2023-06-23 17:26:52,794 INFO [train.py:1064] (1/4) Training started 2023-06-23 17:26:52,794 INFO [train.py:1074] (1/4) Device: cuda:1 2023-06-23 17:26:55,686 INFO [lexicon.py:168] (1/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-23 17:26:56,339 INFO [train.py:1085] (1/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '63e53ba-dirty', 'icefall-git-date': 'Wed Jun 21 18:13:24 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-6-0423201309-7c68fd68fb-6cszs', 'IP address': '10.177.28.83'}, 'world_size': 4, 'master_port': 12536, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 6, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-23 17:26:56,339 INFO [train.py:1087] (1/4) About to create model 2023-06-23 17:26:57,143 INFO [train.py:1091] (1/4) Number of model parameters: 32327030 2023-06-23 17:26:57,144 INFO [checkpoint.py:112] (1/4) Loading checkpoint from zipformer/exp_L_small/epoch-5.pt 2023-06-23 17:27:09,186 INFO [train.py:1106] (1/4) Using DDP 2023-06-23 17:27:09,635 INFO [train.py:1118] (1/4) Loading optimizer state dict 2023-06-23 17:27:10,162 INFO [train.py:1126] (1/4) Loading scheduler state dict 2023-06-23 17:27:10,162 INFO [asr_datamodule.py:390] (1/4) About to get train cuts 2023-06-23 17:27:10,165 INFO [asr_datamodule.py:398] (1/4) About to get dev cuts 2023-06-23 17:27:10,166 INFO [asr_datamodule.py:211] (1/4) About to get Musan cuts 2023-06-23 17:27:13,577 INFO [asr_datamodule.py:216] (1/4) Enable MUSAN 2023-06-23 17:27:13,578 INFO [asr_datamodule.py:239] (1/4) Enable SpecAugment 2023-06-23 17:27:13,578 INFO [asr_datamodule.py:240] (1/4) Time warp factor: 80 2023-06-23 17:27:13,578 INFO [asr_datamodule.py:250] (1/4) Num frame mask: 10 2023-06-23 17:27:13,579 INFO [asr_datamodule.py:263] (1/4) About to create train dataset 2023-06-23 17:27:13,579 INFO [asr_datamodule.py:289] (1/4) Using DynamicBucketingSampler. 2023-06-23 17:27:19,204 INFO [asr_datamodule.py:305] (1/4) About to create train dataloader 2023-06-23 17:27:19,206 INFO [asr_datamodule.py:336] (1/4) About to create dev dataset 2023-06-23 17:27:20,146 INFO [asr_datamodule.py:354] (1/4) About to create dev dataloader 2023-06-23 17:27:20,147 INFO [train.py:1206] (1/4) Loading grad scaler state dict 2023-06-23 17:29:33,344 INFO [train.py:996] (1/4) Epoch 6, batch 0, loss[loss=0.2352, simple_loss=0.2964, pruned_loss=0.08703, over 21863.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.2964, pruned_loss=0.08703, over 21863.00 frames. ], batch size: 373, lr: 5.35e-03, grad_scale: 32.0 2023-06-23 17:29:33,344 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 17:29:50,955 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2383, simple_loss=0.345, pruned_loss=0.06586, over 1796401.00 frames. 2023-06-23 17:29:50,956 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 21628MB 2023-06-23 17:30:27,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=914898.0, ans=0.0 2023-06-23 17:30:28,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.741e+02 4.794e+02 6.251e+02 8.348e+02 2.118e+03, threshold=1.250e+03, percent-clipped=42.0 2023-06-23 17:31:12,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=915018.0, ans=0.2 2023-06-23 17:31:26,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=915078.0, ans=0.0 2023-06-23 17:31:35,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2023-06-23 17:31:35,955 INFO [train.py:996] (1/4) Epoch 6, batch 50, loss[loss=0.257, simple_loss=0.3391, pruned_loss=0.08748, over 21869.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3173, pruned_loss=0.08046, over 961591.80 frames. ], batch size: 124, lr: 5.35e-03, grad_scale: 16.0 2023-06-23 17:31:47,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=915138.0, ans=0.0 2023-06-23 17:32:25,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-23 17:32:47,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=915318.0, ans=0.125 2023-06-23 17:32:57,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-23 17:33:14,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-23 17:33:15,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=915378.0, ans=0.0 2023-06-23 17:33:15,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=915378.0, ans=0.0 2023-06-23 17:33:21,864 INFO [train.py:996] (1/4) Epoch 6, batch 100, loss[loss=0.2557, simple_loss=0.3479, pruned_loss=0.08175, over 21565.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.332, pruned_loss=0.08333, over 1691452.74 frames. ], batch size: 230, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:33:27,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=915438.0, ans=0.1 2023-06-23 17:33:35,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=915438.0, ans=0.2 2023-06-23 17:33:45,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=915498.0, ans=0.2 2023-06-23 17:34:04,544 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.333e+02 2.600e+02 2.995e+02 4.991e+02, threshold=5.199e+02, percent-clipped=0.0 2023-06-23 17:34:07,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=915558.0, ans=0.125 2023-06-23 17:34:23,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=915618.0, ans=0.0 2023-06-23 17:34:49,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=915618.0, ans=0.125 2023-06-23 17:35:09,831 INFO [train.py:996] (1/4) Epoch 6, batch 150, loss[loss=0.2323, simple_loss=0.3215, pruned_loss=0.07156, over 19823.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.332, pruned_loss=0.08148, over 2257533.85 frames. ], batch size: 702, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:35:36,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-23 17:36:36,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=915918.0, ans=0.125 2023-06-23 17:36:59,856 INFO [train.py:996] (1/4) Epoch 6, batch 200, loss[loss=0.1916, simple_loss=0.2717, pruned_loss=0.05577, over 21251.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.328, pruned_loss=0.08166, over 2707304.73 frames. ], batch size: 159, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:37:02,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=916038.0, ans=0.0 2023-06-23 17:37:14,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=916038.0, ans=0.0 2023-06-23 17:37:16,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916098.0, ans=0.1 2023-06-23 17:37:40,263 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.585e+02 2.985e+02 3.639e+02 6.609e+02, threshold=5.970e+02, percent-clipped=4.0 2023-06-23 17:38:23,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=916218.0, ans=0.125 2023-06-23 17:38:47,078 INFO [train.py:996] (1/4) Epoch 6, batch 250, loss[loss=0.2361, simple_loss=0.3131, pruned_loss=0.07951, over 20699.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3222, pruned_loss=0.08012, over 3055560.44 frames. ], batch size: 607, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:38:55,659 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-23 17:39:22,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=916398.0, ans=0.125 2023-06-23 17:39:50,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=916458.0, ans=0.125 2023-06-23 17:40:10,090 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-23 17:40:13,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=916578.0, ans=0.0 2023-06-23 17:40:28,684 INFO [train.py:996] (1/4) Epoch 6, batch 300, loss[loss=0.239, simple_loss=0.3147, pruned_loss=0.08168, over 21807.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3173, pruned_loss=0.08097, over 3325909.12 frames. ], batch size: 298, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:40:39,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2023-06-23 17:40:49,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=916698.0, ans=0.125 2023-06-23 17:41:03,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-23 17:41:08,941 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.631e+02 3.060e+02 3.627e+02 5.054e+02, threshold=6.120e+02, percent-clipped=0.0 2023-06-23 17:41:22,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.45 vs. limit=10.0 2023-06-23 17:41:48,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=916818.0, ans=0.125 2023-06-23 17:42:21,703 INFO [train.py:996] (1/4) Epoch 6, batch 350, loss[loss=0.1937, simple_loss=0.2657, pruned_loss=0.06088, over 21446.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3119, pruned_loss=0.07957, over 3543455.34 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:42:24,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=916938.0, ans=0.125 2023-06-23 17:43:29,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=917058.0, ans=0.2 2023-06-23 17:43:33,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=917118.0, ans=0.1 2023-06-23 17:43:36,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=917118.0, ans=0.0 2023-06-23 17:43:46,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=917118.0, ans=0.05 2023-06-23 17:44:06,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=917238.0, ans=0.125 2023-06-23 17:44:07,648 INFO [train.py:996] (1/4) Epoch 6, batch 400, loss[loss=0.244, simple_loss=0.342, pruned_loss=0.07299, over 21381.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3071, pruned_loss=0.07861, over 3709746.82 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 32.0 2023-06-23 17:44:47,057 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.687e+02 2.996e+02 3.462e+02 5.169e+02, threshold=5.992e+02, percent-clipped=0.0 2023-06-23 17:45:03,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=917358.0, ans=0.125 2023-06-23 17:45:15,445 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-06-23 17:45:55,316 INFO [train.py:996] (1/4) Epoch 6, batch 450, loss[loss=0.265, simple_loss=0.2994, pruned_loss=0.1153, over 21397.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3032, pruned_loss=0.07717, over 3835949.48 frames. ], batch size: 509, lr: 5.34e-03, grad_scale: 32.0 2023-06-23 17:46:01,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=917538.0, ans=0.125 2023-06-23 17:46:03,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=917538.0, ans=0.125 2023-06-23 17:46:25,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=917598.0, ans=0.2 2023-06-23 17:46:57,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=917658.0, ans=0.125 2023-06-23 17:47:16,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-23 17:47:17,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=917718.0, ans=0.09899494936611666 2023-06-23 17:47:46,332 INFO [train.py:996] (1/4) Epoch 6, batch 500, loss[loss=0.2009, simple_loss=0.2731, pruned_loss=0.06436, over 20791.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3034, pruned_loss=0.07622, over 3934692.25 frames. ], batch size: 608, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:48:08,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=917898.0, ans=0.0 2023-06-23 17:48:32,683 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.519e+02 2.896e+02 3.744e+02 5.708e+02, threshold=5.793e+02, percent-clipped=0.0 2023-06-23 17:48:35,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=917958.0, ans=0.125 2023-06-23 17:49:30,915 INFO [train.py:996] (1/4) Epoch 6, batch 550, loss[loss=0.2072, simple_loss=0.3045, pruned_loss=0.05495, over 21345.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3026, pruned_loss=0.07437, over 4014236.11 frames. ], batch size: 211, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:50:20,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=918258.0, ans=0.1 2023-06-23 17:50:30,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=918258.0, ans=0.5 2023-06-23 17:51:13,066 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-23 17:51:15,197 INFO [train.py:996] (1/4) Epoch 6, batch 600, loss[loss=0.2166, simple_loss=0.3165, pruned_loss=0.05832, over 21641.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3088, pruned_loss=0.07452, over 4078045.42 frames. ], batch size: 263, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:51:55,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=918498.0, ans=0.0 2023-06-23 17:52:12,355 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.707e+02 3.073e+02 3.854e+02 5.945e+02, threshold=6.147e+02, percent-clipped=1.0 2023-06-23 17:52:37,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=918618.0, ans=0.95 2023-06-23 17:52:43,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-23 17:52:48,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=918678.0, ans=0.125 2023-06-23 17:52:50,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=918678.0, ans=0.125 2023-06-23 17:53:04,168 INFO [train.py:996] (1/4) Epoch 6, batch 650, loss[loss=0.2267, simple_loss=0.2932, pruned_loss=0.08007, over 21836.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3097, pruned_loss=0.0754, over 4112866.37 frames. ], batch size: 351, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:53:29,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=918798.0, ans=0.0 2023-06-23 17:54:07,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=918858.0, ans=0.2 2023-06-23 17:54:08,295 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-23 17:54:46,838 INFO [train.py:996] (1/4) Epoch 6, batch 700, loss[loss=0.242, simple_loss=0.3303, pruned_loss=0.0768, over 21724.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3094, pruned_loss=0.07558, over 4152181.50 frames. ], batch size: 332, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 17:55:28,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=919098.0, ans=0.125 2023-06-23 17:55:38,549 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.507e+02 2.938e+02 3.548e+02 4.696e+02, threshold=5.875e+02, percent-clipped=0.0 2023-06-23 17:55:46,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=919158.0, ans=0.125 2023-06-23 17:55:59,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=919218.0, ans=0.125 2023-06-23 17:56:03,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919218.0, ans=0.1 2023-06-23 17:56:35,821 INFO [train.py:996] (1/4) Epoch 6, batch 750, loss[loss=0.234, simple_loss=0.3018, pruned_loss=0.08309, over 21872.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3084, pruned_loss=0.07667, over 4186038.41 frames. ], batch size: 107, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 17:57:52,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=919518.0, ans=0.125 2023-06-23 17:58:06,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=919578.0, ans=0.125 2023-06-23 17:58:24,797 INFO [train.py:996] (1/4) Epoch 6, batch 800, loss[loss=0.1926, simple_loss=0.2709, pruned_loss=0.05711, over 21804.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3063, pruned_loss=0.0774, over 4203544.65 frames. ], batch size: 118, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 17:58:50,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919698.0, ans=0.1 2023-06-23 17:59:04,527 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.561e+02 2.955e+02 3.550e+02 6.098e+02, threshold=5.911e+02, percent-clipped=2.0 2023-06-23 17:59:42,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=919818.0, ans=0.2 2023-06-23 18:00:08,606 INFO [train.py:996] (1/4) Epoch 6, batch 850, loss[loss=0.2002, simple_loss=0.2672, pruned_loss=0.06664, over 22002.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3029, pruned_loss=0.07703, over 4226402.31 frames. ], batch size: 103, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 18:00:16,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=919938.0, ans=0.125 2023-06-23 18:00:38,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=919998.0, ans=0.0 2023-06-23 18:01:03,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-23 18:01:16,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=920058.0, ans=0.0 2023-06-23 18:01:42,447 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:01:53,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-23 18:01:59,416 INFO [train.py:996] (1/4) Epoch 6, batch 900, loss[loss=0.1815, simple_loss=0.259, pruned_loss=0.05197, over 21824.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2994, pruned_loss=0.07628, over 4241843.64 frames. ], batch size: 124, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:02:27,766 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-23 18:02:40,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.93 vs. limit=5.0 2023-06-23 18:02:53,510 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.549e+02 3.030e+02 3.332e+02 5.799e+02, threshold=6.061e+02, percent-clipped=0.0 2023-06-23 18:03:18,154 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-23 18:03:21,432 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-23 18:03:29,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=920418.0, ans=0.0 2023-06-23 18:03:50,053 INFO [train.py:996] (1/4) Epoch 6, batch 950, loss[loss=0.2076, simple_loss=0.2751, pruned_loss=0.07006, over 21354.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2987, pruned_loss=0.07668, over 4259347.09 frames. ], batch size: 159, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:03:56,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-23 18:04:07,902 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-23 18:04:46,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=920658.0, ans=0.125 2023-06-23 18:04:46,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=920658.0, ans=0.0 2023-06-23 18:04:59,594 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-23 18:05:10,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.80 vs. limit=22.5 2023-06-23 18:05:19,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=920718.0, ans=0.125 2023-06-23 18:05:41,240 INFO [train.py:996] (1/4) Epoch 6, batch 1000, loss[loss=0.245, simple_loss=0.3194, pruned_loss=0.08534, over 21737.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2992, pruned_loss=0.07707, over 4269237.76 frames. ], batch size: 389, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:06:00,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=920838.0, ans=0.0 2023-06-23 18:06:37,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=920898.0, ans=6.0 2023-06-23 18:06:42,106 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.583e+02 2.913e+02 3.407e+02 5.854e+02, threshold=5.827e+02, percent-clipped=0.0 2023-06-23 18:07:26,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=921078.0, ans=0.2 2023-06-23 18:07:32,545 INFO [train.py:996] (1/4) Epoch 6, batch 1050, loss[loss=0.2336, simple_loss=0.2998, pruned_loss=0.08374, over 21521.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3028, pruned_loss=0.07824, over 4275208.05 frames. ], batch size: 212, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:08:00,171 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-23 18:09:14,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=22.5 2023-06-23 18:09:31,267 INFO [train.py:996] (1/4) Epoch 6, batch 1100, loss[loss=0.2213, simple_loss=0.2945, pruned_loss=0.07405, over 21518.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3026, pruned_loss=0.07678, over 4280856.53 frames. ], batch size: 548, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:10:06,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=921498.0, ans=0.125 2023-06-23 18:10:25,287 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.670e+02 3.079e+02 4.028e+02 7.418e+02, threshold=6.158e+02, percent-clipped=6.0 2023-06-23 18:11:20,101 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.33 vs. limit=10.0 2023-06-23 18:11:29,975 INFO [train.py:996] (1/4) Epoch 6, batch 1150, loss[loss=0.2197, simple_loss=0.3044, pruned_loss=0.06748, over 21738.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3015, pruned_loss=0.076, over 4285343.70 frames. ], batch size: 282, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:12:33,049 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:12:59,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=921978.0, ans=0.0 2023-06-23 18:13:08,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.43 vs. limit=22.5 2023-06-23 18:13:10,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=921978.0, ans=0.125 2023-06-23 18:13:12,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=921978.0, ans=0.125 2023-06-23 18:13:17,165 INFO [train.py:996] (1/4) Epoch 6, batch 1200, loss[loss=0.2015, simple_loss=0.2816, pruned_loss=0.06069, over 21834.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3033, pruned_loss=0.07736, over 4280909.10 frames. ], batch size: 298, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 18:13:24,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=922038.0, ans=0.125 2023-06-23 18:13:42,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=922098.0, ans=0.1 2023-06-23 18:13:59,212 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.616e+02 3.018e+02 3.638e+02 5.698e+02, threshold=6.035e+02, percent-clipped=0.0 2023-06-23 18:14:20,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-23 18:14:33,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=922218.0, ans=0.125 2023-06-23 18:15:01,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=922278.0, ans=0.125 2023-06-23 18:15:01,831 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=22.5 2023-06-23 18:15:07,509 INFO [train.py:996] (1/4) Epoch 6, batch 1250, loss[loss=0.2525, simple_loss=0.325, pruned_loss=0.09006, over 21898.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3073, pruned_loss=0.07986, over 4282763.61 frames. ], batch size: 118, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:15:31,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=922398.0, ans=0.0 2023-06-23 18:15:59,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=922458.0, ans=0.1 2023-06-23 18:16:20,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-23 18:16:59,817 INFO [train.py:996] (1/4) Epoch 6, batch 1300, loss[loss=0.2165, simple_loss=0.3057, pruned_loss=0.06362, over 21788.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3079, pruned_loss=0.07965, over 4290747.59 frames. ], batch size: 282, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:17:09,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=922638.0, ans=0.0 2023-06-23 18:17:24,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=922698.0, ans=0.025 2023-06-23 18:17:29,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=922698.0, ans=0.0 2023-06-23 18:17:32,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=922698.0, ans=0.2 2023-06-23 18:17:42,851 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.762e+02 3.245e+02 4.001e+02 7.520e+02, threshold=6.490e+02, percent-clipped=2.0 2023-06-23 18:17:58,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=922818.0, ans=0.125 2023-06-23 18:18:46,510 INFO [train.py:996] (1/4) Epoch 6, batch 1350, loss[loss=0.1983, simple_loss=0.2665, pruned_loss=0.06505, over 21717.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.308, pruned_loss=0.07931, over 4288463.62 frames. ], batch size: 247, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:19:07,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=922938.0, ans=0.0 2023-06-23 18:19:18,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=922998.0, ans=0.2 2023-06-23 18:19:35,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=923058.0, ans=0.02 2023-06-23 18:19:53,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=923118.0, ans=0.125 2023-06-23 18:20:17,033 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2023-06-23 18:20:33,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=923178.0, ans=0.1 2023-06-23 18:20:36,683 INFO [train.py:996] (1/4) Epoch 6, batch 1400, loss[loss=0.1898, simple_loss=0.2511, pruned_loss=0.06423, over 21277.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3066, pruned_loss=0.07856, over 4283724.25 frames. ], batch size: 177, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:20:55,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.87 vs. limit=15.0 2023-06-23 18:21:10,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=923298.0, ans=0.0 2023-06-23 18:21:21,166 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.458e+02 2.680e+02 3.185e+02 5.161e+02, threshold=5.361e+02, percent-clipped=0.0 2023-06-23 18:21:23,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=923358.0, ans=0.2 2023-06-23 18:22:35,797 INFO [train.py:996] (1/4) Epoch 6, batch 1450, loss[loss=0.1982, simple_loss=0.282, pruned_loss=0.05723, over 21361.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3054, pruned_loss=0.07824, over 4284340.80 frames. ], batch size: 211, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:23:31,241 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-23 18:24:07,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=923778.0, ans=0.0 2023-06-23 18:24:26,519 INFO [train.py:996] (1/4) Epoch 6, batch 1500, loss[loss=0.2058, simple_loss=0.2764, pruned_loss=0.06763, over 20040.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3072, pruned_loss=0.07946, over 4292910.21 frames. ], batch size: 702, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:25:06,056 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.614e+02 2.900e+02 3.425e+02 5.180e+02, threshold=5.801e+02, percent-clipped=0.0 2023-06-23 18:25:12,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=923958.0, ans=0.2 2023-06-23 18:25:13,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=8.0 2023-06-23 18:25:38,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=924018.0, ans=0.2 2023-06-23 18:25:38,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=924018.0, ans=0.2 2023-06-23 18:25:43,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=924018.0, ans=0.1 2023-06-23 18:26:20,372 INFO [train.py:996] (1/4) Epoch 6, batch 1550, loss[loss=0.1989, simple_loss=0.2649, pruned_loss=0.0664, over 21637.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.304, pruned_loss=0.07777, over 4285369.51 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:26:26,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=924138.0, ans=0.1 2023-06-23 18:26:29,188 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-23 18:27:11,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=924258.0, ans=0.0 2023-06-23 18:28:14,002 INFO [train.py:996] (1/4) Epoch 6, batch 1600, loss[loss=0.1345, simple_loss=0.1842, pruned_loss=0.04237, over 16689.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3029, pruned_loss=0.07765, over 4281906.12 frames. ], batch size: 61, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:28:21,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=8.0 2023-06-23 18:28:24,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-06-23 18:28:33,286 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:29:08,557 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.611e+02 2.907e+02 3.387e+02 5.572e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-23 18:29:21,019 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-23 18:29:24,065 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:29:47,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=924618.0, ans=0.0 2023-06-23 18:30:05,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=924678.0, ans=0.1 2023-06-23 18:30:08,446 INFO [train.py:996] (1/4) Epoch 6, batch 1650, loss[loss=0.2038, simple_loss=0.2753, pruned_loss=0.06613, over 21834.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3006, pruned_loss=0.07712, over 4275428.06 frames. ], batch size: 107, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:30:10,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=924738.0, ans=0.125 2023-06-23 18:30:37,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=924798.0, ans=0.125 2023-06-23 18:31:41,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=924918.0, ans=0.125 2023-06-23 18:32:02,890 INFO [train.py:996] (1/4) Epoch 6, batch 1700, loss[loss=0.1865, simple_loss=0.281, pruned_loss=0.04603, over 21607.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3047, pruned_loss=0.07799, over 4272411.28 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:32:36,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=925098.0, ans=0.0 2023-06-23 18:32:50,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=925158.0, ans=0.125 2023-06-23 18:33:01,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.590e+02 2.907e+02 3.447e+02 5.734e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-23 18:33:02,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=925158.0, ans=0.125 2023-06-23 18:33:02,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-23 18:33:25,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=925218.0, ans=0.125 2023-06-23 18:34:02,127 INFO [train.py:996] (1/4) Epoch 6, batch 1750, loss[loss=0.3219, simple_loss=0.3954, pruned_loss=0.1242, over 21521.00 frames. ], tot_loss[loss=0.229, simple_loss=0.305, pruned_loss=0.07652, over 4277671.77 frames. ], batch size: 471, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:34:10,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-06-23 18:34:48,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=925458.0, ans=0.125 2023-06-23 18:35:12,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=925518.0, ans=0.1 2023-06-23 18:35:18,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=925518.0, ans=0.125 2023-06-23 18:35:37,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=925578.0, ans=0.125 2023-06-23 18:36:01,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925638.0, ans=0.1 2023-06-23 18:36:02,456 INFO [train.py:996] (1/4) Epoch 6, batch 1800, loss[loss=0.1649, simple_loss=0.2493, pruned_loss=0.04023, over 21616.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3033, pruned_loss=0.07474, over 4269144.94 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:36:39,670 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:36:39,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=925698.0, ans=0.0 2023-06-23 18:36:56,005 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.930e+02 2.395e+02 2.914e+02 3.634e+02 6.423e+02, threshold=5.828e+02, percent-clipped=1.0 2023-06-23 18:37:14,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-23 18:37:44,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925878.0, ans=0.1 2023-06-23 18:37:53,461 INFO [train.py:996] (1/4) Epoch 6, batch 1850, loss[loss=0.2238, simple_loss=0.3063, pruned_loss=0.0707, over 21930.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3043, pruned_loss=0.07347, over 4256682.27 frames. ], batch size: 316, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:37:56,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=925938.0, ans=0.125 2023-06-23 18:38:36,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=925998.0, ans=0.125 2023-06-23 18:38:41,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=926058.0, ans=0.125 2023-06-23 18:38:45,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-23 18:39:16,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=926118.0, ans=0.0 2023-06-23 18:39:46,172 INFO [train.py:996] (1/4) Epoch 6, batch 1900, loss[loss=0.1881, simple_loss=0.2735, pruned_loss=0.05139, over 21773.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3034, pruned_loss=0.07311, over 4263482.03 frames. ], batch size: 282, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:40:39,855 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.383e+02 2.644e+02 3.253e+02 4.924e+02, threshold=5.288e+02, percent-clipped=0.0 2023-06-23 18:40:40,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=926358.0, ans=0.0 2023-06-23 18:41:14,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=926478.0, ans=0.125 2023-06-23 18:41:20,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.12 vs. limit=6.0 2023-06-23 18:41:37,828 INFO [train.py:996] (1/4) Epoch 6, batch 1950, loss[loss=0.1892, simple_loss=0.2502, pruned_loss=0.06407, over 21633.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2992, pruned_loss=0.07221, over 4268347.65 frames. ], batch size: 247, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:41:54,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=926538.0, ans=0.02 2023-06-23 18:41:56,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=926538.0, ans=0.125 2023-06-23 18:42:01,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=926598.0, ans=0.2 2023-06-23 18:43:37,087 INFO [train.py:996] (1/4) Epoch 6, batch 2000, loss[loss=0.2673, simple_loss=0.3623, pruned_loss=0.0861, over 21626.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2998, pruned_loss=0.07224, over 4266885.10 frames. ], batch size: 414, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:44:02,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=926898.0, ans=0.125 2023-06-23 18:44:13,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-23 18:44:22,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=926958.0, ans=0.1 2023-06-23 18:44:24,631 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.599e+02 2.979e+02 3.641e+02 7.240e+02, threshold=5.958e+02, percent-clipped=3.0 2023-06-23 18:44:31,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=926958.0, ans=0.125 2023-06-23 18:45:28,378 INFO [train.py:996] (1/4) Epoch 6, batch 2050, loss[loss=0.2203, simple_loss=0.2924, pruned_loss=0.0741, over 21556.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2975, pruned_loss=0.07193, over 4267312.89 frames. ], batch size: 131, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:45:52,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=927198.0, ans=0.0 2023-06-23 18:46:05,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=927198.0, ans=0.125 2023-06-23 18:46:25,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=927258.0, ans=0.0 2023-06-23 18:46:54,339 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-23 18:47:20,410 INFO [train.py:996] (1/4) Epoch 6, batch 2100, loss[loss=0.2369, simple_loss=0.3184, pruned_loss=0.07775, over 21802.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3002, pruned_loss=0.07424, over 4273679.05 frames. ], batch size: 282, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:47:21,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=8.0 2023-06-23 18:47:47,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=927498.0, ans=0.125 2023-06-23 18:47:53,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-23 18:48:08,463 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.503e+02 2.741e+02 3.125e+02 4.918e+02, threshold=5.483e+02, percent-clipped=0.0 2023-06-23 18:49:12,096 INFO [train.py:996] (1/4) Epoch 6, batch 2150, loss[loss=0.2294, simple_loss=0.3187, pruned_loss=0.07007, over 21423.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3013, pruned_loss=0.07564, over 4273655.56 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:49:29,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=927738.0, ans=0.125 2023-06-23 18:49:29,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-23 18:49:36,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=927798.0, ans=0.125 2023-06-23 18:49:48,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=927798.0, ans=0.05 2023-06-23 18:49:58,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=927858.0, ans=0.0 2023-06-23 18:50:35,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2023-06-23 18:50:59,987 INFO [train.py:996] (1/4) Epoch 6, batch 2200, loss[loss=0.1792, simple_loss=0.2564, pruned_loss=0.05097, over 21172.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3032, pruned_loss=0.07637, over 4275691.86 frames. ], batch size: 143, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:51:17,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-23 18:51:23,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=928098.0, ans=0.125 2023-06-23 18:51:24,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-23 18:51:47,924 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.632e+02 2.959e+02 3.421e+02 5.687e+02, threshold=5.917e+02, percent-clipped=1.0 2023-06-23 18:52:49,692 INFO [train.py:996] (1/4) Epoch 6, batch 2250, loss[loss=0.1957, simple_loss=0.2576, pruned_loss=0.06689, over 21157.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2991, pruned_loss=0.07439, over 4266352.91 frames. ], batch size: 159, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:53:23,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=928398.0, ans=0.125 2023-06-23 18:53:57,041 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=15.0 2023-06-23 18:54:40,517 INFO [train.py:996] (1/4) Epoch 6, batch 2300, loss[loss=0.1999, simple_loss=0.2608, pruned_loss=0.06949, over 21662.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2968, pruned_loss=0.07389, over 4266898.70 frames. ], batch size: 333, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:55:21,487 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-23 18:55:28,457 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.420e+02 2.816e+02 3.301e+02 5.962e+02, threshold=5.633e+02, percent-clipped=1.0 2023-06-23 18:55:46,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-23 18:55:57,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=928818.0, ans=0.125 2023-06-23 18:56:21,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=928878.0, ans=0.125 2023-06-23 18:56:35,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=928878.0, ans=0.125 2023-06-23 18:56:38,337 INFO [train.py:996] (1/4) Epoch 6, batch 2350, loss[loss=0.2157, simple_loss=0.2756, pruned_loss=0.07795, over 21244.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2951, pruned_loss=0.07374, over 4265303.92 frames. ], batch size: 144, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:56:42,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=928938.0, ans=0.2 2023-06-23 18:56:53,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=928938.0, ans=0.0 2023-06-23 18:57:05,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=928998.0, ans=0.2 2023-06-23 18:57:47,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.64 vs. limit=15.0 2023-06-23 18:58:28,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-23 18:58:30,895 INFO [train.py:996] (1/4) Epoch 6, batch 2400, loss[loss=0.2407, simple_loss=0.323, pruned_loss=0.07916, over 20714.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2986, pruned_loss=0.0763, over 4276300.31 frames. ], batch size: 607, lr: 5.31e-03, grad_scale: 32.0 2023-06-23 18:58:31,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=929238.0, ans=0.125 2023-06-23 18:58:33,643 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-23 18:59:21,205 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.599e+02 2.851e+02 3.513e+02 5.978e+02, threshold=5.701e+02, percent-clipped=2.0 2023-06-23 18:59:27,265 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:59:47,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=929418.0, ans=0.125 2023-06-23 19:00:22,692 INFO [train.py:996] (1/4) Epoch 6, batch 2450, loss[loss=0.2281, simple_loss=0.2942, pruned_loss=0.08099, over 22031.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3004, pruned_loss=0.07817, over 4279101.01 frames. ], batch size: 103, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:02:13,082 INFO [train.py:996] (1/4) Epoch 6, batch 2500, loss[loss=0.201, simple_loss=0.2641, pruned_loss=0.06893, over 15225.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3, pruned_loss=0.07816, over 4262692.00 frames. ], batch size: 61, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:02:58,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=929958.0, ans=0.125 2023-06-23 19:03:03,188 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.544e+02 2.837e+02 3.478e+02 5.146e+02, threshold=5.674e+02, percent-clipped=0.0 2023-06-23 19:03:15,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=930018.0, ans=0.0 2023-06-23 19:03:32,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=930018.0, ans=0.125 2023-06-23 19:03:47,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=930078.0, ans=0.125 2023-06-23 19:04:04,679 INFO [train.py:996] (1/4) Epoch 6, batch 2550, loss[loss=0.2017, simple_loss=0.2608, pruned_loss=0.07135, over 21463.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2999, pruned_loss=0.07698, over 4262528.89 frames. ], batch size: 211, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:04:14,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=930138.0, ans=0.125 2023-06-23 19:04:15,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=930138.0, ans=0.125 2023-06-23 19:04:18,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=930138.0, ans=0.2 2023-06-23 19:04:49,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=930258.0, ans=0.0 2023-06-23 19:04:57,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=930258.0, ans=0.125 2023-06-23 19:05:38,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=930378.0, ans=0.0 2023-06-23 19:05:57,808 INFO [train.py:996] (1/4) Epoch 6, batch 2600, loss[loss=0.224, simple_loss=0.2886, pruned_loss=0.07965, over 21700.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.302, pruned_loss=0.07706, over 4265822.00 frames. ], batch size: 351, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:06:47,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.627e+02 2.988e+02 3.634e+02 5.525e+02, threshold=5.976e+02, percent-clipped=0.0 2023-06-23 19:07:05,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=930618.0, ans=0.125 2023-06-23 19:07:39,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=930678.0, ans=0.125 2023-06-23 19:07:42,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=930678.0, ans=0.0 2023-06-23 19:07:44,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=930678.0, ans=0.0 2023-06-23 19:07:44,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=930678.0, ans=0.0 2023-06-23 19:07:44,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=930678.0, ans=0.125 2023-06-23 19:07:46,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=930678.0, ans=0.125 2023-06-23 19:07:49,107 INFO [train.py:996] (1/4) Epoch 6, batch 2650, loss[loss=0.2858, simple_loss=0.3455, pruned_loss=0.113, over 21413.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3023, pruned_loss=0.07855, over 4270959.00 frames. ], batch size: 471, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:07:53,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=930738.0, ans=0.125 2023-06-23 19:08:30,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=930858.0, ans=0.1 2023-06-23 19:09:15,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=930918.0, ans=0.0 2023-06-23 19:09:19,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=22.5 2023-06-23 19:09:41,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-23 19:09:42,209 INFO [train.py:996] (1/4) Epoch 6, batch 2700, loss[loss=0.1896, simple_loss=0.2595, pruned_loss=0.05982, over 21610.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3007, pruned_loss=0.07792, over 4280792.06 frames. ], batch size: 230, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:09:44,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=931038.0, ans=0.0 2023-06-23 19:10:09,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=931098.0, ans=0.1 2023-06-23 19:10:32,950 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.709e+02 3.074e+02 3.590e+02 5.374e+02, threshold=6.148e+02, percent-clipped=0.0 2023-06-23 19:10:48,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931218.0, ans=0.1 2023-06-23 19:11:11,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=931278.0, ans=0.0 2023-06-23 19:11:31,902 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.23 vs. limit=15.0 2023-06-23 19:11:34,517 INFO [train.py:996] (1/4) Epoch 6, batch 2750, loss[loss=0.2256, simple_loss=0.2933, pruned_loss=0.07893, over 21813.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.301, pruned_loss=0.07761, over 4277317.86 frames. ], batch size: 298, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:11:38,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=931338.0, ans=0.125 2023-06-23 19:11:51,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931398.0, ans=0.1 2023-06-23 19:12:49,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=931518.0, ans=0.0 2023-06-23 19:13:24,218 INFO [train.py:996] (1/4) Epoch 6, batch 2800, loss[loss=0.2145, simple_loss=0.2787, pruned_loss=0.07517, over 21392.00 frames. ], tot_loss[loss=0.231, simple_loss=0.305, pruned_loss=0.07852, over 4276991.58 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:13:32,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.23 vs. limit=10.0 2023-06-23 19:13:43,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=931638.0, ans=0.0 2023-06-23 19:14:08,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931758.0, ans=0.1 2023-06-23 19:14:22,173 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.716e+02 3.036e+02 3.413e+02 5.034e+02, threshold=6.071e+02, percent-clipped=0.0 2023-06-23 19:14:46,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=931818.0, ans=0.2 2023-06-23 19:14:47,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931818.0, ans=0.1 2023-06-23 19:15:18,157 INFO [train.py:996] (1/4) Epoch 6, batch 2850, loss[loss=0.2271, simple_loss=0.3003, pruned_loss=0.07694, over 21702.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3078, pruned_loss=0.08037, over 4278585.25 frames. ], batch size: 332, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:15:40,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=931998.0, ans=0.2 2023-06-23 19:15:51,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.14 vs. limit=12.0 2023-06-23 19:16:26,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=932118.0, ans=0.2 2023-06-23 19:16:54,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=932178.0, ans=0.125 2023-06-23 19:17:00,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=932178.0, ans=0.125 2023-06-23 19:17:04,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=932178.0, ans=0.125 2023-06-23 19:17:07,577 INFO [train.py:996] (1/4) Epoch 6, batch 2900, loss[loss=0.2183, simple_loss=0.2903, pruned_loss=0.07321, over 21377.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3019, pruned_loss=0.07872, over 4272702.69 frames. ], batch size: 159, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:17:53,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=932358.0, ans=0.2 2023-06-23 19:17:55,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-23 19:18:03,657 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.630e+02 3.132e+02 3.824e+02 7.694e+02, threshold=6.265e+02, percent-clipped=2.0 2023-06-23 19:18:04,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=932358.0, ans=0.125 2023-06-23 19:18:28,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=932418.0, ans=0.2 2023-06-23 19:18:58,063 INFO [train.py:996] (1/4) Epoch 6, batch 2950, loss[loss=0.2742, simple_loss=0.3254, pruned_loss=0.1115, over 21773.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3035, pruned_loss=0.07906, over 4279110.56 frames. ], batch size: 508, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:19:23,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-06-23 19:19:26,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=932598.0, ans=0.0 2023-06-23 19:19:38,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932658.0, ans=0.1 2023-06-23 19:19:38,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=932658.0, ans=0.125 2023-06-23 19:19:42,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=932658.0, ans=0.125 2023-06-23 19:20:03,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=932658.0, ans=0.0 2023-06-23 19:20:11,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=932658.0, ans=0.05 2023-06-23 19:20:11,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=932658.0, ans=0.0 2023-06-23 19:20:38,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932778.0, ans=0.1 2023-06-23 19:20:50,754 INFO [train.py:996] (1/4) Epoch 6, batch 3000, loss[loss=0.2721, simple_loss=0.3362, pruned_loss=0.104, over 21248.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3079, pruned_loss=0.0802, over 4278630.92 frames. ], batch size: 143, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:20:50,755 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 19:21:13,119 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2526, simple_loss=0.3435, pruned_loss=0.08085, over 1796401.00 frames. 2023-06-23 19:21:13,120 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23385MB 2023-06-23 19:22:14,691 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.535e+02 2.851e+02 3.436e+02 5.853e+02, threshold=5.702e+02, percent-clipped=0.0 2023-06-23 19:22:15,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=932958.0, ans=0.125 2023-06-23 19:22:27,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=933018.0, ans=0.2 2023-06-23 19:23:05,123 INFO [train.py:996] (1/4) Epoch 6, batch 3050, loss[loss=0.207, simple_loss=0.2789, pruned_loss=0.06751, over 21671.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3095, pruned_loss=0.07944, over 4278256.46 frames. ], batch size: 263, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:24:53,919 INFO [train.py:996] (1/4) Epoch 6, batch 3100, loss[loss=0.212, simple_loss=0.2938, pruned_loss=0.0651, over 21462.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3089, pruned_loss=0.07857, over 4276533.74 frames. ], batch size: 211, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:25:35,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-23 19:25:55,059 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-23 19:25:55,813 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.716e+02 3.164e+02 3.740e+02 6.470e+02, threshold=6.328e+02, percent-clipped=4.0 2023-06-23 19:26:00,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=933558.0, ans=0.125 2023-06-23 19:26:03,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=933618.0, ans=0.1 2023-06-23 19:26:27,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=933678.0, ans=0.125 2023-06-23 19:26:52,675 INFO [train.py:996] (1/4) Epoch 6, batch 3150, loss[loss=0.2336, simple_loss=0.3018, pruned_loss=0.08273, over 20695.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3108, pruned_loss=0.07967, over 4273804.71 frames. ], batch size: 607, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:27:02,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-23 19:27:03,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=933738.0, ans=0.0 2023-06-23 19:27:17,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=933738.0, ans=0.0 2023-06-23 19:27:41,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=933858.0, ans=0.125 2023-06-23 19:28:56,556 INFO [train.py:996] (1/4) Epoch 6, batch 3200, loss[loss=0.232, simple_loss=0.2887, pruned_loss=0.08762, over 21592.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3108, pruned_loss=0.079, over 4273220.83 frames. ], batch size: 548, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:29:46,296 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.521e+02 2.818e+02 3.375e+02 4.819e+02, threshold=5.636e+02, percent-clipped=0.0 2023-06-23 19:29:49,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-23 19:30:15,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=934218.0, ans=0.0 2023-06-23 19:30:37,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.03 vs. limit=10.0 2023-06-23 19:30:46,847 INFO [train.py:996] (1/4) Epoch 6, batch 3250, loss[loss=0.2196, simple_loss=0.2822, pruned_loss=0.07852, over 21876.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3135, pruned_loss=0.08075, over 4276232.83 frames. ], batch size: 353, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:30:54,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=934338.0, ans=0.125 2023-06-23 19:31:59,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=934518.0, ans=0.0 2023-06-23 19:32:01,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=934518.0, ans=0.125 2023-06-23 19:32:01,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=934518.0, ans=0.125 2023-06-23 19:32:38,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=934578.0, ans=0.015 2023-06-23 19:32:40,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934638.0, ans=0.1 2023-06-23 19:32:41,468 INFO [train.py:996] (1/4) Epoch 6, batch 3300, loss[loss=0.2296, simple_loss=0.3099, pruned_loss=0.07462, over 21501.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3064, pruned_loss=0.07978, over 4265332.82 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:32:46,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934638.0, ans=0.1 2023-06-23 19:33:15,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=934698.0, ans=0.125 2023-06-23 19:33:38,656 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.620e+02 2.941e+02 3.334e+02 7.153e+02, threshold=5.881e+02, percent-clipped=1.0 2023-06-23 19:33:45,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=934818.0, ans=0.2 2023-06-23 19:34:12,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.02 vs. limit=12.0 2023-06-23 19:34:17,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=934878.0, ans=0.0 2023-06-23 19:34:30,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934878.0, ans=0.1 2023-06-23 19:34:31,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=934938.0, ans=0.0 2023-06-23 19:34:33,180 INFO [train.py:996] (1/4) Epoch 6, batch 3350, loss[loss=0.2643, simple_loss=0.3257, pruned_loss=0.1015, over 21437.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3086, pruned_loss=0.07951, over 4269743.89 frames. ], batch size: 548, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:34:50,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=934998.0, ans=0.1 2023-06-23 19:34:52,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-23 19:35:16,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=935058.0, ans=0.125 2023-06-23 19:35:31,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-23 19:36:10,446 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:36:25,733 INFO [train.py:996] (1/4) Epoch 6, batch 3400, loss[loss=0.2121, simple_loss=0.2992, pruned_loss=0.06245, over 21067.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3093, pruned_loss=0.08031, over 4273117.05 frames. ], batch size: 607, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:36:53,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=935298.0, ans=0.2 2023-06-23 19:36:57,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=935298.0, ans=0.125 2023-06-23 19:37:13,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=935358.0, ans=0.125 2023-06-23 19:37:30,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.631e+02 2.892e+02 3.496e+02 6.427e+02, threshold=5.784e+02, percent-clipped=1.0 2023-06-23 19:37:47,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=935418.0, ans=0.125 2023-06-23 19:38:18,978 INFO [train.py:996] (1/4) Epoch 6, batch 3450, loss[loss=0.2196, simple_loss=0.2869, pruned_loss=0.07616, over 21887.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.305, pruned_loss=0.07988, over 4279760.79 frames. ], batch size: 107, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:39:28,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=935658.0, ans=0.125 2023-06-23 19:39:59,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=935778.0, ans=15.0 2023-06-23 19:40:16,017 INFO [train.py:996] (1/4) Epoch 6, batch 3500, loss[loss=0.2909, simple_loss=0.3578, pruned_loss=0.112, over 21276.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3154, pruned_loss=0.08409, over 4282890.14 frames. ], batch size: 143, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:40:18,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=935838.0, ans=0.0 2023-06-23 19:40:23,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=935838.0, ans=0.125 2023-06-23 19:41:16,942 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.777e+02 3.098e+02 3.671e+02 6.397e+02, threshold=6.196e+02, percent-clipped=1.0 2023-06-23 19:41:29,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=936018.0, ans=0.0 2023-06-23 19:41:34,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=936018.0, ans=0.125 2023-06-23 19:41:37,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=936018.0, ans=0.0 2023-06-23 19:41:43,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-23 19:42:08,052 INFO [train.py:996] (1/4) Epoch 6, batch 3550, loss[loss=0.258, simple_loss=0.2999, pruned_loss=0.1081, over 21373.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3174, pruned_loss=0.08527, over 4287865.76 frames. ], batch size: 508, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:42:15,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=936138.0, ans=0.125 2023-06-23 19:42:47,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=936198.0, ans=0.1 2023-06-23 19:43:03,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=936258.0, ans=0.2 2023-06-23 19:43:09,212 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-23 19:43:19,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=936318.0, ans=0.0 2023-06-23 19:43:47,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=936378.0, ans=0.125 2023-06-23 19:43:51,979 INFO [train.py:996] (1/4) Epoch 6, batch 3600, loss[loss=0.2498, simple_loss=0.3143, pruned_loss=0.09258, over 21560.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3116, pruned_loss=0.08452, over 4286584.06 frames. ], batch size: 389, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:45:00,196 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.636e+02 3.056e+02 3.547e+02 6.528e+02, threshold=6.113e+02, percent-clipped=1.0 2023-06-23 19:45:28,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=936678.0, ans=0.1 2023-06-23 19:45:32,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-23 19:45:48,103 INFO [train.py:996] (1/4) Epoch 6, batch 3650, loss[loss=0.1572, simple_loss=0.2054, pruned_loss=0.05451, over 17260.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3119, pruned_loss=0.0842, over 4283632.64 frames. ], batch size: 61, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:45:57,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=936738.0, ans=0.0 2023-06-23 19:46:22,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=936798.0, ans=0.125 2023-06-23 19:46:48,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=936858.0, ans=0.0 2023-06-23 19:47:08,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=936918.0, ans=0.0 2023-06-23 19:47:36,912 INFO [train.py:996] (1/4) Epoch 6, batch 3700, loss[loss=0.2269, simple_loss=0.2985, pruned_loss=0.07764, over 21543.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.31, pruned_loss=0.08316, over 4275706.69 frames. ], batch size: 131, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:48:38,102 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.573e+02 2.941e+02 3.537e+02 5.018e+02, threshold=5.882e+02, percent-clipped=0.0 2023-06-23 19:49:05,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=937278.0, ans=0.125 2023-06-23 19:49:27,041 INFO [train.py:996] (1/4) Epoch 6, batch 3750, loss[loss=0.1917, simple_loss=0.2632, pruned_loss=0.06009, over 21600.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3079, pruned_loss=0.08235, over 4282492.74 frames. ], batch size: 195, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:49:35,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.21 vs. limit=5.0 2023-06-23 19:50:28,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=937458.0, ans=0.0 2023-06-23 19:51:29,606 INFO [train.py:996] (1/4) Epoch 6, batch 3800, loss[loss=0.22, simple_loss=0.3197, pruned_loss=0.06014, over 21197.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3047, pruned_loss=0.08038, over 4273286.46 frames. ], batch size: 548, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:51:31,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=937638.0, ans=0.0 2023-06-23 19:51:48,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.31 vs. limit=12.0 2023-06-23 19:52:01,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.30 vs. limit=22.5 2023-06-23 19:52:21,710 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.479e+02 2.831e+02 3.335e+02 6.491e+02, threshold=5.662e+02, percent-clipped=1.0 2023-06-23 19:52:22,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=937758.0, ans=0.1 2023-06-23 19:52:22,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=937758.0, ans=0.125 2023-06-23 19:53:20,148 INFO [train.py:996] (1/4) Epoch 6, batch 3850, loss[loss=0.3032, simple_loss=0.4158, pruned_loss=0.09531, over 19826.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3054, pruned_loss=0.08152, over 4261398.62 frames. ], batch size: 702, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:53:36,664 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2023-06-23 19:53:58,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=938058.0, ans=10.0 2023-06-23 19:53:59,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=938058.0, ans=0.2 2023-06-23 19:55:09,879 INFO [train.py:996] (1/4) Epoch 6, batch 3900, loss[loss=0.2232, simple_loss=0.2974, pruned_loss=0.07453, over 21893.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3012, pruned_loss=0.08077, over 4262137.04 frames. ], batch size: 118, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:55:34,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=938298.0, ans=0.125 2023-06-23 19:56:02,984 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.781e+02 3.101e+02 3.883e+02 8.958e+02, threshold=6.202e+02, percent-clipped=3.0 2023-06-23 19:56:25,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=938418.0, ans=0.2 2023-06-23 19:56:52,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=938478.0, ans=0.5 2023-06-23 19:57:06,042 INFO [train.py:996] (1/4) Epoch 6, batch 3950, loss[loss=0.1791, simple_loss=0.254, pruned_loss=0.05207, over 21435.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3019, pruned_loss=0.0795, over 4265577.43 frames. ], batch size: 131, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:57:06,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=938538.0, ans=0.0 2023-06-23 19:57:30,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-23 19:57:41,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=938658.0, ans=0.07 2023-06-23 19:57:56,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-23 19:57:59,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=938718.0, ans=0.125 2023-06-23 19:58:56,505 INFO [train.py:996] (1/4) Epoch 6, batch 4000, loss[loss=0.1717, simple_loss=0.2458, pruned_loss=0.04883, over 21486.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2945, pruned_loss=0.07568, over 4268135.35 frames. ], batch size: 230, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:59:02,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=938838.0, ans=0.125 2023-06-23 19:59:11,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=938838.0, ans=0.0 2023-06-23 19:59:18,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=938898.0, ans=0.125 2023-06-23 19:59:26,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-23 19:59:27,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=938898.0, ans=0.125 2023-06-23 19:59:29,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=938898.0, ans=0.125 2023-06-23 19:59:34,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-23 19:59:44,414 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.407e+02 2.711e+02 3.233e+02 5.039e+02, threshold=5.423e+02, percent-clipped=0.0 2023-06-23 20:00:21,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=939078.0, ans=0.1 2023-06-23 20:00:47,404 INFO [train.py:996] (1/4) Epoch 6, batch 4050, loss[loss=0.204, simple_loss=0.2839, pruned_loss=0.06207, over 21470.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2942, pruned_loss=0.07396, over 4253158.76 frames. ], batch size: 211, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:00:53,718 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-23 20:01:13,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=939198.0, ans=0.125 2023-06-23 20:02:09,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=939378.0, ans=0.0 2023-06-23 20:02:32,575 INFO [train.py:996] (1/4) Epoch 6, batch 4100, loss[loss=0.2, simple_loss=0.2839, pruned_loss=0.05809, over 21602.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2971, pruned_loss=0.07409, over 4261642.40 frames. ], batch size: 230, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:02:35,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=939438.0, ans=0.1 2023-06-23 20:03:19,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=939558.0, ans=0.0 2023-06-23 20:03:26,709 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.413e+02 2.658e+02 3.099e+02 5.779e+02, threshold=5.316e+02, percent-clipped=1.0 2023-06-23 20:03:27,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=939558.0, ans=0.125 2023-06-23 20:04:00,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-23 20:04:18,599 INFO [train.py:996] (1/4) Epoch 6, batch 4150, loss[loss=0.2307, simple_loss=0.3024, pruned_loss=0.07948, over 21577.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.298, pruned_loss=0.07234, over 4270653.15 frames. ], batch size: 548, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:04:24,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=939738.0, ans=0.125 2023-06-23 20:05:10,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=939858.0, ans=0.0 2023-06-23 20:05:51,271 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-23 20:06:00,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=939978.0, ans=0.09899494936611666 2023-06-23 20:06:01,821 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:06:12,053 INFO [train.py:996] (1/4) Epoch 6, batch 4200, loss[loss=0.2792, simple_loss=0.342, pruned_loss=0.1083, over 21469.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2974, pruned_loss=0.07158, over 4270132.36 frames. ], batch size: 473, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:07:00,981 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:07:17,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=940158.0, ans=0.2 2023-06-23 20:07:18,278 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.286e+02 2.656e+02 3.507e+02 6.693e+02, threshold=5.313e+02, percent-clipped=3.0 2023-06-23 20:08:05,644 INFO [train.py:996] (1/4) Epoch 6, batch 4250, loss[loss=0.2665, simple_loss=0.3523, pruned_loss=0.0904, over 21576.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3074, pruned_loss=0.07459, over 4266502.70 frames. ], batch size: 441, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:08:43,273 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:08:43,850 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-06-23 20:09:29,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=940518.0, ans=0.125 2023-06-23 20:09:41,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=940578.0, ans=0.1 2023-06-23 20:09:59,055 INFO [train.py:996] (1/4) Epoch 6, batch 4300, loss[loss=0.2159, simple_loss=0.2859, pruned_loss=0.07291, over 21305.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3127, pruned_loss=0.07716, over 4266997.85 frames. ], batch size: 159, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:10:27,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=940698.0, ans=0.125 2023-06-23 20:11:11,076 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.724e+02 3.223e+02 4.213e+02 6.998e+02, threshold=6.446e+02, percent-clipped=6.0 2023-06-23 20:11:14,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=940818.0, ans=0.0 2023-06-23 20:11:31,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.88 vs. limit=15.0 2023-06-23 20:11:46,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940878.0, ans=0.1 2023-06-23 20:12:00,248 INFO [train.py:996] (1/4) Epoch 6, batch 4350, loss[loss=0.2166, simple_loss=0.2868, pruned_loss=0.0732, over 21827.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3113, pruned_loss=0.07662, over 4262222.87 frames. ], batch size: 107, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:12:16,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=940998.0, ans=0.0 2023-06-23 20:13:51,993 INFO [train.py:996] (1/4) Epoch 6, batch 4400, loss[loss=0.2211, simple_loss=0.3077, pruned_loss=0.06724, over 21902.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3049, pruned_loss=0.07534, over 4271225.94 frames. ], batch size: 373, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:13:52,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=941238.0, ans=0.125 2023-06-23 20:14:09,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=941238.0, ans=0.04949747468305833 2023-06-23 20:14:50,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=941358.0, ans=0.0 2023-06-23 20:14:53,317 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.531e+02 2.865e+02 3.462e+02 7.210e+02, threshold=5.730e+02, percent-clipped=2.0 2023-06-23 20:15:41,005 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=15.0 2023-06-23 20:15:43,304 INFO [train.py:996] (1/4) Epoch 6, batch 4450, loss[loss=0.2616, simple_loss=0.3659, pruned_loss=0.07866, over 21769.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3147, pruned_loss=0.07767, over 4270311.56 frames. ], batch size: 332, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:16:07,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=941538.0, ans=0.125 2023-06-23 20:16:24,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-23 20:16:34,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-23 20:17:21,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=941778.0, ans=0.035 2023-06-23 20:17:38,975 INFO [train.py:996] (1/4) Epoch 6, batch 4500, loss[loss=0.2173, simple_loss=0.2894, pruned_loss=0.07263, over 20214.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3146, pruned_loss=0.07933, over 4273915.97 frames. ], batch size: 702, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:17:51,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=941838.0, ans=0.0 2023-06-23 20:18:25,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-23 20:18:30,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.30 vs. limit=5.0 2023-06-23 20:18:32,693 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.439e+02 2.793e+02 3.421e+02 5.110e+02, threshold=5.586e+02, percent-clipped=0.0 2023-06-23 20:18:56,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=942018.0, ans=0.0 2023-06-23 20:19:34,555 INFO [train.py:996] (1/4) Epoch 6, batch 4550, loss[loss=0.3115, simple_loss=0.3756, pruned_loss=0.1237, over 21414.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3168, pruned_loss=0.07926, over 4268399.40 frames. ], batch size: 471, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:20:02,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=942198.0, ans=0.125 2023-06-23 20:21:25,296 INFO [train.py:996] (1/4) Epoch 6, batch 4600, loss[loss=0.1999, simple_loss=0.2819, pruned_loss=0.05895, over 21821.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3188, pruned_loss=0.08136, over 4274267.58 frames. ], batch size: 282, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:21:52,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=942498.0, ans=0.0 2023-06-23 20:22:17,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942558.0, ans=0.1 2023-06-23 20:22:25,590 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.585e+02 3.169e+02 3.580e+02 7.815e+02, threshold=6.337e+02, percent-clipped=3.0 2023-06-23 20:22:50,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=942678.0, ans=0.0 2023-06-23 20:22:50,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=942678.0, ans=0.05 2023-06-23 20:23:12,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=942738.0, ans=0.125 2023-06-23 20:23:13,750 INFO [train.py:996] (1/4) Epoch 6, batch 4650, loss[loss=0.2163, simple_loss=0.2924, pruned_loss=0.07005, over 21485.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3113, pruned_loss=0.07923, over 4283021.90 frames. ], batch size: 131, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:23:20,467 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-23 20:23:25,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-06-23 20:25:00,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=942978.0, ans=0.2 2023-06-23 20:25:03,314 INFO [train.py:996] (1/4) Epoch 6, batch 4700, loss[loss=0.207, simple_loss=0.2694, pruned_loss=0.07234, over 21518.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3017, pruned_loss=0.07648, over 4286914.73 frames. ], batch size: 391, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:25:12,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=943038.0, ans=0.025 2023-06-23 20:26:04,153 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.385e+02 2.698e+02 3.095e+02 5.090e+02, threshold=5.395e+02, percent-clipped=0.0 2023-06-23 20:26:50,578 INFO [train.py:996] (1/4) Epoch 6, batch 4750, loss[loss=0.237, simple_loss=0.3023, pruned_loss=0.08586, over 21714.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2957, pruned_loss=0.07568, over 4280626.82 frames. ], batch size: 391, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:27:02,066 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-23 20:27:14,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=943398.0, ans=0.0 2023-06-23 20:28:39,480 INFO [train.py:996] (1/4) Epoch 6, batch 4800, loss[loss=0.223, simple_loss=0.3371, pruned_loss=0.05445, over 19810.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.297, pruned_loss=0.07634, over 4279268.47 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:28:43,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=943638.0, ans=0.0 2023-06-23 20:29:00,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=15.0 2023-06-23 20:29:42,942 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.734e+02 3.125e+02 3.511e+02 5.007e+02, threshold=6.249e+02, percent-clipped=0.0 2023-06-23 20:29:50,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=943818.0, ans=0.125 2023-06-23 20:29:52,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=943818.0, ans=0.125 2023-06-23 20:30:21,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=943878.0, ans=0.0 2023-06-23 20:30:21,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-23 20:30:24,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=943878.0, ans=0.09899494936611666 2023-06-23 20:30:27,072 INFO [train.py:996] (1/4) Epoch 6, batch 4850, loss[loss=0.2211, simple_loss=0.2908, pruned_loss=0.07575, over 21640.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2956, pruned_loss=0.07508, over 4272645.39 frames. ], batch size: 230, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:30:30,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=943938.0, ans=0.0 2023-06-23 20:30:41,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=943938.0, ans=0.125 2023-06-23 20:31:07,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=943998.0, ans=0.0 2023-06-23 20:31:11,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=943998.0, ans=0.0 2023-06-23 20:31:25,575 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-23 20:31:27,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=944058.0, ans=0.125 2023-06-23 20:32:05,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=944178.0, ans=0.0 2023-06-23 20:32:17,552 INFO [train.py:996] (1/4) Epoch 6, batch 4900, loss[loss=0.2489, simple_loss=0.3208, pruned_loss=0.0885, over 21675.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2966, pruned_loss=0.07579, over 4281959.87 frames. ], batch size: 389, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:33:28,633 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.472e+02 2.764e+02 3.016e+02 5.453e+02, threshold=5.528e+02, percent-clipped=0.0 2023-06-23 20:33:46,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=944418.0, ans=0.2 2023-06-23 20:33:58,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-23 20:34:00,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-06-23 20:34:09,916 INFO [train.py:996] (1/4) Epoch 6, batch 4950, loss[loss=0.187, simple_loss=0.2956, pruned_loss=0.03916, over 21137.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3016, pruned_loss=0.0748, over 4272082.61 frames. ], batch size: 548, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:35:41,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=944778.0, ans=0.1 2023-06-23 20:35:57,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=944838.0, ans=0.5 2023-06-23 20:35:58,234 INFO [train.py:996] (1/4) Epoch 6, batch 5000, loss[loss=0.267, simple_loss=0.3486, pruned_loss=0.09265, over 21447.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3018, pruned_loss=0.07194, over 4278220.40 frames. ], batch size: 508, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:36:27,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=944898.0, ans=15.0 2023-06-23 20:36:29,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=944898.0, ans=0.125 2023-06-23 20:37:01,262 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.469e+02 2.951e+02 3.464e+02 5.172e+02, threshold=5.903e+02, percent-clipped=0.0 2023-06-23 20:37:32,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=945078.0, ans=0.0 2023-06-23 20:37:40,285 INFO [train.py:996] (1/4) Epoch 6, batch 5050, loss[loss=0.2717, simple_loss=0.3238, pruned_loss=0.1098, over 21684.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3036, pruned_loss=0.07425, over 4287188.95 frames. ], batch size: 473, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:38:14,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=945198.0, ans=0.0 2023-06-23 20:38:32,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=945258.0, ans=0.0 2023-06-23 20:39:26,500 INFO [train.py:996] (1/4) Epoch 6, batch 5100, loss[loss=0.1932, simple_loss=0.2761, pruned_loss=0.0552, over 21845.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3021, pruned_loss=0.07432, over 4285048.79 frames. ], batch size: 332, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:39:28,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=945438.0, ans=0.0 2023-06-23 20:39:51,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=945498.0, ans=0.0 2023-06-23 20:40:30,256 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.802e+02 3.209e+02 3.785e+02 5.711e+02, threshold=6.418e+02, percent-clipped=0.0 2023-06-23 20:40:57,477 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=22.5 2023-06-23 20:40:57,483 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.21 vs. limit=6.0 2023-06-23 20:41:15,795 INFO [train.py:996] (1/4) Epoch 6, batch 5150, loss[loss=0.2181, simple_loss=0.2814, pruned_loss=0.07739, over 21396.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3004, pruned_loss=0.07502, over 4290784.24 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:42:05,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=945858.0, ans=0.2 2023-06-23 20:42:25,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=945918.0, ans=0.125 2023-06-23 20:42:30,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=945918.0, ans=0.125 2023-06-23 20:43:04,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=946038.0, ans=0.0 2023-06-23 20:43:05,737 INFO [train.py:996] (1/4) Epoch 6, batch 5200, loss[loss=0.2956, simple_loss=0.3713, pruned_loss=0.1099, over 21548.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.302, pruned_loss=0.07596, over 4291003.93 frames. ], batch size: 508, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:43:17,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=946038.0, ans=0.125 2023-06-23 20:43:24,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=946038.0, ans=0.1 2023-06-23 20:43:37,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.37 vs. limit=15.0 2023-06-23 20:44:14,595 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.657e+02 3.031e+02 3.767e+02 5.750e+02, threshold=6.062e+02, percent-clipped=0.0 2023-06-23 20:44:15,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=946218.0, ans=0.125 2023-06-23 20:44:38,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=946278.0, ans=0.1 2023-06-23 20:44:45,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=946278.0, ans=0.125 2023-06-23 20:44:45,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=946278.0, ans=0.95 2023-06-23 20:44:49,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=946278.0, ans=0.04949747468305833 2023-06-23 20:44:59,542 INFO [train.py:996] (1/4) Epoch 6, batch 5250, loss[loss=0.2238, simple_loss=0.3159, pruned_loss=0.06586, over 21786.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3049, pruned_loss=0.07446, over 4283143.01 frames. ], batch size: 282, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:45:15,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-23 20:45:25,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=946398.0, ans=0.2 2023-06-23 20:46:52,869 INFO [train.py:996] (1/4) Epoch 6, batch 5300, loss[loss=0.2366, simple_loss=0.3027, pruned_loss=0.08528, over 21904.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3042, pruned_loss=0.07505, over 4283351.08 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:47:01,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-23 20:47:55,260 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.539e+02 2.781e+02 3.236e+02 4.836e+02, threshold=5.563e+02, percent-clipped=0.0 2023-06-23 20:48:03,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=946818.0, ans=0.1 2023-06-23 20:48:41,810 INFO [train.py:996] (1/4) Epoch 6, batch 5350, loss[loss=0.2729, simple_loss=0.3184, pruned_loss=0.1137, over 21816.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3035, pruned_loss=0.07657, over 4286063.71 frames. ], batch size: 508, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:49:01,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=946998.0, ans=0.125 2023-06-23 20:49:06,181 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:49:15,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=947058.0, ans=0.1 2023-06-23 20:50:29,929 INFO [train.py:996] (1/4) Epoch 6, batch 5400, loss[loss=0.2172, simple_loss=0.2953, pruned_loss=0.06962, over 21920.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3022, pruned_loss=0.07792, over 4296891.28 frames. ], batch size: 118, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:50:41,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2023-06-23 20:51:34,438 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.654e+02 3.257e+02 3.898e+02 6.722e+02, threshold=6.513e+02, percent-clipped=2.0 2023-06-23 20:52:07,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=947478.0, ans=0.125 2023-06-23 20:52:07,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=947478.0, ans=0.125 2023-06-23 20:52:19,512 INFO [train.py:996] (1/4) Epoch 6, batch 5450, loss[loss=0.2009, simple_loss=0.2894, pruned_loss=0.05618, over 21617.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3034, pruned_loss=0.0765, over 4295442.70 frames. ], batch size: 230, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:52:21,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=947538.0, ans=0.0 2023-06-23 20:52:48,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=947598.0, ans=0.1 2023-06-23 20:54:09,234 INFO [train.py:996] (1/4) Epoch 6, batch 5500, loss[loss=0.2313, simple_loss=0.3262, pruned_loss=0.06826, over 21734.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.308, pruned_loss=0.0733, over 4283722.02 frames. ], batch size: 351, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:54:51,421 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:55:03,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=947958.0, ans=0.0 2023-06-23 20:55:24,830 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.255e+02 2.654e+02 3.007e+02 4.668e+02, threshold=5.308e+02, percent-clipped=0.0 2023-06-23 20:56:04,056 INFO [train.py:996] (1/4) Epoch 6, batch 5550, loss[loss=0.1286, simple_loss=0.1834, pruned_loss=0.03694, over 16216.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3067, pruned_loss=0.07038, over 4281987.02 frames. ], batch size: 61, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:56:31,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=948198.0, ans=0.0 2023-06-23 20:57:04,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=948258.0, ans=0.1 2023-06-23 20:57:04,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=948258.0, ans=0.1 2023-06-23 20:57:21,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=948318.0, ans=0.05 2023-06-23 20:57:42,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=948378.0, ans=0.125 2023-06-23 20:57:56,424 INFO [train.py:996] (1/4) Epoch 6, batch 5600, loss[loss=0.2385, simple_loss=0.3294, pruned_loss=0.07375, over 21661.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3044, pruned_loss=0.06786, over 4276840.37 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 20:58:02,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=948438.0, ans=0.125 2023-06-23 20:58:54,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=948558.0, ans=0.125 2023-06-23 20:58:56,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=948558.0, ans=0.125 2023-06-23 20:59:01,213 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.332e+02 2.800e+02 3.364e+02 5.770e+02, threshold=5.601e+02, percent-clipped=3.0 2023-06-23 20:59:14,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-23 20:59:44,388 INFO [train.py:996] (1/4) Epoch 6, batch 5650, loss[loss=0.2527, simple_loss=0.326, pruned_loss=0.08964, over 21724.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3083, pruned_loss=0.07048, over 4275058.56 frames. ], batch size: 389, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 20:59:57,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=948738.0, ans=0.0 2023-06-23 21:00:46,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-23 21:00:53,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=948918.0, ans=0.0 2023-06-23 21:01:01,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=948918.0, ans=0.125 2023-06-23 21:01:29,430 INFO [train.py:996] (1/4) Epoch 6, batch 5700, loss[loss=0.2047, simple_loss=0.2872, pruned_loss=0.06113, over 21616.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3082, pruned_loss=0.07243, over 4277208.74 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:02:17,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=949158.0, ans=0.125 2023-06-23 21:02:41,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.515e+02 2.975e+02 3.453e+02 5.794e+02, threshold=5.950e+02, percent-clipped=1.0 2023-06-23 21:02:58,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=949218.0, ans=0.0 2023-06-23 21:03:12,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=949278.0, ans=0.125 2023-06-23 21:03:31,979 INFO [train.py:996] (1/4) Epoch 6, batch 5750, loss[loss=0.187, simple_loss=0.2788, pruned_loss=0.04759, over 21698.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3015, pruned_loss=0.06926, over 4267237.85 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:03:49,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=949398.0, ans=0.07 2023-06-23 21:03:59,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=949398.0, ans=0.125 2023-06-23 21:04:08,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=949458.0, ans=0.125 2023-06-23 21:05:18,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=949578.0, ans=0.125 2023-06-23 21:05:22,448 INFO [train.py:996] (1/4) Epoch 6, batch 5800, loss[loss=0.1988, simple_loss=0.2855, pruned_loss=0.05606, over 21263.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3008, pruned_loss=0.06811, over 4268142.55 frames. ], batch size: 144, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:05:31,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=949638.0, ans=0.125 2023-06-23 21:06:02,504 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-23 21:06:27,700 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 2.304e+02 2.799e+02 4.068e+02 6.558e+02, threshold=5.598e+02, percent-clipped=2.0 2023-06-23 21:06:55,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=949878.0, ans=0.0 2023-06-23 21:07:04,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=949878.0, ans=0.2 2023-06-23 21:07:06,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=949878.0, ans=0.0 2023-06-23 21:07:12,467 INFO [train.py:996] (1/4) Epoch 6, batch 5850, loss[loss=0.1762, simple_loss=0.277, pruned_loss=0.03765, over 21783.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.299, pruned_loss=0.06456, over 4265926.20 frames. ], batch size: 332, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:08:54,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=950238.0, ans=0.0 2023-06-23 21:08:55,270 INFO [train.py:996] (1/4) Epoch 6, batch 5900, loss[loss=0.2504, simple_loss=0.3171, pruned_loss=0.09185, over 21611.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2917, pruned_loss=0.05934, over 4270916.43 frames. ], batch size: 507, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:09:13,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-23 21:09:27,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950298.0, ans=0.1 2023-06-23 21:09:43,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-23 21:09:53,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=950358.0, ans=0.125 2023-06-23 21:09:57,800 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.988e+02 2.407e+02 3.041e+02 4.833e+02, threshold=4.814e+02, percent-clipped=0.0 2023-06-23 21:10:16,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=950418.0, ans=0.0 2023-06-23 21:10:25,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-23 21:10:41,824 INFO [train.py:996] (1/4) Epoch 6, batch 5950, loss[loss=0.2152, simple_loss=0.2785, pruned_loss=0.07592, over 21216.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2909, pruned_loss=0.06297, over 4280142.19 frames. ], batch size: 176, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:11:31,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=950658.0, ans=0.125 2023-06-23 21:12:25,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=950778.0, ans=0.125 2023-06-23 21:12:30,035 INFO [train.py:996] (1/4) Epoch 6, batch 6000, loss[loss=0.1906, simple_loss=0.2591, pruned_loss=0.06102, over 21806.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2871, pruned_loss=0.06626, over 4285660.58 frames. ], batch size: 118, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:12:30,036 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 21:12:53,040 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2596, simple_loss=0.3528, pruned_loss=0.08322, over 1796401.00 frames. 2023-06-23 21:12:53,041 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23385MB 2023-06-23 21:13:26,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=950898.0, ans=0.125 2023-06-23 21:13:27,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-23 21:13:30,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950898.0, ans=0.1 2023-06-23 21:14:03,938 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.620e+02 2.865e+02 3.269e+02 5.211e+02, threshold=5.729e+02, percent-clipped=1.0 2023-06-23 21:14:04,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=951018.0, ans=0.05 2023-06-23 21:14:48,479 INFO [train.py:996] (1/4) Epoch 6, batch 6050, loss[loss=0.193, simple_loss=0.2569, pruned_loss=0.06457, over 21922.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2833, pruned_loss=0.06738, over 4284484.76 frames. ], batch size: 113, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:15:05,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-23 21:16:01,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=951318.0, ans=0.125 2023-06-23 21:16:30,421 INFO [train.py:996] (1/4) Epoch 6, batch 6100, loss[loss=0.2044, simple_loss=0.2803, pruned_loss=0.06421, over 21656.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2828, pruned_loss=0.06642, over 4283580.25 frames. ], batch size: 263, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:16:40,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=951438.0, ans=0.125 2023-06-23 21:16:52,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=951498.0, ans=0.125 2023-06-23 21:16:58,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=951498.0, ans=0.125 2023-06-23 21:17:29,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=951558.0, ans=0.1 2023-06-23 21:17:40,846 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 2.204e+02 2.422e+02 2.717e+02 3.811e+02, threshold=4.844e+02, percent-clipped=0.0 2023-06-23 21:17:43,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-06-23 21:17:53,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=951618.0, ans=0.1 2023-06-23 21:18:18,507 INFO [train.py:996] (1/4) Epoch 6, batch 6150, loss[loss=0.1971, simple_loss=0.2628, pruned_loss=0.06568, over 21078.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2862, pruned_loss=0.06903, over 4280025.53 frames. ], batch size: 143, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:18:22,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=951738.0, ans=0.125 2023-06-23 21:18:40,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=951798.0, ans=0.0 2023-06-23 21:19:08,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=951858.0, ans=0.125 2023-06-23 21:19:21,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=951918.0, ans=0.1 2023-06-23 21:19:31,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-23 21:20:00,599 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-23 21:20:08,100 INFO [train.py:996] (1/4) Epoch 6, batch 6200, loss[loss=0.2312, simple_loss=0.3014, pruned_loss=0.08055, over 21534.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2914, pruned_loss=0.06944, over 4271173.22 frames. ], batch size: 131, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:20:49,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=952098.0, ans=0.0 2023-06-23 21:20:52,114 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-23 21:20:56,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952158.0, ans=0.1 2023-06-23 21:21:01,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=952158.0, ans=0.1 2023-06-23 21:21:15,537 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.446e+02 2.781e+02 3.201e+02 6.151e+02, threshold=5.562e+02, percent-clipped=2.0 2023-06-23 21:21:55,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=952278.0, ans=0.04949747468305833 2023-06-23 21:21:58,227 INFO [train.py:996] (1/4) Epoch 6, batch 6250, loss[loss=0.2001, simple_loss=0.3011, pruned_loss=0.04957, over 21636.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2953, pruned_loss=0.0683, over 4270387.05 frames. ], batch size: 263, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:22:00,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=952338.0, ans=0.125 2023-06-23 21:22:02,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-23 21:23:08,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=952518.0, ans=0.04949747468305833 2023-06-23 21:23:45,355 INFO [train.py:996] (1/4) Epoch 6, batch 6300, loss[loss=0.2116, simple_loss=0.2944, pruned_loss=0.06438, over 21497.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2987, pruned_loss=0.06737, over 4270532.02 frames. ], batch size: 548, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:24:57,683 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.558e+02 3.046e+02 3.782e+02 6.709e+02, threshold=6.092e+02, percent-clipped=4.0 2023-06-23 21:25:08,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=952818.0, ans=0.125 2023-06-23 21:25:17,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=952878.0, ans=0.0 2023-06-23 21:25:24,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=952878.0, ans=0.125 2023-06-23 21:25:34,565 INFO [train.py:996] (1/4) Epoch 6, batch 6350, loss[loss=0.2369, simple_loss=0.3101, pruned_loss=0.08182, over 21774.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.302, pruned_loss=0.07188, over 4272990.26 frames. ], batch size: 298, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:26:21,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=953058.0, ans=0.04949747468305833 2023-06-23 21:26:35,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=953058.0, ans=0.125 2023-06-23 21:26:47,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=953118.0, ans=0.2 2023-06-23 21:26:50,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=953118.0, ans=0.2 2023-06-23 21:27:03,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=953118.0, ans=0.1 2023-06-23 21:27:06,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-23 21:27:29,911 INFO [train.py:996] (1/4) Epoch 6, batch 6400, loss[loss=0.2372, simple_loss=0.3168, pruned_loss=0.07882, over 21452.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3078, pruned_loss=0.07608, over 4278119.46 frames. ], batch size: 131, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:28:01,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=953298.0, ans=0.2 2023-06-23 21:28:42,594 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.766e+02 2.997e+02 3.346e+02 4.721e+02, threshold=5.994e+02, percent-clipped=0.0 2023-06-23 21:28:46,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953418.0, ans=0.1 2023-06-23 21:29:18,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=953478.0, ans=0.125 2023-06-23 21:29:24,305 INFO [train.py:996] (1/4) Epoch 6, batch 6450, loss[loss=0.2118, simple_loss=0.2856, pruned_loss=0.06898, over 21183.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3128, pruned_loss=0.07651, over 4271165.14 frames. ], batch size: 143, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:30:02,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=953598.0, ans=0.07 2023-06-23 21:30:17,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=953658.0, ans=0.0 2023-06-23 21:31:13,656 INFO [train.py:996] (1/4) Epoch 6, batch 6500, loss[loss=0.1971, simple_loss=0.264, pruned_loss=0.06509, over 21208.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3062, pruned_loss=0.0756, over 4274939.48 frames. ], batch size: 176, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:31:17,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=953838.0, ans=0.125 2023-06-23 21:31:59,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=953958.0, ans=0.125 2023-06-23 21:32:07,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=953958.0, ans=10.0 2023-06-23 21:32:18,653 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.470e+02 2.695e+02 2.978e+02 5.209e+02, threshold=5.391e+02, percent-clipped=0.0 2023-06-23 21:32:26,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=954018.0, ans=0.125 2023-06-23 21:32:48,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=954078.0, ans=0.1 2023-06-23 21:32:51,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=954078.0, ans=0.125 2023-06-23 21:33:00,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=22.5 2023-06-23 21:33:01,244 INFO [train.py:996] (1/4) Epoch 6, batch 6550, loss[loss=0.2115, simple_loss=0.3172, pruned_loss=0.05294, over 21305.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3042, pruned_loss=0.07409, over 4281025.12 frames. ], batch size: 548, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:33:22,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=954198.0, ans=0.125 2023-06-23 21:33:42,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=954198.0, ans=0.125 2023-06-23 21:34:24,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=954378.0, ans=0.0 2023-06-23 21:34:47,733 INFO [train.py:996] (1/4) Epoch 6, batch 6600, loss[loss=0.1956, simple_loss=0.2828, pruned_loss=0.0542, over 20972.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2977, pruned_loss=0.07371, over 4285215.27 frames. ], batch size: 608, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:34:49,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=954438.0, ans=0.2 2023-06-23 21:35:25,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=954498.0, ans=0.1 2023-06-23 21:35:41,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=954558.0, ans=0.0 2023-06-23 21:36:01,782 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.286e+02 2.575e+02 2.928e+02 5.219e+02, threshold=5.150e+02, percent-clipped=0.0 2023-06-23 21:36:27,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=954678.0, ans=0.2 2023-06-23 21:36:35,403 INFO [train.py:996] (1/4) Epoch 6, batch 6650, loss[loss=0.1999, simple_loss=0.2609, pruned_loss=0.06952, over 21522.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2896, pruned_loss=0.07003, over 4276071.46 frames. ], batch size: 195, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:36:39,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=954738.0, ans=0.125 2023-06-23 21:36:43,905 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.79 vs. limit=6.0 2023-06-23 21:37:02,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=954798.0, ans=0.035 2023-06-23 21:37:23,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=954858.0, ans=0.125 2023-06-23 21:37:43,593 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:38:04,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=954978.0, ans=0.0 2023-06-23 21:38:18,824 INFO [train.py:996] (1/4) Epoch 6, batch 6700, loss[loss=0.2409, simple_loss=0.3061, pruned_loss=0.08781, over 21536.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2854, pruned_loss=0.07081, over 4276415.79 frames. ], batch size: 442, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:39:34,439 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.289e+02 2.607e+02 3.016e+02 4.316e+02, threshold=5.215e+02, percent-clipped=0.0 2023-06-23 21:40:07,895 INFO [train.py:996] (1/4) Epoch 6, batch 6750, loss[loss=0.2395, simple_loss=0.3041, pruned_loss=0.08747, over 21677.00 frames. ], tot_loss[loss=0.212, simple_loss=0.282, pruned_loss=0.07094, over 4269753.97 frames. ], batch size: 441, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:41:01,296 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:41:48,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=955578.0, ans=0.2 2023-06-23 21:41:54,992 INFO [train.py:996] (1/4) Epoch 6, batch 6800, loss[loss=0.2211, simple_loss=0.2803, pruned_loss=0.08096, over 21538.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2845, pruned_loss=0.07281, over 4274184.30 frames. ], batch size: 414, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:41:57,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=955638.0, ans=0.125 2023-06-23 21:42:44,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=955758.0, ans=0.0 2023-06-23 21:43:03,402 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.510e+02 2.967e+02 3.494e+02 5.351e+02, threshold=5.935e+02, percent-clipped=1.0 2023-06-23 21:43:42,661 INFO [train.py:996] (1/4) Epoch 6, batch 6850, loss[loss=0.2229, simple_loss=0.2903, pruned_loss=0.07779, over 21686.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2847, pruned_loss=0.07449, over 4278943.76 frames. ], batch size: 389, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:44:31,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=956058.0, ans=0.0 2023-06-23 21:45:23,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=956178.0, ans=0.07 2023-06-23 21:45:32,167 INFO [train.py:996] (1/4) Epoch 6, batch 6900, loss[loss=0.2762, simple_loss=0.3928, pruned_loss=0.07982, over 19813.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2874, pruned_loss=0.0752, over 4279218.51 frames. ], batch size: 702, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:46:46,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=956418.0, ans=0.125 2023-06-23 21:46:49,834 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 2.526e+02 2.937e+02 3.629e+02 5.523e+02, threshold=5.874e+02, percent-clipped=0.0 2023-06-23 21:47:27,765 INFO [train.py:996] (1/4) Epoch 6, batch 6950, loss[loss=0.1788, simple_loss=0.2654, pruned_loss=0.04606, over 21385.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2906, pruned_loss=0.07155, over 4278535.67 frames. ], batch size: 211, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:47:36,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=956538.0, ans=0.2 2023-06-23 21:48:11,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=956658.0, ans=0.1 2023-06-23 21:48:30,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=956718.0, ans=0.2 2023-06-23 21:48:32,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=956718.0, ans=0.1 2023-06-23 21:48:49,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=956778.0, ans=0.125 2023-06-23 21:48:49,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=956778.0, ans=0.5 2023-06-23 21:48:57,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=956778.0, ans=0.0 2023-06-23 21:49:14,664 INFO [train.py:996] (1/4) Epoch 6, batch 7000, loss[loss=0.2162, simple_loss=0.2768, pruned_loss=0.07775, over 21770.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2946, pruned_loss=0.07482, over 4282601.53 frames. ], batch size: 112, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:49:22,545 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:50:27,282 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.602e+02 2.936e+02 3.362e+02 6.122e+02, threshold=5.872e+02, percent-clipped=1.0 2023-06-23 21:50:41,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-23 21:50:42,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=957078.0, ans=0.125 2023-06-23 21:51:05,575 INFO [train.py:996] (1/4) Epoch 6, batch 7050, loss[loss=0.2416, simple_loss=0.3056, pruned_loss=0.08878, over 20819.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2932, pruned_loss=0.07376, over 4276052.20 frames. ], batch size: 608, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:51:11,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=957138.0, ans=0.0 2023-06-23 21:51:59,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=957258.0, ans=0.125 2023-06-23 21:52:03,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=12.0 2023-06-23 21:52:22,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=957318.0, ans=0.125 2023-06-23 21:52:22,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=957318.0, ans=0.04949747468305833 2023-06-23 21:52:39,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=957378.0, ans=0.1 2023-06-23 21:52:49,880 INFO [train.py:996] (1/4) Epoch 6, batch 7100, loss[loss=0.2223, simple_loss=0.3195, pruned_loss=0.0625, over 19882.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2975, pruned_loss=0.07552, over 4268832.40 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:53:48,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=957558.0, ans=0.1 2023-06-23 21:54:06,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 2.381e+02 2.673e+02 3.454e+02 5.437e+02, threshold=5.346e+02, percent-clipped=0.0 2023-06-23 21:54:23,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=957678.0, ans=0.125 2023-06-23 21:54:35,283 INFO [train.py:996] (1/4) Epoch 6, batch 7150, loss[loss=0.2771, simple_loss=0.3501, pruned_loss=0.1021, over 21801.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2949, pruned_loss=0.07308, over 4274462.31 frames. ], batch size: 118, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:54:53,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=957798.0, ans=0.0 2023-06-23 21:54:54,098 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=22.5 2023-06-23 21:55:00,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.17 vs. limit=12.0 2023-06-23 21:55:24,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=957858.0, ans=0.125 2023-06-23 21:55:28,410 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=22.5 2023-06-23 21:55:55,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=957918.0, ans=0.125 2023-06-23 21:56:24,655 INFO [train.py:996] (1/4) Epoch 6, batch 7200, loss[loss=0.1964, simple_loss=0.2666, pruned_loss=0.0631, over 21658.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2968, pruned_loss=0.07464, over 4278062.53 frames. ], batch size: 298, lr: 5.23e-03, grad_scale: 32.0 2023-06-23 21:56:42,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=958098.0, ans=0.125 2023-06-23 21:57:18,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=958158.0, ans=0.125 2023-06-23 21:57:46,111 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.518e+02 2.883e+02 3.559e+02 6.632e+02, threshold=5.766e+02, percent-clipped=3.0 2023-06-23 21:57:49,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=958218.0, ans=0.2 2023-06-23 21:58:13,486 INFO [train.py:996] (1/4) Epoch 6, batch 7250, loss[loss=0.1948, simple_loss=0.2465, pruned_loss=0.07148, over 20783.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2921, pruned_loss=0.07477, over 4274198.01 frames. ], batch size: 608, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 21:58:14,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=958338.0, ans=0.125 2023-06-23 21:58:41,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-23 21:59:47,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-23 22:00:01,985 INFO [train.py:996] (1/4) Epoch 6, batch 7300, loss[loss=0.1793, simple_loss=0.2451, pruned_loss=0.05677, over 21781.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2868, pruned_loss=0.07351, over 4267917.85 frames. ], batch size: 118, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:00:33,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=958698.0, ans=0.0 2023-06-23 22:01:07,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=958758.0, ans=0.07 2023-06-23 22:01:24,348 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.461e+02 2.779e+02 3.106e+02 5.760e+02, threshold=5.558e+02, percent-clipped=0.0 2023-06-23 22:01:25,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=958818.0, ans=0.125 2023-06-23 22:01:48,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=958878.0, ans=0.125 2023-06-23 22:01:51,243 INFO [train.py:996] (1/4) Epoch 6, batch 7350, loss[loss=0.1903, simple_loss=0.2469, pruned_loss=0.06689, over 20736.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2845, pruned_loss=0.0732, over 4262804.93 frames. ], batch size: 609, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:02:18,174 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:02:20,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=958998.0, ans=0.125 2023-06-23 22:03:37,925 INFO [train.py:996] (1/4) Epoch 6, batch 7400, loss[loss=0.2139, simple_loss=0.2858, pruned_loss=0.07095, over 21354.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2915, pruned_loss=0.07459, over 4256026.39 frames. ], batch size: 159, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:03:59,019 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:04:34,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=959358.0, ans=0.05 2023-06-23 22:04:53,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=959418.0, ans=0.125 2023-06-23 22:05:00,954 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.692e+02 3.073e+02 3.719e+02 6.060e+02, threshold=6.147e+02, percent-clipped=2.0 2023-06-23 22:05:07,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=959418.0, ans=0.0 2023-06-23 22:05:08,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=959478.0, ans=0.0 2023-06-23 22:05:12,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=959478.0, ans=0.025 2023-06-23 22:05:17,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=959478.0, ans=0.125 2023-06-23 22:05:19,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=959478.0, ans=0.125 2023-06-23 22:05:37,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=959538.0, ans=0.1 2023-06-23 22:05:39,090 INFO [train.py:996] (1/4) Epoch 6, batch 7450, loss[loss=0.2216, simple_loss=0.2808, pruned_loss=0.08123, over 21303.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2888, pruned_loss=0.07424, over 4260611.94 frames. ], batch size: 160, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:05:43,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=959538.0, ans=0.125 2023-06-23 22:06:06,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=959598.0, ans=0.125 2023-06-23 22:06:23,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=959658.0, ans=0.125 2023-06-23 22:06:40,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=959658.0, ans=0.0 2023-06-23 22:07:30,945 INFO [train.py:996] (1/4) Epoch 6, batch 7500, loss[loss=0.2562, simple_loss=0.3557, pruned_loss=0.07831, over 21625.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2947, pruned_loss=0.07638, over 4271342.83 frames. ], batch size: 263, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:08:01,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=959898.0, ans=0.125 2023-06-23 22:08:10,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=959898.0, ans=0.07 2023-06-23 22:08:27,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=959958.0, ans=0.125 2023-06-23 22:08:30,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=959958.0, ans=0.0 2023-06-23 22:08:41,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-23 22:08:44,095 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.824e+02 3.431e+02 4.118e+02 7.261e+02, threshold=6.863e+02, percent-clipped=3.0 2023-06-23 22:08:59,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=960078.0, ans=0.2 2023-06-23 22:09:20,920 INFO [train.py:996] (1/4) Epoch 6, batch 7550, loss[loss=0.2104, simple_loss=0.2999, pruned_loss=0.06044, over 21654.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3018, pruned_loss=0.07544, over 4268883.12 frames. ], batch size: 263, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:09:23,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=960138.0, ans=0.125 2023-06-23 22:09:43,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960138.0, ans=0.1 2023-06-23 22:10:00,536 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:10:21,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960318.0, ans=0.1 2023-06-23 22:10:41,833 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:11:08,244 INFO [train.py:996] (1/4) Epoch 6, batch 7600, loss[loss=0.2034, simple_loss=0.2569, pruned_loss=0.07492, over 20953.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3002, pruned_loss=0.07378, over 4263251.87 frames. ], batch size: 613, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:11:08,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=960438.0, ans=0.125 2023-06-23 22:11:08,944 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:11:19,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=960438.0, ans=0.5 2023-06-23 22:11:59,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-23 22:12:09,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=960618.0, ans=0.2 2023-06-23 22:12:18,809 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.489e+02 2.806e+02 3.400e+02 5.423e+02, threshold=5.611e+02, percent-clipped=0.0 2023-06-23 22:12:26,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=960678.0, ans=0.2 2023-06-23 22:12:48,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960678.0, ans=0.1 2023-06-23 22:12:56,163 INFO [train.py:996] (1/4) Epoch 6, batch 7650, loss[loss=0.2752, simple_loss=0.3193, pruned_loss=0.1155, over 21814.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2992, pruned_loss=0.07542, over 4275879.37 frames. ], batch size: 508, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:13:03,980 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:13:18,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=960738.0, ans=0.0 2023-06-23 22:13:18,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=960738.0, ans=0.0 2023-06-23 22:13:24,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.93 vs. limit=15.0 2023-06-23 22:13:50,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=960858.0, ans=0.0 2023-06-23 22:14:05,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=960918.0, ans=0.125 2023-06-23 22:14:51,522 INFO [train.py:996] (1/4) Epoch 6, batch 7700, loss[loss=0.224, simple_loss=0.2941, pruned_loss=0.07692, over 21840.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3029, pruned_loss=0.0788, over 4285447.59 frames. ], batch size: 282, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:15:54,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-23 22:16:06,416 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.621e+02 3.080e+02 3.761e+02 5.045e+02, threshold=6.161e+02, percent-clipped=0.0 2023-06-23 22:16:43,588 INFO [train.py:996] (1/4) Epoch 6, batch 7750, loss[loss=0.2726, simple_loss=0.3746, pruned_loss=0.08535, over 21769.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3075, pruned_loss=0.0787, over 4278487.47 frames. ], batch size: 332, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:17:00,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=961398.0, ans=0.125 2023-06-23 22:18:17,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=961578.0, ans=0.0 2023-06-23 22:18:18,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-23 22:18:20,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=961578.0, ans=0.0 2023-06-23 22:18:26,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=961578.0, ans=0.125 2023-06-23 22:18:34,452 INFO [train.py:996] (1/4) Epoch 6, batch 7800, loss[loss=0.2001, simple_loss=0.2737, pruned_loss=0.06326, over 21668.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3085, pruned_loss=0.07836, over 4265576.91 frames. ], batch size: 263, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:18:43,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-23 22:19:06,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-23 22:19:39,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=961818.0, ans=0.125 2023-06-23 22:19:45,757 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.845e+02 3.471e+02 4.135e+02 7.731e+02, threshold=6.941e+02, percent-clipped=4.0 2023-06-23 22:20:00,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.32 vs. limit=15.0 2023-06-23 22:20:21,462 INFO [train.py:996] (1/4) Epoch 6, batch 7850, loss[loss=0.2047, simple_loss=0.2692, pruned_loss=0.07008, over 21472.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.301, pruned_loss=0.07755, over 4273376.05 frames. ], batch size: 195, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:20:24,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=961938.0, ans=0.0 2023-06-23 22:20:46,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=961998.0, ans=0.2 2023-06-23 22:20:53,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=961998.0, ans=0.0 2023-06-23 22:21:14,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=962058.0, ans=0.0 2023-06-23 22:22:15,674 INFO [train.py:996] (1/4) Epoch 6, batch 7900, loss[loss=0.2699, simple_loss=0.3656, pruned_loss=0.08711, over 21634.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.297, pruned_loss=0.07707, over 4278652.95 frames. ], batch size: 414, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:22:29,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=962238.0, ans=0.1 2023-06-23 22:22:44,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=962298.0, ans=0.0 2023-06-23 22:23:36,789 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.814e+02 3.173e+02 3.712e+02 6.452e+02, threshold=6.346e+02, percent-clipped=0.0 2023-06-23 22:24:02,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=962538.0, ans=0.125 2023-06-23 22:24:03,323 INFO [train.py:996] (1/4) Epoch 6, batch 7950, loss[loss=0.2345, simple_loss=0.3115, pruned_loss=0.07872, over 21469.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2996, pruned_loss=0.07559, over 4275761.98 frames. ], batch size: 194, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:24:55,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=962658.0, ans=0.125 2023-06-23 22:25:09,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-23 22:25:21,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=962718.0, ans=0.0 2023-06-23 22:25:44,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=962778.0, ans=0.2 2023-06-23 22:25:58,611 INFO [train.py:996] (1/4) Epoch 6, batch 8000, loss[loss=0.222, simple_loss=0.3127, pruned_loss=0.06562, over 21814.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3061, pruned_loss=0.07792, over 4271760.04 frames. ], batch size: 282, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:26:52,452 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-06-23 22:27:14,861 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-23 22:27:20,912 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.660e+02 3.200e+02 3.986e+02 6.358e+02, threshold=6.400e+02, percent-clipped=1.0 2023-06-23 22:27:50,019 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=6.0 2023-06-23 22:27:56,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=963078.0, ans=0.125 2023-06-23 22:27:59,888 INFO [train.py:996] (1/4) Epoch 6, batch 8050, loss[loss=0.2558, simple_loss=0.3376, pruned_loss=0.087, over 21719.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.309, pruned_loss=0.07836, over 4264093.32 frames. ], batch size: 351, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:28:37,091 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:28:44,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=963258.0, ans=0.125 2023-06-23 22:28:55,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-23 22:29:09,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=963318.0, ans=0.125 2023-06-23 22:29:51,719 INFO [train.py:996] (1/4) Epoch 6, batch 8100, loss[loss=0.2242, simple_loss=0.3058, pruned_loss=0.07132, over 21518.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3092, pruned_loss=0.07977, over 4275633.96 frames. ], batch size: 131, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:29:54,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=963438.0, ans=0.04949747468305833 2023-06-23 22:31:16,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-23 22:31:23,062 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.897e+02 3.319e+02 4.086e+02 8.225e+02, threshold=6.637e+02, percent-clipped=1.0 2023-06-23 22:31:26,014 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-23 22:31:43,277 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:31:58,985 INFO [train.py:996] (1/4) Epoch 6, batch 8150, loss[loss=0.2615, simple_loss=0.379, pruned_loss=0.07207, over 21159.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3166, pruned_loss=0.08149, over 4276104.55 frames. ], batch size: 548, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:32:00,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=963738.0, ans=0.125 2023-06-23 22:32:24,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.05 vs. limit=12.0 2023-06-23 22:32:44,931 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-23 22:32:53,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-23 22:32:54,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=963858.0, ans=0.125 2023-06-23 22:32:56,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=963858.0, ans=0.1 2023-06-23 22:32:57,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=963858.0, ans=0.2 2023-06-23 22:33:46,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=964038.0, ans=0.2 2023-06-23 22:33:48,079 INFO [train.py:996] (1/4) Epoch 6, batch 8200, loss[loss=0.2228, simple_loss=0.2834, pruned_loss=0.08106, over 21848.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3101, pruned_loss=0.07893, over 4278027.47 frames. ], batch size: 373, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:34:04,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-23 22:34:15,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=964098.0, ans=0.125 2023-06-23 22:34:26,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964098.0, ans=0.1 2023-06-23 22:34:38,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=964158.0, ans=0.2 2023-06-23 22:34:55,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-23 22:35:09,638 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.474e+02 2.845e+02 3.510e+02 6.334e+02, threshold=5.689e+02, percent-clipped=0.0 2023-06-23 22:35:39,843 INFO [train.py:996] (1/4) Epoch 6, batch 8250, loss[loss=0.2413, simple_loss=0.3419, pruned_loss=0.07035, over 21606.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3082, pruned_loss=0.0789, over 4274831.84 frames. ], batch size: 389, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:36:11,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=964398.0, ans=10.0 2023-06-23 22:36:40,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=964458.0, ans=0.1 2023-06-23 22:36:45,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=964518.0, ans=0.0 2023-06-23 22:37:17,587 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.66 vs. limit=15.0 2023-06-23 22:37:19,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-23 22:37:26,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964578.0, ans=0.1 2023-06-23 22:37:30,667 INFO [train.py:996] (1/4) Epoch 6, batch 8300, loss[loss=0.2151, simple_loss=0.2867, pruned_loss=0.07176, over 21383.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3066, pruned_loss=0.0754, over 4277868.87 frames. ], batch size: 211, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:38:25,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=964758.0, ans=0.0 2023-06-23 22:38:49,164 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.358e+02 2.866e+02 3.291e+02 6.256e+02, threshold=5.732e+02, percent-clipped=2.0 2023-06-23 22:39:02,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-23 22:39:19,234 INFO [train.py:996] (1/4) Epoch 6, batch 8350, loss[loss=0.2194, simple_loss=0.2995, pruned_loss=0.06964, over 21679.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3056, pruned_loss=0.07338, over 4277987.46 frames. ], batch size: 263, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:39:32,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=964938.0, ans=0.025 2023-06-23 22:39:59,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964998.0, ans=0.1 2023-06-23 22:40:57,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=965178.0, ans=0.2 2023-06-23 22:41:08,733 INFO [train.py:996] (1/4) Epoch 6, batch 8400, loss[loss=0.165, simple_loss=0.2274, pruned_loss=0.05133, over 21865.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3012, pruned_loss=0.07026, over 4271398.57 frames. ], batch size: 107, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:41:23,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-23 22:41:38,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=965298.0, ans=0.0 2023-06-23 22:42:19,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=965418.0, ans=0.125 2023-06-23 22:42:22,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=965418.0, ans=0.125 2023-06-23 22:42:27,773 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 2.294e+02 2.573e+02 3.024e+02 4.553e+02, threshold=5.145e+02, percent-clipped=0.0 2023-06-23 22:42:55,767 INFO [train.py:996] (1/4) Epoch 6, batch 8450, loss[loss=0.2148, simple_loss=0.2774, pruned_loss=0.07606, over 21324.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2993, pruned_loss=0.06965, over 4276088.68 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:43:34,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-23 22:43:58,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-23 22:44:01,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=965658.0, ans=0.0 2023-06-23 22:44:04,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=965718.0, ans=0.04949747468305833 2023-06-23 22:44:18,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=965718.0, ans=0.125 2023-06-23 22:44:44,909 INFO [train.py:996] (1/4) Epoch 6, batch 8500, loss[loss=0.2038, simple_loss=0.2712, pruned_loss=0.06816, over 21705.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2968, pruned_loss=0.07091, over 4269151.58 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:44:45,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=965838.0, ans=0.0 2023-06-23 22:44:51,033 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:45:12,057 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:45:13,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=965898.0, ans=0.1 2023-06-23 22:45:21,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=965898.0, ans=0.125 2023-06-23 22:46:02,670 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.86 vs. limit=15.0 2023-06-23 22:46:13,504 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.833e+02 3.387e+02 4.039e+02 6.147e+02, threshold=6.774e+02, percent-clipped=7.0 2023-06-23 22:46:36,913 INFO [train.py:996] (1/4) Epoch 6, batch 8550, loss[loss=0.2158, simple_loss=0.3093, pruned_loss=0.06117, over 21789.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3007, pruned_loss=0.07358, over 4259564.46 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:46:44,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=966138.0, ans=0.04949747468305833 2023-06-23 22:47:15,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=966198.0, ans=0.2 2023-06-23 22:47:38,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-23 22:47:58,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.31 vs. limit=6.0 2023-06-23 22:48:12,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=966378.0, ans=0.0 2023-06-23 22:48:17,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=966378.0, ans=0.0 2023-06-23 22:48:21,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=966378.0, ans=0.1 2023-06-23 22:48:34,748 INFO [train.py:996] (1/4) Epoch 6, batch 8600, loss[loss=0.2637, simple_loss=0.3364, pruned_loss=0.09552, over 21585.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3062, pruned_loss=0.07522, over 4263346.60 frames. ], batch size: 389, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:49:28,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=966558.0, ans=0.07 2023-06-23 22:49:47,835 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-23 22:49:57,419 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.875e+02 3.260e+02 4.247e+02 6.190e+02, threshold=6.520e+02, percent-clipped=0.0 2023-06-23 22:49:59,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=966618.0, ans=0.0 2023-06-23 22:50:04,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=966678.0, ans=0.125 2023-06-23 22:50:20,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=966678.0, ans=0.0 2023-06-23 22:50:31,076 INFO [train.py:996] (1/4) Epoch 6, batch 8650, loss[loss=0.1818, simple_loss=0.2839, pruned_loss=0.0398, over 21846.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3121, pruned_loss=0.07798, over 4265174.59 frames. ], batch size: 316, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:50:34,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=966738.0, ans=0.1 2023-06-23 22:50:43,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=966738.0, ans=0.0 2023-06-23 22:52:13,207 INFO [train.py:996] (1/4) Epoch 6, batch 8700, loss[loss=0.208, simple_loss=0.2745, pruned_loss=0.07074, over 21619.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3036, pruned_loss=0.0742, over 4261135.57 frames. ], batch size: 332, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:53:04,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-23 22:53:33,446 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.283e+02 2.590e+02 3.172e+02 4.476e+02, threshold=5.179e+02, percent-clipped=0.0 2023-06-23 22:54:08,945 INFO [train.py:996] (1/4) Epoch 6, batch 8750, loss[loss=0.2492, simple_loss=0.3143, pruned_loss=0.09204, over 21889.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3004, pruned_loss=0.07479, over 4270653.93 frames. ], batch size: 118, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:54:14,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=967338.0, ans=0.1 2023-06-23 22:54:20,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=967338.0, ans=0.125 2023-06-23 22:54:24,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=967338.0, ans=0.125 2023-06-23 22:54:39,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=967398.0, ans=0.125 2023-06-23 22:55:49,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=967578.0, ans=0.125 2023-06-23 22:56:02,187 INFO [train.py:996] (1/4) Epoch 6, batch 8800, loss[loss=0.2592, simple_loss=0.3366, pruned_loss=0.09087, over 21512.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.309, pruned_loss=0.07774, over 4267320.79 frames. ], batch size: 211, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 22:56:18,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=967638.0, ans=0.125 2023-06-23 22:56:45,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=967698.0, ans=0.0 2023-06-23 22:57:28,539 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.723e+02 3.088e+02 3.591e+02 5.183e+02, threshold=6.177e+02, percent-clipped=1.0 2023-06-23 22:57:56,315 INFO [train.py:996] (1/4) Epoch 6, batch 8850, loss[loss=0.2107, simple_loss=0.2959, pruned_loss=0.06273, over 21700.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.316, pruned_loss=0.07962, over 4274523.68 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 22:58:12,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=967998.0, ans=10.0 2023-06-23 22:58:18,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=967998.0, ans=0.0 2023-06-23 22:58:27,312 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-23 22:59:29,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=968178.0, ans=0.0 2023-06-23 22:59:46,016 INFO [train.py:996] (1/4) Epoch 6, batch 8900, loss[loss=0.2062, simple_loss=0.2706, pruned_loss=0.07093, over 21260.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3104, pruned_loss=0.07821, over 4271656.75 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 23:00:22,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-23 23:01:18,315 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.656e+02 3.141e+02 3.730e+02 7.900e+02, threshold=6.282e+02, percent-clipped=6.0 2023-06-23 23:01:22,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=968478.0, ans=0.125 2023-06-23 23:01:39,329 INFO [train.py:996] (1/4) Epoch 6, batch 8950, loss[loss=0.2399, simple_loss=0.3126, pruned_loss=0.08356, over 21659.00 frames. ], tot_loss[loss=0.232, simple_loss=0.309, pruned_loss=0.07754, over 4273493.68 frames. ], batch size: 298, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:03:03,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=968718.0, ans=0.2 2023-06-23 23:03:21,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=968778.0, ans=0.125 2023-06-23 23:03:29,117 INFO [train.py:996] (1/4) Epoch 6, batch 9000, loss[loss=0.2742, simple_loss=0.3803, pruned_loss=0.08401, over 20743.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3053, pruned_loss=0.07809, over 4268906.34 frames. ], batch size: 607, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:03:29,117 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 23:03:48,703 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2652, simple_loss=0.3551, pruned_loss=0.08764, over 1796401.00 frames. 2023-06-23 23:03:48,704 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23385MB 2023-06-23 23:04:16,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=968898.0, ans=0.125 2023-06-23 23:05:12,830 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 2.551e+02 3.018e+02 3.495e+02 6.048e+02, threshold=6.037e+02, percent-clipped=0.0 2023-06-23 23:05:15,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=969078.0, ans=0.07 2023-06-23 23:05:45,394 INFO [train.py:996] (1/4) Epoch 6, batch 9050, loss[loss=0.1357, simple_loss=0.1942, pruned_loss=0.0386, over 16345.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2994, pruned_loss=0.07405, over 4266186.18 frames. ], batch size: 63, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:06:00,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=969138.0, ans=0.2 2023-06-23 23:07:07,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=969318.0, ans=0.0 2023-06-23 23:07:11,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=969318.0, ans=0.0 2023-06-23 23:07:30,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=969378.0, ans=0.125 2023-06-23 23:07:38,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=969438.0, ans=0.2 2023-06-23 23:07:39,273 INFO [train.py:996] (1/4) Epoch 6, batch 9100, loss[loss=0.2399, simple_loss=0.3293, pruned_loss=0.0753, over 21581.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3067, pruned_loss=0.07673, over 4267396.57 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:07:47,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=15.0 2023-06-23 23:08:19,173 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-23 23:09:02,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=969618.0, ans=0.0 2023-06-23 23:09:04,185 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.470e+02 2.760e+02 3.335e+02 5.659e+02, threshold=5.519e+02, percent-clipped=0.0 2023-06-23 23:09:30,883 INFO [train.py:996] (1/4) Epoch 6, batch 9150, loss[loss=0.2165, simple_loss=0.3077, pruned_loss=0.06269, over 21796.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3087, pruned_loss=0.07371, over 4269731.60 frames. ], batch size: 298, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:09:36,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=969738.0, ans=0.2 2023-06-23 23:10:11,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=969798.0, ans=0.0 2023-06-23 23:10:23,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=969858.0, ans=0.125 2023-06-23 23:10:40,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-23 23:10:46,759 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:11:22,069 INFO [train.py:996] (1/4) Epoch 6, batch 9200, loss[loss=0.2689, simple_loss=0.3488, pruned_loss=0.09447, over 21793.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3093, pruned_loss=0.07249, over 4266713.08 frames. ], batch size: 124, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:11:24,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=970038.0, ans=0.125 2023-06-23 23:12:00,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=970098.0, ans=0.04949747468305833 2023-06-23 23:12:23,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=970158.0, ans=0.125 2023-06-23 23:12:51,026 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.565e+02 2.927e+02 3.982e+02 7.343e+02, threshold=5.853e+02, percent-clipped=8.0 2023-06-23 23:13:17,987 INFO [train.py:996] (1/4) Epoch 6, batch 9250, loss[loss=0.2639, simple_loss=0.3395, pruned_loss=0.09417, over 21796.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3128, pruned_loss=0.07524, over 4260065.02 frames. ], batch size: 124, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:13:26,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=970338.0, ans=0.0 2023-06-23 23:13:32,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=970338.0, ans=10.0 2023-06-23 23:13:32,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=970338.0, ans=0.125 2023-06-23 23:13:59,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=970458.0, ans=0.2 2023-06-23 23:14:25,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=970458.0, ans=0.95 2023-06-23 23:14:28,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-23 23:14:41,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=970518.0, ans=0.125 2023-06-23 23:14:42,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=970518.0, ans=0.025 2023-06-23 23:15:03,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=970578.0, ans=0.125 2023-06-23 23:15:15,134 INFO [train.py:996] (1/4) Epoch 6, batch 9300, loss[loss=0.31, simple_loss=0.3847, pruned_loss=0.1176, over 21388.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3072, pruned_loss=0.07522, over 4247804.21 frames. ], batch size: 507, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:15:15,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=970638.0, ans=0.5 2023-06-23 23:16:19,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=970818.0, ans=0.0 2023-06-23 23:16:33,208 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.705e+02 3.300e+02 3.579e+02 5.908e+02, threshold=6.601e+02, percent-clipped=1.0 2023-06-23 23:17:01,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=970878.0, ans=0.125 2023-06-23 23:17:03,757 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-23 23:17:06,338 INFO [train.py:996] (1/4) Epoch 6, batch 9350, loss[loss=0.2628, simple_loss=0.3437, pruned_loss=0.09092, over 21611.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3126, pruned_loss=0.07689, over 4245774.55 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:17:32,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-06-23 23:18:18,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.37 vs. limit=22.5 2023-06-23 23:18:25,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=971118.0, ans=10.0 2023-06-23 23:18:43,789 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:18:57,429 INFO [train.py:996] (1/4) Epoch 6, batch 9400, loss[loss=0.2241, simple_loss=0.295, pruned_loss=0.07664, over 21739.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3154, pruned_loss=0.07703, over 4252276.92 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:20:14,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-23 23:20:25,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.477e+02 2.813e+02 3.524e+02 8.030e+02, threshold=5.626e+02, percent-clipped=3.0 2023-06-23 23:20:38,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-23 23:20:44,755 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:20:44,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=971538.0, ans=0.2 2023-06-23 23:20:46,097 INFO [train.py:996] (1/4) Epoch 6, batch 9450, loss[loss=0.191, simple_loss=0.2619, pruned_loss=0.06003, over 21800.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3063, pruned_loss=0.0755, over 4248522.44 frames. ], batch size: 118, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:20:55,902 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-23 23:21:24,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=971598.0, ans=0.0 2023-06-23 23:21:51,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=971718.0, ans=0.125 2023-06-23 23:22:06,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=12.0 2023-06-23 23:22:10,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=971778.0, ans=0.0 2023-06-23 23:22:29,378 INFO [train.py:996] (1/4) Epoch 6, batch 9500, loss[loss=0.179, simple_loss=0.2634, pruned_loss=0.04724, over 21581.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2995, pruned_loss=0.07379, over 4244454.46 frames. ], batch size: 263, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:23:02,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=971898.0, ans=0.125 2023-06-23 23:23:14,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=971898.0, ans=0.1 2023-06-23 23:23:55,315 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.481e+02 2.768e+02 3.385e+02 5.932e+02, threshold=5.537e+02, percent-clipped=1.0 2023-06-23 23:24:20,130 INFO [train.py:996] (1/4) Epoch 6, batch 9550, loss[loss=0.2382, simple_loss=0.3339, pruned_loss=0.07119, over 21765.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3037, pruned_loss=0.07597, over 4254548.74 frames. ], batch size: 247, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:24:51,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=972198.0, ans=0.2 2023-06-23 23:25:43,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=972378.0, ans=0.0 2023-06-23 23:25:59,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=972378.0, ans=0.0 2023-06-23 23:26:02,057 INFO [train.py:996] (1/4) Epoch 6, batch 9600, loss[loss=0.236, simple_loss=0.2976, pruned_loss=0.08719, over 21287.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3056, pruned_loss=0.07775, over 4264871.82 frames. ], batch size: 143, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:26:37,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=972498.0, ans=0.125 2023-06-23 23:27:15,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=972618.0, ans=0.0 2023-06-23 23:27:19,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=972618.0, ans=0.05 2023-06-23 23:27:26,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.00 vs. limit=10.0 2023-06-23 23:27:32,658 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.542e+02 2.834e+02 3.285e+02 4.885e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-23 23:27:39,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=972678.0, ans=0.125 2023-06-23 23:28:01,869 INFO [train.py:996] (1/4) Epoch 6, batch 9650, loss[loss=0.277, simple_loss=0.3426, pruned_loss=0.1057, over 21453.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3065, pruned_loss=0.07762, over 4272662.46 frames. ], batch size: 471, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:28:09,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=972738.0, ans=0.0 2023-06-23 23:28:28,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=972798.0, ans=0.125 2023-06-23 23:29:50,929 INFO [train.py:996] (1/4) Epoch 6, batch 9700, loss[loss=0.2477, simple_loss=0.3236, pruned_loss=0.08591, over 21563.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3117, pruned_loss=0.07906, over 4274574.63 frames. ], batch size: 471, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:30:45,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=973158.0, ans=0.125 2023-06-23 23:30:45,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=973158.0, ans=0.0 2023-06-23 23:30:52,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=973218.0, ans=0.125 2023-06-23 23:31:02,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=973218.0, ans=0.0 2023-06-23 23:31:10,723 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.422e+02 2.744e+02 3.326e+02 5.586e+02, threshold=5.488e+02, percent-clipped=0.0 2023-06-23 23:31:11,705 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=22.5 2023-06-23 23:31:24,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=973278.0, ans=0.05 2023-06-23 23:31:30,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.10 vs. limit=22.5 2023-06-23 23:31:38,372 INFO [train.py:996] (1/4) Epoch 6, batch 9750, loss[loss=0.2351, simple_loss=0.3346, pruned_loss=0.06782, over 20817.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3058, pruned_loss=0.07793, over 4274706.19 frames. ], batch size: 607, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:31:59,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=973398.0, ans=0.2 2023-06-23 23:33:19,469 INFO [train.py:996] (1/4) Epoch 6, batch 9800, loss[loss=0.2069, simple_loss=0.2885, pruned_loss=0.06265, over 16704.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3056, pruned_loss=0.07778, over 4253345.54 frames. ], batch size: 65, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:33:21,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=973638.0, ans=0.125 2023-06-23 23:33:59,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973758.0, ans=0.1 2023-06-23 23:34:09,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973758.0, ans=0.1 2023-06-23 23:34:44,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=973878.0, ans=0.0 2023-06-23 23:34:45,284 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.591e+02 2.983e+02 3.638e+02 9.651e+02, threshold=5.966e+02, percent-clipped=4.0 2023-06-23 23:34:48,230 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-23 23:35:07,685 INFO [train.py:996] (1/4) Epoch 6, batch 9850, loss[loss=0.2301, simple_loss=0.2786, pruned_loss=0.09083, over 21394.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3012, pruned_loss=0.07754, over 4259984.88 frames. ], batch size: 473, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:35:24,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973938.0, ans=0.1 2023-06-23 23:35:50,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=974058.0, ans=0.125 2023-06-23 23:36:46,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=974178.0, ans=0.125 2023-06-23 23:36:57,014 INFO [train.py:996] (1/4) Epoch 6, batch 9900, loss[loss=0.231, simple_loss=0.3024, pruned_loss=0.07978, over 21747.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2975, pruned_loss=0.07675, over 4235356.57 frames. ], batch size: 247, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:36:59,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=974238.0, ans=0.2 2023-06-23 23:37:04,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=974238.0, ans=0.125 2023-06-23 23:38:23,768 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.567e+02 2.955e+02 3.451e+02 4.751e+02, threshold=5.911e+02, percent-clipped=0.0 2023-06-23 23:38:45,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=974538.0, ans=0.1 2023-06-23 23:38:46,962 INFO [train.py:996] (1/4) Epoch 6, batch 9950, loss[loss=0.2191, simple_loss=0.2907, pruned_loss=0.07368, over 21835.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2982, pruned_loss=0.07909, over 4243259.05 frames. ], batch size: 98, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:39:25,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-23 23:39:40,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=974658.0, ans=0.125 2023-06-23 23:39:45,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=974658.0, ans=0.125 2023-06-23 23:39:50,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=974658.0, ans=0.0 2023-06-23 23:40:25,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=974778.0, ans=0.2 2023-06-23 23:40:43,771 INFO [train.py:996] (1/4) Epoch 6, batch 10000, loss[loss=0.2426, simple_loss=0.3137, pruned_loss=0.08575, over 21965.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2928, pruned_loss=0.07767, over 4255142.18 frames. ], batch size: 373, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:41:01,152 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2023-06-23 23:41:33,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.70 vs. limit=15.0 2023-06-23 23:41:58,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-23 23:42:10,927 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.477e+02 2.945e+02 3.555e+02 6.332e+02, threshold=5.891e+02, percent-clipped=1.0 2023-06-23 23:42:30,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=975078.0, ans=0.125 2023-06-23 23:42:34,450 INFO [train.py:996] (1/4) Epoch 6, batch 10050, loss[loss=0.1915, simple_loss=0.2767, pruned_loss=0.05315, over 21870.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2951, pruned_loss=0.07732, over 4264578.49 frames. ], batch size: 372, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:43:00,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=975198.0, ans=0.0 2023-06-23 23:43:25,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=975258.0, ans=0.0 2023-06-23 23:43:29,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.78 vs. limit=6.0 2023-06-23 23:43:32,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=975258.0, ans=0.1 2023-06-23 23:43:41,615 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:44:25,429 INFO [train.py:996] (1/4) Epoch 6, batch 10100, loss[loss=0.1753, simple_loss=0.2382, pruned_loss=0.05619, over 21353.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2928, pruned_loss=0.0755, over 4266639.10 frames. ], batch size: 159, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:44:43,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=975438.0, ans=0.0 2023-06-23 23:45:09,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=975498.0, ans=0.1 2023-06-23 23:45:57,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=15.0 2023-06-23 23:45:59,920 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.533e+02 2.969e+02 3.783e+02 6.881e+02, threshold=5.937e+02, percent-clipped=1.0 2023-06-23 23:46:04,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=975678.0, ans=0.0 2023-06-23 23:46:21,395 INFO [train.py:996] (1/4) Epoch 6, batch 10150, loss[loss=0.2172, simple_loss=0.2895, pruned_loss=0.07238, over 21658.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2979, pruned_loss=0.07828, over 4267463.26 frames. ], batch size: 247, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:46:53,214 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-23 23:46:56,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=975798.0, ans=0.0 2023-06-23 23:47:44,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=975918.0, ans=0.5 2023-06-23 23:48:09,605 INFO [train.py:996] (1/4) Epoch 6, batch 10200, loss[loss=0.234, simple_loss=0.3016, pruned_loss=0.08319, over 21880.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.299, pruned_loss=0.07705, over 4267486.43 frames. ], batch size: 107, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:48:17,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=976038.0, ans=0.2 2023-06-23 23:48:47,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=976098.0, ans=0.2 2023-06-23 23:48:48,226 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-23 23:49:38,134 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.173e+02 2.583e+02 3.025e+02 4.269e+02, threshold=5.166e+02, percent-clipped=0.0 2023-06-23 23:49:38,772 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:49:59,514 INFO [train.py:996] (1/4) Epoch 6, batch 10250, loss[loss=0.157, simple_loss=0.2435, pruned_loss=0.03528, over 21542.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2962, pruned_loss=0.07211, over 4272267.95 frames. ], batch size: 195, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:50:19,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=976338.0, ans=0.04949747468305833 2023-06-23 23:51:58,302 INFO [train.py:996] (1/4) Epoch 6, batch 10300, loss[loss=0.2352, simple_loss=0.3316, pruned_loss=0.06938, over 21831.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2984, pruned_loss=0.07308, over 4265659.11 frames. ], batch size: 282, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:52:09,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=976638.0, ans=0.1 2023-06-23 23:52:11,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=976638.0, ans=0.1 2023-06-23 23:53:28,846 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 2.521e+02 2.843e+02 3.478e+02 5.751e+02, threshold=5.686e+02, percent-clipped=3.0 2023-06-23 23:53:52,272 INFO [train.py:996] (1/4) Epoch 6, batch 10350, loss[loss=0.1478, simple_loss=0.1944, pruned_loss=0.05063, over 21679.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2997, pruned_loss=0.07276, over 4259268.34 frames. ], batch size: 112, lr: 5.17e-03, grad_scale: 16.0 2023-06-23 23:54:25,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=976998.0, ans=0.0 2023-06-23 23:55:41,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.93 vs. limit=10.0 2023-06-23 23:55:43,944 INFO [train.py:996] (1/4) Epoch 6, batch 10400, loss[loss=0.158, simple_loss=0.2048, pruned_loss=0.05561, over 21727.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2954, pruned_loss=0.07248, over 4262588.50 frames. ], batch size: 124, lr: 5.17e-03, grad_scale: 32.0 2023-06-23 23:56:04,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=977238.0, ans=0.0 2023-06-23 23:56:11,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=977298.0, ans=0.125 2023-06-23 23:56:30,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-23 23:57:20,289 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.786e+02 3.233e+02 3.708e+02 5.830e+02, threshold=6.465e+02, percent-clipped=3.0 2023-06-23 23:57:41,113 INFO [train.py:996] (1/4) Epoch 6, batch 10450, loss[loss=0.2397, simple_loss=0.3138, pruned_loss=0.08276, over 21429.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2999, pruned_loss=0.07498, over 4263848.06 frames. ], batch size: 211, lr: 5.17e-03, grad_scale: 32.0 2023-06-23 23:57:44,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.10 vs. limit=5.0 2023-06-23 23:57:49,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=977538.0, ans=0.0 2023-06-23 23:57:52,354 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:58:03,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=977598.0, ans=0.0 2023-06-23 23:58:05,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=977598.0, ans=0.125 2023-06-23 23:58:48,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=977658.0, ans=0.125 2023-06-23 23:58:50,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=977718.0, ans=0.125 2023-06-23 23:58:55,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=977718.0, ans=15.0 2023-06-23 23:59:00,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=977718.0, ans=0.0 2023-06-23 23:59:02,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=977718.0, ans=0.125 2023-06-23 23:59:30,738 INFO [train.py:996] (1/4) Epoch 6, batch 10500, loss[loss=0.2152, simple_loss=0.2717, pruned_loss=0.07932, over 21242.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2984, pruned_loss=0.07344, over 4272216.42 frames. ], batch size: 159, lr: 5.17e-03, grad_scale: 16.0 2023-06-23 23:59:33,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=977838.0, ans=0.1 2023-06-23 23:59:54,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=977898.0, ans=0.2 2023-06-24 00:00:59,806 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.398e+02 2.689e+02 3.123e+02 4.066e+02, threshold=5.379e+02, percent-clipped=0.0 2023-06-24 00:01:19,037 INFO [train.py:996] (1/4) Epoch 6, batch 10550, loss[loss=0.1953, simple_loss=0.2654, pruned_loss=0.06254, over 21795.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2927, pruned_loss=0.07249, over 4259911.59 frames. ], batch size: 317, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:01:59,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=978198.0, ans=0.125 2023-06-24 00:02:04,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=978258.0, ans=0.125 2023-06-24 00:03:01,409 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.21 vs. limit=12.0 2023-06-24 00:03:09,247 INFO [train.py:996] (1/4) Epoch 6, batch 10600, loss[loss=0.237, simple_loss=0.3266, pruned_loss=0.07375, over 19707.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.289, pruned_loss=0.07158, over 4256224.84 frames. ], batch size: 702, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:03:23,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-24 00:04:42,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=978678.0, ans=0.125 2023-06-24 00:04:47,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.546e+02 2.981e+02 3.597e+02 7.487e+02, threshold=5.961e+02, percent-clipped=2.0 2023-06-24 00:05:12,643 INFO [train.py:996] (1/4) Epoch 6, batch 10650, loss[loss=0.175, simple_loss=0.2613, pruned_loss=0.04439, over 21752.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2922, pruned_loss=0.07032, over 4266164.32 frames. ], batch size: 351, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:05:35,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=978798.0, ans=0.1 2023-06-24 00:05:57,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=978858.0, ans=0.0 2023-06-24 00:06:07,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=978858.0, ans=0.125 2023-06-24 00:06:23,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=978918.0, ans=0.125 2023-06-24 00:06:25,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=978918.0, ans=0.125 2023-06-24 00:06:27,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=978918.0, ans=0.125 2023-06-24 00:07:03,055 INFO [train.py:996] (1/4) Epoch 6, batch 10700, loss[loss=0.1809, simple_loss=0.2488, pruned_loss=0.05656, over 21362.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2908, pruned_loss=0.07002, over 4262490.53 frames. ], batch size: 211, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:07:46,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=979158.0, ans=0.125 2023-06-24 00:08:12,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=979218.0, ans=0.2 2023-06-24 00:08:13,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=979218.0, ans=0.2 2023-06-24 00:08:17,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=979218.0, ans=0.2 2023-06-24 00:08:33,230 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-24 00:08:35,825 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.562e+02 2.930e+02 3.343e+02 5.418e+02, threshold=5.860e+02, percent-clipped=0.0 2023-06-24 00:08:55,529 INFO [train.py:996] (1/4) Epoch 6, batch 10750, loss[loss=0.2355, simple_loss=0.3318, pruned_loss=0.06962, over 21757.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3013, pruned_loss=0.07467, over 4268574.66 frames. ], batch size: 298, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:09:57,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979458.0, ans=0.0 2023-06-24 00:10:08,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=979518.0, ans=0.125 2023-06-24 00:10:47,828 INFO [train.py:996] (1/4) Epoch 6, batch 10800, loss[loss=0.268, simple_loss=0.349, pruned_loss=0.09356, over 21847.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3071, pruned_loss=0.07619, over 4275863.74 frames. ], batch size: 124, lr: 5.17e-03, grad_scale: 32.0 2023-06-24 00:11:31,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=979698.0, ans=0.0 2023-06-24 00:11:54,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=979758.0, ans=0.125 2023-06-24 00:12:01,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=979818.0, ans=0.125 2023-06-24 00:12:02,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=979818.0, ans=0.125 2023-06-24 00:12:24,835 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.761e+02 3.249e+02 3.882e+02 5.958e+02, threshold=6.498e+02, percent-clipped=1.0 2023-06-24 00:12:44,080 INFO [train.py:996] (1/4) Epoch 6, batch 10850, loss[loss=0.1914, simple_loss=0.2662, pruned_loss=0.05835, over 21658.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3066, pruned_loss=0.07576, over 4272364.98 frames. ], batch size: 247, lr: 5.17e-03, grad_scale: 32.0 2023-06-24 00:12:46,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=979938.0, ans=0.0 2023-06-24 00:12:49,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=979938.0, ans=0.125 2023-06-24 00:13:17,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-24 00:14:28,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980178.0, ans=0.1 2023-06-24 00:14:31,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-24 00:14:32,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=980178.0, ans=0.125 2023-06-24 00:14:35,110 INFO [train.py:996] (1/4) Epoch 6, batch 10900, loss[loss=0.1987, simple_loss=0.281, pruned_loss=0.05818, over 21283.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2988, pruned_loss=0.07342, over 4269104.87 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:14:35,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=980238.0, ans=0.125 2023-06-24 00:15:09,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980298.0, ans=0.1 2023-06-24 00:15:47,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=980418.0, ans=0.2 2023-06-24 00:16:02,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980478.0, ans=0.1 2023-06-24 00:16:05,807 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.411e+02 2.776e+02 2.994e+02 5.292e+02, threshold=5.553e+02, percent-clipped=0.0 2023-06-24 00:16:07,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=980478.0, ans=0.0 2023-06-24 00:16:22,928 INFO [train.py:996] (1/4) Epoch 6, batch 10950, loss[loss=0.1961, simple_loss=0.2696, pruned_loss=0.06127, over 21581.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2947, pruned_loss=0.07184, over 4261605.45 frames. ], batch size: 263, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:17:07,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=980658.0, ans=0.125 2023-06-24 00:17:18,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=980658.0, ans=0.125 2023-06-24 00:18:13,150 INFO [train.py:996] (1/4) Epoch 6, batch 11000, loss[loss=0.2167, simple_loss=0.289, pruned_loss=0.07223, over 21638.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2941, pruned_loss=0.07284, over 4270260.92 frames. ], batch size: 263, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:18:41,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=980898.0, ans=0.125 2023-06-24 00:18:58,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=980958.0, ans=0.035 2023-06-24 00:19:24,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=981018.0, ans=0.0 2023-06-24 00:19:45,482 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.423e+02 2.754e+02 3.301e+02 6.173e+02, threshold=5.508e+02, percent-clipped=2.0 2023-06-24 00:19:49,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=981078.0, ans=0.125 2023-06-24 00:19:58,268 INFO [train.py:996] (1/4) Epoch 6, batch 11050, loss[loss=0.1946, simple_loss=0.2635, pruned_loss=0.06285, over 21626.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2935, pruned_loss=0.07416, over 4268589.50 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:20:08,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=981138.0, ans=0.0 2023-06-24 00:20:34,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=981198.0, ans=0.125 2023-06-24 00:20:44,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=981198.0, ans=0.0 2023-06-24 00:21:40,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=981378.0, ans=0.125 2023-06-24 00:21:45,968 INFO [train.py:996] (1/4) Epoch 6, batch 11100, loss[loss=0.2108, simple_loss=0.2758, pruned_loss=0.07287, over 21208.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2925, pruned_loss=0.07438, over 4267977.23 frames. ], batch size: 144, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:21:53,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=981438.0, ans=0.125 2023-06-24 00:22:18,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=981498.0, ans=0.0 2023-06-24 00:22:45,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=981558.0, ans=0.025 2023-06-24 00:23:23,907 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.487e+02 2.801e+02 3.244e+02 5.802e+02, threshold=5.603e+02, percent-clipped=1.0 2023-06-24 00:23:36,115 INFO [train.py:996] (1/4) Epoch 6, batch 11150, loss[loss=0.2748, simple_loss=0.3504, pruned_loss=0.09964, over 21404.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2905, pruned_loss=0.07424, over 4273840.74 frames. ], batch size: 507, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:24:17,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-24 00:24:29,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=981858.0, ans=0.0 2023-06-24 00:24:31,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-24 00:24:33,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-24 00:24:38,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=981858.0, ans=0.0 2023-06-24 00:25:27,176 INFO [train.py:996] (1/4) Epoch 6, batch 11200, loss[loss=0.2007, simple_loss=0.2677, pruned_loss=0.06692, over 21832.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2884, pruned_loss=0.07367, over 4264572.29 frames. ], batch size: 372, lr: 5.16e-03, grad_scale: 32.0 2023-06-24 00:25:28,173 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-06-24 00:26:05,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=982098.0, ans=0.1 2023-06-24 00:26:14,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=982098.0, ans=0.0 2023-06-24 00:26:44,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=982218.0, ans=0.0 2023-06-24 00:27:03,168 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.434e+02 2.676e+02 2.972e+02 5.122e+02, threshold=5.353e+02, percent-clipped=0.0 2023-06-24 00:27:10,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=982278.0, ans=0.0 2023-06-24 00:27:15,149 INFO [train.py:996] (1/4) Epoch 6, batch 11250, loss[loss=0.2237, simple_loss=0.2982, pruned_loss=0.07455, over 21858.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2881, pruned_loss=0.07392, over 4259476.44 frames. ], batch size: 351, lr: 5.16e-03, grad_scale: 32.0 2023-06-24 00:27:43,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=982398.0, ans=0.125 2023-06-24 00:28:22,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=982518.0, ans=0.0 2023-06-24 00:29:03,621 INFO [train.py:996] (1/4) Epoch 6, batch 11300, loss[loss=0.2123, simple_loss=0.2856, pruned_loss=0.06953, over 21806.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2889, pruned_loss=0.07387, over 4257850.37 frames. ], batch size: 282, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:29:17,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-24 00:30:43,984 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.481e+02 2.716e+02 3.096e+02 3.979e+02, threshold=5.433e+02, percent-clipped=0.0 2023-06-24 00:31:00,999 INFO [train.py:996] (1/4) Epoch 6, batch 11350, loss[loss=0.2371, simple_loss=0.305, pruned_loss=0.08465, over 21258.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2912, pruned_loss=0.07325, over 4267511.63 frames. ], batch size: 159, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:31:35,221 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-24 00:31:42,772 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.53 vs. limit=15.0 2023-06-24 00:31:43,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=982998.0, ans=0.125 2023-06-24 00:32:59,516 INFO [train.py:996] (1/4) Epoch 6, batch 11400, loss[loss=0.2299, simple_loss=0.2955, pruned_loss=0.08213, over 19818.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2958, pruned_loss=0.07566, over 4265708.91 frames. ], batch size: 702, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:33:16,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=983238.0, ans=0.125 2023-06-24 00:33:18,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=983238.0, ans=0.125 2023-06-24 00:33:19,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=983238.0, ans=0.125 2023-06-24 00:33:26,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=983298.0, ans=0.125 2023-06-24 00:34:00,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=983358.0, ans=0.0 2023-06-24 00:34:10,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=983418.0, ans=0.125 2023-06-24 00:34:38,922 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.559e+02 2.841e+02 3.332e+02 5.224e+02, threshold=5.682e+02, percent-clipped=0.0 2023-06-24 00:34:49,727 INFO [train.py:996] (1/4) Epoch 6, batch 11450, loss[loss=0.2344, simple_loss=0.2886, pruned_loss=0.09007, over 20203.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2964, pruned_loss=0.07428, over 4267743.68 frames. ], batch size: 707, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:35:12,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=6.0 2023-06-24 00:35:13,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=983598.0, ans=0.125 2023-06-24 00:35:51,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=983658.0, ans=0.0 2023-06-24 00:36:35,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=983778.0, ans=0.125 2023-06-24 00:36:46,029 INFO [train.py:996] (1/4) Epoch 6, batch 11500, loss[loss=0.1987, simple_loss=0.2884, pruned_loss=0.05448, over 21470.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3001, pruned_loss=0.07584, over 4269295.06 frames. ], batch size: 211, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:36:47,211 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=12.0 2023-06-24 00:38:29,587 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.699e+02 3.055e+02 3.965e+02 5.631e+02, threshold=6.111e+02, percent-clipped=0.0 2023-06-24 00:38:41,207 INFO [train.py:996] (1/4) Epoch 6, batch 11550, loss[loss=0.3032, simple_loss=0.4181, pruned_loss=0.09418, over 21209.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3049, pruned_loss=0.07529, over 4273047.55 frames. ], batch size: 548, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:38:43,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=984138.0, ans=0.1 2023-06-24 00:39:15,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-24 00:39:24,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=984258.0, ans=0.0 2023-06-24 00:39:24,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=984258.0, ans=0.125 2023-06-24 00:40:38,817 INFO [train.py:996] (1/4) Epoch 6, batch 11600, loss[loss=0.2646, simple_loss=0.3474, pruned_loss=0.09087, over 21318.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3191, pruned_loss=0.077, over 4268125.01 frames. ], batch size: 176, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:41:29,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=984558.0, ans=0.1 2023-06-24 00:41:43,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=984618.0, ans=0.125 2023-06-24 00:41:55,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=984618.0, ans=0.2 2023-06-24 00:42:07,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-24 00:42:10,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=984678.0, ans=0.0 2023-06-24 00:42:12,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=984678.0, ans=0.125 2023-06-24 00:42:15,286 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.879e+02 3.402e+02 4.224e+02 8.565e+02, threshold=6.804e+02, percent-clipped=5.0 2023-06-24 00:42:25,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=984678.0, ans=0.125 2023-06-24 00:42:28,800 INFO [train.py:996] (1/4) Epoch 6, batch 11650, loss[loss=0.2477, simple_loss=0.3343, pruned_loss=0.08052, over 21742.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3256, pruned_loss=0.07768, over 4265253.08 frames. ], batch size: 351, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:42:38,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=984738.0, ans=0.125 2023-06-24 00:42:55,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=984798.0, ans=0.0 2023-06-24 00:43:12,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=984858.0, ans=0.0 2023-06-24 00:43:32,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=984918.0, ans=0.2 2023-06-24 00:44:12,082 INFO [train.py:996] (1/4) Epoch 6, batch 11700, loss[loss=0.2124, simple_loss=0.269, pruned_loss=0.07785, over 21496.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3169, pruned_loss=0.07757, over 4258794.50 frames. ], batch size: 195, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:44:16,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=985038.0, ans=0.2 2023-06-24 00:45:52,487 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.525e+02 2.747e+02 3.370e+02 5.066e+02, threshold=5.494e+02, percent-clipped=0.0 2023-06-24 00:46:01,439 INFO [train.py:996] (1/4) Epoch 6, batch 11750, loss[loss=0.2198, simple_loss=0.2928, pruned_loss=0.0734, over 21900.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3078, pruned_loss=0.0769, over 4261514.27 frames. ], batch size: 317, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:46:19,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=985398.0, ans=0.0 2023-06-24 00:46:21,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=985398.0, ans=0.0 2023-06-24 00:47:36,017 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-24 00:47:39,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-24 00:47:49,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=985578.0, ans=0.0 2023-06-24 00:47:52,518 INFO [train.py:996] (1/4) Epoch 6, batch 11800, loss[loss=0.2503, simple_loss=0.3125, pruned_loss=0.09401, over 21323.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3102, pruned_loss=0.07926, over 4272390.50 frames. ], batch size: 176, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:47:53,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=985638.0, ans=0.0 2023-06-24 00:49:34,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-24 00:49:34,926 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.469e+02 2.710e+02 3.084e+02 4.949e+02, threshold=5.420e+02, percent-clipped=0.0 2023-06-24 00:49:43,727 INFO [train.py:996] (1/4) Epoch 6, batch 11850, loss[loss=0.229, simple_loss=0.3102, pruned_loss=0.07386, over 21869.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3116, pruned_loss=0.07803, over 4282027.75 frames. ], batch size: 107, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:49:44,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=985938.0, ans=0.2 2023-06-24 00:50:02,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=985938.0, ans=0.0 2023-06-24 00:50:43,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=986058.0, ans=0.125 2023-06-24 00:51:34,330 INFO [train.py:996] (1/4) Epoch 6, batch 11900, loss[loss=0.2433, simple_loss=0.3305, pruned_loss=0.07801, over 21573.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.312, pruned_loss=0.07604, over 4279965.16 frames. ], batch size: 441, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:52:14,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=986298.0, ans=0.1 2023-06-24 00:52:21,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=986298.0, ans=0.0 2023-06-24 00:52:25,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=986298.0, ans=0.125 2023-06-24 00:53:16,514 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.327e+02 2.667e+02 3.121e+02 4.121e+02, threshold=5.333e+02, percent-clipped=0.0 2023-06-24 00:53:31,111 INFO [train.py:996] (1/4) Epoch 6, batch 11950, loss[loss=0.1953, simple_loss=0.291, pruned_loss=0.04982, over 21809.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3129, pruned_loss=0.07288, over 4271660.02 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:53:47,372 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-24 00:54:02,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.26 vs. limit=15.0 2023-06-24 00:54:28,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=986658.0, ans=0.125 2023-06-24 00:55:06,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=986778.0, ans=0.125 2023-06-24 00:55:19,961 INFO [train.py:996] (1/4) Epoch 6, batch 12000, loss[loss=0.1994, simple_loss=0.2565, pruned_loss=0.07113, over 21349.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3077, pruned_loss=0.07148, over 4270877.05 frames. ], batch size: 551, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:55:19,962 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 00:55:44,721 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2624, simple_loss=0.3526, pruned_loss=0.08607, over 1796401.00 frames. 2023-06-24 00:55:44,722 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23385MB 2023-06-24 00:56:02,446 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:56:09,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=986898.0, ans=0.125 2023-06-24 00:56:11,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=986898.0, ans=12.0 2023-06-24 00:56:39,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=987018.0, ans=0.125 2023-06-24 00:57:02,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=987078.0, ans=0.125 2023-06-24 00:57:13,629 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 2.572e+02 3.062e+02 3.583e+02 6.186e+02, threshold=6.124e+02, percent-clipped=1.0 2023-06-24 00:57:21,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=987078.0, ans=0.07 2023-06-24 00:57:27,305 INFO [train.py:996] (1/4) Epoch 6, batch 12050, loss[loss=0.2149, simple_loss=0.2784, pruned_loss=0.07573, over 21409.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3035, pruned_loss=0.07322, over 4276848.15 frames. ], batch size: 177, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:57:35,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=987138.0, ans=0.125 2023-06-24 00:57:35,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=987138.0, ans=0.1 2023-06-24 00:57:56,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=987198.0, ans=0.125 2023-06-24 00:58:50,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=987318.0, ans=0.0 2023-06-24 00:59:24,251 INFO [train.py:996] (1/4) Epoch 6, batch 12100, loss[loss=0.2502, simple_loss=0.3303, pruned_loss=0.08507, over 21875.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3082, pruned_loss=0.07741, over 4279243.81 frames. ], batch size: 316, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 01:00:07,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=987558.0, ans=0.07 2023-06-24 01:00:07,930 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=12.0 2023-06-24 01:00:32,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=987618.0, ans=0.125 2023-06-24 01:00:45,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=987618.0, ans=0.125 2023-06-24 01:01:02,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=987678.0, ans=0.125 2023-06-24 01:01:06,899 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.682e+02 3.113e+02 3.706e+02 5.999e+02, threshold=6.227e+02, percent-clipped=0.0 2023-06-24 01:01:14,022 INFO [train.py:996] (1/4) Epoch 6, batch 12150, loss[loss=0.2217, simple_loss=0.3159, pruned_loss=0.06371, over 21645.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3109, pruned_loss=0.07712, over 4278308.21 frames. ], batch size: 263, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 01:01:17,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-24 01:01:25,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=987738.0, ans=0.0 2023-06-24 01:02:04,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=987858.0, ans=0.125 2023-06-24 01:02:13,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=987858.0, ans=0.1 2023-06-24 01:02:47,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-24 01:02:50,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=987978.0, ans=0.125 2023-06-24 01:03:08,581 INFO [train.py:996] (1/4) Epoch 6, batch 12200, loss[loss=0.2053, simple_loss=0.2718, pruned_loss=0.06937, over 21835.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.309, pruned_loss=0.07693, over 4273793.79 frames. ], batch size: 318, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 01:03:10,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=988038.0, ans=0.125 2023-06-24 01:03:41,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=988158.0, ans=0.1 2023-06-24 01:04:45,487 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 2.375e+02 2.667e+02 3.386e+02 5.475e+02, threshold=5.334e+02, percent-clipped=0.0 2023-06-24 01:04:57,279 INFO [train.py:996] (1/4) Epoch 6, batch 12250, loss[loss=0.1431, simple_loss=0.2246, pruned_loss=0.03076, over 21396.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2997, pruned_loss=0.07319, over 4267416.03 frames. ], batch size: 131, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:06:36,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=988578.0, ans=0.125 2023-06-24 01:06:40,997 INFO [train.py:996] (1/4) Epoch 6, batch 12300, loss[loss=0.2569, simple_loss=0.3531, pruned_loss=0.08038, over 21541.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2911, pruned_loss=0.0675, over 4270499.44 frames. ], batch size: 471, lr: 5.14e-03, grad_scale: 8.0 2023-06-24 01:07:36,387 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-24 01:07:57,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=988818.0, ans=0.0 2023-06-24 01:08:04,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=988818.0, ans=0.125 2023-06-24 01:08:24,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=988878.0, ans=0.0 2023-06-24 01:08:25,510 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 2.150e+02 2.660e+02 3.179e+02 5.593e+02, threshold=5.319e+02, percent-clipped=1.0 2023-06-24 01:08:36,031 INFO [train.py:996] (1/4) Epoch 6, batch 12350, loss[loss=0.2466, simple_loss=0.3133, pruned_loss=0.08998, over 21223.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2957, pruned_loss=0.06819, over 4275171.44 frames. ], batch size: 143, lr: 5.14e-03, grad_scale: 8.0 2023-06-24 01:09:34,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=989058.0, ans=0.2 2023-06-24 01:09:50,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=989118.0, ans=0.0 2023-06-24 01:09:53,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.17 vs. limit=22.5 2023-06-24 01:09:59,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=989118.0, ans=0.1 2023-06-24 01:10:24,593 INFO [train.py:996] (1/4) Epoch 6, batch 12400, loss[loss=0.2258, simple_loss=0.2889, pruned_loss=0.08138, over 21823.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2979, pruned_loss=0.07223, over 4282796.50 frames. ], batch size: 298, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:10:26,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=989238.0, ans=0.0 2023-06-24 01:10:37,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=989238.0, ans=0.125 2023-06-24 01:11:51,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=989478.0, ans=0.0 2023-06-24 01:12:08,940 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.631e+02 2.949e+02 3.533e+02 4.721e+02, threshold=5.899e+02, percent-clipped=0.0 2023-06-24 01:12:14,237 INFO [train.py:996] (1/4) Epoch 6, batch 12450, loss[loss=0.2915, simple_loss=0.3543, pruned_loss=0.1144, over 21459.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3019, pruned_loss=0.07539, over 4287672.53 frames. ], batch size: 471, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:12:41,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-24 01:13:13,175 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:13:48,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=989778.0, ans=0.0 2023-06-24 01:13:50,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=989778.0, ans=0.2 2023-06-24 01:14:08,580 INFO [train.py:996] (1/4) Epoch 6, batch 12500, loss[loss=0.2739, simple_loss=0.359, pruned_loss=0.09439, over 21431.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3126, pruned_loss=0.07676, over 4284091.69 frames. ], batch size: 211, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:14:31,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=989838.0, ans=0.0 2023-06-24 01:15:04,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-24 01:15:05,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=989958.0, ans=0.125 2023-06-24 01:15:07,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=989958.0, ans=0.125 2023-06-24 01:15:14,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=989958.0, ans=0.125 2023-06-24 01:16:01,940 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 2.735e+02 3.011e+02 3.446e+02 4.823e+02, threshold=6.021e+02, percent-clipped=0.0 2023-06-24 01:16:07,454 INFO [train.py:996] (1/4) Epoch 6, batch 12550, loss[loss=0.2331, simple_loss=0.3195, pruned_loss=0.07339, over 21794.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3157, pruned_loss=0.07853, over 4286119.87 frames. ], batch size: 282, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:17:12,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=990318.0, ans=0.125 2023-06-24 01:18:03,115 INFO [train.py:996] (1/4) Epoch 6, batch 12600, loss[loss=0.21, simple_loss=0.2885, pruned_loss=0.06572, over 21527.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.316, pruned_loss=0.07713, over 4275164.74 frames. ], batch size: 195, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:18:10,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=990438.0, ans=0.2 2023-06-24 01:18:48,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=22.5 2023-06-24 01:18:56,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=990558.0, ans=0.0 2023-06-24 01:19:46,574 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.342e+02 2.712e+02 3.358e+02 5.513e+02, threshold=5.424e+02, percent-clipped=0.0 2023-06-24 01:19:51,709 INFO [train.py:996] (1/4) Epoch 6, batch 12650, loss[loss=0.2792, simple_loss=0.3236, pruned_loss=0.1175, over 21778.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3086, pruned_loss=0.07473, over 4272513.37 frames. ], batch size: 508, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:20:45,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=990858.0, ans=0.125 2023-06-24 01:21:40,698 INFO [train.py:996] (1/4) Epoch 6, batch 12700, loss[loss=0.2512, simple_loss=0.3194, pruned_loss=0.09154, over 21489.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3086, pruned_loss=0.07721, over 4278236.73 frames. ], batch size: 194, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:21:50,913 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.90 vs. limit=15.0 2023-06-24 01:22:22,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=991158.0, ans=0.0 2023-06-24 01:22:24,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=991158.0, ans=0.125 2023-06-24 01:22:37,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=991158.0, ans=0.04949747468305833 2023-06-24 01:23:21,028 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:23:25,526 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.607e+02 2.938e+02 3.445e+02 5.217e+02, threshold=5.876e+02, percent-clipped=0.0 2023-06-24 01:23:31,064 INFO [train.py:996] (1/4) Epoch 6, batch 12750, loss[loss=0.2303, simple_loss=0.3046, pruned_loss=0.07801, over 20709.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3101, pruned_loss=0.07808, over 4277472.03 frames. ], batch size: 607, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:23:37,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-24 01:23:53,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=991398.0, ans=0.125 2023-06-24 01:23:59,729 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-24 01:24:16,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=991458.0, ans=0.125 2023-06-24 01:24:48,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=991518.0, ans=0.0 2023-06-24 01:24:50,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.90 vs. limit=6.0 2023-06-24 01:25:02,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=991578.0, ans=0.0 2023-06-24 01:25:05,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=991578.0, ans=0.2 2023-06-24 01:25:19,754 INFO [train.py:996] (1/4) Epoch 6, batch 12800, loss[loss=0.2286, simple_loss=0.305, pruned_loss=0.07611, over 21872.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3086, pruned_loss=0.07765, over 4277962.55 frames. ], batch size: 371, lr: 5.14e-03, grad_scale: 32.0 2023-06-24 01:25:28,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=991638.0, ans=0.125 2023-06-24 01:25:37,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=991638.0, ans=0.125 2023-06-24 01:25:47,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=991698.0, ans=0.125 2023-06-24 01:26:07,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-24 01:26:38,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=991818.0, ans=0.125 2023-06-24 01:27:05,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=991878.0, ans=0.0 2023-06-24 01:27:06,615 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.498e+02 2.671e+02 3.042e+02 5.514e+02, threshold=5.341e+02, percent-clipped=0.0 2023-06-24 01:27:07,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=991878.0, ans=0.2 2023-06-24 01:27:09,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=991938.0, ans=0.5 2023-06-24 01:27:09,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-24 01:27:10,345 INFO [train.py:996] (1/4) Epoch 6, batch 12850, loss[loss=0.2213, simple_loss=0.3205, pruned_loss=0.0611, over 21653.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.312, pruned_loss=0.07969, over 4271599.24 frames. ], batch size: 414, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:27:12,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=991938.0, ans=10.0 2023-06-24 01:28:26,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=22.5 2023-06-24 01:29:05,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992178.0, ans=0.1 2023-06-24 01:29:08,017 INFO [train.py:996] (1/4) Epoch 6, batch 12900, loss[loss=0.23, simple_loss=0.311, pruned_loss=0.07453, over 21748.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3093, pruned_loss=0.07596, over 4272682.34 frames. ], batch size: 352, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:29:18,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-24 01:29:22,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=992238.0, ans=10.0 2023-06-24 01:29:51,211 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-24 01:30:06,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=992358.0, ans=0.125 2023-06-24 01:30:32,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=992418.0, ans=0.125 2023-06-24 01:30:32,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=992418.0, ans=0.0 2023-06-24 01:30:50,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=992478.0, ans=0.125 2023-06-24 01:30:55,027 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 2.252e+02 2.502e+02 2.973e+02 5.465e+02, threshold=5.003e+02, percent-clipped=1.0 2023-06-24 01:30:58,567 INFO [train.py:996] (1/4) Epoch 6, batch 12950, loss[loss=0.222, simple_loss=0.3025, pruned_loss=0.07075, over 21868.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3058, pruned_loss=0.07348, over 4275295.41 frames. ], batch size: 372, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:32:27,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=992718.0, ans=0.95 2023-06-24 01:32:47,510 INFO [train.py:996] (1/4) Epoch 6, batch 13000, loss[loss=0.2488, simple_loss=0.3165, pruned_loss=0.09054, over 21448.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3055, pruned_loss=0.0741, over 4281186.24 frames. ], batch size: 507, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:33:24,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=992898.0, ans=0.125 2023-06-24 01:33:31,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=992898.0, ans=0.05 2023-06-24 01:33:32,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=992898.0, ans=0.125 2023-06-24 01:34:23,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=993078.0, ans=0.2 2023-06-24 01:34:25,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=993078.0, ans=0.0 2023-06-24 01:34:33,604 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 2.511e+02 2.962e+02 3.599e+02 5.386e+02, threshold=5.923e+02, percent-clipped=1.0 2023-06-24 01:34:34,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=993078.0, ans=0.125 2023-06-24 01:34:36,944 INFO [train.py:996] (1/4) Epoch 6, batch 13050, loss[loss=0.1951, simple_loss=0.2737, pruned_loss=0.05827, over 21774.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3006, pruned_loss=0.07161, over 4282399.42 frames. ], batch size: 247, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:35:21,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-24 01:36:21,541 INFO [train.py:996] (1/4) Epoch 6, batch 13100, loss[loss=0.2233, simple_loss=0.3062, pruned_loss=0.07022, over 21719.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3014, pruned_loss=0.07208, over 4292090.86 frames. ], batch size: 298, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:36:56,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993498.0, ans=0.1 2023-06-24 01:36:58,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=993498.0, ans=0.2 2023-06-24 01:36:58,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=993498.0, ans=0.125 2023-06-24 01:37:33,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=993618.0, ans=0.0 2023-06-24 01:37:51,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993678.0, ans=0.1 2023-06-24 01:37:51,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=993678.0, ans=0.95 2023-06-24 01:38:09,180 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.775e+02 3.249e+02 4.198e+02 6.182e+02, threshold=6.497e+02, percent-clipped=2.0 2023-06-24 01:38:18,937 INFO [train.py:996] (1/4) Epoch 6, batch 13150, loss[loss=0.201, simple_loss=0.2733, pruned_loss=0.06437, over 21626.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.307, pruned_loss=0.07556, over 4289222.67 frames. ], batch size: 247, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:38:42,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=993798.0, ans=0.125 2023-06-24 01:39:10,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=993858.0, ans=0.125 2023-06-24 01:39:22,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-06-24 01:39:26,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=993918.0, ans=0.2 2023-06-24 01:40:09,768 INFO [train.py:996] (1/4) Epoch 6, batch 13200, loss[loss=0.23, simple_loss=0.2994, pruned_loss=0.08026, over 21707.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3049, pruned_loss=0.07548, over 4283746.01 frames. ], batch size: 298, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:40:10,756 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-24 01:40:12,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-24 01:41:31,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=994278.0, ans=0.0 2023-06-24 01:41:38,765 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:41:56,114 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.674e+02 2.987e+02 3.685e+02 5.841e+02, threshold=5.974e+02, percent-clipped=0.0 2023-06-24 01:41:59,677 INFO [train.py:996] (1/4) Epoch 6, batch 13250, loss[loss=0.1992, simple_loss=0.2635, pruned_loss=0.06743, over 21554.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3044, pruned_loss=0.07764, over 4280491.74 frames. ], batch size: 230, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:42:33,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=994398.0, ans=0.0 2023-06-24 01:42:38,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=994458.0, ans=0.125 2023-06-24 01:42:42,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=994458.0, ans=0.0 2023-06-24 01:42:47,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=994458.0, ans=0.125 2023-06-24 01:43:09,505 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 01:43:49,655 INFO [train.py:996] (1/4) Epoch 6, batch 13300, loss[loss=0.2315, simple_loss=0.3181, pruned_loss=0.07242, over 21705.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3075, pruned_loss=0.07695, over 4283503.18 frames. ], batch size: 298, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:44:15,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=994698.0, ans=0.0 2023-06-24 01:45:01,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=994818.0, ans=0.125 2023-06-24 01:45:26,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=994878.0, ans=0.2 2023-06-24 01:45:41,749 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.520e+02 2.865e+02 3.222e+02 4.480e+02, threshold=5.730e+02, percent-clipped=0.0 2023-06-24 01:45:41,781 INFO [train.py:996] (1/4) Epoch 6, batch 13350, loss[loss=0.2644, simple_loss=0.3386, pruned_loss=0.09507, over 21822.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3122, pruned_loss=0.07887, over 4283157.05 frames. ], batch size: 118, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:45:57,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=994938.0, ans=0.125 2023-06-24 01:46:07,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=994998.0, ans=0.125 2023-06-24 01:47:32,586 INFO [train.py:996] (1/4) Epoch 6, batch 13400, loss[loss=0.2327, simple_loss=0.303, pruned_loss=0.08121, over 21454.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3134, pruned_loss=0.08065, over 4282320.93 frames. ], batch size: 194, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:47:40,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-24 01:48:12,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=995358.0, ans=0.125 2023-06-24 01:48:28,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=995358.0, ans=0.125 2023-06-24 01:48:49,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=995418.0, ans=0.0 2023-06-24 01:49:09,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=995478.0, ans=0.0 2023-06-24 01:49:12,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=995478.0, ans=0.125 2023-06-24 01:49:23,327 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.783e+02 3.072e+02 3.557e+02 5.639e+02, threshold=6.143e+02, percent-clipped=0.0 2023-06-24 01:49:23,359 INFO [train.py:996] (1/4) Epoch 6, batch 13450, loss[loss=0.2707, simple_loss=0.3332, pruned_loss=0.1041, over 21365.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3158, pruned_loss=0.08333, over 4288712.78 frames. ], batch size: 471, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:49:53,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=995598.0, ans=0.04949747468305833 2023-06-24 01:50:00,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-24 01:51:04,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-24 01:51:13,918 INFO [train.py:996] (1/4) Epoch 6, batch 13500, loss[loss=0.1665, simple_loss=0.2286, pruned_loss=0.05217, over 21279.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3058, pruned_loss=0.08022, over 4291711.38 frames. ], batch size: 159, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:52:31,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=996018.0, ans=0.0 2023-06-24 01:53:06,787 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.607e+02 3.013e+02 3.630e+02 7.011e+02, threshold=6.026e+02, percent-clipped=1.0 2023-06-24 01:53:06,819 INFO [train.py:996] (1/4) Epoch 6, batch 13550, loss[loss=0.2164, simple_loss=0.3033, pruned_loss=0.06476, over 21407.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3098, pruned_loss=0.07963, over 4284904.38 frames. ], batch size: 131, lr: 5.12e-03, grad_scale: 8.0 2023-06-24 01:53:23,889 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:53:52,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=996258.0, ans=0.125 2023-06-24 01:54:15,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=996318.0, ans=0.0 2023-06-24 01:54:34,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=996378.0, ans=0.1 2023-06-24 01:54:56,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=996438.0, ans=0.125 2023-06-24 01:54:57,330 INFO [train.py:996] (1/4) Epoch 6, batch 13600, loss[loss=0.221, simple_loss=0.2911, pruned_loss=0.0755, over 21290.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.312, pruned_loss=0.08016, over 4283686.88 frames. ], batch size: 159, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:55:35,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=996498.0, ans=0.1 2023-06-24 01:56:10,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=996618.0, ans=0.1 2023-06-24 01:56:17,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=996618.0, ans=0.125 2023-06-24 01:56:40,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=996678.0, ans=0.125 2023-06-24 01:56:47,197 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.489e+02 2.780e+02 3.135e+02 6.333e+02, threshold=5.560e+02, percent-clipped=1.0 2023-06-24 01:56:47,241 INFO [train.py:996] (1/4) Epoch 6, batch 13650, loss[loss=0.185, simple_loss=0.2436, pruned_loss=0.0632, over 21457.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3073, pruned_loss=0.07652, over 4277772.44 frames. ], batch size: 212, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:57:07,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=996738.0, ans=0.1 2023-06-24 01:57:25,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=996798.0, ans=0.07 2023-06-24 01:58:00,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=996918.0, ans=0.125 2023-06-24 01:58:13,069 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-24 01:58:37,408 INFO [train.py:996] (1/4) Epoch 6, batch 13700, loss[loss=0.3057, simple_loss=0.3661, pruned_loss=0.1226, over 21492.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3003, pruned_loss=0.07514, over 4265737.13 frames. ], batch size: 508, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:58:39,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=997038.0, ans=0.125 2023-06-24 01:58:56,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=997038.0, ans=0.125 2023-06-24 01:59:02,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=997038.0, ans=0.125 2023-06-24 01:59:11,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-24 01:59:20,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=997098.0, ans=0.125 2023-06-24 01:59:25,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=997158.0, ans=0.1 2023-06-24 02:00:00,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=997218.0, ans=0.125 2023-06-24 02:00:16,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=997278.0, ans=0.2 2023-06-24 02:00:18,635 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:00:41,405 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.702e+02 3.112e+02 3.506e+02 5.710e+02, threshold=6.223e+02, percent-clipped=1.0 2023-06-24 02:00:41,436 INFO [train.py:996] (1/4) Epoch 6, batch 13750, loss[loss=0.273, simple_loss=0.3413, pruned_loss=0.1023, over 21458.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.299, pruned_loss=0.07503, over 4263933.60 frames. ], batch size: 508, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:00:42,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=997338.0, ans=0.0 2023-06-24 02:01:07,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.92 vs. limit=22.5 2023-06-24 02:01:40,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-24 02:01:41,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=997458.0, ans=0.0 2023-06-24 02:01:56,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=997518.0, ans=0.0 2023-06-24 02:02:11,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=997578.0, ans=0.2 2023-06-24 02:02:26,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-24 02:02:30,754 INFO [train.py:996] (1/4) Epoch 6, batch 13800, loss[loss=0.2542, simple_loss=0.3609, pruned_loss=0.07374, over 21884.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3039, pruned_loss=0.07386, over 4264938.72 frames. ], batch size: 317, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:03:34,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=997818.0, ans=0.0 2023-06-24 02:03:34,831 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-24 02:04:02,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=997818.0, ans=0.125 2023-06-24 02:04:07,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=997878.0, ans=0.125 2023-06-24 02:04:08,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-24 02:04:22,870 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.948e+02 3.505e+02 4.086e+02 7.226e+02, threshold=7.009e+02, percent-clipped=3.0 2023-06-24 02:04:22,901 INFO [train.py:996] (1/4) Epoch 6, batch 13850, loss[loss=0.2633, simple_loss=0.3456, pruned_loss=0.09046, over 21712.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3104, pruned_loss=0.07546, over 4265505.78 frames. ], batch size: 351, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:04:39,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=997938.0, ans=0.0 2023-06-24 02:05:37,048 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.58 vs. limit=6.0 2023-06-24 02:05:38,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=998118.0, ans=0.5 2023-06-24 02:05:52,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=998118.0, ans=0.1 2023-06-24 02:06:17,493 INFO [train.py:996] (1/4) Epoch 6, batch 13900, loss[loss=0.2268, simple_loss=0.3017, pruned_loss=0.076, over 21481.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3141, pruned_loss=0.07861, over 4268685.62 frames. ], batch size: 211, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:07:45,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=998478.0, ans=0.125 2023-06-24 02:08:04,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=12.0 2023-06-24 02:08:07,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=998538.0, ans=0.125 2023-06-24 02:08:08,423 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.809e+02 3.184e+02 3.702e+02 5.147e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-24 02:08:08,454 INFO [train.py:996] (1/4) Epoch 6, batch 13950, loss[loss=0.2269, simple_loss=0.2957, pruned_loss=0.07909, over 21514.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3139, pruned_loss=0.08029, over 4280643.40 frames. ], batch size: 131, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:08:27,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=998538.0, ans=0.125 2023-06-24 02:08:32,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=998598.0, ans=0.0 2023-06-24 02:08:34,469 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:08:53,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=998658.0, ans=0.0 2023-06-24 02:09:09,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=998658.0, ans=10.0 2023-06-24 02:09:14,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=998658.0, ans=0.125 2023-06-24 02:09:34,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2023-06-24 02:09:57,215 INFO [train.py:996] (1/4) Epoch 6, batch 14000, loss[loss=0.2365, simple_loss=0.311, pruned_loss=0.08094, over 21474.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3086, pruned_loss=0.07778, over 4272137.32 frames. ], batch size: 471, lr: 5.12e-03, grad_scale: 32.0 2023-06-24 02:10:21,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-24 02:11:19,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-24 02:11:46,162 INFO [train.py:996] (1/4) Epoch 6, batch 14050, loss[loss=0.1876, simple_loss=0.2552, pruned_loss=0.06004, over 21626.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3045, pruned_loss=0.074, over 4274412.24 frames. ], batch size: 231, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:11:47,705 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.313e+02 2.760e+02 3.193e+02 4.998e+02, threshold=5.521e+02, percent-clipped=0.0 2023-06-24 02:13:35,270 INFO [train.py:996] (1/4) Epoch 6, batch 14100, loss[loss=0.2623, simple_loss=0.334, pruned_loss=0.09535, over 20672.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2994, pruned_loss=0.07377, over 4262090.70 frames. ], batch size: 607, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:13:50,497 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:13:55,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=999438.0, ans=0.025 2023-06-24 02:15:15,656 INFO [train.py:996] (1/4) Epoch 6, batch 14150, loss[loss=0.2224, simple_loss=0.2978, pruned_loss=0.07355, over 15875.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3026, pruned_loss=0.07496, over 4248565.69 frames. ], batch size: 62, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:15:17,250 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.422e+02 2.767e+02 3.253e+02 5.449e+02, threshold=5.534e+02, percent-clipped=0.0 2023-06-24 02:16:09,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=999858.0, ans=0.2 2023-06-24 02:16:50,573 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:17:02,005 INFO [train.py:996] (1/4) Epoch 6, batch 14200, loss[loss=0.2279, simple_loss=0.294, pruned_loss=0.08088, over 21676.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3007, pruned_loss=0.07366, over 4247502.76 frames. ], batch size: 298, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:18:35,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.41 vs. limit=22.5 2023-06-24 02:18:47,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1000278.0, ans=0.0 2023-06-24 02:18:52,101 INFO [train.py:996] (1/4) Epoch 6, batch 14250, loss[loss=0.2034, simple_loss=0.2798, pruned_loss=0.06349, over 21643.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2958, pruned_loss=0.07391, over 4257095.93 frames. ], batch size: 263, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:18:53,608 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.255e+02 2.600e+02 3.105e+02 6.584e+02, threshold=5.199e+02, percent-clipped=1.0 2023-06-24 02:19:13,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1000398.0, ans=0.0 2023-06-24 02:20:36,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1000578.0, ans=0.125 2023-06-24 02:20:44,853 INFO [train.py:996] (1/4) Epoch 6, batch 14300, loss[loss=0.2068, simple_loss=0.2867, pruned_loss=0.06345, over 21162.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3005, pruned_loss=0.07482, over 4246971.24 frames. ], batch size: 159, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:20:51,464 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-24 02:21:26,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-24 02:21:28,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1000698.0, ans=0.2 2023-06-24 02:21:38,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1000758.0, ans=0.125 2023-06-24 02:22:00,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=12.0 2023-06-24 02:22:20,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1000878.0, ans=0.125 2023-06-24 02:22:34,135 INFO [train.py:996] (1/4) Epoch 6, batch 14350, loss[loss=0.2282, simple_loss=0.3104, pruned_loss=0.07295, over 21415.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3056, pruned_loss=0.07639, over 4250292.58 frames. ], batch size: 548, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:22:35,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1000938.0, ans=0.125 2023-06-24 02:22:36,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.573e+02 3.287e+02 4.161e+02 6.824e+02, threshold=6.573e+02, percent-clipped=7.0 2023-06-24 02:23:43,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1001058.0, ans=0.0 2023-06-24 02:24:25,762 INFO [train.py:996] (1/4) Epoch 6, batch 14400, loss[loss=0.2349, simple_loss=0.2991, pruned_loss=0.08532, over 21807.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3051, pruned_loss=0.07646, over 4251505.83 frames. ], batch size: 441, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:24:31,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1001238.0, ans=0.0 2023-06-24 02:25:53,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1001478.0, ans=0.125 2023-06-24 02:26:09,493 INFO [train.py:996] (1/4) Epoch 6, batch 14450, loss[loss=0.2091, simple_loss=0.2751, pruned_loss=0.0715, over 21450.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2994, pruned_loss=0.07618, over 4254635.15 frames. ], batch size: 389, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:26:16,060 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.443e+02 2.785e+02 3.113e+02 5.962e+02, threshold=5.570e+02, percent-clipped=0.0 2023-06-24 02:26:17,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1001538.0, ans=0.125 2023-06-24 02:26:27,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-24 02:27:01,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-06-24 02:27:47,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1001778.0, ans=0.09899494936611666 2023-06-24 02:27:55,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1001838.0, ans=0.0 2023-06-24 02:27:56,672 INFO [train.py:996] (1/4) Epoch 6, batch 14500, loss[loss=0.2177, simple_loss=0.2978, pruned_loss=0.06879, over 21234.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2952, pruned_loss=0.07528, over 4257823.90 frames. ], batch size: 159, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:27:57,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1001838.0, ans=0.125 2023-06-24 02:28:09,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1001838.0, ans=0.0 2023-06-24 02:28:26,358 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-24 02:29:11,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1002018.0, ans=0.1 2023-06-24 02:29:11,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1002018.0, ans=0.0 2023-06-24 02:29:52,398 INFO [train.py:996] (1/4) Epoch 6, batch 14550, loss[loss=0.2658, simple_loss=0.3352, pruned_loss=0.09817, over 21752.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2994, pruned_loss=0.07668, over 4266016.43 frames. ], batch size: 298, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:30:01,986 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.448e+02 2.869e+02 3.616e+02 7.079e+02, threshold=5.738e+02, percent-clipped=4.0 2023-06-24 02:31:12,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1002318.0, ans=0.125 2023-06-24 02:31:32,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.77 vs. limit=15.0 2023-06-24 02:31:38,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1002378.0, ans=0.125 2023-06-24 02:31:46,403 INFO [train.py:996] (1/4) Epoch 6, batch 14600, loss[loss=0.2339, simple_loss=0.3164, pruned_loss=0.07567, over 21246.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3067, pruned_loss=0.07962, over 4268881.66 frames. ], batch size: 176, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:31:46,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1002438.0, ans=0.2 2023-06-24 02:31:52,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1002438.0, ans=0.125 2023-06-24 02:32:12,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1002498.0, ans=0.125 2023-06-24 02:32:38,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1002558.0, ans=0.2 2023-06-24 02:32:41,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1002618.0, ans=0.1 2023-06-24 02:32:42,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-24 02:32:55,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1002618.0, ans=0.125 2023-06-24 02:32:56,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1002618.0, ans=0.0 2023-06-24 02:33:28,152 INFO [train.py:996] (1/4) Epoch 6, batch 14650, loss[loss=0.2326, simple_loss=0.3321, pruned_loss=0.06649, over 21754.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3076, pruned_loss=0.07853, over 4262710.14 frames. ], batch size: 351, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:33:31,380 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.911e+02 3.568e+02 4.716e+02 7.092e+02, threshold=7.135e+02, percent-clipped=11.0 2023-06-24 02:33:37,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1002738.0, ans=0.125 2023-06-24 02:34:41,898 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:35:10,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1002978.0, ans=0.95 2023-06-24 02:35:15,489 INFO [train.py:996] (1/4) Epoch 6, batch 14700, loss[loss=0.2364, simple_loss=0.3404, pruned_loss=0.06621, over 21631.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3005, pruned_loss=0.07293, over 4260999.86 frames. ], batch size: 389, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:35:22,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1003038.0, ans=0.0 2023-06-24 02:35:29,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1003038.0, ans=0.0 2023-06-24 02:35:34,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-24 02:35:35,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1003098.0, ans=0.125 2023-06-24 02:37:05,513 INFO [train.py:996] (1/4) Epoch 6, batch 14750, loss[loss=0.2525, simple_loss=0.3233, pruned_loss=0.09084, over 21818.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3057, pruned_loss=0.07548, over 4262195.43 frames. ], batch size: 282, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:37:08,873 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.584e+02 3.183e+02 3.769e+02 5.952e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-24 02:37:36,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1003398.0, ans=0.125 2023-06-24 02:37:45,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1003398.0, ans=0.2 2023-06-24 02:37:55,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1003458.0, ans=0.0 2023-06-24 02:38:12,998 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:38:34,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.22 vs. limit=10.0 2023-06-24 02:38:37,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1003578.0, ans=0.0 2023-06-24 02:38:39,327 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=15.0 2023-06-24 02:38:46,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1003578.0, ans=0.2 2023-06-24 02:38:59,538 INFO [train.py:996] (1/4) Epoch 6, batch 14800, loss[loss=0.2242, simple_loss=0.3016, pruned_loss=0.07343, over 21620.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.319, pruned_loss=0.08224, over 4256672.64 frames. ], batch size: 247, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:39:15,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1003638.0, ans=0.0 2023-06-24 02:39:19,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1003638.0, ans=0.04949747468305833 2023-06-24 02:40:02,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1003818.0, ans=0.1 2023-06-24 02:40:10,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-24 02:40:10,521 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-24 02:40:13,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1003818.0, ans=0.0 2023-06-24 02:40:26,453 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-24 02:40:55,585 INFO [train.py:996] (1/4) Epoch 6, batch 14850, loss[loss=0.2922, simple_loss=0.3673, pruned_loss=0.1085, over 21611.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3117, pruned_loss=0.08152, over 4266310.25 frames. ], batch size: 441, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:40:59,018 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.678e+02 3.116e+02 4.005e+02 6.901e+02, threshold=6.233e+02, percent-clipped=1.0 2023-06-24 02:41:04,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-24 02:41:10,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1003938.0, ans=0.125 2023-06-24 02:41:52,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1004058.0, ans=0.0 2023-06-24 02:42:26,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1004118.0, ans=0.0 2023-06-24 02:42:35,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1004178.0, ans=0.125 2023-06-24 02:42:47,057 INFO [train.py:996] (1/4) Epoch 6, batch 14900, loss[loss=0.2071, simple_loss=0.272, pruned_loss=0.07105, over 21640.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3126, pruned_loss=0.08166, over 4262202.86 frames. ], batch size: 112, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:43:24,678 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:43:36,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1004358.0, ans=0.125 2023-06-24 02:44:36,540 INFO [train.py:996] (1/4) Epoch 6, batch 14950, loss[loss=0.2536, simple_loss=0.3377, pruned_loss=0.08473, over 21796.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.314, pruned_loss=0.08192, over 4260533.88 frames. ], batch size: 118, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:44:39,952 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.635e+02 3.010e+02 3.574e+02 5.643e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-24 02:44:40,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1004538.0, ans=0.0 2023-06-24 02:45:16,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-24 02:45:52,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1004718.0, ans=0.125 2023-06-24 02:46:18,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1004778.0, ans=0.0 2023-06-24 02:46:24,990 INFO [train.py:996] (1/4) Epoch 6, batch 15000, loss[loss=0.2296, simple_loss=0.3087, pruned_loss=0.07522, over 21783.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3154, pruned_loss=0.08325, over 4259745.07 frames. ], batch size: 332, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:46:24,991 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 02:46:40,560 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.7168, 2.1537, 3.2820, 2.6660], device='cuda:1') 2023-06-24 02:46:45,298 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2621, simple_loss=0.3511, pruned_loss=0.08652, over 1796401.00 frames. 2023-06-24 02:46:45,299 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23439MB 2023-06-24 02:47:13,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=8.0 2023-06-24 02:47:52,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1004958.0, ans=0.05 2023-06-24 02:47:59,535 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:48:36,408 INFO [train.py:996] (1/4) Epoch 6, batch 15050, loss[loss=0.2121, simple_loss=0.2957, pruned_loss=0.06428, over 21637.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3159, pruned_loss=0.08362, over 4261374.34 frames. ], batch size: 263, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:48:45,229 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.748e+02 3.194e+02 3.808e+02 5.890e+02, threshold=6.387e+02, percent-clipped=0.0 2023-06-24 02:50:05,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1005318.0, ans=0.2 2023-06-24 02:50:26,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1005378.0, ans=0.1 2023-06-24 02:50:31,383 INFO [train.py:996] (1/4) Epoch 6, batch 15100, loss[loss=0.2445, simple_loss=0.3177, pruned_loss=0.08572, over 21320.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3185, pruned_loss=0.08365, over 4262085.60 frames. ], batch size: 548, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:51:05,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1005498.0, ans=0.125 2023-06-24 02:51:48,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1005618.0, ans=0.0 2023-06-24 02:52:20,511 INFO [train.py:996] (1/4) Epoch 6, batch 15150, loss[loss=0.2233, simple_loss=0.293, pruned_loss=0.07683, over 21806.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3142, pruned_loss=0.08298, over 4255792.23 frames. ], batch size: 98, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:52:21,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1005738.0, ans=0.125 2023-06-24 02:52:29,940 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.489e+02 2.718e+02 3.127e+02 6.231e+02, threshold=5.435e+02, percent-clipped=0.0 2023-06-24 02:54:14,597 INFO [train.py:996] (1/4) Epoch 6, batch 15200, loss[loss=0.1838, simple_loss=0.2725, pruned_loss=0.04753, over 21388.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3059, pruned_loss=0.07895, over 4255916.60 frames. ], batch size: 211, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:55:01,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1006158.0, ans=0.125 2023-06-24 02:55:22,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1006218.0, ans=0.125 2023-06-24 02:55:27,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1006218.0, ans=0.125 2023-06-24 02:55:53,940 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-24 02:56:03,359 INFO [train.py:996] (1/4) Epoch 6, batch 15250, loss[loss=0.2098, simple_loss=0.2827, pruned_loss=0.06844, over 21706.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3008, pruned_loss=0.07786, over 4258102.87 frames. ], batch size: 124, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:56:13,787 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 2.536e+02 2.850e+02 3.419e+02 5.207e+02, threshold=5.701e+02, percent-clipped=0.0 2023-06-24 02:56:32,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1006398.0, ans=0.125 2023-06-24 02:56:52,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1006458.0, ans=0.125 2023-06-24 02:56:59,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=22.5 2023-06-24 02:57:57,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1006638.0, ans=0.2 2023-06-24 02:57:58,579 INFO [train.py:996] (1/4) Epoch 6, batch 15300, loss[loss=0.238, simple_loss=0.3, pruned_loss=0.08799, over 21325.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3036, pruned_loss=0.08023, over 4264937.75 frames. ], batch size: 549, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:58:36,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1006698.0, ans=0.05 2023-06-24 02:59:48,121 INFO [train.py:996] (1/4) Epoch 6, batch 15350, loss[loss=0.2495, simple_loss=0.3318, pruned_loss=0.08361, over 21500.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3088, pruned_loss=0.08226, over 4262910.13 frames. ], batch size: 131, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:59:49,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-24 02:59:52,936 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.681e+02 3.062e+02 3.788e+02 5.909e+02, threshold=6.124e+02, percent-clipped=1.0 2023-06-24 03:00:37,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1007058.0, ans=0.2 2023-06-24 03:00:41,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1007058.0, ans=0.0 2023-06-24 03:00:41,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1007058.0, ans=0.0 2023-06-24 03:00:46,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1007118.0, ans=0.125 2023-06-24 03:01:23,794 INFO [train.py:996] (1/4) Epoch 6, batch 15400, loss[loss=0.2048, simple_loss=0.2913, pruned_loss=0.05913, over 15420.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3092, pruned_loss=0.07992, over 4262945.94 frames. ], batch size: 60, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 03:01:24,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1007238.0, ans=0.2 2023-06-24 03:02:01,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1007298.0, ans=0.2 2023-06-24 03:02:01,218 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:02:40,756 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:03:05,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1007538.0, ans=0.0 2023-06-24 03:03:12,816 INFO [train.py:996] (1/4) Epoch 6, batch 15450, loss[loss=0.2418, simple_loss=0.3074, pruned_loss=0.08806, over 21826.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3062, pruned_loss=0.07904, over 4270645.38 frames. ], batch size: 441, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 03:03:13,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1007538.0, ans=0.0 2023-06-24 03:03:23,432 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.379e+02 2.689e+02 3.180e+02 6.204e+02, threshold=5.379e+02, percent-clipped=1.0 2023-06-24 03:03:38,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1007598.0, ans=0.95 2023-06-24 03:04:23,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1007718.0, ans=0.05 2023-06-24 03:05:07,247 INFO [train.py:996] (1/4) Epoch 6, batch 15500, loss[loss=0.2126, simple_loss=0.3092, pruned_loss=0.05801, over 20724.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3098, pruned_loss=0.07957, over 4262699.48 frames. ], batch size: 607, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:05:24,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1007898.0, ans=0.125 2023-06-24 03:06:58,618 INFO [train.py:996] (1/4) Epoch 6, batch 15550, loss[loss=0.1849, simple_loss=0.2781, pruned_loss=0.04586, over 21583.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3073, pruned_loss=0.07685, over 4267289.82 frames. ], batch size: 230, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:07:03,914 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.505e+02 2.792e+02 3.296e+02 4.983e+02, threshold=5.584e+02, percent-clipped=0.0 2023-06-24 03:07:56,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1008258.0, ans=0.1 2023-06-24 03:08:03,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1008318.0, ans=0.0 2023-06-24 03:08:12,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1008318.0, ans=0.1 2023-06-24 03:08:46,198 INFO [train.py:996] (1/4) Epoch 6, batch 15600, loss[loss=0.2163, simple_loss=0.2803, pruned_loss=0.07614, over 21726.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3017, pruned_loss=0.07563, over 4274330.75 frames. ], batch size: 334, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:09:38,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1008558.0, ans=0.05 2023-06-24 03:09:49,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1008618.0, ans=0.125 2023-06-24 03:10:22,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1008678.0, ans=0.125 2023-06-24 03:10:33,913 INFO [train.py:996] (1/4) Epoch 6, batch 15650, loss[loss=0.2097, simple_loss=0.2681, pruned_loss=0.07567, over 21176.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3009, pruned_loss=0.07517, over 4274508.96 frames. ], batch size: 548, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:10:39,280 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.465e+02 2.724e+02 3.048e+02 4.286e+02, threshold=5.447e+02, percent-clipped=0.0 2023-06-24 03:10:53,923 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:11:53,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1008918.0, ans=0.5 2023-06-24 03:12:21,530 INFO [train.py:996] (1/4) Epoch 6, batch 15700, loss[loss=0.174, simple_loss=0.2418, pruned_loss=0.05312, over 15300.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2968, pruned_loss=0.07419, over 4265648.20 frames. ], batch size: 60, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:12:25,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1009038.0, ans=0.2 2023-06-24 03:13:02,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1009098.0, ans=0.1 2023-06-24 03:13:04,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-24 03:13:37,539 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:13:59,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1009278.0, ans=0.125 2023-06-24 03:14:08,903 INFO [train.py:996] (1/4) Epoch 6, batch 15750, loss[loss=0.242, simple_loss=0.2955, pruned_loss=0.09424, over 21806.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2928, pruned_loss=0.07392, over 4262589.86 frames. ], batch size: 98, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:14:14,105 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.454e+02 2.677e+02 3.133e+02 4.467e+02, threshold=5.354e+02, percent-clipped=0.0 2023-06-24 03:15:06,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.85 vs. limit=10.0 2023-06-24 03:15:17,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1009518.0, ans=0.0 2023-06-24 03:15:28,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-24 03:15:49,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1009578.0, ans=0.125 2023-06-24 03:15:57,551 INFO [train.py:996] (1/4) Epoch 6, batch 15800, loss[loss=0.1917, simple_loss=0.2502, pruned_loss=0.06662, over 21473.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2887, pruned_loss=0.07371, over 4261816.17 frames. ], batch size: 230, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:15:58,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1009638.0, ans=0.1 2023-06-24 03:15:59,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1009638.0, ans=0.05 2023-06-24 03:16:01,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1009638.0, ans=0.1 2023-06-24 03:17:20,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1009818.0, ans=0.0 2023-06-24 03:17:45,427 INFO [train.py:996] (1/4) Epoch 6, batch 15850, loss[loss=0.2587, simple_loss=0.3146, pruned_loss=0.1014, over 21361.00 frames. ], tot_loss[loss=0.221, simple_loss=0.291, pruned_loss=0.07544, over 4259570.24 frames. ], batch size: 471, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:17:50,483 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.697e+02 2.988e+02 3.672e+02 5.659e+02, threshold=5.976e+02, percent-clipped=2.0 2023-06-24 03:17:54,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1009938.0, ans=0.0 2023-06-24 03:18:03,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=10.0 2023-06-24 03:18:31,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-24 03:19:32,099 INFO [train.py:996] (1/4) Epoch 6, batch 15900, loss[loss=0.2488, simple_loss=0.2919, pruned_loss=0.1028, over 21331.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2876, pruned_loss=0.07494, over 4258379.00 frames. ], batch size: 473, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:19:41,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1010238.0, ans=0.1 2023-06-24 03:19:43,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1010238.0, ans=0.2 2023-06-24 03:19:50,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2023-06-24 03:20:58,505 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-24 03:21:00,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-24 03:21:19,554 INFO [train.py:996] (1/4) Epoch 6, batch 15950, loss[loss=0.184, simple_loss=0.285, pruned_loss=0.04152, over 21762.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2882, pruned_loss=0.07326, over 4250083.61 frames. ], batch size: 351, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:21:24,433 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 2.251e+02 2.569e+02 3.023e+02 4.641e+02, threshold=5.138e+02, percent-clipped=0.0 2023-06-24 03:21:27,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-24 03:21:39,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1010598.0, ans=0.125 2023-06-24 03:21:44,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1010598.0, ans=0.0 2023-06-24 03:21:59,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1010658.0, ans=0.0 2023-06-24 03:22:22,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1010718.0, ans=0.0 2023-06-24 03:23:07,193 INFO [train.py:996] (1/4) Epoch 6, batch 16000, loss[loss=0.2221, simple_loss=0.3142, pruned_loss=0.06504, over 21647.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2917, pruned_loss=0.07248, over 4253859.56 frames. ], batch size: 389, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:23:11,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1010838.0, ans=0.0 2023-06-24 03:24:02,912 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-24 03:24:53,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1011078.0, ans=0.0 2023-06-24 03:24:55,885 INFO [train.py:996] (1/4) Epoch 6, batch 16050, loss[loss=0.2171, simple_loss=0.3144, pruned_loss=0.05986, over 21795.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2931, pruned_loss=0.07053, over 4261099.57 frames. ], batch size: 282, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:24:58,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1011138.0, ans=0.0 2023-06-24 03:25:02,631 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.499e+02 2.877e+02 3.627e+02 5.675e+02, threshold=5.753e+02, percent-clipped=3.0 2023-06-24 03:25:15,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1011198.0, ans=0.125 2023-06-24 03:25:36,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1011258.0, ans=0.0 2023-06-24 03:25:48,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=22.5 2023-06-24 03:25:59,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1011318.0, ans=0.125 2023-06-24 03:26:18,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1011378.0, ans=0.0 2023-06-24 03:26:37,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1011378.0, ans=0.0 2023-06-24 03:26:42,141 INFO [train.py:996] (1/4) Epoch 6, batch 16100, loss[loss=0.2809, simple_loss=0.3418, pruned_loss=0.11, over 21603.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2998, pruned_loss=0.07225, over 4266173.92 frames. ], batch size: 471, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:27:25,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1011558.0, ans=0.0 2023-06-24 03:28:02,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1011618.0, ans=0.125 2023-06-24 03:28:31,503 INFO [train.py:996] (1/4) Epoch 6, batch 16150, loss[loss=0.2624, simple_loss=0.3166, pruned_loss=0.1041, over 21796.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3009, pruned_loss=0.07403, over 4275569.57 frames. ], batch size: 441, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:28:38,460 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.535e+02 2.977e+02 3.474e+02 6.271e+02, threshold=5.955e+02, percent-clipped=2.0 2023-06-24 03:29:05,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1011798.0, ans=0.1 2023-06-24 03:29:21,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1011858.0, ans=0.0 2023-06-24 03:29:32,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1011858.0, ans=0.125 2023-06-24 03:30:09,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-24 03:30:21,147 INFO [train.py:996] (1/4) Epoch 6, batch 16200, loss[loss=0.2249, simple_loss=0.3062, pruned_loss=0.07185, over 21674.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3049, pruned_loss=0.07581, over 4277991.20 frames. ], batch size: 263, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:31:02,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1012098.0, ans=0.125 2023-06-24 03:32:09,574 INFO [train.py:996] (1/4) Epoch 6, batch 16250, loss[loss=0.2298, simple_loss=0.304, pruned_loss=0.07779, over 21422.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3038, pruned_loss=0.07583, over 4270873.16 frames. ], batch size: 471, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:32:13,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-24 03:32:16,272 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.579e+02 2.975e+02 3.411e+02 5.928e+02, threshold=5.950e+02, percent-clipped=0.0 2023-06-24 03:33:19,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.92 vs. limit=22.5 2023-06-24 03:33:34,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1012518.0, ans=0.125 2023-06-24 03:33:41,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1012578.0, ans=0.1 2023-06-24 03:33:57,753 INFO [train.py:996] (1/4) Epoch 6, batch 16300, loss[loss=0.1854, simple_loss=0.2552, pruned_loss=0.05783, over 21233.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2975, pruned_loss=0.07163, over 4272590.64 frames. ], batch size: 159, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:34:31,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1012698.0, ans=0.125 2023-06-24 03:34:58,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1012758.0, ans=0.0 2023-06-24 03:35:44,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1012878.0, ans=0.125 2023-06-24 03:35:47,987 INFO [train.py:996] (1/4) Epoch 6, batch 16350, loss[loss=0.2752, simple_loss=0.3432, pruned_loss=0.1036, over 21606.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2978, pruned_loss=0.0728, over 4262524.24 frames. ], batch size: 415, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:36:00,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.290e+02 2.661e+02 3.043e+02 4.876e+02, threshold=5.321e+02, percent-clipped=0.0 2023-06-24 03:36:13,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1012938.0, ans=0.0 2023-06-24 03:36:22,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1012998.0, ans=0.125 2023-06-24 03:36:55,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1013118.0, ans=0.1 2023-06-24 03:37:14,769 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=15.0 2023-06-24 03:37:22,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.51 vs. limit=15.0 2023-06-24 03:37:26,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1013178.0, ans=0.125 2023-06-24 03:37:29,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1013178.0, ans=0.0 2023-06-24 03:37:36,532 INFO [train.py:996] (1/4) Epoch 6, batch 16400, loss[loss=0.2396, simple_loss=0.308, pruned_loss=0.08566, over 21909.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2998, pruned_loss=0.07384, over 4269540.20 frames. ], batch size: 118, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:38:00,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1013238.0, ans=0.125 2023-06-24 03:38:21,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1013358.0, ans=0.125 2023-06-24 03:39:04,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1013418.0, ans=10.0 2023-06-24 03:39:29,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.23 vs. limit=15.0 2023-06-24 03:39:30,135 INFO [train.py:996] (1/4) Epoch 6, batch 16450, loss[loss=0.2732, simple_loss=0.3223, pruned_loss=0.1121, over 21767.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3011, pruned_loss=0.07539, over 4268237.53 frames. ], batch size: 508, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:39:42,774 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.477e+02 2.722e+02 3.151e+02 4.827e+02, threshold=5.443e+02, percent-clipped=0.0 2023-06-24 03:40:07,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1013598.0, ans=0.125 2023-06-24 03:40:22,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1013658.0, ans=0.07 2023-06-24 03:40:46,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-24 03:41:26,348 INFO [train.py:996] (1/4) Epoch 6, batch 16500, loss[loss=0.1785, simple_loss=0.2338, pruned_loss=0.06156, over 21908.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3006, pruned_loss=0.07612, over 4278456.89 frames. ], batch size: 107, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:41:35,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1013838.0, ans=0.125 2023-06-24 03:43:07,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1014078.0, ans=0.125 2023-06-24 03:43:12,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1014078.0, ans=0.0 2023-06-24 03:43:15,526 INFO [train.py:996] (1/4) Epoch 6, batch 16550, loss[loss=0.2197, simple_loss=0.2959, pruned_loss=0.07178, over 21441.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2981, pruned_loss=0.07381, over 4275508.66 frames. ], batch size: 211, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:43:22,448 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.591e+02 3.154e+02 3.856e+02 7.253e+02, threshold=6.309e+02, percent-clipped=4.0 2023-06-24 03:43:43,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1014198.0, ans=0.125 2023-06-24 03:44:43,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-24 03:45:06,826 INFO [train.py:996] (1/4) Epoch 6, batch 16600, loss[loss=0.3231, simple_loss=0.4135, pruned_loss=0.1164, over 21705.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3078, pruned_loss=0.07725, over 4279270.12 frames. ], batch size: 441, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:45:09,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1014438.0, ans=0.0 2023-06-24 03:45:45,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1014498.0, ans=0.125 2023-06-24 03:46:11,754 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-24 03:46:36,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1014678.0, ans=0.125 2023-06-24 03:46:40,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=22.5 2023-06-24 03:46:45,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1014678.0, ans=0.0 2023-06-24 03:47:02,040 INFO [train.py:996] (1/4) Epoch 6, batch 16650, loss[loss=0.2368, simple_loss=0.3204, pruned_loss=0.07664, over 20693.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3181, pruned_loss=0.08024, over 4274550.67 frames. ], batch size: 607, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:47:14,438 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.632e+02 2.959e+02 3.254e+02 5.416e+02, threshold=5.917e+02, percent-clipped=0.0 2023-06-24 03:48:00,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1014858.0, ans=0.2 2023-06-24 03:48:13,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1014918.0, ans=0.0 2023-06-24 03:48:48,578 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-24 03:48:59,618 INFO [train.py:996] (1/4) Epoch 6, batch 16700, loss[loss=0.2378, simple_loss=0.3468, pruned_loss=0.06439, over 20691.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3213, pruned_loss=0.08082, over 4262628.87 frames. ], batch size: 607, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:49:18,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.45 vs. limit=15.0 2023-06-24 03:49:39,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1015098.0, ans=0.125 2023-06-24 03:50:58,262 INFO [train.py:996] (1/4) Epoch 6, batch 16750, loss[loss=0.2689, simple_loss=0.3632, pruned_loss=0.08731, over 21583.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3211, pruned_loss=0.08229, over 4260282.65 frames. ], batch size: 414, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:51:13,356 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.841e+02 3.113e+02 3.878e+02 5.035e+02, threshold=6.225e+02, percent-clipped=0.0 2023-06-24 03:51:27,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1015398.0, ans=10.0 2023-06-24 03:51:40,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1015398.0, ans=0.0 2023-06-24 03:52:55,204 INFO [train.py:996] (1/4) Epoch 6, batch 16800, loss[loss=0.2934, simple_loss=0.3577, pruned_loss=0.1145, over 21615.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3253, pruned_loss=0.08262, over 4265000.24 frames. ], batch size: 471, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:53:08,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1015638.0, ans=0.125 2023-06-24 03:53:36,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1015698.0, ans=0.2 2023-06-24 03:53:36,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=12.0 2023-06-24 03:53:55,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1015758.0, ans=0.125 2023-06-24 03:53:57,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1015818.0, ans=0.125 2023-06-24 03:54:33,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1015878.0, ans=0.07 2023-06-24 03:54:44,503 INFO [train.py:996] (1/4) Epoch 6, batch 16850, loss[loss=0.2342, simple_loss=0.301, pruned_loss=0.08369, over 21892.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3211, pruned_loss=0.08181, over 4268781.98 frames. ], batch size: 414, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 03:54:53,526 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.780e+02 3.302e+02 4.313e+02 7.428e+02, threshold=6.605e+02, percent-clipped=4.0 2023-06-24 03:55:42,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1016058.0, ans=0.2 2023-06-24 03:56:24,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1016178.0, ans=0.125 2023-06-24 03:56:32,157 INFO [train.py:996] (1/4) Epoch 6, batch 16900, loss[loss=0.2006, simple_loss=0.2739, pruned_loss=0.06368, over 21566.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3143, pruned_loss=0.07957, over 4279367.51 frames. ], batch size: 389, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 03:56:32,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1016238.0, ans=0.125 2023-06-24 03:57:09,753 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.16 vs. limit=6.0 2023-06-24 03:57:12,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-24 03:57:17,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1016358.0, ans=0.125 2023-06-24 03:57:24,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1016358.0, ans=0.125 2023-06-24 03:57:27,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1016358.0, ans=0.125 2023-06-24 03:57:31,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1016358.0, ans=0.125 2023-06-24 03:58:19,451 INFO [train.py:996] (1/4) Epoch 6, batch 16950, loss[loss=0.2156, simple_loss=0.2811, pruned_loss=0.07506, over 21925.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3073, pruned_loss=0.07795, over 4285468.23 frames. ], batch size: 316, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 03:58:23,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1016538.0, ans=0.2 2023-06-24 03:58:29,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.437e+02 2.853e+02 3.182e+02 4.700e+02, threshold=5.707e+02, percent-clipped=0.0 2023-06-24 03:58:34,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-06-24 03:59:10,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1016658.0, ans=0.125 2023-06-24 03:59:35,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1016718.0, ans=0.2 2023-06-24 04:00:01,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1016778.0, ans=0.025 2023-06-24 04:00:03,693 INFO [train.py:996] (1/4) Epoch 6, batch 17000, loss[loss=0.2027, simple_loss=0.2653, pruned_loss=0.07003, over 21235.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3027, pruned_loss=0.07777, over 4285344.36 frames. ], batch size: 608, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:00:40,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1016898.0, ans=0.0 2023-06-24 04:00:47,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1016898.0, ans=0.125 2023-06-24 04:01:18,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1017018.0, ans=0.125 2023-06-24 04:01:50,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1017078.0, ans=0.0 2023-06-24 04:01:54,099 INFO [train.py:996] (1/4) Epoch 6, batch 17050, loss[loss=0.2719, simple_loss=0.3566, pruned_loss=0.09359, over 21837.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3113, pruned_loss=0.08097, over 4289596.27 frames. ], batch size: 371, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:02:04,584 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 2.608e+02 3.012e+02 3.512e+02 5.895e+02, threshold=6.025e+02, percent-clipped=1.0 2023-06-24 04:02:19,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1017198.0, ans=0.0 2023-06-24 04:02:30,522 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-24 04:02:36,055 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-24 04:02:36,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017258.0, ans=0.1 2023-06-24 04:02:37,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.75 vs. limit=15.0 2023-06-24 04:02:57,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-24 04:03:33,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-24 04:03:36,128 INFO [train.py:996] (1/4) Epoch 6, batch 17100, loss[loss=0.2207, simple_loss=0.295, pruned_loss=0.07318, over 21453.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3106, pruned_loss=0.08143, over 4289481.60 frames. ], batch size: 131, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:04:27,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-24 04:04:44,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1017618.0, ans=0.0 2023-06-24 04:05:01,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1017618.0, ans=0.125 2023-06-24 04:05:06,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017678.0, ans=0.1 2023-06-24 04:05:20,218 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=22.43 vs. limit=15.0 2023-06-24 04:05:22,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1017738.0, ans=0.125 2023-06-24 04:05:23,957 INFO [train.py:996] (1/4) Epoch 6, batch 17150, loss[loss=0.1834, simple_loss=0.266, pruned_loss=0.05043, over 21669.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3064, pruned_loss=0.08073, over 4284449.76 frames. ], batch size: 230, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:05:44,958 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.638e+02 2.899e+02 3.354e+02 4.965e+02, threshold=5.799e+02, percent-clipped=0.0 2023-06-24 04:05:47,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1017738.0, ans=0.2 2023-06-24 04:05:52,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1017798.0, ans=0.09899494936611666 2023-06-24 04:06:01,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1017798.0, ans=0.125 2023-06-24 04:06:36,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1017918.0, ans=0.125 2023-06-24 04:06:37,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1017918.0, ans=0.1 2023-06-24 04:06:43,031 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:07:17,730 INFO [train.py:996] (1/4) Epoch 6, batch 17200, loss[loss=0.2182, simple_loss=0.2935, pruned_loss=0.07149, over 21813.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3054, pruned_loss=0.0799, over 4284855.94 frames. ], batch size: 247, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 04:07:50,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1018098.0, ans=0.09899494936611666 2023-06-24 04:07:52,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1018098.0, ans=0.0 2023-06-24 04:08:11,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1018158.0, ans=0.2 2023-06-24 04:08:20,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1018158.0, ans=0.2 2023-06-24 04:09:12,593 INFO [train.py:996] (1/4) Epoch 6, batch 17250, loss[loss=0.2521, simple_loss=0.3409, pruned_loss=0.08164, over 21348.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3093, pruned_loss=0.08207, over 4282857.19 frames. ], batch size: 176, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:09:25,160 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.699e+02 3.105e+02 3.621e+02 5.993e+02, threshold=6.210e+02, percent-clipped=1.0 2023-06-24 04:09:40,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1018398.0, ans=0.125 2023-06-24 04:10:06,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1018458.0, ans=0.125 2023-06-24 04:10:11,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1018458.0, ans=0.125 2023-06-24 04:10:59,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.93 vs. limit=6.0 2023-06-24 04:11:01,903 INFO [train.py:996] (1/4) Epoch 6, batch 17300, loss[loss=0.245, simple_loss=0.3213, pruned_loss=0.0843, over 21742.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3179, pruned_loss=0.08462, over 4282317.93 frames. ], batch size: 298, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:11:40,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-24 04:11:56,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.95 vs. limit=15.0 2023-06-24 04:12:12,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1018818.0, ans=0.1 2023-06-24 04:12:22,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-24 04:12:40,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1018878.0, ans=0.0 2023-06-24 04:12:58,384 INFO [train.py:996] (1/4) Epoch 6, batch 17350, loss[loss=0.2063, simple_loss=0.2966, pruned_loss=0.05803, over 21804.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3186, pruned_loss=0.0846, over 4282734.05 frames. ], batch size: 316, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:13:16,107 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.820e+02 3.152e+02 3.644e+02 6.101e+02, threshold=6.303e+02, percent-clipped=0.0 2023-06-24 04:13:20,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1018998.0, ans=0.125 2023-06-24 04:14:00,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1019058.0, ans=0.0 2023-06-24 04:14:04,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1019058.0, ans=0.125 2023-06-24 04:14:10,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1019118.0, ans=0.125 2023-06-24 04:14:30,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1019178.0, ans=0.125 2023-06-24 04:14:46,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1019178.0, ans=0.0 2023-06-24 04:14:46,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1019178.0, ans=0.2 2023-06-24 04:14:54,378 INFO [train.py:996] (1/4) Epoch 6, batch 17400, loss[loss=0.1818, simple_loss=0.2347, pruned_loss=0.0645, over 21227.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3129, pruned_loss=0.08061, over 4275883.75 frames. ], batch size: 143, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:15:09,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1019238.0, ans=0.125 2023-06-24 04:15:10,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1019298.0, ans=0.0 2023-06-24 04:16:44,084 INFO [train.py:996] (1/4) Epoch 6, batch 17450, loss[loss=0.2234, simple_loss=0.2628, pruned_loss=0.09199, over 20008.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3087, pruned_loss=0.0785, over 4267860.30 frames. ], batch size: 704, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:16:58,155 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.361e+02 2.755e+02 3.366e+02 5.958e+02, threshold=5.511e+02, percent-clipped=0.0 2023-06-24 04:17:46,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1019718.0, ans=0.125 2023-06-24 04:17:54,959 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.27 vs. limit=22.5 2023-06-24 04:18:09,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1019718.0, ans=0.95 2023-06-24 04:18:30,644 INFO [train.py:996] (1/4) Epoch 6, batch 17500, loss[loss=0.2184, simple_loss=0.289, pruned_loss=0.07389, over 21913.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3034, pruned_loss=0.07582, over 4271317.88 frames. ], batch size: 316, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:18:36,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1019838.0, ans=0.125 2023-06-24 04:19:11,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.34 vs. limit=6.0 2023-06-24 04:19:27,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019958.0, ans=0.1 2023-06-24 04:19:41,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1020018.0, ans=0.1 2023-06-24 04:19:58,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1020078.0, ans=0.125 2023-06-24 04:20:15,383 INFO [train.py:996] (1/4) Epoch 6, batch 17550, loss[loss=0.2229, simple_loss=0.3149, pruned_loss=0.06542, over 21387.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3034, pruned_loss=0.07459, over 4259334.18 frames. ], batch size: 131, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:20:25,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1020138.0, ans=0.125 2023-06-24 04:20:28,780 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.219e+02 2.535e+02 2.795e+02 4.245e+02, threshold=5.070e+02, percent-clipped=0.0 2023-06-24 04:21:58,675 INFO [train.py:996] (1/4) Epoch 6, batch 17600, loss[loss=0.2355, simple_loss=0.3161, pruned_loss=0.0774, over 21467.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3074, pruned_loss=0.07487, over 4244453.23 frames. ], batch size: 194, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:22:04,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1020438.0, ans=0.125 2023-06-24 04:22:36,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1020498.0, ans=0.125 2023-06-24 04:23:04,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1020558.0, ans=0.125 2023-06-24 04:23:48,367 INFO [train.py:996] (1/4) Epoch 6, batch 17650, loss[loss=0.2213, simple_loss=0.301, pruned_loss=0.07082, over 21558.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3059, pruned_loss=0.07533, over 4251760.04 frames. ], batch size: 441, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:24:13,255 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.481e+02 3.096e+02 4.210e+02 8.151e+02, threshold=6.192e+02, percent-clipped=13.0 2023-06-24 04:24:17,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.56 vs. limit=10.0 2023-06-24 04:24:30,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1020798.0, ans=0.125 2023-06-24 04:24:33,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-24 04:25:11,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.69 vs. limit=15.0 2023-06-24 04:25:15,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1020918.0, ans=0.125 2023-06-24 04:25:16,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1020918.0, ans=0.125 2023-06-24 04:25:27,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1020978.0, ans=0.125 2023-06-24 04:25:42,587 INFO [train.py:996] (1/4) Epoch 6, batch 17700, loss[loss=0.2295, simple_loss=0.2953, pruned_loss=0.08182, over 19974.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3008, pruned_loss=0.07321, over 4244064.10 frames. ], batch size: 703, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:25:55,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1021038.0, ans=0.1 2023-06-24 04:26:00,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1021098.0, ans=0.125 2023-06-24 04:26:04,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1021098.0, ans=0.125 2023-06-24 04:26:25,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1021098.0, ans=0.0 2023-06-24 04:26:51,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1021218.0, ans=0.0 2023-06-24 04:27:25,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.97 vs. limit=12.0 2023-06-24 04:27:30,941 INFO [train.py:996] (1/4) Epoch 6, batch 17750, loss[loss=0.2773, simple_loss=0.3599, pruned_loss=0.09737, over 21834.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3077, pruned_loss=0.07655, over 4249866.26 frames. ], batch size: 124, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:27:31,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1021338.0, ans=0.125 2023-06-24 04:27:44,840 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.598e+02 3.053e+02 3.567e+02 5.587e+02, threshold=6.107e+02, percent-clipped=0.0 2023-06-24 04:27:48,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1021398.0, ans=0.0 2023-06-24 04:28:39,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1021518.0, ans=0.125 2023-06-24 04:29:15,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1021578.0, ans=0.1 2023-06-24 04:29:20,577 INFO [train.py:996] (1/4) Epoch 6, batch 17800, loss[loss=0.2537, simple_loss=0.3252, pruned_loss=0.09106, over 21453.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3069, pruned_loss=0.07542, over 4259815.75 frames. ], batch size: 471, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:29:46,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-06-24 04:30:02,917 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:30:50,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=8.0 2023-06-24 04:30:52,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1021878.0, ans=0.125 2023-06-24 04:31:20,140 INFO [train.py:996] (1/4) Epoch 6, batch 17850, loss[loss=0.2256, simple_loss=0.3002, pruned_loss=0.07551, over 21473.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3073, pruned_loss=0.07577, over 4261959.41 frames. ], batch size: 211, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:31:31,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-24 04:31:35,787 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.586e+02 3.040e+02 3.727e+02 6.886e+02, threshold=6.079e+02, percent-clipped=3.0 2023-06-24 04:32:05,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1022058.0, ans=0.1 2023-06-24 04:32:06,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1022058.0, ans=0.125 2023-06-24 04:32:20,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1022118.0, ans=0.125 2023-06-24 04:33:10,773 INFO [train.py:996] (1/4) Epoch 6, batch 17900, loss[loss=0.274, simple_loss=0.3628, pruned_loss=0.09262, over 21716.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3124, pruned_loss=0.07759, over 4267469.97 frames. ], batch size: 441, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:33:11,294 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:33:18,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1022238.0, ans=0.125 2023-06-24 04:33:25,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1022238.0, ans=0.125 2023-06-24 04:33:43,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1022298.0, ans=0.125 2023-06-24 04:33:47,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1022358.0, ans=0.0 2023-06-24 04:33:51,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1022358.0, ans=0.125 2023-06-24 04:34:30,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1022418.0, ans=0.09899494936611666 2023-06-24 04:34:58,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1022478.0, ans=0.125 2023-06-24 04:35:01,042 INFO [train.py:996] (1/4) Epoch 6, batch 17950, loss[loss=0.2436, simple_loss=0.3351, pruned_loss=0.07604, over 21496.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3115, pruned_loss=0.07418, over 4271485.68 frames. ], batch size: 471, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:35:16,326 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 2.348e+02 2.616e+02 3.044e+02 5.736e+02, threshold=5.233e+02, percent-clipped=0.0 2023-06-24 04:36:47,685 INFO [train.py:996] (1/4) Epoch 6, batch 18000, loss[loss=0.215, simple_loss=0.281, pruned_loss=0.07453, over 21811.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3042, pruned_loss=0.0725, over 4271449.93 frames. ], batch size: 98, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:36:47,685 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 04:37:04,090 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.9235, 3.9205, 2.1216, 3.5021], device='cuda:1') 2023-06-24 04:37:05,817 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2648, simple_loss=0.3617, pruned_loss=0.08394, over 1796401.00 frames. 2023-06-24 04:37:05,818 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23439MB 2023-06-24 04:37:52,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1022958.0, ans=0.1 2023-06-24 04:38:55,651 INFO [train.py:996] (1/4) Epoch 6, batch 18050, loss[loss=0.2584, simple_loss=0.3335, pruned_loss=0.09168, over 21789.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2989, pruned_loss=0.07247, over 4262547.94 frames. ], batch size: 124, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:39:22,851 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.462e+02 2.761e+02 3.558e+02 5.314e+02, threshold=5.521e+02, percent-clipped=1.0 2023-06-24 04:39:25,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1023198.0, ans=0.2 2023-06-24 04:39:53,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1023258.0, ans=0.125 2023-06-24 04:40:03,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1023258.0, ans=0.125 2023-06-24 04:40:22,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1023318.0, ans=0.1 2023-06-24 04:40:23,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-24 04:40:25,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1023318.0, ans=0.0 2023-06-24 04:40:46,675 INFO [train.py:996] (1/4) Epoch 6, batch 18100, loss[loss=0.2772, simple_loss=0.3537, pruned_loss=0.1004, over 21506.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.305, pruned_loss=0.07502, over 4263843.46 frames. ], batch size: 131, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:40:56,025 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-24 04:41:07,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1023438.0, ans=0.125 2023-06-24 04:41:27,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1023498.0, ans=0.1 2023-06-24 04:42:21,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1023678.0, ans=0.125 2023-06-24 04:42:42,388 INFO [train.py:996] (1/4) Epoch 6, batch 18150, loss[loss=0.2196, simple_loss=0.2885, pruned_loss=0.07535, over 21482.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3064, pruned_loss=0.07459, over 4269046.79 frames. ], batch size: 195, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:42:57,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1023738.0, ans=0.125 2023-06-24 04:42:57,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1023738.0, ans=0.2 2023-06-24 04:43:02,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1023738.0, ans=15.0 2023-06-24 04:43:02,497 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.411e+02 2.816e+02 3.524e+02 6.086e+02, threshold=5.632e+02, percent-clipped=3.0 2023-06-24 04:44:20,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=1023978.0, ans=12.0 2023-06-24 04:44:22,764 INFO [train.py:996] (1/4) Epoch 6, batch 18200, loss[loss=0.1869, simple_loss=0.2639, pruned_loss=0.055, over 21827.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3001, pruned_loss=0.07448, over 4262694.49 frames. ], batch size: 112, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:44:28,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1024038.0, ans=0.0 2023-06-24 04:44:29,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1024038.0, ans=0.125 2023-06-24 04:44:54,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1024098.0, ans=0.0 2023-06-24 04:45:35,402 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-24 04:45:48,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1024218.0, ans=0.125 2023-06-24 04:46:07,511 INFO [train.py:996] (1/4) Epoch 6, batch 18250, loss[loss=0.1756, simple_loss=0.2487, pruned_loss=0.05122, over 21873.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2931, pruned_loss=0.07238, over 4267064.71 frames. ], batch size: 98, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:46:08,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1024338.0, ans=0.0 2023-06-24 04:46:23,371 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.233e+02 2.540e+02 3.083e+02 5.311e+02, threshold=5.080e+02, percent-clipped=0.0 2023-06-24 04:46:24,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1024398.0, ans=0.0 2023-06-24 04:46:25,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1024398.0, ans=0.125 2023-06-24 04:46:37,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1024398.0, ans=0.125 2023-06-24 04:47:19,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1024518.0, ans=0.125 2023-06-24 04:47:43,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1024578.0, ans=0.0 2023-06-24 04:47:54,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1024578.0, ans=0.125 2023-06-24 04:47:57,593 INFO [train.py:996] (1/4) Epoch 6, batch 18300, loss[loss=0.2369, simple_loss=0.3306, pruned_loss=0.07164, over 21795.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2928, pruned_loss=0.07261, over 4265778.64 frames. ], batch size: 298, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:49:01,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1024758.0, ans=0.0 2023-06-24 04:49:06,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-24 04:49:31,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-24 04:49:44,703 INFO [train.py:996] (1/4) Epoch 6, batch 18350, loss[loss=0.2494, simple_loss=0.3559, pruned_loss=0.07144, over 21666.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2986, pruned_loss=0.07272, over 4248417.43 frames. ], batch size: 263, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:49:48,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1024938.0, ans=0.0 2023-06-24 04:49:53,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1024938.0, ans=0.125 2023-06-24 04:49:56,154 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-24 04:50:00,360 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.650e+02 3.163e+02 4.128e+02 7.474e+02, threshold=6.326e+02, percent-clipped=9.0 2023-06-24 04:50:02,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1024998.0, ans=0.2 2023-06-24 04:50:25,516 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:51:34,315 INFO [train.py:996] (1/4) Epoch 6, batch 18400, loss[loss=0.2036, simple_loss=0.2681, pruned_loss=0.06955, over 21493.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2934, pruned_loss=0.07151, over 4240472.17 frames. ], batch size: 212, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:52:00,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-24 04:52:16,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1025298.0, ans=0.125 2023-06-24 04:52:18,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=12.0 2023-06-24 04:52:20,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-24 04:52:30,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2023-06-24 04:53:17,796 INFO [train.py:996] (1/4) Epoch 6, batch 18450, loss[loss=0.2695, simple_loss=0.3865, pruned_loss=0.07627, over 19891.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2909, pruned_loss=0.06768, over 4247094.26 frames. ], batch size: 702, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:53:18,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1025538.0, ans=0.2 2023-06-24 04:53:33,301 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.125e+02 2.326e+02 2.659e+02 4.995e+02, threshold=4.653e+02, percent-clipped=0.0 2023-06-24 04:54:20,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1025658.0, ans=0.125 2023-06-24 04:54:24,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1025658.0, ans=0.125 2023-06-24 04:54:41,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1025718.0, ans=0.2 2023-06-24 04:54:50,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-24 04:55:06,495 INFO [train.py:996] (1/4) Epoch 6, batch 18500, loss[loss=0.2796, simple_loss=0.3464, pruned_loss=0.1064, over 21477.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2877, pruned_loss=0.06724, over 4227706.15 frames. ], batch size: 508, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:55:32,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1025898.0, ans=0.95 2023-06-24 04:56:17,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1026018.0, ans=0.125 2023-06-24 04:56:34,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1026078.0, ans=0.125 2023-06-24 04:56:52,798 INFO [train.py:996] (1/4) Epoch 6, batch 18550, loss[loss=0.2128, simple_loss=0.2754, pruned_loss=0.07505, over 21719.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2841, pruned_loss=0.06638, over 4220641.95 frames. ], batch size: 316, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:56:58,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026138.0, ans=0.1 2023-06-24 04:57:10,435 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.481e+02 2.781e+02 3.235e+02 5.250e+02, threshold=5.562e+02, percent-clipped=2.0 2023-06-24 04:57:29,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1026198.0, ans=0.0 2023-06-24 04:57:36,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1026258.0, ans=0.07 2023-06-24 04:57:56,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.34 vs. limit=15.0 2023-06-24 04:58:00,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1026318.0, ans=10.0 2023-06-24 04:58:07,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1026318.0, ans=0.125 2023-06-24 04:58:21,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1026318.0, ans=0.2 2023-06-24 04:58:41,456 INFO [train.py:996] (1/4) Epoch 6, batch 18600, loss[loss=0.1732, simple_loss=0.2467, pruned_loss=0.04988, over 21336.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2836, pruned_loss=0.06675, over 4220358.52 frames. ], batch size: 131, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:59:17,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-24 04:59:51,556 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-24 05:00:00,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1026618.0, ans=0.0 2023-06-24 05:00:07,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1026618.0, ans=0.0 2023-06-24 05:00:29,909 INFO [train.py:996] (1/4) Epoch 6, batch 18650, loss[loss=0.2, simple_loss=0.2681, pruned_loss=0.06593, over 21782.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2833, pruned_loss=0.06692, over 4227905.56 frames. ], batch size: 107, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:00:31,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-06-24 05:00:46,767 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.410e+02 2.665e+02 3.233e+02 6.336e+02, threshold=5.330e+02, percent-clipped=1.0 2023-06-24 05:00:59,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-24 05:01:11,190 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-24 05:01:12,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1026798.0, ans=0.2 2023-06-24 05:01:36,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1026918.0, ans=0.125 2023-06-24 05:01:54,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026918.0, ans=0.1 2023-06-24 05:02:01,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1026978.0, ans=0.125 2023-06-24 05:02:03,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1026978.0, ans=0.5 2023-06-24 05:02:08,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1026978.0, ans=0.2 2023-06-24 05:02:16,608 INFO [train.py:996] (1/4) Epoch 6, batch 18700, loss[loss=0.202, simple_loss=0.2545, pruned_loss=0.0747, over 21136.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2809, pruned_loss=0.06848, over 4235082.04 frames. ], batch size: 608, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:02:41,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-06-24 05:04:03,121 INFO [train.py:996] (1/4) Epoch 6, batch 18750, loss[loss=0.2131, simple_loss=0.2855, pruned_loss=0.07035, over 21294.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2819, pruned_loss=0.07043, over 4248540.05 frames. ], batch size: 159, lr: 5.05e-03, grad_scale: 8.0 2023-06-24 05:04:22,089 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.422e+02 2.735e+02 3.202e+02 4.733e+02, threshold=5.471e+02, percent-clipped=0.0 2023-06-24 05:04:53,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1027458.0, ans=0.125 2023-06-24 05:05:07,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1027458.0, ans=0.05 2023-06-24 05:05:11,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1027518.0, ans=0.125 2023-06-24 05:05:32,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027578.0, ans=0.1 2023-06-24 05:05:50,766 INFO [train.py:996] (1/4) Epoch 6, batch 18800, loss[loss=0.1534, simple_loss=0.225, pruned_loss=0.04087, over 16547.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2886, pruned_loss=0.07161, over 4240407.46 frames. ], batch size: 60, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:06:06,964 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:06:11,154 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.41 vs. limit=15.0 2023-06-24 05:06:24,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-24 05:06:34,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027758.0, ans=0.1 2023-06-24 05:07:21,632 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:07:30,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1027878.0, ans=0.125 2023-06-24 05:07:38,449 INFO [train.py:996] (1/4) Epoch 6, batch 18850, loss[loss=0.227, simple_loss=0.2734, pruned_loss=0.09032, over 20345.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2838, pruned_loss=0.06777, over 4244316.03 frames. ], batch size: 703, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:07:57,082 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 2.192e+02 2.570e+02 2.921e+02 4.536e+02, threshold=5.140e+02, percent-clipped=0.0 2023-06-24 05:08:09,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1027998.0, ans=0.125 2023-06-24 05:09:26,023 INFO [train.py:996] (1/4) Epoch 6, batch 18900, loss[loss=0.186, simple_loss=0.2508, pruned_loss=0.06057, over 20742.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2817, pruned_loss=0.06806, over 4241290.15 frames. ], batch size: 608, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:09:28,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1028238.0, ans=0.125 2023-06-24 05:09:31,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1028238.0, ans=0.0 2023-06-24 05:09:40,901 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-24 05:09:42,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1028298.0, ans=0.2 2023-06-24 05:09:48,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1028298.0, ans=0.125 2023-06-24 05:10:50,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1028418.0, ans=0.015 2023-06-24 05:10:53,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1028418.0, ans=0.125 2023-06-24 05:11:01,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1028478.0, ans=0.125 2023-06-24 05:11:14,537 INFO [train.py:996] (1/4) Epoch 6, batch 18950, loss[loss=0.2114, simple_loss=0.2913, pruned_loss=0.06578, over 21668.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2829, pruned_loss=0.07002, over 4250706.82 frames. ], batch size: 263, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:11:39,573 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 2.683e+02 3.004e+02 3.629e+02 6.368e+02, threshold=6.008e+02, percent-clipped=2.0 2023-06-24 05:12:34,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1028718.0, ans=0.0 2023-06-24 05:13:05,428 INFO [train.py:996] (1/4) Epoch 6, batch 19000, loss[loss=0.2337, simple_loss=0.3111, pruned_loss=0.07817, over 20740.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.295, pruned_loss=0.07188, over 4262684.01 frames. ], batch size: 609, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:13:09,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-24 05:14:28,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=1029018.0, ans=12.0 2023-06-24 05:14:49,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1029078.0, ans=0.0 2023-06-24 05:14:54,017 INFO [train.py:996] (1/4) Epoch 6, batch 19050, loss[loss=0.2448, simple_loss=0.3008, pruned_loss=0.09441, over 21316.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3, pruned_loss=0.07606, over 4272174.50 frames. ], batch size: 176, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:15:10,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1029138.0, ans=0.125 2023-06-24 05:15:19,303 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.839e+02 3.291e+02 3.950e+02 6.159e+02, threshold=6.582e+02, percent-clipped=1.0 2023-06-24 05:15:44,819 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:15:48,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=22.5 2023-06-24 05:16:00,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1029258.0, ans=0.125 2023-06-24 05:16:30,794 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:16:38,769 INFO [train.py:996] (1/4) Epoch 6, batch 19100, loss[loss=0.2038, simple_loss=0.2664, pruned_loss=0.07064, over 21223.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2969, pruned_loss=0.0761, over 4273836.20 frames. ], batch size: 143, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:17:13,190 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-24 05:18:00,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1029618.0, ans=0.2 2023-06-24 05:18:35,873 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-24 05:18:35,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-24 05:18:36,314 INFO [train.py:996] (1/4) Epoch 6, batch 19150, loss[loss=0.2533, simple_loss=0.3433, pruned_loss=0.08167, over 21782.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2975, pruned_loss=0.07623, over 4262864.64 frames. ], batch size: 282, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:19:12,549 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.501e+02 2.737e+02 3.196e+02 5.229e+02, threshold=5.475e+02, percent-clipped=0.0 2023-06-24 05:19:15,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-24 05:20:37,953 INFO [train.py:996] (1/4) Epoch 6, batch 19200, loss[loss=0.3419, simple_loss=0.4208, pruned_loss=0.1314, over 21496.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3108, pruned_loss=0.07833, over 4268932.78 frames. ], batch size: 507, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:20:41,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1030038.0, ans=0.125 2023-06-24 05:20:57,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1030038.0, ans=0.1 2023-06-24 05:21:21,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1030158.0, ans=0.125 2023-06-24 05:21:27,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-24 05:21:30,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1030158.0, ans=0.0 2023-06-24 05:21:33,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1030218.0, ans=0.1 2023-06-24 05:22:19,328 INFO [train.py:996] (1/4) Epoch 6, batch 19250, loss[loss=0.1953, simple_loss=0.2842, pruned_loss=0.05319, over 21860.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3097, pruned_loss=0.0739, over 4263085.64 frames. ], batch size: 316, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:22:50,489 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 2.125e+02 2.467e+02 2.912e+02 4.275e+02, threshold=4.933e+02, percent-clipped=0.0 2023-06-24 05:22:56,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1030398.0, ans=0.2 2023-06-24 05:23:10,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1030458.0, ans=0.0 2023-06-24 05:23:14,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1030458.0, ans=0.0 2023-06-24 05:24:00,294 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:24:12,794 INFO [train.py:996] (1/4) Epoch 6, batch 19300, loss[loss=0.2465, simple_loss=0.3104, pruned_loss=0.09134, over 21436.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3061, pruned_loss=0.07342, over 4271729.46 frames. ], batch size: 144, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:24:32,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1030638.0, ans=0.09899494936611666 2023-06-24 05:24:43,866 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-24 05:25:14,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1030818.0, ans=0.125 2023-06-24 05:26:02,606 INFO [train.py:996] (1/4) Epoch 6, batch 19350, loss[loss=0.2545, simple_loss=0.3639, pruned_loss=0.07251, over 19813.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2994, pruned_loss=0.06895, over 4270320.40 frames. ], batch size: 703, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:26:28,593 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.277e+02 2.629e+02 3.333e+02 6.338e+02, threshold=5.259e+02, percent-clipped=7.0 2023-06-24 05:26:40,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1031058.0, ans=0.125 2023-06-24 05:27:06,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1031118.0, ans=0.0 2023-06-24 05:27:19,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1031118.0, ans=0.1 2023-06-24 05:27:34,527 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:27:50,210 INFO [train.py:996] (1/4) Epoch 6, batch 19400, loss[loss=0.2022, simple_loss=0.2851, pruned_loss=0.05966, over 21906.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2954, pruned_loss=0.0674, over 4273790.10 frames. ], batch size: 333, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:28:09,243 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-24 05:28:15,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1031298.0, ans=0.1 2023-06-24 05:28:24,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1031298.0, ans=0.0 2023-06-24 05:28:24,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1031298.0, ans=0.2 2023-06-24 05:29:01,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1031418.0, ans=0.125 2023-06-24 05:29:05,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-24 05:29:44,552 INFO [train.py:996] (1/4) Epoch 6, batch 19450, loss[loss=0.2064, simple_loss=0.2623, pruned_loss=0.07531, over 21578.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.294, pruned_loss=0.06953, over 4284367.53 frames. ], batch size: 247, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:30:05,471 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.505e+02 2.907e+02 3.403e+02 7.011e+02, threshold=5.814e+02, percent-clipped=3.0 2023-06-24 05:30:11,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1031598.0, ans=0.125 2023-06-24 05:30:16,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1031598.0, ans=0.2 2023-06-24 05:31:08,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1031778.0, ans=0.125 2023-06-24 05:31:19,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1031778.0, ans=0.015 2023-06-24 05:31:29,138 INFO [train.py:996] (1/4) Epoch 6, batch 19500, loss[loss=0.1968, simple_loss=0.265, pruned_loss=0.06432, over 21510.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2905, pruned_loss=0.07095, over 4272902.35 frames. ], batch size: 230, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:31:51,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1031898.0, ans=0.0 2023-06-24 05:33:17,766 INFO [train.py:996] (1/4) Epoch 6, batch 19550, loss[loss=0.1876, simple_loss=0.2806, pruned_loss=0.04729, over 21628.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2887, pruned_loss=0.06941, over 4275287.88 frames. ], batch size: 230, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:33:23,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1032138.0, ans=0.125 2023-06-24 05:33:37,931 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.721e+02 3.147e+02 3.714e+02 5.540e+02, threshold=6.293e+02, percent-clipped=0.0 2023-06-24 05:34:01,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1032258.0, ans=0.125 2023-06-24 05:34:20,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1032318.0, ans=0.125 2023-06-24 05:34:22,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-24 05:34:38,305 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-24 05:34:45,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.28 vs. limit=22.5 2023-06-24 05:34:45,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=1032378.0, ans=15.0 2023-06-24 05:34:57,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1032378.0, ans=0.0 2023-06-24 05:35:04,138 INFO [train.py:996] (1/4) Epoch 6, batch 19600, loss[loss=0.2276, simple_loss=0.2916, pruned_loss=0.08179, over 21522.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2896, pruned_loss=0.06991, over 4284626.05 frames. ], batch size: 211, lr: 5.03e-03, grad_scale: 32.0 2023-06-24 05:35:18,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1032438.0, ans=0.2 2023-06-24 05:35:25,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1032498.0, ans=0.2 2023-06-24 05:36:20,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1032618.0, ans=0.0 2023-06-24 05:36:39,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-24 05:36:53,249 INFO [train.py:996] (1/4) Epoch 6, batch 19650, loss[loss=0.2365, simple_loss=0.302, pruned_loss=0.08551, over 21899.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2952, pruned_loss=0.0746, over 4290679.21 frames. ], batch size: 316, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:36:59,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1032738.0, ans=0.1 2023-06-24 05:37:16,213 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.599e+02 2.881e+02 3.237e+02 5.731e+02, threshold=5.762e+02, percent-clipped=0.0 2023-06-24 05:37:29,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=12.0 2023-06-24 05:37:36,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1032858.0, ans=0.2 2023-06-24 05:38:25,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-24 05:38:27,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1032978.0, ans=0.125 2023-06-24 05:38:27,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1032978.0, ans=0.125 2023-06-24 05:38:45,141 INFO [train.py:996] (1/4) Epoch 6, batch 19700, loss[loss=0.2136, simple_loss=0.3195, pruned_loss=0.05383, over 20793.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3001, pruned_loss=0.07555, over 4285602.64 frames. ], batch size: 608, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:40:35,421 INFO [train.py:996] (1/4) Epoch 6, batch 19750, loss[loss=0.2381, simple_loss=0.3225, pruned_loss=0.07686, over 21648.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3101, pruned_loss=0.07768, over 4282005.89 frames. ], batch size: 263, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:40:38,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=1033338.0, ans=15.0 2023-06-24 05:40:55,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1033338.0, ans=0.1 2023-06-24 05:41:09,212 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.723e+02 3.338e+02 4.190e+02 5.879e+02, threshold=6.676e+02, percent-clipped=1.0 2023-06-24 05:42:06,522 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-24 05:42:10,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1033578.0, ans=0.125 2023-06-24 05:42:16,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1033578.0, ans=0.0 2023-06-24 05:42:18,693 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-24 05:42:22,729 INFO [train.py:996] (1/4) Epoch 6, batch 19800, loss[loss=0.1739, simple_loss=0.2485, pruned_loss=0.04966, over 21551.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3089, pruned_loss=0.07818, over 4288688.44 frames. ], batch size: 212, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:43:12,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1033758.0, ans=0.0 2023-06-24 05:43:28,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1033758.0, ans=0.125 2023-06-24 05:43:38,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1033818.0, ans=0.0 2023-06-24 05:43:46,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1033818.0, ans=0.0 2023-06-24 05:43:46,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1033818.0, ans=0.125 2023-06-24 05:43:50,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1033818.0, ans=0.0 2023-06-24 05:44:16,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1033938.0, ans=0.0 2023-06-24 05:44:17,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-06-24 05:44:17,978 INFO [train.py:996] (1/4) Epoch 6, batch 19850, loss[loss=0.1683, simple_loss=0.2379, pruned_loss=0.04935, over 21302.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3013, pruned_loss=0.07274, over 4288808.80 frames. ], batch size: 131, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:44:30,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1033938.0, ans=0.125 2023-06-24 05:44:52,065 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.320e+02 2.642e+02 2.979e+02 5.130e+02, threshold=5.285e+02, percent-clipped=0.0 2023-06-24 05:45:08,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1034058.0, ans=0.1 2023-06-24 05:45:11,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1034058.0, ans=0.1 2023-06-24 05:45:18,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1034058.0, ans=0.2 2023-06-24 05:45:43,407 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:45:44,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1034178.0, ans=0.0 2023-06-24 05:45:48,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1034178.0, ans=0.1 2023-06-24 05:46:03,640 INFO [train.py:996] (1/4) Epoch 6, batch 19900, loss[loss=0.2033, simple_loss=0.2758, pruned_loss=0.06544, over 21796.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3, pruned_loss=0.07004, over 4288176.61 frames. ], batch size: 371, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:47:07,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1034358.0, ans=0.125 2023-06-24 05:47:09,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1034358.0, ans=0.125 2023-06-24 05:47:58,272 INFO [train.py:996] (1/4) Epoch 6, batch 19950, loss[loss=0.2301, simple_loss=0.2865, pruned_loss=0.08686, over 21845.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2957, pruned_loss=0.07023, over 4282084.54 frames. ], batch size: 107, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:48:00,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1034538.0, ans=0.125 2023-06-24 05:48:26,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-24 05:48:33,579 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.312e+02 2.767e+02 3.263e+02 6.271e+02, threshold=5.533e+02, percent-clipped=3.0 2023-06-24 05:48:55,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1034658.0, ans=0.125 2023-06-24 05:49:02,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1034718.0, ans=0.0 2023-06-24 05:49:12,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1034718.0, ans=0.125 2023-06-24 05:49:25,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=15.0 2023-06-24 05:49:46,344 INFO [train.py:996] (1/4) Epoch 6, batch 20000, loss[loss=0.2099, simple_loss=0.2944, pruned_loss=0.06271, over 21809.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2969, pruned_loss=0.07087, over 4289476.00 frames. ], batch size: 298, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:50:32,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-24 05:50:52,034 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.28 vs. limit=10.0 2023-06-24 05:51:22,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1035078.0, ans=0.125 2023-06-24 05:51:33,366 INFO [train.py:996] (1/4) Epoch 6, batch 20050, loss[loss=0.2181, simple_loss=0.2919, pruned_loss=0.07211, over 21903.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2972, pruned_loss=0.07243, over 4290262.48 frames. ], batch size: 316, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:51:33,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1035138.0, ans=0.125 2023-06-24 05:51:36,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1035138.0, ans=0.025 2023-06-24 05:52:08,121 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.657e+02 2.915e+02 3.243e+02 4.793e+02, threshold=5.831e+02, percent-clipped=0.0 2023-06-24 05:52:15,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1035198.0, ans=0.125 2023-06-24 05:52:22,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1035258.0, ans=0.125 2023-06-24 05:52:24,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1035258.0, ans=0.125 2023-06-24 05:52:26,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1035258.0, ans=0.125 2023-06-24 05:52:41,188 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-24 05:52:48,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1035318.0, ans=0.0 2023-06-24 05:53:23,206 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-06-24 05:53:23,689 INFO [train.py:996] (1/4) Epoch 6, batch 20100, loss[loss=0.2375, simple_loss=0.3247, pruned_loss=0.07515, over 21686.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2994, pruned_loss=0.07416, over 4296462.25 frames. ], batch size: 263, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:54:54,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1035678.0, ans=0.125 2023-06-24 05:55:20,082 INFO [train.py:996] (1/4) Epoch 6, batch 20150, loss[loss=0.2746, simple_loss=0.3518, pruned_loss=0.09865, over 21793.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3091, pruned_loss=0.07759, over 4294755.08 frames. ], batch size: 118, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:55:35,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1035738.0, ans=0.125 2023-06-24 05:55:37,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1035798.0, ans=0.0 2023-06-24 05:55:46,196 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 2.880e+02 3.455e+02 4.017e+02 7.640e+02, threshold=6.911e+02, percent-clipped=4.0 2023-06-24 05:56:40,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1035918.0, ans=0.125 2023-06-24 05:56:56,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.62 vs. limit=15.0 2023-06-24 05:57:08,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-24 05:57:12,790 INFO [train.py:996] (1/4) Epoch 6, batch 20200, loss[loss=0.2354, simple_loss=0.3357, pruned_loss=0.0675, over 21807.00 frames. ], tot_loss[loss=0.238, simple_loss=0.315, pruned_loss=0.08054, over 4281913.47 frames. ], batch size: 316, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 05:57:20,440 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:57:38,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1036098.0, ans=0.125 2023-06-24 05:58:09,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1036158.0, ans=0.0 2023-06-24 05:58:19,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1036218.0, ans=0.2 2023-06-24 05:59:01,882 INFO [train.py:996] (1/4) Epoch 6, batch 20250, loss[loss=0.2361, simple_loss=0.3079, pruned_loss=0.08213, over 21756.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3153, pruned_loss=0.07945, over 4280589.80 frames. ], batch size: 112, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 05:59:26,777 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.473e+02 2.856e+02 3.579e+02 8.091e+02, threshold=5.711e+02, percent-clipped=1.0 2023-06-24 06:00:44,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1036578.0, ans=10.0 2023-06-24 06:00:49,970 INFO [train.py:996] (1/4) Epoch 6, batch 20300, loss[loss=0.2228, simple_loss=0.3052, pruned_loss=0.07015, over 21626.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3132, pruned_loss=0.07654, over 4266961.91 frames. ], batch size: 263, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:01:43,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1036758.0, ans=0.125 2023-06-24 06:02:13,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1036818.0, ans=0.125 2023-06-24 06:02:17,463 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.61 vs. limit=8.0 2023-06-24 06:02:33,237 INFO [train.py:996] (1/4) Epoch 6, batch 20350, loss[loss=0.2332, simple_loss=0.306, pruned_loss=0.08014, over 21790.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3137, pruned_loss=0.07667, over 4263812.56 frames. ], batch size: 247, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:02:35,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1036938.0, ans=0.0 2023-06-24 06:02:56,825 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.320e+02 2.555e+02 2.973e+02 6.061e+02, threshold=5.110e+02, percent-clipped=1.0 2023-06-24 06:03:11,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1037058.0, ans=0.125 2023-06-24 06:03:25,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1037058.0, ans=0.125 2023-06-24 06:04:09,391 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.57 vs. limit=6.0 2023-06-24 06:04:20,798 INFO [train.py:996] (1/4) Epoch 6, batch 20400, loss[loss=0.2487, simple_loss=0.3273, pruned_loss=0.08505, over 21783.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3158, pruned_loss=0.07922, over 4256912.64 frames. ], batch size: 332, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:04:26,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1037238.0, ans=0.0 2023-06-24 06:04:46,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1037298.0, ans=0.0 2023-06-24 06:05:15,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1037358.0, ans=0.2 2023-06-24 06:06:05,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1037478.0, ans=0.125 2023-06-24 06:06:08,174 INFO [train.py:996] (1/4) Epoch 6, batch 20450, loss[loss=0.2377, simple_loss=0.3159, pruned_loss=0.07976, over 21822.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3169, pruned_loss=0.08145, over 4254946.27 frames. ], batch size: 124, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:06:18,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1037538.0, ans=0.125 2023-06-24 06:06:31,821 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.916e+02 3.328e+02 3.687e+02 5.878e+02, threshold=6.655e+02, percent-clipped=5.0 2023-06-24 06:07:34,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1037718.0, ans=0.125 2023-06-24 06:07:51,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037778.0, ans=0.1 2023-06-24 06:07:54,270 INFO [train.py:996] (1/4) Epoch 6, batch 20500, loss[loss=0.2163, simple_loss=0.2905, pruned_loss=0.07109, over 21897.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3112, pruned_loss=0.08128, over 4253764.40 frames. ], batch size: 107, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:08:49,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1037958.0, ans=0.125 2023-06-24 06:09:41,880 INFO [train.py:996] (1/4) Epoch 6, batch 20550, loss[loss=0.1992, simple_loss=0.2744, pruned_loss=0.062, over 21386.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3035, pruned_loss=0.07921, over 4253295.15 frames. ], batch size: 131, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:10:03,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1038198.0, ans=0.125 2023-06-24 06:10:06,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.636e+02 3.017e+02 3.648e+02 5.396e+02, threshold=6.035e+02, percent-clipped=0.0 2023-06-24 06:10:37,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1038258.0, ans=0.0 2023-06-24 06:11:13,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1038378.0, ans=0.125 2023-06-24 06:11:22,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-24 06:11:29,767 INFO [train.py:996] (1/4) Epoch 6, batch 20600, loss[loss=0.2434, simple_loss=0.3011, pruned_loss=0.09286, over 21224.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3059, pruned_loss=0.0778, over 4251506.01 frames. ], batch size: 159, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:11:48,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-24 06:12:42,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1038618.0, ans=0.2 2023-06-24 06:13:10,689 INFO [train.py:996] (1/4) Epoch 6, batch 20650, loss[loss=0.1861, simple_loss=0.2482, pruned_loss=0.062, over 21532.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3024, pruned_loss=0.07806, over 4253766.56 frames. ], batch size: 263, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:13:16,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-24 06:13:40,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.453e+02 2.852e+02 3.486e+02 6.346e+02, threshold=5.704e+02, percent-clipped=1.0 2023-06-24 06:13:44,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1038798.0, ans=0.05 2023-06-24 06:14:55,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1038978.0, ans=0.0 2023-06-24 06:15:00,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-24 06:15:00,357 INFO [train.py:996] (1/4) Epoch 6, batch 20700, loss[loss=0.2331, simple_loss=0.3137, pruned_loss=0.07624, over 21824.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2963, pruned_loss=0.07517, over 4236713.28 frames. ], batch size: 371, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:15:06,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1039038.0, ans=0.0 2023-06-24 06:15:24,579 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.89 vs. limit=22.5 2023-06-24 06:16:24,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-06-24 06:16:49,943 INFO [train.py:996] (1/4) Epoch 6, batch 20750, loss[loss=0.26, simple_loss=0.3883, pruned_loss=0.06586, over 20769.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2992, pruned_loss=0.07442, over 4239329.50 frames. ], batch size: 607, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:16:59,369 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:17:26,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1039398.0, ans=0.0 2023-06-24 06:17:37,393 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.434e+02 2.945e+02 4.112e+02 9.661e+02, threshold=5.891e+02, percent-clipped=8.0 2023-06-24 06:17:38,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1039398.0, ans=0.125 2023-06-24 06:17:40,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1039398.0, ans=0.125 2023-06-24 06:17:59,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.26 vs. limit=10.0 2023-06-24 06:18:00,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1039458.0, ans=0.125 2023-06-24 06:18:12,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1039518.0, ans=0.2 2023-06-24 06:18:14,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1039518.0, ans=0.125 2023-06-24 06:18:14,866 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-24 06:18:38,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1039578.0, ans=0.0 2023-06-24 06:18:43,219 INFO [train.py:996] (1/4) Epoch 6, batch 20800, loss[loss=0.2147, simple_loss=0.2805, pruned_loss=0.07448, over 21730.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3027, pruned_loss=0.0758, over 4238822.48 frames. ], batch size: 124, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:19:29,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1039698.0, ans=0.125 2023-06-24 06:19:34,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1039758.0, ans=0.125 2023-06-24 06:19:35,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1039758.0, ans=0.125 2023-06-24 06:19:41,973 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:20:21,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1039878.0, ans=0.125 2023-06-24 06:20:26,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-24 06:20:29,171 INFO [train.py:996] (1/4) Epoch 6, batch 20850, loss[loss=0.1709, simple_loss=0.2428, pruned_loss=0.04948, over 21580.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2934, pruned_loss=0.07312, over 4244171.53 frames. ], batch size: 230, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:21:06,159 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.402e+02 2.795e+02 3.449e+02 6.931e+02, threshold=5.589e+02, percent-clipped=4.0 2023-06-24 06:21:06,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1039998.0, ans=0.0 2023-06-24 06:21:19,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1040058.0, ans=0.125 2023-06-24 06:21:23,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1040058.0, ans=0.025 2023-06-24 06:21:37,357 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:22:18,912 INFO [train.py:996] (1/4) Epoch 6, batch 20900, loss[loss=0.2044, simple_loss=0.2805, pruned_loss=0.06411, over 21259.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2963, pruned_loss=0.07438, over 4246210.54 frames. ], batch size: 159, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:22:47,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1040298.0, ans=0.2 2023-06-24 06:23:07,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1040358.0, ans=0.1 2023-06-24 06:23:31,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1040418.0, ans=0.0 2023-06-24 06:23:33,956 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-24 06:23:36,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1040418.0, ans=0.0 2023-06-24 06:23:48,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1040478.0, ans=0.0 2023-06-24 06:23:50,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-24 06:24:04,673 INFO [train.py:996] (1/4) Epoch 6, batch 20950, loss[loss=0.1658, simple_loss=0.2442, pruned_loss=0.04373, over 21331.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2924, pruned_loss=0.07083, over 4251412.26 frames. ], batch size: 194, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:24:40,106 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.258e+02 2.758e+02 3.294e+02 6.843e+02, threshold=5.516e+02, percent-clipped=1.0 2023-06-24 06:24:42,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1040598.0, ans=0.2 2023-06-24 06:25:25,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1040718.0, ans=0.0 2023-06-24 06:25:49,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1040838.0, ans=15.0 2023-06-24 06:25:50,773 INFO [train.py:996] (1/4) Epoch 6, batch 21000, loss[loss=0.2204, simple_loss=0.2915, pruned_loss=0.07471, over 21883.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.291, pruned_loss=0.07134, over 4261985.77 frames. ], batch size: 351, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:25:50,773 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 06:26:08,830 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2672, simple_loss=0.3654, pruned_loss=0.08451, over 1796401.00 frames. 2023-06-24 06:26:08,831 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 06:26:36,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1040898.0, ans=0.125 2023-06-24 06:26:51,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=22.5 2023-06-24 06:27:02,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1040958.0, ans=0.2 2023-06-24 06:27:09,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1040958.0, ans=0.0 2023-06-24 06:27:12,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1041018.0, ans=0.125 2023-06-24 06:27:44,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1041078.0, ans=0.125 2023-06-24 06:27:50,615 INFO [train.py:996] (1/4) Epoch 6, batch 21050, loss[loss=0.1995, simple_loss=0.2648, pruned_loss=0.06704, over 21248.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2885, pruned_loss=0.0715, over 4256525.67 frames. ], batch size: 131, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:28:23,270 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.469e+02 2.621e+02 3.007e+02 4.225e+02, threshold=5.242e+02, percent-clipped=0.0 2023-06-24 06:28:41,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1041258.0, ans=0.04949747468305833 2023-06-24 06:29:00,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-24 06:29:32,191 INFO [train.py:996] (1/4) Epoch 6, batch 21100, loss[loss=0.2275, simple_loss=0.2858, pruned_loss=0.0846, over 21513.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2843, pruned_loss=0.07028, over 4262961.96 frames. ], batch size: 414, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:29:47,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-24 06:30:03,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1041498.0, ans=0.125 2023-06-24 06:30:07,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1041498.0, ans=0.2 2023-06-24 06:30:09,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1041498.0, ans=0.0 2023-06-24 06:30:44,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1041618.0, ans=0.1 2023-06-24 06:30:48,065 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:30:49,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1041618.0, ans=0.125 2023-06-24 06:31:20,307 INFO [train.py:996] (1/4) Epoch 6, batch 21150, loss[loss=0.2343, simple_loss=0.3415, pruned_loss=0.0635, over 19768.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2806, pruned_loss=0.07078, over 4267489.57 frames. ], batch size: 703, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:32:03,766 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.519e+02 2.928e+02 4.378e+02 7.241e+02, threshold=5.856e+02, percent-clipped=12.0 2023-06-24 06:32:16,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1041858.0, ans=0.0 2023-06-24 06:32:44,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1041978.0, ans=0.025 2023-06-24 06:33:01,393 INFO [train.py:996] (1/4) Epoch 6, batch 21200, loss[loss=0.212, simple_loss=0.2718, pruned_loss=0.07611, over 21348.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2765, pruned_loss=0.06981, over 4253499.46 frames. ], batch size: 131, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:33:42,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-24 06:33:59,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1042158.0, ans=0.125 2023-06-24 06:34:19,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-24 06:34:25,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1042218.0, ans=0.125 2023-06-24 06:34:34,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-24 06:34:49,496 INFO [train.py:996] (1/4) Epoch 6, batch 21250, loss[loss=0.1988, simple_loss=0.2682, pruned_loss=0.0647, over 21161.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2754, pruned_loss=0.07016, over 4253103.91 frames. ], batch size: 143, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:34:58,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1042338.0, ans=0.125 2023-06-24 06:35:08,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2023-06-24 06:35:33,815 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.602e+02 2.917e+02 3.308e+02 4.858e+02, threshold=5.834e+02, percent-clipped=0.0 2023-06-24 06:35:55,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1042458.0, ans=0.125 2023-06-24 06:36:36,696 INFO [train.py:996] (1/4) Epoch 6, batch 21300, loss[loss=0.223, simple_loss=0.2987, pruned_loss=0.07358, over 21850.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2816, pruned_loss=0.07178, over 4250463.85 frames. ], batch size: 351, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:36:49,756 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-06-24 06:37:13,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1042698.0, ans=0.0 2023-06-24 06:37:14,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1042698.0, ans=0.2 2023-06-24 06:37:30,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1042758.0, ans=0.2 2023-06-24 06:38:06,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1042878.0, ans=0.1 2023-06-24 06:38:28,061 INFO [train.py:996] (1/4) Epoch 6, batch 21350, loss[loss=0.187, simple_loss=0.2627, pruned_loss=0.05561, over 21144.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.286, pruned_loss=0.07219, over 4256673.93 frames. ], batch size: 143, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:38:40,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1042938.0, ans=0.0 2023-06-24 06:39:03,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-06-24 06:39:13,495 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.459e+02 2.698e+02 3.098e+02 4.551e+02, threshold=5.397e+02, percent-clipped=0.0 2023-06-24 06:39:37,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1043118.0, ans=0.0 2023-06-24 06:39:50,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1043118.0, ans=0.125 2023-06-24 06:40:27,125 INFO [train.py:996] (1/4) Epoch 6, batch 21400, loss[loss=0.3226, simple_loss=0.38, pruned_loss=0.1326, over 21338.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2895, pruned_loss=0.07138, over 4261943.67 frames. ], batch size: 507, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:40:41,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1043238.0, ans=0.0 2023-06-24 06:41:12,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-24 06:41:26,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1043418.0, ans=0.025 2023-06-24 06:42:14,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1043538.0, ans=0.125 2023-06-24 06:42:15,512 INFO [train.py:996] (1/4) Epoch 6, batch 21450, loss[loss=0.2406, simple_loss=0.3066, pruned_loss=0.08735, over 21462.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.294, pruned_loss=0.07368, over 4263219.82 frames. ], batch size: 131, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:42:49,430 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.545e+02 3.012e+02 3.537e+02 6.506e+02, threshold=6.024e+02, percent-clipped=2.0 2023-06-24 06:42:56,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1043658.0, ans=0.2 2023-06-24 06:43:19,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1043718.0, ans=0.035 2023-06-24 06:43:38,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1043778.0, ans=0.125 2023-06-24 06:44:02,127 INFO [train.py:996] (1/4) Epoch 6, batch 21500, loss[loss=0.2118, simple_loss=0.2764, pruned_loss=0.07362, over 21723.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2918, pruned_loss=0.07503, over 4271770.33 frames. ], batch size: 351, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:44:08,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1043838.0, ans=0.125 2023-06-24 06:44:54,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-24 06:45:03,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1044018.0, ans=0.125 2023-06-24 06:45:50,184 INFO [train.py:996] (1/4) Epoch 6, batch 21550, loss[loss=0.1795, simple_loss=0.2502, pruned_loss=0.05447, over 21611.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2854, pruned_loss=0.07264, over 4265240.19 frames. ], batch size: 298, lr: 5.01e-03, grad_scale: 8.0 2023-06-24 06:46:26,727 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.541e+02 2.913e+02 3.487e+02 5.320e+02, threshold=5.826e+02, percent-clipped=0.0 2023-06-24 06:46:40,757 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-24 06:47:08,420 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-24 06:47:09,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1044378.0, ans=0.125 2023-06-24 06:47:20,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=22.5 2023-06-24 06:47:39,508 INFO [train.py:996] (1/4) Epoch 6, batch 21600, loss[loss=0.1959, simple_loss=0.2702, pruned_loss=0.06079, over 21246.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2825, pruned_loss=0.0713, over 4258083.36 frames. ], batch size: 159, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:47:55,317 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.98 vs. limit=22.5 2023-06-24 06:48:26,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1044558.0, ans=0.0 2023-06-24 06:48:31,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1044558.0, ans=0.07 2023-06-24 06:48:39,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1044618.0, ans=0.0 2023-06-24 06:49:21,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1044678.0, ans=0.125 2023-06-24 06:49:21,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1044678.0, ans=0.125 2023-06-24 06:49:27,640 INFO [train.py:996] (1/4) Epoch 6, batch 21650, loss[loss=0.1866, simple_loss=0.2714, pruned_loss=0.05092, over 21309.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.287, pruned_loss=0.06955, over 4252705.87 frames. ], batch size: 131, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:49:28,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1044738.0, ans=0.1 2023-06-24 06:49:30,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1044738.0, ans=0.05 2023-06-24 06:49:36,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1044738.0, ans=0.125 2023-06-24 06:49:38,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1044738.0, ans=0.1 2023-06-24 06:50:03,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.515e+02 2.797e+02 3.244e+02 5.540e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-24 06:50:10,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.57 vs. limit=15.0 2023-06-24 06:50:27,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1044918.0, ans=0.125 2023-06-24 06:50:27,424 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-24 06:50:42,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1044918.0, ans=0.0 2023-06-24 06:51:14,472 INFO [train.py:996] (1/4) Epoch 6, batch 21700, loss[loss=0.2072, simple_loss=0.2746, pruned_loss=0.06988, over 21557.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2876, pruned_loss=0.06905, over 4257908.99 frames. ], batch size: 391, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:51:18,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1045038.0, ans=0.125 2023-06-24 06:51:27,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1045038.0, ans=0.125 2023-06-24 06:51:40,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.88 vs. limit=22.5 2023-06-24 06:53:01,229 INFO [train.py:996] (1/4) Epoch 6, batch 21750, loss[loss=0.2035, simple_loss=0.2557, pruned_loss=0.07564, over 21227.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2845, pruned_loss=0.06918, over 4254823.45 frames. ], batch size: 551, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:53:20,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1045398.0, ans=0.125 2023-06-24 06:53:37,711 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 2.476e+02 2.744e+02 3.259e+02 4.826e+02, threshold=5.488e+02, percent-clipped=0.0 2023-06-24 06:53:56,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1045458.0, ans=10.0 2023-06-24 06:54:45,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1045578.0, ans=0.125 2023-06-24 06:54:49,975 INFO [train.py:996] (1/4) Epoch 6, batch 21800, loss[loss=0.2304, simple_loss=0.3242, pruned_loss=0.06829, over 19911.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2823, pruned_loss=0.06985, over 4238174.29 frames. ], batch size: 702, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:55:16,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-24 06:55:30,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1045758.0, ans=0.125 2023-06-24 06:55:31,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1045758.0, ans=0.0 2023-06-24 06:55:58,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1045818.0, ans=0.015 2023-06-24 06:56:39,646 INFO [train.py:996] (1/4) Epoch 6, batch 21850, loss[loss=0.2181, simple_loss=0.295, pruned_loss=0.07059, over 21472.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2859, pruned_loss=0.07043, over 4238828.61 frames. ], batch size: 131, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:57:16,747 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.510e+02 2.889e+02 3.463e+02 5.314e+02, threshold=5.778e+02, percent-clipped=0.0 2023-06-24 06:57:53,020 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-24 06:58:21,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1046178.0, ans=0.125 2023-06-24 06:58:27,727 INFO [train.py:996] (1/4) Epoch 6, batch 21900, loss[loss=0.2072, simple_loss=0.274, pruned_loss=0.07025, over 21775.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.288, pruned_loss=0.07168, over 4249192.42 frames. ], batch size: 333, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:58:28,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1046238.0, ans=0.0 2023-06-24 07:00:11,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1046478.0, ans=0.125 2023-06-24 07:00:21,990 INFO [train.py:996] (1/4) Epoch 6, batch 21950, loss[loss=0.1539, simple_loss=0.2329, pruned_loss=0.03746, over 21500.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2828, pruned_loss=0.07082, over 4261016.51 frames. ], batch size: 212, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:00:53,154 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.424e+02 2.913e+02 3.468e+02 5.833e+02, threshold=5.826e+02, percent-clipped=1.0 2023-06-24 07:00:55,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-06-24 07:01:40,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1046778.0, ans=0.125 2023-06-24 07:02:10,403 INFO [train.py:996] (1/4) Epoch 6, batch 22000, loss[loss=0.2682, simple_loss=0.317, pruned_loss=0.1097, over 21391.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2776, pruned_loss=0.068, over 4260413.07 frames. ], batch size: 507, lr: 5.00e-03, grad_scale: 32.0 2023-06-24 07:02:17,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1046838.0, ans=0.125 2023-06-24 07:02:36,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1046898.0, ans=0.0 2023-06-24 07:03:29,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1047018.0, ans=0.0 2023-06-24 07:04:00,837 INFO [train.py:996] (1/4) Epoch 6, batch 22050, loss[loss=0.2325, simple_loss=0.3093, pruned_loss=0.0778, over 21254.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2819, pruned_loss=0.06972, over 4259497.24 frames. ], batch size: 159, lr: 5.00e-03, grad_scale: 32.0 2023-06-24 07:04:09,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-24 07:04:24,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1047198.0, ans=0.1 2023-06-24 07:04:39,706 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.375e+02 2.787e+02 3.407e+02 5.897e+02, threshold=5.574e+02, percent-clipped=1.0 2023-06-24 07:04:40,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1047258.0, ans=0.125 2023-06-24 07:04:43,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1047258.0, ans=0.125 2023-06-24 07:04:48,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1047258.0, ans=0.1 2023-06-24 07:05:13,318 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:05:28,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1047378.0, ans=0.125 2023-06-24 07:05:32,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1047378.0, ans=0.125 2023-06-24 07:05:32,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1047378.0, ans=0.125 2023-06-24 07:05:49,385 INFO [train.py:996] (1/4) Epoch 6, batch 22100, loss[loss=0.2252, simple_loss=0.2977, pruned_loss=0.0763, over 21438.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2924, pruned_loss=0.07465, over 4269227.26 frames. ], batch size: 211, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:06:26,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1047498.0, ans=0.0 2023-06-24 07:07:10,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-24 07:07:32,054 INFO [train.py:996] (1/4) Epoch 6, batch 22150, loss[loss=0.2782, simple_loss=0.3248, pruned_loss=0.1158, over 21747.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2957, pruned_loss=0.07629, over 4278936.07 frames. ], batch size: 508, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:07:55,987 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-24 07:08:10,536 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.700e+02 3.228e+02 3.782e+02 5.741e+02, threshold=6.456e+02, percent-clipped=1.0 2023-06-24 07:08:13,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1047858.0, ans=0.1 2023-06-24 07:08:13,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.45 vs. limit=15.0 2023-06-24 07:08:16,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1047858.0, ans=0.0 2023-06-24 07:09:21,404 INFO [train.py:996] (1/4) Epoch 6, batch 22200, loss[loss=0.2341, simple_loss=0.3301, pruned_loss=0.06901, over 21900.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2975, pruned_loss=0.07701, over 4282894.62 frames. ], batch size: 316, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:09:32,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1048038.0, ans=0.125 2023-06-24 07:10:03,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1048158.0, ans=0.125 2023-06-24 07:10:29,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1048218.0, ans=0.09899494936611666 2023-06-24 07:11:09,188 INFO [train.py:996] (1/4) Epoch 6, batch 22250, loss[loss=0.293, simple_loss=0.3571, pruned_loss=0.1144, over 21397.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3049, pruned_loss=0.07842, over 4288117.72 frames. ], batch size: 471, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:11:10,312 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-24 07:11:20,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1048338.0, ans=0.0 2023-06-24 07:11:46,678 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.521e+02 2.836e+02 3.368e+02 6.817e+02, threshold=5.671e+02, percent-clipped=1.0 2023-06-24 07:12:42,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1048578.0, ans=0.125 2023-06-24 07:12:55,395 INFO [train.py:996] (1/4) Epoch 6, batch 22300, loss[loss=0.2312, simple_loss=0.2977, pruned_loss=0.08236, over 21321.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.306, pruned_loss=0.07912, over 4274576.72 frames. ], batch size: 176, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:14:17,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1048818.0, ans=0.125 2023-06-24 07:14:30,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1048878.0, ans=0.0 2023-06-24 07:14:30,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1048878.0, ans=0.2 2023-06-24 07:14:38,217 INFO [train.py:996] (1/4) Epoch 6, batch 22350, loss[loss=0.1777, simple_loss=0.2601, pruned_loss=0.04766, over 21542.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3042, pruned_loss=0.07958, over 4279006.02 frames. ], batch size: 212, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:15:06,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1048998.0, ans=0.125 2023-06-24 07:15:09,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1048998.0, ans=0.125 2023-06-24 07:15:15,680 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.647e+02 2.993e+02 3.483e+02 5.422e+02, threshold=5.987e+02, percent-clipped=0.0 2023-06-24 07:15:23,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1049058.0, ans=0.125 2023-06-24 07:15:31,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1049058.0, ans=0.125 2023-06-24 07:15:32,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1049058.0, ans=0.125 2023-06-24 07:15:40,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1049118.0, ans=0.0 2023-06-24 07:15:55,520 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-24 07:16:17,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049178.0, ans=0.1 2023-06-24 07:16:20,239 INFO [train.py:996] (1/4) Epoch 6, batch 22400, loss[loss=0.2104, simple_loss=0.2827, pruned_loss=0.06906, over 21559.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3001, pruned_loss=0.07632, over 4288155.58 frames. ], batch size: 230, lr: 4.99e-03, grad_scale: 32.0 2023-06-24 07:16:38,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1049238.0, ans=0.125 2023-06-24 07:17:41,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1049418.0, ans=0.1 2023-06-24 07:18:07,111 INFO [train.py:996] (1/4) Epoch 6, batch 22450, loss[loss=0.1915, simple_loss=0.2444, pruned_loss=0.06931, over 21199.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.295, pruned_loss=0.07557, over 4272761.71 frames. ], batch size: 548, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:18:52,753 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.523e+02 2.860e+02 3.590e+02 5.659e+02, threshold=5.720e+02, percent-clipped=0.0 2023-06-24 07:18:55,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1049658.0, ans=0.2 2023-06-24 07:19:15,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1049718.0, ans=0.125 2023-06-24 07:19:44,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1049778.0, ans=0.0 2023-06-24 07:19:46,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1049778.0, ans=0.125 2023-06-24 07:19:50,737 INFO [train.py:996] (1/4) Epoch 6, batch 22500, loss[loss=0.2029, simple_loss=0.2733, pruned_loss=0.06621, over 21779.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2894, pruned_loss=0.07476, over 4273002.50 frames. ], batch size: 107, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:20:16,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-24 07:21:17,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1050018.0, ans=0.2 2023-06-24 07:21:19,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1050018.0, ans=0.2 2023-06-24 07:21:21,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1050078.0, ans=0.125 2023-06-24 07:21:34,380 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-24 07:21:36,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-24 07:21:40,278 INFO [train.py:996] (1/4) Epoch 6, batch 22550, loss[loss=0.2189, simple_loss=0.2908, pruned_loss=0.07344, over 21312.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2943, pruned_loss=0.07536, over 4278232.70 frames. ], batch size: 176, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:21:47,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1050138.0, ans=0.0 2023-06-24 07:21:49,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1050138.0, ans=0.2 2023-06-24 07:22:32,441 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.690e+02 3.328e+02 4.292e+02 7.428e+02, threshold=6.656e+02, percent-clipped=5.0 2023-06-24 07:22:40,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1050258.0, ans=0.0 2023-06-24 07:23:25,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1050378.0, ans=0.2 2023-06-24 07:23:28,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-24 07:23:30,367 INFO [train.py:996] (1/4) Epoch 6, batch 22600, loss[loss=0.1627, simple_loss=0.2047, pruned_loss=0.06037, over 16557.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2967, pruned_loss=0.07607, over 4280891.29 frames. ], batch size: 61, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:23:52,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1050438.0, ans=0.0 2023-06-24 07:23:53,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1050438.0, ans=0.025 2023-06-24 07:24:20,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1050558.0, ans=0.125 2023-06-24 07:24:37,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1050618.0, ans=0.125 2023-06-24 07:25:23,770 INFO [train.py:996] (1/4) Epoch 6, batch 22650, loss[loss=0.2147, simple_loss=0.283, pruned_loss=0.07316, over 21686.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2929, pruned_loss=0.07558, over 4275904.71 frames. ], batch size: 112, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:25:36,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1050738.0, ans=0.125 2023-06-24 07:25:40,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1050738.0, ans=0.125 2023-06-24 07:26:07,805 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.240e+02 2.713e+02 2.934e+02 3.379e+02 4.768e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-24 07:26:10,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1050858.0, ans=0.125 2023-06-24 07:26:32,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1050918.0, ans=0.125 2023-06-24 07:26:58,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1050978.0, ans=0.04949747468305833 2023-06-24 07:27:04,011 INFO [train.py:996] (1/4) Epoch 6, batch 22700, loss[loss=0.1809, simple_loss=0.256, pruned_loss=0.05287, over 21812.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2872, pruned_loss=0.07477, over 4281077.48 frames. ], batch size: 118, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:27:35,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1051098.0, ans=0.1 2023-06-24 07:28:23,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1051218.0, ans=0.0 2023-06-24 07:28:56,563 INFO [train.py:996] (1/4) Epoch 6, batch 22750, loss[loss=0.2695, simple_loss=0.3363, pruned_loss=0.1014, over 21670.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2887, pruned_loss=0.07672, over 4273334.09 frames. ], batch size: 351, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:29:40,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.10 vs. limit=10.0 2023-06-24 07:29:41,501 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.658e+02 2.967e+02 3.229e+02 5.067e+02, threshold=5.933e+02, percent-clipped=0.0 2023-06-24 07:30:01,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1051518.0, ans=0.125 2023-06-24 07:30:23,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-24 07:30:48,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1051638.0, ans=0.2 2023-06-24 07:30:49,464 INFO [train.py:996] (1/4) Epoch 6, batch 22800, loss[loss=0.221, simple_loss=0.3063, pruned_loss=0.06786, over 21794.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2931, pruned_loss=0.07931, over 4279119.66 frames. ], batch size: 112, lr: 4.99e-03, grad_scale: 32.0 2023-06-24 07:31:09,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1051698.0, ans=0.125 2023-06-24 07:31:23,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1051698.0, ans=0.0 2023-06-24 07:31:47,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1051818.0, ans=0.0 2023-06-24 07:32:31,115 INFO [train.py:996] (1/4) Epoch 6, batch 22850, loss[loss=0.2004, simple_loss=0.2669, pruned_loss=0.067, over 21859.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2881, pruned_loss=0.07826, over 4279027.51 frames. ], batch size: 373, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:32:43,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1051938.0, ans=0.125 2023-06-24 07:32:58,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1051998.0, ans=0.1 2023-06-24 07:33:14,541 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.514e+02 2.935e+02 3.337e+02 4.796e+02, threshold=5.870e+02, percent-clipped=0.0 2023-06-24 07:33:31,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1052118.0, ans=0.0 2023-06-24 07:34:00,898 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-24 07:34:07,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1052178.0, ans=0.2 2023-06-24 07:34:20,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1052178.0, ans=0.125 2023-06-24 07:34:22,985 INFO [train.py:996] (1/4) Epoch 6, batch 22900, loss[loss=0.222, simple_loss=0.3255, pruned_loss=0.05923, over 21802.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2935, pruned_loss=0.07777, over 4271440.80 frames. ], batch size: 282, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:34:25,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1052238.0, ans=0.0 2023-06-24 07:34:29,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1052238.0, ans=0.1 2023-06-24 07:35:14,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1052358.0, ans=0.125 2023-06-24 07:35:27,937 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-06-24 07:36:14,144 INFO [train.py:996] (1/4) Epoch 6, batch 22950, loss[loss=0.2402, simple_loss=0.2945, pruned_loss=0.0929, over 21890.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3104, pruned_loss=0.07739, over 4274565.85 frames. ], batch size: 107, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:36:14,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1052538.0, ans=0.0 2023-06-24 07:36:56,202 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.392e+02 2.733e+02 3.196e+02 4.909e+02, threshold=5.466e+02, percent-clipped=0.0 2023-06-24 07:37:16,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1052718.0, ans=0.125 2023-06-24 07:37:38,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1052718.0, ans=0.2 2023-06-24 07:38:02,365 INFO [train.py:996] (1/4) Epoch 6, batch 23000, loss[loss=0.2042, simple_loss=0.2693, pruned_loss=0.06959, over 21213.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3081, pruned_loss=0.07501, over 4269550.37 frames. ], batch size: 608, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:38:28,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-24 07:38:35,670 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-24 07:39:31,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1053018.0, ans=0.125 2023-06-24 07:39:58,078 INFO [train.py:996] (1/4) Epoch 6, batch 23050, loss[loss=0.2349, simple_loss=0.307, pruned_loss=0.08144, over 21781.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3082, pruned_loss=0.07607, over 4269534.64 frames. ], batch size: 247, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:39:59,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.65 vs. limit=15.0 2023-06-24 07:39:59,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=22.5 2023-06-24 07:40:04,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1053138.0, ans=0.0 2023-06-24 07:40:13,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-24 07:40:41,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.622e+02 2.848e+02 3.330e+02 6.770e+02, threshold=5.696e+02, percent-clipped=1.0 2023-06-24 07:41:48,439 INFO [train.py:996] (1/4) Epoch 6, batch 23100, loss[loss=0.2133, simple_loss=0.2768, pruned_loss=0.07485, over 21531.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3024, pruned_loss=0.0758, over 4269455.69 frames. ], batch size: 414, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:42:13,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1053498.0, ans=0.2 2023-06-24 07:43:07,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1053618.0, ans=0.0 2023-06-24 07:43:18,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1053678.0, ans=0.125 2023-06-24 07:43:35,985 INFO [train.py:996] (1/4) Epoch 6, batch 23150, loss[loss=0.1919, simple_loss=0.2604, pruned_loss=0.06169, over 22019.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2958, pruned_loss=0.07517, over 4265756.10 frames. ], batch size: 103, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:43:45,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053738.0, ans=0.1 2023-06-24 07:44:07,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-24 07:44:16,347 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.510e+02 2.867e+02 3.300e+02 5.681e+02, threshold=5.734e+02, percent-clipped=0.0 2023-06-24 07:44:19,121 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-24 07:45:02,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1053978.0, ans=0.05 2023-06-24 07:45:15,937 INFO [train.py:996] (1/4) Epoch 6, batch 23200, loss[loss=0.2262, simple_loss=0.3046, pruned_loss=0.07393, over 21904.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2957, pruned_loss=0.07648, over 4277575.43 frames. ], batch size: 124, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 07:46:09,018 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:46:42,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1054218.0, ans=0.125 2023-06-24 07:46:52,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-24 07:47:02,656 INFO [train.py:996] (1/4) Epoch 6, batch 23250, loss[loss=0.2619, simple_loss=0.3099, pruned_loss=0.1069, over 21786.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.295, pruned_loss=0.07709, over 4286652.43 frames. ], batch size: 508, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 07:47:26,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1054398.0, ans=0.125 2023-06-24 07:47:56,057 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.641e+02 2.998e+02 3.541e+02 5.576e+02, threshold=5.996e+02, percent-clipped=0.0 2023-06-24 07:48:25,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1054518.0, ans=0.125 2023-06-24 07:48:26,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1054518.0, ans=0.125 2023-06-24 07:48:57,999 INFO [train.py:996] (1/4) Epoch 6, batch 23300, loss[loss=0.2491, simple_loss=0.349, pruned_loss=0.07464, over 21626.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3041, pruned_loss=0.0787, over 4287642.18 frames. ], batch size: 230, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:49:30,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1054698.0, ans=0.125 2023-06-24 07:49:36,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-24 07:49:56,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1054758.0, ans=0.0 2023-06-24 07:50:23,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-24 07:50:40,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1054878.0, ans=0.125 2023-06-24 07:50:46,418 INFO [train.py:996] (1/4) Epoch 6, batch 23350, loss[loss=0.1687, simple_loss=0.2331, pruned_loss=0.05219, over 21918.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.308, pruned_loss=0.07733, over 4290586.71 frames. ], batch size: 107, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:51:20,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=22.5 2023-06-24 07:51:41,202 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-24 07:51:41,622 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.545e+02 3.075e+02 3.480e+02 4.848e+02, threshold=6.150e+02, percent-clipped=0.0 2023-06-24 07:52:34,868 INFO [train.py:996] (1/4) Epoch 6, batch 23400, loss[loss=0.1862, simple_loss=0.2573, pruned_loss=0.05752, over 21303.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3006, pruned_loss=0.07333, over 4283395.69 frames. ], batch size: 176, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:52:57,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1055238.0, ans=0.125 2023-06-24 07:53:10,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1055298.0, ans=0.2 2023-06-24 07:53:36,380 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:54:04,021 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 07:54:12,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1055478.0, ans=0.125 2023-06-24 07:54:27,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1055538.0, ans=0.125 2023-06-24 07:54:33,578 INFO [train.py:996] (1/4) Epoch 6, batch 23450, loss[loss=0.218, simple_loss=0.2866, pruned_loss=0.07465, over 20772.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3017, pruned_loss=0.0757, over 4284180.11 frames. ], batch size: 608, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:54:50,420 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=15.0 2023-06-24 07:55:17,215 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.530e+02 2.834e+02 3.227e+02 5.088e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-24 07:55:27,022 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-24 07:56:15,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1055838.0, ans=0.125 2023-06-24 07:56:20,856 INFO [train.py:996] (1/4) Epoch 6, batch 23500, loss[loss=0.2335, simple_loss=0.3103, pruned_loss=0.07837, over 21813.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3018, pruned_loss=0.07733, over 4288802.27 frames. ], batch size: 118, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:56:42,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1055898.0, ans=0.0 2023-06-24 07:57:02,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1055958.0, ans=0.0 2023-06-24 07:57:14,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1055958.0, ans=0.0 2023-06-24 07:57:46,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-24 07:57:50,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1056078.0, ans=0.1 2023-06-24 07:58:02,657 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:58:09,290 INFO [train.py:996] (1/4) Epoch 6, batch 23550, loss[loss=0.1907, simple_loss=0.2394, pruned_loss=0.07096, over 21295.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2956, pruned_loss=0.07708, over 4287030.96 frames. ], batch size: 548, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:58:11,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1056138.0, ans=0.125 2023-06-24 07:58:11,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1056138.0, ans=0.125 2023-06-24 07:58:12,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-24 07:58:41,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1056198.0, ans=0.2 2023-06-24 07:58:52,560 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.611e+02 2.905e+02 3.629e+02 5.861e+02, threshold=5.811e+02, percent-clipped=1.0 2023-06-24 07:59:10,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1056318.0, ans=0.2 2023-06-24 07:59:21,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1056318.0, ans=0.0 2023-06-24 07:59:57,762 INFO [train.py:996] (1/4) Epoch 6, batch 23600, loss[loss=0.2283, simple_loss=0.3048, pruned_loss=0.07584, over 21478.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2973, pruned_loss=0.07656, over 4277176.83 frames. ], batch size: 211, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 08:00:49,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-24 08:01:52,004 INFO [train.py:996] (1/4) Epoch 6, batch 23650, loss[loss=0.2383, simple_loss=0.3175, pruned_loss=0.07957, over 21296.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2983, pruned_loss=0.0751, over 4280921.41 frames. ], batch size: 548, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 08:02:38,196 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.630e+02 3.092e+02 3.541e+02 6.593e+02, threshold=6.183e+02, percent-clipped=1.0 2023-06-24 08:03:13,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1056918.0, ans=0.125 2023-06-24 08:03:40,792 INFO [train.py:996] (1/4) Epoch 6, batch 23700, loss[loss=0.2646, simple_loss=0.3321, pruned_loss=0.09855, over 21253.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3009, pruned_loss=0.07516, over 4285081.10 frames. ], batch size: 143, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:04:23,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1057158.0, ans=0.125 2023-06-24 08:04:35,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1057158.0, ans=0.2 2023-06-24 08:05:11,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1057218.0, ans=0.125 2023-06-24 08:05:31,890 INFO [train.py:996] (1/4) Epoch 6, batch 23750, loss[loss=0.2265, simple_loss=0.3209, pruned_loss=0.06602, over 21685.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3037, pruned_loss=0.07603, over 4288949.97 frames. ], batch size: 441, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:06:00,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1057398.0, ans=0.125 2023-06-24 08:06:04,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-24 08:06:14,448 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:06:26,810 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.305e+02 2.862e+02 3.715e+02 6.571e+02, threshold=5.724e+02, percent-clipped=1.0 2023-06-24 08:06:47,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-24 08:06:51,457 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-24 08:07:07,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1057578.0, ans=0.0 2023-06-24 08:07:21,349 INFO [train.py:996] (1/4) Epoch 6, batch 23800, loss[loss=0.273, simple_loss=0.3659, pruned_loss=0.09007, over 21757.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3009, pruned_loss=0.07366, over 4284975.56 frames. ], batch size: 351, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:07:27,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1057638.0, ans=0.1 2023-06-24 08:08:37,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1057818.0, ans=0.125 2023-06-24 08:08:44,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1057818.0, ans=0.2 2023-06-24 08:08:46,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1057818.0, ans=0.2 2023-06-24 08:09:18,094 INFO [train.py:996] (1/4) Epoch 6, batch 23850, loss[loss=0.2469, simple_loss=0.3229, pruned_loss=0.08543, over 21977.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3105, pruned_loss=0.07655, over 4284834.01 frames. ], batch size: 317, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:09:36,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.16 vs. limit=5.0 2023-06-24 08:10:14,726 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.790e+02 3.206e+02 3.794e+02 6.982e+02, threshold=6.412e+02, percent-clipped=2.0 2023-06-24 08:10:35,106 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-24 08:11:12,040 INFO [train.py:996] (1/4) Epoch 6, batch 23900, loss[loss=0.2259, simple_loss=0.3087, pruned_loss=0.07156, over 21673.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.319, pruned_loss=0.07957, over 4282357.32 frames. ], batch size: 332, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:11:16,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1058238.0, ans=0.1 2023-06-24 08:12:17,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1058418.0, ans=0.0 2023-06-24 08:13:00,271 INFO [train.py:996] (1/4) Epoch 6, batch 23950, loss[loss=0.2277, simple_loss=0.283, pruned_loss=0.08619, over 20133.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3122, pruned_loss=0.07917, over 4280803.24 frames. ], batch size: 702, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:13:03,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-24 08:13:41,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1058598.0, ans=0.1 2023-06-24 08:13:52,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.675e+02 3.021e+02 3.458e+02 5.557e+02, threshold=6.041e+02, percent-clipped=0.0 2023-06-24 08:13:55,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1058658.0, ans=0.125 2023-06-24 08:13:58,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1058658.0, ans=0.125 2023-06-24 08:14:40,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1058778.0, ans=0.125 2023-06-24 08:14:55,869 INFO [train.py:996] (1/4) Epoch 6, batch 24000, loss[loss=0.2452, simple_loss=0.316, pruned_loss=0.08714, over 20711.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3142, pruned_loss=0.08228, over 4287920.56 frames. ], batch size: 607, lr: 4.97e-03, grad_scale: 32.0 2023-06-24 08:14:55,870 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 08:15:17,147 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2634, simple_loss=0.3603, pruned_loss=0.08319, over 1796401.00 frames. 2023-06-24 08:15:17,148 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 08:15:21,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1058838.0, ans=0.125 2023-06-24 08:16:13,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1058958.0, ans=0.0 2023-06-24 08:16:13,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1058958.0, ans=0.125 2023-06-24 08:16:27,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1059018.0, ans=0.125 2023-06-24 08:16:29,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.05 vs. limit=15.0 2023-06-24 08:17:08,162 INFO [train.py:996] (1/4) Epoch 6, batch 24050, loss[loss=0.2249, simple_loss=0.3078, pruned_loss=0.07097, over 21749.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3153, pruned_loss=0.08246, over 4285210.85 frames. ], batch size: 332, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:17:08,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1059138.0, ans=10.0 2023-06-24 08:17:56,744 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.625e+02 3.028e+02 3.764e+02 6.671e+02, threshold=6.056e+02, percent-clipped=1.0 2023-06-24 08:18:14,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1059318.0, ans=0.0 2023-06-24 08:18:38,598 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=22.5 2023-06-24 08:18:59,223 INFO [train.py:996] (1/4) Epoch 6, batch 24100, loss[loss=0.3226, simple_loss=0.3787, pruned_loss=0.1333, over 21328.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3148, pruned_loss=0.0805, over 4278696.48 frames. ], batch size: 507, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:19:05,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1059438.0, ans=0.125 2023-06-24 08:19:15,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1059498.0, ans=0.125 2023-06-24 08:19:18,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1059498.0, ans=0.125 2023-06-24 08:19:22,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1059498.0, ans=0.125 2023-06-24 08:20:49,103 INFO [train.py:996] (1/4) Epoch 6, batch 24150, loss[loss=0.2372, simple_loss=0.3062, pruned_loss=0.0841, over 21925.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3131, pruned_loss=0.08158, over 4284239.71 frames. ], batch size: 316, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:21:39,414 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.79 vs. limit=22.5 2023-06-24 08:21:42,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1059858.0, ans=0.0 2023-06-24 08:21:43,497 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.678e+02 3.013e+02 3.443e+02 5.621e+02, threshold=6.026e+02, percent-clipped=0.0 2023-06-24 08:22:02,366 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.47 vs. limit=10.0 2023-06-24 08:22:13,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1059918.0, ans=0.0 2023-06-24 08:22:35,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1059978.0, ans=0.1 2023-06-24 08:22:40,808 INFO [train.py:996] (1/4) Epoch 6, batch 24200, loss[loss=0.3283, simple_loss=0.3999, pruned_loss=0.1283, over 21481.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3154, pruned_loss=0.08334, over 4285613.03 frames. ], batch size: 508, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:22:48,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1060038.0, ans=0.125 2023-06-24 08:23:06,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1060098.0, ans=0.2 2023-06-24 08:23:14,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1060098.0, ans=0.125 2023-06-24 08:23:23,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1060158.0, ans=0.0 2023-06-24 08:23:23,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=12.0 2023-06-24 08:23:40,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1060158.0, ans=0.125 2023-06-24 08:24:27,613 INFO [train.py:996] (1/4) Epoch 6, batch 24250, loss[loss=0.1766, simple_loss=0.3012, pruned_loss=0.02598, over 20773.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3126, pruned_loss=0.07669, over 4286466.44 frames. ], batch size: 607, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:24:34,048 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-24 08:24:35,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1060338.0, ans=0.125 2023-06-24 08:24:51,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1060338.0, ans=0.1 2023-06-24 08:24:56,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1060398.0, ans=0.0 2023-06-24 08:25:08,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1060398.0, ans=0.2 2023-06-24 08:25:10,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=15.0 2023-06-24 08:25:24,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1060458.0, ans=0.125 2023-06-24 08:25:25,857 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 2.253e+02 2.770e+02 3.370e+02 5.813e+02, threshold=5.539e+02, percent-clipped=0.0 2023-06-24 08:26:15,731 INFO [train.py:996] (1/4) Epoch 6, batch 24300, loss[loss=0.1777, simple_loss=0.2496, pruned_loss=0.05287, over 21336.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3048, pruned_loss=0.07169, over 4275000.06 frames. ], batch size: 159, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:27:33,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.82 vs. limit=6.0 2023-06-24 08:28:09,204 INFO [train.py:996] (1/4) Epoch 6, batch 24350, loss[loss=0.2207, simple_loss=0.2983, pruned_loss=0.07157, over 21807.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3018, pruned_loss=0.07154, over 4281546.71 frames. ], batch size: 298, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:28:22,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1060938.0, ans=0.025 2023-06-24 08:28:26,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.15 vs. limit=22.5 2023-06-24 08:29:01,796 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 2.610e+02 2.946e+02 3.475e+02 5.631e+02, threshold=5.892e+02, percent-clipped=1.0 2023-06-24 08:29:10,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1061058.0, ans=0.125 2023-06-24 08:29:10,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-24 08:29:58,779 INFO [train.py:996] (1/4) Epoch 6, batch 24400, loss[loss=0.2359, simple_loss=0.3146, pruned_loss=0.07856, over 21780.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3065, pruned_loss=0.07556, over 4284679.74 frames. ], batch size: 333, lr: 4.97e-03, grad_scale: 32.0 2023-06-24 08:30:04,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1061238.0, ans=0.0 2023-06-24 08:30:43,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1061358.0, ans=0.125 2023-06-24 08:31:11,822 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.70 vs. limit=15.0 2023-06-24 08:31:14,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1061418.0, ans=10.0 2023-06-24 08:31:41,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-24 08:31:49,136 INFO [train.py:996] (1/4) Epoch 6, batch 24450, loss[loss=0.2944, simple_loss=0.3786, pruned_loss=0.1051, over 21463.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.308, pruned_loss=0.07638, over 4279318.64 frames. ], batch size: 471, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:31:57,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1061538.0, ans=0.1 2023-06-24 08:32:06,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1061538.0, ans=0.125 2023-06-24 08:32:41,894 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.780e+02 3.190e+02 3.668e+02 5.575e+02, threshold=6.380e+02, percent-clipped=0.0 2023-06-24 08:32:51,752 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-24 08:32:58,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-24 08:33:01,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1061718.0, ans=0.1 2023-06-24 08:33:34,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1061778.0, ans=0.0 2023-06-24 08:33:36,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1061838.0, ans=0.0 2023-06-24 08:33:37,501 INFO [train.py:996] (1/4) Epoch 6, batch 24500, loss[loss=0.2277, simple_loss=0.2958, pruned_loss=0.07977, over 21217.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3079, pruned_loss=0.07641, over 4280642.66 frames. ], batch size: 143, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:33:38,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1061838.0, ans=0.125 2023-06-24 08:34:14,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-24 08:34:33,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1061958.0, ans=0.1 2023-06-24 08:34:42,797 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-24 08:34:51,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1062018.0, ans=0.125 2023-06-24 08:35:02,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062018.0, ans=0.1 2023-06-24 08:35:28,569 INFO [train.py:996] (1/4) Epoch 6, batch 24550, loss[loss=0.2377, simple_loss=0.322, pruned_loss=0.07673, over 21575.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3093, pruned_loss=0.0782, over 4282558.12 frames. ], batch size: 230, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:35:50,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1062198.0, ans=0.125 2023-06-24 08:35:54,792 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.81 vs. limit=15.0 2023-06-24 08:36:18,658 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.580e+02 2.942e+02 3.468e+02 6.882e+02, threshold=5.884e+02, percent-clipped=1.0 2023-06-24 08:36:42,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1062318.0, ans=0.015 2023-06-24 08:36:45,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1062318.0, ans=0.0 2023-06-24 08:36:47,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1062318.0, ans=0.125 2023-06-24 08:37:10,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062378.0, ans=0.1 2023-06-24 08:37:18,642 INFO [train.py:996] (1/4) Epoch 6, batch 24600, loss[loss=0.2054, simple_loss=0.2658, pruned_loss=0.07255, over 21841.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3054, pruned_loss=0.07897, over 4270533.13 frames. ], batch size: 98, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:37:20,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1062438.0, ans=0.1 2023-06-24 08:38:13,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-24 08:38:17,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062558.0, ans=0.1 2023-06-24 08:39:14,756 INFO [train.py:996] (1/4) Epoch 6, batch 24650, loss[loss=0.1792, simple_loss=0.2373, pruned_loss=0.06054, over 21279.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.298, pruned_loss=0.07791, over 4276248.17 frames. ], batch size: 176, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:39:20,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1062738.0, ans=0.125 2023-06-24 08:39:30,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1062798.0, ans=0.0 2023-06-24 08:39:36,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1062798.0, ans=0.1 2023-06-24 08:39:50,988 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-24 08:40:03,029 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.707e+02 3.176e+02 3.617e+02 5.573e+02, threshold=6.353e+02, percent-clipped=0.0 2023-06-24 08:40:24,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1062918.0, ans=0.125 2023-06-24 08:41:03,581 INFO [train.py:996] (1/4) Epoch 6, batch 24700, loss[loss=0.1974, simple_loss=0.2718, pruned_loss=0.06153, over 21572.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2959, pruned_loss=0.07574, over 4277275.86 frames. ], batch size: 263, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:41:15,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-24 08:41:55,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1063158.0, ans=0.05 2023-06-24 08:42:36,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1063278.0, ans=0.2 2023-06-24 08:42:52,507 INFO [train.py:996] (1/4) Epoch 6, batch 24750, loss[loss=0.2208, simple_loss=0.2912, pruned_loss=0.07523, over 14492.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2891, pruned_loss=0.07295, over 4265435.07 frames. ], batch size: 60, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:43:10,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1063398.0, ans=0.125 2023-06-24 08:43:40,615 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:43:41,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.438e+02 2.880e+02 3.643e+02 9.109e+02, threshold=5.760e+02, percent-clipped=1.0 2023-06-24 08:43:58,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1063518.0, ans=0.125 2023-06-24 08:44:36,342 INFO [train.py:996] (1/4) Epoch 6, batch 24800, loss[loss=0.2026, simple_loss=0.2549, pruned_loss=0.07514, over 20212.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2841, pruned_loss=0.07296, over 4260409.93 frames. ], batch size: 703, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:44:52,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1063638.0, ans=22.5 2023-06-24 08:45:51,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1063818.0, ans=0.5 2023-06-24 08:46:22,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1063878.0, ans=0.125 2023-06-24 08:46:25,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1063938.0, ans=0.125 2023-06-24 08:46:26,849 INFO [train.py:996] (1/4) Epoch 6, batch 24850, loss[loss=0.2197, simple_loss=0.2836, pruned_loss=0.07791, over 21560.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2856, pruned_loss=0.07432, over 4272756.78 frames. ], batch size: 212, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:46:59,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1063998.0, ans=0.0 2023-06-24 08:47:09,927 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:47:21,019 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.836e+02 3.370e+02 3.940e+02 7.201e+02, threshold=6.739e+02, percent-clipped=1.0 2023-06-24 08:47:51,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1064178.0, ans=0.125 2023-06-24 08:48:21,629 INFO [train.py:996] (1/4) Epoch 6, batch 24900, loss[loss=0.2513, simple_loss=0.3202, pruned_loss=0.0912, over 21595.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2892, pruned_loss=0.07522, over 4279067.42 frames. ], batch size: 263, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:48:24,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1064238.0, ans=0.0 2023-06-24 08:48:45,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1064298.0, ans=0.125 2023-06-24 08:48:46,123 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.16 vs. limit=12.0 2023-06-24 08:50:14,172 INFO [train.py:996] (1/4) Epoch 6, batch 24950, loss[loss=0.2824, simple_loss=0.3437, pruned_loss=0.1105, over 21787.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.296, pruned_loss=0.07871, over 4278152.67 frames. ], batch size: 441, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:50:27,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1064538.0, ans=0.125 2023-06-24 08:51:12,088 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 2.867e+02 3.295e+02 3.992e+02 6.156e+02, threshold=6.590e+02, percent-clipped=0.0 2023-06-24 08:51:42,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1064718.0, ans=0.2 2023-06-24 08:51:42,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-24 08:51:43,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1064718.0, ans=0.1 2023-06-24 08:52:06,459 INFO [train.py:996] (1/4) Epoch 6, batch 25000, loss[loss=0.2239, simple_loss=0.2916, pruned_loss=0.07811, over 21462.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3023, pruned_loss=0.08071, over 4275928.05 frames. ], batch size: 389, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:52:12,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1064838.0, ans=0.0 2023-06-24 08:52:45,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1064898.0, ans=0.125 2023-06-24 08:53:18,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1065018.0, ans=0.1 2023-06-24 08:53:32,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-24 08:53:54,473 INFO [train.py:996] (1/4) Epoch 6, batch 25050, loss[loss=0.1788, simple_loss=0.2413, pruned_loss=0.05812, over 21478.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2954, pruned_loss=0.07928, over 4278899.43 frames. ], batch size: 212, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:54:12,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1065138.0, ans=0.0 2023-06-24 08:54:32,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1065198.0, ans=0.125 2023-06-24 08:54:34,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1065198.0, ans=0.125 2023-06-24 08:54:56,057 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.544e+02 2.890e+02 3.638e+02 5.399e+02, threshold=5.780e+02, percent-clipped=0.0 2023-06-24 08:55:02,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1065318.0, ans=0.125 2023-06-24 08:55:11,731 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-24 08:55:18,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1065318.0, ans=0.0 2023-06-24 08:55:44,169 INFO [train.py:996] (1/4) Epoch 6, batch 25100, loss[loss=0.1825, simple_loss=0.238, pruned_loss=0.06349, over 20737.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2892, pruned_loss=0.07772, over 4276885.23 frames. ], batch size: 608, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:55:46,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1065438.0, ans=0.125 2023-06-24 08:56:19,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1065498.0, ans=0.1 2023-06-24 08:56:26,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1065498.0, ans=0.125 2023-06-24 08:57:31,181 INFO [train.py:996] (1/4) Epoch 6, batch 25150, loss[loss=0.2097, simple_loss=0.3037, pruned_loss=0.05783, over 21804.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.293, pruned_loss=0.07521, over 4280235.33 frames. ], batch size: 332, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 08:58:27,074 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.368e+02 2.837e+02 3.510e+02 8.139e+02, threshold=5.674e+02, percent-clipped=4.0 2023-06-24 08:58:40,570 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:59:20,761 INFO [train.py:996] (1/4) Epoch 6, batch 25200, loss[loss=0.2267, simple_loss=0.3122, pruned_loss=0.07055, over 21804.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2928, pruned_loss=0.07299, over 4264740.31 frames. ], batch size: 414, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 08:59:28,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1066038.0, ans=0.125 2023-06-24 09:00:05,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=22.5 2023-06-24 09:01:05,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1066278.0, ans=0.125 2023-06-24 09:01:08,295 INFO [train.py:996] (1/4) Epoch 6, batch 25250, loss[loss=0.1934, simple_loss=0.2622, pruned_loss=0.06228, over 21554.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2913, pruned_loss=0.07126, over 4263350.46 frames. ], batch size: 263, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:01:56,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1066458.0, ans=0.1 2023-06-24 09:02:01,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1066458.0, ans=0.035 2023-06-24 09:02:12,212 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.424e+02 2.718e+02 3.085e+02 4.421e+02, threshold=5.437e+02, percent-clipped=0.0 2023-06-24 09:02:55,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1066578.0, ans=0.2 2023-06-24 09:02:57,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1066638.0, ans=0.125 2023-06-24 09:02:58,771 INFO [train.py:996] (1/4) Epoch 6, batch 25300, loss[loss=0.211, simple_loss=0.2916, pruned_loss=0.06521, over 21707.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2898, pruned_loss=0.07126, over 4243126.80 frames. ], batch size: 351, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:03:08,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1066638.0, ans=0.0 2023-06-24 09:03:43,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1066758.0, ans=0.0 2023-06-24 09:04:12,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.16 vs. limit=22.5 2023-06-24 09:04:24,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1066818.0, ans=0.0 2023-06-24 09:04:31,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1066878.0, ans=0.125 2023-06-24 09:04:48,231 INFO [train.py:996] (1/4) Epoch 6, batch 25350, loss[loss=0.24, simple_loss=0.3337, pruned_loss=0.07313, over 21198.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.294, pruned_loss=0.07178, over 4252473.28 frames. ], batch size: 548, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:05:50,928 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.518e+02 2.873e+02 3.506e+02 6.244e+02, threshold=5.746e+02, percent-clipped=2.0 2023-06-24 09:06:00,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1067118.0, ans=0.125 2023-06-24 09:06:25,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1067178.0, ans=0.125 2023-06-24 09:06:28,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2023-06-24 09:06:35,388 INFO [train.py:996] (1/4) Epoch 6, batch 25400, loss[loss=0.2249, simple_loss=0.2885, pruned_loss=0.08067, over 21595.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2896, pruned_loss=0.07126, over 4261928.35 frames. ], batch size: 263, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:07:01,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1067298.0, ans=0.0 2023-06-24 09:07:32,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1067358.0, ans=0.125 2023-06-24 09:08:25,235 INFO [train.py:996] (1/4) Epoch 6, batch 25450, loss[loss=0.2034, simple_loss=0.3015, pruned_loss=0.05264, over 21815.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2904, pruned_loss=0.07233, over 4242670.20 frames. ], batch size: 282, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:08:43,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-24 09:09:30,504 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.400e+02 2.613e+02 3.023e+02 4.754e+02, threshold=5.227e+02, percent-clipped=0.0 2023-06-24 09:10:23,334 INFO [train.py:996] (1/4) Epoch 6, batch 25500, loss[loss=0.2773, simple_loss=0.3535, pruned_loss=0.1005, over 21499.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2905, pruned_loss=0.07008, over 4234824.00 frames. ], batch size: 471, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:10:25,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1067838.0, ans=0.2 2023-06-24 09:10:48,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1067898.0, ans=0.125 2023-06-24 09:10:57,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1067898.0, ans=0.125 2023-06-24 09:11:16,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1067958.0, ans=0.125 2023-06-24 09:11:43,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1068018.0, ans=0.125 2023-06-24 09:11:57,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068078.0, ans=0.1 2023-06-24 09:11:59,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1068078.0, ans=0.125 2023-06-24 09:12:14,519 INFO [train.py:996] (1/4) Epoch 6, batch 25550, loss[loss=0.2294, simple_loss=0.3201, pruned_loss=0.06936, over 21669.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2968, pruned_loss=0.07, over 4238939.65 frames. ], batch size: 263, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:13:20,093 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.387e+02 2.706e+02 3.316e+02 5.632e+02, threshold=5.413e+02, percent-clipped=1.0 2023-06-24 09:14:01,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-24 09:14:05,891 INFO [train.py:996] (1/4) Epoch 6, batch 25600, loss[loss=0.233, simple_loss=0.3273, pruned_loss=0.06934, over 17220.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3023, pruned_loss=0.07085, over 4245367.42 frames. ], batch size: 60, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:14:18,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-24 09:14:26,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1068438.0, ans=0.1 2023-06-24 09:15:08,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1068558.0, ans=0.025 2023-06-24 09:15:10,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1068558.0, ans=0.2 2023-06-24 09:15:18,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068618.0, ans=0.1 2023-06-24 09:16:00,283 INFO [train.py:996] (1/4) Epoch 6, batch 25650, loss[loss=0.2187, simple_loss=0.2774, pruned_loss=0.07999, over 21592.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3032, pruned_loss=0.07282, over 4250345.27 frames. ], batch size: 415, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:16:56,702 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 2.676e+02 3.048e+02 3.761e+02 7.606e+02, threshold=6.096e+02, percent-clipped=4.0 2023-06-24 09:17:00,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1068918.0, ans=0.1 2023-06-24 09:17:26,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1068978.0, ans=0.0 2023-06-24 09:17:40,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1069038.0, ans=0.125 2023-06-24 09:17:41,416 INFO [train.py:996] (1/4) Epoch 6, batch 25700, loss[loss=0.27, simple_loss=0.3196, pruned_loss=0.1102, over 21700.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3003, pruned_loss=0.07446, over 4256644.16 frames. ], batch size: 508, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:19:35,831 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=15.0 2023-06-24 09:19:39,956 INFO [train.py:996] (1/4) Epoch 6, batch 25750, loss[loss=0.2151, simple_loss=0.294, pruned_loss=0.06806, over 21754.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3062, pruned_loss=0.07738, over 4257793.18 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:20:08,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1069398.0, ans=0.1 2023-06-24 09:20:27,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1069458.0, ans=0.125 2023-06-24 09:20:43,888 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.640e+02 3.088e+02 3.573e+02 6.081e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-24 09:20:48,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1069518.0, ans=0.0 2023-06-24 09:21:40,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1069638.0, ans=0.0 2023-06-24 09:21:42,002 INFO [train.py:996] (1/4) Epoch 6, batch 25800, loss[loss=0.223, simple_loss=0.2959, pruned_loss=0.075, over 20783.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3143, pruned_loss=0.08117, over 4264042.91 frames. ], batch size: 608, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:21:53,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1069638.0, ans=0.0 2023-06-24 09:21:55,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1069638.0, ans=0.0 2023-06-24 09:21:58,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1069698.0, ans=0.125 2023-06-24 09:22:22,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1069758.0, ans=0.0 2023-06-24 09:22:27,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069758.0, ans=0.1 2023-06-24 09:22:57,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1069818.0, ans=0.125 2023-06-24 09:23:04,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1069818.0, ans=0.125 2023-06-24 09:23:07,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1069878.0, ans=0.125 2023-06-24 09:23:30,424 INFO [train.py:996] (1/4) Epoch 6, batch 25850, loss[loss=0.2682, simple_loss=0.3395, pruned_loss=0.09848, over 21625.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3156, pruned_loss=0.08055, over 4267803.85 frames. ], batch size: 471, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:24:02,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069998.0, ans=0.1 2023-06-24 09:24:29,132 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.685e+02 2.967e+02 3.484e+02 6.005e+02, threshold=5.935e+02, percent-clipped=0.0 2023-06-24 09:25:16,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1070178.0, ans=0.1 2023-06-24 09:25:21,128 INFO [train.py:996] (1/4) Epoch 6, batch 25900, loss[loss=0.2175, simple_loss=0.3039, pruned_loss=0.06553, over 21217.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3166, pruned_loss=0.08086, over 4273833.18 frames. ], batch size: 143, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:26:24,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1070358.0, ans=0.125 2023-06-24 09:26:48,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1070418.0, ans=0.0 2023-06-24 09:27:16,099 INFO [train.py:996] (1/4) Epoch 6, batch 25950, loss[loss=0.2396, simple_loss=0.3185, pruned_loss=0.08037, over 21596.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3228, pruned_loss=0.08388, over 4281616.44 frames. ], batch size: 112, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:27:20,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1070538.0, ans=0.2 2023-06-24 09:27:44,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1070598.0, ans=0.125 2023-06-24 09:28:05,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=12.0 2023-06-24 09:28:08,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1070658.0, ans=0.2 2023-06-24 09:28:20,614 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.613e+02 2.969e+02 3.394e+02 6.568e+02, threshold=5.938e+02, percent-clipped=2.0 2023-06-24 09:28:23,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1070658.0, ans=0.1 2023-06-24 09:28:23,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=1070658.0, ans=15.0 2023-06-24 09:29:06,460 INFO [train.py:996] (1/4) Epoch 6, batch 26000, loss[loss=0.2238, simple_loss=0.338, pruned_loss=0.05479, over 19778.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3228, pruned_loss=0.08167, over 4278125.19 frames. ], batch size: 702, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:29:06,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1070838.0, ans=0.125 2023-06-24 09:29:32,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1070898.0, ans=0.125 2023-06-24 09:30:26,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1071018.0, ans=0.0 2023-06-24 09:30:45,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1071078.0, ans=0.07 2023-06-24 09:31:00,839 INFO [train.py:996] (1/4) Epoch 6, batch 26050, loss[loss=0.2206, simple_loss=0.297, pruned_loss=0.07212, over 21863.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3232, pruned_loss=0.08326, over 4283993.42 frames. ], batch size: 118, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:31:06,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1071138.0, ans=0.125 2023-06-24 09:31:08,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1071138.0, ans=0.2 2023-06-24 09:31:08,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1071138.0, ans=0.0 2023-06-24 09:31:36,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1071198.0, ans=0.0 2023-06-24 09:31:47,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1071258.0, ans=0.125 2023-06-24 09:31:58,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.591e+02 3.026e+02 3.549e+02 5.342e+02, threshold=6.052e+02, percent-clipped=0.0 2023-06-24 09:32:04,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1071318.0, ans=0.1 2023-06-24 09:32:11,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1071318.0, ans=0.125 2023-06-24 09:32:30,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1071378.0, ans=0.04949747468305833 2023-06-24 09:32:47,961 INFO [train.py:996] (1/4) Epoch 6, batch 26100, loss[loss=0.2168, simple_loss=0.2801, pruned_loss=0.07674, over 21351.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3171, pruned_loss=0.08265, over 4284690.99 frames. ], batch size: 176, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:32:52,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1071438.0, ans=0.025 2023-06-24 09:33:21,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1071498.0, ans=0.125 2023-06-24 09:33:21,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1071498.0, ans=0.0 2023-06-24 09:33:29,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-24 09:33:41,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1071558.0, ans=0.0 2023-06-24 09:33:41,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1071558.0, ans=0.04949747468305833 2023-06-24 09:34:06,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1071618.0, ans=0.0 2023-06-24 09:34:14,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-24 09:34:28,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1071678.0, ans=0.125 2023-06-24 09:34:38,495 INFO [train.py:996] (1/4) Epoch 6, batch 26150, loss[loss=0.2104, simple_loss=0.2761, pruned_loss=0.07235, over 20942.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.314, pruned_loss=0.08312, over 4288265.01 frames. ], batch size: 607, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:35:20,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1071798.0, ans=0.1 2023-06-24 09:35:23,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1071858.0, ans=0.04949747468305833 2023-06-24 09:35:39,628 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.601e+02 2.864e+02 3.408e+02 4.627e+02, threshold=5.727e+02, percent-clipped=0.0 2023-06-24 09:36:28,868 INFO [train.py:996] (1/4) Epoch 6, batch 26200, loss[loss=0.2489, simple_loss=0.3477, pruned_loss=0.07506, over 21726.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3146, pruned_loss=0.08086, over 4290111.80 frames. ], batch size: 351, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:36:34,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1072038.0, ans=0.125 2023-06-24 09:36:34,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1072038.0, ans=0.1 2023-06-24 09:36:46,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-24 09:38:22,429 INFO [train.py:996] (1/4) Epoch 6, batch 26250, loss[loss=0.2585, simple_loss=0.3266, pruned_loss=0.09522, over 21754.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3169, pruned_loss=0.07965, over 4284993.12 frames. ], batch size: 441, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:39:16,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-24 09:39:20,859 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 2.533e+02 2.809e+02 3.331e+02 4.740e+02, threshold=5.619e+02, percent-clipped=0.0 2023-06-24 09:39:31,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-24 09:39:36,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1072518.0, ans=0.1 2023-06-24 09:39:48,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-24 09:40:16,013 INFO [train.py:996] (1/4) Epoch 6, batch 26300, loss[loss=0.2358, simple_loss=0.3047, pruned_loss=0.08341, over 21756.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3153, pruned_loss=0.07975, over 4290019.40 frames. ], batch size: 389, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:41:39,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1072878.0, ans=0.125 2023-06-24 09:42:05,733 INFO [train.py:996] (1/4) Epoch 6, batch 26350, loss[loss=0.2662, simple_loss=0.3313, pruned_loss=0.1005, over 21310.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3144, pruned_loss=0.08145, over 4285587.59 frames. ], batch size: 143, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:42:45,572 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:42:53,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1073058.0, ans=0.1 2023-06-24 09:42:58,167 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.852e+02 3.248e+02 3.843e+02 6.054e+02, threshold=6.496e+02, percent-clipped=2.0 2023-06-24 09:43:09,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1073118.0, ans=0.125 2023-06-24 09:43:53,727 INFO [train.py:996] (1/4) Epoch 6, batch 26400, loss[loss=0.2047, simple_loss=0.2693, pruned_loss=0.0701, over 21651.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3079, pruned_loss=0.08145, over 4277576.37 frames. ], batch size: 298, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:44:05,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1073238.0, ans=0.0 2023-06-24 09:45:22,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1073418.0, ans=0.125 2023-06-24 09:45:24,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1073418.0, ans=0.125 2023-06-24 09:45:49,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1073538.0, ans=0.0 2023-06-24 09:45:50,253 INFO [train.py:996] (1/4) Epoch 6, batch 26450, loss[loss=0.2824, simple_loss=0.3965, pruned_loss=0.08411, over 21204.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3068, pruned_loss=0.08085, over 4268579.26 frames. ], batch size: 549, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:46:50,154 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.812e+02 3.126e+02 4.062e+02 8.206e+02, threshold=6.252e+02, percent-clipped=4.0 2023-06-24 09:47:17,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1073778.0, ans=0.125 2023-06-24 09:47:39,830 INFO [train.py:996] (1/4) Epoch 6, batch 26500, loss[loss=0.1477, simple_loss=0.1901, pruned_loss=0.0527, over 16348.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3085, pruned_loss=0.08014, over 4263798.74 frames. ], batch size: 61, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:49:10,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1074018.0, ans=0.125 2023-06-24 09:49:24,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-24 09:49:31,874 INFO [train.py:996] (1/4) Epoch 6, batch 26550, loss[loss=0.2129, simple_loss=0.3071, pruned_loss=0.05933, over 21702.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3084, pruned_loss=0.07773, over 4265178.33 frames. ], batch size: 415, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:50:42,535 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.610e+02 3.106e+02 3.674e+02 5.828e+02, threshold=6.212e+02, percent-clipped=0.0 2023-06-24 09:50:51,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-24 09:50:55,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1074318.0, ans=0.125 2023-06-24 09:51:09,748 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-24 09:51:26,528 INFO [train.py:996] (1/4) Epoch 6, batch 26600, loss[loss=0.1835, simple_loss=0.2665, pruned_loss=0.05024, over 21573.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3074, pruned_loss=0.07469, over 4263116.89 frames. ], batch size: 263, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:52:06,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1074498.0, ans=0.125 2023-06-24 09:52:40,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1074618.0, ans=0.1 2023-06-24 09:52:54,436 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:53:15,488 INFO [train.py:996] (1/4) Epoch 6, batch 26650, loss[loss=0.1549, simple_loss=0.2475, pruned_loss=0.03112, over 21787.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2994, pruned_loss=0.07213, over 4256833.01 frames. ], batch size: 333, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:53:41,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1074798.0, ans=0.125 2023-06-24 09:53:43,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1074798.0, ans=0.125 2023-06-24 09:54:18,557 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.261e+02 2.468e+02 2.751e+02 5.054e+02, threshold=4.936e+02, percent-clipped=0.0 2023-06-24 09:54:21,348 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-24 09:54:22,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1074918.0, ans=0.1 2023-06-24 09:54:55,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1074978.0, ans=0.04949747468305833 2023-06-24 09:55:03,187 INFO [train.py:996] (1/4) Epoch 6, batch 26700, loss[loss=0.1692, simple_loss=0.2671, pruned_loss=0.03564, over 20808.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2931, pruned_loss=0.06975, over 4258359.41 frames. ], batch size: 609, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:55:39,071 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:56:01,575 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-24 09:56:43,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1075278.0, ans=0.0 2023-06-24 09:56:59,422 INFO [train.py:996] (1/4) Epoch 6, batch 26750, loss[loss=0.2464, simple_loss=0.3199, pruned_loss=0.08651, over 21926.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2932, pruned_loss=0.06903, over 4266173.87 frames. ], batch size: 372, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 09:57:03,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1075338.0, ans=0.125 2023-06-24 09:57:04,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-24 09:57:30,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1075398.0, ans=0.125 2023-06-24 09:57:55,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.365e+02 2.700e+02 3.222e+02 4.591e+02, threshold=5.400e+02, percent-clipped=0.0 2023-06-24 09:58:46,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1075578.0, ans=0.125 2023-06-24 09:58:49,564 INFO [train.py:996] (1/4) Epoch 6, batch 26800, loss[loss=0.2077, simple_loss=0.2757, pruned_loss=0.06983, over 21970.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3005, pruned_loss=0.07325, over 4273567.15 frames. ], batch size: 98, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:59:40,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1075758.0, ans=0.0 2023-06-24 10:00:24,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.46 vs. limit=15.0 2023-06-24 10:00:30,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1075878.0, ans=0.0 2023-06-24 10:00:43,887 INFO [train.py:996] (1/4) Epoch 6, batch 26850, loss[loss=0.205, simple_loss=0.266, pruned_loss=0.07199, over 15256.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3027, pruned_loss=0.07656, over 4265146.96 frames. ], batch size: 60, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:00:46,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1075938.0, ans=0.0 2023-06-24 10:01:15,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2023-06-24 10:01:18,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1076058.0, ans=0.125 2023-06-24 10:01:32,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1076058.0, ans=0.125 2023-06-24 10:01:46,039 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.745e+02 3.127e+02 3.693e+02 5.292e+02, threshold=6.255e+02, percent-clipped=0.0 2023-06-24 10:02:25,596 INFO [train.py:996] (1/4) Epoch 6, batch 26900, loss[loss=0.1919, simple_loss=0.2519, pruned_loss=0.06594, over 21270.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2954, pruned_loss=0.07587, over 4255269.75 frames. ], batch size: 177, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:02:32,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1076238.0, ans=0.1 2023-06-24 10:02:38,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-24 10:02:48,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-24 10:03:21,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-24 10:03:28,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-24 10:03:32,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1076418.0, ans=0.125 2023-06-24 10:03:33,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-24 10:03:42,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1076418.0, ans=0.2 2023-06-24 10:04:11,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1076478.0, ans=0.125 2023-06-24 10:04:14,909 INFO [train.py:996] (1/4) Epoch 6, batch 26950, loss[loss=0.2077, simple_loss=0.2829, pruned_loss=0.06621, over 21389.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2937, pruned_loss=0.07578, over 4262381.65 frames. ], batch size: 131, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:05:14,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1076658.0, ans=0.0 2023-06-24 10:05:26,437 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.499e+02 2.950e+02 4.079e+02 6.623e+02, threshold=5.900e+02, percent-clipped=3.0 2023-06-24 10:05:29,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1076718.0, ans=0.125 2023-06-24 10:05:46,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1076778.0, ans=0.0 2023-06-24 10:06:10,648 INFO [train.py:996] (1/4) Epoch 6, batch 27000, loss[loss=0.1918, simple_loss=0.2728, pruned_loss=0.05539, over 21109.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2947, pruned_loss=0.07389, over 4267574.93 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:06:10,649 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 10:06:28,765 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2519, simple_loss=0.3439, pruned_loss=0.0799, over 1796401.00 frames. 2023-06-24 10:06:28,766 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 10:07:15,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-24 10:07:30,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1076958.0, ans=0.0 2023-06-24 10:08:18,375 INFO [train.py:996] (1/4) Epoch 6, batch 27050, loss[loss=0.187, simple_loss=0.3118, pruned_loss=0.03116, over 20863.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2963, pruned_loss=0.07022, over 4270448.43 frames. ], batch size: 607, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:08:20,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1077138.0, ans=0.0 2023-06-24 10:09:09,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.14 vs. limit=5.0 2023-06-24 10:09:34,300 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.401e+02 2.781e+02 3.239e+02 4.464e+02, threshold=5.563e+02, percent-clipped=0.0 2023-06-24 10:09:45,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-24 10:10:08,114 INFO [train.py:996] (1/4) Epoch 6, batch 27100, loss[loss=0.246, simple_loss=0.3188, pruned_loss=0.08664, over 21787.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2989, pruned_loss=0.0709, over 4268782.00 frames. ], batch size: 441, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:10:14,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-06-24 10:10:43,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1077498.0, ans=15.0 2023-06-24 10:11:58,244 INFO [train.py:996] (1/4) Epoch 6, batch 27150, loss[loss=0.2613, simple_loss=0.3383, pruned_loss=0.09218, over 21406.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3105, pruned_loss=0.07452, over 4276454.24 frames. ], batch size: 194, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:12:41,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1077798.0, ans=0.0 2023-06-24 10:13:11,202 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=22.5 2023-06-24 10:13:13,430 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.603e+02 2.899e+02 3.318e+02 5.343e+02, threshold=5.797e+02, percent-clipped=0.0 2023-06-24 10:13:14,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-24 10:13:19,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1077918.0, ans=0.1 2023-06-24 10:13:52,973 INFO [train.py:996] (1/4) Epoch 6, batch 27200, loss[loss=0.2424, simple_loss=0.3289, pruned_loss=0.07789, over 21375.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3178, pruned_loss=0.07679, over 4278530.83 frames. ], batch size: 131, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:13:55,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1078038.0, ans=0.2 2023-06-24 10:14:25,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1078098.0, ans=0.035 2023-06-24 10:14:41,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1078158.0, ans=0.0 2023-06-24 10:15:32,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1078278.0, ans=0.05 2023-06-24 10:15:48,402 INFO [train.py:996] (1/4) Epoch 6, batch 27250, loss[loss=0.2784, simple_loss=0.3457, pruned_loss=0.1055, over 21223.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3194, pruned_loss=0.08012, over 4281112.92 frames. ], batch size: 143, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:16:20,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1078398.0, ans=0.95 2023-06-24 10:16:47,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1078518.0, ans=0.025 2023-06-24 10:16:56,094 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 2.982e+02 3.326e+02 3.737e+02 5.172e+02, threshold=6.652e+02, percent-clipped=0.0 2023-06-24 10:17:32,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1078578.0, ans=0.025 2023-06-24 10:17:45,561 INFO [train.py:996] (1/4) Epoch 6, batch 27300, loss[loss=0.2381, simple_loss=0.321, pruned_loss=0.07761, over 21761.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3214, pruned_loss=0.08174, over 4275607.53 frames. ], batch size: 247, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:18:52,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1078818.0, ans=0.125 2023-06-24 10:19:10,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-24 10:19:11,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1078878.0, ans=0.0 2023-06-24 10:19:33,550 INFO [train.py:996] (1/4) Epoch 6, batch 27350, loss[loss=0.2331, simple_loss=0.3185, pruned_loss=0.07385, over 21337.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3246, pruned_loss=0.08219, over 4270902.61 frames. ], batch size: 548, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:19:34,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1078938.0, ans=0.2 2023-06-24 10:19:51,726 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-24 10:20:37,090 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.617e+02 2.947e+02 3.408e+02 6.075e+02, threshold=5.893e+02, percent-clipped=0.0 2023-06-24 10:20:50,940 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:21:19,470 INFO [train.py:996] (1/4) Epoch 6, batch 27400, loss[loss=0.1944, simple_loss=0.2608, pruned_loss=0.06399, over 21546.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3192, pruned_loss=0.08152, over 4272032.22 frames. ], batch size: 230, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:22:14,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1079358.0, ans=0.0 2023-06-24 10:22:15,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-24 10:22:16,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1079358.0, ans=0.125 2023-06-24 10:22:21,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1079418.0, ans=0.1 2023-06-24 10:22:36,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1079418.0, ans=0.0 2023-06-24 10:23:06,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1079538.0, ans=0.125 2023-06-24 10:23:07,160 INFO [train.py:996] (1/4) Epoch 6, batch 27450, loss[loss=0.2095, simple_loss=0.2881, pruned_loss=0.06542, over 20063.00 frames. ], tot_loss[loss=0.236, simple_loss=0.313, pruned_loss=0.07954, over 4265717.70 frames. ], batch size: 702, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:23:18,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1079538.0, ans=0.125 2023-06-24 10:23:52,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1079658.0, ans=0.125 2023-06-24 10:23:55,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1079658.0, ans=0.125 2023-06-24 10:24:07,320 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.466e+02 2.775e+02 3.164e+02 4.697e+02, threshold=5.550e+02, percent-clipped=0.0 2023-06-24 10:24:42,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1079778.0, ans=0.125 2023-06-24 10:24:50,396 INFO [train.py:996] (1/4) Epoch 6, batch 27500, loss[loss=0.2297, simple_loss=0.2946, pruned_loss=0.08242, over 21898.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3113, pruned_loss=0.08031, over 4272260.07 frames. ], batch size: 371, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:24:54,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1079838.0, ans=0.125 2023-06-24 10:25:04,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1079838.0, ans=0.0 2023-06-24 10:25:34,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1079958.0, ans=0.0 2023-06-24 10:25:40,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1079958.0, ans=0.1 2023-06-24 10:25:47,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-24 10:25:50,898 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-24 10:26:24,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1080078.0, ans=0.125 2023-06-24 10:26:34,073 INFO [train.py:996] (1/4) Epoch 6, batch 27550, loss[loss=0.2497, simple_loss=0.2922, pruned_loss=0.1036, over 21426.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3069, pruned_loss=0.07842, over 4277558.45 frames. ], batch size: 508, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:26:36,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1080138.0, ans=0.0 2023-06-24 10:27:20,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1080258.0, ans=0.125 2023-06-24 10:27:43,820 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.511e+02 2.711e+02 3.223e+02 7.892e+02, threshold=5.422e+02, percent-clipped=3.0 2023-06-24 10:27:44,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-24 10:27:59,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1080318.0, ans=0.0 2023-06-24 10:28:21,566 INFO [train.py:996] (1/4) Epoch 6, batch 27600, loss[loss=0.1987, simple_loss=0.2648, pruned_loss=0.06628, over 21358.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3, pruned_loss=0.07672, over 4279918.75 frames. ], batch size: 194, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:28:27,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1080438.0, ans=0.025 2023-06-24 10:28:28,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-24 10:28:52,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1080498.0, ans=0.2 2023-06-24 10:30:06,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1080738.0, ans=0.0 2023-06-24 10:30:08,102 INFO [train.py:996] (1/4) Epoch 6, batch 27650, loss[loss=0.2077, simple_loss=0.2969, pruned_loss=0.05925, over 21318.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2939, pruned_loss=0.0756, over 4266391.06 frames. ], batch size: 176, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:30:25,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1080798.0, ans=0.0 2023-06-24 10:31:11,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1080918.0, ans=0.125 2023-06-24 10:31:12,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.435e+02 2.709e+02 3.081e+02 4.195e+02, threshold=5.419e+02, percent-clipped=0.0 2023-06-24 10:31:12,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1080918.0, ans=0.09899494936611666 2023-06-24 10:31:51,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1080978.0, ans=0.125 2023-06-24 10:31:56,491 INFO [train.py:996] (1/4) Epoch 6, batch 27700, loss[loss=0.2635, simple_loss=0.3447, pruned_loss=0.09117, over 21720.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2935, pruned_loss=0.0739, over 4272649.70 frames. ], batch size: 351, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:33:13,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1081218.0, ans=0.1 2023-06-24 10:33:41,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.48 vs. limit=15.0 2023-06-24 10:33:45,290 INFO [train.py:996] (1/4) Epoch 6, batch 27750, loss[loss=0.2524, simple_loss=0.3291, pruned_loss=0.0879, over 21726.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2967, pruned_loss=0.07414, over 4275346.93 frames. ], batch size: 441, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:33:46,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1081338.0, ans=0.1 2023-06-24 10:33:55,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1081338.0, ans=0.125 2023-06-24 10:34:55,023 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.574e+02 2.914e+02 3.859e+02 6.202e+02, threshold=5.827e+02, percent-clipped=2.0 2023-06-24 10:35:32,877 INFO [train.py:996] (1/4) Epoch 6, batch 27800, loss[loss=0.2241, simple_loss=0.2917, pruned_loss=0.0783, over 21904.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2954, pruned_loss=0.07422, over 4275593.80 frames. ], batch size: 351, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:35:36,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1081638.0, ans=0.125 2023-06-24 10:35:36,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1081638.0, ans=0.1 2023-06-24 10:35:45,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1081638.0, ans=0.0 2023-06-24 10:36:45,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1081818.0, ans=0.0 2023-06-24 10:37:01,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1081818.0, ans=0.0 2023-06-24 10:37:17,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-24 10:37:21,734 INFO [train.py:996] (1/4) Epoch 6, batch 27850, loss[loss=0.2272, simple_loss=0.3356, pruned_loss=0.05942, over 19732.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.296, pruned_loss=0.07467, over 4281186.79 frames. ], batch size: 703, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:38:15,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1082058.0, ans=0.125 2023-06-24 10:38:23,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1082058.0, ans=0.125 2023-06-24 10:38:39,968 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.601e+02 3.026e+02 3.751e+02 1.054e+03, threshold=6.053e+02, percent-clipped=6.0 2023-06-24 10:39:11,466 INFO [train.py:996] (1/4) Epoch 6, batch 27900, loss[loss=0.2174, simple_loss=0.3075, pruned_loss=0.06371, over 21447.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3054, pruned_loss=0.07639, over 4279104.67 frames. ], batch size: 211, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:40:00,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1082298.0, ans=0.1 2023-06-24 10:40:02,324 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-24 10:40:47,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2023-06-24 10:41:13,475 INFO [train.py:996] (1/4) Epoch 6, batch 27950, loss[loss=0.1636, simple_loss=0.2507, pruned_loss=0.03828, over 21477.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3038, pruned_loss=0.07293, over 4278384.41 frames. ], batch size: 212, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:41:14,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1082538.0, ans=0.125 2023-06-24 10:41:43,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1082598.0, ans=0.125 2023-06-24 10:42:18,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1082718.0, ans=0.07 2023-06-24 10:42:19,428 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.526e+02 3.218e+02 4.121e+02 6.447e+02, threshold=6.437e+02, percent-clipped=1.0 2023-06-24 10:42:47,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1082778.0, ans=0.0 2023-06-24 10:43:01,522 INFO [train.py:996] (1/4) Epoch 6, batch 28000, loss[loss=0.2195, simple_loss=0.2808, pruned_loss=0.07905, over 21284.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3017, pruned_loss=0.07091, over 4283625.55 frames. ], batch size: 176, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:43:16,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1082838.0, ans=0.1 2023-06-24 10:44:07,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1083018.0, ans=0.0 2023-06-24 10:44:14,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1083018.0, ans=0.0 2023-06-24 10:44:44,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-24 10:44:57,556 INFO [train.py:996] (1/4) Epoch 6, batch 28050, loss[loss=0.2342, simple_loss=0.2993, pruned_loss=0.08457, over 21549.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3007, pruned_loss=0.07229, over 4286585.97 frames. ], batch size: 548, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:45:03,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1083138.0, ans=0.0 2023-06-24 10:45:08,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1083138.0, ans=0.2 2023-06-24 10:45:11,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1083138.0, ans=0.125 2023-06-24 10:45:36,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-24 10:46:04,460 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.746e+02 3.083e+02 3.764e+02 7.718e+02, threshold=6.165e+02, percent-clipped=1.0 2023-06-24 10:46:23,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1083378.0, ans=0.1 2023-06-24 10:46:27,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1083378.0, ans=0.0 2023-06-24 10:46:45,935 INFO [train.py:996] (1/4) Epoch 6, batch 28100, loss[loss=0.188, simple_loss=0.2538, pruned_loss=0.06105, over 21175.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2977, pruned_loss=0.07207, over 4274057.83 frames. ], batch size: 176, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:46:57,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1083438.0, ans=0.125 2023-06-24 10:47:39,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1083558.0, ans=0.0 2023-06-24 10:47:54,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1083618.0, ans=0.0 2023-06-24 10:48:33,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-24 10:48:34,046 INFO [train.py:996] (1/4) Epoch 6, batch 28150, loss[loss=0.2184, simple_loss=0.2741, pruned_loss=0.08133, over 21244.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2921, pruned_loss=0.07148, over 4272313.30 frames. ], batch size: 144, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:48:37,992 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:48:43,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1083738.0, ans=0.2 2023-06-24 10:48:57,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1083798.0, ans=0.125 2023-06-24 10:49:40,075 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.816e+02 3.227e+02 4.008e+02 8.112e+02, threshold=6.453e+02, percent-clipped=1.0 2023-06-24 10:50:03,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1083918.0, ans=0.2 2023-06-24 10:50:17,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1083978.0, ans=0.125 2023-06-24 10:50:24,154 INFO [train.py:996] (1/4) Epoch 6, batch 28200, loss[loss=0.1868, simple_loss=0.2345, pruned_loss=0.06949, over 20771.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2894, pruned_loss=0.0735, over 4276953.89 frames. ], batch size: 608, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:51:08,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1084158.0, ans=0.0 2023-06-24 10:51:20,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1084158.0, ans=0.0 2023-06-24 10:51:58,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1084278.0, ans=0.0 2023-06-24 10:52:11,999 INFO [train.py:996] (1/4) Epoch 6, batch 28250, loss[loss=0.2031, simple_loss=0.2902, pruned_loss=0.058, over 16123.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2919, pruned_loss=0.0757, over 4271558.11 frames. ], batch size: 60, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:52:32,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-24 10:52:41,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-24 10:53:13,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1084458.0, ans=0.2 2023-06-24 10:53:30,914 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.671e+02 3.008e+02 3.478e+02 6.433e+02, threshold=6.015e+02, percent-clipped=0.0 2023-06-24 10:53:34,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1084518.0, ans=0.015 2023-06-24 10:54:03,545 INFO [train.py:996] (1/4) Epoch 6, batch 28300, loss[loss=0.1781, simple_loss=0.2706, pruned_loss=0.04283, over 21642.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2901, pruned_loss=0.07369, over 4265450.64 frames. ], batch size: 414, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:54:10,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1084638.0, ans=0.125 2023-06-24 10:54:21,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1084638.0, ans=0.0 2023-06-24 10:54:36,655 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.05 vs. limit=10.0 2023-06-24 10:55:09,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1084758.0, ans=0.07 2023-06-24 10:55:18,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-06-24 10:55:25,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-24 10:55:28,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1084818.0, ans=0.04949747468305833 2023-06-24 10:55:57,134 INFO [train.py:996] (1/4) Epoch 6, batch 28350, loss[loss=0.1515, simple_loss=0.227, pruned_loss=0.03802, over 21130.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2867, pruned_loss=0.06967, over 4256402.50 frames. ], batch size: 143, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:57:04,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1085058.0, ans=0.125 2023-06-24 10:57:10,498 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.270e+02 2.582e+02 2.935e+02 5.064e+02, threshold=5.164e+02, percent-clipped=0.0 2023-06-24 10:57:11,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-24 10:57:46,640 INFO [train.py:996] (1/4) Epoch 6, batch 28400, loss[loss=0.2126, simple_loss=0.2831, pruned_loss=0.07106, over 21636.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2833, pruned_loss=0.0694, over 4264776.89 frames. ], batch size: 298, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:57:59,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1085238.0, ans=0.125 2023-06-24 10:58:13,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1085298.0, ans=0.1 2023-06-24 10:58:15,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1085298.0, ans=0.09899494936611666 2023-06-24 10:58:28,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-24 10:59:31,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1085478.0, ans=0.125 2023-06-24 10:59:36,142 INFO [train.py:996] (1/4) Epoch 6, batch 28450, loss[loss=0.2283, simple_loss=0.295, pruned_loss=0.0808, over 21949.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2888, pruned_loss=0.07355, over 4271674.92 frames. ], batch size: 316, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 11:00:42,789 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.782e+02 3.154e+02 3.614e+02 5.624e+02, threshold=6.308e+02, percent-clipped=2.0 2023-06-24 11:00:45,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1085718.0, ans=0.1 2023-06-24 11:01:14,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1085778.0, ans=0.1 2023-06-24 11:01:25,033 INFO [train.py:996] (1/4) Epoch 6, batch 28500, loss[loss=0.2413, simple_loss=0.3082, pruned_loss=0.0872, over 21320.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2926, pruned_loss=0.07616, over 4280470.61 frames. ], batch size: 176, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:01:27,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1085838.0, ans=0.125 2023-06-24 11:02:46,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1086018.0, ans=0.2 2023-06-24 11:02:49,591 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-06-24 11:03:17,201 INFO [train.py:996] (1/4) Epoch 6, batch 28550, loss[loss=0.2656, simple_loss=0.3542, pruned_loss=0.0885, over 21764.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3018, pruned_loss=0.07891, over 4285913.59 frames. ], batch size: 351, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:03:21,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-24 11:04:31,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-24 11:04:31,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=12.0 2023-06-24 11:04:37,434 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.822e+02 3.220e+02 3.775e+02 6.822e+02, threshold=6.440e+02, percent-clipped=1.0 2023-06-24 11:05:12,854 INFO [train.py:996] (1/4) Epoch 6, batch 28600, loss[loss=0.2431, simple_loss=0.3119, pruned_loss=0.08712, over 21862.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3087, pruned_loss=0.08121, over 4287582.38 frames. ], batch size: 372, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:05:32,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1086438.0, ans=0.1 2023-06-24 11:05:41,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1086498.0, ans=0.0 2023-06-24 11:06:07,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-24 11:06:22,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1086618.0, ans=0.2 2023-06-24 11:06:23,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-24 11:06:38,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1086678.0, ans=0.2 2023-06-24 11:06:45,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1086678.0, ans=0.1 2023-06-24 11:06:50,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1086678.0, ans=0.0 2023-06-24 11:07:06,292 INFO [train.py:996] (1/4) Epoch 6, batch 28650, loss[loss=0.2106, simple_loss=0.275, pruned_loss=0.0731, over 21674.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3049, pruned_loss=0.0802, over 4288059.78 frames. ], batch size: 333, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:07:10,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-24 11:08:14,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.693e+02 2.997e+02 3.394e+02 5.567e+02, threshold=5.993e+02, percent-clipped=0.0 2023-06-24 11:08:54,850 INFO [train.py:996] (1/4) Epoch 6, batch 28700, loss[loss=0.2398, simple_loss=0.3128, pruned_loss=0.08336, over 21864.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3036, pruned_loss=0.08086, over 4292222.62 frames. ], batch size: 371, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:09:04,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1087038.0, ans=0.2 2023-06-24 11:09:21,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087098.0, ans=0.1 2023-06-24 11:09:26,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-24 11:09:32,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1087098.0, ans=0.0 2023-06-24 11:10:00,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1087218.0, ans=0.125 2023-06-24 11:10:34,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1087278.0, ans=0.2 2023-06-24 11:10:44,944 INFO [train.py:996] (1/4) Epoch 6, batch 28750, loss[loss=0.2276, simple_loss=0.3162, pruned_loss=0.06952, over 21683.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3043, pruned_loss=0.08082, over 4284515.71 frames. ], batch size: 389, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:11:04,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1087398.0, ans=0.0 2023-06-24 11:11:13,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1087398.0, ans=0.2 2023-06-24 11:11:22,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087398.0, ans=0.1 2023-06-24 11:11:52,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1087518.0, ans=0.0 2023-06-24 11:11:53,478 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.664e+02 3.026e+02 3.382e+02 4.910e+02, threshold=6.051e+02, percent-clipped=0.0 2023-06-24 11:12:17,021 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=22.5 2023-06-24 11:12:33,148 INFO [train.py:996] (1/4) Epoch 6, batch 28800, loss[loss=0.2683, simple_loss=0.3473, pruned_loss=0.09463, over 21826.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3074, pruned_loss=0.08143, over 4290038.63 frames. ], batch size: 124, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:12:38,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1087638.0, ans=0.125 2023-06-24 11:12:39,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=15.0 2023-06-24 11:13:33,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-06-24 11:13:56,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087878.0, ans=0.1 2023-06-24 11:14:11,759 INFO [train.py:996] (1/4) Epoch 6, batch 28850, loss[loss=0.3179, simple_loss=0.4303, pruned_loss=0.1028, over 19656.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3094, pruned_loss=0.08298, over 4289772.16 frames. ], batch size: 702, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:14:36,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1087938.0, ans=0.125 2023-06-24 11:14:40,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1087998.0, ans=0.0 2023-06-24 11:14:51,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087998.0, ans=0.1 2023-06-24 11:14:56,573 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:15:25,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.795e+02 3.097e+02 3.558e+02 6.026e+02, threshold=6.195e+02, percent-clipped=0.0 2023-06-24 11:15:52,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1088178.0, ans=0.125 2023-06-24 11:16:00,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1088238.0, ans=0.125 2023-06-24 11:16:01,656 INFO [train.py:996] (1/4) Epoch 6, batch 28900, loss[loss=0.2481, simple_loss=0.3181, pruned_loss=0.08907, over 21366.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.313, pruned_loss=0.08492, over 4291575.59 frames. ], batch size: 548, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:16:54,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1088358.0, ans=0.2 2023-06-24 11:17:24,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-24 11:17:57,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-24 11:18:05,891 INFO [train.py:996] (1/4) Epoch 6, batch 28950, loss[loss=0.2239, simple_loss=0.2993, pruned_loss=0.07425, over 21734.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3124, pruned_loss=0.08412, over 4288449.64 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:18:42,512 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:19:04,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1088658.0, ans=0.1 2023-06-24 11:19:17,995 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.066e+02 3.487e+02 4.356e+02 7.485e+02, threshold=6.974e+02, percent-clipped=4.0 2023-06-24 11:19:57,026 INFO [train.py:996] (1/4) Epoch 6, batch 29000, loss[loss=0.3043, simple_loss=0.3678, pruned_loss=0.1204, over 21348.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3143, pruned_loss=0.08288, over 4280186.98 frames. ], batch size: 507, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:20:24,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1088898.0, ans=0.125 2023-06-24 11:21:33,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1089078.0, ans=0.125 2023-06-24 11:21:39,611 INFO [train.py:996] (1/4) Epoch 6, batch 29050, loss[loss=0.2297, simple_loss=0.2941, pruned_loss=0.08268, over 21361.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3125, pruned_loss=0.08246, over 4286604.26 frames. ], batch size: 176, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:21:50,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1089138.0, ans=0.2 2023-06-24 11:21:51,342 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.81 vs. limit=6.0 2023-06-24 11:22:33,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1089258.0, ans=0.0 2023-06-24 11:22:33,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1089258.0, ans=0.125 2023-06-24 11:22:37,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1089258.0, ans=0.125 2023-06-24 11:22:59,672 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.602e+02 2.960e+02 3.468e+02 4.732e+02, threshold=5.920e+02, percent-clipped=0.0 2023-06-24 11:23:26,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1089438.0, ans=0.04949747468305833 2023-06-24 11:23:27,546 INFO [train.py:996] (1/4) Epoch 6, batch 29100, loss[loss=0.1911, simple_loss=0.2606, pruned_loss=0.06085, over 21521.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3037, pruned_loss=0.08039, over 4283886.36 frames. ], batch size: 391, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:23:34,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-24 11:23:35,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1089438.0, ans=0.025 2023-06-24 11:24:14,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1089558.0, ans=0.1 2023-06-24 11:24:53,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-24 11:24:58,521 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-24 11:25:10,506 INFO [train.py:996] (1/4) Epoch 6, batch 29150, loss[loss=0.227, simple_loss=0.2997, pruned_loss=0.07711, over 21550.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3015, pruned_loss=0.07823, over 4278065.81 frames. ], batch size: 230, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:25:15,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-24 11:25:35,804 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-06-24 11:26:07,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1089858.0, ans=0.125 2023-06-24 11:26:21,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1089918.0, ans=0.1 2023-06-24 11:26:30,814 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.501e+02 2.832e+02 3.252e+02 5.475e+02, threshold=5.663e+02, percent-clipped=0.0 2023-06-24 11:26:55,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1089978.0, ans=0.0 2023-06-24 11:26:58,347 INFO [train.py:996] (1/4) Epoch 6, batch 29200, loss[loss=0.1844, simple_loss=0.2541, pruned_loss=0.05733, over 21567.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2967, pruned_loss=0.07717, over 4274674.69 frames. ], batch size: 231, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:27:00,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1090038.0, ans=0.0 2023-06-24 11:27:44,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1090158.0, ans=0.0 2023-06-24 11:27:50,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1090158.0, ans=0.2 2023-06-24 11:27:55,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1090158.0, ans=0.0 2023-06-24 11:28:07,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1090158.0, ans=0.0 2023-06-24 11:28:47,227 INFO [train.py:996] (1/4) Epoch 6, batch 29250, loss[loss=0.2022, simple_loss=0.2959, pruned_loss=0.05424, over 20805.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2946, pruned_loss=0.07459, over 4274974.27 frames. ], batch size: 608, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:30:08,090 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.479e+02 2.949e+02 4.059e+02 6.998e+02, threshold=5.898e+02, percent-clipped=9.0 2023-06-24 11:30:40,986 INFO [train.py:996] (1/4) Epoch 6, batch 29300, loss[loss=0.2031, simple_loss=0.2785, pruned_loss=0.06388, over 21265.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.297, pruned_loss=0.07377, over 4273694.20 frames. ], batch size: 549, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:31:03,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-24 11:31:50,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1090818.0, ans=0.2 2023-06-24 11:32:29,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1090938.0, ans=0.0 2023-06-24 11:32:31,104 INFO [train.py:996] (1/4) Epoch 6, batch 29350, loss[loss=0.2464, simple_loss=0.3284, pruned_loss=0.08219, over 21532.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2934, pruned_loss=0.07308, over 4281211.44 frames. ], batch size: 441, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:32:45,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1090938.0, ans=0.2 2023-06-24 11:32:47,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1090938.0, ans=0.2 2023-06-24 11:32:47,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1090938.0, ans=0.0 2023-06-24 11:33:43,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.584e+02 3.038e+02 3.610e+02 5.891e+02, threshold=6.076e+02, percent-clipped=0.0 2023-06-24 11:33:43,977 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:34:21,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1091238.0, ans=0.0 2023-06-24 11:34:22,817 INFO [train.py:996] (1/4) Epoch 6, batch 29400, loss[loss=0.1549, simple_loss=0.2121, pruned_loss=0.04882, over 21169.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2937, pruned_loss=0.07161, over 4283518.69 frames. ], batch size: 159, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:34:37,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1091238.0, ans=0.2 2023-06-24 11:34:59,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-24 11:35:16,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1091358.0, ans=0.1 2023-06-24 11:35:19,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1091358.0, ans=0.1 2023-06-24 11:36:12,516 INFO [train.py:996] (1/4) Epoch 6, batch 29450, loss[loss=0.3211, simple_loss=0.398, pruned_loss=0.1221, over 21838.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2924, pruned_loss=0.07094, over 4279082.93 frames. ], batch size: 124, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:36:50,991 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:36:51,034 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:37:12,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1091658.0, ans=0.2 2023-06-24 11:37:23,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1091718.0, ans=0.125 2023-06-24 11:37:26,462 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.589e+02 2.908e+02 3.358e+02 5.330e+02, threshold=5.817e+02, percent-clipped=0.0 2023-06-24 11:37:27,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1091718.0, ans=0.125 2023-06-24 11:38:00,158 INFO [train.py:996] (1/4) Epoch 6, batch 29500, loss[loss=0.2661, simple_loss=0.323, pruned_loss=0.1046, over 21596.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2975, pruned_loss=0.0745, over 4283140.47 frames. ], batch size: 471, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:38:23,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1091898.0, ans=0.0 2023-06-24 11:39:32,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1092078.0, ans=0.125 2023-06-24 11:39:49,832 INFO [train.py:996] (1/4) Epoch 6, batch 29550, loss[loss=0.2112, simple_loss=0.2782, pruned_loss=0.0721, over 21870.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2965, pruned_loss=0.07544, over 4289017.43 frames. ], batch size: 298, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:39:52,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1092138.0, ans=0.125 2023-06-24 11:39:53,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1092138.0, ans=0.125 2023-06-24 11:39:56,328 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.06 vs. limit=12.0 2023-06-24 11:40:11,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1092198.0, ans=0.125 2023-06-24 11:40:13,705 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-24 11:40:40,718 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.24 vs. limit=15.0 2023-06-24 11:40:43,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1092258.0, ans=0.0 2023-06-24 11:41:02,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1092318.0, ans=0.1 2023-06-24 11:41:05,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.863e+02 3.307e+02 3.931e+02 5.796e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-24 11:41:38,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.21 vs. limit=10.0 2023-06-24 11:41:39,640 INFO [train.py:996] (1/4) Epoch 6, batch 29600, loss[loss=0.2392, simple_loss=0.3158, pruned_loss=0.08131, over 21397.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3036, pruned_loss=0.07866, over 4293236.31 frames. ], batch size: 211, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:42:09,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1092498.0, ans=0.2 2023-06-24 11:43:26,425 INFO [train.py:996] (1/4) Epoch 6, batch 29650, loss[loss=0.2397, simple_loss=0.3067, pruned_loss=0.08633, over 21786.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3004, pruned_loss=0.07497, over 4293859.46 frames. ], batch size: 112, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:43:36,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-24 11:43:58,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1092798.0, ans=0.125 2023-06-24 11:44:07,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-06-24 11:44:12,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1092858.0, ans=0.2 2023-06-24 11:44:47,010 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.546e+02 3.028e+02 3.755e+02 5.764e+02, threshold=6.055e+02, percent-clipped=0.0 2023-06-24 11:45:06,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1092978.0, ans=0.125 2023-06-24 11:45:14,620 INFO [train.py:996] (1/4) Epoch 6, batch 29700, loss[loss=0.2143, simple_loss=0.3343, pruned_loss=0.04718, over 19779.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3007, pruned_loss=0.07447, over 4287224.99 frames. ], batch size: 702, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:45:38,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1093038.0, ans=0.125 2023-06-24 11:45:51,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1093098.0, ans=0.125 2023-06-24 11:45:54,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1093098.0, ans=0.125 2023-06-24 11:46:28,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1093218.0, ans=0.125 2023-06-24 11:46:37,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1093218.0, ans=0.2 2023-06-24 11:47:02,446 INFO [train.py:996] (1/4) Epoch 6, batch 29750, loss[loss=0.1914, simple_loss=0.2695, pruned_loss=0.05671, over 16509.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3059, pruned_loss=0.07394, over 4282262.92 frames. ], batch size: 60, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:47:28,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1093398.0, ans=0.125 2023-06-24 11:47:50,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1093458.0, ans=0.07 2023-06-24 11:48:03,218 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-24 11:48:07,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1093518.0, ans=0.125 2023-06-24 11:48:23,014 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.425e+02 2.693e+02 3.074e+02 5.352e+02, threshold=5.385e+02, percent-clipped=0.0 2023-06-24 11:48:54,212 INFO [train.py:996] (1/4) Epoch 6, batch 29800, loss[loss=0.2191, simple_loss=0.2931, pruned_loss=0.07251, over 21482.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.308, pruned_loss=0.07536, over 4285411.41 frames. ], batch size: 211, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:49:08,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1093638.0, ans=0.125 2023-06-24 11:49:17,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1093698.0, ans=0.0 2023-06-24 11:50:09,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-24 11:50:34,853 INFO [train.py:996] (1/4) Epoch 6, batch 29850, loss[loss=0.2137, simple_loss=0.2893, pruned_loss=0.06905, over 21782.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3034, pruned_loss=0.07319, over 4286877.77 frames. ], batch size: 414, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:50:55,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1093938.0, ans=0.1 2023-06-24 11:50:58,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1093938.0, ans=0.5 2023-06-24 11:51:04,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.84 vs. limit=22.5 2023-06-24 11:51:20,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1094058.0, ans=0.95 2023-06-24 11:51:28,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1094058.0, ans=0.2 2023-06-24 11:51:54,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1094118.0, ans=0.0 2023-06-24 11:51:55,460 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.491e+02 2.734e+02 3.399e+02 8.130e+02, threshold=5.469e+02, percent-clipped=4.0 2023-06-24 11:51:58,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-24 11:52:02,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1094178.0, ans=0.125 2023-06-24 11:52:09,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1094178.0, ans=0.0 2023-06-24 11:52:26,428 INFO [train.py:996] (1/4) Epoch 6, batch 29900, loss[loss=0.2489, simple_loss=0.3186, pruned_loss=0.08957, over 21319.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3023, pruned_loss=0.07436, over 4290454.64 frames. ], batch size: 176, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:52:50,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1094298.0, ans=0.05 2023-06-24 11:52:52,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1094298.0, ans=0.125 2023-06-24 11:53:31,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1094358.0, ans=0.2 2023-06-24 11:54:02,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1094478.0, ans=0.1 2023-06-24 11:54:23,082 INFO [train.py:996] (1/4) Epoch 6, batch 29950, loss[loss=0.2542, simple_loss=0.359, pruned_loss=0.07471, over 17886.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3055, pruned_loss=0.07793, over 4281463.40 frames. ], batch size: 62, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:54:32,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1094538.0, ans=0.05 2023-06-24 11:54:36,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.93 vs. limit=10.0 2023-06-24 11:55:15,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1094658.0, ans=0.0 2023-06-24 11:55:15,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1094658.0, ans=0.0 2023-06-24 11:55:33,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1094718.0, ans=0.1 2023-06-24 11:55:41,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 2.840e+02 3.123e+02 3.616e+02 5.024e+02, threshold=6.246e+02, percent-clipped=0.0 2023-06-24 11:55:44,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1094718.0, ans=0.0 2023-06-24 11:56:13,753 INFO [train.py:996] (1/4) Epoch 6, batch 30000, loss[loss=0.2166, simple_loss=0.3038, pruned_loss=0.06466, over 21791.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3075, pruned_loss=0.07805, over 4277402.28 frames. ], batch size: 282, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:56:13,753 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 11:56:30,096 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.0714, 2.7195, 4.2384, 3.1677], device='cuda:1') 2023-06-24 11:56:34,158 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2459, simple_loss=0.3437, pruned_loss=0.07409, over 1796401.00 frames. 2023-06-24 11:56:34,159 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 11:56:59,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1094898.0, ans=0.125 2023-06-24 11:57:16,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1094958.0, ans=0.125 2023-06-24 11:57:18,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1094958.0, ans=0.0 2023-06-24 11:57:41,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.69 vs. limit=10.0 2023-06-24 11:58:36,148 INFO [train.py:996] (1/4) Epoch 6, batch 30050, loss[loss=0.2638, simple_loss=0.3692, pruned_loss=0.07917, over 21842.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3124, pruned_loss=0.07616, over 4275083.33 frames. ], batch size: 371, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:58:47,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1095138.0, ans=0.1 2023-06-24 11:59:16,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1095198.0, ans=0.2 2023-06-24 11:59:26,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1095258.0, ans=0.2 2023-06-24 11:59:35,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1095258.0, ans=0.125 2023-06-24 11:59:55,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.460e+02 2.888e+02 3.811e+02 6.345e+02, threshold=5.776e+02, percent-clipped=1.0 2023-06-24 12:00:16,438 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-24 12:00:18,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1095378.0, ans=0.125 2023-06-24 12:00:24,949 INFO [train.py:996] (1/4) Epoch 6, batch 30100, loss[loss=0.2052, simple_loss=0.2716, pruned_loss=0.06941, over 21582.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3107, pruned_loss=0.07588, over 4272728.06 frames. ], batch size: 247, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:00:38,520 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-24 12:01:05,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-24 12:01:12,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1095558.0, ans=0.0 2023-06-24 12:02:15,860 INFO [train.py:996] (1/4) Epoch 6, batch 30150, loss[loss=0.2649, simple_loss=0.344, pruned_loss=0.09293, over 21833.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3067, pruned_loss=0.07761, over 4265622.59 frames. ], batch size: 124, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:02:27,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1095738.0, ans=0.1 2023-06-24 12:02:42,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1095738.0, ans=0.2 2023-06-24 12:03:24,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1095858.0, ans=0.125 2023-06-24 12:03:44,045 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 2.662e+02 2.970e+02 3.572e+02 6.402e+02, threshold=5.941e+02, percent-clipped=1.0 2023-06-24 12:03:55,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1095978.0, ans=0.07 2023-06-24 12:04:19,440 INFO [train.py:996] (1/4) Epoch 6, batch 30200, loss[loss=0.2109, simple_loss=0.2924, pruned_loss=0.06464, over 21235.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3092, pruned_loss=0.07647, over 4271625.20 frames. ], batch size: 176, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:05:01,647 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-24 12:06:10,627 INFO [train.py:996] (1/4) Epoch 6, batch 30250, loss[loss=0.2578, simple_loss=0.341, pruned_loss=0.08727, over 19964.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3178, pruned_loss=0.07907, over 4271044.31 frames. ], batch size: 702, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:06:25,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1096338.0, ans=0.125 2023-06-24 12:06:30,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1096338.0, ans=0.1 2023-06-24 12:06:32,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1096398.0, ans=0.025 2023-06-24 12:06:46,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1096398.0, ans=0.2 2023-06-24 12:06:55,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.51 vs. limit=15.0 2023-06-24 12:07:27,928 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 2.716e+02 3.093e+02 3.619e+02 5.439e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-24 12:07:57,887 INFO [train.py:996] (1/4) Epoch 6, batch 30300, loss[loss=0.2248, simple_loss=0.2889, pruned_loss=0.08037, over 21552.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3148, pruned_loss=0.07914, over 4257274.92 frames. ], batch size: 414, lr: 4.88e-03, grad_scale: 16.0 2023-06-24 12:08:28,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1096698.0, ans=0.95 2023-06-24 12:09:02,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1096758.0, ans=15.0 2023-06-24 12:09:53,995 INFO [train.py:996] (1/4) Epoch 6, batch 30350, loss[loss=0.2131, simple_loss=0.2878, pruned_loss=0.0692, over 21489.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3165, pruned_loss=0.08071, over 4264101.76 frames. ], batch size: 211, lr: 4.88e-03, grad_scale: 16.0 2023-06-24 12:09:55,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1096938.0, ans=0.0 2023-06-24 12:10:56,358 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.694e+02 3.043e+02 3.524e+02 5.331e+02, threshold=6.085e+02, percent-clipped=0.0 2023-06-24 12:11:08,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1097178.0, ans=0.125 2023-06-24 12:11:15,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1097178.0, ans=0.2 2023-06-24 12:11:27,900 INFO [train.py:996] (1/4) Epoch 6, batch 30400, loss[loss=0.2151, simple_loss=0.2631, pruned_loss=0.08358, over 20205.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3081, pruned_loss=0.07898, over 4255875.43 frames. ], batch size: 703, lr: 4.88e-03, grad_scale: 32.0 2023-06-24 12:11:33,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1097238.0, ans=0.125 2023-06-24 12:11:34,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-24 12:11:36,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1097238.0, ans=0.125 2023-06-24 12:11:39,583 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:12:07,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1097358.0, ans=0.1 2023-06-24 12:12:49,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1097478.0, ans=0.04949747468305833 2023-06-24 12:12:57,209 INFO [train.py:996] (1/4) Epoch 6, batch 30450, loss[loss=0.2759, simple_loss=0.3945, pruned_loss=0.07867, over 19909.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3087, pruned_loss=0.07872, over 4197336.21 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 32.0 2023-06-24 12:13:36,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1097658.0, ans=0.0 2023-06-24 12:13:36,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1097658.0, ans=0.1 2023-06-24 12:13:49,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1097718.0, ans=6.0 2023-06-24 12:13:56,608 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 4.419e+02 5.663e+02 8.899e+02 2.204e+03, threshold=1.133e+03, percent-clipped=46.0 2023-06-24 12:16:21,099 INFO [train.py:996] (1/4) Epoch 7, batch 0, loss[loss=0.2115, simple_loss=0.2766, pruned_loss=0.07323, over 21472.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2766, pruned_loss=0.07323, over 21472.00 frames. ], batch size: 195, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:16:21,100 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 12:16:38,594 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2421, simple_loss=0.346, pruned_loss=0.0691, over 1796401.00 frames. 2023-06-24 12:16:38,595 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 12:18:05,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1097982.0, ans=0.1 2023-06-24 12:18:20,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1098042.0, ans=0.125 2023-06-24 12:18:25,380 INFO [train.py:996] (1/4) Epoch 7, batch 50, loss[loss=0.2568, simple_loss=0.3645, pruned_loss=0.07454, over 21730.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3114, pruned_loss=0.07642, over 956469.96 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:19:13,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1098222.0, ans=0.125 2023-06-24 12:20:01,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.689e+02 3.085e+02 3.734e+02 9.044e+02, threshold=6.169e+02, percent-clipped=0.0 2023-06-24 12:20:13,726 INFO [train.py:996] (1/4) Epoch 7, batch 100, loss[loss=0.2298, simple_loss=0.307, pruned_loss=0.07628, over 21468.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3246, pruned_loss=0.0789, over 1688190.78 frames. ], batch size: 211, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:20:57,270 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:21:04,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2023-06-24 12:22:00,445 INFO [train.py:996] (1/4) Epoch 7, batch 150, loss[loss=0.1899, simple_loss=0.2511, pruned_loss=0.06437, over 15264.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3252, pruned_loss=0.07784, over 2259468.71 frames. ], batch size: 60, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:22:16,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1098762.0, ans=0.0 2023-06-24 12:23:36,474 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.604e+02 2.896e+02 3.363e+02 6.379e+02, threshold=5.792e+02, percent-clipped=1.0 2023-06-24 12:23:40,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-24 12:23:45,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1098942.0, ans=0.125 2023-06-24 12:23:47,925 INFO [train.py:996] (1/4) Epoch 7, batch 200, loss[loss=0.1844, simple_loss=0.2692, pruned_loss=0.04976, over 21451.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3225, pruned_loss=0.07811, over 2708899.70 frames. ], batch size: 195, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:23:52,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1099002.0, ans=0.125 2023-06-24 12:24:16,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1099062.0, ans=0.125 2023-06-24 12:24:40,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1099122.0, ans=6.0 2023-06-24 12:24:55,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1099182.0, ans=0.125 2023-06-24 12:25:31,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1099242.0, ans=0.1 2023-06-24 12:25:36,653 INFO [train.py:996] (1/4) Epoch 7, batch 250, loss[loss=0.2216, simple_loss=0.2832, pruned_loss=0.07999, over 21476.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3187, pruned_loss=0.0773, over 3057594.54 frames. ], batch size: 194, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:25:39,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-24 12:26:16,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1099362.0, ans=0.1 2023-06-24 12:26:44,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1099482.0, ans=0.025 2023-06-24 12:26:44,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1099482.0, ans=0.0 2023-06-24 12:26:45,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-24 12:27:08,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1099542.0, ans=0.0 2023-06-24 12:27:14,905 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.504e+02 2.848e+02 3.185e+02 4.478e+02, threshold=5.696e+02, percent-clipped=0.0 2023-06-24 12:27:27,342 INFO [train.py:996] (1/4) Epoch 7, batch 300, loss[loss=0.196, simple_loss=0.2615, pruned_loss=0.06523, over 21292.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3127, pruned_loss=0.0765, over 3327769.72 frames. ], batch size: 159, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:28:29,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1099722.0, ans=0.125 2023-06-24 12:29:18,767 INFO [train.py:996] (1/4) Epoch 7, batch 350, loss[loss=0.2172, simple_loss=0.289, pruned_loss=0.07266, over 21639.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3067, pruned_loss=0.07518, over 3539646.74 frames. ], batch size: 415, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:29:25,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1099902.0, ans=0.125 2023-06-24 12:30:09,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1099962.0, ans=0.1 2023-06-24 12:30:43,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1100082.0, ans=0.125 2023-06-24 12:30:58,775 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.718e+02 3.112e+02 3.692e+02 6.265e+02, threshold=6.224e+02, percent-clipped=2.0 2023-06-24 12:31:03,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1100142.0, ans=0.07 2023-06-24 12:31:11,306 INFO [train.py:996] (1/4) Epoch 7, batch 400, loss[loss=0.2295, simple_loss=0.3006, pruned_loss=0.07921, over 21817.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3031, pruned_loss=0.07413, over 3701823.52 frames. ], batch size: 118, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:32:08,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1100322.0, ans=0.125 2023-06-24 12:32:44,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1100442.0, ans=0.125 2023-06-24 12:32:53,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1100442.0, ans=0.2 2023-06-24 12:33:02,119 INFO [train.py:996] (1/4) Epoch 7, batch 450, loss[loss=0.2145, simple_loss=0.3231, pruned_loss=0.05293, over 21659.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.299, pruned_loss=0.0729, over 3835517.57 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:33:48,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1100562.0, ans=0.125 2023-06-24 12:34:28,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1100682.0, ans=0.125 2023-06-24 12:34:34,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1100742.0, ans=0.1 2023-06-24 12:34:40,555 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.615e+02 3.361e+02 4.061e+02 5.988e+02, threshold=6.722e+02, percent-clipped=0.0 2023-06-24 12:34:57,198 INFO [train.py:996] (1/4) Epoch 7, batch 500, loss[loss=0.1755, simple_loss=0.2261, pruned_loss=0.06246, over 20739.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2986, pruned_loss=0.07197, over 3932696.14 frames. ], batch size: 609, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:36:11,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-24 12:36:12,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1100982.0, ans=0.0 2023-06-24 12:36:30,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1101042.0, ans=0.0 2023-06-24 12:36:46,115 INFO [train.py:996] (1/4) Epoch 7, batch 550, loss[loss=0.2486, simple_loss=0.353, pruned_loss=0.07208, over 21772.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2986, pruned_loss=0.07159, over 4013731.64 frames. ], batch size: 282, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:36:57,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1101102.0, ans=0.0 2023-06-24 12:37:23,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-24 12:37:49,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1101282.0, ans=0.0 2023-06-24 12:37:54,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-24 12:38:14,377 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.641e+02 3.136e+02 3.627e+02 5.437e+02, threshold=6.272e+02, percent-clipped=0.0 2023-06-24 12:38:28,514 INFO [train.py:996] (1/4) Epoch 7, batch 600, loss[loss=0.2577, simple_loss=0.3799, pruned_loss=0.06774, over 19756.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3044, pruned_loss=0.07152, over 4070684.84 frames. ], batch size: 702, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:38:48,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1101402.0, ans=0.125 2023-06-24 12:38:58,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=22.5 2023-06-24 12:39:04,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1101462.0, ans=0.5 2023-06-24 12:39:06,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-24 12:39:11,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1101462.0, ans=0.0 2023-06-24 12:39:21,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1101522.0, ans=0.125 2023-06-24 12:39:36,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1101582.0, ans=0.2 2023-06-24 12:40:00,541 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-24 12:40:04,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1101642.0, ans=0.125 2023-06-24 12:40:08,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1101642.0, ans=0.2 2023-06-24 12:40:16,851 INFO [train.py:996] (1/4) Epoch 7, batch 650, loss[loss=0.2576, simple_loss=0.3051, pruned_loss=0.1051, over 21757.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3061, pruned_loss=0.07235, over 4121382.07 frames. ], batch size: 508, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:40:52,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1101762.0, ans=0.1 2023-06-24 12:40:54,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1101762.0, ans=0.07 2023-06-24 12:41:20,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1101822.0, ans=0.125 2023-06-24 12:41:25,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1101882.0, ans=0.125 2023-06-24 12:41:51,779 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.716e+02 3.087e+02 3.645e+02 5.920e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-24 12:42:05,932 INFO [train.py:996] (1/4) Epoch 7, batch 700, loss[loss=0.2134, simple_loss=0.2825, pruned_loss=0.0722, over 21716.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3084, pruned_loss=0.07287, over 4161567.88 frames. ], batch size: 230, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:42:08,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1102002.0, ans=0.1 2023-06-24 12:43:05,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1102122.0, ans=0.1 2023-06-24 12:43:28,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1102182.0, ans=0.0 2023-06-24 12:43:44,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1102242.0, ans=0.025 2023-06-24 12:43:59,372 INFO [train.py:996] (1/4) Epoch 7, batch 750, loss[loss=0.2323, simple_loss=0.3038, pruned_loss=0.0804, over 21828.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3088, pruned_loss=0.07366, over 4182014.69 frames. ], batch size: 118, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:44:25,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1102362.0, ans=0.0 2023-06-24 12:44:44,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1102422.0, ans=0.2 2023-06-24 12:45:06,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1102482.0, ans=0.125 2023-06-24 12:45:28,793 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.942e+02 3.385e+02 4.235e+02 7.679e+02, threshold=6.771e+02, percent-clipped=3.0 2023-06-24 12:45:43,017 INFO [train.py:996] (1/4) Epoch 7, batch 800, loss[loss=0.2539, simple_loss=0.3391, pruned_loss=0.08431, over 21769.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3049, pruned_loss=0.07401, over 4195484.95 frames. ], batch size: 298, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:45:59,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.79 vs. limit=6.0 2023-06-24 12:46:12,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1102662.0, ans=0.0 2023-06-24 12:46:14,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1102662.0, ans=0.125 2023-06-24 12:46:31,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1102662.0, ans=0.1 2023-06-24 12:46:40,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1102722.0, ans=0.125 2023-06-24 12:46:44,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1102722.0, ans=0.0 2023-06-24 12:46:44,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1102722.0, ans=0.0 2023-06-24 12:47:38,956 INFO [train.py:996] (1/4) Epoch 7, batch 850, loss[loss=0.215, simple_loss=0.2838, pruned_loss=0.07307, over 21818.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3031, pruned_loss=0.0742, over 4218109.08 frames. ], batch size: 247, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:47:49,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1102902.0, ans=0.125 2023-06-24 12:48:19,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-24 12:49:07,527 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.715e+02 3.192e+02 3.563e+02 7.547e+02, threshold=6.383e+02, percent-clipped=1.0 2023-06-24 12:49:24,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1103142.0, ans=0.015 2023-06-24 12:49:27,462 INFO [train.py:996] (1/4) Epoch 7, batch 900, loss[loss=0.2166, simple_loss=0.2892, pruned_loss=0.07202, over 21327.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2993, pruned_loss=0.07384, over 4234747.58 frames. ], batch size: 131, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:49:31,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1103202.0, ans=0.125 2023-06-24 12:49:43,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-24 12:51:17,571 INFO [train.py:996] (1/4) Epoch 7, batch 950, loss[loss=0.2897, simple_loss=0.3445, pruned_loss=0.1174, over 21567.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2977, pruned_loss=0.07344, over 4250102.82 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:51:21,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1103502.0, ans=0.125 2023-06-24 12:51:43,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=12.0 2023-06-24 12:52:04,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1103622.0, ans=0.125 2023-06-24 12:52:20,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1103682.0, ans=0.125 2023-06-24 12:52:24,659 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-24 12:52:59,202 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.594e+02 2.897e+02 3.337e+02 7.292e+02, threshold=5.794e+02, percent-clipped=1.0 2023-06-24 12:53:07,713 INFO [train.py:996] (1/4) Epoch 7, batch 1000, loss[loss=0.2254, simple_loss=0.2896, pruned_loss=0.08062, over 21563.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2978, pruned_loss=0.07332, over 4254660.04 frames. ], batch size: 548, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:53:33,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1103802.0, ans=0.09899494936611666 2023-06-24 12:53:37,879 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-24 12:54:02,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-24 12:54:11,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1103982.0, ans=0.1 2023-06-24 12:54:23,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1103982.0, ans=0.125 2023-06-24 12:54:33,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1103982.0, ans=0.1 2023-06-24 12:54:47,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1104042.0, ans=0.0 2023-06-24 12:55:12,146 INFO [train.py:996] (1/4) Epoch 7, batch 1050, loss[loss=0.2055, simple_loss=0.2728, pruned_loss=0.06907, over 21866.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2958, pruned_loss=0.07306, over 4269626.90 frames. ], batch size: 298, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:55:53,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1104222.0, ans=0.125 2023-06-24 12:56:01,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1104222.0, ans=0.05 2023-06-24 12:56:20,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1104282.0, ans=0.2 2023-06-24 12:56:42,631 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.809e+02 3.239e+02 3.685e+02 6.477e+02, threshold=6.478e+02, percent-clipped=3.0 2023-06-24 12:56:57,485 INFO [train.py:996] (1/4) Epoch 7, batch 1100, loss[loss=0.2337, simple_loss=0.3051, pruned_loss=0.08117, over 21478.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2953, pruned_loss=0.073, over 4272594.89 frames. ], batch size: 194, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:57:27,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1104462.0, ans=0.2 2023-06-24 12:57:48,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-06-24 12:57:55,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1104582.0, ans=0.0 2023-06-24 12:58:06,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.55 vs. limit=22.5 2023-06-24 12:58:11,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.68 vs. limit=15.0 2023-06-24 12:58:24,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1104642.0, ans=0.125 2023-06-24 12:58:38,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1104642.0, ans=0.125 2023-06-24 12:58:40,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1104642.0, ans=0.2 2023-06-24 12:58:43,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1104642.0, ans=0.0 2023-06-24 12:58:45,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1104642.0, ans=0.2 2023-06-24 12:58:48,072 INFO [train.py:996] (1/4) Epoch 7, batch 1150, loss[loss=0.1727, simple_loss=0.2603, pruned_loss=0.04258, over 21616.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2967, pruned_loss=0.0727, over 4281051.33 frames. ], batch size: 230, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:59:02,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1104702.0, ans=0.0 2023-06-24 12:59:23,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1104822.0, ans=0.0 2023-06-24 12:59:36,561 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-24 12:59:36,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=22.5 2023-06-24 12:59:48,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1104822.0, ans=0.125 2023-06-24 12:59:54,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1104882.0, ans=0.95 2023-06-24 13:00:30,288 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 2.493e+02 2.841e+02 3.361e+02 6.236e+02, threshold=5.682e+02, percent-clipped=0.0 2023-06-24 13:00:38,727 INFO [train.py:996] (1/4) Epoch 7, batch 1200, loss[loss=0.2423, simple_loss=0.3156, pruned_loss=0.0845, over 21411.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2993, pruned_loss=0.07287, over 4277767.73 frames. ], batch size: 194, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:00:46,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1105002.0, ans=0.1 2023-06-24 13:00:58,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1105062.0, ans=0.125 2023-06-24 13:01:01,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1105062.0, ans=0.5 2023-06-24 13:01:05,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105062.0, ans=0.1 2023-06-24 13:02:15,431 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.85 vs. limit=22.5 2023-06-24 13:02:27,870 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-24 13:02:28,499 INFO [train.py:996] (1/4) Epoch 7, batch 1250, loss[loss=0.2953, simple_loss=0.3814, pruned_loss=0.1046, over 21591.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2989, pruned_loss=0.07328, over 4279494.01 frames. ], batch size: 414, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:04:09,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.694e+02 3.114e+02 3.849e+02 5.488e+02, threshold=6.227e+02, percent-clipped=0.0 2023-06-24 13:04:18,075 INFO [train.py:996] (1/4) Epoch 7, batch 1300, loss[loss=0.2562, simple_loss=0.3214, pruned_loss=0.09552, over 21790.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3013, pruned_loss=0.07378, over 4285958.27 frames. ], batch size: 441, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:05:50,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1105842.0, ans=0.0 2023-06-24 13:06:06,842 INFO [train.py:996] (1/4) Epoch 7, batch 1350, loss[loss=0.1879, simple_loss=0.2771, pruned_loss=0.04935, over 21798.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3024, pruned_loss=0.07462, over 4290574.92 frames. ], batch size: 351, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:07:17,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1106082.0, ans=0.125 2023-06-24 13:07:43,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1106142.0, ans=0.0 2023-06-24 13:07:48,036 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 2.498e+02 2.809e+02 3.151e+02 4.941e+02, threshold=5.617e+02, percent-clipped=0.0 2023-06-24 13:07:56,338 INFO [train.py:996] (1/4) Epoch 7, batch 1400, loss[loss=0.1989, simple_loss=0.271, pruned_loss=0.06342, over 21700.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2997, pruned_loss=0.07422, over 4281126.05 frames. ], batch size: 316, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:08:42,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-24 13:08:55,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-24 13:09:22,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-24 13:09:45,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1106502.0, ans=0.125 2023-06-24 13:09:46,116 INFO [train.py:996] (1/4) Epoch 7, batch 1450, loss[loss=0.2461, simple_loss=0.317, pruned_loss=0.08759, over 21798.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3014, pruned_loss=0.075, over 4288002.07 frames. ], batch size: 282, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:10:25,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1106622.0, ans=0.125 2023-06-24 13:10:27,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1106622.0, ans=0.1 2023-06-24 13:10:31,217 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-24 13:10:32,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1106622.0, ans=0.015 2023-06-24 13:10:34,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1106622.0, ans=0.125 2023-06-24 13:10:45,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1106622.0, ans=0.125 2023-06-24 13:11:13,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1106682.0, ans=0.125 2023-06-24 13:11:13,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1106682.0, ans=0.1 2023-06-24 13:11:13,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-24 13:11:28,960 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.734e+02 3.228e+02 3.700e+02 6.613e+02, threshold=6.455e+02, percent-clipped=4.0 2023-06-24 13:11:36,323 INFO [train.py:996] (1/4) Epoch 7, batch 1500, loss[loss=0.1941, simple_loss=0.2618, pruned_loss=0.06323, over 21676.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3032, pruned_loss=0.07562, over 4291406.80 frames. ], batch size: 333, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:12:23,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1106922.0, ans=0.0 2023-06-24 13:12:25,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1106922.0, ans=0.1 2023-06-24 13:13:24,208 INFO [train.py:996] (1/4) Epoch 7, batch 1550, loss[loss=0.2651, simple_loss=0.357, pruned_loss=0.08657, over 21713.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3028, pruned_loss=0.07487, over 4292761.91 frames. ], batch size: 298, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:13:41,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-24 13:14:43,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1107282.0, ans=0.025 2023-06-24 13:14:58,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1107342.0, ans=0.125 2023-06-24 13:15:03,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1107342.0, ans=0.0 2023-06-24 13:15:06,847 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.619e+02 3.008e+02 3.656e+02 5.850e+02, threshold=6.017e+02, percent-clipped=0.0 2023-06-24 13:15:13,448 INFO [train.py:996] (1/4) Epoch 7, batch 1600, loss[loss=0.172, simple_loss=0.2335, pruned_loss=0.05529, over 21760.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3003, pruned_loss=0.07383, over 4286068.70 frames. ], batch size: 124, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:15:28,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1107402.0, ans=0.125 2023-06-24 13:15:34,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1107402.0, ans=0.125 2023-06-24 13:15:46,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1107462.0, ans=0.125 2023-06-24 13:15:47,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-24 13:16:01,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1107462.0, ans=0.2 2023-06-24 13:16:21,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1107522.0, ans=0.1 2023-06-24 13:16:27,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1107582.0, ans=0.2 2023-06-24 13:16:47,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1107642.0, ans=0.0 2023-06-24 13:17:11,119 INFO [train.py:996] (1/4) Epoch 7, batch 1650, loss[loss=0.2313, simple_loss=0.2933, pruned_loss=0.0847, over 20135.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3002, pruned_loss=0.074, over 4286298.32 frames. ], batch size: 703, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:17:18,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1107702.0, ans=0.1 2023-06-24 13:18:10,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1107822.0, ans=0.2 2023-06-24 13:18:24,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1107882.0, ans=0.125 2023-06-24 13:18:42,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1107942.0, ans=0.125 2023-06-24 13:18:55,909 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.769e+02 3.129e+02 3.705e+02 6.024e+02, threshold=6.259e+02, percent-clipped=1.0 2023-06-24 13:19:03,700 INFO [train.py:996] (1/4) Epoch 7, batch 1700, loss[loss=0.2268, simple_loss=0.2877, pruned_loss=0.08297, over 20088.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.302, pruned_loss=0.07452, over 4281677.24 frames. ], batch size: 702, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:19:17,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1108002.0, ans=0.0 2023-06-24 13:20:01,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1108122.0, ans=0.0 2023-06-24 13:20:36,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1108182.0, ans=0.125 2023-06-24 13:20:42,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1108242.0, ans=0.07 2023-06-24 13:21:02,722 INFO [train.py:996] (1/4) Epoch 7, batch 1750, loss[loss=0.221, simple_loss=0.2962, pruned_loss=0.07294, over 20660.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3009, pruned_loss=0.07302, over 4279029.56 frames. ], batch size: 607, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:21:03,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=22.5 2023-06-24 13:21:37,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1108362.0, ans=0.95 2023-06-24 13:21:49,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-24 13:21:50,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1108422.0, ans=0.2 2023-06-24 13:22:18,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1108482.0, ans=0.125 2023-06-24 13:22:27,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1108482.0, ans=0.2 2023-06-24 13:22:56,563 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.765e+02 3.316e+02 4.331e+02 7.357e+02, threshold=6.632e+02, percent-clipped=3.0 2023-06-24 13:23:07,040 INFO [train.py:996] (1/4) Epoch 7, batch 1800, loss[loss=0.2407, simple_loss=0.3237, pruned_loss=0.0788, over 21761.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2979, pruned_loss=0.07039, over 4271352.25 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:24:42,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-24 13:24:45,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1108842.0, ans=0.0 2023-06-24 13:24:51,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1108902.0, ans=0.125 2023-06-24 13:24:52,486 INFO [train.py:996] (1/4) Epoch 7, batch 1850, loss[loss=0.1774, simple_loss=0.2301, pruned_loss=0.06233, over 20042.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2994, pruned_loss=0.06966, over 4271907.69 frames. ], batch size: 702, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:25:49,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1109022.0, ans=0.1 2023-06-24 13:26:00,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1109082.0, ans=0.125 2023-06-24 13:26:02,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.22 vs. limit=10.0 2023-06-24 13:26:27,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-24 13:26:36,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1109142.0, ans=0.125 2023-06-24 13:26:38,960 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.879e+02 3.449e+02 4.316e+02 7.592e+02, threshold=6.898e+02, percent-clipped=3.0 2023-06-24 13:26:47,940 INFO [train.py:996] (1/4) Epoch 7, batch 1900, loss[loss=0.193, simple_loss=0.2657, pruned_loss=0.06015, over 21738.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2994, pruned_loss=0.07018, over 4267791.38 frames. ], batch size: 282, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:27:07,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1109262.0, ans=0.125 2023-06-24 13:27:07,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1109262.0, ans=0.125 2023-06-24 13:27:16,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-24 13:27:45,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1109382.0, ans=0.0 2023-06-24 13:28:16,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1109442.0, ans=0.0 2023-06-24 13:28:38,151 INFO [train.py:996] (1/4) Epoch 7, batch 1950, loss[loss=0.204, simple_loss=0.2973, pruned_loss=0.05538, over 21721.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2959, pruned_loss=0.06936, over 4266385.40 frames. ], batch size: 352, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:28:56,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1109562.0, ans=0.0 2023-06-24 13:29:02,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-24 13:29:05,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1109562.0, ans=0.2 2023-06-24 13:29:32,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1109622.0, ans=0.125 2023-06-24 13:29:37,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1109682.0, ans=0.0 2023-06-24 13:30:01,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2023-06-24 13:30:26,483 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.664e+02 3.137e+02 3.840e+02 6.499e+02, threshold=6.275e+02, percent-clipped=0.0 2023-06-24 13:30:29,931 INFO [train.py:996] (1/4) Epoch 7, batch 2000, loss[loss=0.2399, simple_loss=0.3093, pruned_loss=0.0852, over 20677.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2926, pruned_loss=0.06735, over 4267632.19 frames. ], batch size: 607, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:32:17,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1110042.0, ans=0.2 2023-06-24 13:32:20,841 INFO [train.py:996] (1/4) Epoch 7, batch 2050, loss[loss=0.2043, simple_loss=0.2822, pruned_loss=0.06316, over 21776.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2941, pruned_loss=0.06835, over 4278688.72 frames. ], batch size: 316, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:32:31,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1110102.0, ans=0.1 2023-06-24 13:32:32,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-24 13:32:49,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-24 13:33:05,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1110222.0, ans=0.1 2023-06-24 13:33:53,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1110342.0, ans=0.125 2023-06-24 13:33:56,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1110342.0, ans=0.125 2023-06-24 13:34:07,173 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.697e+02 3.083e+02 3.787e+02 7.892e+02, threshold=6.165e+02, percent-clipped=1.0 2023-06-24 13:34:10,760 INFO [train.py:996] (1/4) Epoch 7, batch 2100, loss[loss=0.2166, simple_loss=0.2866, pruned_loss=0.07328, over 21537.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2989, pruned_loss=0.07075, over 4272396.95 frames. ], batch size: 230, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:35:01,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1110522.0, ans=0.1 2023-06-24 13:35:55,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1110642.0, ans=0.125 2023-06-24 13:36:02,110 INFO [train.py:996] (1/4) Epoch 7, batch 2150, loss[loss=0.1835, simple_loss=0.2575, pruned_loss=0.05479, over 21602.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2995, pruned_loss=0.07213, over 4259430.49 frames. ], batch size: 263, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:36:47,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-24 13:36:48,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1110822.0, ans=0.125 2023-06-24 13:37:49,021 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.806e+02 3.490e+02 4.529e+02 7.299e+02, threshold=6.981e+02, percent-clipped=4.0 2023-06-24 13:37:52,590 INFO [train.py:996] (1/4) Epoch 7, batch 2200, loss[loss=0.1875, simple_loss=0.2654, pruned_loss=0.0548, over 21400.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3, pruned_loss=0.07174, over 4258365.10 frames. ], batch size: 194, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:38:19,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1111062.0, ans=0.125 2023-06-24 13:39:40,162 INFO [train.py:996] (1/4) Epoch 7, batch 2250, loss[loss=0.2061, simple_loss=0.2689, pruned_loss=0.07162, over 21617.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2963, pruned_loss=0.07032, over 4269715.58 frames. ], batch size: 332, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:39:47,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1111302.0, ans=0.0 2023-06-24 13:40:18,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1111422.0, ans=0.04949747468305833 2023-06-24 13:41:24,939 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.731e+02 3.125e+02 3.958e+02 6.138e+02, threshold=6.249e+02, percent-clipped=0.0 2023-06-24 13:41:28,580 INFO [train.py:996] (1/4) Epoch 7, batch 2300, loss[loss=0.2227, simple_loss=0.3144, pruned_loss=0.06551, over 21793.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2929, pruned_loss=0.07062, over 4267221.50 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:41:38,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1111602.0, ans=0.125 2023-06-24 13:42:50,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111782.0, ans=0.1 2023-06-24 13:43:14,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1111842.0, ans=0.95 2023-06-24 13:43:17,688 INFO [train.py:996] (1/4) Epoch 7, batch 2350, loss[loss=0.2118, simple_loss=0.272, pruned_loss=0.07579, over 21538.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2912, pruned_loss=0.071, over 4269219.48 frames. ], batch size: 391, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:44:04,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1112022.0, ans=0.035 2023-06-24 13:44:25,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1112022.0, ans=0.2 2023-06-24 13:44:29,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-24 13:45:00,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1112142.0, ans=0.0 2023-06-24 13:45:04,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1112142.0, ans=0.0 2023-06-24 13:45:05,383 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.752e+02 3.198e+02 3.763e+02 6.793e+02, threshold=6.396e+02, percent-clipped=2.0 2023-06-24 13:45:08,856 INFO [train.py:996] (1/4) Epoch 7, batch 2400, loss[loss=0.2503, simple_loss=0.3212, pruned_loss=0.08972, over 21711.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2932, pruned_loss=0.07208, over 4274511.90 frames. ], batch size: 332, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:45:33,743 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=22.5 2023-06-24 13:46:09,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1112322.0, ans=0.2 2023-06-24 13:46:11,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1112322.0, ans=0.0 2023-06-24 13:46:13,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1112322.0, ans=0.125 2023-06-24 13:46:59,005 INFO [train.py:996] (1/4) Epoch 7, batch 2450, loss[loss=0.2015, simple_loss=0.2724, pruned_loss=0.06528, over 21491.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.298, pruned_loss=0.07364, over 4279904.75 frames. ], batch size: 230, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:47:22,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1112562.0, ans=0.1 2023-06-24 13:47:26,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1112562.0, ans=0.05 2023-06-24 13:47:27,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1112562.0, ans=0.125 2023-06-24 13:48:19,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1112682.0, ans=0.0 2023-06-24 13:48:25,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1112682.0, ans=0.1 2023-06-24 13:48:48,358 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.774e+02 3.648e+02 4.607e+02 7.858e+02, threshold=7.296e+02, percent-clipped=5.0 2023-06-24 13:48:51,790 INFO [train.py:996] (1/4) Epoch 7, batch 2500, loss[loss=0.2283, simple_loss=0.3107, pruned_loss=0.07298, over 21495.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2942, pruned_loss=0.07258, over 4274146.48 frames. ], batch size: 389, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 13:48:52,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1112802.0, ans=0.125 2023-06-24 13:50:42,240 INFO [train.py:996] (1/4) Epoch 7, batch 2550, loss[loss=0.2059, simple_loss=0.2743, pruned_loss=0.06877, over 21440.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2943, pruned_loss=0.07154, over 4268435.11 frames. ], batch size: 131, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 13:51:02,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1113162.0, ans=0.0 2023-06-24 13:52:30,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.860e+02 3.358e+02 4.176e+02 6.278e+02, threshold=6.716e+02, percent-clipped=0.0 2023-06-24 13:52:32,040 INFO [train.py:996] (1/4) Epoch 7, batch 2600, loss[loss=0.2596, simple_loss=0.3418, pruned_loss=0.08871, over 21323.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.298, pruned_loss=0.0735, over 4274348.58 frames. ], batch size: 548, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:52:40,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1113402.0, ans=0.125 2023-06-24 13:53:37,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1113522.0, ans=0.125 2023-06-24 13:54:18,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-24 13:54:23,122 INFO [train.py:996] (1/4) Epoch 7, batch 2650, loss[loss=0.2027, simple_loss=0.2764, pruned_loss=0.06452, over 21879.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2981, pruned_loss=0.07432, over 4278070.47 frames. ], batch size: 118, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:54:29,295 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-24 13:55:34,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1113882.0, ans=0.125 2023-06-24 13:55:42,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1113882.0, ans=0.125 2023-06-24 13:56:12,124 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.624e+02 3.107e+02 3.655e+02 6.528e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-24 13:56:14,281 INFO [train.py:996] (1/4) Epoch 7, batch 2700, loss[loss=0.1895, simple_loss=0.2697, pruned_loss=0.0547, over 21818.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.295, pruned_loss=0.07364, over 4271414.33 frames. ], batch size: 316, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:57:17,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1114122.0, ans=0.0 2023-06-24 13:57:17,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1114122.0, ans=0.0 2023-06-24 13:57:44,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1114182.0, ans=0.05 2023-06-24 13:57:47,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114242.0, ans=0.1 2023-06-24 13:58:04,703 INFO [train.py:996] (1/4) Epoch 7, batch 2750, loss[loss=0.2586, simple_loss=0.3166, pruned_loss=0.1003, over 21704.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2947, pruned_loss=0.07444, over 4272719.25 frames. ], batch size: 473, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:58:22,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1114302.0, ans=0.125 2023-06-24 13:59:01,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1114422.0, ans=0.09899494936611666 2023-06-24 13:59:07,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114422.0, ans=0.1 2023-06-24 13:59:15,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1114482.0, ans=0.125 2023-06-24 13:59:17,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1114482.0, ans=0.125 2023-06-24 13:59:32,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-24 14:00:01,267 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.964e+02 3.229e+02 3.808e+02 6.340e+02, threshold=6.458e+02, percent-clipped=1.0 2023-06-24 14:00:03,066 INFO [train.py:996] (1/4) Epoch 7, batch 2800, loss[loss=0.2611, simple_loss=0.334, pruned_loss=0.09408, over 21643.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2997, pruned_loss=0.07611, over 4268119.87 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:00:21,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1114602.0, ans=0.0 2023-06-24 14:00:39,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-24 14:00:39,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1114662.0, ans=0.0 2023-06-24 14:00:51,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1114722.0, ans=10.0 2023-06-24 14:01:54,115 INFO [train.py:996] (1/4) Epoch 7, batch 2850, loss[loss=0.2088, simple_loss=0.2825, pruned_loss=0.06757, over 21794.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3021, pruned_loss=0.07757, over 4272763.25 frames. ], batch size: 333, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:01:58,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-06-24 14:02:29,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1114962.0, ans=0.0 2023-06-24 14:02:40,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1115022.0, ans=0.125 2023-06-24 14:02:59,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1115022.0, ans=0.1 2023-06-24 14:03:36,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1115142.0, ans=0.0 2023-06-24 14:03:42,914 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 2.854e+02 3.316e+02 3.985e+02 8.556e+02, threshold=6.632e+02, percent-clipped=4.0 2023-06-24 14:03:42,946 INFO [train.py:996] (1/4) Epoch 7, batch 2900, loss[loss=0.3108, simple_loss=0.3933, pruned_loss=0.1141, over 21654.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3001, pruned_loss=0.0771, over 4276381.52 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:04:01,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1115202.0, ans=0.1 2023-06-24 14:04:51,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1115382.0, ans=0.125 2023-06-24 14:04:56,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1115382.0, ans=0.07 2023-06-24 14:05:07,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1115382.0, ans=0.05 2023-06-24 14:05:09,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1115442.0, ans=0.125 2023-06-24 14:05:18,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1115442.0, ans=0.125 2023-06-24 14:05:23,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1115442.0, ans=0.0 2023-06-24 14:05:33,564 INFO [train.py:996] (1/4) Epoch 7, batch 2950, loss[loss=0.2512, simple_loss=0.3163, pruned_loss=0.09302, over 21720.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3022, pruned_loss=0.07795, over 4282541.84 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:05:52,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1115502.0, ans=0.125 2023-06-24 14:06:44,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1115682.0, ans=0.1 2023-06-24 14:07:24,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.857e+02 3.209e+02 3.929e+02 8.381e+02, threshold=6.419e+02, percent-clipped=2.0 2023-06-24 14:07:24,687 INFO [train.py:996] (1/4) Epoch 7, batch 3000, loss[loss=0.2293, simple_loss=0.3115, pruned_loss=0.07351, over 21776.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3057, pruned_loss=0.07787, over 4278900.94 frames. ], batch size: 332, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:07:24,687 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 14:07:46,556 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2481, simple_loss=0.3407, pruned_loss=0.0778, over 1796401.00 frames. 2023-06-24 14:07:46,557 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 14:07:56,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1115802.0, ans=0.1 2023-06-24 14:09:37,531 INFO [train.py:996] (1/4) Epoch 7, batch 3050, loss[loss=0.1707, simple_loss=0.2543, pruned_loss=0.04358, over 21390.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3074, pruned_loss=0.07679, over 4282102.10 frames. ], batch size: 194, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:09:47,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1116102.0, ans=0.0 2023-06-24 14:10:03,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-24 14:10:12,541 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-24 14:10:42,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1116222.0, ans=0.125 2023-06-24 14:11:03,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1116342.0, ans=0.1 2023-06-24 14:11:33,742 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.533e+02 2.915e+02 3.819e+02 6.639e+02, threshold=5.830e+02, percent-clipped=1.0 2023-06-24 14:11:33,773 INFO [train.py:996] (1/4) Epoch 7, batch 3100, loss[loss=0.1952, simple_loss=0.2868, pruned_loss=0.05177, over 21784.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3059, pruned_loss=0.07527, over 4277876.74 frames. ], batch size: 282, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:11:35,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1116402.0, ans=0.1 2023-06-24 14:11:43,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1116402.0, ans=0.125 2023-06-24 14:13:25,681 INFO [train.py:996] (1/4) Epoch 7, batch 3150, loss[loss=0.2941, simple_loss=0.3744, pruned_loss=0.1069, over 21504.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3061, pruned_loss=0.07637, over 4274503.56 frames. ], batch size: 131, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:13:38,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1116702.0, ans=0.125 2023-06-24 14:14:32,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1116822.0, ans=0.2 2023-06-24 14:15:07,792 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-24 14:15:22,162 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.712e+02 3.098e+02 3.534e+02 5.991e+02, threshold=6.196e+02, percent-clipped=1.0 2023-06-24 14:15:22,194 INFO [train.py:996] (1/4) Epoch 7, batch 3200, loss[loss=0.2189, simple_loss=0.3011, pruned_loss=0.06836, over 21723.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3065, pruned_loss=0.07553, over 4280448.11 frames. ], batch size: 298, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:15:28,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1117002.0, ans=0.0 2023-06-24 14:16:07,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1117122.0, ans=0.125 2023-06-24 14:16:59,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1117242.0, ans=0.0 2023-06-24 14:17:13,190 INFO [train.py:996] (1/4) Epoch 7, batch 3250, loss[loss=0.208, simple_loss=0.2976, pruned_loss=0.05926, over 21602.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3093, pruned_loss=0.07748, over 4282405.59 frames. ], batch size: 263, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:17:42,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1117362.0, ans=0.2 2023-06-24 14:17:44,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1117362.0, ans=0.125 2023-06-24 14:17:53,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1117422.0, ans=0.125 2023-06-24 14:19:05,823 INFO [train.py:996] (1/4) Epoch 7, batch 3300, loss[loss=0.2432, simple_loss=0.332, pruned_loss=0.07716, over 21214.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3038, pruned_loss=0.07673, over 4267110.32 frames. ], batch size: 549, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:19:07,511 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.718e+02 3.384e+02 4.609e+02 8.476e+02, threshold=6.767e+02, percent-clipped=13.0 2023-06-24 14:19:30,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1117662.0, ans=0.0 2023-06-24 14:20:06,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1117722.0, ans=0.2 2023-06-24 14:20:27,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2023-06-24 14:20:56,361 INFO [train.py:996] (1/4) Epoch 7, batch 3350, loss[loss=0.2712, simple_loss=0.3307, pruned_loss=0.1059, over 21580.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3062, pruned_loss=0.07727, over 4272118.88 frames. ], batch size: 471, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:20:58,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1117902.0, ans=0.125 2023-06-24 14:21:20,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1117962.0, ans=0.125 2023-06-24 14:21:29,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1117962.0, ans=0.1 2023-06-24 14:22:50,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1118142.0, ans=0.125 2023-06-24 14:22:52,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.58 vs. limit=15.0 2023-06-24 14:22:53,118 INFO [train.py:996] (1/4) Epoch 7, batch 3400, loss[loss=0.2387, simple_loss=0.3209, pruned_loss=0.07825, over 21595.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3062, pruned_loss=0.07749, over 4278597.02 frames. ], batch size: 389, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:22:54,759 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.810e+02 3.179e+02 3.983e+02 5.568e+02, threshold=6.357e+02, percent-clipped=0.0 2023-06-24 14:23:11,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1118262.0, ans=0.125 2023-06-24 14:24:31,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1118442.0, ans=0.1 2023-06-24 14:24:43,508 INFO [train.py:996] (1/4) Epoch 7, batch 3450, loss[loss=0.21, simple_loss=0.2905, pruned_loss=0.06472, over 21624.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2998, pruned_loss=0.07608, over 4274485.60 frames. ], batch size: 263, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:25:06,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1118562.0, ans=0.0 2023-06-24 14:25:52,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-24 14:26:36,920 INFO [train.py:996] (1/4) Epoch 7, batch 3500, loss[loss=0.2499, simple_loss=0.3352, pruned_loss=0.08229, over 21728.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3084, pruned_loss=0.07906, over 4279734.62 frames. ], batch size: 247, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:26:38,659 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.694e+02 2.966e+02 3.710e+02 5.580e+02, threshold=5.932e+02, percent-clipped=0.0 2023-06-24 14:27:16,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1118862.0, ans=0.0 2023-06-24 14:27:18,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1118862.0, ans=0.125 2023-06-24 14:27:35,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1118922.0, ans=0.0 2023-06-24 14:28:22,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1119042.0, ans=0.0 2023-06-24 14:28:33,479 INFO [train.py:996] (1/4) Epoch 7, batch 3550, loss[loss=0.2387, simple_loss=0.3091, pruned_loss=0.08414, over 21766.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3121, pruned_loss=0.08098, over 4273834.31 frames. ], batch size: 118, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:28:51,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1119102.0, ans=0.125 2023-06-24 14:28:51,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1119102.0, ans=0.125 2023-06-24 14:29:32,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1119222.0, ans=0.0 2023-06-24 14:29:38,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.86 vs. limit=6.0 2023-06-24 14:29:52,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1119282.0, ans=0.125 2023-06-24 14:30:12,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1119342.0, ans=0.0 2023-06-24 14:30:24,487 INFO [train.py:996] (1/4) Epoch 7, batch 3600, loss[loss=0.2262, simple_loss=0.284, pruned_loss=0.08426, over 21161.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3056, pruned_loss=0.07945, over 4271214.40 frames. ], batch size: 143, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:30:31,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 2.873e+02 3.282e+02 3.993e+02 6.971e+02, threshold=6.565e+02, percent-clipped=2.0 2023-06-24 14:30:59,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1119462.0, ans=0.125 2023-06-24 14:31:52,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-24 14:32:11,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1119642.0, ans=0.0 2023-06-24 14:32:22,656 INFO [train.py:996] (1/4) Epoch 7, batch 3650, loss[loss=0.1421, simple_loss=0.1825, pruned_loss=0.0509, over 16965.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3074, pruned_loss=0.08047, over 4261113.49 frames. ], batch size: 60, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:34:06,795 INFO [train.py:996] (1/4) Epoch 7, batch 3700, loss[loss=0.212, simple_loss=0.2988, pruned_loss=0.06258, over 21221.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3062, pruned_loss=0.079, over 4267307.79 frames. ], batch size: 176, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:34:08,347 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.822e+02 3.276e+02 3.785e+02 7.589e+02, threshold=6.551e+02, percent-clipped=1.0 2023-06-24 14:34:41,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-06-24 14:36:02,189 INFO [train.py:996] (1/4) Epoch 7, batch 3750, loss[loss=0.1825, simple_loss=0.2471, pruned_loss=0.05891, over 21149.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3045, pruned_loss=0.07847, over 4271909.16 frames. ], batch size: 143, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:37:34,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1120542.0, ans=0.125 2023-06-24 14:37:41,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1120542.0, ans=0.07 2023-06-24 14:37:43,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.61 vs. limit=15.0 2023-06-24 14:37:57,858 INFO [train.py:996] (1/4) Epoch 7, batch 3800, loss[loss=0.2463, simple_loss=0.3257, pruned_loss=0.08346, over 21533.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3026, pruned_loss=0.07737, over 4269531.35 frames. ], batch size: 131, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:38:01,821 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.713e+02 3.064e+02 3.470e+02 5.470e+02, threshold=6.128e+02, percent-clipped=0.0 2023-06-24 14:39:20,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1120782.0, ans=0.035 2023-06-24 14:39:49,644 INFO [train.py:996] (1/4) Epoch 7, batch 3850, loss[loss=0.2038, simple_loss=0.2723, pruned_loss=0.06766, over 21739.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2996, pruned_loss=0.07696, over 4267650.33 frames. ], batch size: 112, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:40:11,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1120962.0, ans=0.125 2023-06-24 14:40:40,905 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-24 14:40:46,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1121082.0, ans=0.0 2023-06-24 14:41:03,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1121082.0, ans=0.025 2023-06-24 14:41:33,233 INFO [train.py:996] (1/4) Epoch 7, batch 3900, loss[loss=0.2199, simple_loss=0.287, pruned_loss=0.07642, over 15239.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2939, pruned_loss=0.07592, over 4271201.98 frames. ], batch size: 61, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:41:36,445 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.710e+02 3.145e+02 3.584e+02 6.226e+02, threshold=6.291e+02, percent-clipped=1.0 2023-06-24 14:43:30,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1121502.0, ans=0.0 2023-06-24 14:43:31,400 INFO [train.py:996] (1/4) Epoch 7, batch 3950, loss[loss=0.171, simple_loss=0.2453, pruned_loss=0.0483, over 21268.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2962, pruned_loss=0.0752, over 4265857.26 frames. ], batch size: 159, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:43:35,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1121502.0, ans=0.2 2023-06-24 14:43:57,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1121562.0, ans=0.1 2023-06-24 14:44:12,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1121622.0, ans=0.0 2023-06-24 14:45:22,788 INFO [train.py:996] (1/4) Epoch 7, batch 4000, loss[loss=0.2193, simple_loss=0.3023, pruned_loss=0.06817, over 19833.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2888, pruned_loss=0.07169, over 4268408.93 frames. ], batch size: 702, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:45:26,593 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.561e+02 2.888e+02 3.482e+02 6.063e+02, threshold=5.775e+02, percent-clipped=0.0 2023-06-24 14:45:45,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1121862.0, ans=0.125 2023-06-24 14:45:50,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1121862.0, ans=0.1 2023-06-24 14:45:51,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-24 14:46:07,140 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:46:37,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1121982.0, ans=0.0 2023-06-24 14:47:13,474 INFO [train.py:996] (1/4) Epoch 7, batch 4050, loss[loss=0.212, simple_loss=0.3059, pruned_loss=0.05899, over 21610.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2894, pruned_loss=0.07054, over 4269874.97 frames. ], batch size: 263, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:48:41,014 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.52 vs. limit=12.0 2023-06-24 14:49:04,256 INFO [train.py:996] (1/4) Epoch 7, batch 4100, loss[loss=0.2411, simple_loss=0.315, pruned_loss=0.08361, over 21842.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.292, pruned_loss=0.07126, over 4281431.57 frames. ], batch size: 414, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:49:08,894 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.546e+02 2.998e+02 3.545e+02 8.551e+02, threshold=5.997e+02, percent-clipped=3.0 2023-06-24 14:49:20,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1122402.0, ans=0.2 2023-06-24 14:49:32,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1122462.0, ans=0.2 2023-06-24 14:49:43,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1122522.0, ans=0.1 2023-06-24 14:50:01,244 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:50:17,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1122582.0, ans=0.0 2023-06-24 14:50:54,031 INFO [train.py:996] (1/4) Epoch 7, batch 4150, loss[loss=0.2238, simple_loss=0.3088, pruned_loss=0.0694, over 21732.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2927, pruned_loss=0.06899, over 4284777.66 frames. ], batch size: 351, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:51:55,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1122822.0, ans=0.125 2023-06-24 14:52:40,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-24 14:52:46,883 INFO [train.py:996] (1/4) Epoch 7, batch 4200, loss[loss=0.1791, simple_loss=0.2707, pruned_loss=0.04374, over 21426.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.294, pruned_loss=0.06958, over 4273491.86 frames. ], batch size: 212, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:52:57,886 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.672e+02 2.976e+02 3.504e+02 5.360e+02, threshold=5.952e+02, percent-clipped=0.0 2023-06-24 14:53:03,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1123002.0, ans=0.2 2023-06-24 14:54:42,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1123242.0, ans=0.125 2023-06-24 14:54:45,207 INFO [train.py:996] (1/4) Epoch 7, batch 4250, loss[loss=0.2487, simple_loss=0.3227, pruned_loss=0.08732, over 21775.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3034, pruned_loss=0.07254, over 4280910.42 frames. ], batch size: 298, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:55:06,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1123362.0, ans=0.0 2023-06-24 14:55:25,931 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-24 14:55:52,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1123422.0, ans=0.125 2023-06-24 14:55:58,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1123482.0, ans=0.125 2023-06-24 14:56:06,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1123482.0, ans=0.0 2023-06-24 14:56:40,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1123542.0, ans=0.1 2023-06-24 14:56:43,619 INFO [train.py:996] (1/4) Epoch 7, batch 4300, loss[loss=0.2084, simple_loss=0.3071, pruned_loss=0.05489, over 21752.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3073, pruned_loss=0.07411, over 4273231.66 frames. ], batch size: 298, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:56:48,731 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.063e+02 3.693e+02 4.827e+02 7.345e+02, threshold=7.385e+02, percent-clipped=7.0 2023-06-24 14:57:46,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1123782.0, ans=0.07 2023-06-24 14:57:59,432 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-24 14:58:19,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1123842.0, ans=0.125 2023-06-24 14:58:24,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1123842.0, ans=0.125 2023-06-24 14:58:39,569 INFO [train.py:996] (1/4) Epoch 7, batch 4350, loss[loss=0.1991, simple_loss=0.2648, pruned_loss=0.06675, over 21362.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3053, pruned_loss=0.07333, over 4259759.24 frames. ], batch size: 160, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:59:43,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1124082.0, ans=0.0 2023-06-24 14:59:49,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.21 vs. limit=15.0 2023-06-24 15:00:35,519 INFO [train.py:996] (1/4) Epoch 7, batch 4400, loss[loss=0.2018, simple_loss=0.2896, pruned_loss=0.05699, over 21618.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3032, pruned_loss=0.07279, over 4256121.87 frames. ], batch size: 263, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:00:41,287 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.891e+02 3.329e+02 4.006e+02 7.259e+02, threshold=6.659e+02, percent-clipped=0.0 2023-06-24 15:01:00,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1124262.0, ans=0.125 2023-06-24 15:01:04,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1124262.0, ans=0.1 2023-06-24 15:01:24,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1124322.0, ans=0.0 2023-06-24 15:01:33,648 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:02:28,239 INFO [train.py:996] (1/4) Epoch 7, batch 4450, loss[loss=0.2674, simple_loss=0.349, pruned_loss=0.09292, over 21717.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3109, pruned_loss=0.07441, over 4264437.70 frames. ], batch size: 389, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:04:12,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=22.5 2023-06-24 15:04:20,240 INFO [train.py:996] (1/4) Epoch 7, batch 4500, loss[loss=0.2107, simple_loss=0.3027, pruned_loss=0.05936, over 20889.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3119, pruned_loss=0.07653, over 4269494.12 frames. ], batch size: 608, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:04:25,142 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.933e+02 3.595e+02 4.328e+02 6.220e+02, threshold=7.189e+02, percent-clipped=0.0 2023-06-24 15:06:10,783 INFO [train.py:996] (1/4) Epoch 7, batch 4550, loss[loss=0.2674, simple_loss=0.3438, pruned_loss=0.09555, over 21759.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3139, pruned_loss=0.07652, over 4272379.87 frames. ], batch size: 441, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:06:27,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1125162.0, ans=0.0 2023-06-24 15:07:09,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1125222.0, ans=0.125 2023-06-24 15:07:54,597 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-24 15:07:56,746 INFO [train.py:996] (1/4) Epoch 7, batch 4600, loss[loss=0.1984, simple_loss=0.2783, pruned_loss=0.05928, over 21748.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3143, pruned_loss=0.07751, over 4276919.85 frames. ], batch size: 247, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:08:02,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 3.095e+02 3.765e+02 5.007e+02 9.113e+02, threshold=7.530e+02, percent-clipped=6.0 2023-06-24 15:08:02,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1125402.0, ans=10.0 2023-06-24 15:08:23,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1125462.0, ans=0.0 2023-06-24 15:08:32,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1125462.0, ans=0.125 2023-06-24 15:08:33,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2023-06-24 15:08:41,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-06-24 15:09:38,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-24 15:09:45,802 INFO [train.py:996] (1/4) Epoch 7, batch 4650, loss[loss=0.1625, simple_loss=0.2432, pruned_loss=0.04095, over 21786.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3074, pruned_loss=0.0754, over 4288646.31 frames. ], batch size: 282, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:11:20,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1125942.0, ans=0.1 2023-06-24 15:11:32,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1125942.0, ans=0.125 2023-06-24 15:11:34,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1126002.0, ans=0.125 2023-06-24 15:11:35,508 INFO [train.py:996] (1/4) Epoch 7, batch 4700, loss[loss=0.2402, simple_loss=0.349, pruned_loss=0.06567, over 21198.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2974, pruned_loss=0.0727, over 4286625.20 frames. ], batch size: 548, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:11:45,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.572e+02 2.876e+02 3.232e+02 6.204e+02, threshold=5.752e+02, percent-clipped=0.0 2023-06-24 15:11:59,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1126062.0, ans=0.95 2023-06-24 15:12:00,536 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-24 15:12:31,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1126122.0, ans=0.05 2023-06-24 15:12:45,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1126182.0, ans=0.125 2023-06-24 15:13:17,050 INFO [train.py:996] (1/4) Epoch 7, batch 4750, loss[loss=0.2054, simple_loss=0.2743, pruned_loss=0.06827, over 21654.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2931, pruned_loss=0.07272, over 4290638.50 frames. ], batch size: 230, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:13:25,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1126302.0, ans=0.125 2023-06-24 15:13:36,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.24 vs. limit=6.0 2023-06-24 15:13:57,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-24 15:14:10,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1126422.0, ans=0.125 2023-06-24 15:14:48,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.52 vs. limit=15.0 2023-06-24 15:15:04,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-24 15:15:13,766 INFO [train.py:996] (1/4) Epoch 7, batch 4800, loss[loss=0.1997, simple_loss=0.2938, pruned_loss=0.0528, over 21719.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2945, pruned_loss=0.07309, over 4297737.14 frames. ], batch size: 247, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:15:19,153 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.780e+02 3.342e+02 3.933e+02 6.055e+02, threshold=6.684e+02, percent-clipped=1.0 2023-06-24 15:16:10,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.87 vs. limit=22.5 2023-06-24 15:16:11,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1126722.0, ans=0.0 2023-06-24 15:16:20,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1126782.0, ans=15.0 2023-06-24 15:16:27,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1126782.0, ans=0.1 2023-06-24 15:16:39,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1126842.0, ans=0.1 2023-06-24 15:16:58,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1126902.0, ans=10.0 2023-06-24 15:16:59,128 INFO [train.py:996] (1/4) Epoch 7, batch 4850, loss[loss=0.2356, simple_loss=0.2989, pruned_loss=0.08615, over 21695.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2933, pruned_loss=0.07274, over 4301895.34 frames. ], batch size: 441, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:17:13,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1126902.0, ans=0.05 2023-06-24 15:17:41,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-06-24 15:18:02,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1127022.0, ans=0.125 2023-06-24 15:18:50,502 INFO [train.py:996] (1/4) Epoch 7, batch 4900, loss[loss=0.2158, simple_loss=0.2917, pruned_loss=0.06994, over 21858.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2957, pruned_loss=0.07374, over 4307560.86 frames. ], batch size: 118, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:18:55,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.644e+02 3.017e+02 3.473e+02 6.026e+02, threshold=6.033e+02, percent-clipped=0.0 2023-06-24 15:19:36,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1127322.0, ans=0.0 2023-06-24 15:20:31,999 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-24 15:20:41,538 INFO [train.py:996] (1/4) Epoch 7, batch 4950, loss[loss=0.2041, simple_loss=0.299, pruned_loss=0.0546, over 21724.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2978, pruned_loss=0.07182, over 4296042.79 frames. ], batch size: 351, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:22:22,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1127742.0, ans=0.0 2023-06-24 15:22:26,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-24 15:22:30,052 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=19.64 vs. limit=15.0 2023-06-24 15:22:30,507 INFO [train.py:996] (1/4) Epoch 7, batch 5000, loss[loss=0.1595, simple_loss=0.2296, pruned_loss=0.04466, over 17666.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.296, pruned_loss=0.06851, over 4287134.43 frames. ], batch size: 65, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:22:35,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.510e+02 2.912e+02 3.367e+02 5.959e+02, threshold=5.824e+02, percent-clipped=0.0 2023-06-24 15:22:53,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1127862.0, ans=0.125 2023-06-24 15:24:13,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-24 15:24:19,961 INFO [train.py:996] (1/4) Epoch 7, batch 5050, loss[loss=0.2315, simple_loss=0.3489, pruned_loss=0.05708, over 20688.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2962, pruned_loss=0.07019, over 4296026.41 frames. ], batch size: 607, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:25:47,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-24 15:26:10,467 INFO [train.py:996] (1/4) Epoch 7, batch 5100, loss[loss=0.1725, simple_loss=0.2535, pruned_loss=0.04577, over 21621.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2962, pruned_loss=0.07078, over 4296421.44 frames. ], batch size: 230, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:26:17,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.691e+02 3.129e+02 3.589e+02 6.328e+02, threshold=6.257e+02, percent-clipped=2.0 2023-06-24 15:26:27,963 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:28:00,598 INFO [train.py:996] (1/4) Epoch 7, batch 5150, loss[loss=0.2234, simple_loss=0.3046, pruned_loss=0.07106, over 21869.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2954, pruned_loss=0.07169, over 4296284.28 frames. ], batch size: 371, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:28:36,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1128762.0, ans=0.2 2023-06-24 15:29:52,278 INFO [train.py:996] (1/4) Epoch 7, batch 5200, loss[loss=0.224, simple_loss=0.3217, pruned_loss=0.06319, over 21624.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2999, pruned_loss=0.07257, over 4293389.71 frames. ], batch size: 263, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:29:59,487 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.769e+02 3.246e+02 4.133e+02 8.749e+02, threshold=6.492e+02, percent-clipped=7.0 2023-06-24 15:30:56,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1129122.0, ans=0.0 2023-06-24 15:31:10,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1129182.0, ans=0.5 2023-06-24 15:31:13,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1129182.0, ans=0.04949747468305833 2023-06-24 15:31:41,111 INFO [train.py:996] (1/4) Epoch 7, batch 5250, loss[loss=0.1763, simple_loss=0.2533, pruned_loss=0.04967, over 16462.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.304, pruned_loss=0.07114, over 4284939.15 frames. ], batch size: 62, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:31:51,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1129302.0, ans=0.0 2023-06-24 15:33:31,871 INFO [train.py:996] (1/4) Epoch 7, batch 5300, loss[loss=0.2019, simple_loss=0.2793, pruned_loss=0.06224, over 21663.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3033, pruned_loss=0.07275, over 4289264.09 frames. ], batch size: 263, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:33:38,414 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.522e+02 2.825e+02 3.420e+02 5.349e+02, threshold=5.650e+02, percent-clipped=0.0 2023-06-24 15:34:03,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1129662.0, ans=0.0 2023-06-24 15:34:17,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-24 15:34:23,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1129722.0, ans=0.1 2023-06-24 15:34:33,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1129722.0, ans=0.125 2023-06-24 15:34:52,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1129782.0, ans=0.0 2023-06-24 15:35:17,635 INFO [train.py:996] (1/4) Epoch 7, batch 5350, loss[loss=0.2401, simple_loss=0.3018, pruned_loss=0.08919, over 21907.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3018, pruned_loss=0.07423, over 4294655.30 frames. ], batch size: 414, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:35:19,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1129902.0, ans=0.2 2023-06-24 15:35:37,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1129962.0, ans=0.125 2023-06-24 15:36:10,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-24 15:36:44,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1130082.0, ans=0.125 2023-06-24 15:37:07,119 INFO [train.py:996] (1/4) Epoch 7, batch 5400, loss[loss=0.2414, simple_loss=0.3053, pruned_loss=0.0888, over 21561.00 frames. ], tot_loss[loss=0.224, simple_loss=0.299, pruned_loss=0.07453, over 4293972.54 frames. ], batch size: 471, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:37:16,428 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.748e+02 3.020e+02 3.535e+02 6.573e+02, threshold=6.041e+02, percent-clipped=2.0 2023-06-24 15:37:33,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1130262.0, ans=0.125 2023-06-24 15:37:51,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-24 15:38:14,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1130322.0, ans=0.5 2023-06-24 15:38:16,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=15.0 2023-06-24 15:38:49,021 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-24 15:38:59,041 INFO [train.py:996] (1/4) Epoch 7, batch 5450, loss[loss=0.2975, simple_loss=0.377, pruned_loss=0.109, over 21563.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3002, pruned_loss=0.07348, over 4287376.41 frames. ], batch size: 471, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:39:56,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1130622.0, ans=0.2 2023-06-24 15:40:08,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1130682.0, ans=0.125 2023-06-24 15:40:50,129 INFO [train.py:996] (1/4) Epoch 7, batch 5500, loss[loss=0.1902, simple_loss=0.2777, pruned_loss=0.0513, over 21335.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3055, pruned_loss=0.07108, over 4284514.72 frames. ], batch size: 176, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:40:51,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-24 15:40:56,242 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-24 15:40:57,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1130802.0, ans=0.0 2023-06-24 15:40:58,240 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.852e+02 3.783e+02 5.353e+02 8.274e+02, threshold=7.565e+02, percent-clipped=13.0 2023-06-24 15:41:01,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2023-06-24 15:41:33,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1130922.0, ans=0.1 2023-06-24 15:41:40,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1130922.0, ans=0.125 2023-06-24 15:41:51,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1130982.0, ans=0.0 2023-06-24 15:42:40,381 INFO [train.py:996] (1/4) Epoch 7, batch 5550, loss[loss=0.1871, simple_loss=0.2759, pruned_loss=0.04919, over 21369.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3047, pruned_loss=0.06872, over 4284004.54 frames. ], batch size: 211, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:43:27,241 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.68 vs. limit=6.0 2023-06-24 15:44:31,795 INFO [train.py:996] (1/4) Epoch 7, batch 5600, loss[loss=0.2173, simple_loss=0.303, pruned_loss=0.06578, over 21239.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.3011, pruned_loss=0.06636, over 4281354.96 frames. ], batch size: 159, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:44:45,773 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.480e+02 2.959e+02 3.871e+02 8.894e+02, threshold=5.918e+02, percent-clipped=1.0 2023-06-24 15:45:24,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1131522.0, ans=0.0 2023-06-24 15:45:33,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1131522.0, ans=0.1 2023-06-24 15:46:19,842 INFO [train.py:996] (1/4) Epoch 7, batch 5650, loss[loss=0.2406, simple_loss=0.3136, pruned_loss=0.08378, over 21784.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3032, pruned_loss=0.06835, over 4286779.01 frames. ], batch size: 112, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:46:20,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.05 vs. limit=10.0 2023-06-24 15:46:34,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1131702.0, ans=0.1 2023-06-24 15:46:44,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=22.5 2023-06-24 15:46:56,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1131762.0, ans=0.125 2023-06-24 15:47:00,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-06-24 15:47:21,689 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-24 15:47:58,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=10.0 2023-06-24 15:48:15,336 INFO [train.py:996] (1/4) Epoch 7, batch 5700, loss[loss=0.2346, simple_loss=0.3135, pruned_loss=0.07789, over 21618.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3033, pruned_loss=0.06984, over 4291580.26 frames. ], batch size: 441, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:48:26,251 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.629e+02 3.066e+02 3.731e+02 7.827e+02, threshold=6.133e+02, percent-clipped=4.0 2023-06-24 15:49:07,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1132122.0, ans=0.125 2023-06-24 15:49:19,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1132122.0, ans=0.0 2023-06-24 15:49:19,801 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:49:53,113 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2023-06-24 15:50:06,421 INFO [train.py:996] (1/4) Epoch 7, batch 5750, loss[loss=0.2069, simple_loss=0.2987, pruned_loss=0.05751, over 21601.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2998, pruned_loss=0.06747, over 4287698.81 frames. ], batch size: 441, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:50:10,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1132302.0, ans=0.125 2023-06-24 15:50:13,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1132302.0, ans=15.0 2023-06-24 15:50:21,664 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-24 15:50:45,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1132362.0, ans=0.0 2023-06-24 15:51:02,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1132422.0, ans=0.0 2023-06-24 15:51:34,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1132482.0, ans=0.125 2023-06-24 15:51:51,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1132542.0, ans=0.035 2023-06-24 15:51:56,229 INFO [train.py:996] (1/4) Epoch 7, batch 5800, loss[loss=0.2468, simple_loss=0.3451, pruned_loss=0.07427, over 21633.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.3002, pruned_loss=0.06639, over 4286657.73 frames. ], batch size: 389, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:52:12,047 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.681e+02 3.323e+02 4.302e+02 6.884e+02, threshold=6.646e+02, percent-clipped=1.0 2023-06-24 15:52:53,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1132722.0, ans=0.125 2023-06-24 15:53:15,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1132782.0, ans=0.1 2023-06-24 15:53:24,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1132782.0, ans=0.125 2023-06-24 15:53:32,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1132842.0, ans=0.0 2023-06-24 15:53:35,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1132842.0, ans=10.0 2023-06-24 15:53:47,454 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:53:58,490 INFO [train.py:996] (1/4) Epoch 7, batch 5850, loss[loss=0.1782, simple_loss=0.2705, pruned_loss=0.04291, over 21414.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2978, pruned_loss=0.06396, over 4278691.07 frames. ], batch size: 194, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:54:08,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=22.5 2023-06-24 15:54:53,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-24 15:55:02,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1133082.0, ans=0.125 2023-06-24 15:55:51,559 INFO [train.py:996] (1/4) Epoch 7, batch 5900, loss[loss=0.1557, simple_loss=0.2301, pruned_loss=0.04065, over 21864.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2909, pruned_loss=0.05921, over 4276929.69 frames. ], batch size: 102, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 15:56:01,765 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 2.024e+02 2.372e+02 2.933e+02 6.586e+02, threshold=4.744e+02, percent-clipped=0.0 2023-06-24 15:56:21,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1133262.0, ans=0.125 2023-06-24 15:56:25,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-24 15:56:26,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1133322.0, ans=0.1 2023-06-24 15:57:00,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1133382.0, ans=0.125 2023-06-24 15:57:39,657 INFO [train.py:996] (1/4) Epoch 7, batch 5950, loss[loss=0.2167, simple_loss=0.2862, pruned_loss=0.07361, over 21688.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2906, pruned_loss=0.06241, over 4284311.77 frames. ], batch size: 389, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 15:58:16,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1133562.0, ans=0.1 2023-06-24 15:58:44,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1133682.0, ans=0.2 2023-06-24 15:59:12,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1133742.0, ans=0.125 2023-06-24 15:59:27,089 INFO [train.py:996] (1/4) Epoch 7, batch 6000, loss[loss=0.183, simple_loss=0.3032, pruned_loss=0.03145, over 21247.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.288, pruned_loss=0.06511, over 4291451.15 frames. ], batch size: 548, lr: 4.41e-03, grad_scale: 32.0 2023-06-24 15:59:27,090 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 15:59:44,450 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2613, simple_loss=0.3539, pruned_loss=0.08436, over 1796401.00 frames. 2023-06-24 15:59:44,451 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 15:59:57,238 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 3.144e+02 3.731e+02 4.665e+02 6.977e+02, threshold=7.462e+02, percent-clipped=24.0 2023-06-24 16:00:15,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1133862.0, ans=0.05 2023-06-24 16:01:12,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1133982.0, ans=0.125 2023-06-24 16:01:36,673 INFO [train.py:996] (1/4) Epoch 7, batch 6050, loss[loss=0.1857, simple_loss=0.2642, pruned_loss=0.05365, over 21597.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2842, pruned_loss=0.06501, over 4272104.39 frames. ], batch size: 414, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:01:45,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1134102.0, ans=0.2 2023-06-24 16:02:01,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1134162.0, ans=0.1 2023-06-24 16:02:11,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1134162.0, ans=0.125 2023-06-24 16:03:27,722 INFO [train.py:996] (1/4) Epoch 7, batch 6100, loss[loss=0.2059, simple_loss=0.2832, pruned_loss=0.06431, over 20138.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2823, pruned_loss=0.06323, over 4270524.71 frames. ], batch size: 702, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:03:39,887 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.425e+02 2.947e+02 3.693e+02 6.413e+02, threshold=5.895e+02, percent-clipped=0.0 2023-06-24 16:03:58,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1134462.0, ans=0.0 2023-06-24 16:04:03,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1134462.0, ans=0.125 2023-06-24 16:04:05,137 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:04:30,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1134582.0, ans=0.0 2023-06-24 16:05:01,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1134642.0, ans=0.0 2023-06-24 16:05:17,147 INFO [train.py:996] (1/4) Epoch 7, batch 6150, loss[loss=0.2275, simple_loss=0.3055, pruned_loss=0.07479, over 21645.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2847, pruned_loss=0.06579, over 4266877.22 frames. ], batch size: 415, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:05:21,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1134702.0, ans=0.125 2023-06-24 16:06:06,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1134822.0, ans=0.2 2023-06-24 16:06:45,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-24 16:07:02,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1134942.0, ans=0.2 2023-06-24 16:07:05,672 INFO [train.py:996] (1/4) Epoch 7, batch 6200, loss[loss=0.2507, simple_loss=0.316, pruned_loss=0.09267, over 21489.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2884, pruned_loss=0.06672, over 4276733.49 frames. ], batch size: 509, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:07:25,603 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.569e+02 3.119e+02 3.567e+02 5.212e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-24 16:07:40,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1135062.0, ans=0.1 2023-06-24 16:07:45,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1135122.0, ans=0.125 2023-06-24 16:08:52,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1135242.0, ans=0.2 2023-06-24 16:08:56,819 INFO [train.py:996] (1/4) Epoch 7, batch 6250, loss[loss=0.2459, simple_loss=0.3467, pruned_loss=0.0726, over 21672.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2954, pruned_loss=0.06673, over 4277936.43 frames. ], batch size: 414, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:09:20,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1135362.0, ans=0.125 2023-06-24 16:09:27,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-24 16:09:49,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1135422.0, ans=0.2 2023-06-24 16:10:18,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1135482.0, ans=10.0 2023-06-24 16:10:25,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1135482.0, ans=0.0 2023-06-24 16:10:50,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1135602.0, ans=0.125 2023-06-24 16:10:51,876 INFO [train.py:996] (1/4) Epoch 7, batch 6300, loss[loss=0.237, simple_loss=0.3478, pruned_loss=0.06309, over 21218.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2986, pruned_loss=0.06584, over 4275291.55 frames. ], batch size: 548, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:11:06,080 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.617e+02 3.122e+02 4.088e+02 6.551e+02, threshold=6.244e+02, percent-clipped=1.0 2023-06-24 16:11:39,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1135722.0, ans=0.04949747468305833 2023-06-24 16:12:01,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=22.5 2023-06-24 16:12:04,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1135782.0, ans=0.125 2023-06-24 16:12:13,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1135782.0, ans=0.2 2023-06-24 16:12:18,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1135842.0, ans=0.04949747468305833 2023-06-24 16:12:20,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1135842.0, ans=0.125 2023-06-24 16:12:40,740 INFO [train.py:996] (1/4) Epoch 7, batch 6350, loss[loss=0.2482, simple_loss=0.3266, pruned_loss=0.08485, over 21464.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3012, pruned_loss=0.06981, over 4283975.05 frames. ], batch size: 194, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:13:40,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1136022.0, ans=0.07 2023-06-24 16:13:48,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-24 16:14:30,411 INFO [train.py:996] (1/4) Epoch 7, batch 6400, loss[loss=0.2425, simple_loss=0.32, pruned_loss=0.08252, over 21743.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3064, pruned_loss=0.07433, over 4286620.76 frames. ], batch size: 298, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:14:55,408 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.966e+02 3.361e+02 3.840e+02 6.220e+02, threshold=6.721e+02, percent-clipped=0.0 2023-06-24 16:15:25,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-24 16:15:51,337 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:16:05,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1136442.0, ans=0.125 2023-06-24 16:16:12,571 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-24 16:16:25,897 INFO [train.py:996] (1/4) Epoch 7, batch 6450, loss[loss=0.2417, simple_loss=0.3238, pruned_loss=0.07982, over 21455.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3088, pruned_loss=0.07358, over 4289690.82 frames. ], batch size: 131, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:16:49,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1136502.0, ans=0.0 2023-06-24 16:17:31,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.37 vs. limit=10.0 2023-06-24 16:18:00,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1136742.0, ans=0.125 2023-06-24 16:18:14,954 INFO [train.py:996] (1/4) Epoch 7, batch 6500, loss[loss=0.2008, simple_loss=0.2871, pruned_loss=0.05726, over 21628.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3034, pruned_loss=0.07208, over 4277721.78 frames. ], batch size: 263, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:18:38,257 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.852e+02 3.600e+02 4.849e+02 8.797e+02, threshold=7.199e+02, percent-clipped=3.0 2023-06-24 16:18:51,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1136862.0, ans=0.125 2023-06-24 16:18:52,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1136862.0, ans=0.2 2023-06-24 16:19:03,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136922.0, ans=0.1 2023-06-24 16:19:05,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1136922.0, ans=0.2 2023-06-24 16:19:51,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1137042.0, ans=0.0 2023-06-24 16:20:03,503 INFO [train.py:996] (1/4) Epoch 7, batch 6550, loss[loss=0.2132, simple_loss=0.328, pruned_loss=0.04919, over 21208.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3015, pruned_loss=0.0699, over 4276368.86 frames. ], batch size: 548, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:20:08,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=6.0 2023-06-24 16:20:33,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1137162.0, ans=0.0 2023-06-24 16:20:39,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.39 vs. limit=10.0 2023-06-24 16:21:53,186 INFO [train.py:996] (1/4) Epoch 7, batch 6600, loss[loss=0.2002, simple_loss=0.2525, pruned_loss=0.07391, over 21260.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2957, pruned_loss=0.06991, over 4283093.92 frames. ], batch size: 548, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:22:04,157 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:22:17,172 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.518e+02 2.917e+02 3.263e+02 5.305e+02, threshold=5.833e+02, percent-clipped=0.0 2023-06-24 16:22:20,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1137462.0, ans=0.0 2023-06-24 16:22:53,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1137522.0, ans=0.1 2023-06-24 16:23:01,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1137582.0, ans=0.0 2023-06-24 16:23:44,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1137642.0, ans=0.5 2023-06-24 16:23:53,027 INFO [train.py:996] (1/4) Epoch 7, batch 6650, loss[loss=0.2077, simple_loss=0.2796, pruned_loss=0.06792, over 21649.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2867, pruned_loss=0.06707, over 4274277.82 frames. ], batch size: 391, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:25:43,961 INFO [train.py:996] (1/4) Epoch 7, batch 6700, loss[loss=0.2224, simple_loss=0.2935, pruned_loss=0.0756, over 21652.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2811, pruned_loss=0.0665, over 4271844.12 frames. ], batch size: 415, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:25:44,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1138002.0, ans=0.0 2023-06-24 16:25:45,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1138002.0, ans=0.1 2023-06-24 16:25:57,323 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.457e+02 2.786e+02 3.230e+02 4.297e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-24 16:26:08,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1138062.0, ans=6.0 2023-06-24 16:27:26,645 INFO [train.py:996] (1/4) Epoch 7, batch 6750, loss[loss=0.2147, simple_loss=0.2784, pruned_loss=0.07545, over 21763.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2791, pruned_loss=0.06718, over 4273241.22 frames. ], batch size: 351, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:27:48,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1138362.0, ans=0.125 2023-06-24 16:28:24,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1138482.0, ans=0.125 2023-06-24 16:28:37,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1138482.0, ans=0.0 2023-06-24 16:28:38,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1138482.0, ans=0.2 2023-06-24 16:29:05,375 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-24 16:29:09,292 INFO [train.py:996] (1/4) Epoch 7, batch 6800, loss[loss=0.1933, simple_loss=0.26, pruned_loss=0.06326, over 21724.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2817, pruned_loss=0.06912, over 4273068.38 frames. ], batch size: 282, lr: 4.40e-03, grad_scale: 32.0 2023-06-24 16:29:15,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1138602.0, ans=0.1 2023-06-24 16:29:23,223 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 2.710e+02 3.194e+02 3.747e+02 5.784e+02, threshold=6.389e+02, percent-clipped=2.0 2023-06-24 16:29:33,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1138662.0, ans=0.125 2023-06-24 16:30:20,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1138782.0, ans=0.07 2023-06-24 16:30:50,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1138902.0, ans=0.1 2023-06-24 16:30:51,565 INFO [train.py:996] (1/4) Epoch 7, batch 6850, loss[loss=0.208, simple_loss=0.2717, pruned_loss=0.07217, over 21759.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2822, pruned_loss=0.07118, over 4274998.82 frames. ], batch size: 351, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:31:44,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1139022.0, ans=0.2 2023-06-24 16:31:54,572 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.86 vs. limit=15.0 2023-06-24 16:32:17,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1139142.0, ans=0.1 2023-06-24 16:32:40,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1139202.0, ans=0.0 2023-06-24 16:32:41,595 INFO [train.py:996] (1/4) Epoch 7, batch 6900, loss[loss=0.275, simple_loss=0.3898, pruned_loss=0.08009, over 19814.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2848, pruned_loss=0.07127, over 4279381.09 frames. ], batch size: 702, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:33:03,119 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.809e+02 3.309e+02 4.065e+02 7.013e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-24 16:33:16,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1139262.0, ans=0.125 2023-06-24 16:33:27,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-24 16:33:28,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1139322.0, ans=0.2 2023-06-24 16:34:28,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1139442.0, ans=0.1 2023-06-24 16:34:37,857 INFO [train.py:996] (1/4) Epoch 7, batch 6950, loss[loss=0.2273, simple_loss=0.3065, pruned_loss=0.07401, over 21718.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2863, pruned_loss=0.06897, over 4279367.09 frames. ], batch size: 298, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:34:45,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1139502.0, ans=0.0 2023-06-24 16:34:47,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1139502.0, ans=0.125 2023-06-24 16:35:51,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1139682.0, ans=0.2 2023-06-24 16:36:16,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-24 16:36:27,773 INFO [train.py:996] (1/4) Epoch 7, batch 7000, loss[loss=0.2372, simple_loss=0.2889, pruned_loss=0.09274, over 21310.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.289, pruned_loss=0.07107, over 4283640.23 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:36:39,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1139802.0, ans=0.1 2023-06-24 16:36:49,410 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.855e+02 3.392e+02 4.148e+02 6.941e+02, threshold=6.785e+02, percent-clipped=1.0 2023-06-24 16:37:22,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1139922.0, ans=0.125 2023-06-24 16:37:22,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1139922.0, ans=0.125 2023-06-24 16:37:49,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1139982.0, ans=0.2 2023-06-24 16:38:18,590 INFO [train.py:996] (1/4) Epoch 7, batch 7050, loss[loss=0.1862, simple_loss=0.2679, pruned_loss=0.05223, over 21694.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2884, pruned_loss=0.07076, over 4277347.40 frames. ], batch size: 247, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:38:32,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1140102.0, ans=0.0 2023-06-24 16:38:42,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1140162.0, ans=0.125 2023-06-24 16:38:53,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-24 16:39:36,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1140282.0, ans=0.125 2023-06-24 16:40:15,910 INFO [train.py:996] (1/4) Epoch 7, batch 7100, loss[loss=0.2343, simple_loss=0.3092, pruned_loss=0.07972, over 21691.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2925, pruned_loss=0.07208, over 4280525.24 frames. ], batch size: 298, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:40:28,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1140402.0, ans=0.125 2023-06-24 16:40:31,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.792e+02 3.207e+02 3.771e+02 5.994e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-24 16:40:47,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-24 16:41:08,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1140522.0, ans=0.1 2023-06-24 16:41:24,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1140582.0, ans=0.125 2023-06-24 16:41:36,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1140582.0, ans=0.125 2023-06-24 16:42:06,512 INFO [train.py:996] (1/4) Epoch 7, batch 7150, loss[loss=0.2038, simple_loss=0.2887, pruned_loss=0.05949, over 21341.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2898, pruned_loss=0.06905, over 4278602.06 frames. ], batch size: 549, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:42:21,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1140702.0, ans=0.125 2023-06-24 16:42:58,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1140822.0, ans=0.0 2023-06-24 16:43:16,936 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-06-24 16:43:56,410 INFO [train.py:996] (1/4) Epoch 7, batch 7200, loss[loss=0.2047, simple_loss=0.3015, pruned_loss=0.05394, over 20925.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2921, pruned_loss=0.07139, over 4277991.85 frames. ], batch size: 607, lr: 4.40e-03, grad_scale: 32.0 2023-06-24 16:44:12,344 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.840e+02 3.235e+02 4.044e+02 5.731e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-24 16:44:15,537 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.33 vs. limit=10.0 2023-06-24 16:45:21,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1141182.0, ans=0.0 2023-06-24 16:45:31,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1141242.0, ans=0.125 2023-06-24 16:45:40,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1141242.0, ans=0.125 2023-06-24 16:45:45,329 INFO [train.py:996] (1/4) Epoch 7, batch 7250, loss[loss=0.1917, simple_loss=0.2565, pruned_loss=0.06349, over 21882.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2886, pruned_loss=0.0712, over 4274589.88 frames. ], batch size: 373, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:46:20,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1141362.0, ans=0.05 2023-06-24 16:46:32,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1141422.0, ans=0.125 2023-06-24 16:47:15,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1141542.0, ans=0.04949747468305833 2023-06-24 16:47:34,236 INFO [train.py:996] (1/4) Epoch 7, batch 7300, loss[loss=0.1878, simple_loss=0.2537, pruned_loss=0.06091, over 21655.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2833, pruned_loss=0.07002, over 4267821.51 frames. ], batch size: 333, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:47:37,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-24 16:47:48,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1141602.0, ans=0.0 2023-06-24 16:47:49,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.54 vs. limit=5.0 2023-06-24 16:47:50,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1141662.0, ans=0.125 2023-06-24 16:47:51,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.579e+02 3.088e+02 3.610e+02 6.583e+02, threshold=6.177e+02, percent-clipped=0.0 2023-06-24 16:48:38,413 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:49:00,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-24 16:49:07,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1141842.0, ans=0.125 2023-06-24 16:49:10,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1141842.0, ans=0.0 2023-06-24 16:49:16,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1141842.0, ans=0.025 2023-06-24 16:49:25,138 INFO [train.py:996] (1/4) Epoch 7, batch 7350, loss[loss=0.2614, simple_loss=0.337, pruned_loss=0.09288, over 21456.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2802, pruned_loss=0.07002, over 4260904.50 frames. ], batch size: 131, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:49:34,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1141902.0, ans=0.1 2023-06-24 16:50:00,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-24 16:50:04,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1141962.0, ans=0.125 2023-06-24 16:50:45,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1142082.0, ans=0.125 2023-06-24 16:51:02,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1142142.0, ans=0.125 2023-06-24 16:51:11,764 INFO [train.py:996] (1/4) Epoch 7, batch 7400, loss[loss=0.2652, simple_loss=0.3497, pruned_loss=0.09035, over 21477.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2845, pruned_loss=0.07208, over 4257881.36 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:51:41,598 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.851e+02 3.315e+02 4.181e+02 6.542e+02, threshold=6.630e+02, percent-clipped=3.0 2023-06-24 16:51:45,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1142262.0, ans=0.125 2023-06-24 16:52:01,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1142262.0, ans=0.125 2023-06-24 16:52:03,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1142322.0, ans=0.125 2023-06-24 16:52:07,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1142322.0, ans=0.125 2023-06-24 16:52:08,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1142322.0, ans=0.125 2023-06-24 16:52:40,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1142382.0, ans=0.2 2023-06-24 16:52:58,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1142442.0, ans=0.125 2023-06-24 16:53:03,472 INFO [train.py:996] (1/4) Epoch 7, batch 7450, loss[loss=0.2425, simple_loss=0.2962, pruned_loss=0.09444, over 21368.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2839, pruned_loss=0.07151, over 4251498.97 frames. ], batch size: 473, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:54:30,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1142682.0, ans=0.125 2023-06-24 16:54:34,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1142682.0, ans=0.05 2023-06-24 16:55:06,453 INFO [train.py:996] (1/4) Epoch 7, batch 7500, loss[loss=0.2351, simple_loss=0.3242, pruned_loss=0.07294, over 21432.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2892, pruned_loss=0.07368, over 4256092.46 frames. ], batch size: 194, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:55:29,607 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.030e+02 3.534e+02 4.560e+02 9.672e+02, threshold=7.067e+02, percent-clipped=6.0 2023-06-24 16:56:08,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1142922.0, ans=0.125 2023-06-24 16:56:45,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1143042.0, ans=0.1 2023-06-24 16:56:56,940 INFO [train.py:996] (1/4) Epoch 7, batch 7550, loss[loss=0.2335, simple_loss=0.3343, pruned_loss=0.06635, over 21661.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.297, pruned_loss=0.07277, over 4257413.49 frames. ], batch size: 414, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:58:41,045 INFO [train.py:996] (1/4) Epoch 7, batch 7600, loss[loss=0.2262, simple_loss=0.2935, pruned_loss=0.07943, over 21350.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2959, pruned_loss=0.0714, over 4258335.49 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 16:59:03,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1143462.0, ans=0.04949747468305833 2023-06-24 16:59:09,492 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.834e+02 3.229e+02 4.103e+02 6.859e+02, threshold=6.458e+02, percent-clipped=0.0 2023-06-24 16:59:10,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1143462.0, ans=0.125 2023-06-24 16:59:39,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=22.5 2023-06-24 16:59:47,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.56 vs. limit=15.0 2023-06-24 16:59:58,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1143582.0, ans=0.1 2023-06-24 17:00:11,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1143642.0, ans=0.125 2023-06-24 17:00:36,330 INFO [train.py:996] (1/4) Epoch 7, batch 7650, loss[loss=0.2662, simple_loss=0.3117, pruned_loss=0.1103, over 21779.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.296, pruned_loss=0.07362, over 4276269.92 frames. ], batch size: 508, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:00:49,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1143702.0, ans=0.125 2023-06-24 17:01:10,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1143762.0, ans=0.2 2023-06-24 17:01:20,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1143762.0, ans=0.2 2023-06-24 17:02:06,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1143942.0, ans=0.125 2023-06-24 17:02:28,490 INFO [train.py:996] (1/4) Epoch 7, batch 7700, loss[loss=0.2692, simple_loss=0.3973, pruned_loss=0.07052, over 19779.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3, pruned_loss=0.07641, over 4281815.05 frames. ], batch size: 702, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:02:49,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.45 vs. limit=10.0 2023-06-24 17:02:53,672 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.786e+02 3.159e+02 3.961e+02 6.423e+02, threshold=6.319e+02, percent-clipped=0.0 2023-06-24 17:03:14,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1144062.0, ans=0.2 2023-06-24 17:03:42,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1144182.0, ans=0.0 2023-06-24 17:04:27,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1144302.0, ans=10.0 2023-06-24 17:04:29,077 INFO [train.py:996] (1/4) Epoch 7, batch 7750, loss[loss=0.1857, simple_loss=0.2625, pruned_loss=0.05448, over 21910.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3041, pruned_loss=0.07618, over 4282590.86 frames. ], batch size: 98, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:04:52,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1144362.0, ans=0.125 2023-06-24 17:06:27,893 INFO [train.py:996] (1/4) Epoch 7, batch 7800, loss[loss=0.2519, simple_loss=0.327, pruned_loss=0.08837, over 21685.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3052, pruned_loss=0.07627, over 4266431.32 frames. ], batch size: 414, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:06:30,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1144602.0, ans=0.125 2023-06-24 17:06:47,324 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.311e+02 4.032e+02 5.871e+02 9.097e+02, threshold=8.064e+02, percent-clipped=12.0 2023-06-24 17:06:55,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-24 17:06:56,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1144662.0, ans=0.125 2023-06-24 17:07:09,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1144722.0, ans=0.125 2023-06-24 17:08:11,748 INFO [train.py:996] (1/4) Epoch 7, batch 7850, loss[loss=0.233, simple_loss=0.2758, pruned_loss=0.09513, over 21378.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2991, pruned_loss=0.07606, over 4273958.31 frames. ], batch size: 509, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:09:08,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1145022.0, ans=0.1 2023-06-24 17:09:10,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1145082.0, ans=0.0 2023-06-24 17:09:34,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1145082.0, ans=0.0 2023-06-24 17:10:10,657 INFO [train.py:996] (1/4) Epoch 7, batch 7900, loss[loss=0.1242, simple_loss=0.1797, pruned_loss=0.03432, over 16119.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2919, pruned_loss=0.07445, over 4262558.02 frames. ], batch size: 60, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:10:11,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1145202.0, ans=0.125 2023-06-24 17:10:14,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1145202.0, ans=0.0 2023-06-24 17:10:23,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1145202.0, ans=0.125 2023-06-24 17:10:30,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.893e+02 3.310e+02 4.075e+02 8.177e+02, threshold=6.621e+02, percent-clipped=1.0 2023-06-24 17:11:33,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1145382.0, ans=0.07 2023-06-24 17:11:44,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-24 17:12:02,895 INFO [train.py:996] (1/4) Epoch 7, batch 7950, loss[loss=0.2271, simple_loss=0.2984, pruned_loss=0.07796, over 20792.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2967, pruned_loss=0.07404, over 4257029.28 frames. ], batch size: 611, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:12:30,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1145562.0, ans=0.2 2023-06-24 17:13:54,792 INFO [train.py:996] (1/4) Epoch 7, batch 8000, loss[loss=0.2295, simple_loss=0.3247, pruned_loss=0.06712, over 21739.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3014, pruned_loss=0.07562, over 4262283.61 frames. ], batch size: 351, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:14:07,026 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:14:22,382 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.777e+02 3.258e+02 3.899e+02 6.990e+02, threshold=6.515e+02, percent-clipped=3.0 2023-06-24 17:15:20,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1145982.0, ans=0.1 2023-06-24 17:15:30,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.39 vs. limit=22.5 2023-06-24 17:15:57,269 INFO [train.py:996] (1/4) Epoch 7, batch 8050, loss[loss=0.2574, simple_loss=0.3412, pruned_loss=0.08682, over 21742.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3035, pruned_loss=0.07594, over 4258038.99 frames. ], batch size: 351, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:17:30,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1146342.0, ans=0.125 2023-06-24 17:17:48,335 INFO [train.py:996] (1/4) Epoch 7, batch 8100, loss[loss=0.2312, simple_loss=0.2995, pruned_loss=0.08141, over 21310.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3051, pruned_loss=0.0769, over 4260066.61 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:18:05,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1146402.0, ans=0.0 2023-06-24 17:18:21,659 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 3.042e+02 3.840e+02 5.397e+02 9.623e+02, threshold=7.680e+02, percent-clipped=13.0 2023-06-24 17:18:39,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1146522.0, ans=0.2 2023-06-24 17:18:40,542 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=12.0 2023-06-24 17:19:25,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1146642.0, ans=0.125 2023-06-24 17:19:55,077 INFO [train.py:996] (1/4) Epoch 7, batch 8150, loss[loss=0.2015, simple_loss=0.2933, pruned_loss=0.05483, over 20071.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3127, pruned_loss=0.07924, over 4261689.45 frames. ], batch size: 703, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:19:59,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1146702.0, ans=0.125 2023-06-24 17:20:11,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1146762.0, ans=0.0 2023-06-24 17:20:46,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1146882.0, ans=0.1 2023-06-24 17:21:36,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1146942.0, ans=10.0 2023-06-24 17:21:36,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1146942.0, ans=0.2 2023-06-24 17:21:44,387 INFO [train.py:996] (1/4) Epoch 7, batch 8200, loss[loss=0.17, simple_loss=0.2297, pruned_loss=0.05515, over 21221.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3053, pruned_loss=0.07645, over 4269974.80 frames. ], batch size: 176, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:21:54,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.39 vs. limit=22.5 2023-06-24 17:22:06,169 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.961e+02 3.959e+02 5.617e+02 1.113e+03, threshold=7.919e+02, percent-clipped=3.0 2023-06-24 17:22:13,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1147062.0, ans=0.0 2023-06-24 17:23:01,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1147242.0, ans=0.125 2023-06-24 17:23:29,574 INFO [train.py:996] (1/4) Epoch 7, batch 8250, loss[loss=0.2062, simple_loss=0.2902, pruned_loss=0.06108, over 21287.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.305, pruned_loss=0.07703, over 4266965.32 frames. ], batch size: 159, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:23:52,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1147362.0, ans=0.0 2023-06-24 17:24:00,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1147362.0, ans=0.125 2023-06-24 17:24:06,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1147422.0, ans=0.0 2023-06-24 17:24:09,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1147422.0, ans=0.125 2023-06-24 17:25:15,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1147542.0, ans=0.2 2023-06-24 17:25:22,786 INFO [train.py:996] (1/4) Epoch 7, batch 8300, loss[loss=0.2151, simple_loss=0.2917, pruned_loss=0.06921, over 21350.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3044, pruned_loss=0.07485, over 4263445.82 frames. ], batch size: 176, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:25:43,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.710e+02 3.107e+02 3.703e+02 5.803e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-24 17:26:16,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-24 17:26:17,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1147722.0, ans=0.125 2023-06-24 17:26:34,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1147782.0, ans=0.125 2023-06-24 17:26:46,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1147782.0, ans=0.035 2023-06-24 17:26:58,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1147842.0, ans=0.2 2023-06-24 17:27:04,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1147842.0, ans=0.2 2023-06-24 17:27:12,193 INFO [train.py:996] (1/4) Epoch 7, batch 8350, loss[loss=0.204, simple_loss=0.2759, pruned_loss=0.06605, over 21774.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3024, pruned_loss=0.07255, over 4258511.66 frames. ], batch size: 112, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:27:19,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=22.5 2023-06-24 17:27:26,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1147902.0, ans=0.07 2023-06-24 17:29:03,703 INFO [train.py:996] (1/4) Epoch 7, batch 8400, loss[loss=0.1767, simple_loss=0.2473, pruned_loss=0.053, over 21204.00 frames. ], tot_loss[loss=0.219, simple_loss=0.299, pruned_loss=0.06949, over 4263585.98 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:29:25,440 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.527e+02 3.220e+02 3.909e+02 1.035e+03, threshold=6.440e+02, percent-clipped=5.0 2023-06-24 17:29:30,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1148262.0, ans=0.2 2023-06-24 17:29:46,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1148322.0, ans=0.1 2023-06-24 17:30:03,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1148382.0, ans=0.0 2023-06-24 17:30:30,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1148442.0, ans=0.125 2023-06-24 17:30:47,864 INFO [train.py:996] (1/4) Epoch 7, batch 8450, loss[loss=0.2339, simple_loss=0.3035, pruned_loss=0.08211, over 21236.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.297, pruned_loss=0.06932, over 4268775.74 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:30:48,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1148502.0, ans=0.125 2023-06-24 17:32:27,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1148742.0, ans=0.0 2023-06-24 17:32:36,629 INFO [train.py:996] (1/4) Epoch 7, batch 8500, loss[loss=0.2111, simple_loss=0.2509, pruned_loss=0.08564, over 20073.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2932, pruned_loss=0.07035, over 4268612.45 frames. ], batch size: 704, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:32:48,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1148802.0, ans=0.125 2023-06-24 17:32:57,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 2.839e+02 3.413e+02 4.005e+02 7.078e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-24 17:34:04,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1148982.0, ans=0.1 2023-06-24 17:34:11,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1149042.0, ans=0.125 2023-06-24 17:34:17,005 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:34:26,835 INFO [train.py:996] (1/4) Epoch 7, batch 8550, loss[loss=0.2246, simple_loss=0.3187, pruned_loss=0.06519, over 21708.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2968, pruned_loss=0.07251, over 4274448.97 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:34:36,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1149102.0, ans=0.125 2023-06-24 17:34:45,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1149162.0, ans=0.2 2023-06-24 17:35:17,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1149222.0, ans=0.2 2023-06-24 17:35:25,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-06-24 17:35:56,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1149282.0, ans=0.2 2023-06-24 17:36:08,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-24 17:36:18,078 INFO [train.py:996] (1/4) Epoch 7, batch 8600, loss[loss=0.2442, simple_loss=0.3194, pruned_loss=0.08447, over 21410.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3027, pruned_loss=0.07424, over 4272347.09 frames. ], batch size: 211, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:36:28,313 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:36:31,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1149402.0, ans=0.0 2023-06-24 17:36:40,531 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.018e+02 3.698e+02 4.926e+02 7.683e+02, threshold=7.396e+02, percent-clipped=5.0 2023-06-24 17:36:42,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1149462.0, ans=0.2 2023-06-24 17:36:42,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1149462.0, ans=0.0 2023-06-24 17:37:07,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1149522.0, ans=0.025 2023-06-24 17:37:43,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1149642.0, ans=0.125 2023-06-24 17:37:58,816 INFO [train.py:996] (1/4) Epoch 7, batch 8650, loss[loss=0.2145, simple_loss=0.2817, pruned_loss=0.07367, over 21087.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3093, pruned_loss=0.07509, over 4279185.93 frames. ], batch size: 607, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:38:24,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1149762.0, ans=0.125 2023-06-24 17:38:47,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1149822.0, ans=0.1 2023-06-24 17:39:03,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1149822.0, ans=0.0 2023-06-24 17:39:07,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1149882.0, ans=0.125 2023-06-24 17:39:08,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1149882.0, ans=0.0 2023-06-24 17:39:36,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1149942.0, ans=0.125 2023-06-24 17:39:42,718 INFO [train.py:996] (1/4) Epoch 7, batch 8700, loss[loss=0.1938, simple_loss=0.2569, pruned_loss=0.0654, over 21453.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.304, pruned_loss=0.07223, over 4271247.82 frames. ], batch size: 131, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:40:09,728 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 2.588e+02 3.028e+02 3.644e+02 6.697e+02, threshold=6.057e+02, percent-clipped=0.0 2023-06-24 17:40:52,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1150122.0, ans=0.1 2023-06-24 17:41:11,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1150242.0, ans=0.2 2023-06-24 17:41:13,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1150242.0, ans=0.125 2023-06-24 17:41:30,717 INFO [train.py:996] (1/4) Epoch 7, batch 8750, loss[loss=0.1975, simple_loss=0.2664, pruned_loss=0.06427, over 21362.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3011, pruned_loss=0.0725, over 4273834.51 frames. ], batch size: 159, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:41:54,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1150362.0, ans=0.125 2023-06-24 17:43:08,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1150542.0, ans=0.0 2023-06-24 17:43:21,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1150602.0, ans=0.09899494936611666 2023-06-24 17:43:22,456 INFO [train.py:996] (1/4) Epoch 7, batch 8800, loss[loss=0.2355, simple_loss=0.321, pruned_loss=0.07503, over 21567.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3076, pruned_loss=0.0749, over 4269868.46 frames. ], batch size: 230, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:43:30,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1150602.0, ans=0.125 2023-06-24 17:43:55,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-24 17:44:02,354 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.059e+02 3.780e+02 4.742e+02 8.855e+02, threshold=7.560e+02, percent-clipped=10.0 2023-06-24 17:44:06,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1150662.0, ans=0.09899494936611666 2023-06-24 17:44:40,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1150782.0, ans=0.1 2023-06-24 17:44:48,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1150782.0, ans=0.125 2023-06-24 17:45:11,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1150842.0, ans=0.125 2023-06-24 17:45:15,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1150842.0, ans=0.125 2023-06-24 17:45:24,892 INFO [train.py:996] (1/4) Epoch 7, batch 8850, loss[loss=0.2328, simple_loss=0.3285, pruned_loss=0.06853, over 16045.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3116, pruned_loss=0.07555, over 4263524.38 frames. ], batch size: 61, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:45:27,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1150902.0, ans=0.125 2023-06-24 17:46:31,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1151082.0, ans=0.125 2023-06-24 17:46:57,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1151142.0, ans=0.1 2023-06-24 17:47:13,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1151142.0, ans=0.125 2023-06-24 17:47:16,884 INFO [train.py:996] (1/4) Epoch 7, batch 8900, loss[loss=0.2239, simple_loss=0.3069, pruned_loss=0.07043, over 21852.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3077, pruned_loss=0.07463, over 4263418.86 frames. ], batch size: 372, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:47:30,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1151202.0, ans=0.0 2023-06-24 17:47:41,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1151202.0, ans=0.125 2023-06-24 17:47:54,305 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.946e+02 3.604e+02 5.046e+02 1.118e+03, threshold=7.207e+02, percent-clipped=3.0 2023-06-24 17:47:54,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1151262.0, ans=0.1 2023-06-24 17:48:00,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1151262.0, ans=0.0 2023-06-24 17:48:04,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1151322.0, ans=0.05 2023-06-24 17:48:04,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1151322.0, ans=0.2 2023-06-24 17:48:54,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-24 17:49:21,206 INFO [train.py:996] (1/4) Epoch 7, batch 8950, loss[loss=0.2931, simple_loss=0.4118, pruned_loss=0.08715, over 19769.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3102, pruned_loss=0.07447, over 4268561.36 frames. ], batch size: 702, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:49:47,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1151562.0, ans=0.0 2023-06-24 17:49:49,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1151562.0, ans=0.0 2023-06-24 17:49:57,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-24 17:50:02,729 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-24 17:50:24,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1151682.0, ans=0.95 2023-06-24 17:51:07,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1151742.0, ans=0.125 2023-06-24 17:51:10,376 INFO [train.py:996] (1/4) Epoch 7, batch 9000, loss[loss=0.2062, simple_loss=0.2682, pruned_loss=0.07207, over 21728.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3055, pruned_loss=0.0746, over 4270638.55 frames. ], batch size: 300, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:51:10,377 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 17:51:24,548 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3642, 2.2228, 4.0799, 3.9101], device='cuda:1') 2023-06-24 17:51:28,283 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2657, simple_loss=0.3576, pruned_loss=0.0869, over 1796401.00 frames. 2023-06-24 17:51:28,285 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 17:51:53,793 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.929e+02 3.694e+02 4.955e+02 7.799e+02, threshold=7.388e+02, percent-clipped=3.0 2023-06-24 17:52:36,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.22 vs. limit=6.0 2023-06-24 17:52:56,307 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-24 17:53:21,712 INFO [train.py:996] (1/4) Epoch 7, batch 9050, loss[loss=0.2191, simple_loss=0.2994, pruned_loss=0.06938, over 21747.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3014, pruned_loss=0.07177, over 4269507.86 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:53:40,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1152162.0, ans=0.0 2023-06-24 17:53:58,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1152162.0, ans=0.1 2023-06-24 17:54:19,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1152222.0, ans=0.09899494936611666 2023-06-24 17:54:51,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1152282.0, ans=0.125 2023-06-24 17:55:14,793 INFO [train.py:996] (1/4) Epoch 7, batch 9100, loss[loss=0.2367, simple_loss=0.3307, pruned_loss=0.07137, over 21673.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3061, pruned_loss=0.07433, over 4269082.15 frames. ], batch size: 414, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:55:17,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1152402.0, ans=0.125 2023-06-24 17:55:36,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152462.0, ans=0.1 2023-06-24 17:55:45,275 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.655e+02 3.193e+02 3.861e+02 6.275e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-24 17:55:47,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152462.0, ans=0.1 2023-06-24 17:56:09,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1152522.0, ans=0.2 2023-06-24 17:56:20,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1152582.0, ans=0.0 2023-06-24 17:56:31,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1152582.0, ans=0.125 2023-06-24 17:56:39,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-24 17:57:01,007 INFO [train.py:996] (1/4) Epoch 7, batch 9150, loss[loss=0.213, simple_loss=0.3251, pruned_loss=0.05048, over 21208.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3088, pruned_loss=0.07234, over 4271140.11 frames. ], batch size: 548, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:57:49,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1152822.0, ans=0.0 2023-06-24 17:58:05,964 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:58:19,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-24 17:58:54,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1152942.0, ans=0.125 2023-06-24 17:58:58,734 INFO [train.py:996] (1/4) Epoch 7, batch 9200, loss[loss=0.2552, simple_loss=0.3406, pruned_loss=0.08494, over 21641.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3103, pruned_loss=0.07093, over 4267481.35 frames. ], batch size: 414, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:59:05,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1153002.0, ans=0.125 2023-06-24 17:59:28,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1153062.0, ans=0.0 2023-06-24 17:59:29,536 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.740e+02 3.426e+02 4.320e+02 8.569e+02, threshold=6.853e+02, percent-clipped=6.0 2023-06-24 18:00:50,654 INFO [train.py:996] (1/4) Epoch 7, batch 9250, loss[loss=0.2323, simple_loss=0.3079, pruned_loss=0.07837, over 21679.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3116, pruned_loss=0.07333, over 4266201.77 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 18:01:05,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1153302.0, ans=0.125 2023-06-24 18:01:54,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1153422.0, ans=0.2 2023-06-24 18:02:04,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1153482.0, ans=0.0 2023-06-24 18:02:12,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1153482.0, ans=0.125 2023-06-24 18:02:42,902 INFO [train.py:996] (1/4) Epoch 7, batch 9300, loss[loss=0.1963, simple_loss=0.2626, pruned_loss=0.06501, over 21564.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3046, pruned_loss=0.0731, over 4269815.74 frames. ], batch size: 263, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 18:02:50,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1153602.0, ans=0.2 2023-06-24 18:03:13,911 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.058e+02 3.549e+02 4.364e+02 7.419e+02, threshold=7.098e+02, percent-clipped=2.0 2023-06-24 18:04:04,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1153782.0, ans=0.1 2023-06-24 18:04:24,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1153842.0, ans=0.0 2023-06-24 18:04:29,057 INFO [train.py:996] (1/4) Epoch 7, batch 9350, loss[loss=0.2356, simple_loss=0.3233, pruned_loss=0.07397, over 21735.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3116, pruned_loss=0.07529, over 4271133.38 frames. ], batch size: 298, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:05:13,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1153962.0, ans=0.125 2023-06-24 18:05:21,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-24 18:05:32,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1154022.0, ans=0.1 2023-06-24 18:06:10,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1154142.0, ans=0.125 2023-06-24 18:06:28,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1154142.0, ans=0.125 2023-06-24 18:06:30,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1154202.0, ans=0.125 2023-06-24 18:06:31,722 INFO [train.py:996] (1/4) Epoch 7, batch 9400, loss[loss=0.1824, simple_loss=0.2483, pruned_loss=0.05828, over 21304.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3128, pruned_loss=0.07538, over 4276989.29 frames. ], batch size: 549, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:06:36,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154202.0, ans=0.1 2023-06-24 18:06:54,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=12.0 2023-06-24 18:07:01,678 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-24 18:07:02,156 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.873e+02 3.280e+02 3.858e+02 8.681e+02, threshold=6.561e+02, percent-clipped=2.0 2023-06-24 18:07:08,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1154262.0, ans=0.0 2023-06-24 18:07:15,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1154322.0, ans=0.125 2023-06-24 18:07:33,753 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-24 18:07:57,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1154442.0, ans=0.125 2023-06-24 18:08:21,775 INFO [train.py:996] (1/4) Epoch 7, batch 9450, loss[loss=0.1808, simple_loss=0.2569, pruned_loss=0.05235, over 21770.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3044, pruned_loss=0.07417, over 4261395.70 frames. ], batch size: 124, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:08:23,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1154502.0, ans=0.0 2023-06-24 18:08:24,438 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.67 vs. limit=15.0 2023-06-24 18:09:00,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1154622.0, ans=0.125 2023-06-24 18:09:45,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154742.0, ans=0.1 2023-06-24 18:10:10,153 INFO [train.py:996] (1/4) Epoch 7, batch 9500, loss[loss=0.2344, simple_loss=0.3035, pruned_loss=0.08263, over 21823.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2965, pruned_loss=0.07228, over 4259473.24 frames. ], batch size: 118, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:10:42,235 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.886e+02 3.476e+02 4.165e+02 8.781e+02, threshold=6.953e+02, percent-clipped=4.0 2023-06-24 18:11:44,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1155042.0, ans=0.125 2023-06-24 18:11:58,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1155042.0, ans=0.2 2023-06-24 18:11:59,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1155102.0, ans=0.0 2023-06-24 18:12:01,001 INFO [train.py:996] (1/4) Epoch 7, batch 9550, loss[loss=0.2263, simple_loss=0.3252, pruned_loss=0.06366, over 21811.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3003, pruned_loss=0.07399, over 4263337.96 frames. ], batch size: 282, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:12:55,337 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:13:12,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1155282.0, ans=22.5 2023-06-24 18:13:47,556 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:13:50,403 INFO [train.py:996] (1/4) Epoch 7, batch 9600, loss[loss=0.1917, simple_loss=0.2693, pruned_loss=0.05703, over 21293.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3029, pruned_loss=0.07546, over 4267639.77 frames. ], batch size: 176, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:14:06,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1155402.0, ans=0.2 2023-06-24 18:14:20,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-24 18:14:23,104 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.053e+02 3.563e+02 4.666e+02 8.626e+02, threshold=7.126e+02, percent-clipped=5.0 2023-06-24 18:14:52,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1155582.0, ans=0.0 2023-06-24 18:14:52,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1155582.0, ans=0.1 2023-06-24 18:15:19,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1155642.0, ans=0.0 2023-06-24 18:15:45,052 INFO [train.py:996] (1/4) Epoch 7, batch 9650, loss[loss=0.2335, simple_loss=0.3126, pruned_loss=0.07716, over 21692.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3027, pruned_loss=0.07623, over 4268001.63 frames. ], batch size: 351, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:16:07,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.07 vs. limit=5.0 2023-06-24 18:16:13,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1155762.0, ans=0.125 2023-06-24 18:17:34,788 INFO [train.py:996] (1/4) Epoch 7, batch 9700, loss[loss=0.2179, simple_loss=0.2924, pruned_loss=0.07165, over 21279.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3064, pruned_loss=0.07606, over 4270913.25 frames. ], batch size: 143, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:17:42,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1156002.0, ans=0.0 2023-06-24 18:17:56,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156062.0, ans=0.1 2023-06-24 18:18:08,283 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.706e+02 3.025e+02 3.673e+02 7.479e+02, threshold=6.049e+02, percent-clipped=1.0 2023-06-24 18:18:17,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156122.0, ans=0.1 2023-06-24 18:18:35,940 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 18:19:18,113 INFO [train.py:996] (1/4) Epoch 7, batch 9750, loss[loss=0.2743, simple_loss=0.3591, pruned_loss=0.09473, over 21449.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3011, pruned_loss=0.07501, over 4258674.67 frames. ], batch size: 131, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:19:22,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1156302.0, ans=0.09899494936611666 2023-06-24 18:19:39,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1156362.0, ans=0.125 2023-06-24 18:19:57,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=15.0 2023-06-24 18:20:06,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1156422.0, ans=0.2 2023-06-24 18:20:07,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1156422.0, ans=0.05 2023-06-24 18:20:11,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156422.0, ans=0.1 2023-06-24 18:20:19,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1156482.0, ans=0.125 2023-06-24 18:20:40,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1156542.0, ans=0.95 2023-06-24 18:21:07,433 INFO [train.py:996] (1/4) Epoch 7, batch 9800, loss[loss=0.216, simple_loss=0.2936, pruned_loss=0.06919, over 21692.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3005, pruned_loss=0.07467, over 4262617.69 frames. ], batch size: 389, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:21:39,990 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.762e+02 3.059e+02 4.077e+02 6.018e+02, threshold=6.118e+02, percent-clipped=0.0 2023-06-24 18:21:51,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1156722.0, ans=0.125 2023-06-24 18:22:37,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1156842.0, ans=0.0 2023-06-24 18:22:55,809 INFO [train.py:996] (1/4) Epoch 7, batch 9850, loss[loss=0.1903, simple_loss=0.2315, pruned_loss=0.07461, over 20074.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2977, pruned_loss=0.07442, over 4260020.29 frames. ], batch size: 703, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:23:04,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1156902.0, ans=0.2 2023-06-24 18:23:40,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1157022.0, ans=0.125 2023-06-24 18:23:40,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1157022.0, ans=0.0 2023-06-24 18:23:45,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1157022.0, ans=0.2 2023-06-24 18:24:09,096 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-24 18:24:14,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1157142.0, ans=0.1 2023-06-24 18:24:38,512 INFO [train.py:996] (1/4) Epoch 7, batch 9900, loss[loss=0.1938, simple_loss=0.2694, pruned_loss=0.05912, over 21369.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.294, pruned_loss=0.07381, over 4256841.14 frames. ], batch size: 211, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:25:12,372 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.791e+02 3.369e+02 4.122e+02 6.726e+02, threshold=6.739e+02, percent-clipped=1.0 2023-06-24 18:26:07,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.18 vs. limit=10.0 2023-06-24 18:26:27,528 INFO [train.py:996] (1/4) Epoch 7, batch 9950, loss[loss=0.2852, simple_loss=0.3881, pruned_loss=0.09114, over 19783.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2953, pruned_loss=0.07587, over 4258998.59 frames. ], batch size: 702, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:26:33,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1157502.0, ans=0.125 2023-06-24 18:27:11,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1157622.0, ans=0.0 2023-06-24 18:27:14,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1157622.0, ans=0.2 2023-06-24 18:27:36,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1157682.0, ans=0.125 2023-06-24 18:27:36,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1157682.0, ans=0.125 2023-06-24 18:27:38,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1157682.0, ans=0.0 2023-06-24 18:28:09,487 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.20 vs. limit=6.0 2023-06-24 18:28:16,552 INFO [train.py:996] (1/4) Epoch 7, batch 10000, loss[loss=0.19, simple_loss=0.27, pruned_loss=0.05503, over 21668.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2911, pruned_loss=0.07443, over 4255859.34 frames. ], batch size: 391, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:28:17,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1157802.0, ans=0.0 2023-06-24 18:28:49,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.643e+02 3.254e+02 4.440e+02 7.063e+02, threshold=6.507e+02, percent-clipped=1.0 2023-06-24 18:29:01,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1157922.0, ans=0.2 2023-06-24 18:30:04,067 INFO [train.py:996] (1/4) Epoch 7, batch 10050, loss[loss=0.194, simple_loss=0.261, pruned_loss=0.06348, over 21203.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2935, pruned_loss=0.07516, over 4260940.22 frames. ], batch size: 159, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:30:10,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1158102.0, ans=0.125 2023-06-24 18:31:19,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1158282.0, ans=0.125 2023-06-24 18:31:37,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1158342.0, ans=0.1 2023-06-24 18:32:01,198 INFO [train.py:996] (1/4) Epoch 7, batch 10100, loss[loss=0.1778, simple_loss=0.2617, pruned_loss=0.04697, over 21003.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.292, pruned_loss=0.07362, over 4252105.04 frames. ], batch size: 608, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:32:22,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-24 18:32:30,795 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.650e+02 3.073e+02 3.822e+02 6.259e+02, threshold=6.145e+02, percent-clipped=0.0 2023-06-24 18:32:34,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1158522.0, ans=0.0 2023-06-24 18:33:31,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1158642.0, ans=0.5 2023-06-24 18:33:50,329 INFO [train.py:996] (1/4) Epoch 7, batch 10150, loss[loss=0.2228, simple_loss=0.3079, pruned_loss=0.0688, over 21814.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2997, pruned_loss=0.07657, over 4256067.65 frames. ], batch size: 316, lr: 4.37e-03, grad_scale: 8.0 2023-06-24 18:33:54,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1158702.0, ans=0.0 2023-06-24 18:34:46,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1158822.0, ans=10.0 2023-06-24 18:35:34,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1158942.0, ans=0.125 2023-06-24 18:35:39,222 INFO [train.py:996] (1/4) Epoch 7, batch 10200, loss[loss=0.205, simple_loss=0.2987, pruned_loss=0.0556, over 21182.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2983, pruned_loss=0.07436, over 4256164.94 frames. ], batch size: 548, lr: 4.37e-03, grad_scale: 8.0 2023-06-24 18:35:47,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-24 18:35:59,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1159062.0, ans=0.125 2023-06-24 18:36:17,260 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 2.567e+02 2.979e+02 3.564e+02 7.472e+02, threshold=5.959e+02, percent-clipped=1.0 2023-06-24 18:36:26,835 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-24 18:36:51,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=22.5 2023-06-24 18:36:54,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1159182.0, ans=0.125 2023-06-24 18:37:03,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1159182.0, ans=0.0 2023-06-24 18:37:17,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1159242.0, ans=0.0 2023-06-24 18:37:19,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1159242.0, ans=0.2 2023-06-24 18:37:24,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1159242.0, ans=0.125 2023-06-24 18:37:28,868 INFO [train.py:996] (1/4) Epoch 7, batch 10250, loss[loss=0.2568, simple_loss=0.3339, pruned_loss=0.08986, over 21406.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2921, pruned_loss=0.06905, over 4265569.12 frames. ], batch size: 131, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:37:29,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1159302.0, ans=0.0 2023-06-24 18:38:46,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1159482.0, ans=0.125 2023-06-24 18:39:05,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1159542.0, ans=0.125 2023-06-24 18:39:19,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1159542.0, ans=0.2 2023-06-24 18:39:22,103 INFO [train.py:996] (1/4) Epoch 7, batch 10300, loss[loss=0.2298, simple_loss=0.3116, pruned_loss=0.07398, over 21418.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2978, pruned_loss=0.07121, over 4265355.88 frames. ], batch size: 211, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:39:28,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1159602.0, ans=0.0 2023-06-24 18:39:29,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1159602.0, ans=0.07 2023-06-24 18:40:11,436 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.687e+02 3.369e+02 4.671e+02 1.084e+03, threshold=6.737e+02, percent-clipped=9.0 2023-06-24 18:40:12,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1159662.0, ans=0.125 2023-06-24 18:41:14,514 INFO [train.py:996] (1/4) Epoch 7, batch 10350, loss[loss=0.1767, simple_loss=0.2483, pruned_loss=0.05257, over 21508.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2992, pruned_loss=0.07092, over 4267865.29 frames. ], batch size: 195, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:41:45,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1159962.0, ans=0.125 2023-06-24 18:42:17,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1160022.0, ans=0.0 2023-06-24 18:42:17,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1160022.0, ans=0.0 2023-06-24 18:42:38,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1160082.0, ans=0.0 2023-06-24 18:42:47,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1160142.0, ans=0.125 2023-06-24 18:43:12,829 INFO [train.py:996] (1/4) Epoch 7, batch 10400, loss[loss=0.1964, simple_loss=0.2654, pruned_loss=0.06373, over 21623.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2913, pruned_loss=0.06988, over 4274521.04 frames. ], batch size: 263, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:43:46,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-24 18:43:55,302 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:43:56,275 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.812e+02 3.590e+02 4.501e+02 9.958e+02, threshold=7.181e+02, percent-clipped=6.0 2023-06-24 18:44:12,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1160322.0, ans=0.125 2023-06-24 18:44:30,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1160382.0, ans=15.0 2023-06-24 18:44:33,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1160382.0, ans=0.2 2023-06-24 18:44:37,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-24 18:45:14,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1160502.0, ans=0.125 2023-06-24 18:45:15,901 INFO [train.py:996] (1/4) Epoch 7, batch 10450, loss[loss=0.2835, simple_loss=0.346, pruned_loss=0.1105, over 21804.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2969, pruned_loss=0.07229, over 4270433.20 frames. ], batch size: 441, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:45:17,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1160502.0, ans=0.125 2023-06-24 18:45:18,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1160502.0, ans=0.2 2023-06-24 18:46:20,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1160682.0, ans=0.125 2023-06-24 18:47:06,305 INFO [train.py:996] (1/4) Epoch 7, batch 10500, loss[loss=0.1972, simple_loss=0.2619, pruned_loss=0.06624, over 21513.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2965, pruned_loss=0.07098, over 4264972.33 frames. ], batch size: 230, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:47:40,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1160862.0, ans=0.125 2023-06-24 18:47:43,176 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.810e+02 3.423e+02 4.183e+02 6.636e+02, threshold=6.845e+02, percent-clipped=0.0 2023-06-24 18:48:09,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1160982.0, ans=0.125 2023-06-24 18:48:54,920 INFO [train.py:996] (1/4) Epoch 7, batch 10550, loss[loss=0.2067, simple_loss=0.2762, pruned_loss=0.0686, over 21859.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2899, pruned_loss=0.07027, over 4253228.92 frames. ], batch size: 107, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:49:48,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1161222.0, ans=0.1 2023-06-24 18:50:46,830 INFO [train.py:996] (1/4) Epoch 7, batch 10600, loss[loss=0.1889, simple_loss=0.2586, pruned_loss=0.05958, over 21992.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2857, pruned_loss=0.06879, over 4256710.36 frames. ], batch size: 103, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:50:47,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1161402.0, ans=0.125 2023-06-24 18:50:55,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1161402.0, ans=0.5 2023-06-24 18:51:17,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1161462.0, ans=0.0 2023-06-24 18:51:24,989 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.607e+02 2.934e+02 3.561e+02 5.999e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-24 18:51:27,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1161522.0, ans=0.125 2023-06-24 18:51:36,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1161522.0, ans=0.2 2023-06-24 18:51:43,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1161522.0, ans=0.125 2023-06-24 18:51:53,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1161582.0, ans=0.125 2023-06-24 18:52:38,202 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-24 18:52:38,843 INFO [train.py:996] (1/4) Epoch 7, batch 10650, loss[loss=0.1617, simple_loss=0.2455, pruned_loss=0.03892, over 21748.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2887, pruned_loss=0.06823, over 4255550.16 frames. ], batch size: 282, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:53:00,682 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-24 18:53:45,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1161822.0, ans=0.0 2023-06-24 18:54:12,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1161942.0, ans=0.125 2023-06-24 18:54:29,866 INFO [train.py:996] (1/4) Epoch 7, batch 10700, loss[loss=0.2247, simple_loss=0.2964, pruned_loss=0.07649, over 21766.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2887, pruned_loss=0.0681, over 4251053.27 frames. ], batch size: 247, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:54:45,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1162002.0, ans=0.0 2023-06-24 18:55:08,597 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.935e+02 3.419e+02 4.511e+02 9.695e+02, threshold=6.839e+02, percent-clipped=12.0 2023-06-24 18:56:21,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1162242.0, ans=0.0 2023-06-24 18:56:28,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1162302.0, ans=0.2 2023-06-24 18:56:29,683 INFO [train.py:996] (1/4) Epoch 7, batch 10750, loss[loss=0.2623, simple_loss=0.3504, pruned_loss=0.08713, over 21392.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3014, pruned_loss=0.07276, over 4254900.53 frames. ], batch size: 211, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:56:32,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1162302.0, ans=0.125 2023-06-24 18:56:40,048 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-24 18:56:48,601 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-24 18:56:51,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1162362.0, ans=0.0 2023-06-24 18:57:32,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-06-24 18:58:03,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1162542.0, ans=12.0 2023-06-24 18:58:10,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1162542.0, ans=0.0 2023-06-24 18:58:16,027 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-24 18:58:21,621 INFO [train.py:996] (1/4) Epoch 7, batch 10800, loss[loss=0.3079, simple_loss=0.3685, pruned_loss=0.1237, over 21329.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3066, pruned_loss=0.07422, over 4261840.89 frames. ], batch size: 507, lr: 4.36e-03, grad_scale: 32.0 2023-06-24 18:59:05,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1162662.0, ans=0.015 2023-06-24 18:59:06,335 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 2.815e+02 3.156e+02 3.825e+02 7.344e+02, threshold=6.312e+02, percent-clipped=1.0 2023-06-24 18:59:49,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1162842.0, ans=0.2 2023-06-24 19:00:07,154 INFO [train.py:996] (1/4) Epoch 7, batch 10850, loss[loss=0.1983, simple_loss=0.2717, pruned_loss=0.06241, over 21623.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3046, pruned_loss=0.0739, over 4267684.01 frames. ], batch size: 415, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:00:07,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1162902.0, ans=0.125 2023-06-24 19:01:08,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1163022.0, ans=0.125 2023-06-24 19:01:08,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1163022.0, ans=0.125 2023-06-24 19:02:04,083 INFO [train.py:996] (1/4) Epoch 7, batch 10900, loss[loss=0.2761, simple_loss=0.3576, pruned_loss=0.09732, over 21400.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2971, pruned_loss=0.0714, over 4260415.65 frames. ], batch size: 507, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:02:47,847 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.711e+02 3.083e+02 3.861e+02 1.043e+03, threshold=6.166e+02, percent-clipped=5.0 2023-06-24 19:02:58,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-24 19:03:19,246 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.12 vs. limit=10.0 2023-06-24 19:03:29,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1163442.0, ans=0.125 2023-06-24 19:03:38,338 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-24 19:03:53,385 INFO [train.py:996] (1/4) Epoch 7, batch 10950, loss[loss=0.2115, simple_loss=0.2949, pruned_loss=0.06401, over 19903.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2939, pruned_loss=0.06938, over 4255511.48 frames. ], batch size: 702, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:03:55,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1163502.0, ans=0.0 2023-06-24 19:04:31,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1163562.0, ans=0.125 2023-06-24 19:04:50,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1163622.0, ans=0.1 2023-06-24 19:05:27,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1163742.0, ans=0.125 2023-06-24 19:05:42,598 INFO [train.py:996] (1/4) Epoch 7, batch 11000, loss[loss=0.2132, simple_loss=0.2799, pruned_loss=0.07323, over 21804.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2912, pruned_loss=0.06961, over 4268012.20 frames. ], batch size: 282, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:05:50,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1163802.0, ans=0.0 2023-06-24 19:05:53,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1163802.0, ans=0.07 2023-06-24 19:05:56,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=22.5 2023-06-24 19:06:26,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 2.764e+02 3.110e+02 3.886e+02 6.584e+02, threshold=6.221e+02, percent-clipped=1.0 2023-06-24 19:06:29,277 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-06-24 19:06:53,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.38 vs. limit=15.0 2023-06-24 19:07:31,771 INFO [train.py:996] (1/4) Epoch 7, batch 11050, loss[loss=0.1983, simple_loss=0.2603, pruned_loss=0.06816, over 21423.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2884, pruned_loss=0.07075, over 4274929.91 frames. ], batch size: 131, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:08:07,155 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-24 19:08:16,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1164222.0, ans=0.125 2023-06-24 19:08:54,824 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.85 vs. limit=15.0 2023-06-24 19:08:59,738 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-24 19:09:04,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1164342.0, ans=0.0 2023-06-24 19:09:17,949 INFO [train.py:996] (1/4) Epoch 7, batch 11100, loss[loss=0.2038, simple_loss=0.2885, pruned_loss=0.05956, over 21409.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2872, pruned_loss=0.0711, over 4267874.44 frames. ], batch size: 194, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:10:00,560 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.678e+02 3.103e+02 3.561e+02 5.692e+02, threshold=6.205e+02, percent-clipped=0.0 2023-06-24 19:10:35,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1164582.0, ans=0.0 2023-06-24 19:10:49,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-24 19:11:05,126 INFO [train.py:996] (1/4) Epoch 7, batch 11150, loss[loss=0.1988, simple_loss=0.2854, pruned_loss=0.05603, over 21323.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2855, pruned_loss=0.07077, over 4266887.60 frames. ], batch size: 176, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:11:21,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1164762.0, ans=0.0 2023-06-24 19:12:08,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.69 vs. limit=15.0 2023-06-24 19:12:20,242 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.84 vs. limit=10.0 2023-06-24 19:12:31,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1164942.0, ans=0.125 2023-06-24 19:12:52,247 INFO [train.py:996] (1/4) Epoch 7, batch 11200, loss[loss=0.2048, simple_loss=0.2715, pruned_loss=0.06904, over 21844.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2836, pruned_loss=0.06999, over 4264679.26 frames. ], batch size: 373, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:13:01,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-24 19:13:02,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1165002.0, ans=0.0 2023-06-24 19:13:02,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-24 19:13:35,975 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.570e+02 2.865e+02 3.266e+02 5.455e+02, threshold=5.730e+02, percent-clipped=0.0 2023-06-24 19:13:43,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1165122.0, ans=0.125 2023-06-24 19:14:41,000 INFO [train.py:996] (1/4) Epoch 7, batch 11250, loss[loss=0.2351, simple_loss=0.3204, pruned_loss=0.07489, over 21863.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2838, pruned_loss=0.07035, over 4261795.11 frames. ], batch size: 124, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:14:41,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1165302.0, ans=0.125 2023-06-24 19:14:56,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1165302.0, ans=0.125 2023-06-24 19:15:38,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1165422.0, ans=0.025 2023-06-24 19:15:40,442 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:15:47,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1165482.0, ans=0.0 2023-06-24 19:16:26,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1165542.0, ans=0.0 2023-06-24 19:16:31,000 INFO [train.py:996] (1/4) Epoch 7, batch 11300, loss[loss=0.1889, simple_loss=0.2635, pruned_loss=0.05712, over 21579.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2852, pruned_loss=0.07096, over 4268462.80 frames. ], batch size: 195, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:17:11,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-24 19:17:13,932 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 2.821e+02 3.305e+02 4.579e+02 7.835e+02, threshold=6.611e+02, percent-clipped=6.0 2023-06-24 19:17:42,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1165782.0, ans=0.1 2023-06-24 19:17:51,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1165782.0, ans=0.2 2023-06-24 19:17:58,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1165842.0, ans=0.125 2023-06-24 19:18:19,906 INFO [train.py:996] (1/4) Epoch 7, batch 11350, loss[loss=0.2516, simple_loss=0.3291, pruned_loss=0.08705, over 21903.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2873, pruned_loss=0.0703, over 4265022.91 frames. ], batch size: 372, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:18:22,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1165902.0, ans=0.1 2023-06-24 19:18:44,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1165962.0, ans=0.125 2023-06-24 19:19:34,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1166082.0, ans=0.1 2023-06-24 19:20:11,170 INFO [train.py:996] (1/4) Epoch 7, batch 11400, loss[loss=0.2581, simple_loss=0.3302, pruned_loss=0.09297, over 21340.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2928, pruned_loss=0.07268, over 4268361.79 frames. ], batch size: 549, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:20:39,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1166262.0, ans=0.2 2023-06-24 19:20:56,091 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.882e+02 3.810e+02 4.991e+02 7.494e+02, threshold=7.619e+02, percent-clipped=6.0 2023-06-24 19:21:43,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1166442.0, ans=0.0 2023-06-24 19:22:06,434 INFO [train.py:996] (1/4) Epoch 7, batch 11450, loss[loss=0.2008, simple_loss=0.2815, pruned_loss=0.06007, over 21695.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2932, pruned_loss=0.07181, over 4263454.82 frames. ], batch size: 247, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:22:37,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-24 19:23:27,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1166682.0, ans=0.2 2023-06-24 19:23:37,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-24 19:23:51,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1166742.0, ans=0.125 2023-06-24 19:23:59,143 INFO [train.py:996] (1/4) Epoch 7, batch 11500, loss[loss=0.1969, simple_loss=0.2845, pruned_loss=0.0546, over 21159.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2976, pruned_loss=0.07382, over 4266082.05 frames. ], batch size: 159, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:24:40,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1166862.0, ans=0.0 2023-06-24 19:24:44,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.827e+02 3.371e+02 4.045e+02 6.932e+02, threshold=6.743e+02, percent-clipped=0.0 2023-06-24 19:25:39,387 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.23 vs. limit=15.0 2023-06-24 19:25:56,873 INFO [train.py:996] (1/4) Epoch 7, batch 11550, loss[loss=0.2799, simple_loss=0.38, pruned_loss=0.08994, over 21752.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3015, pruned_loss=0.07302, over 4266185.08 frames. ], batch size: 351, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:27:14,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1167282.0, ans=0.125 2023-06-24 19:27:48,859 INFO [train.py:996] (1/4) Epoch 7, batch 11600, loss[loss=0.2655, simple_loss=0.349, pruned_loss=0.09097, over 21321.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3158, pruned_loss=0.07515, over 4269902.76 frames. ], batch size: 143, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:28:34,723 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.839e+02 3.611e+02 4.809e+02 8.575e+02, threshold=7.221e+02, percent-clipped=4.0 2023-06-24 19:28:54,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1167522.0, ans=0.1 2023-06-24 19:28:59,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1167582.0, ans=0.1 2023-06-24 19:29:22,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1167642.0, ans=0.0 2023-06-24 19:29:42,816 INFO [train.py:996] (1/4) Epoch 7, batch 11650, loss[loss=0.2607, simple_loss=0.3306, pruned_loss=0.09538, over 21498.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3226, pruned_loss=0.07572, over 4257813.81 frames. ], batch size: 441, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:29:43,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1167702.0, ans=0.035 2023-06-24 19:30:42,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1167822.0, ans=0.125 2023-06-24 19:31:33,861 INFO [train.py:996] (1/4) Epoch 7, batch 11700, loss[loss=0.2059, simple_loss=0.2732, pruned_loss=0.06931, over 21985.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3131, pruned_loss=0.07461, over 4264031.47 frames. ], batch size: 119, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:31:36,597 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-24 19:31:37,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1168002.0, ans=0.125 2023-06-24 19:32:13,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1168122.0, ans=0.2 2023-06-24 19:32:16,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.666e+02 3.050e+02 3.571e+02 8.433e+02, threshold=6.100e+02, percent-clipped=2.0 2023-06-24 19:32:28,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1168122.0, ans=0.1 2023-06-24 19:33:22,091 INFO [train.py:996] (1/4) Epoch 7, batch 11750, loss[loss=0.1939, simple_loss=0.2612, pruned_loss=0.06332, over 21781.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3044, pruned_loss=0.07407, over 4261918.01 frames. ], batch size: 112, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:33:35,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1168302.0, ans=0.04949747468305833 2023-06-24 19:33:40,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1168362.0, ans=0.1 2023-06-24 19:34:14,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.02 vs. limit=15.0 2023-06-24 19:34:43,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1168482.0, ans=0.125 2023-06-24 19:35:14,523 INFO [train.py:996] (1/4) Epoch 7, batch 11800, loss[loss=0.2373, simple_loss=0.3087, pruned_loss=0.08298, over 21220.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3058, pruned_loss=0.0756, over 4268541.48 frames. ], batch size: 143, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:35:26,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.44 vs. limit=12.0 2023-06-24 19:35:28,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1168602.0, ans=0.0 2023-06-24 19:35:45,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.88 vs. limit=22.5 2023-06-24 19:36:03,568 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.959e+02 3.685e+02 4.448e+02 7.783e+02, threshold=7.370e+02, percent-clipped=3.0 2023-06-24 19:36:39,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-24 19:37:00,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1168842.0, ans=0.125 2023-06-24 19:37:05,808 INFO [train.py:996] (1/4) Epoch 7, batch 11850, loss[loss=0.2075, simple_loss=0.2899, pruned_loss=0.06255, over 21482.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3065, pruned_loss=0.07426, over 4277102.72 frames. ], batch size: 211, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:37:21,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1168902.0, ans=0.125 2023-06-24 19:37:22,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1168902.0, ans=0.125 2023-06-24 19:38:05,772 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-06-24 19:38:09,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-24 19:39:02,917 INFO [train.py:996] (1/4) Epoch 7, batch 11900, loss[loss=0.1892, simple_loss=0.2784, pruned_loss=0.04995, over 21566.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3092, pruned_loss=0.07277, over 4271686.56 frames. ], batch size: 263, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:39:03,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1169202.0, ans=0.02 2023-06-24 19:39:20,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.42 vs. limit=15.0 2023-06-24 19:39:44,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1169262.0, ans=0.125 2023-06-24 19:39:48,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1169322.0, ans=0.125 2023-06-24 19:39:51,155 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.709e+02 3.163e+02 3.879e+02 8.042e+02, threshold=6.325e+02, percent-clipped=1.0 2023-06-24 19:39:57,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1169322.0, ans=0.125 2023-06-24 19:39:58,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.02 vs. limit=8.0 2023-06-24 19:40:58,216 INFO [train.py:996] (1/4) Epoch 7, batch 11950, loss[loss=0.2077, simple_loss=0.3096, pruned_loss=0.05291, over 21677.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3111, pruned_loss=0.07032, over 4271268.99 frames. ], batch size: 414, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:41:40,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1169622.0, ans=0.125 2023-06-24 19:42:07,901 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:42:08,464 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-24 19:42:09,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1169682.0, ans=0.125 2023-06-24 19:42:40,542 INFO [train.py:996] (1/4) Epoch 7, batch 12000, loss[loss=0.1921, simple_loss=0.2676, pruned_loss=0.05832, over 21641.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3032, pruned_loss=0.0681, over 4279736.05 frames. ], batch size: 298, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:42:40,543 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 19:42:53,733 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.5520, 4.1181, 4.2508, 3.8117], device='cuda:1') 2023-06-24 19:43:01,772 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.261, simple_loss=0.3543, pruned_loss=0.08379, over 1796401.00 frames. 2023-06-24 19:43:01,773 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 19:43:34,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1169862.0, ans=0.0 2023-06-24 19:43:36,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1169862.0, ans=0.125 2023-06-24 19:43:44,288 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.826e+02 3.232e+02 4.022e+02 5.951e+02, threshold=6.465e+02, percent-clipped=0.0 2023-06-24 19:44:33,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-24 19:44:56,187 INFO [train.py:996] (1/4) Epoch 7, batch 12050, loss[loss=0.2152, simple_loss=0.2811, pruned_loss=0.07467, over 21501.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2987, pruned_loss=0.07039, over 4288226.86 frames. ], batch size: 211, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:44:58,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1170102.0, ans=0.2 2023-06-24 19:45:12,415 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:45:25,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1170162.0, ans=0.0 2023-06-24 19:45:40,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.57 vs. limit=10.0 2023-06-24 19:45:48,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1170222.0, ans=0.0 2023-06-24 19:46:46,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1170402.0, ans=0.1 2023-06-24 19:46:48,328 INFO [train.py:996] (1/4) Epoch 7, batch 12100, loss[loss=0.1874, simple_loss=0.2355, pruned_loss=0.06967, over 20107.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3027, pruned_loss=0.07401, over 4281562.26 frames. ], batch size: 702, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:47:10,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1170462.0, ans=0.2 2023-06-24 19:47:33,914 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.018e+02 3.555e+02 4.988e+02 8.352e+02, threshold=7.110e+02, percent-clipped=5.0 2023-06-24 19:47:57,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1170582.0, ans=0.5 2023-06-24 19:48:07,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1170582.0, ans=0.125 2023-06-24 19:48:41,146 INFO [train.py:996] (1/4) Epoch 7, batch 12150, loss[loss=0.3128, simple_loss=0.3969, pruned_loss=0.1143, over 21455.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.307, pruned_loss=0.07406, over 4282448.61 frames. ], batch size: 507, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:48:56,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1170702.0, ans=0.07 2023-06-24 19:48:59,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1170702.0, ans=0.2 2023-06-24 19:49:45,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1170822.0, ans=0.125 2023-06-24 19:49:46,049 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:50:17,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1170942.0, ans=0.125 2023-06-24 19:50:30,863 INFO [train.py:996] (1/4) Epoch 7, batch 12200, loss[loss=0.241, simple_loss=0.2817, pruned_loss=0.1001, over 21392.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3047, pruned_loss=0.073, over 4275105.01 frames. ], batch size: 508, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:51:25,422 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.032e+02 3.828e+02 4.856e+02 1.056e+03, threshold=7.657e+02, percent-clipped=7.0 2023-06-24 19:51:33,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-24 19:52:18,162 INFO [train.py:996] (1/4) Epoch 7, batch 12250, loss[loss=0.1642, simple_loss=0.2541, pruned_loss=0.03718, over 21767.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2965, pruned_loss=0.06948, over 4272505.58 frames. ], batch size: 371, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:52:42,160 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:53:36,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1171482.0, ans=0.95 2023-06-24 19:53:37,208 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=22.5 2023-06-24 19:54:06,972 INFO [train.py:996] (1/4) Epoch 7, batch 12300, loss[loss=0.1637, simple_loss=0.2415, pruned_loss=0.04296, over 21148.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2867, pruned_loss=0.06413, over 4261960.01 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:54:11,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-24 19:54:12,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1171602.0, ans=0.125 2023-06-24 19:54:49,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.00 vs. limit=15.0 2023-06-24 19:54:56,069 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.162e+02 2.543e+02 3.041e+02 6.823e+02, threshold=5.086e+02, percent-clipped=0.0 2023-06-24 19:54:58,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1171722.0, ans=0.1 2023-06-24 19:55:17,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1171782.0, ans=0.0 2023-06-24 19:55:20,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1171782.0, ans=0.2 2023-06-24 19:55:42,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=22.5 2023-06-24 19:55:51,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1171842.0, ans=0.0 2023-06-24 19:55:54,582 INFO [train.py:996] (1/4) Epoch 7, batch 12350, loss[loss=0.1902, simple_loss=0.2738, pruned_loss=0.05332, over 21287.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.291, pruned_loss=0.06518, over 4263469.16 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:56:00,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1171902.0, ans=0.1 2023-06-24 19:57:27,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-24 19:57:42,412 INFO [train.py:996] (1/4) Epoch 7, batch 12400, loss[loss=0.2398, simple_loss=0.3015, pruned_loss=0.08902, over 21553.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2941, pruned_loss=0.06831, over 4271776.08 frames. ], batch size: 548, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:58:06,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1172262.0, ans=0.05 2023-06-24 19:58:37,883 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.786e+02 3.157e+02 3.873e+02 7.298e+02, threshold=6.314e+02, percent-clipped=10.0 2023-06-24 19:59:04,575 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-24 19:59:26,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1172442.0, ans=0.0 2023-06-24 19:59:31,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1172502.0, ans=0.125 2023-06-24 19:59:33,101 INFO [train.py:996] (1/4) Epoch 7, batch 12450, loss[loss=0.3027, simple_loss=0.3554, pruned_loss=0.1251, over 21449.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2982, pruned_loss=0.07119, over 4276037.47 frames. ], batch size: 510, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:00:43,441 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.74 vs. limit=6.0 2023-06-24 20:01:04,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-24 20:01:14,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1172742.0, ans=0.1 2023-06-24 20:01:18,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1172742.0, ans=0.125 2023-06-24 20:01:25,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2023-06-24 20:01:30,061 INFO [train.py:996] (1/4) Epoch 7, batch 12500, loss[loss=0.2658, simple_loss=0.376, pruned_loss=0.07776, over 21653.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3107, pruned_loss=0.0744, over 4273715.63 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:02:24,571 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.093e+02 3.470e+02 4.423e+02 7.018e+02, threshold=6.940e+02, percent-clipped=1.0 2023-06-24 20:03:09,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1173042.0, ans=0.125 2023-06-24 20:03:10,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.98 vs. limit=6.0 2023-06-24 20:03:12,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1173042.0, ans=0.125 2023-06-24 20:03:19,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1173042.0, ans=0.2 2023-06-24 20:03:31,035 INFO [train.py:996] (1/4) Epoch 7, batch 12550, loss[loss=0.2453, simple_loss=0.3277, pruned_loss=0.08143, over 21605.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3142, pruned_loss=0.07647, over 4280539.57 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:03:35,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1173102.0, ans=0.125 2023-06-24 20:04:37,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1173282.0, ans=0.125 2023-06-24 20:05:21,003 INFO [train.py:996] (1/4) Epoch 7, batch 12600, loss[loss=0.1973, simple_loss=0.2878, pruned_loss=0.05346, over 21827.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3133, pruned_loss=0.07474, over 4278641.03 frames. ], batch size: 333, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:06:01,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1173522.0, ans=0.125 2023-06-24 20:06:05,563 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.821e+02 3.460e+02 4.531e+02 8.641e+02, threshold=6.920e+02, percent-clipped=2.0 2023-06-24 20:06:26,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1173582.0, ans=0.2 2023-06-24 20:06:26,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1173582.0, ans=0.125 2023-06-24 20:06:46,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=15.0 2023-06-24 20:06:48,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1173642.0, ans=0.0 2023-06-24 20:07:13,630 INFO [train.py:996] (1/4) Epoch 7, batch 12650, loss[loss=0.2532, simple_loss=0.3095, pruned_loss=0.09844, over 21706.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3062, pruned_loss=0.07167, over 4271550.94 frames. ], batch size: 507, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:07:19,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1173702.0, ans=0.05 2023-06-24 20:07:39,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-06-24 20:08:16,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1173882.0, ans=0.0 2023-06-24 20:08:32,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1173942.0, ans=0.025 2023-06-24 20:09:02,170 INFO [train.py:996] (1/4) Epoch 7, batch 12700, loss[loss=0.2324, simple_loss=0.3185, pruned_loss=0.07319, over 21468.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3042, pruned_loss=0.07309, over 4270909.52 frames. ], batch size: 131, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:09:11,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1174002.0, ans=0.125 2023-06-24 20:09:47,810 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 2.796e+02 3.277e+02 3.938e+02 5.852e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-24 20:09:57,436 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-24 20:10:20,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1174182.0, ans=0.2 2023-06-24 20:10:50,766 INFO [train.py:996] (1/4) Epoch 7, batch 12750, loss[loss=0.2226, simple_loss=0.3103, pruned_loss=0.06741, over 21691.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3059, pruned_loss=0.07383, over 4270027.98 frames. ], batch size: 351, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:11:24,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1174362.0, ans=0.125 2023-06-24 20:11:32,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1174422.0, ans=0.0 2023-06-24 20:11:42,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-24 20:12:03,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.01 vs. limit=22.5 2023-06-24 20:12:23,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1174542.0, ans=0.125 2023-06-24 20:12:39,035 INFO [train.py:996] (1/4) Epoch 7, batch 12800, loss[loss=0.2183, simple_loss=0.2983, pruned_loss=0.06913, over 21639.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3059, pruned_loss=0.0743, over 4278595.90 frames. ], batch size: 230, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:13:02,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1174662.0, ans=0.05 2023-06-24 20:13:20,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1174722.0, ans=0.1 2023-06-24 20:13:25,311 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.978e+02 3.549e+02 4.677e+02 8.571e+02, threshold=7.098e+02, percent-clipped=5.0 2023-06-24 20:13:35,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1174722.0, ans=0.0 2023-06-24 20:13:43,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1174782.0, ans=0.125 2023-06-24 20:13:58,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1174782.0, ans=0.125 2023-06-24 20:14:25,240 INFO [train.py:996] (1/4) Epoch 7, batch 12850, loss[loss=0.2357, simple_loss=0.3369, pruned_loss=0.06727, over 19901.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3084, pruned_loss=0.07609, over 4276575.62 frames. ], batch size: 704, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:14:46,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1174962.0, ans=0.125 2023-06-24 20:15:12,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1175022.0, ans=0.125 2023-06-24 20:15:20,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1175022.0, ans=0.0 2023-06-24 20:15:42,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1175082.0, ans=0.2 2023-06-24 20:16:16,201 INFO [train.py:996] (1/4) Epoch 7, batch 12900, loss[loss=0.2837, simple_loss=0.3513, pruned_loss=0.108, over 21471.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3059, pruned_loss=0.07357, over 4278870.47 frames. ], batch size: 471, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:17:14,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.556e+02 2.922e+02 3.625e+02 8.221e+02, threshold=5.845e+02, percent-clipped=4.0 2023-06-24 20:17:18,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1175322.0, ans=0.125 2023-06-24 20:17:50,433 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:18:05,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=22.5 2023-06-24 20:18:05,565 INFO [train.py:996] (1/4) Epoch 7, batch 12950, loss[loss=0.2443, simple_loss=0.3157, pruned_loss=0.0864, over 21483.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3036, pruned_loss=0.07156, over 4276641.69 frames. ], batch size: 211, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:19:45,545 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:19:53,388 INFO [train.py:996] (1/4) Epoch 7, batch 13000, loss[loss=0.2159, simple_loss=0.2975, pruned_loss=0.06715, over 21787.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3032, pruned_loss=0.07155, over 4265127.73 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:19:56,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-24 20:20:50,813 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.748e+02 3.242e+02 4.275e+02 7.846e+02, threshold=6.485e+02, percent-clipped=8.0 2023-06-24 20:21:12,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1175982.0, ans=0.125 2023-06-24 20:21:43,442 INFO [train.py:996] (1/4) Epoch 7, batch 13050, loss[loss=0.2299, simple_loss=0.3011, pruned_loss=0.07938, over 21405.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3004, pruned_loss=0.06955, over 4261751.91 frames. ], batch size: 159, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:22:13,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1176162.0, ans=0.125 2023-06-24 20:22:49,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-24 20:23:05,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.72 vs. limit=15.0 2023-06-24 20:23:15,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1176342.0, ans=0.0 2023-06-24 20:23:22,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1176342.0, ans=0.125 2023-06-24 20:23:32,684 INFO [train.py:996] (1/4) Epoch 7, batch 13100, loss[loss=0.2315, simple_loss=0.3168, pruned_loss=0.07307, over 21755.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3018, pruned_loss=0.06968, over 4272445.04 frames. ], batch size: 332, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:24:00,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1176462.0, ans=0.125 2023-06-24 20:24:30,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1176522.0, ans=0.0 2023-06-24 20:24:31,401 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.745e+02 3.057e+02 3.676e+02 6.184e+02, threshold=6.113e+02, percent-clipped=0.0 2023-06-24 20:24:43,113 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.02 vs. limit=15.0 2023-06-24 20:25:33,936 INFO [train.py:996] (1/4) Epoch 7, batch 13150, loss[loss=0.1892, simple_loss=0.263, pruned_loss=0.05772, over 21598.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3036, pruned_loss=0.07155, over 4271969.00 frames. ], batch size: 263, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:25:55,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1176762.0, ans=0.125 2023-06-24 20:26:22,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1176822.0, ans=0.2 2023-06-24 20:26:36,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1176882.0, ans=0.125 2023-06-24 20:27:28,452 INFO [train.py:996] (1/4) Epoch 7, batch 13200, loss[loss=0.2351, simple_loss=0.2844, pruned_loss=0.0929, over 20021.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3022, pruned_loss=0.07207, over 4271115.55 frames. ], batch size: 702, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:27:31,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-24 20:27:35,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2023-06-24 20:28:17,663 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.491e+02 2.990e+02 3.679e+02 4.765e+02 8.248e+02, threshold=7.359e+02, percent-clipped=11.0 2023-06-24 20:28:23,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1177122.0, ans=0.0 2023-06-24 20:28:41,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1177182.0, ans=0.125 2023-06-24 20:29:18,291 INFO [train.py:996] (1/4) Epoch 7, batch 13250, loss[loss=0.2151, simple_loss=0.2931, pruned_loss=0.06852, over 21260.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3011, pruned_loss=0.07342, over 4275235.65 frames. ], batch size: 176, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:29:42,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1177362.0, ans=0.1 2023-06-24 20:30:46,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1177482.0, ans=0.025 2023-06-24 20:31:09,724 INFO [train.py:996] (1/4) Epoch 7, batch 13300, loss[loss=0.2575, simple_loss=0.3417, pruned_loss=0.08664, over 21653.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3038, pruned_loss=0.07291, over 4279761.93 frames. ], batch size: 389, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:31:13,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1177602.0, ans=0.125 2023-06-24 20:31:17,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1177602.0, ans=0.1 2023-06-24 20:31:24,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1177602.0, ans=0.125 2023-06-24 20:32:10,684 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.867e+02 3.500e+02 4.353e+02 7.353e+02, threshold=7.001e+02, percent-clipped=0.0 2023-06-24 20:32:29,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1177782.0, ans=0.125 2023-06-24 20:33:00,259 INFO [train.py:996] (1/4) Epoch 7, batch 13350, loss[loss=0.2368, simple_loss=0.3214, pruned_loss=0.07615, over 21740.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3087, pruned_loss=0.07516, over 4281079.03 frames. ], batch size: 247, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:33:36,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-24 20:34:00,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1178022.0, ans=0.2 2023-06-24 20:34:16,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1178082.0, ans=0.05 2023-06-24 20:34:30,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1178142.0, ans=0.0 2023-06-24 20:34:48,860 INFO [train.py:996] (1/4) Epoch 7, batch 13400, loss[loss=0.2906, simple_loss=0.3482, pruned_loss=0.1164, over 21491.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3105, pruned_loss=0.07709, over 4279509.78 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:34:49,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1178202.0, ans=0.1 2023-06-24 20:35:02,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=12.0 2023-06-24 20:35:27,195 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:35:54,569 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.870e+02 3.236e+02 3.893e+02 7.079e+02, threshold=6.472e+02, percent-clipped=1.0 2023-06-24 20:36:01,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1178322.0, ans=0.0 2023-06-24 20:36:09,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1178382.0, ans=0.1 2023-06-24 20:36:18,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-24 20:36:42,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1178502.0, ans=0.025 2023-06-24 20:36:43,532 INFO [train.py:996] (1/4) Epoch 7, batch 13450, loss[loss=0.2772, simple_loss=0.3468, pruned_loss=0.1038, over 21303.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3107, pruned_loss=0.07853, over 4278024.54 frames. ], batch size: 143, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:36:43,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1178502.0, ans=0.1 2023-06-24 20:38:03,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1178682.0, ans=0.125 2023-06-24 20:38:31,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1178802.0, ans=0.2 2023-06-24 20:38:33,324 INFO [train.py:996] (1/4) Epoch 7, batch 13500, loss[loss=0.2533, simple_loss=0.3288, pruned_loss=0.0889, over 21643.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2996, pruned_loss=0.0757, over 4271718.88 frames. ], batch size: 441, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:38:39,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-24 20:39:35,907 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 3.362e+02 3.847e+02 4.790e+02 7.815e+02, threshold=7.695e+02, percent-clipped=4.0 2023-06-24 20:40:13,802 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-24 20:40:13,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-24 20:40:30,479 INFO [train.py:996] (1/4) Epoch 7, batch 13550, loss[loss=0.3109, simple_loss=0.4037, pruned_loss=0.1091, over 21564.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3039, pruned_loss=0.07523, over 4272271.33 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:42:19,505 INFO [train.py:996] (1/4) Epoch 7, batch 13600, loss[loss=0.1987, simple_loss=0.2846, pruned_loss=0.05645, over 21802.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3052, pruned_loss=0.07605, over 4278116.59 frames. ], batch size: 298, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:42:20,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1179402.0, ans=0.2 2023-06-24 20:42:36,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1179402.0, ans=0.125 2023-06-24 20:42:50,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1179462.0, ans=0.0 2023-06-24 20:42:59,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1179462.0, ans=0.125 2023-06-24 20:43:13,906 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.752e+02 3.319e+02 4.170e+02 8.424e+02, threshold=6.637e+02, percent-clipped=2.0 2023-06-24 20:43:21,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1179582.0, ans=0.1 2023-06-24 20:43:44,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1179642.0, ans=0.125 2023-06-24 20:43:48,337 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-24 20:44:10,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1179642.0, ans=0.125 2023-06-24 20:44:13,939 INFO [train.py:996] (1/4) Epoch 7, batch 13650, loss[loss=0.226, simple_loss=0.2867, pruned_loss=0.08263, over 21319.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.302, pruned_loss=0.07335, over 4277655.18 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:44:29,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1179702.0, ans=0.2 2023-06-24 20:44:33,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1179702.0, ans=0.0 2023-06-24 20:44:41,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1179762.0, ans=0.2 2023-06-24 20:45:14,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1179882.0, ans=10.0 2023-06-24 20:45:38,301 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-24 20:45:49,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1179942.0, ans=0.0 2023-06-24 20:46:02,770 INFO [train.py:996] (1/4) Epoch 7, batch 13700, loss[loss=0.3292, simple_loss=0.3861, pruned_loss=0.1362, over 21517.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2963, pruned_loss=0.07274, over 4278178.36 frames. ], batch size: 508, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:46:06,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1180002.0, ans=0.0 2023-06-24 20:46:53,960 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.933e+02 3.408e+02 4.386e+02 8.480e+02, threshold=6.816e+02, percent-clipped=3.0 2023-06-24 20:47:12,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1180182.0, ans=0.125 2023-06-24 20:47:58,388 INFO [train.py:996] (1/4) Epoch 7, batch 13750, loss[loss=0.1876, simple_loss=0.2567, pruned_loss=0.05926, over 21240.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2953, pruned_loss=0.07213, over 4268007.31 frames. ], batch size: 176, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:48:04,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1180302.0, ans=0.125 2023-06-24 20:48:13,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1180302.0, ans=0.05 2023-06-24 20:49:43,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-24 20:49:46,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=15.0 2023-06-24 20:49:50,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1180602.0, ans=0.0 2023-06-24 20:49:51,683 INFO [train.py:996] (1/4) Epoch 7, batch 13800, loss[loss=0.2623, simple_loss=0.3659, pruned_loss=0.07934, over 21655.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3001, pruned_loss=0.0711, over 4258956.09 frames. ], batch size: 414, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:50:07,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.58 vs. limit=10.0 2023-06-24 20:50:16,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-24 20:50:55,000 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.819e+02 3.661e+02 5.277e+02 1.106e+03, threshold=7.321e+02, percent-clipped=8.0 2023-06-24 20:51:37,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1180842.0, ans=0.2 2023-06-24 20:51:42,291 INFO [train.py:996] (1/4) Epoch 7, batch 13850, loss[loss=0.2795, simple_loss=0.3643, pruned_loss=0.09734, over 21673.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3062, pruned_loss=0.07173, over 4269787.14 frames. ], batch size: 414, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:51:49,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1180902.0, ans=0.0 2023-06-24 20:51:55,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1180902.0, ans=0.015 2023-06-24 20:53:09,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1181082.0, ans=0.2 2023-06-24 20:53:33,252 INFO [train.py:996] (1/4) Epoch 7, batch 13900, loss[loss=0.2537, simple_loss=0.3161, pruned_loss=0.09566, over 21773.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3109, pruned_loss=0.0758, over 4277682.07 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:53:56,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1181262.0, ans=0.125 2023-06-24 20:54:34,858 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.148e+02 3.792e+02 4.891e+02 9.530e+02, threshold=7.583e+02, percent-clipped=4.0 2023-06-24 20:55:17,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1181442.0, ans=0.125 2023-06-24 20:55:21,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1181502.0, ans=0.0 2023-06-24 20:55:22,186 INFO [train.py:996] (1/4) Epoch 7, batch 13950, loss[loss=0.2159, simple_loss=0.2886, pruned_loss=0.07163, over 21781.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.31, pruned_loss=0.07675, over 4280640.38 frames. ], batch size: 247, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:55:38,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181502.0, ans=0.1 2023-06-24 20:55:50,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-24 20:56:05,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1181562.0, ans=0.2 2023-06-24 20:57:09,134 INFO [train.py:996] (1/4) Epoch 7, batch 14000, loss[loss=0.1715, simple_loss=0.2409, pruned_loss=0.051, over 21304.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3085, pruned_loss=0.07575, over 4278095.38 frames. ], batch size: 159, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 20:57:39,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-24 20:57:43,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1181862.0, ans=0.125 2023-06-24 20:58:14,895 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 2.939e+02 3.299e+02 3.866e+02 1.368e+03, threshold=6.598e+02, percent-clipped=4.0 2023-06-24 20:58:56,591 INFO [train.py:996] (1/4) Epoch 7, batch 14050, loss[loss=0.2058, simple_loss=0.2673, pruned_loss=0.07208, over 14790.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3023, pruned_loss=0.07218, over 4262334.37 frames. ], batch size: 60, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 20:58:56,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1182102.0, ans=0.125 2023-06-24 20:59:26,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1182162.0, ans=0.1 2023-06-24 20:59:28,459 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:00:01,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1182222.0, ans=0.015 2023-06-24 21:00:01,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1182222.0, ans=0.0 2023-06-24 21:00:43,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1182402.0, ans=0.1 2023-06-24 21:00:44,819 INFO [train.py:996] (1/4) Epoch 7, batch 14100, loss[loss=0.1985, simple_loss=0.2646, pruned_loss=0.06624, over 21709.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2957, pruned_loss=0.07133, over 4254539.97 frames. ], batch size: 282, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 21:01:31,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1182462.0, ans=0.125 2023-06-24 21:01:52,072 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.658e+02 3.185e+02 3.657e+02 7.559e+02, threshold=6.369e+02, percent-clipped=1.0 2023-06-24 21:02:13,199 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-24 21:02:29,764 INFO [train.py:996] (1/4) Epoch 7, batch 14150, loss[loss=0.2139, simple_loss=0.2985, pruned_loss=0.06464, over 21845.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3002, pruned_loss=0.07285, over 4254541.34 frames. ], batch size: 102, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:03:02,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1182762.0, ans=0.2 2023-06-24 21:03:37,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1182882.0, ans=0.125 2023-06-24 21:04:02,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-24 21:04:10,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1182942.0, ans=0.125 2023-06-24 21:04:14,272 INFO [train.py:996] (1/4) Epoch 7, batch 14200, loss[loss=0.2038, simple_loss=0.2718, pruned_loss=0.06785, over 21574.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2985, pruned_loss=0.07125, over 4265507.35 frames. ], batch size: 263, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:04:33,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1183062.0, ans=0.125 2023-06-24 21:05:17,833 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 2.638e+02 3.052e+02 3.885e+02 7.622e+02, threshold=6.105e+02, percent-clipped=2.0 2023-06-24 21:05:47,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1183242.0, ans=0.125 2023-06-24 21:06:03,313 INFO [train.py:996] (1/4) Epoch 7, batch 14250, loss[loss=0.22, simple_loss=0.2866, pruned_loss=0.0767, over 21265.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2923, pruned_loss=0.0708, over 4257287.41 frames. ], batch size: 143, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:06:31,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1183362.0, ans=0.1 2023-06-24 21:07:04,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1183422.0, ans=0.125 2023-06-24 21:07:20,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=15.0 2023-06-24 21:07:52,512 INFO [train.py:996] (1/4) Epoch 7, batch 14300, loss[loss=0.3194, simple_loss=0.4012, pruned_loss=0.1188, over 21596.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2931, pruned_loss=0.0706, over 4253017.57 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:08:01,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1183602.0, ans=0.0 2023-06-24 21:08:56,719 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.824e+02 3.398e+02 4.914e+02 1.429e+03, threshold=6.796e+02, percent-clipped=17.0 2023-06-24 21:09:40,480 INFO [train.py:996] (1/4) Epoch 7, batch 14350, loss[loss=0.2878, simple_loss=0.3679, pruned_loss=0.1039, over 21521.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2998, pruned_loss=0.07186, over 4260338.03 frames. ], batch size: 507, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:11:08,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1184082.0, ans=0.125 2023-06-24 21:11:09,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1184142.0, ans=0.125 2023-06-24 21:11:15,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1184142.0, ans=0.125 2023-06-24 21:11:28,478 INFO [train.py:996] (1/4) Epoch 7, batch 14400, loss[loss=0.2193, simple_loss=0.2915, pruned_loss=0.07358, over 21727.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2967, pruned_loss=0.07166, over 4270767.18 frames. ], batch size: 124, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:11:41,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1184202.0, ans=0.0 2023-06-24 21:12:32,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.835e+02 3.374e+02 4.163e+02 7.231e+02, threshold=6.749e+02, percent-clipped=2.0 2023-06-24 21:12:33,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1184322.0, ans=0.2 2023-06-24 21:13:09,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1184442.0, ans=0.0 2023-06-24 21:13:14,852 INFO [train.py:996] (1/4) Epoch 7, batch 14450, loss[loss=0.204, simple_loss=0.2677, pruned_loss=0.07018, over 21759.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2913, pruned_loss=0.07205, over 4262600.55 frames. ], batch size: 333, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:13:16,811 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:13:43,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1184562.0, ans=0.1 2023-06-24 21:14:27,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1184682.0, ans=0.2 2023-06-24 21:14:49,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1184742.0, ans=0.0 2023-06-24 21:15:03,195 INFO [train.py:996] (1/4) Epoch 7, batch 14500, loss[loss=0.2319, simple_loss=0.2833, pruned_loss=0.09029, over 21243.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2901, pruned_loss=0.07197, over 4263605.58 frames. ], batch size: 471, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:15:24,363 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:16:08,440 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.747e+02 3.186e+02 4.190e+02 7.871e+02, threshold=6.373e+02, percent-clipped=3.0 2023-06-24 21:16:47,448 INFO [train.py:996] (1/4) Epoch 7, batch 14550, loss[loss=0.2565, simple_loss=0.3337, pruned_loss=0.08959, over 21613.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.296, pruned_loss=0.07411, over 4265061.52 frames. ], batch size: 389, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:16:51,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1185102.0, ans=0.0 2023-06-24 21:17:45,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1185222.0, ans=0.1 2023-06-24 21:18:17,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1185282.0, ans=0.125 2023-06-24 21:18:37,544 INFO [train.py:996] (1/4) Epoch 7, batch 14600, loss[loss=0.2818, simple_loss=0.3433, pruned_loss=0.1102, over 21803.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3039, pruned_loss=0.07777, over 4269082.46 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:18:43,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1185402.0, ans=0.1 2023-06-24 21:19:25,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1185462.0, ans=0.125 2023-06-24 21:19:41,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1185522.0, ans=0.0 2023-06-24 21:19:42,434 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.108e+02 3.903e+02 5.552e+02 1.166e+03, threshold=7.806e+02, percent-clipped=17.0 2023-06-24 21:20:20,970 INFO [train.py:996] (1/4) Epoch 7, batch 14650, loss[loss=0.1986, simple_loss=0.2931, pruned_loss=0.05201, over 21747.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3058, pruned_loss=0.07682, over 4271719.07 frames. ], batch size: 332, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:20:35,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1185702.0, ans=0.04949747468305833 2023-06-24 21:21:07,021 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:21:38,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1185882.0, ans=0.125 2023-06-24 21:22:00,305 INFO [train.py:996] (1/4) Epoch 7, batch 14700, loss[loss=0.2372, simple_loss=0.3307, pruned_loss=0.07184, over 21692.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3, pruned_loss=0.07169, over 4261632.28 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:22:15,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1186002.0, ans=0.0 2023-06-24 21:22:43,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1186062.0, ans=0.015 2023-06-24 21:22:44,514 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-24 21:23:06,188 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.369e+02 2.874e+02 3.417e+02 6.463e+02, threshold=5.748e+02, percent-clipped=0.0 2023-06-24 21:23:09,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=22.5 2023-06-24 21:23:20,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1186182.0, ans=0.125 2023-06-24 21:23:32,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1186242.0, ans=0.015 2023-06-24 21:23:51,806 INFO [train.py:996] (1/4) Epoch 7, batch 14750, loss[loss=0.2584, simple_loss=0.3269, pruned_loss=0.09494, over 21266.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.305, pruned_loss=0.07404, over 4265000.28 frames. ], batch size: 159, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:24:52,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1186422.0, ans=0.125 2023-06-24 21:24:54,336 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:25:38,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1186542.0, ans=0.05 2023-06-24 21:25:48,307 INFO [train.py:996] (1/4) Epoch 7, batch 14800, loss[loss=0.2193, simple_loss=0.287, pruned_loss=0.0758, over 21366.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3168, pruned_loss=0.07911, over 4263224.33 frames. ], batch size: 211, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:25:49,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1186602.0, ans=0.2 2023-06-24 21:26:14,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1186662.0, ans=0.0 2023-06-24 21:26:19,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1186662.0, ans=0.1 2023-06-24 21:26:44,092 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.322e+02 4.309e+02 5.612e+02 1.041e+03, threshold=8.619e+02, percent-clipped=22.0 2023-06-24 21:27:39,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1186842.0, ans=0.2 2023-06-24 21:27:44,130 INFO [train.py:996] (1/4) Epoch 7, batch 14850, loss[loss=0.2228, simple_loss=0.2795, pruned_loss=0.08305, over 21445.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3103, pruned_loss=0.07833, over 4249289.50 frames. ], batch size: 441, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:28:01,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1186962.0, ans=0.2 2023-06-24 21:28:06,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1186962.0, ans=0.125 2023-06-24 21:29:30,062 INFO [train.py:996] (1/4) Epoch 7, batch 14900, loss[loss=0.2542, simple_loss=0.3271, pruned_loss=0.09066, over 21617.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3139, pruned_loss=0.08051, over 4251234.06 frames. ], batch size: 389, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:29:41,504 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.64 vs. limit=6.0 2023-06-24 21:29:57,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.24 vs. limit=22.5 2023-06-24 21:30:16,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1187322.0, ans=0.2 2023-06-24 21:30:18,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1187322.0, ans=0.2 2023-06-24 21:30:36,821 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.164e+02 3.884e+02 4.869e+02 8.267e+02, threshold=7.767e+02, percent-clipped=0.0 2023-06-24 21:31:20,157 INFO [train.py:996] (1/4) Epoch 7, batch 14950, loss[loss=0.2373, simple_loss=0.3184, pruned_loss=0.07806, over 21437.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3134, pruned_loss=0.07936, over 4258236.66 frames. ], batch size: 211, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:31:23,259 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-24 21:31:36,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1187562.0, ans=0.0 2023-06-24 21:32:46,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1187682.0, ans=0.05 2023-06-24 21:32:56,242 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.77 vs. limit=6.0 2023-06-24 21:33:09,356 INFO [train.py:996] (1/4) Epoch 7, batch 15000, loss[loss=0.2398, simple_loss=0.3101, pruned_loss=0.0847, over 15360.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3151, pruned_loss=0.08079, over 4259721.86 frames. ], batch size: 60, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:33:09,357 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 21:33:26,462 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2547, simple_loss=0.3504, pruned_loss=0.07951, over 1796401.00 frames. 2023-06-24 21:33:26,463 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 21:33:44,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1187802.0, ans=0.125 2023-06-24 21:34:39,131 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.767e+02 3.159e+02 3.696e+02 5.819e+02, threshold=6.318e+02, percent-clipped=0.0 2023-06-24 21:34:50,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1187982.0, ans=0.2 2023-06-24 21:34:52,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1187982.0, ans=0.1 2023-06-24 21:35:07,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1188042.0, ans=0.2 2023-06-24 21:35:17,607 INFO [train.py:996] (1/4) Epoch 7, batch 15050, loss[loss=0.2698, simple_loss=0.3588, pruned_loss=0.09044, over 21677.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3139, pruned_loss=0.08072, over 4266196.26 frames. ], batch size: 441, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:36:30,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1188282.0, ans=0.2 2023-06-24 21:36:57,701 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.48 vs. limit=10.0 2023-06-24 21:37:00,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1188402.0, ans=0.05 2023-06-24 21:37:07,752 INFO [train.py:996] (1/4) Epoch 7, batch 15100, loss[loss=0.2524, simple_loss=0.3278, pruned_loss=0.08844, over 21616.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3137, pruned_loss=0.07977, over 4263255.21 frames. ], batch size: 389, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:37:44,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1188462.0, ans=0.0 2023-06-24 21:37:46,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1188462.0, ans=0.125 2023-06-24 21:37:56,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1188522.0, ans=0.125 2023-06-24 21:38:13,585 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.973e+02 3.589e+02 4.717e+02 7.835e+02, threshold=7.177e+02, percent-clipped=5.0 2023-06-24 21:38:16,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1188582.0, ans=0.1 2023-06-24 21:38:26,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1188582.0, ans=0.125 2023-06-24 21:38:55,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1188642.0, ans=0.5 2023-06-24 21:39:00,096 INFO [train.py:996] (1/4) Epoch 7, batch 15150, loss[loss=0.2483, simple_loss=0.2917, pruned_loss=0.1025, over 21223.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3096, pruned_loss=0.08032, over 4262244.20 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:39:27,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1188762.0, ans=15.0 2023-06-24 21:40:43,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1188942.0, ans=0.1 2023-06-24 21:40:49,636 INFO [train.py:996] (1/4) Epoch 7, batch 15200, loss[loss=0.1954, simple_loss=0.271, pruned_loss=0.05988, over 21392.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3014, pruned_loss=0.07615, over 4257664.37 frames. ], batch size: 194, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:41:27,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1189062.0, ans=0.125 2023-06-24 21:41:51,474 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.555e+02 2.882e+02 3.442e+02 5.882e+02, threshold=5.763e+02, percent-clipped=0.0 2023-06-24 21:41:59,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1189182.0, ans=0.1 2023-06-24 21:42:27,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1189242.0, ans=0.0 2023-06-24 21:42:32,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1189242.0, ans=0.0 2023-06-24 21:42:41,591 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-24 21:42:49,800 INFO [train.py:996] (1/4) Epoch 7, batch 15250, loss[loss=0.2051, simple_loss=0.2724, pruned_loss=0.0689, over 21830.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2953, pruned_loss=0.07457, over 4262320.06 frames. ], batch size: 317, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:43:15,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1189362.0, ans=0.2 2023-06-24 21:43:16,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1189362.0, ans=0.125 2023-06-24 21:44:38,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1189602.0, ans=0.1 2023-06-24 21:44:40,023 INFO [train.py:996] (1/4) Epoch 7, batch 15300, loss[loss=0.2522, simple_loss=0.3254, pruned_loss=0.08951, over 21818.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2999, pruned_loss=0.07636, over 4261294.91 frames. ], batch size: 124, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:44:54,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1189602.0, ans=0.1 2023-06-24 21:45:37,482 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.236e+02 3.827e+02 4.813e+02 8.149e+02, threshold=7.653e+02, percent-clipped=14.0 2023-06-24 21:45:46,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1189782.0, ans=0.1 2023-06-24 21:45:54,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1189782.0, ans=0.125 2023-06-24 21:45:54,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1189782.0, ans=0.125 2023-06-24 21:46:13,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1189842.0, ans=0.0 2023-06-24 21:46:27,869 INFO [train.py:996] (1/4) Epoch 7, batch 15350, loss[loss=0.2253, simple_loss=0.3231, pruned_loss=0.0637, over 21881.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3044, pruned_loss=0.07908, over 4264247.08 frames. ], batch size: 316, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:46:40,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1189902.0, ans=0.0 2023-06-24 21:47:07,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1190022.0, ans=0.04949747468305833 2023-06-24 21:48:14,152 INFO [train.py:996] (1/4) Epoch 7, batch 15400, loss[loss=0.2373, simple_loss=0.3129, pruned_loss=0.08085, over 21243.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3056, pruned_loss=0.07804, over 4273696.53 frames. ], batch size: 143, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:48:15,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1190202.0, ans=15.0 2023-06-24 21:48:33,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1190262.0, ans=0.125 2023-06-24 21:48:55,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1190322.0, ans=0.025 2023-06-24 21:49:00,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1190322.0, ans=0.125 2023-06-24 21:49:01,430 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=22.5 2023-06-24 21:49:01,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.19 vs. limit=15.0 2023-06-24 21:49:05,373 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.624e+02 3.015e+02 3.662e+02 6.507e+02, threshold=6.030e+02, percent-clipped=0.0 2023-06-24 21:49:05,973 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:49:09,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-24 21:50:02,507 INFO [train.py:996] (1/4) Epoch 7, batch 15450, loss[loss=0.211, simple_loss=0.2902, pruned_loss=0.06594, over 21477.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3025, pruned_loss=0.07658, over 4272721.00 frames. ], batch size: 548, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:50:17,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.80 vs. limit=22.5 2023-06-24 21:50:36,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1190622.0, ans=0.125 2023-06-24 21:51:11,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1190682.0, ans=0.0 2023-06-24 21:51:43,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1190742.0, ans=0.125 2023-06-24 21:51:52,856 INFO [train.py:996] (1/4) Epoch 7, batch 15500, loss[loss=0.2471, simple_loss=0.3159, pruned_loss=0.08913, over 21469.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3065, pruned_loss=0.0765, over 4269018.40 frames. ], batch size: 211, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:51:56,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1190802.0, ans=0.1 2023-06-24 21:52:02,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1190802.0, ans=0.125 2023-06-24 21:52:23,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1190862.0, ans=0.02 2023-06-24 21:52:41,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1190922.0, ans=0.0 2023-06-24 21:52:51,802 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.878e+02 3.263e+02 4.056e+02 7.756e+02, threshold=6.526e+02, percent-clipped=2.0 2023-06-24 21:53:37,043 INFO [train.py:996] (1/4) Epoch 7, batch 15550, loss[loss=0.1761, simple_loss=0.2637, pruned_loss=0.04418, over 21597.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3042, pruned_loss=0.07385, over 4273004.45 frames. ], batch size: 263, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:53:48,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1191102.0, ans=0.125 2023-06-24 21:54:00,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1191162.0, ans=0.125 2023-06-24 21:54:05,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1191162.0, ans=0.125 2023-06-24 21:54:31,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1191222.0, ans=0.125 2023-06-24 21:55:14,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-24 21:55:20,516 INFO [train.py:996] (1/4) Epoch 7, batch 15600, loss[loss=0.2669, simple_loss=0.3995, pruned_loss=0.06713, over 19849.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.299, pruned_loss=0.07214, over 4254464.73 frames. ], batch size: 702, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:55:29,156 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-24 21:56:01,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1191522.0, ans=0.2 2023-06-24 21:56:23,745 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.726e+02 3.210e+02 4.134e+02 7.598e+02, threshold=6.420e+02, percent-clipped=3.0 2023-06-24 21:56:59,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1191642.0, ans=0.1 2023-06-24 21:57:08,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1191702.0, ans=0.125 2023-06-24 21:57:09,397 INFO [train.py:996] (1/4) Epoch 7, batch 15650, loss[loss=0.2138, simple_loss=0.2871, pruned_loss=0.07025, over 21363.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3004, pruned_loss=0.07224, over 4250923.66 frames. ], batch size: 131, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 21:57:11,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1191702.0, ans=0.0 2023-06-24 21:58:08,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=12.0 2023-06-24 21:58:24,183 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-24 21:58:38,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1191942.0, ans=0.125 2023-06-24 21:58:41,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1191942.0, ans=0.125 2023-06-24 21:58:57,014 INFO [train.py:996] (1/4) Epoch 7, batch 15700, loss[loss=0.1964, simple_loss=0.2871, pruned_loss=0.05285, over 21727.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2963, pruned_loss=0.07134, over 4251802.85 frames. ], batch size: 282, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 21:59:06,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1192002.0, ans=0.125 2023-06-24 21:59:40,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.47 vs. limit=15.0 2023-06-24 22:00:00,286 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.614e+02 3.168e+02 3.646e+02 5.632e+02, threshold=6.336e+02, percent-clipped=0.0 2023-06-24 22:00:25,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1192242.0, ans=0.0 2023-06-24 22:00:36,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.14 vs. limit=5.0 2023-06-24 22:00:43,468 INFO [train.py:996] (1/4) Epoch 7, batch 15750, loss[loss=0.1865, simple_loss=0.2624, pruned_loss=0.05529, over 21755.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2918, pruned_loss=0.07152, over 4248477.42 frames. ], batch size: 112, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:00:55,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1192302.0, ans=0.125 2023-06-24 22:02:18,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1192542.0, ans=0.1 2023-06-24 22:02:32,327 INFO [train.py:996] (1/4) Epoch 7, batch 15800, loss[loss=0.2052, simple_loss=0.272, pruned_loss=0.06917, over 21504.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2872, pruned_loss=0.0712, over 4255509.41 frames. ], batch size: 195, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:02:34,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1192602.0, ans=0.125 2023-06-24 22:02:36,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1192602.0, ans=0.0 2023-06-24 22:03:37,343 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.306e+02 2.697e+02 3.086e+02 3.699e+02 6.270e+02, threshold=6.172e+02, percent-clipped=0.0 2023-06-24 22:03:57,272 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:04:15,600 INFO [train.py:996] (1/4) Epoch 7, batch 15850, loss[loss=0.24, simple_loss=0.3138, pruned_loss=0.08305, over 21563.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2895, pruned_loss=0.07317, over 4251462.12 frames. ], batch size: 230, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:05:02,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1193022.0, ans=0.125 2023-06-24 22:05:53,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1193142.0, ans=0.125 2023-06-24 22:06:04,579 INFO [train.py:996] (1/4) Epoch 7, batch 15900, loss[loss=0.2067, simple_loss=0.2762, pruned_loss=0.06859, over 21820.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2877, pruned_loss=0.07348, over 4253514.07 frames. ], batch size: 107, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:06:17,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1193202.0, ans=0.1 2023-06-24 22:07:09,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 2.996e+02 3.520e+02 4.315e+02 6.246e+02, threshold=7.040e+02, percent-clipped=3.0 2023-06-24 22:07:53,081 INFO [train.py:996] (1/4) Epoch 7, batch 15950, loss[loss=0.1596, simple_loss=0.2447, pruned_loss=0.03727, over 21496.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.288, pruned_loss=0.07103, over 4254733.86 frames. ], batch size: 211, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:07:59,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1193502.0, ans=0.125 2023-06-24 22:08:16,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1193562.0, ans=0.125 2023-06-24 22:08:20,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1193562.0, ans=0.0 2023-06-24 22:08:25,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1193562.0, ans=0.125 2023-06-24 22:09:43,010 INFO [train.py:996] (1/4) Epoch 7, batch 16000, loss[loss=0.1974, simple_loss=0.2832, pruned_loss=0.05575, over 21399.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2894, pruned_loss=0.0692, over 4249006.25 frames. ], batch size: 211, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:09:47,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1193802.0, ans=0.0 2023-06-24 22:10:03,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1193862.0, ans=0.0 2023-06-24 22:10:15,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1193862.0, ans=0.1 2023-06-24 22:10:55,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 3.010e+02 3.950e+02 5.010e+02 9.750e+02, threshold=7.899e+02, percent-clipped=10.0 2023-06-24 22:11:32,504 INFO [train.py:996] (1/4) Epoch 7, batch 16050, loss[loss=0.1785, simple_loss=0.2623, pruned_loss=0.04739, over 21438.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2926, pruned_loss=0.06796, over 4259912.45 frames. ], batch size: 194, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:11:36,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1194102.0, ans=0.1 2023-06-24 22:11:55,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1194162.0, ans=0.2 2023-06-24 22:11:55,173 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:12:35,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1194282.0, ans=0.2 2023-06-24 22:12:53,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1194282.0, ans=0.125 2023-06-24 22:13:17,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1194342.0, ans=0.0 2023-06-24 22:13:20,135 INFO [train.py:996] (1/4) Epoch 7, batch 16100, loss[loss=0.1954, simple_loss=0.2889, pruned_loss=0.05097, over 21674.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2971, pruned_loss=0.06876, over 4270734.25 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:14:13,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1194522.0, ans=0.2 2023-06-24 22:14:25,471 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 3.065e+02 3.753e+02 4.772e+02 1.110e+03, threshold=7.506e+02, percent-clipped=5.0 2023-06-24 22:14:53,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1194642.0, ans=0.2 2023-06-24 22:15:06,568 INFO [train.py:996] (1/4) Epoch 7, batch 16150, loss[loss=0.2052, simple_loss=0.2789, pruned_loss=0.06577, over 21481.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2973, pruned_loss=0.07055, over 4282494.25 frames. ], batch size: 131, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:15:19,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1194702.0, ans=0.1 2023-06-24 22:15:38,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-24 22:15:52,153 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:16:34,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1194882.0, ans=0.0 2023-06-24 22:16:38,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-24 22:16:45,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1194942.0, ans=0.1 2023-06-24 22:16:52,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1194942.0, ans=0.0 2023-06-24 22:16:57,002 INFO [train.py:996] (1/4) Epoch 7, batch 16200, loss[loss=0.2574, simple_loss=0.334, pruned_loss=0.09044, over 21453.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3009, pruned_loss=0.07245, over 4280877.01 frames. ], batch size: 131, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:17:20,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1195062.0, ans=0.0 2023-06-24 22:17:24,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-24 22:18:15,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.476e+02 2.895e+02 3.394e+02 4.172e+02 8.958e+02, threshold=6.788e+02, percent-clipped=2.0 2023-06-24 22:18:18,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1195182.0, ans=0.07 2023-06-24 22:18:26,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1195182.0, ans=0.1 2023-06-24 22:18:46,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1195302.0, ans=0.0 2023-06-24 22:18:47,724 INFO [train.py:996] (1/4) Epoch 7, batch 16250, loss[loss=0.2098, simple_loss=0.2805, pruned_loss=0.06959, over 21856.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3003, pruned_loss=0.07267, over 4277795.91 frames. ], batch size: 373, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:19:39,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1195422.0, ans=0.125 2023-06-24 22:19:51,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1195422.0, ans=0.125 2023-06-24 22:20:22,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1195542.0, ans=0.125 2023-06-24 22:20:31,151 INFO [train.py:996] (1/4) Epoch 7, batch 16300, loss[loss=0.1628, simple_loss=0.2457, pruned_loss=0.03991, over 21247.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2943, pruned_loss=0.06928, over 4270854.33 frames. ], batch size: 176, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:20:33,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1195602.0, ans=0.0 2023-06-24 22:20:47,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1195662.0, ans=0.125 2023-06-24 22:21:48,946 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.667e+02 3.225e+02 3.668e+02 6.965e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-24 22:22:03,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1195842.0, ans=0.125 2023-06-24 22:22:04,690 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.48 vs. limit=22.5 2023-06-24 22:22:20,752 INFO [train.py:996] (1/4) Epoch 7, batch 16350, loss[loss=0.1838, simple_loss=0.2662, pruned_loss=0.05068, over 21671.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2937, pruned_loss=0.06965, over 4269593.69 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:22:35,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1195902.0, ans=0.0 2023-06-24 22:22:41,831 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=22.5 2023-06-24 22:24:03,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1196202.0, ans=0.125 2023-06-24 22:24:04,381 INFO [train.py:996] (1/4) Epoch 7, batch 16400, loss[loss=0.2155, simple_loss=0.3121, pruned_loss=0.05945, over 21310.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2969, pruned_loss=0.0716, over 4276830.61 frames. ], batch size: 548, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:24:19,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1196202.0, ans=0.125 2023-06-24 22:24:23,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.57 vs. limit=15.0 2023-06-24 22:24:24,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1196262.0, ans=0.2 2023-06-24 22:25:16,391 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.934e+02 3.396e+02 4.473e+02 6.388e+02, threshold=6.793e+02, percent-clipped=0.0 2023-06-24 22:25:48,958 INFO [train.py:996] (1/4) Epoch 7, batch 16450, loss[loss=0.2125, simple_loss=0.2891, pruned_loss=0.06797, over 21419.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2978, pruned_loss=0.07299, over 4285011.99 frames. ], batch size: 131, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:27:32,620 INFO [train.py:996] (1/4) Epoch 7, batch 16500, loss[loss=0.2027, simple_loss=0.2772, pruned_loss=0.06407, over 21759.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.297, pruned_loss=0.07294, over 4285249.94 frames. ], batch size: 298, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:28:28,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1196922.0, ans=0.125 2023-06-24 22:28:49,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1196982.0, ans=0.125 2023-06-24 22:28:51,335 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.249e+02 4.017e+02 5.671e+02 1.121e+03, threshold=8.034e+02, percent-clipped=17.0 2023-06-24 22:28:58,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1196982.0, ans=0.0 2023-06-24 22:29:06,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1197042.0, ans=0.125 2023-06-24 22:29:06,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1197042.0, ans=0.1 2023-06-24 22:29:11,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1197042.0, ans=0.125 2023-06-24 22:29:15,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1197042.0, ans=15.0 2023-06-24 22:29:20,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1197042.0, ans=0.2 2023-06-24 22:29:22,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-24 22:29:23,183 INFO [train.py:996] (1/4) Epoch 7, batch 16550, loss[loss=0.3506, simple_loss=0.461, pruned_loss=0.1201, over 19792.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2966, pruned_loss=0.07226, over 4269447.98 frames. ], batch size: 702, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:30:07,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1197162.0, ans=0.125 2023-06-24 22:30:11,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1197162.0, ans=0.125 2023-06-24 22:30:20,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1197222.0, ans=0.125 2023-06-24 22:31:37,495 INFO [train.py:996] (1/4) Epoch 7, batch 16600, loss[loss=0.214, simple_loss=0.3097, pruned_loss=0.05912, over 20722.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3034, pruned_loss=0.07446, over 4270808.69 frames. ], batch size: 607, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:31:49,459 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-24 22:31:52,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1197402.0, ans=0.0 2023-06-24 22:32:18,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-24 22:32:32,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1197582.0, ans=0.2 2023-06-24 22:32:36,775 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.261e+02 4.003e+02 5.335e+02 1.096e+03, threshold=8.006e+02, percent-clipped=4.0 2023-06-24 22:32:58,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1197642.0, ans=0.125 2023-06-24 22:33:00,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1197642.0, ans=0.0 2023-06-24 22:33:26,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1197642.0, ans=0.125 2023-06-24 22:33:29,480 INFO [train.py:996] (1/4) Epoch 7, batch 16650, loss[loss=0.2805, simple_loss=0.3589, pruned_loss=0.101, over 21464.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3127, pruned_loss=0.07587, over 4268347.59 frames. ], batch size: 131, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:34:03,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1197762.0, ans=0.0 2023-06-24 22:35:02,206 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-24 22:35:17,466 INFO [train.py:996] (1/4) Epoch 7, batch 16700, loss[loss=0.1936, simple_loss=0.2561, pruned_loss=0.06552, over 21184.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3152, pruned_loss=0.07738, over 4266296.62 frames. ], batch size: 143, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:36:39,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.449e+02 4.344e+02 5.804e+02 8.392e+02, threshold=8.689e+02, percent-clipped=2.0 2023-06-24 22:37:03,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1198242.0, ans=0.0 2023-06-24 22:37:12,524 INFO [train.py:996] (1/4) Epoch 7, batch 16750, loss[loss=0.2697, simple_loss=0.3568, pruned_loss=0.09133, over 21641.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3184, pruned_loss=0.07985, over 4265359.95 frames. ], batch size: 389, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:37:42,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1198362.0, ans=0.1 2023-06-24 22:38:24,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1198422.0, ans=0.1 2023-06-24 22:38:39,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1198482.0, ans=0.1 2023-06-24 22:38:54,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1198542.0, ans=0.125 2023-06-24 22:39:02,803 INFO [train.py:996] (1/4) Epoch 7, batch 16800, loss[loss=0.2639, simple_loss=0.3658, pruned_loss=0.08095, over 21298.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.323, pruned_loss=0.08005, over 4260656.50 frames. ], batch size: 548, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:39:42,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-24 22:40:20,483 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.537e+02 4.384e+02 6.125e+02 1.119e+03, threshold=8.769e+02, percent-clipped=3.0 2023-06-24 22:40:55,218 INFO [train.py:996] (1/4) Epoch 7, batch 16850, loss[loss=0.2712, simple_loss=0.3256, pruned_loss=0.1085, over 21765.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3186, pruned_loss=0.07953, over 4267628.23 frames. ], batch size: 473, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:41:52,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-24 22:42:01,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=12.25 vs. limit=15.0 2023-06-24 22:42:29,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-24 22:42:47,359 INFO [train.py:996] (1/4) Epoch 7, batch 16900, loss[loss=0.1855, simple_loss=0.2569, pruned_loss=0.05705, over 21639.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3125, pruned_loss=0.07791, over 4269078.69 frames. ], batch size: 247, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:42:47,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1199202.0, ans=0.125 2023-06-24 22:42:58,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1199202.0, ans=0.125 2023-06-24 22:43:27,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1199262.0, ans=0.125 2023-06-24 22:43:54,624 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.672e+02 3.013e+02 3.696e+02 7.423e+02, threshold=6.025e+02, percent-clipped=0.0 2023-06-24 22:44:00,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199382.0, ans=0.1 2023-06-24 22:44:33,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1199502.0, ans=0.125 2023-06-24 22:44:34,199 INFO [train.py:996] (1/4) Epoch 7, batch 16950, loss[loss=0.2067, simple_loss=0.2832, pruned_loss=0.06513, over 21891.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3045, pruned_loss=0.07537, over 4274712.75 frames. ], batch size: 118, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:46:05,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1199742.0, ans=0.05 2023-06-24 22:46:21,574 INFO [train.py:996] (1/4) Epoch 7, batch 17000, loss[loss=0.2403, simple_loss=0.2971, pruned_loss=0.09172, over 21597.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3008, pruned_loss=0.07568, over 4281994.66 frames. ], batch size: 548, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:46:53,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1199862.0, ans=0.0 2023-06-24 22:47:33,178 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.127e+02 3.708e+02 4.467e+02 7.774e+02, threshold=7.417e+02, percent-clipped=6.0 2023-06-24 22:47:45,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-24 22:48:17,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1200102.0, ans=0.0 2023-06-24 22:48:18,535 INFO [train.py:996] (1/4) Epoch 7, batch 17050, loss[loss=0.2469, simple_loss=0.326, pruned_loss=0.08392, over 21813.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3088, pruned_loss=0.07814, over 4287912.37 frames. ], batch size: 298, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:48:18,971 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:48:34,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1200102.0, ans=0.2 2023-06-24 22:49:45,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1200342.0, ans=0.0 2023-06-24 22:50:04,880 INFO [train.py:996] (1/4) Epoch 7, batch 17100, loss[loss=0.2377, simple_loss=0.302, pruned_loss=0.08668, over 21783.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3085, pruned_loss=0.07883, over 4294134.64 frames. ], batch size: 441, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:50:33,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1200462.0, ans=0.1 2023-06-24 22:50:52,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1200522.0, ans=0.0 2023-06-24 22:50:56,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1200522.0, ans=0.125 2023-06-24 22:51:00,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1200582.0, ans=0.125 2023-06-24 22:51:07,031 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.876e+02 3.458e+02 4.009e+02 6.895e+02, threshold=6.917e+02, percent-clipped=0.0 2023-06-24 22:51:12,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1200582.0, ans=0.1 2023-06-24 22:51:30,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=12.0 2023-06-24 22:51:46,954 INFO [train.py:996] (1/4) Epoch 7, batch 17150, loss[loss=0.1819, simple_loss=0.2652, pruned_loss=0.04933, over 21723.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3045, pruned_loss=0.07889, over 4300766.53 frames. ], batch size: 247, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:52:53,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1200882.0, ans=0.125 2023-06-24 22:53:42,022 INFO [train.py:996] (1/4) Epoch 7, batch 17200, loss[loss=0.2392, simple_loss=0.3103, pruned_loss=0.08409, over 21470.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3036, pruned_loss=0.0778, over 4296019.03 frames. ], batch size: 194, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:54:45,119 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:54:53,173 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.812e+02 3.269e+02 4.158e+02 6.698e+02, threshold=6.538e+02, percent-clipped=0.0 2023-06-24 22:55:33,445 INFO [train.py:996] (1/4) Epoch 7, batch 17250, loss[loss=0.2474, simple_loss=0.3285, pruned_loss=0.08312, over 21342.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3075, pruned_loss=0.07978, over 4292589.16 frames. ], batch size: 159, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:55:39,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1201302.0, ans=15.0 2023-06-24 22:57:24,157 INFO [train.py:996] (1/4) Epoch 7, batch 17300, loss[loss=0.2494, simple_loss=0.3329, pruned_loss=0.08297, over 21175.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3162, pruned_loss=0.083, over 4291638.38 frames. ], batch size: 143, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:57:38,147 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-24 22:58:06,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=22.5 2023-06-24 22:58:23,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1201722.0, ans=0.025 2023-06-24 22:58:36,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.86 vs. limit=15.0 2023-06-24 22:58:47,368 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.589e+02 3.154e+02 3.783e+02 4.784e+02 7.470e+02, threshold=7.566e+02, percent-clipped=5.0 2023-06-24 22:58:52,006 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:59:15,012 INFO [train.py:996] (1/4) Epoch 7, batch 17350, loss[loss=0.1985, simple_loss=0.2812, pruned_loss=0.05794, over 21645.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3169, pruned_loss=0.08263, over 4296246.10 frames. ], batch size: 230, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:00:13,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1202022.0, ans=0.0 2023-06-24 23:00:26,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1202022.0, ans=0.2 2023-06-24 23:00:46,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1202142.0, ans=0.0 2023-06-24 23:01:04,945 INFO [train.py:996] (1/4) Epoch 7, batch 17400, loss[loss=0.3138, simple_loss=0.3854, pruned_loss=0.1211, over 21508.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3128, pruned_loss=0.07917, over 4288644.41 frames. ], batch size: 508, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:02:25,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1202382.0, ans=0.125 2023-06-24 23:02:28,350 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.058e+02 3.682e+02 4.915e+02 8.567e+02, threshold=7.364e+02, percent-clipped=2.0 2023-06-24 23:02:34,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.03 vs. limit=15.0 2023-06-24 23:03:05,883 INFO [train.py:996] (1/4) Epoch 7, batch 17450, loss[loss=0.1966, simple_loss=0.2965, pruned_loss=0.04842, over 21701.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3096, pruned_loss=0.07671, over 4284461.36 frames. ], batch size: 414, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:03:30,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1202562.0, ans=0.1 2023-06-24 23:03:48,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1202562.0, ans=0.0 2023-06-24 23:04:14,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1202682.0, ans=0.125 2023-06-24 23:04:21,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.03 vs. limit=15.0 2023-06-24 23:04:56,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1202742.0, ans=0.125 2023-06-24 23:04:59,040 INFO [train.py:996] (1/4) Epoch 7, batch 17500, loss[loss=0.1999, simple_loss=0.2713, pruned_loss=0.06424, over 21138.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.305, pruned_loss=0.0739, over 4282809.89 frames. ], batch size: 608, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:05:01,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1202802.0, ans=0.0 2023-06-24 23:05:31,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1202862.0, ans=0.125 2023-06-24 23:06:04,513 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.853e+02 3.403e+02 4.672e+02 8.323e+02, threshold=6.806e+02, percent-clipped=1.0 2023-06-24 23:06:05,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1202982.0, ans=0.125 2023-06-24 23:06:44,182 INFO [train.py:996] (1/4) Epoch 7, batch 17550, loss[loss=0.221, simple_loss=0.3088, pruned_loss=0.06653, over 21796.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.305, pruned_loss=0.07318, over 4282824.30 frames. ], batch size: 112, lr: 4.28e-03, grad_scale: 8.0 2023-06-24 23:07:28,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1203222.0, ans=0.125 2023-06-24 23:08:32,340 INFO [train.py:996] (1/4) Epoch 7, batch 17600, loss[loss=0.2451, simple_loss=0.3184, pruned_loss=0.08589, over 21933.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3083, pruned_loss=0.07435, over 4285766.36 frames. ], batch size: 372, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:08:36,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1203402.0, ans=0.1 2023-06-24 23:08:58,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1203462.0, ans=0.0 2023-06-24 23:09:01,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1203462.0, ans=0.0 2023-06-24 23:09:03,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1203462.0, ans=0.2 2023-06-24 23:09:04,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1203462.0, ans=0.0 2023-06-24 23:09:21,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-24 23:09:41,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.778e+02 3.294e+02 4.134e+02 8.304e+02, threshold=6.589e+02, percent-clipped=2.0 2023-06-24 23:10:20,634 INFO [train.py:996] (1/4) Epoch 7, batch 17650, loss[loss=0.1692, simple_loss=0.2396, pruned_loss=0.0494, over 21635.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3061, pruned_loss=0.0749, over 4279918.95 frames. ], batch size: 263, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:10:21,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1203702.0, ans=0.0 2023-06-24 23:10:29,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1203702.0, ans=0.125 2023-06-24 23:10:50,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1203762.0, ans=0.0 2023-06-24 23:10:58,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-24 23:11:13,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1203822.0, ans=0.5 2023-06-24 23:11:17,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-24 23:11:30,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=8.0 2023-06-24 23:12:12,064 INFO [train.py:996] (1/4) Epoch 7, batch 17700, loss[loss=0.2329, simple_loss=0.3195, pruned_loss=0.07312, over 21742.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2992, pruned_loss=0.07153, over 4281261.16 frames. ], batch size: 351, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:12:12,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1204002.0, ans=0.125 2023-06-24 23:12:28,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1204002.0, ans=0.0 2023-06-24 23:13:11,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1204182.0, ans=0.0 2023-06-24 23:13:22,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=15.0 2023-06-24 23:13:30,881 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.965e+02 3.854e+02 5.323e+02 9.978e+02, threshold=7.709e+02, percent-clipped=16.0 2023-06-24 23:13:43,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-24 23:14:06,514 INFO [train.py:996] (1/4) Epoch 7, batch 17750, loss[loss=0.2739, simple_loss=0.3611, pruned_loss=0.09339, over 21844.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3072, pruned_loss=0.07491, over 4280760.30 frames. ], batch size: 124, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:15:08,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1204482.0, ans=0.125 2023-06-24 23:15:56,639 INFO [train.py:996] (1/4) Epoch 7, batch 17800, loss[loss=0.2096, simple_loss=0.2934, pruned_loss=0.06289, over 20764.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3053, pruned_loss=0.07373, over 4274193.78 frames. ], batch size: 607, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:16:15,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1204662.0, ans=0.125 2023-06-24 23:16:21,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-24 23:16:26,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1204662.0, ans=0.125 2023-06-24 23:16:50,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1204722.0, ans=0.09899494936611666 2023-06-24 23:17:18,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1204782.0, ans=0.0 2023-06-24 23:17:23,012 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.835e+02 3.431e+02 4.472e+02 1.183e+03, threshold=6.863e+02, percent-clipped=3.0 2023-06-24 23:17:47,961 INFO [train.py:996] (1/4) Epoch 7, batch 17850, loss[loss=0.3044, simple_loss=0.3771, pruned_loss=0.1158, over 21500.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3059, pruned_loss=0.07457, over 4271859.75 frames. ], batch size: 471, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:18:47,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-24 23:18:58,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1205022.0, ans=0.0 2023-06-24 23:19:02,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1205082.0, ans=0.04949747468305833 2023-06-24 23:19:05,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1205082.0, ans=0.125 2023-06-24 23:19:38,022 INFO [train.py:996] (1/4) Epoch 7, batch 17900, loss[loss=0.2017, simple_loss=0.2653, pruned_loss=0.06902, over 20302.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3111, pruned_loss=0.07681, over 4270642.15 frames. ], batch size: 702, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:19:47,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1205202.0, ans=0.0 2023-06-24 23:20:02,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1205262.0, ans=0.125 2023-06-24 23:20:28,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-24 23:20:50,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-06-24 23:21:03,390 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.987e+02 3.415e+02 4.264e+02 7.391e+02, threshold=6.831e+02, percent-clipped=3.0 2023-06-24 23:21:27,588 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-24 23:21:27,867 INFO [train.py:996] (1/4) Epoch 7, batch 17950, loss[loss=0.2064, simple_loss=0.3048, pruned_loss=0.05397, over 21596.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3097, pruned_loss=0.07358, over 4266163.12 frames. ], batch size: 441, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:22:27,735 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:22:38,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1205622.0, ans=0.125 2023-06-24 23:23:06,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1205742.0, ans=0.2 2023-06-24 23:23:19,952 INFO [train.py:996] (1/4) Epoch 7, batch 18000, loss[loss=0.1975, simple_loss=0.2654, pruned_loss=0.06477, over 21787.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3021, pruned_loss=0.07102, over 4252243.26 frames. ], batch size: 118, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:23:19,953 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 23:23:40,280 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2616, simple_loss=0.3599, pruned_loss=0.08162, over 1796401.00 frames. 2023-06-24 23:23:40,281 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-24 23:23:43,037 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:24:02,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1205802.0, ans=0.125 2023-06-24 23:24:11,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1205862.0, ans=0.125 2023-06-24 23:24:52,531 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-24 23:24:55,044 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.947e+02 3.493e+02 4.464e+02 9.866e+02, threshold=6.986e+02, percent-clipped=5.0 2023-06-24 23:25:24,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1206042.0, ans=0.125 2023-06-24 23:25:35,634 INFO [train.py:996] (1/4) Epoch 7, batch 18050, loss[loss=0.2201, simple_loss=0.2912, pruned_loss=0.07451, over 21613.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2959, pruned_loss=0.0702, over 4250208.09 frames. ], batch size: 263, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:26:03,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1206162.0, ans=0.125 2023-06-24 23:26:03,726 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-24 23:26:10,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1206162.0, ans=0.125 2023-06-24 23:27:08,034 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-24 23:27:30,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1206402.0, ans=0.125 2023-06-24 23:27:32,133 INFO [train.py:996] (1/4) Epoch 7, batch 18100, loss[loss=0.2458, simple_loss=0.3318, pruned_loss=0.0799, over 21289.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2999, pruned_loss=0.07171, over 4247063.52 frames. ], batch size: 143, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:27:42,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1206402.0, ans=0.1 2023-06-24 23:27:51,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.13 vs. limit=10.0 2023-06-24 23:27:55,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1206462.0, ans=0.035 2023-06-24 23:28:44,811 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.881e+02 3.345e+02 4.009e+02 7.924e+02, threshold=6.690e+02, percent-clipped=2.0 2023-06-24 23:28:50,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1206642.0, ans=0.125 2023-06-24 23:29:09,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1206642.0, ans=0.1 2023-06-24 23:29:14,114 INFO [train.py:996] (1/4) Epoch 7, batch 18150, loss[loss=0.1906, simple_loss=0.261, pruned_loss=0.06005, over 21852.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3024, pruned_loss=0.07201, over 4255413.00 frames. ], batch size: 107, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:29:42,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1206762.0, ans=0.125 2023-06-24 23:29:46,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-24 23:30:16,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1206882.0, ans=0.1 2023-06-24 23:30:39,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1206942.0, ans=0.0 2023-06-24 23:30:59,297 INFO [train.py:996] (1/4) Epoch 7, batch 18200, loss[loss=0.2029, simple_loss=0.2686, pruned_loss=0.06862, over 21392.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2974, pruned_loss=0.07228, over 4255706.70 frames. ], batch size: 144, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:31:53,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-24 23:32:02,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1207182.0, ans=0.07 2023-06-24 23:32:05,010 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.874e+02 3.635e+02 5.188e+02 1.150e+03, threshold=7.270e+02, percent-clipped=9.0 2023-06-24 23:32:17,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-24 23:32:38,591 INFO [train.py:996] (1/4) Epoch 7, batch 18250, loss[loss=0.1788, simple_loss=0.2536, pruned_loss=0.05197, over 21776.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2892, pruned_loss=0.06952, over 4270616.35 frames. ], batch size: 124, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:32:59,418 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:34:04,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1207542.0, ans=0.0 2023-06-24 23:34:24,257 INFO [train.py:996] (1/4) Epoch 7, batch 18300, loss[loss=0.2256, simple_loss=0.3364, pruned_loss=0.05738, over 20962.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2885, pruned_loss=0.06968, over 4265917.46 frames. ], batch size: 607, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:34:29,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1207602.0, ans=0.95 2023-06-24 23:34:37,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1207602.0, ans=0.2 2023-06-24 23:34:46,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1207662.0, ans=0.0 2023-06-24 23:35:16,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1207722.0, ans=0.125 2023-06-24 23:35:28,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1207782.0, ans=0.2 2023-06-24 23:35:39,690 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.910e+02 3.541e+02 4.206e+02 1.059e+03, threshold=7.082e+02, percent-clipped=3.0 2023-06-24 23:36:12,354 INFO [train.py:996] (1/4) Epoch 7, batch 18350, loss[loss=0.2031, simple_loss=0.2776, pruned_loss=0.06433, over 21711.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2957, pruned_loss=0.0702, over 4268394.14 frames. ], batch size: 316, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:37:25,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1208082.0, ans=0.0 2023-06-24 23:37:53,262 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-24 23:38:01,016 INFO [train.py:996] (1/4) Epoch 7, batch 18400, loss[loss=0.1796, simple_loss=0.2497, pruned_loss=0.05474, over 21394.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2903, pruned_loss=0.06877, over 4258489.65 frames. ], batch size: 131, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:38:01,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208202.0, ans=0.1 2023-06-24 23:38:37,048 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.98 vs. limit=5.0 2023-06-24 23:39:10,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1208382.0, ans=0.0 2023-06-24 23:39:16,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.559e+02 3.009e+02 3.655e+02 5.951e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-24 23:39:44,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1208442.0, ans=0.125 2023-06-24 23:39:49,330 INFO [train.py:996] (1/4) Epoch 7, batch 18450, loss[loss=0.2047, simple_loss=0.2925, pruned_loss=0.0585, over 21556.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2876, pruned_loss=0.06585, over 4252996.13 frames. ], batch size: 442, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:40:28,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208562.0, ans=0.1 2023-06-24 23:41:19,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208742.0, ans=0.1 2023-06-24 23:41:38,244 INFO [train.py:996] (1/4) Epoch 7, batch 18500, loss[loss=0.2002, simple_loss=0.2618, pruned_loss=0.06934, over 21380.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.282, pruned_loss=0.06413, over 4239322.72 frames. ], batch size: 160, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:41:40,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1208802.0, ans=0.125 2023-06-24 23:41:54,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1208802.0, ans=0.0 2023-06-24 23:42:16,608 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-24 23:42:21,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208862.0, ans=0.1 2023-06-24 23:42:59,433 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.868e+02 3.588e+02 5.410e+02 1.340e+03, threshold=7.175e+02, percent-clipped=18.0 2023-06-24 23:43:18,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-24 23:43:21,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-24 23:43:25,442 INFO [train.py:996] (1/4) Epoch 7, batch 18550, loss[loss=0.21, simple_loss=0.2761, pruned_loss=0.07193, over 21742.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2813, pruned_loss=0.06394, over 4233605.93 frames. ], batch size: 351, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:43:49,336 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-24 23:44:06,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1209162.0, ans=0.0 2023-06-24 23:44:30,055 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-24 23:45:02,405 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:45:13,361 INFO [train.py:996] (1/4) Epoch 7, batch 18600, loss[loss=0.1885, simple_loss=0.249, pruned_loss=0.06405, over 20764.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2802, pruned_loss=0.06482, over 4245020.12 frames. ], batch size: 608, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:45:16,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1209402.0, ans=0.125 2023-06-24 23:45:19,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-24 23:45:50,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1209462.0, ans=0.125 2023-06-24 23:45:52,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1209462.0, ans=0.125 2023-06-24 23:46:35,146 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.703e+02 3.435e+02 4.233e+02 7.811e+02, threshold=6.869e+02, percent-clipped=3.0 2023-06-24 23:46:42,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1209642.0, ans=0.125 2023-06-24 23:47:01,136 INFO [train.py:996] (1/4) Epoch 7, batch 18650, loss[loss=0.1991, simple_loss=0.2621, pruned_loss=0.06804, over 21489.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2792, pruned_loss=0.06445, over 4244697.70 frames. ], batch size: 230, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:47:19,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1209702.0, ans=0.0 2023-06-24 23:47:40,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1209762.0, ans=0.0 2023-06-24 23:48:25,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1209942.0, ans=0.125 2023-06-24 23:48:28,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1209942.0, ans=0.0 2023-06-24 23:48:48,668 INFO [train.py:996] (1/4) Epoch 7, batch 18700, loss[loss=0.1928, simple_loss=0.2639, pruned_loss=0.06083, over 21690.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2779, pruned_loss=0.06576, over 4243132.19 frames. ], batch size: 264, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:48:54,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1210002.0, ans=0.1 2023-06-24 23:48:56,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1210002.0, ans=0.05 2023-06-24 23:50:10,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.803e+02 3.350e+02 3.905e+02 5.845e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-24 23:50:36,538 INFO [train.py:996] (1/4) Epoch 7, batch 18750, loss[loss=0.2035, simple_loss=0.2725, pruned_loss=0.06727, over 21343.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2798, pruned_loss=0.06846, over 4256268.93 frames. ], batch size: 159, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:51:04,571 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-06-24 23:51:22,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1210422.0, ans=0.0 2023-06-24 23:51:36,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1210422.0, ans=0.125 2023-06-24 23:51:42,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1210482.0, ans=0.0 2023-06-24 23:52:01,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1210542.0, ans=0.0 2023-06-24 23:52:07,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-24 23:52:22,878 INFO [train.py:996] (1/4) Epoch 7, batch 18800, loss[loss=0.2455, simple_loss=0.3402, pruned_loss=0.07543, over 21625.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2872, pruned_loss=0.07041, over 4265651.61 frames. ], batch size: 389, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:52:38,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1210602.0, ans=0.1 2023-06-24 23:52:42,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1210602.0, ans=0.125 2023-06-24 23:53:43,560 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.643e+02 3.373e+02 4.457e+02 8.790e+02, threshold=6.746e+02, percent-clipped=4.0 2023-06-24 23:54:09,244 INFO [train.py:996] (1/4) Epoch 7, batch 18850, loss[loss=0.1638, simple_loss=0.2589, pruned_loss=0.03433, over 21691.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2855, pruned_loss=0.0667, over 4274732.81 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:54:50,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1210962.0, ans=0.0 2023-06-24 23:55:56,272 INFO [train.py:996] (1/4) Epoch 7, batch 18900, loss[loss=0.2223, simple_loss=0.2896, pruned_loss=0.07751, over 21856.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2821, pruned_loss=0.06671, over 4270153.86 frames. ], batch size: 98, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:56:02,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.78 vs. limit=10.0 2023-06-24 23:56:18,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1211262.0, ans=0.125 2023-06-24 23:56:29,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1211262.0, ans=0.125 2023-06-24 23:56:47,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1211322.0, ans=0.2 2023-06-24 23:57:04,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-24 23:57:07,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1211382.0, ans=0.125 2023-06-24 23:57:17,723 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.759e+02 3.206e+02 4.379e+02 8.069e+02, threshold=6.411e+02, percent-clipped=2.0 2023-06-24 23:57:25,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1211442.0, ans=0.1 2023-06-24 23:57:32,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1211442.0, ans=0.0 2023-06-24 23:57:42,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1211502.0, ans=0.125 2023-06-24 23:57:44,052 INFO [train.py:996] (1/4) Epoch 7, batch 18950, loss[loss=0.2345, simple_loss=0.3282, pruned_loss=0.07036, over 21710.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2818, pruned_loss=0.06885, over 4271728.44 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:57:48,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1211502.0, ans=0.125 2023-06-24 23:58:03,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-24 23:58:14,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-24 23:58:20,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1211562.0, ans=0.0 2023-06-24 23:58:29,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1211622.0, ans=0.5 2023-06-24 23:59:00,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1211682.0, ans=0.0 2023-06-24 23:59:11,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1211682.0, ans=0.125 2023-06-24 23:59:39,029 INFO [train.py:996] (1/4) Epoch 7, batch 19000, loss[loss=0.2314, simple_loss=0.3307, pruned_loss=0.06602, over 21714.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2919, pruned_loss=0.07048, over 4274028.18 frames. ], batch size: 351, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:01:02,151 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 3.130e+02 3.898e+02 4.619e+02 8.945e+02, threshold=7.797e+02, percent-clipped=5.0 2023-06-25 00:01:05,314 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.69 vs. limit=10.0 2023-06-25 00:01:06,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1212042.0, ans=0.0 2023-06-25 00:01:26,873 INFO [train.py:996] (1/4) Epoch 7, batch 19050, loss[loss=0.2478, simple_loss=0.3117, pruned_loss=0.09192, over 21719.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2972, pruned_loss=0.07408, over 4278816.72 frames. ], batch size: 389, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:01:27,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1212102.0, ans=0.1 2023-06-25 00:03:13,223 INFO [train.py:996] (1/4) Epoch 7, batch 19100, loss[loss=0.1981, simple_loss=0.267, pruned_loss=0.06459, over 21760.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2957, pruned_loss=0.07518, over 4278373.59 frames. ], batch size: 112, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:03:17,003 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:03:28,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1212402.0, ans=0.0 2023-06-25 00:04:38,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.814e+02 3.416e+02 4.391e+02 9.529e+02, threshold=6.832e+02, percent-clipped=4.0 2023-06-25 00:05:04,650 INFO [train.py:996] (1/4) Epoch 7, batch 19150, loss[loss=0.2793, simple_loss=0.3727, pruned_loss=0.09296, over 21612.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2972, pruned_loss=0.07549, over 4275485.87 frames. ], batch size: 414, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:05:38,458 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=15.0 2023-06-25 00:05:54,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1212822.0, ans=0.0 2023-06-25 00:06:35,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1212882.0, ans=0.0 2023-06-25 00:07:00,455 INFO [train.py:996] (1/4) Epoch 7, batch 19200, loss[loss=0.2199, simple_loss=0.3211, pruned_loss=0.05931, over 21155.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3055, pruned_loss=0.07603, over 4275201.60 frames. ], batch size: 143, lr: 4.27e-03, grad_scale: 32.0 2023-06-25 00:08:03,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1213182.0, ans=0.125 2023-06-25 00:08:23,635 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 3.201e+02 4.532e+02 8.099e+02 1.362e+03, threshold=9.063e+02, percent-clipped=31.0 2023-06-25 00:08:24,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1213242.0, ans=0.125 2023-06-25 00:08:48,679 INFO [train.py:996] (1/4) Epoch 7, batch 19250, loss[loss=0.1961, simple_loss=0.2751, pruned_loss=0.05855, over 21478.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3056, pruned_loss=0.07109, over 4269640.98 frames. ], batch size: 131, lr: 4.27e-03, grad_scale: 32.0 2023-06-25 00:09:12,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-25 00:09:14,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1213362.0, ans=10.0 2023-06-25 00:10:14,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1213542.0, ans=0.1 2023-06-25 00:10:29,819 INFO [train.py:996] (1/4) Epoch 7, batch 19300, loss[loss=0.2403, simple_loss=0.3068, pruned_loss=0.08688, over 21793.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3033, pruned_loss=0.06978, over 4271216.48 frames. ], batch size: 112, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:10:58,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1213662.0, ans=0.0 2023-06-25 00:11:17,580 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:11:25,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-25 00:11:33,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1213722.0, ans=0.0 2023-06-25 00:12:02,047 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.613e+02 3.067e+02 3.986e+02 9.865e+02, threshold=6.134e+02, percent-clipped=1.0 2023-06-25 00:12:24,985 INFO [train.py:996] (1/4) Epoch 7, batch 19350, loss[loss=0.188, simple_loss=0.2742, pruned_loss=0.05092, over 21687.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2983, pruned_loss=0.06676, over 4277884.36 frames. ], batch size: 247, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:13:00,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1213962.0, ans=0.125 2023-06-25 00:13:03,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1213962.0, ans=0.0 2023-06-25 00:13:14,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1214022.0, ans=0.1 2023-06-25 00:13:16,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1214022.0, ans=0.0 2023-06-25 00:13:33,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1214082.0, ans=0.125 2023-06-25 00:14:11,257 INFO [train.py:996] (1/4) Epoch 7, batch 19400, loss[loss=0.2336, simple_loss=0.2979, pruned_loss=0.08464, over 21548.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2958, pruned_loss=0.06599, over 4280981.33 frames. ], batch size: 548, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:14:36,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1214262.0, ans=0.1 2023-06-25 00:15:13,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1214322.0, ans=0.125 2023-06-25 00:15:27,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1214382.0, ans=0.0 2023-06-25 00:15:29,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-25 00:15:34,735 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.871e+02 3.427e+02 4.239e+02 8.208e+02, threshold=6.853e+02, percent-clipped=6.0 2023-06-25 00:15:51,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1214502.0, ans=0.125 2023-06-25 00:15:58,301 INFO [train.py:996] (1/4) Epoch 7, batch 19450, loss[loss=0.2007, simple_loss=0.2684, pruned_loss=0.06651, over 20181.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2932, pruned_loss=0.06747, over 4285483.05 frames. ], batch size: 702, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:16:10,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1214502.0, ans=0.125 2023-06-25 00:16:11,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.93 vs. limit=22.5 2023-06-25 00:16:12,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1214502.0, ans=0.125 2023-06-25 00:17:04,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1214682.0, ans=0.125 2023-06-25 00:17:46,959 INFO [train.py:996] (1/4) Epoch 7, batch 19500, loss[loss=0.1859, simple_loss=0.2441, pruned_loss=0.06391, over 21157.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2873, pruned_loss=0.06805, over 4285955.15 frames. ], batch size: 143, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:17:57,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1214802.0, ans=0.0 2023-06-25 00:18:14,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1214862.0, ans=0.2 2023-06-25 00:18:23,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1214862.0, ans=0.2 2023-06-25 00:18:39,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1214922.0, ans=0.0 2023-06-25 00:19:01,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1214982.0, ans=0.2 2023-06-25 00:19:06,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1214982.0, ans=0.125 2023-06-25 00:19:14,505 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 2.919e+02 3.343e+02 4.176e+02 7.589e+02, threshold=6.686e+02, percent-clipped=2.0 2023-06-25 00:19:21,420 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:19:36,591 INFO [train.py:996] (1/4) Epoch 7, batch 19550, loss[loss=0.1909, simple_loss=0.2867, pruned_loss=0.0475, over 21376.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2833, pruned_loss=0.06647, over 4273256.96 frames. ], batch size: 194, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:19:43,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1215102.0, ans=0.0 2023-06-25 00:20:39,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1215222.0, ans=0.0 2023-06-25 00:20:53,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1215282.0, ans=0.0 2023-06-25 00:21:25,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1215402.0, ans=0.0 2023-06-25 00:21:26,535 INFO [train.py:996] (1/4) Epoch 7, batch 19600, loss[loss=0.2146, simple_loss=0.2892, pruned_loss=0.07007, over 21849.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2853, pruned_loss=0.06705, over 4278228.54 frames. ], batch size: 332, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:22:40,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1215582.0, ans=0.125 2023-06-25 00:22:52,375 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.425e+02 3.092e+02 3.648e+02 4.642e+02 7.608e+02, threshold=7.295e+02, percent-clipped=3.0 2023-06-25 00:23:21,452 INFO [train.py:996] (1/4) Epoch 7, batch 19650, loss[loss=0.2194, simple_loss=0.2932, pruned_loss=0.07274, over 21888.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2904, pruned_loss=0.07107, over 4282652.51 frames. ], batch size: 371, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:24:47,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1215882.0, ans=10.0 2023-06-25 00:24:57,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1215942.0, ans=0.0 2023-06-25 00:25:16,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=22.5 2023-06-25 00:25:19,692 INFO [train.py:996] (1/4) Epoch 7, batch 19700, loss[loss=0.1894, simple_loss=0.2707, pruned_loss=0.05407, over 21509.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2916, pruned_loss=0.07121, over 4273458.26 frames. ], batch size: 212, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:26:53,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 3.060e+02 3.533e+02 4.552e+02 9.773e+02, threshold=7.066e+02, percent-clipped=3.0 2023-06-25 00:27:15,067 INFO [train.py:996] (1/4) Epoch 7, batch 19750, loss[loss=0.2089, simple_loss=0.2889, pruned_loss=0.06442, over 21753.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3005, pruned_loss=0.07233, over 4270097.15 frames. ], batch size: 124, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:27:47,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-25 00:29:02,171 INFO [train.py:996] (1/4) Epoch 7, batch 19800, loss[loss=0.2034, simple_loss=0.2737, pruned_loss=0.0665, over 21826.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3009, pruned_loss=0.07344, over 4282905.04 frames. ], batch size: 247, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:29:27,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1216662.0, ans=0.0 2023-06-25 00:29:45,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1216722.0, ans=0.05 2023-06-25 00:30:30,867 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.745e+02 3.353e+02 4.359e+02 1.129e+03, threshold=6.706e+02, percent-clipped=10.0 2023-06-25 00:30:52,379 INFO [train.py:996] (1/4) Epoch 7, batch 19850, loss[loss=0.1675, simple_loss=0.2435, pruned_loss=0.04572, over 21447.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2943, pruned_loss=0.06968, over 4273428.21 frames. ], batch size: 194, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:30:52,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1216902.0, ans=0.1 2023-06-25 00:31:09,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-25 00:32:39,662 INFO [train.py:996] (1/4) Epoch 7, batch 19900, loss[loss=0.2596, simple_loss=0.3725, pruned_loss=0.07333, over 19705.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2967, pruned_loss=0.06787, over 4274913.33 frames. ], batch size: 703, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:33:10,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1217262.0, ans=15.0 2023-06-25 00:34:12,735 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.818e+02 3.439e+02 4.122e+02 9.461e+02, threshold=6.879e+02, percent-clipped=3.0 2023-06-25 00:34:26,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-25 00:34:28,694 INFO [train.py:996] (1/4) Epoch 7, batch 19950, loss[loss=0.1769, simple_loss=0.249, pruned_loss=0.05237, over 21215.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2901, pruned_loss=0.0674, over 4272683.00 frames. ], batch size: 131, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:34:33,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1217502.0, ans=0.125 2023-06-25 00:35:24,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1217622.0, ans=0.125 2023-06-25 00:35:59,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.45 vs. limit=10.0 2023-06-25 00:36:00,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1217742.0, ans=0.125 2023-06-25 00:36:17,065 INFO [train.py:996] (1/4) Epoch 7, batch 20000, loss[loss=0.2124, simple_loss=0.2879, pruned_loss=0.06845, over 21818.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2925, pruned_loss=0.06824, over 4277986.02 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:36:23,087 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:37:09,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1217922.0, ans=0.0 2023-06-25 00:37:47,459 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.923e+02 3.292e+02 4.012e+02 7.608e+02, threshold=6.584e+02, percent-clipped=1.0 2023-06-25 00:38:03,224 INFO [train.py:996] (1/4) Epoch 7, batch 20050, loss[loss=0.234, simple_loss=0.3082, pruned_loss=0.07988, over 21868.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2943, pruned_loss=0.0707, over 4282793.96 frames. ], batch size: 414, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:38:12,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1218102.0, ans=0.125 2023-06-25 00:38:12,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-25 00:39:01,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1218222.0, ans=0.125 2023-06-25 00:39:34,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1218342.0, ans=0.125 2023-06-25 00:39:53,778 INFO [train.py:996] (1/4) Epoch 7, batch 20100, loss[loss=0.2169, simple_loss=0.3142, pruned_loss=0.05977, over 21784.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2963, pruned_loss=0.07221, over 4287903.48 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:39:59,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1218402.0, ans=0.125 2023-06-25 00:41:07,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-06-25 00:41:10,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1218582.0, ans=0.1 2023-06-25 00:41:17,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1218582.0, ans=0.125 2023-06-25 00:41:29,510 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 2.968e+02 3.649e+02 4.781e+02 8.701e+02, threshold=7.299e+02, percent-clipped=5.0 2023-06-25 00:41:49,525 INFO [train.py:996] (1/4) Epoch 7, batch 20150, loss[loss=0.2692, simple_loss=0.3401, pruned_loss=0.0992, over 21675.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3056, pruned_loss=0.07578, over 4288353.82 frames. ], batch size: 351, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:42:09,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-25 00:42:41,387 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=22.5 2023-06-25 00:42:49,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1218822.0, ans=0.025 2023-06-25 00:43:01,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1218882.0, ans=0.2 2023-06-25 00:43:12,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-25 00:43:51,472 INFO [train.py:996] (1/4) Epoch 7, batch 20200, loss[loss=0.2578, simple_loss=0.3499, pruned_loss=0.0828, over 21825.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3123, pruned_loss=0.07923, over 4290563.22 frames. ], batch size: 371, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:44:00,262 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-25 00:44:19,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1219062.0, ans=0.0 2023-06-25 00:45:22,359 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.473e+02 3.331e+02 3.947e+02 5.099e+02 9.386e+02, threshold=7.894e+02, percent-clipped=7.0 2023-06-25 00:45:26,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1219242.0, ans=0.2 2023-06-25 00:45:36,260 INFO [train.py:996] (1/4) Epoch 7, batch 20250, loss[loss=0.2176, simple_loss=0.2923, pruned_loss=0.07144, over 21011.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3129, pruned_loss=0.07745, over 4284987.86 frames. ], batch size: 607, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:46:00,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1219362.0, ans=0.0 2023-06-25 00:46:25,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1219422.0, ans=0.125 2023-06-25 00:46:29,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1219422.0, ans=0.0 2023-06-25 00:46:42,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1219482.0, ans=0.0 2023-06-25 00:47:15,297 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-25 00:47:25,051 INFO [train.py:996] (1/4) Epoch 7, batch 20300, loss[loss=0.2319, simple_loss=0.3199, pruned_loss=0.07195, over 21730.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.311, pruned_loss=0.07479, over 4289006.81 frames. ], batch size: 351, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:48:02,114 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-25 00:48:08,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1219722.0, ans=0.125 2023-06-25 00:48:17,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1219722.0, ans=0.2 2023-06-25 00:48:52,965 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.615e+02 3.044e+02 3.787e+02 8.411e+02, threshold=6.088e+02, percent-clipped=1.0 2023-06-25 00:49:11,891 INFO [train.py:996] (1/4) Epoch 7, batch 20350, loss[loss=0.188, simple_loss=0.2665, pruned_loss=0.05478, over 17320.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3106, pruned_loss=0.07493, over 4277468.84 frames. ], batch size: 65, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:49:23,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1219902.0, ans=0.0 2023-06-25 00:49:54,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1220022.0, ans=0.0 2023-06-25 00:49:56,694 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:50:56,352 INFO [train.py:996] (1/4) Epoch 7, batch 20400, loss[loss=0.3065, simple_loss=0.3683, pruned_loss=0.1223, over 21779.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3122, pruned_loss=0.07704, over 4255184.42 frames. ], batch size: 441, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 00:50:59,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1220202.0, ans=0.2 2023-06-25 00:51:04,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1220202.0, ans=0.125 2023-06-25 00:51:40,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1220322.0, ans=0.0 2023-06-25 00:51:49,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1220322.0, ans=0.125 2023-06-25 00:51:51,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-25 00:52:32,830 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.347e+02 3.963e+02 4.819e+02 8.468e+02, threshold=7.927e+02, percent-clipped=6.0 2023-06-25 00:52:33,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1220442.0, ans=0.125 2023-06-25 00:52:33,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-25 00:52:44,834 INFO [train.py:996] (1/4) Epoch 7, batch 20450, loss[loss=0.2481, simple_loss=0.3123, pruned_loss=0.09192, over 21818.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3131, pruned_loss=0.07975, over 4266263.54 frames. ], batch size: 441, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:52:46,801 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:53:06,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1220562.0, ans=0.0 2023-06-25 00:53:10,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1220562.0, ans=0.5 2023-06-25 00:53:27,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1220622.0, ans=0.125 2023-06-25 00:53:50,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1220682.0, ans=0.125 2023-06-25 00:54:25,822 INFO [train.py:996] (1/4) Epoch 7, batch 20500, loss[loss=0.2129, simple_loss=0.2786, pruned_loss=0.07355, over 21872.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3086, pruned_loss=0.07993, over 4262541.50 frames. ], batch size: 107, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:55:27,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1220982.0, ans=0.0 2023-06-25 00:55:56,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1221042.0, ans=0.125 2023-06-25 00:56:00,855 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.204e+02 4.054e+02 5.426e+02 8.867e+02, threshold=8.109e+02, percent-clipped=2.0 2023-06-25 00:56:01,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=22.5 2023-06-25 00:56:13,075 INFO [train.py:996] (1/4) Epoch 7, batch 20550, loss[loss=0.2166, simple_loss=0.3089, pruned_loss=0.06217, over 21851.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2999, pruned_loss=0.07727, over 4265647.05 frames. ], batch size: 372, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:57:06,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1221222.0, ans=0.0 2023-06-25 00:57:20,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1221282.0, ans=0.2 2023-06-25 00:57:31,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.48 vs. limit=10.0 2023-06-25 00:57:46,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1221342.0, ans=0.125 2023-06-25 00:57:56,523 INFO [train.py:996] (1/4) Epoch 7, batch 20600, loss[loss=0.2189, simple_loss=0.3136, pruned_loss=0.06204, over 21733.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3009, pruned_loss=0.07549, over 4257327.74 frames. ], batch size: 298, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:58:27,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-25 00:58:40,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1221522.0, ans=0.025 2023-06-25 00:59:15,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1221582.0, ans=0.125 2023-06-25 00:59:25,528 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.095e+02 3.828e+02 5.103e+02 1.106e+03, threshold=7.656e+02, percent-clipped=7.0 2023-06-25 00:59:37,757 INFO [train.py:996] (1/4) Epoch 7, batch 20650, loss[loss=0.1937, simple_loss=0.2654, pruned_loss=0.06103, over 21848.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2982, pruned_loss=0.07529, over 4253598.23 frames. ], batch size: 98, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:59:59,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-25 01:00:06,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1221762.0, ans=0.125 2023-06-25 01:00:06,949 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-25 01:00:16,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1221822.0, ans=0.0 2023-06-25 01:00:57,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1221882.0, ans=0.125 2023-06-25 01:01:08,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1221942.0, ans=0.125 2023-06-25 01:01:14,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1221942.0, ans=0.0 2023-06-25 01:01:32,844 INFO [train.py:996] (1/4) Epoch 7, batch 20700, loss[loss=0.2112, simple_loss=0.301, pruned_loss=0.06067, over 21722.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2912, pruned_loss=0.07214, over 4246873.52 frames. ], batch size: 332, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:01:33,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1222002.0, ans=0.1 2023-06-25 01:01:42,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1222002.0, ans=0.125 2023-06-25 01:01:48,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1222062.0, ans=0.125 2023-06-25 01:01:53,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1222062.0, ans=0.2 2023-06-25 01:02:16,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1222122.0, ans=0.125 2023-06-25 01:02:32,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1222122.0, ans=0.125 2023-06-25 01:03:06,666 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.936e+02 3.801e+02 5.565e+02 1.085e+03, threshold=7.602e+02, percent-clipped=14.0 2023-06-25 01:03:19,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1222242.0, ans=0.0 2023-06-25 01:03:24,055 INFO [train.py:996] (1/4) Epoch 7, batch 20750, loss[loss=0.2622, simple_loss=0.3859, pruned_loss=0.06925, over 20752.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2932, pruned_loss=0.07093, over 4250455.44 frames. ], batch size: 607, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:04:19,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1222422.0, ans=0.125 2023-06-25 01:04:23,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1222422.0, ans=0.125 2023-06-25 01:04:23,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1222422.0, ans=0.0 2023-06-25 01:04:26,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1222482.0, ans=0.0 2023-06-25 01:04:38,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1222482.0, ans=0.125 2023-06-25 01:04:50,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1222542.0, ans=0.125 2023-06-25 01:04:56,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1222542.0, ans=0.125 2023-06-25 01:05:03,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2023-06-25 01:05:07,446 INFO [train.py:996] (1/4) Epoch 7, batch 20800, loss[loss=0.2101, simple_loss=0.2759, pruned_loss=0.07219, over 21821.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2952, pruned_loss=0.07118, over 4245629.87 frames. ], batch size: 372, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:05:08,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1222602.0, ans=0.125 2023-06-25 01:05:13,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-25 01:05:20,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1222602.0, ans=0.2 2023-06-25 01:05:34,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-25 01:05:47,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1222662.0, ans=0.07 2023-06-25 01:06:43,829 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.312e+02 4.339e+02 6.808e+02 1.439e+03, threshold=8.678e+02, percent-clipped=19.0 2023-06-25 01:06:46,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1222842.0, ans=0.125 2023-06-25 01:06:55,804 INFO [train.py:996] (1/4) Epoch 7, batch 20850, loss[loss=0.2081, simple_loss=0.2851, pruned_loss=0.06556, over 21842.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.291, pruned_loss=0.06998, over 4249387.35 frames. ], batch size: 371, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:07:01,369 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:07:49,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1223022.0, ans=0.0 2023-06-25 01:08:04,532 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.34 vs. limit=8.0 2023-06-25 01:08:29,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1223142.0, ans=0.0 2023-06-25 01:08:33,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1223142.0, ans=0.1 2023-06-25 01:08:44,705 INFO [train.py:996] (1/4) Epoch 7, batch 20900, loss[loss=0.2785, simple_loss=0.3974, pruned_loss=0.07985, over 19821.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2921, pruned_loss=0.07095, over 4254773.08 frames. ], batch size: 702, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:09:22,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1223322.0, ans=0.125 2023-06-25 01:09:41,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1223322.0, ans=0.1 2023-06-25 01:09:58,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1223382.0, ans=0.125 2023-06-25 01:10:03,613 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:10:14,246 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:10:19,932 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.894e+02 3.467e+02 4.402e+02 7.475e+02, threshold=6.935e+02, percent-clipped=1.0 2023-06-25 01:10:30,273 INFO [train.py:996] (1/4) Epoch 7, batch 20950, loss[loss=0.2031, simple_loss=0.2798, pruned_loss=0.06316, over 21849.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2868, pruned_loss=0.0674, over 4264909.30 frames. ], batch size: 371, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:10:43,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-25 01:11:46,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1223682.0, ans=0.2 2023-06-25 01:12:09,737 INFO [train.py:996] (1/4) Epoch 7, batch 21000, loss[loss=0.2334, simple_loss=0.3015, pruned_loss=0.08262, over 21195.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2856, pruned_loss=0.06841, over 4273139.66 frames. ], batch size: 143, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:12:09,737 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 01:12:23,957 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3757, 2.1834, 4.0624, 3.8629], device='cuda:1') 2023-06-25 01:12:24,580 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.5393, 4.0802, 3.6221, 2.6078], device='cuda:1') 2023-06-25 01:12:27,617 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2666, simple_loss=0.3633, pruned_loss=0.08493, over 1796401.00 frames. 2023-06-25 01:12:27,618 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 01:12:59,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1223862.0, ans=0.125 2023-06-25 01:13:28,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1223922.0, ans=0.0 2023-06-25 01:13:48,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1223982.0, ans=0.125 2023-06-25 01:13:58,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1224042.0, ans=0.0 2023-06-25 01:14:06,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.703e+02 3.087e+02 3.976e+02 6.503e+02, threshold=6.174e+02, percent-clipped=0.0 2023-06-25 01:14:10,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1224042.0, ans=0.0 2023-06-25 01:14:17,186 INFO [train.py:996] (1/4) Epoch 7, batch 21050, loss[loss=0.1993, simple_loss=0.2719, pruned_loss=0.06331, over 21858.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2845, pruned_loss=0.06902, over 4278059.00 frames. ], batch size: 118, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:14:23,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-25 01:14:31,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1224102.0, ans=0.125 2023-06-25 01:15:32,458 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.54 vs. limit=15.0 2023-06-25 01:15:54,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1224342.0, ans=0.125 2023-06-25 01:16:05,182 INFO [train.py:996] (1/4) Epoch 7, batch 21100, loss[loss=0.1874, simple_loss=0.2569, pruned_loss=0.05897, over 21816.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2815, pruned_loss=0.06879, over 4279445.56 frames. ], batch size: 352, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:16:05,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1224402.0, ans=0.125 2023-06-25 01:16:19,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1224402.0, ans=0.125 2023-06-25 01:16:28,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1224462.0, ans=0.1 2023-06-25 01:17:42,062 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.657e+02 3.143e+02 4.101e+02 9.163e+02, threshold=6.287e+02, percent-clipped=4.0 2023-06-25 01:17:52,619 INFO [train.py:996] (1/4) Epoch 7, batch 21150, loss[loss=0.2406, simple_loss=0.2854, pruned_loss=0.09791, over 21330.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2787, pruned_loss=0.06952, over 4277477.13 frames. ], batch size: 473, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:19:20,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1224942.0, ans=0.125 2023-06-25 01:19:39,300 INFO [train.py:996] (1/4) Epoch 7, batch 21200, loss[loss=0.2082, simple_loss=0.2674, pruned_loss=0.07447, over 21269.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2748, pruned_loss=0.06853, over 4269160.94 frames. ], batch size: 144, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:20:32,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1225122.0, ans=0.125 2023-06-25 01:21:17,800 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.659e+02 3.125e+02 3.870e+02 6.186e+02, threshold=6.250e+02, percent-clipped=0.0 2023-06-25 01:21:28,367 INFO [train.py:996] (1/4) Epoch 7, batch 21250, loss[loss=0.1919, simple_loss=0.2613, pruned_loss=0.06129, over 21589.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2735, pruned_loss=0.06835, over 4273327.01 frames. ], batch size: 263, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:21:28,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1225302.0, ans=0.2 2023-06-25 01:21:51,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1225362.0, ans=0.0 2023-06-25 01:23:15,832 INFO [train.py:996] (1/4) Epoch 7, batch 21300, loss[loss=0.176, simple_loss=0.243, pruned_loss=0.05449, over 21267.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2808, pruned_loss=0.07099, over 4275165.43 frames. ], batch size: 176, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:23:27,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.51 vs. limit=15.0 2023-06-25 01:23:46,410 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:23:46,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1225662.0, ans=0.025 2023-06-25 01:24:31,630 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:24:42,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1225782.0, ans=0.125 2023-06-25 01:24:55,314 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 2.894e+02 3.300e+02 4.575e+02 9.382e+02, threshold=6.600e+02, percent-clipped=9.0 2023-06-25 01:25:04,015 INFO [train.py:996] (1/4) Epoch 7, batch 21350, loss[loss=0.1777, simple_loss=0.2573, pruned_loss=0.04909, over 16718.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2848, pruned_loss=0.07196, over 4272711.31 frames. ], batch size: 62, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:25:43,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1225962.0, ans=0.125 2023-06-25 01:26:51,934 INFO [train.py:996] (1/4) Epoch 7, batch 21400, loss[loss=0.2977, simple_loss=0.3636, pruned_loss=0.116, over 21408.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2885, pruned_loss=0.07161, over 4281330.06 frames. ], batch size: 471, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:27:57,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 01:28:08,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.88 vs. limit=10.0 2023-06-25 01:28:23,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1226442.0, ans=0.1 2023-06-25 01:28:28,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1226442.0, ans=0.0 2023-06-25 01:28:31,714 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.088e+02 4.012e+02 5.119e+02 7.296e+02, threshold=8.024e+02, percent-clipped=4.0 2023-06-25 01:28:40,329 INFO [train.py:996] (1/4) Epoch 7, batch 21450, loss[loss=0.2039, simple_loss=0.2728, pruned_loss=0.06755, over 21812.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2916, pruned_loss=0.07269, over 4279697.38 frames. ], batch size: 247, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:28:45,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-25 01:29:44,358 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-25 01:30:11,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1226742.0, ans=0.125 2023-06-25 01:30:18,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1226742.0, ans=0.0 2023-06-25 01:30:21,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1226742.0, ans=0.0 2023-06-25 01:30:28,719 INFO [train.py:996] (1/4) Epoch 7, batch 21500, loss[loss=0.2108, simple_loss=0.2732, pruned_loss=0.07424, over 21881.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2908, pruned_loss=0.07362, over 4260804.52 frames. ], batch size: 373, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:30:34,514 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-25 01:30:59,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1226862.0, ans=0.0 2023-06-25 01:31:41,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1226982.0, ans=0.95 2023-06-25 01:31:53,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1226982.0, ans=0.0 2023-06-25 01:31:59,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1227042.0, ans=0.125 2023-06-25 01:32:06,009 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 2.889e+02 3.383e+02 4.228e+02 8.142e+02, threshold=6.766e+02, percent-clipped=1.0 2023-06-25 01:32:08,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1227042.0, ans=0.0 2023-06-25 01:32:14,646 INFO [train.py:996] (1/4) Epoch 7, batch 21550, loss[loss=0.1783, simple_loss=0.2431, pruned_loss=0.05675, over 21474.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2845, pruned_loss=0.07094, over 4261407.05 frames. ], batch size: 212, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:32:50,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1227162.0, ans=0.0 2023-06-25 01:33:45,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1227342.0, ans=0.025 2023-06-25 01:33:52,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1227342.0, ans=0.1 2023-06-25 01:33:59,005 INFO [train.py:996] (1/4) Epoch 7, batch 21600, loss[loss=0.1952, simple_loss=0.2641, pruned_loss=0.06313, over 21332.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2826, pruned_loss=0.07004, over 4255704.15 frames. ], batch size: 144, lr: 4.24e-03, grad_scale: 32.0 2023-06-25 01:34:26,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1227462.0, ans=0.125 2023-06-25 01:34:41,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1227462.0, ans=0.0 2023-06-25 01:35:04,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-25 01:35:40,096 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.809e+02 3.415e+02 4.856e+02 1.279e+03, threshold=6.830e+02, percent-clipped=8.0 2023-06-25 01:35:53,429 INFO [train.py:996] (1/4) Epoch 7, batch 21650, loss[loss=0.1987, simple_loss=0.2949, pruned_loss=0.0512, over 21633.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2851, pruned_loss=0.06815, over 4256357.57 frames. ], batch size: 230, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:36:12,308 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:36:24,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1227762.0, ans=0.0 2023-06-25 01:37:08,804 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.01 vs. limit=5.0 2023-06-25 01:37:13,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-25 01:37:34,936 INFO [train.py:996] (1/4) Epoch 7, batch 21700, loss[loss=0.1967, simple_loss=0.268, pruned_loss=0.0627, over 21644.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2842, pruned_loss=0.06585, over 4265379.57 frames. ], batch size: 332, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:37:55,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1228002.0, ans=0.125 2023-06-25 01:38:45,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1228182.0, ans=0.125 2023-06-25 01:38:50,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1228182.0, ans=0.0 2023-06-25 01:39:07,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1228242.0, ans=0.125 2023-06-25 01:39:14,325 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 3.013e+02 3.692e+02 5.814e+02 1.203e+03, threshold=7.384e+02, percent-clipped=13.0 2023-06-25 01:39:16,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1228242.0, ans=0.1 2023-06-25 01:39:21,000 INFO [train.py:996] (1/4) Epoch 7, batch 21750, loss[loss=0.1949, simple_loss=0.254, pruned_loss=0.06792, over 21488.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2802, pruned_loss=0.06632, over 4267178.35 frames. ], batch size: 212, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:39:35,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1228302.0, ans=0.05 2023-06-25 01:41:08,660 INFO [train.py:996] (1/4) Epoch 7, batch 21800, loss[loss=0.2219, simple_loss=0.2892, pruned_loss=0.07735, over 21320.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2785, pruned_loss=0.0671, over 4267825.28 frames. ], batch size: 160, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:41:21,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1228602.0, ans=0.125 2023-06-25 01:42:45,918 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.194e+02 4.069e+02 5.190e+02 9.750e+02, threshold=8.138e+02, percent-clipped=3.0 2023-06-25 01:42:53,028 INFO [train.py:996] (1/4) Epoch 7, batch 21850, loss[loss=0.1904, simple_loss=0.2559, pruned_loss=0.06246, over 21596.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2823, pruned_loss=0.06718, over 4259694.08 frames. ], batch size: 263, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:43:13,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1228902.0, ans=0.125 2023-06-25 01:43:17,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1228902.0, ans=0.125 2023-06-25 01:43:36,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1229022.0, ans=0.125 2023-06-25 01:44:09,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1229082.0, ans=0.0 2023-06-25 01:44:11,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1229082.0, ans=0.0 2023-06-25 01:44:44,858 INFO [train.py:996] (1/4) Epoch 7, batch 21900, loss[loss=0.2212, simple_loss=0.3006, pruned_loss=0.07086, over 20059.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2842, pruned_loss=0.0687, over 4267569.57 frames. ], batch size: 702, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:45:10,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1229262.0, ans=0.125 2023-06-25 01:45:17,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1229262.0, ans=0.1 2023-06-25 01:45:32,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1229322.0, ans=0.09899494936611666 2023-06-25 01:45:55,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1229382.0, ans=0.0 2023-06-25 01:46:00,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=12.0 2023-06-25 01:46:19,671 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 2.991e+02 3.581e+02 4.789e+02 1.002e+03, threshold=7.161e+02, percent-clipped=1.0 2023-06-25 01:46:29,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1229502.0, ans=0.0 2023-06-25 01:46:31,072 INFO [train.py:996] (1/4) Epoch 7, batch 21950, loss[loss=0.1481, simple_loss=0.2325, pruned_loss=0.0318, over 21713.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2794, pruned_loss=0.06757, over 4268464.84 frames. ], batch size: 282, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:46:38,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1229502.0, ans=0.1 2023-06-25 01:47:30,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1229622.0, ans=0.125 2023-06-25 01:47:43,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1229682.0, ans=0.0 2023-06-25 01:48:26,815 INFO [train.py:996] (1/4) Epoch 7, batch 22000, loss[loss=0.1957, simple_loss=0.2686, pruned_loss=0.06138, over 21798.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2742, pruned_loss=0.06493, over 4274147.25 frames. ], batch size: 352, lr: 4.24e-03, grad_scale: 32.0 2023-06-25 01:48:29,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1229802.0, ans=0.125 2023-06-25 01:49:39,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229982.0, ans=0.1 2023-06-25 01:50:07,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1230042.0, ans=0.125 2023-06-25 01:50:12,192 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 3.193e+02 3.853e+02 5.102e+02 1.201e+03, threshold=7.707e+02, percent-clipped=7.0 2023-06-25 01:50:17,729 INFO [train.py:996] (1/4) Epoch 7, batch 22050, loss[loss=0.2558, simple_loss=0.3242, pruned_loss=0.09365, over 21307.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.277, pruned_loss=0.06568, over 4261000.00 frames. ], batch size: 159, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:51:26,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1230282.0, ans=0.125 2023-06-25 01:51:50,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1230342.0, ans=0.1 2023-06-25 01:51:57,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1230342.0, ans=0.1 2023-06-25 01:52:06,968 INFO [train.py:996] (1/4) Epoch 7, batch 22100, loss[loss=0.2819, simple_loss=0.3493, pruned_loss=0.1072, over 21765.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2878, pruned_loss=0.07048, over 4270079.88 frames. ], batch size: 441, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:52:26,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1230402.0, ans=0.0 2023-06-25 01:52:59,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1230522.0, ans=0.125 2023-06-25 01:53:05,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-25 01:53:11,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1230582.0, ans=0.125 2023-06-25 01:53:12,094 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-25 01:53:12,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-25 01:53:49,131 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.415e+02 4.118e+02 5.475e+02 8.069e+02, threshold=8.235e+02, percent-clipped=4.0 2023-06-25 01:53:54,219 INFO [train.py:996] (1/4) Epoch 7, batch 22150, loss[loss=0.1984, simple_loss=0.2859, pruned_loss=0.05544, over 21818.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2913, pruned_loss=0.07294, over 4280917.31 frames. ], batch size: 282, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:54:05,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1230702.0, ans=0.0 2023-06-25 01:54:43,848 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:54:50,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1230822.0, ans=0.125 2023-06-25 01:55:04,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.23 vs. limit=10.0 2023-06-25 01:55:32,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-25 01:55:35,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-25 01:55:41,159 INFO [train.py:996] (1/4) Epoch 7, batch 22200, loss[loss=0.2186, simple_loss=0.3181, pruned_loss=0.05951, over 21822.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2933, pruned_loss=0.07342, over 4292291.65 frames. ], batch size: 282, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:56:24,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1231062.0, ans=0.05 2023-06-25 01:57:25,286 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.120e+02 3.891e+02 5.411e+02 1.488e+03, threshold=7.782e+02, percent-clipped=8.0 2023-06-25 01:57:31,130 INFO [train.py:996] (1/4) Epoch 7, batch 22250, loss[loss=0.2231, simple_loss=0.3067, pruned_loss=0.06975, over 21496.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3, pruned_loss=0.07514, over 4286342.45 frames. ], batch size: 194, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:57:44,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-06-25 01:57:45,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-25 01:58:37,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1231482.0, ans=10.0 2023-06-25 01:58:42,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1231482.0, ans=0.125 2023-06-25 01:59:18,343 INFO [train.py:996] (1/4) Epoch 7, batch 22300, loss[loss=0.2046, simple_loss=0.2629, pruned_loss=0.07314, over 21229.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3009, pruned_loss=0.07647, over 4281655.49 frames. ], batch size: 608, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 01:59:39,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1231662.0, ans=0.0 2023-06-25 02:00:05,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1231722.0, ans=0.125 2023-06-25 02:00:40,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231782.0, ans=0.1 2023-06-25 02:01:00,002 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.515e+02 3.143e+02 3.997e+02 5.587e+02 8.969e+02, threshold=7.995e+02, percent-clipped=6.0 2023-06-25 02:01:00,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1231842.0, ans=0.125 2023-06-25 02:01:10,873 INFO [train.py:996] (1/4) Epoch 7, batch 22350, loss[loss=0.2269, simple_loss=0.297, pruned_loss=0.07843, over 21821.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2996, pruned_loss=0.07761, over 4287676.07 frames. ], batch size: 414, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:01:18,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1231902.0, ans=0.125 2023-06-25 02:01:26,445 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.85 vs. limit=22.5 2023-06-25 02:01:34,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.75 vs. limit=10.0 2023-06-25 02:01:43,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1231962.0, ans=0.125 2023-06-25 02:01:48,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1231962.0, ans=0.125 2023-06-25 02:01:57,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1232022.0, ans=0.125 2023-06-25 02:01:57,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1232022.0, ans=0.2 2023-06-25 02:01:58,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1232022.0, ans=0.0 2023-06-25 02:02:29,348 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.32 vs. limit=15.0 2023-06-25 02:02:59,812 INFO [train.py:996] (1/4) Epoch 7, batch 22400, loss[loss=0.2172, simple_loss=0.2868, pruned_loss=0.07377, over 21480.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2968, pruned_loss=0.07423, over 4283578.06 frames. ], batch size: 389, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:03:00,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1232202.0, ans=0.125 2023-06-25 02:03:41,514 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-25 02:04:01,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1232322.0, ans=0.1 2023-06-25 02:04:42,722 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.737e+02 3.177e+02 4.252e+02 6.969e+02, threshold=6.354e+02, percent-clipped=0.0 2023-06-25 02:04:47,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1232502.0, ans=0.2 2023-06-25 02:04:48,408 INFO [train.py:996] (1/4) Epoch 7, batch 22450, loss[loss=0.2218, simple_loss=0.2692, pruned_loss=0.08716, over 21333.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.291, pruned_loss=0.07371, over 4275994.50 frames. ], batch size: 473, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:05:23,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.59 vs. limit=15.0 2023-06-25 02:06:05,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1232682.0, ans=0.035 2023-06-25 02:06:12,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1232682.0, ans=0.2 2023-06-25 02:06:43,941 INFO [train.py:996] (1/4) Epoch 7, batch 22500, loss[loss=0.2082, simple_loss=0.2669, pruned_loss=0.0747, over 16594.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.286, pruned_loss=0.07246, over 4275302.20 frames. ], batch size: 69, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:06:49,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1232802.0, ans=0.125 2023-06-25 02:06:53,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-25 02:07:01,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1232862.0, ans=0.2 2023-06-25 02:07:35,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1232922.0, ans=0.0 2023-06-25 02:07:41,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1232922.0, ans=0.1 2023-06-25 02:08:11,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.73 vs. limit=22.5 2023-06-25 02:08:17,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1233042.0, ans=0.125 2023-06-25 02:08:22,743 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 2.995e+02 3.831e+02 4.510e+02 7.998e+02, threshold=7.663e+02, percent-clipped=9.0 2023-06-25 02:08:32,941 INFO [train.py:996] (1/4) Epoch 7, batch 22550, loss[loss=0.2428, simple_loss=0.3175, pruned_loss=0.08411, over 22000.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2892, pruned_loss=0.07217, over 4278012.72 frames. ], batch size: 113, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:08:56,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1233162.0, ans=0.1 2023-06-25 02:09:55,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1233282.0, ans=0.0 2023-06-25 02:10:23,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-25 02:10:23,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-25 02:10:24,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1233402.0, ans=0.0 2023-06-25 02:10:25,392 INFO [train.py:996] (1/4) Epoch 7, batch 22600, loss[loss=0.1882, simple_loss=0.2543, pruned_loss=0.06108, over 21309.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2921, pruned_loss=0.0729, over 4272376.02 frames. ], batch size: 131, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:10:31,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1233402.0, ans=0.2 2023-06-25 02:10:52,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1233462.0, ans=0.125 2023-06-25 02:11:04,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1233462.0, ans=0.0 2023-06-25 02:12:06,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1233642.0, ans=0.125 2023-06-25 02:12:08,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=12.0 2023-06-25 02:12:10,426 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.222e+02 3.850e+02 5.288e+02 1.031e+03, threshold=7.700e+02, percent-clipped=4.0 2023-06-25 02:12:14,432 INFO [train.py:996] (1/4) Epoch 7, batch 22650, loss[loss=0.181, simple_loss=0.2557, pruned_loss=0.05311, over 21783.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2896, pruned_loss=0.07197, over 4274631.28 frames. ], batch size: 118, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:12:39,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.08 vs. limit=6.0 2023-06-25 02:12:42,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1233762.0, ans=0.035 2023-06-25 02:13:06,163 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-25 02:13:27,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1233882.0, ans=0.125 2023-06-25 02:13:58,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1233942.0, ans=0.05 2023-06-25 02:14:01,939 INFO [train.py:996] (1/4) Epoch 7, batch 22700, loss[loss=0.2127, simple_loss=0.2627, pruned_loss=0.08138, over 21436.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2852, pruned_loss=0.07111, over 4269199.26 frames. ], batch size: 476, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:14:38,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1234062.0, ans=0.0 2023-06-25 02:14:41,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1234062.0, ans=0.125 2023-06-25 02:14:46,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1234122.0, ans=0.125 2023-06-25 02:15:33,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1234242.0, ans=0.125 2023-06-25 02:15:46,860 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.300e+02 4.052e+02 5.642e+02 1.079e+03, threshold=8.104e+02, percent-clipped=7.0 2023-06-25 02:15:49,894 INFO [train.py:996] (1/4) Epoch 7, batch 22750, loss[loss=0.2015, simple_loss=0.2642, pruned_loss=0.06941, over 21729.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2876, pruned_loss=0.07298, over 4275086.53 frames. ], batch size: 300, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:17:35,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1234602.0, ans=0.125 2023-06-25 02:17:36,746 INFO [train.py:996] (1/4) Epoch 7, batch 22800, loss[loss=0.2403, simple_loss=0.3085, pruned_loss=0.08609, over 21732.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2896, pruned_loss=0.07479, over 4285140.23 frames. ], batch size: 389, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:18:15,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1234662.0, ans=0.2 2023-06-25 02:18:38,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1234722.0, ans=0.0 2023-06-25 02:18:38,745 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-06-25 02:18:39,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1234722.0, ans=0.125 2023-06-25 02:18:43,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1234782.0, ans=0.125 2023-06-25 02:19:01,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1234782.0, ans=0.2 2023-06-25 02:19:23,105 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.142e+02 3.789e+02 4.718e+02 7.259e+02, threshold=7.578e+02, percent-clipped=0.0 2023-06-25 02:19:25,135 INFO [train.py:996] (1/4) Epoch 7, batch 22850, loss[loss=0.2237, simple_loss=0.2884, pruned_loss=0.07951, over 21760.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2868, pruned_loss=0.07484, over 4276438.80 frames. ], batch size: 371, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:20:14,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1235022.0, ans=0.1 2023-06-25 02:20:24,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1235022.0, ans=0.125 2023-06-25 02:20:26,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1235082.0, ans=0.125 2023-06-25 02:20:53,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1235142.0, ans=0.0 2023-06-25 02:20:54,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-25 02:21:08,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1235202.0, ans=0.125 2023-06-25 02:21:09,505 INFO [train.py:996] (1/4) Epoch 7, batch 22900, loss[loss=0.2083, simple_loss=0.2947, pruned_loss=0.06101, over 21352.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2879, pruned_loss=0.07382, over 4257380.69 frames. ], batch size: 176, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:21:10,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1235202.0, ans=15.0 2023-06-25 02:21:15,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1235202.0, ans=0.0 2023-06-25 02:22:33,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1235382.0, ans=0.125 2023-06-25 02:23:04,025 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.455e+02 4.744e+02 6.371e+02 1.430e+03, threshold=9.487e+02, percent-clipped=13.0 2023-06-25 02:23:05,581 INFO [train.py:996] (1/4) Epoch 7, batch 22950, loss[loss=0.2288, simple_loss=0.3309, pruned_loss=0.06337, over 21553.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3022, pruned_loss=0.07302, over 4262202.00 frames. ], batch size: 195, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:23:14,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1235502.0, ans=0.0 2023-06-25 02:23:21,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1235562.0, ans=0.0 2023-06-25 02:24:05,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1235622.0, ans=0.125 2023-06-25 02:24:53,114 INFO [train.py:996] (1/4) Epoch 7, batch 23000, loss[loss=0.2239, simple_loss=0.3513, pruned_loss=0.04827, over 20770.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3009, pruned_loss=0.07131, over 4262492.73 frames. ], batch size: 607, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:26:02,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1235982.0, ans=0.125 2023-06-25 02:26:40,403 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.049e+02 3.858e+02 4.759e+02 9.781e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-25 02:26:42,793 INFO [train.py:996] (1/4) Epoch 7, batch 23050, loss[loss=0.251, simple_loss=0.3174, pruned_loss=0.09228, over 21369.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3013, pruned_loss=0.07317, over 4264409.18 frames. ], batch size: 176, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:27:43,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1236222.0, ans=0.125 2023-06-25 02:28:31,616 INFO [train.py:996] (1/4) Epoch 7, batch 23100, loss[loss=0.194, simple_loss=0.258, pruned_loss=0.065, over 21622.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2985, pruned_loss=0.07338, over 4272751.18 frames. ], batch size: 298, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:28:52,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1236402.0, ans=0.025 2023-06-25 02:29:16,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1236522.0, ans=0.125 2023-06-25 02:29:21,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1236522.0, ans=0.125 2023-06-25 02:29:24,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1236522.0, ans=0.125 2023-06-25 02:29:26,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-25 02:29:27,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1236522.0, ans=0.1 2023-06-25 02:30:16,715 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.056e+02 3.591e+02 4.604e+02 9.748e+02, threshold=7.182e+02, percent-clipped=1.0 2023-06-25 02:30:18,304 INFO [train.py:996] (1/4) Epoch 7, batch 23150, loss[loss=0.1988, simple_loss=0.2728, pruned_loss=0.0624, over 21677.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2938, pruned_loss=0.07297, over 4280163.34 frames. ], batch size: 263, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:30:31,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1236702.0, ans=0.0 2023-06-25 02:30:49,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236762.0, ans=0.1 2023-06-25 02:31:07,030 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:32:00,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1236942.0, ans=0.125 2023-06-25 02:32:02,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1237002.0, ans=0.125 2023-06-25 02:32:03,913 INFO [train.py:996] (1/4) Epoch 7, batch 23200, loss[loss=0.2237, simple_loss=0.284, pruned_loss=0.08175, over 21588.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2932, pruned_loss=0.07384, over 4286372.51 frames. ], batch size: 548, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:32:25,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1237062.0, ans=0.0 2023-06-25 02:33:07,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1237182.0, ans=0.125 2023-06-25 02:33:11,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1237182.0, ans=0.125 2023-06-25 02:33:52,454 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.126e+02 3.728e+02 5.060e+02 1.069e+03, threshold=7.456e+02, percent-clipped=4.0 2023-06-25 02:33:52,485 INFO [train.py:996] (1/4) Epoch 7, batch 23250, loss[loss=0.221, simple_loss=0.2898, pruned_loss=0.07607, over 21826.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2924, pruned_loss=0.07375, over 4285881.49 frames. ], batch size: 332, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:34:03,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1237302.0, ans=0.0 2023-06-25 02:34:51,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-06-25 02:35:43,501 INFO [train.py:996] (1/4) Epoch 7, batch 23300, loss[loss=0.2689, simple_loss=0.3747, pruned_loss=0.08155, over 21252.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3002, pruned_loss=0.07554, over 4286608.52 frames. ], batch size: 548, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:35:47,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1237602.0, ans=0.0 2023-06-25 02:36:21,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1237662.0, ans=0.125 2023-06-25 02:36:23,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1237662.0, ans=0.0 2023-06-25 02:37:17,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1237842.0, ans=0.125 2023-06-25 02:37:39,006 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.209e+02 3.833e+02 5.523e+02 1.342e+03, threshold=7.666e+02, percent-clipped=15.0 2023-06-25 02:37:39,038 INFO [train.py:996] (1/4) Epoch 7, batch 23350, loss[loss=0.1995, simple_loss=0.2548, pruned_loss=0.07207, over 20257.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.304, pruned_loss=0.07549, over 4281037.16 frames. ], batch size: 702, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:39:23,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-25 02:39:33,714 INFO [train.py:996] (1/4) Epoch 7, batch 23400, loss[loss=0.2323, simple_loss=0.3019, pruned_loss=0.08139, over 21819.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2964, pruned_loss=0.07134, over 4283322.04 frames. ], batch size: 124, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:39:51,555 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=22.5 2023-06-25 02:39:54,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1238262.0, ans=0.0 2023-06-25 02:40:48,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1238382.0, ans=0.015 2023-06-25 02:41:18,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1238442.0, ans=0.0 2023-06-25 02:41:23,136 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 3.153e+02 4.336e+02 5.410e+02 1.099e+03, threshold=8.672e+02, percent-clipped=12.0 2023-06-25 02:41:23,168 INFO [train.py:996] (1/4) Epoch 7, batch 23450, loss[loss=0.243, simple_loss=0.3145, pruned_loss=0.08578, over 21702.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2986, pruned_loss=0.07352, over 4280308.97 frames. ], batch size: 351, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:41:30,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1238502.0, ans=0.125 2023-06-25 02:41:34,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1238502.0, ans=0.0 2023-06-25 02:41:55,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1238562.0, ans=0.125 2023-06-25 02:42:00,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1238562.0, ans=0.0 2023-06-25 02:42:06,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1238622.0, ans=0.125 2023-06-25 02:42:55,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1238742.0, ans=0.125 2023-06-25 02:42:59,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1238742.0, ans=0.125 2023-06-25 02:43:06,182 INFO [train.py:996] (1/4) Epoch 7, batch 23500, loss[loss=0.2456, simple_loss=0.3136, pruned_loss=0.08876, over 21506.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3005, pruned_loss=0.07577, over 4280441.77 frames. ], batch size: 211, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:44:11,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1238982.0, ans=0.125 2023-06-25 02:44:25,161 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-25 02:44:53,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 2.970e+02 3.465e+02 4.227e+02 7.885e+02, threshold=6.930e+02, percent-clipped=0.0 2023-06-25 02:44:53,901 INFO [train.py:996] (1/4) Epoch 7, batch 23550, loss[loss=0.2028, simple_loss=0.2715, pruned_loss=0.06704, over 21817.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2957, pruned_loss=0.07597, over 4274788.52 frames. ], batch size: 98, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:46:42,569 INFO [train.py:996] (1/4) Epoch 7, batch 23600, loss[loss=0.2281, simple_loss=0.3081, pruned_loss=0.0741, over 21790.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2943, pruned_loss=0.07542, over 4268640.28 frames. ], batch size: 332, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 02:46:50,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1239402.0, ans=0.125 2023-06-25 02:46:52,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1239402.0, ans=0.0 2023-06-25 02:46:56,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1239402.0, ans=0.0 2023-06-25 02:47:00,834 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-25 02:47:08,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1239462.0, ans=0.125 2023-06-25 02:47:23,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1239522.0, ans=0.125 2023-06-25 02:47:26,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1239522.0, ans=0.125 2023-06-25 02:47:59,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1239582.0, ans=0.125 2023-06-25 02:47:59,744 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=12.0 2023-06-25 02:48:28,065 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.161e+02 4.117e+02 5.105e+02 1.053e+03, threshold=8.234e+02, percent-clipped=8.0 2023-06-25 02:48:28,096 INFO [train.py:996] (1/4) Epoch 7, batch 23650, loss[loss=0.2766, simple_loss=0.3474, pruned_loss=0.1029, over 21451.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.295, pruned_loss=0.07392, over 4275247.65 frames. ], batch size: 471, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 02:48:30,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1239702.0, ans=0.125 2023-06-25 02:48:37,920 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.56 vs. limit=22.5 2023-06-25 02:49:22,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1239822.0, ans=0.125 2023-06-25 02:49:30,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-25 02:49:44,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1239882.0, ans=0.2 2023-06-25 02:49:46,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.77 vs. limit=6.0 2023-06-25 02:50:17,006 INFO [train.py:996] (1/4) Epoch 7, batch 23700, loss[loss=0.2486, simple_loss=0.3253, pruned_loss=0.08595, over 21424.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2974, pruned_loss=0.0738, over 4278859.14 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:51:05,323 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 02:51:37,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1240182.0, ans=0.125 2023-06-25 02:52:12,606 INFO [train.py:996] (1/4) Epoch 7, batch 23750, loss[loss=0.1797, simple_loss=0.2747, pruned_loss=0.0424, over 21264.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2995, pruned_loss=0.07371, over 4281800.09 frames. ], batch size: 176, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:52:14,421 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 3.374e+02 3.894e+02 5.027e+02 8.477e+02, threshold=7.788e+02, percent-clipped=1.0 2023-06-25 02:52:20,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1240302.0, ans=0.125 2023-06-25 02:53:06,921 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:53:28,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1240482.0, ans=0.125 2023-06-25 02:53:34,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-25 02:54:03,169 INFO [train.py:996] (1/4) Epoch 7, batch 23800, loss[loss=0.2544, simple_loss=0.3499, pruned_loss=0.07948, over 21669.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2988, pruned_loss=0.07171, over 4280786.24 frames. ], batch size: 298, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:54:42,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1240662.0, ans=0.125 2023-06-25 02:54:46,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1240662.0, ans=0.0 2023-06-25 02:55:05,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1240722.0, ans=0.04949747468305833 2023-06-25 02:55:05,368 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:55:25,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1240782.0, ans=0.0 2023-06-25 02:56:06,035 INFO [train.py:996] (1/4) Epoch 7, batch 23850, loss[loss=0.261, simple_loss=0.3356, pruned_loss=0.09327, over 21207.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3064, pruned_loss=0.07318, over 4280406.56 frames. ], batch size: 143, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:56:07,965 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.127e+02 4.092e+02 4.859e+02 9.689e+02, threshold=8.184e+02, percent-clipped=5.0 2023-06-25 02:57:55,508 INFO [train.py:996] (1/4) Epoch 7, batch 23900, loss[loss=0.2169, simple_loss=0.2985, pruned_loss=0.06767, over 21704.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3123, pruned_loss=0.07573, over 4283106.35 frames. ], batch size: 282, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:57:55,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1241202.0, ans=0.125 2023-06-25 02:58:12,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1241262.0, ans=0.025 2023-06-25 02:58:18,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1241262.0, ans=0.125 2023-06-25 02:58:32,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1241322.0, ans=0.2 2023-06-25 02:58:37,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1241322.0, ans=0.125 2023-06-25 02:59:06,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1241382.0, ans=0.2 2023-06-25 02:59:38,297 INFO [train.py:996] (1/4) Epoch 7, batch 23950, loss[loss=0.2487, simple_loss=0.3258, pruned_loss=0.08583, over 21171.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3069, pruned_loss=0.07523, over 4278081.13 frames. ], batch size: 143, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:59:39,936 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.625e+02 3.372e+02 4.562e+02 5.557e+02 1.074e+03, threshold=9.124e+02, percent-clipped=7.0 2023-06-25 02:59:44,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1241502.0, ans=0.1 2023-06-25 02:59:45,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-25 03:00:28,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1241622.0, ans=0.125 2023-06-25 03:00:54,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1241682.0, ans=0.0 2023-06-25 03:01:27,365 INFO [train.py:996] (1/4) Epoch 7, batch 24000, loss[loss=0.2369, simple_loss=0.3131, pruned_loss=0.08032, over 21809.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3073, pruned_loss=0.07813, over 4277572.23 frames. ], batch size: 282, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 03:01:27,366 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 03:01:41,495 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.6998, 4.2778, 4.3785, 3.5948], device='cuda:1') 2023-06-25 03:01:45,549 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2668, simple_loss=0.3629, pruned_loss=0.0854, over 1796401.00 frames. 2023-06-25 03:01:45,550 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 03:01:54,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1241802.0, ans=0.125 2023-06-25 03:01:55,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.65 vs. limit=22.5 2023-06-25 03:02:29,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1241862.0, ans=0.0 2023-06-25 03:02:33,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1241922.0, ans=0.125 2023-06-25 03:03:22,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1242042.0, ans=0.125 2023-06-25 03:03:35,954 INFO [train.py:996] (1/4) Epoch 7, batch 24050, loss[loss=0.2019, simple_loss=0.2972, pruned_loss=0.0533, over 21630.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3089, pruned_loss=0.07811, over 4284045.71 frames. ], batch size: 263, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:03:39,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.516e+02 4.440e+02 5.748e+02 1.093e+03, threshold=8.881e+02, percent-clipped=2.0 2023-06-25 03:03:48,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1242102.0, ans=0.0 2023-06-25 03:04:06,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1242162.0, ans=0.0 2023-06-25 03:04:42,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1242222.0, ans=0.125 2023-06-25 03:04:56,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-25 03:05:13,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1242342.0, ans=15.0 2023-06-25 03:05:20,277 INFO [train.py:996] (1/4) Epoch 7, batch 24100, loss[loss=0.2266, simple_loss=0.3073, pruned_loss=0.07294, over 21298.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3085, pruned_loss=0.07669, over 4282678.13 frames. ], batch size: 176, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:06:02,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1242462.0, ans=0.0 2023-06-25 03:06:46,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1242582.0, ans=0.1 2023-06-25 03:07:09,403 INFO [train.py:996] (1/4) Epoch 7, batch 24150, loss[loss=0.2252, simple_loss=0.2922, pruned_loss=0.07911, over 21892.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3086, pruned_loss=0.07804, over 4286672.30 frames. ], batch size: 371, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:07:12,835 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.235e+02 4.030e+02 4.867e+02 1.048e+03, threshold=8.060e+02, percent-clipped=3.0 2023-06-25 03:07:39,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1242762.0, ans=0.0 2023-06-25 03:08:05,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-25 03:08:20,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1242882.0, ans=0.0 2023-06-25 03:08:36,561 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-25 03:08:53,060 INFO [train.py:996] (1/4) Epoch 7, batch 24200, loss[loss=0.2148, simple_loss=0.3039, pruned_loss=0.06283, over 21805.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3121, pruned_loss=0.0799, over 4294217.54 frames. ], batch size: 282, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:10:16,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1243182.0, ans=0.0 2023-06-25 03:10:22,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1243242.0, ans=0.0 2023-06-25 03:10:24,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1243242.0, ans=0.2 2023-06-25 03:10:45,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1243242.0, ans=0.125 2023-06-25 03:10:48,488 INFO [train.py:996] (1/4) Epoch 7, batch 24250, loss[loss=0.1807, simple_loss=0.2641, pruned_loss=0.0486, over 21202.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3098, pruned_loss=0.07447, over 4288731.05 frames. ], batch size: 143, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:10:51,920 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 3.061e+02 3.870e+02 4.839e+02 8.744e+02, threshold=7.741e+02, percent-clipped=3.0 2023-06-25 03:11:14,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1243302.0, ans=0.125 2023-06-25 03:11:15,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-25 03:12:14,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=22.5 2023-06-25 03:12:17,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-25 03:12:38,084 INFO [train.py:996] (1/4) Epoch 7, batch 24300, loss[loss=0.1938, simple_loss=0.2695, pruned_loss=0.05907, over 21858.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3033, pruned_loss=0.06889, over 4272706.25 frames. ], batch size: 107, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:13:06,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1243662.0, ans=15.0 2023-06-25 03:13:17,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1243662.0, ans=0.125 2023-06-25 03:13:18,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1243662.0, ans=0.125 2023-06-25 03:14:26,049 INFO [train.py:996] (1/4) Epoch 7, batch 24350, loss[loss=0.2266, simple_loss=0.2986, pruned_loss=0.07727, over 21818.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2966, pruned_loss=0.06773, over 4275670.80 frames. ], batch size: 247, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:14:34,784 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.804e+02 3.474e+02 4.596e+02 8.821e+02, threshold=6.948e+02, percent-clipped=1.0 2023-06-25 03:14:53,044 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:14:57,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.86 vs. limit=22.5 2023-06-25 03:16:20,443 INFO [train.py:996] (1/4) Epoch 7, batch 24400, loss[loss=0.2185, simple_loss=0.3022, pruned_loss=0.06736, over 21680.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3012, pruned_loss=0.07077, over 4274313.56 frames. ], batch size: 332, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:16:31,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1244202.0, ans=0.1 2023-06-25 03:17:22,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1244382.0, ans=0.0 2023-06-25 03:17:35,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1244382.0, ans=0.1 2023-06-25 03:17:39,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-25 03:17:54,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1244442.0, ans=0.2 2023-06-25 03:18:15,750 INFO [train.py:996] (1/4) Epoch 7, batch 24450, loss[loss=0.1993, simple_loss=0.2958, pruned_loss=0.05142, over 21657.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3025, pruned_loss=0.07202, over 4260973.08 frames. ], batch size: 263, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:18:19,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.592e+02 3.443e+02 3.965e+02 5.571e+02 1.139e+03, threshold=7.931e+02, percent-clipped=16.0 2023-06-25 03:18:25,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-25 03:19:22,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-25 03:20:03,676 INFO [train.py:996] (1/4) Epoch 7, batch 24500, loss[loss=0.2899, simple_loss=0.3557, pruned_loss=0.112, over 21501.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3049, pruned_loss=0.07308, over 4271615.38 frames. ], batch size: 507, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:20:44,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=12.0 2023-06-25 03:20:50,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1244922.0, ans=0.0 2023-06-25 03:21:48,774 INFO [train.py:996] (1/4) Epoch 7, batch 24550, loss[loss=0.254, simple_loss=0.3315, pruned_loss=0.08825, over 21796.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3068, pruned_loss=0.07492, over 4276867.30 frames. ], batch size: 282, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:21:53,851 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.970e+02 3.569e+02 4.682e+02 1.145e+03, threshold=7.139e+02, percent-clipped=2.0 2023-06-25 03:21:54,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1245102.0, ans=0.0 2023-06-25 03:22:45,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.17 vs. limit=10.0 2023-06-25 03:23:23,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1245342.0, ans=0.1 2023-06-25 03:23:28,581 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:23:31,370 INFO [train.py:996] (1/4) Epoch 7, batch 24600, loss[loss=0.1893, simple_loss=0.2575, pruned_loss=0.06055, over 21251.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3039, pruned_loss=0.07629, over 4273575.24 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:23:34,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1245402.0, ans=0.05 2023-06-25 03:23:52,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1245462.0, ans=0.1 2023-06-25 03:24:12,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1245522.0, ans=0.025 2023-06-25 03:24:30,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1245522.0, ans=0.035 2023-06-25 03:24:33,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1245582.0, ans=0.0 2023-06-25 03:24:39,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1245582.0, ans=0.125 2023-06-25 03:24:56,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1245642.0, ans=0.1 2023-06-25 03:25:14,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.41 vs. limit=10.0 2023-06-25 03:25:14,511 INFO [train.py:996] (1/4) Epoch 7, batch 24650, loss[loss=0.2179, simple_loss=0.3536, pruned_loss=0.04113, over 19740.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2969, pruned_loss=0.0747, over 4266415.34 frames. ], batch size: 702, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:25:16,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1245702.0, ans=0.0 2023-06-25 03:25:19,705 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 3.258e+02 3.830e+02 5.672e+02 1.406e+03, threshold=7.660e+02, percent-clipped=13.0 2023-06-25 03:25:20,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1245702.0, ans=0.2 2023-06-25 03:25:22,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1245702.0, ans=0.0 2023-06-25 03:25:25,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1245702.0, ans=0.0 2023-06-25 03:25:26,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-25 03:26:36,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1245882.0, ans=0.0 2023-06-25 03:26:53,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1245942.0, ans=0.1 2023-06-25 03:26:57,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1245942.0, ans=0.2 2023-06-25 03:27:02,265 INFO [train.py:996] (1/4) Epoch 7, batch 24700, loss[loss=0.199, simple_loss=0.2708, pruned_loss=0.06363, over 21787.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2947, pruned_loss=0.07258, over 4270994.28 frames. ], batch size: 102, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:27:08,287 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-25 03:27:09,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1246002.0, ans=0.0 2023-06-25 03:27:16,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1246002.0, ans=0.2 2023-06-25 03:28:43,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1246242.0, ans=0.125 2023-06-25 03:28:49,789 INFO [train.py:996] (1/4) Epoch 7, batch 24750, loss[loss=0.1706, simple_loss=0.2443, pruned_loss=0.04842, over 21735.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2891, pruned_loss=0.07065, over 4274765.53 frames. ], batch size: 282, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:28:54,677 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.901e+02 3.279e+02 4.785e+02 1.213e+03, threshold=6.557e+02, percent-clipped=5.0 2023-06-25 03:29:58,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1246482.0, ans=0.125 2023-06-25 03:30:26,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1246542.0, ans=0.1 2023-06-25 03:30:35,881 INFO [train.py:996] (1/4) Epoch 7, batch 24800, loss[loss=0.2243, simple_loss=0.2943, pruned_loss=0.07711, over 21874.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2841, pruned_loss=0.07016, over 4261512.15 frames. ], batch size: 351, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:30:37,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.55 vs. limit=10.0 2023-06-25 03:30:50,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1246602.0, ans=0.125 2023-06-25 03:31:11,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1246662.0, ans=0.0 2023-06-25 03:31:33,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1246722.0, ans=0.2 2023-06-25 03:31:43,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1246782.0, ans=0.125 2023-06-25 03:32:18,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1246842.0, ans=0.0 2023-06-25 03:32:23,877 INFO [train.py:996] (1/4) Epoch 7, batch 24850, loss[loss=0.1912, simple_loss=0.2667, pruned_loss=0.05782, over 21820.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2851, pruned_loss=0.07152, over 4272926.55 frames. ], batch size: 282, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:32:30,459 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 03:32:30,856 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 3.124e+02 3.906e+02 4.909e+02 9.613e+02, threshold=7.812e+02, percent-clipped=9.0 2023-06-25 03:33:10,321 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:33:46,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1247082.0, ans=0.05 2023-06-25 03:34:14,037 INFO [train.py:996] (1/4) Epoch 7, batch 24900, loss[loss=0.3054, simple_loss=0.3637, pruned_loss=0.1236, over 21399.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2885, pruned_loss=0.07286, over 4273845.35 frames. ], batch size: 471, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:35:09,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1247322.0, ans=0.07 2023-06-25 03:35:16,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1247322.0, ans=0.1 2023-06-25 03:35:36,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1247382.0, ans=0.09899494936611666 2023-06-25 03:35:57,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1247442.0, ans=0.125 2023-06-25 03:36:08,401 INFO [train.py:996] (1/4) Epoch 7, batch 24950, loss[loss=0.2643, simple_loss=0.3309, pruned_loss=0.09889, over 21527.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2957, pruned_loss=0.07652, over 4281592.45 frames. ], batch size: 211, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:36:15,235 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.748e+02 3.765e+02 4.804e+02 6.774e+02 1.687e+03, threshold=9.608e+02, percent-clipped=17.0 2023-06-25 03:36:18,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-25 03:37:05,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1247622.0, ans=0.025 2023-06-25 03:37:34,828 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.63 vs. limit=22.5 2023-06-25 03:37:57,850 INFO [train.py:996] (1/4) Epoch 7, batch 25000, loss[loss=0.2399, simple_loss=0.3271, pruned_loss=0.07636, over 21649.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3018, pruned_loss=0.07774, over 4285602.14 frames. ], batch size: 263, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:38:03,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1247802.0, ans=0.125 2023-06-25 03:38:13,516 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.20 vs. limit=10.0 2023-06-25 03:38:21,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1247802.0, ans=0.0 2023-06-25 03:38:31,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1247862.0, ans=0.2 2023-06-25 03:38:50,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1247922.0, ans=0.0 2023-06-25 03:39:06,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1247982.0, ans=0.0 2023-06-25 03:39:42,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1248042.0, ans=0.0 2023-06-25 03:39:44,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1248042.0, ans=0.0 2023-06-25 03:39:47,802 INFO [train.py:996] (1/4) Epoch 7, batch 25050, loss[loss=0.1784, simple_loss=0.2428, pruned_loss=0.05699, over 21456.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.296, pruned_loss=0.07649, over 4284983.87 frames. ], batch size: 212, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:39:59,670 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.536e+02 3.278e+02 3.984e+02 5.261e+02 1.222e+03, threshold=7.967e+02, percent-clipped=1.0 2023-06-25 03:40:54,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1248282.0, ans=0.125 2023-06-25 03:41:01,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1248282.0, ans=0.05 2023-06-25 03:41:19,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1248342.0, ans=0.125 2023-06-25 03:41:19,607 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-25 03:41:35,612 INFO [train.py:996] (1/4) Epoch 7, batch 25100, loss[loss=0.1956, simple_loss=0.2619, pruned_loss=0.06466, over 21552.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2906, pruned_loss=0.07506, over 4287342.46 frames. ], batch size: 391, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:41:55,715 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-25 03:42:03,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1248462.0, ans=0.125 2023-06-25 03:43:15,191 INFO [train.py:996] (1/4) Epoch 7, batch 25150, loss[loss=0.2079, simple_loss=0.2905, pruned_loss=0.06263, over 21875.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2933, pruned_loss=0.07294, over 4282855.31 frames. ], batch size: 371, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:43:21,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-25 03:43:22,414 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.917e+02 3.507e+02 4.290e+02 7.134e+02, threshold=7.014e+02, percent-clipped=0.0 2023-06-25 03:43:38,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1248762.0, ans=0.0 2023-06-25 03:43:38,638 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-25 03:44:14,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1248822.0, ans=0.125 2023-06-25 03:45:03,179 INFO [train.py:996] (1/4) Epoch 7, batch 25200, loss[loss=0.1984, simple_loss=0.2872, pruned_loss=0.05482, over 21434.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2931, pruned_loss=0.07146, over 4275073.47 frames. ], batch size: 211, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:45:29,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1249062.0, ans=0.09899494936611666 2023-06-25 03:45:29,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1249062.0, ans=0.125 2023-06-25 03:45:40,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.69 vs. limit=15.0 2023-06-25 03:45:45,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-25 03:46:24,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1249242.0, ans=0.125 2023-06-25 03:46:36,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1249242.0, ans=0.0 2023-06-25 03:46:44,515 INFO [train.py:996] (1/4) Epoch 7, batch 25250, loss[loss=0.197, simple_loss=0.2535, pruned_loss=0.07026, over 21257.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2913, pruned_loss=0.07002, over 4263201.47 frames. ], batch size: 144, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 03:46:45,522 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-25 03:46:50,725 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 3.493e+02 4.531e+02 6.299e+02 1.264e+03, threshold=9.062e+02, percent-clipped=19.0 2023-06-25 03:47:00,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1249362.0, ans=0.125 2023-06-25 03:47:08,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1249362.0, ans=0.125 2023-06-25 03:47:24,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=12.0 2023-06-25 03:47:32,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1249422.0, ans=0.0 2023-06-25 03:48:32,320 INFO [train.py:996] (1/4) Epoch 7, batch 25300, loss[loss=0.1737, simple_loss=0.2452, pruned_loss=0.05106, over 21363.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2875, pruned_loss=0.06877, over 4245060.02 frames. ], batch size: 131, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 03:49:11,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-25 03:49:26,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1249722.0, ans=0.035 2023-06-25 03:49:28,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1249722.0, ans=0.1 2023-06-25 03:49:45,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=22.5 2023-06-25 03:50:20,519 INFO [train.py:996] (1/4) Epoch 7, batch 25350, loss[loss=0.207, simple_loss=0.3096, pruned_loss=0.0522, over 20781.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2902, pruned_loss=0.06925, over 4233033.84 frames. ], batch size: 607, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:50:21,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1249902.0, ans=6.0 2023-06-25 03:50:29,457 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 2.853e+02 3.365e+02 4.532e+02 7.857e+02, threshold=6.730e+02, percent-clipped=0.0 2023-06-25 03:51:22,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1250082.0, ans=0.0 2023-06-25 03:51:36,072 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.37 vs. limit=12.0 2023-06-25 03:52:03,093 INFO [train.py:996] (1/4) Epoch 7, batch 25400, loss[loss=0.2158, simple_loss=0.2673, pruned_loss=0.08216, over 21500.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2851, pruned_loss=0.06781, over 4198906.65 frames. ], batch size: 441, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:52:12,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1250202.0, ans=0.1 2023-06-25 03:52:31,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1250262.0, ans=0.125 2023-06-25 03:52:56,151 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-06-25 03:53:15,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1250382.0, ans=0.125 2023-06-25 03:53:28,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1250442.0, ans=0.125 2023-06-25 03:53:46,476 INFO [train.py:996] (1/4) Epoch 7, batch 25450, loss[loss=0.2152, simple_loss=0.3129, pruned_loss=0.0588, over 21817.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2864, pruned_loss=0.06998, over 4211011.49 frames. ], batch size: 351, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:53:55,087 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.979e+02 3.775e+02 5.252e+02 7.977e+02, threshold=7.549e+02, percent-clipped=6.0 2023-06-25 03:53:55,663 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:54:21,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1250562.0, ans=0.0 2023-06-25 03:54:23,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1250562.0, ans=0.0 2023-06-25 03:55:21,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1250742.0, ans=0.1 2023-06-25 03:55:32,118 INFO [train.py:996] (1/4) Epoch 7, batch 25500, loss[loss=0.1693, simple_loss=0.2593, pruned_loss=0.03967, over 21580.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2867, pruned_loss=0.06629, over 4226150.24 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:55:59,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1250862.0, ans=0.0 2023-06-25 03:56:43,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1250982.0, ans=6.0 2023-06-25 03:57:03,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1251042.0, ans=0.1 2023-06-25 03:57:11,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1251042.0, ans=0.2 2023-06-25 03:57:27,518 INFO [train.py:996] (1/4) Epoch 7, batch 25550, loss[loss=0.2207, simple_loss=0.3259, pruned_loss=0.05773, over 21633.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2942, pruned_loss=0.06691, over 4239178.19 frames. ], batch size: 414, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:57:41,638 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.132e+02 4.314e+02 5.832e+02 9.037e+02, threshold=8.627e+02, percent-clipped=4.0 2023-06-25 03:57:44,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1251102.0, ans=0.125 2023-06-25 03:58:12,285 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:58:33,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1251282.0, ans=0.1 2023-06-25 03:58:34,410 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.03 vs. limit=10.0 2023-06-25 03:58:47,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1251282.0, ans=0.035 2023-06-25 03:58:53,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-25 03:59:11,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1251342.0, ans=0.95 2023-06-25 03:59:14,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1251342.0, ans=0.0 2023-06-25 03:59:21,923 INFO [train.py:996] (1/4) Epoch 7, batch 25600, loss[loss=0.2392, simple_loss=0.3179, pruned_loss=0.08023, over 21729.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2984, pruned_loss=0.0674, over 4247449.55 frames. ], batch size: 351, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 04:00:59,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-25 04:01:09,207 INFO [train.py:996] (1/4) Epoch 7, batch 25650, loss[loss=0.1958, simple_loss=0.2623, pruned_loss=0.06467, over 21631.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.298, pruned_loss=0.06967, over 4258621.09 frames. ], batch size: 282, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:01:19,262 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.050e+02 3.577e+02 4.545e+02 8.924e+02, threshold=7.154e+02, percent-clipped=2.0 2023-06-25 04:01:23,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1251702.0, ans=0.035 2023-06-25 04:01:30,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1251762.0, ans=0.125 2023-06-25 04:01:42,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1251822.0, ans=0.0 2023-06-25 04:01:43,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1251822.0, ans=0.0 2023-06-25 04:02:11,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1251882.0, ans=0.125 2023-06-25 04:02:24,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=12.0 2023-06-25 04:02:48,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1251942.0, ans=0.2 2023-06-25 04:02:54,041 INFO [train.py:996] (1/4) Epoch 7, batch 25700, loss[loss=0.1914, simple_loss=0.2731, pruned_loss=0.05492, over 21370.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2954, pruned_loss=0.07079, over 4257382.63 frames. ], batch size: 211, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:04:42,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1252302.0, ans=0.0 2023-06-25 04:04:43,958 INFO [train.py:996] (1/4) Epoch 7, batch 25750, loss[loss=0.3947, simple_loss=0.4664, pruned_loss=0.1615, over 21486.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3013, pruned_loss=0.07449, over 4257120.70 frames. ], batch size: 471, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:04:55,427 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.207e+02 3.828e+02 5.534e+02 9.207e+02, threshold=7.655e+02, percent-clipped=4.0 2023-06-25 04:04:57,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1252302.0, ans=0.0 2023-06-25 04:05:16,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1252362.0, ans=0.125 2023-06-25 04:05:57,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1252422.0, ans=0.1 2023-06-25 04:06:30,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-25 04:06:41,398 INFO [train.py:996] (1/4) Epoch 7, batch 25800, loss[loss=0.2366, simple_loss=0.3169, pruned_loss=0.07821, over 21741.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3128, pruned_loss=0.07796, over 4262512.24 frames. ], batch size: 332, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:07:48,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1252782.0, ans=0.125 2023-06-25 04:07:56,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-25 04:08:02,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1252782.0, ans=0.125 2023-06-25 04:08:36,059 INFO [train.py:996] (1/4) Epoch 7, batch 25850, loss[loss=0.2194, simple_loss=0.2906, pruned_loss=0.07415, over 21515.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3144, pruned_loss=0.07783, over 4266662.03 frames. ], batch size: 131, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:08:46,063 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.799e+02 4.980e+02 7.138e+02 1.041e+03, threshold=9.960e+02, percent-clipped=14.0 2023-06-25 04:08:50,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1252902.0, ans=0.07 2023-06-25 04:09:43,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-25 04:09:46,964 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:10:24,678 INFO [train.py:996] (1/4) Epoch 7, batch 25900, loss[loss=0.2395, simple_loss=0.3319, pruned_loss=0.07351, over 21399.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3158, pruned_loss=0.07905, over 4271232.72 frames. ], batch size: 211, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:10:26,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1253202.0, ans=0.2 2023-06-25 04:12:06,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1253442.0, ans=0.04949747468305833 2023-06-25 04:12:11,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1253442.0, ans=0.0 2023-06-25 04:12:19,420 INFO [train.py:996] (1/4) Epoch 7, batch 25950, loss[loss=0.2787, simple_loss=0.3441, pruned_loss=0.1066, over 21574.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3232, pruned_loss=0.08206, over 4269751.53 frames. ], batch size: 414, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:12:30,261 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.924e+02 4.825e+02 6.667e+02 9.345e+02, threshold=9.651e+02, percent-clipped=0.0 2023-06-25 04:12:45,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-25 04:12:49,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1253562.0, ans=0.0 2023-06-25 04:12:58,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1253622.0, ans=0.2 2023-06-25 04:13:20,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1253682.0, ans=0.0 2023-06-25 04:14:08,572 INFO [train.py:996] (1/4) Epoch 7, batch 26000, loss[loss=0.2448, simple_loss=0.31, pruned_loss=0.08977, over 20035.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.321, pruned_loss=0.07992, over 4265343.51 frames. ], batch size: 703, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 04:14:31,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1253862.0, ans=0.125 2023-06-25 04:14:32,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1253862.0, ans=0.1 2023-06-25 04:15:58,148 INFO [train.py:996] (1/4) Epoch 7, batch 26050, loss[loss=0.2261, simple_loss=0.2908, pruned_loss=0.08066, over 21684.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3215, pruned_loss=0.08116, over 4266732.54 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:16:10,015 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.188e+02 3.821e+02 5.430e+02 8.574e+02, threshold=7.643e+02, percent-clipped=0.0 2023-06-25 04:16:56,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1254222.0, ans=0.1 2023-06-25 04:16:56,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1254222.0, ans=0.0 2023-06-25 04:17:25,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1254342.0, ans=0.0 2023-06-25 04:17:27,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1254342.0, ans=0.2 2023-06-25 04:17:45,911 INFO [train.py:996] (1/4) Epoch 7, batch 26100, loss[loss=0.2264, simple_loss=0.2951, pruned_loss=0.0789, over 21878.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3155, pruned_loss=0.08064, over 4275890.88 frames. ], batch size: 391, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:17:53,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.65 vs. limit=10.0 2023-06-25 04:18:16,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1254462.0, ans=0.2 2023-06-25 04:18:56,943 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=15.0 2023-06-25 04:19:08,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1254582.0, ans=0.1 2023-06-25 04:19:35,052 INFO [train.py:996] (1/4) Epoch 7, batch 26150, loss[loss=0.2368, simple_loss=0.308, pruned_loss=0.08279, over 21474.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3125, pruned_loss=0.08012, over 4277307.89 frames. ], batch size: 194, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:19:47,504 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.240e+02 3.858e+02 5.306e+02 8.605e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-25 04:19:55,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1254762.0, ans=0.0 2023-06-25 04:19:55,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1254762.0, ans=0.125 2023-06-25 04:20:01,072 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-06-25 04:20:44,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1254882.0, ans=0.2 2023-06-25 04:20:47,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1254882.0, ans=0.125 2023-06-25 04:21:24,135 INFO [train.py:996] (1/4) Epoch 7, batch 26200, loss[loss=0.2047, simple_loss=0.3046, pruned_loss=0.05242, over 21626.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.313, pruned_loss=0.07805, over 4281647.79 frames. ], batch size: 230, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:21:28,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255002.0, ans=0.1 2023-06-25 04:21:56,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1255062.0, ans=0.2 2023-06-25 04:21:58,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1255062.0, ans=0.125 2023-06-25 04:22:33,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1255182.0, ans=0.2 2023-06-25 04:22:56,695 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-25 04:23:03,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1255242.0, ans=0.125 2023-06-25 04:23:13,406 INFO [train.py:996] (1/4) Epoch 7, batch 26250, loss[loss=0.2324, simple_loss=0.3096, pruned_loss=0.07764, over 21471.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3152, pruned_loss=0.07675, over 4288047.49 frames. ], batch size: 131, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:23:25,312 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.172e+02 3.762e+02 4.925e+02 1.309e+03, threshold=7.524e+02, percent-clipped=5.0 2023-06-25 04:23:41,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1255362.0, ans=0.125 2023-06-25 04:24:09,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1255422.0, ans=0.0 2023-06-25 04:24:20,743 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:24:39,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1255482.0, ans=0.125 2023-06-25 04:25:01,096 INFO [train.py:996] (1/4) Epoch 7, batch 26300, loss[loss=0.2186, simple_loss=0.285, pruned_loss=0.0761, over 21482.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3113, pruned_loss=0.0772, over 4290931.95 frames. ], batch size: 211, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:25:07,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-25 04:25:22,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1255662.0, ans=0.0 2023-06-25 04:26:08,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1255722.0, ans=0.125 2023-06-25 04:26:17,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255782.0, ans=0.1 2023-06-25 04:26:41,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1255842.0, ans=0.1 2023-06-25 04:26:51,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1255842.0, ans=0.0 2023-06-25 04:26:51,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1255842.0, ans=0.125 2023-06-25 04:26:53,860 INFO [train.py:996] (1/4) Epoch 7, batch 26350, loss[loss=0.2422, simple_loss=0.3107, pruned_loss=0.08688, over 21625.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3102, pruned_loss=0.0782, over 4296577.55 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:27:09,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-25 04:27:11,529 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.447e+02 3.110e+02 3.681e+02 4.505e+02 7.991e+02, threshold=7.361e+02, percent-clipped=2.0 2023-06-25 04:27:29,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1255962.0, ans=0.2 2023-06-25 04:27:57,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1256022.0, ans=0.2 2023-06-25 04:28:24,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1256142.0, ans=0.0 2023-06-25 04:28:38,239 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-25 04:28:40,449 INFO [train.py:996] (1/4) Epoch 7, batch 26400, loss[loss=0.1981, simple_loss=0.2582, pruned_loss=0.06897, over 21271.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3049, pruned_loss=0.07874, over 4284123.02 frames. ], batch size: 549, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:28:57,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1256202.0, ans=0.1 2023-06-25 04:29:46,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1256322.0, ans=0.1 2023-06-25 04:29:59,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1256382.0, ans=0.125 2023-06-25 04:30:14,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1256442.0, ans=0.2 2023-06-25 04:30:39,801 INFO [train.py:996] (1/4) Epoch 7, batch 26450, loss[loss=0.3383, simple_loss=0.4239, pruned_loss=0.1263, over 21408.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3047, pruned_loss=0.07826, over 4277925.03 frames. ], batch size: 507, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:30:42,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1256502.0, ans=0.0 2023-06-25 04:30:57,248 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.534e+02 4.471e+02 5.534e+02 1.801e+03, threshold=8.941e+02, percent-clipped=10.0 2023-06-25 04:32:10,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-25 04:32:36,248 INFO [train.py:996] (1/4) Epoch 7, batch 26500, loss[loss=0.1847, simple_loss=0.2475, pruned_loss=0.06092, over 21365.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.306, pruned_loss=0.07636, over 4267648.87 frames. ], batch size: 176, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:32:38,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1256802.0, ans=10.0 2023-06-25 04:33:11,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1256862.0, ans=0.1 2023-06-25 04:33:14,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-25 04:33:36,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1256922.0, ans=0.125 2023-06-25 04:34:10,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1257042.0, ans=0.1 2023-06-25 04:34:33,094 INFO [train.py:996] (1/4) Epoch 7, batch 26550, loss[loss=0.2297, simple_loss=0.3353, pruned_loss=0.06207, over 19852.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.304, pruned_loss=0.07419, over 4262697.36 frames. ], batch size: 703, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:34:47,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.332e+02 4.391e+02 7.235e+02 1.419e+03, threshold=8.782e+02, percent-clipped=20.0 2023-06-25 04:35:29,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-25 04:35:35,908 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:36:06,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1257342.0, ans=0.125 2023-06-25 04:36:21,171 INFO [train.py:996] (1/4) Epoch 7, batch 26600, loss[loss=0.2141, simple_loss=0.2866, pruned_loss=0.07078, over 21579.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3015, pruned_loss=0.07187, over 4257373.91 frames. ], batch size: 414, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:38:03,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1257642.0, ans=0.125 2023-06-25 04:38:09,686 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-25 04:38:10,078 INFO [train.py:996] (1/4) Epoch 7, batch 26650, loss[loss=0.1782, simple_loss=0.2535, pruned_loss=0.05149, over 21459.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2947, pruned_loss=0.07044, over 4251467.97 frames. ], batch size: 195, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:38:28,646 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.895e+02 3.400e+02 5.153e+02 1.068e+03, threshold=6.799e+02, percent-clipped=4.0 2023-06-25 04:39:09,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2023-06-25 04:39:10,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1257882.0, ans=0.125 2023-06-25 04:39:41,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1257942.0, ans=0.1 2023-06-25 04:39:56,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1258002.0, ans=0.2 2023-06-25 04:39:57,580 INFO [train.py:996] (1/4) Epoch 7, batch 26700, loss[loss=0.1955, simple_loss=0.2675, pruned_loss=0.06171, over 21574.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2882, pruned_loss=0.0675, over 4261674.66 frames. ], batch size: 212, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:40:10,265 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:40:17,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1258002.0, ans=0.125 2023-06-25 04:41:05,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1258182.0, ans=0.125 2023-06-25 04:41:52,590 INFO [train.py:996] (1/4) Epoch 7, batch 26750, loss[loss=0.2213, simple_loss=0.3081, pruned_loss=0.06718, over 20702.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2887, pruned_loss=0.06684, over 4270565.85 frames. ], batch size: 607, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:42:06,348 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.716e+02 3.514e+02 4.569e+02 1.217e+03, threshold=7.028e+02, percent-clipped=8.0 2023-06-25 04:42:43,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1258422.0, ans=0.125 2023-06-25 04:42:59,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1258422.0, ans=0.125 2023-06-25 04:43:43,456 INFO [train.py:996] (1/4) Epoch 7, batch 26800, loss[loss=0.3066, simple_loss=0.3633, pruned_loss=0.1249, over 21442.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2966, pruned_loss=0.0716, over 4270153.79 frames. ], batch size: 471, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:44:25,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1258662.0, ans=0.1 2023-06-25 04:44:28,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1258722.0, ans=0.125 2023-06-25 04:45:32,715 INFO [train.py:996] (1/4) Epoch 7, batch 26850, loss[loss=0.2515, simple_loss=0.3177, pruned_loss=0.09261, over 21548.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2994, pruned_loss=0.07512, over 4257637.71 frames. ], batch size: 389, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:45:39,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1258902.0, ans=0.0 2023-06-25 04:45:58,735 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.727e+02 3.580e+02 4.511e+02 5.580e+02 1.314e+03, threshold=9.022e+02, percent-clipped=13.0 2023-06-25 04:46:19,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1259022.0, ans=0.125 2023-06-25 04:46:58,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-25 04:47:22,399 INFO [train.py:996] (1/4) Epoch 7, batch 26900, loss[loss=0.1911, simple_loss=0.2549, pruned_loss=0.06368, over 21537.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.291, pruned_loss=0.07334, over 4264910.42 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:48:02,085 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-25 04:48:54,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.07 vs. limit=10.0 2023-06-25 04:48:59,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-25 04:49:06,712 INFO [train.py:996] (1/4) Epoch 7, batch 26950, loss[loss=0.2387, simple_loss=0.3173, pruned_loss=0.08006, over 21788.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2901, pruned_loss=0.07281, over 4263022.94 frames. ], batch size: 371, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:49:33,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.559e+02 3.020e+02 3.484e+02 4.294e+02 8.554e+02, threshold=6.967e+02, percent-clipped=0.0 2023-06-25 04:49:41,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1259562.0, ans=10.0 2023-06-25 04:49:44,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1259562.0, ans=0.125 2023-06-25 04:49:45,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.50 vs. limit=10.0 2023-06-25 04:50:24,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=15.0 2023-06-25 04:50:28,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1259682.0, ans=0.0 2023-06-25 04:51:02,040 INFO [train.py:996] (1/4) Epoch 7, batch 27000, loss[loss=0.1852, simple_loss=0.2342, pruned_loss=0.06811, over 19949.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2903, pruned_loss=0.07078, over 4257853.89 frames. ], batch size: 703, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:51:02,040 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 04:51:14,046 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7893, 2.5374, 3.9213, 2.4929], device='cuda:1') 2023-06-25 04:51:24,271 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2512, simple_loss=0.3463, pruned_loss=0.07806, over 1796401.00 frames. 2023-06-25 04:51:24,272 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 04:51:53,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-25 04:52:38,211 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-25 04:53:14,906 INFO [train.py:996] (1/4) Epoch 7, batch 27050, loss[loss=0.2004, simple_loss=0.2952, pruned_loss=0.05276, over 21799.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2918, pruned_loss=0.06769, over 4264909.65 frames. ], batch size: 282, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:53:34,731 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.897e+02 3.762e+02 4.771e+02 8.226e+02, threshold=7.524e+02, percent-clipped=2.0 2023-06-25 04:54:14,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1260222.0, ans=0.2 2023-06-25 04:54:50,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1260342.0, ans=0.0 2023-06-25 04:55:01,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1260342.0, ans=0.125 2023-06-25 04:55:04,178 INFO [train.py:996] (1/4) Epoch 7, batch 27100, loss[loss=0.2048, simple_loss=0.3076, pruned_loss=0.05095, over 21619.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2941, pruned_loss=0.06923, over 4273348.45 frames. ], batch size: 230, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:55:43,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1260462.0, ans=0.0 2023-06-25 04:55:50,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1260522.0, ans=0.125 2023-06-25 04:55:59,431 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:56:53,967 INFO [train.py:996] (1/4) Epoch 7, batch 27150, loss[loss=0.2823, simple_loss=0.3924, pruned_loss=0.08614, over 20770.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3055, pruned_loss=0.07192, over 4272360.21 frames. ], batch size: 607, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:57:08,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1260702.0, ans=0.0 2023-06-25 04:57:19,851 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.400e+02 4.098e+02 5.830e+02 1.178e+03, threshold=8.196e+02, percent-clipped=9.0 2023-06-25 04:57:48,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1260822.0, ans=0.125 2023-06-25 04:58:00,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1260882.0, ans=0.1 2023-06-25 04:58:07,347 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:58:47,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1261002.0, ans=0.125 2023-06-25 04:58:53,807 INFO [train.py:996] (1/4) Epoch 7, batch 27200, loss[loss=0.26, simple_loss=0.3269, pruned_loss=0.09654, over 21797.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3145, pruned_loss=0.07474, over 4271021.81 frames. ], batch size: 124, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:59:14,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1261062.0, ans=0.1 2023-06-25 04:59:39,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1261122.0, ans=0.125 2023-06-25 05:00:16,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1261182.0, ans=0.2 2023-06-25 05:00:21,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1261242.0, ans=0.125 2023-06-25 05:00:23,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1261242.0, ans=0.125 2023-06-25 05:00:35,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1261242.0, ans=0.0 2023-06-25 05:00:44,391 INFO [train.py:996] (1/4) Epoch 7, batch 27250, loss[loss=0.2548, simple_loss=0.3224, pruned_loss=0.09353, over 21818.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3167, pruned_loss=0.07849, over 4272344.52 frames. ], batch size: 247, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:01:02,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.541e+02 3.239e+02 3.756e+02 4.583e+02 7.251e+02, threshold=7.513e+02, percent-clipped=0.0 2023-06-25 05:01:29,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1261422.0, ans=0.04949747468305833 2023-06-25 05:01:39,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1261422.0, ans=0.1 2023-06-25 05:01:55,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1261482.0, ans=0.0 2023-06-25 05:02:17,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1261542.0, ans=0.125 2023-06-25 05:02:19,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1261542.0, ans=0.1 2023-06-25 05:02:36,161 INFO [train.py:996] (1/4) Epoch 7, batch 27300, loss[loss=0.2107, simple_loss=0.3257, pruned_loss=0.04789, over 20757.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3189, pruned_loss=0.07928, over 4278920.07 frames. ], batch size: 607, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:03:36,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1261722.0, ans=0.1 2023-06-25 05:04:25,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1261902.0, ans=0.1 2023-06-25 05:04:26,392 INFO [train.py:996] (1/4) Epoch 7, batch 27350, loss[loss=0.243, simple_loss=0.3186, pruned_loss=0.08368, over 21210.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3212, pruned_loss=0.07958, over 4274660.22 frames. ], batch size: 143, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:04:47,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1261962.0, ans=0.125 2023-06-25 05:04:48,254 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.470e+02 4.790e+02 5.992e+02 9.415e+02, threshold=9.580e+02, percent-clipped=9.0 2023-06-25 05:05:01,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.48 vs. limit=15.0 2023-06-25 05:05:21,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1262022.0, ans=0.125 2023-06-25 05:05:27,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1262022.0, ans=0.125 2023-06-25 05:06:18,602 INFO [train.py:996] (1/4) Epoch 7, batch 27400, loss[loss=0.2088, simple_loss=0.2759, pruned_loss=0.07084, over 21400.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3153, pruned_loss=0.07888, over 4279075.53 frames. ], batch size: 131, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:07:40,466 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.83 vs. limit=22.5 2023-06-25 05:08:08,412 INFO [train.py:996] (1/4) Epoch 7, batch 27450, loss[loss=0.249, simple_loss=0.3202, pruned_loss=0.08887, over 21861.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3087, pruned_loss=0.07751, over 4276083.54 frames. ], batch size: 118, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:08:36,542 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.458e+02 3.140e+02 3.820e+02 5.353e+02 9.307e+02, threshold=7.640e+02, percent-clipped=0.0 2023-06-25 05:09:19,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1262682.0, ans=0.2 2023-06-25 05:09:31,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1262742.0, ans=0.035 2023-06-25 05:09:31,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1262742.0, ans=0.0 2023-06-25 05:09:33,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1262742.0, ans=0.0 2023-06-25 05:09:50,492 INFO [train.py:996] (1/4) Epoch 7, batch 27500, loss[loss=0.2123, simple_loss=0.2877, pruned_loss=0.0685, over 21910.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3076, pruned_loss=0.07788, over 4287644.00 frames. ], batch size: 351, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:10:00,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1262802.0, ans=0.2 2023-06-25 05:11:43,564 INFO [train.py:996] (1/4) Epoch 7, batch 27550, loss[loss=0.2348, simple_loss=0.2935, pruned_loss=0.08804, over 21361.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3025, pruned_loss=0.07467, over 4284579.27 frames. ], batch size: 471, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:12:10,958 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.562e+02 3.311e+02 4.001e+02 4.826e+02 1.149e+03, threshold=8.002e+02, percent-clipped=4.0 2023-06-25 05:12:17,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-25 05:12:37,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1263222.0, ans=0.1 2023-06-25 05:13:08,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1263342.0, ans=0.125 2023-06-25 05:13:29,711 INFO [train.py:996] (1/4) Epoch 7, batch 27600, loss[loss=0.1868, simple_loss=0.2434, pruned_loss=0.06509, over 21158.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2962, pruned_loss=0.07378, over 4280585.50 frames. ], batch size: 548, lr: 4.18e-03, grad_scale: 32.0 2023-06-25 05:13:49,051 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:14:09,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1263462.0, ans=0.1 2023-06-25 05:14:58,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1263642.0, ans=0.0 2023-06-25 05:15:10,386 INFO [train.py:996] (1/4) Epoch 7, batch 27650, loss[loss=0.2191, simple_loss=0.3016, pruned_loss=0.06829, over 21755.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2907, pruned_loss=0.07295, over 4277952.02 frames. ], batch size: 332, lr: 4.18e-03, grad_scale: 32.0 2023-06-25 05:15:37,187 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.109e+02 3.684e+02 5.059e+02 1.214e+03, threshold=7.368e+02, percent-clipped=6.0 2023-06-25 05:15:39,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1263762.0, ans=0.0 2023-06-25 05:16:49,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1263942.0, ans=0.125 2023-06-25 05:16:57,724 INFO [train.py:996] (1/4) Epoch 7, batch 27700, loss[loss=0.1995, simple_loss=0.279, pruned_loss=0.06003, over 21461.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2907, pruned_loss=0.07088, over 4281596.72 frames. ], batch size: 211, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:17:09,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1264002.0, ans=0.0 2023-06-25 05:17:20,434 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:17:36,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1264062.0, ans=0.0 2023-06-25 05:17:36,834 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:18:10,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1264182.0, ans=0.2 2023-06-25 05:18:15,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-25 05:18:49,452 INFO [train.py:996] (1/4) Epoch 7, batch 27750, loss[loss=0.1809, simple_loss=0.2687, pruned_loss=0.04653, over 21772.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.295, pruned_loss=0.07112, over 4278922.87 frames. ], batch size: 298, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:18:49,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1264302.0, ans=0.0 2023-06-25 05:18:51,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1264302.0, ans=0.125 2023-06-25 05:19:19,114 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 2.962e+02 3.488e+02 4.454e+02 9.416e+02, threshold=6.976e+02, percent-clipped=4.0 2023-06-25 05:19:31,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1264362.0, ans=0.2 2023-06-25 05:19:38,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1264422.0, ans=0.0 2023-06-25 05:19:46,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1264422.0, ans=0.0 2023-06-25 05:19:53,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1264422.0, ans=0.2 2023-06-25 05:19:57,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-25 05:20:36,086 INFO [train.py:996] (1/4) Epoch 7, batch 27800, loss[loss=0.1998, simple_loss=0.2619, pruned_loss=0.06883, over 21207.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2948, pruned_loss=0.07191, over 4284097.94 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:20:42,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1264602.0, ans=0.125 2023-06-25 05:20:53,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1264602.0, ans=0.125 2023-06-25 05:20:57,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1264662.0, ans=0.125 2023-06-25 05:21:08,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1264662.0, ans=0.0 2023-06-25 05:21:21,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1264722.0, ans=0.95 2023-06-25 05:21:46,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.09 vs. limit=5.0 2023-06-25 05:22:04,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1264842.0, ans=0.125 2023-06-25 05:22:24,418 INFO [train.py:996] (1/4) Epoch 7, batch 27850, loss[loss=0.2255, simple_loss=0.3162, pruned_loss=0.06734, over 21796.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2941, pruned_loss=0.07284, over 4291138.96 frames. ], batch size: 247, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:22:56,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-06-25 05:22:57,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.134e+02 3.811e+02 5.096e+02 8.843e+02, threshold=7.621e+02, percent-clipped=7.0 2023-06-25 05:23:25,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1265022.0, ans=0.1 2023-06-25 05:24:10,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1265142.0, ans=0.0 2023-06-25 05:24:25,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1265202.0, ans=0.1 2023-06-25 05:24:27,024 INFO [train.py:996] (1/4) Epoch 7, batch 27900, loss[loss=0.2213, simple_loss=0.3095, pruned_loss=0.06657, over 21418.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3044, pruned_loss=0.07484, over 4285114.55 frames. ], batch size: 194, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:24:32,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.72 vs. limit=22.5 2023-06-25 05:24:47,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1265262.0, ans=0.2 2023-06-25 05:25:18,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-25 05:26:21,599 INFO [train.py:996] (1/4) Epoch 7, batch 27950, loss[loss=0.2313, simple_loss=0.3239, pruned_loss=0.06936, over 21227.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3057, pruned_loss=0.07184, over 4279078.50 frames. ], batch size: 549, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:26:32,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1265502.0, ans=0.125 2023-06-25 05:26:42,603 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 3.117e+02 4.053e+02 5.979e+02 1.114e+03, threshold=8.107e+02, percent-clipped=11.0 2023-06-25 05:27:58,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1265742.0, ans=0.0 2023-06-25 05:28:09,596 INFO [train.py:996] (1/4) Epoch 7, batch 28000, loss[loss=0.2276, simple_loss=0.3015, pruned_loss=0.07683, over 21883.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.301, pruned_loss=0.06897, over 4287353.86 frames. ], batch size: 351, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:28:12,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1265802.0, ans=0.0 2023-06-25 05:28:48,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1265922.0, ans=0.125 2023-06-25 05:30:01,610 INFO [train.py:996] (1/4) Epoch 7, batch 28050, loss[loss=0.1664, simple_loss=0.2232, pruned_loss=0.05476, over 21283.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2976, pruned_loss=0.06993, over 4294766.34 frames. ], batch size: 159, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:30:22,380 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.952e+02 3.818e+02 5.160e+02 1.220e+03, threshold=7.636e+02, percent-clipped=4.0 2023-06-25 05:30:47,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1266222.0, ans=0.125 2023-06-25 05:30:57,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1266222.0, ans=0.2 2023-06-25 05:31:51,592 INFO [train.py:996] (1/4) Epoch 7, batch 28100, loss[loss=0.2122, simple_loss=0.2772, pruned_loss=0.07359, over 21445.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2957, pruned_loss=0.07005, over 4288064.31 frames. ], batch size: 389, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:31:58,948 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:32:04,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1266402.0, ans=0.0 2023-06-25 05:33:06,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1266582.0, ans=0.125 2023-06-25 05:33:11,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1266582.0, ans=0.2 2023-06-25 05:33:13,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1266582.0, ans=0.0 2023-06-25 05:33:23,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1266642.0, ans=0.125 2023-06-25 05:33:28,368 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:33:40,486 INFO [train.py:996] (1/4) Epoch 7, batch 28150, loss[loss=0.189, simple_loss=0.2533, pruned_loss=0.06232, over 21827.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2879, pruned_loss=0.06997, over 4289351.74 frames. ], batch size: 352, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:34:00,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1266762.0, ans=0.05 2023-06-25 05:34:01,756 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.369e+02 4.176e+02 5.786e+02 1.041e+03, threshold=8.353e+02, percent-clipped=8.0 2023-06-25 05:34:23,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1266762.0, ans=0.1 2023-06-25 05:34:38,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1266822.0, ans=0.125 2023-06-25 05:35:29,307 INFO [train.py:996] (1/4) Epoch 7, batch 28200, loss[loss=0.2229, simple_loss=0.2904, pruned_loss=0.07776, over 20706.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2869, pruned_loss=0.07113, over 4286764.85 frames. ], batch size: 607, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:37:09,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1267242.0, ans=0.125 2023-06-25 05:37:12,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-25 05:37:12,239 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=12.0 2023-06-25 05:37:17,969 INFO [train.py:996] (1/4) Epoch 7, batch 28250, loss[loss=0.2215, simple_loss=0.2854, pruned_loss=0.07882, over 21602.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2919, pruned_loss=0.0735, over 4272432.87 frames. ], batch size: 415, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:37:21,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1267302.0, ans=0.0 2023-06-25 05:37:32,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1267302.0, ans=0.1 2023-06-25 05:37:43,657 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.555e+02 3.449e+02 4.309e+02 5.866e+02 1.082e+03, threshold=8.618e+02, percent-clipped=6.0 2023-06-25 05:38:56,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1267542.0, ans=0.0 2023-06-25 05:39:08,746 INFO [train.py:996] (1/4) Epoch 7, batch 28300, loss[loss=0.2001, simple_loss=0.2632, pruned_loss=0.06849, over 21819.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2891, pruned_loss=0.07155, over 4274213.25 frames. ], batch size: 102, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:40:16,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1267722.0, ans=0.125 2023-06-25 05:40:23,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1267782.0, ans=0.95 2023-06-25 05:41:03,570 INFO [train.py:996] (1/4) Epoch 7, batch 28350, loss[loss=0.1975, simple_loss=0.286, pruned_loss=0.05449, over 21533.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.287, pruned_loss=0.06578, over 4274023.62 frames. ], batch size: 389, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:41:28,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1267962.0, ans=0.125 2023-06-25 05:41:29,699 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.753e+02 3.449e+02 4.988e+02 1.144e+03, threshold=6.899e+02, percent-clipped=4.0 2023-06-25 05:41:55,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1268022.0, ans=0.125 2023-06-25 05:42:19,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-25 05:42:25,757 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:42:45,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1268202.0, ans=0.125 2023-06-25 05:42:51,316 INFO [train.py:996] (1/4) Epoch 7, batch 28400, loss[loss=0.2701, simple_loss=0.3444, pruned_loss=0.09787, over 21803.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2834, pruned_loss=0.06592, over 4271531.24 frames. ], batch size: 118, lr: 4.17e-03, grad_scale: 32.0 2023-06-25 05:43:40,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1268322.0, ans=0.125 2023-06-25 05:44:01,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1268382.0, ans=0.0 2023-06-25 05:44:07,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1268382.0, ans=0.125 2023-06-25 05:44:40,880 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:44:42,021 INFO [train.py:996] (1/4) Epoch 7, batch 28450, loss[loss=0.2847, simple_loss=0.3328, pruned_loss=0.1183, over 21682.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2889, pruned_loss=0.06997, over 4273957.64 frames. ], batch size: 508, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:45:05,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1268562.0, ans=0.125 2023-06-25 05:45:15,052 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.249e+02 3.944e+02 5.811e+02 1.668e+03, threshold=7.889e+02, percent-clipped=19.0 2023-06-25 05:45:35,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1268622.0, ans=0.0 2023-06-25 05:45:36,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1268622.0, ans=0.0 2023-06-25 05:45:43,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1268682.0, ans=0.2 2023-06-25 05:46:36,210 INFO [train.py:996] (1/4) Epoch 7, batch 28500, loss[loss=0.2536, simple_loss=0.3202, pruned_loss=0.09355, over 21947.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2906, pruned_loss=0.07217, over 4278414.18 frames. ], batch size: 316, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:46:58,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-25 05:46:58,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-25 05:48:31,019 INFO [train.py:996] (1/4) Epoch 7, batch 28550, loss[loss=0.2559, simple_loss=0.3491, pruned_loss=0.08137, over 21619.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2987, pruned_loss=0.07448, over 4277855.09 frames. ], batch size: 230, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:48:42,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1269102.0, ans=0.0 2023-06-25 05:48:53,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.516e+02 4.419e+02 5.883e+02 1.246e+03, threshold=8.838e+02, percent-clipped=8.0 2023-06-25 05:48:56,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1269162.0, ans=15.0 2023-06-25 05:49:51,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1269282.0, ans=0.1 2023-06-25 05:50:10,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1269342.0, ans=0.0 2023-06-25 05:50:18,707 INFO [train.py:996] (1/4) Epoch 7, batch 28600, loss[loss=0.2401, simple_loss=0.3138, pruned_loss=0.08322, over 21857.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3043, pruned_loss=0.07658, over 4279575.17 frames. ], batch size: 372, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:50:37,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1269462.0, ans=0.0 2023-06-25 05:50:44,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1269462.0, ans=0.125 2023-06-25 05:50:44,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1269462.0, ans=0.02 2023-06-25 05:51:09,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1269522.0, ans=0.2 2023-06-25 05:52:07,664 INFO [train.py:996] (1/4) Epoch 7, batch 28650, loss[loss=0.2064, simple_loss=0.2672, pruned_loss=0.07284, over 21678.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2998, pruned_loss=0.07628, over 4269291.68 frames. ], batch size: 417, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:52:27,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1269762.0, ans=0.0 2023-06-25 05:52:29,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1269762.0, ans=0.125 2023-06-25 05:52:30,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.536e+02 4.575e+02 6.589e+02 8.896e+02, threshold=9.150e+02, percent-clipped=1.0 2023-06-25 05:53:31,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1269882.0, ans=0.0 2023-06-25 05:53:43,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1269942.0, ans=0.125 2023-06-25 05:53:55,688 INFO [train.py:996] (1/4) Epoch 7, batch 28700, loss[loss=0.2389, simple_loss=0.3149, pruned_loss=0.08145, over 21350.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2984, pruned_loss=0.07674, over 4261184.39 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:54:01,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1270002.0, ans=0.0 2023-06-25 05:54:13,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1270062.0, ans=0.1 2023-06-25 05:54:26,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1270062.0, ans=0.0 2023-06-25 05:54:51,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-25 05:54:58,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1270122.0, ans=0.0 2023-06-25 05:55:02,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1270182.0, ans=0.125 2023-06-25 05:55:43,751 INFO [train.py:996] (1/4) Epoch 7, batch 28750, loss[loss=0.196, simple_loss=0.2874, pruned_loss=0.05231, over 21644.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2992, pruned_loss=0.07707, over 4268528.84 frames. ], batch size: 263, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:55:47,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1270302.0, ans=0.125 2023-06-25 05:56:06,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.238e+02 3.725e+02 5.020e+02 9.578e+02, threshold=7.449e+02, percent-clipped=2.0 2023-06-25 05:56:39,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1270422.0, ans=0.05 2023-06-25 05:57:14,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1270542.0, ans=0.2 2023-06-25 05:57:23,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1270542.0, ans=0.025 2023-06-25 05:57:33,211 INFO [train.py:996] (1/4) Epoch 7, batch 28800, loss[loss=0.2356, simple_loss=0.3101, pruned_loss=0.08051, over 21571.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3018, pruned_loss=0.07715, over 4272562.11 frames. ], batch size: 263, lr: 4.17e-03, grad_scale: 32.0 2023-06-25 05:58:18,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1270722.0, ans=0.0 2023-06-25 05:58:47,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1270782.0, ans=0.1 2023-06-25 05:59:04,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1270842.0, ans=0.125 2023-06-25 05:59:22,101 INFO [train.py:996] (1/4) Epoch 7, batch 28850, loss[loss=0.2497, simple_loss=0.3157, pruned_loss=0.0918, over 21478.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3031, pruned_loss=0.07808, over 4279643.56 frames. ], batch size: 131, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:59:54,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1270962.0, ans=0.125 2023-06-25 05:59:54,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1270962.0, ans=0.1 2023-06-25 06:00:02,818 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.622e+02 3.393e+02 4.119e+02 6.059e+02 1.112e+03, threshold=8.239e+02, percent-clipped=12.0 2023-06-25 06:00:03,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1270962.0, ans=0.125 2023-06-25 06:00:32,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1271022.0, ans=0.05 2023-06-25 06:00:58,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1271142.0, ans=0.5 2023-06-25 06:01:17,964 INFO [train.py:996] (1/4) Epoch 7, batch 28900, loss[loss=0.2375, simple_loss=0.3046, pruned_loss=0.08515, over 21481.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3059, pruned_loss=0.0795, over 4283216.43 frames. ], batch size: 211, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 06:01:41,846 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:01:48,407 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-25 06:02:14,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-25 06:02:39,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1271442.0, ans=0.04949747468305833 2023-06-25 06:03:09,346 INFO [train.py:996] (1/4) Epoch 7, batch 28950, loss[loss=0.2027, simple_loss=0.282, pruned_loss=0.0617, over 21712.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3041, pruned_loss=0.07856, over 4285129.03 frames. ], batch size: 247, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:03:46,132 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.609e+02 4.387e+02 5.987e+02 1.071e+03, threshold=8.774e+02, percent-clipped=6.0 2023-06-25 06:03:59,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1271622.0, ans=0.1 2023-06-25 06:03:59,156 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:04:02,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1271622.0, ans=0.125 2023-06-25 06:04:09,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1271622.0, ans=0.0 2023-06-25 06:04:32,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1271682.0, ans=0.125 2023-06-25 06:04:33,157 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.92 vs. limit=22.5 2023-06-25 06:05:02,834 INFO [train.py:996] (1/4) Epoch 7, batch 29000, loss[loss=0.2409, simple_loss=0.3237, pruned_loss=0.07911, over 21831.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3081, pruned_loss=0.07819, over 4281396.95 frames. ], batch size: 124, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:05:35,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1271862.0, ans=0.125 2023-06-25 06:06:51,567 INFO [train.py:996] (1/4) Epoch 7, batch 29050, loss[loss=0.2398, simple_loss=0.3035, pruned_loss=0.08808, over 21771.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3084, pruned_loss=0.07918, over 4291905.36 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:07:21,610 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.493e+02 3.635e+02 4.186e+02 5.307e+02 1.029e+03, threshold=8.372e+02, percent-clipped=1.0 2023-06-25 06:07:25,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1272162.0, ans=0.125 2023-06-25 06:07:31,059 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:08:01,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1272282.0, ans=0.0 2023-06-25 06:08:37,153 INFO [train.py:996] (1/4) Epoch 7, batch 29100, loss[loss=0.2018, simple_loss=0.2686, pruned_loss=0.06753, over 21629.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2992, pruned_loss=0.07657, over 4288619.12 frames. ], batch size: 416, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:08:49,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1272402.0, ans=0.125 2023-06-25 06:08:56,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.74 vs. limit=10.0 2023-06-25 06:09:10,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272462.0, ans=0.1 2023-06-25 06:09:11,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1272462.0, ans=0.125 2023-06-25 06:09:27,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1272522.0, ans=10.0 2023-06-25 06:10:23,682 INFO [train.py:996] (1/4) Epoch 7, batch 29150, loss[loss=0.1705, simple_loss=0.2266, pruned_loss=0.0572, over 20760.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2984, pruned_loss=0.075, over 4290324.89 frames. ], batch size: 608, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:10:44,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-25 06:10:51,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1272762.0, ans=0.0 2023-06-25 06:10:54,195 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.210e+02 4.222e+02 5.476e+02 9.873e+02, threshold=8.444e+02, percent-clipped=1.0 2023-06-25 06:11:10,295 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:12:10,646 INFO [train.py:996] (1/4) Epoch 7, batch 29200, loss[loss=0.1905, simple_loss=0.2575, pruned_loss=0.06173, over 21222.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2948, pruned_loss=0.07434, over 4281773.50 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 06:13:30,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1273182.0, ans=0.0 2023-06-25 06:14:05,290 INFO [train.py:996] (1/4) Epoch 7, batch 29250, loss[loss=0.2159, simple_loss=0.3055, pruned_loss=0.06309, over 21738.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2921, pruned_loss=0.07146, over 4286753.03 frames. ], batch size: 282, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:14:31,558 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.162e+02 4.067e+02 5.479e+02 1.081e+03, threshold=8.134e+02, percent-clipped=3.0 2023-06-25 06:14:50,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1273422.0, ans=0.125 2023-06-25 06:14:55,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=22.5 2023-06-25 06:15:36,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1273542.0, ans=0.0 2023-06-25 06:15:53,721 INFO [train.py:996] (1/4) Epoch 7, batch 29300, loss[loss=0.1996, simple_loss=0.2673, pruned_loss=0.06597, over 21703.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2949, pruned_loss=0.07108, over 4287432.99 frames. ], batch size: 112, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:16:44,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1273722.0, ans=0.125 2023-06-25 06:16:46,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1273722.0, ans=0.1 2023-06-25 06:17:05,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1273782.0, ans=0.1 2023-06-25 06:17:42,113 INFO [train.py:996] (1/4) Epoch 7, batch 29350, loss[loss=0.1941, simple_loss=0.2595, pruned_loss=0.06432, over 21639.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2915, pruned_loss=0.07072, over 4281127.28 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:17:42,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1273902.0, ans=0.0 2023-06-25 06:18:13,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.302e+02 3.026e+02 3.822e+02 5.352e+02 1.093e+03, threshold=7.644e+02, percent-clipped=3.0 2023-06-25 06:19:30,140 INFO [train.py:996] (1/4) Epoch 7, batch 29400, loss[loss=0.1966, simple_loss=0.295, pruned_loss=0.0491, over 21392.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2908, pruned_loss=0.06911, over 4280283.31 frames. ], batch size: 211, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:20:04,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1274262.0, ans=0.125 2023-06-25 06:20:08,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1274262.0, ans=0.125 2023-06-25 06:20:08,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1274262.0, ans=0.0 2023-06-25 06:20:31,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1274322.0, ans=0.0 2023-06-25 06:20:54,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1274382.0, ans=0.0 2023-06-25 06:20:54,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1274382.0, ans=0.125 2023-06-25 06:20:57,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1274382.0, ans=0.0 2023-06-25 06:21:20,150 INFO [train.py:996] (1/4) Epoch 7, batch 29450, loss[loss=0.2669, simple_loss=0.3564, pruned_loss=0.08869, over 19731.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2878, pruned_loss=0.06819, over 4261597.83 frames. ], batch size: 703, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:21:48,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1274562.0, ans=0.2 2023-06-25 06:21:53,719 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.532e+02 4.385e+02 5.559e+02 1.410e+03, threshold=8.770e+02, percent-clipped=9.0 2023-06-25 06:22:44,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1274682.0, ans=0.125 2023-06-25 06:23:08,517 INFO [train.py:996] (1/4) Epoch 7, batch 29500, loss[loss=0.2122, simple_loss=0.2794, pruned_loss=0.07249, over 21515.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2936, pruned_loss=0.07159, over 4269234.22 frames. ], batch size: 194, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:23:54,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1274922.0, ans=0.04949747468305833 2023-06-25 06:24:53,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1275042.0, ans=0.125 2023-06-25 06:24:55,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1275102.0, ans=0.0 2023-06-25 06:24:56,251 INFO [train.py:996] (1/4) Epoch 7, batch 29550, loss[loss=0.2015, simple_loss=0.2708, pruned_loss=0.06615, over 21901.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2936, pruned_loss=0.07352, over 4285832.46 frames. ], batch size: 316, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:25:23,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1275162.0, ans=0.125 2023-06-25 06:25:29,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2023-06-25 06:25:30,061 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.666e+02 3.932e+02 4.748e+02 5.685e+02 9.373e+02, threshold=9.495e+02, percent-clipped=3.0 2023-06-25 06:25:54,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1275222.0, ans=0.0 2023-06-25 06:26:13,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-25 06:26:23,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1275282.0, ans=0.1 2023-06-25 06:26:39,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1275342.0, ans=0.125 2023-06-25 06:26:41,743 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-25 06:26:45,581 INFO [train.py:996] (1/4) Epoch 7, batch 29600, loss[loss=0.226, simple_loss=0.2972, pruned_loss=0.07744, over 20142.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2995, pruned_loss=0.07638, over 4281231.81 frames. ], batch size: 702, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:28:25,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1275642.0, ans=0.1 2023-06-25 06:28:33,294 INFO [train.py:996] (1/4) Epoch 7, batch 29650, loss[loss=0.1681, simple_loss=0.2456, pruned_loss=0.04527, over 21754.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2966, pruned_loss=0.07225, over 4276387.53 frames. ], batch size: 247, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:29:16,844 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.458e+02 4.326e+02 5.325e+02 1.074e+03, threshold=8.651e+02, percent-clipped=3.0 2023-06-25 06:29:47,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1275882.0, ans=0.125 2023-06-25 06:29:50,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1275882.0, ans=0.0 2023-06-25 06:29:50,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1275882.0, ans=0.125 2023-06-25 06:30:04,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1275942.0, ans=0.1 2023-06-25 06:30:20,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1276002.0, ans=0.125 2023-06-25 06:30:27,050 INFO [train.py:996] (1/4) Epoch 7, batch 29700, loss[loss=0.2424, simple_loss=0.3449, pruned_loss=0.06999, over 21436.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2995, pruned_loss=0.07223, over 4267896.33 frames. ], batch size: 194, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:31:21,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1276122.0, ans=0.0 2023-06-25 06:31:24,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1276122.0, ans=0.125 2023-06-25 06:31:40,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1276182.0, ans=0.125 2023-06-25 06:32:02,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1276242.0, ans=0.125 2023-06-25 06:32:16,253 INFO [train.py:996] (1/4) Epoch 7, batch 29750, loss[loss=0.2242, simple_loss=0.3185, pruned_loss=0.06498, over 21688.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3037, pruned_loss=0.07197, over 4274932.82 frames. ], batch size: 263, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:32:32,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1276362.0, ans=0.0 2023-06-25 06:32:32,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1276362.0, ans=0.125 2023-06-25 06:32:33,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1276362.0, ans=0.0 2023-06-25 06:32:54,077 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.299e+02 3.896e+02 4.722e+02 1.232e+03, threshold=7.792e+02, percent-clipped=5.0 2023-06-25 06:34:03,388 INFO [train.py:996] (1/4) Epoch 7, batch 29800, loss[loss=0.2177, simple_loss=0.2826, pruned_loss=0.07639, over 21533.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3045, pruned_loss=0.07256, over 4281273.52 frames. ], batch size: 548, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:35:50,571 INFO [train.py:996] (1/4) Epoch 7, batch 29850, loss[loss=0.1804, simple_loss=0.2653, pruned_loss=0.04777, over 21395.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2998, pruned_loss=0.07017, over 4279174.35 frames. ], batch size: 194, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:36:28,325 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 2.948e+02 3.373e+02 4.045e+02 7.832e+02, threshold=6.745e+02, percent-clipped=1.0 2023-06-25 06:36:47,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1277022.0, ans=0.125 2023-06-25 06:37:20,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1277142.0, ans=0.2 2023-06-25 06:37:21,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1277142.0, ans=0.0 2023-06-25 06:37:36,840 INFO [train.py:996] (1/4) Epoch 7, batch 29900, loss[loss=0.2611, simple_loss=0.3252, pruned_loss=0.09844, over 21487.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2989, pruned_loss=0.07215, over 4282649.11 frames. ], batch size: 194, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:37:44,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=22.5 2023-06-25 06:37:46,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1277202.0, ans=10.0 2023-06-25 06:38:18,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-25 06:39:20,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1277442.0, ans=0.125 2023-06-25 06:39:25,215 INFO [train.py:996] (1/4) Epoch 7, batch 29950, loss[loss=0.2743, simple_loss=0.3424, pruned_loss=0.1031, over 21306.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3039, pruned_loss=0.07588, over 4278408.92 frames. ], batch size: 143, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:39:56,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1277562.0, ans=0.2 2023-06-25 06:40:02,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1277562.0, ans=0.125 2023-06-25 06:40:08,694 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.319e+02 4.450e+02 5.387e+02 9.920e+02, threshold=8.899e+02, percent-clipped=12.0 2023-06-25 06:40:34,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1277682.0, ans=0.1 2023-06-25 06:40:48,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1277682.0, ans=0.125 2023-06-25 06:41:04,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1277742.0, ans=0.125 2023-06-25 06:41:04,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1277742.0, ans=0.0 2023-06-25 06:41:19,265 INFO [train.py:996] (1/4) Epoch 7, batch 30000, loss[loss=0.2171, simple_loss=0.3029, pruned_loss=0.06559, over 21406.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3048, pruned_loss=0.07526, over 4270845.87 frames. ], batch size: 211, lr: 4.16e-03, grad_scale: 32.0 2023-06-25 06:41:19,266 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 06:41:36,385 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.6087, 3.0225, 3.1942, 3.6492, 2.0380, 3.5055, 3.3951, 2.3243], device='cuda:1') 2023-06-25 06:41:39,212 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2493, simple_loss=0.346, pruned_loss=0.07628, over 1796401.00 frames. 2023-06-25 06:41:39,213 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 06:41:44,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-25 06:41:53,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1277802.0, ans=0.125 2023-06-25 06:42:03,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.80 vs. limit=10.0 2023-06-25 06:43:06,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1277982.0, ans=0.125 2023-06-25 06:43:30,343 INFO [train.py:996] (1/4) Epoch 7, batch 30050, loss[loss=0.2508, simple_loss=0.3516, pruned_loss=0.075, over 21691.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3081, pruned_loss=0.073, over 4269100.44 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:44:04,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1278162.0, ans=0.1 2023-06-25 06:44:05,681 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.420e+02 3.279e+02 4.155e+02 5.724e+02 1.149e+03, threshold=8.309e+02, percent-clipped=6.0 2023-06-25 06:45:16,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1278402.0, ans=0.1 2023-06-25 06:45:17,722 INFO [train.py:996] (1/4) Epoch 7, batch 30100, loss[loss=0.1939, simple_loss=0.2621, pruned_loss=0.06281, over 21828.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3056, pruned_loss=0.07224, over 4270719.17 frames. ], batch size: 107, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:46:23,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1278522.0, ans=0.125 2023-06-25 06:46:25,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1278522.0, ans=0.0 2023-06-25 06:46:30,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1278582.0, ans=0.0 2023-06-25 06:46:50,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1278642.0, ans=0.1 2023-06-25 06:46:53,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1278642.0, ans=0.125 2023-06-25 06:46:59,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1278642.0, ans=0.125 2023-06-25 06:47:10,877 INFO [train.py:996] (1/4) Epoch 7, batch 30150, loss[loss=0.2572, simple_loss=0.3412, pruned_loss=0.08664, over 21494.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.302, pruned_loss=0.07382, over 4272742.94 frames. ], batch size: 131, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:47:47,379 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.267e+02 3.809e+02 4.984e+02 9.103e+02, threshold=7.618e+02, percent-clipped=3.0 2023-06-25 06:47:57,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1278822.0, ans=0.0 2023-06-25 06:48:11,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-25 06:48:13,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1278822.0, ans=10.0 2023-06-25 06:48:55,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1279002.0, ans=0.0 2023-06-25 06:48:56,770 INFO [train.py:996] (1/4) Epoch 7, batch 30200, loss[loss=0.2285, simple_loss=0.3269, pruned_loss=0.06508, over 21307.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3039, pruned_loss=0.07245, over 4271951.21 frames. ], batch size: 549, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:49:30,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279062.0, ans=0.1 2023-06-25 06:50:31,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1279242.0, ans=0.125 2023-06-25 06:50:42,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-25 06:50:47,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1279242.0, ans=0.125 2023-06-25 06:50:59,059 INFO [train.py:996] (1/4) Epoch 7, batch 30250, loss[loss=0.251, simple_loss=0.3526, pruned_loss=0.07472, over 21810.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3116, pruned_loss=0.07497, over 4270234.59 frames. ], batch size: 282, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:51:33,133 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 3.334e+02 4.601e+02 6.960e+02 1.343e+03, threshold=9.203e+02, percent-clipped=16.0 2023-06-25 06:51:40,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1279422.0, ans=0.125 2023-06-25 06:52:13,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1279482.0, ans=0.0 2023-06-25 06:52:41,318 INFO [train.py:996] (1/4) Epoch 7, batch 30300, loss[loss=0.2083, simple_loss=0.2712, pruned_loss=0.07264, over 21501.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.31, pruned_loss=0.0755, over 4269601.38 frames. ], batch size: 441, lr: 4.15e-03, grad_scale: 16.0 2023-06-25 06:52:50,296 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:53:06,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1279662.0, ans=0.2 2023-06-25 06:53:17,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1279662.0, ans=0.0 2023-06-25 06:54:27,339 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:54:37,766 INFO [train.py:996] (1/4) Epoch 7, batch 30350, loss[loss=0.2672, simple_loss=0.3624, pruned_loss=0.08598, over 21222.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3117, pruned_loss=0.07669, over 4254130.26 frames. ], batch size: 549, lr: 4.15e-03, grad_scale: 16.0 2023-06-25 06:54:46,191 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-25 06:54:55,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1279962.0, ans=0.125 2023-06-25 06:54:59,631 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:55:05,063 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.772e+02 4.635e+02 6.721e+02 1.384e+03, threshold=9.269e+02, percent-clipped=9.0 2023-06-25 06:55:17,695 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-25 06:55:23,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1280082.0, ans=0.025 2023-06-25 06:56:00,343 INFO [train.py:996] (1/4) Epoch 7, batch 30400, loss[loss=0.227, simple_loss=0.2657, pruned_loss=0.09418, over 20338.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3023, pruned_loss=0.07478, over 4238216.10 frames. ], batch size: 702, lr: 4.15e-03, grad_scale: 32.0 2023-06-25 06:56:45,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1280322.0, ans=0.125 2023-06-25 06:57:15,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1280442.0, ans=0.1 2023-06-25 06:57:28,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1280442.0, ans=0.2 2023-06-25 06:57:33,108 INFO [train.py:996] (1/4) Epoch 7, batch 30450, loss[loss=0.256, simple_loss=0.3686, pruned_loss=0.07171, over 19936.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3033, pruned_loss=0.07502, over 4184840.82 frames. ], batch size: 702, lr: 4.15e-03, grad_scale: 32.0 2023-06-25 06:57:36,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1280502.0, ans=0.04949747468305833 2023-06-25 06:58:02,609 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.728e+02 6.501e+02 9.013e+02 1.486e+03 3.895e+03, threshold=1.803e+03, percent-clipped=46.0 2023-06-25 06:58:16,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-25 06:58:19,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1280622.0, ans=0.2 2023-06-25 06:58:25,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1280682.0, ans=0.0 2023-06-25 07:01:02,055 INFO [train.py:996] (1/4) Epoch 8, batch 0, loss[loss=0.2613, simple_loss=0.3091, pruned_loss=0.1067, over 21350.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3091, pruned_loss=0.1067, over 21350.00 frames. ], batch size: 473, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:01:02,056 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 07:01:19,563 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2406, simple_loss=0.3467, pruned_loss=0.06724, over 1796401.00 frames. 2023-06-25 07:01:19,564 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 07:01:20,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-25 07:02:18,003 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:02:41,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1281012.0, ans=0.125 2023-06-25 07:03:05,849 INFO [train.py:996] (1/4) Epoch 8, batch 50, loss[loss=0.2366, simple_loss=0.324, pruned_loss=0.07459, over 19903.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3121, pruned_loss=0.0767, over 970473.12 frames. ], batch size: 703, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:03:13,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1281072.0, ans=0.2 2023-06-25 07:03:18,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-25 07:03:21,483 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:03:31,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1281132.0, ans=0.0 2023-06-25 07:03:31,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1281132.0, ans=0.0 2023-06-25 07:03:49,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.736e+02 3.478e+02 5.204e+02 1.094e+03 2.896e+03, threshold=1.041e+03, percent-clipped=7.0 2023-06-25 07:04:51,363 INFO [train.py:996] (1/4) Epoch 8, batch 100, loss[loss=0.2707, simple_loss=0.3543, pruned_loss=0.09355, over 21462.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3288, pruned_loss=0.07923, over 1703849.45 frames. ], batch size: 471, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:05:25,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1281432.0, ans=0.125 2023-06-25 07:06:01,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1281552.0, ans=0.2 2023-06-25 07:06:37,748 INFO [train.py:996] (1/4) Epoch 8, batch 150, loss[loss=0.2502, simple_loss=0.3487, pruned_loss=0.07587, over 21864.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3313, pruned_loss=0.07867, over 2272830.39 frames. ], batch size: 371, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:07:02,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1281732.0, ans=0.125 2023-06-25 07:07:27,442 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.041e+02 3.436e+02 4.359e+02 9.068e+02, threshold=6.872e+02, percent-clipped=0.0 2023-06-25 07:07:59,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1281852.0, ans=0.04949747468305833 2023-06-25 07:08:15,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1281912.0, ans=0.125 2023-06-25 07:08:18,567 INFO [train.py:996] (1/4) Epoch 8, batch 200, loss[loss=0.2922, simple_loss=0.346, pruned_loss=0.1192, over 21429.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3257, pruned_loss=0.07653, over 2719578.66 frames. ], batch size: 471, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:08:42,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1282032.0, ans=0.0 2023-06-25 07:09:03,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1282092.0, ans=0.125 2023-06-25 07:09:33,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1282152.0, ans=0.125 2023-06-25 07:09:57,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1282212.0, ans=0.125 2023-06-25 07:09:59,992 INFO [train.py:996] (1/4) Epoch 8, batch 250, loss[loss=0.2268, simple_loss=0.2902, pruned_loss=0.08176, over 21581.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3185, pruned_loss=0.07606, over 3072383.62 frames. ], batch size: 548, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:10:16,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1282332.0, ans=0.1 2023-06-25 07:10:21,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1282332.0, ans=0.125 2023-06-25 07:10:27,100 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-25 07:10:45,280 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.611e+02 3.498e+02 4.445e+02 5.647e+02 1.101e+03, threshold=8.891e+02, percent-clipped=14.0 2023-06-25 07:11:21,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1282452.0, ans=0.1 2023-06-25 07:11:33,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1282512.0, ans=0.2 2023-06-25 07:11:42,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1282512.0, ans=0.125 2023-06-25 07:11:49,072 INFO [train.py:996] (1/4) Epoch 8, batch 300, loss[loss=0.1977, simple_loss=0.2844, pruned_loss=0.05555, over 21236.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3133, pruned_loss=0.0749, over 3337767.91 frames. ], batch size: 176, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:12:14,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1282632.0, ans=0.125 2023-06-25 07:13:39,780 INFO [train.py:996] (1/4) Epoch 8, batch 350, loss[loss=0.1829, simple_loss=0.2505, pruned_loss=0.05763, over 21194.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3067, pruned_loss=0.07327, over 3543533.65 frames. ], batch size: 548, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:13:46,623 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-25 07:14:09,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-25 07:14:30,123 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.131e+02 3.897e+02 5.934e+02 1.239e+03, threshold=7.794e+02, percent-clipped=5.0 2023-06-25 07:15:04,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=12.0 2023-06-25 07:15:18,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1283112.0, ans=0.2 2023-06-25 07:15:19,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1283112.0, ans=0.125 2023-06-25 07:15:27,511 INFO [train.py:996] (1/4) Epoch 8, batch 400, loss[loss=0.2232, simple_loss=0.2772, pruned_loss=0.08458, over 21317.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2996, pruned_loss=0.07216, over 3707588.20 frames. ], batch size: 473, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:15:28,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1283172.0, ans=0.0 2023-06-25 07:15:28,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 07:15:40,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-25 07:15:55,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-25 07:16:35,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.20 vs. limit=5.0 2023-06-25 07:16:43,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1283352.0, ans=0.0 2023-06-25 07:16:58,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1283352.0, ans=0.125 2023-06-25 07:17:19,197 INFO [train.py:996] (1/4) Epoch 8, batch 450, loss[loss=0.2008, simple_loss=0.2648, pruned_loss=0.06838, over 21780.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2962, pruned_loss=0.06982, over 3836107.06 frames. ], batch size: 352, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:17:25,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1283472.0, ans=0.125 2023-06-25 07:17:27,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1283472.0, ans=0.1 2023-06-25 07:18:16,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 3.535e+02 4.359e+02 5.649e+02 1.208e+03, threshold=8.718e+02, percent-clipped=9.0 2023-06-25 07:19:01,980 INFO [train.py:996] (1/4) Epoch 8, batch 500, loss[loss=0.1925, simple_loss=0.2377, pruned_loss=0.07359, over 20734.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3006, pruned_loss=0.0706, over 3930596.40 frames. ], batch size: 609, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:19:14,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1283772.0, ans=0.2 2023-06-25 07:20:01,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1283892.0, ans=0.1 2023-06-25 07:20:34,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1284012.0, ans=0.125 2023-06-25 07:20:49,168 INFO [train.py:996] (1/4) Epoch 8, batch 550, loss[loss=0.2568, simple_loss=0.3642, pruned_loss=0.07472, over 21677.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3059, pruned_loss=0.07114, over 4007818.74 frames. ], batch size: 441, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:21:10,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1284132.0, ans=0.125 2023-06-25 07:21:17,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1284132.0, ans=0.1 2023-06-25 07:21:45,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.578e+02 5.101e+02 7.574e+02 1.639e+03, threshold=1.020e+03, percent-clipped=17.0 2023-06-25 07:21:48,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=12.0 2023-06-25 07:21:49,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1284192.0, ans=0.125 2023-06-25 07:21:52,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1284192.0, ans=0.125 2023-06-25 07:21:56,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1284252.0, ans=0.0 2023-06-25 07:21:59,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1284252.0, ans=0.2 2023-06-25 07:22:13,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1284312.0, ans=6.0 2023-06-25 07:22:28,846 INFO [train.py:996] (1/4) Epoch 8, batch 600, loss[loss=0.2154, simple_loss=0.2923, pruned_loss=0.06928, over 21510.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3052, pruned_loss=0.07048, over 4066642.28 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:23:32,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-25 07:23:39,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1284552.0, ans=0.125 2023-06-25 07:23:41,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1284552.0, ans=0.125 2023-06-25 07:24:10,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1284612.0, ans=0.0 2023-06-25 07:24:14,790 INFO [train.py:996] (1/4) Epoch 8, batch 650, loss[loss=0.2197, simple_loss=0.2968, pruned_loss=0.07134, over 15431.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3052, pruned_loss=0.07037, over 4107444.76 frames. ], batch size: 63, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:24:34,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-25 07:24:41,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1284732.0, ans=0.2 2023-06-25 07:24:52,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1284792.0, ans=0.125 2023-06-25 07:25:03,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1284792.0, ans=0.2 2023-06-25 07:25:03,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1284792.0, ans=0.125 2023-06-25 07:25:16,312 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.313e+02 4.571e+02 7.176e+02 1.629e+03, threshold=9.143e+02, percent-clipped=10.0 2023-06-25 07:25:16,855 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:25:23,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1284792.0, ans=0.125 2023-06-25 07:25:40,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1284852.0, ans=0.125 2023-06-25 07:26:00,094 INFO [train.py:996] (1/4) Epoch 8, batch 700, loss[loss=0.221, simple_loss=0.3443, pruned_loss=0.04885, over 20836.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3077, pruned_loss=0.0717, over 4148613.51 frames. ], batch size: 608, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:26:09,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1284972.0, ans=0.1 2023-06-25 07:26:30,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-25 07:27:34,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1285212.0, ans=0.0 2023-06-25 07:27:44,312 INFO [train.py:996] (1/4) Epoch 8, batch 750, loss[loss=0.2093, simple_loss=0.2815, pruned_loss=0.06851, over 21280.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3053, pruned_loss=0.07252, over 4173449.16 frames. ], batch size: 143, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:28:10,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1285332.0, ans=0.125 2023-06-25 07:28:47,153 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.536e+02 3.601e+02 4.438e+02 5.764e+02 1.140e+03, threshold=8.877e+02, percent-clipped=3.0 2023-06-25 07:28:49,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1285392.0, ans=0.125 2023-06-25 07:28:52,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.56 vs. limit=15.0 2023-06-25 07:29:32,234 INFO [train.py:996] (1/4) Epoch 8, batch 800, loss[loss=0.2705, simple_loss=0.3714, pruned_loss=0.08481, over 20910.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3019, pruned_loss=0.07283, over 4194123.52 frames. ], batch size: 608, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:29:50,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-25 07:29:53,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=22.5 2023-06-25 07:30:03,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-25 07:30:42,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1285692.0, ans=0.0 2023-06-25 07:30:51,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-25 07:31:25,109 INFO [train.py:996] (1/4) Epoch 8, batch 850, loss[loss=0.2024, simple_loss=0.2818, pruned_loss=0.06149, over 21667.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2993, pruned_loss=0.07317, over 4221299.90 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:31:29,783 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.76 vs. limit=10.0 2023-06-25 07:31:44,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1285932.0, ans=0.125 2023-06-25 07:31:56,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-25 07:32:23,991 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.482e+02 3.234e+02 3.833e+02 4.866e+02 9.722e+02, threshold=7.666e+02, percent-clipped=1.0 2023-06-25 07:32:49,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.19 vs. limit=12.0 2023-06-25 07:33:13,027 INFO [train.py:996] (1/4) Epoch 8, batch 900, loss[loss=0.2235, simple_loss=0.2935, pruned_loss=0.07678, over 21926.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2957, pruned_loss=0.07217, over 4238062.28 frames. ], batch size: 316, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:33:40,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1286232.0, ans=0.125 2023-06-25 07:35:01,291 INFO [train.py:996] (1/4) Epoch 8, batch 950, loss[loss=0.2159, simple_loss=0.2848, pruned_loss=0.0735, over 21294.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2938, pruned_loss=0.07177, over 4252826.66 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:35:03,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1286472.0, ans=0.1 2023-06-25 07:35:38,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1286532.0, ans=0.5 2023-06-25 07:35:54,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.907e+02 3.602e+02 4.628e+02 6.707e+02 1.446e+03, threshold=9.256e+02, percent-clipped=20.0 2023-06-25 07:36:42,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-25 07:36:42,687 INFO [train.py:996] (1/4) Epoch 8, batch 1000, loss[loss=0.2952, simple_loss=0.4059, pruned_loss=0.09225, over 19775.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2948, pruned_loss=0.07225, over 4266035.84 frames. ], batch size: 703, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:37:56,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1286952.0, ans=0.0 2023-06-25 07:38:15,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1287012.0, ans=0.0 2023-06-25 07:38:27,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1287012.0, ans=0.125 2023-06-25 07:38:31,296 INFO [train.py:996] (1/4) Epoch 8, batch 1050, loss[loss=0.1805, simple_loss=0.2469, pruned_loss=0.05701, over 21232.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2936, pruned_loss=0.07226, over 4273843.74 frames. ], batch size: 143, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:39:25,334 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.89 vs. limit=15.0 2023-06-25 07:39:30,743 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 3.439e+02 4.407e+02 5.715e+02 1.308e+03, threshold=8.815e+02, percent-clipped=4.0 2023-06-25 07:40:19,270 INFO [train.py:996] (1/4) Epoch 8, batch 1100, loss[loss=0.2759, simple_loss=0.3359, pruned_loss=0.1079, over 21547.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2942, pruned_loss=0.07253, over 4279290.91 frames. ], batch size: 471, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:40:30,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1287372.0, ans=0.2 2023-06-25 07:41:20,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1287492.0, ans=0.125 2023-06-25 07:41:26,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1287552.0, ans=0.0 2023-06-25 07:42:15,439 INFO [train.py:996] (1/4) Epoch 8, batch 1150, loss[loss=0.2177, simple_loss=0.3074, pruned_loss=0.06397, over 21782.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2955, pruned_loss=0.07267, over 4286714.74 frames. ], batch size: 332, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:42:35,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1287732.0, ans=0.0 2023-06-25 07:42:45,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1287732.0, ans=0.0 2023-06-25 07:42:59,617 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.638e+02 3.529e+02 4.325e+02 5.726e+02 1.140e+03, threshold=8.649e+02, percent-clipped=5.0 2023-06-25 07:43:22,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-25 07:43:24,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1287852.0, ans=0.0 2023-06-25 07:43:25,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1287852.0, ans=0.0 2023-06-25 07:43:25,659 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-25 07:43:27,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-25 07:43:59,518 INFO [train.py:996] (1/4) Epoch 8, batch 1200, loss[loss=0.2629, simple_loss=0.3391, pruned_loss=0.09332, over 21558.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2979, pruned_loss=0.07299, over 4284762.15 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:44:43,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1288092.0, ans=0.0 2023-06-25 07:45:14,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1288212.0, ans=0.125 2023-06-25 07:45:14,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1288212.0, ans=0.125 2023-06-25 07:45:26,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1288212.0, ans=0.125 2023-06-25 07:45:47,937 INFO [train.py:996] (1/4) Epoch 8, batch 1250, loss[loss=0.2001, simple_loss=0.2831, pruned_loss=0.05851, over 21271.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2988, pruned_loss=0.0726, over 4288929.30 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:46:02,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1288272.0, ans=0.125 2023-06-25 07:46:13,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1288332.0, ans=0.2 2023-06-25 07:46:25,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1288392.0, ans=0.125 2023-06-25 07:46:28,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-25 07:46:38,022 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.488e+02 3.316e+02 4.127e+02 5.335e+02 1.234e+03, threshold=8.255e+02, percent-clipped=5.0 2023-06-25 07:47:30,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1288512.0, ans=0.125 2023-06-25 07:47:36,787 INFO [train.py:996] (1/4) Epoch 8, batch 1300, loss[loss=0.2112, simple_loss=0.2862, pruned_loss=0.06811, over 21819.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2994, pruned_loss=0.07253, over 4289556.85 frames. ], batch size: 282, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:47:46,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1288572.0, ans=0.125 2023-06-25 07:48:01,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1288632.0, ans=0.2 2023-06-25 07:48:03,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1288632.0, ans=0.125 2023-06-25 07:48:41,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1288752.0, ans=0.035 2023-06-25 07:48:52,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1288752.0, ans=0.125 2023-06-25 07:49:24,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1288872.0, ans=0.0 2023-06-25 07:49:25,889 INFO [train.py:996] (1/4) Epoch 8, batch 1350, loss[loss=0.2417, simple_loss=0.3342, pruned_loss=0.07458, over 21689.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3026, pruned_loss=0.07379, over 4290276.27 frames. ], batch size: 389, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:50:11,766 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-25 07:50:15,507 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.456e+02 4.378e+02 5.897e+02 1.151e+03, threshold=8.757e+02, percent-clipped=2.0 2023-06-25 07:50:19,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1288992.0, ans=0.125 2023-06-25 07:50:47,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289112.0, ans=0.1 2023-06-25 07:51:08,357 INFO [train.py:996] (1/4) Epoch 8, batch 1400, loss[loss=0.2, simple_loss=0.2669, pruned_loss=0.06659, over 21687.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3, pruned_loss=0.07291, over 4295347.45 frames. ], batch size: 298, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:51:56,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-25 07:52:51,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1289412.0, ans=0.125 2023-06-25 07:52:57,305 INFO [train.py:996] (1/4) Epoch 8, batch 1450, loss[loss=0.2219, simple_loss=0.299, pruned_loss=0.07235, over 21911.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2985, pruned_loss=0.0733, over 4296846.38 frames. ], batch size: 316, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:53:03,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1289472.0, ans=0.125 2023-06-25 07:53:05,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-25 07:53:13,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1289532.0, ans=0.125 2023-06-25 07:53:22,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1289532.0, ans=0.125 2023-06-25 07:53:48,356 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.442e+02 4.414e+02 6.258e+02 1.881e+03, threshold=8.827e+02, percent-clipped=13.0 2023-06-25 07:54:05,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1289652.0, ans=0.125 2023-06-25 07:54:11,772 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.13 vs. limit=10.0 2023-06-25 07:54:28,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=22.5 2023-06-25 07:54:37,671 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-25 07:54:47,191 INFO [train.py:996] (1/4) Epoch 8, batch 1500, loss[loss=0.2444, simple_loss=0.3136, pruned_loss=0.08754, over 21333.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3004, pruned_loss=0.07468, over 4297434.11 frames. ], batch size: 548, lr: 3.85e-03, grad_scale: 8.0 2023-06-25 07:54:55,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1289772.0, ans=0.125 2023-06-25 07:55:05,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289832.0, ans=0.1 2023-06-25 07:55:24,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1289892.0, ans=0.2 2023-06-25 07:55:49,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1289952.0, ans=0.2 2023-06-25 07:56:23,150 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:56:26,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1290012.0, ans=0.07 2023-06-25 07:56:40,533 INFO [train.py:996] (1/4) Epoch 8, batch 1550, loss[loss=0.1858, simple_loss=0.2817, pruned_loss=0.04498, over 21570.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2982, pruned_loss=0.07389, over 4289925.21 frames. ], batch size: 389, lr: 3.85e-03, grad_scale: 8.0 2023-06-25 07:56:57,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1290132.0, ans=0.1 2023-06-25 07:57:31,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1290192.0, ans=0.05 2023-06-25 07:57:35,137 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 3.681e+02 5.239e+02 6.621e+02 1.108e+03, threshold=1.048e+03, percent-clipped=5.0 2023-06-25 07:58:33,626 INFO [train.py:996] (1/4) Epoch 8, batch 1600, loss[loss=0.2645, simple_loss=0.319, pruned_loss=0.105, over 21605.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2981, pruned_loss=0.07353, over 4277115.35 frames. ], batch size: 471, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:58:58,784 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:59:31,659 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:59:43,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1290492.0, ans=0.1 2023-06-25 08:00:11,669 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:00:26,986 INFO [train.py:996] (1/4) Epoch 8, batch 1650, loss[loss=0.2395, simple_loss=0.3248, pruned_loss=0.0771, over 20683.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2976, pruned_loss=0.07247, over 4274986.45 frames. ], batch size: 607, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 08:00:41,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1290672.0, ans=0.05 2023-06-25 08:01:38,151 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.337e+02 4.261e+02 5.571e+02 1.006e+03, threshold=8.522e+02, percent-clipped=0.0 2023-06-25 08:01:49,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1290852.0, ans=0.1 2023-06-25 08:02:20,414 INFO [train.py:996] (1/4) Epoch 8, batch 1700, loss[loss=0.1895, simple_loss=0.2626, pruned_loss=0.05816, over 21448.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3003, pruned_loss=0.07335, over 4275403.73 frames. ], batch size: 195, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:03:14,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1291032.0, ans=0.1 2023-06-25 08:03:27,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1291092.0, ans=0.125 2023-06-25 08:03:43,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1291152.0, ans=0.2 2023-06-25 08:04:10,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1291212.0, ans=0.125 2023-06-25 08:04:17,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.08 vs. limit=5.0 2023-06-25 08:04:20,213 INFO [train.py:996] (1/4) Epoch 8, batch 1750, loss[loss=0.3055, simple_loss=0.3666, pruned_loss=0.1222, over 21382.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2987, pruned_loss=0.07105, over 4272444.06 frames. ], batch size: 471, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:04:34,377 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-25 08:05:26,793 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.271e+02 4.291e+02 6.912e+02 1.295e+03, threshold=8.582e+02, percent-clipped=12.0 2023-06-25 08:06:19,511 INFO [train.py:996] (1/4) Epoch 8, batch 1800, loss[loss=0.1938, simple_loss=0.2583, pruned_loss=0.06463, over 21492.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.299, pruned_loss=0.06992, over 4269311.50 frames. ], batch size: 212, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:06:38,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1291572.0, ans=0.0 2023-06-25 08:07:43,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1291752.0, ans=0.0 2023-06-25 08:08:10,403 INFO [train.py:996] (1/4) Epoch 8, batch 1850, loss[loss=0.1687, simple_loss=0.2442, pruned_loss=0.04656, over 21265.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2979, pruned_loss=0.06813, over 4268654.00 frames. ], batch size: 176, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:08:42,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1291932.0, ans=0.1 2023-06-25 08:08:54,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-25 08:09:08,863 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.967e+02 5.452e+02 7.986e+02 1.937e+03, threshold=1.090e+03, percent-clipped=22.0 2023-06-25 08:09:11,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1291992.0, ans=0.125 2023-06-25 08:09:22,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1292052.0, ans=0.0 2023-06-25 08:09:31,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1292052.0, ans=0.0 2023-06-25 08:09:38,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1292112.0, ans=15.0 2023-06-25 08:09:54,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1292112.0, ans=0.125 2023-06-25 08:10:05,939 INFO [train.py:996] (1/4) Epoch 8, batch 1900, loss[loss=0.2058, simple_loss=0.2664, pruned_loss=0.07261, over 21196.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2991, pruned_loss=0.06972, over 4268314.04 frames. ], batch size: 144, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:10:48,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.79 vs. limit=10.0 2023-06-25 08:11:03,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1292352.0, ans=0.2 2023-06-25 08:11:43,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1292412.0, ans=0.1 2023-06-25 08:12:04,362 INFO [train.py:996] (1/4) Epoch 8, batch 1950, loss[loss=0.2339, simple_loss=0.2772, pruned_loss=0.0953, over 21362.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2936, pruned_loss=0.06905, over 4276595.17 frames. ], batch size: 507, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:12:05,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1292472.0, ans=0.125 2023-06-25 08:12:14,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1292472.0, ans=0.2 2023-06-25 08:12:36,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1292532.0, ans=0.0 2023-06-25 08:13:00,224 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.618e+02 4.190e+02 5.257e+02 7.093e+02 1.583e+03, threshold=1.051e+03, percent-clipped=6.0 2023-06-25 08:13:52,820 INFO [train.py:996] (1/4) Epoch 8, batch 2000, loss[loss=0.2066, simple_loss=0.3014, pruned_loss=0.05587, over 21581.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2907, pruned_loss=0.06645, over 4259716.56 frames. ], batch size: 441, lr: 3.84e-03, grad_scale: 32.0 2023-06-25 08:14:02,518 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-06-25 08:14:07,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1292772.0, ans=0.0 2023-06-25 08:14:15,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1292832.0, ans=0.0 2023-06-25 08:14:57,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.90 vs. limit=15.0 2023-06-25 08:15:15,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1293012.0, ans=0.0 2023-06-25 08:15:44,225 INFO [train.py:996] (1/4) Epoch 8, batch 2050, loss[loss=0.2331, simple_loss=0.3164, pruned_loss=0.07489, over 21869.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2935, pruned_loss=0.06658, over 4263508.65 frames. ], batch size: 371, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:15:51,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1293072.0, ans=0.125 2023-06-25 08:16:25,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293192.0, ans=0.1 2023-06-25 08:16:39,106 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 4.169e+02 5.197e+02 7.491e+02 1.738e+03, threshold=1.039e+03, percent-clipped=10.0 2023-06-25 08:17:35,782 INFO [train.py:996] (1/4) Epoch 8, batch 2100, loss[loss=0.2616, simple_loss=0.3281, pruned_loss=0.09761, over 21635.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2954, pruned_loss=0.06772, over 4273810.52 frames. ], batch size: 471, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:17:40,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-25 08:18:02,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1293432.0, ans=0.2 2023-06-25 08:18:23,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1293492.0, ans=0.0 2023-06-25 08:18:36,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1293552.0, ans=12.0 2023-06-25 08:19:27,056 INFO [train.py:996] (1/4) Epoch 8, batch 2150, loss[loss=0.1816, simple_loss=0.2648, pruned_loss=0.0492, over 21595.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2935, pruned_loss=0.06897, over 4280785.44 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:20:01,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293792.0, ans=0.1 2023-06-25 08:20:22,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1293792.0, ans=0.125 2023-06-25 08:20:23,095 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.526e+02 3.335e+02 3.972e+02 5.687e+02 1.021e+03, threshold=7.943e+02, percent-clipped=0.0 2023-06-25 08:20:47,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1293852.0, ans=0.125 2023-06-25 08:21:11,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1293912.0, ans=0.1 2023-06-25 08:21:16,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1293912.0, ans=0.125 2023-06-25 08:21:19,278 INFO [train.py:996] (1/4) Epoch 8, batch 2200, loss[loss=0.1972, simple_loss=0.2807, pruned_loss=0.05691, over 21764.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2973, pruned_loss=0.07006, over 4284785.42 frames. ], batch size: 247, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:21:28,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1293972.0, ans=0.0 2023-06-25 08:21:54,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1294092.0, ans=0.0 2023-06-25 08:22:02,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.12 vs. limit=15.0 2023-06-25 08:22:07,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1294092.0, ans=0.0 2023-06-25 08:22:17,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1294152.0, ans=0.0 2023-06-25 08:22:26,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-25 08:22:56,691 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:23:03,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1294212.0, ans=0.125 2023-06-25 08:23:08,643 INFO [train.py:996] (1/4) Epoch 8, batch 2250, loss[loss=0.1869, simple_loss=0.2531, pruned_loss=0.06037, over 21353.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2933, pruned_loss=0.06836, over 4282854.57 frames. ], batch size: 131, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:24:02,839 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.466e+02 3.638e+02 4.452e+02 6.050e+02 1.629e+03, threshold=8.904e+02, percent-clipped=11.0 2023-06-25 08:24:16,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1294452.0, ans=0.0 2023-06-25 08:24:48,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-25 08:24:52,814 INFO [train.py:996] (1/4) Epoch 8, batch 2300, loss[loss=0.2199, simple_loss=0.285, pruned_loss=0.07737, over 21849.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2882, pruned_loss=0.06804, over 4269774.34 frames. ], batch size: 107, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:25:16,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1294632.0, ans=0.0 2023-06-25 08:25:24,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1294632.0, ans=0.04949747468305833 2023-06-25 08:25:53,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1294752.0, ans=0.1 2023-06-25 08:26:00,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1294752.0, ans=0.0 2023-06-25 08:26:16,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1294752.0, ans=0.125 2023-06-25 08:26:46,476 INFO [train.py:996] (1/4) Epoch 8, batch 2350, loss[loss=0.1963, simple_loss=0.2814, pruned_loss=0.05565, over 21704.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2882, pruned_loss=0.06854, over 4268131.20 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:26:47,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1294872.0, ans=0.0 2023-06-25 08:27:41,217 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 4.172e+02 5.399e+02 7.196e+02 1.286e+03, threshold=1.080e+03, percent-clipped=11.0 2023-06-25 08:28:37,751 INFO [train.py:996] (1/4) Epoch 8, batch 2400, loss[loss=0.2311, simple_loss=0.3143, pruned_loss=0.07393, over 21367.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2894, pruned_loss=0.06975, over 4268132.87 frames. ], batch size: 131, lr: 3.84e-03, grad_scale: 32.0 2023-06-25 08:28:55,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1295232.0, ans=10.0 2023-06-25 08:29:14,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1295232.0, ans=0.125 2023-06-25 08:29:14,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1295232.0, ans=0.09899494936611666 2023-06-25 08:29:46,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1295352.0, ans=0.125 2023-06-25 08:30:27,362 INFO [train.py:996] (1/4) Epoch 8, batch 2450, loss[loss=0.1873, simple_loss=0.2576, pruned_loss=0.05844, over 21630.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2938, pruned_loss=0.07222, over 4267781.51 frames. ], batch size: 282, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:30:55,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1295532.0, ans=0.0 2023-06-25 08:31:24,808 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.841e+02 6.208e+02 9.164e+02 1.809e+03, threshold=1.242e+03, percent-clipped=16.0 2023-06-25 08:31:52,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1295652.0, ans=0.125 2023-06-25 08:32:08,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1295712.0, ans=0.0 2023-06-25 08:32:12,772 INFO [train.py:996] (1/4) Epoch 8, batch 2500, loss[loss=0.2111, simple_loss=0.2816, pruned_loss=0.07025, over 21828.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2937, pruned_loss=0.07199, over 4262688.26 frames. ], batch size: 107, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:32:39,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1295832.0, ans=0.1 2023-06-25 08:32:52,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1295892.0, ans=0.1 2023-06-25 08:33:09,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-25 08:33:52,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1296012.0, ans=0.125 2023-06-25 08:33:59,230 INFO [train.py:996] (1/4) Epoch 8, batch 2550, loss[loss=0.1849, simple_loss=0.2488, pruned_loss=0.06048, over 21551.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2918, pruned_loss=0.07173, over 4259420.33 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:33:59,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1296072.0, ans=0.2 2023-06-25 08:33:59,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1296072.0, ans=0.125 2023-06-25 08:34:52,361 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-25 08:34:56,188 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.632e+02 3.347e+02 3.968e+02 6.148e+02 1.129e+03, threshold=7.936e+02, percent-clipped=0.0 2023-06-25 08:35:13,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1296252.0, ans=10.0 2023-06-25 08:35:49,663 INFO [train.py:996] (1/4) Epoch 8, batch 2600, loss[loss=0.2256, simple_loss=0.2887, pruned_loss=0.08128, over 19926.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2936, pruned_loss=0.07372, over 4267831.93 frames. ], batch size: 702, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:36:09,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1296432.0, ans=0.125 2023-06-25 08:36:11,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1296432.0, ans=0.5 2023-06-25 08:36:35,050 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-25 08:36:43,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1296492.0, ans=0.125 2023-06-25 08:37:33,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1296612.0, ans=0.125 2023-06-25 08:37:40,593 INFO [train.py:996] (1/4) Epoch 8, batch 2650, loss[loss=0.2333, simple_loss=0.3032, pruned_loss=0.08171, over 21859.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2957, pruned_loss=0.07466, over 4270788.17 frames. ], batch size: 351, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:38:10,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=6.0 2023-06-25 08:38:37,354 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.828e+02 4.857e+02 7.020e+02 1.360e+03, threshold=9.714e+02, percent-clipped=21.0 2023-06-25 08:38:57,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-25 08:39:07,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1296912.0, ans=0.2 2023-06-25 08:39:20,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1296912.0, ans=0.125 2023-06-25 08:39:24,881 INFO [train.py:996] (1/4) Epoch 8, batch 2700, loss[loss=0.2127, simple_loss=0.2903, pruned_loss=0.06759, over 21817.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2935, pruned_loss=0.07322, over 4273850.54 frames. ], batch size: 371, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:39:42,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1297032.0, ans=0.125 2023-06-25 08:40:18,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1297092.0, ans=0.125 2023-06-25 08:40:59,121 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=12.0 2023-06-25 08:41:00,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1297212.0, ans=0.0 2023-06-25 08:41:17,919 INFO [train.py:996] (1/4) Epoch 8, batch 2750, loss[loss=0.2149, simple_loss=0.3082, pruned_loss=0.06079, over 21847.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2939, pruned_loss=0.07257, over 4273543.03 frames. ], batch size: 371, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:41:31,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-25 08:41:33,937 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.71 vs. limit=15.0 2023-06-25 08:41:57,525 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:42:13,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1297392.0, ans=0.0 2023-06-25 08:42:27,787 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.672e+02 4.055e+02 5.362e+02 7.595e+02 1.481e+03, threshold=1.072e+03, percent-clipped=12.0 2023-06-25 08:42:28,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1297452.0, ans=0.0 2023-06-25 08:42:41,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1297452.0, ans=0.07 2023-06-25 08:42:53,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-25 08:42:57,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1297512.0, ans=0.125 2023-06-25 08:43:11,596 INFO [train.py:996] (1/4) Epoch 8, batch 2800, loss[loss=0.2033, simple_loss=0.2698, pruned_loss=0.06835, over 21259.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2988, pruned_loss=0.07466, over 4266315.38 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:43:53,609 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-06-25 08:43:56,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1297692.0, ans=10.0 2023-06-25 08:43:58,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1297692.0, ans=0.025 2023-06-25 08:44:28,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1297752.0, ans=0.2 2023-06-25 08:44:37,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1297752.0, ans=0.2 2023-06-25 08:44:41,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-06-25 08:44:46,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1297812.0, ans=0.125 2023-06-25 08:44:55,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1297812.0, ans=0.125 2023-06-25 08:44:55,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-25 08:44:59,917 INFO [train.py:996] (1/4) Epoch 8, batch 2850, loss[loss=0.2047, simple_loss=0.2869, pruned_loss=0.0612, over 21797.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2983, pruned_loss=0.075, over 4257733.24 frames. ], batch size: 371, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:45:00,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1297872.0, ans=0.125 2023-06-25 08:45:28,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1297932.0, ans=0.125 2023-06-25 08:45:55,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1297992.0, ans=0.125 2023-06-25 08:46:13,108 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.766e+02 3.662e+02 5.066e+02 7.139e+02 1.545e+03, threshold=1.013e+03, percent-clipped=5.0 2023-06-25 08:46:50,050 INFO [train.py:996] (1/4) Epoch 8, batch 2900, loss[loss=0.216, simple_loss=0.2813, pruned_loss=0.07536, over 21694.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2949, pruned_loss=0.07442, over 4265777.37 frames. ], batch size: 263, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:46:53,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1298172.0, ans=0.125 2023-06-25 08:47:20,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1298232.0, ans=0.0 2023-06-25 08:47:23,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1298232.0, ans=0.2 2023-06-25 08:47:33,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1298292.0, ans=0.04949747468305833 2023-06-25 08:48:36,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-25 08:48:42,068 INFO [train.py:996] (1/4) Epoch 8, batch 2950, loss[loss=0.2305, simple_loss=0.3312, pruned_loss=0.06491, over 20817.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2983, pruned_loss=0.07428, over 4280533.16 frames. ], batch size: 607, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:49:40,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.61 vs. limit=10.0 2023-06-25 08:49:41,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1298592.0, ans=0.0 2023-06-25 08:49:57,022 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.695e+02 3.497e+02 4.851e+02 7.009e+02 1.350e+03, threshold=9.702e+02, percent-clipped=11.0 2023-06-25 08:50:33,466 INFO [train.py:996] (1/4) Epoch 8, batch 3000, loss[loss=0.2427, simple_loss=0.3193, pruned_loss=0.08307, over 21913.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3032, pruned_loss=0.07495, over 4286113.26 frames. ], batch size: 316, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:50:33,467 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 08:50:50,230 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.7632, 4.3846, 4.5064, 3.4833], device='cuda:1') 2023-06-25 08:50:54,959 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2557, simple_loss=0.3462, pruned_loss=0.08265, over 1796401.00 frames. 2023-06-25 08:50:54,960 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 08:52:03,167 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:52:31,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-25 08:52:40,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1299012.0, ans=0.125 2023-06-25 08:52:45,445 INFO [train.py:996] (1/4) Epoch 8, batch 3050, loss[loss=0.1831, simple_loss=0.268, pruned_loss=0.04904, over 21653.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3013, pruned_loss=0.07246, over 4290696.91 frames. ], batch size: 263, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:53:20,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-25 08:53:38,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1299192.0, ans=0.0 2023-06-25 08:53:43,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1299192.0, ans=0.0 2023-06-25 08:53:55,754 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.327e+02 3.997e+02 5.438e+02 1.383e+03, threshold=7.994e+02, percent-clipped=4.0 2023-06-25 08:54:30,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1299312.0, ans=0.5 2023-06-25 08:54:37,055 INFO [train.py:996] (1/4) Epoch 8, batch 3100, loss[loss=0.2333, simple_loss=0.3276, pruned_loss=0.06952, over 21664.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3033, pruned_loss=0.0724, over 4292038.39 frames. ], batch size: 414, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:55:06,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-25 08:55:32,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299492.0, ans=0.1 2023-06-25 08:55:51,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1299552.0, ans=0.1 2023-06-25 08:56:03,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1299552.0, ans=0.0 2023-06-25 08:56:15,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1299612.0, ans=0.125 2023-06-25 08:56:39,306 INFO [train.py:996] (1/4) Epoch 8, batch 3150, loss[loss=0.2388, simple_loss=0.319, pruned_loss=0.07935, over 21655.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3053, pruned_loss=0.07289, over 4285864.44 frames. ], batch size: 351, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:57:42,389 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-25 08:57:44,761 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 3.426e+02 4.350e+02 5.969e+02 1.538e+03, threshold=8.700e+02, percent-clipped=12.0 2023-06-25 08:58:23,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1299912.0, ans=0.125 2023-06-25 08:58:36,646 INFO [train.py:996] (1/4) Epoch 8, batch 3200, loss[loss=0.1959, simple_loss=0.2751, pruned_loss=0.05837, over 21319.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3044, pruned_loss=0.0721, over 4284212.56 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:59:13,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1300092.0, ans=0.0 2023-06-25 09:00:28,333 INFO [train.py:996] (1/4) Epoch 8, batch 3250, loss[loss=0.2302, simple_loss=0.2997, pruned_loss=0.08031, over 21202.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3055, pruned_loss=0.07363, over 4282384.59 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:00:32,688 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:00:50,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1300332.0, ans=0.2 2023-06-25 09:01:30,043 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 3.942e+02 5.285e+02 9.066e+02 2.066e+03, threshold=1.057e+03, percent-clipped=29.0 2023-06-25 09:02:20,242 INFO [train.py:996] (1/4) Epoch 8, batch 3300, loss[loss=0.2041, simple_loss=0.2815, pruned_loss=0.06331, over 21169.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3006, pruned_loss=0.07283, over 4278810.89 frames. ], batch size: 159, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:02:22,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1300572.0, ans=0.0 2023-06-25 09:02:26,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1300572.0, ans=0.1 2023-06-25 09:03:58,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1300812.0, ans=0.0 2023-06-25 09:04:11,682 INFO [train.py:996] (1/4) Epoch 8, batch 3350, loss[loss=0.2225, simple_loss=0.2877, pruned_loss=0.07867, over 21360.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3018, pruned_loss=0.0737, over 4280061.67 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:05:00,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1300992.0, ans=0.125 2023-06-25 09:05:02,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1300992.0, ans=0.1 2023-06-25 09:05:23,410 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 4.026e+02 5.637e+02 8.126e+02 1.843e+03, threshold=1.127e+03, percent-clipped=12.0 2023-06-25 09:05:23,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1301052.0, ans=0.1 2023-06-25 09:05:33,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=1301052.0, ans=15.0 2023-06-25 09:05:34,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1301052.0, ans=0.125 2023-06-25 09:06:01,863 INFO [train.py:996] (1/4) Epoch 8, batch 3400, loss[loss=0.2339, simple_loss=0.3188, pruned_loss=0.07444, over 21711.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3021, pruned_loss=0.07422, over 4283341.71 frames. ], batch size: 414, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:06:04,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1301172.0, ans=0.125 2023-06-25 09:06:11,967 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-25 09:06:31,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1301232.0, ans=0.125 2023-06-25 09:06:39,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1301232.0, ans=0.0 2023-06-25 09:06:54,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1301292.0, ans=0.2 2023-06-25 09:07:07,756 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=22.5 2023-06-25 09:07:48,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1301412.0, ans=0.1 2023-06-25 09:07:55,861 INFO [train.py:996] (1/4) Epoch 8, batch 3450, loss[loss=0.2676, simple_loss=0.3526, pruned_loss=0.09132, over 21827.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2994, pruned_loss=0.07405, over 4283177.85 frames. ], batch size: 317, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:08:30,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1301532.0, ans=0.1 2023-06-25 09:09:13,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 3.588e+02 4.974e+02 7.725e+02 1.763e+03, threshold=9.948e+02, percent-clipped=11.0 2023-06-25 09:09:53,622 INFO [train.py:996] (1/4) Epoch 8, batch 3500, loss[loss=0.2158, simple_loss=0.3094, pruned_loss=0.06113, over 21431.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3081, pruned_loss=0.07807, over 4284355.27 frames. ], batch size: 211, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:09:57,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1301772.0, ans=0.125 2023-06-25 09:11:02,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1301952.0, ans=0.04949747468305833 2023-06-25 09:11:11,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1301952.0, ans=0.5 2023-06-25 09:11:13,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1301952.0, ans=0.0 2023-06-25 09:11:22,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1302012.0, ans=0.125 2023-06-25 09:11:43,840 INFO [train.py:996] (1/4) Epoch 8, batch 3550, loss[loss=0.1826, simple_loss=0.2546, pruned_loss=0.05525, over 19908.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.311, pruned_loss=0.07902, over 4280084.15 frames. ], batch size: 703, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:12:55,358 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 4.003e+02 5.383e+02 7.230e+02 1.174e+03, threshold=1.077e+03, percent-clipped=7.0 2023-06-25 09:13:19,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1302312.0, ans=0.07 2023-06-25 09:13:28,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1302312.0, ans=0.125 2023-06-25 09:13:35,442 INFO [train.py:996] (1/4) Epoch 8, batch 3600, loss[loss=0.2287, simple_loss=0.3003, pruned_loss=0.07858, over 21265.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3057, pruned_loss=0.07841, over 4275789.07 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:13:37,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1302372.0, ans=0.0 2023-06-25 09:14:27,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=12.0 2023-06-25 09:14:49,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1302552.0, ans=0.1 2023-06-25 09:15:03,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1302612.0, ans=0.2 2023-06-25 09:15:18,542 INFO [train.py:996] (1/4) Epoch 8, batch 3650, loss[loss=0.2336, simple_loss=0.2925, pruned_loss=0.08731, over 21336.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3058, pruned_loss=0.07852, over 4278044.92 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:15:55,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1302732.0, ans=0.0 2023-06-25 09:16:31,574 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.652e+02 4.088e+02 5.545e+02 7.819e+02 1.547e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-25 09:16:45,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1302912.0, ans=0.125 2023-06-25 09:16:59,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1302912.0, ans=0.1 2023-06-25 09:17:00,324 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-25 09:17:08,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1302972.0, ans=0.125 2023-06-25 09:17:09,593 INFO [train.py:996] (1/4) Epoch 8, batch 3700, loss[loss=0.2534, simple_loss=0.3202, pruned_loss=0.09331, over 21609.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3041, pruned_loss=0.0776, over 4273963.71 frames. ], batch size: 471, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:17:49,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1303032.0, ans=0.0 2023-06-25 09:19:01,402 INFO [train.py:996] (1/4) Epoch 8, batch 3750, loss[loss=0.1782, simple_loss=0.2619, pruned_loss=0.04729, over 21629.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3024, pruned_loss=0.07593, over 4278347.52 frames. ], batch size: 263, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:19:33,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1303332.0, ans=0.125 2023-06-25 09:19:38,313 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:19:38,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=12.0 2023-06-25 09:20:16,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-25 09:20:21,191 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.272e+02 4.501e+02 6.560e+02 9.292e+02, threshold=9.001e+02, percent-clipped=0.0 2023-06-25 09:20:58,382 INFO [train.py:996] (1/4) Epoch 8, batch 3800, loss[loss=0.2711, simple_loss=0.375, pruned_loss=0.08364, over 19870.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2998, pruned_loss=0.07393, over 4275323.14 frames. ], batch size: 702, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:21:15,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1303572.0, ans=0.1 2023-06-25 09:21:38,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1303632.0, ans=0.125 2023-06-25 09:21:45,987 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-06-25 09:22:19,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1303752.0, ans=0.0 2023-06-25 09:22:40,690 INFO [train.py:996] (1/4) Epoch 8, batch 3850, loss[loss=0.1931, simple_loss=0.2642, pruned_loss=0.06095, over 21424.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2973, pruned_loss=0.07428, over 4276256.74 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:22:51,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1303872.0, ans=0.125 2023-06-25 09:23:06,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1303872.0, ans=0.0 2023-06-25 09:23:28,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1303932.0, ans=0.125 2023-06-25 09:23:59,527 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.585e+02 3.373e+02 4.487e+02 6.167e+02 2.000e+03, threshold=8.974e+02, percent-clipped=6.0 2023-06-25 09:24:02,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1304052.0, ans=0.125 2023-06-25 09:24:10,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1304052.0, ans=0.015 2023-06-25 09:24:17,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1304112.0, ans=0.125 2023-06-25 09:24:31,274 INFO [train.py:996] (1/4) Epoch 8, batch 3900, loss[loss=0.2053, simple_loss=0.283, pruned_loss=0.06383, over 21883.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2946, pruned_loss=0.07431, over 4280749.36 frames. ], batch size: 351, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:25:19,295 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-25 09:25:40,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1304352.0, ans=0.125 2023-06-25 09:26:05,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1304412.0, ans=0.0 2023-06-25 09:26:12,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1304412.0, ans=0.0 2023-06-25 09:26:27,158 INFO [train.py:996] (1/4) Epoch 8, batch 3950, loss[loss=0.1724, simple_loss=0.2624, pruned_loss=0.04123, over 21638.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2964, pruned_loss=0.0741, over 4284156.74 frames. ], batch size: 263, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:27:05,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1304532.0, ans=0.2 2023-06-25 09:27:12,380 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:27:32,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1304592.0, ans=0.125 2023-06-25 09:27:38,822 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.574e+02 3.686e+02 5.186e+02 7.402e+02 1.424e+03, threshold=1.037e+03, percent-clipped=9.0 2023-06-25 09:27:49,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.35 vs. limit=15.0 2023-06-25 09:28:16,205 INFO [train.py:996] (1/4) Epoch 8, batch 4000, loss[loss=0.2016, simple_loss=0.2676, pruned_loss=0.06777, over 21442.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2898, pruned_loss=0.07044, over 4284309.06 frames. ], batch size: 389, lr: 3.82e-03, grad_scale: 32.0 2023-06-25 09:28:30,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-25 09:28:59,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.62 vs. limit=15.0 2023-06-25 09:30:03,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1305012.0, ans=0.125 2023-06-25 09:30:11,497 INFO [train.py:996] (1/4) Epoch 8, batch 4050, loss[loss=0.2102, simple_loss=0.2892, pruned_loss=0.06565, over 21387.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2901, pruned_loss=0.06913, over 4278681.89 frames. ], batch size: 194, lr: 3.82e-03, grad_scale: 32.0 2023-06-25 09:30:34,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1305132.0, ans=0.02 2023-06-25 09:30:44,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1305132.0, ans=0.025 2023-06-25 09:31:18,960 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.508e+02 3.803e+02 4.888e+02 6.657e+02 1.371e+03, threshold=9.776e+02, percent-clipped=4.0 2023-06-25 09:31:39,555 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.79 vs. limit=5.0 2023-06-25 09:31:59,981 INFO [train.py:996] (1/4) Epoch 8, batch 4100, loss[loss=0.1892, simple_loss=0.2732, pruned_loss=0.0526, over 21251.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.29, pruned_loss=0.06953, over 4280464.61 frames. ], batch size: 176, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:32:58,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1305492.0, ans=0.125 2023-06-25 09:33:05,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1305552.0, ans=0.125 2023-06-25 09:33:39,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1305612.0, ans=0.1 2023-06-25 09:33:44,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1305612.0, ans=0.05 2023-06-25 09:33:48,846 INFO [train.py:996] (1/4) Epoch 8, batch 4150, loss[loss=0.1666, simple_loss=0.2579, pruned_loss=0.03762, over 21463.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2902, pruned_loss=0.06662, over 4281408.47 frames. ], batch size: 195, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:33:57,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1305672.0, ans=0.125 2023-06-25 09:34:07,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-25 09:34:39,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1305792.0, ans=0.07 2023-06-25 09:35:00,802 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.587e+02 3.172e+02 3.844e+02 5.295e+02 7.953e+02, threshold=7.689e+02, percent-clipped=0.0 2023-06-25 09:35:02,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1305852.0, ans=0.125 2023-06-25 09:35:17,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1305912.0, ans=0.1 2023-06-25 09:35:35,835 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:35:41,057 INFO [train.py:996] (1/4) Epoch 8, batch 4200, loss[loss=0.2615, simple_loss=0.3677, pruned_loss=0.07765, over 21558.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2908, pruned_loss=0.06641, over 4276784.21 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:35:57,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-25 09:36:26,091 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2023-06-25 09:36:49,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1306152.0, ans=0.125 2023-06-25 09:36:53,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1306152.0, ans=0.2 2023-06-25 09:37:12,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.17 vs. limit=12.0 2023-06-25 09:37:38,050 INFO [train.py:996] (1/4) Epoch 8, batch 4250, loss[loss=0.2523, simple_loss=0.3294, pruned_loss=0.08757, over 21746.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2972, pruned_loss=0.06757, over 4278709.56 frames. ], batch size: 247, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:37:56,196 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-25 09:38:57,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.607e+02 4.053e+02 6.185e+02 8.917e+02 1.733e+03, threshold=1.237e+03, percent-clipped=33.0 2023-06-25 09:39:27,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1306512.0, ans=0.2 2023-06-25 09:39:27,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1306512.0, ans=0.125 2023-06-25 09:39:27,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1306512.0, ans=0.2 2023-06-25 09:39:38,305 INFO [train.py:996] (1/4) Epoch 8, batch 4300, loss[loss=0.2238, simple_loss=0.3142, pruned_loss=0.06673, over 21828.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3039, pruned_loss=0.07038, over 4280611.95 frames. ], batch size: 282, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:39:50,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-06-25 09:41:28,177 INFO [train.py:996] (1/4) Epoch 8, batch 4350, loss[loss=0.253, simple_loss=0.3181, pruned_loss=0.09399, over 21327.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3043, pruned_loss=0.07028, over 4278717.84 frames. ], batch size: 471, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:41:49,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1306932.0, ans=0.125 2023-06-25 09:42:44,510 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.580e+02 4.513e+02 6.539e+02 1.169e+03, threshold=9.025e+02, percent-clipped=0.0 2023-06-25 09:43:19,209 INFO [train.py:996] (1/4) Epoch 8, batch 4400, loss[loss=0.2201, simple_loss=0.3139, pruned_loss=0.06318, over 21724.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2983, pruned_loss=0.07016, over 4273114.65 frames. ], batch size: 298, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:43:21,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1307172.0, ans=0.125 2023-06-25 09:43:38,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-25 09:43:41,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1307232.0, ans=0.125 2023-06-25 09:43:44,156 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.50 vs. limit=15.0 2023-06-25 09:44:05,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1307292.0, ans=0.125 2023-06-25 09:44:15,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-25 09:44:18,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1307292.0, ans=0.125 2023-06-25 09:44:49,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1307352.0, ans=0.125 2023-06-25 09:44:54,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-25 09:45:16,019 INFO [train.py:996] (1/4) Epoch 8, batch 4450, loss[loss=0.2522, simple_loss=0.3514, pruned_loss=0.0765, over 21733.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3066, pruned_loss=0.07137, over 4276237.19 frames. ], batch size: 298, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:46:28,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-25 09:46:32,170 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 3.788e+02 5.957e+02 8.951e+02 1.705e+03, threshold=1.191e+03, percent-clipped=23.0 2023-06-25 09:46:33,013 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:46:47,582 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-25 09:47:06,081 INFO [train.py:996] (1/4) Epoch 8, batch 4500, loss[loss=0.1946, simple_loss=0.2968, pruned_loss=0.0462, over 20856.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3081, pruned_loss=0.07375, over 4281848.44 frames. ], batch size: 608, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:47:06,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1307772.0, ans=0.125 2023-06-25 09:47:27,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1307832.0, ans=0.04949747468305833 2023-06-25 09:48:12,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1307892.0, ans=0.0 2023-06-25 09:48:12,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1307892.0, ans=0.2 2023-06-25 09:48:30,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1307952.0, ans=0.125 2023-06-25 09:48:54,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1308072.0, ans=0.0 2023-06-25 09:48:56,010 INFO [train.py:996] (1/4) Epoch 8, batch 4550, loss[loss=0.2858, simple_loss=0.3554, pruned_loss=0.1081, over 21818.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3091, pruned_loss=0.07354, over 4284637.38 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:49:12,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 09:49:17,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1308072.0, ans=0.0 2023-06-25 09:49:26,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-25 09:49:45,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1308132.0, ans=0.125 2023-06-25 09:50:18,032 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.595e+02 3.343e+02 4.134e+02 5.307e+02 1.038e+03, threshold=8.269e+02, percent-clipped=0.0 2023-06-25 09:50:52,057 INFO [train.py:996] (1/4) Epoch 8, batch 4600, loss[loss=0.2195, simple_loss=0.3026, pruned_loss=0.06824, over 21485.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3126, pruned_loss=0.07579, over 4280080.64 frames. ], batch size: 131, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:51:22,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1308432.0, ans=0.1 2023-06-25 09:51:38,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1308492.0, ans=0.125 2023-06-25 09:51:42,898 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-25 09:52:28,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1308612.0, ans=0.125 2023-06-25 09:52:42,570 INFO [train.py:996] (1/4) Epoch 8, batch 4650, loss[loss=0.2299, simple_loss=0.2941, pruned_loss=0.0829, over 21755.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3077, pruned_loss=0.07408, over 4275149.09 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:52:42,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1308672.0, ans=0.0 2023-06-25 09:53:18,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1308732.0, ans=0.0 2023-06-25 09:53:23,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1308732.0, ans=0.1 2023-06-25 09:53:27,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-25 09:53:35,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1308792.0, ans=0.0 2023-06-25 09:53:47,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1308792.0, ans=0.125 2023-06-25 09:53:59,081 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.213e+02 3.806e+02 5.357e+02 1.908e+03, threshold=7.612e+02, percent-clipped=10.0 2023-06-25 09:54:02,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-25 09:54:31,183 INFO [train.py:996] (1/4) Epoch 8, batch 4700, loss[loss=0.1897, simple_loss=0.2525, pruned_loss=0.0635, over 21324.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2979, pruned_loss=0.07127, over 4275016.13 frames. ], batch size: 131, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:54:31,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1308972.0, ans=0.0 2023-06-25 09:55:07,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309032.0, ans=0.1 2023-06-25 09:55:12,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1309032.0, ans=0.025 2023-06-25 09:55:14,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1309032.0, ans=0.0 2023-06-25 09:55:19,981 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-25 09:55:42,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1309152.0, ans=0.125 2023-06-25 09:56:03,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1309212.0, ans=0.125 2023-06-25 09:56:21,241 INFO [train.py:996] (1/4) Epoch 8, batch 4750, loss[loss=0.2026, simple_loss=0.2675, pruned_loss=0.0689, over 21537.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2913, pruned_loss=0.07071, over 4282634.37 frames. ], batch size: 212, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:56:33,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1309272.0, ans=0.125 2023-06-25 09:57:20,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1309392.0, ans=0.05 2023-06-25 09:57:26,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1309392.0, ans=0.125 2023-06-25 09:57:35,148 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-25 09:57:39,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.746e+02 3.551e+02 4.538e+02 6.106e+02 1.235e+03, threshold=9.075e+02, percent-clipped=15.0 2023-06-25 09:58:17,096 INFO [train.py:996] (1/4) Epoch 8, batch 4800, loss[loss=0.2185, simple_loss=0.335, pruned_loss=0.05097, over 21208.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2911, pruned_loss=0.07101, over 4285148.62 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:58:18,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-06-25 09:58:28,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1309572.0, ans=0.04949747468305833 2023-06-25 09:59:33,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1309752.0, ans=0.0 2023-06-25 09:59:59,467 INFO [train.py:996] (1/4) Epoch 8, batch 4850, loss[loss=0.1993, simple_loss=0.2716, pruned_loss=0.06353, over 20326.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2918, pruned_loss=0.07074, over 4289293.81 frames. ], batch size: 703, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:00:21,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1309872.0, ans=0.1 2023-06-25 10:01:16,357 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.775e+02 3.669e+02 4.660e+02 6.748e+02 1.065e+03, threshold=9.320e+02, percent-clipped=5.0 2023-06-25 10:01:17,638 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-25 10:01:48,358 INFO [train.py:996] (1/4) Epoch 8, batch 4900, loss[loss=0.2356, simple_loss=0.3303, pruned_loss=0.0704, over 21716.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.294, pruned_loss=0.07166, over 4285445.35 frames. ], batch size: 298, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:02:01,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1310172.0, ans=0.09899494936611666 2023-06-25 10:02:21,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1310232.0, ans=0.1 2023-06-25 10:03:47,913 INFO [train.py:996] (1/4) Epoch 8, batch 4950, loss[loss=0.1822, simple_loss=0.2763, pruned_loss=0.04411, over 19906.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2981, pruned_loss=0.07001, over 4288072.42 frames. ], batch size: 703, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:03:57,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1310472.0, ans=0.0 2023-06-25 10:05:00,801 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.071e+02 4.183e+02 5.786e+02 1.763e+03, threshold=8.366e+02, percent-clipped=8.0 2023-06-25 10:05:37,449 INFO [train.py:996] (1/4) Epoch 8, batch 5000, loss[loss=0.2589, simple_loss=0.3172, pruned_loss=0.1003, over 21700.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2978, pruned_loss=0.06766, over 4281047.86 frames. ], batch size: 507, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:05:43,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1310772.0, ans=0.025 2023-06-25 10:06:10,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1310892.0, ans=0.0 2023-06-25 10:07:19,125 INFO [train.py:996] (1/4) Epoch 8, batch 5050, loss[loss=0.2227, simple_loss=0.3454, pruned_loss=0.05003, over 20844.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2985, pruned_loss=0.06971, over 4290577.66 frames. ], batch size: 607, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:08:25,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1311252.0, ans=0.2 2023-06-25 10:08:30,094 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.598e+02 4.329e+02 6.155e+02 1.761e+03, threshold=8.658e+02, percent-clipped=10.0 2023-06-25 10:09:07,135 INFO [train.py:996] (1/4) Epoch 8, batch 5100, loss[loss=0.2285, simple_loss=0.3158, pruned_loss=0.07065, over 20067.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2965, pruned_loss=0.0705, over 4287253.54 frames. ], batch size: 703, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:09:54,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1311492.0, ans=0.125 2023-06-25 10:10:52,973 INFO [train.py:996] (1/4) Epoch 8, batch 5150, loss[loss=0.267, simple_loss=0.3116, pruned_loss=0.1112, over 21845.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2945, pruned_loss=0.07165, over 4295272.64 frames. ], batch size: 508, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:11:22,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1311732.0, ans=0.125 2023-06-25 10:11:44,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1311792.0, ans=0.125 2023-06-25 10:11:57,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1311792.0, ans=0.0 2023-06-25 10:12:11,331 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 3.617e+02 5.481e+02 7.313e+02 1.650e+03, threshold=1.096e+03, percent-clipped=16.0 2023-06-25 10:12:48,514 INFO [train.py:996] (1/4) Epoch 8, batch 5200, loss[loss=0.2209, simple_loss=0.3194, pruned_loss=0.06126, over 21646.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2952, pruned_loss=0.07174, over 4293319.01 frames. ], batch size: 263, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:13:03,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1311972.0, ans=0.125 2023-06-25 10:13:26,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1312032.0, ans=0.0 2023-06-25 10:13:53,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1312152.0, ans=0.125 2023-06-25 10:14:09,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1312152.0, ans=0.125 2023-06-25 10:14:18,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1312212.0, ans=0.125 2023-06-25 10:14:43,358 INFO [train.py:996] (1/4) Epoch 8, batch 5250, loss[loss=0.218, simple_loss=0.3133, pruned_loss=0.06141, over 21637.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2975, pruned_loss=0.06969, over 4290630.90 frames. ], batch size: 389, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:14:58,430 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-25 10:15:06,234 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:15:53,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.659e+02 3.587e+02 4.772e+02 6.547e+02 1.598e+03, threshold=9.543e+02, percent-clipped=4.0 2023-06-25 10:15:58,384 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-25 10:16:28,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1312572.0, ans=0.125 2023-06-25 10:16:29,955 INFO [train.py:996] (1/4) Epoch 8, batch 5300, loss[loss=0.2011, simple_loss=0.2726, pruned_loss=0.06479, over 21687.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2964, pruned_loss=0.0702, over 4297777.05 frames. ], batch size: 263, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:16:56,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1312632.0, ans=0.125 2023-06-25 10:16:58,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1312632.0, ans=0.2 2023-06-25 10:18:17,008 INFO [train.py:996] (1/4) Epoch 8, batch 5350, loss[loss=0.2152, simple_loss=0.2859, pruned_loss=0.07222, over 21363.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2951, pruned_loss=0.07122, over 4300896.08 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:18:42,588 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-25 10:18:59,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1312992.0, ans=0.0 2023-06-25 10:19:05,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1312992.0, ans=0.0 2023-06-25 10:19:28,374 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.684e+02 3.542e+02 4.424e+02 5.994e+02 1.106e+03, threshold=8.848e+02, percent-clipped=4.0 2023-06-25 10:19:43,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1313112.0, ans=0.2 2023-06-25 10:20:05,536 INFO [train.py:996] (1/4) Epoch 8, batch 5400, loss[loss=0.2169, simple_loss=0.2903, pruned_loss=0.07169, over 21583.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2942, pruned_loss=0.07227, over 4303550.41 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:20:44,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1313292.0, ans=0.1 2023-06-25 10:21:08,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-25 10:21:19,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1313352.0, ans=0.125 2023-06-25 10:21:22,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1313352.0, ans=0.1 2023-06-25 10:21:55,115 INFO [train.py:996] (1/4) Epoch 8, batch 5450, loss[loss=0.1946, simple_loss=0.2846, pruned_loss=0.05235, over 21817.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2962, pruned_loss=0.07063, over 4299464.70 frames. ], batch size: 124, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:22:18,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1313532.0, ans=0.1 2023-06-25 10:22:27,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1313532.0, ans=0.125 2023-06-25 10:22:32,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1313532.0, ans=0.125 2023-06-25 10:22:52,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1313592.0, ans=0.0 2023-06-25 10:23:15,416 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.381e+02 6.345e+02 1.127e+03 2.400e+03, threshold=1.269e+03, percent-clipped=34.0 2023-06-25 10:23:25,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1313712.0, ans=0.0 2023-06-25 10:23:45,599 INFO [train.py:996] (1/4) Epoch 8, batch 5500, loss[loss=0.2054, simple_loss=0.2982, pruned_loss=0.05627, over 21574.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3008, pruned_loss=0.06803, over 4297173.58 frames. ], batch size: 230, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:24:00,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1313772.0, ans=0.0 2023-06-25 10:25:16,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1314012.0, ans=0.125 2023-06-25 10:25:18,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1314012.0, ans=0.0 2023-06-25 10:25:23,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1314012.0, ans=0.1 2023-06-25 10:25:35,631 INFO [train.py:996] (1/4) Epoch 8, batch 5550, loss[loss=0.2269, simple_loss=0.3507, pruned_loss=0.05156, over 19834.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.3006, pruned_loss=0.06556, over 4290289.40 frames. ], batch size: 703, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:25:36,931 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-25 10:25:42,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1314072.0, ans=0.125 2023-06-25 10:26:00,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1314132.0, ans=0.0 2023-06-25 10:26:37,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1314192.0, ans=0.2 2023-06-25 10:27:03,714 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.121e+02 4.354e+02 6.729e+02 1.471e+03, threshold=8.708e+02, percent-clipped=1.0 2023-06-25 10:27:20,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-25 10:27:26,990 INFO [train.py:996] (1/4) Epoch 8, batch 5600, loss[loss=0.1914, simple_loss=0.269, pruned_loss=0.05687, over 21164.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2974, pruned_loss=0.06285, over 4286323.18 frames. ], batch size: 143, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:28:14,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1314432.0, ans=0.125 2023-06-25 10:29:15,396 INFO [train.py:996] (1/4) Epoch 8, batch 5650, loss[loss=0.2219, simple_loss=0.2958, pruned_loss=0.07399, over 21859.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3035, pruned_loss=0.06668, over 4284039.25 frames. ], batch size: 124, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:29:46,276 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:30:11,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1314792.0, ans=0.5 2023-06-25 10:30:21,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1314792.0, ans=0.0 2023-06-25 10:30:30,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1314852.0, ans=0.125 2023-06-25 10:30:42,462 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.720e+02 4.225e+02 5.470e+02 8.803e+02 1.575e+03, threshold=1.094e+03, percent-clipped=25.0 2023-06-25 10:31:12,021 INFO [train.py:996] (1/4) Epoch 8, batch 5700, loss[loss=0.2522, simple_loss=0.3386, pruned_loss=0.08289, over 21453.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3026, pruned_loss=0.06723, over 4284660.65 frames. ], batch size: 471, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:31:52,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1315032.0, ans=0.125 2023-06-25 10:32:39,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1315152.0, ans=0.125 2023-06-25 10:32:41,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1315152.0, ans=0.125 2023-06-25 10:32:48,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1315212.0, ans=0.2 2023-06-25 10:33:14,791 INFO [train.py:996] (1/4) Epoch 8, batch 5750, loss[loss=0.1588, simple_loss=0.2433, pruned_loss=0.0372, over 21399.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2974, pruned_loss=0.06425, over 4284607.00 frames. ], batch size: 211, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:33:23,445 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=8.0 2023-06-25 10:33:27,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1315272.0, ans=0.0 2023-06-25 10:34:31,280 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.956e+02 5.585e+02 8.690e+02 2.193e+03, threshold=1.117e+03, percent-clipped=12.0 2023-06-25 10:34:51,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1315512.0, ans=0.125 2023-06-25 10:35:05,090 INFO [train.py:996] (1/4) Epoch 8, batch 5800, loss[loss=0.2334, simple_loss=0.3287, pruned_loss=0.06903, over 21589.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2975, pruned_loss=0.06357, over 4285731.12 frames. ], batch size: 441, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:35:33,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1315632.0, ans=0.0 2023-06-25 10:35:35,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1315632.0, ans=0.015 2023-06-25 10:35:35,912 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 10:35:43,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-25 10:36:31,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1315752.0, ans=0.1 2023-06-25 10:36:52,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1315812.0, ans=0.0 2023-06-25 10:36:55,265 INFO [train.py:996] (1/4) Epoch 8, batch 5850, loss[loss=0.24, simple_loss=0.3058, pruned_loss=0.08711, over 20128.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2961, pruned_loss=0.06029, over 4276358.16 frames. ], batch size: 702, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:37:28,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1315932.0, ans=0.07 2023-06-25 10:37:28,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1315932.0, ans=0.0 2023-06-25 10:38:21,041 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 3.016e+02 4.169e+02 5.558e+02 1.178e+03, threshold=8.338e+02, percent-clipped=1.0 2023-06-25 10:38:31,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1316112.0, ans=0.125 2023-06-25 10:38:43,380 INFO [train.py:996] (1/4) Epoch 8, batch 5900, loss[loss=0.1702, simple_loss=0.2463, pruned_loss=0.04706, over 21347.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2885, pruned_loss=0.0553, over 4281307.04 frames. ], batch size: 194, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:40:08,342 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-25 10:40:18,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1316412.0, ans=0.0 2023-06-25 10:40:32,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1316412.0, ans=0.1 2023-06-25 10:40:36,584 INFO [train.py:996] (1/4) Epoch 8, batch 5950, loss[loss=0.2011, simple_loss=0.2617, pruned_loss=0.07027, over 21459.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2874, pruned_loss=0.05836, over 4283053.35 frames. ], batch size: 177, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:40:59,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1316532.0, ans=0.2 2023-06-25 10:41:57,151 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.705e+02 4.644e+02 6.015e+02 1.261e+03, threshold=9.288e+02, percent-clipped=6.0 2023-06-25 10:42:06,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1316712.0, ans=0.125 2023-06-25 10:42:24,684 INFO [train.py:996] (1/4) Epoch 8, batch 6000, loss[loss=0.1956, simple_loss=0.2635, pruned_loss=0.06379, over 21317.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2831, pruned_loss=0.06179, over 4279422.42 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:42:24,684 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 10:42:35,428 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.8824, 5.0045, 2.4294, 4.5097], device='cuda:1') 2023-06-25 10:42:43,099 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2599, simple_loss=0.3542, pruned_loss=0.08283, over 1796401.00 frames. 2023-06-25 10:42:43,100 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 10:42:52,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1316772.0, ans=0.0 2023-06-25 10:43:29,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1316892.0, ans=0.2 2023-06-25 10:43:53,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1316952.0, ans=0.07 2023-06-25 10:44:32,089 INFO [train.py:996] (1/4) Epoch 8, batch 6050, loss[loss=0.1955, simple_loss=0.26, pruned_loss=0.06548, over 21598.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2775, pruned_loss=0.06241, over 4278838.24 frames. ], batch size: 415, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:45:09,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1317132.0, ans=0.1 2023-06-25 10:46:02,684 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.027e+02 3.543e+02 4.966e+02 9.624e+02, threshold=7.086e+02, percent-clipped=3.0 2023-06-25 10:46:16,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1317312.0, ans=0.125 2023-06-25 10:46:21,238 INFO [train.py:996] (1/4) Epoch 8, batch 6100, loss[loss=0.1815, simple_loss=0.274, pruned_loss=0.04451, over 21742.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2785, pruned_loss=0.06134, over 4267004.92 frames. ], batch size: 351, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:46:32,946 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-25 10:46:43,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.66 vs. limit=10.0 2023-06-25 10:47:14,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1317492.0, ans=0.125 2023-06-25 10:48:04,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1317612.0, ans=0.1 2023-06-25 10:48:09,188 INFO [train.py:996] (1/4) Epoch 8, batch 6150, loss[loss=0.2131, simple_loss=0.291, pruned_loss=0.06759, over 21528.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2815, pruned_loss=0.06416, over 4264018.55 frames. ], batch size: 389, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:49:04,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1317792.0, ans=0.125 2023-06-25 10:49:07,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1317792.0, ans=0.07 2023-06-25 10:49:11,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1317792.0, ans=0.1 2023-06-25 10:49:12,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1317792.0, ans=0.0 2023-06-25 10:49:21,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1317852.0, ans=0.125 2023-06-25 10:49:28,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1317852.0, ans=0.125 2023-06-25 10:49:33,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1317852.0, ans=0.125 2023-06-25 10:49:38,027 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.673e+02 3.233e+02 3.904e+02 5.485e+02 1.131e+03, threshold=7.808e+02, percent-clipped=12.0 2023-06-25 10:49:40,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1317912.0, ans=0.1 2023-06-25 10:49:58,256 INFO [train.py:996] (1/4) Epoch 8, batch 6200, loss[loss=0.2227, simple_loss=0.3457, pruned_loss=0.04983, over 20761.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2842, pruned_loss=0.06375, over 4261500.05 frames. ], batch size: 607, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:50:04,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1317972.0, ans=0.125 2023-06-25 10:51:49,418 INFO [train.py:996] (1/4) Epoch 8, batch 6250, loss[loss=0.1943, simple_loss=0.2906, pruned_loss=0.049, over 21458.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2909, pruned_loss=0.06392, over 4262607.55 frames. ], batch size: 211, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:52:10,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1318272.0, ans=0.125 2023-06-25 10:52:12,934 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-25 10:52:44,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1318392.0, ans=0.1 2023-06-25 10:53:17,964 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.870e+02 4.537e+02 6.426e+02 9.551e+02 1.693e+03, threshold=1.285e+03, percent-clipped=41.0 2023-06-25 10:53:23,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1318512.0, ans=0.2 2023-06-25 10:53:42,654 INFO [train.py:996] (1/4) Epoch 8, batch 6300, loss[loss=0.2374, simple_loss=0.3136, pruned_loss=0.0806, over 21758.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2951, pruned_loss=0.063, over 4271441.43 frames. ], batch size: 441, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:54:03,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.86 vs. limit=22.5 2023-06-25 10:54:09,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1318632.0, ans=0.125 2023-06-25 10:55:14,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1318812.0, ans=0.2 2023-06-25 10:55:42,023 INFO [train.py:996] (1/4) Epoch 8, batch 6350, loss[loss=0.2394, simple_loss=0.3082, pruned_loss=0.0853, over 21808.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2971, pruned_loss=0.06749, over 4273297.74 frames. ], batch size: 441, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:56:17,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1318932.0, ans=0.125 2023-06-25 10:56:30,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1318992.0, ans=0.125 2023-06-25 10:56:36,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1318992.0, ans=0.07 2023-06-25 10:57:02,230 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.857e+02 3.830e+02 4.751e+02 5.817e+02 1.226e+03, threshold=9.501e+02, percent-clipped=0.0 2023-06-25 10:57:22,399 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:57:27,549 INFO [train.py:996] (1/4) Epoch 8, batch 6400, loss[loss=0.2351, simple_loss=0.3133, pruned_loss=0.07844, over 21712.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3024, pruned_loss=0.07128, over 4277437.48 frames. ], batch size: 298, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 10:57:56,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1319232.0, ans=0.125 2023-06-25 10:58:17,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1319292.0, ans=0.2 2023-06-25 10:58:25,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1319352.0, ans=0.2 2023-06-25 10:58:46,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1319352.0, ans=0.2 2023-06-25 10:58:58,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1319412.0, ans=0.1 2023-06-25 10:59:05,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1319412.0, ans=0.125 2023-06-25 10:59:17,471 INFO [train.py:996] (1/4) Epoch 8, batch 6450, loss[loss=0.1926, simple_loss=0.28, pruned_loss=0.05254, over 21340.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3046, pruned_loss=0.07054, over 4275027.11 frames. ], batch size: 211, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 10:59:29,050 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-06-25 10:59:33,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1319472.0, ans=0.1 2023-06-25 11:00:42,536 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 3.948e+02 4.858e+02 6.624e+02 1.248e+03, threshold=9.716e+02, percent-clipped=3.0 2023-06-25 11:00:48,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-25 11:01:06,963 INFO [train.py:996] (1/4) Epoch 8, batch 6500, loss[loss=0.2089, simple_loss=0.3151, pruned_loss=0.05133, over 21239.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2994, pruned_loss=0.06909, over 4271249.95 frames. ], batch size: 549, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:01:17,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-25 11:01:53,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.33 vs. limit=6.0 2023-06-25 11:02:23,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1319952.0, ans=0.2 2023-06-25 11:02:42,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1320012.0, ans=0.1 2023-06-25 11:02:57,206 INFO [train.py:996] (1/4) Epoch 8, batch 6550, loss[loss=0.2041, simple_loss=0.278, pruned_loss=0.06512, over 21677.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2989, pruned_loss=0.06814, over 4267760.13 frames. ], batch size: 263, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:02:57,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1320072.0, ans=0.1 2023-06-25 11:03:43,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1320192.0, ans=0.0 2023-06-25 11:04:22,983 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.651e+02 3.584e+02 5.538e+02 7.556e+02 1.701e+03, threshold=1.108e+03, percent-clipped=14.0 2023-06-25 11:04:46,583 INFO [train.py:996] (1/4) Epoch 8, batch 6600, loss[loss=0.2662, simple_loss=0.3312, pruned_loss=0.1007, over 21547.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2927, pruned_loss=0.06799, over 4258439.32 frames. ], batch size: 471, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:05:03,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-25 11:05:51,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1320492.0, ans=0.125 2023-06-25 11:06:36,636 INFO [train.py:996] (1/4) Epoch 8, batch 6650, loss[loss=0.1787, simple_loss=0.2566, pruned_loss=0.05046, over 21615.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2848, pruned_loss=0.06541, over 4257819.45 frames. ], batch size: 298, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:06:41,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-25 11:06:43,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1320672.0, ans=0.2 2023-06-25 11:06:54,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1320732.0, ans=0.125 2023-06-25 11:07:35,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1320792.0, ans=0.125 2023-06-25 11:07:39,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1320792.0, ans=0.125 2023-06-25 11:07:41,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1320792.0, ans=0.1 2023-06-25 11:07:51,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1320852.0, ans=0.125 2023-06-25 11:08:03,782 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.315e+02 4.377e+02 5.902e+02 1.210e+03, threshold=8.754e+02, percent-clipped=3.0 2023-06-25 11:08:20,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1320912.0, ans=0.2 2023-06-25 11:08:26,388 INFO [train.py:996] (1/4) Epoch 8, batch 6700, loss[loss=0.1719, simple_loss=0.2408, pruned_loss=0.05151, over 21568.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2792, pruned_loss=0.06555, over 4261498.00 frames. ], batch size: 263, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:08:56,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1321032.0, ans=0.125 2023-06-25 11:08:56,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1321032.0, ans=0.2 2023-06-25 11:09:09,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1321092.0, ans=0.125 2023-06-25 11:09:43,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1321152.0, ans=0.125 2023-06-25 11:10:10,065 INFO [train.py:996] (1/4) Epoch 8, batch 6750, loss[loss=0.2, simple_loss=0.2784, pruned_loss=0.06086, over 21506.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2761, pruned_loss=0.06534, over 4253402.98 frames. ], batch size: 131, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:11:31,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1321452.0, ans=0.125 2023-06-25 11:11:31,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1321452.0, ans=0.2 2023-06-25 11:11:35,906 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.451e+02 4.455e+02 6.236e+02 1.487e+03, threshold=8.910e+02, percent-clipped=11.0 2023-06-25 11:11:41,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1321512.0, ans=0.0 2023-06-25 11:11:58,610 INFO [train.py:996] (1/4) Epoch 8, batch 6800, loss[loss=0.2466, simple_loss=0.3222, pruned_loss=0.08546, over 21877.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2784, pruned_loss=0.06747, over 4255995.94 frames. ], batch size: 107, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:12:08,188 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=22.5 2023-06-25 11:12:41,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1321692.0, ans=0.125 2023-06-25 11:13:41,724 INFO [train.py:996] (1/4) Epoch 8, batch 6850, loss[loss=0.2465, simple_loss=0.3666, pruned_loss=0.06318, over 19947.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2791, pruned_loss=0.06879, over 4261416.12 frames. ], batch size: 702, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:13:56,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1321872.0, ans=0.035 2023-06-25 11:14:35,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1321992.0, ans=0.125 2023-06-25 11:14:37,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-25 11:14:48,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1321992.0, ans=0.125 2023-06-25 11:15:00,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1322052.0, ans=0.125 2023-06-25 11:15:09,319 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.753e+02 5.063e+02 7.364e+02 1.523e+03, threshold=1.013e+03, percent-clipped=16.0 2023-06-25 11:15:32,245 INFO [train.py:996] (1/4) Epoch 8, batch 6900, loss[loss=0.1855, simple_loss=0.2632, pruned_loss=0.0539, over 21856.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2806, pruned_loss=0.06915, over 4268293.55 frames. ], batch size: 107, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:15:42,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1322172.0, ans=0.0 2023-06-25 11:16:37,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1322292.0, ans=0.1 2023-06-25 11:16:41,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1322352.0, ans=0.1 2023-06-25 11:16:48,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1322352.0, ans=0.125 2023-06-25 11:17:17,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1322412.0, ans=0.125 2023-06-25 11:17:23,586 INFO [train.py:996] (1/4) Epoch 8, batch 6950, loss[loss=0.2625, simple_loss=0.3347, pruned_loss=0.09514, over 21552.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2842, pruned_loss=0.0665, over 4268690.84 frames. ], batch size: 414, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:17:33,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1322472.0, ans=0.125 2023-06-25 11:17:59,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1322532.0, ans=0.125 2023-06-25 11:18:27,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1322592.0, ans=0.5 2023-06-25 11:18:54,430 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.544e+02 4.966e+02 6.681e+02 1.694e+03, threshold=9.931e+02, percent-clipped=7.0 2023-06-25 11:18:57,249 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.49 vs. limit=6.0 2023-06-25 11:19:12,240 INFO [train.py:996] (1/4) Epoch 8, batch 7000, loss[loss=0.2136, simple_loss=0.2826, pruned_loss=0.07232, over 21339.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2887, pruned_loss=0.06856, over 4254586.62 frames. ], batch size: 131, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:20:13,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1322892.0, ans=0.1 2023-06-25 11:20:21,269 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-06-25 11:20:56,842 INFO [train.py:996] (1/4) Epoch 8, batch 7050, loss[loss=0.1987, simple_loss=0.2927, pruned_loss=0.05229, over 21609.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2856, pruned_loss=0.06813, over 4247482.35 frames. ], batch size: 414, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:21:48,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1323192.0, ans=0.0 2023-06-25 11:21:48,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-25 11:21:57,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1323192.0, ans=0.1 2023-06-25 11:22:30,811 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.752e+02 3.663e+02 4.659e+02 6.225e+02 9.950e+02, threshold=9.319e+02, percent-clipped=1.0 2023-06-25 11:22:48,492 INFO [train.py:996] (1/4) Epoch 8, batch 7100, loss[loss=0.2495, simple_loss=0.3223, pruned_loss=0.08835, over 21660.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.289, pruned_loss=0.06856, over 4259651.48 frames. ], batch size: 351, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:22:56,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1323372.0, ans=0.0 2023-06-25 11:23:46,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1323492.0, ans=0.125 2023-06-25 11:24:07,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-25 11:24:14,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1323612.0, ans=0.02 2023-06-25 11:24:21,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1323612.0, ans=0.1 2023-06-25 11:24:35,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1323612.0, ans=0.125 2023-06-25 11:24:35,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1323612.0, ans=0.2 2023-06-25 11:24:44,520 INFO [train.py:996] (1/4) Epoch 8, batch 7150, loss[loss=0.2176, simple_loss=0.2963, pruned_loss=0.06943, over 21760.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2883, pruned_loss=0.06744, over 4259654.26 frames. ], batch size: 332, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:24:45,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=15.0 2023-06-25 11:25:54,092 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:26:03,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1323852.0, ans=0.1 2023-06-25 11:26:11,766 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 3.580e+02 4.514e+02 6.175e+02 1.199e+03, threshold=9.027e+02, percent-clipped=4.0 2023-06-25 11:26:12,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1323912.0, ans=0.1 2023-06-25 11:26:14,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1323912.0, ans=0.125 2023-06-25 11:26:40,811 INFO [train.py:996] (1/4) Epoch 8, batch 7200, loss[loss=0.223, simple_loss=0.3406, pruned_loss=0.05272, over 19768.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2907, pruned_loss=0.06974, over 4262879.21 frames. ], batch size: 703, lr: 3.80e-03, grad_scale: 32.0 2023-06-25 11:28:28,901 INFO [train.py:996] (1/4) Epoch 8, batch 7250, loss[loss=0.2033, simple_loss=0.2722, pruned_loss=0.06717, over 21528.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2877, pruned_loss=0.07002, over 4268632.55 frames. ], batch size: 132, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:28:29,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1324272.0, ans=0.0 2023-06-25 11:29:26,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1324452.0, ans=0.125 2023-06-25 11:29:38,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1324452.0, ans=0.0 2023-06-25 11:29:48,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1324512.0, ans=0.0 2023-06-25 11:29:51,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 3.605e+02 4.552e+02 6.343e+02 1.382e+03, threshold=9.103e+02, percent-clipped=6.0 2023-06-25 11:30:17,029 INFO [train.py:996] (1/4) Epoch 8, batch 7300, loss[loss=0.1922, simple_loss=0.2563, pruned_loss=0.06412, over 21861.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2816, pruned_loss=0.06837, over 4265677.92 frames. ], batch size: 373, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:30:27,330 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-25 11:30:28,856 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-25 11:30:34,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-25 11:30:46,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-25 11:31:15,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1324752.0, ans=0.125 2023-06-25 11:31:24,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1324752.0, ans=0.125 2023-06-25 11:31:26,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1324752.0, ans=0.0 2023-06-25 11:31:39,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1324812.0, ans=0.0 2023-06-25 11:32:07,771 INFO [train.py:996] (1/4) Epoch 8, batch 7350, loss[loss=0.1919, simple_loss=0.2516, pruned_loss=0.06607, over 21562.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2797, pruned_loss=0.06903, over 4260777.94 frames. ], batch size: 442, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:32:14,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-25 11:32:55,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1324992.0, ans=0.0 2023-06-25 11:33:07,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=15.0 2023-06-25 11:33:43,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1325112.0, ans=0.125 2023-06-25 11:33:44,446 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 4.058e+02 5.630e+02 9.164e+02 1.929e+03, threshold=1.126e+03, percent-clipped=26.0 2023-06-25 11:34:01,232 INFO [train.py:996] (1/4) Epoch 8, batch 7400, loss[loss=0.2367, simple_loss=0.3076, pruned_loss=0.08291, over 21631.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2856, pruned_loss=0.07096, over 4266695.56 frames. ], batch size: 263, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:34:02,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1325172.0, ans=0.1 2023-06-25 11:34:40,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1325292.0, ans=0.125 2023-06-25 11:34:43,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1325292.0, ans=0.2 2023-06-25 11:35:11,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=22.5 2023-06-25 11:35:51,162 INFO [train.py:996] (1/4) Epoch 8, batch 7450, loss[loss=0.1791, simple_loss=0.2444, pruned_loss=0.05686, over 21553.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2833, pruned_loss=0.06963, over 4263318.25 frames. ], batch size: 263, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:36:03,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1325472.0, ans=0.125 2023-06-25 11:36:16,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1325532.0, ans=0.125 2023-06-25 11:37:15,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1325652.0, ans=0.125 2023-06-25 11:37:28,589 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 3.413e+02 4.464e+02 6.199e+02 1.662e+03, threshold=8.927e+02, percent-clipped=2.0 2023-06-25 11:37:50,177 INFO [train.py:996] (1/4) Epoch 8, batch 7500, loss[loss=0.2495, simple_loss=0.3464, pruned_loss=0.07629, over 21614.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2895, pruned_loss=0.07139, over 4269063.27 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:38:01,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1325772.0, ans=0.0 2023-06-25 11:39:37,788 INFO [train.py:996] (1/4) Epoch 8, batch 7550, loss[loss=0.2124, simple_loss=0.3135, pruned_loss=0.05564, over 21677.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2951, pruned_loss=0.07021, over 4277563.58 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:39:38,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1326072.0, ans=0.2 2023-06-25 11:40:10,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1326132.0, ans=0.1 2023-06-25 11:40:43,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1326192.0, ans=0.1 2023-06-25 11:40:56,847 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=22.5 2023-06-25 11:41:00,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-25 11:41:05,653 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.672e+02 5.210e+02 9.088e+02 2.173e+03, threshold=1.042e+03, percent-clipped=24.0 2023-06-25 11:41:26,457 INFO [train.py:996] (1/4) Epoch 8, batch 7600, loss[loss=0.1988, simple_loss=0.2603, pruned_loss=0.06872, over 21176.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2968, pruned_loss=0.06926, over 4274929.51 frames. ], batch size: 608, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 11:41:27,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1326372.0, ans=0.125 2023-06-25 11:41:50,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1326432.0, ans=0.1 2023-06-25 11:42:00,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1326492.0, ans=0.125 2023-06-25 11:43:09,705 INFO [train.py:996] (1/4) Epoch 8, batch 7650, loss[loss=0.2141, simple_loss=0.281, pruned_loss=0.07358, over 21938.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2956, pruned_loss=0.06975, over 4278654.88 frames. ], batch size: 316, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 11:43:10,861 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-25 11:44:08,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1326792.0, ans=0.1 2023-06-25 11:44:44,978 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.753e+02 3.604e+02 4.352e+02 5.552e+02 1.331e+03, threshold=8.705e+02, percent-clipped=4.0 2023-06-25 11:44:48,045 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.59 vs. limit=5.0 2023-06-25 11:44:59,559 INFO [train.py:996] (1/4) Epoch 8, batch 7700, loss[loss=0.2681, simple_loss=0.3367, pruned_loss=0.09973, over 21572.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2968, pruned_loss=0.07234, over 4277947.77 frames. ], batch size: 414, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:45:43,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-06-25 11:45:48,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1327092.0, ans=0.125 2023-06-25 11:46:11,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1327152.0, ans=0.125 2023-06-25 11:46:17,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1327152.0, ans=0.125 2023-06-25 11:46:46,200 INFO [train.py:996] (1/4) Epoch 8, batch 7750, loss[loss=0.1459, simple_loss=0.2001, pruned_loss=0.04581, over 17271.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3036, pruned_loss=0.07384, over 4276498.05 frames. ], batch size: 62, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:46:46,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1327272.0, ans=0.0 2023-06-25 11:47:46,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-25 11:48:24,877 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.807e+02 4.127e+02 5.917e+02 8.235e+02 1.345e+03, threshold=1.183e+03, percent-clipped=19.0 2023-06-25 11:48:32,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1327512.0, ans=0.125 2023-06-25 11:48:37,391 INFO [train.py:996] (1/4) Epoch 8, batch 7800, loss[loss=0.199, simple_loss=0.2523, pruned_loss=0.07281, over 21194.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3067, pruned_loss=0.07448, over 4278141.00 frames. ], batch size: 143, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:48:50,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1327572.0, ans=0.0 2023-06-25 11:48:52,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1327572.0, ans=0.2 2023-06-25 11:49:24,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-25 11:49:58,262 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-25 11:50:14,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1327812.0, ans=0.125 2023-06-25 11:50:26,034 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-25 11:50:26,498 INFO [train.py:996] (1/4) Epoch 8, batch 7850, loss[loss=0.1899, simple_loss=0.2525, pruned_loss=0.06362, over 21453.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2995, pruned_loss=0.07254, over 4279927.39 frames. ], batch size: 195, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:50:41,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1327872.0, ans=0.0 2023-06-25 11:50:50,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1327932.0, ans=10.0 2023-06-25 11:52:07,048 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 3.555e+02 5.085e+02 7.464e+02 1.705e+03, threshold=1.017e+03, percent-clipped=5.0 2023-06-25 11:52:26,558 INFO [train.py:996] (1/4) Epoch 8, batch 7900, loss[loss=0.2259, simple_loss=0.3145, pruned_loss=0.06862, over 21732.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2937, pruned_loss=0.0711, over 4280173.42 frames. ], batch size: 351, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:52:44,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1328172.0, ans=0.125 2023-06-25 11:54:24,580 INFO [train.py:996] (1/4) Epoch 8, batch 7950, loss[loss=0.2251, simple_loss=0.3004, pruned_loss=0.07487, over 21757.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2981, pruned_loss=0.071, over 4277295.99 frames. ], batch size: 298, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:54:42,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1328472.0, ans=0.125 2023-06-25 11:55:14,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-06-25 11:55:29,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-25 11:56:11,370 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 4.611e+02 6.417e+02 9.938e+02 3.239e+03, threshold=1.283e+03, percent-clipped=22.0 2023-06-25 11:56:17,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1328712.0, ans=0.125 2023-06-25 11:56:24,138 INFO [train.py:996] (1/4) Epoch 8, batch 8000, loss[loss=0.2937, simple_loss=0.3621, pruned_loss=0.1126, over 21428.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2998, pruned_loss=0.07257, over 4260782.14 frames. ], batch size: 471, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:56:51,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-25 11:57:13,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1328892.0, ans=0.125 2023-06-25 11:58:01,846 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:58:22,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1329072.0, ans=0.125 2023-06-25 11:58:24,049 INFO [train.py:996] (1/4) Epoch 8, batch 8050, loss[loss=0.2065, simple_loss=0.2505, pruned_loss=0.08125, over 20283.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3031, pruned_loss=0.07321, over 4266221.84 frames. ], batch size: 703, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:59:43,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1329252.0, ans=0.125 2023-06-25 12:00:03,818 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.754e+02 4.648e+02 6.798e+02 1.163e+03 2.924e+03, threshold=1.360e+03, percent-clipped=20.0 2023-06-25 12:00:07,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329312.0, ans=0.1 2023-06-25 12:00:16,696 INFO [train.py:996] (1/4) Epoch 8, batch 8100, loss[loss=0.1928, simple_loss=0.2646, pruned_loss=0.06054, over 21843.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3019, pruned_loss=0.07272, over 4261942.27 frames. ], batch size: 282, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:00:18,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1329372.0, ans=15.0 2023-06-25 12:00:34,173 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:00:52,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1329432.0, ans=0.04949747468305833 2023-06-25 12:02:01,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1329612.0, ans=0.0 2023-06-25 12:02:15,828 INFO [train.py:996] (1/4) Epoch 8, batch 8150, loss[loss=0.2423, simple_loss=0.3431, pruned_loss=0.07075, over 21769.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3084, pruned_loss=0.07386, over 4259530.12 frames. ], batch size: 371, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:02:41,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1329732.0, ans=0.2 2023-06-25 12:03:15,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1329792.0, ans=0.2 2023-06-25 12:03:32,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1329852.0, ans=0.125 2023-06-25 12:03:47,519 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.902e+02 4.317e+02 6.289e+02 1.033e+03 2.172e+03, threshold=1.258e+03, percent-clipped=12.0 2023-06-25 12:04:04,955 INFO [train.py:996] (1/4) Epoch 8, batch 8200, loss[loss=0.1776, simple_loss=0.2464, pruned_loss=0.05442, over 21531.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.302, pruned_loss=0.07189, over 4267486.07 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:05:21,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1330152.0, ans=0.125 2023-06-25 12:05:24,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-25 12:05:33,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1330212.0, ans=0.125 2023-06-25 12:05:54,986 INFO [train.py:996] (1/4) Epoch 8, batch 8250, loss[loss=0.2541, simple_loss=0.3381, pruned_loss=0.0851, over 21562.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3, pruned_loss=0.07162, over 4255755.08 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:07:12,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.12 vs. limit=22.5 2023-06-25 12:07:21,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1330512.0, ans=0.0 2023-06-25 12:07:27,234 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.445e+02 3.428e+02 4.253e+02 6.741e+02 1.234e+03, threshold=8.505e+02, percent-clipped=0.0 2023-06-25 12:07:50,390 INFO [train.py:996] (1/4) Epoch 8, batch 8300, loss[loss=0.1928, simple_loss=0.2736, pruned_loss=0.05603, over 21226.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2981, pruned_loss=0.06994, over 4259411.25 frames. ], batch size: 159, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:09:39,007 INFO [train.py:996] (1/4) Epoch 8, batch 8350, loss[loss=0.2042, simple_loss=0.3028, pruned_loss=0.05278, over 21779.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2986, pruned_loss=0.06868, over 4252787.18 frames. ], batch size: 282, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:10:05,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-25 12:10:19,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-25 12:10:52,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-25 12:11:10,353 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.535e+02 3.482e+02 5.019e+02 7.188e+02 1.647e+03, threshold=1.004e+03, percent-clipped=15.0 2023-06-25 12:11:26,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1331172.0, ans=0.1 2023-06-25 12:11:27,232 INFO [train.py:996] (1/4) Epoch 8, batch 8400, loss[loss=0.1671, simple_loss=0.2653, pruned_loss=0.03439, over 21747.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2962, pruned_loss=0.06578, over 4261247.59 frames. ], batch size: 351, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:11:44,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1331172.0, ans=0.015 2023-06-25 12:11:46,172 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:12:05,365 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-25 12:12:10,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-25 12:12:29,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1331352.0, ans=0.125 2023-06-25 12:12:43,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1331352.0, ans=0.0 2023-06-25 12:12:59,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1331412.0, ans=0.1 2023-06-25 12:13:15,377 INFO [train.py:996] (1/4) Epoch 8, batch 8450, loss[loss=0.2059, simple_loss=0.2719, pruned_loss=0.06997, over 21506.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2925, pruned_loss=0.06531, over 4269948.10 frames. ], batch size: 195, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:14:16,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1331652.0, ans=0.04949747468305833 2023-06-25 12:14:45,595 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.841e+02 5.103e+02 7.112e+02 1.474e+03, threshold=1.021e+03, percent-clipped=11.0 2023-06-25 12:14:55,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1331712.0, ans=0.0 2023-06-25 12:15:04,490 INFO [train.py:996] (1/4) Epoch 8, batch 8500, loss[loss=0.1938, simple_loss=0.2664, pruned_loss=0.06063, over 21819.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2882, pruned_loss=0.06649, over 4266008.54 frames. ], batch size: 118, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:15:34,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-25 12:15:43,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-25 12:16:04,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-25 12:16:42,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1332012.0, ans=0.125 2023-06-25 12:16:56,635 INFO [train.py:996] (1/4) Epoch 8, batch 8550, loss[loss=0.2572, simple_loss=0.3462, pruned_loss=0.08412, over 21609.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2935, pruned_loss=0.06871, over 4270692.53 frames. ], batch size: 389, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:17:13,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1332072.0, ans=0.125 2023-06-25 12:17:26,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1332132.0, ans=0.2 2023-06-25 12:17:56,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1332192.0, ans=0.125 2023-06-25 12:18:36,651 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 4.136e+02 5.316e+02 7.631e+02 1.468e+03, threshold=1.063e+03, percent-clipped=11.0 2023-06-25 12:18:52,728 INFO [train.py:996] (1/4) Epoch 8, batch 8600, loss[loss=0.2222, simple_loss=0.316, pruned_loss=0.06426, over 21746.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2996, pruned_loss=0.07076, over 4271895.81 frames. ], batch size: 298, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:18:58,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1332372.0, ans=0.125 2023-06-25 12:19:05,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332372.0, ans=0.1 2023-06-25 12:19:35,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1332492.0, ans=0.125 2023-06-25 12:19:41,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332492.0, ans=0.1 2023-06-25 12:20:01,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1332552.0, ans=0.0 2023-06-25 12:20:43,328 INFO [train.py:996] (1/4) Epoch 8, batch 8650, loss[loss=0.2217, simple_loss=0.2893, pruned_loss=0.077, over 20039.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3057, pruned_loss=0.07159, over 4267947.15 frames. ], batch size: 704, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:20:43,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1332672.0, ans=0.2 2023-06-25 12:20:45,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1332672.0, ans=0.0 2023-06-25 12:20:56,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-25 12:21:13,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1332732.0, ans=0.1 2023-06-25 12:22:12,156 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-25 12:22:16,159 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.872e+02 5.286e+02 7.583e+02 1.337e+03, threshold=1.057e+03, percent-clipped=5.0 2023-06-25 12:22:29,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1332912.0, ans=0.125 2023-06-25 12:22:32,426 INFO [train.py:996] (1/4) Epoch 8, batch 8700, loss[loss=0.1999, simple_loss=0.2691, pruned_loss=0.06534, over 21446.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2986, pruned_loss=0.06845, over 4268089.34 frames. ], batch size: 389, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:22:42,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-25 12:22:48,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1333032.0, ans=0.0 2023-06-25 12:23:24,696 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:23:31,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1333152.0, ans=0.125 2023-06-25 12:23:31,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1333152.0, ans=0.125 2023-06-25 12:24:21,809 INFO [train.py:996] (1/4) Epoch 8, batch 8750, loss[loss=0.2285, simple_loss=0.2941, pruned_loss=0.08139, over 21498.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2948, pruned_loss=0.06897, over 4273187.56 frames. ], batch size: 144, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:24:29,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1333272.0, ans=0.125 2023-06-25 12:24:55,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1333332.0, ans=0.0 2023-06-25 12:24:55,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1333332.0, ans=0.2 2023-06-25 12:25:43,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1333452.0, ans=0.125 2023-06-25 12:25:59,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1333512.0, ans=0.0 2023-06-25 12:25:59,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1333512.0, ans=0.1 2023-06-25 12:26:00,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1333512.0, ans=0.125 2023-06-25 12:26:02,039 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.799e+02 3.947e+02 5.629e+02 7.790e+02 1.713e+03, threshold=1.126e+03, percent-clipped=18.0 2023-06-25 12:26:18,049 INFO [train.py:996] (1/4) Epoch 8, batch 8800, loss[loss=0.2574, simple_loss=0.3432, pruned_loss=0.08582, over 21317.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3031, pruned_loss=0.07157, over 4276604.45 frames. ], batch size: 548, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:26:59,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1333692.0, ans=0.0 2023-06-25 12:27:02,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-25 12:27:20,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1333692.0, ans=0.04949747468305833 2023-06-25 12:27:25,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1333752.0, ans=0.2 2023-06-25 12:27:48,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1333812.0, ans=0.0 2023-06-25 12:28:04,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1333812.0, ans=0.1 2023-06-25 12:28:09,020 INFO [train.py:996] (1/4) Epoch 8, batch 8850, loss[loss=0.2115, simple_loss=0.3148, pruned_loss=0.05411, over 21638.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3095, pruned_loss=0.07319, over 4283334.39 frames. ], batch size: 230, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:28:09,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1333872.0, ans=0.125 2023-06-25 12:28:51,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.64 vs. limit=15.0 2023-06-25 12:28:52,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1333992.0, ans=0.0 2023-06-25 12:29:48,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1334112.0, ans=0.0 2023-06-25 12:29:51,861 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.618e+02 3.568e+02 4.882e+02 6.738e+02 2.080e+03, threshold=9.764e+02, percent-clipped=3.0 2023-06-25 12:30:01,475 INFO [train.py:996] (1/4) Epoch 8, batch 8900, loss[loss=0.2683, simple_loss=0.369, pruned_loss=0.08381, over 20753.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3033, pruned_loss=0.07245, over 4278670.01 frames. ], batch size: 608, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:30:22,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1334172.0, ans=0.125 2023-06-25 12:30:22,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1334172.0, ans=0.125 2023-06-25 12:31:00,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1334292.0, ans=0.1 2023-06-25 12:31:10,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1334292.0, ans=0.125 2023-06-25 12:31:11,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1334352.0, ans=0.125 2023-06-25 12:31:25,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1334352.0, ans=0.125 2023-06-25 12:31:31,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1334352.0, ans=0.125 2023-06-25 12:31:31,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1334352.0, ans=0.125 2023-06-25 12:31:45,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1334412.0, ans=0.125 2023-06-25 12:31:58,032 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:31:59,179 INFO [train.py:996] (1/4) Epoch 8, batch 8950, loss[loss=0.2225, simple_loss=0.3039, pruned_loss=0.07052, over 21605.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3037, pruned_loss=0.07181, over 4271798.86 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:32:35,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1334532.0, ans=0.2 2023-06-25 12:33:34,162 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.718e+02 4.076e+02 6.080e+02 7.762e+02 1.933e+03, threshold=1.216e+03, percent-clipped=14.0 2023-06-25 12:33:48,742 INFO [train.py:996] (1/4) Epoch 8, batch 9000, loss[loss=0.2007, simple_loss=0.2764, pruned_loss=0.06247, over 21575.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2977, pruned_loss=0.07054, over 4279861.62 frames. ], batch size: 414, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:33:48,742 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 12:34:07,156 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2631, simple_loss=0.3554, pruned_loss=0.08544, over 1796401.00 frames. 2023-06-25 12:34:07,157 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 12:34:12,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.04 vs. limit=22.5 2023-06-25 12:34:28,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1334772.0, ans=0.0 2023-06-25 12:34:40,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1334832.0, ans=0.125 2023-06-25 12:35:12,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1334892.0, ans=0.125 2023-06-25 12:35:15,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1334952.0, ans=0.1 2023-06-25 12:35:24,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1334952.0, ans=0.07 2023-06-25 12:35:57,399 INFO [train.py:996] (1/4) Epoch 8, batch 9050, loss[loss=0.2023, simple_loss=0.2874, pruned_loss=0.05858, over 21562.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2939, pruned_loss=0.06798, over 4277764.43 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:36:19,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.39 vs. limit=10.0 2023-06-25 12:37:25,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1335252.0, ans=0.0 2023-06-25 12:37:47,100 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.667e+02 3.976e+02 5.366e+02 7.574e+02 1.688e+03, threshold=1.073e+03, percent-clipped=5.0 2023-06-25 12:37:55,894 INFO [train.py:996] (1/4) Epoch 8, batch 9100, loss[loss=0.2439, simple_loss=0.3356, pruned_loss=0.07611, over 21619.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2988, pruned_loss=0.07028, over 4275017.51 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:38:14,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1335372.0, ans=0.1 2023-06-25 12:38:17,539 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:38:18,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-25 12:38:24,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1335432.0, ans=0.125 2023-06-25 12:38:31,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1335432.0, ans=0.0 2023-06-25 12:39:06,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1335552.0, ans=0.1 2023-06-25 12:39:17,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1335552.0, ans=0.125 2023-06-25 12:39:47,142 INFO [train.py:996] (1/4) Epoch 8, batch 9150, loss[loss=0.2219, simple_loss=0.3118, pruned_loss=0.066, over 21578.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3017, pruned_loss=0.06807, over 4267902.83 frames. ], batch size: 230, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:40:11,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1335672.0, ans=0.0 2023-06-25 12:40:44,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1335792.0, ans=0.0 2023-06-25 12:41:27,971 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.582e+02 4.285e+02 5.759e+02 1.145e+03, threshold=8.570e+02, percent-clipped=4.0 2023-06-25 12:41:47,516 INFO [train.py:996] (1/4) Epoch 8, batch 9200, loss[loss=0.2219, simple_loss=0.3026, pruned_loss=0.07056, over 21289.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3025, pruned_loss=0.06781, over 4266480.17 frames. ], batch size: 176, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:42:39,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1336092.0, ans=0.0 2023-06-25 12:42:59,722 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-25 12:43:35,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1336272.0, ans=0.0 2023-06-25 12:43:37,056 INFO [train.py:996] (1/4) Epoch 8, batch 9250, loss[loss=0.2026, simple_loss=0.2717, pruned_loss=0.06672, over 21605.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3041, pruned_loss=0.07082, over 4274196.94 frames. ], batch size: 298, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:43:43,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-25 12:43:56,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1336332.0, ans=0.0 2023-06-25 12:44:39,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-25 12:44:40,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1336452.0, ans=0.2 2023-06-25 12:45:21,452 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.801e+02 3.683e+02 5.339e+02 7.868e+02 1.539e+03, threshold=1.068e+03, percent-clipped=20.0 2023-06-25 12:45:28,190 INFO [train.py:996] (1/4) Epoch 8, batch 9300, loss[loss=0.1903, simple_loss=0.2632, pruned_loss=0.05868, over 21891.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3001, pruned_loss=0.07057, over 4271089.82 frames. ], batch size: 107, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:46:02,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1336632.0, ans=0.125 2023-06-25 12:46:25,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1336692.0, ans=0.1 2023-06-25 12:46:34,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1336752.0, ans=0.125 2023-06-25 12:47:12,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1336812.0, ans=0.125 2023-06-25 12:47:19,155 INFO [train.py:996] (1/4) Epoch 8, batch 9350, loss[loss=0.2411, simple_loss=0.322, pruned_loss=0.08009, over 21409.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3086, pruned_loss=0.07209, over 4276724.45 frames. ], batch size: 176, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:47:43,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1336932.0, ans=0.125 2023-06-25 12:47:51,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-25 12:48:05,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1336992.0, ans=0.0 2023-06-25 12:49:01,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1337112.0, ans=0.2 2023-06-25 12:49:02,854 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.116e+02 5.791e+02 8.209e+02 2.175e+03, threshold=1.158e+03, percent-clipped=13.0 2023-06-25 12:49:10,232 INFO [train.py:996] (1/4) Epoch 8, batch 9400, loss[loss=0.1966, simple_loss=0.2639, pruned_loss=0.06462, over 21273.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3095, pruned_loss=0.07213, over 4276272.84 frames. ], batch size: 549, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:49:21,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1337172.0, ans=0.1 2023-06-25 12:49:23,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1337172.0, ans=0.1 2023-06-25 12:50:11,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-25 12:50:25,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1337352.0, ans=0.125 2023-06-25 12:50:26,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1337352.0, ans=0.2 2023-06-25 12:51:04,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1337472.0, ans=0.125 2023-06-25 12:51:05,908 INFO [train.py:996] (1/4) Epoch 8, batch 9450, loss[loss=0.1882, simple_loss=0.2554, pruned_loss=0.06053, over 21746.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3026, pruned_loss=0.07139, over 4264310.26 frames. ], batch size: 300, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:51:08,690 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.66 vs. limit=6.0 2023-06-25 12:51:24,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=22.5 2023-06-25 12:52:09,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-25 12:52:29,431 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-25 12:52:33,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1337712.0, ans=0.125 2023-06-25 12:52:41,863 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.789e+02 4.276e+02 5.565e+02 7.806e+02 1.820e+03, threshold=1.113e+03, percent-clipped=7.0 2023-06-25 12:52:48,856 INFO [train.py:996] (1/4) Epoch 8, batch 9500, loss[loss=0.1754, simple_loss=0.2681, pruned_loss=0.0414, over 21681.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2966, pruned_loss=0.06897, over 4258365.64 frames. ], batch size: 298, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:53:03,127 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-25 12:53:04,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1337772.0, ans=0.125 2023-06-25 12:54:30,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1338012.0, ans=0.0 2023-06-25 12:54:43,654 INFO [train.py:996] (1/4) Epoch 8, batch 9550, loss[loss=0.2425, simple_loss=0.336, pruned_loss=0.07446, over 21626.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3016, pruned_loss=0.07172, over 4258901.38 frames. ], batch size: 230, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:55:46,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1338192.0, ans=0.125 2023-06-25 12:56:08,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1338312.0, ans=0.125 2023-06-25 12:56:26,043 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.967e+02 4.048e+02 5.374e+02 8.215e+02 1.903e+03, threshold=1.075e+03, percent-clipped=10.0 2023-06-25 12:56:32,907 INFO [train.py:996] (1/4) Epoch 8, batch 9600, loss[loss=0.1643, simple_loss=0.2729, pruned_loss=0.02787, over 20776.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3025, pruned_loss=0.07273, over 4269439.82 frames. ], batch size: 607, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:56:59,442 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-25 12:57:30,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1338492.0, ans=0.125 2023-06-25 12:57:52,738 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-25 12:58:24,348 INFO [train.py:996] (1/4) Epoch 8, batch 9650, loss[loss=0.2947, simple_loss=0.3487, pruned_loss=0.1203, over 21401.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3007, pruned_loss=0.07263, over 4274138.79 frames. ], batch size: 508, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:58:28,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1338672.0, ans=0.125 2023-06-25 12:58:31,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1338672.0, ans=0.0 2023-06-25 13:00:07,431 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.726e+02 3.684e+02 4.580e+02 6.595e+02 1.807e+03, threshold=9.160e+02, percent-clipped=4.0 2023-06-25 13:00:20,093 INFO [train.py:996] (1/4) Epoch 8, batch 9700, loss[loss=0.2048, simple_loss=0.2843, pruned_loss=0.06265, over 21754.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3031, pruned_loss=0.0726, over 4280721.79 frames. ], batch size: 247, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 13:00:29,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1338972.0, ans=0.1 2023-06-25 13:00:59,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1339032.0, ans=0.5 2023-06-25 13:01:49,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1339212.0, ans=0.125 2023-06-25 13:02:02,387 INFO [train.py:996] (1/4) Epoch 8, batch 9750, loss[loss=0.199, simple_loss=0.2668, pruned_loss=0.06556, over 21562.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2962, pruned_loss=0.07146, over 4274447.81 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:02:06,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1339272.0, ans=0.0 2023-06-25 13:03:07,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-25 13:03:16,380 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-25 13:03:19,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1339452.0, ans=0.125 2023-06-25 13:03:22,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1339452.0, ans=0.125 2023-06-25 13:03:25,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1339512.0, ans=0.2 2023-06-25 13:03:28,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1339512.0, ans=0.125 2023-06-25 13:03:42,354 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.775e+02 3.743e+02 5.532e+02 7.768e+02 2.224e+03, threshold=1.106e+03, percent-clipped=14.0 2023-06-25 13:03:49,312 INFO [train.py:996] (1/4) Epoch 8, batch 9800, loss[loss=0.2009, simple_loss=0.273, pruned_loss=0.06443, over 21620.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2967, pruned_loss=0.07189, over 4276246.58 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:04:09,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1339572.0, ans=0.125 2023-06-25 13:04:16,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1339632.0, ans=0.125 2023-06-25 13:04:31,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1339692.0, ans=0.1 2023-06-25 13:05:37,670 INFO [train.py:996] (1/4) Epoch 8, batch 9850, loss[loss=0.2162, simple_loss=0.2723, pruned_loss=0.08011, over 21576.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2926, pruned_loss=0.07182, over 4271795.52 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:05:42,327 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0 2023-06-25 13:05:43,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1339872.0, ans=0.0 2023-06-25 13:06:12,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.37 vs. limit=22.5 2023-06-25 13:06:52,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1340052.0, ans=0.125 2023-06-25 13:07:19,911 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 3.728e+02 4.692e+02 6.683e+02 1.521e+03, threshold=9.384e+02, percent-clipped=6.0 2023-06-25 13:07:26,591 INFO [train.py:996] (1/4) Epoch 8, batch 9900, loss[loss=0.2122, simple_loss=0.2877, pruned_loss=0.06832, over 21760.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2904, pruned_loss=0.07201, over 4272122.06 frames. ], batch size: 282, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:07:32,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1340172.0, ans=0.125 2023-06-25 13:07:53,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1340232.0, ans=0.125 2023-06-25 13:09:14,604 INFO [train.py:996] (1/4) Epoch 8, batch 9950, loss[loss=0.2334, simple_loss=0.2889, pruned_loss=0.08894, over 21525.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2954, pruned_loss=0.07474, over 4260490.17 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:09:36,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1340532.0, ans=0.0 2023-06-25 13:10:25,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=22.5 2023-06-25 13:10:58,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1340712.0, ans=0.125 2023-06-25 13:10:59,873 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.700e+02 4.924e+02 7.179e+02 1.701e+03, threshold=9.849e+02, percent-clipped=16.0 2023-06-25 13:11:11,620 INFO [train.py:996] (1/4) Epoch 8, batch 10000, loss[loss=0.203, simple_loss=0.284, pruned_loss=0.061, over 21611.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2903, pruned_loss=0.07307, over 4259429.63 frames. ], batch size: 389, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:11:24,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1340772.0, ans=0.0 2023-06-25 13:11:28,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1340832.0, ans=0.0 2023-06-25 13:11:40,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1340832.0, ans=0.2 2023-06-25 13:11:58,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1340892.0, ans=0.2 2023-06-25 13:12:28,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1340952.0, ans=0.125 2023-06-25 13:12:40,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1341012.0, ans=0.0 2023-06-25 13:13:02,331 INFO [train.py:996] (1/4) Epoch 8, batch 10050, loss[loss=0.1887, simple_loss=0.2445, pruned_loss=0.06647, over 20762.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2929, pruned_loss=0.07332, over 4257923.05 frames. ], batch size: 609, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:13:43,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1341132.0, ans=0.0 2023-06-25 13:13:52,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1341192.0, ans=0.125 2023-06-25 13:14:00,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-25 13:14:45,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1341312.0, ans=0.0 2023-06-25 13:14:55,226 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 4.346e+02 5.951e+02 7.848e+02 1.633e+03, threshold=1.190e+03, percent-clipped=16.0 2023-06-25 13:14:58,763 INFO [train.py:996] (1/4) Epoch 8, batch 10100, loss[loss=0.2267, simple_loss=0.297, pruned_loss=0.07817, over 21620.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.288, pruned_loss=0.07123, over 4258537.13 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:15:35,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1341432.0, ans=0.0 2023-06-25 13:15:55,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1341492.0, ans=0.0 2023-06-25 13:15:59,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1341492.0, ans=0.2 2023-06-25 13:16:31,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-25 13:16:40,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1341612.0, ans=0.04949747468305833 2023-06-25 13:16:43,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1341612.0, ans=0.125 2023-06-25 13:16:48,269 INFO [train.py:996] (1/4) Epoch 8, batch 10150, loss[loss=0.196, simple_loss=0.2643, pruned_loss=0.06384, over 17071.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2944, pruned_loss=0.07419, over 4258866.39 frames. ], batch size: 60, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:16:59,188 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:17:11,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1341732.0, ans=0.1 2023-06-25 13:17:16,703 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:17:18,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1341732.0, ans=0.07 2023-06-25 13:18:12,492 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:18:38,988 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.458e+02 4.384e+02 5.388e+02 1.096e+03, threshold=8.768e+02, percent-clipped=0.0 2023-06-25 13:18:39,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1341912.0, ans=0.125 2023-06-25 13:18:41,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1341972.0, ans=10.0 2023-06-25 13:18:42,756 INFO [train.py:996] (1/4) Epoch 8, batch 10200, loss[loss=0.1783, simple_loss=0.2582, pruned_loss=0.04917, over 16182.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2937, pruned_loss=0.07184, over 4245921.81 frames. ], batch size: 63, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:19:33,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1342092.0, ans=0.07 2023-06-25 13:20:24,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1342212.0, ans=0.0 2023-06-25 13:20:34,609 INFO [train.py:996] (1/4) Epoch 8, batch 10250, loss[loss=0.2115, simple_loss=0.2934, pruned_loss=0.06485, over 21328.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.29, pruned_loss=0.06738, over 4245473.66 frames. ], batch size: 159, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:20:41,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=12.0 2023-06-25 13:21:14,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1342392.0, ans=0.025 2023-06-25 13:22:13,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1342512.0, ans=0.0 2023-06-25 13:22:16,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-25 13:22:23,169 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 3.628e+02 5.027e+02 6.947e+02 1.354e+03, threshold=1.005e+03, percent-clipped=10.0 2023-06-25 13:22:26,723 INFO [train.py:996] (1/4) Epoch 8, batch 10300, loss[loss=0.2127, simple_loss=0.2954, pruned_loss=0.06502, over 21289.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2927, pruned_loss=0.067, over 4248607.56 frames. ], batch size: 176, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:23:22,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1342692.0, ans=0.0 2023-06-25 13:23:23,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1342692.0, ans=0.1 2023-06-25 13:23:51,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1342752.0, ans=0.125 2023-06-25 13:24:18,487 INFO [train.py:996] (1/4) Epoch 8, batch 10350, loss[loss=0.1632, simple_loss=0.2179, pruned_loss=0.05428, over 21211.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2945, pruned_loss=0.06729, over 4257780.96 frames. ], batch size: 143, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:24:40,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1342932.0, ans=0.125 2023-06-25 13:25:19,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1342992.0, ans=0.0 2023-06-25 13:25:41,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-25 13:25:49,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1343112.0, ans=0.5 2023-06-25 13:25:53,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1343112.0, ans=0.0 2023-06-25 13:25:55,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1343112.0, ans=0.0 2023-06-25 13:26:05,201 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.961e+02 4.465e+02 6.325e+02 1.027e+03 2.051e+03, threshold=1.265e+03, percent-clipped=26.0 2023-06-25 13:26:15,282 INFO [train.py:996] (1/4) Epoch 8, batch 10400, loss[loss=0.189, simple_loss=0.2556, pruned_loss=0.06121, over 21639.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.29, pruned_loss=0.06655, over 4259916.74 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:26:26,744 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.23 vs. limit=22.5 2023-06-25 13:26:38,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1343232.0, ans=0.0 2023-06-25 13:27:24,011 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-25 13:27:43,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1343412.0, ans=0.125 2023-06-25 13:28:06,150 INFO [train.py:996] (1/4) Epoch 8, batch 10450, loss[loss=0.226, simple_loss=0.2939, pruned_loss=0.07906, over 21349.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2938, pruned_loss=0.06963, over 4266542.88 frames. ], batch size: 176, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:29:52,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.857e+02 4.046e+02 6.081e+02 8.924e+02 1.860e+03, threshold=1.216e+03, percent-clipped=7.0 2023-06-25 13:29:53,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1343772.0, ans=0.0 2023-06-25 13:29:54,310 INFO [train.py:996] (1/4) Epoch 8, batch 10500, loss[loss=0.1879, simple_loss=0.2589, pruned_loss=0.05839, over 21628.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2911, pruned_loss=0.06845, over 4257158.58 frames. ], batch size: 332, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:29:55,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.98 vs. limit=5.0 2023-06-25 13:30:27,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1343832.0, ans=0.125 2023-06-25 13:30:28,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1343832.0, ans=0.015 2023-06-25 13:30:40,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1343832.0, ans=0.07 2023-06-25 13:30:51,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2023-06-25 13:30:52,909 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:30:53,411 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-25 13:31:44,399 INFO [train.py:996] (1/4) Epoch 8, batch 10550, loss[loss=0.2037, simple_loss=0.2642, pruned_loss=0.07164, over 21279.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2866, pruned_loss=0.06751, over 4257545.50 frames. ], batch size: 144, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:33:35,043 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.879e+02 5.008e+02 7.044e+02 1.478e+03, threshold=1.002e+03, percent-clipped=2.0 2023-06-25 13:33:35,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1344372.0, ans=0.125 2023-06-25 13:33:37,117 INFO [train.py:996] (1/4) Epoch 8, batch 10600, loss[loss=0.1745, simple_loss=0.2592, pruned_loss=0.04486, over 21399.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2845, pruned_loss=0.06685, over 4240262.31 frames. ], batch size: 211, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:33:38,071 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:35:08,701 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-25 13:35:15,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1344612.0, ans=0.125 2023-06-25 13:35:20,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1344612.0, ans=0.125 2023-06-25 13:35:34,395 INFO [train.py:996] (1/4) Epoch 8, batch 10650, loss[loss=0.1206, simple_loss=0.1764, pruned_loss=0.03234, over 16680.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2869, pruned_loss=0.06592, over 4235035.82 frames. ], batch size: 62, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:35:51,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1344672.0, ans=0.0 2023-06-25 13:36:43,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1344852.0, ans=0.125 2023-06-25 13:36:59,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1344912.0, ans=0.2 2023-06-25 13:37:00,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-25 13:37:23,250 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.596e+02 3.773e+02 5.055e+02 6.605e+02 1.042e+03, threshold=1.011e+03, percent-clipped=1.0 2023-06-25 13:37:30,132 INFO [train.py:996] (1/4) Epoch 8, batch 10700, loss[loss=0.2453, simple_loss=0.3247, pruned_loss=0.08292, over 21546.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2847, pruned_loss=0.06539, over 4238319.88 frames. ], batch size: 389, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:37:32,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1344972.0, ans=0.1 2023-06-25 13:37:44,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1344972.0, ans=0.5 2023-06-25 13:38:09,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1345092.0, ans=0.035 2023-06-25 13:38:20,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1345092.0, ans=0.125 2023-06-25 13:38:38,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1345152.0, ans=0.125 2023-06-25 13:39:22,492 INFO [train.py:996] (1/4) Epoch 8, batch 10750, loss[loss=0.2359, simple_loss=0.3346, pruned_loss=0.06864, over 21751.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2956, pruned_loss=0.06992, over 4245132.85 frames. ], batch size: 332, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:39:28,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345272.0, ans=0.1 2023-06-25 13:39:51,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1345332.0, ans=0.2 2023-06-25 13:39:56,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-25 13:40:31,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1345452.0, ans=0.125 2023-06-25 13:40:51,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1345512.0, ans=0.125 2023-06-25 13:41:05,933 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.724e+02 3.874e+02 4.652e+02 6.783e+02 1.933e+03, threshold=9.304e+02, percent-clipped=9.0 2023-06-25 13:41:08,310 INFO [train.py:996] (1/4) Epoch 8, batch 10800, loss[loss=0.2309, simple_loss=0.3124, pruned_loss=0.07474, over 21745.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3017, pruned_loss=0.07113, over 4251115.58 frames. ], batch size: 332, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:41:28,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345632.0, ans=0.1 2023-06-25 13:41:30,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1345632.0, ans=0.125 2023-06-25 13:41:50,458 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:42:07,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-06-25 13:42:23,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1345752.0, ans=0.0 2023-06-25 13:42:31,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1345752.0, ans=0.125 2023-06-25 13:42:33,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-25 13:42:41,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345812.0, ans=0.1 2023-06-25 13:42:52,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-25 13:42:53,467 INFO [train.py:996] (1/4) Epoch 8, batch 10850, loss[loss=0.2296, simple_loss=0.3236, pruned_loss=0.0678, over 20828.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3015, pruned_loss=0.07084, over 4247886.86 frames. ], batch size: 609, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:43:05,754 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-06-25 13:44:24,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346112.0, ans=0.1 2023-06-25 13:44:40,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1346112.0, ans=0.0 2023-06-25 13:44:43,297 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.154e+02 5.827e+02 8.227e+02 1.341e+03, threshold=1.165e+03, percent-clipped=17.0 2023-06-25 13:44:43,342 INFO [train.py:996] (1/4) Epoch 8, batch 10900, loss[loss=0.2158, simple_loss=0.3136, pruned_loss=0.05895, over 21813.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2952, pruned_loss=0.06873, over 4258587.74 frames. ], batch size: 371, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:45:11,715 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-25 13:46:27,636 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-25 13:46:28,393 INFO [train.py:996] (1/4) Epoch 8, batch 10950, loss[loss=0.1931, simple_loss=0.2678, pruned_loss=0.05919, over 21837.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2906, pruned_loss=0.06692, over 4252865.26 frames. ], batch size: 352, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:46:57,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1346532.0, ans=0.125 2023-06-25 13:47:09,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1346532.0, ans=0.125 2023-06-25 13:47:12,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1346592.0, ans=0.07 2023-06-25 13:47:31,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1346592.0, ans=0.125 2023-06-25 13:47:39,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346652.0, ans=0.1 2023-06-25 13:48:10,257 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.690e+02 3.805e+02 5.172e+02 7.672e+02 1.562e+03, threshold=1.034e+03, percent-clipped=4.0 2023-06-25 13:48:10,288 INFO [train.py:996] (1/4) Epoch 8, batch 11000, loss[loss=0.2209, simple_loss=0.2855, pruned_loss=0.07817, over 21860.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2878, pruned_loss=0.06781, over 4263085.81 frames. ], batch size: 332, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:48:39,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-25 13:48:51,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1346832.0, ans=0.0 2023-06-25 13:49:32,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1346952.0, ans=0.125 2023-06-25 13:49:59,326 INFO [train.py:996] (1/4) Epoch 8, batch 11050, loss[loss=0.1987, simple_loss=0.265, pruned_loss=0.06625, over 21743.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2856, pruned_loss=0.06815, over 4263879.38 frames. ], batch size: 371, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:49:59,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1347072.0, ans=0.0 2023-06-25 13:50:20,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=22.5 2023-06-25 13:50:27,636 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.40 vs. limit=10.0 2023-06-25 13:50:32,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1347132.0, ans=0.09899494936611666 2023-06-25 13:51:20,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1347252.0, ans=0.125 2023-06-25 13:51:24,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1347252.0, ans=0.0 2023-06-25 13:51:49,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.876e+02 3.834e+02 4.608e+02 6.864e+02 1.206e+03, threshold=9.217e+02, percent-clipped=3.0 2023-06-25 13:51:50,019 INFO [train.py:996] (1/4) Epoch 8, batch 11100, loss[loss=0.1879, simple_loss=0.2599, pruned_loss=0.05802, over 21482.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2848, pruned_loss=0.06801, over 4260991.51 frames. ], batch size: 230, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:52:01,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1347372.0, ans=0.0 2023-06-25 13:52:11,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1347432.0, ans=0.2 2023-06-25 13:52:39,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1347492.0, ans=0.1 2023-06-25 13:53:11,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1347552.0, ans=0.1 2023-06-25 13:53:17,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1347552.0, ans=0.125 2023-06-25 13:53:39,296 INFO [train.py:996] (1/4) Epoch 8, batch 11150, loss[loss=0.1892, simple_loss=0.2616, pruned_loss=0.05843, over 21640.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2826, pruned_loss=0.06761, over 4266639.95 frames. ], batch size: 282, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:54:28,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1347792.0, ans=0.125 2023-06-25 13:54:53,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1347852.0, ans=0.04949747468305833 2023-06-25 13:54:58,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1347852.0, ans=0.0 2023-06-25 13:55:19,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1347912.0, ans=0.05 2023-06-25 13:55:22,808 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.920e+02 3.496e+02 4.428e+02 6.433e+02 1.139e+03, threshold=8.857e+02, percent-clipped=2.0 2023-06-25 13:55:22,853 INFO [train.py:996] (1/4) Epoch 8, batch 11200, loss[loss=0.1927, simple_loss=0.2633, pruned_loss=0.06104, over 21635.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2814, pruned_loss=0.06707, over 4254706.90 frames. ], batch size: 298, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:55:25,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1347972.0, ans=0.2 2023-06-25 13:55:39,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1348032.0, ans=0.125 2023-06-25 13:56:01,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1348032.0, ans=0.0 2023-06-25 13:56:56,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-25 13:56:57,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1348212.0, ans=0.125 2023-06-25 13:57:10,454 INFO [train.py:996] (1/4) Epoch 8, batch 11250, loss[loss=0.2083, simple_loss=0.2965, pruned_loss=0.06007, over 21789.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2822, pruned_loss=0.06707, over 4265547.78 frames. ], batch size: 124, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:57:21,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1348272.0, ans=0.2 2023-06-25 13:57:54,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1348392.0, ans=0.125 2023-06-25 13:58:12,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1348452.0, ans=0.125 2023-06-25 13:58:59,633 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 3.512e+02 4.278e+02 5.867e+02 1.075e+03, threshold=8.556e+02, percent-clipped=3.0 2023-06-25 13:58:59,671 INFO [train.py:996] (1/4) Epoch 8, batch 11300, loss[loss=0.2306, simple_loss=0.3093, pruned_loss=0.0759, over 20661.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2837, pruned_loss=0.06764, over 4261384.89 frames. ], batch size: 607, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:00:49,634 INFO [train.py:996] (1/4) Epoch 8, batch 11350, loss[loss=0.1719, simple_loss=0.2312, pruned_loss=0.05627, over 20799.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2853, pruned_loss=0.06779, over 4269066.55 frames. ], batch size: 609, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:01:03,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1348872.0, ans=0.125 2023-06-25 14:02:22,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1349112.0, ans=0.0 2023-06-25 14:02:41,711 INFO [train.py:996] (1/4) Epoch 8, batch 11400, loss[loss=0.2172, simple_loss=0.304, pruned_loss=0.06521, over 21857.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2917, pruned_loss=0.07042, over 4270291.36 frames. ], batch size: 317, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:02:43,592 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.820e+02 3.968e+02 4.967e+02 6.707e+02 2.156e+03, threshold=9.935e+02, percent-clipped=13.0 2023-06-25 14:02:44,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1349172.0, ans=0.125 2023-06-25 14:02:46,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-25 14:03:10,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1349232.0, ans=0.125 2023-06-25 14:04:17,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1349412.0, ans=0.1 2023-06-25 14:04:26,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1349412.0, ans=0.125 2023-06-25 14:04:36,640 INFO [train.py:996] (1/4) Epoch 8, batch 11450, loss[loss=0.203, simple_loss=0.2811, pruned_loss=0.06241, over 21299.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2922, pruned_loss=0.06933, over 4274611.41 frames. ], batch size: 176, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:05:48,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-25 14:06:13,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.12 vs. limit=15.0 2023-06-25 14:06:33,203 INFO [train.py:996] (1/4) Epoch 8, batch 11500, loss[loss=0.2403, simple_loss=0.3096, pruned_loss=0.08547, over 21187.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2965, pruned_loss=0.07115, over 4274061.61 frames. ], batch size: 143, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:06:34,722 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.573e+02 4.073e+02 4.904e+02 7.356e+02 1.531e+03, threshold=9.808e+02, percent-clipped=13.0 2023-06-25 14:06:42,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1349772.0, ans=0.0 2023-06-25 14:07:21,337 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=22.5 2023-06-25 14:08:17,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1350012.0, ans=0.2 2023-06-25 14:08:19,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1350012.0, ans=0.125 2023-06-25 14:08:29,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1350072.0, ans=0.125 2023-06-25 14:08:30,981 INFO [train.py:996] (1/4) Epoch 8, batch 11550, loss[loss=0.2334, simple_loss=0.3301, pruned_loss=0.06829, over 21835.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3019, pruned_loss=0.07087, over 4272647.06 frames. ], batch size: 316, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:08:37,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1350072.0, ans=0.125 2023-06-25 14:09:23,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1350192.0, ans=0.125 2023-06-25 14:09:25,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1350192.0, ans=0.125 2023-06-25 14:09:28,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1350192.0, ans=0.125 2023-06-25 14:09:50,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1350252.0, ans=0.125 2023-06-25 14:10:22,636 INFO [train.py:996] (1/4) Epoch 8, batch 11600, loss[loss=0.2649, simple_loss=0.3721, pruned_loss=0.07883, over 21678.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3189, pruned_loss=0.07411, over 4268409.20 frames. ], batch size: 389, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:10:24,357 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.914e+02 4.338e+02 5.534e+02 7.509e+02 2.145e+03, threshold=1.107e+03, percent-clipped=20.0 2023-06-25 14:10:28,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1350372.0, ans=0.0 2023-06-25 14:10:41,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=12.0 2023-06-25 14:12:12,224 INFO [train.py:996] (1/4) Epoch 8, batch 11650, loss[loss=0.2014, simple_loss=0.2834, pruned_loss=0.05972, over 21761.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3233, pruned_loss=0.07427, over 4274393.18 frames. ], batch size: 124, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:12:41,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.73 vs. limit=5.0 2023-06-25 14:13:30,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-25 14:13:49,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1350912.0, ans=0.025 2023-06-25 14:13:55,109 INFO [train.py:996] (1/4) Epoch 8, batch 11700, loss[loss=0.2451, simple_loss=0.2818, pruned_loss=0.1042, over 21307.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3152, pruned_loss=0.0738, over 4261120.18 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:13:58,337 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 3.697e+02 5.318e+02 8.205e+02 1.649e+03, threshold=1.064e+03, percent-clipped=10.0 2023-06-25 14:14:37,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1351092.0, ans=0.2 2023-06-25 14:15:43,631 INFO [train.py:996] (1/4) Epoch 8, batch 11750, loss[loss=0.2283, simple_loss=0.3012, pruned_loss=0.07775, over 21418.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.305, pruned_loss=0.07269, over 4269165.87 frames. ], batch size: 131, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:16:05,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1351332.0, ans=0.0 2023-06-25 14:16:05,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1351332.0, ans=0.0 2023-06-25 14:16:51,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1351392.0, ans=0.0 2023-06-25 14:17:02,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1351452.0, ans=0.2 2023-06-25 14:17:21,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1351512.0, ans=0.0 2023-06-25 14:17:26,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1351512.0, ans=0.125 2023-06-25 14:17:28,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1351512.0, ans=0.125 2023-06-25 14:17:40,826 INFO [train.py:996] (1/4) Epoch 8, batch 11800, loss[loss=0.2024, simple_loss=0.2992, pruned_loss=0.05275, over 21585.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3062, pruned_loss=0.0745, over 4275280.94 frames. ], batch size: 230, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:17:44,257 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.704e+02 3.917e+02 5.538e+02 7.967e+02 1.804e+03, threshold=1.108e+03, percent-clipped=14.0 2023-06-25 14:18:13,311 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:18:15,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1351692.0, ans=0.125 2023-06-25 14:19:01,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1351752.0, ans=0.125 2023-06-25 14:19:06,191 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.60 vs. limit=22.5 2023-06-25 14:19:30,724 INFO [train.py:996] (1/4) Epoch 8, batch 11850, loss[loss=0.2071, simple_loss=0.3153, pruned_loss=0.0494, over 20847.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3078, pruned_loss=0.07377, over 4280966.89 frames. ], batch size: 608, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:19:52,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1351932.0, ans=0.125 2023-06-25 14:20:09,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1351932.0, ans=0.2 2023-06-25 14:20:41,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1352052.0, ans=0.0 2023-06-25 14:21:17,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-25 14:21:22,242 INFO [train.py:996] (1/4) Epoch 8, batch 11900, loss[loss=0.2065, simple_loss=0.2796, pruned_loss=0.06671, over 21583.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3073, pruned_loss=0.07142, over 4270436.45 frames. ], batch size: 230, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:21:23,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1352172.0, ans=0.0 2023-06-25 14:21:25,802 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.770e+02 3.589e+02 4.714e+02 6.474e+02 1.333e+03, threshold=9.428e+02, percent-clipped=3.0 2023-06-25 14:21:37,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1352172.0, ans=0.125 2023-06-25 14:22:13,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1352292.0, ans=0.125 2023-06-25 14:22:13,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1352292.0, ans=0.0 2023-06-25 14:22:14,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1352292.0, ans=0.2 2023-06-25 14:22:16,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1352292.0, ans=0.125 2023-06-25 14:23:10,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1352412.0, ans=0.0 2023-06-25 14:23:16,556 INFO [train.py:996] (1/4) Epoch 8, batch 11950, loss[loss=0.177, simple_loss=0.2673, pruned_loss=0.04334, over 21418.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3088, pruned_loss=0.06944, over 4265276.46 frames. ], batch size: 131, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:23:44,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1352532.0, ans=0.125 2023-06-25 14:24:03,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-25 14:24:08,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-25 14:24:44,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1352652.0, ans=0.1 2023-06-25 14:24:53,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1352712.0, ans=0.1 2023-06-25 14:25:06,512 INFO [train.py:996] (1/4) Epoch 8, batch 12000, loss[loss=0.2173, simple_loss=0.2736, pruned_loss=0.08046, over 21247.00 frames. ], tot_loss[loss=0.218, simple_loss=0.3017, pruned_loss=0.0672, over 4270298.44 frames. ], batch size: 160, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:25:06,513 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 14:25:31,280 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2626, simple_loss=0.3537, pruned_loss=0.08577, over 1796401.00 frames. 2023-06-25 14:25:31,282 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 14:25:34,817 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.581e+02 4.444e+02 6.606e+02 1.302e+03, threshold=8.887e+02, percent-clipped=8.0 2023-06-25 14:25:55,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1352832.0, ans=0.125 2023-06-25 14:26:21,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1352892.0, ans=0.125 2023-06-25 14:26:23,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1352892.0, ans=0.125 2023-06-25 14:27:08,713 INFO [train.py:996] (1/4) Epoch 8, batch 12050, loss[loss=0.2335, simple_loss=0.311, pruned_loss=0.07802, over 21814.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2975, pruned_loss=0.06834, over 4271378.33 frames. ], batch size: 124, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:27:25,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1353072.0, ans=0.125 2023-06-25 14:27:26,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353072.0, ans=0.1 2023-06-25 14:27:26,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1353072.0, ans=0.2 2023-06-25 14:27:32,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1353072.0, ans=0.2 2023-06-25 14:27:48,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1353132.0, ans=0.125 2023-06-25 14:28:19,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1353192.0, ans=0.1 2023-06-25 14:28:28,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1353252.0, ans=0.035 2023-06-25 14:28:47,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1353312.0, ans=0.0 2023-06-25 14:29:10,833 INFO [train.py:996] (1/4) Epoch 8, batch 12100, loss[loss=0.231, simple_loss=0.3156, pruned_loss=0.07318, over 21347.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3033, pruned_loss=0.07286, over 4278422.07 frames. ], batch size: 548, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:29:14,306 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.401e+02 6.036e+02 8.453e+02 2.254e+03, threshold=1.207e+03, percent-clipped=23.0 2023-06-25 14:30:19,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-25 14:31:09,971 INFO [train.py:996] (1/4) Epoch 8, batch 12150, loss[loss=0.207, simple_loss=0.2833, pruned_loss=0.06541, over 21202.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3063, pruned_loss=0.07196, over 4274904.89 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:31:43,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1353732.0, ans=0.125 2023-06-25 14:32:32,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1353852.0, ans=0.125 2023-06-25 14:32:53,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1353912.0, ans=0.0 2023-06-25 14:32:56,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1353912.0, ans=0.2 2023-06-25 14:32:59,807 INFO [train.py:996] (1/4) Epoch 8, batch 12200, loss[loss=0.1953, simple_loss=0.261, pruned_loss=0.06484, over 21630.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3017, pruned_loss=0.07122, over 4270298.12 frames. ], batch size: 231, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:33:03,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.861e+02 3.926e+02 5.745e+02 7.853e+02 1.417e+03, threshold=1.149e+03, percent-clipped=2.0 2023-06-25 14:33:23,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1354032.0, ans=0.1 2023-06-25 14:33:25,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1354032.0, ans=0.125 2023-06-25 14:34:37,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1354212.0, ans=0.125 2023-06-25 14:34:47,635 INFO [train.py:996] (1/4) Epoch 8, batch 12250, loss[loss=0.2049, simple_loss=0.2779, pruned_loss=0.06592, over 21511.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2941, pruned_loss=0.0685, over 4264676.01 frames. ], batch size: 509, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:34:50,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1354272.0, ans=6.0 2023-06-25 14:35:40,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1354392.0, ans=0.07 2023-06-25 14:35:58,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1354452.0, ans=0.125 2023-06-25 14:36:30,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1354512.0, ans=0.0 2023-06-25 14:36:36,591 INFO [train.py:996] (1/4) Epoch 8, batch 12300, loss[loss=0.2086, simple_loss=0.2988, pruned_loss=0.05916, over 21060.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2857, pruned_loss=0.06349, over 4255573.26 frames. ], batch size: 607, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:36:37,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1354572.0, ans=0.0 2023-06-25 14:36:41,782 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 3.509e+02 4.835e+02 7.096e+02 1.534e+03, threshold=9.669e+02, percent-clipped=2.0 2023-06-25 14:36:50,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.28 vs. limit=15.0 2023-06-25 14:36:57,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-25 14:37:02,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1354632.0, ans=0.125 2023-06-25 14:37:28,662 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:38:12,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1354812.0, ans=0.125 2023-06-25 14:38:25,381 INFO [train.py:996] (1/4) Epoch 8, batch 12350, loss[loss=0.2261, simple_loss=0.3021, pruned_loss=0.075, over 21653.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2903, pruned_loss=0.06329, over 4262079.51 frames. ], batch size: 230, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:39:38,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1355052.0, ans=0.125 2023-06-25 14:40:06,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-25 14:40:12,701 INFO [train.py:996] (1/4) Epoch 8, batch 12400, loss[loss=0.2456, simple_loss=0.3077, pruned_loss=0.09177, over 21272.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2931, pruned_loss=0.06699, over 4269168.17 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:40:17,816 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.711e+02 4.388e+02 6.020e+02 7.604e+02 1.312e+03, threshold=1.204e+03, percent-clipped=10.0 2023-06-25 14:41:17,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1355352.0, ans=0.1 2023-06-25 14:41:21,669 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-25 14:41:22,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1355352.0, ans=0.0 2023-06-25 14:41:51,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1355412.0, ans=0.0 2023-06-25 14:42:04,063 INFO [train.py:996] (1/4) Epoch 8, batch 12450, loss[loss=0.2707, simple_loss=0.3415, pruned_loss=0.1, over 21317.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.297, pruned_loss=0.0699, over 4276544.38 frames. ], batch size: 159, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:43:41,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-25 14:43:55,913 INFO [train.py:996] (1/4) Epoch 8, batch 12500, loss[loss=0.2625, simple_loss=0.3642, pruned_loss=0.08037, over 21322.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3055, pruned_loss=0.07296, over 4271779.35 frames. ], batch size: 549, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:44:02,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 4.292e+02 5.906e+02 9.269e+02 3.047e+03, threshold=1.181e+03, percent-clipped=14.0 2023-06-25 14:44:14,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1355832.0, ans=0.1 2023-06-25 14:44:49,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1355892.0, ans=0.1 2023-06-25 14:45:08,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1355952.0, ans=15.0 2023-06-25 14:45:47,021 INFO [train.py:996] (1/4) Epoch 8, batch 12550, loss[loss=0.243, simple_loss=0.3315, pruned_loss=0.07729, over 21645.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3088, pruned_loss=0.07464, over 4275014.17 frames. ], batch size: 389, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:46:34,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1356132.0, ans=0.125 2023-06-25 14:46:36,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1356132.0, ans=0.125 2023-06-25 14:47:20,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1356312.0, ans=0.2 2023-06-25 14:47:42,218 INFO [train.py:996] (1/4) Epoch 8, batch 12600, loss[loss=0.1721, simple_loss=0.2665, pruned_loss=0.03886, over 21722.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3075, pruned_loss=0.07253, over 4273144.88 frames. ], batch size: 332, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:47:43,006 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:47:48,695 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.799e+02 4.195e+02 5.786e+02 8.769e+02 1.751e+03, threshold=1.157e+03, percent-clipped=8.0 2023-06-25 14:48:28,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1356492.0, ans=0.1 2023-06-25 14:49:14,399 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-25 14:49:20,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1356612.0, ans=0.0 2023-06-25 14:49:23,541 INFO [train.py:996] (1/4) Epoch 8, batch 12650, loss[loss=0.2092, simple_loss=0.2828, pruned_loss=0.0678, over 21838.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3006, pruned_loss=0.06917, over 4280048.49 frames. ], batch size: 282, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:51:04,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1356912.0, ans=0.125 2023-06-25 14:51:13,830 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:51:19,758 INFO [train.py:996] (1/4) Epoch 8, batch 12700, loss[loss=0.2492, simple_loss=0.3208, pruned_loss=0.08882, over 21723.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3005, pruned_loss=0.07132, over 4283636.93 frames. ], batch size: 351, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:51:32,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 4.265e+02 5.595e+02 7.381e+02 1.572e+03, threshold=1.119e+03, percent-clipped=3.0 2023-06-25 14:52:26,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1357152.0, ans=0.2 2023-06-25 14:52:39,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1357212.0, ans=0.1 2023-06-25 14:52:42,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1357212.0, ans=0.2 2023-06-25 14:52:58,661 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=15.0 2023-06-25 14:53:02,632 INFO [train.py:996] (1/4) Epoch 8, batch 12750, loss[loss=0.1907, simple_loss=0.2745, pruned_loss=0.05346, over 21476.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3026, pruned_loss=0.07166, over 4279995.27 frames. ], batch size: 212, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:53:21,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-25 14:54:45,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1357512.0, ans=0.95 2023-06-25 14:54:57,127 INFO [train.py:996] (1/4) Epoch 8, batch 12800, loss[loss=0.2419, simple_loss=0.3053, pruned_loss=0.08926, over 21405.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3025, pruned_loss=0.07261, over 4282828.19 frames. ], batch size: 548, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:55:04,029 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.881e+02 3.698e+02 4.519e+02 5.409e+02 8.581e+02, threshold=9.039e+02, percent-clipped=0.0 2023-06-25 14:55:40,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1357692.0, ans=0.125 2023-06-25 14:55:54,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1357752.0, ans=0.125 2023-06-25 14:56:16,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1357752.0, ans=0.0 2023-06-25 14:56:37,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1357812.0, ans=0.0 2023-06-25 14:56:41,323 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-25 14:56:47,893 INFO [train.py:996] (1/4) Epoch 8, batch 12850, loss[loss=0.2197, simple_loss=0.315, pruned_loss=0.06218, over 21684.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3033, pruned_loss=0.07354, over 4282143.47 frames. ], batch size: 441, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:56:53,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1357872.0, ans=0.04949747468305833 2023-06-25 14:57:28,438 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-25 14:58:13,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1358052.0, ans=0.2 2023-06-25 14:58:40,132 INFO [train.py:996] (1/4) Epoch 8, batch 12900, loss[loss=0.165, simple_loss=0.2427, pruned_loss=0.04365, over 21778.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3022, pruned_loss=0.0708, over 4265682.33 frames. ], batch size: 124, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:58:47,393 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.654e+02 3.588e+02 4.373e+02 7.155e+02 1.857e+03, threshold=8.745e+02, percent-clipped=14.0 2023-06-25 14:58:49,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1358172.0, ans=0.1 2023-06-25 14:58:55,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-25 15:00:00,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1358412.0, ans=0.125 2023-06-25 15:00:24,705 INFO [train.py:996] (1/4) Epoch 8, batch 12950, loss[loss=0.2374, simple_loss=0.3176, pruned_loss=0.07861, over 21371.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3022, pruned_loss=0.06907, over 4262548.35 frames. ], batch size: 549, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:00:36,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1358472.0, ans=0.125 2023-06-25 15:00:44,348 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-25 15:01:12,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1358592.0, ans=0.125 2023-06-25 15:01:25,123 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-25 15:02:14,887 INFO [train.py:996] (1/4) Epoch 8, batch 13000, loss[loss=0.1641, simple_loss=0.2457, pruned_loss=0.04118, over 21818.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3028, pruned_loss=0.06942, over 4260501.86 frames. ], batch size: 118, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:02:23,105 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.843e+02 4.886e+02 6.754e+02 1.173e+03, threshold=9.772e+02, percent-clipped=9.0 2023-06-25 15:02:35,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1358832.0, ans=0.025 2023-06-25 15:02:51,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1358832.0, ans=0.2 2023-06-25 15:03:52,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1359012.0, ans=0.125 2023-06-25 15:03:52,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1359012.0, ans=0.125 2023-06-25 15:03:57,492 INFO [train.py:996] (1/4) Epoch 8, batch 13050, loss[loss=0.2148, simple_loss=0.283, pruned_loss=0.0733, over 21813.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2961, pruned_loss=0.06679, over 4267229.28 frames. ], batch size: 247, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:04:15,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1359132.0, ans=0.125 2023-06-25 15:04:18,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1359132.0, ans=0.125 2023-06-25 15:04:20,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1359132.0, ans=0.0 2023-06-25 15:04:53,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1359192.0, ans=0.125 2023-06-25 15:05:18,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1359252.0, ans=0.0 2023-06-25 15:05:20,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1359252.0, ans=0.125 2023-06-25 15:05:41,252 INFO [train.py:996] (1/4) Epoch 8, batch 13100, loss[loss=0.2108, simple_loss=0.2937, pruned_loss=0.06397, over 21831.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.297, pruned_loss=0.06732, over 4279200.53 frames. ], batch size: 282, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:05:50,178 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.856e+02 3.427e+02 4.465e+02 6.179e+02 1.477e+03, threshold=8.931e+02, percent-clipped=2.0 2023-06-25 15:05:50,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1359372.0, ans=0.125 2023-06-25 15:06:26,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1359432.0, ans=0.125 2023-06-25 15:06:34,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1359492.0, ans=0.125 2023-06-25 15:06:41,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1359492.0, ans=0.125 2023-06-25 15:07:10,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1359552.0, ans=0.125 2023-06-25 15:07:14,731 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:07:30,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1359672.0, ans=0.125 2023-06-25 15:07:31,675 INFO [train.py:996] (1/4) Epoch 8, batch 13150, loss[loss=0.2408, simple_loss=0.3093, pruned_loss=0.08619, over 21521.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2996, pruned_loss=0.06994, over 4284957.92 frames. ], batch size: 389, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:07:34,013 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:08:02,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1359672.0, ans=0.0 2023-06-25 15:08:16,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1359732.0, ans=0.04949747468305833 2023-06-25 15:08:36,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1359792.0, ans=0.125 2023-06-25 15:09:01,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1359852.0, ans=0.125 2023-06-25 15:09:27,373 INFO [train.py:996] (1/4) Epoch 8, batch 13200, loss[loss=0.2382, simple_loss=0.3171, pruned_loss=0.07962, over 21832.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.297, pruned_loss=0.06939, over 4283551.62 frames. ], batch size: 124, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 15:09:46,137 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.744e+02 3.706e+02 4.388e+02 6.661e+02 1.084e+03, threshold=8.775e+02, percent-clipped=9.0 2023-06-25 15:10:28,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1360092.0, ans=0.2 2023-06-25 15:10:34,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1360152.0, ans=0.125 2023-06-25 15:11:21,331 INFO [train.py:996] (1/4) Epoch 8, batch 13250, loss[loss=0.2132, simple_loss=0.3026, pruned_loss=0.06192, over 21868.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2982, pruned_loss=0.07099, over 4278748.28 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:11:35,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1360272.0, ans=0.125 2023-06-25 15:11:46,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1360332.0, ans=0.0 2023-06-25 15:12:30,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1360452.0, ans=0.125 2023-06-25 15:13:18,499 INFO [train.py:996] (1/4) Epoch 8, batch 13300, loss[loss=0.2635, simple_loss=0.3363, pruned_loss=0.09534, over 21730.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3025, pruned_loss=0.07057, over 4274155.06 frames. ], batch size: 441, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:13:34,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.717e+02 5.105e+02 6.593e+02 1.654e+03, threshold=1.021e+03, percent-clipped=11.0 2023-06-25 15:14:02,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-25 15:14:13,959 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-25 15:14:24,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1360752.0, ans=0.125 2023-06-25 15:15:08,172 INFO [train.py:996] (1/4) Epoch 8, batch 13350, loss[loss=0.247, simple_loss=0.3242, pruned_loss=0.08491, over 21546.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3064, pruned_loss=0.07322, over 4274642.10 frames. ], batch size: 414, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:16:32,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1361052.0, ans=0.1 2023-06-25 15:16:42,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1361112.0, ans=0.2 2023-06-25 15:17:03,314 INFO [train.py:996] (1/4) Epoch 8, batch 13400, loss[loss=0.2782, simple_loss=0.3428, pruned_loss=0.1069, over 21521.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3087, pruned_loss=0.07605, over 4276391.02 frames. ], batch size: 471, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:17:13,984 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.122e+02 3.939e+02 4.986e+02 7.057e+02 1.760e+03, threshold=9.973e+02, percent-clipped=5.0 2023-06-25 15:17:33,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1361232.0, ans=22.5 2023-06-25 15:18:30,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1361412.0, ans=0.5 2023-06-25 15:18:52,663 INFO [train.py:996] (1/4) Epoch 8, batch 13450, loss[loss=0.2472, simple_loss=0.315, pruned_loss=0.08973, over 21633.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3105, pruned_loss=0.07845, over 4272208.24 frames. ], batch size: 441, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:18:53,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1361472.0, ans=0.125 2023-06-25 15:19:40,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1361592.0, ans=0.125 2023-06-25 15:20:27,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1361712.0, ans=0.0 2023-06-25 15:20:30,798 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:20:42,822 INFO [train.py:996] (1/4) Epoch 8, batch 13500, loss[loss=0.2261, simple_loss=0.2907, pruned_loss=0.08076, over 20687.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3013, pruned_loss=0.07549, over 4261990.86 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:20:53,698 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.704e+02 3.900e+02 4.940e+02 7.289e+02 1.559e+03, threshold=9.879e+02, percent-clipped=7.0 2023-06-25 15:21:38,562 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-25 15:21:57,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1361952.0, ans=0.0 2023-06-25 15:22:08,835 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:22:33,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1362072.0, ans=0.2 2023-06-25 15:22:34,519 INFO [train.py:996] (1/4) Epoch 8, batch 13550, loss[loss=0.2324, simple_loss=0.359, pruned_loss=0.05294, over 19809.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3051, pruned_loss=0.07484, over 4260191.62 frames. ], batch size: 703, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:22:36,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1362072.0, ans=0.125 2023-06-25 15:22:46,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1362072.0, ans=0.125 2023-06-25 15:23:17,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-25 15:23:26,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1362192.0, ans=0.1 2023-06-25 15:23:41,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1362252.0, ans=0.2 2023-06-25 15:24:12,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-25 15:24:18,228 INFO [train.py:996] (1/4) Epoch 8, batch 13600, loss[loss=0.1964, simple_loss=0.2809, pruned_loss=0.05591, over 21798.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3057, pruned_loss=0.0749, over 4260842.84 frames. ], batch size: 298, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:24:28,504 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.890e+02 3.859e+02 5.232e+02 7.287e+02 1.567e+03, threshold=1.046e+03, percent-clipped=12.0 2023-06-25 15:24:29,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1362372.0, ans=0.0 2023-06-25 15:25:05,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-25 15:25:34,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1362552.0, ans=0.0 2023-06-25 15:26:01,172 INFO [train.py:996] (1/4) Epoch 8, batch 13650, loss[loss=0.2022, simple_loss=0.2708, pruned_loss=0.06682, over 21354.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2996, pruned_loss=0.07183, over 4267261.71 frames. ], batch size: 131, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:26:05,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-25 15:26:08,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1362672.0, ans=0.0 2023-06-25 15:26:39,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1362732.0, ans=0.0 2023-06-25 15:26:49,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1362792.0, ans=0.1 2023-06-25 15:27:38,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1362912.0, ans=0.1 2023-06-25 15:27:50,065 INFO [train.py:996] (1/4) Epoch 8, batch 13700, loss[loss=0.1872, simple_loss=0.263, pruned_loss=0.05572, over 21689.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2959, pruned_loss=0.07178, over 4266865.67 frames. ], batch size: 247, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:27:54,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1362972.0, ans=0.1 2023-06-25 15:28:08,788 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 3.641e+02 4.705e+02 7.070e+02 1.116e+03, threshold=9.410e+02, percent-clipped=4.0 2023-06-25 15:29:34,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-25 15:29:46,968 INFO [train.py:996] (1/4) Epoch 8, batch 13750, loss[loss=0.2411, simple_loss=0.3235, pruned_loss=0.07935, over 21570.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2923, pruned_loss=0.07094, over 4265262.91 frames. ], batch size: 441, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:30:10,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1363332.0, ans=0.125 2023-06-25 15:30:10,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1363332.0, ans=0.125 2023-06-25 15:30:12,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1363332.0, ans=0.0 2023-06-25 15:31:02,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1363452.0, ans=0.125 2023-06-25 15:31:39,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363512.0, ans=0.1 2023-06-25 15:31:42,858 INFO [train.py:996] (1/4) Epoch 8, batch 13800, loss[loss=0.2309, simple_loss=0.3214, pruned_loss=0.07023, over 21489.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2977, pruned_loss=0.06964, over 4274733.65 frames. ], batch size: 211, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:32:00,806 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.906e+02 4.517e+02 6.756e+02 9.995e+02 2.111e+03, threshold=1.351e+03, percent-clipped=26.0 2023-06-25 15:32:13,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363632.0, ans=0.1 2023-06-25 15:32:29,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363692.0, ans=0.1 2023-06-25 15:32:36,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1363692.0, ans=0.125 2023-06-25 15:33:33,364 INFO [train.py:996] (1/4) Epoch 8, batch 13850, loss[loss=0.2604, simple_loss=0.3445, pruned_loss=0.08818, over 21767.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3058, pruned_loss=0.07162, over 4273739.35 frames. ], batch size: 332, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:33:42,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1363872.0, ans=0.1 2023-06-25 15:34:03,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1363932.0, ans=0.2 2023-06-25 15:34:39,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-06-25 15:34:46,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1364052.0, ans=0.125 2023-06-25 15:34:57,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1364112.0, ans=0.2 2023-06-25 15:35:03,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1364112.0, ans=0.125 2023-06-25 15:35:16,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1364112.0, ans=0.2 2023-06-25 15:35:16,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1364112.0, ans=0.125 2023-06-25 15:35:20,917 INFO [train.py:996] (1/4) Epoch 8, batch 13900, loss[loss=0.2348, simple_loss=0.3007, pruned_loss=0.08448, over 21374.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3097, pruned_loss=0.07483, over 4273631.20 frames. ], batch size: 211, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:35:33,174 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.067e+02 4.054e+02 4.959e+02 6.399e+02 1.364e+03, threshold=9.918e+02, percent-clipped=1.0 2023-06-25 15:37:04,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1364412.0, ans=0.125 2023-06-25 15:37:09,435 INFO [train.py:996] (1/4) Epoch 8, batch 13950, loss[loss=0.2345, simple_loss=0.3113, pruned_loss=0.07886, over 21282.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3092, pruned_loss=0.07581, over 4281464.80 frames. ], batch size: 143, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:37:32,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1364532.0, ans=0.125 2023-06-25 15:37:58,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1364592.0, ans=0.2 2023-06-25 15:38:00,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1364592.0, ans=0.0 2023-06-25 15:38:57,950 INFO [train.py:996] (1/4) Epoch 8, batch 14000, loss[loss=0.1942, simple_loss=0.2826, pruned_loss=0.0529, over 21327.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3055, pruned_loss=0.07329, over 4279313.40 frames. ], batch size: 144, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:39:09,930 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.689e+02 3.751e+02 4.894e+02 7.186e+02 1.368e+03, threshold=9.787e+02, percent-clipped=13.0 2023-06-25 15:39:12,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1364772.0, ans=0.125 2023-06-25 15:39:34,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-25 15:40:10,416 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:40:45,624 INFO [train.py:996] (1/4) Epoch 8, batch 14050, loss[loss=0.2102, simple_loss=0.2785, pruned_loss=0.07094, over 21841.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3003, pruned_loss=0.06965, over 4286857.79 frames. ], batch size: 98, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:40:49,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1365072.0, ans=0.0 2023-06-25 15:40:50,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-25 15:41:12,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1365132.0, ans=0.0 2023-06-25 15:41:16,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1365132.0, ans=0.0 2023-06-25 15:41:30,748 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-25 15:42:03,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1365252.0, ans=0.0 2023-06-25 15:42:16,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-25 15:42:33,628 INFO [train.py:996] (1/4) Epoch 8, batch 14100, loss[loss=0.1887, simple_loss=0.2542, pruned_loss=0.06165, over 21844.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.295, pruned_loss=0.06981, over 4283119.18 frames. ], batch size: 98, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:42:47,591 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.656e+02 3.476e+02 4.443e+02 5.620e+02 1.211e+03, threshold=8.886e+02, percent-clipped=2.0 2023-06-25 15:43:02,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1365432.0, ans=0.125 2023-06-25 15:43:48,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1365552.0, ans=10.0 2023-06-25 15:43:51,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1365552.0, ans=0.1 2023-06-25 15:44:11,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1365612.0, ans=0.0 2023-06-25 15:44:19,905 INFO [train.py:996] (1/4) Epoch 8, batch 14150, loss[loss=0.2159, simple_loss=0.2993, pruned_loss=0.06631, over 21812.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2975, pruned_loss=0.06966, over 4286979.23 frames. ], batch size: 124, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:44:21,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.97 vs. limit=15.0 2023-06-25 15:44:27,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1365672.0, ans=0.1 2023-06-25 15:44:57,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1365792.0, ans=0.125 2023-06-25 15:45:58,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1365912.0, ans=0.1 2023-06-25 15:46:01,174 INFO [train.py:996] (1/4) Epoch 8, batch 14200, loss[loss=0.2104, simple_loss=0.2797, pruned_loss=0.07052, over 21772.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2973, pruned_loss=0.06949, over 4288284.59 frames. ], batch size: 118, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:46:06,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1365972.0, ans=0.125 2023-06-25 15:46:20,190 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 4.879e+02 7.691e+02 1.070e+03 2.190e+03, threshold=1.538e+03, percent-clipped=38.0 2023-06-25 15:46:31,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1366032.0, ans=0.125 2023-06-25 15:46:35,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1366032.0, ans=0.0 2023-06-25 15:47:40,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1366212.0, ans=10.0 2023-06-25 15:47:46,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1366212.0, ans=0.2 2023-06-25 15:47:49,387 INFO [train.py:996] (1/4) Epoch 8, batch 14250, loss[loss=0.1889, simple_loss=0.2541, pruned_loss=0.06185, over 21407.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2923, pruned_loss=0.06911, over 4285809.87 frames. ], batch size: 195, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:47:59,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-25 15:48:02,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1366272.0, ans=0.125 2023-06-25 15:49:33,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1366512.0, ans=0.1 2023-06-25 15:49:39,557 INFO [train.py:996] (1/4) Epoch 8, batch 14300, loss[loss=0.295, simple_loss=0.3895, pruned_loss=0.1002, over 21778.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2949, pruned_loss=0.06888, over 4274693.46 frames. ], batch size: 332, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:49:59,649 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.382e+02 4.720e+02 7.552e+02 1.673e+03, threshold=9.439e+02, percent-clipped=2.0 2023-06-25 15:50:04,317 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.04 vs. limit=6.0 2023-06-25 15:50:04,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=12.0 2023-06-25 15:50:16,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1366632.0, ans=0.2 2023-06-25 15:50:45,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1366752.0, ans=0.125 2023-06-25 15:51:23,203 INFO [train.py:996] (1/4) Epoch 8, batch 14350, loss[loss=0.222, simple_loss=0.3033, pruned_loss=0.07037, over 21090.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2987, pruned_loss=0.06833, over 4268669.99 frames. ], batch size: 608, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:52:26,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1367052.0, ans=0.1 2023-06-25 15:53:17,508 INFO [train.py:996] (1/4) Epoch 8, batch 14400, loss[loss=0.2234, simple_loss=0.2891, pruned_loss=0.07885, over 20050.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2975, pruned_loss=0.06962, over 4274518.27 frames. ], batch size: 704, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:53:18,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1367172.0, ans=0.125 2023-06-25 15:53:30,786 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.862e+02 3.850e+02 4.891e+02 6.324e+02 1.594e+03, threshold=9.783e+02, percent-clipped=6.0 2023-06-25 15:53:36,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1367232.0, ans=0.0 2023-06-25 15:53:38,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1367232.0, ans=0.1 2023-06-25 15:53:59,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1367292.0, ans=0.0 2023-06-25 15:54:00,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.89 vs. limit=15.0 2023-06-25 15:54:05,045 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-25 15:54:25,082 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=22.5 2023-06-25 15:54:30,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1367352.0, ans=15.0 2023-06-25 15:54:39,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1367412.0, ans=0.0 2023-06-25 15:54:53,563 INFO [train.py:996] (1/4) Epoch 8, batch 14450, loss[loss=0.1814, simple_loss=0.2494, pruned_loss=0.05669, over 21514.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.292, pruned_loss=0.0695, over 4270555.84 frames. ], batch size: 195, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:56:10,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1367652.0, ans=0.04949747468305833 2023-06-25 15:56:13,932 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=15.0 2023-06-25 15:56:40,134 INFO [train.py:996] (1/4) Epoch 8, batch 14500, loss[loss=0.1957, simple_loss=0.2602, pruned_loss=0.06556, over 21656.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2879, pruned_loss=0.06963, over 4260867.15 frames. ], batch size: 282, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:57:02,148 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.716e+02 3.452e+02 4.183e+02 6.174e+02 1.088e+03, threshold=8.366e+02, percent-clipped=1.0 2023-06-25 15:57:04,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1367832.0, ans=0.2 2023-06-25 15:57:09,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1367832.0, ans=0.125 2023-06-25 15:57:20,835 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-25 15:57:46,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1367892.0, ans=0.0 2023-06-25 15:57:54,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-25 15:57:55,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1367952.0, ans=0.125 2023-06-25 15:58:03,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-25 15:58:14,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1368012.0, ans=0.125 2023-06-25 15:58:33,918 INFO [train.py:996] (1/4) Epoch 8, batch 14550, loss[loss=0.2193, simple_loss=0.2809, pruned_loss=0.07888, over 20116.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2906, pruned_loss=0.07065, over 4263317.32 frames. ], batch size: 703, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 15:59:20,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1368192.0, ans=0.0 2023-06-25 15:59:28,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1368192.0, ans=0.125 2023-06-25 16:00:21,550 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:00:22,837 INFO [train.py:996] (1/4) Epoch 8, batch 14600, loss[loss=0.2236, simple_loss=0.3143, pruned_loss=0.06644, over 21261.00 frames. ], tot_loss[loss=0.223, simple_loss=0.298, pruned_loss=0.07403, over 4271383.02 frames. ], batch size: 176, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:00:24,214 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-25 16:00:37,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1368372.0, ans=0.0 2023-06-25 16:00:38,178 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.717e+02 6.049e+02 8.556e+02 1.756e+03, threshold=1.210e+03, percent-clipped=27.0 2023-06-25 16:01:31,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.03 vs. limit=5.0 2023-06-25 16:01:39,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-06-25 16:02:10,908 INFO [train.py:996] (1/4) Epoch 8, batch 14650, loss[loss=0.2081, simple_loss=0.2937, pruned_loss=0.0612, over 21803.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3015, pruned_loss=0.07328, over 4275115.24 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:02:11,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1368672.0, ans=0.125 2023-06-25 16:02:13,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1368672.0, ans=15.0 2023-06-25 16:03:02,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=22.5 2023-06-25 16:03:58,482 INFO [train.py:996] (1/4) Epoch 8, batch 14700, loss[loss=0.1895, simple_loss=0.2646, pruned_loss=0.05722, over 21421.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2984, pruned_loss=0.0691, over 4274120.48 frames. ], batch size: 131, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:04:14,432 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.677e+02 4.958e+02 7.109e+02 1.155e+03, threshold=9.917e+02, percent-clipped=0.0 2023-06-25 16:04:14,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1369032.0, ans=0.125 2023-06-25 16:04:22,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1369032.0, ans=0.0 2023-06-25 16:05:02,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1369092.0, ans=0.125 2023-06-25 16:05:50,273 INFO [train.py:996] (1/4) Epoch 8, batch 14750, loss[loss=0.2683, simple_loss=0.3492, pruned_loss=0.09369, over 21580.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3019, pruned_loss=0.07102, over 4278761.51 frames. ], batch size: 263, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:05:53,339 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-25 16:07:29,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1369512.0, ans=0.0 2023-06-25 16:07:47,189 INFO [train.py:996] (1/4) Epoch 8, batch 14800, loss[loss=0.2197, simple_loss=0.2937, pruned_loss=0.07288, over 21370.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3097, pruned_loss=0.07486, over 4274571.97 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:08:12,802 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.189e+02 4.811e+02 6.847e+02 1.023e+03 2.171e+03, threshold=1.369e+03, percent-clipped=26.0 2023-06-25 16:08:18,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1369632.0, ans=0.1 2023-06-25 16:08:27,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1369632.0, ans=0.125 2023-06-25 16:08:29,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1369632.0, ans=0.0 2023-06-25 16:08:56,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-25 16:09:43,085 INFO [train.py:996] (1/4) Epoch 8, batch 14850, loss[loss=0.2418, simple_loss=0.3194, pruned_loss=0.08216, over 19947.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3037, pruned_loss=0.07468, over 4262859.31 frames. ], batch size: 703, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:09:45,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1369872.0, ans=0.0 2023-06-25 16:09:47,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1369872.0, ans=0.1 2023-06-25 16:09:55,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-25 16:10:36,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1369992.0, ans=0.125 2023-06-25 16:11:39,842 INFO [train.py:996] (1/4) Epoch 8, batch 14900, loss[loss=0.3099, simple_loss=0.3759, pruned_loss=0.1219, over 21372.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3057, pruned_loss=0.07622, over 4266443.13 frames. ], batch size: 507, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:11:54,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1370172.0, ans=0.0 2023-06-25 16:11:57,469 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.090e+02 4.206e+02 5.469e+02 8.347e+02 1.577e+03, threshold=1.094e+03, percent-clipped=2.0 2023-06-25 16:12:58,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1370352.0, ans=0.125 2023-06-25 16:13:02,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-25 16:13:30,835 INFO [train.py:996] (1/4) Epoch 8, batch 14950, loss[loss=0.2278, simple_loss=0.3085, pruned_loss=0.07358, over 21719.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3065, pruned_loss=0.07577, over 4260165.57 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:14:20,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.31 vs. limit=15.0 2023-06-25 16:15:08,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1370712.0, ans=0.2 2023-06-25 16:15:19,933 INFO [train.py:996] (1/4) Epoch 8, batch 15000, loss[loss=0.2146, simple_loss=0.2856, pruned_loss=0.07183, over 21334.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3089, pruned_loss=0.07719, over 4266457.91 frames. ], batch size: 159, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:15:19,934 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 16:15:40,715 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2554, simple_loss=0.3473, pruned_loss=0.08173, over 1796401.00 frames. 2023-06-25 16:15:40,716 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 16:15:41,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1370772.0, ans=0.125 2023-06-25 16:15:58,825 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.944e+02 3.850e+02 4.977e+02 6.696e+02 1.113e+03, threshold=9.953e+02, percent-clipped=2.0 2023-06-25 16:17:30,919 INFO [train.py:996] (1/4) Epoch 8, batch 15050, loss[loss=0.2705, simple_loss=0.3664, pruned_loss=0.08734, over 21680.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3111, pruned_loss=0.07821, over 4265152.16 frames. ], batch size: 441, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:17:42,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1371072.0, ans=0.1 2023-06-25 16:17:42,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-25 16:18:56,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1371312.0, ans=0.09899494936611666 2023-06-25 16:19:05,051 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:19:20,734 INFO [train.py:996] (1/4) Epoch 8, batch 15100, loss[loss=0.2412, simple_loss=0.3155, pruned_loss=0.08348, over 21701.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3146, pruned_loss=0.07824, over 4261100.31 frames. ], batch size: 351, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:19:26,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1371372.0, ans=0.2 2023-06-25 16:19:43,658 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.981e+02 4.480e+02 6.447e+02 8.808e+02 1.442e+03, threshold=1.289e+03, percent-clipped=16.0 2023-06-25 16:19:55,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1371432.0, ans=0.125 2023-06-25 16:19:57,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=1371432.0, ans=15.0 2023-06-25 16:20:00,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1371432.0, ans=0.2 2023-06-25 16:21:03,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1371612.0, ans=0.125 2023-06-25 16:21:09,586 INFO [train.py:996] (1/4) Epoch 8, batch 15150, loss[loss=0.2242, simple_loss=0.2759, pruned_loss=0.08627, over 21477.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.311, pruned_loss=0.07821, over 4257810.18 frames. ], batch size: 441, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:21:28,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-25 16:21:50,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1371732.0, ans=0.125 2023-06-25 16:22:00,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1371792.0, ans=0.125 2023-06-25 16:22:53,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1371912.0, ans=0.125 2023-06-25 16:22:57,835 INFO [train.py:996] (1/4) Epoch 8, batch 15200, loss[loss=0.2424, simple_loss=0.3295, pruned_loss=0.07762, over 20685.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3019, pruned_loss=0.07472, over 4260118.49 frames. ], batch size: 607, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:23:13,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-25 16:23:26,364 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.586e+02 3.888e+02 5.742e+02 8.749e+02 1.820e+03, threshold=1.148e+03, percent-clipped=6.0 2023-06-25 16:23:35,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1372032.0, ans=0.125 2023-06-25 16:23:51,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-25 16:24:52,721 INFO [train.py:996] (1/4) Epoch 8, batch 15250, loss[loss=0.2393, simple_loss=0.3116, pruned_loss=0.08348, over 20806.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2982, pruned_loss=0.07297, over 4258398.88 frames. ], batch size: 611, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:25:42,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1372392.0, ans=0.0 2023-06-25 16:26:48,261 INFO [train.py:996] (1/4) Epoch 8, batch 15300, loss[loss=0.2906, simple_loss=0.3451, pruned_loss=0.1181, over 21397.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3004, pruned_loss=0.07502, over 4258902.45 frames. ], batch size: 471, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:27:12,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.814e+02 3.962e+02 5.141e+02 6.603e+02 1.300e+03, threshold=1.028e+03, percent-clipped=5.0 2023-06-25 16:27:18,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1372632.0, ans=0.0 2023-06-25 16:27:47,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-25 16:28:14,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1372812.0, ans=0.0 2023-06-25 16:28:14,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1372812.0, ans=0.2 2023-06-25 16:28:30,831 INFO [train.py:996] (1/4) Epoch 8, batch 15350, loss[loss=0.2326, simple_loss=0.3177, pruned_loss=0.07373, over 21709.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3064, pruned_loss=0.07668, over 4262190.28 frames. ], batch size: 351, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:28:52,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-25 16:29:02,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1372932.0, ans=0.125 2023-06-25 16:29:20,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1372992.0, ans=0.125 2023-06-25 16:29:40,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1373052.0, ans=0.125 2023-06-25 16:29:46,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1373052.0, ans=0.1 2023-06-25 16:30:10,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1373112.0, ans=0.0 2023-06-25 16:30:13,112 INFO [train.py:996] (1/4) Epoch 8, batch 15400, loss[loss=0.2105, simple_loss=0.2854, pruned_loss=0.06778, over 21845.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3076, pruned_loss=0.07552, over 4267153.88 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:30:13,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1373172.0, ans=0.125 2023-06-25 16:30:46,928 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.892e+02 4.124e+02 5.602e+02 8.412e+02 1.592e+03, threshold=1.120e+03, percent-clipped=11.0 2023-06-25 16:31:31,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1373352.0, ans=0.0 2023-06-25 16:31:31,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1373352.0, ans=0.125 2023-06-25 16:31:47,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1373412.0, ans=0.125 2023-06-25 16:31:51,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1373412.0, ans=0.2 2023-06-25 16:32:00,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1373472.0, ans=0.1 2023-06-25 16:32:01,323 INFO [train.py:996] (1/4) Epoch 8, batch 15450, loss[loss=0.2193, simple_loss=0.2929, pruned_loss=0.07284, over 21495.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3046, pruned_loss=0.0745, over 4261567.19 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:32:43,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1373532.0, ans=0.0 2023-06-25 16:33:29,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1373652.0, ans=0.125 2023-06-25 16:33:32,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1373712.0, ans=0.125 2023-06-25 16:34:02,541 INFO [train.py:996] (1/4) Epoch 8, batch 15500, loss[loss=0.2508, simple_loss=0.3305, pruned_loss=0.08556, over 21526.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3056, pruned_loss=0.0745, over 4270501.49 frames. ], batch size: 414, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:34:27,237 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.957e+02 5.678e+02 7.705e+02 1.506e+03, threshold=1.136e+03, percent-clipped=3.0 2023-06-25 16:34:53,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-25 16:35:53,646 INFO [train.py:996] (1/4) Epoch 8, batch 15550, loss[loss=0.2125, simple_loss=0.3017, pruned_loss=0.06162, over 21733.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3026, pruned_loss=0.07193, over 4263816.75 frames. ], batch size: 332, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:36:05,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1374072.0, ans=0.0 2023-06-25 16:36:55,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1374252.0, ans=0.125 2023-06-25 16:37:21,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-25 16:37:42,336 INFO [train.py:996] (1/4) Epoch 8, batch 15600, loss[loss=0.239, simple_loss=0.2808, pruned_loss=0.09862, over 21342.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2943, pruned_loss=0.07048, over 4259118.85 frames. ], batch size: 508, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:37:46,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1374372.0, ans=0.1 2023-06-25 16:38:01,280 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.371e+02 3.943e+02 5.908e+02 1.274e+03, threshold=7.887e+02, percent-clipped=2.0 2023-06-25 16:38:06,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.86 vs. limit=15.0 2023-06-25 16:39:13,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1374612.0, ans=0.125 2023-06-25 16:39:14,479 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-25 16:39:21,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1374612.0, ans=0.1 2023-06-25 16:39:30,802 INFO [train.py:996] (1/4) Epoch 8, batch 15650, loss[loss=0.1886, simple_loss=0.2591, pruned_loss=0.05903, over 21541.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.293, pruned_loss=0.07006, over 4265779.97 frames. ], batch size: 231, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:39:35,716 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-25 16:40:21,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1374792.0, ans=0.125 2023-06-25 16:40:31,148 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-25 16:40:33,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1374852.0, ans=0.1 2023-06-25 16:41:19,311 INFO [train.py:996] (1/4) Epoch 8, batch 15700, loss[loss=0.2023, simple_loss=0.2675, pruned_loss=0.0686, over 21242.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2901, pruned_loss=0.0691, over 4263805.76 frames. ], batch size: 144, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:41:25,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1374972.0, ans=0.125 2023-06-25 16:41:33,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1374972.0, ans=0.0 2023-06-25 16:41:40,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.691e+02 3.513e+02 4.156e+02 5.605e+02 1.068e+03, threshold=8.312e+02, percent-clipped=8.0 2023-06-25 16:41:41,824 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-25 16:43:06,886 INFO [train.py:996] (1/4) Epoch 8, batch 15750, loss[loss=0.1997, simple_loss=0.2661, pruned_loss=0.06668, over 21637.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2863, pruned_loss=0.06938, over 4266576.64 frames. ], batch size: 264, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:43:41,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1375332.0, ans=0.0 2023-06-25 16:44:04,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1375452.0, ans=0.125 2023-06-25 16:44:55,704 INFO [train.py:996] (1/4) Epoch 8, batch 15800, loss[loss=0.2164, simple_loss=0.3026, pruned_loss=0.06511, over 16534.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2829, pruned_loss=0.06861, over 4256207.60 frames. ], batch size: 60, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:44:56,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1375572.0, ans=0.1 2023-06-25 16:45:16,665 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.881e+02 4.132e+02 5.788e+02 8.606e+02 2.042e+03, threshold=1.158e+03, percent-clipped=26.0 2023-06-25 16:45:37,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1375692.0, ans=0.125 2023-06-25 16:46:17,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-25 16:46:18,691 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:46:44,090 INFO [train.py:996] (1/4) Epoch 8, batch 15850, loss[loss=0.229, simple_loss=0.3272, pruned_loss=0.06545, over 16173.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2866, pruned_loss=0.07008, over 4255051.35 frames. ], batch size: 60, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:47:03,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1375932.0, ans=0.2 2023-06-25 16:47:17,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1375932.0, ans=0.125 2023-06-25 16:48:00,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-25 16:48:31,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1376172.0, ans=0.0 2023-06-25 16:48:32,033 INFO [train.py:996] (1/4) Epoch 8, batch 15900, loss[loss=0.1994, simple_loss=0.2799, pruned_loss=0.0594, over 21763.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2865, pruned_loss=0.07074, over 4260037.37 frames. ], batch size: 124, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:48:49,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1376232.0, ans=0.1 2023-06-25 16:48:52,531 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.061e+02 4.407e+02 5.744e+02 8.356e+02 1.559e+03, threshold=1.149e+03, percent-clipped=5.0 2023-06-25 16:49:20,169 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.40 vs. limit=10.0 2023-06-25 16:50:19,044 INFO [train.py:996] (1/4) Epoch 8, batch 15950, loss[loss=0.2262, simple_loss=0.314, pruned_loss=0.06916, over 21749.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2877, pruned_loss=0.06935, over 4251804.26 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:50:34,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.49 vs. limit=15.0 2023-06-25 16:50:38,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1376532.0, ans=0.1 2023-06-25 16:51:00,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-25 16:51:12,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1376592.0, ans=0.1 2023-06-25 16:51:25,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-25 16:52:10,858 INFO [train.py:996] (1/4) Epoch 8, batch 16000, loss[loss=0.2152, simple_loss=0.3074, pruned_loss=0.06149, over 21667.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2877, pruned_loss=0.06677, over 4263806.59 frames. ], batch size: 441, lr: 3.72e-03, grad_scale: 32.0 2023-06-25 16:52:31,814 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.505e+02 3.799e+02 4.877e+02 8.252e+02 1.708e+03, threshold=9.755e+02, percent-clipped=5.0 2023-06-25 16:52:57,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1376892.0, ans=0.125 2023-06-25 16:53:04,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1376892.0, ans=0.2 2023-06-25 16:53:59,490 INFO [train.py:996] (1/4) Epoch 8, batch 16050, loss[loss=0.2193, simple_loss=0.3177, pruned_loss=0.06045, over 21647.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2885, pruned_loss=0.06531, over 4265480.65 frames. ], batch size: 263, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:54:12,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1377072.0, ans=0.035 2023-06-25 16:54:44,369 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=22.5 2023-06-25 16:54:59,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1377252.0, ans=0.0 2023-06-25 16:55:01,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1377252.0, ans=0.2 2023-06-25 16:55:33,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1377312.0, ans=0.125 2023-06-25 16:55:47,402 INFO [train.py:996] (1/4) Epoch 8, batch 16100, loss[loss=0.2055, simple_loss=0.2905, pruned_loss=0.06024, over 21412.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2917, pruned_loss=0.06596, over 4272126.45 frames. ], batch size: 194, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:55:55,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1377372.0, ans=0.1 2023-06-25 16:56:07,810 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-25 16:56:10,218 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.825e+02 4.315e+02 5.631e+02 9.006e+02 2.276e+03, threshold=1.126e+03, percent-clipped=22.0 2023-06-25 16:56:23,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1377492.0, ans=0.125 2023-06-25 16:56:31,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-25 16:56:44,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.10 vs. limit=6.0 2023-06-25 16:56:57,409 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=15.0 2023-06-25 16:57:23,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1377612.0, ans=0.0 2023-06-25 16:57:35,030 INFO [train.py:996] (1/4) Epoch 8, batch 16150, loss[loss=0.2189, simple_loss=0.3042, pruned_loss=0.06683, over 21539.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2925, pruned_loss=0.06919, over 4268071.54 frames. ], batch size: 194, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:57:47,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1377672.0, ans=0.125 2023-06-25 16:58:26,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1377792.0, ans=0.125 2023-06-25 16:59:08,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1377912.0, ans=0.015 2023-06-25 16:59:10,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1377912.0, ans=0.2 2023-06-25 16:59:15,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1377912.0, ans=0.125 2023-06-25 16:59:20,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1377912.0, ans=0.0 2023-06-25 16:59:22,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1377972.0, ans=0.0 2023-06-25 16:59:23,928 INFO [train.py:996] (1/4) Epoch 8, batch 16200, loss[loss=0.2321, simple_loss=0.3222, pruned_loss=0.07099, over 20109.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2963, pruned_loss=0.06989, over 4274828.93 frames. ], batch size: 703, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:59:29,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1377972.0, ans=0.125 2023-06-25 16:59:46,148 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.018e+02 5.082e+02 7.447e+02 1.479e+03, threshold=1.016e+03, percent-clipped=6.0 2023-06-25 17:01:11,823 INFO [train.py:996] (1/4) Epoch 8, batch 16250, loss[loss=0.1581, simple_loss=0.2305, pruned_loss=0.04289, over 21260.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2959, pruned_loss=0.06942, over 4277918.82 frames. ], batch size: 159, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:01:27,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-25 17:01:30,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1378332.0, ans=0.04949747468305833 2023-06-25 17:02:43,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1378512.0, ans=0.0 2023-06-25 17:02:54,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1378512.0, ans=0.95 2023-06-25 17:03:00,429 INFO [train.py:996] (1/4) Epoch 8, batch 16300, loss[loss=0.1702, simple_loss=0.2547, pruned_loss=0.04287, over 21701.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2909, pruned_loss=0.06567, over 4275452.94 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:03:24,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.309e+02 4.494e+02 6.869e+02 1.781e+03, threshold=8.988e+02, percent-clipped=11.0 2023-06-25 17:04:21,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1378752.0, ans=0.125 2023-06-25 17:04:27,349 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-25 17:04:32,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-25 17:04:46,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1378812.0, ans=0.125 2023-06-25 17:04:50,594 INFO [train.py:996] (1/4) Epoch 8, batch 16350, loss[loss=0.2514, simple_loss=0.3259, pruned_loss=0.08848, over 21480.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2908, pruned_loss=0.06577, over 4281276.47 frames. ], batch size: 131, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:05:04,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-25 17:05:19,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1378932.0, ans=0.05 2023-06-25 17:05:27,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-25 17:06:06,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1379052.0, ans=0.0 2023-06-25 17:06:21,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.83 vs. limit=15.0 2023-06-25 17:06:32,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1379112.0, ans=0.04949747468305833 2023-06-25 17:06:39,428 INFO [train.py:996] (1/4) Epoch 8, batch 16400, loss[loss=0.209, simple_loss=0.2984, pruned_loss=0.05981, over 21028.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2966, pruned_loss=0.06829, over 4276426.08 frames. ], batch size: 608, lr: 3.72e-03, grad_scale: 32.0 2023-06-25 17:06:57,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1379232.0, ans=0.0 2023-06-25 17:07:09,116 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.348e+02 5.266e+02 7.750e+02 2.110e+03, threshold=1.053e+03, percent-clipped=17.0 2023-06-25 17:07:30,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1379292.0, ans=0.125 2023-06-25 17:07:47,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1379352.0, ans=0.2 2023-06-25 17:07:58,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1379352.0, ans=0.0 2023-06-25 17:08:17,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1379412.0, ans=0.0 2023-06-25 17:08:22,752 INFO [train.py:996] (1/4) Epoch 8, batch 16450, loss[loss=0.2345, simple_loss=0.2905, pruned_loss=0.08924, over 20000.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2977, pruned_loss=0.07055, over 4278702.28 frames. ], batch size: 702, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:08:39,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1379532.0, ans=0.125 2023-06-25 17:09:38,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1379652.0, ans=0.0 2023-06-25 17:10:04,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1379712.0, ans=0.125 2023-06-25 17:10:09,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379712.0, ans=0.1 2023-06-25 17:10:12,902 INFO [train.py:996] (1/4) Epoch 8, batch 16500, loss[loss=0.1975, simple_loss=0.2724, pruned_loss=0.06133, over 21826.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2968, pruned_loss=0.07081, over 4282880.78 frames. ], batch size: 282, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:10:43,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.406e+02 5.923e+02 9.341e+02 2.012e+03, threshold=1.185e+03, percent-clipped=18.0 2023-06-25 17:11:17,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1379892.0, ans=0.125 2023-06-25 17:11:28,166 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:11:35,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1379952.0, ans=0.125 2023-06-25 17:12:03,237 INFO [train.py:996] (1/4) Epoch 8, batch 16550, loss[loss=0.3094, simple_loss=0.3739, pruned_loss=0.1224, over 21375.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2947, pruned_loss=0.06874, over 4285845.35 frames. ], batch size: 507, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:12:15,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-25 17:13:30,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.75 vs. limit=6.0 2023-06-25 17:13:35,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1380312.0, ans=0.0 2023-06-25 17:13:39,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1380312.0, ans=0.125 2023-06-25 17:13:52,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1380312.0, ans=0.0 2023-06-25 17:14:05,600 INFO [train.py:996] (1/4) Epoch 8, batch 16600, loss[loss=0.2872, simple_loss=0.3896, pruned_loss=0.09243, over 21650.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3027, pruned_loss=0.07208, over 4281115.67 frames. ], batch size: 389, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:14:17,295 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-25 17:14:41,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.056e+02 4.921e+02 6.632e+02 9.394e+02 2.372e+03, threshold=1.326e+03, percent-clipped=11.0 2023-06-25 17:14:41,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1380432.0, ans=0.0 2023-06-25 17:14:47,086 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:15:26,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1380552.0, ans=0.2 2023-06-25 17:15:41,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-25 17:15:55,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1380612.0, ans=0.0 2023-06-25 17:16:01,858 INFO [train.py:996] (1/4) Epoch 8, batch 16650, loss[loss=0.2478, simple_loss=0.33, pruned_loss=0.08276, over 21949.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3106, pruned_loss=0.0744, over 4283615.61 frames. ], batch size: 372, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:16:25,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1380732.0, ans=0.125 2023-06-25 17:16:36,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1380732.0, ans=0.1 2023-06-25 17:16:38,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1380732.0, ans=0.0 2023-06-25 17:17:28,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-25 17:18:00,231 INFO [train.py:996] (1/4) Epoch 8, batch 16700, loss[loss=0.2191, simple_loss=0.2959, pruned_loss=0.07116, over 21166.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3111, pruned_loss=0.07552, over 4279302.82 frames. ], batch size: 608, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:18:26,044 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.158e+02 5.069e+02 7.220e+02 1.088e+03 2.234e+03, threshold=1.444e+03, percent-clipped=12.0 2023-06-25 17:18:44,290 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-25 17:18:45,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1381092.0, ans=0.025 2023-06-25 17:19:10,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1381152.0, ans=0.1 2023-06-25 17:19:10,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1381152.0, ans=0.2 2023-06-25 17:19:10,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2023-06-25 17:19:11,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1381152.0, ans=0.04949747468305833 2023-06-25 17:19:54,804 INFO [train.py:996] (1/4) Epoch 8, batch 16750, loss[loss=0.2389, simple_loss=0.3209, pruned_loss=0.07839, over 21583.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.314, pruned_loss=0.07729, over 4271464.29 frames. ], batch size: 263, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:20:15,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1381272.0, ans=0.125 2023-06-25 17:20:37,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1381332.0, ans=0.125 2023-06-25 17:20:55,810 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-25 17:21:01,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-25 17:21:09,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1381452.0, ans=0.0 2023-06-25 17:21:47,563 INFO [train.py:996] (1/4) Epoch 8, batch 16800, loss[loss=0.238, simple_loss=0.3347, pruned_loss=0.07061, over 21307.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3159, pruned_loss=0.0766, over 4269852.76 frames. ], batch size: 548, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:21:53,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1381572.0, ans=0.125 2023-06-25 17:22:14,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1381632.0, ans=10.0 2023-06-25 17:22:18,697 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.357e+02 4.342e+02 5.532e+02 7.799e+02 1.934e+03, threshold=1.106e+03, percent-clipped=5.0 2023-06-25 17:22:40,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1381692.0, ans=0.2 2023-06-25 17:23:24,350 INFO [train.py:996] (1/4) Epoch 8, batch 16850, loss[loss=0.2239, simple_loss=0.2899, pruned_loss=0.07892, over 21906.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3119, pruned_loss=0.07724, over 4281521.62 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:23:26,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1381872.0, ans=0.1 2023-06-25 17:23:26,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1381872.0, ans=0.125 2023-06-25 17:23:28,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1381872.0, ans=15.0 2023-06-25 17:24:00,301 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-25 17:24:20,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1381992.0, ans=0.125 2023-06-25 17:24:29,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1381992.0, ans=0.125 2023-06-25 17:24:30,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1381992.0, ans=0.125 2023-06-25 17:24:36,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1382052.0, ans=0.1 2023-06-25 17:24:43,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1382052.0, ans=0.1 2023-06-25 17:24:46,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1382052.0, ans=0.0 2023-06-25 17:25:08,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1382112.0, ans=0.125 2023-06-25 17:25:11,471 INFO [train.py:996] (1/4) Epoch 8, batch 16900, loss[loss=0.2409, simple_loss=0.3218, pruned_loss=0.08001, over 20737.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3072, pruned_loss=0.07515, over 4282790.64 frames. ], batch size: 607, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:25:27,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1382172.0, ans=0.07 2023-06-25 17:25:36,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1382172.0, ans=0.125 2023-06-25 17:25:58,355 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 4.085e+02 5.568e+02 7.476e+02 1.428e+03, threshold=1.114e+03, percent-clipped=3.0 2023-06-25 17:26:26,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1382352.0, ans=0.125 2023-06-25 17:26:35,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1382352.0, ans=0.5 2023-06-25 17:26:59,129 INFO [train.py:996] (1/4) Epoch 8, batch 16950, loss[loss=0.2088, simple_loss=0.2843, pruned_loss=0.06666, over 21854.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3028, pruned_loss=0.07412, over 4285124.17 frames. ], batch size: 98, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:27:06,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1382472.0, ans=0.2 2023-06-25 17:27:49,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1382532.0, ans=0.1 2023-06-25 17:28:04,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-25 17:28:05,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1382592.0, ans=0.1 2023-06-25 17:28:05,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1382592.0, ans=0.125 2023-06-25 17:28:53,782 INFO [train.py:996] (1/4) Epoch 8, batch 17000, loss[loss=0.2121, simple_loss=0.2861, pruned_loss=0.06903, over 21688.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3003, pruned_loss=0.07455, over 4289733.92 frames. ], batch size: 263, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:28:54,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1382772.0, ans=0.125 2023-06-25 17:28:54,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1382772.0, ans=0.0 2023-06-25 17:29:35,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.768e+02 4.400e+02 6.237e+02 1.054e+03 1.925e+03, threshold=1.247e+03, percent-clipped=22.0 2023-06-25 17:30:35,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383012.0, ans=0.1 2023-06-25 17:30:47,371 INFO [train.py:996] (1/4) Epoch 8, batch 17050, loss[loss=0.2468, simple_loss=0.3323, pruned_loss=0.08063, over 21761.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3053, pruned_loss=0.07558, over 4292475.12 frames. ], batch size: 298, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:31:05,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1383072.0, ans=0.125 2023-06-25 17:31:27,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1383192.0, ans=0.0 2023-06-25 17:32:29,757 INFO [train.py:996] (1/4) Epoch 8, batch 17100, loss[loss=0.2213, simple_loss=0.2895, pruned_loss=0.07654, over 19964.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3041, pruned_loss=0.07637, over 4290537.65 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:33:06,775 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 4.538e+02 6.730e+02 8.383e+02 1.322e+03, threshold=1.346e+03, percent-clipped=2.0 2023-06-25 17:33:14,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1383492.0, ans=0.07 2023-06-25 17:33:58,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1383612.0, ans=0.125 2023-06-25 17:34:10,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1383612.0, ans=0.0 2023-06-25 17:34:23,390 INFO [train.py:996] (1/4) Epoch 8, batch 17150, loss[loss=0.1912, simple_loss=0.265, pruned_loss=0.05868, over 21444.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2999, pruned_loss=0.07538, over 4292466.15 frames. ], batch size: 211, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:34:47,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1383732.0, ans=0.2 2023-06-25 17:35:20,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1383852.0, ans=0.125 2023-06-25 17:35:42,603 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:36:13,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1383912.0, ans=0.125 2023-06-25 17:36:18,567 INFO [train.py:996] (1/4) Epoch 8, batch 17200, loss[loss=0.2484, simple_loss=0.3283, pruned_loss=0.0842, over 21815.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2998, pruned_loss=0.07556, over 4297857.90 frames. ], batch size: 124, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:36:44,524 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.225e+02 5.384e+02 7.580e+02 1.533e+03, threshold=1.077e+03, percent-clipped=1.0 2023-06-25 17:36:48,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1384032.0, ans=0.0 2023-06-25 17:37:15,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1384152.0, ans=0.05 2023-06-25 17:37:34,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1384152.0, ans=0.07 2023-06-25 17:37:47,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1384212.0, ans=0.1 2023-06-25 17:38:07,324 INFO [train.py:996] (1/4) Epoch 8, batch 17250, loss[loss=0.2279, simple_loss=0.3058, pruned_loss=0.07504, over 21669.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3022, pruned_loss=0.07658, over 4294834.30 frames. ], batch size: 351, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:38:12,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=22.5 2023-06-25 17:38:15,362 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=22.5 2023-06-25 17:38:19,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1384272.0, ans=0.125 2023-06-25 17:38:22,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1384272.0, ans=0.1 2023-06-25 17:38:34,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1384332.0, ans=0.0 2023-06-25 17:38:45,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1384392.0, ans=0.125 2023-06-25 17:39:50,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1384512.0, ans=0.1 2023-06-25 17:39:57,089 INFO [train.py:996] (1/4) Epoch 8, batch 17300, loss[loss=0.2634, simple_loss=0.341, pruned_loss=0.09289, over 21572.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3099, pruned_loss=0.07939, over 4291278.09 frames. ], batch size: 414, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:40:25,190 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.394e+02 4.465e+02 6.350e+02 1.043e+03 2.141e+03, threshold=1.270e+03, percent-clipped=19.0 2023-06-25 17:40:35,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1384632.0, ans=0.0 2023-06-25 17:41:01,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1384692.0, ans=0.125 2023-06-25 17:41:21,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1384752.0, ans=0.0 2023-06-25 17:41:45,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1384812.0, ans=0.2 2023-06-25 17:41:47,989 INFO [train.py:996] (1/4) Epoch 8, batch 17350, loss[loss=0.2222, simple_loss=0.3156, pruned_loss=0.06443, over 21701.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3107, pruned_loss=0.07951, over 4289926.18 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:42:26,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1384932.0, ans=0.05 2023-06-25 17:42:29,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1384932.0, ans=0.2 2023-06-25 17:43:10,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1385052.0, ans=0.1 2023-06-25 17:43:38,102 INFO [train.py:996] (1/4) Epoch 8, batch 17400, loss[loss=0.1782, simple_loss=0.243, pruned_loss=0.05664, over 21812.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3072, pruned_loss=0.07575, over 4283135.49 frames. ], batch size: 118, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:44:08,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1385232.0, ans=0.1 2023-06-25 17:44:22,062 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.735e+02 4.973e+02 6.706e+02 2.674e+03, threshold=9.946e+02, percent-clipped=3.0 2023-06-25 17:44:35,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1385292.0, ans=0.2 2023-06-25 17:44:42,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1385292.0, ans=0.2 2023-06-25 17:45:32,727 INFO [train.py:996] (1/4) Epoch 8, batch 17450, loss[loss=0.1985, simple_loss=0.2947, pruned_loss=0.0511, over 21178.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.304, pruned_loss=0.0737, over 4268804.83 frames. ], batch size: 548, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:46:15,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1385532.0, ans=0.125 2023-06-25 17:46:54,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1385652.0, ans=0.125 2023-06-25 17:47:20,324 INFO [train.py:996] (1/4) Epoch 8, batch 17500, loss[loss=0.2168, simple_loss=0.2848, pruned_loss=0.07438, over 21818.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3022, pruned_loss=0.07237, over 4270283.58 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:47:57,720 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.917e+02 3.737e+02 5.034e+02 7.979e+02 1.418e+03, threshold=1.007e+03, percent-clipped=12.0 2023-06-25 17:48:03,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1385892.0, ans=0.035 2023-06-25 17:48:03,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1385892.0, ans=0.125 2023-06-25 17:48:13,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.98 vs. limit=6.0 2023-06-25 17:49:05,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1386072.0, ans=0.125 2023-06-25 17:49:05,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1386072.0, ans=0.0 2023-06-25 17:49:07,118 INFO [train.py:996] (1/4) Epoch 8, batch 17550, loss[loss=0.2204, simple_loss=0.3094, pruned_loss=0.06572, over 21457.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3022, pruned_loss=0.07087, over 4277228.99 frames. ], batch size: 194, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:49:07,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1386072.0, ans=0.0 2023-06-25 17:49:56,127 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:50:03,643 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-25 17:50:53,466 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-25 17:50:54,176 INFO [train.py:996] (1/4) Epoch 8, batch 17600, loss[loss=0.2448, simple_loss=0.3212, pruned_loss=0.08419, over 21568.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3044, pruned_loss=0.0716, over 4278700.75 frames. ], batch size: 389, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:51:03,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1386372.0, ans=0.0 2023-06-25 17:51:33,992 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.904e+02 3.918e+02 5.459e+02 7.837e+02 1.902e+03, threshold=1.092e+03, percent-clipped=12.0 2023-06-25 17:51:46,813 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:52:06,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1386552.0, ans=0.0 2023-06-25 17:52:36,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1386612.0, ans=0.2 2023-06-25 17:52:36,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1386612.0, ans=0.2 2023-06-25 17:52:49,583 INFO [train.py:996] (1/4) Epoch 8, batch 17650, loss[loss=0.1795, simple_loss=0.2501, pruned_loss=0.05444, over 21787.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.303, pruned_loss=0.07205, over 4269452.65 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:53:41,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1386792.0, ans=0.0 2023-06-25 17:54:05,905 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:54:07,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1386852.0, ans=0.125 2023-06-25 17:54:39,348 INFO [train.py:996] (1/4) Epoch 8, batch 17700, loss[loss=0.1967, simple_loss=0.3067, pruned_loss=0.04329, over 19916.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2978, pruned_loss=0.06973, over 4272438.79 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:55:14,590 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.821e+02 4.539e+02 6.208e+02 9.459e+02 1.772e+03, threshold=1.242e+03, percent-clipped=17.0 2023-06-25 17:55:19,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 17:55:30,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1387092.0, ans=0.0 2023-06-25 17:55:49,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1387152.0, ans=0.07 2023-06-25 17:56:14,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1387212.0, ans=0.125 2023-06-25 17:56:29,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1387212.0, ans=0.5 2023-06-25 17:56:34,262 INFO [train.py:996] (1/4) Epoch 8, batch 17750, loss[loss=0.2254, simple_loss=0.3179, pruned_loss=0.06645, over 21231.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3035, pruned_loss=0.07194, over 4272797.14 frames. ], batch size: 143, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:56:55,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-25 17:57:38,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.64 vs. limit=22.5 2023-06-25 17:57:52,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1387452.0, ans=0.1 2023-06-25 17:57:56,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-25 17:58:05,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1387452.0, ans=0.0 2023-06-25 17:58:24,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1387572.0, ans=0.125 2023-06-25 17:58:25,860 INFO [train.py:996] (1/4) Epoch 8, batch 17800, loss[loss=0.2013, simple_loss=0.2797, pruned_loss=0.06143, over 21438.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3032, pruned_loss=0.0712, over 4275542.98 frames. ], batch size: 194, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:58:55,811 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 4.160e+02 4.945e+02 7.686e+02 1.227e+03, threshold=9.890e+02, percent-clipped=0.0 2023-06-25 17:59:11,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1387692.0, ans=0.125 2023-06-25 17:59:19,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1387692.0, ans=0.1 2023-06-25 17:59:39,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1387752.0, ans=0.5 2023-06-25 17:59:56,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1387812.0, ans=0.125 2023-06-25 18:00:10,386 INFO [train.py:996] (1/4) Epoch 8, batch 17850, loss[loss=0.2218, simple_loss=0.2992, pruned_loss=0.07214, over 21718.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3039, pruned_loss=0.0719, over 4276813.15 frames. ], batch size: 247, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:00:34,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1387932.0, ans=0.2 2023-06-25 18:00:51,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-25 18:01:04,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.36 vs. limit=15.0 2023-06-25 18:01:54,652 INFO [train.py:996] (1/4) Epoch 8, batch 17900, loss[loss=0.2215, simple_loss=0.3087, pruned_loss=0.06712, over 21401.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3079, pruned_loss=0.0728, over 4271830.43 frames. ], batch size: 194, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:02:03,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1388172.0, ans=0.0 2023-06-25 18:02:40,618 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.096e+02 4.760e+02 6.226e+02 9.356e+02 2.163e+03, threshold=1.245e+03, percent-clipped=21.0 2023-06-25 18:02:45,537 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=22.5 2023-06-25 18:03:05,098 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-25 18:03:44,498 INFO [train.py:996] (1/4) Epoch 8, batch 17950, loss[loss=0.2073, simple_loss=0.3223, pruned_loss=0.0461, over 19714.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3075, pruned_loss=0.07002, over 4269537.89 frames. ], batch size: 703, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:03:50,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1388472.0, ans=0.125 2023-06-25 18:04:36,572 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-25 18:05:03,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1388652.0, ans=0.015 2023-06-25 18:05:27,065 INFO [train.py:996] (1/4) Epoch 8, batch 18000, loss[loss=0.1901, simple_loss=0.2509, pruned_loss=0.06463, over 21493.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3004, pruned_loss=0.06854, over 4261498.97 frames. ], batch size: 195, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 18:05:27,066 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 18:05:48,115 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2638, simple_loss=0.3571, pruned_loss=0.08527, over 1796401.00 frames. 2023-06-25 18:05:48,116 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 18:06:07,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1388772.0, ans=0.07 2023-06-25 18:06:23,292 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.397e+02 3.503e+02 4.294e+02 6.004e+02 1.457e+03, threshold=8.588e+02, percent-clipped=3.0 2023-06-25 18:06:29,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-25 18:06:39,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1388892.0, ans=0.125 2023-06-25 18:06:43,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1388892.0, ans=0.0 2023-06-25 18:06:51,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1388952.0, ans=0.1 2023-06-25 18:07:00,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388952.0, ans=0.1 2023-06-25 18:07:26,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1389012.0, ans=0.125 2023-06-25 18:07:37,516 INFO [train.py:996] (1/4) Epoch 8, batch 18050, loss[loss=0.2049, simple_loss=0.2741, pruned_loss=0.06788, over 21437.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2946, pruned_loss=0.0674, over 4270243.33 frames. ], batch size: 389, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 18:08:11,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1389132.0, ans=0.125 2023-06-25 18:08:32,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1389192.0, ans=0.0 2023-06-25 18:08:58,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1389312.0, ans=0.1 2023-06-25 18:09:16,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.40 vs. limit=22.5 2023-06-25 18:09:33,678 INFO [train.py:996] (1/4) Epoch 8, batch 18100, loss[loss=0.2091, simple_loss=0.2895, pruned_loss=0.06432, over 21912.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2995, pruned_loss=0.06994, over 4272964.99 frames. ], batch size: 98, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:09:34,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1389372.0, ans=0.125 2023-06-25 18:09:45,599 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.70 vs. limit=15.0 2023-06-25 18:09:46,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1389372.0, ans=0.125 2023-06-25 18:09:59,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-25 18:10:00,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1389432.0, ans=0.125 2023-06-25 18:10:05,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.803e+02 3.773e+02 4.901e+02 6.840e+02 2.108e+03, threshold=9.801e+02, percent-clipped=15.0 2023-06-25 18:10:39,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1389552.0, ans=0.125 2023-06-25 18:11:22,662 INFO [train.py:996] (1/4) Epoch 8, batch 18150, loss[loss=0.2011, simple_loss=0.2916, pruned_loss=0.05524, over 21618.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3014, pruned_loss=0.07055, over 4271936.46 frames. ], batch size: 263, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:11:36,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1389672.0, ans=0.125 2023-06-25 18:11:37,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-25 18:12:04,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1389792.0, ans=0.0 2023-06-25 18:12:12,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1389852.0, ans=0.0 2023-06-25 18:13:07,668 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-25 18:13:10,101 INFO [train.py:996] (1/4) Epoch 8, batch 18200, loss[loss=0.198, simple_loss=0.2681, pruned_loss=0.06399, over 21739.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2962, pruned_loss=0.07012, over 4260192.62 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:13:37,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1390032.0, ans=0.1 2023-06-25 18:13:40,433 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.735e+02 4.047e+02 5.658e+02 8.715e+02 2.136e+03, threshold=1.132e+03, percent-clipped=16.0 2023-06-25 18:13:57,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1390092.0, ans=0.0 2023-06-25 18:14:49,914 INFO [train.py:996] (1/4) Epoch 8, batch 18250, loss[loss=0.2023, simple_loss=0.2767, pruned_loss=0.06399, over 21665.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2896, pruned_loss=0.06758, over 4264063.60 frames. ], batch size: 230, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:14:53,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1390272.0, ans=0.125 2023-06-25 18:15:11,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1390332.0, ans=0.2 2023-06-25 18:16:03,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1390512.0, ans=0.1 2023-06-25 18:16:26,075 INFO [train.py:996] (1/4) Epoch 8, batch 18300, loss[loss=0.2306, simple_loss=0.3358, pruned_loss=0.06276, over 21784.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2877, pruned_loss=0.06751, over 4270529.27 frames. ], batch size: 282, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:16:35,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1390572.0, ans=0.125 2023-06-25 18:17:12,347 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 4.046e+02 5.831e+02 1.006e+03 2.196e+03, threshold=1.166e+03, percent-clipped=19.0 2023-06-25 18:17:16,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1390692.0, ans=0.125 2023-06-25 18:17:40,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1390752.0, ans=0.125 2023-06-25 18:17:47,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.82 vs. limit=10.0 2023-06-25 18:17:52,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1390812.0, ans=0.0 2023-06-25 18:18:12,735 INFO [train.py:996] (1/4) Epoch 8, batch 18350, loss[loss=0.1924, simple_loss=0.268, pruned_loss=0.05843, over 21505.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2918, pruned_loss=0.06747, over 4273717.77 frames. ], batch size: 230, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:19:14,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1391052.0, ans=0.125 2023-06-25 18:19:56,469 INFO [train.py:996] (1/4) Epoch 8, batch 18400, loss[loss=0.1773, simple_loss=0.2584, pruned_loss=0.04812, over 21126.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2885, pruned_loss=0.06629, over 4260517.05 frames. ], batch size: 159, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:20:38,567 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.733e+02 5.113e+02 7.460e+02 1.718e+03, threshold=1.023e+03, percent-clipped=6.0 2023-06-25 18:20:43,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1391292.0, ans=15.0 2023-06-25 18:21:31,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1391412.0, ans=0.1 2023-06-25 18:21:35,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1391412.0, ans=0.125 2023-06-25 18:21:45,617 INFO [train.py:996] (1/4) Epoch 8, batch 18450, loss[loss=0.2024, simple_loss=0.2838, pruned_loss=0.0605, over 21662.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2858, pruned_loss=0.06322, over 4264135.64 frames. ], batch size: 247, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:22:38,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1391592.0, ans=0.035 2023-06-25 18:23:02,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1391712.0, ans=0.125 2023-06-25 18:23:15,045 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-25 18:23:26,124 INFO [train.py:996] (1/4) Epoch 8, batch 18500, loss[loss=0.1597, simple_loss=0.242, pruned_loss=0.03866, over 21213.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2799, pruned_loss=0.06154, over 4255040.67 frames. ], batch size: 176, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:24:02,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1391832.0, ans=0.0 2023-06-25 18:24:07,380 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 3.343e+02 4.214e+02 5.911e+02 1.246e+03, threshold=8.429e+02, percent-clipped=4.0 2023-06-25 18:24:23,249 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:24:25,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1391892.0, ans=0.1 2023-06-25 18:24:34,411 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=8.0 2023-06-25 18:24:45,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1391952.0, ans=0.1 2023-06-25 18:24:49,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1391952.0, ans=0.0 2023-06-25 18:25:03,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1392012.0, ans=0.07 2023-06-25 18:25:09,783 INFO [train.py:996] (1/4) Epoch 8, batch 18550, loss[loss=0.1936, simple_loss=0.2703, pruned_loss=0.05841, over 21767.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2785, pruned_loss=0.06103, over 4248948.24 frames. ], batch size: 351, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:25:55,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1392192.0, ans=0.0 2023-06-25 18:26:41,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-25 18:27:04,679 INFO [train.py:996] (1/4) Epoch 8, batch 18600, loss[loss=0.1692, simple_loss=0.2541, pruned_loss=0.04219, over 21583.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2765, pruned_loss=0.06131, over 4252800.58 frames. ], batch size: 230, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:27:36,463 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 3.804e+02 5.092e+02 7.468e+02 1.783e+03, threshold=1.018e+03, percent-clipped=18.0 2023-06-25 18:27:42,327 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.09 vs. limit=10.0 2023-06-25 18:28:33,917 INFO [train.py:996] (1/4) Epoch 8, batch 18650, loss[loss=0.2064, simple_loss=0.2686, pruned_loss=0.07213, over 21810.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2751, pruned_loss=0.06141, over 4256461.74 frames. ], batch size: 102, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:28:59,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1392672.0, ans=0.0 2023-06-25 18:29:24,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1392792.0, ans=0.07 2023-06-25 18:29:32,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1392792.0, ans=0.125 2023-06-25 18:30:16,029 INFO [train.py:996] (1/4) Epoch 8, batch 18700, loss[loss=0.1765, simple_loss=0.2346, pruned_loss=0.0592, over 20711.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2731, pruned_loss=0.06318, over 4247218.03 frames. ], batch size: 608, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:31:04,141 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.953e+02 3.708e+02 4.986e+02 6.996e+02 1.849e+03, threshold=9.973e+02, percent-clipped=6.0 2023-06-25 18:31:04,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1393092.0, ans=0.125 2023-06-25 18:31:11,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1393092.0, ans=0.125 2023-06-25 18:31:20,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1393092.0, ans=0.1 2023-06-25 18:31:26,403 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=22.5 2023-06-25 18:31:43,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1393152.0, ans=0.0 2023-06-25 18:31:58,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1393212.0, ans=0.0 2023-06-25 18:32:03,249 INFO [train.py:996] (1/4) Epoch 8, batch 18750, loss[loss=0.2447, simple_loss=0.323, pruned_loss=0.08318, over 21608.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2761, pruned_loss=0.06572, over 4246991.98 frames. ], batch size: 263, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:32:32,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1393332.0, ans=0.1 2023-06-25 18:32:42,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1393332.0, ans=0.125 2023-06-25 18:32:47,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1393392.0, ans=0.125 2023-06-25 18:32:50,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-25 18:33:03,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1393392.0, ans=0.0 2023-06-25 18:33:04,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1393392.0, ans=0.125 2023-06-25 18:33:48,490 INFO [train.py:996] (1/4) Epoch 8, batch 18800, loss[loss=0.2031, simple_loss=0.3, pruned_loss=0.05305, over 21648.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2846, pruned_loss=0.0674, over 4246978.46 frames. ], batch size: 414, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:33:49,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-25 18:34:02,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1393572.0, ans=0.125 2023-06-25 18:34:31,895 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.846e+02 4.247e+02 5.340e+02 7.897e+02 1.499e+03, threshold=1.068e+03, percent-clipped=10.0 2023-06-25 18:35:02,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1393752.0, ans=0.125 2023-06-25 18:35:19,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1393812.0, ans=0.125 2023-06-25 18:35:22,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1393812.0, ans=0.125 2023-06-25 18:35:31,327 INFO [train.py:996] (1/4) Epoch 8, batch 18850, loss[loss=0.2238, simple_loss=0.2874, pruned_loss=0.08008, over 21872.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2811, pruned_loss=0.06396, over 4253785.51 frames. ], batch size: 98, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:36:37,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1394052.0, ans=0.0 2023-06-25 18:37:18,483 INFO [train.py:996] (1/4) Epoch 8, batch 18900, loss[loss=0.1815, simple_loss=0.2521, pruned_loss=0.05541, over 21687.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2771, pruned_loss=0.064, over 4259974.18 frames. ], batch size: 282, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:38:08,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1394292.0, ans=0.0 2023-06-25 18:38:09,195 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.589e+02 4.833e+02 6.205e+02 1.384e+03, threshold=9.667e+02, percent-clipped=4.0 2023-06-25 18:38:10,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1394292.0, ans=0.2 2023-06-25 18:38:24,775 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.64 vs. limit=15.0 2023-06-25 18:38:36,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1394352.0, ans=0.125 2023-06-25 18:39:07,633 INFO [train.py:996] (1/4) Epoch 8, batch 18950, loss[loss=0.242, simple_loss=0.295, pruned_loss=0.09446, over 20039.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2785, pruned_loss=0.06614, over 4265482.80 frames. ], batch size: 702, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:39:47,753 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-25 18:39:55,011 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.15 vs. limit=22.5 2023-06-25 18:41:08,062 INFO [train.py:996] (1/4) Epoch 8, batch 19000, loss[loss=0.1702, simple_loss=0.2283, pruned_loss=0.05605, over 20790.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2866, pruned_loss=0.06718, over 4271559.86 frames. ], batch size: 609, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:41:17,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1394772.0, ans=0.125 2023-06-25 18:41:33,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-25 18:41:47,837 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.853e+02 4.722e+02 6.033e+02 9.741e+02 2.203e+03, threshold=1.207e+03, percent-clipped=24.0 2023-06-25 18:42:02,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1394892.0, ans=0.0 2023-06-25 18:42:04,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1394952.0, ans=0.5 2023-06-25 18:42:30,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1395012.0, ans=0.125 2023-06-25 18:42:55,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1395072.0, ans=0.0 2023-06-25 18:42:56,841 INFO [train.py:996] (1/4) Epoch 8, batch 19050, loss[loss=0.2787, simple_loss=0.3425, pruned_loss=0.1075, over 21400.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.293, pruned_loss=0.07117, over 4272680.63 frames. ], batch size: 471, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:43:04,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1395072.0, ans=0.0 2023-06-25 18:43:15,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-25 18:43:17,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1395132.0, ans=0.125 2023-06-25 18:44:10,597 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.38 vs. limit=15.0 2023-06-25 18:44:30,024 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:44:44,096 INFO [train.py:996] (1/4) Epoch 8, batch 19100, loss[loss=0.1941, simple_loss=0.2576, pruned_loss=0.06536, over 21168.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2923, pruned_loss=0.07251, over 4269732.82 frames. ], batch size: 176, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:44:52,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-25 18:44:58,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1395372.0, ans=0.125 2023-06-25 18:45:11,917 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-06-25 18:45:19,982 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.078e+02 4.021e+02 4.752e+02 6.454e+02 2.086e+03, threshold=9.504e+02, percent-clipped=4.0 2023-06-25 18:45:38,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1395492.0, ans=0.0 2023-06-25 18:45:43,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1395552.0, ans=0.1 2023-06-25 18:45:57,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-25 18:46:16,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1395612.0, ans=0.0 2023-06-25 18:46:30,729 INFO [train.py:996] (1/4) Epoch 8, batch 19150, loss[loss=0.2272, simple_loss=0.3182, pruned_loss=0.06813, over 21392.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2927, pruned_loss=0.07279, over 4269543.54 frames. ], batch size: 211, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:46:35,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1395672.0, ans=0.0 2023-06-25 18:47:13,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1395792.0, ans=0.0 2023-06-25 18:47:15,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1395792.0, ans=0.125 2023-06-25 18:48:21,120 INFO [train.py:996] (1/4) Epoch 8, batch 19200, loss[loss=0.2065, simple_loss=0.3157, pruned_loss=0.04863, over 21421.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3033, pruned_loss=0.07346, over 4270949.85 frames. ], batch size: 194, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:49:00,513 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 4.242e+02 5.606e+02 9.141e+02 1.658e+03, threshold=1.121e+03, percent-clipped=22.0 2023-06-25 18:49:01,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1396092.0, ans=0.0 2023-06-25 18:49:05,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1396092.0, ans=22.5 2023-06-25 18:50:01,385 INFO [train.py:996] (1/4) Epoch 8, batch 19250, loss[loss=0.1796, simple_loss=0.2748, pruned_loss=0.04215, over 21330.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3026, pruned_loss=0.06843, over 4275647.32 frames. ], batch size: 176, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:50:03,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1396272.0, ans=0.125 2023-06-25 18:50:05,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1396272.0, ans=0.2 2023-06-25 18:51:34,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1396512.0, ans=0.125 2023-06-25 18:51:38,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1396512.0, ans=0.125 2023-06-25 18:51:44,986 INFO [train.py:996] (1/4) Epoch 8, batch 19300, loss[loss=0.2319, simple_loss=0.2887, pruned_loss=0.08756, over 21535.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3031, pruned_loss=0.06895, over 4278294.03 frames. ], batch size: 548, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:51:47,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=22.5 2023-06-25 18:51:52,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1396572.0, ans=0.0 2023-06-25 18:52:28,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.708e+02 5.763e+02 8.363e+02 1.771e+03, threshold=1.153e+03, percent-clipped=11.0 2023-06-25 18:52:41,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-25 18:53:29,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1396812.0, ans=0.125 2023-06-25 18:53:37,187 INFO [train.py:996] (1/4) Epoch 8, batch 19350, loss[loss=0.2584, simple_loss=0.3371, pruned_loss=0.08989, over 21541.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2969, pruned_loss=0.06503, over 4279662.12 frames. ], batch size: 509, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:54:58,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1397052.0, ans=0.2 2023-06-25 18:55:05,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1397052.0, ans=0.1 2023-06-25 18:55:08,958 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=22.5 2023-06-25 18:55:25,375 INFO [train.py:996] (1/4) Epoch 8, batch 19400, loss[loss=0.2042, simple_loss=0.2745, pruned_loss=0.06696, over 21287.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2933, pruned_loss=0.06375, over 4275435.61 frames. ], batch size: 159, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:55:36,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-25 18:56:07,068 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.659e+02 3.804e+02 4.878e+02 6.968e+02 1.951e+03, threshold=9.756e+02, percent-clipped=7.0 2023-06-25 18:56:13,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1397292.0, ans=0.125 2023-06-25 18:56:25,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1397292.0, ans=0.125 2023-06-25 18:57:00,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1397412.0, ans=0.125 2023-06-25 18:57:05,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1397412.0, ans=0.0 2023-06-25 18:57:13,642 INFO [train.py:996] (1/4) Epoch 8, batch 19450, loss[loss=0.2207, simple_loss=0.2759, pruned_loss=0.08277, over 21582.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2907, pruned_loss=0.06603, over 4281325.82 frames. ], batch size: 441, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:57:17,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1397472.0, ans=0.125 2023-06-25 18:57:23,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1397472.0, ans=0.0 2023-06-25 18:57:44,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1397532.0, ans=0.0 2023-06-25 18:58:01,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1397592.0, ans=0.035 2023-06-25 18:58:01,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1397592.0, ans=0.0 2023-06-25 18:58:24,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1397592.0, ans=0.125 2023-06-25 18:58:30,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1397652.0, ans=0.125 2023-06-25 18:58:36,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1397652.0, ans=0.0 2023-06-25 18:58:45,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1397712.0, ans=0.1 2023-06-25 18:59:01,910 INFO [train.py:996] (1/4) Epoch 8, batch 19500, loss[loss=0.1744, simple_loss=0.2316, pruned_loss=0.05857, over 21139.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.287, pruned_loss=0.06664, over 4282590.26 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 18:59:17,994 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:59:38,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1397832.0, ans=0.02 2023-06-25 18:59:47,840 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.224e+02 4.180e+02 5.667e+02 7.986e+02 1.317e+03, threshold=1.133e+03, percent-clipped=13.0 2023-06-25 19:00:42,903 INFO [train.py:996] (1/4) Epoch 8, batch 19550, loss[loss=0.1778, simple_loss=0.2763, pruned_loss=0.03967, over 21744.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2817, pruned_loss=0.06493, over 4267588.69 frames. ], batch size: 298, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:00:45,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1398072.0, ans=0.0 2023-06-25 19:01:00,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1398072.0, ans=0.0 2023-06-25 19:01:40,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1398192.0, ans=0.0 2023-06-25 19:02:13,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1398312.0, ans=0.125 2023-06-25 19:02:14,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1398312.0, ans=0.125 2023-06-25 19:02:23,099 INFO [train.py:996] (1/4) Epoch 8, batch 19600, loss[loss=0.208, simple_loss=0.2871, pruned_loss=0.06442, over 21782.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2843, pruned_loss=0.06633, over 4271942.63 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:02:23,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1398372.0, ans=0.125 2023-06-25 19:02:30,999 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-25 19:02:43,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=15.0 2023-06-25 19:02:46,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1398432.0, ans=22.5 2023-06-25 19:02:58,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1398432.0, ans=0.125 2023-06-25 19:02:59,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1398432.0, ans=0.0 2023-06-25 19:03:10,844 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.701e+02 4.318e+02 6.046e+02 9.838e+02 1.787e+03, threshold=1.209e+03, percent-clipped=19.0 2023-06-25 19:03:41,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-25 19:04:09,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1398672.0, ans=0.0 2023-06-25 19:04:10,733 INFO [train.py:996] (1/4) Epoch 8, batch 19650, loss[loss=0.2069, simple_loss=0.3174, pruned_loss=0.04818, over 19795.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2897, pruned_loss=0.07026, over 4277321.04 frames. ], batch size: 702, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:04:36,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1398732.0, ans=0.125 2023-06-25 19:04:39,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1398732.0, ans=0.0 2023-06-25 19:05:31,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1398852.0, ans=0.0 2023-06-25 19:05:32,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=22.5 2023-06-25 19:06:01,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-06-25 19:06:07,311 INFO [train.py:996] (1/4) Epoch 8, batch 19700, loss[loss=0.1992, simple_loss=0.2911, pruned_loss=0.05367, over 21671.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2924, pruned_loss=0.0703, over 4277046.09 frames. ], batch size: 298, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:06:08,795 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.19 vs. limit=6.0 2023-06-25 19:06:53,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1399032.0, ans=0.0 2023-06-25 19:07:03,156 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.726e+02 4.245e+02 5.228e+02 6.853e+02 1.147e+03, threshold=1.046e+03, percent-clipped=0.0 2023-06-25 19:07:10,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1399092.0, ans=0.125 2023-06-25 19:08:01,696 INFO [train.py:996] (1/4) Epoch 8, batch 19750, loss[loss=0.2696, simple_loss=0.3698, pruned_loss=0.08472, over 21695.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3039, pruned_loss=0.07274, over 4279091.71 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:08:33,554 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-25 19:08:58,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1399392.0, ans=0.125 2023-06-25 19:09:54,487 INFO [train.py:996] (1/4) Epoch 8, batch 19800, loss[loss=0.1752, simple_loss=0.2482, pruned_loss=0.0511, over 21453.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3048, pruned_loss=0.07362, over 4280163.71 frames. ], batch size: 194, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:09:56,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1399572.0, ans=0.125 2023-06-25 19:10:10,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399632.0, ans=0.1 2023-06-25 19:10:29,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-25 19:10:38,211 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.249e+02 4.512e+02 5.932e+02 8.767e+02 2.271e+03, threshold=1.186e+03, percent-clipped=19.0 2023-06-25 19:10:55,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399692.0, ans=0.1 2023-06-25 19:11:42,795 INFO [train.py:996] (1/4) Epoch 8, batch 19850, loss[loss=0.1895, simple_loss=0.2875, pruned_loss=0.04576, over 21169.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2966, pruned_loss=0.06873, over 4278465.86 frames. ], batch size: 548, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:12:41,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1399992.0, ans=0.0 2023-06-25 19:13:02,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1400052.0, ans=0.5 2023-06-25 19:13:19,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.04 vs. limit=12.0 2023-06-25 19:13:28,588 INFO [train.py:996] (1/4) Epoch 8, batch 19900, loss[loss=0.1953, simple_loss=0.2624, pruned_loss=0.06408, over 21552.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2947, pruned_loss=0.06589, over 4276108.61 frames. ], batch size: 263, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:13:34,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1400172.0, ans=0.0 2023-06-25 19:13:53,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1400232.0, ans=0.125 2023-06-25 19:14:17,379 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.623e+02 3.554e+02 4.496e+02 7.903e+02 1.499e+03, threshold=8.992e+02, percent-clipped=4.0 2023-06-25 19:14:59,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1400412.0, ans=0.09899494936611666 2023-06-25 19:15:17,750 INFO [train.py:996] (1/4) Epoch 8, batch 19950, loss[loss=0.2298, simple_loss=0.2915, pruned_loss=0.08405, over 21871.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2892, pruned_loss=0.06551, over 4275768.09 frames. ], batch size: 107, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:15:47,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-25 19:17:12,735 INFO [train.py:996] (1/4) Epoch 8, batch 20000, loss[loss=0.2126, simple_loss=0.2932, pruned_loss=0.06601, over 21115.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2899, pruned_loss=0.06532, over 4279756.21 frames. ], batch size: 608, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:17:44,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1400832.0, ans=0.07 2023-06-25 19:17:55,682 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 3.942e+02 5.343e+02 7.186e+02 1.508e+03, threshold=1.069e+03, percent-clipped=12.0 2023-06-25 19:18:01,930 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-25 19:18:58,668 INFO [train.py:996] (1/4) Epoch 8, batch 20050, loss[loss=0.2201, simple_loss=0.293, pruned_loss=0.07362, over 21953.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2913, pruned_loss=0.06761, over 4285322.01 frames. ], batch size: 316, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:19:06,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1401072.0, ans=0.0 2023-06-25 19:19:22,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1401132.0, ans=0.05 2023-06-25 19:19:22,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1401132.0, ans=0.0 2023-06-25 19:19:29,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1401132.0, ans=0.2 2023-06-25 19:19:41,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1401192.0, ans=0.0 2023-06-25 19:20:02,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1401252.0, ans=0.125 2023-06-25 19:20:25,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1401312.0, ans=0.1 2023-06-25 19:20:29,999 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-25 19:20:48,062 INFO [train.py:996] (1/4) Epoch 8, batch 20100, loss[loss=0.2105, simple_loss=0.294, pruned_loss=0.06351, over 21811.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2929, pruned_loss=0.0696, over 4293673.71 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:21:30,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1401492.0, ans=0.0 2023-06-25 19:21:33,499 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.930e+02 3.809e+02 4.961e+02 6.304e+02 1.570e+03, threshold=9.921e+02, percent-clipped=3.0 2023-06-25 19:21:58,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1401552.0, ans=0.125 2023-06-25 19:22:02,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1401552.0, ans=0.0 2023-06-25 19:22:29,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-25 19:22:38,463 INFO [train.py:996] (1/4) Epoch 8, batch 20150, loss[loss=0.2502, simple_loss=0.3276, pruned_loss=0.08637, over 21725.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3022, pruned_loss=0.07237, over 4289877.12 frames. ], batch size: 298, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:23:07,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1401732.0, ans=0.125 2023-06-25 19:24:23,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1401912.0, ans=0.1 2023-06-25 19:24:35,467 INFO [train.py:996] (1/4) Epoch 8, batch 20200, loss[loss=0.2184, simple_loss=0.3059, pruned_loss=0.06543, over 21639.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3076, pruned_loss=0.07515, over 4287498.67 frames. ], batch size: 263, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:24:37,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1401972.0, ans=0.0 2023-06-25 19:24:52,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-25 19:24:53,755 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-25 19:25:12,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1402032.0, ans=0.125 2023-06-25 19:25:25,535 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.344e+02 4.250e+02 5.853e+02 8.923e+02 1.822e+03, threshold=1.171e+03, percent-clipped=17.0 2023-06-25 19:26:22,998 INFO [train.py:996] (1/4) Epoch 8, batch 20250, loss[loss=0.2042, simple_loss=0.2811, pruned_loss=0.06361, over 21291.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3078, pruned_loss=0.07341, over 4285295.88 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:27:18,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.55 vs. limit=10.0 2023-06-25 19:27:29,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1402452.0, ans=0.1 2023-06-25 19:27:39,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1402452.0, ans=0.2 2023-06-25 19:28:15,777 INFO [train.py:996] (1/4) Epoch 8, batch 20300, loss[loss=0.1933, simple_loss=0.27, pruned_loss=0.05833, over 21340.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3063, pruned_loss=0.07156, over 4281105.88 frames. ], batch size: 131, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:28:58,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.664e+02 3.704e+02 5.083e+02 7.003e+02 2.093e+03, threshold=1.017e+03, percent-clipped=9.0 2023-06-25 19:29:21,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-25 19:29:35,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1402812.0, ans=0.0 2023-06-25 19:29:39,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-25 19:29:44,066 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:29:53,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1402812.0, ans=0.02 2023-06-25 19:29:56,322 INFO [train.py:996] (1/4) Epoch 8, batch 20350, loss[loss=0.2469, simple_loss=0.3258, pruned_loss=0.08403, over 21873.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3063, pruned_loss=0.07161, over 4265128.00 frames. ], batch size: 118, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:30:16,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1402872.0, ans=0.125 2023-06-25 19:30:33,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1402932.0, ans=0.2 2023-06-25 19:31:00,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=12.0 2023-06-25 19:31:06,816 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:31:44,573 INFO [train.py:996] (1/4) Epoch 8, batch 20400, loss[loss=0.2346, simple_loss=0.3143, pruned_loss=0.07743, over 21651.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3088, pruned_loss=0.07442, over 4266652.12 frames. ], batch size: 230, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:32:21,695 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-25 19:32:24,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1403232.0, ans=0.0 2023-06-25 19:32:33,895 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.980e+02 4.155e+02 6.028e+02 7.732e+02 1.561e+03, threshold=1.206e+03, percent-clipped=8.0 2023-06-25 19:33:04,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1403352.0, ans=0.125 2023-06-25 19:33:31,063 INFO [train.py:996] (1/4) Epoch 8, batch 20450, loss[loss=0.2192, simple_loss=0.289, pruned_loss=0.07467, over 21501.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.31, pruned_loss=0.07643, over 4248636.76 frames. ], batch size: 194, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:33:33,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1403472.0, ans=0.125 2023-06-25 19:33:36,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-25 19:33:54,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1403472.0, ans=0.125 2023-06-25 19:33:55,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-25 19:34:38,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1403652.0, ans=0.125 2023-06-25 19:35:00,686 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-25 19:35:11,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-25 19:35:16,944 INFO [train.py:996] (1/4) Epoch 8, batch 20500, loss[loss=0.213, simple_loss=0.2814, pruned_loss=0.07225, over 21462.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3052, pruned_loss=0.07643, over 4250885.72 frames. ], batch size: 211, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:35:44,837 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-25 19:36:06,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1403892.0, ans=0.125 2023-06-25 19:36:07,743 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.072e+02 6.125e+02 8.287e+02 1.348e+03, threshold=1.225e+03, percent-clipped=6.0 2023-06-25 19:37:09,435 INFO [train.py:996] (1/4) Epoch 8, batch 20550, loss[loss=0.1995, simple_loss=0.2848, pruned_loss=0.05709, over 21855.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2986, pruned_loss=0.07498, over 4245431.92 frames. ], batch size: 372, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:37:25,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1404072.0, ans=0.2 2023-06-25 19:38:40,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1404312.0, ans=0.0 2023-06-25 19:38:56,971 INFO [train.py:996] (1/4) Epoch 8, batch 20600, loss[loss=0.2358, simple_loss=0.3147, pruned_loss=0.0785, over 21724.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3015, pruned_loss=0.07346, over 4247697.96 frames. ], batch size: 389, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:39:19,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1404432.0, ans=0.125 2023-06-25 19:39:29,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1404432.0, ans=0.0 2023-06-25 19:39:42,068 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.887e+02 4.920e+02 7.013e+02 1.215e+03 1.791e+03, threshold=1.403e+03, percent-clipped=24.0 2023-06-25 19:40:04,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-06-25 19:40:22,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1404612.0, ans=0.2 2023-06-25 19:40:42,118 INFO [train.py:996] (1/4) Epoch 8, batch 20650, loss[loss=0.1842, simple_loss=0.2492, pruned_loss=0.05962, over 21541.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2981, pruned_loss=0.07367, over 4261575.93 frames. ], batch size: 230, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:41:01,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-25 19:41:22,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1404792.0, ans=0.125 2023-06-25 19:42:30,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1404972.0, ans=0.125 2023-06-25 19:42:31,271 INFO [train.py:996] (1/4) Epoch 8, batch 20700, loss[loss=0.1788, simple_loss=0.2621, pruned_loss=0.04774, over 21609.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2893, pruned_loss=0.0706, over 4258178.34 frames. ], batch size: 263, lr: 3.69e-03, grad_scale: 8.0 2023-06-25 19:43:07,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1405032.0, ans=0.1 2023-06-25 19:43:27,033 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.647e+02 4.600e+02 6.617e+02 1.302e+03, threshold=9.199e+02, percent-clipped=0.0 2023-06-25 19:43:34,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=15.0 2023-06-25 19:43:45,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1405152.0, ans=0.2 2023-06-25 19:44:27,712 INFO [train.py:996] (1/4) Epoch 8, batch 20750, loss[loss=0.2467, simple_loss=0.3442, pruned_loss=0.07465, over 21734.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.293, pruned_loss=0.07021, over 4264762.40 frames. ], batch size: 332, lr: 3.69e-03, grad_scale: 8.0 2023-06-25 19:45:02,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1405332.0, ans=10.0 2023-06-25 19:45:05,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1405332.0, ans=0.125 2023-06-25 19:45:09,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1405392.0, ans=0.0 2023-06-25 19:45:26,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1405392.0, ans=0.2 2023-06-25 19:46:16,223 INFO [train.py:996] (1/4) Epoch 8, batch 20800, loss[loss=0.1862, simple_loss=0.2568, pruned_loss=0.05784, over 21598.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2944, pruned_loss=0.07078, over 4266609.24 frames. ], batch size: 298, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:46:25,387 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:46:48,162 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:47:04,541 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-25 19:47:10,349 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.318e+02 7.506e+02 1.059e+03 2.434e+03, threshold=1.501e+03, percent-clipped=34.0 2023-06-25 19:47:27,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1405752.0, ans=0.125 2023-06-25 19:48:02,835 INFO [train.py:996] (1/4) Epoch 8, batch 20850, loss[loss=0.175, simple_loss=0.2418, pruned_loss=0.05408, over 21016.00 frames. ], tot_loss[loss=0.211, simple_loss=0.286, pruned_loss=0.06796, over 4243977.12 frames. ], batch size: 608, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:48:18,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1405932.0, ans=0.0 2023-06-25 19:48:58,755 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-25 19:49:23,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1406052.0, ans=0.2 2023-06-25 19:49:44,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1406112.0, ans=0.125 2023-06-25 19:49:47,591 INFO [train.py:996] (1/4) Epoch 8, batch 20900, loss[loss=0.2172, simple_loss=0.2991, pruned_loss=0.0677, over 21818.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.286, pruned_loss=0.06927, over 4250590.18 frames. ], batch size: 333, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:50:05,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1406232.0, ans=0.125 2023-06-25 19:50:25,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1406292.0, ans=0.125 2023-06-25 19:50:34,299 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.696e+02 3.719e+02 4.943e+02 7.397e+02 1.417e+03, threshold=9.886e+02, percent-clipped=0.0 2023-06-25 19:50:46,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1406352.0, ans=0.0 2023-06-25 19:51:30,970 INFO [train.py:996] (1/4) Epoch 8, batch 20950, loss[loss=0.1867, simple_loss=0.2626, pruned_loss=0.05543, over 21840.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2823, pruned_loss=0.06594, over 4252809.33 frames. ], batch size: 118, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:51:31,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1406472.0, ans=0.1 2023-06-25 19:52:17,547 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:52:26,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-25 19:52:38,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1406652.0, ans=0.125 2023-06-25 19:52:42,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1406652.0, ans=0.2 2023-06-25 19:53:01,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1406712.0, ans=0.0 2023-06-25 19:53:11,619 INFO [train.py:996] (1/4) Epoch 8, batch 21000, loss[loss=0.2298, simple_loss=0.3109, pruned_loss=0.0744, over 21833.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2831, pruned_loss=0.06626, over 4256089.61 frames. ], batch size: 112, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:53:11,620 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 19:53:23,540 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.7674, 5.9119, 5.5841, 5.3472], device='cuda:1') 2023-06-25 19:53:31,255 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2635, simple_loss=0.3595, pruned_loss=0.08373, over 1796401.00 frames. 2023-06-25 19:53:31,256 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 19:54:24,844 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.578e+02 4.486e+02 7.087e+02 1.717e+03, threshold=8.972e+02, percent-clipped=7.0 2023-06-25 19:55:17,248 INFO [train.py:996] (1/4) Epoch 8, batch 21050, loss[loss=0.2153, simple_loss=0.2935, pruned_loss=0.06857, over 15895.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2822, pruned_loss=0.06677, over 4256463.59 frames. ], batch size: 63, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:55:19,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1407072.0, ans=0.125 2023-06-25 19:55:29,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1407072.0, ans=0.125 2023-06-25 19:55:57,021 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.94 vs. limit=22.5 2023-06-25 19:56:54,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1407312.0, ans=0.125 2023-06-25 19:57:03,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1407372.0, ans=0.0 2023-06-25 19:57:05,032 INFO [train.py:996] (1/4) Epoch 8, batch 21100, loss[loss=0.195, simple_loss=0.2636, pruned_loss=0.06317, over 21714.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.279, pruned_loss=0.0662, over 4266409.34 frames. ], batch size: 316, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:57:30,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.91 vs. limit=22.5 2023-06-25 19:57:41,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1407432.0, ans=0.0 2023-06-25 19:57:43,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1407492.0, ans=0.0 2023-06-25 19:57:45,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1407492.0, ans=0.125 2023-06-25 19:57:46,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1407492.0, ans=0.125 2023-06-25 19:57:57,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.898e+02 4.201e+02 5.635e+02 7.939e+02 1.482e+03, threshold=1.127e+03, percent-clipped=15.0 2023-06-25 19:58:05,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1407552.0, ans=0.0 2023-06-25 19:58:25,311 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:58:30,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1407612.0, ans=0.125 2023-06-25 19:58:42,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1407612.0, ans=10.0 2023-06-25 19:58:49,891 INFO [train.py:996] (1/4) Epoch 8, batch 21150, loss[loss=0.2467, simple_loss=0.3523, pruned_loss=0.07052, over 19793.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2757, pruned_loss=0.06695, over 4273122.82 frames. ], batch size: 703, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:59:54,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1407852.0, ans=0.0 2023-06-25 20:00:26,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1407912.0, ans=0.125 2023-06-25 20:00:29,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1407912.0, ans=0.125 2023-06-25 20:00:33,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1407912.0, ans=0.0 2023-06-25 20:00:38,488 INFO [train.py:996] (1/4) Epoch 8, batch 21200, loss[loss=0.1909, simple_loss=0.2571, pruned_loss=0.06236, over 21259.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2719, pruned_loss=0.06614, over 4253614.51 frames. ], batch size: 144, lr: 3.68e-03, grad_scale: 32.0 2023-06-25 20:01:34,476 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.940e+02 3.823e+02 4.703e+02 6.840e+02 1.518e+03, threshold=9.406e+02, percent-clipped=1.0 2023-06-25 20:02:26,000 INFO [train.py:996] (1/4) Epoch 8, batch 21250, loss[loss=0.2244, simple_loss=0.2966, pruned_loss=0.07614, over 21530.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2705, pruned_loss=0.06609, over 4257511.77 frames. ], batch size: 195, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:02:52,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1408332.0, ans=0.125 2023-06-25 20:03:16,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1408392.0, ans=0.2 2023-06-25 20:03:23,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1408392.0, ans=0.1 2023-06-25 20:03:25,861 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-25 20:03:56,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=22.5 2023-06-25 20:04:11,919 INFO [train.py:996] (1/4) Epoch 8, batch 21300, loss[loss=0.2312, simple_loss=0.3097, pruned_loss=0.07637, over 21518.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2777, pruned_loss=0.06844, over 4260805.84 frames. ], batch size: 131, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:05:07,872 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.197e+02 4.370e+02 6.934e+02 9.057e+02 1.727e+03, threshold=1.387e+03, percent-clipped=23.0 2023-06-25 20:05:58,749 INFO [train.py:996] (1/4) Epoch 8, batch 21350, loss[loss=0.2098, simple_loss=0.3007, pruned_loss=0.0595, over 21386.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2823, pruned_loss=0.0693, over 4263270.40 frames. ], batch size: 211, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:06:35,990 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-25 20:06:39,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1408932.0, ans=0.125 2023-06-25 20:06:46,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1408992.0, ans=0.125 2023-06-25 20:07:45,972 INFO [train.py:996] (1/4) Epoch 8, batch 21400, loss[loss=0.2767, simple_loss=0.3455, pruned_loss=0.104, over 21419.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2864, pruned_loss=0.06931, over 4268940.09 frames. ], batch size: 471, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:07:49,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409172.0, ans=0.1 2023-06-25 20:08:04,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1409172.0, ans=0.125 2023-06-25 20:08:18,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409232.0, ans=0.1 2023-06-25 20:08:22,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1409232.0, ans=0.125 2023-06-25 20:08:46,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.047e+02 3.806e+02 5.030e+02 6.995e+02 1.894e+03, threshold=1.006e+03, percent-clipped=5.0 2023-06-25 20:08:47,828 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-25 20:09:13,101 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-25 20:09:31,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1409472.0, ans=0.125 2023-06-25 20:09:32,596 INFO [train.py:996] (1/4) Epoch 8, batch 21450, loss[loss=0.2265, simple_loss=0.291, pruned_loss=0.08105, over 21922.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2889, pruned_loss=0.07012, over 4276284.02 frames. ], batch size: 316, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:09:37,239 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-25 20:10:54,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1409652.0, ans=0.2 2023-06-25 20:11:20,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=22.5 2023-06-25 20:11:20,625 INFO [train.py:996] (1/4) Epoch 8, batch 21500, loss[loss=0.2117, simple_loss=0.2862, pruned_loss=0.0686, over 21682.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.288, pruned_loss=0.07163, over 4276348.43 frames. ], batch size: 230, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:11:32,613 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-25 20:12:25,071 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.002e+02 3.682e+02 4.429e+02 6.594e+02 1.934e+03, threshold=8.857e+02, percent-clipped=12.0 2023-06-25 20:12:29,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409892.0, ans=0.1 2023-06-25 20:12:55,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1410012.0, ans=0.125 2023-06-25 20:12:57,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1410012.0, ans=0.125 2023-06-25 20:13:02,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1410012.0, ans=0.5 2023-06-25 20:13:05,248 INFO [train.py:996] (1/4) Epoch 8, batch 21550, loss[loss=0.1861, simple_loss=0.2501, pruned_loss=0.06104, over 21328.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2812, pruned_loss=0.06886, over 4280565.18 frames. ], batch size: 144, lr: 3.68e-03, grad_scale: 8.0 2023-06-25 20:14:53,573 INFO [train.py:996] (1/4) Epoch 8, batch 21600, loss[loss=0.1932, simple_loss=0.2698, pruned_loss=0.05828, over 21390.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.277, pruned_loss=0.06712, over 4280972.60 frames. ], batch size: 131, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:16:02,290 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.883e+02 3.709e+02 4.996e+02 7.825e+02 2.196e+03, threshold=9.991e+02, percent-clipped=18.0 2023-06-25 20:16:24,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-25 20:16:46,686 INFO [train.py:996] (1/4) Epoch 8, batch 21650, loss[loss=0.2162, simple_loss=0.3085, pruned_loss=0.06189, over 21585.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2816, pruned_loss=0.06532, over 4281780.70 frames. ], batch size: 230, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:16:57,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1410672.0, ans=0.035 2023-06-25 20:17:23,007 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.61 vs. limit=10.0 2023-06-25 20:17:40,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1410792.0, ans=0.125 2023-06-25 20:18:24,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1410972.0, ans=0.1 2023-06-25 20:18:25,953 INFO [train.py:996] (1/4) Epoch 8, batch 21700, loss[loss=0.1865, simple_loss=0.2614, pruned_loss=0.05581, over 21727.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2818, pruned_loss=0.0636, over 4287305.02 frames. ], batch size: 112, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:18:26,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1410972.0, ans=0.0 2023-06-25 20:19:33,110 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.665e+02 3.626e+02 5.313e+02 7.928e+02 1.804e+03, threshold=1.063e+03, percent-clipped=12.0 2023-06-25 20:19:54,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1411212.0, ans=0.2 2023-06-25 20:20:01,888 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.32 vs. limit=10.0 2023-06-25 20:20:03,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1411212.0, ans=0.025 2023-06-25 20:20:12,839 INFO [train.py:996] (1/4) Epoch 8, batch 21750, loss[loss=0.2206, simple_loss=0.2729, pruned_loss=0.08415, over 21517.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2776, pruned_loss=0.06335, over 4279381.08 frames. ], batch size: 442, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:21:30,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1411452.0, ans=0.125 2023-06-25 20:21:51,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1411512.0, ans=0.2 2023-06-25 20:22:07,448 INFO [train.py:996] (1/4) Epoch 8, batch 21800, loss[loss=0.2313, simple_loss=0.3113, pruned_loss=0.0757, over 21610.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2763, pruned_loss=0.06483, over 4274047.61 frames. ], batch size: 247, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:22:09,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1411572.0, ans=0.07 2023-06-25 20:22:18,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1411572.0, ans=0.1 2023-06-25 20:22:39,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1411632.0, ans=0.015 2023-06-25 20:23:10,278 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.928e+02 3.851e+02 5.673e+02 8.450e+02 2.187e+03, threshold=1.135e+03, percent-clipped=14.0 2023-06-25 20:23:12,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1411692.0, ans=0.0 2023-06-25 20:23:46,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1411812.0, ans=0.2 2023-06-25 20:23:54,654 INFO [train.py:996] (1/4) Epoch 8, batch 21850, loss[loss=0.2091, simple_loss=0.275, pruned_loss=0.07158, over 21300.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2819, pruned_loss=0.06577, over 4259334.97 frames. ], batch size: 143, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:24:02,831 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-25 20:25:05,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1412052.0, ans=0.1 2023-06-25 20:25:44,252 INFO [train.py:996] (1/4) Epoch 8, batch 21900, loss[loss=0.2358, simple_loss=0.3224, pruned_loss=0.07463, over 19858.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2829, pruned_loss=0.06666, over 4267266.44 frames. ], batch size: 703, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:26:46,375 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.152e+02 5.797e+02 7.520e+02 1.468e+03, threshold=1.159e+03, percent-clipped=2.0 2023-06-25 20:27:05,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1412352.0, ans=0.0 2023-06-25 20:27:16,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1412412.0, ans=0.1 2023-06-25 20:27:23,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1412412.0, ans=0.025 2023-06-25 20:27:36,365 INFO [train.py:996] (1/4) Epoch 8, batch 21950, loss[loss=0.2013, simple_loss=0.2716, pruned_loss=0.0655, over 21398.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2777, pruned_loss=0.06593, over 4267619.49 frames. ], batch size: 473, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:27:41,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.20 vs. limit=15.0 2023-06-25 20:27:43,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1412472.0, ans=0.125 2023-06-25 20:27:50,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1412472.0, ans=0.1 2023-06-25 20:28:35,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1412592.0, ans=0.125 2023-06-25 20:28:50,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1412652.0, ans=0.125 2023-06-25 20:29:20,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1412712.0, ans=0.0 2023-06-25 20:29:25,478 INFO [train.py:996] (1/4) Epoch 8, batch 22000, loss[loss=0.1685, simple_loss=0.2354, pruned_loss=0.05078, over 21196.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2716, pruned_loss=0.06288, over 4251984.45 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 32.0 2023-06-25 20:29:27,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1412772.0, ans=0.0 2023-06-25 20:29:45,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1412772.0, ans=0.125 2023-06-25 20:29:57,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1412832.0, ans=0.125 2023-06-25 20:30:23,049 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.905e+02 5.232e+02 7.810e+02 2.335e+03, threshold=1.046e+03, percent-clipped=14.0 2023-06-25 20:30:53,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1413012.0, ans=0.0 2023-06-25 20:31:07,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1413012.0, ans=0.125 2023-06-25 20:31:13,834 INFO [train.py:996] (1/4) Epoch 8, batch 22050, loss[loss=0.2477, simple_loss=0.3218, pruned_loss=0.08681, over 21477.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2767, pruned_loss=0.06451, over 4252212.37 frames. ], batch size: 131, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 20:31:44,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.72 vs. limit=12.0 2023-06-25 20:32:30,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1413252.0, ans=0.125 2023-06-25 20:32:36,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-25 20:32:56,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1413312.0, ans=10.0 2023-06-25 20:33:02,600 INFO [train.py:996] (1/4) Epoch 8, batch 22100, loss[loss=0.2906, simple_loss=0.3605, pruned_loss=0.1103, over 21774.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2869, pruned_loss=0.06894, over 4247578.69 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:33:31,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1413432.0, ans=0.1 2023-06-25 20:33:49,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1413492.0, ans=22.5 2023-06-25 20:34:00,123 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.256e+02 4.552e+02 6.727e+02 1.040e+03 2.213e+03, threshold=1.345e+03, percent-clipped=23.0 2023-06-25 20:34:06,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-25 20:34:47,946 INFO [train.py:996] (1/4) Epoch 8, batch 22150, loss[loss=0.2172, simple_loss=0.2913, pruned_loss=0.07154, over 21253.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2908, pruned_loss=0.07074, over 4258492.11 frames. ], batch size: 159, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:35:11,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1413672.0, ans=0.125 2023-06-25 20:35:15,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1413732.0, ans=0.125 2023-06-25 20:35:34,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1413792.0, ans=0.125 2023-06-25 20:35:39,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1413792.0, ans=0.125 2023-06-25 20:36:35,750 INFO [train.py:996] (1/4) Epoch 8, batch 22200, loss[loss=0.2329, simple_loss=0.3248, pruned_loss=0.07051, over 21807.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2931, pruned_loss=0.07177, over 4267015.15 frames. ], batch size: 298, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:36:54,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.65 vs. limit=15.0 2023-06-25 20:37:29,742 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.047e+02 4.294e+02 5.583e+02 8.306e+02 1.665e+03, threshold=1.117e+03, percent-clipped=3.0 2023-06-25 20:37:30,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1414092.0, ans=0.125 2023-06-25 20:38:08,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1414212.0, ans=0.0 2023-06-25 20:38:23,333 INFO [train.py:996] (1/4) Epoch 8, batch 22250, loss[loss=0.2185, simple_loss=0.2765, pruned_loss=0.08028, over 21254.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2998, pruned_loss=0.07366, over 4274173.36 frames. ], batch size: 608, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:38:44,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1414332.0, ans=0.0 2023-06-25 20:38:53,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1414332.0, ans=0.125 2023-06-25 20:38:54,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1414332.0, ans=0.0 2023-06-25 20:39:17,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1414392.0, ans=0.125 2023-06-25 20:40:04,016 INFO [train.py:996] (1/4) Epoch 8, batch 22300, loss[loss=0.2327, simple_loss=0.3032, pruned_loss=0.08115, over 21741.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3019, pruned_loss=0.07569, over 4278983.56 frames. ], batch size: 389, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:40:16,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1414572.0, ans=0.02 2023-06-25 20:40:30,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1414632.0, ans=0.1 2023-06-25 20:40:57,263 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.093e+02 5.360e+02 7.335e+02 1.399e+03, threshold=1.072e+03, percent-clipped=5.0 2023-06-25 20:41:02,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1414752.0, ans=0.2 2023-06-25 20:41:51,930 INFO [train.py:996] (1/4) Epoch 8, batch 22350, loss[loss=0.2077, simple_loss=0.2702, pruned_loss=0.07258, over 21954.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3002, pruned_loss=0.07627, over 4288548.99 frames. ], batch size: 316, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:42:22,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1414932.0, ans=0.125 2023-06-25 20:42:22,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1414932.0, ans=0.07 2023-06-25 20:42:36,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1414992.0, ans=0.0 2023-06-25 20:42:39,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1414992.0, ans=0.125 2023-06-25 20:42:51,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1415052.0, ans=0.0 2023-06-25 20:43:38,702 INFO [train.py:996] (1/4) Epoch 8, batch 22400, loss[loss=0.1841, simple_loss=0.256, pruned_loss=0.05612, over 21784.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2971, pruned_loss=0.07359, over 4278683.25 frames. ], batch size: 124, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 20:44:20,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1415292.0, ans=0.05 2023-06-25 20:44:34,052 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.012e+02 4.038e+02 6.138e+02 7.809e+02 1.292e+03, threshold=1.228e+03, percent-clipped=3.0 2023-06-25 20:45:19,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1415412.0, ans=0.2 2023-06-25 20:45:25,832 INFO [train.py:996] (1/4) Epoch 8, batch 22450, loss[loss=0.1895, simple_loss=0.2496, pruned_loss=0.06468, over 21332.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2921, pruned_loss=0.07265, over 4274203.84 frames. ], batch size: 177, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:46:01,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=22.5 2023-06-25 20:46:26,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1415592.0, ans=15.0 2023-06-25 20:46:26,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-25 20:46:27,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1415652.0, ans=0.0 2023-06-25 20:47:12,141 INFO [train.py:996] (1/4) Epoch 8, batch 22500, loss[loss=0.2248, simple_loss=0.322, pruned_loss=0.06378, over 21569.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2861, pruned_loss=0.07118, over 4269115.39 frames. ], batch size: 230, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:47:14,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1415772.0, ans=0.125 2023-06-25 20:47:27,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=12.0 2023-06-25 20:47:28,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1415832.0, ans=0.1 2023-06-25 20:47:52,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1415892.0, ans=0.2 2023-06-25 20:48:14,388 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.949e+02 4.919e+02 7.887e+02 2.030e+03, threshold=9.838e+02, percent-clipped=13.0 2023-06-25 20:49:01,255 INFO [train.py:996] (1/4) Epoch 8, batch 22550, loss[loss=0.2482, simple_loss=0.3176, pruned_loss=0.08945, over 21782.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2903, pruned_loss=0.07138, over 4272534.87 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:49:02,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-25 20:49:21,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=22.5 2023-06-25 20:49:30,125 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-25 20:49:51,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1416192.0, ans=0.125 2023-06-25 20:50:38,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1416312.0, ans=0.2 2023-06-25 20:50:52,261 INFO [train.py:996] (1/4) Epoch 8, batch 22600, loss[loss=0.2536, simple_loss=0.3479, pruned_loss=0.07968, over 21633.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2936, pruned_loss=0.07195, over 4274656.52 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:52:04,649 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.075e+02 4.518e+02 6.028e+02 9.364e+02 1.882e+03, threshold=1.206e+03, percent-clipped=21.0 2023-06-25 20:52:13,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1416552.0, ans=0.125 2023-06-25 20:52:38,698 INFO [train.py:996] (1/4) Epoch 8, batch 22650, loss[loss=0.2088, simple_loss=0.2704, pruned_loss=0.07359, over 21415.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2891, pruned_loss=0.07132, over 4277106.75 frames. ], batch size: 389, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:52:59,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1416732.0, ans=0.0 2023-06-25 20:53:40,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1416792.0, ans=0.125 2023-06-25 20:53:50,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1416852.0, ans=0.125 2023-06-25 20:54:20,825 INFO [train.py:996] (1/4) Epoch 8, batch 22700, loss[loss=0.2264, simple_loss=0.2794, pruned_loss=0.08671, over 21261.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2824, pruned_loss=0.06994, over 4284155.14 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:54:22,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=22.5 2023-06-25 20:54:45,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1417032.0, ans=0.125 2023-06-25 20:55:33,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.950e+02 3.999e+02 5.550e+02 8.694e+02 1.659e+03, threshold=1.110e+03, percent-clipped=6.0 2023-06-25 20:55:36,306 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:55:59,609 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:56:08,905 INFO [train.py:996] (1/4) Epoch 8, batch 22750, loss[loss=0.1926, simple_loss=0.2402, pruned_loss=0.07252, over 20733.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2848, pruned_loss=0.07185, over 4269221.69 frames. ], batch size: 609, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:57:52,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-25 20:57:55,435 INFO [train.py:996] (1/4) Epoch 8, batch 22800, loss[loss=0.2331, simple_loss=0.307, pruned_loss=0.07961, over 21827.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2883, pruned_loss=0.0736, over 4278008.48 frames. ], batch size: 124, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:57:57,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1417572.0, ans=0.0 2023-06-25 20:58:21,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1417632.0, ans=0.2 2023-06-25 20:58:41,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=15.0 2023-06-25 20:59:06,879 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.216e+02 4.609e+02 5.638e+02 8.633e+02 1.980e+03, threshold=1.128e+03, percent-clipped=10.0 2023-06-25 20:59:22,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1417812.0, ans=0.0 2023-06-25 20:59:27,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.13 vs. limit=22.5 2023-06-25 20:59:38,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1417812.0, ans=0.125 2023-06-25 20:59:41,027 INFO [train.py:996] (1/4) Epoch 8, batch 22850, loss[loss=0.1941, simple_loss=0.2592, pruned_loss=0.06446, over 21704.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2859, pruned_loss=0.07318, over 4269606.24 frames. ], batch size: 333, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:59:55,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1417872.0, ans=0.05 2023-06-25 21:00:59,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.15 vs. limit=15.0 2023-06-25 21:01:00,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1418052.0, ans=0.125 2023-06-25 21:01:11,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1418112.0, ans=0.125 2023-06-25 21:01:17,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1418112.0, ans=0.2 2023-06-25 21:01:18,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1418112.0, ans=0.0 2023-06-25 21:01:30,609 INFO [train.py:996] (1/4) Epoch 8, batch 22900, loss[loss=0.2322, simple_loss=0.3477, pruned_loss=0.05834, over 21169.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2886, pruned_loss=0.07273, over 4257246.76 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:01:36,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1418172.0, ans=0.125 2023-06-25 21:02:38,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1418292.0, ans=15.0 2023-06-25 21:02:45,749 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 4.620e+02 6.877e+02 1.071e+03 2.318e+03, threshold=1.375e+03, percent-clipped=23.0 2023-06-25 21:02:48,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1418352.0, ans=0.2 2023-06-25 21:02:49,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1418352.0, ans=0.125 2023-06-25 21:03:12,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1418412.0, ans=0.2 2023-06-25 21:03:25,356 INFO [train.py:996] (1/4) Epoch 8, batch 22950, loss[loss=0.2063, simple_loss=0.3143, pruned_loss=0.04914, over 21568.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3026, pruned_loss=0.07104, over 4266826.06 frames. ], batch size: 230, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:04:46,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1418652.0, ans=0.0 2023-06-25 21:05:02,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1418712.0, ans=0.125 2023-06-25 21:05:12,909 INFO [train.py:996] (1/4) Epoch 8, batch 23000, loss[loss=0.2408, simple_loss=0.3093, pruned_loss=0.0861, over 21620.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3028, pruned_loss=0.06932, over 4267179.86 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:05:25,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1418772.0, ans=0.2 2023-06-25 21:05:34,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1418832.0, ans=0.2 2023-06-25 21:05:43,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1418832.0, ans=0.2 2023-06-25 21:05:45,623 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-25 21:06:10,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.651e+02 4.060e+02 5.403e+02 8.584e+02 1.736e+03, threshold=1.081e+03, percent-clipped=10.0 2023-06-25 21:06:44,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-25 21:06:55,852 INFO [train.py:996] (1/4) Epoch 8, batch 23050, loss[loss=0.2791, simple_loss=0.3408, pruned_loss=0.1087, over 21479.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3034, pruned_loss=0.07177, over 4273576.83 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:07:36,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1419132.0, ans=0.0 2023-06-25 21:08:15,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419252.0, ans=0.1 2023-06-25 21:08:42,789 INFO [train.py:996] (1/4) Epoch 8, batch 23100, loss[loss=0.1824, simple_loss=0.2371, pruned_loss=0.06385, over 20685.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2989, pruned_loss=0.07201, over 4273547.94 frames. ], batch size: 607, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:09:00,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419372.0, ans=0.1 2023-06-25 21:09:00,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1419372.0, ans=0.0 2023-06-25 21:09:15,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-25 21:09:17,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1419432.0, ans=0.0 2023-06-25 21:09:34,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1419492.0, ans=0.1 2023-06-25 21:09:44,370 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.818e+02 4.180e+02 5.701e+02 8.990e+02 1.720e+03, threshold=1.140e+03, percent-clipped=10.0 2023-06-25 21:10:15,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1419612.0, ans=0.125 2023-06-25 21:10:30,263 INFO [train.py:996] (1/4) Epoch 8, batch 23150, loss[loss=0.2017, simple_loss=0.2755, pruned_loss=0.06398, over 21520.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2933, pruned_loss=0.07171, over 4274843.15 frames. ], batch size: 131, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:11:28,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1419792.0, ans=0.125 2023-06-25 21:12:17,931 INFO [train.py:996] (1/4) Epoch 8, batch 23200, loss[loss=0.2083, simple_loss=0.2803, pruned_loss=0.06817, over 21366.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2921, pruned_loss=0.07284, over 4278138.12 frames. ], batch size: 176, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 21:12:23,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1419972.0, ans=0.125 2023-06-25 21:13:09,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1420092.0, ans=0.2 2023-06-25 21:13:19,452 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.122e+02 4.151e+02 5.652e+02 8.200e+02 1.593e+03, threshold=1.130e+03, percent-clipped=6.0 2023-06-25 21:13:59,463 INFO [train.py:996] (1/4) Epoch 8, batch 23250, loss[loss=0.225, simple_loss=0.2904, pruned_loss=0.07979, over 21477.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2917, pruned_loss=0.07342, over 4284010.00 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:14:44,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1420392.0, ans=0.125 2023-06-25 21:14:56,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1420392.0, ans=0.02 2023-06-25 21:15:00,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1420452.0, ans=0.125 2023-06-25 21:15:34,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1420512.0, ans=0.07 2023-06-25 21:15:52,841 INFO [train.py:996] (1/4) Epoch 8, batch 23300, loss[loss=0.2208, simple_loss=0.3234, pruned_loss=0.05907, over 21428.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2974, pruned_loss=0.07389, over 4288646.96 frames. ], batch size: 194, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:16:05,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1420572.0, ans=0.125 2023-06-25 21:16:09,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-06-25 21:16:35,809 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-25 21:16:40,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1420692.0, ans=0.125 2023-06-25 21:16:57,815 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.217e+02 4.429e+02 5.607e+02 7.442e+02 1.718e+03, threshold=1.121e+03, percent-clipped=5.0 2023-06-25 21:17:05,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1420752.0, ans=0.2 2023-06-25 21:17:36,838 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:17:36,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1420812.0, ans=0.0 2023-06-25 21:17:41,347 INFO [train.py:996] (1/4) Epoch 8, batch 23350, loss[loss=0.166, simple_loss=0.2413, pruned_loss=0.04539, over 21165.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3013, pruned_loss=0.073, over 4284144.64 frames. ], batch size: 143, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:17:54,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1420872.0, ans=0.125 2023-06-25 21:18:52,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1421052.0, ans=0.1 2023-06-25 21:19:24,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1421112.0, ans=0.0 2023-06-25 21:19:29,484 INFO [train.py:996] (1/4) Epoch 8, batch 23400, loss[loss=0.258, simple_loss=0.3189, pruned_loss=0.09855, over 21780.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2953, pruned_loss=0.06925, over 4267035.57 frames. ], batch size: 441, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:19:35,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1421172.0, ans=0.2 2023-06-25 21:19:40,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1421172.0, ans=0.125 2023-06-25 21:19:40,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1421172.0, ans=0.125 2023-06-25 21:20:34,180 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.708e+02 4.466e+02 6.262e+02 8.598e+02 1.529e+03, threshold=1.252e+03, percent-clipped=12.0 2023-06-25 21:20:57,857 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-25 21:21:17,394 INFO [train.py:996] (1/4) Epoch 8, batch 23450, loss[loss=0.2194, simple_loss=0.2933, pruned_loss=0.07271, over 21747.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2961, pruned_loss=0.07198, over 4266945.70 frames. ], batch size: 332, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:21:21,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1421472.0, ans=0.0 2023-06-25 21:21:23,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1421472.0, ans=10.0 2023-06-25 21:21:30,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1421472.0, ans=0.0 2023-06-25 21:21:40,113 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:21:53,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1421532.0, ans=0.125 2023-06-25 21:22:53,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1421712.0, ans=0.0 2023-06-25 21:23:04,829 INFO [train.py:996] (1/4) Epoch 8, batch 23500, loss[loss=0.2078, simple_loss=0.2779, pruned_loss=0.06885, over 21421.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2959, pruned_loss=0.07338, over 4273107.43 frames. ], batch size: 211, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:23:21,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-25 21:23:29,601 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-06-25 21:24:07,690 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.437e+02 4.437e+02 5.920e+02 8.678e+02 1.556e+03, threshold=1.184e+03, percent-clipped=4.0 2023-06-25 21:24:50,817 INFO [train.py:996] (1/4) Epoch 8, batch 23550, loss[loss=0.1837, simple_loss=0.254, pruned_loss=0.05665, over 21671.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2921, pruned_loss=0.07333, over 4271700.83 frames. ], batch size: 264, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:24:54,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1422072.0, ans=0.0 2023-06-25 21:26:01,862 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-25 21:26:08,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1422252.0, ans=0.09899494936611666 2023-06-25 21:26:34,227 INFO [train.py:996] (1/4) Epoch 8, batch 23600, loss[loss=0.235, simple_loss=0.3092, pruned_loss=0.08041, over 21669.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2929, pruned_loss=0.07416, over 4273711.60 frames. ], batch size: 351, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:27:44,973 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-25 21:27:45,447 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.385e+02 5.770e+02 8.074e+02 1.431e+03, threshold=1.154e+03, percent-clipped=6.0 2023-06-25 21:28:03,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1422612.0, ans=0.125 2023-06-25 21:28:16,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1422612.0, ans=0.125 2023-06-25 21:28:19,014 INFO [train.py:996] (1/4) Epoch 8, batch 23650, loss[loss=0.2637, simple_loss=0.3387, pruned_loss=0.09434, over 21450.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2937, pruned_loss=0.07284, over 4277882.81 frames. ], batch size: 471, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:28:23,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1422672.0, ans=0.2 2023-06-25 21:28:30,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1422672.0, ans=0.1 2023-06-25 21:29:09,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1422792.0, ans=0.0 2023-06-25 21:29:27,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1422792.0, ans=0.2 2023-06-25 21:30:15,687 INFO [train.py:996] (1/4) Epoch 8, batch 23700, loss[loss=0.2042, simple_loss=0.2854, pruned_loss=0.06151, over 21926.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2948, pruned_loss=0.07222, over 4277242.99 frames. ], batch size: 317, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:30:27,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=22.5 2023-06-25 21:30:56,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1423032.0, ans=10.0 2023-06-25 21:31:21,922 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.069e+02 4.706e+02 7.567e+02 1.059e+03 2.312e+03, threshold=1.513e+03, percent-clipped=21.0 2023-06-25 21:31:24,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1423152.0, ans=0.035 2023-06-25 21:31:37,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1423152.0, ans=0.125 2023-06-25 21:31:40,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1423152.0, ans=0.125 2023-06-25 21:32:05,911 INFO [train.py:996] (1/4) Epoch 8, batch 23750, loss[loss=0.2272, simple_loss=0.3052, pruned_loss=0.07459, over 21641.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2997, pruned_loss=0.07369, over 4276925.79 frames. ], batch size: 263, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:32:35,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1423332.0, ans=0.0 2023-06-25 21:32:36,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1423332.0, ans=0.125 2023-06-25 21:32:38,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1423332.0, ans=0.0 2023-06-25 21:32:59,163 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:33:33,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1423452.0, ans=0.0 2023-06-25 21:33:54,140 INFO [train.py:996] (1/4) Epoch 8, batch 23800, loss[loss=0.2335, simple_loss=0.3334, pruned_loss=0.06678, over 21775.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2992, pruned_loss=0.07146, over 4283101.66 frames. ], batch size: 316, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:34:05,406 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-25 21:34:28,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1423632.0, ans=0.0 2023-06-25 21:34:30,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-25 21:34:33,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1423632.0, ans=0.125 2023-06-25 21:34:49,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1423692.0, ans=0.0 2023-06-25 21:35:08,034 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.992e+02 4.494e+02 6.635e+02 8.945e+02 1.790e+03, threshold=1.327e+03, percent-clipped=2.0 2023-06-25 21:35:44,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1423872.0, ans=0.0 2023-06-25 21:35:50,940 INFO [train.py:996] (1/4) Epoch 8, batch 23850, loss[loss=0.232, simple_loss=0.3138, pruned_loss=0.07511, over 21660.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3036, pruned_loss=0.07194, over 4278722.35 frames. ], batch size: 230, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:36:08,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423932.0, ans=0.1 2023-06-25 21:36:59,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1424052.0, ans=0.0 2023-06-25 21:37:03,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1424052.0, ans=0.0 2023-06-25 21:37:40,723 INFO [train.py:996] (1/4) Epoch 8, batch 23900, loss[loss=0.2359, simple_loss=0.3221, pruned_loss=0.07483, over 20714.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3093, pruned_loss=0.07394, over 4275664.10 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:38:23,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1424292.0, ans=0.125 2023-06-25 21:38:41,566 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.117e+02 4.954e+02 6.480e+02 8.834e+02 1.664e+03, threshold=1.296e+03, percent-clipped=3.0 2023-06-25 21:38:42,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424352.0, ans=0.1 2023-06-25 21:38:43,090 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-06-25 21:39:16,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1424412.0, ans=0.125 2023-06-25 21:39:23,105 INFO [train.py:996] (1/4) Epoch 8, batch 23950, loss[loss=0.2519, simple_loss=0.3706, pruned_loss=0.0666, over 20804.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3046, pruned_loss=0.07401, over 4274745.16 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:39:44,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1424532.0, ans=0.1 2023-06-25 21:40:13,736 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.99 vs. limit=10.0 2023-06-25 21:40:22,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1424592.0, ans=0.0 2023-06-25 21:40:24,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1424652.0, ans=0.125 2023-06-25 21:40:29,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1424652.0, ans=0.125 2023-06-25 21:40:57,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1424712.0, ans=0.125 2023-06-25 21:41:09,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1424772.0, ans=0.125 2023-06-25 21:41:11,171 INFO [train.py:996] (1/4) Epoch 8, batch 24000, loss[loss=0.2323, simple_loss=0.3045, pruned_loss=0.08003, over 21462.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3057, pruned_loss=0.07644, over 4282766.70 frames. ], batch size: 211, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:41:11,171 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 21:41:29,303 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2655, simple_loss=0.3581, pruned_loss=0.0864, over 1796401.00 frames. 2023-06-25 21:41:29,304 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 21:42:49,030 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.318e+02 4.591e+02 6.093e+02 8.134e+02 1.870e+03, threshold=1.219e+03, percent-clipped=5.0 2023-06-25 21:42:54,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1424952.0, ans=0.2 2023-06-25 21:42:58,305 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:43:18,402 INFO [train.py:996] (1/4) Epoch 8, batch 24050, loss[loss=0.1803, simple_loss=0.2626, pruned_loss=0.04901, over 21244.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3066, pruned_loss=0.07709, over 4283197.29 frames. ], batch size: 159, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:43:21,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-25 21:43:54,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-06-25 21:44:45,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1425252.0, ans=0.0 2023-06-25 21:45:09,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1425312.0, ans=0.1 2023-06-25 21:45:14,044 INFO [train.py:996] (1/4) Epoch 8, batch 24100, loss[loss=0.2852, simple_loss=0.3555, pruned_loss=0.1074, over 21485.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3054, pruned_loss=0.07479, over 4285209.65 frames. ], batch size: 471, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:45:18,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1425372.0, ans=0.07 2023-06-25 21:45:37,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1425432.0, ans=0.1 2023-06-25 21:46:27,131 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.209e+02 4.362e+02 5.817e+02 7.695e+02 1.790e+03, threshold=1.163e+03, percent-clipped=6.0 2023-06-25 21:47:01,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1425672.0, ans=10.0 2023-06-25 21:47:02,456 INFO [train.py:996] (1/4) Epoch 8, batch 24150, loss[loss=0.2297, simple_loss=0.2991, pruned_loss=0.08015, over 21746.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3056, pruned_loss=0.07621, over 4292161.39 frames. ], batch size: 389, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:47:24,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1425732.0, ans=0.0 2023-06-25 21:48:58,665 INFO [train.py:996] (1/4) Epoch 8, batch 24200, loss[loss=0.2227, simple_loss=0.2878, pruned_loss=0.07875, over 21142.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3068, pruned_loss=0.07686, over 4293330.30 frames. ], batch size: 143, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:50:13,420 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.946e+02 4.269e+02 5.400e+02 8.843e+02 1.561e+03, threshold=1.080e+03, percent-clipped=7.0 2023-06-25 21:50:32,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1426212.0, ans=0.0 2023-06-25 21:50:34,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1426212.0, ans=0.2 2023-06-25 21:50:45,561 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.29 vs. limit=5.0 2023-06-25 21:50:49,358 INFO [train.py:996] (1/4) Epoch 8, batch 24250, loss[loss=0.1675, simple_loss=0.2675, pruned_loss=0.0338, over 21730.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.304, pruned_loss=0.07132, over 4285304.18 frames. ], batch size: 247, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:51:24,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1426332.0, ans=0.09899494936611666 2023-06-25 21:51:26,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1426332.0, ans=0.125 2023-06-25 21:52:25,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1426512.0, ans=0.1 2023-06-25 21:52:30,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-25 21:52:36,547 INFO [train.py:996] (1/4) Epoch 8, batch 24300, loss[loss=0.1822, simple_loss=0.2521, pruned_loss=0.0561, over 21847.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2975, pruned_loss=0.06545, over 4280981.42 frames. ], batch size: 118, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:53:04,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1426632.0, ans=0.0 2023-06-25 21:53:48,811 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 3.813e+02 5.438e+02 8.323e+02 1.746e+03, threshold=1.088e+03, percent-clipped=13.0 2023-06-25 21:54:29,438 INFO [train.py:996] (1/4) Epoch 8, batch 24350, loss[loss=0.2085, simple_loss=0.2812, pruned_loss=0.06794, over 21768.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2925, pruned_loss=0.06495, over 4282443.50 frames. ], batch size: 247, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:54:38,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1426872.0, ans=0.1 2023-06-25 21:54:47,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1426932.0, ans=0.125 2023-06-25 21:54:53,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1426932.0, ans=0.02 2023-06-25 21:55:14,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1426992.0, ans=0.0 2023-06-25 21:56:13,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1427112.0, ans=0.05 2023-06-25 21:56:18,881 INFO [train.py:996] (1/4) Epoch 8, batch 24400, loss[loss=0.2586, simple_loss=0.3443, pruned_loss=0.0865, over 21806.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2971, pruned_loss=0.06851, over 4280557.18 frames. ], batch size: 118, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:56:28,532 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-25 21:57:27,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-25 21:57:29,771 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.945e+02 4.612e+02 5.722e+02 8.222e+02 2.006e+03, threshold=1.144e+03, percent-clipped=13.0 2023-06-25 21:58:07,722 INFO [train.py:996] (1/4) Epoch 8, batch 24450, loss[loss=0.2234, simple_loss=0.3169, pruned_loss=0.06492, over 21721.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2981, pruned_loss=0.07003, over 4275120.08 frames. ], batch size: 298, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:58:55,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1427592.0, ans=0.04949747468305833 2023-06-25 21:59:05,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-25 21:59:05,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1427592.0, ans=0.0 2023-06-25 21:59:52,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.13 vs. limit=10.0 2023-06-25 21:59:55,399 INFO [train.py:996] (1/4) Epoch 8, batch 24500, loss[loss=0.2161, simple_loss=0.2925, pruned_loss=0.06981, over 21920.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2996, pruned_loss=0.06989, over 4282795.56 frames. ], batch size: 316, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:59:59,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1427772.0, ans=0.0 2023-06-25 22:00:14,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1427772.0, ans=0.0 2023-06-25 22:00:18,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1427832.0, ans=0.0 2023-06-25 22:00:38,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1427892.0, ans=0.0 2023-06-25 22:00:46,199 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-25 22:00:49,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1427892.0, ans=0.0 2023-06-25 22:01:04,620 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.043e+02 4.093e+02 5.380e+02 7.688e+02 2.312e+03, threshold=1.076e+03, percent-clipped=10.0 2023-06-25 22:01:47,725 INFO [train.py:996] (1/4) Epoch 8, batch 24550, loss[loss=0.2441, simple_loss=0.3138, pruned_loss=0.08718, over 21472.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3017, pruned_loss=0.07227, over 4286094.45 frames. ], batch size: 194, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 22:02:40,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1428192.0, ans=0.125 2023-06-25 22:03:27,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1428312.0, ans=0.125 2023-06-25 22:03:34,844 INFO [train.py:996] (1/4) Epoch 8, batch 24600, loss[loss=0.24, simple_loss=0.2906, pruned_loss=0.09471, over 21239.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2984, pruned_loss=0.07281, over 4278093.56 frames. ], batch size: 471, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 22:04:11,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1428432.0, ans=0.125 2023-06-25 22:04:16,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1428492.0, ans=0.125 2023-06-25 22:04:43,329 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.133e+02 4.316e+02 5.425e+02 7.027e+02 1.651e+03, threshold=1.085e+03, percent-clipped=8.0 2023-06-25 22:05:01,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1428612.0, ans=0.125 2023-06-25 22:05:14,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1428612.0, ans=0.035 2023-06-25 22:05:16,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1428612.0, ans=0.0 2023-06-25 22:05:19,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-25 22:05:21,816 INFO [train.py:996] (1/4) Epoch 8, batch 24650, loss[loss=0.2072, simple_loss=0.2799, pruned_loss=0.06731, over 21885.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.292, pruned_loss=0.07167, over 4270654.93 frames. ], batch size: 98, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:06:19,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1428792.0, ans=0.125 2023-06-25 22:06:29,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1428852.0, ans=0.0 2023-06-25 22:06:53,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1428912.0, ans=0.1 2023-06-25 22:07:07,941 INFO [train.py:996] (1/4) Epoch 8, batch 24700, loss[loss=0.2391, simple_loss=0.32, pruned_loss=0.07907, over 21395.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2899, pruned_loss=0.07022, over 4274985.75 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:07:24,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1429032.0, ans=0.125 2023-06-25 22:08:09,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1429152.0, ans=0.1 2023-06-25 22:08:17,095 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.664e+02 4.405e+02 6.289e+02 8.929e+02 2.025e+03, threshold=1.258e+03, percent-clipped=12.0 2023-06-25 22:08:33,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1429212.0, ans=0.0 2023-06-25 22:08:40,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.11 vs. limit=10.0 2023-06-25 22:08:49,436 INFO [train.py:996] (1/4) Epoch 8, batch 24750, loss[loss=0.1695, simple_loss=0.2401, pruned_loss=0.0494, over 21338.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2828, pruned_loss=0.06739, over 4273895.66 frames. ], batch size: 131, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:09:39,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1429392.0, ans=0.025 2023-06-25 22:09:58,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1429452.0, ans=0.1 2023-06-25 22:10:18,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1429452.0, ans=0.2 2023-06-25 22:10:20,013 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=22.5 2023-06-25 22:10:37,858 INFO [train.py:996] (1/4) Epoch 8, batch 24800, loss[loss=0.2099, simple_loss=0.2785, pruned_loss=0.07065, over 21859.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2782, pruned_loss=0.06746, over 4272890.55 frames. ], batch size: 333, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:11:05,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-25 22:11:34,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1429692.0, ans=0.125 2023-06-25 22:11:49,238 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 4.217e+02 5.954e+02 8.314e+02 1.595e+03, threshold=1.191e+03, percent-clipped=9.0 2023-06-25 22:12:04,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.97 vs. limit=12.0 2023-06-25 22:12:05,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1429812.0, ans=0.1 2023-06-25 22:12:20,371 INFO [train.py:996] (1/4) Epoch 8, batch 24850, loss[loss=0.2069, simple_loss=0.2884, pruned_loss=0.06273, over 21062.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2801, pruned_loss=0.06925, over 4281388.71 frames. ], batch size: 608, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:13:33,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1430052.0, ans=0.05 2023-06-25 22:13:40,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1430052.0, ans=0.0 2023-06-25 22:13:51,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-25 22:13:55,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-25 22:14:09,934 INFO [train.py:996] (1/4) Epoch 8, batch 24900, loss[loss=0.2343, simple_loss=0.3083, pruned_loss=0.08015, over 21622.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2825, pruned_loss=0.07009, over 4277885.32 frames. ], batch size: 230, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:15:25,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1430352.0, ans=0.125 2023-06-25 22:15:31,582 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.069e+02 4.057e+02 5.546e+02 7.694e+02 2.051e+03, threshold=1.109e+03, percent-clipped=6.0 2023-06-25 22:15:38,668 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-25 22:15:58,275 INFO [train.py:996] (1/4) Epoch 8, batch 24950, loss[loss=0.2321, simple_loss=0.3103, pruned_loss=0.07694, over 21669.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2904, pruned_loss=0.07448, over 4280443.32 frames. ], batch size: 263, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:16:19,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-25 22:17:28,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1430712.0, ans=0.1 2023-06-25 22:17:46,834 INFO [train.py:996] (1/4) Epoch 8, batch 25000, loss[loss=0.198, simple_loss=0.2678, pruned_loss=0.06414, over 21704.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2973, pruned_loss=0.07577, over 4288048.24 frames. ], batch size: 282, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:18:20,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1430832.0, ans=0.1 2023-06-25 22:18:31,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1430832.0, ans=15.0 2023-06-25 22:18:44,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1430892.0, ans=0.0 2023-06-25 22:18:53,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-25 22:18:56,932 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-25 22:19:07,611 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.298e+02 4.363e+02 6.743e+02 9.687e+02 1.962e+03, threshold=1.349e+03, percent-clipped=15.0 2023-06-25 22:19:17,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=22.5 2023-06-25 22:19:23,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1431012.0, ans=0.125 2023-06-25 22:19:24,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1431012.0, ans=0.1 2023-06-25 22:19:32,750 INFO [train.py:996] (1/4) Epoch 8, batch 25050, loss[loss=0.2212, simple_loss=0.2809, pruned_loss=0.08074, over 21159.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2909, pruned_loss=0.07393, over 4274179.47 frames. ], batch size: 143, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:19:47,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1431072.0, ans=0.09899494936611666 2023-06-25 22:19:54,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1431132.0, ans=0.0 2023-06-25 22:21:07,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-25 22:21:19,821 INFO [train.py:996] (1/4) Epoch 8, batch 25100, loss[loss=0.2115, simple_loss=0.2695, pruned_loss=0.07678, over 21794.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2846, pruned_loss=0.07242, over 4275332.98 frames. ], batch size: 102, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:21:47,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1431432.0, ans=0.1 2023-06-25 22:22:41,590 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.362e+02 5.445e+02 8.840e+02 1.769e+03, threshold=1.089e+03, percent-clipped=5.0 2023-06-25 22:23:07,158 INFO [train.py:996] (1/4) Epoch 8, batch 25150, loss[loss=0.2109, simple_loss=0.3029, pruned_loss=0.05946, over 21391.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2892, pruned_loss=0.07086, over 4275113.01 frames. ], batch size: 194, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:23:14,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1431672.0, ans=0.04949747468305833 2023-06-25 22:23:26,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1431672.0, ans=0.0 2023-06-25 22:24:17,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1431792.0, ans=0.125 2023-06-25 22:24:43,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1431912.0, ans=0.2 2023-06-25 22:24:55,097 INFO [train.py:996] (1/4) Epoch 8, batch 25200, loss[loss=0.1971, simple_loss=0.2929, pruned_loss=0.05066, over 21836.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2892, pruned_loss=0.06862, over 4274764.28 frames. ], batch size: 316, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:25:22,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-25 22:25:44,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1432092.0, ans=0.125 2023-06-25 22:25:57,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-25 22:26:18,384 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.791e+02 3.750e+02 5.347e+02 7.396e+02 1.859e+03, threshold=1.069e+03, percent-clipped=8.0 2023-06-25 22:26:27,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1432212.0, ans=0.125 2023-06-25 22:26:29,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1432212.0, ans=0.125 2023-06-25 22:26:41,762 INFO [train.py:996] (1/4) Epoch 8, batch 25250, loss[loss=0.2044, simple_loss=0.2872, pruned_loss=0.06078, over 15691.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2871, pruned_loss=0.06697, over 4273713.98 frames. ], batch size: 60, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:26:44,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432272.0, ans=0.1 2023-06-25 22:27:15,741 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:27:49,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1432392.0, ans=0.07 2023-06-25 22:28:16,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-25 22:28:29,122 INFO [train.py:996] (1/4) Epoch 8, batch 25300, loss[loss=0.2142, simple_loss=0.2906, pruned_loss=0.06888, over 21733.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2846, pruned_loss=0.06633, over 4264819.69 frames. ], batch size: 298, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:29:24,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1432692.0, ans=0.125 2023-06-25 22:29:27,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-06-25 22:29:45,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1432752.0, ans=0.1 2023-06-25 22:29:53,832 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.880e+02 4.048e+02 5.397e+02 7.800e+02 1.560e+03, threshold=1.079e+03, percent-clipped=8.0 2023-06-25 22:30:17,479 INFO [train.py:996] (1/4) Epoch 8, batch 25350, loss[loss=0.225, simple_loss=0.3254, pruned_loss=0.06234, over 21823.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2859, pruned_loss=0.06598, over 4251823.37 frames. ], batch size: 371, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:30:55,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432932.0, ans=0.1 2023-06-25 22:31:19,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432992.0, ans=0.1 2023-06-25 22:31:38,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1433052.0, ans=0.1 2023-06-25 22:31:46,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1433112.0, ans=0.1 2023-06-25 22:31:59,563 INFO [train.py:996] (1/4) Epoch 8, batch 25400, loss[loss=0.2083, simple_loss=0.2816, pruned_loss=0.06745, over 19883.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2813, pruned_loss=0.06529, over 4253528.13 frames. ], batch size: 703, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:32:03,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1433172.0, ans=0.04949747468305833 2023-06-25 22:32:40,605 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-25 22:33:10,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1433352.0, ans=0.2 2023-06-25 22:33:21,728 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.049e+02 4.073e+02 6.227e+02 9.020e+02 1.627e+03, threshold=1.245e+03, percent-clipped=13.0 2023-06-25 22:33:30,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1433412.0, ans=0.125 2023-06-25 22:33:44,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1433472.0, ans=0.1 2023-06-25 22:33:45,981 INFO [train.py:996] (1/4) Epoch 8, batch 25450, loss[loss=0.1809, simple_loss=0.2389, pruned_loss=0.06147, over 17092.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2822, pruned_loss=0.0667, over 4248173.38 frames. ], batch size: 64, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:33:46,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1433472.0, ans=0.125 2023-06-25 22:34:41,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1433592.0, ans=0.2 2023-06-25 22:34:51,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1433592.0, ans=0.125 2023-06-25 22:35:10,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1433652.0, ans=0.125 2023-06-25 22:35:25,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-25 22:35:31,609 INFO [train.py:996] (1/4) Epoch 8, batch 25500, loss[loss=0.2496, simple_loss=0.3396, pruned_loss=0.07976, over 21655.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2834, pruned_loss=0.06386, over 4250178.58 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:36:12,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1433832.0, ans=0.0 2023-06-25 22:36:19,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1433832.0, ans=0.125 2023-06-25 22:36:46,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1433952.0, ans=0.125 2023-06-25 22:36:46,762 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:36:56,876 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.763e+02 3.870e+02 4.829e+02 7.230e+02 1.638e+03, threshold=9.659e+02, percent-clipped=1.0 2023-06-25 22:36:59,930 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-25 22:37:21,633 INFO [train.py:996] (1/4) Epoch 8, batch 25550, loss[loss=0.2206, simple_loss=0.322, pruned_loss=0.05962, over 21858.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2906, pruned_loss=0.06471, over 4256500.23 frames. ], batch size: 371, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:37:32,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1434072.0, ans=0.125 2023-06-25 22:37:37,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.36 vs. limit=15.0 2023-06-25 22:38:33,823 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:38:42,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1434252.0, ans=0.2 2023-06-25 22:38:49,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1434252.0, ans=0.0 2023-06-25 22:39:19,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1434372.0, ans=0.09899494936611666 2023-06-25 22:39:20,117 INFO [train.py:996] (1/4) Epoch 8, batch 25600, loss[loss=0.2644, simple_loss=0.3401, pruned_loss=0.09437, over 21785.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2949, pruned_loss=0.06536, over 4256964.17 frames. ], batch size: 118, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:39:49,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1434432.0, ans=0.1 2023-06-25 22:40:21,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.80 vs. limit=22.5 2023-06-25 22:40:31,973 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.098e+02 4.217e+02 6.682e+02 9.360e+02 1.950e+03, threshold=1.336e+03, percent-clipped=22.0 2023-06-25 22:40:53,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1434612.0, ans=0.07 2023-06-25 22:40:53,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1434612.0, ans=0.125 2023-06-25 22:41:11,568 INFO [train.py:996] (1/4) Epoch 8, batch 25650, loss[loss=0.2432, simple_loss=0.3, pruned_loss=0.09324, over 21292.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2961, pruned_loss=0.06855, over 4260211.98 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:42:38,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1434912.0, ans=15.0 2023-06-25 22:42:50,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1434912.0, ans=0.0 2023-06-25 22:42:58,589 INFO [train.py:996] (1/4) Epoch 8, batch 25700, loss[loss=0.2422, simple_loss=0.315, pruned_loss=0.08466, over 21857.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2932, pruned_loss=0.06977, over 4253635.26 frames. ], batch size: 124, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:43:02,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1434972.0, ans=0.125 2023-06-25 22:43:28,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1435032.0, ans=0.04949747468305833 2023-06-25 22:44:06,647 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.951e+02 3.970e+02 5.194e+02 7.142e+02 1.504e+03, threshold=1.039e+03, percent-clipped=1.0 2023-06-25 22:44:31,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1435212.0, ans=0.0 2023-06-25 22:44:40,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-25 22:44:52,936 INFO [train.py:996] (1/4) Epoch 8, batch 25750, loss[loss=0.2569, simple_loss=0.3397, pruned_loss=0.08708, over 21580.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2975, pruned_loss=0.07207, over 4257618.95 frames. ], batch size: 230, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:45:25,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435332.0, ans=0.1 2023-06-25 22:46:32,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-06-25 22:46:45,449 INFO [train.py:996] (1/4) Epoch 8, batch 25800, loss[loss=0.2381, simple_loss=0.3171, pruned_loss=0.07961, over 21731.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3103, pruned_loss=0.07636, over 4254080.69 frames. ], batch size: 332, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:47:47,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1435692.0, ans=0.2 2023-06-25 22:48:11,403 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.235e+02 4.952e+02 6.520e+02 9.122e+02 2.118e+03, threshold=1.304e+03, percent-clipped=17.0 2023-06-25 22:48:11,935 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:48:20,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1435812.0, ans=0.5 2023-06-25 22:48:33,932 INFO [train.py:996] (1/4) Epoch 8, batch 25850, loss[loss=0.2444, simple_loss=0.3083, pruned_loss=0.0903, over 21403.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.311, pruned_loss=0.0756, over 4265483.23 frames. ], batch size: 144, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:48:39,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1435872.0, ans=0.125 2023-06-25 22:48:39,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1435872.0, ans=0.125 2023-06-25 22:49:06,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1435932.0, ans=0.125 2023-06-25 22:49:58,143 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-25 22:50:01,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1436052.0, ans=0.125 2023-06-25 22:50:03,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1436052.0, ans=0.125 2023-06-25 22:50:23,327 INFO [train.py:996] (1/4) Epoch 8, batch 25900, loss[loss=0.2189, simple_loss=0.3111, pruned_loss=0.06334, over 19958.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3124, pruned_loss=0.07654, over 4273959.31 frames. ], batch size: 702, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:50:25,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1436172.0, ans=0.1 2023-06-25 22:51:02,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1436232.0, ans=0.0 2023-06-25 22:51:04,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1436232.0, ans=0.125 2023-06-25 22:51:06,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1436232.0, ans=0.0 2023-06-25 22:51:19,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1436292.0, ans=0.125 2023-06-25 22:51:43,653 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.598e+02 5.216e+02 8.298e+02 1.003e+03 1.891e+03, threshold=1.660e+03, percent-clipped=7.0 2023-06-25 22:52:03,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1436412.0, ans=0.125 2023-06-25 22:52:06,619 INFO [train.py:996] (1/4) Epoch 8, batch 25950, loss[loss=0.3241, simple_loss=0.3722, pruned_loss=0.138, over 21368.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.319, pruned_loss=0.07924, over 4276785.79 frames. ], batch size: 507, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 22:52:56,871 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:53:00,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1436592.0, ans=0.125 2023-06-25 22:53:01,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1436592.0, ans=0.09899494936611666 2023-06-25 22:53:08,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1436592.0, ans=0.1 2023-06-25 22:53:20,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1436652.0, ans=0.0 2023-06-25 22:53:58,762 INFO [train.py:996] (1/4) Epoch 8, batch 26000, loss[loss=0.2227, simple_loss=0.3046, pruned_loss=0.07034, over 21350.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3192, pruned_loss=0.07827, over 4280740.97 frames. ], batch size: 159, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:54:26,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1436832.0, ans=0.125 2023-06-25 22:55:20,390 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.951e+02 4.119e+02 5.246e+02 6.904e+02 1.299e+03, threshold=1.049e+03, percent-clipped=0.0 2023-06-25 22:55:47,886 INFO [train.py:996] (1/4) Epoch 8, batch 26050, loss[loss=0.2277, simple_loss=0.2945, pruned_loss=0.08042, over 21493.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3185, pruned_loss=0.0792, over 4276412.31 frames. ], batch size: 211, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:56:39,025 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-25 22:57:04,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437252.0, ans=0.1 2023-06-25 22:57:28,593 INFO [train.py:996] (1/4) Epoch 8, batch 26100, loss[loss=0.207, simple_loss=0.278, pruned_loss=0.06794, over 21950.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3128, pruned_loss=0.07815, over 4275040.93 frames. ], batch size: 316, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:57:42,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1437372.0, ans=0.125 2023-06-25 22:57:45,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437372.0, ans=0.1 2023-06-25 22:57:55,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-25 22:58:09,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1437432.0, ans=0.04949747468305833 2023-06-25 22:58:20,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-25 22:58:44,089 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.195e+02 4.438e+02 5.651e+02 7.112e+02 1.480e+03, threshold=1.130e+03, percent-clipped=4.0 2023-06-25 22:59:22,553 INFO [train.py:996] (1/4) Epoch 8, batch 26150, loss[loss=0.2197, simple_loss=0.2921, pruned_loss=0.07364, over 21734.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3088, pruned_loss=0.07719, over 4271553.35 frames. ], batch size: 332, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:01:12,284 INFO [train.py:996] (1/4) Epoch 8, batch 26200, loss[loss=0.2198, simple_loss=0.3366, pruned_loss=0.05151, over 20844.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3086, pruned_loss=0.07535, over 4269472.12 frames. ], batch size: 608, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:01:15,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-25 23:01:21,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1437972.0, ans=0.0 2023-06-25 23:01:26,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437972.0, ans=0.1 2023-06-25 23:01:28,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1438032.0, ans=0.0 2023-06-25 23:01:29,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1438032.0, ans=0.025 2023-06-25 23:01:47,766 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-25 23:02:07,834 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=22.5 2023-06-25 23:02:23,565 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.227e+02 4.470e+02 5.888e+02 8.750e+02 1.495e+03, threshold=1.178e+03, percent-clipped=8.0 2023-06-25 23:02:43,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1438212.0, ans=0.125 2023-06-25 23:02:55,428 INFO [train.py:996] (1/4) Epoch 8, batch 26250, loss[loss=0.2563, simple_loss=0.336, pruned_loss=0.08834, over 21877.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3117, pruned_loss=0.07433, over 4269710.19 frames. ], batch size: 107, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:03:15,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1438332.0, ans=0.1 2023-06-25 23:03:55,573 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:04:36,263 INFO [train.py:996] (1/4) Epoch 8, batch 26300, loss[loss=0.2448, simple_loss=0.3097, pruned_loss=0.08996, over 21461.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.309, pruned_loss=0.07504, over 4272694.99 frames. ], batch size: 144, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:05:00,387 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.60 vs. limit=15.0 2023-06-25 23:06:03,912 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.391e+02 4.218e+02 5.396e+02 7.440e+02 1.508e+03, threshold=1.079e+03, percent-clipped=2.0 2023-06-25 23:06:18,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1438812.0, ans=0.125 2023-06-25 23:06:24,581 INFO [train.py:996] (1/4) Epoch 8, batch 26350, loss[loss=0.2741, simple_loss=0.3531, pruned_loss=0.0975, over 21783.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3074, pruned_loss=0.07552, over 4273126.60 frames. ], batch size: 124, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:06:27,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-25 23:07:44,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-25 23:07:50,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1439052.0, ans=0.125 2023-06-25 23:08:11,343 INFO [train.py:996] (1/4) Epoch 8, batch 26400, loss[loss=0.2035, simple_loss=0.2721, pruned_loss=0.06745, over 21133.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3014, pruned_loss=0.0756, over 4270360.63 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:08:56,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1439292.0, ans=0.1 2023-06-25 23:09:28,636 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 23:09:36,048 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.025e+02 5.044e+02 7.451e+02 1.741e+03, threshold=1.009e+03, percent-clipped=9.0 2023-06-25 23:09:36,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1439352.0, ans=0.125 2023-06-25 23:09:57,665 INFO [train.py:996] (1/4) Epoch 8, batch 26450, loss[loss=0.2492, simple_loss=0.3533, pruned_loss=0.07259, over 21772.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3011, pruned_loss=0.07531, over 4267664.67 frames. ], batch size: 351, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:09:58,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1439472.0, ans=0.035 2023-06-25 23:10:08,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1439472.0, ans=0.125 2023-06-25 23:11:07,518 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-25 23:11:30,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1439712.0, ans=0.1 2023-06-25 23:11:44,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1439712.0, ans=0.025 2023-06-25 23:11:48,698 INFO [train.py:996] (1/4) Epoch 8, batch 26500, loss[loss=0.2541, simple_loss=0.3427, pruned_loss=0.08273, over 21687.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3049, pruned_loss=0.07437, over 4262601.31 frames. ], batch size: 414, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:11:58,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1439772.0, ans=0.0 2023-06-25 23:12:51,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1439892.0, ans=0.125 2023-06-25 23:13:23,155 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.879e+02 4.514e+02 6.896e+02 1.400e+03 2.768e+03, threshold=1.379e+03, percent-clipped=34.0 2023-06-25 23:13:29,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1440012.0, ans=0.1 2023-06-25 23:13:53,896 INFO [train.py:996] (1/4) Epoch 8, batch 26550, loss[loss=0.1831, simple_loss=0.2596, pruned_loss=0.05329, over 21537.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3018, pruned_loss=0.07228, over 4260722.56 frames. ], batch size: 195, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:13:56,701 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-25 23:14:57,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1440252.0, ans=0.125 2023-06-25 23:14:58,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.63 vs. limit=12.0 2023-06-25 23:15:12,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1440312.0, ans=0.125 2023-06-25 23:15:15,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-06-25 23:15:33,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1440312.0, ans=0.125 2023-06-25 23:15:47,292 INFO [train.py:996] (1/4) Epoch 8, batch 26600, loss[loss=0.2011, simple_loss=0.2715, pruned_loss=0.06537, over 21590.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3, pruned_loss=0.06968, over 4249689.27 frames. ], batch size: 247, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:15:58,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1440372.0, ans=0.0 2023-06-25 23:16:30,451 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-25 23:16:31,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1440492.0, ans=0.0 2023-06-25 23:16:31,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1440492.0, ans=0.125 2023-06-25 23:16:57,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1440552.0, ans=0.1 2023-06-25 23:17:00,242 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.847e+02 4.407e+02 5.733e+02 8.512e+02 1.391e+03, threshold=1.147e+03, percent-clipped=1.0 2023-06-25 23:17:00,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1440612.0, ans=0.125 2023-06-25 23:17:35,763 INFO [train.py:996] (1/4) Epoch 8, batch 26650, loss[loss=0.1511, simple_loss=0.2335, pruned_loss=0.03434, over 21417.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2936, pruned_loss=0.06866, over 4259124.76 frames. ], batch size: 194, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:17:36,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1440672.0, ans=0.2 2023-06-25 23:18:00,731 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.99 vs. limit=22.5 2023-06-25 23:18:07,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=22.5 2023-06-25 23:18:20,510 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:18:27,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-25 23:19:14,221 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-25 23:19:15,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1440912.0, ans=0.125 2023-06-25 23:19:18,182 INFO [train.py:996] (1/4) Epoch 8, batch 26700, loss[loss=0.2209, simple_loss=0.2991, pruned_loss=0.07138, over 21753.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2859, pruned_loss=0.06553, over 4262002.60 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:20:37,582 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 3.824e+02 5.567e+02 8.569e+02 1.745e+03, threshold=1.113e+03, percent-clipped=13.0 2023-06-25 23:20:38,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1441212.0, ans=0.125 2023-06-25 23:21:01,556 INFO [train.py:996] (1/4) Epoch 8, batch 26750, loss[loss=0.2678, simple_loss=0.3583, pruned_loss=0.08864, over 21471.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2867, pruned_loss=0.0645, over 4261481.99 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:21:37,477 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-25 23:21:42,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1441392.0, ans=0.125 2023-06-25 23:22:07,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1441452.0, ans=0.125 2023-06-25 23:22:24,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-06-25 23:22:40,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1441512.0, ans=10.0 2023-06-25 23:22:46,501 INFO [train.py:996] (1/4) Epoch 8, batch 26800, loss[loss=0.2348, simple_loss=0.3057, pruned_loss=0.08194, over 20679.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2934, pruned_loss=0.06792, over 4263206.68 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:23:23,048 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:23:54,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-06-25 23:24:14,275 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.253e+02 4.422e+02 6.215e+02 9.798e+02 1.990e+03, threshold=1.243e+03, percent-clipped=9.0 2023-06-25 23:24:38,129 INFO [train.py:996] (1/4) Epoch 8, batch 26850, loss[loss=0.1997, simple_loss=0.2806, pruned_loss=0.0594, over 20108.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2936, pruned_loss=0.06983, over 4265150.18 frames. ], batch size: 703, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:25:06,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-25 23:25:22,671 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.33 vs. limit=22.5 2023-06-25 23:26:15,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1442112.0, ans=0.125 2023-06-25 23:26:20,005 INFO [train.py:996] (1/4) Epoch 8, batch 26900, loss[loss=0.2295, simple_loss=0.2677, pruned_loss=0.09565, over 21553.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2871, pruned_loss=0.07001, over 4258819.71 frames. ], batch size: 512, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:26:27,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1442172.0, ans=0.125 2023-06-25 23:26:27,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1442172.0, ans=0.125 2023-06-25 23:26:39,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1442232.0, ans=0.0 2023-06-25 23:26:43,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1442232.0, ans=0.1 2023-06-25 23:27:09,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=15.0 2023-06-25 23:27:40,232 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 3.925e+02 6.896e+02 1.001e+03 2.184e+03, threshold=1.379e+03, percent-clipped=14.0 2023-06-25 23:28:02,743 INFO [train.py:996] (1/4) Epoch 8, batch 26950, loss[loss=0.2476, simple_loss=0.3426, pruned_loss=0.0763, over 21200.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2862, pruned_loss=0.07023, over 4258088.37 frames. ], batch size: 548, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:28:17,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1442472.0, ans=0.2 2023-06-25 23:28:35,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-25 23:29:49,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1442712.0, ans=0.1 2023-06-25 23:29:51,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.16 vs. limit=5.0 2023-06-25 23:29:52,148 INFO [train.py:996] (1/4) Epoch 8, batch 27000, loss[loss=0.1967, simple_loss=0.2943, pruned_loss=0.04954, over 21702.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.287, pruned_loss=0.06859, over 4260800.52 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:29:52,148 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 23:30:10,463 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2506, simple_loss=0.341, pruned_loss=0.08006, over 1796401.00 frames. 2023-06-25 23:30:10,464 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-25 23:31:05,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.03 vs. limit=12.0 2023-06-25 23:31:17,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1442952.0, ans=0.0 2023-06-25 23:31:20,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1442952.0, ans=10.0 2023-06-25 23:31:32,476 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.635e+02 4.043e+02 5.265e+02 7.888e+02 2.132e+03, threshold=1.053e+03, percent-clipped=7.0 2023-06-25 23:31:39,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1443012.0, ans=0.125 2023-06-25 23:31:49,547 INFO [train.py:996] (1/4) Epoch 8, batch 27050, loss[loss=0.2211, simple_loss=0.3084, pruned_loss=0.06694, over 21382.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2898, pruned_loss=0.06574, over 4264510.79 frames. ], batch size: 176, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:32:42,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1443192.0, ans=0.125 2023-06-25 23:33:38,796 INFO [train.py:996] (1/4) Epoch 8, batch 27100, loss[loss=0.2275, simple_loss=0.3283, pruned_loss=0.06331, over 21838.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2926, pruned_loss=0.06578, over 4264404.22 frames. ], batch size: 371, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:34:29,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1443492.0, ans=0.0 2023-06-25 23:34:38,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1443492.0, ans=0.1 2023-06-25 23:34:50,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1443492.0, ans=0.07 2023-06-25 23:35:11,011 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.566e+02 6.448e+02 9.782e+02 2.509e+03, threshold=1.290e+03, percent-clipped=22.0 2023-06-25 23:35:33,786 INFO [train.py:996] (1/4) Epoch 8, batch 27150, loss[loss=0.2237, simple_loss=0.3155, pruned_loss=0.06589, over 21392.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3019, pruned_loss=0.06847, over 4273861.12 frames. ], batch size: 211, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:36:11,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1443732.0, ans=0.0 2023-06-25 23:37:25,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1443912.0, ans=0.0 2023-06-25 23:37:28,319 INFO [train.py:996] (1/4) Epoch 8, batch 27200, loss[loss=0.2825, simple_loss=0.3954, pruned_loss=0.08482, over 20718.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3104, pruned_loss=0.07099, over 4275002.18 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:38:16,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1444092.0, ans=0.125 2023-06-25 23:38:26,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1444152.0, ans=0.125 2023-06-25 23:39:01,337 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.411e+02 4.854e+02 6.757e+02 9.648e+02 1.735e+03, threshold=1.351e+03, percent-clipped=9.0 2023-06-25 23:39:18,853 INFO [train.py:996] (1/4) Epoch 8, batch 27250, loss[loss=0.2395, simple_loss=0.3131, pruned_loss=0.08297, over 21940.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3127, pruned_loss=0.07455, over 4277876.45 frames. ], batch size: 372, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:39:31,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-25 23:39:45,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1444332.0, ans=0.125 2023-06-25 23:40:00,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1444392.0, ans=0.0 2023-06-25 23:40:13,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1444392.0, ans=0.0 2023-06-25 23:41:13,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1444572.0, ans=0.125 2023-06-25 23:41:14,465 INFO [train.py:996] (1/4) Epoch 8, batch 27300, loss[loss=0.2786, simple_loss=0.3571, pruned_loss=0.1001, over 21486.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3146, pruned_loss=0.07585, over 4273777.09 frames. ], batch size: 471, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:41:18,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-25 23:41:20,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-25 23:41:37,027 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-25 23:41:41,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1444632.0, ans=0.125 2023-06-25 23:42:43,062 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.436e+02 5.757e+02 8.260e+02 1.524e+03, threshold=1.151e+03, percent-clipped=4.0 2023-06-25 23:43:03,201 INFO [train.py:996] (1/4) Epoch 8, batch 27350, loss[loss=0.2153, simple_loss=0.2944, pruned_loss=0.06812, over 21794.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3168, pruned_loss=0.07654, over 4277667.72 frames. ], batch size: 247, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:43:25,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1444932.0, ans=0.125 2023-06-25 23:43:27,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1444932.0, ans=0.125 2023-06-25 23:44:01,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1444992.0, ans=0.04949747468305833 2023-06-25 23:44:08,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1444992.0, ans=0.125 2023-06-25 23:44:50,140 INFO [train.py:996] (1/4) Epoch 8, batch 27400, loss[loss=0.2519, simple_loss=0.2966, pruned_loss=0.1036, over 21458.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3121, pruned_loss=0.07635, over 4282639.97 frames. ], batch size: 508, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:45:22,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1445232.0, ans=0.1 2023-06-25 23:45:45,287 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-25 23:45:58,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1445352.0, ans=0.2 2023-06-25 23:46:14,719 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.145e+02 3.925e+02 4.930e+02 6.414e+02 1.207e+03, threshold=9.861e+02, percent-clipped=2.0 2023-06-25 23:46:33,504 INFO [train.py:996] (1/4) Epoch 8, batch 27450, loss[loss=0.2447, simple_loss=0.3256, pruned_loss=0.08189, over 21417.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3059, pruned_loss=0.07473, over 4290728.41 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:46:37,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1445472.0, ans=0.025 2023-06-25 23:47:28,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1445592.0, ans=0.0 2023-06-25 23:48:18,948 INFO [train.py:996] (1/4) Epoch 8, batch 27500, loss[loss=0.2138, simple_loss=0.2926, pruned_loss=0.06752, over 21946.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3043, pruned_loss=0.07466, over 4295917.74 frames. ], batch size: 316, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:48:38,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1445832.0, ans=0.0 2023-06-25 23:49:05,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1445892.0, ans=0.125 2023-06-25 23:49:08,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1445892.0, ans=0.125 2023-06-25 23:49:33,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1445952.0, ans=0.0 2023-06-25 23:49:42,666 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.981e+02 3.778e+02 4.835e+02 6.283e+02 1.305e+03, threshold=9.670e+02, percent-clipped=1.0 2023-06-25 23:50:01,309 INFO [train.py:996] (1/4) Epoch 8, batch 27550, loss[loss=0.2269, simple_loss=0.2859, pruned_loss=0.0839, over 21519.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2994, pruned_loss=0.07207, over 4286969.27 frames. ], batch size: 441, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:51:34,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1446312.0, ans=0.0 2023-06-25 23:51:44,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1446312.0, ans=0.2 2023-06-25 23:51:49,460 INFO [train.py:996] (1/4) Epoch 8, batch 27600, loss[loss=0.214, simple_loss=0.281, pruned_loss=0.07348, over 21547.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2951, pruned_loss=0.07167, over 4281074.25 frames. ], batch size: 391, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:52:05,973 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-25 23:53:11,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1446612.0, ans=0.1 2023-06-25 23:53:16,405 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.997e+02 3.759e+02 4.592e+02 6.391e+02 1.970e+03, threshold=9.184e+02, percent-clipped=8.0 2023-06-25 23:53:34,781 INFO [train.py:996] (1/4) Epoch 8, batch 27650, loss[loss=0.204, simple_loss=0.2768, pruned_loss=0.06561, over 21921.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2892, pruned_loss=0.07133, over 4270326.10 frames. ], batch size: 107, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:53:57,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1446732.0, ans=0.125 2023-06-25 23:54:13,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1446732.0, ans=0.0 2023-06-25 23:54:45,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1446852.0, ans=0.125 2023-06-25 23:54:57,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1446852.0, ans=0.125 2023-06-25 23:55:22,850 INFO [train.py:996] (1/4) Epoch 8, batch 27700, loss[loss=0.2359, simple_loss=0.3241, pruned_loss=0.07388, over 21713.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2905, pruned_loss=0.07007, over 4272328.46 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:55:27,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.79 vs. limit=15.0 2023-06-25 23:56:02,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1447032.0, ans=0.2 2023-06-25 23:56:13,307 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.30 vs. limit=15.0 2023-06-25 23:56:56,236 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.091e+02 3.950e+02 5.187e+02 7.067e+02 1.545e+03, threshold=1.037e+03, percent-clipped=11.0 2023-06-25 23:57:09,965 INFO [train.py:996] (1/4) Epoch 8, batch 27750, loss[loss=0.2005, simple_loss=0.2572, pruned_loss=0.07191, over 20224.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2929, pruned_loss=0.06942, over 4272634.40 frames. ], batch size: 703, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:57:39,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1447332.0, ans=0.125 2023-06-25 23:58:12,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-25 23:58:54,841 INFO [train.py:996] (1/4) Epoch 8, batch 27800, loss[loss=0.2149, simple_loss=0.2835, pruned_loss=0.07314, over 21992.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2907, pruned_loss=0.06918, over 4270519.49 frames. ], batch size: 373, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:59:10,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1447632.0, ans=0.125 2023-06-25 23:59:49,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.29 vs. limit=10.0 2023-06-26 00:00:07,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.83 vs. limit=22.5 2023-06-26 00:00:08,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1447752.0, ans=0.0 2023-06-26 00:00:23,977 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.743e+02 4.274e+02 5.854e+02 7.453e+02 1.495e+03, threshold=1.171e+03, percent-clipped=16.0 2023-06-26 00:00:40,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-26 00:00:42,957 INFO [train.py:996] (1/4) Epoch 8, batch 27850, loss[loss=0.2338, simple_loss=0.2969, pruned_loss=0.0853, over 21338.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2894, pruned_loss=0.07015, over 4280617.65 frames. ], batch size: 159, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:01:49,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1447992.0, ans=0.1 2023-06-26 00:02:38,788 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2023-06-26 00:02:39,165 INFO [train.py:996] (1/4) Epoch 8, batch 27900, loss[loss=0.1834, simple_loss=0.2661, pruned_loss=0.05037, over 16561.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2977, pruned_loss=0.07135, over 4274438.34 frames. ], batch size: 60, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:02:58,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1448172.0, ans=0.0 2023-06-26 00:03:14,391 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:03:21,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1448292.0, ans=0.0 2023-06-26 00:03:24,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1448292.0, ans=0.0 2023-06-26 00:03:24,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1448292.0, ans=0.125 2023-06-26 00:03:45,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1448352.0, ans=0.125 2023-06-26 00:03:55,613 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-26 00:03:58,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1448352.0, ans=0.1 2023-06-26 00:04:15,993 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.747e+02 3.981e+02 4.843e+02 6.105e+02 1.501e+03, threshold=9.685e+02, percent-clipped=1.0 2023-06-26 00:04:35,188 INFO [train.py:996] (1/4) Epoch 8, batch 27950, loss[loss=0.2382, simple_loss=0.3482, pruned_loss=0.06408, over 20001.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2985, pruned_loss=0.06848, over 4269610.13 frames. ], batch size: 703, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:04:56,844 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:05:00,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1448532.0, ans=0.0 2023-06-26 00:06:22,308 INFO [train.py:996] (1/4) Epoch 8, batch 28000, loss[loss=0.2042, simple_loss=0.2727, pruned_loss=0.06786, over 21860.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2975, pruned_loss=0.06633, over 4281951.37 frames. ], batch size: 107, lr: 3.63e-03, grad_scale: 32.0 2023-06-26 00:07:47,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1449012.0, ans=0.125 2023-06-26 00:07:58,621 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 4.485e+02 6.487e+02 9.458e+02 1.758e+03, threshold=1.297e+03, percent-clipped=21.0 2023-06-26 00:08:10,950 INFO [train.py:996] (1/4) Epoch 8, batch 28050, loss[loss=0.1864, simple_loss=0.2538, pruned_loss=0.0595, over 21391.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2953, pruned_loss=0.06763, over 4282931.27 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:09:55,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1449312.0, ans=0.0 2023-06-26 00:09:57,867 INFO [train.py:996] (1/4) Epoch 8, batch 28100, loss[loss=0.1836, simple_loss=0.2518, pruned_loss=0.05768, over 21514.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2916, pruned_loss=0.06712, over 4280230.76 frames. ], batch size: 195, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:10:31,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-26 00:10:39,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1449492.0, ans=0.2 2023-06-26 00:10:49,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1449492.0, ans=0.2 2023-06-26 00:11:11,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.71 vs. limit=6.0 2023-06-26 00:11:27,656 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 4.530e+02 6.783e+02 9.812e+02 2.062e+03, threshold=1.357e+03, percent-clipped=16.0 2023-06-26 00:11:40,013 INFO [train.py:996] (1/4) Epoch 8, batch 28150, loss[loss=0.2069, simple_loss=0.2689, pruned_loss=0.07239, over 21612.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2861, pruned_loss=0.06762, over 4285894.19 frames. ], batch size: 264, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:11:57,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1449672.0, ans=0.0 2023-06-26 00:12:21,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1449792.0, ans=0.0 2023-06-26 00:13:13,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1449912.0, ans=0.125 2023-06-26 00:13:26,709 INFO [train.py:996] (1/4) Epoch 8, batch 28200, loss[loss=0.2312, simple_loss=0.2967, pruned_loss=0.08283, over 21670.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2831, pruned_loss=0.06874, over 4281749.86 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:13:41,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1449972.0, ans=0.125 2023-06-26 00:13:43,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1449972.0, ans=0.125 2023-06-26 00:13:54,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1450032.0, ans=0.0 2023-06-26 00:14:56,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1450212.0, ans=0.125 2023-06-26 00:15:02,517 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.442e+02 4.547e+02 5.710e+02 8.432e+02 1.923e+03, threshold=1.142e+03, percent-clipped=7.0 2023-06-26 00:15:05,661 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=22.5 2023-06-26 00:15:14,924 INFO [train.py:996] (1/4) Epoch 8, batch 28250, loss[loss=0.1936, simple_loss=0.263, pruned_loss=0.06209, over 21852.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2867, pruned_loss=0.07178, over 4283067.69 frames. ], batch size: 107, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:15:28,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1450272.0, ans=0.2 2023-06-26 00:15:45,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1450332.0, ans=0.125 2023-06-26 00:16:15,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1450392.0, ans=0.1 2023-06-26 00:16:24,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1450392.0, ans=0.1 2023-06-26 00:16:33,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=8.0 2023-06-26 00:16:59,716 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:17:04,117 INFO [train.py:996] (1/4) Epoch 8, batch 28300, loss[loss=0.1913, simple_loss=0.2829, pruned_loss=0.04986, over 21837.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2848, pruned_loss=0.06988, over 4272101.73 frames. ], batch size: 371, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:17:47,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1450632.0, ans=0.125 2023-06-26 00:18:19,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1450752.0, ans=0.1 2023-06-26 00:18:36,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1450812.0, ans=0.0 2023-06-26 00:18:39,089 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.826e+02 4.235e+02 6.929e+02 1.082e+03 2.013e+03, threshold=1.386e+03, percent-clipped=23.0 2023-06-26 00:18:56,494 INFO [train.py:996] (1/4) Epoch 8, batch 28350, loss[loss=0.2096, simple_loss=0.3259, pruned_loss=0.04668, over 19818.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2817, pruned_loss=0.06493, over 4269355.00 frames. ], batch size: 703, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:20:14,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1451052.0, ans=0.1 2023-06-26 00:20:14,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1451052.0, ans=0.125 2023-06-26 00:20:18,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1451052.0, ans=0.125 2023-06-26 00:20:23,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1451112.0, ans=0.125 2023-06-26 00:20:30,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1451112.0, ans=0.02 2023-06-26 00:20:43,837 INFO [train.py:996] (1/4) Epoch 8, batch 28400, loss[loss=0.2147, simple_loss=0.284, pruned_loss=0.07276, over 21638.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2793, pruned_loss=0.06477, over 4261923.14 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 32.0 2023-06-26 00:21:28,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1451232.0, ans=0.1 2023-06-26 00:21:44,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=12.0 2023-06-26 00:21:47,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1451292.0, ans=0.125 2023-06-26 00:21:50,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1451352.0, ans=0.0 2023-06-26 00:22:20,865 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.356e+02 4.435e+02 6.673e+02 8.870e+02 1.776e+03, threshold=1.335e+03, percent-clipped=3.0 2023-06-26 00:22:31,523 INFO [train.py:996] (1/4) Epoch 8, batch 28450, loss[loss=0.2406, simple_loss=0.3189, pruned_loss=0.08115, over 21409.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2852, pruned_loss=0.06852, over 4264985.66 frames. ], batch size: 131, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:22:51,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-26 00:22:53,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1451472.0, ans=0.0 2023-06-26 00:23:01,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1451532.0, ans=0.2 2023-06-26 00:24:20,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1451712.0, ans=10.0 2023-06-26 00:24:30,330 INFO [train.py:996] (1/4) Epoch 8, batch 28500, loss[loss=0.2631, simple_loss=0.3328, pruned_loss=0.09671, over 21632.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2883, pruned_loss=0.07139, over 4279233.22 frames. ], batch size: 414, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:24:38,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1451772.0, ans=0.125 2023-06-26 00:25:29,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1451952.0, ans=0.0 2023-06-26 00:26:09,287 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.427e+02 4.818e+02 6.676e+02 8.470e+02 2.134e+03, threshold=1.335e+03, percent-clipped=3.0 2023-06-26 00:26:19,580 INFO [train.py:996] (1/4) Epoch 8, batch 28550, loss[loss=0.2523, simple_loss=0.3498, pruned_loss=0.07739, over 21550.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2955, pruned_loss=0.07349, over 4279445.82 frames. ], batch size: 230, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:27:17,344 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:27:49,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1452252.0, ans=0.125 2023-06-26 00:28:15,668 INFO [train.py:996] (1/4) Epoch 8, batch 28600, loss[loss=0.2726, simple_loss=0.3473, pruned_loss=0.09897, over 21545.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3028, pruned_loss=0.0754, over 4279786.49 frames. ], batch size: 414, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:28:47,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1452432.0, ans=0.125 2023-06-26 00:29:47,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1452612.0, ans=0.1 2023-06-26 00:29:53,323 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.158e+02 4.451e+02 5.957e+02 7.529e+02 1.462e+03, threshold=1.191e+03, percent-clipped=3.0 2023-06-26 00:29:54,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1452612.0, ans=0.125 2023-06-26 00:29:54,689 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.90 vs. limit=10.0 2023-06-26 00:30:03,855 INFO [train.py:996] (1/4) Epoch 8, batch 28650, loss[loss=0.2322, simple_loss=0.293, pruned_loss=0.08572, over 21586.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2979, pruned_loss=0.07557, over 4272484.46 frames. ], batch size: 415, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:30:40,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2023-06-26 00:31:47,840 INFO [train.py:996] (1/4) Epoch 8, batch 28700, loss[loss=0.2441, simple_loss=0.3153, pruned_loss=0.08645, over 21245.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2966, pruned_loss=0.07614, over 4273730.47 frames. ], batch size: 143, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:31:52,378 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-26 00:32:03,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453032.0, ans=0.1 2023-06-26 00:32:26,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1453092.0, ans=0.125 2023-06-26 00:32:35,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-26 00:32:45,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1453092.0, ans=0.125 2023-06-26 00:32:52,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1453152.0, ans=0.125 2023-06-26 00:33:13,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1453212.0, ans=0.125 2023-06-26 00:33:19,798 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.264e+02 4.611e+02 5.755e+02 7.778e+02 1.501e+03, threshold=1.151e+03, percent-clipped=4.0 2023-06-26 00:33:20,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1453212.0, ans=0.0 2023-06-26 00:33:30,629 INFO [train.py:996] (1/4) Epoch 8, batch 28750, loss[loss=0.196, simple_loss=0.2873, pruned_loss=0.05232, over 21891.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2965, pruned_loss=0.07581, over 4286748.63 frames. ], batch size: 316, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:33:35,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-26 00:35:00,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1453512.0, ans=0.0 2023-06-26 00:35:18,760 INFO [train.py:996] (1/4) Epoch 8, batch 28800, loss[loss=0.2318, simple_loss=0.3105, pruned_loss=0.07658, over 21919.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3, pruned_loss=0.07609, over 4289902.72 frames. ], batch size: 316, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:35:50,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-26 00:36:55,660 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.079e+02 4.504e+02 5.803e+02 7.798e+02 1.715e+03, threshold=1.161e+03, percent-clipped=9.0 2023-06-26 00:37:06,135 INFO [train.py:996] (1/4) Epoch 8, batch 28850, loss[loss=0.2339, simple_loss=0.3128, pruned_loss=0.07749, over 21830.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3017, pruned_loss=0.07792, over 4292530.40 frames. ], batch size: 107, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:37:11,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1453872.0, ans=0.2 2023-06-26 00:37:26,538 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-26 00:37:53,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1453932.0, ans=0.1 2023-06-26 00:38:18,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-26 00:39:02,756 INFO [train.py:996] (1/4) Epoch 8, batch 28900, loss[loss=0.3011, simple_loss=0.4102, pruned_loss=0.09604, over 19840.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3033, pruned_loss=0.07875, over 4289811.35 frames. ], batch size: 702, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:39:43,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1454232.0, ans=0.0 2023-06-26 00:40:22,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1454352.0, ans=0.2 2023-06-26 00:40:36,722 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.510e+02 4.525e+02 6.150e+02 8.317e+02 2.231e+03, threshold=1.230e+03, percent-clipped=10.0 2023-06-26 00:40:42,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1454412.0, ans=0.1 2023-06-26 00:40:57,567 INFO [train.py:996] (1/4) Epoch 8, batch 28950, loss[loss=0.2376, simple_loss=0.3442, pruned_loss=0.06553, over 21668.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3039, pruned_loss=0.07813, over 4286032.05 frames. ], batch size: 414, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:41:16,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1454472.0, ans=0.2 2023-06-26 00:41:52,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1454592.0, ans=0.125 2023-06-26 00:42:07,840 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-26 00:42:52,300 INFO [train.py:996] (1/4) Epoch 8, batch 29000, loss[loss=0.2456, simple_loss=0.326, pruned_loss=0.08266, over 21757.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3071, pruned_loss=0.07699, over 4284518.86 frames. ], batch size: 332, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:44:18,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=8.0 2023-06-26 00:44:19,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1455012.0, ans=0.125 2023-06-26 00:44:25,066 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=12.0 2023-06-26 00:44:25,389 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.229e+02 4.694e+02 5.564e+02 8.456e+02 2.061e+03, threshold=1.113e+03, percent-clipped=6.0 2023-06-26 00:44:28,079 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:44:39,537 INFO [train.py:996] (1/4) Epoch 8, batch 29050, loss[loss=0.2591, simple_loss=0.3175, pruned_loss=0.1004, over 21722.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3059, pruned_loss=0.07783, over 4282740.40 frames. ], batch size: 473, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:44:52,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1455072.0, ans=0.0 2023-06-26 00:44:59,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1455072.0, ans=0.125 2023-06-26 00:45:14,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1455132.0, ans=0.125 2023-06-26 00:45:32,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1455192.0, ans=0.0 2023-06-26 00:46:27,380 INFO [train.py:996] (1/4) Epoch 8, batch 29100, loss[loss=0.1744, simple_loss=0.2275, pruned_loss=0.06067, over 20736.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2972, pruned_loss=0.07516, over 4285671.64 frames. ], batch size: 608, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:47:06,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1455492.0, ans=0.2 2023-06-26 00:48:06,921 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.913e+02 4.309e+02 6.273e+02 8.461e+02 1.678e+03, threshold=1.255e+03, percent-clipped=7.0 2023-06-26 00:48:15,345 INFO [train.py:996] (1/4) Epoch 8, batch 29150, loss[loss=0.2022, simple_loss=0.268, pruned_loss=0.06826, over 21963.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2967, pruned_loss=0.07321, over 4283672.49 frames. ], batch size: 103, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:48:35,788 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.87 vs. limit=22.5 2023-06-26 00:49:26,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1455852.0, ans=0.1 2023-06-26 00:49:41,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1455912.0, ans=0.2 2023-06-26 00:50:08,215 INFO [train.py:996] (1/4) Epoch 8, batch 29200, loss[loss=0.2128, simple_loss=0.2919, pruned_loss=0.0669, over 21741.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2931, pruned_loss=0.0725, over 4281801.39 frames. ], batch size: 333, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:50:42,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1456092.0, ans=0.2 2023-06-26 00:51:08,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1456092.0, ans=0.0 2023-06-26 00:51:42,005 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.239e+02 4.282e+02 5.514e+02 8.024e+02 1.461e+03, threshold=1.103e+03, percent-clipped=3.0 2023-06-26 00:51:52,291 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.12 vs. limit=6.0 2023-06-26 00:51:56,521 INFO [train.py:996] (1/4) Epoch 8, batch 29250, loss[loss=0.2112, simple_loss=0.2846, pruned_loss=0.06885, over 21271.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2922, pruned_loss=0.07095, over 4273182.10 frames. ], batch size: 144, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:52:06,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-26 00:52:13,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-26 00:52:37,372 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:52:53,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-26 00:53:44,020 INFO [train.py:996] (1/4) Epoch 8, batch 29300, loss[loss=0.2008, simple_loss=0.2902, pruned_loss=0.05568, over 21528.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2931, pruned_loss=0.07048, over 4266130.52 frames. ], batch size: 195, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:54:00,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1456632.0, ans=0.125 2023-06-26 00:54:06,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1456632.0, ans=0.0 2023-06-26 00:55:03,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1456752.0, ans=0.1 2023-06-26 00:55:25,848 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.817e+02 4.100e+02 5.558e+02 8.472e+02 2.092e+03, threshold=1.112e+03, percent-clipped=11.0 2023-06-26 00:55:32,607 INFO [train.py:996] (1/4) Epoch 8, batch 29350, loss[loss=0.2049, simple_loss=0.2791, pruned_loss=0.06533, over 21826.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2891, pruned_loss=0.06986, over 4266332.25 frames. ], batch size: 118, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:55:51,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1456932.0, ans=0.125 2023-06-26 00:56:22,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1456992.0, ans=0.125 2023-06-26 00:56:24,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1456992.0, ans=0.125 2023-06-26 00:56:36,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=15.0 2023-06-26 00:57:21,102 INFO [train.py:996] (1/4) Epoch 8, batch 29400, loss[loss=0.1311, simple_loss=0.1842, pruned_loss=0.03904, over 21379.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2858, pruned_loss=0.06737, over 4265050.66 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:57:31,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1457172.0, ans=0.1 2023-06-26 00:57:52,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-26 00:58:07,505 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-26 00:59:02,196 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.050e+02 4.516e+02 7.158e+02 1.067e+03 2.108e+03, threshold=1.432e+03, percent-clipped=22.0 2023-06-26 00:59:09,193 INFO [train.py:996] (1/4) Epoch 8, batch 29450, loss[loss=0.2285, simple_loss=0.3043, pruned_loss=0.07633, over 21811.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2857, pruned_loss=0.06674, over 4267485.55 frames. ], batch size: 333, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:59:49,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1457532.0, ans=0.1 2023-06-26 01:00:03,495 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:00:56,319 INFO [train.py:996] (1/4) Epoch 8, batch 29500, loss[loss=0.2327, simple_loss=0.2975, pruned_loss=0.08395, over 21563.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2905, pruned_loss=0.06969, over 4267387.31 frames. ], batch size: 548, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:01:20,844 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:01:46,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1457892.0, ans=0.1 2023-06-26 01:01:46,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1457892.0, ans=0.125 2023-06-26 01:01:57,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-26 01:02:28,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1458012.0, ans=0.1 2023-06-26 01:02:36,145 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.293e+02 4.544e+02 5.932e+02 7.825e+02 1.489e+03, threshold=1.186e+03, percent-clipped=1.0 2023-06-26 01:02:38,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1458012.0, ans=0.125 2023-06-26 01:02:42,880 INFO [train.py:996] (1/4) Epoch 8, batch 29550, loss[loss=0.228, simple_loss=0.2959, pruned_loss=0.08004, over 21885.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2892, pruned_loss=0.0707, over 4271460.90 frames. ], batch size: 414, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:02:43,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1458072.0, ans=0.1 2023-06-26 01:03:02,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1458072.0, ans=0.2 2023-06-26 01:04:33,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1458312.0, ans=0.125 2023-06-26 01:04:40,183 INFO [train.py:996] (1/4) Epoch 8, batch 29600, loss[loss=0.2439, simple_loss=0.3346, pruned_loss=0.07663, over 21819.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2958, pruned_loss=0.07328, over 4276058.40 frames. ], batch size: 316, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:05:39,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1458492.0, ans=0.125 2023-06-26 01:05:52,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1458552.0, ans=0.125 2023-06-26 01:05:56,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1458552.0, ans=0.1 2023-06-26 01:06:21,136 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 4.529e+02 7.554e+02 1.096e+03 2.697e+03, threshold=1.511e+03, percent-clipped=19.0 2023-06-26 01:06:27,943 INFO [train.py:996] (1/4) Epoch 8, batch 29650, loss[loss=0.1866, simple_loss=0.2671, pruned_loss=0.053, over 21824.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2955, pruned_loss=0.07072, over 4279532.61 frames. ], batch size: 298, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:07:16,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1458792.0, ans=0.125 2023-06-26 01:07:46,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1458852.0, ans=0.0 2023-06-26 01:07:48,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1458852.0, ans=0.125 2023-06-26 01:08:17,131 INFO [train.py:996] (1/4) Epoch 8, batch 29700, loss[loss=0.2267, simple_loss=0.302, pruned_loss=0.07571, over 21505.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2977, pruned_loss=0.07117, over 4280480.14 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:08:23,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1458972.0, ans=0.125 2023-06-26 01:08:45,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1459032.0, ans=0.05 2023-06-26 01:09:15,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1459092.0, ans=0.025 2023-06-26 01:09:20,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1459092.0, ans=0.125 2023-06-26 01:09:24,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1459152.0, ans=0.0 2023-06-26 01:09:27,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1459152.0, ans=0.125 2023-06-26 01:09:57,646 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.231e+02 4.516e+02 5.860e+02 9.248e+02 1.775e+03, threshold=1.172e+03, percent-clipped=6.0 2023-06-26 01:09:58,242 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:10:04,571 INFO [train.py:996] (1/4) Epoch 8, batch 29750, loss[loss=0.2128, simple_loss=0.2881, pruned_loss=0.06874, over 21865.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.302, pruned_loss=0.07119, over 4281982.90 frames. ], batch size: 107, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:10:42,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1459332.0, ans=0.2 2023-06-26 01:10:46,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-26 01:11:49,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1459572.0, ans=0.125 2023-06-26 01:11:51,246 INFO [train.py:996] (1/4) Epoch 8, batch 29800, loss[loss=0.2109, simple_loss=0.2908, pruned_loss=0.06546, over 21673.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3032, pruned_loss=0.07154, over 4279702.10 frames. ], batch size: 230, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:12:13,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1459572.0, ans=0.1 2023-06-26 01:12:17,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1459632.0, ans=0.125 2023-06-26 01:13:26,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1459812.0, ans=0.0 2023-06-26 01:13:32,344 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.767e+02 3.928e+02 4.572e+02 6.290e+02 1.025e+03, threshold=9.144e+02, percent-clipped=0.0 2023-06-26 01:13:33,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1459812.0, ans=0.2 2023-06-26 01:13:37,440 INFO [train.py:996] (1/4) Epoch 8, batch 29850, loss[loss=0.2114, simple_loss=0.2781, pruned_loss=0.07233, over 21534.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2984, pruned_loss=0.06975, over 4281061.90 frames. ], batch size: 548, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:13:46,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1459872.0, ans=0.125 2023-06-26 01:14:21,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1459992.0, ans=0.1 2023-06-26 01:14:49,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1460052.0, ans=0.2 2023-06-26 01:15:00,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1460052.0, ans=0.1 2023-06-26 01:15:20,096 INFO [train.py:996] (1/4) Epoch 8, batch 29900, loss[loss=0.2708, simple_loss=0.3593, pruned_loss=0.09111, over 21434.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2976, pruned_loss=0.07024, over 4281430.63 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:16:00,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-26 01:16:05,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1460292.0, ans=0.1 2023-06-26 01:16:20,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-26 01:16:25,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1460292.0, ans=0.1 2023-06-26 01:16:42,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-26 01:17:10,427 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.335e+02 4.671e+02 6.480e+02 9.712e+02 1.710e+03, threshold=1.296e+03, percent-clipped=28.0 2023-06-26 01:17:15,540 INFO [train.py:996] (1/4) Epoch 8, batch 29950, loss[loss=0.2372, simple_loss=0.3188, pruned_loss=0.07777, over 21434.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3015, pruned_loss=0.07365, over 4287411.55 frames. ], batch size: 131, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:17:21,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1460472.0, ans=0.125 2023-06-26 01:17:22,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-26 01:19:00,377 INFO [train.py:996] (1/4) Epoch 8, batch 30000, loss[loss=0.2067, simple_loss=0.289, pruned_loss=0.06221, over 21235.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3037, pruned_loss=0.07443, over 4283225.49 frames. ], batch size: 143, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:19:00,377 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 01:19:11,720 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.6577, 3.1589, 3.1135, 3.7461, 2.2200, 3.5245, 3.4564, 2.5216], device='cuda:1') 2023-06-26 01:19:14,973 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.2480, 4.7437, 4.9965, 4.4700], device='cuda:1') 2023-06-26 01:19:18,793 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2464, simple_loss=0.3452, pruned_loss=0.07378, over 1796401.00 frames. 2023-06-26 01:19:18,793 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 01:19:33,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1460772.0, ans=0.125 2023-06-26 01:20:05,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1460892.0, ans=0.125 2023-06-26 01:21:08,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1461012.0, ans=0.0 2023-06-26 01:21:14,674 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.858e+02 4.174e+02 5.657e+02 7.922e+02 1.669e+03, threshold=1.131e+03, percent-clipped=1.0 2023-06-26 01:21:20,155 INFO [train.py:996] (1/4) Epoch 8, batch 30050, loss[loss=0.2375, simple_loss=0.3672, pruned_loss=0.05387, over 20803.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3077, pruned_loss=0.07171, over 4276585.71 frames. ], batch size: 607, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:22:43,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1461252.0, ans=0.1 2023-06-26 01:23:13,754 INFO [train.py:996] (1/4) Epoch 8, batch 30100, loss[loss=0.1923, simple_loss=0.2576, pruned_loss=0.06352, over 21894.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3059, pruned_loss=0.07118, over 4273848.12 frames. ], batch size: 113, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:24:08,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1461492.0, ans=0.125 2023-06-26 01:24:25,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1461552.0, ans=0.125 2023-06-26 01:24:53,941 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.090e+02 4.517e+02 6.270e+02 9.720e+02 3.054e+03, threshold=1.254e+03, percent-clipped=16.0 2023-06-26 01:24:57,503 INFO [train.py:996] (1/4) Epoch 8, batch 30150, loss[loss=0.2249, simple_loss=0.2962, pruned_loss=0.07682, over 20000.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3014, pruned_loss=0.07242, over 4266965.97 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:25:22,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1461732.0, ans=0.2 2023-06-26 01:26:01,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-26 01:26:23,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1461852.0, ans=0.2 2023-06-26 01:26:53,760 INFO [train.py:996] (1/4) Epoch 8, batch 30200, loss[loss=0.2434, simple_loss=0.3379, pruned_loss=0.07443, over 21474.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3027, pruned_loss=0.0716, over 4260521.00 frames. ], batch size: 471, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:28:00,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1462092.0, ans=0.125 2023-06-26 01:28:45,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.354e+02 5.048e+02 7.227e+02 1.023e+03 2.150e+03, threshold=1.445e+03, percent-clipped=15.0 2023-06-26 01:28:48,921 INFO [train.py:996] (1/4) Epoch 8, batch 30250, loss[loss=0.2298, simple_loss=0.2975, pruned_loss=0.08101, over 19962.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3094, pruned_loss=0.07349, over 4260746.30 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:28:54,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1462272.0, ans=0.2 2023-06-26 01:29:23,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1462332.0, ans=0.125 2023-06-26 01:30:36,863 INFO [train.py:996] (1/4) Epoch 8, batch 30300, loss[loss=0.185, simple_loss=0.2538, pruned_loss=0.05807, over 21610.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3078, pruned_loss=0.07399, over 4259786.31 frames. ], batch size: 282, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:31:26,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1462692.0, ans=0.125 2023-06-26 01:32:09,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1462812.0, ans=0.0 2023-06-26 01:32:31,190 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.196e+02 5.174e+02 6.761e+02 1.021e+03 2.632e+03, threshold=1.352e+03, percent-clipped=10.0 2023-06-26 01:32:34,766 INFO [train.py:996] (1/4) Epoch 8, batch 30350, loss[loss=0.2394, simple_loss=0.3236, pruned_loss=0.07753, over 21733.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3093, pruned_loss=0.07545, over 4264839.49 frames. ], batch size: 332, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:33:56,337 INFO [train.py:996] (1/4) Epoch 8, batch 30400, loss[loss=0.207, simple_loss=0.2556, pruned_loss=0.07922, over 20335.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.304, pruned_loss=0.07407, over 4259331.44 frames. ], batch size: 703, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:34:19,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1463232.0, ans=0.125 2023-06-26 01:34:24,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1463232.0, ans=0.0 2023-06-26 01:34:35,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1463292.0, ans=0.125 2023-06-26 01:34:38,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1463292.0, ans=0.125 2023-06-26 01:34:46,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1463292.0, ans=0.07 2023-06-26 01:34:53,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1463352.0, ans=0.1 2023-06-26 01:35:24,301 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.064e+02 6.383e+02 1.075e+03 1.632e+03 7.193e+03, threshold=2.149e+03, percent-clipped=36.0 2023-06-26 01:35:25,750 INFO [train.py:996] (1/4) Epoch 8, batch 30450, loss[loss=0.2655, simple_loss=0.3817, pruned_loss=0.07463, over 19889.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3043, pruned_loss=0.07374, over 4200518.03 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:35:26,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1463472.0, ans=0.125 2023-06-26 01:35:35,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1463472.0, ans=0.125 2023-06-26 01:35:47,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1463532.0, ans=0.025 2023-06-26 01:36:01,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-26 01:36:26,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1463652.0, ans=0.0 2023-06-26 01:36:26,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1463652.0, ans=0.125 2023-06-26 01:38:50,986 INFO [train.py:996] (1/4) Epoch 9, batch 0, loss[loss=0.1995, simple_loss=0.2655, pruned_loss=0.06676, over 21754.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2655, pruned_loss=0.06676, over 21754.00 frames. ], batch size: 317, lr: 3.39e-03, grad_scale: 32.0 2023-06-26 01:38:50,986 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 01:39:14,232 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2395, simple_loss=0.3459, pruned_loss=0.06656, over 1796401.00 frames. 2023-06-26 01:39:14,233 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 01:40:27,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1463922.0, ans=0.125 2023-06-26 01:40:29,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1463922.0, ans=0.0 2023-06-26 01:40:31,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1463922.0, ans=15.0 2023-06-26 01:40:59,310 INFO [train.py:996] (1/4) Epoch 9, batch 50, loss[loss=0.2456, simple_loss=0.3143, pruned_loss=0.08846, over 21906.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3096, pruned_loss=0.07524, over 960969.17 frames. ], batch size: 107, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:41:03,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1464042.0, ans=0.07 2023-06-26 01:41:13,455 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.197e+02 4.855e+02 1.072e+03 2.293e+03 5.497e+03, threshold=2.144e+03, percent-clipped=28.0 2023-06-26 01:41:57,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1464162.0, ans=0.0 2023-06-26 01:42:06,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1464222.0, ans=0.0 2023-06-26 01:42:36,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1464282.0, ans=0.1 2023-06-26 01:42:40,947 INFO [train.py:996] (1/4) Epoch 9, batch 100, loss[loss=0.2375, simple_loss=0.341, pruned_loss=0.06698, over 21313.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3202, pruned_loss=0.07616, over 1694056.28 frames. ], batch size: 176, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:42:59,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1464342.0, ans=0.95 2023-06-26 01:43:14,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1464402.0, ans=0.0 2023-06-26 01:43:37,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1464462.0, ans=0.1 2023-06-26 01:43:38,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1464462.0, ans=0.125 2023-06-26 01:43:46,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1464462.0, ans=0.2 2023-06-26 01:44:15,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1464582.0, ans=0.125 2023-06-26 01:44:26,115 INFO [train.py:996] (1/4) Epoch 9, batch 150, loss[loss=0.2225, simple_loss=0.2902, pruned_loss=0.07737, over 21940.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3228, pruned_loss=0.07567, over 2272418.79 frames. ], batch size: 316, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:44:40,676 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.415e+02 5.834e+02 7.944e+02 1.480e+03, threshold=1.167e+03, percent-clipped=0.0 2023-06-26 01:45:10,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1464702.0, ans=0.125 2023-06-26 01:45:23,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1464762.0, ans=0.1 2023-06-26 01:46:03,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1464882.0, ans=0.0 2023-06-26 01:46:07,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=6.0 2023-06-26 01:46:13,198 INFO [train.py:996] (1/4) Epoch 9, batch 200, loss[loss=0.2306, simple_loss=0.3183, pruned_loss=0.07149, over 21893.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.318, pruned_loss=0.07357, over 2717335.09 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:48:00,439 INFO [train.py:996] (1/4) Epoch 9, batch 250, loss[loss=0.2214, simple_loss=0.2972, pruned_loss=0.07285, over 21869.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3127, pruned_loss=0.07258, over 3060003.23 frames. ], batch size: 332, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:48:08,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.143e+02 4.378e+02 6.069e+02 8.721e+02 1.562e+03, threshold=1.214e+03, percent-clipped=10.0 2023-06-26 01:48:54,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-26 01:49:09,463 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.16 vs. limit=10.0 2023-06-26 01:49:14,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1465422.0, ans=0.125 2023-06-26 01:49:29,148 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-26 01:49:50,362 INFO [train.py:996] (1/4) Epoch 9, batch 300, loss[loss=0.2428, simple_loss=0.3077, pruned_loss=0.08895, over 21760.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3087, pruned_loss=0.07335, over 3332442.08 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:49:52,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1465542.0, ans=0.125 2023-06-26 01:50:00,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-26 01:50:32,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1465602.0, ans=0.0 2023-06-26 01:50:35,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1465602.0, ans=0.125 2023-06-26 01:50:52,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-26 01:51:16,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-26 01:51:41,271 INFO [train.py:996] (1/4) Epoch 9, batch 350, loss[loss=0.2086, simple_loss=0.2699, pruned_loss=0.0737, over 21431.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3033, pruned_loss=0.07205, over 3540899.52 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:51:50,487 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.975e+02 4.637e+02 6.282e+02 9.202e+02 1.945e+03, threshold=1.256e+03, percent-clipped=12.0 2023-06-26 01:51:51,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-06-26 01:52:21,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1465902.0, ans=0.125 2023-06-26 01:52:21,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1465902.0, ans=0.0 2023-06-26 01:52:25,297 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-26 01:52:53,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0 2023-06-26 01:53:26,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1466082.0, ans=0.0 2023-06-26 01:53:30,988 INFO [train.py:996] (1/4) Epoch 9, batch 400, loss[loss=0.1727, simple_loss=0.2516, pruned_loss=0.04694, over 21628.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2958, pruned_loss=0.07051, over 3706117.31 frames. ], batch size: 247, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:54:05,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1466202.0, ans=0.2 2023-06-26 01:54:47,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1466322.0, ans=0.0 2023-06-26 01:54:49,788 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-26 01:55:14,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1466382.0, ans=0.125 2023-06-26 01:55:20,922 INFO [train.py:996] (1/4) Epoch 9, batch 450, loss[loss=0.2139, simple_loss=0.2894, pruned_loss=0.06926, over 21873.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.294, pruned_loss=0.0698, over 3835766.24 frames. ], batch size: 118, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:55:40,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1466442.0, ans=0.125 2023-06-26 01:55:41,297 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.313e+02 4.889e+02 7.953e+02 1.170e+03 2.853e+03, threshold=1.591e+03, percent-clipped=21.0 2023-06-26 01:55:41,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1466442.0, ans=0.025 2023-06-26 01:55:54,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466502.0, ans=0.1 2023-06-26 01:56:14,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-26 01:56:15,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1466562.0, ans=0.125 2023-06-26 01:56:50,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466682.0, ans=0.1 2023-06-26 01:57:11,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1466682.0, ans=0.0 2023-06-26 01:57:13,988 INFO [train.py:996] (1/4) Epoch 9, batch 500, loss[loss=0.2225, simple_loss=0.315, pruned_loss=0.06501, over 21764.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2959, pruned_loss=0.06991, over 3931674.03 frames. ], batch size: 247, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:57:49,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.91 vs. limit=10.0 2023-06-26 01:58:10,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1466862.0, ans=0.2 2023-06-26 01:58:31,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1466922.0, ans=0.2 2023-06-26 01:58:35,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466922.0, ans=0.1 2023-06-26 01:58:53,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1466982.0, ans=0.1 2023-06-26 01:59:08,353 INFO [train.py:996] (1/4) Epoch 9, batch 550, loss[loss=0.2069, simple_loss=0.2809, pruned_loss=0.06642, over 21856.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2993, pruned_loss=0.06897, over 4011435.97 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:59:25,222 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.595e+02 7.824e+02 1.104e+03 2.417e+03, threshold=1.565e+03, percent-clipped=11.0 2023-06-26 02:00:16,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1467222.0, ans=0.125 2023-06-26 02:00:31,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1467282.0, ans=0.125 2023-06-26 02:01:03,298 INFO [train.py:996] (1/4) Epoch 9, batch 600, loss[loss=0.2236, simple_loss=0.2953, pruned_loss=0.07594, over 21751.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3006, pruned_loss=0.06886, over 4072516.02 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:01:44,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1467462.0, ans=0.035 2023-06-26 02:01:44,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1467462.0, ans=0.0 2023-06-26 02:01:51,571 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.22 vs. limit=8.0 2023-06-26 02:02:19,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1467582.0, ans=0.05 2023-06-26 02:02:21,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1467582.0, ans=0.2 2023-06-26 02:02:36,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1467582.0, ans=0.125 2023-06-26 02:02:39,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-26 02:02:47,024 INFO [train.py:996] (1/4) Epoch 9, batch 650, loss[loss=0.2057, simple_loss=0.2624, pruned_loss=0.07446, over 19915.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3005, pruned_loss=0.06911, over 4124833.19 frames. ], batch size: 704, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:03:03,584 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.173e+02 5.371e+02 7.433e+02 1.361e+03 3.228e+03, threshold=1.487e+03, percent-clipped=18.0 2023-06-26 02:03:35,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1467762.0, ans=0.1 2023-06-26 02:03:39,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1467762.0, ans=0.0 2023-06-26 02:03:41,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1467762.0, ans=0.1 2023-06-26 02:03:42,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1467762.0, ans=0.0 2023-06-26 02:03:52,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1467822.0, ans=0.0 2023-06-26 02:04:44,095 INFO [train.py:996] (1/4) Epoch 9, batch 700, loss[loss=0.2216, simple_loss=0.3025, pruned_loss=0.07031, over 21790.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2997, pruned_loss=0.06904, over 4152622.02 frames. ], batch size: 107, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:05:37,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1468062.0, ans=0.1 2023-06-26 02:05:55,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1468122.0, ans=0.0 2023-06-26 02:06:02,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-26 02:06:04,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1468182.0, ans=0.0 2023-06-26 02:06:31,665 INFO [train.py:996] (1/4) Epoch 9, batch 750, loss[loss=0.2085, simple_loss=0.2727, pruned_loss=0.07219, over 21710.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.299, pruned_loss=0.06918, over 4187246.92 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:06:38,277 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-26 02:06:42,121 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 4.754e+02 6.417e+02 9.585e+02 1.882e+03, threshold=1.283e+03, percent-clipped=6.0 2023-06-26 02:06:57,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-26 02:07:42,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-26 02:07:45,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1468482.0, ans=0.125 2023-06-26 02:07:47,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-26 02:07:57,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1468482.0, ans=0.125 2023-06-26 02:08:10,173 INFO [train.py:996] (1/4) Epoch 9, batch 800, loss[loss=0.2049, simple_loss=0.2737, pruned_loss=0.06811, over 21775.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2963, pruned_loss=0.06929, over 4195466.78 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:08:16,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-26 02:08:17,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1468542.0, ans=0.0 2023-06-26 02:08:50,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-26 02:09:22,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-26 02:09:45,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1468782.0, ans=0.125 2023-06-26 02:10:10,648 INFO [train.py:996] (1/4) Epoch 9, batch 850, loss[loss=0.212, simple_loss=0.2914, pruned_loss=0.06628, over 21897.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2946, pruned_loss=0.06984, over 4220686.58 frames. ], batch size: 124, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:10:13,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1468842.0, ans=0.0 2023-06-26 02:10:26,284 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.225e+02 7.900e+02 1.161e+03 2.208e+03, threshold=1.580e+03, percent-clipped=19.0 2023-06-26 02:10:28,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1468842.0, ans=0.0 2023-06-26 02:11:06,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1468962.0, ans=0.125 2023-06-26 02:11:11,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1469022.0, ans=0.125 2023-06-26 02:11:13,774 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-26 02:11:59,347 INFO [train.py:996] (1/4) Epoch 9, batch 900, loss[loss=0.181, simple_loss=0.2732, pruned_loss=0.04442, over 21797.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2918, pruned_loss=0.06909, over 4234168.18 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:12:24,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1469202.0, ans=0.0 2023-06-26 02:12:50,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1469262.0, ans=0.125 2023-06-26 02:13:48,845 INFO [train.py:996] (1/4) Epoch 9, batch 950, loss[loss=0.212, simple_loss=0.2905, pruned_loss=0.06677, over 21738.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2918, pruned_loss=0.06892, over 4251641.67 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:14:01,417 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 4.404e+02 7.084e+02 1.100e+03 2.197e+03, threshold=1.417e+03, percent-clipped=5.0 2023-06-26 02:14:17,273 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:14:17,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1469502.0, ans=0.1 2023-06-26 02:14:31,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1469562.0, ans=0.125 2023-06-26 02:14:47,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1469622.0, ans=0.125 2023-06-26 02:15:36,790 INFO [train.py:996] (1/4) Epoch 9, batch 1000, loss[loss=0.2427, simple_loss=0.3203, pruned_loss=0.08253, over 21371.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2906, pruned_loss=0.06887, over 4262713.49 frames. ], batch size: 131, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:15:37,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1469742.0, ans=0.125 2023-06-26 02:16:16,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1469862.0, ans=0.2 2023-06-26 02:16:52,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1469922.0, ans=0.1 2023-06-26 02:17:15,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1469982.0, ans=0.0 2023-06-26 02:17:18,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1469982.0, ans=0.1 2023-06-26 02:17:27,484 INFO [train.py:996] (1/4) Epoch 9, batch 1050, loss[loss=0.159, simple_loss=0.24, pruned_loss=0.03901, over 21253.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2925, pruned_loss=0.06934, over 4264996.63 frames. ], batch size: 176, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:17:39,392 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 4.347e+02 6.082e+02 9.446e+02 2.534e+03, threshold=1.216e+03, percent-clipped=8.0 2023-06-26 02:17:47,573 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:19:18,789 INFO [train.py:996] (1/4) Epoch 9, batch 1100, loss[loss=0.2359, simple_loss=0.3158, pruned_loss=0.07807, over 21862.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2929, pruned_loss=0.0692, over 4275207.27 frames. ], batch size: 371, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:19:37,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1470402.0, ans=0.0 2023-06-26 02:19:49,551 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:20:44,104 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-26 02:21:09,480 INFO [train.py:996] (1/4) Epoch 9, batch 1150, loss[loss=0.2329, simple_loss=0.3055, pruned_loss=0.08016, over 21776.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2933, pruned_loss=0.06888, over 4275557.54 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:21:22,267 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.817e+02 6.167e+02 1.033e+03 2.052e+03, threshold=1.233e+03, percent-clipped=13.0 2023-06-26 02:21:35,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1470702.0, ans=0.125 2023-06-26 02:21:38,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1470702.0, ans=0.125 2023-06-26 02:21:58,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-26 02:22:12,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1470762.0, ans=0.125 2023-06-26 02:22:28,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1470822.0, ans=0.1 2023-06-26 02:23:00,149 INFO [train.py:996] (1/4) Epoch 9, batch 1200, loss[loss=0.2426, simple_loss=0.3299, pruned_loss=0.07767, over 21755.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2945, pruned_loss=0.06962, over 4276159.80 frames. ], batch size: 391, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:23:01,336 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-06-26 02:23:12,226 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.26 vs. limit=15.0 2023-06-26 02:24:42,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1471182.0, ans=0.07 2023-06-26 02:24:52,845 INFO [train.py:996] (1/4) Epoch 9, batch 1250, loss[loss=0.2279, simple_loss=0.315, pruned_loss=0.07042, over 21765.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2969, pruned_loss=0.07031, over 4279343.20 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:25:06,490 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 4.578e+02 6.578e+02 9.426e+02 2.383e+03, threshold=1.316e+03, percent-clipped=14.0 2023-06-26 02:26:38,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1471482.0, ans=0.1 2023-06-26 02:26:42,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1471542.0, ans=0.0 2023-06-26 02:26:43,220 INFO [train.py:996] (1/4) Epoch 9, batch 1300, loss[loss=0.2445, simple_loss=0.3276, pruned_loss=0.08068, over 21748.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2976, pruned_loss=0.07106, over 4281201.97 frames. ], batch size: 414, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:26:50,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1471542.0, ans=0.5 2023-06-26 02:27:20,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1471602.0, ans=0.0 2023-06-26 02:28:14,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1471782.0, ans=0.125 2023-06-26 02:28:25,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1471782.0, ans=0.125 2023-06-26 02:28:32,858 INFO [train.py:996] (1/4) Epoch 9, batch 1350, loss[loss=0.284, simple_loss=0.345, pruned_loss=0.1115, over 21326.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2986, pruned_loss=0.07159, over 4290084.09 frames. ], batch size: 507, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:28:46,556 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.596e+02 4.887e+02 7.409e+02 1.206e+03 1.964e+03, threshold=1.482e+03, percent-clipped=19.0 2023-06-26 02:28:55,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1471902.0, ans=0.0 2023-06-26 02:28:57,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1471902.0, ans=0.125 2023-06-26 02:30:22,900 INFO [train.py:996] (1/4) Epoch 9, batch 1400, loss[loss=0.1838, simple_loss=0.274, pruned_loss=0.0468, over 21385.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2968, pruned_loss=0.0715, over 4286100.89 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:30:43,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1472202.0, ans=0.125 2023-06-26 02:30:54,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-26 02:30:55,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1472202.0, ans=0.125 2023-06-26 02:31:51,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1472322.0, ans=0.0 2023-06-26 02:31:58,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1472382.0, ans=15.0 2023-06-26 02:32:09,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1472382.0, ans=0.125 2023-06-26 02:32:13,555 INFO [train.py:996] (1/4) Epoch 9, batch 1450, loss[loss=0.3191, simple_loss=0.4022, pruned_loss=0.118, over 21525.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2967, pruned_loss=0.07212, over 4289910.86 frames. ], batch size: 507, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:32:27,134 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.469e+02 8.336e+02 1.169e+03 2.052e+03, threshold=1.667e+03, percent-clipped=11.0 2023-06-26 02:32:37,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1472502.0, ans=0.125 2023-06-26 02:33:57,839 INFO [train.py:996] (1/4) Epoch 9, batch 1500, loss[loss=0.2151, simple_loss=0.2939, pruned_loss=0.0681, over 17592.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.298, pruned_loss=0.07355, over 4293339.78 frames. ], batch size: 60, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:34:41,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1472802.0, ans=0.0 2023-06-26 02:35:27,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-26 02:35:36,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1472982.0, ans=0.125 2023-06-26 02:35:44,372 INFO [train.py:996] (1/4) Epoch 9, batch 1550, loss[loss=0.1726, simple_loss=0.2791, pruned_loss=0.03302, over 20820.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2968, pruned_loss=0.07264, over 4293287.13 frames. ], batch size: 607, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:35:46,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1473042.0, ans=0.1 2023-06-26 02:35:58,923 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.360e+02 5.874e+02 7.765e+02 1.799e+03, threshold=1.175e+03, percent-clipped=2.0 2023-06-26 02:36:46,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1473162.0, ans=0.125 2023-06-26 02:36:49,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1473162.0, ans=0.1 2023-06-26 02:37:07,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1473222.0, ans=0.125 2023-06-26 02:37:35,443 INFO [train.py:996] (1/4) Epoch 9, batch 1600, loss[loss=0.2193, simple_loss=0.2988, pruned_loss=0.06996, over 21417.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2942, pruned_loss=0.07118, over 4286865.26 frames. ], batch size: 548, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:37:58,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1473402.0, ans=0.125 2023-06-26 02:39:00,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1473522.0, ans=0.125 2023-06-26 02:39:07,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1473582.0, ans=0.0 2023-06-26 02:39:16,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0 2023-06-26 02:39:22,831 INFO [train.py:996] (1/4) Epoch 9, batch 1650, loss[loss=0.2251, simple_loss=0.2983, pruned_loss=0.07596, over 21336.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.293, pruned_loss=0.0701, over 4276987.43 frames. ], batch size: 143, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:39:56,140 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.260e+02 4.603e+02 6.235e+02 9.034e+02 1.719e+03, threshold=1.247e+03, percent-clipped=11.0 2023-06-26 02:40:14,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1473762.0, ans=0.1 2023-06-26 02:40:52,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1473882.0, ans=0.0 2023-06-26 02:41:11,346 INFO [train.py:996] (1/4) Epoch 9, batch 1700, loss[loss=0.2037, simple_loss=0.2754, pruned_loss=0.06601, over 21050.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2964, pruned_loss=0.07148, over 4276839.64 frames. ], batch size: 608, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:41:38,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1473942.0, ans=0.2 2023-06-26 02:42:03,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-26 02:42:12,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-26 02:42:27,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=22.5 2023-06-26 02:42:33,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-26 02:42:42,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1474182.0, ans=0.0 2023-06-26 02:42:43,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1474182.0, ans=0.95 2023-06-26 02:42:47,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1474182.0, ans=0.1 2023-06-26 02:43:10,702 INFO [train.py:996] (1/4) Epoch 9, batch 1750, loss[loss=0.2146, simple_loss=0.2958, pruned_loss=0.06674, over 21538.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2971, pruned_loss=0.06991, over 4276495.77 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:43:26,595 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.127e+02 4.603e+02 7.165e+02 1.089e+03 2.171e+03, threshold=1.433e+03, percent-clipped=16.0 2023-06-26 02:44:07,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.71 vs. limit=22.5 2023-06-26 02:44:15,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1474422.0, ans=0.95 2023-06-26 02:44:51,694 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-26 02:44:59,053 INFO [train.py:996] (1/4) Epoch 9, batch 1800, loss[loss=0.2508, simple_loss=0.3273, pruned_loss=0.08713, over 21432.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2908, pruned_loss=0.06616, over 4275071.77 frames. ], batch size: 507, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:45:10,181 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:45:11,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1474542.0, ans=0.125 2023-06-26 02:45:17,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=12.0 2023-06-26 02:45:19,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=22.5 2023-06-26 02:45:37,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-26 02:46:42,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1474782.0, ans=0.09899494936611666 2023-06-26 02:46:44,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1474782.0, ans=0.125 2023-06-26 02:46:49,565 INFO [train.py:996] (1/4) Epoch 9, batch 1850, loss[loss=0.2283, simple_loss=0.3253, pruned_loss=0.06558, over 21455.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2944, pruned_loss=0.06537, over 4275212.71 frames. ], batch size: 471, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:47:07,111 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.277e+02 4.370e+02 7.147e+02 9.387e+02 1.947e+03, threshold=1.429e+03, percent-clipped=4.0 2023-06-26 02:47:08,582 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-26 02:47:31,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1474962.0, ans=0.2 2023-06-26 02:47:31,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1474962.0, ans=0.0 2023-06-26 02:47:57,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475022.0, ans=0.1 2023-06-26 02:48:30,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1475082.0, ans=0.2 2023-06-26 02:48:35,241 INFO [train.py:996] (1/4) Epoch 9, batch 1900, loss[loss=0.1921, simple_loss=0.2515, pruned_loss=0.06639, over 20301.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2955, pruned_loss=0.06669, over 4276941.57 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:48:35,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1475142.0, ans=0.125 2023-06-26 02:48:49,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1475142.0, ans=0.125 2023-06-26 02:49:09,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1475262.0, ans=0.0 2023-06-26 02:49:52,956 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.57 vs. limit=22.5 2023-06-26 02:50:03,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-26 02:50:22,000 INFO [train.py:996] (1/4) Epoch 9, batch 1950, loss[loss=0.1935, simple_loss=0.2571, pruned_loss=0.06495, over 21617.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2932, pruned_loss=0.06709, over 4272964.73 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:50:39,714 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.119e+02 4.600e+02 6.101e+02 9.329e+02 1.931e+03, threshold=1.220e+03, percent-clipped=7.0 2023-06-26 02:50:40,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1475502.0, ans=0.125 2023-06-26 02:51:42,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1475622.0, ans=0.125 2023-06-26 02:51:44,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1475622.0, ans=0.04949747468305833 2023-06-26 02:52:13,511 INFO [train.py:996] (1/4) Epoch 9, batch 2000, loss[loss=0.1315, simple_loss=0.1934, pruned_loss=0.03477, over 15797.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2892, pruned_loss=0.06571, over 4264960.02 frames. ], batch size: 60, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:52:16,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1475742.0, ans=0.0 2023-06-26 02:53:22,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1475922.0, ans=0.125 2023-06-26 02:53:43,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1475982.0, ans=0.0 2023-06-26 02:53:54,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1475982.0, ans=0.125 2023-06-26 02:53:58,860 INFO [train.py:996] (1/4) Epoch 9, batch 2050, loss[loss=0.2207, simple_loss=0.2968, pruned_loss=0.07234, over 21848.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2886, pruned_loss=0.06531, over 4270700.82 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:54:16,487 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.128e+02 5.298e+02 7.792e+02 1.006e+03 2.094e+03, threshold=1.558e+03, percent-clipped=16.0 2023-06-26 02:55:32,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1476222.0, ans=0.125 2023-06-26 02:55:53,049 INFO [train.py:996] (1/4) Epoch 9, batch 2100, loss[loss=0.2025, simple_loss=0.2874, pruned_loss=0.05874, over 21827.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2924, pruned_loss=0.06732, over 4275581.64 frames. ], batch size: 102, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:56:04,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1476342.0, ans=0.2 2023-06-26 02:57:05,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1476462.0, ans=0.125 2023-06-26 02:57:09,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1476522.0, ans=0.2 2023-06-26 02:57:11,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1476522.0, ans=0.1 2023-06-26 02:57:12,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-26 02:57:26,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1476582.0, ans=0.125 2023-06-26 02:57:44,906 INFO [train.py:996] (1/4) Epoch 9, batch 2150, loss[loss=0.2075, simple_loss=0.2822, pruned_loss=0.06634, over 21207.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2926, pruned_loss=0.06852, over 4274547.52 frames. ], batch size: 176, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:58:02,891 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 5.087e+02 7.506e+02 1.094e+03 2.833e+03, threshold=1.501e+03, percent-clipped=11.0 2023-06-26 02:59:31,673 INFO [train.py:996] (1/4) Epoch 9, batch 2200, loss[loss=0.2171, simple_loss=0.3055, pruned_loss=0.06433, over 21735.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2961, pruned_loss=0.06917, over 4270457.15 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:59:40,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1476942.0, ans=0.0 2023-06-26 03:00:03,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1477002.0, ans=0.04949747468305833 2023-06-26 03:01:15,203 INFO [train.py:996] (1/4) Epoch 9, batch 2250, loss[loss=0.1839, simple_loss=0.2484, pruned_loss=0.05968, over 21832.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2918, pruned_loss=0.06707, over 4264822.25 frames. ], batch size: 98, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:01:22,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1477242.0, ans=0.125 2023-06-26 03:01:32,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.139e+02 4.755e+02 7.951e+02 1.208e+03 2.238e+03, threshold=1.590e+03, percent-clipped=7.0 2023-06-26 03:02:26,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1477362.0, ans=0.0 2023-06-26 03:02:38,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1477422.0, ans=0.09899494936611666 2023-06-26 03:02:48,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1477482.0, ans=0.125 2023-06-26 03:03:00,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1477482.0, ans=0.125 2023-06-26 03:03:05,222 INFO [train.py:996] (1/4) Epoch 9, batch 2300, loss[loss=0.2383, simple_loss=0.284, pruned_loss=0.09629, over 21306.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2897, pruned_loss=0.06702, over 4257235.75 frames. ], batch size: 473, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:03:10,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1477542.0, ans=0.125 2023-06-26 03:03:25,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-26 03:04:07,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1477662.0, ans=0.125 2023-06-26 03:04:12,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1477662.0, ans=15.0 2023-06-26 03:04:18,392 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=10.0 2023-06-26 03:04:22,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1477722.0, ans=0.0 2023-06-26 03:04:30,459 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:04:30,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-26 03:04:37,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1477782.0, ans=0.0 2023-06-26 03:04:51,426 INFO [train.py:996] (1/4) Epoch 9, batch 2350, loss[loss=0.2203, simple_loss=0.2878, pruned_loss=0.07641, over 21437.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2867, pruned_loss=0.06708, over 4259790.97 frames. ], batch size: 389, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:05:12,609 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-26 03:05:15,071 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.161e+02 4.711e+02 6.334e+02 1.025e+03 2.139e+03, threshold=1.267e+03, percent-clipped=9.0 2023-06-26 03:05:26,641 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-26 03:05:53,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1477962.0, ans=0.125 2023-06-26 03:06:44,888 INFO [train.py:996] (1/4) Epoch 9, batch 2400, loss[loss=0.1935, simple_loss=0.2687, pruned_loss=0.05918, over 21878.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2919, pruned_loss=0.06897, over 4261254.42 frames. ], batch size: 98, lr: 3.37e-03, grad_scale: 32.0 2023-06-26 03:07:03,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1478142.0, ans=0.125 2023-06-26 03:07:18,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1478202.0, ans=0.125 2023-06-26 03:08:11,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1478382.0, ans=10.0 2023-06-26 03:08:16,656 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-06-26 03:08:36,725 INFO [train.py:996] (1/4) Epoch 9, batch 2450, loss[loss=0.2416, simple_loss=0.3225, pruned_loss=0.08033, over 21842.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2963, pruned_loss=0.07113, over 4265619.85 frames. ], batch size: 118, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:09:01,663 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.534e+02 5.033e+02 6.854e+02 1.116e+03 2.187e+03, threshold=1.371e+03, percent-clipped=18.0 2023-06-26 03:09:08,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1478502.0, ans=0.09899494936611666 2023-06-26 03:09:08,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1478502.0, ans=0.2 2023-06-26 03:09:22,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1478502.0, ans=0.2 2023-06-26 03:09:23,342 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-26 03:10:19,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1478742.0, ans=0.2 2023-06-26 03:10:20,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1478742.0, ans=0.125 2023-06-26 03:10:21,187 INFO [train.py:996] (1/4) Epoch 9, batch 2500, loss[loss=0.2194, simple_loss=0.2937, pruned_loss=0.07257, over 21111.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2942, pruned_loss=0.07051, over 4267644.20 frames. ], batch size: 159, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:11:05,874 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:11:44,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-26 03:11:59,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-06-26 03:12:06,840 INFO [train.py:996] (1/4) Epoch 9, batch 2550, loss[loss=0.1971, simple_loss=0.268, pruned_loss=0.06309, over 21514.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2931, pruned_loss=0.06915, over 4263164.70 frames. ], batch size: 391, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:12:30,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1479042.0, ans=0.1 2023-06-26 03:12:37,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-26 03:12:37,798 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.177e+02 4.403e+02 6.951e+02 9.882e+02 2.721e+03, threshold=1.390e+03, percent-clipped=12.0 2023-06-26 03:13:02,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1479102.0, ans=0.0 2023-06-26 03:13:47,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1479282.0, ans=0.125 2023-06-26 03:13:57,227 INFO [train.py:996] (1/4) Epoch 9, batch 2600, loss[loss=0.2117, simple_loss=0.2897, pruned_loss=0.06689, over 21786.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2939, pruned_loss=0.06979, over 4263036.25 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:14:26,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=12.0 2023-06-26 03:14:48,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1479462.0, ans=0.02 2023-06-26 03:14:53,285 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-26 03:15:14,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1479522.0, ans=0.125 2023-06-26 03:15:43,476 INFO [train.py:996] (1/4) Epoch 9, batch 2650, loss[loss=0.2176, simple_loss=0.2998, pruned_loss=0.06767, over 21741.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2936, pruned_loss=0.0703, over 4272666.55 frames. ], batch size: 247, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:15:45,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1479642.0, ans=0.0 2023-06-26 03:16:14,201 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 5.388e+02 7.988e+02 1.143e+03 2.285e+03, threshold=1.598e+03, percent-clipped=12.0 2023-06-26 03:16:18,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1479702.0, ans=0.07 2023-06-26 03:17:08,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1479822.0, ans=0.07 2023-06-26 03:17:29,008 INFO [train.py:996] (1/4) Epoch 9, batch 2700, loss[loss=0.2489, simple_loss=0.3179, pruned_loss=0.08993, over 21315.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2917, pruned_loss=0.06909, over 4261185.95 frames. ], batch size: 549, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:17:29,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-26 03:18:57,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1480122.0, ans=0.2 2023-06-26 03:19:00,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1480182.0, ans=0.0 2023-06-26 03:19:09,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1480182.0, ans=0.125 2023-06-26 03:19:13,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1480182.0, ans=0.125 2023-06-26 03:19:19,994 INFO [train.py:996] (1/4) Epoch 9, batch 2750, loss[loss=0.2293, simple_loss=0.307, pruned_loss=0.07581, over 21471.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2916, pruned_loss=0.06899, over 4262844.90 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:19:51,123 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.311e+02 4.494e+02 5.812e+02 9.696e+02 2.134e+03, threshold=1.162e+03, percent-clipped=3.0 2023-06-26 03:20:21,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1480362.0, ans=0.125 2023-06-26 03:21:18,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1480542.0, ans=0.0 2023-06-26 03:21:19,576 INFO [train.py:996] (1/4) Epoch 9, batch 2800, loss[loss=0.2519, simple_loss=0.3449, pruned_loss=0.07948, over 21802.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2947, pruned_loss=0.07028, over 4260874.58 frames. ], batch size: 316, lr: 3.37e-03, grad_scale: 32.0 2023-06-26 03:21:20,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1480542.0, ans=0.125 2023-06-26 03:22:21,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-26 03:22:58,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1480782.0, ans=0.1 2023-06-26 03:23:18,685 INFO [train.py:996] (1/4) Epoch 9, batch 2850, loss[loss=0.2074, simple_loss=0.2972, pruned_loss=0.05881, over 21649.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2981, pruned_loss=0.0727, over 4262251.24 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:23:45,691 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.704e+02 5.417e+02 7.792e+02 1.299e+03 2.553e+03, threshold=1.558e+03, percent-clipped=28.0 2023-06-26 03:24:10,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1480962.0, ans=0.09899494936611666 2023-06-26 03:24:26,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1481022.0, ans=0.1 2023-06-26 03:25:03,425 INFO [train.py:996] (1/4) Epoch 9, batch 2900, loss[loss=0.2005, simple_loss=0.2706, pruned_loss=0.0652, over 21263.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2942, pruned_loss=0.07147, over 4269092.69 frames. ], batch size: 176, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:25:39,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1481202.0, ans=0.0 2023-06-26 03:25:48,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1481262.0, ans=0.1 2023-06-26 03:25:52,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1481262.0, ans=0.125 2023-06-26 03:26:07,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1481322.0, ans=0.125 2023-06-26 03:26:09,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1481322.0, ans=0.02 2023-06-26 03:26:24,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.70 vs. limit=6.0 2023-06-26 03:26:30,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1481382.0, ans=0.125 2023-06-26 03:26:53,586 INFO [train.py:996] (1/4) Epoch 9, batch 2950, loss[loss=0.2398, simple_loss=0.3359, pruned_loss=0.07181, over 21687.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2965, pruned_loss=0.07232, over 4279016.30 frames. ], batch size: 389, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:27:21,368 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.507e+02 5.801e+02 9.754e+02 1.778e+03, threshold=1.160e+03, percent-clipped=2.0 2023-06-26 03:27:27,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1481502.0, ans=0.0 2023-06-26 03:27:36,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1481562.0, ans=0.125 2023-06-26 03:27:43,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1481562.0, ans=0.09899494936611666 2023-06-26 03:28:01,420 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-26 03:28:29,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1481682.0, ans=0.125 2023-06-26 03:28:29,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1481682.0, ans=0.125 2023-06-26 03:28:38,620 INFO [train.py:996] (1/4) Epoch 9, batch 3000, loss[loss=0.2336, simple_loss=0.3196, pruned_loss=0.07377, over 21479.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3006, pruned_loss=0.0733, over 4281963.77 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:28:38,620 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 03:29:01,195 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2514, simple_loss=0.3427, pruned_loss=0.08003, over 1796401.00 frames. 2023-06-26 03:29:01,196 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 03:30:48,444 INFO [train.py:996] (1/4) Epoch 9, batch 3050, loss[loss=0.2237, simple_loss=0.2973, pruned_loss=0.07503, over 21766.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2998, pruned_loss=0.07183, over 4280022.69 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:31:00,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-26 03:31:09,634 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.684e+02 7.478e+02 1.068e+03 1.857e+03, threshold=1.496e+03, percent-clipped=20.0 2023-06-26 03:31:21,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1482102.0, ans=0.0 2023-06-26 03:31:34,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1482162.0, ans=0.125 2023-06-26 03:31:36,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1482162.0, ans=0.125 2023-06-26 03:32:29,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-26 03:32:42,259 INFO [train.py:996] (1/4) Epoch 9, batch 3100, loss[loss=0.215, simple_loss=0.3239, pruned_loss=0.05303, over 19747.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3002, pruned_loss=0.07046, over 4278752.17 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:32:48,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1482342.0, ans=0.125 2023-06-26 03:34:35,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482642.0, ans=0.1 2023-06-26 03:34:36,173 INFO [train.py:996] (1/4) Epoch 9, batch 3150, loss[loss=0.2808, simple_loss=0.388, pruned_loss=0.08685, over 21184.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3013, pruned_loss=0.07003, over 4276456.87 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:34:58,314 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.871e+02 4.385e+02 6.208e+02 9.255e+02 2.149e+03, threshold=1.242e+03, percent-clipped=3.0 2023-06-26 03:35:43,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1482762.0, ans=0.07 2023-06-26 03:36:20,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1482882.0, ans=0.0 2023-06-26 03:36:28,382 INFO [train.py:996] (1/4) Epoch 9, batch 3200, loss[loss=0.214, simple_loss=0.2881, pruned_loss=0.07, over 21229.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3029, pruned_loss=0.07143, over 4275854.58 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 03:36:36,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482942.0, ans=0.1 2023-06-26 03:37:05,520 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:37:17,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1483062.0, ans=0.0 2023-06-26 03:37:38,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1483122.0, ans=0.2 2023-06-26 03:37:47,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1483122.0, ans=0.0 2023-06-26 03:38:02,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1483182.0, ans=0.0 2023-06-26 03:38:04,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.02 vs. limit=15.0 2023-06-26 03:38:09,274 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-26 03:38:13,655 INFO [train.py:996] (1/4) Epoch 9, batch 3250, loss[loss=0.2406, simple_loss=0.3126, pruned_loss=0.08432, over 21761.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3047, pruned_loss=0.07345, over 4281267.90 frames. ], batch size: 124, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:38:21,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1483242.0, ans=0.125 2023-06-26 03:38:41,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1483302.0, ans=0.125 2023-06-26 03:38:47,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.805e+02 4.877e+02 6.649e+02 1.271e+03 2.472e+03, threshold=1.330e+03, percent-clipped=27.0 2023-06-26 03:39:09,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1483362.0, ans=0.125 2023-06-26 03:39:19,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1483362.0, ans=0.125 2023-06-26 03:40:05,658 INFO [train.py:996] (1/4) Epoch 9, batch 3300, loss[loss=0.2305, simple_loss=0.3056, pruned_loss=0.07768, over 20787.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2984, pruned_loss=0.07256, over 4272724.02 frames. ], batch size: 611, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:40:48,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-26 03:40:54,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1483602.0, ans=0.1 2023-06-26 03:41:24,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1483722.0, ans=0.0 2023-06-26 03:41:42,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=22.5 2023-06-26 03:41:56,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-26 03:42:03,482 INFO [train.py:996] (1/4) Epoch 9, batch 3350, loss[loss=0.2149, simple_loss=0.2852, pruned_loss=0.07236, over 21503.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3009, pruned_loss=0.07248, over 4276117.81 frames. ], batch size: 194, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:42:12,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1483842.0, ans=0.07 2023-06-26 03:42:18,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1483842.0, ans=0.125 2023-06-26 03:42:24,689 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-26 03:42:34,552 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:42:35,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1483902.0, ans=0.0 2023-06-26 03:42:36,951 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.432e+02 5.027e+02 7.904e+02 1.051e+03 2.659e+03, threshold=1.581e+03, percent-clipped=15.0 2023-06-26 03:43:26,441 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-06-26 03:43:58,357 INFO [train.py:996] (1/4) Epoch 9, batch 3400, loss[loss=0.1976, simple_loss=0.2681, pruned_loss=0.06357, over 21279.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3011, pruned_loss=0.07346, over 4275770.50 frames. ], batch size: 176, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:44:06,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-26 03:44:30,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1484202.0, ans=0.125 2023-06-26 03:44:54,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1484262.0, ans=0.125 2023-06-26 03:45:12,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1484322.0, ans=0.0 2023-06-26 03:45:51,263 INFO [train.py:996] (1/4) Epoch 9, batch 3450, loss[loss=0.2142, simple_loss=0.2795, pruned_loss=0.07442, over 21539.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2966, pruned_loss=0.07301, over 4262208.07 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:46:02,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1484442.0, ans=0.2 2023-06-26 03:46:15,861 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.98 vs. limit=15.0 2023-06-26 03:46:19,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 5.080e+02 7.210e+02 9.972e+02 1.993e+03, threshold=1.442e+03, percent-clipped=4.0 2023-06-26 03:46:47,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1484562.0, ans=0.125 2023-06-26 03:46:55,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-26 03:47:39,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1484682.0, ans=0.015 2023-06-26 03:47:47,816 INFO [train.py:996] (1/4) Epoch 9, batch 3500, loss[loss=0.2494, simple_loss=0.3394, pruned_loss=0.07969, over 21607.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3061, pruned_loss=0.07666, over 4271017.00 frames. ], batch size: 263, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:48:10,708 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:48:58,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1484922.0, ans=0.0 2023-06-26 03:49:37,596 INFO [train.py:996] (1/4) Epoch 9, batch 3550, loss[loss=0.1953, simple_loss=0.2658, pruned_loss=0.06239, over 21860.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3077, pruned_loss=0.07753, over 4274718.66 frames. ], batch size: 372, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:50:05,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.388e+02 4.836e+02 6.336e+02 9.493e+02 2.947e+03, threshold=1.267e+03, percent-clipped=8.0 2023-06-26 03:50:31,605 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-26 03:50:31,732 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.85 vs. limit=15.0 2023-06-26 03:50:36,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=1485222.0, ans=15.0 2023-06-26 03:51:27,687 INFO [train.py:996] (1/4) Epoch 9, batch 3600, loss[loss=0.1996, simple_loss=0.2655, pruned_loss=0.06689, over 21743.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3019, pruned_loss=0.07643, over 4279005.55 frames. ], batch size: 282, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 03:51:51,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485402.0, ans=0.1 2023-06-26 03:51:57,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1485402.0, ans=0.125 2023-06-26 03:52:09,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1485462.0, ans=0.125 2023-06-26 03:52:22,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1485462.0, ans=0.125 2023-06-26 03:52:52,376 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:53:12,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1485582.0, ans=0.0 2023-06-26 03:53:19,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1485582.0, ans=0.125 2023-06-26 03:53:28,639 INFO [train.py:996] (1/4) Epoch 9, batch 3650, loss[loss=0.191, simple_loss=0.2623, pruned_loss=0.05988, over 21848.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3029, pruned_loss=0.07632, over 4278645.22 frames. ], batch size: 107, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:53:35,826 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.65 vs. limit=10.0 2023-06-26 03:53:45,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1485702.0, ans=0.09899494936611666 2023-06-26 03:53:53,230 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.365e+02 4.857e+02 6.488e+02 1.037e+03 3.171e+03, threshold=1.298e+03, percent-clipped=18.0 2023-06-26 03:54:11,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1485762.0, ans=0.0 2023-06-26 03:55:19,269 INFO [train.py:996] (1/4) Epoch 9, batch 3700, loss[loss=0.2296, simple_loss=0.3057, pruned_loss=0.07673, over 21852.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3023, pruned_loss=0.0758, over 4286489.31 frames. ], batch size: 332, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:55:53,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1486062.0, ans=0.125 2023-06-26 03:55:57,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1486062.0, ans=0.5 2023-06-26 03:56:51,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486182.0, ans=0.1 2023-06-26 03:57:10,165 INFO [train.py:996] (1/4) Epoch 9, batch 3750, loss[loss=0.2181, simple_loss=0.294, pruned_loss=0.07115, over 21858.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3015, pruned_loss=0.0756, over 4290003.77 frames. ], batch size: 351, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:57:26,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1486302.0, ans=0.125 2023-06-26 03:57:35,331 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.387e+02 4.638e+02 6.369e+02 1.007e+03 1.951e+03, threshold=1.274e+03, percent-clipped=10.0 2023-06-26 03:58:16,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1486362.0, ans=0.125 2023-06-26 03:58:18,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1486422.0, ans=0.125 2023-06-26 03:58:38,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1486422.0, ans=0.125 2023-06-26 03:58:51,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1486482.0, ans=0.0 2023-06-26 03:59:00,822 INFO [train.py:996] (1/4) Epoch 9, batch 3800, loss[loss=0.1853, simple_loss=0.2648, pruned_loss=0.0529, over 21617.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2998, pruned_loss=0.07419, over 4291796.76 frames. ], batch size: 263, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:59:01,879 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=12.0 2023-06-26 03:59:03,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1486542.0, ans=0.025 2023-06-26 03:59:37,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1486602.0, ans=0.125 2023-06-26 04:00:19,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-26 04:00:30,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1486782.0, ans=0.125 2023-06-26 04:00:49,607 INFO [train.py:996] (1/4) Epoch 9, batch 3850, loss[loss=0.1965, simple_loss=0.2683, pruned_loss=0.06234, over 21770.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2969, pruned_loss=0.07403, over 4288752.25 frames. ], batch size: 124, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:01:04,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.73 vs. limit=5.0 2023-06-26 04:01:12,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1486902.0, ans=0.125 2023-06-26 04:01:19,302 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.209e+02 4.310e+02 5.472e+02 7.871e+02 1.774e+03, threshold=1.094e+03, percent-clipped=3.0 2023-06-26 04:02:39,290 INFO [train.py:996] (1/4) Epoch 9, batch 3900, loss[loss=0.195, simple_loss=0.2692, pruned_loss=0.06034, over 21466.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2918, pruned_loss=0.07321, over 4294445.72 frames. ], batch size: 211, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:03:01,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-26 04:03:41,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1487262.0, ans=0.125 2023-06-26 04:04:28,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1487442.0, ans=0.035 2023-06-26 04:04:29,554 INFO [train.py:996] (1/4) Epoch 9, batch 3950, loss[loss=0.2037, simple_loss=0.2899, pruned_loss=0.05881, over 21446.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2929, pruned_loss=0.07267, over 4292629.07 frames. ], batch size: 211, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:04:59,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.430e+02 5.269e+02 7.379e+02 1.187e+03 2.051e+03, threshold=1.476e+03, percent-clipped=29.0 2023-06-26 04:05:03,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1487502.0, ans=0.0 2023-06-26 04:06:21,599 INFO [train.py:996] (1/4) Epoch 9, batch 4000, loss[loss=0.1767, simple_loss=0.2439, pruned_loss=0.0548, over 21556.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.286, pruned_loss=0.06834, over 4287216.22 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 04:06:57,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.13 vs. limit=15.0 2023-06-26 04:07:19,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1487862.0, ans=0.125 2023-06-26 04:07:28,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1487862.0, ans=0.125 2023-06-26 04:07:31,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1487922.0, ans=0.1 2023-06-26 04:07:41,274 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.83 vs. limit=15.0 2023-06-26 04:08:15,055 INFO [train.py:996] (1/4) Epoch 9, batch 4050, loss[loss=0.1982, simple_loss=0.2575, pruned_loss=0.06947, over 21414.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.286, pruned_loss=0.06688, over 4275979.31 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:08:15,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1488042.0, ans=0.2 2023-06-26 04:08:54,029 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.408e+02 5.792e+02 1.027e+03 1.957e+03, threshold=1.158e+03, percent-clipped=6.0 2023-06-26 04:08:59,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1488102.0, ans=0.125 2023-06-26 04:09:24,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488222.0, ans=0.1 2023-06-26 04:09:44,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2023-06-26 04:09:49,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1488282.0, ans=10.0 2023-06-26 04:09:56,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1488282.0, ans=0.125 2023-06-26 04:10:00,291 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-26 04:10:06,497 INFO [train.py:996] (1/4) Epoch 9, batch 4100, loss[loss=0.1919, simple_loss=0.2716, pruned_loss=0.05606, over 21234.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2885, pruned_loss=0.06715, over 4275115.39 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:10:48,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1488402.0, ans=0.125 2023-06-26 04:10:57,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-26 04:10:57,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.30 vs. limit=10.0 2023-06-26 04:11:04,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1488462.0, ans=0.125 2023-06-26 04:11:09,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1488462.0, ans=0.125 2023-06-26 04:11:58,926 INFO [train.py:996] (1/4) Epoch 9, batch 4150, loss[loss=0.2625, simple_loss=0.32, pruned_loss=0.1025, over 21373.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2887, pruned_loss=0.0652, over 4278794.92 frames. ], batch size: 507, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:12:13,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1488642.0, ans=0.125 2023-06-26 04:12:15,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1488642.0, ans=0.125 2023-06-26 04:12:42,738 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.917e+02 4.750e+02 6.636e+02 9.716e+02 1.939e+03, threshold=1.327e+03, percent-clipped=13.0 2023-06-26 04:13:27,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488822.0, ans=0.1 2023-06-26 04:13:40,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1488882.0, ans=0.0 2023-06-26 04:13:57,435 INFO [train.py:996] (1/4) Epoch 9, batch 4200, loss[loss=0.1873, simple_loss=0.2665, pruned_loss=0.05402, over 21694.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.288, pruned_loss=0.06483, over 4272081.14 frames. ], batch size: 298, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:14:06,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1488942.0, ans=0.1 2023-06-26 04:14:31,774 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-26 04:14:53,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1489062.0, ans=0.04949747468305833 2023-06-26 04:15:13,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-26 04:15:56,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-26 04:15:56,536 INFO [train.py:996] (1/4) Epoch 9, batch 4250, loss[loss=0.2401, simple_loss=0.3266, pruned_loss=0.07681, over 21768.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2954, pruned_loss=0.0667, over 4275135.14 frames. ], batch size: 332, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:16:34,528 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.173e+02 6.968e+02 9.905e+02 1.425e+03 3.258e+03, threshold=1.981e+03, percent-clipped=30.0 2023-06-26 04:16:42,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1489362.0, ans=0.2 2023-06-26 04:16:44,888 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-26 04:16:47,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1489362.0, ans=0.2 2023-06-26 04:17:55,659 INFO [train.py:996] (1/4) Epoch 9, batch 4300, loss[loss=0.2263, simple_loss=0.3533, pruned_loss=0.04967, over 19702.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3008, pruned_loss=0.06813, over 4270136.41 frames. ], batch size: 702, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:18:09,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1489542.0, ans=0.125 2023-06-26 04:19:24,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1489722.0, ans=0.125 2023-06-26 04:19:26,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1489722.0, ans=0.0 2023-06-26 04:19:51,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1489842.0, ans=0.125 2023-06-26 04:19:52,224 INFO [train.py:996] (1/4) Epoch 9, batch 4350, loss[loss=0.1846, simple_loss=0.2623, pruned_loss=0.05348, over 21674.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3009, pruned_loss=0.06792, over 4270437.07 frames. ], batch size: 299, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:20:05,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1489842.0, ans=0.0 2023-06-26 04:20:18,896 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.396e+02 4.613e+02 6.929e+02 1.161e+03 2.829e+03, threshold=1.386e+03, percent-clipped=7.0 2023-06-26 04:20:52,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1489962.0, ans=0.125 2023-06-26 04:20:58,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-26 04:21:12,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1490022.0, ans=0.0 2023-06-26 04:21:42,254 INFO [train.py:996] (1/4) Epoch 9, batch 4400, loss[loss=0.2076, simple_loss=0.2793, pruned_loss=0.06793, over 21549.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2978, pruned_loss=0.06796, over 4271412.65 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 04:22:03,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-26 04:22:28,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1490262.0, ans=0.125 2023-06-26 04:23:25,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1490382.0, ans=0.0 2023-06-26 04:23:35,722 INFO [train.py:996] (1/4) Epoch 9, batch 4450, loss[loss=0.2932, simple_loss=0.3766, pruned_loss=0.1049, over 21654.00 frames. ], tot_loss[loss=0.225, simple_loss=0.308, pruned_loss=0.07101, over 4275273.80 frames. ], batch size: 441, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:23:45,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1490442.0, ans=0.125 2023-06-26 04:23:46,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1490442.0, ans=0.125 2023-06-26 04:24:03,477 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 5.132e+02 7.510e+02 1.153e+03 2.650e+03, threshold=1.502e+03, percent-clipped=12.0 2023-06-26 04:24:22,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1490562.0, ans=0.2 2023-06-26 04:24:24,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1490562.0, ans=0.0 2023-06-26 04:24:37,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1490562.0, ans=0.125 2023-06-26 04:24:37,837 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-26 04:24:56,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1490622.0, ans=0.2 2023-06-26 04:25:11,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-26 04:25:19,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1490682.0, ans=0.125 2023-06-26 04:25:25,700 INFO [train.py:996] (1/4) Epoch 9, batch 4500, loss[loss=0.2071, simple_loss=0.2879, pruned_loss=0.06318, over 21210.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3082, pruned_loss=0.07263, over 4283182.70 frames. ], batch size: 176, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:26:25,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1490862.0, ans=0.125 2023-06-26 04:26:25,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1490862.0, ans=0.2 2023-06-26 04:26:36,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1490922.0, ans=0.0 2023-06-26 04:27:14,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1491042.0, ans=0.0 2023-06-26 04:27:15,813 INFO [train.py:996] (1/4) Epoch 9, batch 4550, loss[loss=0.264, simple_loss=0.3471, pruned_loss=0.09048, over 21858.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3112, pruned_loss=0.07319, over 4283266.46 frames. ], batch size: 124, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:27:16,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1491042.0, ans=0.125 2023-06-26 04:27:19,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1491042.0, ans=0.1 2023-06-26 04:27:57,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491102.0, ans=0.1 2023-06-26 04:28:00,501 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.475e+02 4.870e+02 6.557e+02 1.171e+03 3.635e+03, threshold=1.311e+03, percent-clipped=15.0 2023-06-26 04:28:33,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-26 04:29:00,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1491282.0, ans=0.125 2023-06-26 04:29:02,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1491282.0, ans=0.125 2023-06-26 04:29:05,593 INFO [train.py:996] (1/4) Epoch 9, batch 4600, loss[loss=0.2204, simple_loss=0.3049, pruned_loss=0.06796, over 21877.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3123, pruned_loss=0.07415, over 4289707.67 frames. ], batch size: 124, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:29:54,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1491402.0, ans=0.125 2023-06-26 04:29:59,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1491462.0, ans=0.2 2023-06-26 04:30:33,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1491522.0, ans=0.125 2023-06-26 04:30:34,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1491522.0, ans=0.125 2023-06-26 04:31:00,437 INFO [train.py:996] (1/4) Epoch 9, batch 4650, loss[loss=0.1567, simple_loss=0.2317, pruned_loss=0.04083, over 21509.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.305, pruned_loss=0.07154, over 4284244.35 frames. ], batch size: 195, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:31:00,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1491642.0, ans=0.2 2023-06-26 04:31:12,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-06-26 04:31:28,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1491702.0, ans=0.0 2023-06-26 04:31:38,992 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.118e+02 4.318e+02 5.535e+02 7.322e+02 1.899e+03, threshold=1.107e+03, percent-clipped=2.0 2023-06-26 04:32:55,323 INFO [train.py:996] (1/4) Epoch 9, batch 4700, loss[loss=0.2308, simple_loss=0.3022, pruned_loss=0.07972, over 20689.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2948, pruned_loss=0.06915, over 4279839.79 frames. ], batch size: 608, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:33:28,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1492002.0, ans=0.2 2023-06-26 04:33:29,430 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-26 04:33:34,579 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-26 04:33:57,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1492122.0, ans=0.07 2023-06-26 04:34:00,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-26 04:34:38,023 INFO [train.py:996] (1/4) Epoch 9, batch 4750, loss[loss=0.2174, simple_loss=0.2825, pruned_loss=0.07616, over 21340.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2886, pruned_loss=0.0688, over 4279829.90 frames. ], batch size: 159, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:35:04,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1492302.0, ans=0.1 2023-06-26 04:35:04,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1492302.0, ans=0.0 2023-06-26 04:35:16,911 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.450e+02 6.729e+02 1.004e+03 1.717e+03, threshold=1.346e+03, percent-clipped=12.0 2023-06-26 04:35:43,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1492362.0, ans=0.0 2023-06-26 04:36:02,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1492422.0, ans=0.2 2023-06-26 04:36:32,933 INFO [train.py:996] (1/4) Epoch 9, batch 4800, loss[loss=0.2096, simple_loss=0.3072, pruned_loss=0.05601, over 21684.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2897, pruned_loss=0.06969, over 4287562.44 frames. ], batch size: 298, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 04:36:50,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1492542.0, ans=0.2 2023-06-26 04:37:04,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1492602.0, ans=0.125 2023-06-26 04:37:19,394 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.07 vs. limit=15.0 2023-06-26 04:37:22,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1492662.0, ans=0.125 2023-06-26 04:38:11,595 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:38:21,211 INFO [train.py:996] (1/4) Epoch 9, batch 4850, loss[loss=0.2238, simple_loss=0.298, pruned_loss=0.07482, over 21316.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2884, pruned_loss=0.06912, over 4286262.40 frames. ], batch size: 143, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:38:21,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1492842.0, ans=0.0 2023-06-26 04:38:30,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1492842.0, ans=0.0 2023-06-26 04:38:56,363 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.280e+02 4.162e+02 5.020e+02 8.423e+02 2.243e+03, threshold=1.004e+03, percent-clipped=7.0 2023-06-26 04:39:15,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1492962.0, ans=0.035 2023-06-26 04:39:56,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1493082.0, ans=0.125 2023-06-26 04:40:11,775 INFO [train.py:996] (1/4) Epoch 9, batch 4900, loss[loss=0.267, simple_loss=0.3525, pruned_loss=0.09076, over 21492.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.291, pruned_loss=0.06999, over 4285515.60 frames. ], batch size: 471, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:41:38,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1493382.0, ans=0.2 2023-06-26 04:42:01,686 INFO [train.py:996] (1/4) Epoch 9, batch 4950, loss[loss=0.2136, simple_loss=0.3173, pruned_loss=0.05492, over 21188.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.294, pruned_loss=0.06809, over 4277205.18 frames. ], batch size: 548, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:42:21,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1493442.0, ans=0.125 2023-06-26 04:42:23,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1493502.0, ans=0.2 2023-06-26 04:42:30,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1493502.0, ans=0.0 2023-06-26 04:42:42,356 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.982e+02 5.004e+02 7.690e+02 1.209e+03 2.410e+03, threshold=1.538e+03, percent-clipped=31.0 2023-06-26 04:42:42,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1493502.0, ans=0.0 2023-06-26 04:42:47,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-26 04:43:05,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1493622.0, ans=0.035 2023-06-26 04:43:49,256 INFO [train.py:996] (1/4) Epoch 9, batch 5000, loss[loss=0.2127, simple_loss=0.299, pruned_loss=0.06318, over 21849.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2934, pruned_loss=0.06551, over 4282872.84 frames. ], batch size: 371, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:43:49,842 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:44:38,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1493862.0, ans=0.125 2023-06-26 04:45:37,533 INFO [train.py:996] (1/4) Epoch 9, batch 5050, loss[loss=0.2219, simple_loss=0.2953, pruned_loss=0.07421, over 21493.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2931, pruned_loss=0.06739, over 4281244.28 frames. ], batch size: 131, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:45:38,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1494042.0, ans=0.125 2023-06-26 04:46:12,687 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 4.718e+02 6.361e+02 8.600e+02 1.640e+03, threshold=1.272e+03, percent-clipped=2.0 2023-06-26 04:47:26,060 INFO [train.py:996] (1/4) Epoch 9, batch 5100, loss[loss=0.1666, simple_loss=0.2454, pruned_loss=0.04392, over 21657.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2925, pruned_loss=0.06762, over 4278389.84 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:47:28,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1494342.0, ans=0.2 2023-06-26 04:47:48,131 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-26 04:47:59,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1494402.0, ans=0.2 2023-06-26 04:49:08,834 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:49:09,871 INFO [train.py:996] (1/4) Epoch 9, batch 5150, loss[loss=0.23, simple_loss=0.303, pruned_loss=0.07845, over 21897.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2915, pruned_loss=0.06811, over 4283632.76 frames. ], batch size: 107, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:49:33,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1494642.0, ans=0.125 2023-06-26 04:49:45,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1494702.0, ans=0.1 2023-06-26 04:49:50,340 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.908e+02 4.572e+02 6.344e+02 1.136e+03 2.635e+03, threshold=1.269e+03, percent-clipped=18.0 2023-06-26 04:50:16,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.98 vs. limit=10.0 2023-06-26 04:50:58,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1494882.0, ans=0.0 2023-06-26 04:51:10,734 INFO [train.py:996] (1/4) Epoch 9, batch 5200, loss[loss=0.2026, simple_loss=0.2819, pruned_loss=0.06166, over 21588.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2957, pruned_loss=0.06876, over 4276730.41 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 04:51:23,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1494942.0, ans=0.125 2023-06-26 04:52:16,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1495122.0, ans=0.125 2023-06-26 04:52:17,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1495122.0, ans=0.125 2023-06-26 04:52:40,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1495182.0, ans=0.0 2023-06-26 04:52:45,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1495182.0, ans=0.125 2023-06-26 04:52:58,585 INFO [train.py:996] (1/4) Epoch 9, batch 5250, loss[loss=0.2747, simple_loss=0.3528, pruned_loss=0.09834, over 21560.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2999, pruned_loss=0.06792, over 4278459.37 frames. ], batch size: 471, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:53:16,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1495302.0, ans=0.0 2023-06-26 04:53:35,845 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.142e+02 4.723e+02 6.704e+02 8.682e+02 1.617e+03, threshold=1.341e+03, percent-clipped=7.0 2023-06-26 04:54:04,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1495422.0, ans=0.07 2023-06-26 04:54:26,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1495482.0, ans=0.125 2023-06-26 04:54:43,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1495482.0, ans=0.2 2023-06-26 04:54:50,684 INFO [train.py:996] (1/4) Epoch 9, batch 5300, loss[loss=0.1801, simple_loss=0.2417, pruned_loss=0.05926, over 20730.00 frames. ], tot_loss[loss=0.218, simple_loss=0.299, pruned_loss=0.06852, over 4278437.58 frames. ], batch size: 607, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:54:54,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1495542.0, ans=0.0 2023-06-26 04:54:55,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-26 04:55:28,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1495602.0, ans=0.1 2023-06-26 04:56:39,196 INFO [train.py:996] (1/4) Epoch 9, batch 5350, loss[loss=0.2248, simple_loss=0.3513, pruned_loss=0.04909, over 19821.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2978, pruned_loss=0.06978, over 4280989.63 frames. ], batch size: 703, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:56:45,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-26 04:57:15,451 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 4.386e+02 5.571e+02 7.652e+02 1.743e+03, threshold=1.114e+03, percent-clipped=3.0 2023-06-26 04:57:38,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1495962.0, ans=0.125 2023-06-26 04:58:07,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1496082.0, ans=0.125 2023-06-26 04:58:14,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1496082.0, ans=0.2 2023-06-26 04:58:27,550 INFO [train.py:996] (1/4) Epoch 9, batch 5400, loss[loss=0.2583, simple_loss=0.3054, pruned_loss=0.1056, over 21770.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2971, pruned_loss=0.07089, over 4290255.84 frames. ], batch size: 508, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:00:02,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1496382.0, ans=0.07 2023-06-26 05:00:22,762 INFO [train.py:996] (1/4) Epoch 9, batch 5450, loss[loss=0.2738, simple_loss=0.3795, pruned_loss=0.08405, over 21656.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2972, pruned_loss=0.06985, over 4287719.62 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:00:26,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1496442.0, ans=0.125 2023-06-26 05:00:32,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1496442.0, ans=0.125 2023-06-26 05:00:54,978 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.911e+02 4.664e+02 7.291e+02 1.143e+03 2.963e+03, threshold=1.458e+03, percent-clipped=26.0 2023-06-26 05:01:18,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1496562.0, ans=0.0 2023-06-26 05:02:10,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1496742.0, ans=0.125 2023-06-26 05:02:12,196 INFO [train.py:996] (1/4) Epoch 9, batch 5500, loss[loss=0.2118, simple_loss=0.3147, pruned_loss=0.05449, over 21780.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2999, pruned_loss=0.06648, over 4291467.58 frames. ], batch size: 371, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:03:05,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1496862.0, ans=0.1 2023-06-26 05:04:02,033 INFO [train.py:996] (1/4) Epoch 9, batch 5550, loss[loss=0.2777, simple_loss=0.3695, pruned_loss=0.09298, over 21489.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.3007, pruned_loss=0.06503, over 4281124.00 frames. ], batch size: 471, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:04:04,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1497042.0, ans=0.125 2023-06-26 05:04:30,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1497102.0, ans=0.0 2023-06-26 05:04:44,535 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 5.816e+02 9.061e+02 1.223e+03 2.185e+03, threshold=1.812e+03, percent-clipped=16.0 2023-06-26 05:04:59,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1497162.0, ans=0.125 2023-06-26 05:05:34,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1497282.0, ans=0.125 2023-06-26 05:05:34,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1497282.0, ans=0.1 2023-06-26 05:05:36,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1497282.0, ans=0.125 2023-06-26 05:05:58,678 INFO [train.py:996] (1/4) Epoch 9, batch 5600, loss[loss=0.2754, simple_loss=0.3717, pruned_loss=0.08959, over 21686.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2993, pruned_loss=0.06311, over 4280753.11 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 05:06:49,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1497462.0, ans=0.0 2023-06-26 05:07:18,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1497522.0, ans=0.125 2023-06-26 05:07:45,546 INFO [train.py:996] (1/4) Epoch 9, batch 5650, loss[loss=0.257, simple_loss=0.3199, pruned_loss=0.097, over 21614.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2993, pruned_loss=0.06446, over 4274804.65 frames. ], batch size: 471, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:08:29,191 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 5.175e+02 8.774e+02 1.262e+03 2.376e+03, threshold=1.755e+03, percent-clipped=8.0 2023-06-26 05:09:05,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1497822.0, ans=0.125 2023-06-26 05:09:41,565 INFO [train.py:996] (1/4) Epoch 9, batch 5700, loss[loss=0.2161, simple_loss=0.2876, pruned_loss=0.07228, over 21292.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2986, pruned_loss=0.06587, over 4281712.14 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:10:12,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1498002.0, ans=0.125 2023-06-26 05:10:47,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-26 05:11:13,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1498122.0, ans=0.125 2023-06-26 05:11:39,516 INFO [train.py:996] (1/4) Epoch 9, batch 5750, loss[loss=0.1503, simple_loss=0.2323, pruned_loss=0.03413, over 21330.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2936, pruned_loss=0.06377, over 4275454.05 frames. ], batch size: 131, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:11:47,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1498242.0, ans=0.09899494936611666 2023-06-26 05:11:56,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1498302.0, ans=0.1 2023-06-26 05:12:05,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1498302.0, ans=0.125 2023-06-26 05:12:07,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1498302.0, ans=0.035 2023-06-26 05:12:12,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1498302.0, ans=0.0 2023-06-26 05:12:18,971 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.313e+02 4.582e+02 6.982e+02 1.089e+03 2.466e+03, threshold=1.396e+03, percent-clipped=2.0 2023-06-26 05:13:05,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1498422.0, ans=0.1 2023-06-26 05:13:21,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1498482.0, ans=0.125 2023-06-26 05:13:31,264 INFO [train.py:996] (1/4) Epoch 9, batch 5800, loss[loss=0.2228, simple_loss=0.3274, pruned_loss=0.05909, over 21673.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2933, pruned_loss=0.06226, over 4270912.07 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:13:31,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1498542.0, ans=0.0 2023-06-26 05:13:48,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1498542.0, ans=0.0 2023-06-26 05:14:40,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1498722.0, ans=0.2 2023-06-26 05:15:27,925 INFO [train.py:996] (1/4) Epoch 9, batch 5850, loss[loss=0.1689, simple_loss=0.2754, pruned_loss=0.03119, over 21767.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2919, pruned_loss=0.05828, over 4274633.32 frames. ], batch size: 332, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:15:28,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1498842.0, ans=0.0 2023-06-26 05:16:05,867 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.859e+02 4.519e+02 6.797e+02 9.504e+02 2.240e+03, threshold=1.359e+03, percent-clipped=6.0 2023-06-26 05:17:15,174 INFO [train.py:996] (1/4) Epoch 9, batch 5900, loss[loss=0.1705, simple_loss=0.2532, pruned_loss=0.04392, over 21769.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2855, pruned_loss=0.05483, over 4275836.36 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:17:28,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1499142.0, ans=0.1 2023-06-26 05:17:39,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1499202.0, ans=0.0 2023-06-26 05:17:45,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-26 05:17:56,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1499202.0, ans=0.125 2023-06-26 05:18:05,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1499262.0, ans=0.0 2023-06-26 05:18:56,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1499382.0, ans=0.125 2023-06-26 05:19:01,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.32 vs. limit=10.0 2023-06-26 05:19:04,242 INFO [train.py:996] (1/4) Epoch 9, batch 5950, loss[loss=0.1922, simple_loss=0.2625, pruned_loss=0.06097, over 22019.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2841, pruned_loss=0.05652, over 4271569.37 frames. ], batch size: 103, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:19:06,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1499442.0, ans=0.09899494936611666 2023-06-26 05:19:29,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1499502.0, ans=0.0 2023-06-26 05:19:47,179 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.880e+02 4.429e+02 6.642e+02 9.511e+02 2.071e+03, threshold=1.328e+03, percent-clipped=8.0 2023-06-26 05:20:46,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1499682.0, ans=0.2 2023-06-26 05:20:50,645 INFO [train.py:996] (1/4) Epoch 9, batch 6000, loss[loss=0.2022, simple_loss=0.265, pruned_loss=0.06975, over 21798.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2809, pruned_loss=0.05942, over 4275013.21 frames. ], batch size: 124, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 05:20:50,646 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 05:21:11,483 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2616, simple_loss=0.3531, pruned_loss=0.08508, over 1796401.00 frames. 2023-06-26 05:21:11,485 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 05:21:27,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-26 05:22:03,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1499862.0, ans=0.0 2023-06-26 05:22:09,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1499862.0, ans=0.0 2023-06-26 05:23:08,650 INFO [train.py:996] (1/4) Epoch 9, batch 6050, loss[loss=0.256, simple_loss=0.3614, pruned_loss=0.07532, over 20859.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2767, pruned_loss=0.06034, over 4273850.25 frames. ], batch size: 608, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:23:11,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-26 05:23:45,958 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-26 05:23:48,269 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.915e+02 7.181e+02 1.064e+03 2.049e+03, threshold=1.436e+03, percent-clipped=12.0 2023-06-26 05:23:52,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500162.0, ans=0.1 2023-06-26 05:24:20,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1500222.0, ans=0.125 2023-06-26 05:24:27,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1500282.0, ans=0.0 2023-06-26 05:24:55,915 INFO [train.py:996] (1/4) Epoch 9, batch 6100, loss[loss=0.2181, simple_loss=0.288, pruned_loss=0.07415, over 21579.00 frames. ], tot_loss[loss=0.198, simple_loss=0.276, pruned_loss=0.06005, over 4273404.49 frames. ], batch size: 195, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:24:58,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.10 vs. limit=10.0 2023-06-26 05:25:36,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-26 05:25:38,405 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:26:00,095 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.08 vs. limit=15.0 2023-06-26 05:26:01,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-26 05:26:43,539 INFO [train.py:996] (1/4) Epoch 9, batch 6150, loss[loss=0.2266, simple_loss=0.2975, pruned_loss=0.07783, over 21567.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.279, pruned_loss=0.06174, over 4280739.36 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:27:07,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-26 05:27:22,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1500762.0, ans=0.125 2023-06-26 05:27:23,508 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.733e+02 6.899e+02 9.489e+02 3.075e+03, threshold=1.380e+03, percent-clipped=10.0 2023-06-26 05:27:31,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1500762.0, ans=0.125 2023-06-26 05:27:31,850 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.42 vs. limit=15.0 2023-06-26 05:27:59,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500822.0, ans=0.1 2023-06-26 05:28:05,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1500882.0, ans=0.125 2023-06-26 05:28:32,079 INFO [train.py:996] (1/4) Epoch 9, batch 6200, loss[loss=0.2238, simple_loss=0.3035, pruned_loss=0.07207, over 21865.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2806, pruned_loss=0.06227, over 4276464.43 frames. ], batch size: 107, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:28:46,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1500942.0, ans=0.05 2023-06-26 05:28:46,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1500942.0, ans=0.125 2023-06-26 05:28:50,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1501002.0, ans=0.1 2023-06-26 05:29:44,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1501122.0, ans=0.125 2023-06-26 05:29:47,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1501122.0, ans=0.125 2023-06-26 05:30:16,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1501182.0, ans=0.125 2023-06-26 05:30:21,386 INFO [train.py:996] (1/4) Epoch 9, batch 6250, loss[loss=0.2243, simple_loss=0.3308, pruned_loss=0.05897, over 21644.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2872, pruned_loss=0.06262, over 4282958.34 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:30:35,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1501242.0, ans=0.125 2023-06-26 05:30:39,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1501302.0, ans=0.125 2023-06-26 05:31:01,151 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.531e+02 5.718e+02 9.151e+02 1.565e+03 3.193e+03, threshold=1.830e+03, percent-clipped=32.0 2023-06-26 05:31:04,011 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-26 05:31:23,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1501422.0, ans=0.0 2023-06-26 05:32:09,864 INFO [train.py:996] (1/4) Epoch 9, batch 6300, loss[loss=0.2172, simple_loss=0.2947, pruned_loss=0.06983, over 21744.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2915, pruned_loss=0.0623, over 4287937.90 frames. ], batch size: 112, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:32:20,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1501542.0, ans=0.125 2023-06-26 05:32:55,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1501662.0, ans=0.0 2023-06-26 05:33:06,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1501662.0, ans=0.2 2023-06-26 05:33:12,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1501722.0, ans=0.0 2023-06-26 05:33:29,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1501722.0, ans=0.125 2023-06-26 05:33:38,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1501722.0, ans=0.125 2023-06-26 05:33:50,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-26 05:34:00,220 INFO [train.py:996] (1/4) Epoch 9, batch 6350, loss[loss=0.2343, simple_loss=0.3073, pruned_loss=0.08064, over 21436.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2947, pruned_loss=0.06569, over 4289734.82 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:34:43,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1501902.0, ans=0.0 2023-06-26 05:34:51,584 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=15.0 2023-06-26 05:34:52,164 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.804e+02 5.467e+02 7.732e+02 1.098e+03 2.787e+03, threshold=1.546e+03, percent-clipped=5.0 2023-06-26 05:35:26,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1502022.0, ans=0.0 2023-06-26 05:35:33,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-26 05:35:54,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1502142.0, ans=0.0 2023-06-26 05:35:55,822 INFO [train.py:996] (1/4) Epoch 9, batch 6400, loss[loss=0.2457, simple_loss=0.3239, pruned_loss=0.08379, over 21448.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3009, pruned_loss=0.06902, over 4286163.00 frames. ], batch size: 211, lr: 3.34e-03, grad_scale: 32.0 2023-06-26 05:36:34,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1502202.0, ans=0.125 2023-06-26 05:36:40,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1502202.0, ans=0.04949747468305833 2023-06-26 05:36:51,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-26 05:37:45,708 INFO [train.py:996] (1/4) Epoch 9, batch 6450, loss[loss=0.243, simple_loss=0.3308, pruned_loss=0.07762, over 21329.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3026, pruned_loss=0.06935, over 4279509.12 frames. ], batch size: 549, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:37:46,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1502442.0, ans=0.125 2023-06-26 05:38:32,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1502562.0, ans=0.2 2023-06-26 05:38:33,803 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.613e+02 5.276e+02 6.947e+02 1.153e+03 2.587e+03, threshold=1.389e+03, percent-clipped=9.0 2023-06-26 05:39:19,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1502682.0, ans=0.2 2023-06-26 05:39:21,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1502682.0, ans=0.125 2023-06-26 05:39:23,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1502682.0, ans=0.1 2023-06-26 05:39:25,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=15.0 2023-06-26 05:39:33,073 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-26 05:39:35,596 INFO [train.py:996] (1/4) Epoch 9, batch 6500, loss[loss=0.2314, simple_loss=0.2851, pruned_loss=0.08885, over 21262.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2966, pruned_loss=0.06846, over 4281501.08 frames. ], batch size: 471, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:40:19,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=15.0 2023-06-26 05:40:34,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1502862.0, ans=0.0 2023-06-26 05:40:37,433 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:41:20,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1502982.0, ans=0.2 2023-06-26 05:41:30,777 INFO [train.py:996] (1/4) Epoch 9, batch 6550, loss[loss=0.2154, simple_loss=0.2819, pruned_loss=0.07442, over 21139.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2966, pruned_loss=0.06798, over 4281579.33 frames. ], batch size: 143, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:41:43,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.57 vs. limit=22.5 2023-06-26 05:42:05,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1503102.0, ans=0.0 2023-06-26 05:42:09,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1503162.0, ans=0.1 2023-06-26 05:42:19,565 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 4.890e+02 6.578e+02 1.052e+03 2.225e+03, threshold=1.316e+03, percent-clipped=12.0 2023-06-26 05:42:32,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1503222.0, ans=0.2 2023-06-26 05:42:32,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1503222.0, ans=0.2 2023-06-26 05:43:12,676 INFO [train.py:996] (1/4) Epoch 9, batch 6600, loss[loss=0.1721, simple_loss=0.2309, pruned_loss=0.05669, over 21229.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2906, pruned_loss=0.06695, over 4271251.04 frames. ], batch size: 548, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:44:36,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-26 05:44:38,720 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:44:51,678 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-06-26 05:45:04,849 INFO [train.py:996] (1/4) Epoch 9, batch 6650, loss[loss=0.1722, simple_loss=0.241, pruned_loss=0.05173, over 21538.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2855, pruned_loss=0.06401, over 4274250.07 frames. ], batch size: 230, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:45:14,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-26 05:45:20,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.49 vs. limit=10.0 2023-06-26 05:45:50,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1503762.0, ans=0.125 2023-06-26 05:45:53,449 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.932e+02 4.741e+02 6.331e+02 9.151e+02 2.148e+03, threshold=1.266e+03, percent-clipped=9.0 2023-06-26 05:46:30,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1503882.0, ans=0.05 2023-06-26 05:46:51,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1503882.0, ans=10.0 2023-06-26 05:46:54,052 INFO [train.py:996] (1/4) Epoch 9, batch 6700, loss[loss=0.1957, simple_loss=0.2555, pruned_loss=0.06797, over 21273.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2811, pruned_loss=0.06392, over 4275210.30 frames. ], batch size: 144, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:47:46,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1504062.0, ans=0.1 2023-06-26 05:48:26,311 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:48:30,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1504182.0, ans=0.2 2023-06-26 05:48:36,342 INFO [train.py:996] (1/4) Epoch 9, batch 6750, loss[loss=0.2123, simple_loss=0.2855, pruned_loss=0.06957, over 21845.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2793, pruned_loss=0.06472, over 4271918.81 frames. ], batch size: 371, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:48:53,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1504242.0, ans=0.125 2023-06-26 05:48:54,149 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.46 vs. limit=10.0 2023-06-26 05:49:31,082 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.439e+02 4.588e+02 6.610e+02 8.394e+02 1.640e+03, threshold=1.322e+03, percent-clipped=2.0 2023-06-26 05:49:52,652 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.22 vs. limit=6.0 2023-06-26 05:50:04,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1504422.0, ans=0.07 2023-06-26 05:50:25,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1504482.0, ans=0.04949747468305833 2023-06-26 05:50:29,549 INFO [train.py:996] (1/4) Epoch 9, batch 6800, loss[loss=0.2294, simple_loss=0.2958, pruned_loss=0.08152, over 21756.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2823, pruned_loss=0.06685, over 4278768.71 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:50:32,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1504542.0, ans=0.04949747468305833 2023-06-26 05:50:42,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-26 05:50:46,109 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:51:15,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1504662.0, ans=0.05 2023-06-26 05:51:48,753 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:51:49,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=22.5 2023-06-26 05:52:04,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1504782.0, ans=0.1 2023-06-26 05:52:16,595 INFO [train.py:996] (1/4) Epoch 9, batch 6850, loss[loss=0.2677, simple_loss=0.3095, pruned_loss=0.1129, over 21714.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2803, pruned_loss=0.06818, over 4289117.57 frames. ], batch size: 508, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:52:18,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1504842.0, ans=0.0 2023-06-26 05:52:18,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1504842.0, ans=0.2 2023-06-26 05:52:27,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1504842.0, ans=0.125 2023-06-26 05:52:39,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1504902.0, ans=0.125 2023-06-26 05:53:05,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.496e+02 4.685e+02 7.280e+02 1.216e+03 2.418e+03, threshold=1.456e+03, percent-clipped=17.0 2023-06-26 05:53:32,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-26 05:53:46,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1505082.0, ans=0.0 2023-06-26 05:53:53,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-26 05:54:02,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1505082.0, ans=0.09899494936611666 2023-06-26 05:54:03,393 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-26 05:54:05,601 INFO [train.py:996] (1/4) Epoch 9, batch 6900, loss[loss=0.2037, simple_loss=0.2893, pruned_loss=0.05899, over 21819.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2809, pruned_loss=0.06767, over 4293411.56 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:54:50,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1505262.0, ans=0.0 2023-06-26 05:54:57,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1505262.0, ans=0.05 2023-06-26 05:54:57,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-26 05:55:44,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1505382.0, ans=10.0 2023-06-26 05:55:54,118 INFO [train.py:996] (1/4) Epoch 9, batch 6950, loss[loss=0.2077, simple_loss=0.2868, pruned_loss=0.06429, over 21252.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2822, pruned_loss=0.06488, over 4290987.29 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:56:09,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-26 05:56:24,893 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:56:28,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1505502.0, ans=0.125 2023-06-26 05:56:43,253 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.242e+02 5.032e+02 6.537e+02 9.718e+02 2.265e+03, threshold=1.307e+03, percent-clipped=8.0 2023-06-26 05:57:18,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1505682.0, ans=0.0 2023-06-26 05:57:42,923 INFO [train.py:996] (1/4) Epoch 9, batch 7000, loss[loss=0.2093, simple_loss=0.2781, pruned_loss=0.07025, over 21214.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2834, pruned_loss=0.06675, over 4286902.26 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:58:17,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1505802.0, ans=0.125 2023-06-26 05:59:04,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1505922.0, ans=0.0 2023-06-26 05:59:23,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.65 vs. limit=10.0 2023-06-26 05:59:38,682 INFO [train.py:996] (1/4) Epoch 9, batch 7050, loss[loss=0.1916, simple_loss=0.278, pruned_loss=0.05259, over 21611.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2823, pruned_loss=0.06569, over 4280418.71 frames. ], batch size: 263, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:00:27,418 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.188e+02 4.829e+02 6.611e+02 8.594e+02 1.864e+03, threshold=1.322e+03, percent-clipped=11.0 2023-06-26 06:01:26,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1506342.0, ans=0.125 2023-06-26 06:01:33,082 INFO [train.py:996] (1/4) Epoch 9, batch 7100, loss[loss=0.1698, simple_loss=0.2513, pruned_loss=0.0441, over 21718.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2867, pruned_loss=0.0671, over 4277492.84 frames. ], batch size: 247, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:01:39,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-26 06:01:39,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-26 06:01:50,909 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:02:06,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1506402.0, ans=0.0 2023-06-26 06:02:23,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1506462.0, ans=0.125 2023-06-26 06:02:33,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1506522.0, ans=0.125 2023-06-26 06:03:22,334 INFO [train.py:996] (1/4) Epoch 9, batch 7150, loss[loss=0.2263, simple_loss=0.3102, pruned_loss=0.07115, over 21548.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.287, pruned_loss=0.06613, over 4279372.95 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:03:32,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1506642.0, ans=0.125 2023-06-26 06:03:47,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1506702.0, ans=0.125 2023-06-26 06:04:06,068 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.994e+02 4.588e+02 6.424e+02 8.469e+02 2.110e+03, threshold=1.285e+03, percent-clipped=2.0 2023-06-26 06:04:09,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1506762.0, ans=0.125 2023-06-26 06:04:11,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1506762.0, ans=0.2 2023-06-26 06:05:11,695 INFO [train.py:996] (1/4) Epoch 9, batch 7200, loss[loss=0.2422, simple_loss=0.3025, pruned_loss=0.0909, over 21313.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2883, pruned_loss=0.06739, over 4267856.93 frames. ], batch size: 471, lr: 3.34e-03, grad_scale: 32.0 2023-06-26 06:05:40,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1507002.0, ans=0.0 2023-06-26 06:05:41,001 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.24 vs. limit=6.0 2023-06-26 06:05:51,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1507062.0, ans=0.125 2023-06-26 06:05:57,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1507062.0, ans=0.125 2023-06-26 06:05:58,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1507062.0, ans=0.125 2023-06-26 06:06:29,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1507122.0, ans=0.125 2023-06-26 06:06:47,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1507182.0, ans=0.0 2023-06-26 06:07:00,447 INFO [train.py:996] (1/4) Epoch 9, batch 7250, loss[loss=0.2199, simple_loss=0.2864, pruned_loss=0.07677, over 14994.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2832, pruned_loss=0.06761, over 4257735.24 frames. ], batch size: 60, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:07:14,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1507242.0, ans=0.0 2023-06-26 06:07:45,473 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.193e+02 5.249e+02 7.377e+02 1.151e+03 2.707e+03, threshold=1.475e+03, percent-clipped=23.0 2023-06-26 06:08:00,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1507362.0, ans=0.125 2023-06-26 06:08:00,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1507362.0, ans=0.2 2023-06-26 06:08:03,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1507422.0, ans=0.04949747468305833 2023-06-26 06:08:48,820 INFO [train.py:996] (1/4) Epoch 9, batch 7300, loss[loss=0.2015, simple_loss=0.2602, pruned_loss=0.07141, over 21383.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2769, pruned_loss=0.06651, over 4265976.13 frames. ], batch size: 144, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:10:01,430 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-26 06:10:04,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1507722.0, ans=0.125 2023-06-26 06:10:44,111 INFO [train.py:996] (1/4) Epoch 9, batch 7350, loss[loss=0.2544, simple_loss=0.3323, pruned_loss=0.0883, over 21827.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2772, pruned_loss=0.06796, over 4269766.65 frames. ], batch size: 118, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:11:08,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1507902.0, ans=15.0 2023-06-26 06:11:08,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1507902.0, ans=10.0 2023-06-26 06:11:30,331 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.284e+02 4.727e+02 6.627e+02 9.690e+02 1.819e+03, threshold=1.325e+03, percent-clipped=8.0 2023-06-26 06:12:34,171 INFO [train.py:996] (1/4) Epoch 9, batch 7400, loss[loss=0.2085, simple_loss=0.2851, pruned_loss=0.06598, over 21427.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2848, pruned_loss=0.07014, over 4275746.14 frames. ], batch size: 211, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:13:48,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.03 vs. limit=12.0 2023-06-26 06:13:50,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1508322.0, ans=0.125 2023-06-26 06:13:55,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-26 06:14:09,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1508382.0, ans=0.0 2023-06-26 06:14:25,304 INFO [train.py:996] (1/4) Epoch 9, batch 7450, loss[loss=0.2191, simple_loss=0.2895, pruned_loss=0.07432, over 21834.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2827, pruned_loss=0.0686, over 4277824.92 frames. ], batch size: 98, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:15:23,589 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.343e+02 4.976e+02 6.577e+02 1.050e+03 2.324e+03, threshold=1.315e+03, percent-clipped=17.0 2023-06-26 06:15:44,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1508622.0, ans=0.1 2023-06-26 06:15:52,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1508622.0, ans=0.1 2023-06-26 06:16:18,122 INFO [train.py:996] (1/4) Epoch 9, batch 7500, loss[loss=0.2325, simple_loss=0.3132, pruned_loss=0.07593, over 21385.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2886, pruned_loss=0.07017, over 4282751.87 frames. ], batch size: 211, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:16:42,754 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.55 vs. limit=15.0 2023-06-26 06:17:50,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1508982.0, ans=0.0 2023-06-26 06:17:56,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1508982.0, ans=0.125 2023-06-26 06:18:08,922 INFO [train.py:996] (1/4) Epoch 9, batch 7550, loss[loss=0.1712, simple_loss=0.2477, pruned_loss=0.04742, over 16387.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2943, pruned_loss=0.06954, over 4277420.25 frames. ], batch size: 61, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:19:04,851 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 6.002e+02 8.588e+02 1.350e+03 2.877e+03, threshold=1.718e+03, percent-clipped=25.0 2023-06-26 06:19:07,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1509162.0, ans=0.125 2023-06-26 06:19:13,656 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-26 06:19:19,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1509222.0, ans=0.0 2023-06-26 06:19:56,674 INFO [train.py:996] (1/4) Epoch 9, batch 7600, loss[loss=0.2102, simple_loss=0.287, pruned_loss=0.06672, over 21881.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2938, pruned_loss=0.06948, over 4272897.63 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 06:21:46,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.19 vs. limit=15.0 2023-06-26 06:21:46,200 INFO [train.py:996] (1/4) Epoch 9, batch 7650, loss[loss=0.2229, simple_loss=0.2995, pruned_loss=0.0732, over 21864.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2921, pruned_loss=0.07042, over 4277869.25 frames. ], batch size: 124, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:22:16,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1509702.0, ans=0.125 2023-06-26 06:22:22,069 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.17 vs. limit=12.0 2023-06-26 06:22:42,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1509762.0, ans=0.0 2023-06-26 06:22:43,950 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 5.104e+02 7.952e+02 1.146e+03 1.972e+03, threshold=1.590e+03, percent-clipped=6.0 2023-06-26 06:22:58,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1509822.0, ans=0.015 2023-06-26 06:22:58,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1509822.0, ans=0.1 2023-06-26 06:23:23,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1509882.0, ans=0.125 2023-06-26 06:23:26,881 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:23:40,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-26 06:23:41,251 INFO [train.py:996] (1/4) Epoch 9, batch 7700, loss[loss=0.1871, simple_loss=0.244, pruned_loss=0.06505, over 21058.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2946, pruned_loss=0.0726, over 4284788.58 frames. ], batch size: 608, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:23:45,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1509942.0, ans=0.0 2023-06-26 06:24:39,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1510062.0, ans=0.07 2023-06-26 06:24:47,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1510062.0, ans=0.0 2023-06-26 06:25:06,835 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:25:33,211 INFO [train.py:996] (1/4) Epoch 9, batch 7750, loss[loss=0.1994, simple_loss=0.2666, pruned_loss=0.06613, over 20227.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2987, pruned_loss=0.07201, over 4282763.57 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:26:32,278 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 5.408e+02 8.578e+02 1.362e+03 2.742e+03, threshold=1.716e+03, percent-clipped=14.0 2023-06-26 06:27:33,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-26 06:27:34,361 INFO [train.py:996] (1/4) Epoch 9, batch 7800, loss[loss=0.2364, simple_loss=0.315, pruned_loss=0.07892, over 21548.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3008, pruned_loss=0.07252, over 4285151.37 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:27:34,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1510542.0, ans=0.0 2023-06-26 06:28:04,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1510602.0, ans=0.0 2023-06-26 06:28:31,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1510722.0, ans=0.2 2023-06-26 06:28:49,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1510722.0, ans=0.0 2023-06-26 06:29:14,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-26 06:29:20,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1510782.0, ans=0.125 2023-06-26 06:29:23,904 INFO [train.py:996] (1/4) Epoch 9, batch 7850, loss[loss=0.1892, simple_loss=0.255, pruned_loss=0.06173, over 21414.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2942, pruned_loss=0.07121, over 4283897.55 frames. ], batch size: 195, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:29:46,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1510902.0, ans=0.09899494936611666 2023-06-26 06:30:12,951 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.902e+02 7.462e+02 1.114e+03 2.139e+03, threshold=1.492e+03, percent-clipped=5.0 2023-06-26 06:31:14,990 INFO [train.py:996] (1/4) Epoch 9, batch 7900, loss[loss=0.2645, simple_loss=0.3686, pruned_loss=0.0802, over 21639.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2887, pruned_loss=0.06914, over 4284675.01 frames. ], batch size: 414, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:31:48,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1511202.0, ans=0.125 2023-06-26 06:32:55,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1511382.0, ans=0.2 2023-06-26 06:33:07,271 INFO [train.py:996] (1/4) Epoch 9, batch 7950, loss[loss=0.2725, simple_loss=0.3439, pruned_loss=0.1005, over 21756.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2912, pruned_loss=0.06912, over 4282971.12 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:33:11,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1511442.0, ans=0.125 2023-06-26 06:34:02,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.829e+02 6.422e+02 9.281e+02 1.330e+03 3.368e+03, threshold=1.856e+03, percent-clipped=18.0 2023-06-26 06:34:20,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1511622.0, ans=0.0 2023-06-26 06:35:05,371 INFO [train.py:996] (1/4) Epoch 9, batch 8000, loss[loss=0.2465, simple_loss=0.3384, pruned_loss=0.07733, over 21303.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2967, pruned_loss=0.07105, over 4280418.35 frames. ], batch size: 548, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:35:27,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1511802.0, ans=0.0 2023-06-26 06:36:02,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1511862.0, ans=0.025 2023-06-26 06:36:51,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=15.0 2023-06-26 06:37:01,458 INFO [train.py:996] (1/4) Epoch 9, batch 8050, loss[loss=0.19, simple_loss=0.2433, pruned_loss=0.0684, over 21837.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3015, pruned_loss=0.07245, over 4279989.48 frames. ], batch size: 107, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:37:44,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-26 06:38:00,763 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:38:01,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.472e+02 6.276e+02 8.546e+02 1.348e+03 3.651e+03, threshold=1.709e+03, percent-clipped=15.0 2023-06-26 06:38:51,623 INFO [train.py:996] (1/4) Epoch 9, batch 8100, loss[loss=0.1899, simple_loss=0.2615, pruned_loss=0.05911, over 21822.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2995, pruned_loss=0.07286, over 4282084.51 frames. ], batch size: 247, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:40:34,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1512582.0, ans=0.1 2023-06-26 06:40:58,164 INFO [train.py:996] (1/4) Epoch 9, batch 8150, loss[loss=0.2067, simple_loss=0.287, pruned_loss=0.0632, over 19940.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3069, pruned_loss=0.07432, over 4279938.83 frames. ], batch size: 703, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:41:25,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1512702.0, ans=0.0 2023-06-26 06:41:54,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.538e+02 6.819e+02 1.034e+03 1.568e+03 4.387e+03, threshold=2.069e+03, percent-clipped=18.0 2023-06-26 06:42:00,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1512822.0, ans=0.125 2023-06-26 06:42:09,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-26 06:42:21,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1512822.0, ans=0.125 2023-06-26 06:42:49,111 INFO [train.py:996] (1/4) Epoch 9, batch 8200, loss[loss=0.2034, simple_loss=0.2611, pruned_loss=0.07278, over 21665.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2987, pruned_loss=0.07071, over 4274202.56 frames. ], batch size: 248, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:43:08,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1512942.0, ans=0.125 2023-06-26 06:43:20,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-26 06:44:40,641 INFO [train.py:996] (1/4) Epoch 9, batch 8250, loss[loss=0.2581, simple_loss=0.3756, pruned_loss=0.07036, over 20816.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2985, pruned_loss=0.07113, over 4273746.65 frames. ], batch size: 607, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:45:36,648 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.396e+02 4.867e+02 7.289e+02 1.042e+03 1.970e+03, threshold=1.458e+03, percent-clipped=0.0 2023-06-26 06:46:35,301 INFO [train.py:996] (1/4) Epoch 9, batch 8300, loss[loss=0.2427, simple_loss=0.327, pruned_loss=0.07927, over 21654.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2979, pruned_loss=0.06899, over 4281579.65 frames. ], batch size: 414, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:47:14,537 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.90 vs. limit=22.5 2023-06-26 06:47:41,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-26 06:48:04,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513782.0, ans=0.1 2023-06-26 06:48:25,379 INFO [train.py:996] (1/4) Epoch 9, batch 8350, loss[loss=0.2213, simple_loss=0.2925, pruned_loss=0.07503, over 21333.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2961, pruned_loss=0.06738, over 4271088.08 frames. ], batch size: 471, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:49:22,434 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.156e+02 5.177e+02 7.489e+02 1.153e+03 2.858e+03, threshold=1.498e+03, percent-clipped=11.0 2023-06-26 06:50:14,400 INFO [train.py:996] (1/4) Epoch 9, batch 8400, loss[loss=0.1808, simple_loss=0.2704, pruned_loss=0.04566, over 21671.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2939, pruned_loss=0.0651, over 4263396.56 frames. ], batch size: 247, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:50:51,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1514202.0, ans=0.1 2023-06-26 06:50:59,898 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=2.752e-03 2023-06-26 06:51:40,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1514382.0, ans=0.125 2023-06-26 06:51:55,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1514382.0, ans=0.125 2023-06-26 06:51:55,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1514382.0, ans=0.125 2023-06-26 06:52:01,966 INFO [train.py:996] (1/4) Epoch 9, batch 8450, loss[loss=0.2142, simple_loss=0.2871, pruned_loss=0.07066, over 21898.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2918, pruned_loss=0.06457, over 4273353.86 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:52:06,349 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-06-26 06:52:07,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1514442.0, ans=0.1 2023-06-26 06:52:10,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1514442.0, ans=0.0 2023-06-26 06:52:29,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1514502.0, ans=0.0 2023-06-26 06:52:43,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1514562.0, ans=0.125 2023-06-26 06:52:49,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1514562.0, ans=0.0 2023-06-26 06:52:58,302 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.897e+02 4.182e+02 5.654e+02 7.712e+02 3.428e+03, threshold=1.131e+03, percent-clipped=11.0 2023-06-26 06:53:09,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1514622.0, ans=0.0 2023-06-26 06:53:44,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1514682.0, ans=0.0 2023-06-26 06:53:51,632 INFO [train.py:996] (1/4) Epoch 9, batch 8500, loss[loss=0.2257, simple_loss=0.2842, pruned_loss=0.08364, over 21529.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2891, pruned_loss=0.06531, over 4270475.88 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:53:52,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1514742.0, ans=0.125 2023-06-26 06:54:05,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1514742.0, ans=0.1 2023-06-26 06:54:12,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1514742.0, ans=0.125 2023-06-26 06:54:19,285 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.86 vs. limit=5.0 2023-06-26 06:55:24,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1514982.0, ans=0.125 2023-06-26 06:55:42,992 INFO [train.py:996] (1/4) Epoch 9, batch 8550, loss[loss=0.2105, simple_loss=0.3017, pruned_loss=0.05961, over 21738.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2925, pruned_loss=0.06749, over 4273012.73 frames. ], batch size: 247, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:56:40,472 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.388e+02 5.673e+02 9.028e+02 1.285e+03 2.973e+03, threshold=1.806e+03, percent-clipped=33.0 2023-06-26 06:57:02,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1515222.0, ans=0.025 2023-06-26 06:57:22,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1515282.0, ans=0.07 2023-06-26 06:57:27,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1515282.0, ans=0.125 2023-06-26 06:57:34,132 INFO [train.py:996] (1/4) Epoch 9, batch 8600, loss[loss=0.2755, simple_loss=0.3526, pruned_loss=0.09915, over 21396.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2995, pruned_loss=0.06918, over 4272515.74 frames. ], batch size: 131, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:57:58,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1515402.0, ans=0.125 2023-06-26 06:59:25,184 INFO [train.py:996] (1/4) Epoch 9, batch 8650, loss[loss=0.1922, simple_loss=0.2914, pruned_loss=0.04649, over 21634.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3053, pruned_loss=0.07038, over 4279918.20 frames. ], batch size: 263, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:59:51,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1515702.0, ans=0.125 2023-06-26 07:00:10,434 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:00:25,176 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.107e+02 4.849e+02 6.283e+02 8.957e+02 2.012e+03, threshold=1.257e+03, percent-clipped=3.0 2023-06-26 07:00:55,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1515882.0, ans=0.2 2023-06-26 07:00:58,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1515882.0, ans=0.0 2023-06-26 07:01:11,955 INFO [train.py:996] (1/4) Epoch 9, batch 8700, loss[loss=0.1631, simple_loss=0.2289, pruned_loss=0.04866, over 21477.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2952, pruned_loss=0.0675, over 4276172.39 frames. ], batch size: 195, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:02:42,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1516182.0, ans=0.125 2023-06-26 07:02:54,686 INFO [train.py:996] (1/4) Epoch 9, batch 8750, loss[loss=0.2358, simple_loss=0.2992, pruned_loss=0.08615, over 21165.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2907, pruned_loss=0.06785, over 4282462.30 frames. ], batch size: 176, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:03:15,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1516242.0, ans=0.125 2023-06-26 07:04:02,675 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 4.871e+02 5.858e+02 9.020e+02 2.163e+03, threshold=1.172e+03, percent-clipped=9.0 2023-06-26 07:04:17,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1516422.0, ans=0.125 2023-06-26 07:04:51,204 INFO [train.py:996] (1/4) Epoch 9, batch 8800, loss[loss=0.2499, simple_loss=0.3535, pruned_loss=0.0732, over 19822.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3023, pruned_loss=0.0712, over 4282117.38 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 07:06:22,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1516782.0, ans=0.0 2023-06-26 07:06:46,512 INFO [train.py:996] (1/4) Epoch 9, batch 8850, loss[loss=0.2039, simple_loss=0.2874, pruned_loss=0.06017, over 21684.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3092, pruned_loss=0.07309, over 4277523.24 frames. ], batch size: 298, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 07:06:47,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1516842.0, ans=0.0 2023-06-26 07:06:50,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1516842.0, ans=0.2 2023-06-26 07:07:00,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1516842.0, ans=0.125 2023-06-26 07:07:43,090 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.458e+02 5.018e+02 7.490e+02 1.008e+03 2.036e+03, threshold=1.498e+03, percent-clipped=19.0 2023-06-26 07:07:43,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1516962.0, ans=0.1 2023-06-26 07:07:51,005 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.75 vs. limit=22.5 2023-06-26 07:08:37,006 INFO [train.py:996] (1/4) Epoch 9, batch 8900, loss[loss=0.209, simple_loss=0.2974, pruned_loss=0.06028, over 21734.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3057, pruned_loss=0.07156, over 4272934.26 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:08:47,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1517142.0, ans=0.125 2023-06-26 07:08:47,847 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-06-26 07:09:48,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1517322.0, ans=0.0 2023-06-26 07:09:48,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1517322.0, ans=0.125 2023-06-26 07:10:34,128 INFO [train.py:996] (1/4) Epoch 9, batch 8950, loss[loss=0.206, simple_loss=0.2868, pruned_loss=0.0626, over 21685.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3074, pruned_loss=0.07165, over 4271370.27 frames. ], batch size: 247, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:10:52,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1517442.0, ans=0.04949747468305833 2023-06-26 07:10:56,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1517502.0, ans=0.2 2023-06-26 07:11:31,713 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.717e+02 6.385e+02 1.007e+03 1.831e+03 3.231e+03, threshold=2.014e+03, percent-clipped=34.0 2023-06-26 07:11:51,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1517622.0, ans=0.1 2023-06-26 07:12:29,547 INFO [train.py:996] (1/4) Epoch 9, batch 9000, loss[loss=0.1998, simple_loss=0.2661, pruned_loss=0.06678, over 21584.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3005, pruned_loss=0.07109, over 4270621.85 frames. ], batch size: 415, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:12:29,548 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 07:12:42,369 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7645, 2.3183, 3.7025, 2.3679], device='cuda:1') 2023-06-26 07:12:47,769 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2687, simple_loss=0.357, pruned_loss=0.09027, over 1796401.00 frames. 2023-06-26 07:12:47,770 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 07:12:54,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1517742.0, ans=0.125 2023-06-26 07:13:03,156 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-26 07:13:07,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-26 07:13:08,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1517802.0, ans=0.0 2023-06-26 07:13:15,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1517802.0, ans=0.1 2023-06-26 07:13:27,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1517862.0, ans=0.125 2023-06-26 07:13:52,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1517922.0, ans=0.125 2023-06-26 07:13:58,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1517922.0, ans=0.125 2023-06-26 07:14:17,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1517922.0, ans=0.2 2023-06-26 07:14:17,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1517922.0, ans=0.0 2023-06-26 07:14:32,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1517982.0, ans=0.0 2023-06-26 07:14:34,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-26 07:14:38,774 INFO [train.py:996] (1/4) Epoch 9, batch 9050, loss[loss=0.23, simple_loss=0.3506, pruned_loss=0.0547, over 19848.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.297, pruned_loss=0.06823, over 4262854.39 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:14:52,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1518042.0, ans=0.2 2023-06-26 07:14:53,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1518042.0, ans=0.125 2023-06-26 07:15:13,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1518102.0, ans=0.0 2023-06-26 07:15:27,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-26 07:15:33,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1518162.0, ans=0.125 2023-06-26 07:15:38,268 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.242e+02 4.774e+02 6.783e+02 1.195e+03 2.023e+03, threshold=1.357e+03, percent-clipped=1.0 2023-06-26 07:16:30,154 INFO [train.py:996] (1/4) Epoch 9, batch 9100, loss[loss=0.1956, simple_loss=0.2993, pruned_loss=0.04595, over 21791.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3018, pruned_loss=0.07031, over 4267856.82 frames. ], batch size: 316, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:16:52,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1518402.0, ans=0.125 2023-06-26 07:17:01,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1518402.0, ans=0.125 2023-06-26 07:17:06,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1518402.0, ans=0.0 2023-06-26 07:17:08,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1518402.0, ans=0.125 2023-06-26 07:17:39,406 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-26 07:18:01,968 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:18:20,685 INFO [train.py:996] (1/4) Epoch 9, batch 9150, loss[loss=0.2039, simple_loss=0.2923, pruned_loss=0.05774, over 21413.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3021, pruned_loss=0.06785, over 4270096.10 frames. ], batch size: 194, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:19:29,336 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.947e+02 4.682e+02 7.293e+02 9.875e+02 2.025e+03, threshold=1.459e+03, percent-clipped=11.0 2023-06-26 07:19:37,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1518822.0, ans=0.1 2023-06-26 07:19:41,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1518822.0, ans=0.125 2023-06-26 07:19:49,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1518822.0, ans=0.0 2023-06-26 07:20:14,555 INFO [train.py:996] (1/4) Epoch 9, batch 9200, loss[loss=0.2265, simple_loss=0.3183, pruned_loss=0.06737, over 21642.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3046, pruned_loss=0.06677, over 4266408.79 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:20:58,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1519002.0, ans=0.0 2023-06-26 07:21:55,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1519182.0, ans=10.0 2023-06-26 07:21:56,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1519182.0, ans=0.0 2023-06-26 07:22:03,208 INFO [train.py:996] (1/4) Epoch 9, batch 9250, loss[loss=0.2248, simple_loss=0.304, pruned_loss=0.07287, over 21786.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3062, pruned_loss=0.0699, over 4271401.09 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:22:05,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1519242.0, ans=0.1 2023-06-26 07:22:07,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1519242.0, ans=0.0 2023-06-26 07:22:37,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1519302.0, ans=0.0 2023-06-26 07:22:37,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1519302.0, ans=0.2 2023-06-26 07:22:54,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1519362.0, ans=0.2 2023-06-26 07:22:56,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1519362.0, ans=0.5 2023-06-26 07:23:04,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.22 vs. limit=15.0 2023-06-26 07:23:06,657 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.432e+02 5.072e+02 7.125e+02 1.070e+03 2.650e+03, threshold=1.425e+03, percent-clipped=11.0 2023-06-26 07:23:53,074 INFO [train.py:996] (1/4) Epoch 9, batch 9300, loss[loss=0.2427, simple_loss=0.3399, pruned_loss=0.07275, over 21730.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.299, pruned_loss=0.06935, over 4279094.20 frames. ], batch size: 351, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:23:53,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1519542.0, ans=0.125 2023-06-26 07:23:56,242 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.11 vs. limit=15.0 2023-06-26 07:24:59,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1519662.0, ans=0.125 2023-06-26 07:25:35,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1519782.0, ans=0.2 2023-06-26 07:25:42,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1519842.0, ans=0.125 2023-06-26 07:25:43,676 INFO [train.py:996] (1/4) Epoch 9, batch 9350, loss[loss=0.1733, simple_loss=0.25, pruned_loss=0.04829, over 20713.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3052, pruned_loss=0.07014, over 4270038.75 frames. ], batch size: 607, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:25:55,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1519842.0, ans=0.0 2023-06-26 07:26:54,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 5.143e+02 7.806e+02 1.433e+03 2.856e+03, threshold=1.561e+03, percent-clipped=26.0 2023-06-26 07:26:58,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1520022.0, ans=0.125 2023-06-26 07:27:17,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1520082.0, ans=0.0 2023-06-26 07:27:23,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-26 07:27:38,712 INFO [train.py:996] (1/4) Epoch 9, batch 9400, loss[loss=0.2416, simple_loss=0.2903, pruned_loss=0.09644, over 21339.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3047, pruned_loss=0.07071, over 4265754.36 frames. ], batch size: 507, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:27:56,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1520142.0, ans=0.125 2023-06-26 07:28:06,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1520202.0, ans=0.1 2023-06-26 07:28:38,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1520262.0, ans=0.125 2023-06-26 07:29:03,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1520382.0, ans=0.125 2023-06-26 07:29:32,004 INFO [train.py:996] (1/4) Epoch 9, batch 9450, loss[loss=0.2584, simple_loss=0.4014, pruned_loss=0.05768, over 19838.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2989, pruned_loss=0.06989, over 4257376.79 frames. ], batch size: 702, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:29:48,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1520442.0, ans=0.5 2023-06-26 07:29:53,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1520502.0, ans=0.0 2023-06-26 07:30:01,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1520502.0, ans=0.0 2023-06-26 07:30:17,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1520562.0, ans=0.2 2023-06-26 07:30:31,555 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 5.776e+02 8.947e+02 1.514e+03 4.644e+03, threshold=1.789e+03, percent-clipped=22.0 2023-06-26 07:30:32,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-26 07:30:58,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1520682.0, ans=0.125 2023-06-26 07:31:21,066 INFO [train.py:996] (1/4) Epoch 9, batch 9500, loss[loss=0.2037, simple_loss=0.2723, pruned_loss=0.06761, over 21832.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2909, pruned_loss=0.06782, over 4254525.88 frames. ], batch size: 107, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:31:46,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1520802.0, ans=0.025 2023-06-26 07:32:36,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1520922.0, ans=0.125 2023-06-26 07:33:12,882 INFO [train.py:996] (1/4) Epoch 9, batch 9550, loss[loss=0.2311, simple_loss=0.3255, pruned_loss=0.06831, over 21831.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2964, pruned_loss=0.07014, over 4250902.81 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:34:08,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1521162.0, ans=0.0 2023-06-26 07:34:11,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.380e+02 4.672e+02 5.675e+02 8.285e+02 1.544e+03, threshold=1.135e+03, percent-clipped=0.0 2023-06-26 07:35:01,319 INFO [train.py:996] (1/4) Epoch 9, batch 9600, loss[loss=0.2114, simple_loss=0.2839, pruned_loss=0.06942, over 21485.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2979, pruned_loss=0.07129, over 4258286.83 frames. ], batch size: 548, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:35:35,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-26 07:36:39,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1521582.0, ans=0.125 2023-06-26 07:36:40,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1521582.0, ans=0.125 2023-06-26 07:36:52,869 INFO [train.py:996] (1/4) Epoch 9, batch 9650, loss[loss=0.2204, simple_loss=0.2981, pruned_loss=0.07135, over 21729.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2969, pruned_loss=0.06999, over 4252448.40 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:37:02,623 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-26 07:37:47,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.81 vs. limit=15.0 2023-06-26 07:37:49,601 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.333e+02 4.623e+02 6.972e+02 1.187e+03 2.800e+03, threshold=1.394e+03, percent-clipped=26.0 2023-06-26 07:38:38,188 INFO [train.py:996] (1/4) Epoch 9, batch 9700, loss[loss=0.2242, simple_loss=0.2969, pruned_loss=0.07577, over 21894.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3005, pruned_loss=0.07119, over 4257973.99 frames. ], batch size: 124, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:38:45,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1521942.0, ans=0.125 2023-06-26 07:39:09,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1522002.0, ans=0.125 2023-06-26 07:39:47,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1522122.0, ans=0.1 2023-06-26 07:39:47,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1522122.0, ans=0.125 2023-06-26 07:39:59,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1522122.0, ans=0.0 2023-06-26 07:40:00,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.03 vs. limit=15.0 2023-06-26 07:40:13,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1522182.0, ans=0.125 2023-06-26 07:40:27,228 INFO [train.py:996] (1/4) Epoch 9, batch 9750, loss[loss=0.2482, simple_loss=0.3281, pruned_loss=0.08413, over 21244.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2952, pruned_loss=0.07057, over 4263711.98 frames. ], batch size: 143, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:40:34,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1522242.0, ans=0.0 2023-06-26 07:40:36,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1522242.0, ans=0.0 2023-06-26 07:41:05,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-26 07:41:17,713 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.60 vs. limit=10.0 2023-06-26 07:41:22,045 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.554e+02 4.804e+02 6.885e+02 8.968e+02 2.424e+03, threshold=1.377e+03, percent-clipped=5.0 2023-06-26 07:42:02,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1522482.0, ans=0.0 2023-06-26 07:42:07,397 INFO [train.py:996] (1/4) Epoch 9, batch 9800, loss[loss=0.2098, simple_loss=0.2863, pruned_loss=0.06665, over 21894.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.296, pruned_loss=0.07089, over 4263986.05 frames. ], batch size: 107, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:42:13,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1522542.0, ans=0.0 2023-06-26 07:42:25,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1522542.0, ans=0.0 2023-06-26 07:43:43,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=10.0 2023-06-26 07:43:57,267 INFO [train.py:996] (1/4) Epoch 9, batch 9850, loss[loss=0.2054, simple_loss=0.2804, pruned_loss=0.0652, over 21794.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2927, pruned_loss=0.07118, over 4274001.73 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:44:15,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522842.0, ans=0.1 2023-06-26 07:44:54,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-26 07:44:58,516 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 4.827e+02 6.671e+02 1.006e+03 2.121e+03, threshold=1.334e+03, percent-clipped=9.0 2023-06-26 07:45:52,835 INFO [train.py:996] (1/4) Epoch 9, batch 9900, loss[loss=0.228, simple_loss=0.303, pruned_loss=0.07656, over 21727.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2903, pruned_loss=0.07061, over 4262896.92 frames. ], batch size: 351, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:46:35,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1523262.0, ans=0.125 2023-06-26 07:47:35,632 INFO [train.py:996] (1/4) Epoch 9, batch 9950, loss[loss=0.2133, simple_loss=0.2777, pruned_loss=0.07444, over 21533.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2923, pruned_loss=0.0728, over 4269548.69 frames. ], batch size: 414, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:47:43,568 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:48:13,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1523502.0, ans=0.125 2023-06-26 07:48:13,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1523502.0, ans=0.0 2023-06-26 07:48:38,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.372e+02 4.966e+02 6.562e+02 9.646e+02 1.795e+03, threshold=1.312e+03, percent-clipped=7.0 2023-06-26 07:49:31,805 INFO [train.py:996] (1/4) Epoch 9, batch 10000, loss[loss=0.2308, simple_loss=0.302, pruned_loss=0.07981, over 21226.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2873, pruned_loss=0.07128, over 4267968.35 frames. ], batch size: 143, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:49:54,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1523802.0, ans=0.125 2023-06-26 07:50:01,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-26 07:50:15,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1523862.0, ans=0.125 2023-06-26 07:50:15,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-26 07:50:28,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1523862.0, ans=0.1 2023-06-26 07:50:29,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1523862.0, ans=0.0 2023-06-26 07:50:36,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1523922.0, ans=0.125 2023-06-26 07:51:00,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.38 vs. limit=15.0 2023-06-26 07:51:22,410 INFO [train.py:996] (1/4) Epoch 9, batch 10050, loss[loss=0.2, simple_loss=0.2828, pruned_loss=0.05863, over 21598.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.289, pruned_loss=0.07114, over 4275727.47 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:51:43,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1524042.0, ans=0.0 2023-06-26 07:51:52,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-26 07:52:31,225 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.556e+02 5.086e+02 7.732e+02 1.194e+03 2.294e+03, threshold=1.546e+03, percent-clipped=16.0 2023-06-26 07:53:13,020 INFO [train.py:996] (1/4) Epoch 9, batch 10100, loss[loss=0.2134, simple_loss=0.2987, pruned_loss=0.06409, over 21643.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2882, pruned_loss=0.06963, over 4273437.03 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:53:13,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1524342.0, ans=0.1 2023-06-26 07:53:31,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1524342.0, ans=0.0 2023-06-26 07:54:01,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524462.0, ans=0.1 2023-06-26 07:54:01,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1524462.0, ans=0.1 2023-06-26 07:55:06,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-26 07:55:07,057 INFO [train.py:996] (1/4) Epoch 9, batch 10150, loss[loss=0.222, simple_loss=0.2919, pruned_loss=0.07608, over 21345.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2939, pruned_loss=0.07175, over 4273969.54 frames. ], batch size: 548, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:55:26,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524702.0, ans=0.1 2023-06-26 07:56:10,911 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.445e+02 5.423e+02 7.380e+02 1.011e+03 1.635e+03, threshold=1.476e+03, percent-clipped=1.0 2023-06-26 07:56:56,530 INFO [train.py:996] (1/4) Epoch 9, batch 10200, loss[loss=0.1938, simple_loss=0.2879, pruned_loss=0.04989, over 20740.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2916, pruned_loss=0.06913, over 4269791.42 frames. ], batch size: 608, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:56:59,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1524942.0, ans=0.0 2023-06-26 07:57:28,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1525002.0, ans=0.2 2023-06-26 07:57:55,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1525062.0, ans=0.125 2023-06-26 07:58:47,139 INFO [train.py:996] (1/4) Epoch 9, batch 10250, loss[loss=0.231, simple_loss=0.3109, pruned_loss=0.07556, over 21315.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2878, pruned_loss=0.06414, over 4271311.73 frames. ], batch size: 159, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:59:44,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1525362.0, ans=0.0 2023-06-26 07:59:58,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.778e+02 4.201e+02 6.167e+02 1.103e+03 3.116e+03, threshold=1.233e+03, percent-clipped=15.0 2023-06-26 08:00:28,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-26 08:00:38,950 INFO [train.py:996] (1/4) Epoch 9, batch 10300, loss[loss=0.2569, simple_loss=0.337, pruned_loss=0.08837, over 21386.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2915, pruned_loss=0.06525, over 4268558.10 frames. ], batch size: 131, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:02:15,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1525782.0, ans=0.0 2023-06-26 08:02:30,553 INFO [train.py:996] (1/4) Epoch 9, batch 10350, loss[loss=0.1902, simple_loss=0.2577, pruned_loss=0.06136, over 21677.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2922, pruned_loss=0.06531, over 4271484.72 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:03:09,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-26 08:03:46,305 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.376e+02 5.119e+02 7.830e+02 1.250e+03 2.539e+03, threshold=1.566e+03, percent-clipped=25.0 2023-06-26 08:03:56,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1526022.0, ans=0.125 2023-06-26 08:04:33,026 INFO [train.py:996] (1/4) Epoch 9, batch 10400, loss[loss=0.2044, simple_loss=0.2887, pruned_loss=0.05999, over 21602.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.285, pruned_loss=0.06442, over 4269533.32 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 08:05:03,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1526202.0, ans=0.0 2023-06-26 08:05:15,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-26 08:05:37,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1526322.0, ans=0.125 2023-06-26 08:05:41,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1526322.0, ans=0.125 2023-06-26 08:05:43,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1526322.0, ans=0.025 2023-06-26 08:05:58,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1526322.0, ans=0.1 2023-06-26 08:06:09,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1526382.0, ans=0.1 2023-06-26 08:06:19,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1526382.0, ans=0.125 2023-06-26 08:06:24,918 INFO [train.py:996] (1/4) Epoch 9, batch 10450, loss[loss=0.2583, simple_loss=0.347, pruned_loss=0.08479, over 21613.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2903, pruned_loss=0.06761, over 4272643.51 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:06:55,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1526502.0, ans=0.0 2023-06-26 08:07:29,715 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.606e+02 5.261e+02 7.908e+02 1.020e+03 2.061e+03, threshold=1.582e+03, percent-clipped=9.0 2023-06-26 08:08:01,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1526682.0, ans=0.125 2023-06-26 08:08:14,049 INFO [train.py:996] (1/4) Epoch 9, batch 10500, loss[loss=0.1927, simple_loss=0.259, pruned_loss=0.06323, over 21208.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2881, pruned_loss=0.06658, over 4268933.36 frames. ], batch size: 548, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:08:29,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.42 vs. limit=10.0 2023-06-26 08:08:57,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1526862.0, ans=0.0 2023-06-26 08:09:28,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1526922.0, ans=0.1 2023-06-26 08:10:02,768 INFO [train.py:996] (1/4) Epoch 9, batch 10550, loss[loss=0.1977, simple_loss=0.2618, pruned_loss=0.06677, over 21580.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2824, pruned_loss=0.06625, over 4275870.57 frames. ], batch size: 414, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:10:47,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.42 vs. limit=22.5 2023-06-26 08:11:04,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1527222.0, ans=0.1 2023-06-26 08:11:07,151 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.351e+02 4.011e+02 5.575e+02 6.702e+02 2.123e+03, threshold=1.115e+03, percent-clipped=3.0 2023-06-26 08:11:38,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1527282.0, ans=0.125 2023-06-26 08:11:47,889 INFO [train.py:996] (1/4) Epoch 9, batch 10600, loss[loss=0.1672, simple_loss=0.2398, pruned_loss=0.04729, over 21069.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2779, pruned_loss=0.06469, over 4267067.06 frames. ], batch size: 143, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:12:27,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1527402.0, ans=0.125 2023-06-26 08:12:42,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1527462.0, ans=0.025 2023-06-26 08:13:09,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1527522.0, ans=0.125 2023-06-26 08:13:31,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2023-06-26 08:13:44,652 INFO [train.py:996] (1/4) Epoch 9, batch 10650, loss[loss=0.2031, simple_loss=0.287, pruned_loss=0.05957, over 21560.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2811, pruned_loss=0.06429, over 4261661.36 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:13:45,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1527642.0, ans=0.125 2023-06-26 08:14:04,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1527642.0, ans=0.0 2023-06-26 08:14:18,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1527702.0, ans=0.125 2023-06-26 08:14:49,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.930e+02 8.313e+02 1.262e+03 3.074e+03, threshold=1.663e+03, percent-clipped=34.0 2023-06-26 08:14:59,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-26 08:15:30,445 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-26 08:15:34,234 INFO [train.py:996] (1/4) Epoch 9, batch 10700, loss[loss=0.2085, simple_loss=0.3209, pruned_loss=0.0481, over 20797.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2814, pruned_loss=0.06386, over 4259263.21 frames. ], batch size: 608, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:15:53,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1528002.0, ans=0.125 2023-06-26 08:15:55,689 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:16:19,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1528062.0, ans=0.05 2023-06-26 08:16:34,659 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-26 08:16:48,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1528122.0, ans=0.2 2023-06-26 08:16:59,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1528182.0, ans=0.125 2023-06-26 08:17:20,277 INFO [train.py:996] (1/4) Epoch 9, batch 10750, loss[loss=0.2118, simple_loss=0.3049, pruned_loss=0.05936, over 20843.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2912, pruned_loss=0.06764, over 4260394.91 frames. ], batch size: 608, lr: 3.31e-03, grad_scale: 8.0 2023-06-26 08:18:13,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1528362.0, ans=0.125 2023-06-26 08:18:25,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-26 08:18:33,240 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.303e+02 6.075e+02 7.797e+02 1.997e+03, threshold=1.215e+03, percent-clipped=3.0 2023-06-26 08:18:56,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1528482.0, ans=0.1 2023-06-26 08:19:10,512 INFO [train.py:996] (1/4) Epoch 9, batch 10800, loss[loss=0.2429, simple_loss=0.3185, pruned_loss=0.08363, over 21213.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2951, pruned_loss=0.06791, over 4268869.63 frames. ], batch size: 143, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:20:16,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1528662.0, ans=0.125 2023-06-26 08:21:00,971 INFO [train.py:996] (1/4) Epoch 9, batch 10850, loss[loss=0.2488, simple_loss=0.3175, pruned_loss=0.09001, over 21320.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2962, pruned_loss=0.06857, over 4264887.21 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:21:33,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1528902.0, ans=0.125 2023-06-26 08:22:11,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1528962.0, ans=0.125 2023-06-26 08:22:14,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.93 vs. limit=10.0 2023-06-26 08:22:18,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1529022.0, ans=0.2 2023-06-26 08:22:19,321 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.316e+02 4.810e+02 7.791e+02 1.214e+03 2.371e+03, threshold=1.558e+03, percent-clipped=23.0 2023-06-26 08:22:56,735 INFO [train.py:996] (1/4) Epoch 9, batch 10900, loss[loss=0.1908, simple_loss=0.2922, pruned_loss=0.04475, over 21748.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2902, pruned_loss=0.0676, over 4264022.40 frames. ], batch size: 282, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:23:18,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1529202.0, ans=0.125 2023-06-26 08:24:15,412 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:24:44,097 INFO [train.py:996] (1/4) Epoch 9, batch 10950, loss[loss=0.1787, simple_loss=0.2601, pruned_loss=0.04865, over 21774.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2874, pruned_loss=0.06638, over 4264841.97 frames. ], batch size: 124, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:25:03,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1529442.0, ans=0.125 2023-06-26 08:25:26,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1529502.0, ans=0.2 2023-06-26 08:25:47,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1529562.0, ans=0.0 2023-06-26 08:25:54,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1529622.0, ans=0.05 2023-06-26 08:25:55,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.406e+02 4.859e+02 7.093e+02 1.092e+03 2.550e+03, threshold=1.419e+03, percent-clipped=10.0 2023-06-26 08:26:26,615 INFO [train.py:996] (1/4) Epoch 9, batch 11000, loss[loss=0.244, simple_loss=0.3057, pruned_loss=0.09113, over 21634.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2869, pruned_loss=0.0672, over 4266617.01 frames. ], batch size: 471, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:26:41,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1529742.0, ans=0.125 2023-06-26 08:26:47,286 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-26 08:27:46,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1529922.0, ans=0.125 2023-06-26 08:28:11,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529982.0, ans=0.1 2023-06-26 08:28:20,361 INFO [train.py:996] (1/4) Epoch 9, batch 11050, loss[loss=0.1971, simple_loss=0.2682, pruned_loss=0.06303, over 21830.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2839, pruned_loss=0.06768, over 4264146.05 frames. ], batch size: 107, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:29:14,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1530162.0, ans=0.0 2023-06-26 08:29:32,132 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.865e+02 7.286e+02 1.085e+03 1.953e+03, threshold=1.457e+03, percent-clipped=8.0 2023-06-26 08:29:55,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-26 08:30:03,324 INFO [train.py:996] (1/4) Epoch 9, batch 11100, loss[loss=0.1921, simple_loss=0.2792, pruned_loss=0.05247, over 21385.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2827, pruned_loss=0.06683, over 4250180.97 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:30:15,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-26 08:30:49,912 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:31:00,856 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-26 08:31:12,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1530462.0, ans=0.2 2023-06-26 08:31:31,804 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-06-26 08:31:57,826 INFO [train.py:996] (1/4) Epoch 9, batch 11150, loss[loss=0.2048, simple_loss=0.2988, pruned_loss=0.0554, over 21799.00 frames. ], tot_loss[loss=0.207, simple_loss=0.281, pruned_loss=0.06652, over 4250151.14 frames. ], batch size: 317, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:32:35,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1530702.0, ans=0.125 2023-06-26 08:32:49,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1530762.0, ans=0.0 2023-06-26 08:32:51,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.47 vs. limit=10.0 2023-06-26 08:32:59,403 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:33:09,617 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 4.594e+02 7.408e+02 1.103e+03 2.164e+03, threshold=1.482e+03, percent-clipped=12.0 2023-06-26 08:33:18,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1530822.0, ans=0.125 2023-06-26 08:33:22,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1530882.0, ans=0.125 2023-06-26 08:33:24,500 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-26 08:33:40,341 INFO [train.py:996] (1/4) Epoch 9, batch 11200, loss[loss=0.2297, simple_loss=0.2896, pruned_loss=0.08492, over 22019.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2801, pruned_loss=0.06688, over 4255786.58 frames. ], batch size: 103, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:33:50,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.01 vs. limit=5.0 2023-06-26 08:34:07,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1530942.0, ans=0.05 2023-06-26 08:35:30,886 INFO [train.py:996] (1/4) Epoch 9, batch 11250, loss[loss=0.1936, simple_loss=0.2817, pruned_loss=0.05271, over 21814.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2798, pruned_loss=0.06677, over 4247941.50 frames. ], batch size: 333, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:35:34,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1531242.0, ans=0.125 2023-06-26 08:36:00,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1531302.0, ans=0.125 2023-06-26 08:36:50,832 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 4.914e+02 6.866e+02 9.264e+02 1.730e+03, threshold=1.373e+03, percent-clipped=7.0 2023-06-26 08:37:18,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-06-26 08:37:20,669 INFO [train.py:996] (1/4) Epoch 9, batch 11300, loss[loss=0.1809, simple_loss=0.2678, pruned_loss=0.04703, over 21868.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2807, pruned_loss=0.06703, over 4257120.65 frames. ], batch size: 316, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:37:53,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-06-26 08:38:12,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1531662.0, ans=0.1 2023-06-26 08:38:18,575 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-26 08:38:19,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1531662.0, ans=0.05 2023-06-26 08:39:16,336 INFO [train.py:996] (1/4) Epoch 9, batch 11350, loss[loss=0.2192, simple_loss=0.3024, pruned_loss=0.06797, over 21707.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2833, pruned_loss=0.06716, over 4256120.74 frames. ], batch size: 298, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:39:33,987 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.41 vs. limit=15.0 2023-06-26 08:39:55,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1531902.0, ans=0.125 2023-06-26 08:39:58,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1531902.0, ans=0.125 2023-06-26 08:40:10,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1531962.0, ans=0.1 2023-06-26 08:40:31,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.547e+02 4.947e+02 6.813e+02 1.038e+03 3.040e+03, threshold=1.363e+03, percent-clipped=13.0 2023-06-26 08:40:41,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1532022.0, ans=0.125 2023-06-26 08:40:44,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1532082.0, ans=0.125 2023-06-26 08:41:07,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1532142.0, ans=0.125 2023-06-26 08:41:08,346 INFO [train.py:996] (1/4) Epoch 9, batch 11400, loss[loss=0.2686, simple_loss=0.337, pruned_loss=0.1001, over 21432.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2893, pruned_loss=0.0697, over 4258182.63 frames. ], batch size: 471, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:41:22,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1532142.0, ans=0.125 2023-06-26 08:42:01,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-26 08:42:16,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1532322.0, ans=0.0 2023-06-26 08:43:04,957 INFO [train.py:996] (1/4) Epoch 9, batch 11450, loss[loss=0.2253, simple_loss=0.3069, pruned_loss=0.0718, over 21861.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2922, pruned_loss=0.0693, over 4266085.74 frames. ], batch size: 371, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:43:28,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1532502.0, ans=0.125 2023-06-26 08:43:36,705 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.35 vs. limit=15.0 2023-06-26 08:43:52,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1532562.0, ans=0.0 2023-06-26 08:43:55,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1532562.0, ans=0.0 2023-06-26 08:44:14,730 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.477e+02 5.112e+02 7.054e+02 1.112e+03 2.275e+03, threshold=1.411e+03, percent-clipped=15.0 2023-06-26 08:44:37,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-26 08:44:45,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1532682.0, ans=0.2 2023-06-26 08:45:01,338 INFO [train.py:996] (1/4) Epoch 9, batch 11500, loss[loss=0.21, simple_loss=0.3078, pruned_loss=0.05608, over 21872.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2943, pruned_loss=0.06932, over 4259565.38 frames. ], batch size: 371, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:45:05,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1532742.0, ans=0.125 2023-06-26 08:45:33,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1532802.0, ans=0.125 2023-06-26 08:46:50,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1532982.0, ans=0.125 2023-06-26 08:46:53,162 INFO [train.py:996] (1/4) Epoch 9, batch 11550, loss[loss=0.315, simple_loss=0.4188, pruned_loss=0.1056, over 21652.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3006, pruned_loss=0.06992, over 4266432.34 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:47:24,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1533102.0, ans=0.125 2023-06-26 08:47:29,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1533102.0, ans=0.2 2023-06-26 08:48:08,420 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.865e+02 8.299e+02 1.163e+03 3.420e+03, threshold=1.660e+03, percent-clipped=18.0 2023-06-26 08:48:11,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-26 08:48:41,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1533282.0, ans=0.2 2023-06-26 08:48:48,929 INFO [train.py:996] (1/4) Epoch 9, batch 11600, loss[loss=0.273, simple_loss=0.3817, pruned_loss=0.08213, over 21673.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3125, pruned_loss=0.07114, over 4255634.73 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:49:33,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1533462.0, ans=15.0 2023-06-26 08:49:54,432 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-26 08:50:37,990 INFO [train.py:996] (1/4) Epoch 9, batch 11650, loss[loss=0.1999, simple_loss=0.2638, pruned_loss=0.06794, over 20843.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3168, pruned_loss=0.07163, over 4254951.12 frames. ], batch size: 609, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:50:59,726 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-26 08:51:01,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-26 08:51:04,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1533702.0, ans=0.125 2023-06-26 08:51:28,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1533762.0, ans=0.125 2023-06-26 08:51:48,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1533822.0, ans=0.1 2023-06-26 08:51:52,913 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.214e+02 7.495e+02 1.149e+03 1.864e+03 4.386e+03, threshold=2.298e+03, percent-clipped=28.0 2023-06-26 08:52:26,007 INFO [train.py:996] (1/4) Epoch 9, batch 11700, loss[loss=0.1871, simple_loss=0.2555, pruned_loss=0.05937, over 21603.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3071, pruned_loss=0.07046, over 4238114.04 frames. ], batch size: 332, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:52:39,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1533942.0, ans=0.125 2023-06-26 08:52:56,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1534002.0, ans=0.025 2023-06-26 08:54:13,593 INFO [train.py:996] (1/4) Epoch 9, batch 11750, loss[loss=0.2007, simple_loss=0.2668, pruned_loss=0.06728, over 21861.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2983, pruned_loss=0.06988, over 4242562.75 frames. ], batch size: 98, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:54:21,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1534242.0, ans=0.125 2023-06-26 08:55:01,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-26 08:55:31,057 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 4.368e+02 6.221e+02 1.023e+03 2.709e+03, threshold=1.244e+03, percent-clipped=2.0 2023-06-26 08:55:39,538 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-26 08:55:53,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1534482.0, ans=0.125 2023-06-26 08:56:03,966 INFO [train.py:996] (1/4) Epoch 9, batch 11800, loss[loss=0.2232, simple_loss=0.3234, pruned_loss=0.06146, over 21884.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3006, pruned_loss=0.07197, over 4254416.39 frames. ], batch size: 372, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:57:12,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-26 08:57:32,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1534722.0, ans=0.0 2023-06-26 08:57:43,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1534782.0, ans=0.1 2023-06-26 08:57:53,770 INFO [train.py:996] (1/4) Epoch 9, batch 11850, loss[loss=0.218, simple_loss=0.3009, pruned_loss=0.06759, over 21620.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3019, pruned_loss=0.07072, over 4265157.78 frames. ], batch size: 230, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:59:16,061 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 4.313e+02 5.764e+02 8.343e+02 1.784e+03, threshold=1.153e+03, percent-clipped=5.0 2023-06-26 08:59:19,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-26 08:59:31,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=22.5 2023-06-26 08:59:39,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1535082.0, ans=0.125 2023-06-26 08:59:50,220 INFO [train.py:996] (1/4) Epoch 9, batch 11900, loss[loss=0.2263, simple_loss=0.3006, pruned_loss=0.07596, over 20685.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3018, pruned_loss=0.06898, over 4253074.50 frames. ], batch size: 607, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:00:06,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1535202.0, ans=0.0 2023-06-26 09:01:04,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1535322.0, ans=0.0 2023-06-26 09:01:04,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1535322.0, ans=0.2 2023-06-26 09:01:16,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.13 vs. limit=12.0 2023-06-26 09:01:26,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1535382.0, ans=0.1 2023-06-26 09:01:36,246 INFO [train.py:996] (1/4) Epoch 9, batch 11950, loss[loss=0.1126, simple_loss=0.1693, pruned_loss=0.02793, over 16260.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.3026, pruned_loss=0.06651, over 4248148.78 frames. ], batch size: 60, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:01:36,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1535442.0, ans=0.125 2023-06-26 09:02:43,943 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-26 09:02:50,740 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 4.636e+02 6.640e+02 1.069e+03 2.597e+03, threshold=1.328e+03, percent-clipped=19.0 2023-06-26 09:02:51,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1535622.0, ans=0.0 2023-06-26 09:03:23,551 INFO [train.py:996] (1/4) Epoch 9, batch 12000, loss[loss=0.2072, simple_loss=0.2728, pruned_loss=0.07076, over 21828.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2959, pruned_loss=0.06449, over 4248861.59 frames. ], batch size: 98, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 09:03:23,552 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 09:03:41,732 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2638, simple_loss=0.3517, pruned_loss=0.08798, over 1796401.00 frames. 2023-06-26 09:03:41,733 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 09:04:36,400 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.83 vs. limit=5.0 2023-06-26 09:05:08,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1535922.0, ans=0.1 2023-06-26 09:05:21,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1535982.0, ans=0.0 2023-06-26 09:05:31,708 INFO [train.py:996] (1/4) Epoch 9, batch 12050, loss[loss=0.1959, simple_loss=0.2691, pruned_loss=0.06138, over 21514.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2932, pruned_loss=0.06609, over 4253695.12 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:05:34,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-26 09:05:46,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1536042.0, ans=0.125 2023-06-26 09:06:54,781 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.979e+02 7.743e+02 1.300e+03 2.733e+03, threshold=1.549e+03, percent-clipped=23.0 2023-06-26 09:07:12,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-26 09:07:34,230 INFO [train.py:996] (1/4) Epoch 9, batch 12100, loss[loss=0.2322, simple_loss=0.3118, pruned_loss=0.07629, over 21805.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2989, pruned_loss=0.07052, over 4264147.62 frames. ], batch size: 282, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:08:21,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1536462.0, ans=0.125 2023-06-26 09:09:27,801 INFO [train.py:996] (1/4) Epoch 9, batch 12150, loss[loss=0.2276, simple_loss=0.3288, pruned_loss=0.06321, over 21854.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3022, pruned_loss=0.06938, over 4259767.39 frames. ], batch size: 371, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:09:35,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-26 09:09:45,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1536642.0, ans=0.0 2023-06-26 09:09:49,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1536702.0, ans=0.0 2023-06-26 09:10:18,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1536762.0, ans=0.125 2023-06-26 09:10:33,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1536822.0, ans=0.125 2023-06-26 09:10:43,934 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.377e+02 5.272e+02 8.352e+02 1.536e+03 2.585e+03, threshold=1.670e+03, percent-clipped=24.0 2023-06-26 09:10:44,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1536822.0, ans=0.0 2023-06-26 09:11:14,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1536882.0, ans=0.125 2023-06-26 09:11:18,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1536942.0, ans=0.1 2023-06-26 09:11:19,131 INFO [train.py:996] (1/4) Epoch 9, batch 12200, loss[loss=0.2, simple_loss=0.2642, pruned_loss=0.0679, over 21545.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.297, pruned_loss=0.06876, over 4250785.00 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:12:23,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-26 09:13:06,910 INFO [train.py:996] (1/4) Epoch 9, batch 12250, loss[loss=0.1468, simple_loss=0.228, pruned_loss=0.0328, over 21336.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2897, pruned_loss=0.06566, over 4256134.86 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:13:17,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1537242.0, ans=0.125 2023-06-26 09:13:23,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1537302.0, ans=0.0 2023-06-26 09:13:56,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1537362.0, ans=0.2 2023-06-26 09:14:12,895 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.985e+02 4.210e+02 5.762e+02 8.754e+02 2.023e+03, threshold=1.152e+03, percent-clipped=2.0 2023-06-26 09:14:55,442 INFO [train.py:996] (1/4) Epoch 9, batch 12300, loss[loss=0.2282, simple_loss=0.3242, pruned_loss=0.06616, over 20824.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2857, pruned_loss=0.06137, over 4252384.17 frames. ], batch size: 607, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:15:41,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1537662.0, ans=0.125 2023-06-26 09:16:42,684 INFO [train.py:996] (1/4) Epoch 9, batch 12350, loss[loss=0.2157, simple_loss=0.3021, pruned_loss=0.06465, over 21617.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2895, pruned_loss=0.06241, over 4254847.59 frames. ], batch size: 263, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:16:50,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-26 09:16:52,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1537842.0, ans=0.125 2023-06-26 09:17:03,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1537902.0, ans=0.2 2023-06-26 09:17:33,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-26 09:17:47,949 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.624e+02 9.354e+02 1.463e+03 3.322e+03, threshold=1.871e+03, percent-clipped=32.0 2023-06-26 09:18:29,186 INFO [train.py:996] (1/4) Epoch 9, batch 12400, loss[loss=0.2074, simple_loss=0.2828, pruned_loss=0.06602, over 21891.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2912, pruned_loss=0.06528, over 4269217.01 frames. ], batch size: 118, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:18:39,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-26 09:19:01,652 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=15.0 2023-06-26 09:19:34,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1538322.0, ans=0.2 2023-06-26 09:20:18,911 INFO [train.py:996] (1/4) Epoch 9, batch 12450, loss[loss=0.2468, simple_loss=0.3237, pruned_loss=0.08498, over 21620.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2942, pruned_loss=0.06808, over 4272260.40 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:20:36,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1538442.0, ans=0.125 2023-06-26 09:20:42,187 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:20:47,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2023-06-26 09:20:50,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1538502.0, ans=0.125 2023-06-26 09:21:21,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1538562.0, ans=0.125 2023-06-26 09:21:43,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.605e+02 6.014e+02 7.920e+02 1.251e+03 2.737e+03, threshold=1.584e+03, percent-clipped=3.0 2023-06-26 09:22:15,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.24 vs. limit=15.0 2023-06-26 09:22:15,991 INFO [train.py:996] (1/4) Epoch 9, batch 12500, loss[loss=0.2406, simple_loss=0.336, pruned_loss=0.07262, over 21323.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3055, pruned_loss=0.07164, over 4279523.04 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:22:37,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1538802.0, ans=0.1 2023-06-26 09:22:45,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1538802.0, ans=0.125 2023-06-26 09:22:45,662 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-26 09:23:00,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1538862.0, ans=0.125 2023-06-26 09:23:08,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1538862.0, ans=0.2 2023-06-26 09:24:07,315 INFO [train.py:996] (1/4) Epoch 9, batch 12550, loss[loss=0.215, simple_loss=0.3409, pruned_loss=0.04456, over 20781.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3098, pruned_loss=0.07376, over 4280797.93 frames. ], batch size: 608, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:24:13,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1539042.0, ans=0.125 2023-06-26 09:24:39,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1539102.0, ans=0.035 2023-06-26 09:24:53,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1539102.0, ans=0.2 2023-06-26 09:25:32,842 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 5.506e+02 7.478e+02 1.164e+03 2.448e+03, threshold=1.496e+03, percent-clipped=9.0 2023-06-26 09:26:02,744 INFO [train.py:996] (1/4) Epoch 9, batch 12600, loss[loss=0.227, simple_loss=0.3296, pruned_loss=0.06223, over 21225.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3086, pruned_loss=0.07176, over 4274660.17 frames. ], batch size: 549, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:26:25,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-26 09:26:52,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1539462.0, ans=0.125 2023-06-26 09:27:03,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1539462.0, ans=0.125 2023-06-26 09:27:09,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1539522.0, ans=0.2 2023-06-26 09:27:50,963 INFO [train.py:996] (1/4) Epoch 9, batch 12650, loss[loss=0.1856, simple_loss=0.2782, pruned_loss=0.04653, over 21640.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2994, pruned_loss=0.06692, over 4269437.88 frames. ], batch size: 389, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:28:07,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1539642.0, ans=22.5 2023-06-26 09:28:22,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1539702.0, ans=0.125 2023-06-26 09:28:50,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-26 09:28:57,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1539822.0, ans=0.125 2023-06-26 09:29:02,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.56 vs. limit=5.0 2023-06-26 09:29:09,318 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 4.812e+02 9.064e+02 1.405e+03 2.946e+03, threshold=1.813e+03, percent-clipped=21.0 2023-06-26 09:29:44,742 INFO [train.py:996] (1/4) Epoch 9, batch 12700, loss[loss=0.2431, simple_loss=0.3181, pruned_loss=0.08404, over 21452.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3001, pruned_loss=0.06946, over 4275036.26 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:29:50,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1539942.0, ans=0.125 2023-06-26 09:30:02,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1539942.0, ans=0.5 2023-06-26 09:30:28,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1540062.0, ans=0.125 2023-06-26 09:30:54,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1540122.0, ans=0.0 2023-06-26 09:31:32,361 INFO [train.py:996] (1/4) Epoch 9, batch 12750, loss[loss=0.2301, simple_loss=0.3139, pruned_loss=0.07311, over 21361.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.302, pruned_loss=0.07069, over 4271453.20 frames. ], batch size: 548, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:31:36,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1540242.0, ans=0.125 2023-06-26 09:32:18,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1540362.0, ans=0.125 2023-06-26 09:32:42,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1540422.0, ans=0.0 2023-06-26 09:32:45,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.578e+02 5.161e+02 7.205e+02 9.772e+02 1.736e+03, threshold=1.441e+03, percent-clipped=0.0 2023-06-26 09:32:46,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1540422.0, ans=0.04949747468305833 2023-06-26 09:32:59,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1540482.0, ans=0.0 2023-06-26 09:33:13,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=15.0 2023-06-26 09:33:19,508 INFO [train.py:996] (1/4) Epoch 9, batch 12800, loss[loss=0.2531, simple_loss=0.3122, pruned_loss=0.09698, over 21741.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3011, pruned_loss=0.0716, over 4278044.25 frames. ], batch size: 508, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:33:31,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1540542.0, ans=0.125 2023-06-26 09:33:42,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1540602.0, ans=0.125 2023-06-26 09:33:49,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1540602.0, ans=0.1 2023-06-26 09:33:55,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-26 09:34:05,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1540662.0, ans=0.1 2023-06-26 09:35:13,874 INFO [train.py:996] (1/4) Epoch 9, batch 12850, loss[loss=0.2062, simple_loss=0.3085, pruned_loss=0.05195, over 21218.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3041, pruned_loss=0.07297, over 4278954.33 frames. ], batch size: 548, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:35:23,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1540842.0, ans=0.1 2023-06-26 09:36:36,354 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.020e+02 4.565e+02 5.945e+02 7.206e+02 1.665e+03, threshold=1.189e+03, percent-clipped=1.0 2023-06-26 09:37:00,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.11 vs. limit=15.0 2023-06-26 09:37:04,595 INFO [train.py:996] (1/4) Epoch 9, batch 12900, loss[loss=0.1963, simple_loss=0.2683, pruned_loss=0.06214, over 21222.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3005, pruned_loss=0.06961, over 4282262.73 frames. ], batch size: 159, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:37:17,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1541142.0, ans=0.0 2023-06-26 09:37:25,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-26 09:37:37,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1541202.0, ans=0.2 2023-06-26 09:37:46,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1541262.0, ans=0.125 2023-06-26 09:38:49,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-26 09:38:55,140 INFO [train.py:996] (1/4) Epoch 9, batch 12950, loss[loss=0.2366, simple_loss=0.3151, pruned_loss=0.07909, over 21491.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2982, pruned_loss=0.06737, over 4280101.52 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:38:56,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-26 09:40:05,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-26 09:40:21,281 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.550e+02 5.457e+02 7.611e+02 1.240e+03 2.264e+03, threshold=1.522e+03, percent-clipped=25.0 2023-06-26 09:40:43,295 INFO [train.py:996] (1/4) Epoch 9, batch 13000, loss[loss=0.1515, simple_loss=0.2383, pruned_loss=0.03235, over 21341.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3012, pruned_loss=0.06771, over 4272009.54 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:40:54,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.63 vs. limit=6.0 2023-06-26 09:40:55,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1541742.0, ans=0.025 2023-06-26 09:41:29,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1541862.0, ans=0.125 2023-06-26 09:41:55,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1541922.0, ans=0.1 2023-06-26 09:42:30,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1542042.0, ans=0.0 2023-06-26 09:42:31,809 INFO [train.py:996] (1/4) Epoch 9, batch 13050, loss[loss=0.192, simple_loss=0.263, pruned_loss=0.06049, over 21800.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2947, pruned_loss=0.06519, over 4273875.25 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:42:37,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1542042.0, ans=0.125 2023-06-26 09:43:02,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1542102.0, ans=0.125 2023-06-26 09:43:34,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1542162.0, ans=0.2 2023-06-26 09:43:44,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1542222.0, ans=0.125 2023-06-26 09:43:58,313 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.428e+02 4.464e+02 7.205e+02 1.000e+03 2.248e+03, threshold=1.441e+03, percent-clipped=5.0 2023-06-26 09:44:14,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1542282.0, ans=0.07 2023-06-26 09:44:21,915 INFO [train.py:996] (1/4) Epoch 9, batch 13100, loss[loss=0.2267, simple_loss=0.3125, pruned_loss=0.07043, over 21756.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2966, pruned_loss=0.06606, over 4283906.14 frames. ], batch size: 332, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:44:53,837 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-26 09:45:05,481 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.42 vs. limit=6.0 2023-06-26 09:45:10,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1542402.0, ans=0.125 2023-06-26 09:45:10,280 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:45:13,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1542462.0, ans=0.0 2023-06-26 09:45:26,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1542462.0, ans=0.025 2023-06-26 09:45:34,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1542462.0, ans=0.0 2023-06-26 09:45:42,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1542522.0, ans=0.125 2023-06-26 09:46:20,395 INFO [train.py:996] (1/4) Epoch 9, batch 13150, loss[loss=0.2018, simple_loss=0.2758, pruned_loss=0.06385, over 21777.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2976, pruned_loss=0.06824, over 4280800.45 frames. ], batch size: 316, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:47:13,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1542762.0, ans=0.0 2023-06-26 09:47:33,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1542822.0, ans=0.0 2023-06-26 09:47:37,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-26 09:47:43,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 6.125e+02 9.524e+02 1.520e+03 3.301e+03, threshold=1.905e+03, percent-clipped=27.0 2023-06-26 09:48:24,344 INFO [train.py:996] (1/4) Epoch 9, batch 13200, loss[loss=0.224, simple_loss=0.3048, pruned_loss=0.07163, over 21566.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2959, pruned_loss=0.06837, over 4278941.11 frames. ], batch size: 389, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:49:12,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1543062.0, ans=0.125 2023-06-26 09:49:24,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-26 09:49:27,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1543122.0, ans=0.125 2023-06-26 09:50:02,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.13 vs. limit=22.5 2023-06-26 09:50:16,134 INFO [train.py:996] (1/4) Epoch 9, batch 13250, loss[loss=0.2334, simple_loss=0.3169, pruned_loss=0.07496, over 21757.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2965, pruned_loss=0.06949, over 4277106.58 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:50:32,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1543242.0, ans=0.125 2023-06-26 09:50:34,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1543242.0, ans=0.125 2023-06-26 09:50:44,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1543302.0, ans=0.0 2023-06-26 09:50:53,877 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:51:17,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1543422.0, ans=0.0 2023-06-26 09:51:48,023 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.553e+02 4.712e+02 6.598e+02 9.234e+02 1.581e+03, threshold=1.320e+03, percent-clipped=0.0 2023-06-26 09:51:54,208 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:52:13,051 INFO [train.py:996] (1/4) Epoch 9, batch 13300, loss[loss=0.223, simple_loss=0.3051, pruned_loss=0.07047, over 21492.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2992, pruned_loss=0.06949, over 4276109.20 frames. ], batch size: 211, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:52:30,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1543602.0, ans=0.0 2023-06-26 09:52:31,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1543602.0, ans=0.125 2023-06-26 09:52:34,036 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-26 09:52:51,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1543662.0, ans=0.1 2023-06-26 09:52:58,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1543662.0, ans=0.0 2023-06-26 09:54:01,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1543842.0, ans=0.125 2023-06-26 09:54:02,880 INFO [train.py:996] (1/4) Epoch 9, batch 13350, loss[loss=0.2599, simple_loss=0.3411, pruned_loss=0.08938, over 21620.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3045, pruned_loss=0.07214, over 4276788.06 frames. ], batch size: 389, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:54:14,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-26 09:54:24,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1543902.0, ans=0.0 2023-06-26 09:54:26,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1543902.0, ans=0.0 2023-06-26 09:54:54,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-26 09:55:27,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.028e+02 5.402e+02 7.933e+02 1.042e+03 2.169e+03, threshold=1.587e+03, percent-clipped=13.0 2023-06-26 09:55:51,790 INFO [train.py:996] (1/4) Epoch 9, batch 13400, loss[loss=0.2134, simple_loss=0.2809, pruned_loss=0.07299, over 21851.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.306, pruned_loss=0.07357, over 4282243.40 frames. ], batch size: 98, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:56:09,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1544202.0, ans=0.125 2023-06-26 09:56:09,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1544202.0, ans=0.125 2023-06-26 09:56:18,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1544202.0, ans=22.5 2023-06-26 09:57:30,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1544382.0, ans=0.0 2023-06-26 09:57:34,850 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.13 vs. limit=12.0 2023-06-26 09:57:39,175 INFO [train.py:996] (1/4) Epoch 9, batch 13450, loss[loss=0.2508, simple_loss=0.3342, pruned_loss=0.08373, over 21414.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3061, pruned_loss=0.07512, over 4285590.86 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:57:56,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1544502.0, ans=0.0 2023-06-26 09:58:16,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1544502.0, ans=0.125 2023-06-26 09:58:32,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1544562.0, ans=0.125 2023-06-26 09:59:10,545 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.398e+02 5.052e+02 6.156e+02 8.765e+02 1.835e+03, threshold=1.231e+03, percent-clipped=4.0 2023-06-26 09:59:30,321 INFO [train.py:996] (1/4) Epoch 9, batch 13500, loss[loss=0.2775, simple_loss=0.346, pruned_loss=0.1045, over 21469.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2963, pruned_loss=0.0721, over 4268954.54 frames. ], batch size: 509, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:59:52,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1544802.0, ans=0.0 2023-06-26 10:01:27,146 INFO [train.py:996] (1/4) Epoch 9, batch 13550, loss[loss=0.2638, simple_loss=0.3734, pruned_loss=0.07713, over 21272.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2985, pruned_loss=0.07073, over 4268003.87 frames. ], batch size: 548, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 10:02:05,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1545102.0, ans=0.2 2023-06-26 10:02:05,940 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-26 10:02:51,989 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.799e+02 5.933e+02 9.358e+02 1.476e+03 2.986e+03, threshold=1.872e+03, percent-clipped=34.0 2023-06-26 10:03:00,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1545282.0, ans=0.0 2023-06-26 10:03:16,833 INFO [train.py:996] (1/4) Epoch 9, batch 13600, loss[loss=0.2127, simple_loss=0.2875, pruned_loss=0.06895, over 21250.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3014, pruned_loss=0.07109, over 4272557.86 frames. ], batch size: 143, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:03:19,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1545342.0, ans=0.125 2023-06-26 10:03:41,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1545342.0, ans=0.0 2023-06-26 10:04:17,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-26 10:04:19,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1545462.0, ans=0.2 2023-06-26 10:04:23,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1545522.0, ans=0.2 2023-06-26 10:05:04,147 INFO [train.py:996] (1/4) Epoch 9, batch 13650, loss[loss=0.1689, simple_loss=0.2388, pruned_loss=0.04952, over 21542.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2957, pruned_loss=0.06821, over 4273363.25 frames. ], batch size: 263, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:05:09,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1545642.0, ans=0.125 2023-06-26 10:05:24,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1545642.0, ans=0.2 2023-06-26 10:05:26,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1545702.0, ans=0.125 2023-06-26 10:05:26,745 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-26 10:05:40,499 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:05:47,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1545762.0, ans=0.125 2023-06-26 10:05:49,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1545762.0, ans=0.125 2023-06-26 10:05:58,561 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-26 10:06:22,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1545822.0, ans=0.0 2023-06-26 10:06:23,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.401e+02 4.955e+02 6.723e+02 8.963e+02 2.035e+03, threshold=1.345e+03, percent-clipped=1.0 2023-06-26 10:06:48,902 INFO [train.py:996] (1/4) Epoch 9, batch 13700, loss[loss=0.2366, simple_loss=0.3189, pruned_loss=0.07712, over 21620.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2929, pruned_loss=0.06837, over 4259692.33 frames. ], batch size: 389, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:07:00,831 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-26 10:07:14,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1546002.0, ans=0.125 2023-06-26 10:08:04,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1546122.0, ans=0.0 2023-06-26 10:08:23,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1546182.0, ans=0.1 2023-06-26 10:08:45,476 INFO [train.py:996] (1/4) Epoch 9, batch 13750, loss[loss=0.167, simple_loss=0.2186, pruned_loss=0.05764, over 21193.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2924, pruned_loss=0.06835, over 4256062.70 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:10:01,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1546422.0, ans=0.0 2023-06-26 10:10:16,977 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 6.154e+02 1.114e+03 1.508e+03 3.073e+03, threshold=2.228e+03, percent-clipped=34.0 2023-06-26 10:10:31,036 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.66 vs. limit=15.0 2023-06-26 10:10:41,608 INFO [train.py:996] (1/4) Epoch 9, batch 13800, loss[loss=0.2701, simple_loss=0.3839, pruned_loss=0.07815, over 21236.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2977, pruned_loss=0.06842, over 4255221.83 frames. ], batch size: 549, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:10:49,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1546542.0, ans=0.0 2023-06-26 10:11:09,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1546602.0, ans=0.0 2023-06-26 10:12:27,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1546782.0, ans=0.2 2023-06-26 10:12:32,892 INFO [train.py:996] (1/4) Epoch 9, batch 13850, loss[loss=0.2904, simple_loss=0.3657, pruned_loss=0.1076, over 21502.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3044, pruned_loss=0.06869, over 4258414.13 frames. ], batch size: 471, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:13:11,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1546962.0, ans=0.0 2023-06-26 10:13:51,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1547022.0, ans=0.2 2023-06-26 10:13:57,546 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.808e+02 5.512e+02 9.000e+02 1.173e+03 2.021e+03, threshold=1.800e+03, percent-clipped=1.0 2023-06-26 10:14:08,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1547082.0, ans=0.05 2023-06-26 10:14:13,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1547082.0, ans=0.0 2023-06-26 10:14:18,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-26 10:14:22,456 INFO [train.py:996] (1/4) Epoch 9, batch 13900, loss[loss=0.2418, simple_loss=0.3125, pruned_loss=0.08554, over 21438.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3075, pruned_loss=0.07217, over 4268142.93 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:14:37,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1547142.0, ans=0.125 2023-06-26 10:14:37,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1547142.0, ans=0.125 2023-06-26 10:14:39,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1547202.0, ans=0.2 2023-06-26 10:14:44,175 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:15:51,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1547322.0, ans=0.0 2023-06-26 10:16:11,147 INFO [train.py:996] (1/4) Epoch 9, batch 13950, loss[loss=0.2343, simple_loss=0.3032, pruned_loss=0.08274, over 21923.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3068, pruned_loss=0.07383, over 4279520.39 frames. ], batch size: 316, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:16:18,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1547442.0, ans=0.125 2023-06-26 10:16:24,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1547442.0, ans=0.2 2023-06-26 10:16:27,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1547502.0, ans=0.2 2023-06-26 10:16:52,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1547562.0, ans=0.125 2023-06-26 10:17:34,993 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.697e+02 5.601e+02 7.890e+02 1.100e+03 2.147e+03, threshold=1.578e+03, percent-clipped=2.0 2023-06-26 10:17:57,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1547742.0, ans=0.0 2023-06-26 10:17:58,867 INFO [train.py:996] (1/4) Epoch 9, batch 14000, loss[loss=0.1862, simple_loss=0.2661, pruned_loss=0.05311, over 21643.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3023, pruned_loss=0.0716, over 4269076.82 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:18:01,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1547742.0, ans=0.1 2023-06-26 10:19:10,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1547922.0, ans=0.0 2023-06-26 10:19:46,316 INFO [train.py:996] (1/4) Epoch 9, batch 14050, loss[loss=0.1737, simple_loss=0.2533, pruned_loss=0.0471, over 16503.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2972, pruned_loss=0.06825, over 4267464.41 frames. ], batch size: 63, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:19:56,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1548042.0, ans=0.125 2023-06-26 10:20:43,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1548162.0, ans=0.125 2023-06-26 10:20:45,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=12.0 2023-06-26 10:21:01,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1548222.0, ans=0.125 2023-06-26 10:21:03,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1548222.0, ans=0.1 2023-06-26 10:21:06,322 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.796e+02 7.490e+02 1.046e+03 2.202e+03, threshold=1.498e+03, percent-clipped=4.0 2023-06-26 10:21:27,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1548282.0, ans=0.05 2023-06-26 10:21:30,903 INFO [train.py:996] (1/4) Epoch 9, batch 14100, loss[loss=0.2051, simple_loss=0.2696, pruned_loss=0.07031, over 15755.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2905, pruned_loss=0.06765, over 4260814.93 frames. ], batch size: 60, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:21:51,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1548402.0, ans=0.07 2023-06-26 10:22:08,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1548402.0, ans=0.125 2023-06-26 10:23:18,176 INFO [train.py:996] (1/4) Epoch 9, batch 14150, loss[loss=0.196, simple_loss=0.2886, pruned_loss=0.05167, over 21364.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2953, pruned_loss=0.0687, over 4263906.00 frames. ], batch size: 194, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:24:18,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.06 vs. limit=6.0 2023-06-26 10:24:42,645 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.188e+02 5.825e+02 9.276e+02 1.325e+03 2.479e+03, threshold=1.855e+03, percent-clipped=15.0 2023-06-26 10:24:48,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1548882.0, ans=0.1 2023-06-26 10:24:59,277 INFO [train.py:996] (1/4) Epoch 9, batch 14200, loss[loss=0.2121, simple_loss=0.281, pruned_loss=0.0716, over 21591.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2947, pruned_loss=0.0677, over 4265093.65 frames. ], batch size: 391, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:25:11,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1548942.0, ans=0.125 2023-06-26 10:25:14,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-26 10:25:55,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1549062.0, ans=0.125 2023-06-26 10:26:07,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1549122.0, ans=0.0 2023-06-26 10:26:11,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-26 10:26:47,079 INFO [train.py:996] (1/4) Epoch 9, batch 14250, loss[loss=0.1808, simple_loss=0.2502, pruned_loss=0.05567, over 21202.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2893, pruned_loss=0.06745, over 4248594.78 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:26:54,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1549242.0, ans=0.125 2023-06-26 10:27:13,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1549302.0, ans=0.1 2023-06-26 10:27:40,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1549362.0, ans=0.0 2023-06-26 10:28:22,816 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.193e+02 4.868e+02 6.668e+02 9.362e+02 2.470e+03, threshold=1.334e+03, percent-clipped=6.0 2023-06-26 10:28:43,656 INFO [train.py:996] (1/4) Epoch 9, batch 14300, loss[loss=0.3336, simple_loss=0.422, pruned_loss=0.1226, over 21606.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2912, pruned_loss=0.0675, over 4259939.01 frames. ], batch size: 441, lr: 3.29e-03, grad_scale: 8.0 2023-06-26 10:28:51,049 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:28:55,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1549542.0, ans=0.0 2023-06-26 10:28:57,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1549542.0, ans=0.125 2023-06-26 10:29:09,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1549602.0, ans=0.125 2023-06-26 10:29:11,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1549602.0, ans=0.125 2023-06-26 10:29:52,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1549722.0, ans=0.0 2023-06-26 10:30:06,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1549722.0, ans=0.0 2023-06-26 10:30:33,266 INFO [train.py:996] (1/4) Epoch 9, batch 14350, loss[loss=0.2138, simple_loss=0.2995, pruned_loss=0.06401, over 21453.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.294, pruned_loss=0.0667, over 4257217.11 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 8.0 2023-06-26 10:30:44,221 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:30:50,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1549902.0, ans=0.0 2023-06-26 10:31:19,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1549962.0, ans=0.1 2023-06-26 10:31:42,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1550022.0, ans=0.125 2023-06-26 10:32:00,435 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.713e+02 8.636e+02 1.390e+03 3.076e+03, threshold=1.727e+03, percent-clipped=28.0 2023-06-26 10:32:08,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1550082.0, ans=0.125 2023-06-26 10:32:09,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1550082.0, ans=0.0 2023-06-26 10:32:13,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1550082.0, ans=0.1 2023-06-26 10:32:21,223 INFO [train.py:996] (1/4) Epoch 9, batch 14400, loss[loss=0.2199, simple_loss=0.2904, pruned_loss=0.07467, over 21820.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2914, pruned_loss=0.06754, over 4251560.83 frames. ], batch size: 351, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:32:44,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1550202.0, ans=0.1 2023-06-26 10:33:00,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1550262.0, ans=0.0 2023-06-26 10:33:05,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1550262.0, ans=0.125 2023-06-26 10:33:16,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1550262.0, ans=0.125 2023-06-26 10:33:35,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=22.5 2023-06-26 10:33:42,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1550322.0, ans=0.0 2023-06-26 10:34:03,154 INFO [train.py:996] (1/4) Epoch 9, batch 14450, loss[loss=0.1938, simple_loss=0.2579, pruned_loss=0.06487, over 21801.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2858, pruned_loss=0.06737, over 4253859.30 frames. ], batch size: 283, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:34:12,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1550442.0, ans=0.025 2023-06-26 10:34:27,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1550502.0, ans=0.0 2023-06-26 10:34:36,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1550502.0, ans=0.2 2023-06-26 10:35:29,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=22.5 2023-06-26 10:35:36,880 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.305e+02 4.619e+02 5.727e+02 8.380e+02 1.480e+03, threshold=1.145e+03, percent-clipped=0.0 2023-06-26 10:35:45,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1550682.0, ans=0.2 2023-06-26 10:35:56,849 INFO [train.py:996] (1/4) Epoch 9, batch 14500, loss[loss=0.1988, simple_loss=0.2752, pruned_loss=0.06122, over 22018.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2833, pruned_loss=0.06759, over 4259391.35 frames. ], batch size: 103, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:37:46,732 INFO [train.py:996] (1/4) Epoch 9, batch 14550, loss[loss=0.2457, simple_loss=0.325, pruned_loss=0.08318, over 21297.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2875, pruned_loss=0.06921, over 4259390.34 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:37:49,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.98 vs. limit=12.0 2023-06-26 10:38:19,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1551102.0, ans=0.2 2023-06-26 10:39:00,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1551222.0, ans=0.05 2023-06-26 10:39:20,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.787e+02 5.550e+02 7.546e+02 1.212e+03 2.573e+03, threshold=1.509e+03, percent-clipped=29.0 2023-06-26 10:39:35,747 INFO [train.py:996] (1/4) Epoch 9, batch 14600, loss[loss=0.24, simple_loss=0.314, pruned_loss=0.08298, over 21662.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2963, pruned_loss=0.07304, over 4261681.16 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:39:46,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1551342.0, ans=0.1 2023-06-26 10:40:12,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=12.0 2023-06-26 10:40:14,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-26 10:40:43,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-26 10:40:52,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-26 10:40:57,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-26 10:41:24,129 INFO [train.py:996] (1/4) Epoch 9, batch 14650, loss[loss=0.1611, simple_loss=0.2499, pruned_loss=0.03615, over 21622.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2996, pruned_loss=0.07215, over 4259496.84 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:41:27,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1551642.0, ans=0.1 2023-06-26 10:41:40,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1551702.0, ans=0.125 2023-06-26 10:42:33,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1551822.0, ans=0.125 2023-06-26 10:42:40,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-26 10:42:45,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1551882.0, ans=0.0 2023-06-26 10:42:46,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.116e+02 4.456e+02 7.843e+02 1.118e+03 1.924e+03, threshold=1.569e+03, percent-clipped=10.0 2023-06-26 10:42:47,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1551882.0, ans=0.1 2023-06-26 10:42:57,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-26 10:43:07,369 INFO [train.py:996] (1/4) Epoch 9, batch 14700, loss[loss=0.1977, simple_loss=0.2825, pruned_loss=0.0564, over 21233.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2936, pruned_loss=0.067, over 4257131.90 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:43:38,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1552002.0, ans=0.0 2023-06-26 10:44:45,249 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-26 10:44:58,835 INFO [train.py:996] (1/4) Epoch 9, batch 14750, loss[loss=0.2111, simple_loss=0.2911, pruned_loss=0.06555, over 19936.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2977, pruned_loss=0.0691, over 4261597.46 frames. ], batch size: 702, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:45:36,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1552302.0, ans=0.125 2023-06-26 10:45:53,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1552362.0, ans=0.0 2023-06-26 10:46:17,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1552422.0, ans=0.125 2023-06-26 10:46:34,190 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 5.782e+02 7.997e+02 1.225e+03 2.854e+03, threshold=1.599e+03, percent-clipped=14.0 2023-06-26 10:46:55,533 INFO [train.py:996] (1/4) Epoch 9, batch 14800, loss[loss=0.2148, simple_loss=0.2774, pruned_loss=0.07605, over 21299.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3105, pruned_loss=0.07497, over 4267294.50 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:47:26,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1552602.0, ans=0.125 2023-06-26 10:47:31,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1552602.0, ans=0.125 2023-06-26 10:47:41,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1552662.0, ans=0.125 2023-06-26 10:48:59,082 INFO [train.py:996] (1/4) Epoch 9, batch 14850, loss[loss=0.1797, simple_loss=0.2485, pruned_loss=0.0555, over 21258.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3043, pruned_loss=0.07437, over 4266900.23 frames. ], batch size: 176, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:49:10,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1552842.0, ans=0.1 2023-06-26 10:49:24,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1552902.0, ans=0.125 2023-06-26 10:50:32,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1553082.0, ans=0.125 2023-06-26 10:50:32,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1553082.0, ans=0.0 2023-06-26 10:50:35,325 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.505e+02 5.144e+02 7.174e+02 1.026e+03 2.687e+03, threshold=1.435e+03, percent-clipped=5.0 2023-06-26 10:50:43,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-26 10:50:50,338 INFO [train.py:996] (1/4) Epoch 9, batch 14900, loss[loss=0.2274, simple_loss=0.299, pruned_loss=0.07792, over 21434.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3068, pruned_loss=0.07565, over 4262093.23 frames. ], batch size: 176, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:51:14,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1553202.0, ans=0.0 2023-06-26 10:51:16,396 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:51:21,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1553202.0, ans=0.2 2023-06-26 10:52:46,128 INFO [train.py:996] (1/4) Epoch 9, batch 14950, loss[loss=0.2181, simple_loss=0.3002, pruned_loss=0.06795, over 21295.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3076, pruned_loss=0.07511, over 4261785.48 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:52:46,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1553442.0, ans=0.2 2023-06-26 10:52:52,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1553442.0, ans=0.025 2023-06-26 10:53:20,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1553502.0, ans=0.1 2023-06-26 10:53:51,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1553622.0, ans=0.125 2023-06-26 10:53:56,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1553622.0, ans=0.0 2023-06-26 10:54:17,653 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.700e+02 5.284e+02 7.127e+02 1.003e+03 2.591e+03, threshold=1.425e+03, percent-clipped=12.0 2023-06-26 10:54:37,160 INFO [train.py:996] (1/4) Epoch 9, batch 15000, loss[loss=0.2303, simple_loss=0.3076, pruned_loss=0.07654, over 21382.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3103, pruned_loss=0.07667, over 4266820.84 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:54:37,161 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 10:54:55,447 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2558, simple_loss=0.3464, pruned_loss=0.08259, over 1796401.00 frames. 2023-06-26 10:54:55,448 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 10:55:02,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1553742.0, ans=0.125 2023-06-26 10:56:11,280 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.42 vs. limit=10.0 2023-06-26 10:56:23,290 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-06-26 10:56:33,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1553982.0, ans=0.125 2023-06-26 10:56:46,884 INFO [train.py:996] (1/4) Epoch 9, batch 15050, loss[loss=0.2785, simple_loss=0.3665, pruned_loss=0.09528, over 21526.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3111, pruned_loss=0.07745, over 4269893.19 frames. ], batch size: 471, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:56:47,493 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:58:12,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-06-26 10:58:13,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1554222.0, ans=0.125 2023-06-26 10:58:21,869 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.687e+02 6.658e+02 1.222e+03 1.555e+03 2.780e+03, threshold=2.443e+03, percent-clipped=32.0 2023-06-26 10:58:41,268 INFO [train.py:996] (1/4) Epoch 9, batch 15100, loss[loss=0.2608, simple_loss=0.3335, pruned_loss=0.09403, over 21242.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3128, pruned_loss=0.07649, over 4268861.69 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:59:38,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1554462.0, ans=0.125 2023-06-26 11:00:29,588 INFO [train.py:996] (1/4) Epoch 9, batch 15150, loss[loss=0.2279, simple_loss=0.3096, pruned_loss=0.07308, over 19928.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3086, pruned_loss=0.07679, over 4267946.88 frames. ], batch size: 702, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:00:35,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1554642.0, ans=0.125 2023-06-26 11:01:02,384 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-26 11:01:09,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1554702.0, ans=0.125 2023-06-26 11:01:13,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1554702.0, ans=0.0 2023-06-26 11:01:31,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-26 11:01:46,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1554822.0, ans=0.125 2023-06-26 11:01:57,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=12.0 2023-06-26 11:02:05,236 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 4.649e+02 7.475e+02 1.057e+03 2.217e+03, threshold=1.495e+03, percent-clipped=0.0 2023-06-26 11:02:19,220 INFO [train.py:996] (1/4) Epoch 9, batch 15200, loss[loss=0.1551, simple_loss=0.2313, pruned_loss=0.03941, over 21565.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2999, pruned_loss=0.07297, over 4262652.96 frames. ], batch size: 195, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 11:03:56,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1555182.0, ans=0.125 2023-06-26 11:03:57,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1555182.0, ans=0.125 2023-06-26 11:04:12,988 INFO [train.py:996] (1/4) Epoch 9, batch 15250, loss[loss=0.178, simple_loss=0.2508, pruned_loss=0.05256, over 21634.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2938, pruned_loss=0.07101, over 4264649.76 frames. ], batch size: 247, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:04:45,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1555302.0, ans=0.04949747468305833 2023-06-26 11:04:46,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1555302.0, ans=0.125 2023-06-26 11:05:44,372 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.756e+02 7.926e+02 1.187e+03 2.967e+03, threshold=1.585e+03, percent-clipped=10.0 2023-06-26 11:05:49,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.53 vs. limit=10.0 2023-06-26 11:06:02,508 INFO [train.py:996] (1/4) Epoch 9, batch 15300, loss[loss=0.2437, simple_loss=0.3193, pruned_loss=0.08401, over 21143.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2954, pruned_loss=0.07267, over 4266794.28 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:06:44,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1555602.0, ans=0.0 2023-06-26 11:07:51,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1555842.0, ans=0.125 2023-06-26 11:07:52,694 INFO [train.py:996] (1/4) Epoch 9, batch 15350, loss[loss=0.2223, simple_loss=0.3183, pruned_loss=0.06314, over 21699.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2996, pruned_loss=0.07473, over 4267329.21 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:08:10,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1555842.0, ans=0.125 2023-06-26 11:08:41,354 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:09:22,283 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.533e+02 5.256e+02 7.334e+02 1.092e+03 2.120e+03, threshold=1.467e+03, percent-clipped=2.0 2023-06-26 11:09:39,842 INFO [train.py:996] (1/4) Epoch 9, batch 15400, loss[loss=0.2031, simple_loss=0.2759, pruned_loss=0.06511, over 21527.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3004, pruned_loss=0.07353, over 4260604.21 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:10:06,738 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-26 11:11:23,627 INFO [train.py:996] (1/4) Epoch 9, batch 15450, loss[loss=0.266, simple_loss=0.3439, pruned_loss=0.09405, over 21590.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2989, pruned_loss=0.07327, over 4269263.61 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:11:45,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1556442.0, ans=0.2 2023-06-26 11:11:59,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1556502.0, ans=0.1 2023-06-26 11:12:09,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1556502.0, ans=0.2 2023-06-26 11:12:19,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1556562.0, ans=0.125 2023-06-26 11:13:01,732 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.285e+02 4.674e+02 6.020e+02 7.889e+02 1.710e+03, threshold=1.204e+03, percent-clipped=2.0 2023-06-26 11:13:07,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1556682.0, ans=0.0 2023-06-26 11:13:20,023 INFO [train.py:996] (1/4) Epoch 9, batch 15500, loss[loss=0.2512, simple_loss=0.3276, pruned_loss=0.08745, over 21300.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.302, pruned_loss=0.07224, over 4259110.31 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:13:30,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1556742.0, ans=10.0 2023-06-26 11:14:47,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1556982.0, ans=0.125 2023-06-26 11:15:09,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-26 11:15:11,408 INFO [train.py:996] (1/4) Epoch 9, batch 15550, loss[loss=0.19, simple_loss=0.254, pruned_loss=0.06298, over 21202.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2992, pruned_loss=0.07027, over 4258161.99 frames. ], batch size: 608, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:15:13,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557042.0, ans=0.1 2023-06-26 11:15:52,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1557162.0, ans=0.025 2023-06-26 11:16:06,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1557162.0, ans=0.0 2023-06-26 11:16:17,441 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-26 11:16:41,919 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.382e+02 5.062e+02 7.091e+02 1.054e+03 2.391e+03, threshold=1.418e+03, percent-clipped=18.0 2023-06-26 11:16:42,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1557282.0, ans=0.125 2023-06-26 11:16:55,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1557282.0, ans=0.0 2023-06-26 11:16:59,947 INFO [train.py:996] (1/4) Epoch 9, batch 15600, loss[loss=0.2018, simple_loss=0.2681, pruned_loss=0.06778, over 21758.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2928, pruned_loss=0.06855, over 4268988.89 frames. ], batch size: 112, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:18:48,389 INFO [train.py:996] (1/4) Epoch 9, batch 15650, loss[loss=0.2457, simple_loss=0.2914, pruned_loss=0.09999, over 21350.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2923, pruned_loss=0.06822, over 4271435.97 frames. ], batch size: 508, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:19:35,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-26 11:19:39,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1557762.0, ans=0.035 2023-06-26 11:20:25,554 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.281e+02 4.437e+02 5.415e+02 7.572e+02 1.667e+03, threshold=1.083e+03, percent-clipped=3.0 2023-06-26 11:20:43,536 INFO [train.py:996] (1/4) Epoch 9, batch 15700, loss[loss=0.1849, simple_loss=0.2658, pruned_loss=0.05198, over 21612.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2885, pruned_loss=0.06738, over 4274737.67 frames. ], batch size: 247, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:20:56,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-26 11:21:13,837 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-26 11:21:36,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-26 11:21:53,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1558122.0, ans=0.05 2023-06-26 11:22:29,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1558242.0, ans=0.0 2023-06-26 11:22:30,872 INFO [train.py:996] (1/4) Epoch 9, batch 15750, loss[loss=0.2625, simple_loss=0.3128, pruned_loss=0.106, over 21373.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2847, pruned_loss=0.06733, over 4269781.04 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:22:43,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1558242.0, ans=0.0 2023-06-26 11:23:24,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1558362.0, ans=0.125 2023-06-26 11:23:46,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1558422.0, ans=0.0 2023-06-26 11:24:00,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1558482.0, ans=0.2 2023-06-26 11:24:01,248 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.153e+02 4.399e+02 6.641e+02 9.028e+02 1.552e+03, threshold=1.328e+03, percent-clipped=11.0 2023-06-26 11:24:18,401 INFO [train.py:996] (1/4) Epoch 9, batch 15800, loss[loss=0.2404, simple_loss=0.2761, pruned_loss=0.1024, over 21514.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2814, pruned_loss=0.06752, over 4264252.38 frames. ], batch size: 512, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:24:19,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.30 vs. limit=15.0 2023-06-26 11:26:06,289 INFO [train.py:996] (1/4) Epoch 9, batch 15850, loss[loss=0.2667, simple_loss=0.3258, pruned_loss=0.1038, over 21411.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2832, pruned_loss=0.06935, over 4263037.27 frames. ], batch size: 510, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:26:13,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1558842.0, ans=0.95 2023-06-26 11:26:29,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1558902.0, ans=0.125 2023-06-26 11:27:30,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=15.0 2023-06-26 11:27:38,948 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 5.055e+02 6.778e+02 9.936e+02 2.216e+03, threshold=1.356e+03, percent-clipped=9.0 2023-06-26 11:27:49,529 INFO [train.py:996] (1/4) Epoch 9, batch 15900, loss[loss=0.2086, simple_loss=0.2974, pruned_loss=0.05993, over 21505.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2819, pruned_loss=0.06968, over 4252637.29 frames. ], batch size: 389, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:28:20,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-26 11:28:55,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1559322.0, ans=0.0 2023-06-26 11:29:38,901 INFO [train.py:996] (1/4) Epoch 9, batch 15950, loss[loss=0.1603, simple_loss=0.2638, pruned_loss=0.02837, over 21796.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2837, pruned_loss=0.06706, over 4254819.85 frames. ], batch size: 332, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:29:57,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1559442.0, ans=0.0 2023-06-26 11:30:05,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1559502.0, ans=0.2 2023-06-26 11:30:06,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1559502.0, ans=0.125 2023-06-26 11:30:19,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-26 11:30:26,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1559562.0, ans=0.125 2023-06-26 11:31:08,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1559682.0, ans=0.1 2023-06-26 11:31:17,724 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 4.974e+02 7.474e+02 9.810e+02 2.700e+03, threshold=1.495e+03, percent-clipped=8.0 2023-06-26 11:31:28,097 INFO [train.py:996] (1/4) Epoch 9, batch 16000, loss[loss=0.2192, simple_loss=0.3114, pruned_loss=0.06344, over 21740.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2851, pruned_loss=0.06562, over 4259606.56 frames. ], batch size: 298, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:31:28,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1559742.0, ans=0.125 2023-06-26 11:32:05,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1559802.0, ans=0.125 2023-06-26 11:32:14,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1559862.0, ans=0.1 2023-06-26 11:33:17,711 INFO [train.py:996] (1/4) Epoch 9, batch 16050, loss[loss=0.2331, simple_loss=0.3354, pruned_loss=0.06539, over 21809.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2886, pruned_loss=0.06349, over 4267162.08 frames. ], batch size: 282, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:33:23,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1560042.0, ans=0.0 2023-06-26 11:33:24,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1560042.0, ans=0.2 2023-06-26 11:33:29,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1560042.0, ans=0.125 2023-06-26 11:34:06,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1560162.0, ans=0.2 2023-06-26 11:34:37,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1560282.0, ans=0.125 2023-06-26 11:34:45,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.405e+02 5.738e+02 8.747e+02 1.434e+03 3.009e+03, threshold=1.749e+03, percent-clipped=21.0 2023-06-26 11:35:05,345 INFO [train.py:996] (1/4) Epoch 9, batch 16100, loss[loss=0.2092, simple_loss=0.282, pruned_loss=0.06818, over 21837.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2951, pruned_loss=0.0654, over 4273334.95 frames. ], batch size: 282, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:35:38,936 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:35:45,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-26 11:35:54,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1560462.0, ans=0.2 2023-06-26 11:36:54,141 INFO [train.py:996] (1/4) Epoch 9, batch 16150, loss[loss=0.238, simple_loss=0.3587, pruned_loss=0.05859, over 20840.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2946, pruned_loss=0.06788, over 4283434.00 frames. ], batch size: 608, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:37:13,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1560642.0, ans=0.2 2023-06-26 11:37:39,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1560762.0, ans=0.0 2023-06-26 11:37:49,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1560762.0, ans=0.125 2023-06-26 11:38:02,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1560822.0, ans=0.0 2023-06-26 11:38:09,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1560822.0, ans=0.09899494936611666 2023-06-26 11:38:33,303 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.369e+02 5.523e+02 8.339e+02 1.289e+03 2.279e+03, threshold=1.668e+03, percent-clipped=10.0 2023-06-26 11:38:46,827 INFO [train.py:996] (1/4) Epoch 9, batch 16200, loss[loss=0.2688, simple_loss=0.3357, pruned_loss=0.101, over 21243.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2963, pruned_loss=0.06892, over 4281672.16 frames. ], batch size: 143, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:38:55,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1560942.0, ans=0.0 2023-06-26 11:39:05,944 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:39:43,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1561062.0, ans=0.5 2023-06-26 11:40:33,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 11:40:38,346 INFO [train.py:996] (1/4) Epoch 9, batch 16250, loss[loss=0.2248, simple_loss=0.2916, pruned_loss=0.07906, over 21460.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2968, pruned_loss=0.06936, over 4285064.92 frames. ], batch size: 509, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:41:28,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1561362.0, ans=0.1 2023-06-26 11:41:38,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1561362.0, ans=0.125 2023-06-26 11:42:17,692 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.175e+02 4.964e+02 6.149e+02 9.832e+02 2.311e+03, threshold=1.230e+03, percent-clipped=3.0 2023-06-26 11:42:26,958 INFO [train.py:996] (1/4) Epoch 9, batch 16300, loss[loss=0.1836, simple_loss=0.2786, pruned_loss=0.04434, over 21704.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2901, pruned_loss=0.06588, over 4283698.80 frames. ], batch size: 298, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:42:41,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1561542.0, ans=0.125 2023-06-26 11:43:03,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1561602.0, ans=0.125 2023-06-26 11:43:07,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1561662.0, ans=0.0 2023-06-26 11:43:07,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1561662.0, ans=0.2 2023-06-26 11:44:17,118 INFO [train.py:996] (1/4) Epoch 9, batch 16350, loss[loss=0.2142, simple_loss=0.287, pruned_loss=0.0707, over 21419.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2898, pruned_loss=0.0655, over 4287328.23 frames. ], batch size: 194, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:45:56,604 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 4.777e+02 5.847e+02 7.634e+02 1.657e+03, threshold=1.169e+03, percent-clipped=4.0 2023-06-26 11:46:05,028 INFO [train.py:996] (1/4) Epoch 9, batch 16400, loss[loss=0.1996, simple_loss=0.2718, pruned_loss=0.06375, over 21461.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2922, pruned_loss=0.06638, over 4282231.76 frames. ], batch size: 211, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:46:41,472 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.03 vs. limit=15.0 2023-06-26 11:46:44,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-26 11:46:51,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1562262.0, ans=0.5 2023-06-26 11:46:58,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1562262.0, ans=0.2 2023-06-26 11:47:00,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1562262.0, ans=0.125 2023-06-26 11:47:19,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1562322.0, ans=0.1 2023-06-26 11:47:54,199 INFO [train.py:996] (1/4) Epoch 9, batch 16450, loss[loss=0.1929, simple_loss=0.273, pruned_loss=0.05642, over 17510.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2931, pruned_loss=0.06689, over 4284148.66 frames. ], batch size: 63, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:48:28,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1562502.0, ans=0.0 2023-06-26 11:48:30,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1562502.0, ans=0.05 2023-06-26 11:48:46,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1562562.0, ans=0.125 2023-06-26 11:49:34,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.87 vs. limit=22.5 2023-06-26 11:49:36,729 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.948e+02 6.287e+02 8.709e+02 1.538e+03, threshold=1.257e+03, percent-clipped=9.0 2023-06-26 11:49:44,330 INFO [train.py:996] (1/4) Epoch 9, batch 16500, loss[loss=0.2202, simple_loss=0.2974, pruned_loss=0.07148, over 21845.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2928, pruned_loss=0.06842, over 4284918.80 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:50:12,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1562802.0, ans=0.125 2023-06-26 11:51:24,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1562982.0, ans=0.125 2023-06-26 11:51:30,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1562982.0, ans=0.125 2023-06-26 11:51:34,680 INFO [train.py:996] (1/4) Epoch 9, batch 16550, loss[loss=0.2239, simple_loss=0.3085, pruned_loss=0.06969, over 21851.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2933, pruned_loss=0.06704, over 4286261.77 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:52:49,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1563222.0, ans=0.1 2023-06-26 11:53:01,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1563222.0, ans=0.125 2023-06-26 11:53:24,959 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.329e+02 6.129e+02 9.986e+02 1.624e+03 3.562e+03, threshold=1.997e+03, percent-clipped=34.0 2023-06-26 11:53:31,909 INFO [train.py:996] (1/4) Epoch 9, batch 16600, loss[loss=0.2698, simple_loss=0.3754, pruned_loss=0.08207, over 21890.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3019, pruned_loss=0.06981, over 4288177.05 frames. ], batch size: 372, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:53:58,280 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=22.5 2023-06-26 11:54:19,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1563402.0, ans=0.125 2023-06-26 11:55:28,531 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-26 11:55:29,109 INFO [train.py:996] (1/4) Epoch 9, batch 16650, loss[loss=0.2217, simple_loss=0.3059, pruned_loss=0.06878, over 21814.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3092, pruned_loss=0.07136, over 4280749.85 frames. ], batch size: 247, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:55:59,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1563702.0, ans=0.0 2023-06-26 11:56:01,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1563702.0, ans=0.125 2023-06-26 11:56:38,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1563762.0, ans=0.0 2023-06-26 11:57:21,318 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.537e+02 4.947e+02 6.891e+02 9.517e+02 1.890e+03, threshold=1.378e+03, percent-clipped=0.0 2023-06-26 11:57:33,727 INFO [train.py:996] (1/4) Epoch 9, batch 16700, loss[loss=0.2419, simple_loss=0.327, pruned_loss=0.0784, over 21670.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3111, pruned_loss=0.07233, over 4282116.67 frames. ], batch size: 389, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:57:48,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1563942.0, ans=0.0 2023-06-26 11:58:26,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1564062.0, ans=15.0 2023-06-26 11:59:09,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=22.5 2023-06-26 11:59:17,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1564182.0, ans=0.07 2023-06-26 11:59:18,611 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.48 vs. limit=15.0 2023-06-26 11:59:29,000 INFO [train.py:996] (1/4) Epoch 9, batch 16750, loss[loss=0.2759, simple_loss=0.3475, pruned_loss=0.1022, over 21788.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3132, pruned_loss=0.07538, over 4282829.97 frames. ], batch size: 441, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:59:31,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1564242.0, ans=0.125 2023-06-26 12:00:46,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-26 12:00:50,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-26 12:00:53,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1564422.0, ans=0.07 2023-06-26 12:01:13,729 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.747e+02 5.518e+02 7.489e+02 1.102e+03 1.868e+03, threshold=1.498e+03, percent-clipped=9.0 2023-06-26 12:01:20,315 INFO [train.py:996] (1/4) Epoch 9, batch 16800, loss[loss=0.2189, simple_loss=0.2953, pruned_loss=0.07123, over 21855.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3182, pruned_loss=0.07605, over 4276195.32 frames. ], batch size: 298, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 12:01:36,941 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-06-26 12:01:55,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1564602.0, ans=0.125 2023-06-26 12:02:32,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1564722.0, ans=0.125 2023-06-26 12:02:48,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1564722.0, ans=0.125 2023-06-26 12:02:51,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1564782.0, ans=0.0 2023-06-26 12:03:09,571 INFO [train.py:996] (1/4) Epoch 9, batch 16850, loss[loss=0.2356, simple_loss=0.3698, pruned_loss=0.05063, over 20760.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3134, pruned_loss=0.07552, over 4279073.51 frames. ], batch size: 607, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 12:03:43,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-26 12:04:13,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1564962.0, ans=0.0 2023-06-26 12:04:15,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1564962.0, ans=0.2 2023-06-26 12:04:52,111 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.430e+02 5.138e+02 7.609e+02 1.062e+03 2.399e+03, threshold=1.522e+03, percent-clipped=7.0 2023-06-26 12:04:59,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1565082.0, ans=0.125 2023-06-26 12:05:02,282 INFO [train.py:996] (1/4) Epoch 9, batch 16900, loss[loss=0.1842, simple_loss=0.2602, pruned_loss=0.05405, over 21753.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3065, pruned_loss=0.07313, over 4275467.06 frames. ], batch size: 316, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:05:09,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-26 12:06:43,790 INFO [train.py:996] (1/4) Epoch 9, batch 16950, loss[loss=0.2113, simple_loss=0.281, pruned_loss=0.07075, over 21770.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.299, pruned_loss=0.07133, over 4280068.17 frames. ], batch size: 247, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:06:55,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1565442.0, ans=0.0 2023-06-26 12:07:10,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1565502.0, ans=0.0 2023-06-26 12:07:15,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1565502.0, ans=0.125 2023-06-26 12:07:22,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1565502.0, ans=0.0 2023-06-26 12:07:57,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1565622.0, ans=0.0 2023-06-26 12:08:16,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1565682.0, ans=0.1 2023-06-26 12:08:27,254 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.850e+02 5.163e+02 6.810e+02 8.799e+02 2.047e+03, threshold=1.362e+03, percent-clipped=3.0 2023-06-26 12:08:32,637 INFO [train.py:996] (1/4) Epoch 9, batch 17000, loss[loss=0.2239, simple_loss=0.2932, pruned_loss=0.07733, over 21856.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2963, pruned_loss=0.07206, over 4286114.13 frames. ], batch size: 391, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:10:18,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1565982.0, ans=0.1 2023-06-26 12:10:29,843 INFO [train.py:996] (1/4) Epoch 9, batch 17050, loss[loss=0.2337, simple_loss=0.315, pruned_loss=0.07617, over 21293.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3035, pruned_loss=0.0746, over 4289384.90 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:10:30,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1566042.0, ans=0.1 2023-06-26 12:11:34,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-26 12:11:55,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1566282.0, ans=0.125 2023-06-26 12:12:06,988 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 5.718e+02 8.769e+02 1.372e+03 2.605e+03, threshold=1.754e+03, percent-clipped=26.0 2023-06-26 12:12:17,824 INFO [train.py:996] (1/4) Epoch 9, batch 17100, loss[loss=0.2301, simple_loss=0.3033, pruned_loss=0.07847, over 21882.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.302, pruned_loss=0.07532, over 4288183.89 frames. ], batch size: 414, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:12:30,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-26 12:12:53,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1566402.0, ans=0.015 2023-06-26 12:12:56,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1566402.0, ans=0.0 2023-06-26 12:13:22,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1566462.0, ans=0.07 2023-06-26 12:13:28,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1566522.0, ans=0.1 2023-06-26 12:13:39,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1566522.0, ans=0.1 2023-06-26 12:14:10,888 INFO [train.py:996] (1/4) Epoch 9, batch 17150, loss[loss=0.2197, simple_loss=0.286, pruned_loss=0.07673, over 21264.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2984, pruned_loss=0.0746, over 4296156.05 frames. ], batch size: 143, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:14:12,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-26 12:14:22,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1566642.0, ans=0.025 2023-06-26 12:15:31,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1566882.0, ans=0.125 2023-06-26 12:15:55,035 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.574e+02 4.824e+02 6.813e+02 1.101e+03 2.342e+03, threshold=1.363e+03, percent-clipped=2.0 2023-06-26 12:16:00,471 INFO [train.py:996] (1/4) Epoch 9, batch 17200, loss[loss=0.2247, simple_loss=0.2973, pruned_loss=0.07611, over 21817.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2997, pruned_loss=0.07509, over 4289766.08 frames. ], batch size: 247, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 12:16:34,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1567002.0, ans=0.125 2023-06-26 12:16:43,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1567002.0, ans=0.2 2023-06-26 12:17:32,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1567182.0, ans=0.5 2023-06-26 12:17:39,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1567182.0, ans=0.0 2023-06-26 12:18:02,437 INFO [train.py:996] (1/4) Epoch 9, batch 17250, loss[loss=0.236, simple_loss=0.318, pruned_loss=0.07697, over 21634.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.302, pruned_loss=0.07644, over 4287449.45 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:18:48,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1567362.0, ans=0.125 2023-06-26 12:19:14,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1567422.0, ans=0.0 2023-06-26 12:19:16,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1567422.0, ans=0.125 2023-06-26 12:19:48,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.788e+02 5.514e+02 7.810e+02 1.291e+03 2.321e+03, threshold=1.562e+03, percent-clipped=17.0 2023-06-26 12:19:49,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1567482.0, ans=0.2 2023-06-26 12:19:52,296 INFO [train.py:996] (1/4) Epoch 9, batch 17300, loss[loss=0.2796, simple_loss=0.3517, pruned_loss=0.1038, over 21323.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.311, pruned_loss=0.08006, over 4290666.77 frames. ], batch size: 143, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:20:40,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1567662.0, ans=0.125 2023-06-26 12:21:24,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1567782.0, ans=0.0 2023-06-26 12:21:38,669 INFO [train.py:996] (1/4) Epoch 9, batch 17350, loss[loss=0.2811, simple_loss=0.3796, pruned_loss=0.09129, over 20794.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3113, pruned_loss=0.0795, over 4288327.46 frames. ], batch size: 607, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:22:24,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1567962.0, ans=0.0 2023-06-26 12:23:03,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1568082.0, ans=0.0 2023-06-26 12:23:15,910 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.405e+02 5.455e+02 8.630e+02 1.274e+03 2.528e+03, threshold=1.726e+03, percent-clipped=16.0 2023-06-26 12:23:19,220 INFO [train.py:996] (1/4) Epoch 9, batch 17400, loss[loss=0.255, simple_loss=0.34, pruned_loss=0.08503, over 21591.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3081, pruned_loss=0.0766, over 4282796.55 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:23:19,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1568142.0, ans=0.125 2023-06-26 12:23:31,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1568142.0, ans=0.125 2023-06-26 12:24:36,629 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-26 12:25:10,951 INFO [train.py:996] (1/4) Epoch 9, batch 17450, loss[loss=0.177, simple_loss=0.2644, pruned_loss=0.04484, over 21772.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3044, pruned_loss=0.07344, over 4264526.92 frames. ], batch size: 282, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:25:33,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1568502.0, ans=0.0 2023-06-26 12:25:40,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1568502.0, ans=0.125 2023-06-26 12:25:56,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1568562.0, ans=0.125 2023-06-26 12:26:15,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1568562.0, ans=0.0 2023-06-26 12:26:47,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1568682.0, ans=0.0 2023-06-26 12:26:57,093 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 4.686e+02 6.725e+02 1.029e+03 2.928e+03, threshold=1.345e+03, percent-clipped=7.0 2023-06-26 12:26:58,191 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-26 12:26:58,686 INFO [train.py:996] (1/4) Epoch 9, batch 17500, loss[loss=0.213, simple_loss=0.3268, pruned_loss=0.04962, over 19827.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3012, pruned_loss=0.07116, over 4264040.75 frames. ], batch size: 703, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:27:04,141 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 12:27:16,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1568802.0, ans=0.0 2023-06-26 12:27:54,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1568862.0, ans=0.125 2023-06-26 12:28:08,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1568922.0, ans=0.0 2023-06-26 12:28:08,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-26 12:28:38,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1568982.0, ans=0.125 2023-06-26 12:28:40,999 INFO [train.py:996] (1/4) Epoch 9, batch 17550, loss[loss=0.2101, simple_loss=0.3022, pruned_loss=0.05902, over 21797.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3016, pruned_loss=0.06982, over 4271792.67 frames. ], batch size: 124, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:29:02,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1569102.0, ans=0.035 2023-06-26 12:29:19,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1569102.0, ans=0.0 2023-06-26 12:29:53,273 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-26 12:29:56,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1569222.0, ans=0.2 2023-06-26 12:30:01,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1569222.0, ans=0.1 2023-06-26 12:30:01,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1569222.0, ans=0.1 2023-06-26 12:30:34,070 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 4.559e+02 6.477e+02 8.639e+02 1.603e+03, threshold=1.295e+03, percent-clipped=2.0 2023-06-26 12:30:35,777 INFO [train.py:996] (1/4) Epoch 9, batch 17600, loss[loss=0.2437, simple_loss=0.322, pruned_loss=0.08271, over 21529.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3035, pruned_loss=0.07035, over 4270684.07 frames. ], batch size: 414, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:30:53,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1569402.0, ans=0.125 2023-06-26 12:31:34,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1569462.0, ans=0.0 2023-06-26 12:31:50,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1569522.0, ans=0.2 2023-06-26 12:31:54,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1569522.0, ans=0.015 2023-06-26 12:31:57,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 12:32:20,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1569642.0, ans=0.125 2023-06-26 12:32:21,704 INFO [train.py:996] (1/4) Epoch 9, batch 17650, loss[loss=0.1655, simple_loss=0.2263, pruned_loss=0.0523, over 21275.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2986, pruned_loss=0.0697, over 4263425.45 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:32:53,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1569702.0, ans=0.0 2023-06-26 12:33:16,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1569762.0, ans=0.0 2023-06-26 12:33:30,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1569822.0, ans=0.0 2023-06-26 12:34:09,282 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 6.180e+02 8.586e+02 1.472e+03 2.723e+03, threshold=1.717e+03, percent-clipped=31.0 2023-06-26 12:34:10,898 INFO [train.py:996] (1/4) Epoch 9, batch 17700, loss[loss=0.2174, simple_loss=0.3048, pruned_loss=0.065, over 21377.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2945, pruned_loss=0.06748, over 4261425.78 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:34:32,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1570002.0, ans=0.2 2023-06-26 12:36:06,984 INFO [train.py:996] (1/4) Epoch 9, batch 17750, loss[loss=0.2342, simple_loss=0.3164, pruned_loss=0.07594, over 21750.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2997, pruned_loss=0.06954, over 4268645.89 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:36:09,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1570242.0, ans=0.125 2023-06-26 12:36:25,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1570242.0, ans=0.2 2023-06-26 12:36:33,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1570302.0, ans=0.125 2023-06-26 12:36:44,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1570302.0, ans=0.1 2023-06-26 12:36:44,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-26 12:37:23,562 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-26 12:37:32,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1570422.0, ans=0.125 2023-06-26 12:37:44,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1570482.0, ans=0.5 2023-06-26 12:37:55,882 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 12:37:56,887 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.878e+02 5.343e+02 8.043e+02 1.136e+03 2.008e+03, threshold=1.609e+03, percent-clipped=5.0 2023-06-26 12:38:04,117 INFO [train.py:996] (1/4) Epoch 9, batch 17800, loss[loss=0.1952, simple_loss=0.2941, pruned_loss=0.04814, over 19814.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3013, pruned_loss=0.07024, over 4272944.41 frames. ], batch size: 702, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:38:26,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1570602.0, ans=0.0 2023-06-26 12:38:37,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1570602.0, ans=0.05 2023-06-26 12:38:56,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1570662.0, ans=0.1 2023-06-26 12:39:08,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-26 12:39:55,309 INFO [train.py:996] (1/4) Epoch 9, batch 17850, loss[loss=0.2657, simple_loss=0.3376, pruned_loss=0.09696, over 21764.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3004, pruned_loss=0.07066, over 4259712.76 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:41:42,264 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 5.491e+02 8.059e+02 1.156e+03 1.916e+03, threshold=1.612e+03, percent-clipped=10.0 2023-06-26 12:41:43,903 INFO [train.py:996] (1/4) Epoch 9, batch 17900, loss[loss=0.2233, simple_loss=0.3387, pruned_loss=0.05394, over 20819.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3053, pruned_loss=0.07213, over 4264373.35 frames. ], batch size: 608, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:42:45,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1571262.0, ans=0.0 2023-06-26 12:43:02,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1571322.0, ans=0.07 2023-06-26 12:43:20,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1571382.0, ans=0.125 2023-06-26 12:43:40,952 INFO [train.py:996] (1/4) Epoch 9, batch 17950, loss[loss=0.2343, simple_loss=0.3253, pruned_loss=0.07168, over 21622.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3055, pruned_loss=0.06931, over 4267539.12 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:44:58,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1571622.0, ans=0.125 2023-06-26 12:45:22,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1571682.0, ans=0.125 2023-06-26 12:45:24,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 4.426e+02 5.727e+02 7.254e+02 1.857e+03, threshold=1.145e+03, percent-clipped=1.0 2023-06-26 12:45:26,476 INFO [train.py:996] (1/4) Epoch 9, batch 18000, loss[loss=0.1837, simple_loss=0.2637, pruned_loss=0.05185, over 20759.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2984, pruned_loss=0.06713, over 4262994.72 frames. ], batch size: 607, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 12:45:26,477 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 12:45:38,956 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.1832, 2.6054, 2.5310, 3.1961, 1.8902, 2.9658, 3.0100, 2.2112], device='cuda:1') 2023-06-26 12:45:46,675 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2587, simple_loss=0.3543, pruned_loss=0.08153, over 1796401.00 frames. 2023-06-26 12:45:46,676 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 12:45:56,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1571742.0, ans=0.125 2023-06-26 12:46:00,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1571742.0, ans=0.2 2023-06-26 12:46:49,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1571922.0, ans=0.125 2023-06-26 12:47:08,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1571922.0, ans=0.0 2023-06-26 12:47:12,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-26 12:47:16,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1571982.0, ans=0.125 2023-06-26 12:47:16,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1571982.0, ans=0.1 2023-06-26 12:47:36,561 INFO [train.py:996] (1/4) Epoch 9, batch 18050, loss[loss=0.2358, simple_loss=0.3058, pruned_loss=0.08295, over 21715.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2929, pruned_loss=0.06632, over 4263373.92 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:47:51,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1572042.0, ans=0.0 2023-06-26 12:49:05,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1572282.0, ans=0.1 2023-06-26 12:49:28,425 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.634e+02 5.417e+02 6.596e+02 1.071e+03 2.802e+03, threshold=1.319e+03, percent-clipped=21.0 2023-06-26 12:49:28,456 INFO [train.py:996] (1/4) Epoch 9, batch 18100, loss[loss=0.2162, simple_loss=0.3062, pruned_loss=0.06307, over 21253.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2966, pruned_loss=0.06743, over 4263825.40 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:49:49,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1572342.0, ans=0.2 2023-06-26 12:50:44,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-26 12:51:18,348 INFO [train.py:996] (1/4) Epoch 9, batch 18150, loss[loss=0.2303, simple_loss=0.2979, pruned_loss=0.08135, over 21660.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2981, pruned_loss=0.06733, over 4264833.94 frames. ], batch size: 415, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:51:29,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1572642.0, ans=0.125 2023-06-26 12:52:24,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1572762.0, ans=0.125 2023-06-26 12:52:44,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1572882.0, ans=0.1 2023-06-26 12:52:58,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1572882.0, ans=0.1 2023-06-26 12:53:05,711 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.440e+02 4.565e+02 5.741e+02 8.756e+02 1.817e+03, threshold=1.148e+03, percent-clipped=4.0 2023-06-26 12:53:05,744 INFO [train.py:996] (1/4) Epoch 9, batch 18200, loss[loss=0.2135, simple_loss=0.2788, pruned_loss=0.07412, over 21610.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.292, pruned_loss=0.06722, over 4257978.19 frames. ], batch size: 415, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:53:18,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-26 12:53:33,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1573002.0, ans=0.125 2023-06-26 12:53:58,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1573062.0, ans=0.125 2023-06-26 12:54:46,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1573242.0, ans=0.0 2023-06-26 12:54:47,257 INFO [train.py:996] (1/4) Epoch 9, batch 18250, loss[loss=0.1918, simple_loss=0.265, pruned_loss=0.05934, over 21638.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2852, pruned_loss=0.06536, over 4261940.24 frames. ], batch size: 195, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:55:51,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1573422.0, ans=0.0 2023-06-26 12:56:03,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-26 12:56:12,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.77 vs. limit=10.0 2023-06-26 12:56:20,541 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-26 12:56:42,098 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.384e+02 4.816e+02 6.355e+02 8.859e+02 2.523e+03, threshold=1.271e+03, percent-clipped=14.0 2023-06-26 12:56:42,140 INFO [train.py:996] (1/4) Epoch 9, batch 18300, loss[loss=0.2426, simple_loss=0.3506, pruned_loss=0.06733, over 21823.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.286, pruned_loss=0.06578, over 4269729.71 frames. ], batch size: 316, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:56:58,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1573602.0, ans=0.0 2023-06-26 12:57:52,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1573722.0, ans=0.2 2023-06-26 12:58:05,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1573782.0, ans=0.07 2023-06-26 12:58:25,482 INFO [train.py:996] (1/4) Epoch 9, batch 18350, loss[loss=0.2269, simple_loss=0.2935, pruned_loss=0.08012, over 21285.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2922, pruned_loss=0.06588, over 4267850.59 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:58:29,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1573842.0, ans=0.0 2023-06-26 13:00:14,624 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.067e+02 5.416e+02 7.058e+02 9.535e+02 2.465e+03, threshold=1.412e+03, percent-clipped=12.0 2023-06-26 13:00:14,656 INFO [train.py:996] (1/4) Epoch 9, batch 18400, loss[loss=0.1717, simple_loss=0.2572, pruned_loss=0.04311, over 21477.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2875, pruned_loss=0.06487, over 4265728.48 frames. ], batch size: 212, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 13:01:56,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1574382.0, ans=0.0 2023-06-26 13:01:59,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1574382.0, ans=0.0 2023-06-26 13:02:01,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1574382.0, ans=22.5 2023-06-26 13:02:04,283 INFO [train.py:996] (1/4) Epoch 9, batch 18450, loss[loss=0.1564, simple_loss=0.2554, pruned_loss=0.02871, over 21826.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2835, pruned_loss=0.0618, over 4257793.34 frames. ], batch size: 317, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 13:02:13,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-26 13:02:36,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1574502.0, ans=0.125 2023-06-26 13:03:30,879 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-26 13:03:42,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1574682.0, ans=0.125 2023-06-26 13:03:45,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1574682.0, ans=0.2 2023-06-26 13:03:52,178 INFO [train.py:996] (1/4) Epoch 9, batch 18500, loss[loss=0.1938, simple_loss=0.2667, pruned_loss=0.06047, over 21743.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.279, pruned_loss=0.0604, over 4255858.64 frames. ], batch size: 112, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:03:53,904 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.096e+02 4.756e+02 7.398e+02 1.037e+03 4.377e+03, threshold=1.480e+03, percent-clipped=11.0 2023-06-26 13:04:03,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1574742.0, ans=0.125 2023-06-26 13:04:05,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1574742.0, ans=0.0 2023-06-26 13:04:05,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1574742.0, ans=0.125 2023-06-26 13:04:09,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=1574802.0, ans=15.0 2023-06-26 13:04:32,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1574862.0, ans=0.0 2023-06-26 13:04:37,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1574862.0, ans=0.2 2023-06-26 13:04:46,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574862.0, ans=0.1 2023-06-26 13:04:51,705 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-06-26 13:05:12,955 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-06-26 13:05:24,665 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 13:05:26,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-26 13:05:40,083 INFO [train.py:996] (1/4) Epoch 9, batch 18550, loss[loss=0.2007, simple_loss=0.2674, pruned_loss=0.067, over 21778.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2769, pruned_loss=0.05999, over 4248516.29 frames. ], batch size: 371, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:06:11,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1575102.0, ans=0.1 2023-06-26 13:06:40,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=12.0 2023-06-26 13:07:28,415 INFO [train.py:996] (1/4) Epoch 9, batch 18600, loss[loss=0.2062, simple_loss=0.2877, pruned_loss=0.06239, over 21556.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2746, pruned_loss=0.06053, over 4233425.30 frames. ], batch size: 389, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:07:30,257 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.128e+02 4.633e+02 7.387e+02 1.048e+03 1.831e+03, threshold=1.477e+03, percent-clipped=1.0 2023-06-26 13:07:48,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1575402.0, ans=0.1 2023-06-26 13:08:49,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1575522.0, ans=0.125 2023-06-26 13:09:15,094 INFO [train.py:996] (1/4) Epoch 9, batch 18650, loss[loss=0.1983, simple_loss=0.2522, pruned_loss=0.07224, over 16999.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2744, pruned_loss=0.06066, over 4220686.29 frames. ], batch size: 66, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:09:30,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1575702.0, ans=0.125 2023-06-26 13:10:08,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1575762.0, ans=0.125 2023-06-26 13:10:26,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1575822.0, ans=0.0 2023-06-26 13:11:02,353 INFO [train.py:996] (1/4) Epoch 9, batch 18700, loss[loss=0.2195, simple_loss=0.2782, pruned_loss=0.08041, over 21707.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2723, pruned_loss=0.06177, over 4235177.96 frames. ], batch size: 416, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:11:04,040 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.187e+02 4.395e+02 5.926e+02 8.949e+02 1.374e+03, threshold=1.185e+03, percent-clipped=0.0 2023-06-26 13:11:15,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1575942.0, ans=0.0 2023-06-26 13:12:05,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1576122.0, ans=0.0 2023-06-26 13:12:07,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1576122.0, ans=0.0 2023-06-26 13:12:32,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1576182.0, ans=0.1 2023-06-26 13:12:43,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-26 13:12:49,677 INFO [train.py:996] (1/4) Epoch 9, batch 18750, loss[loss=0.2203, simple_loss=0.2983, pruned_loss=0.07117, over 21628.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2737, pruned_loss=0.06379, over 4255762.04 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:13:43,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1576362.0, ans=0.0 2023-06-26 13:13:45,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1576422.0, ans=0.125 2023-06-26 13:13:51,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1576422.0, ans=0.2 2023-06-26 13:14:38,312 INFO [train.py:996] (1/4) Epoch 9, batch 18800, loss[loss=0.1873, simple_loss=0.2734, pruned_loss=0.05066, over 21381.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2817, pruned_loss=0.06573, over 4264780.53 frames. ], batch size: 211, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:14:40,091 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 6.038e+02 7.723e+02 1.097e+03 3.023e+03, threshold=1.545e+03, percent-clipped=19.0 2023-06-26 13:15:03,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1576602.0, ans=10.0 2023-06-26 13:16:26,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1576842.0, ans=0.125 2023-06-26 13:16:27,803 INFO [train.py:996] (1/4) Epoch 9, batch 18850, loss[loss=0.1479, simple_loss=0.2349, pruned_loss=0.03041, over 21431.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2826, pruned_loss=0.06311, over 4259439.91 frames. ], batch size: 211, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:17:12,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1576962.0, ans=0.125 2023-06-26 13:17:17,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1576962.0, ans=0.125 2023-06-26 13:17:33,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1577022.0, ans=0.125 2023-06-26 13:18:04,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1577082.0, ans=0.05 2023-06-26 13:18:12,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-26 13:18:14,392 INFO [train.py:996] (1/4) Epoch 9, batch 18900, loss[loss=0.2134, simple_loss=0.2842, pruned_loss=0.07127, over 21419.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2778, pruned_loss=0.06225, over 4258739.06 frames. ], batch size: 131, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:18:17,651 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.228e+02 4.531e+02 6.963e+02 9.490e+02 1.932e+03, threshold=1.393e+03, percent-clipped=3.0 2023-06-26 13:18:25,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1577142.0, ans=0.0 2023-06-26 13:18:25,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1577142.0, ans=0.125 2023-06-26 13:19:48,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1577382.0, ans=0.2 2023-06-26 13:20:00,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1577382.0, ans=0.125 2023-06-26 13:20:03,846 INFO [train.py:996] (1/4) Epoch 9, batch 18950, loss[loss=0.2484, simple_loss=0.3373, pruned_loss=0.07968, over 21790.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2779, pruned_loss=0.06391, over 4264805.65 frames. ], batch size: 414, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:20:31,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1577502.0, ans=0.05 2023-06-26 13:21:30,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1577622.0, ans=0.1 2023-06-26 13:21:31,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=22.5 2023-06-26 13:21:51,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1577682.0, ans=0.125 2023-06-26 13:21:54,022 INFO [train.py:996] (1/4) Epoch 9, batch 19000, loss[loss=0.2405, simple_loss=0.3214, pruned_loss=0.07977, over 21691.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2871, pruned_loss=0.06569, over 4258850.06 frames. ], batch size: 332, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:21:58,081 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.501e+02 4.865e+02 6.670e+02 8.887e+02 1.787e+03, threshold=1.334e+03, percent-clipped=6.0 2023-06-26 13:22:03,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1577742.0, ans=0.125 2023-06-26 13:22:27,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1577802.0, ans=0.125 2023-06-26 13:22:34,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1577862.0, ans=0.125 2023-06-26 13:22:36,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1577862.0, ans=0.125 2023-06-26 13:22:57,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1577922.0, ans=0.0 2023-06-26 13:23:29,912 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=12.0 2023-06-26 13:23:37,507 INFO [train.py:996] (1/4) Epoch 9, batch 19050, loss[loss=0.2216, simple_loss=0.2944, pruned_loss=0.0744, over 21943.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2923, pruned_loss=0.06883, over 4257636.37 frames. ], batch size: 113, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:23:47,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1578042.0, ans=0.2 2023-06-26 13:24:40,430 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=12.0 2023-06-26 13:25:03,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1578282.0, ans=0.125 2023-06-26 13:25:20,508 INFO [train.py:996] (1/4) Epoch 9, batch 19100, loss[loss=0.1893, simple_loss=0.2611, pruned_loss=0.05879, over 21397.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2907, pruned_loss=0.06991, over 4255635.52 frames. ], batch size: 131, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:25:24,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.781e+02 5.304e+02 7.054e+02 1.099e+03 1.877e+03, threshold=1.411e+03, percent-clipped=10.0 2023-06-26 13:25:26,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1578342.0, ans=0.125 2023-06-26 13:26:51,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1578582.0, ans=0.0 2023-06-26 13:27:11,385 INFO [train.py:996] (1/4) Epoch 9, batch 19150, loss[loss=0.291, simple_loss=0.3787, pruned_loss=0.1017, over 21581.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2933, pruned_loss=0.07066, over 4257717.47 frames. ], batch size: 441, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:28:37,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1578822.0, ans=0.125 2023-06-26 13:29:06,083 INFO [train.py:996] (1/4) Epoch 9, batch 19200, loss[loss=0.226, simple_loss=0.3286, pruned_loss=0.06167, over 21400.00 frames. ], tot_loss[loss=0.222, simple_loss=0.302, pruned_loss=0.07099, over 4265915.43 frames. ], batch size: 194, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:29:10,040 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.893e+02 6.153e+02 9.835e+02 1.321e+03 2.570e+03, threshold=1.967e+03, percent-clipped=19.0 2023-06-26 13:29:49,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1579002.0, ans=0.2 2023-06-26 13:30:12,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1579122.0, ans=0.1 2023-06-26 13:30:49,830 INFO [train.py:996] (1/4) Epoch 9, batch 19250, loss[loss=0.17, simple_loss=0.269, pruned_loss=0.03545, over 21690.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3026, pruned_loss=0.06704, over 4275743.28 frames. ], batch size: 298, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:30:52,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1579242.0, ans=0.1 2023-06-26 13:31:14,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1579302.0, ans=0.125 2023-06-26 13:31:26,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1579302.0, ans=0.125 2023-06-26 13:31:57,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1579422.0, ans=0.0 2023-06-26 13:32:24,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1579482.0, ans=0.125 2023-06-26 13:32:38,025 INFO [train.py:996] (1/4) Epoch 9, batch 19300, loss[loss=0.1894, simple_loss=0.2671, pruned_loss=0.05582, over 21645.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2995, pruned_loss=0.06647, over 4285495.25 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:32:41,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.842e+02 4.708e+02 6.632e+02 9.817e+02 2.132e+03, threshold=1.326e+03, percent-clipped=1.0 2023-06-26 13:32:57,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1579602.0, ans=0.0 2023-06-26 13:33:25,410 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-26 13:33:30,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1579662.0, ans=0.05 2023-06-26 13:33:46,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1579722.0, ans=0.0 2023-06-26 13:33:55,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1579722.0, ans=0.125 2023-06-26 13:34:23,275 INFO [train.py:996] (1/4) Epoch 9, batch 19350, loss[loss=0.2064, simple_loss=0.2991, pruned_loss=0.05688, over 21699.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2956, pruned_loss=0.06352, over 4276278.91 frames. ], batch size: 391, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:34:42,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1579842.0, ans=0.125 2023-06-26 13:35:16,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1579962.0, ans=0.025 2023-06-26 13:35:19,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1579962.0, ans=0.125 2023-06-26 13:35:32,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-26 13:35:44,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1580022.0, ans=0.0 2023-06-26 13:36:09,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1580142.0, ans=22.5 2023-06-26 13:36:10,358 INFO [train.py:996] (1/4) Epoch 9, batch 19400, loss[loss=0.2107, simple_loss=0.2772, pruned_loss=0.07212, over 21617.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2925, pruned_loss=0.0627, over 4274742.59 frames. ], batch size: 195, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:36:13,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1580142.0, ans=0.125 2023-06-26 13:36:15,960 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.056e+02 5.043e+02 7.685e+02 1.074e+03 1.940e+03, threshold=1.537e+03, percent-clipped=16.0 2023-06-26 13:36:23,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1580142.0, ans=0.125 2023-06-26 13:37:21,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1580322.0, ans=0.125 2023-06-26 13:37:26,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1580322.0, ans=0.125 2023-06-26 13:37:53,588 INFO [train.py:996] (1/4) Epoch 9, batch 19450, loss[loss=0.2002, simple_loss=0.2671, pruned_loss=0.06663, over 21792.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2899, pruned_loss=0.06457, over 4281040.59 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:38:39,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1580502.0, ans=0.0 2023-06-26 13:38:42,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1580562.0, ans=0.125 2023-06-26 13:38:44,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1580562.0, ans=0.125 2023-06-26 13:38:45,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1580562.0, ans=0.125 2023-06-26 13:39:41,554 INFO [train.py:996] (1/4) Epoch 9, batch 19500, loss[loss=0.2291, simple_loss=0.3141, pruned_loss=0.07203, over 21159.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2874, pruned_loss=0.06594, over 4277693.74 frames. ], batch size: 548, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:39:46,899 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.457e+02 4.487e+02 6.079e+02 9.287e+02 2.149e+03, threshold=1.216e+03, percent-clipped=7.0 2023-06-26 13:40:01,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1580742.0, ans=0.05 2023-06-26 13:40:09,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1580802.0, ans=0.04949747468305833 2023-06-26 13:41:30,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1581042.0, ans=0.125 2023-06-26 13:41:30,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-26 13:41:31,331 INFO [train.py:996] (1/4) Epoch 9, batch 19550, loss[loss=0.1529, simple_loss=0.2059, pruned_loss=0.04994, over 21847.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2814, pruned_loss=0.06415, over 4278342.72 frames. ], batch size: 98, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:41:46,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-26 13:41:57,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1581102.0, ans=0.2 2023-06-26 13:42:10,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1581102.0, ans=0.0 2023-06-26 13:42:11,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1581102.0, ans=0.0 2023-06-26 13:42:41,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1581222.0, ans=0.125 2023-06-26 13:42:45,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1581222.0, ans=0.125 2023-06-26 13:42:48,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1581222.0, ans=0.125 2023-06-26 13:43:18,831 INFO [train.py:996] (1/4) Epoch 9, batch 19600, loss[loss=0.2505, simple_loss=0.3298, pruned_loss=0.0856, over 21464.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2831, pruned_loss=0.06484, over 4276659.13 frames. ], batch size: 131, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:43:28,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1581342.0, ans=0.0 2023-06-26 13:43:29,296 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.268e+02 5.045e+02 6.281e+02 9.154e+02 2.396e+03, threshold=1.256e+03, percent-clipped=14.0 2023-06-26 13:43:51,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1581402.0, ans=0.125 2023-06-26 13:44:00,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-26 13:44:30,188 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-26 13:44:41,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581522.0, ans=0.1 2023-06-26 13:44:50,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1581582.0, ans=0.125 2023-06-26 13:45:13,611 INFO [train.py:996] (1/4) Epoch 9, batch 19650, loss[loss=0.1946, simple_loss=0.2705, pruned_loss=0.0593, over 21425.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2881, pruned_loss=0.06785, over 4280448.87 frames. ], batch size: 211, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:45:14,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1581642.0, ans=0.0 2023-06-26 13:47:15,809 INFO [train.py:996] (1/4) Epoch 9, batch 19700, loss[loss=0.2193, simple_loss=0.3128, pruned_loss=0.06287, over 21713.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.29, pruned_loss=0.06876, over 4271474.05 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:47:22,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.461e+02 6.188e+02 8.447e+02 1.401e+03 2.428e+03, threshold=1.689e+03, percent-clipped=28.0 2023-06-26 13:47:56,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1582062.0, ans=0.1 2023-06-26 13:48:09,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.67 vs. limit=22.5 2023-06-26 13:48:39,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-26 13:49:06,315 INFO [train.py:996] (1/4) Epoch 9, batch 19750, loss[loss=0.2554, simple_loss=0.3598, pruned_loss=0.07552, over 21753.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2998, pruned_loss=0.07014, over 4263088.86 frames. ], batch size: 332, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:49:29,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1582302.0, ans=0.1 2023-06-26 13:50:24,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1582422.0, ans=0.07 2023-06-26 13:50:55,596 INFO [train.py:996] (1/4) Epoch 9, batch 19800, loss[loss=0.1772, simple_loss=0.2529, pruned_loss=0.05068, over 21405.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2978, pruned_loss=0.07049, over 4274470.95 frames. ], batch size: 211, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:51:02,876 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 6.209e+02 8.156e+02 1.271e+03 2.290e+03, threshold=1.631e+03, percent-clipped=8.0 2023-06-26 13:52:40,911 INFO [train.py:996] (1/4) Epoch 9, batch 19850, loss[loss=0.1738, simple_loss=0.251, pruned_loss=0.04828, over 21221.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2924, pruned_loss=0.06645, over 4270555.19 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:52:45,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-26 13:53:07,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1582902.0, ans=0.125 2023-06-26 13:53:19,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1582962.0, ans=0.125 2023-06-26 13:53:45,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1582962.0, ans=0.125 2023-06-26 13:54:07,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1583022.0, ans=0.5 2023-06-26 13:54:26,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1583142.0, ans=0.125 2023-06-26 13:54:27,767 INFO [train.py:996] (1/4) Epoch 9, batch 19900, loss[loss=0.1736, simple_loss=0.2614, pruned_loss=0.04285, over 21832.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2915, pruned_loss=0.0637, over 4264638.47 frames. ], batch size: 107, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:54:34,749 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.163e+02 4.779e+02 6.020e+02 7.987e+02 2.016e+03, threshold=1.204e+03, percent-clipped=5.0 2023-06-26 13:55:05,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1583202.0, ans=0.2 2023-06-26 13:55:27,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-26 13:56:08,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1583382.0, ans=0.125 2023-06-26 13:56:16,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1583382.0, ans=0.125 2023-06-26 13:56:18,951 INFO [train.py:996] (1/4) Epoch 9, batch 19950, loss[loss=0.1855, simple_loss=0.2618, pruned_loss=0.05466, over 21770.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2864, pruned_loss=0.06342, over 4265542.77 frames. ], batch size: 118, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:56:26,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1583442.0, ans=22.5 2023-06-26 13:56:38,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1583442.0, ans=0.125 2023-06-26 13:57:26,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1583562.0, ans=0.125 2023-06-26 13:57:50,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1583682.0, ans=0.125 2023-06-26 13:58:06,814 INFO [train.py:996] (1/4) Epoch 9, batch 20000, loss[loss=0.2109, simple_loss=0.283, pruned_loss=0.06936, over 21345.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2854, pruned_loss=0.06381, over 4252389.92 frames. ], batch size: 159, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:58:16,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1583742.0, ans=22.5 2023-06-26 13:58:19,151 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.477e+02 4.524e+02 6.104e+02 8.785e+02 2.084e+03, threshold=1.221e+03, percent-clipped=7.0 2023-06-26 13:58:43,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-26 13:59:36,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.33 vs. limit=10.0 2023-06-26 13:59:56,238 INFO [train.py:996] (1/4) Epoch 9, batch 20050, loss[loss=0.2137, simple_loss=0.2862, pruned_loss=0.07057, over 21797.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2888, pruned_loss=0.06655, over 4269338.78 frames. ], batch size: 247, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 14:00:31,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1584102.0, ans=0.2 2023-06-26 14:00:35,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1584162.0, ans=0.0 2023-06-26 14:01:06,005 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-26 14:01:49,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=22.5 2023-06-26 14:01:51,567 INFO [train.py:996] (1/4) Epoch 9, batch 20100, loss[loss=0.209, simple_loss=0.2886, pruned_loss=0.06473, over 21338.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2908, pruned_loss=0.06825, over 4279973.35 frames. ], batch size: 159, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 14:02:00,507 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.881e+02 4.985e+02 7.812e+02 1.091e+03 2.146e+03, threshold=1.562e+03, percent-clipped=15.0 2023-06-26 14:02:51,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1584462.0, ans=0.125 2023-06-26 14:03:11,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1584522.0, ans=0.125 2023-06-26 14:03:43,024 INFO [train.py:996] (1/4) Epoch 9, batch 20150, loss[loss=0.2283, simple_loss=0.309, pruned_loss=0.07378, over 21786.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2985, pruned_loss=0.0715, over 4274109.18 frames. ], batch size: 247, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:04:05,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1584642.0, ans=0.0 2023-06-26 14:04:23,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1584702.0, ans=0.0 2023-06-26 14:04:55,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1584822.0, ans=0.2 2023-06-26 14:05:53,458 INFO [train.py:996] (1/4) Epoch 9, batch 20200, loss[loss=0.2321, simple_loss=0.3401, pruned_loss=0.062, over 19869.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3055, pruned_loss=0.07419, over 4270731.65 frames. ], batch size: 702, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:06:02,514 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 6.140e+02 1.031e+03 1.445e+03 3.124e+03, threshold=2.061e+03, percent-clipped=23.0 2023-06-26 14:06:03,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1584942.0, ans=0.1 2023-06-26 14:06:16,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1585002.0, ans=0.125 2023-06-26 14:06:17,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1585002.0, ans=0.015 2023-06-26 14:06:27,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.82 vs. limit=22.5 2023-06-26 14:06:47,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1585062.0, ans=0.125 2023-06-26 14:07:03,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1585122.0, ans=0.2 2023-06-26 14:07:43,934 INFO [train.py:996] (1/4) Epoch 9, batch 20250, loss[loss=0.205, simple_loss=0.2768, pruned_loss=0.0666, over 21189.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3072, pruned_loss=0.07302, over 4274289.91 frames. ], batch size: 143, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:07:53,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.79 vs. limit=15.0 2023-06-26 14:08:39,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1585422.0, ans=0.0 2023-06-26 14:09:26,862 INFO [train.py:996] (1/4) Epoch 9, batch 20300, loss[loss=0.2017, simple_loss=0.2785, pruned_loss=0.06249, over 21163.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3056, pruned_loss=0.07037, over 4273991.37 frames. ], batch size: 159, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:09:35,561 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.339e+02 4.853e+02 6.521e+02 1.002e+03 2.689e+03, threshold=1.304e+03, percent-clipped=1.0 2023-06-26 14:10:15,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1585662.0, ans=0.2 2023-06-26 14:10:24,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1585722.0, ans=0.2 2023-06-26 14:10:25,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-26 14:10:25,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1585722.0, ans=10.0 2023-06-26 14:10:31,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1585722.0, ans=0.125 2023-06-26 14:11:15,935 INFO [train.py:996] (1/4) Epoch 9, batch 20350, loss[loss=0.2459, simple_loss=0.3191, pruned_loss=0.08633, over 21900.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.305, pruned_loss=0.07019, over 4264700.55 frames. ], batch size: 118, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:12:12,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1586022.0, ans=0.125 2023-06-26 14:12:14,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1586022.0, ans=0.125 2023-06-26 14:12:21,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-26 14:12:53,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1586082.0, ans=0.2 2023-06-26 14:13:04,276 INFO [train.py:996] (1/4) Epoch 9, batch 20400, loss[loss=0.2327, simple_loss=0.3131, pruned_loss=0.07619, over 21640.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3085, pruned_loss=0.07298, over 4258853.91 frames. ], batch size: 230, lr: 3.25e-03, grad_scale: 32.0 2023-06-26 14:13:08,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1586142.0, ans=0.0 2023-06-26 14:13:13,305 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.291e+02 5.756e+02 8.261e+02 1.227e+03 2.104e+03, threshold=1.652e+03, percent-clipped=22.0 2023-06-26 14:13:55,837 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:14:02,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1586322.0, ans=0.2 2023-06-26 14:14:19,243 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-26 14:14:37,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1586382.0, ans=0.0 2023-06-26 14:14:47,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1586382.0, ans=0.125 2023-06-26 14:14:48,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1586382.0, ans=0.95 2023-06-26 14:14:52,320 INFO [train.py:996] (1/4) Epoch 9, batch 20450, loss[loss=0.1857, simple_loss=0.2557, pruned_loss=0.05782, over 21089.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3081, pruned_loss=0.07474, over 4254319.51 frames. ], batch size: 608, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:14:53,453 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.27 vs. limit=10.0 2023-06-26 14:15:06,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1586442.0, ans=0.125 2023-06-26 14:15:09,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1586502.0, ans=0.125 2023-06-26 14:16:20,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-26 14:16:33,713 INFO [train.py:996] (1/4) Epoch 9, batch 20500, loss[loss=0.2346, simple_loss=0.2987, pruned_loss=0.08527, over 21809.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3038, pruned_loss=0.07502, over 4253182.24 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:16:44,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.954e+02 5.491e+02 7.367e+02 1.069e+03 2.836e+03, threshold=1.473e+03, percent-clipped=8.0 2023-06-26 14:17:00,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1586802.0, ans=0.125 2023-06-26 14:17:02,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1586802.0, ans=0.125 2023-06-26 14:17:11,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.71 vs. limit=22.5 2023-06-26 14:17:12,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1586862.0, ans=0.0 2023-06-26 14:18:17,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1586982.0, ans=0.125 2023-06-26 14:18:21,699 INFO [train.py:996] (1/4) Epoch 9, batch 20550, loss[loss=0.2063, simple_loss=0.2893, pruned_loss=0.06165, over 21644.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2968, pruned_loss=0.0733, over 4245941.79 frames. ], batch size: 247, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:18:44,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1587102.0, ans=0.1 2023-06-26 14:19:27,255 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-26 14:19:38,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1587222.0, ans=0.125 2023-06-26 14:20:06,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1587282.0, ans=0.125 2023-06-26 14:20:09,329 INFO [train.py:996] (1/4) Epoch 9, batch 20600, loss[loss=0.2495, simple_loss=0.3245, pruned_loss=0.0873, over 21746.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2989, pruned_loss=0.07143, over 4236083.49 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:20:19,843 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.630e+02 4.988e+02 6.640e+02 9.393e+02 1.385e+03, threshold=1.328e+03, percent-clipped=0.0 2023-06-26 14:20:27,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1587402.0, ans=0.125 2023-06-26 14:21:56,977 INFO [train.py:996] (1/4) Epoch 9, batch 20650, loss[loss=0.2162, simple_loss=0.2892, pruned_loss=0.07156, over 17607.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2947, pruned_loss=0.07147, over 4223070.39 frames. ], batch size: 60, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:22:52,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1587762.0, ans=0.125 2023-06-26 14:22:55,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1587822.0, ans=0.125 2023-06-26 14:23:47,398 INFO [train.py:996] (1/4) Epoch 9, batch 20700, loss[loss=0.2171, simple_loss=0.3215, pruned_loss=0.05628, over 20078.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2878, pruned_loss=0.06886, over 4235122.74 frames. ], batch size: 703, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:23:58,482 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.316e+02 4.993e+02 7.910e+02 1.068e+03 1.993e+03, threshold=1.582e+03, percent-clipped=12.0 2023-06-26 14:24:20,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1588002.0, ans=0.2 2023-06-26 14:24:36,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1588062.0, ans=0.0 2023-06-26 14:24:42,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.42 vs. limit=15.0 2023-06-26 14:24:56,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1588122.0, ans=0.0 2023-06-26 14:25:18,401 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=22.5 2023-06-26 14:25:38,411 INFO [train.py:996] (1/4) Epoch 9, batch 20750, loss[loss=0.2782, simple_loss=0.377, pruned_loss=0.08966, over 21655.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2907, pruned_loss=0.06812, over 4235848.48 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:25:39,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1588242.0, ans=0.125 2023-06-26 14:25:47,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1588242.0, ans=0.125 2023-06-26 14:26:00,301 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-26 14:26:08,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1588302.0, ans=0.125 2023-06-26 14:27:32,210 INFO [train.py:996] (1/4) Epoch 9, batch 20800, loss[loss=0.184, simple_loss=0.2579, pruned_loss=0.05504, over 21185.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2953, pruned_loss=0.06934, over 4244513.51 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 32.0 2023-06-26 14:27:42,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 6.315e+02 8.167e+02 1.529e+03 3.332e+03, threshold=1.633e+03, percent-clipped=23.0 2023-06-26 14:28:06,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1588662.0, ans=0.125 2023-06-26 14:28:14,940 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-26 14:28:24,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1588662.0, ans=0.0 2023-06-26 14:28:40,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1588722.0, ans=0.09899494936611666 2023-06-26 14:29:06,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1588782.0, ans=0.2 2023-06-26 14:29:19,944 INFO [train.py:996] (1/4) Epoch 9, batch 20850, loss[loss=0.1598, simple_loss=0.2375, pruned_loss=0.04109, over 21621.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2865, pruned_loss=0.06694, over 4245830.27 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:29:22,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1588842.0, ans=0.0 2023-06-26 14:29:24,157 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-26 14:29:28,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1588842.0, ans=0.2 2023-06-26 14:29:36,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-26 14:29:46,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1588902.0, ans=0.125 2023-06-26 14:30:03,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1588962.0, ans=0.1 2023-06-26 14:30:32,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1589022.0, ans=0.2 2023-06-26 14:30:38,838 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:31:08,538 INFO [train.py:996] (1/4) Epoch 9, batch 20900, loss[loss=0.2092, simple_loss=0.2864, pruned_loss=0.06603, over 21851.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2877, pruned_loss=0.0678, over 4257161.86 frames. ], batch size: 124, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:31:20,484 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.215e+02 4.594e+02 6.029e+02 1.010e+03 2.105e+03, threshold=1.206e+03, percent-clipped=4.0 2023-06-26 14:31:53,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1589262.0, ans=0.125 2023-06-26 14:32:13,246 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.87 vs. limit=10.0 2023-06-26 14:32:19,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1589322.0, ans=0.1 2023-06-26 14:32:28,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1589322.0, ans=0.0 2023-06-26 14:32:45,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1589382.0, ans=0.1 2023-06-26 14:32:48,497 INFO [train.py:996] (1/4) Epoch 9, batch 20950, loss[loss=0.163, simple_loss=0.2433, pruned_loss=0.04132, over 21450.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2841, pruned_loss=0.06482, over 4247063.22 frames. ], batch size: 211, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:32:54,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1589442.0, ans=0.1 2023-06-26 14:32:56,643 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-26 14:32:59,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1589442.0, ans=0.0 2023-06-26 14:33:09,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1589502.0, ans=0.0 2023-06-26 14:33:16,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1589502.0, ans=10.0 2023-06-26 14:33:20,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1589502.0, ans=0.125 2023-06-26 14:33:40,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1589562.0, ans=0.0 2023-06-26 14:33:44,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1589622.0, ans=0.1 2023-06-26 14:34:36,214 INFO [train.py:996] (1/4) Epoch 9, batch 21000, loss[loss=0.2283, simple_loss=0.3126, pruned_loss=0.07206, over 21877.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2851, pruned_loss=0.06568, over 4254919.24 frames. ], batch size: 124, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:34:36,215 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 14:34:59,716 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2612, simple_loss=0.3587, pruned_loss=0.0819, over 1796401.00 frames. 2023-06-26 14:34:59,717 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 14:35:02,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1589742.0, ans=0.0 2023-06-26 14:35:10,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1589742.0, ans=0.125 2023-06-26 14:35:11,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.378e+02 4.892e+02 7.035e+02 1.069e+03 1.759e+03, threshold=1.407e+03, percent-clipped=17.0 2023-06-26 14:35:22,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=12.0 2023-06-26 14:36:17,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.32 vs. limit=15.0 2023-06-26 14:36:49,924 INFO [train.py:996] (1/4) Epoch 9, batch 21050, loss[loss=0.1546, simple_loss=0.2251, pruned_loss=0.04204, over 17484.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.282, pruned_loss=0.06547, over 4234196.52 frames. ], batch size: 67, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:37:48,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1590162.0, ans=0.125 2023-06-26 14:38:08,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1590222.0, ans=0.125 2023-06-26 14:38:36,807 INFO [train.py:996] (1/4) Epoch 9, batch 21100, loss[loss=0.2192, simple_loss=0.285, pruned_loss=0.07667, over 21982.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2785, pruned_loss=0.06479, over 4242264.96 frames. ], batch size: 103, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:38:42,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1590342.0, ans=0.1 2023-06-26 14:38:50,943 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.682e+02 5.080e+02 7.538e+02 1.007e+03 2.026e+03, threshold=1.508e+03, percent-clipped=9.0 2023-06-26 14:40:20,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1590582.0, ans=0.0 2023-06-26 14:40:22,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1590582.0, ans=0.1 2023-06-26 14:40:24,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1590642.0, ans=0.0 2023-06-26 14:40:25,043 INFO [train.py:996] (1/4) Epoch 9, batch 21150, loss[loss=0.183, simple_loss=0.232, pruned_loss=0.06697, over 20786.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2748, pruned_loss=0.06585, over 4247682.26 frames. ], batch size: 609, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:40:35,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1590642.0, ans=0.125 2023-06-26 14:40:40,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1590702.0, ans=0.0 2023-06-26 14:41:29,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1590822.0, ans=0.125 2023-06-26 14:41:30,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-26 14:41:40,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1590822.0, ans=0.1 2023-06-26 14:41:53,917 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:42:12,172 INFO [train.py:996] (1/4) Epoch 9, batch 21200, loss[loss=0.1814, simple_loss=0.2311, pruned_loss=0.06586, over 20299.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2706, pruned_loss=0.06455, over 4252589.31 frames. ], batch size: 703, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:42:12,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1590942.0, ans=0.125 2023-06-26 14:42:25,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-26 14:42:26,032 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.202e+02 4.962e+02 6.952e+02 8.758e+02 1.783e+03, threshold=1.390e+03, percent-clipped=2.0 2023-06-26 14:43:07,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1591062.0, ans=0.125 2023-06-26 14:43:25,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1591122.0, ans=0.0 2023-06-26 14:43:56,762 INFO [train.py:996] (1/4) Epoch 9, batch 21250, loss[loss=0.2477, simple_loss=0.3433, pruned_loss=0.07601, over 20760.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2691, pruned_loss=0.06474, over 4252462.93 frames. ], batch size: 609, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:44:49,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1591362.0, ans=0.0 2023-06-26 14:45:01,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1591422.0, ans=0.125 2023-06-26 14:45:33,955 INFO [train.py:996] (1/4) Epoch 9, batch 21300, loss[loss=0.1755, simple_loss=0.2452, pruned_loss=0.05293, over 16408.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2753, pruned_loss=0.06647, over 4248163.00 frames. ], batch size: 62, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:45:48,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-26 14:45:52,766 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.243e+02 5.570e+02 8.003e+02 1.129e+03 3.066e+03, threshold=1.601e+03, percent-clipped=15.0 2023-06-26 14:46:00,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2023-06-26 14:46:57,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1591722.0, ans=0.1 2023-06-26 14:47:23,329 INFO [train.py:996] (1/4) Epoch 9, batch 21350, loss[loss=0.1881, simple_loss=0.2842, pruned_loss=0.04604, over 21738.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2798, pruned_loss=0.06735, over 4259845.25 frames. ], batch size: 351, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:47:23,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1591842.0, ans=0.0 2023-06-26 14:47:34,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1591842.0, ans=0.2 2023-06-26 14:47:52,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1591902.0, ans=0.125 2023-06-26 14:47:58,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1591902.0, ans=0.1 2023-06-26 14:48:41,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1592022.0, ans=0.125 2023-06-26 14:48:56,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1592082.0, ans=0.125 2023-06-26 14:49:12,005 INFO [train.py:996] (1/4) Epoch 9, batch 21400, loss[loss=0.2237, simple_loss=0.3116, pruned_loss=0.06792, over 21636.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2831, pruned_loss=0.06698, over 4267528.34 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:49:25,997 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.706e+02 6.583e+02 9.880e+02 2.077e+03, threshold=1.317e+03, percent-clipped=4.0 2023-06-26 14:49:30,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-26 14:49:33,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1592202.0, ans=0.125 2023-06-26 14:49:46,020 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-26 14:49:54,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1592262.0, ans=0.125 2023-06-26 14:49:54,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1592262.0, ans=0.2 2023-06-26 14:50:55,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1592382.0, ans=0.1 2023-06-26 14:51:00,486 INFO [train.py:996] (1/4) Epoch 9, batch 21450, loss[loss=0.2176, simple_loss=0.2765, pruned_loss=0.07936, over 20081.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2866, pruned_loss=0.06817, over 4273715.76 frames. ], batch size: 703, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:51:27,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1592502.0, ans=0.125 2023-06-26 14:51:48,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1592562.0, ans=0.0 2023-06-26 14:52:14,196 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.38 vs. limit=22.5 2023-06-26 14:52:23,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1592622.0, ans=0.125 2023-06-26 14:52:26,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1592682.0, ans=0.125 2023-06-26 14:52:43,809 INFO [train.py:996] (1/4) Epoch 9, batch 21500, loss[loss=0.1857, simple_loss=0.2547, pruned_loss=0.05838, over 21712.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2851, pruned_loss=0.06968, over 4277686.49 frames. ], batch size: 316, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:53:03,330 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 5.893e+02 8.169e+02 1.189e+03 2.218e+03, threshold=1.634e+03, percent-clipped=19.0 2023-06-26 14:53:05,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1592802.0, ans=0.1 2023-06-26 14:53:09,902 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-06-26 14:54:17,019 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:54:22,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1592982.0, ans=0.2 2023-06-26 14:54:32,470 INFO [train.py:996] (1/4) Epoch 9, batch 21550, loss[loss=0.1671, simple_loss=0.238, pruned_loss=0.04812, over 21360.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2792, pruned_loss=0.067, over 4273053.46 frames. ], batch size: 131, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:54:42,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1593042.0, ans=0.125 2023-06-26 14:54:57,125 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-26 14:55:01,217 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.34 vs. limit=12.0 2023-06-26 14:55:04,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1593102.0, ans=0.2 2023-06-26 14:55:33,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1593162.0, ans=0.125 2023-06-26 14:56:05,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-26 14:56:26,271 INFO [train.py:996] (1/4) Epoch 9, batch 21600, loss[loss=0.2053, simple_loss=0.2739, pruned_loss=0.06838, over 21991.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2753, pruned_loss=0.06598, over 4264933.11 frames. ], batch size: 103, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:56:43,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1593342.0, ans=0.125 2023-06-26 14:56:44,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1593342.0, ans=0.07 2023-06-26 14:56:53,155 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.217e+02 4.926e+02 7.373e+02 9.794e+02 2.336e+03, threshold=1.475e+03, percent-clipped=12.0 2023-06-26 14:56:56,307 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-26 14:56:57,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593402.0, ans=0.1 2023-06-26 14:56:58,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-26 14:57:36,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1593462.0, ans=0.125 2023-06-26 14:57:38,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1593462.0, ans=0.0 2023-06-26 14:58:03,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593582.0, ans=0.1 2023-06-26 14:58:10,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1593582.0, ans=0.125 2023-06-26 14:58:12,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-26 14:58:15,066 INFO [train.py:996] (1/4) Epoch 9, batch 21650, loss[loss=0.1867, simple_loss=0.2735, pruned_loss=0.04992, over 21757.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2805, pruned_loss=0.06524, over 4266050.14 frames. ], batch size: 112, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:58:51,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593702.0, ans=0.1 2023-06-26 14:59:08,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1593762.0, ans=0.2 2023-06-26 14:59:32,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593822.0, ans=0.1 2023-06-26 14:59:46,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1593882.0, ans=0.0 2023-06-26 14:59:46,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1593882.0, ans=0.0 2023-06-26 15:00:01,540 INFO [train.py:996] (1/4) Epoch 9, batch 21700, loss[loss=0.1902, simple_loss=0.2646, pruned_loss=0.05792, over 21350.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2812, pruned_loss=0.06388, over 4267898.21 frames. ], batch size: 131, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 15:00:22,136 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.421e+02 4.737e+02 7.563e+02 1.159e+03 3.422e+03, threshold=1.513e+03, percent-clipped=14.0 2023-06-26 15:00:24,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1594002.0, ans=0.2 2023-06-26 15:00:27,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1594002.0, ans=0.0 2023-06-26 15:01:44,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1594182.0, ans=0.0 2023-06-26 15:01:47,599 INFO [train.py:996] (1/4) Epoch 9, batch 21750, loss[loss=0.1868, simple_loss=0.255, pruned_loss=0.05926, over 21442.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2772, pruned_loss=0.06386, over 4262420.56 frames. ], batch size: 131, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:02:26,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=12.0 2023-06-26 15:02:39,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1594362.0, ans=0.125 2023-06-26 15:02:48,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1594362.0, ans=0.05 2023-06-26 15:02:50,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1594362.0, ans=0.125 2023-06-26 15:03:03,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-26 15:03:30,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1594482.0, ans=0.125 2023-06-26 15:03:38,351 INFO [train.py:996] (1/4) Epoch 9, batch 21800, loss[loss=0.2163, simple_loss=0.3045, pruned_loss=0.06408, over 21729.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2734, pruned_loss=0.06418, over 4269116.42 frames. ], batch size: 333, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:04:04,208 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.457e+02 4.843e+02 6.619e+02 9.442e+02 2.103e+03, threshold=1.324e+03, percent-clipped=2.0 2023-06-26 15:04:20,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-06-26 15:05:06,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1594782.0, ans=0.09899494936611666 2023-06-26 15:05:25,817 INFO [train.py:996] (1/4) Epoch 9, batch 21850, loss[loss=0.2218, simple_loss=0.3126, pruned_loss=0.06548, over 21577.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2792, pruned_loss=0.06423, over 4258395.28 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:05:41,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1594842.0, ans=0.125 2023-06-26 15:07:12,432 INFO [train.py:996] (1/4) Epoch 9, batch 21900, loss[loss=0.1996, simple_loss=0.2703, pruned_loss=0.06442, over 21744.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2803, pruned_loss=0.06528, over 4269573.61 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:07:38,241 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 4.571e+02 6.004e+02 8.081e+02 1.811e+03, threshold=1.201e+03, percent-clipped=9.0 2023-06-26 15:07:47,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1595202.0, ans=0.125 2023-06-26 15:07:55,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1595262.0, ans=0.0 2023-06-26 15:08:02,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1595262.0, ans=0.2 2023-06-26 15:08:02,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1595262.0, ans=0.125 2023-06-26 15:08:25,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1595322.0, ans=0.2 2023-06-26 15:09:04,623 INFO [train.py:996] (1/4) Epoch 9, batch 21950, loss[loss=0.1585, simple_loss=0.2452, pruned_loss=0.0359, over 21751.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2751, pruned_loss=0.06443, over 4272582.78 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:10:54,230 INFO [train.py:996] (1/4) Epoch 9, batch 22000, loss[loss=0.1484, simple_loss=0.2464, pruned_loss=0.02517, over 20814.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2686, pruned_loss=0.06029, over 4262548.14 frames. ], batch size: 608, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:11:15,750 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 4.453e+02 7.165e+02 9.999e+02 1.931e+03, threshold=1.433e+03, percent-clipped=13.0 2023-06-26 15:12:49,922 INFO [train.py:996] (1/4) Epoch 9, batch 22050, loss[loss=0.3063, simple_loss=0.3783, pruned_loss=0.1171, over 21472.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.273, pruned_loss=0.06226, over 4264019.71 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:13:12,837 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.06 vs. limit=6.0 2023-06-26 15:13:52,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1596222.0, ans=0.0 2023-06-26 15:13:54,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1596222.0, ans=0.125 2023-06-26 15:14:13,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1596282.0, ans=0.0 2023-06-26 15:14:38,917 INFO [train.py:996] (1/4) Epoch 9, batch 22100, loss[loss=0.1812, simple_loss=0.256, pruned_loss=0.0532, over 16326.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2839, pruned_loss=0.06704, over 4264729.43 frames. ], batch size: 61, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:14:56,635 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.930e+02 6.282e+02 9.612e+02 1.455e+03 3.538e+03, threshold=1.922e+03, percent-clipped=29.0 2023-06-26 15:15:03,442 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=22.5 2023-06-26 15:15:41,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1596522.0, ans=0.2 2023-06-26 15:15:59,830 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 15:16:04,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1596582.0, ans=0.0 2023-06-26 15:16:25,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1596642.0, ans=0.0 2023-06-26 15:16:26,363 INFO [train.py:996] (1/4) Epoch 9, batch 22150, loss[loss=0.2088, simple_loss=0.2849, pruned_loss=0.06638, over 21849.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2866, pruned_loss=0.06843, over 4275788.04 frames. ], batch size: 351, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:16:30,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-26 15:16:48,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1596702.0, ans=0.0 2023-06-26 15:17:13,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1596762.0, ans=0.125 2023-06-26 15:17:18,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1596762.0, ans=0.125 2023-06-26 15:17:22,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1596762.0, ans=0.125 2023-06-26 15:17:43,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1596822.0, ans=0.0 2023-06-26 15:17:47,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1596882.0, ans=0.035 2023-06-26 15:17:50,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1596882.0, ans=0.5 2023-06-26 15:17:50,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1596882.0, ans=0.0 2023-06-26 15:18:02,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-26 15:18:14,955 INFO [train.py:996] (1/4) Epoch 9, batch 22200, loss[loss=0.2832, simple_loss=0.3809, pruned_loss=0.09278, over 19936.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2886, pruned_loss=0.06959, over 4274662.78 frames. ], batch size: 702, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:18:32,777 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.802e+02 5.060e+02 7.082e+02 1.053e+03 2.242e+03, threshold=1.416e+03, percent-clipped=3.0 2023-06-26 15:19:18,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1597122.0, ans=0.0 2023-06-26 15:19:22,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1597122.0, ans=0.0 2023-06-26 15:20:00,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1597182.0, ans=0.2 2023-06-26 15:20:04,042 INFO [train.py:996] (1/4) Epoch 9, batch 22250, loss[loss=0.2458, simple_loss=0.321, pruned_loss=0.08524, over 21356.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2953, pruned_loss=0.07098, over 4269308.85 frames. ], batch size: 176, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:21:51,047 INFO [train.py:996] (1/4) Epoch 9, batch 22300, loss[loss=0.2018, simple_loss=0.2698, pruned_loss=0.06696, over 21677.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2974, pruned_loss=0.07245, over 4267109.05 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:22:08,322 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.561e+02 5.384e+02 7.516e+02 1.079e+03 3.010e+03, threshold=1.503e+03, percent-clipped=16.0 2023-06-26 15:22:30,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-26 15:23:04,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=12.0 2023-06-26 15:23:08,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1597722.0, ans=0.0 2023-06-26 15:23:09,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1597782.0, ans=0.125 2023-06-26 15:23:33,877 INFO [train.py:996] (1/4) Epoch 9, batch 22350, loss[loss=0.1986, simple_loss=0.2709, pruned_loss=0.06313, over 21672.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.295, pruned_loss=0.0726, over 4275162.78 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:23:47,930 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-26 15:23:55,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1597902.0, ans=0.0 2023-06-26 15:24:14,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1597902.0, ans=0.1 2023-06-26 15:25:21,635 INFO [train.py:996] (1/4) Epoch 9, batch 22400, loss[loss=0.1798, simple_loss=0.2499, pruned_loss=0.05484, over 21776.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2916, pruned_loss=0.06992, over 4278532.17 frames. ], batch size: 102, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:25:22,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1598142.0, ans=0.05 2023-06-26 15:25:49,393 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.723e+02 5.104e+02 6.690e+02 9.796e+02 2.008e+03, threshold=1.338e+03, percent-clipped=2.0 2023-06-26 15:25:49,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1598202.0, ans=0.125 2023-06-26 15:26:06,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1598262.0, ans=0.07 2023-06-26 15:26:44,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1598322.0, ans=0.125 2023-06-26 15:26:51,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1598382.0, ans=0.125 2023-06-26 15:27:07,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1598442.0, ans=0.125 2023-06-26 15:27:14,428 INFO [train.py:996] (1/4) Epoch 9, batch 22450, loss[loss=0.1909, simple_loss=0.2514, pruned_loss=0.06518, over 21350.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2872, pruned_loss=0.0698, over 4271936.19 frames. ], batch size: 177, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:29:02,868 INFO [train.py:996] (1/4) Epoch 9, batch 22500, loss[loss=0.1923, simple_loss=0.266, pruned_loss=0.05932, over 21413.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2824, pruned_loss=0.06957, over 4263950.11 frames. ], batch size: 211, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:29:21,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1598742.0, ans=0.0 2023-06-26 15:29:22,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1598742.0, ans=0.0 2023-06-26 15:29:26,964 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 5.166e+02 7.858e+02 1.138e+03 3.264e+03, threshold=1.572e+03, percent-clipped=12.0 2023-06-26 15:29:29,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1598802.0, ans=0.0 2023-06-26 15:29:31,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1598802.0, ans=0.125 2023-06-26 15:29:38,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1598802.0, ans=0.125 2023-06-26 15:30:57,517 INFO [train.py:996] (1/4) Epoch 9, batch 22550, loss[loss=0.2305, simple_loss=0.3067, pruned_loss=0.07721, over 21509.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.286, pruned_loss=0.0696, over 4272378.87 frames. ], batch size: 131, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:30:58,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2023-06-26 15:31:05,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1599042.0, ans=0.1 2023-06-26 15:32:37,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1599282.0, ans=0.0 2023-06-26 15:32:49,141 INFO [train.py:996] (1/4) Epoch 9, batch 22600, loss[loss=0.1745, simple_loss=0.2387, pruned_loss=0.05517, over 21196.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2893, pruned_loss=0.0694, over 4274816.47 frames. ], batch size: 143, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:33:08,789 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.130e+02 6.886e+02 1.082e+03 1.570e+03 3.521e+03, threshold=2.164e+03, percent-clipped=24.0 2023-06-26 15:33:47,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1599462.0, ans=0.125 2023-06-26 15:34:20,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1599582.0, ans=0.125 2023-06-26 15:34:27,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1599582.0, ans=0.2 2023-06-26 15:34:37,816 INFO [train.py:996] (1/4) Epoch 9, batch 22650, loss[loss=0.1905, simple_loss=0.2557, pruned_loss=0.06268, over 21767.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2868, pruned_loss=0.06925, over 4283707.63 frames. ], batch size: 300, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:35:51,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=22.5 2023-06-26 15:35:56,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1599822.0, ans=0.125 2023-06-26 15:35:56,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1599822.0, ans=0.0 2023-06-26 15:36:08,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1599882.0, ans=0.1 2023-06-26 15:36:13,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1599882.0, ans=0.0 2023-06-26 15:36:24,813 INFO [train.py:996] (1/4) Epoch 9, batch 22700, loss[loss=0.2081, simple_loss=0.2694, pruned_loss=0.07339, over 14846.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2805, pruned_loss=0.06875, over 4281131.05 frames. ], batch size: 60, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:36:34,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1599942.0, ans=0.2 2023-06-26 15:36:44,331 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.472e+02 5.506e+02 7.412e+02 1.059e+03 2.032e+03, threshold=1.482e+03, percent-clipped=0.0 2023-06-26 15:36:59,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1600002.0, ans=0.125 2023-06-26 15:37:01,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-26 15:37:01,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-26 15:37:50,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1600122.0, ans=0.025 2023-06-26 15:38:07,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1600182.0, ans=0.035 2023-06-26 15:38:09,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1600182.0, ans=0.125 2023-06-26 15:38:13,887 INFO [train.py:996] (1/4) Epoch 9, batch 22750, loss[loss=0.2297, simple_loss=0.2988, pruned_loss=0.08027, over 21506.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2812, pruned_loss=0.06978, over 4276117.06 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:39:45,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1600482.0, ans=0.125 2023-06-26 15:39:52,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1600482.0, ans=0.125 2023-06-26 15:40:01,404 INFO [train.py:996] (1/4) Epoch 9, batch 22800, loss[loss=0.1938, simple_loss=0.2649, pruned_loss=0.06135, over 21765.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2862, pruned_loss=0.07184, over 4269997.35 frames. ], batch size: 282, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:40:28,035 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.571e+02 5.339e+02 7.756e+02 1.140e+03 2.355e+03, threshold=1.551e+03, percent-clipped=14.0 2023-06-26 15:40:33,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1600602.0, ans=0.125 2023-06-26 15:41:21,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1600722.0, ans=0.125 2023-06-26 15:41:49,492 INFO [train.py:996] (1/4) Epoch 9, batch 22850, loss[loss=0.2014, simple_loss=0.2659, pruned_loss=0.0685, over 21659.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.284, pruned_loss=0.07125, over 4277671.35 frames. ], batch size: 282, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:42:22,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1600902.0, ans=0.125 2023-06-26 15:42:34,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1600962.0, ans=0.125 2023-06-26 15:42:35,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.40 vs. limit=15.0 2023-06-26 15:42:42,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-26 15:42:49,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-26 15:43:20,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1601082.0, ans=0.125 2023-06-26 15:43:25,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1601082.0, ans=0.125 2023-06-26 15:43:33,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1601082.0, ans=0.125 2023-06-26 15:43:37,590 INFO [train.py:996] (1/4) Epoch 9, batch 22900, loss[loss=0.2031, simple_loss=0.2848, pruned_loss=0.06067, over 21285.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2849, pruned_loss=0.07047, over 4277874.70 frames. ], batch size: 176, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:43:41,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1601142.0, ans=0.125 2023-06-26 15:44:04,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.622e+02 6.448e+02 8.997e+02 1.321e+03 2.993e+03, threshold=1.799e+03, percent-clipped=19.0 2023-06-26 15:44:34,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.03 vs. limit=6.0 2023-06-26 15:44:55,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1601322.0, ans=10.0 2023-06-26 15:44:57,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1601322.0, ans=0.1 2023-06-26 15:45:28,329 INFO [train.py:996] (1/4) Epoch 9, batch 22950, loss[loss=0.2721, simple_loss=0.3939, pruned_loss=0.07517, over 21648.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2972, pruned_loss=0.06915, over 4283366.22 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:45:35,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1601442.0, ans=0.0 2023-06-26 15:45:43,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-26 15:45:46,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1601442.0, ans=0.125 2023-06-26 15:46:03,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1601502.0, ans=0.0 2023-06-26 15:47:08,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1601682.0, ans=0.125 2023-06-26 15:47:10,753 INFO [train.py:996] (1/4) Epoch 9, batch 23000, loss[loss=0.2136, simple_loss=0.2864, pruned_loss=0.07041, over 21912.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2985, pruned_loss=0.06733, over 4281655.28 frames. ], batch size: 351, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:47:36,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1601742.0, ans=0.0 2023-06-26 15:47:42,629 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.591e+02 4.526e+02 6.178e+02 9.113e+02 2.510e+03, threshold=1.236e+03, percent-clipped=4.0 2023-06-26 15:49:11,850 INFO [train.py:996] (1/4) Epoch 9, batch 23050, loss[loss=0.2204, simple_loss=0.2953, pruned_loss=0.07281, over 21617.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2984, pruned_loss=0.06915, over 4279195.79 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:49:36,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1602102.0, ans=0.0 2023-06-26 15:50:13,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1602222.0, ans=0.125 2023-06-26 15:50:15,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1602222.0, ans=0.125 2023-06-26 15:50:22,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1602222.0, ans=0.1 2023-06-26 15:50:55,766 INFO [train.py:996] (1/4) Epoch 9, batch 23100, loss[loss=0.1828, simple_loss=0.2518, pruned_loss=0.05687, over 21805.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2948, pruned_loss=0.07012, over 4281506.17 frames. ], batch size: 317, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:51:22,037 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.578e+02 4.836e+02 6.120e+02 9.547e+02 2.287e+03, threshold=1.224e+03, percent-clipped=14.0 2023-06-26 15:52:44,512 INFO [train.py:996] (1/4) Epoch 9, batch 23150, loss[loss=0.2325, simple_loss=0.3104, pruned_loss=0.07727, over 21504.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2907, pruned_loss=0.06978, over 4278461.93 frames. ], batch size: 131, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:53:07,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1602702.0, ans=0.125 2023-06-26 15:53:40,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=12.0 2023-06-26 15:53:47,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1602822.0, ans=0.0 2023-06-26 15:54:07,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1602882.0, ans=0.125 2023-06-26 15:54:14,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1602882.0, ans=0.125 2023-06-26 15:54:17,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1602882.0, ans=0.125 2023-06-26 15:54:25,643 INFO [train.py:996] (1/4) Epoch 9, batch 23200, loss[loss=0.2033, simple_loss=0.2729, pruned_loss=0.0668, over 21712.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2892, pruned_loss=0.07024, over 4281659.86 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:54:42,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1602942.0, ans=0.1 2023-06-26 15:54:57,768 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.454e+02 5.096e+02 6.731e+02 1.055e+03 2.311e+03, threshold=1.346e+03, percent-clipped=14.0 2023-06-26 15:55:08,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1603002.0, ans=0.125 2023-06-26 15:55:41,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1603122.0, ans=0.0 2023-06-26 15:56:14,184 INFO [train.py:996] (1/4) Epoch 9, batch 23250, loss[loss=0.2147, simple_loss=0.2902, pruned_loss=0.06964, over 21950.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2886, pruned_loss=0.07072, over 4292086.78 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:57:01,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1603302.0, ans=0.0 2023-06-26 15:57:55,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-26 15:58:08,882 INFO [train.py:996] (1/4) Epoch 9, batch 23300, loss[loss=0.2221, simple_loss=0.3243, pruned_loss=0.05997, over 21298.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2944, pruned_loss=0.07209, over 4298024.18 frames. ], batch size: 176, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:58:24,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-26 15:58:37,850 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 6.000e+02 9.033e+02 1.405e+03 3.617e+03, threshold=1.807e+03, percent-clipped=26.0 2023-06-26 15:58:47,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-26 16:00:05,621 INFO [train.py:996] (1/4) Epoch 9, batch 23350, loss[loss=0.2414, simple_loss=0.3198, pruned_loss=0.08152, over 21476.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2991, pruned_loss=0.07131, over 4296330.66 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 16:00:30,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1603902.0, ans=0.125 2023-06-26 16:01:32,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1604082.0, ans=0.125 2023-06-26 16:01:52,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1604142.0, ans=0.1 2023-06-26 16:01:53,511 INFO [train.py:996] (1/4) Epoch 9, batch 23400, loss[loss=0.1956, simple_loss=0.2729, pruned_loss=0.0591, over 21918.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2929, pruned_loss=0.06802, over 4283303.23 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:02:21,756 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.225e+02 5.479e+02 7.119e+02 1.024e+03 2.077e+03, threshold=1.424e+03, percent-clipped=2.0 2023-06-26 16:02:33,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1604262.0, ans=0.125 2023-06-26 16:02:43,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1604262.0, ans=0.1 2023-06-26 16:03:14,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1604322.0, ans=0.125 2023-06-26 16:03:15,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1604322.0, ans=0.0 2023-06-26 16:03:47,227 INFO [train.py:996] (1/4) Epoch 9, batch 23450, loss[loss=0.2296, simple_loss=0.2948, pruned_loss=0.08218, over 21447.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2927, pruned_loss=0.07014, over 4287283.97 frames. ], batch size: 548, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:04:12,314 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-26 16:05:18,327 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=12.0 2023-06-26 16:05:24,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1604682.0, ans=0.125 2023-06-26 16:05:28,871 INFO [train.py:996] (1/4) Epoch 9, batch 23500, loss[loss=0.2214, simple_loss=0.2904, pruned_loss=0.07624, over 21889.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2934, pruned_loss=0.07136, over 4296125.94 frames. ], batch size: 351, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:05:53,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1604802.0, ans=0.1 2023-06-26 16:05:56,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.762e+02 6.548e+02 9.049e+02 1.310e+03 3.325e+03, threshold=1.810e+03, percent-clipped=21.0 2023-06-26 16:06:41,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1604922.0, ans=0.0 2023-06-26 16:06:58,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1604982.0, ans=0.2 2023-06-26 16:07:09,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1604982.0, ans=0.125 2023-06-26 16:07:15,822 INFO [train.py:996] (1/4) Epoch 9, batch 23550, loss[loss=0.1927, simple_loss=0.2583, pruned_loss=0.06352, over 21690.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2884, pruned_loss=0.07104, over 4300686.30 frames. ], batch size: 333, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:07:36,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1605102.0, ans=0.125 2023-06-26 16:07:48,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1605102.0, ans=0.125 2023-06-26 16:08:36,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1605222.0, ans=0.0 2023-06-26 16:08:40,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1605222.0, ans=0.1 2023-06-26 16:08:45,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1605282.0, ans=0.0 2023-06-26 16:09:04,143 INFO [train.py:996] (1/4) Epoch 9, batch 23600, loss[loss=0.2499, simple_loss=0.3335, pruned_loss=0.08313, over 21835.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2902, pruned_loss=0.07143, over 4287541.04 frames. ], batch size: 118, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:09:30,588 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-26 16:09:32,653 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.846e+02 5.348e+02 7.374e+02 1.134e+03 2.536e+03, threshold=1.475e+03, percent-clipped=3.0 2023-06-26 16:10:27,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1605522.0, ans=0.0 2023-06-26 16:10:42,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1605582.0, ans=0.125 2023-06-26 16:10:44,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1605582.0, ans=0.0 2023-06-26 16:10:55,357 INFO [train.py:996] (1/4) Epoch 9, batch 23650, loss[loss=0.2289, simple_loss=0.3179, pruned_loss=0.06989, over 21419.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2909, pruned_loss=0.06999, over 4288712.26 frames. ], batch size: 131, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:11:13,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1605702.0, ans=0.1 2023-06-26 16:11:35,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1605702.0, ans=0.0 2023-06-26 16:11:49,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1605762.0, ans=0.125 2023-06-26 16:12:37,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1605882.0, ans=0.125 2023-06-26 16:12:43,721 INFO [train.py:996] (1/4) Epoch 9, batch 23700, loss[loss=0.19, simple_loss=0.2715, pruned_loss=0.0543, over 21365.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2936, pruned_loss=0.07035, over 4285622.69 frames. ], batch size: 211, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:12:49,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1605942.0, ans=0.1 2023-06-26 16:13:03,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1605942.0, ans=0.125 2023-06-26 16:13:18,786 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.291e+02 4.703e+02 6.208e+02 8.925e+02 2.253e+03, threshold=1.242e+03, percent-clipped=5.0 2023-06-26 16:14:08,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1606122.0, ans=0.125 2023-06-26 16:14:27,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1606182.0, ans=0.0 2023-06-26 16:14:33,484 INFO [train.py:996] (1/4) Epoch 9, batch 23750, loss[loss=0.2022, simple_loss=0.3032, pruned_loss=0.05059, over 21695.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2957, pruned_loss=0.07048, over 4282135.27 frames. ], batch size: 351, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:14:53,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1606242.0, ans=0.125 2023-06-26 16:14:55,336 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-26 16:15:11,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1606302.0, ans=0.0 2023-06-26 16:15:23,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1606362.0, ans=0.07 2023-06-26 16:15:58,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1606422.0, ans=0.125 2023-06-26 16:16:01,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1606422.0, ans=0.1 2023-06-26 16:16:14,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1606482.0, ans=0.125 2023-06-26 16:16:24,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1606482.0, ans=0.1 2023-06-26 16:16:27,406 INFO [train.py:996] (1/4) Epoch 9, batch 23800, loss[loss=0.2077, simple_loss=0.2814, pruned_loss=0.06698, over 21762.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2925, pruned_loss=0.06788, over 4279519.07 frames. ], batch size: 112, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:16:37,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1606542.0, ans=0.0 2023-06-26 16:16:42,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1606542.0, ans=0.0 2023-06-26 16:16:59,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1606602.0, ans=0.0 2023-06-26 16:17:01,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1606602.0, ans=0.125 2023-06-26 16:17:04,096 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 5.235e+02 7.849e+02 1.092e+03 2.188e+03, threshold=1.570e+03, percent-clipped=19.0 2023-06-26 16:17:07,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1606602.0, ans=0.125 2023-06-26 16:18:16,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1606782.0, ans=0.04949747468305833 2023-06-26 16:18:29,084 INFO [train.py:996] (1/4) Epoch 9, batch 23850, loss[loss=0.2354, simple_loss=0.3091, pruned_loss=0.08085, over 21335.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3021, pruned_loss=0.07058, over 4277439.60 frames. ], batch size: 159, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:18:31,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1606842.0, ans=0.125 2023-06-26 16:18:42,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1606842.0, ans=0.1 2023-06-26 16:19:02,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1606902.0, ans=0.0 2023-06-26 16:20:08,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-26 16:20:08,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.06 vs. limit=10.0 2023-06-26 16:20:16,508 INFO [train.py:996] (1/4) Epoch 9, batch 23900, loss[loss=0.2237, simple_loss=0.3038, pruned_loss=0.07178, over 21750.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3088, pruned_loss=0.07204, over 4276300.20 frames. ], batch size: 351, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:20:45,481 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.978e+02 7.206e+02 9.928e+02 1.468e+03 4.059e+03, threshold=1.986e+03, percent-clipped=20.0 2023-06-26 16:20:59,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1607262.0, ans=0.125 2023-06-26 16:21:13,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-26 16:21:24,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1607322.0, ans=0.2 2023-06-26 16:22:02,557 INFO [train.py:996] (1/4) Epoch 9, batch 23950, loss[loss=0.206, simple_loss=0.2586, pruned_loss=0.07674, over 20062.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3026, pruned_loss=0.07167, over 4272991.64 frames. ], batch size: 702, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:23:43,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=12.0 2023-06-26 16:23:48,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-26 16:23:50,727 INFO [train.py:996] (1/4) Epoch 9, batch 24000, loss[loss=0.2546, simple_loss=0.3286, pruned_loss=0.09029, over 21362.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3041, pruned_loss=0.07488, over 4272572.55 frames. ], batch size: 159, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:23:50,728 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 16:24:08,191 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.8422, 4.4407, 4.5488, 3.6481], device='cuda:1') 2023-06-26 16:24:10,700 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2632, simple_loss=0.3589, pruned_loss=0.0837, over 1796401.00 frames. 2023-06-26 16:24:10,700 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 16:24:36,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.989e+02 5.715e+02 7.802e+02 1.213e+03 2.324e+03, threshold=1.560e+03, percent-clipped=4.0 2023-06-26 16:25:57,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1607982.0, ans=0.2 2023-06-26 16:26:00,888 INFO [train.py:996] (1/4) Epoch 9, batch 24050, loss[loss=0.189, simple_loss=0.2752, pruned_loss=0.0514, over 21284.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3041, pruned_loss=0.07439, over 4272503.21 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:26:22,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1608102.0, ans=0.0 2023-06-26 16:27:15,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1608222.0, ans=0.125 2023-06-26 16:27:38,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1608282.0, ans=0.1 2023-06-26 16:27:49,845 INFO [train.py:996] (1/4) Epoch 9, batch 24100, loss[loss=0.2322, simple_loss=0.3167, pruned_loss=0.0739, over 21576.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3041, pruned_loss=0.07242, over 4268087.68 frames. ], batch size: 263, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:28:27,566 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.963e+02 5.200e+02 7.145e+02 1.046e+03 2.381e+03, threshold=1.429e+03, percent-clipped=3.0 2023-06-26 16:28:34,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1608462.0, ans=0.0 2023-06-26 16:29:27,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1608582.0, ans=0.0 2023-06-26 16:29:28,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1608582.0, ans=0.125 2023-06-26 16:29:39,186 INFO [train.py:996] (1/4) Epoch 9, batch 24150, loss[loss=0.241, simple_loss=0.3076, pruned_loss=0.0872, over 21307.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3044, pruned_loss=0.07455, over 4277378.29 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:31:28,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1608942.0, ans=0.125 2023-06-26 16:31:29,815 INFO [train.py:996] (1/4) Epoch 9, batch 24200, loss[loss=0.2528, simple_loss=0.3357, pruned_loss=0.0849, over 21738.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.306, pruned_loss=0.07585, over 4277133.09 frames. ], batch size: 351, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:31:52,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-26 16:32:01,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1609002.0, ans=0.2 2023-06-26 16:32:12,919 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.636e+02 5.699e+02 8.049e+02 1.259e+03 2.421e+03, threshold=1.610e+03, percent-clipped=17.0 2023-06-26 16:32:45,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-26 16:33:31,012 INFO [train.py:996] (1/4) Epoch 9, batch 24250, loss[loss=0.1794, simple_loss=0.2819, pruned_loss=0.0384, over 21651.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3036, pruned_loss=0.0699, over 4277632.42 frames. ], batch size: 263, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:34:31,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-26 16:35:18,866 INFO [train.py:996] (1/4) Epoch 9, batch 24300, loss[loss=0.1349, simple_loss=0.221, pruned_loss=0.02445, over 21686.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2976, pruned_loss=0.06515, over 4273487.47 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:35:50,233 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.136e+02 4.084e+02 7.233e+02 1.324e+03 4.143e+03, threshold=1.447e+03, percent-clipped=16.0 2023-06-26 16:35:57,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1609662.0, ans=0.1 2023-06-26 16:37:04,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1609782.0, ans=0.2 2023-06-26 16:37:07,350 INFO [train.py:996] (1/4) Epoch 9, batch 24350, loss[loss=0.2391, simple_loss=0.311, pruned_loss=0.08359, over 21511.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2928, pruned_loss=0.06496, over 4278894.50 frames. ], batch size: 548, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:37:24,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=12.0 2023-06-26 16:38:20,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1610022.0, ans=0.125 2023-06-26 16:39:02,416 INFO [train.py:996] (1/4) Epoch 9, batch 24400, loss[loss=0.2282, simple_loss=0.3042, pruned_loss=0.07614, over 21889.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2983, pruned_loss=0.06767, over 4276275.33 frames. ], batch size: 373, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:39:06,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1610142.0, ans=0.0 2023-06-26 16:39:34,017 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.576e+02 5.148e+02 6.716e+02 1.029e+03 2.743e+03, threshold=1.343e+03, percent-clipped=7.0 2023-06-26 16:39:50,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1610262.0, ans=0.125 2023-06-26 16:39:50,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1610262.0, ans=0.125 2023-06-26 16:40:46,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1610382.0, ans=0.1 2023-06-26 16:40:52,862 INFO [train.py:996] (1/4) Epoch 9, batch 24450, loss[loss=0.3349, simple_loss=0.4045, pruned_loss=0.1327, over 21438.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2983, pruned_loss=0.06845, over 4274767.26 frames. ], batch size: 507, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:41:00,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1610442.0, ans=0.2 2023-06-26 16:41:21,648 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-26 16:42:40,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1610742.0, ans=0.0 2023-06-26 16:42:41,528 INFO [train.py:996] (1/4) Epoch 9, batch 24500, loss[loss=0.2302, simple_loss=0.3019, pruned_loss=0.07921, over 21742.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3001, pruned_loss=0.06926, over 4282363.14 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:42:59,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1610742.0, ans=0.125 2023-06-26 16:43:14,698 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.443e+02 5.135e+02 6.610e+02 1.095e+03 2.710e+03, threshold=1.322e+03, percent-clipped=12.0 2023-06-26 16:43:23,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1610862.0, ans=0.125 2023-06-26 16:44:22,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1610982.0, ans=0.0 2023-06-26 16:44:35,193 INFO [train.py:996] (1/4) Epoch 9, batch 24550, loss[loss=0.2204, simple_loss=0.3037, pruned_loss=0.06857, over 21900.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3025, pruned_loss=0.07173, over 4283667.49 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:44:58,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1611102.0, ans=0.125 2023-06-26 16:45:17,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1611162.0, ans=0.125 2023-06-26 16:45:29,937 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-26 16:45:55,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1611222.0, ans=0.0 2023-06-26 16:46:07,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1611282.0, ans=0.125 2023-06-26 16:46:16,621 INFO [train.py:996] (1/4) Epoch 9, batch 24600, loss[loss=0.203, simple_loss=0.2784, pruned_loss=0.06379, over 21763.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2981, pruned_loss=0.07177, over 4288441.85 frames. ], batch size: 333, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:46:48,953 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.731e+02 5.495e+02 6.731e+02 9.246e+02 1.741e+03, threshold=1.346e+03, percent-clipped=6.0 2023-06-26 16:48:05,288 INFO [train.py:996] (1/4) Epoch 9, batch 24650, loss[loss=0.2044, simple_loss=0.2629, pruned_loss=0.07293, over 21512.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2896, pruned_loss=0.07046, over 4277499.53 frames. ], batch size: 441, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:48:05,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1611642.0, ans=0.0 2023-06-26 16:48:13,230 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-26 16:49:05,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1611762.0, ans=0.0 2023-06-26 16:49:06,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1611762.0, ans=0.125 2023-06-26 16:49:17,526 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 16:49:58,473 INFO [train.py:996] (1/4) Epoch 9, batch 24700, loss[loss=0.1909, simple_loss=0.2646, pruned_loss=0.05857, over 21229.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2891, pruned_loss=0.06925, over 4261993.25 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:50:03,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1611942.0, ans=0.2 2023-06-26 16:50:11,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1611942.0, ans=0.1 2023-06-26 16:50:31,842 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.355e+02 4.942e+02 6.984e+02 9.406e+02 2.267e+03, threshold=1.397e+03, percent-clipped=8.0 2023-06-26 16:50:50,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-26 16:51:24,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1612182.0, ans=0.0 2023-06-26 16:51:37,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1612182.0, ans=0.1 2023-06-26 16:51:46,657 INFO [train.py:996] (1/4) Epoch 9, batch 24750, loss[loss=0.2278, simple_loss=0.2807, pruned_loss=0.08745, over 21435.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2816, pruned_loss=0.06635, over 4258634.86 frames. ], batch size: 509, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:51:52,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1612242.0, ans=0.125 2023-06-26 16:51:54,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1612242.0, ans=0.0 2023-06-26 16:51:56,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1612242.0, ans=0.125 2023-06-26 16:52:52,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1612362.0, ans=0.1 2023-06-26 16:52:54,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1612422.0, ans=0.0 2023-06-26 16:53:29,863 INFO [train.py:996] (1/4) Epoch 9, batch 24800, loss[loss=0.2184, simple_loss=0.2853, pruned_loss=0.07578, over 21474.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2775, pruned_loss=0.06619, over 4262221.78 frames. ], batch size: 212, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:53:56,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1612602.0, ans=0.035 2023-06-26 16:54:10,073 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.499e+02 5.335e+02 8.218e+02 1.489e+03 3.682e+03, threshold=1.644e+03, percent-clipped=29.0 2023-06-26 16:54:43,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1612722.0, ans=0.125 2023-06-26 16:55:06,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1612782.0, ans=0.0 2023-06-26 16:55:06,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1612782.0, ans=0.1 2023-06-26 16:55:14,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-26 16:55:20,288 INFO [train.py:996] (1/4) Epoch 9, batch 24850, loss[loss=0.1793, simple_loss=0.248, pruned_loss=0.05534, over 21863.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2782, pruned_loss=0.06772, over 4267192.20 frames. ], batch size: 107, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:55:29,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1612842.0, ans=0.0 2023-06-26 16:55:43,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1612902.0, ans=0.0 2023-06-26 16:56:07,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-26 16:56:24,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1612962.0, ans=0.125 2023-06-26 16:56:32,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-26 16:56:54,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1613082.0, ans=0.05 2023-06-26 16:57:14,589 INFO [train.py:996] (1/4) Epoch 9, batch 24900, loss[loss=0.25, simple_loss=0.3187, pruned_loss=0.09069, over 21362.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2817, pruned_loss=0.06862, over 4272421.54 frames. ], batch size: 548, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:57:52,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1613202.0, ans=0.09899494936611666 2023-06-26 16:57:54,892 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.038e+02 5.408e+02 8.463e+02 1.347e+03 2.375e+03, threshold=1.693e+03, percent-clipped=14.0 2023-06-26 16:58:19,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1613262.0, ans=0.125 2023-06-26 16:58:51,024 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.45 vs. limit=15.0 2023-06-26 16:59:11,114 INFO [train.py:996] (1/4) Epoch 9, batch 24950, loss[loss=0.2387, simple_loss=0.3114, pruned_loss=0.083, over 21464.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2883, pruned_loss=0.07214, over 4267000.75 frames. ], batch size: 211, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:59:51,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.88 vs. limit=10.0 2023-06-26 17:00:31,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1613622.0, ans=0.07 2023-06-26 17:00:55,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1613682.0, ans=0.1 2023-06-26 17:01:05,690 INFO [train.py:996] (1/4) Epoch 9, batch 25000, loss[loss=0.2021, simple_loss=0.2831, pruned_loss=0.06054, over 21494.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2936, pruned_loss=0.07339, over 4269822.89 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 17:01:16,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1613742.0, ans=0.05 2023-06-26 17:01:21,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1613802.0, ans=0.5 2023-06-26 17:01:40,150 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.080e+02 5.287e+02 8.385e+02 1.349e+03 3.356e+03, threshold=1.677e+03, percent-clipped=10.0 2023-06-26 17:01:50,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-26 17:02:05,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1613862.0, ans=0.125 2023-06-26 17:02:07,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1613922.0, ans=0.0 2023-06-26 17:02:52,725 INFO [train.py:996] (1/4) Epoch 9, batch 25050, loss[loss=0.1999, simple_loss=0.2679, pruned_loss=0.06599, over 21655.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2875, pruned_loss=0.07205, over 4271355.13 frames. ], batch size: 333, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:03:28,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1614102.0, ans=0.0 2023-06-26 17:04:40,859 INFO [train.py:996] (1/4) Epoch 9, batch 25100, loss[loss=0.1877, simple_loss=0.2595, pruned_loss=0.058, over 21599.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2833, pruned_loss=0.0705, over 4264081.73 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:04:54,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1614342.0, ans=0.0 2023-06-26 17:05:15,428 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.352e+02 5.789e+02 8.437e+02 1.364e+03 2.592e+03, threshold=1.687e+03, percent-clipped=13.0 2023-06-26 17:05:19,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1614462.0, ans=0.125 2023-06-26 17:06:16,720 INFO [train.py:996] (1/4) Epoch 9, batch 25150, loss[loss=0.219, simple_loss=0.3077, pruned_loss=0.06512, over 21766.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2856, pruned_loss=0.0687, over 4251387.88 frames. ], batch size: 112, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:07:10,180 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:07:27,879 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:07:48,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1614882.0, ans=0.125 2023-06-26 17:08:05,179 INFO [train.py:996] (1/4) Epoch 9, batch 25200, loss[loss=0.1897, simple_loss=0.2895, pruned_loss=0.04491, over 21678.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2864, pruned_loss=0.06705, over 4242427.52 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:08:50,508 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.256e+02 4.732e+02 7.162e+02 1.048e+03 3.410e+03, threshold=1.432e+03, percent-clipped=11.0 2023-06-26 17:08:59,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1615062.0, ans=0.2 2023-06-26 17:09:04,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1615062.0, ans=0.125 2023-06-26 17:09:16,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1615122.0, ans=0.1 2023-06-26 17:09:37,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1615182.0, ans=0.2 2023-06-26 17:09:48,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1615182.0, ans=0.0 2023-06-26 17:09:52,310 INFO [train.py:996] (1/4) Epoch 9, batch 25250, loss[loss=0.1962, simple_loss=0.2674, pruned_loss=0.0625, over 21779.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2848, pruned_loss=0.06586, over 4251296.89 frames. ], batch size: 371, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:11:30,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1615482.0, ans=0.2 2023-06-26 17:11:38,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1615542.0, ans=0.2 2023-06-26 17:11:39,514 INFO [train.py:996] (1/4) Epoch 9, batch 25300, loss[loss=0.218, simple_loss=0.2694, pruned_loss=0.08332, over 20110.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2819, pruned_loss=0.06551, over 4247788.13 frames. ], batch size: 703, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:11:43,548 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:12:22,272 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.284e+02 5.811e+02 7.982e+02 1.248e+03 2.930e+03, threshold=1.596e+03, percent-clipped=17.0 2023-06-26 17:12:41,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1615662.0, ans=0.2 2023-06-26 17:12:50,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1615722.0, ans=0.125 2023-06-26 17:13:14,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1615782.0, ans=0.1 2023-06-26 17:13:29,737 INFO [train.py:996] (1/4) Epoch 9, batch 25350, loss[loss=0.1872, simple_loss=0.2812, pruned_loss=0.04665, over 21748.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2842, pruned_loss=0.06516, over 4235033.54 frames. ], batch size: 351, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:13:54,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1615842.0, ans=0.125 2023-06-26 17:14:23,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1615962.0, ans=0.125 2023-06-26 17:14:29,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-26 17:15:09,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1616082.0, ans=0.04949747468305833 2023-06-26 17:15:17,104 INFO [train.py:996] (1/4) Epoch 9, batch 25400, loss[loss=0.1904, simple_loss=0.2576, pruned_loss=0.06162, over 21604.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2815, pruned_loss=0.06458, over 4238487.98 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:15:20,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=22.5 2023-06-26 17:15:26,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1616142.0, ans=0.04949747468305833 2023-06-26 17:15:58,563 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 5.059e+02 8.454e+02 1.158e+03 2.444e+03, threshold=1.691e+03, percent-clipped=8.0 2023-06-26 17:16:19,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1616262.0, ans=0.125 2023-06-26 17:16:26,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1616322.0, ans=0.125 2023-06-26 17:16:50,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1616382.0, ans=0.025 2023-06-26 17:17:05,781 INFO [train.py:996] (1/4) Epoch 9, batch 25450, loss[loss=0.2233, simple_loss=0.3224, pruned_loss=0.06214, over 21774.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2822, pruned_loss=0.06642, over 4244159.84 frames. ], batch size: 414, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:17:34,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1616502.0, ans=0.95 2023-06-26 17:17:38,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1616502.0, ans=0.125 2023-06-26 17:17:53,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1616502.0, ans=0.1 2023-06-26 17:18:17,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1616562.0, ans=0.125 2023-06-26 17:18:17,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1616562.0, ans=0.1 2023-06-26 17:18:32,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1616622.0, ans=0.125 2023-06-26 17:18:42,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.14 vs. limit=15.0 2023-06-26 17:18:48,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-26 17:18:55,043 INFO [train.py:996] (1/4) Epoch 9, batch 25500, loss[loss=0.2168, simple_loss=0.2931, pruned_loss=0.07026, over 21306.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2829, pruned_loss=0.06396, over 4249007.67 frames. ], batch size: 176, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:19:43,309 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.118e+02 5.222e+02 7.710e+02 1.108e+03 2.263e+03, threshold=1.542e+03, percent-clipped=6.0 2023-06-26 17:20:03,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1616862.0, ans=10.0 2023-06-26 17:20:14,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1616922.0, ans=0.1 2023-06-26 17:20:23,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1616922.0, ans=0.0 2023-06-26 17:20:31,457 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-26 17:20:56,325 INFO [train.py:996] (1/4) Epoch 9, batch 25550, loss[loss=0.2591, simple_loss=0.3551, pruned_loss=0.08159, over 21528.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2894, pruned_loss=0.06374, over 4251359.37 frames. ], batch size: 471, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:21:06,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1617042.0, ans=0.125 2023-06-26 17:21:23,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1617102.0, ans=0.5 2023-06-26 17:22:01,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1617222.0, ans=0.125 2023-06-26 17:22:18,572 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-26 17:22:26,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1617282.0, ans=0.125 2023-06-26 17:22:46,576 INFO [train.py:996] (1/4) Epoch 9, batch 25600, loss[loss=0.2922, simple_loss=0.3475, pruned_loss=0.1184, over 21445.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2943, pruned_loss=0.06458, over 4264930.58 frames. ], batch size: 471, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:22:53,472 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-26 17:22:54,459 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:23:11,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.13 vs. limit=10.0 2023-06-26 17:23:29,861 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.761e+02 5.184e+02 7.757e+02 1.041e+03 2.426e+03, threshold=1.551e+03, percent-clipped=8.0 2023-06-26 17:23:30,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1617402.0, ans=0.0 2023-06-26 17:23:34,458 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-26 17:23:38,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1617462.0, ans=0.0 2023-06-26 17:24:36,583 INFO [train.py:996] (1/4) Epoch 9, batch 25650, loss[loss=0.2021, simple_loss=0.2622, pruned_loss=0.07094, over 21674.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2946, pruned_loss=0.06717, over 4257587.89 frames. ], batch size: 247, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:24:43,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1617642.0, ans=0.125 2023-06-26 17:24:56,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1617642.0, ans=0.0 2023-06-26 17:25:39,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-26 17:25:43,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1617822.0, ans=0.2 2023-06-26 17:26:06,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.89 vs. limit=15.0 2023-06-26 17:26:12,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1617882.0, ans=0.0 2023-06-26 17:26:15,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1617882.0, ans=0.2 2023-06-26 17:26:21,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1617882.0, ans=0.125 2023-06-26 17:26:24,641 INFO [train.py:996] (1/4) Epoch 9, batch 25700, loss[loss=0.1976, simple_loss=0.2801, pruned_loss=0.05752, over 21489.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2922, pruned_loss=0.06832, over 4259592.11 frames. ], batch size: 194, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:27:00,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1618002.0, ans=0.125 2023-06-26 17:27:08,740 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.793e+02 5.331e+02 7.573e+02 1.078e+03 3.200e+03, threshold=1.515e+03, percent-clipped=12.0 2023-06-26 17:27:25,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1618062.0, ans=0.2 2023-06-26 17:27:56,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618182.0, ans=0.1 2023-06-26 17:28:00,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1618182.0, ans=0.125 2023-06-26 17:28:21,568 INFO [train.py:996] (1/4) Epoch 9, batch 25750, loss[loss=0.2849, simple_loss=0.3624, pruned_loss=0.1037, over 21602.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2954, pruned_loss=0.07024, over 4268449.15 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:28:33,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-26 17:29:14,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1618362.0, ans=0.125 2023-06-26 17:29:31,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1618422.0, ans=0.125 2023-06-26 17:29:32,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1618422.0, ans=0.0 2023-06-26 17:29:33,237 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-26 17:29:41,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1618422.0, ans=0.125 2023-06-26 17:30:14,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1618482.0, ans=0.2 2023-06-26 17:30:18,700 INFO [train.py:996] (1/4) Epoch 9, batch 25800, loss[loss=0.2273, simple_loss=0.3108, pruned_loss=0.07186, over 21931.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3091, pruned_loss=0.07508, over 4263423.91 frames. ], batch size: 316, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:31:03,951 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 5.908e+02 7.803e+02 1.133e+03 2.789e+03, threshold=1.561e+03, percent-clipped=14.0 2023-06-26 17:31:42,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1618722.0, ans=0.125 2023-06-26 17:31:57,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1618782.0, ans=0.2 2023-06-26 17:32:08,649 INFO [train.py:996] (1/4) Epoch 9, batch 25850, loss[loss=0.2294, simple_loss=0.3093, pruned_loss=0.07474, over 21763.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3096, pruned_loss=0.07455, over 4272497.90 frames. ], batch size: 389, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:32:51,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618902.0, ans=0.1 2023-06-26 17:32:58,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1618962.0, ans=0.125 2023-06-26 17:33:11,518 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-26 17:33:23,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1619022.0, ans=0.0 2023-06-26 17:34:03,376 INFO [train.py:996] (1/4) Epoch 9, batch 25900, loss[loss=0.2316, simple_loss=0.321, pruned_loss=0.07108, over 21563.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3113, pruned_loss=0.07591, over 4281839.54 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:34:22,220 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:34:36,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1619202.0, ans=0.2 2023-06-26 17:34:47,600 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.581e+02 5.400e+02 8.685e+02 1.109e+03 2.488e+03, threshold=1.737e+03, percent-clipped=11.0 2023-06-26 17:34:59,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1619262.0, ans=0.125 2023-06-26 17:35:54,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1619382.0, ans=0.125 2023-06-26 17:35:55,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1619382.0, ans=0.125 2023-06-26 17:35:58,965 INFO [train.py:996] (1/4) Epoch 9, batch 25950, loss[loss=0.2345, simple_loss=0.3175, pruned_loss=0.07572, over 21579.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3181, pruned_loss=0.0784, over 4285202.86 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:36:01,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1619442.0, ans=0.0 2023-06-26 17:37:46,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1619682.0, ans=0.125 2023-06-26 17:37:49,200 INFO [train.py:996] (1/4) Epoch 9, batch 26000, loss[loss=0.2335, simple_loss=0.3222, pruned_loss=0.07243, over 21685.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3156, pruned_loss=0.07635, over 4280998.29 frames. ], batch size: 351, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:38:33,557 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.565e+02 5.045e+02 5.850e+02 7.861e+02 1.944e+03, threshold=1.170e+03, percent-clipped=2.0 2023-06-26 17:38:35,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1619862.0, ans=0.0 2023-06-26 17:38:46,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1619862.0, ans=0.125 2023-06-26 17:39:20,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1619982.0, ans=0.125 2023-06-26 17:39:27,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1619982.0, ans=0.125 2023-06-26 17:39:31,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1619982.0, ans=0.0 2023-06-26 17:39:37,968 INFO [train.py:996] (1/4) Epoch 9, batch 26050, loss[loss=0.2142, simple_loss=0.2929, pruned_loss=0.06781, over 21839.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3162, pruned_loss=0.07753, over 4274409.87 frames. ], batch size: 107, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:39:58,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1620042.0, ans=0.0 2023-06-26 17:40:20,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1620102.0, ans=0.125 2023-06-26 17:40:46,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1620222.0, ans=0.95 2023-06-26 17:41:09,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1620282.0, ans=0.125 2023-06-26 17:41:18,731 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-26 17:41:21,073 INFO [train.py:996] (1/4) Epoch 9, batch 26100, loss[loss=0.1811, simple_loss=0.2446, pruned_loss=0.05876, over 21240.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3104, pruned_loss=0.07582, over 4273991.41 frames. ], batch size: 608, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:41:51,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1620402.0, ans=0.0 2023-06-26 17:42:05,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1620462.0, ans=0.125 2023-06-26 17:42:06,463 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.994e+02 5.585e+02 7.440e+02 1.140e+03 2.110e+03, threshold=1.488e+03, percent-clipped=23.0 2023-06-26 17:42:06,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1620462.0, ans=0.0 2023-06-26 17:42:27,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1620522.0, ans=0.2 2023-06-26 17:42:51,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1620582.0, ans=0.0 2023-06-26 17:42:51,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1620582.0, ans=0.1 2023-06-26 17:43:04,900 INFO [train.py:996] (1/4) Epoch 9, batch 26150, loss[loss=0.2439, simple_loss=0.3191, pruned_loss=0.08436, over 21563.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.308, pruned_loss=0.07575, over 4276622.08 frames. ], batch size: 414, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:43:17,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1620642.0, ans=0.125 2023-06-26 17:43:37,248 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.14 vs. limit=22.5 2023-06-26 17:43:45,187 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:44:22,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1620822.0, ans=0.0 2023-06-26 17:44:22,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1620822.0, ans=0.125 2023-06-26 17:44:25,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1620822.0, ans=0.035 2023-06-26 17:45:00,356 INFO [train.py:996] (1/4) Epoch 9, batch 26200, loss[loss=0.2214, simple_loss=0.3034, pruned_loss=0.06975, over 20644.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3085, pruned_loss=0.07403, over 4277767.73 frames. ], batch size: 608, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:45:41,650 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.840e+02 5.161e+02 8.097e+02 1.241e+03 2.329e+03, threshold=1.619e+03, percent-clipped=17.0 2023-06-26 17:46:07,406 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:46:56,620 INFO [train.py:996] (1/4) Epoch 9, batch 26250, loss[loss=0.2488, simple_loss=0.3353, pruned_loss=0.08118, over 21807.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3124, pruned_loss=0.07294, over 4279091.72 frames. ], batch size: 414, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:47:09,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1621242.0, ans=0.025 2023-06-26 17:47:23,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1621302.0, ans=0.2 2023-06-26 17:48:44,889 INFO [train.py:996] (1/4) Epoch 9, batch 26300, loss[loss=0.2285, simple_loss=0.2965, pruned_loss=0.0802, over 21505.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3089, pruned_loss=0.07389, over 4285769.71 frames. ], batch size: 194, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:49:25,499 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.773e+02 5.088e+02 7.206e+02 1.171e+03 1.823e+03, threshold=1.441e+03, percent-clipped=7.0 2023-06-26 17:49:30,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=15.0 2023-06-26 17:49:51,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1621722.0, ans=0.0 2023-06-26 17:49:52,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1621722.0, ans=10.0 2023-06-26 17:50:21,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1621782.0, ans=0.09899494936611666 2023-06-26 17:50:30,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1621782.0, ans=0.1 2023-06-26 17:50:33,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1621842.0, ans=0.0 2023-06-26 17:50:34,484 INFO [train.py:996] (1/4) Epoch 9, batch 26350, loss[loss=0.2124, simple_loss=0.2932, pruned_loss=0.06576, over 20790.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3072, pruned_loss=0.07509, over 4288489.69 frames. ], batch size: 607, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:50:39,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.86 vs. limit=10.0 2023-06-26 17:52:23,818 INFO [train.py:996] (1/4) Epoch 9, batch 26400, loss[loss=0.2043, simple_loss=0.2749, pruned_loss=0.06686, over 21997.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3015, pruned_loss=0.07479, over 4282549.25 frames. ], batch size: 103, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:52:44,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1622202.0, ans=0.2 2023-06-26 17:53:10,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622262.0, ans=0.1 2023-06-26 17:53:12,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-06-26 17:53:12,756 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 5.037e+02 6.959e+02 9.647e+02 1.675e+03, threshold=1.392e+03, percent-clipped=4.0 2023-06-26 17:53:56,211 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=22.5 2023-06-26 17:53:59,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-26 17:54:16,743 INFO [train.py:996] (1/4) Epoch 9, batch 26450, loss[loss=0.2606, simple_loss=0.3837, pruned_loss=0.06871, over 21165.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3014, pruned_loss=0.07443, over 4284251.70 frames. ], batch size: 549, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:55:27,330 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=22.5 2023-06-26 17:55:28,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1622562.0, ans=0.125 2023-06-26 17:55:52,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1622682.0, ans=0.0 2023-06-26 17:56:04,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1622682.0, ans=0.0 2023-06-26 17:56:13,593 INFO [train.py:996] (1/4) Epoch 9, batch 26500, loss[loss=0.2381, simple_loss=0.3251, pruned_loss=0.07555, over 21679.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3041, pruned_loss=0.07312, over 4281768.17 frames. ], batch size: 389, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:56:16,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1622742.0, ans=0.0 2023-06-26 17:57:07,194 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.802e+02 5.662e+02 1.052e+03 1.637e+03 4.186e+03, threshold=2.103e+03, percent-clipped=36.0 2023-06-26 17:57:08,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.48 vs. limit=15.0 2023-06-26 17:57:28,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1622922.0, ans=0.0 2023-06-26 17:58:11,238 INFO [train.py:996] (1/4) Epoch 9, batch 26550, loss[loss=0.1646, simple_loss=0.2342, pruned_loss=0.04753, over 21282.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3012, pruned_loss=0.07136, over 4262904.07 frames. ], batch size: 176, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:58:11,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1623042.0, ans=0.125 2023-06-26 17:59:06,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1623162.0, ans=0.07 2023-06-26 17:59:57,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1623282.0, ans=0.0 2023-06-26 18:00:05,318 INFO [train.py:996] (1/4) Epoch 9, batch 26600, loss[loss=0.235, simple_loss=0.2919, pruned_loss=0.08904, over 20076.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3011, pruned_loss=0.06882, over 4255617.49 frames. ], batch size: 703, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:00:47,578 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 5.073e+02 7.169e+02 1.139e+03 3.123e+03, threshold=1.434e+03, percent-clipped=9.0 2023-06-26 18:01:14,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1623522.0, ans=0.125 2023-06-26 18:01:23,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1623522.0, ans=0.125 2023-06-26 18:01:48,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1623582.0, ans=15.0 2023-06-26 18:01:59,714 INFO [train.py:996] (1/4) Epoch 9, batch 26650, loss[loss=0.1614, simple_loss=0.2544, pruned_loss=0.03419, over 21669.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2935, pruned_loss=0.06723, over 4252541.29 frames. ], batch size: 391, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:02:15,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1623702.0, ans=0.0 2023-06-26 18:02:16,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1623702.0, ans=0.0 2023-06-26 18:02:30,936 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=18.97 vs. limit=15.0 2023-06-26 18:03:17,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1623882.0, ans=0.125 2023-06-26 18:03:17,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1623882.0, ans=0.0 2023-06-26 18:03:38,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1623882.0, ans=0.0 2023-06-26 18:03:40,945 INFO [train.py:996] (1/4) Epoch 9, batch 26700, loss[loss=0.2203, simple_loss=0.3008, pruned_loss=0.06988, over 21883.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2861, pruned_loss=0.06416, over 4262805.59 frames. ], batch size: 107, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:04:27,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1624062.0, ans=0.1 2023-06-26 18:04:29,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.080e+02 5.616e+02 9.381e+02 2.662e+03, threshold=1.123e+03, percent-clipped=11.0 2023-06-26 18:04:30,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1624062.0, ans=0.07 2023-06-26 18:04:44,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1624062.0, ans=0.125 2023-06-26 18:05:36,277 INFO [train.py:996] (1/4) Epoch 9, batch 26750, loss[loss=0.2262, simple_loss=0.3064, pruned_loss=0.07297, over 21566.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2866, pruned_loss=0.06365, over 4265886.56 frames. ], batch size: 507, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:06:25,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1624362.0, ans=0.0 2023-06-26 18:06:46,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1624422.0, ans=0.0 2023-06-26 18:07:27,061 INFO [train.py:996] (1/4) Epoch 9, batch 26800, loss[loss=0.2573, simple_loss=0.3355, pruned_loss=0.08951, over 21544.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2935, pruned_loss=0.06726, over 4262306.85 frames. ], batch size: 414, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 18:07:33,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=15.0 2023-06-26 18:07:51,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1624602.0, ans=0.0 2023-06-26 18:08:15,092 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.607e+02 5.810e+02 7.473e+02 1.088e+03 2.811e+03, threshold=1.495e+03, percent-clipped=19.0 2023-06-26 18:09:22,005 INFO [train.py:996] (1/4) Epoch 9, batch 26850, loss[loss=0.1962, simple_loss=0.2674, pruned_loss=0.06254, over 21798.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2936, pruned_loss=0.06927, over 4259022.68 frames. ], batch size: 124, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:09:30,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1624842.0, ans=0.125 2023-06-26 18:09:51,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1624902.0, ans=0.125 2023-06-26 18:10:52,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-26 18:10:56,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-26 18:11:09,549 INFO [train.py:996] (1/4) Epoch 9, batch 26900, loss[loss=0.1753, simple_loss=0.2359, pruned_loss=0.0574, over 21604.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2858, pruned_loss=0.06843, over 4260986.32 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:11:42,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1625202.0, ans=0.1 2023-06-26 18:11:52,551 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.498e+02 4.462e+02 5.999e+02 9.238e+02 1.607e+03, threshold=1.200e+03, percent-clipped=3.0 2023-06-26 18:11:53,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1625262.0, ans=0.1 2023-06-26 18:11:54,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1625262.0, ans=0.0 2023-06-26 18:12:11,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1625322.0, ans=0.1 2023-06-26 18:12:28,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1625322.0, ans=0.125 2023-06-26 18:12:39,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1625382.0, ans=0.125 2023-06-26 18:12:57,976 INFO [train.py:996] (1/4) Epoch 9, batch 26950, loss[loss=0.3071, simple_loss=0.3785, pruned_loss=0.1178, over 21466.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2856, pruned_loss=0.06839, over 4261845.50 frames. ], batch size: 508, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:13:00,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1625442.0, ans=0.5 2023-06-26 18:13:08,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1625442.0, ans=0.035 2023-06-26 18:13:21,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1625502.0, ans=0.125 2023-06-26 18:13:30,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1625502.0, ans=0.2 2023-06-26 18:13:57,019 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.18 vs. limit=15.0 2023-06-26 18:14:16,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1625622.0, ans=0.0 2023-06-26 18:14:19,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1625622.0, ans=0.125 2023-06-26 18:14:43,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1625682.0, ans=0.04949747468305833 2023-06-26 18:14:46,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1625742.0, ans=0.2 2023-06-26 18:14:47,904 INFO [train.py:996] (1/4) Epoch 9, batch 27000, loss[loss=0.1846, simple_loss=0.2604, pruned_loss=0.05444, over 21517.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2866, pruned_loss=0.06709, over 4264208.66 frames. ], batch size: 212, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:14:47,904 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 18:15:07,477 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2501, simple_loss=0.3419, pruned_loss=0.07919, over 1796401.00 frames. 2023-06-26 18:15:07,478 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 18:15:34,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1625802.0, ans=0.125 2023-06-26 18:15:59,909 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.455e+02 5.551e+02 8.937e+02 1.384e+03 3.879e+03, threshold=1.787e+03, percent-clipped=32.0 2023-06-26 18:16:56,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1626042.0, ans=0.1 2023-06-26 18:16:57,921 INFO [train.py:996] (1/4) Epoch 9, batch 27050, loss[loss=0.2202, simple_loss=0.3033, pruned_loss=0.06854, over 21747.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.289, pruned_loss=0.06518, over 4266992.57 frames. ], batch size: 389, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:16:58,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1626042.0, ans=0.125 2023-06-26 18:17:21,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1626102.0, ans=0.125 2023-06-26 18:17:34,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1626102.0, ans=0.2 2023-06-26 18:18:12,789 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-26 18:18:21,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=22.5 2023-06-26 18:18:32,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1626282.0, ans=0.125 2023-06-26 18:18:49,374 INFO [train.py:996] (1/4) Epoch 9, batch 27100, loss[loss=0.1953, simple_loss=0.2941, pruned_loss=0.04824, over 21370.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2898, pruned_loss=0.06559, over 4275256.89 frames. ], batch size: 131, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:19:42,256 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.471e+02 6.179e+02 8.599e+02 1.265e+03 2.717e+03, threshold=1.720e+03, percent-clipped=9.0 2023-06-26 18:20:07,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1626522.0, ans=0.0 2023-06-26 18:20:45,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1626642.0, ans=0.125 2023-06-26 18:20:46,686 INFO [train.py:996] (1/4) Epoch 9, batch 27150, loss[loss=0.2092, simple_loss=0.294, pruned_loss=0.06226, over 21308.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3013, pruned_loss=0.06844, over 4277919.56 frames. ], batch size: 131, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:21:59,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1626822.0, ans=0.125 2023-06-26 18:22:34,985 INFO [train.py:996] (1/4) Epoch 9, batch 27200, loss[loss=0.2416, simple_loss=0.3259, pruned_loss=0.07865, over 21283.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3087, pruned_loss=0.07087, over 4277188.33 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:22:41,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-26 18:23:25,801 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.290e+02 5.594e+02 8.054e+02 1.283e+03 2.318e+03, threshold=1.611e+03, percent-clipped=7.0 2023-06-26 18:23:51,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1627122.0, ans=0.125 2023-06-26 18:24:30,170 INFO [train.py:996] (1/4) Epoch 9, batch 27250, loss[loss=0.2205, simple_loss=0.2982, pruned_loss=0.07141, over 21724.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3101, pruned_loss=0.07391, over 4272506.04 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:24:37,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1627242.0, ans=0.125 2023-06-26 18:24:44,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1627242.0, ans=0.125 2023-06-26 18:24:59,687 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.43 vs. limit=15.0 2023-06-26 18:25:06,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1627302.0, ans=0.1 2023-06-26 18:25:54,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-26 18:26:20,979 INFO [train.py:996] (1/4) Epoch 9, batch 27300, loss[loss=0.2531, simple_loss=0.3481, pruned_loss=0.07912, over 21711.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3137, pruned_loss=0.07568, over 4277399.15 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:26:38,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1627542.0, ans=0.0 2023-06-26 18:26:59,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1627602.0, ans=0.125 2023-06-26 18:27:18,626 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.468e+02 5.640e+02 6.768e+02 9.000e+02 1.859e+03, threshold=1.354e+03, percent-clipped=2.0 2023-06-26 18:27:24,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1627662.0, ans=0.0 2023-06-26 18:27:42,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-26 18:27:47,412 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.77 vs. limit=22.5 2023-06-26 18:28:06,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1627782.0, ans=0.2 2023-06-26 18:28:17,659 INFO [train.py:996] (1/4) Epoch 9, batch 27350, loss[loss=0.2393, simple_loss=0.3229, pruned_loss=0.07782, over 21255.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3163, pruned_loss=0.077, over 4280244.80 frames. ], batch size: 143, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:28:42,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1627902.0, ans=0.0 2023-06-26 18:28:53,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1627902.0, ans=0.0 2023-06-26 18:30:04,059 INFO [train.py:996] (1/4) Epoch 9, batch 27400, loss[loss=0.2045, simple_loss=0.2724, pruned_loss=0.06827, over 21652.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3108, pruned_loss=0.07611, over 4283268.31 frames. ], batch size: 391, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:30:32,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1628202.0, ans=0.035 2023-06-26 18:30:54,137 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.725e+02 5.126e+02 6.894e+02 1.011e+03 2.169e+03, threshold=1.379e+03, percent-clipped=11.0 2023-06-26 18:31:45,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1628382.0, ans=0.125 2023-06-26 18:31:47,416 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:31:52,292 INFO [train.py:996] (1/4) Epoch 9, batch 27450, loss[loss=0.2206, simple_loss=0.3009, pruned_loss=0.07019, over 20702.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3049, pruned_loss=0.07478, over 4281629.99 frames. ], batch size: 607, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:31:53,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-26 18:32:16,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1628502.0, ans=0.125 2023-06-26 18:32:40,898 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-06-26 18:33:05,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1628622.0, ans=0.0 2023-06-26 18:33:21,094 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-26 18:33:38,598 INFO [train.py:996] (1/4) Epoch 9, batch 27500, loss[loss=0.2172, simple_loss=0.2877, pruned_loss=0.07332, over 21902.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3032, pruned_loss=0.07423, over 4286875.18 frames. ], batch size: 351, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:34:29,852 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.928e+02 5.202e+02 7.866e+02 1.174e+03 2.313e+03, threshold=1.573e+03, percent-clipped=14.0 2023-06-26 18:34:32,431 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-26 18:35:26,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1629042.0, ans=0.125 2023-06-26 18:35:26,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.12 vs. limit=10.0 2023-06-26 18:35:27,112 INFO [train.py:996] (1/4) Epoch 9, batch 27550, loss[loss=0.1739, simple_loss=0.247, pruned_loss=0.05041, over 21456.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2974, pruned_loss=0.0709, over 4291302.04 frames. ], batch size: 194, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:36:06,563 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.80 vs. limit=10.0 2023-06-26 18:36:07,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1629102.0, ans=0.1 2023-06-26 18:36:47,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1629222.0, ans=0.125 2023-06-26 18:37:21,053 INFO [train.py:996] (1/4) Epoch 9, batch 27600, loss[loss=0.1959, simple_loss=0.2669, pruned_loss=0.06244, over 21293.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2919, pruned_loss=0.06978, over 4280439.59 frames. ], batch size: 144, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 18:37:38,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1629402.0, ans=0.125 2023-06-26 18:38:11,877 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 6.372e+02 8.382e+02 1.316e+03 3.069e+03, threshold=1.676e+03, percent-clipped=15.0 2023-06-26 18:38:25,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1629522.0, ans=0.125 2023-06-26 18:38:27,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1629522.0, ans=0.125 2023-06-26 18:38:33,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-26 18:38:43,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1629582.0, ans=0.125 2023-06-26 18:38:50,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1629582.0, ans=0.0 2023-06-26 18:38:51,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1629582.0, ans=0.0 2023-06-26 18:38:59,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1629582.0, ans=10.0 2023-06-26 18:39:07,992 INFO [train.py:996] (1/4) Epoch 9, batch 27650, loss[loss=0.1796, simple_loss=0.2404, pruned_loss=0.05935, over 21041.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2861, pruned_loss=0.06901, over 4285407.26 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:39:08,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1629642.0, ans=0.0 2023-06-26 18:39:24,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1629702.0, ans=0.2 2023-06-26 18:39:52,019 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-26 18:40:07,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1629762.0, ans=0.125 2023-06-26 18:40:08,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=1629762.0, ans=12.0 2023-06-26 18:40:37,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1629882.0, ans=0.125 2023-06-26 18:40:55,819 INFO [train.py:996] (1/4) Epoch 9, batch 27700, loss[loss=0.2496, simple_loss=0.336, pruned_loss=0.08156, over 21307.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2869, pruned_loss=0.06745, over 4283983.38 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:41:47,709 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.763e+02 6.253e+02 8.900e+02 1.966e+03, threshold=1.251e+03, percent-clipped=3.0 2023-06-26 18:42:02,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1630122.0, ans=0.125 2023-06-26 18:42:11,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=1630122.0, ans=12.0 2023-06-26 18:42:43,154 INFO [train.py:996] (1/4) Epoch 9, batch 27750, loss[loss=0.2256, simple_loss=0.3324, pruned_loss=0.05943, over 20860.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2916, pruned_loss=0.06768, over 4287266.48 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:43:30,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1630362.0, ans=0.125 2023-06-26 18:43:42,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1630362.0, ans=0.0 2023-06-26 18:44:11,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1630482.0, ans=0.125 2023-06-26 18:44:23,841 INFO [train.py:996] (1/4) Epoch 9, batch 27800, loss[loss=0.2045, simple_loss=0.2784, pruned_loss=0.06536, over 21893.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2903, pruned_loss=0.06825, over 4288888.87 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:45:23,038 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.595e+02 5.099e+02 6.470e+02 1.005e+03 1.791e+03, threshold=1.294e+03, percent-clipped=14.0 2023-06-26 18:45:41,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1630722.0, ans=0.0 2023-06-26 18:45:41,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1630722.0, ans=0.125 2023-06-26 18:45:43,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1630722.0, ans=0.125 2023-06-26 18:46:00,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1630782.0, ans=0.125 2023-06-26 18:46:18,784 INFO [train.py:996] (1/4) Epoch 9, batch 27850, loss[loss=0.228, simple_loss=0.2995, pruned_loss=0.07832, over 21867.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.289, pruned_loss=0.06903, over 4287457.17 frames. ], batch size: 118, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:46:41,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1630902.0, ans=0.0 2023-06-26 18:46:46,685 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:47:03,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-26 18:47:41,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1631022.0, ans=0.02 2023-06-26 18:48:03,205 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-06-26 18:48:11,027 INFO [train.py:996] (1/4) Epoch 9, batch 27900, loss[loss=0.2244, simple_loss=0.3238, pruned_loss=0.06248, over 21793.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2956, pruned_loss=0.06958, over 4287565.40 frames. ], batch size: 316, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:49:12,753 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.641e+02 5.533e+02 7.337e+02 1.067e+03 2.093e+03, threshold=1.467e+03, percent-clipped=13.0 2023-06-26 18:49:46,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1631382.0, ans=0.0 2023-06-26 18:50:02,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1631382.0, ans=0.0 2023-06-26 18:50:09,119 INFO [train.py:996] (1/4) Epoch 9, batch 27950, loss[loss=0.1587, simple_loss=0.2288, pruned_loss=0.04429, over 17080.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2959, pruned_loss=0.06705, over 4276380.03 frames. ], batch size: 61, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:50:23,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1631442.0, ans=0.125 2023-06-26 18:50:51,817 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:50:55,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1631562.0, ans=0.07 2023-06-26 18:51:18,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1631622.0, ans=0.1 2023-06-26 18:51:23,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1631622.0, ans=0.2 2023-06-26 18:51:30,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1631622.0, ans=0.2 2023-06-26 18:51:58,508 INFO [train.py:996] (1/4) Epoch 9, batch 28000, loss[loss=0.2737, simple_loss=0.3233, pruned_loss=0.1121, over 21764.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2946, pruned_loss=0.06538, over 4281550.39 frames. ], batch size: 508, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:52:08,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-26 18:52:53,930 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 5.535e+02 9.213e+02 1.280e+03 3.629e+03, threshold=1.843e+03, percent-clipped=20.0 2023-06-26 18:53:20,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1631922.0, ans=0.0 2023-06-26 18:53:56,218 INFO [train.py:996] (1/4) Epoch 9, batch 28050, loss[loss=0.2569, simple_loss=0.3298, pruned_loss=0.09203, over 21589.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2922, pruned_loss=0.06577, over 4284904.94 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:54:42,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1632162.0, ans=0.0 2023-06-26 18:54:57,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1632222.0, ans=0.0 2023-06-26 18:55:40,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1632282.0, ans=0.2 2023-06-26 18:55:43,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1632342.0, ans=0.125 2023-06-26 18:55:44,940 INFO [train.py:996] (1/4) Epoch 9, batch 28100, loss[loss=0.1688, simple_loss=0.2441, pruned_loss=0.04675, over 21388.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2903, pruned_loss=0.06592, over 4277209.13 frames. ], batch size: 131, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:56:10,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1632402.0, ans=0.2 2023-06-26 18:56:25,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1632462.0, ans=0.125 2023-06-26 18:56:37,149 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.724e+02 5.263e+02 6.694e+02 1.046e+03 2.729e+03, threshold=1.339e+03, percent-clipped=5.0 2023-06-26 18:57:25,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1632582.0, ans=0.0 2023-06-26 18:57:29,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-26 18:57:29,617 INFO [train.py:996] (1/4) Epoch 9, batch 28150, loss[loss=0.2013, simple_loss=0.2682, pruned_loss=0.06721, over 21894.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2858, pruned_loss=0.06577, over 4279783.92 frames. ], batch size: 373, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:57:57,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1632702.0, ans=0.04949747468305833 2023-06-26 18:58:11,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1632762.0, ans=0.5 2023-06-26 18:58:23,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1632762.0, ans=0.1 2023-06-26 18:58:28,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1632762.0, ans=0.0 2023-06-26 18:58:48,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1632822.0, ans=0.125 2023-06-26 18:58:51,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1632822.0, ans=0.0 2023-06-26 18:59:15,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1632882.0, ans=0.125 2023-06-26 18:59:16,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-26 18:59:18,579 INFO [train.py:996] (1/4) Epoch 9, batch 28200, loss[loss=0.2659, simple_loss=0.3739, pruned_loss=0.07899, over 19754.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2857, pruned_loss=0.06747, over 4274176.74 frames. ], batch size: 702, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:59:35,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1632942.0, ans=0.125 2023-06-26 18:59:45,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1633002.0, ans=10.0 2023-06-26 19:00:03,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1633062.0, ans=0.125 2023-06-26 19:00:13,415 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.542e+02 6.148e+02 9.394e+02 1.401e+03 3.381e+03, threshold=1.879e+03, percent-clipped=30.0 2023-06-26 19:00:22,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1633062.0, ans=0.025 2023-06-26 19:01:07,256 INFO [train.py:996] (1/4) Epoch 9, batch 28250, loss[loss=0.1741, simple_loss=0.2202, pruned_loss=0.064, over 20740.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2867, pruned_loss=0.06976, over 4271473.69 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:01:07,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1633242.0, ans=0.2 2023-06-26 19:01:07,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1633242.0, ans=0.05 2023-06-26 19:01:11,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1633242.0, ans=0.125 2023-06-26 19:01:26,394 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.96 vs. limit=5.0 2023-06-26 19:01:50,769 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-26 19:02:09,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1633362.0, ans=0.0 2023-06-26 19:02:16,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1633422.0, ans=0.0 2023-06-26 19:02:19,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1633422.0, ans=0.125 2023-06-26 19:02:33,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1633422.0, ans=0.125 2023-06-26 19:02:46,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1633482.0, ans=0.0 2023-06-26 19:03:03,816 INFO [train.py:996] (1/4) Epoch 9, batch 28300, loss[loss=0.1988, simple_loss=0.2916, pruned_loss=0.05297, over 21517.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2855, pruned_loss=0.06766, over 4262092.19 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:03:31,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1633602.0, ans=0.1 2023-06-26 19:03:48,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1633662.0, ans=0.2 2023-06-26 19:03:58,617 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.505e+02 4.596e+02 7.876e+02 1.186e+03 2.671e+03, threshold=1.575e+03, percent-clipped=4.0 2023-06-26 19:04:13,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1633722.0, ans=0.0 2023-06-26 19:04:46,069 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=12.0 2023-06-26 19:04:53,325 INFO [train.py:996] (1/4) Epoch 9, batch 28350, loss[loss=0.1845, simple_loss=0.2473, pruned_loss=0.06087, over 21247.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2845, pruned_loss=0.06292, over 4258110.96 frames. ], batch size: 160, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:04:55,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1633842.0, ans=0.125 2023-06-26 19:05:07,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1633842.0, ans=0.125 2023-06-26 19:05:09,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.87 vs. limit=15.0 2023-06-26 19:05:20,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1633902.0, ans=0.125 2023-06-26 19:06:15,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-26 19:06:38,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-26 19:06:46,259 INFO [train.py:996] (1/4) Epoch 9, batch 28400, loss[loss=0.1792, simple_loss=0.2492, pruned_loss=0.05458, over 21627.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2803, pruned_loss=0.06287, over 4254433.44 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 19:07:09,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1634202.0, ans=0.1 2023-06-26 19:07:34,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-26 19:07:36,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1634262.0, ans=0.0 2023-06-26 19:07:41,786 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.757e+02 5.507e+02 7.639e+02 1.116e+03 2.582e+03, threshold=1.528e+03, percent-clipped=10.0 2023-06-26 19:08:24,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1634382.0, ans=0.125 2023-06-26 19:08:33,701 INFO [train.py:996] (1/4) Epoch 9, batch 28450, loss[loss=0.1995, simple_loss=0.2721, pruned_loss=0.06344, over 15437.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2841, pruned_loss=0.06589, over 4256523.53 frames. ], batch size: 60, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:08:52,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1634442.0, ans=0.2 2023-06-26 19:09:39,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1634622.0, ans=0.1 2023-06-26 19:09:40,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1634622.0, ans=0.95 2023-06-26 19:09:48,766 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=12.0 2023-06-26 19:10:22,630 INFO [train.py:996] (1/4) Epoch 9, batch 28500, loss[loss=0.2712, simple_loss=0.3377, pruned_loss=0.1023, over 21482.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2863, pruned_loss=0.0682, over 4260116.14 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:11:20,219 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 5.079e+02 6.899e+02 9.776e+02 2.125e+03, threshold=1.380e+03, percent-clipped=6.0 2023-06-26 19:11:41,962 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 19:11:45,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1634922.0, ans=0.125 2023-06-26 19:12:18,033 INFO [train.py:996] (1/4) Epoch 9, batch 28550, loss[loss=0.2674, simple_loss=0.3722, pruned_loss=0.08128, over 21272.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2952, pruned_loss=0.0716, over 4268030.09 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:12:24,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-06-26 19:14:06,437 INFO [train.py:996] (1/4) Epoch 9, batch 28600, loss[loss=0.2411, simple_loss=0.3153, pruned_loss=0.08342, over 21586.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3008, pruned_loss=0.07318, over 4274261.79 frames. ], batch size: 389, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:14:52,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1635462.0, ans=0.125 2023-06-26 19:15:00,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1635462.0, ans=0.04949747468305833 2023-06-26 19:15:10,706 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 5.299e+02 6.853e+02 1.013e+03 2.004e+03, threshold=1.371e+03, percent-clipped=8.0 2023-06-26 19:15:51,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1635582.0, ans=0.125 2023-06-26 19:16:02,348 INFO [train.py:996] (1/4) Epoch 9, batch 28650, loss[loss=0.2029, simple_loss=0.2606, pruned_loss=0.07264, over 21319.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2954, pruned_loss=0.07241, over 4269563.18 frames. ], batch size: 144, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:16:04,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1635642.0, ans=0.0 2023-06-26 19:16:31,831 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=15.0 2023-06-26 19:17:08,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1635822.0, ans=0.0 2023-06-26 19:17:29,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1635882.0, ans=0.125 2023-06-26 19:17:50,883 INFO [train.py:996] (1/4) Epoch 9, batch 28700, loss[loss=0.2226, simple_loss=0.2956, pruned_loss=0.07479, over 21460.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2942, pruned_loss=0.07374, over 4268300.18 frames. ], batch size: 194, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:18:48,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.582e+02 5.345e+02 7.889e+02 1.390e+03 2.918e+03, threshold=1.578e+03, percent-clipped=26.0 2023-06-26 19:19:08,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-26 19:19:40,173 INFO [train.py:996] (1/4) Epoch 9, batch 28750, loss[loss=0.1932, simple_loss=0.2846, pruned_loss=0.05087, over 21800.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2926, pruned_loss=0.07323, over 4276940.74 frames. ], batch size: 282, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:20:20,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1636302.0, ans=0.0 2023-06-26 19:20:33,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1636362.0, ans=0.125 2023-06-26 19:20:35,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1636362.0, ans=0.0 2023-06-26 19:20:56,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1636422.0, ans=0.125 2023-06-26 19:21:19,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1636482.0, ans=0.0 2023-06-26 19:21:31,181 INFO [train.py:996] (1/4) Epoch 9, batch 28800, loss[loss=0.2828, simple_loss=0.3638, pruned_loss=0.1009, over 21801.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2961, pruned_loss=0.07272, over 4271121.12 frames. ], batch size: 118, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:21:55,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1636602.0, ans=0.125 2023-06-26 19:21:58,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1636602.0, ans=0.1 2023-06-26 19:22:22,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-26 19:22:33,372 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 5.031e+02 6.250e+02 8.713e+02 2.260e+03, threshold=1.250e+03, percent-clipped=3.0 2023-06-26 19:22:35,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636662.0, ans=0.1 2023-06-26 19:22:56,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1636722.0, ans=0.0 2023-06-26 19:23:25,613 INFO [train.py:996] (1/4) Epoch 9, batch 28850, loss[loss=0.2166, simple_loss=0.2827, pruned_loss=0.0752, over 21454.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2975, pruned_loss=0.07429, over 4275706.77 frames. ], batch size: 194, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:23:32,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1636842.0, ans=0.125 2023-06-26 19:24:40,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1637022.0, ans=0.125 2023-06-26 19:24:59,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1637082.0, ans=0.2 2023-06-26 19:25:14,984 INFO [train.py:996] (1/4) Epoch 9, batch 28900, loss[loss=0.2772, simple_loss=0.3507, pruned_loss=0.1019, over 21747.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.303, pruned_loss=0.07697, over 4275834.73 frames. ], batch size: 441, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:26:18,358 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.963e+02 5.862e+02 9.485e+02 1.263e+03 2.647e+03, threshold=1.897e+03, percent-clipped=25.0 2023-06-26 19:26:22,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1637322.0, ans=0.2 2023-06-26 19:27:10,765 INFO [train.py:996] (1/4) Epoch 9, batch 28950, loss[loss=0.2545, simple_loss=0.3252, pruned_loss=0.09195, over 19959.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3037, pruned_loss=0.07702, over 4273091.34 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:27:50,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1637502.0, ans=0.2 2023-06-26 19:28:18,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1637622.0, ans=0.125 2023-06-26 19:28:41,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1637622.0, ans=0.0 2023-06-26 19:29:07,344 INFO [train.py:996] (1/4) Epoch 9, batch 29000, loss[loss=0.2246, simple_loss=0.2988, pruned_loss=0.07519, over 21780.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3052, pruned_loss=0.07591, over 4271031.60 frames. ], batch size: 247, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:29:35,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1637802.0, ans=0.125 2023-06-26 19:29:41,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1637802.0, ans=0.035 2023-06-26 19:30:02,147 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.853e+02 8.491e+02 1.284e+03 2.472e+03, threshold=1.698e+03, percent-clipped=8.0 2023-06-26 19:30:44,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1637982.0, ans=0.125 2023-06-26 19:30:44,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1637982.0, ans=0.0 2023-06-26 19:30:57,307 INFO [train.py:996] (1/4) Epoch 9, batch 29050, loss[loss=0.2279, simple_loss=0.2924, pruned_loss=0.08167, over 21587.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3046, pruned_loss=0.07602, over 4272279.26 frames. ], batch size: 212, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:31:22,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.58 vs. limit=15.0 2023-06-26 19:31:35,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1638102.0, ans=0.125 2023-06-26 19:31:57,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1638162.0, ans=0.0 2023-06-26 19:32:17,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1638222.0, ans=0.125 2023-06-26 19:32:19,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1638222.0, ans=0.1 2023-06-26 19:32:24,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1638282.0, ans=0.125 2023-06-26 19:32:46,716 INFO [train.py:996] (1/4) Epoch 9, batch 29100, loss[loss=0.1826, simple_loss=0.2513, pruned_loss=0.05694, over 21787.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2965, pruned_loss=0.07374, over 4272203.61 frames. ], batch size: 371, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:33:10,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1638402.0, ans=0.0 2023-06-26 19:33:44,925 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.318e+02 7.274e+02 9.701e+02 2.233e+03, threshold=1.455e+03, percent-clipped=4.0 2023-06-26 19:34:16,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1638582.0, ans=0.1 2023-06-26 19:34:35,014 INFO [train.py:996] (1/4) Epoch 9, batch 29150, loss[loss=0.1896, simple_loss=0.2617, pruned_loss=0.05871, over 21313.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2948, pruned_loss=0.07229, over 4271719.51 frames. ], batch size: 608, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:34:54,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1638642.0, ans=0.125 2023-06-26 19:35:01,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1638702.0, ans=0.1 2023-06-26 19:35:01,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1638702.0, ans=0.0 2023-06-26 19:35:15,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1638762.0, ans=0.0 2023-06-26 19:35:26,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.72 vs. limit=15.0 2023-06-26 19:35:48,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1638822.0, ans=0.0 2023-06-26 19:36:20,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1638882.0, ans=0.125 2023-06-26 19:36:23,259 INFO [train.py:996] (1/4) Epoch 9, batch 29200, loss[loss=0.2674, simple_loss=0.3229, pruned_loss=0.106, over 21446.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2897, pruned_loss=0.07099, over 4258626.78 frames. ], batch size: 509, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 19:37:28,581 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 5.329e+02 8.203e+02 1.175e+03 2.946e+03, threshold=1.641e+03, percent-clipped=12.0 2023-06-26 19:37:40,012 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-26 19:38:11,683 INFO [train.py:996] (1/4) Epoch 9, batch 29250, loss[loss=0.199, simple_loss=0.2835, pruned_loss=0.05722, over 21546.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.288, pruned_loss=0.06858, over 4260485.72 frames. ], batch size: 230, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:38:24,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=15.0 2023-06-26 19:38:30,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-26 19:38:59,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1639362.0, ans=0.125 2023-06-26 19:39:27,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1639422.0, ans=0.2 2023-06-26 19:40:05,078 INFO [train.py:996] (1/4) Epoch 9, batch 29300, loss[loss=0.1885, simple_loss=0.2646, pruned_loss=0.0562, over 21631.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2894, pruned_loss=0.06788, over 4267108.92 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:40:37,114 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 19:40:38,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1639602.0, ans=0.125 2023-06-26 19:41:01,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1639662.0, ans=0.0 2023-06-26 19:41:03,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.751e+02 5.530e+02 7.690e+02 1.193e+03 2.293e+03, threshold=1.538e+03, percent-clipped=8.0 2023-06-26 19:41:10,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1639722.0, ans=0.1 2023-06-26 19:41:30,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1639782.0, ans=0.125 2023-06-26 19:41:55,412 INFO [train.py:996] (1/4) Epoch 9, batch 29350, loss[loss=0.2096, simple_loss=0.2719, pruned_loss=0.07368, over 21235.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2861, pruned_loss=0.06742, over 4262503.59 frames. ], batch size: 144, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:41:56,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1639842.0, ans=0.125 2023-06-26 19:42:24,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1639902.0, ans=0.1 2023-06-26 19:42:33,264 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=17.77 vs. limit=15.0 2023-06-26 19:42:38,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1639962.0, ans=0.1 2023-06-26 19:42:56,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1639962.0, ans=0.0 2023-06-26 19:43:47,605 INFO [train.py:996] (1/4) Epoch 9, batch 29400, loss[loss=0.1811, simple_loss=0.2613, pruned_loss=0.05045, over 21682.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2867, pruned_loss=0.06563, over 4256141.04 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:44:52,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1640262.0, ans=0.0 2023-06-26 19:44:53,333 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.689e+02 1.066e+03 1.595e+03 4.259e+03, threshold=2.132e+03, percent-clipped=27.0 2023-06-26 19:44:55,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1640322.0, ans=0.125 2023-06-26 19:45:44,116 INFO [train.py:996] (1/4) Epoch 9, batch 29450, loss[loss=0.1753, simple_loss=0.2455, pruned_loss=0.05258, over 21629.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.285, pruned_loss=0.06457, over 4261049.75 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:45:59,348 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-26 19:46:56,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1640622.0, ans=0.125 2023-06-26 19:47:10,021 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 19:47:22,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1640682.0, ans=0.125 2023-06-26 19:47:26,976 INFO [train.py:996] (1/4) Epoch 9, batch 29500, loss[loss=0.2051, simple_loss=0.281, pruned_loss=0.06464, over 21869.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2888, pruned_loss=0.06692, over 4270433.91 frames. ], batch size: 351, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:47:37,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1640742.0, ans=0.2 2023-06-26 19:48:22,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1640862.0, ans=0.1 2023-06-26 19:48:30,389 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.856e+02 6.070e+02 8.083e+02 1.104e+03 1.958e+03, threshold=1.617e+03, percent-clipped=0.0 2023-06-26 19:49:14,777 INFO [train.py:996] (1/4) Epoch 9, batch 29550, loss[loss=0.2158, simple_loss=0.2906, pruned_loss=0.07051, over 21629.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2873, pruned_loss=0.06826, over 4279679.72 frames. ], batch size: 131, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:50:06,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.22 vs. limit=10.0 2023-06-26 19:50:08,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1641162.0, ans=0.0 2023-06-26 19:51:11,602 INFO [train.py:996] (1/4) Epoch 9, batch 29600, loss[loss=0.2454, simple_loss=0.3322, pruned_loss=0.07927, over 21659.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2955, pruned_loss=0.07107, over 4276532.64 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 19:51:33,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-26 19:52:16,244 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 6.278e+02 9.739e+02 1.305e+03 2.412e+03, threshold=1.948e+03, percent-clipped=12.0 2023-06-26 19:53:00,019 INFO [train.py:996] (1/4) Epoch 9, batch 29650, loss[loss=0.2001, simple_loss=0.2787, pruned_loss=0.06071, over 21410.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2933, pruned_loss=0.06825, over 4275684.37 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:53:25,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1641702.0, ans=0.125 2023-06-26 19:54:00,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1641762.0, ans=0.2 2023-06-26 19:54:49,514 INFO [train.py:996] (1/4) Epoch 9, batch 29700, loss[loss=0.231, simple_loss=0.3166, pruned_loss=0.07273, over 21823.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2938, pruned_loss=0.06863, over 4278971.13 frames. ], batch size: 351, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:54:58,582 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-26 19:55:05,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1641942.0, ans=0.2 2023-06-26 19:55:36,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1642062.0, ans=0.125 2023-06-26 19:55:43,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1642062.0, ans=0.0 2023-06-26 19:55:55,268 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 4.988e+02 7.625e+02 1.121e+03 2.201e+03, threshold=1.525e+03, percent-clipped=1.0 2023-06-26 19:55:57,953 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 19:56:03,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1642122.0, ans=0.0 2023-06-26 19:56:38,117 INFO [train.py:996] (1/4) Epoch 9, batch 29750, loss[loss=0.2819, simple_loss=0.3511, pruned_loss=0.1063, over 21588.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2979, pruned_loss=0.06896, over 4281328.32 frames. ], batch size: 507, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:56:47,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-26 19:56:50,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1642242.0, ans=0.2 2023-06-26 19:56:56,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2023-06-26 19:57:26,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1642362.0, ans=0.1 2023-06-26 19:58:26,737 INFO [train.py:996] (1/4) Epoch 9, batch 29800, loss[loss=0.2301, simple_loss=0.3012, pruned_loss=0.07955, over 21303.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2992, pruned_loss=0.06973, over 4282093.17 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:59:16,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1642662.0, ans=0.0 2023-06-26 19:59:33,435 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.208e+02 7.577e+02 1.107e+03 1.626e+03 2.906e+03, threshold=2.213e+03, percent-clipped=30.0 2023-06-26 19:59:37,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1642722.0, ans=0.0 2023-06-26 19:59:46,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1642722.0, ans=0.0 2023-06-26 20:00:15,160 INFO [train.py:996] (1/4) Epoch 9, batch 29850, loss[loss=0.2059, simple_loss=0.2834, pruned_loss=0.0642, over 21906.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2962, pruned_loss=0.06835, over 4276236.94 frames. ], batch size: 124, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:00:42,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1642902.0, ans=0.125 2023-06-26 20:01:20,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1642962.0, ans=0.125 2023-06-26 20:01:23,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1643022.0, ans=0.125 2023-06-26 20:01:39,984 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-26 20:01:52,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1643082.0, ans=0.125 2023-06-26 20:02:08,054 INFO [train.py:996] (1/4) Epoch 9, batch 29900, loss[loss=0.2248, simple_loss=0.2963, pruned_loss=0.07661, over 21381.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2939, pruned_loss=0.06923, over 4281077.98 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:02:11,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1643142.0, ans=0.125 2023-06-26 20:02:18,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1643142.0, ans=0.0 2023-06-26 20:02:27,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1643202.0, ans=0.0 2023-06-26 20:03:09,778 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.950e+02 5.576e+02 8.031e+02 1.172e+03 2.675e+03, threshold=1.606e+03, percent-clipped=3.0 2023-06-26 20:03:46,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-26 20:03:57,899 INFO [train.py:996] (1/4) Epoch 9, batch 29950, loss[loss=0.2226, simple_loss=0.2948, pruned_loss=0.0752, over 21631.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2983, pruned_loss=0.0731, over 4279444.34 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:04:16,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1643442.0, ans=0.0 2023-06-26 20:04:37,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1643502.0, ans=0.125 2023-06-26 20:04:53,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1643562.0, ans=0.07 2023-06-26 20:05:38,157 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:05:54,893 INFO [train.py:996] (1/4) Epoch 9, batch 30000, loss[loss=0.1841, simple_loss=0.2843, pruned_loss=0.0419, over 21725.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3005, pruned_loss=0.07302, over 4275215.53 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 20:05:54,893 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 20:06:10,822 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.8793, 2.1187, 4.1880, 2.1706], device='cuda:1') 2023-06-26 20:06:15,949 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2518, simple_loss=0.3443, pruned_loss=0.07961, over 1796401.00 frames. 2023-06-26 20:06:15,951 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 20:06:42,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1643802.0, ans=0.125 2023-06-26 20:06:42,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-26 20:06:44,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1643802.0, ans=0.125 2023-06-26 20:07:19,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1643862.0, ans=0.1 2023-06-26 20:07:22,150 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.541e+02 6.693e+02 9.863e+02 1.324e+03 2.517e+03, threshold=1.973e+03, percent-clipped=14.0 2023-06-26 20:07:33,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1643922.0, ans=0.0 2023-06-26 20:07:48,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1643982.0, ans=0.0 2023-06-26 20:08:08,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1644042.0, ans=0.0 2023-06-26 20:08:09,887 INFO [train.py:996] (1/4) Epoch 9, batch 30050, loss[loss=0.2872, simple_loss=0.3953, pruned_loss=0.08956, over 21498.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3059, pruned_loss=0.07104, over 4273968.08 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:08:26,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-26 20:08:53,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-26 20:09:24,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1644222.0, ans=0.125 2023-06-26 20:10:03,804 INFO [train.py:996] (1/4) Epoch 9, batch 30100, loss[loss=0.2005, simple_loss=0.2662, pruned_loss=0.06743, over 21787.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3014, pruned_loss=0.07002, over 4267304.03 frames. ], batch size: 317, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:10:36,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1644402.0, ans=0.1 2023-06-26 20:11:07,513 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.020e+02 5.633e+02 9.341e+02 1.482e+03 2.871e+03, threshold=1.868e+03, percent-clipped=12.0 2023-06-26 20:11:43,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1644582.0, ans=0.125 2023-06-26 20:11:53,759 INFO [train.py:996] (1/4) Epoch 9, batch 30150, loss[loss=0.2181, simple_loss=0.2922, pruned_loss=0.07197, over 21697.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.299, pruned_loss=0.07163, over 4250792.39 frames. ], batch size: 298, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:13:01,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1644762.0, ans=0.125 2023-06-26 20:13:02,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.81 vs. limit=15.0 2023-06-26 20:13:14,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1644822.0, ans=0.2 2023-06-26 20:13:26,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1644882.0, ans=0.07 2023-06-26 20:13:26,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1644882.0, ans=0.125 2023-06-26 20:13:50,899 INFO [train.py:996] (1/4) Epoch 9, batch 30200, loss[loss=0.2056, simple_loss=0.3136, pruned_loss=0.04881, over 21662.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3016, pruned_loss=0.07031, over 4259932.78 frames. ], batch size: 389, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:15:00,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1645122.0, ans=0.2 2023-06-26 20:15:01,602 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.601e+02 6.016e+02 8.945e+02 1.496e+03 2.296e+03, threshold=1.789e+03, percent-clipped=11.0 2023-06-26 20:15:08,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-26 20:15:34,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1645182.0, ans=0.125 2023-06-26 20:15:42,555 INFO [train.py:996] (1/4) Epoch 9, batch 30250, loss[loss=0.2323, simple_loss=0.302, pruned_loss=0.08128, over 19983.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3076, pruned_loss=0.07198, over 4260207.53 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:15:56,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1645242.0, ans=0.0 2023-06-26 20:17:37,239 INFO [train.py:996] (1/4) Epoch 9, batch 30300, loss[loss=0.2048, simple_loss=0.2674, pruned_loss=0.07111, over 20123.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3044, pruned_loss=0.07172, over 4262340.53 frames. ], batch size: 704, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:18:47,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.854e+02 6.241e+02 9.150e+02 1.357e+03 2.520e+03, threshold=1.830e+03, percent-clipped=12.0 2023-06-26 20:19:07,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.79 vs. limit=22.5 2023-06-26 20:19:35,121 INFO [train.py:996] (1/4) Epoch 9, batch 30350, loss[loss=0.288, simple_loss=0.3675, pruned_loss=0.1043, over 21537.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3027, pruned_loss=0.072, over 4264837.03 frames. ], batch size: 473, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:19:54,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1645902.0, ans=0.0 2023-06-26 20:19:58,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1645902.0, ans=0.125 2023-06-26 20:20:10,313 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-26 20:20:36,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1646022.0, ans=0.125 2023-06-26 20:20:43,681 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:20:46,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1646082.0, ans=0.0 2023-06-26 20:20:58,734 INFO [train.py:996] (1/4) Epoch 9, batch 30400, loss[loss=0.2039, simple_loss=0.2554, pruned_loss=0.07616, over 20295.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2961, pruned_loss=0.07066, over 4257004.96 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 32.0 2023-06-26 20:21:19,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1646202.0, ans=0.04949747468305833 2023-06-26 20:21:55,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.123e+02 6.385e+02 9.749e+02 1.472e+03 9.200e+03, threshold=1.950e+03, percent-clipped=15.0 2023-06-26 20:22:15,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1646382.0, ans=0.125 2023-06-26 20:22:17,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1646382.0, ans=0.125 2023-06-26 20:22:29,101 INFO [train.py:996] (1/4) Epoch 9, batch 30450, loss[loss=0.2523, simple_loss=0.3575, pruned_loss=0.07355, over 19791.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2969, pruned_loss=0.07039, over 4198872.32 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:22:30,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-26 20:22:44,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1646502.0, ans=0.1 2023-06-26 20:23:26,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-26 20:23:36,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1646682.0, ans=0.125 2023-06-26 20:25:55,248 INFO [train.py:996] (1/4) Epoch 10, batch 0, loss[loss=0.2087, simple_loss=0.2715, pruned_loss=0.07291, over 21344.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2715, pruned_loss=0.07291, over 21344.00 frames. ], batch size: 177, lr: 3.02e-03, grad_scale: 32.0 2023-06-26 20:25:55,249 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 20:26:11,819 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2437, simple_loss=0.3472, pruned_loss=0.0701, over 1796401.00 frames. 2023-06-26 20:26:11,820 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 20:26:29,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1646712.0, ans=0.125 2023-06-26 20:26:42,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1646772.0, ans=0.05 2023-06-26 20:26:48,205 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-26 20:27:01,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1646832.0, ans=0.125 2023-06-26 20:27:35,022 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.973e+02 1.183e+03 2.082e+03 3.728e+03 9.226e+03, threshold=4.165e+03, percent-clipped=55.0 2023-06-26 20:27:57,612 INFO [train.py:996] (1/4) Epoch 10, batch 50, loss[loss=0.2617, simple_loss=0.3419, pruned_loss=0.09071, over 21844.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3069, pruned_loss=0.07067, over 946845.01 frames. ], batch size: 118, lr: 3.02e-03, grad_scale: 16.0 2023-06-26 20:29:24,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1647192.0, ans=0.125 2023-06-26 20:29:39,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1647252.0, ans=0.0 2023-06-26 20:29:41,861 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-26 20:29:44,272 INFO [train.py:996] (1/4) Epoch 10, batch 100, loss[loss=0.3164, simple_loss=0.3737, pruned_loss=0.1296, over 21347.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3189, pruned_loss=0.07354, over 1677237.00 frames. ], batch size: 507, lr: 3.02e-03, grad_scale: 16.0 2023-06-26 20:30:08,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1647372.0, ans=0.125 2023-06-26 20:30:15,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1647372.0, ans=0.2 2023-06-26 20:31:06,601 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.835e+02 5.191e+02 6.971e+02 9.608e+02 1.975e+03, threshold=1.394e+03, percent-clipped=0.0 2023-06-26 20:31:12,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1647552.0, ans=0.125 2023-06-26 20:31:28,450 INFO [train.py:996] (1/4) Epoch 10, batch 150, loss[loss=0.2421, simple_loss=0.3364, pruned_loss=0.07392, over 21739.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3209, pruned_loss=0.07325, over 2252833.25 frames. ], batch size: 351, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:31:29,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1647612.0, ans=0.2 2023-06-26 20:31:53,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1647672.0, ans=0.0 2023-06-26 20:32:45,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1647792.0, ans=0.125 2023-06-26 20:32:50,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1647792.0, ans=0.0 2023-06-26 20:33:14,187 INFO [train.py:996] (1/4) Epoch 10, batch 200, loss[loss=0.2321, simple_loss=0.3188, pruned_loss=0.07273, over 21412.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3191, pruned_loss=0.07154, over 2706220.41 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:33:18,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1647912.0, ans=0.05 2023-06-26 20:33:52,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1647972.0, ans=0.125 2023-06-26 20:33:53,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1647972.0, ans=0.2 2023-06-26 20:33:54,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.67 vs. limit=10.0 2023-06-26 20:34:02,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-26 20:34:35,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1648092.0, ans=0.1 2023-06-26 20:34:39,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.906e+02 5.339e+02 8.333e+02 1.175e+03 2.265e+03, threshold=1.667e+03, percent-clipped=16.0 2023-06-26 20:35:01,891 INFO [train.py:996] (1/4) Epoch 10, batch 250, loss[loss=0.2135, simple_loss=0.2928, pruned_loss=0.06709, over 21767.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3122, pruned_loss=0.07077, over 3054562.90 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:35:23,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.92 vs. limit=22.5 2023-06-26 20:35:46,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1648332.0, ans=0.0 2023-06-26 20:35:48,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1648332.0, ans=0.1 2023-06-26 20:35:48,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1648332.0, ans=0.5 2023-06-26 20:36:14,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1648392.0, ans=0.0 2023-06-26 20:36:28,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1648392.0, ans=0.125 2023-06-26 20:36:33,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1648392.0, ans=0.125 2023-06-26 20:36:54,054 INFO [train.py:996] (1/4) Epoch 10, batch 300, loss[loss=0.213, simple_loss=0.3092, pruned_loss=0.05844, over 21829.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3087, pruned_loss=0.07143, over 3327125.90 frames. ], batch size: 332, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:36:54,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1648512.0, ans=0.0 2023-06-26 20:37:07,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1648512.0, ans=0.1 2023-06-26 20:37:10,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1648512.0, ans=0.0 2023-06-26 20:37:50,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1648632.0, ans=0.125 2023-06-26 20:38:03,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1648692.0, ans=0.2 2023-06-26 20:38:17,611 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.698e+02 5.791e+02 8.130e+02 1.304e+03 2.175e+03, threshold=1.626e+03, percent-clipped=9.0 2023-06-26 20:38:35,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1648752.0, ans=0.125 2023-06-26 20:38:40,488 INFO [train.py:996] (1/4) Epoch 10, batch 350, loss[loss=0.2121, simple_loss=0.2806, pruned_loss=0.07173, over 21491.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3005, pruned_loss=0.07052, over 3529990.31 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:38:59,504 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:39:47,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-26 20:39:52,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1648992.0, ans=0.125 2023-06-26 20:40:11,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1649052.0, ans=0.0 2023-06-26 20:40:24,650 INFO [train.py:996] (1/4) Epoch 10, batch 400, loss[loss=0.1801, simple_loss=0.271, pruned_loss=0.04455, over 21208.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2953, pruned_loss=0.06849, over 3695301.51 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 20:41:38,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-26 20:41:53,086 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 7.996e+02 1.335e+03 1.838e+03 3.332e+03, threshold=2.670e+03, percent-clipped=35.0 2023-06-26 20:42:07,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1649352.0, ans=0.125 2023-06-26 20:42:11,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1649352.0, ans=0.04949747468305833 2023-06-26 20:42:14,172 INFO [train.py:996] (1/4) Epoch 10, batch 450, loss[loss=0.2076, simple_loss=0.2728, pruned_loss=0.07118, over 21578.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2916, pruned_loss=0.06749, over 3826303.29 frames. ], batch size: 391, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:42:46,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1649472.0, ans=0.04949747468305833 2023-06-26 20:43:00,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1649532.0, ans=0.125 2023-06-26 20:43:34,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1649592.0, ans=0.2 2023-06-26 20:43:41,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1649652.0, ans=0.0 2023-06-26 20:43:59,397 INFO [train.py:996] (1/4) Epoch 10, batch 500, loss[loss=0.2424, simple_loss=0.3525, pruned_loss=0.0662, over 21666.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2919, pruned_loss=0.06696, over 3925824.17 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:44:17,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1649712.0, ans=0.0 2023-06-26 20:45:06,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1649832.0, ans=0.1 2023-06-26 20:45:24,854 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 9.001e+02 1.327e+03 2.089e+03 4.282e+03, threshold=2.653e+03, percent-clipped=10.0 2023-06-26 20:45:37,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1649952.0, ans=0.1 2023-06-26 20:45:43,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-26 20:45:51,432 INFO [train.py:996] (1/4) Epoch 10, batch 550, loss[loss=0.1863, simple_loss=0.2779, pruned_loss=0.04736, over 21738.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2934, pruned_loss=0.06627, over 4004361.87 frames. ], batch size: 351, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:45:55,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1650012.0, ans=0.0 2023-06-26 20:45:55,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1650012.0, ans=0.2 2023-06-26 20:46:03,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1650012.0, ans=0.125 2023-06-26 20:46:20,349 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-26 20:46:20,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-06-26 20:47:17,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1650252.0, ans=22.5 2023-06-26 20:47:28,807 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:47:33,161 INFO [train.py:996] (1/4) Epoch 10, batch 600, loss[loss=0.1942, simple_loss=0.2691, pruned_loss=0.05964, over 21858.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2958, pruned_loss=0.06545, over 4069666.22 frames. ], batch size: 107, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:47:44,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1650312.0, ans=0.125 2023-06-26 20:48:03,396 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-26 20:48:50,208 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-26 20:48:56,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1650492.0, ans=0.125 2023-06-26 20:48:58,840 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.973e+02 6.857e+02 1.039e+03 1.439e+03 2.641e+03, threshold=2.079e+03, percent-clipped=0.0 2023-06-26 20:49:06,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1650552.0, ans=0.125 2023-06-26 20:49:09,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1650552.0, ans=0.2 2023-06-26 20:49:19,454 INFO [train.py:996] (1/4) Epoch 10, batch 650, loss[loss=0.2056, simple_loss=0.3027, pruned_loss=0.05423, over 21644.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2989, pruned_loss=0.06561, over 4117450.40 frames. ], batch size: 263, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:49:33,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1650612.0, ans=0.1 2023-06-26 20:50:11,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1650732.0, ans=0.125 2023-06-26 20:50:40,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1650792.0, ans=0.125 2023-06-26 20:50:47,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1650852.0, ans=0.125 2023-06-26 20:50:57,830 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:51:00,882 INFO [train.py:996] (1/4) Epoch 10, batch 700, loss[loss=0.2789, simple_loss=0.378, pruned_loss=0.08995, over 21541.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.3026, pruned_loss=0.0668, over 4158713.53 frames. ], batch size: 471, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:51:37,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1650972.0, ans=0.07 2023-06-26 20:52:25,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1651092.0, ans=0.2 2023-06-26 20:52:26,593 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 6.227e+02 9.890e+02 1.482e+03 2.866e+03, threshold=1.978e+03, percent-clipped=9.0 2023-06-26 20:52:27,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1651092.0, ans=0.5 2023-06-26 20:52:39,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-26 20:52:40,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1651152.0, ans=0.125 2023-06-26 20:52:47,462 INFO [train.py:996] (1/4) Epoch 10, batch 750, loss[loss=0.2007, simple_loss=0.2706, pruned_loss=0.06546, over 21825.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.3014, pruned_loss=0.06719, over 4183781.38 frames. ], batch size: 282, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:53:34,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1651332.0, ans=0.125 2023-06-26 20:53:39,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1651332.0, ans=0.1 2023-06-26 20:54:21,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1651452.0, ans=0.0 2023-06-26 20:54:35,024 INFO [train.py:996] (1/4) Epoch 10, batch 800, loss[loss=0.2157, simple_loss=0.2853, pruned_loss=0.07303, over 21712.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2971, pruned_loss=0.06732, over 4193901.25 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 20:54:52,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1651512.0, ans=0.2 2023-06-26 20:55:00,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1651572.0, ans=0.125 2023-06-26 20:55:57,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-26 20:56:04,634 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.663e+02 5.824e+02 9.070e+02 1.319e+03 2.505e+03, threshold=1.814e+03, percent-clipped=4.0 2023-06-26 20:56:05,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1651752.0, ans=0.125 2023-06-26 20:56:13,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1651752.0, ans=0.0 2023-06-26 20:56:13,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1651752.0, ans=0.125 2023-06-26 20:56:23,622 INFO [train.py:996] (1/4) Epoch 10, batch 850, loss[loss=0.225, simple_loss=0.3, pruned_loss=0.07504, over 21871.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2924, pruned_loss=0.06775, over 4220967.37 frames. ], batch size: 124, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:57:20,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1651932.0, ans=0.1 2023-06-26 20:57:22,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-26 20:57:54,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1652052.0, ans=0.125 2023-06-26 20:58:03,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-26 20:58:18,436 INFO [train.py:996] (1/4) Epoch 10, batch 900, loss[loss=0.1846, simple_loss=0.2745, pruned_loss=0.04735, over 21840.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2889, pruned_loss=0.0667, over 4242938.56 frames. ], batch size: 316, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:58:19,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1652112.0, ans=0.1 2023-06-26 20:58:19,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-26 20:58:23,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1652112.0, ans=0.0 2023-06-26 20:59:42,412 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.674e+02 4.955e+02 6.528e+02 1.022e+03 3.124e+03, threshold=1.306e+03, percent-clipped=4.0 2023-06-26 20:59:58,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-26 21:00:07,561 INFO [train.py:996] (1/4) Epoch 10, batch 950, loss[loss=0.2008, simple_loss=0.2611, pruned_loss=0.07028, over 21577.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2859, pruned_loss=0.06676, over 4257542.16 frames. ], batch size: 263, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:00:29,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1652472.0, ans=0.2 2023-06-26 21:00:31,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-26 21:01:46,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1652652.0, ans=0.2 2023-06-26 21:01:48,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1652652.0, ans=0.2 2023-06-26 21:01:52,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1652652.0, ans=0.0 2023-06-26 21:01:56,948 INFO [train.py:996] (1/4) Epoch 10, batch 1000, loss[loss=0.2119, simple_loss=0.2761, pruned_loss=0.07387, over 21428.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2863, pruned_loss=0.06671, over 4265082.99 frames. ], batch size: 389, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:03:23,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1652892.0, ans=0.0 2023-06-26 21:03:31,697 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.126e+02 7.237e+02 1.217e+03 1.852e+03 3.276e+03, threshold=2.433e+03, percent-clipped=47.0 2023-06-26 21:03:56,357 INFO [train.py:996] (1/4) Epoch 10, batch 1050, loss[loss=0.1645, simple_loss=0.2594, pruned_loss=0.03484, over 21779.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2864, pruned_loss=0.06667, over 4266966.86 frames. ], batch size: 282, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:04:28,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-26 21:05:20,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1653192.0, ans=0.125 2023-06-26 21:05:33,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.05 vs. limit=22.5 2023-06-26 21:05:46,786 INFO [train.py:996] (1/4) Epoch 10, batch 1100, loss[loss=0.2065, simple_loss=0.2751, pruned_loss=0.0689, over 21278.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2859, pruned_loss=0.06611, over 4275339.09 frames. ], batch size: 159, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:06:21,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1653372.0, ans=10.0 2023-06-26 21:06:44,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1653432.0, ans=0.0 2023-06-26 21:07:14,101 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.610e+02 5.858e+02 8.624e+02 1.218e+03 2.996e+03, threshold=1.725e+03, percent-clipped=2.0 2023-06-26 21:07:38,290 INFO [train.py:996] (1/4) Epoch 10, batch 1150, loss[loss=0.196, simple_loss=0.2786, pruned_loss=0.0567, over 21686.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2871, pruned_loss=0.06661, over 4279869.58 frames. ], batch size: 230, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:07:51,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1653612.0, ans=0.0 2023-06-26 21:08:47,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1653732.0, ans=0.2 2023-06-26 21:09:05,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1653792.0, ans=0.2 2023-06-26 21:09:06,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-26 21:09:20,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-26 21:09:36,629 INFO [train.py:996] (1/4) Epoch 10, batch 1200, loss[loss=0.2492, simple_loss=0.3312, pruned_loss=0.08359, over 21804.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2877, pruned_loss=0.0668, over 4270629.91 frames. ], batch size: 124, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:09:42,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1653912.0, ans=0.2 2023-06-26 21:10:51,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1654092.0, ans=0.1 2023-06-26 21:11:00,197 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.823e+02 5.719e+02 8.661e+02 1.239e+03 3.080e+03, threshold=1.732e+03, percent-clipped=10.0 2023-06-26 21:11:25,980 INFO [train.py:996] (1/4) Epoch 10, batch 1250, loss[loss=0.2572, simple_loss=0.3409, pruned_loss=0.08678, over 21472.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.292, pruned_loss=0.06793, over 4272125.69 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:12:21,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=22.5 2023-06-26 21:13:16,688 INFO [train.py:996] (1/4) Epoch 10, batch 1300, loss[loss=0.2079, simple_loss=0.2926, pruned_loss=0.06164, over 21822.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2932, pruned_loss=0.06785, over 4271320.83 frames. ], batch size: 282, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:14:25,384 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.94 vs. limit=10.0 2023-06-26 21:14:43,544 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.194e+02 7.398e+02 1.015e+03 1.513e+03 3.841e+03, threshold=2.029e+03, percent-clipped=13.0 2023-06-26 21:14:45,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1654752.0, ans=0.0 2023-06-26 21:15:06,108 INFO [train.py:996] (1/4) Epoch 10, batch 1350, loss[loss=0.2267, simple_loss=0.2979, pruned_loss=0.07774, over 21436.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2942, pruned_loss=0.06882, over 4278461.43 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:15:23,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=22.5 2023-06-26 21:15:28,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=1654872.0, ans=15.0 2023-06-26 21:15:29,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1654872.0, ans=0.05 2023-06-26 21:15:33,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1654872.0, ans=0.125 2023-06-26 21:16:14,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1654992.0, ans=0.0 2023-06-26 21:16:25,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1654992.0, ans=0.2 2023-06-26 21:16:44,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=22.5 2023-06-26 21:17:00,114 INFO [train.py:996] (1/4) Epoch 10, batch 1400, loss[loss=0.1956, simple_loss=0.2779, pruned_loss=0.05662, over 21792.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2934, pruned_loss=0.06876, over 4282944.03 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:17:10,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=15.0 2023-06-26 21:17:26,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1655172.0, ans=0.2 2023-06-26 21:17:37,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1655172.0, ans=0.0 2023-06-26 21:17:47,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1655232.0, ans=0.125 2023-06-26 21:18:25,053 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.941e+02 5.863e+02 9.944e+02 1.473e+03 3.016e+03, threshold=1.989e+03, percent-clipped=13.0 2023-06-26 21:18:31,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1655352.0, ans=0.1 2023-06-26 21:18:48,212 INFO [train.py:996] (1/4) Epoch 10, batch 1450, loss[loss=0.2047, simple_loss=0.2882, pruned_loss=0.06053, over 21378.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2952, pruned_loss=0.0697, over 4285969.65 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:19:24,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-26 21:20:03,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1655592.0, ans=0.1 2023-06-26 21:20:34,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1655652.0, ans=0.2 2023-06-26 21:20:36,874 INFO [train.py:996] (1/4) Epoch 10, batch 1500, loss[loss=0.2038, simple_loss=0.2842, pruned_loss=0.06169, over 21097.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2949, pruned_loss=0.06981, over 4287714.22 frames. ], batch size: 607, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:20:42,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1655712.0, ans=0.0 2023-06-26 21:22:03,792 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.679e+02 5.579e+02 7.007e+02 1.027e+03 2.656e+03, threshold=1.401e+03, percent-clipped=4.0 2023-06-26 21:22:29,781 INFO [train.py:996] (1/4) Epoch 10, batch 1550, loss[loss=0.1917, simple_loss=0.2662, pruned_loss=0.05856, over 21440.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.293, pruned_loss=0.06928, over 4278168.33 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:22:34,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1656012.0, ans=0.0 2023-06-26 21:23:52,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1656192.0, ans=0.125 2023-06-26 21:24:18,730 INFO [train.py:996] (1/4) Epoch 10, batch 1600, loss[loss=0.2128, simple_loss=0.2895, pruned_loss=0.06804, over 21576.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2935, pruned_loss=0.07012, over 4275271.98 frames. ], batch size: 212, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:25:07,059 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:25:50,450 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.007e+02 6.112e+02 1.058e+03 1.502e+03 3.121e+03, threshold=2.116e+03, percent-clipped=30.0 2023-06-26 21:26:07,888 INFO [train.py:996] (1/4) Epoch 10, batch 1650, loss[loss=0.2272, simple_loss=0.2903, pruned_loss=0.082, over 21351.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2925, pruned_loss=0.06983, over 4277766.33 frames. ], batch size: 176, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:26:16,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-26 21:26:26,869 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-26 21:26:33,555 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-26 21:26:52,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-26 21:27:03,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1656732.0, ans=0.125 2023-06-26 21:27:04,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1656732.0, ans=0.1 2023-06-26 21:27:13,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1656732.0, ans=0.125 2023-06-26 21:27:20,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1656792.0, ans=0.0 2023-06-26 21:27:32,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-26 21:27:51,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1656852.0, ans=0.0 2023-06-26 21:27:55,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1656852.0, ans=0.1 2023-06-26 21:28:03,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1656912.0, ans=0.0 2023-06-26 21:28:04,345 INFO [train.py:996] (1/4) Epoch 10, batch 1700, loss[loss=0.209, simple_loss=0.3056, pruned_loss=0.05621, over 19896.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.292, pruned_loss=0.06849, over 4272479.28 frames. ], batch size: 702, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:28:35,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-26 21:28:40,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1656972.0, ans=0.125 2023-06-26 21:29:00,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-26 21:29:11,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-26 21:29:24,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1657092.0, ans=0.09899494936611666 2023-06-26 21:29:40,235 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.858e+02 6.519e+02 9.043e+02 1.348e+03 2.914e+03, threshold=1.809e+03, percent-clipped=3.0 2023-06-26 21:29:56,217 INFO [train.py:996] (1/4) Epoch 10, batch 1750, loss[loss=0.2207, simple_loss=0.297, pruned_loss=0.07217, over 21432.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2931, pruned_loss=0.06855, over 4263952.85 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:30:40,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1657272.0, ans=0.125 2023-06-26 21:31:11,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1657392.0, ans=0.125 2023-06-26 21:31:35,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1657452.0, ans=0.1 2023-06-26 21:31:45,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1657452.0, ans=0.125 2023-06-26 21:31:45,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1657452.0, ans=0.125 2023-06-26 21:31:54,521 INFO [train.py:996] (1/4) Epoch 10, batch 1800, loss[loss=0.2163, simple_loss=0.2869, pruned_loss=0.07281, over 21338.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2928, pruned_loss=0.06752, over 4262233.83 frames. ], batch size: 176, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:31:55,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1657512.0, ans=0.125 2023-06-26 21:32:03,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1657512.0, ans=0.125 2023-06-26 21:32:47,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1657632.0, ans=0.125 2023-06-26 21:33:24,906 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.658e+02 9.190e+02 1.767e+03 4.020e+03, threshold=1.838e+03, percent-clipped=23.0 2023-06-26 21:33:43,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1657812.0, ans=0.125 2023-06-26 21:33:44,330 INFO [train.py:996] (1/4) Epoch 10, batch 1850, loss[loss=0.1778, simple_loss=0.2426, pruned_loss=0.0565, over 16469.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.294, pruned_loss=0.06596, over 4264861.04 frames. ], batch size: 64, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:34:26,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1657872.0, ans=0.125 2023-06-26 21:34:38,072 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.13 vs. limit=10.0 2023-06-26 21:35:01,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1657992.0, ans=0.1 2023-06-26 21:35:27,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1658052.0, ans=0.125 2023-06-26 21:35:32,234 INFO [train.py:996] (1/4) Epoch 10, batch 1900, loss[loss=0.1815, simple_loss=0.2531, pruned_loss=0.05491, over 21666.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2943, pruned_loss=0.06534, over 4274141.80 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:35:39,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1658112.0, ans=0.0 2023-06-26 21:36:15,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1658172.0, ans=0.125 2023-06-26 21:36:38,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1658232.0, ans=0.125 2023-06-26 21:36:38,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1658232.0, ans=0.125 2023-06-26 21:37:08,215 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.968e+02 6.601e+02 8.691e+02 1.330e+03 2.480e+03, threshold=1.738e+03, percent-clipped=9.0 2023-06-26 21:37:22,014 INFO [train.py:996] (1/4) Epoch 10, batch 1950, loss[loss=0.1801, simple_loss=0.2641, pruned_loss=0.04808, over 21085.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2907, pruned_loss=0.06559, over 4261630.57 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:37:22,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1658412.0, ans=0.1 2023-06-26 21:37:34,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.27 vs. limit=10.0 2023-06-26 21:37:44,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1658472.0, ans=0.0 2023-06-26 21:38:21,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1658532.0, ans=0.125 2023-06-26 21:38:37,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1658592.0, ans=0.125 2023-06-26 21:39:11,007 INFO [train.py:996] (1/4) Epoch 10, batch 2000, loss[loss=0.2079, simple_loss=0.2691, pruned_loss=0.07334, over 21562.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2849, pruned_loss=0.06391, over 4262980.55 frames. ], batch size: 414, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:39:37,129 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.60 vs. limit=15.0 2023-06-26 21:40:23,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1658892.0, ans=0.2 2023-06-26 21:40:46,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 7.434e+02 1.051e+03 1.825e+03 4.116e+03, threshold=2.102e+03, percent-clipped=26.0 2023-06-26 21:40:48,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1658952.0, ans=0.2 2023-06-26 21:41:00,341 INFO [train.py:996] (1/4) Epoch 10, batch 2050, loss[loss=0.2235, simple_loss=0.2996, pruned_loss=0.07365, over 21852.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2875, pruned_loss=0.06371, over 4272996.51 frames. ], batch size: 332, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:41:11,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1659012.0, ans=0.0 2023-06-26 21:41:37,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1659072.0, ans=0.025 2023-06-26 21:41:37,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1659072.0, ans=0.1 2023-06-26 21:41:46,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1659072.0, ans=0.125 2023-06-26 21:41:55,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1659132.0, ans=0.2 2023-06-26 21:42:07,451 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-26 21:42:26,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1659192.0, ans=0.125 2023-06-26 21:42:31,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1659252.0, ans=0.0 2023-06-26 21:42:49,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1659252.0, ans=0.0 2023-06-26 21:42:50,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1659252.0, ans=0.2 2023-06-26 21:42:53,055 INFO [train.py:996] (1/4) Epoch 10, batch 2100, loss[loss=0.2146, simple_loss=0.2895, pruned_loss=0.06984, over 21795.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.291, pruned_loss=0.06419, over 4270636.27 frames. ], batch size: 371, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:44:22,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.975e+02 6.453e+02 1.021e+03 1.329e+03 2.280e+03, threshold=2.042e+03, percent-clipped=5.0 2023-06-26 21:44:41,180 INFO [train.py:996] (1/4) Epoch 10, batch 2150, loss[loss=0.2276, simple_loss=0.2925, pruned_loss=0.08137, over 21572.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2911, pruned_loss=0.06502, over 4250834.59 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:44:50,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1659612.0, ans=0.125 2023-06-26 21:45:10,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1659672.0, ans=0.0 2023-06-26 21:45:48,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-26 21:46:29,853 INFO [train.py:996] (1/4) Epoch 10, batch 2200, loss[loss=0.1701, simple_loss=0.2331, pruned_loss=0.05357, over 15912.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2937, pruned_loss=0.06595, over 4256144.02 frames. ], batch size: 60, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:47:28,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1660032.0, ans=0.125 2023-06-26 21:48:00,546 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.021e+02 5.718e+02 8.930e+02 1.284e+03 2.710e+03, threshold=1.786e+03, percent-clipped=5.0 2023-06-26 21:48:01,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1660152.0, ans=0.125 2023-06-26 21:48:09,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1660152.0, ans=0.0 2023-06-26 21:48:17,692 INFO [train.py:996] (1/4) Epoch 10, batch 2250, loss[loss=0.2057, simple_loss=0.2828, pruned_loss=0.06428, over 21532.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2894, pruned_loss=0.06525, over 4257653.47 frames. ], batch size: 389, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:48:36,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1660212.0, ans=0.2 2023-06-26 21:49:04,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1660332.0, ans=0.1 2023-06-26 21:49:24,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-26 21:49:28,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1660392.0, ans=0.0 2023-06-26 21:50:04,941 INFO [train.py:996] (1/4) Epoch 10, batch 2300, loss[loss=0.1891, simple_loss=0.2645, pruned_loss=0.05681, over 22037.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2861, pruned_loss=0.06506, over 4267374.78 frames. ], batch size: 119, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:50:05,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1660512.0, ans=0.125 2023-06-26 21:50:14,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1660512.0, ans=0.125 2023-06-26 21:50:52,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1660632.0, ans=0.2 2023-06-26 21:51:12,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-26 21:51:13,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1660692.0, ans=0.0 2023-06-26 21:51:16,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1660692.0, ans=0.125 2023-06-26 21:51:40,837 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 6.340e+02 1.061e+03 1.425e+03 3.450e+03, threshold=2.122e+03, percent-clipped=15.0 2023-06-26 21:51:52,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1660812.0, ans=0.0 2023-06-26 21:51:52,954 INFO [train.py:996] (1/4) Epoch 10, batch 2350, loss[loss=0.2193, simple_loss=0.2927, pruned_loss=0.073, over 21683.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2854, pruned_loss=0.06664, over 4267053.00 frames. ], batch size: 332, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:52:43,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1660932.0, ans=0.125 2023-06-26 21:53:04,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-26 21:53:08,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1660992.0, ans=0.125 2023-06-26 21:53:10,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.01 vs. limit=5.0 2023-06-26 21:53:43,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1661052.0, ans=0.0 2023-06-26 21:53:46,716 INFO [train.py:996] (1/4) Epoch 10, batch 2400, loss[loss=0.2378, simple_loss=0.3254, pruned_loss=0.0751, over 21828.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2867, pruned_loss=0.06835, over 4260789.17 frames. ], batch size: 124, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:55:11,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1661352.0, ans=0.125 2023-06-26 21:55:12,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1661352.0, ans=0.2 2023-06-26 21:55:17,525 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.139e+02 8.857e+02 1.254e+03 1.714e+03 3.828e+03, threshold=2.507e+03, percent-clipped=13.0 2023-06-26 21:55:35,107 INFO [train.py:996] (1/4) Epoch 10, batch 2450, loss[loss=0.1852, simple_loss=0.2547, pruned_loss=0.05782, over 21729.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2906, pruned_loss=0.07012, over 4262748.42 frames. ], batch size: 112, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:55:35,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1661412.0, ans=0.1 2023-06-26 21:55:57,215 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:56:27,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1661532.0, ans=0.1 2023-06-26 21:56:55,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1661592.0, ans=0.125 2023-06-26 21:57:22,908 INFO [train.py:996] (1/4) Epoch 10, batch 2500, loss[loss=0.2301, simple_loss=0.3269, pruned_loss=0.06669, over 21317.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2889, pruned_loss=0.0692, over 4261671.97 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:58:15,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1661832.0, ans=0.2 2023-06-26 21:58:39,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1661892.0, ans=0.125 2023-06-26 21:58:52,889 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.949e+02 5.442e+02 7.727e+02 1.360e+03 2.872e+03, threshold=1.545e+03, percent-clipped=3.0 2023-06-26 21:59:16,975 INFO [train.py:996] (1/4) Epoch 10, batch 2550, loss[loss=0.1977, simple_loss=0.265, pruned_loss=0.06518, over 21271.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2893, pruned_loss=0.06894, over 4249603.73 frames. ], batch size: 176, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:59:21,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1662012.0, ans=0.2 2023-06-26 21:59:34,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1662072.0, ans=0.1 2023-06-26 21:59:36,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1662072.0, ans=0.2 2023-06-26 21:59:38,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1662072.0, ans=0.0 2023-06-26 22:00:36,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1662252.0, ans=0.0 2023-06-26 22:00:38,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662252.0, ans=0.1 2023-06-26 22:00:38,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-26 22:00:43,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1662252.0, ans=0.0 2023-06-26 22:00:58,655 INFO [train.py:996] (1/4) Epoch 10, batch 2600, loss[loss=0.2041, simple_loss=0.2784, pruned_loss=0.06483, over 21298.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2911, pruned_loss=0.07048, over 4264750.65 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:00:59,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1662312.0, ans=0.125 2023-06-26 22:01:24,671 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-26 22:01:29,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1662372.0, ans=0.125 2023-06-26 22:02:05,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1662492.0, ans=0.125 2023-06-26 22:02:30,506 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.653e+02 5.932e+02 7.910e+02 1.183e+03 2.273e+03, threshold=1.582e+03, percent-clipped=10.0 2023-06-26 22:02:45,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1662552.0, ans=0.2 2023-06-26 22:02:48,660 INFO [train.py:996] (1/4) Epoch 10, batch 2650, loss[loss=0.2238, simple_loss=0.3013, pruned_loss=0.07317, over 21342.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2916, pruned_loss=0.07095, over 4275158.33 frames. ], batch size: 549, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:02:49,322 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:03:01,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=15.0 2023-06-26 22:03:59,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1662792.0, ans=0.0 2023-06-26 22:04:42,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662912.0, ans=0.1 2023-06-26 22:04:43,341 INFO [train.py:996] (1/4) Epoch 10, batch 2700, loss[loss=0.2132, simple_loss=0.2862, pruned_loss=0.07008, over 21444.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2919, pruned_loss=0.07136, over 4272012.24 frames. ], batch size: 211, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:04:52,395 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:04:54,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1662912.0, ans=0.2 2023-06-26 22:04:57,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1662912.0, ans=0.125 2023-06-26 22:04:57,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1662912.0, ans=0.125 2023-06-26 22:05:31,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1663032.0, ans=0.2 2023-06-26 22:06:09,114 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 5.804e+02 8.533e+02 1.371e+03 2.390e+03, threshold=1.707e+03, percent-clipped=16.0 2023-06-26 22:06:31,057 INFO [train.py:996] (1/4) Epoch 10, batch 2750, loss[loss=0.1971, simple_loss=0.263, pruned_loss=0.0656, over 21558.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2912, pruned_loss=0.07102, over 4282113.04 frames. ], batch size: 212, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:06:35,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1663212.0, ans=0.1 2023-06-26 22:07:51,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1663452.0, ans=0.125 2023-06-26 22:08:21,200 INFO [train.py:996] (1/4) Epoch 10, batch 2800, loss[loss=0.3344, simple_loss=0.4055, pruned_loss=0.1317, over 21511.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2955, pruned_loss=0.0707, over 4270872.76 frames. ], batch size: 507, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:08:45,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=1663572.0, ans=15.0 2023-06-26 22:09:11,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1663632.0, ans=0.125 2023-06-26 22:09:17,247 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:09:31,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.93 vs. limit=10.0 2023-06-26 22:10:00,634 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 7.435e+02 1.264e+03 2.282e+03 6.620e+03, threshold=2.529e+03, percent-clipped=31.0 2023-06-26 22:10:11,258 INFO [train.py:996] (1/4) Epoch 10, batch 2850, loss[loss=0.1574, simple_loss=0.2136, pruned_loss=0.05062, over 21377.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2953, pruned_loss=0.07078, over 4268968.83 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:10:49,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1663932.0, ans=0.2 2023-06-26 22:11:19,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1663992.0, ans=10.0 2023-06-26 22:11:26,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1663992.0, ans=0.0 2023-06-26 22:11:33,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1663992.0, ans=0.125 2023-06-26 22:11:51,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1664052.0, ans=0.125 2023-06-26 22:11:59,690 INFO [train.py:996] (1/4) Epoch 10, batch 2900, loss[loss=0.2142, simple_loss=0.2891, pruned_loss=0.06969, over 21844.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2926, pruned_loss=0.07038, over 4263116.77 frames. ], batch size: 351, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:12:16,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-26 22:12:21,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1664172.0, ans=0.125 2023-06-26 22:12:31,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1664172.0, ans=0.125 2023-06-26 22:12:46,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1664232.0, ans=0.0 2023-06-26 22:13:18,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-26 22:13:30,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1664352.0, ans=0.1 2023-06-26 22:13:38,074 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.904e+02 5.286e+02 7.202e+02 1.145e+03 2.929e+03, threshold=1.440e+03, percent-clipped=1.0 2023-06-26 22:13:38,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1664352.0, ans=0.0 2023-06-26 22:13:46,799 INFO [train.py:996] (1/4) Epoch 10, batch 2950, loss[loss=0.2087, simple_loss=0.3082, pruned_loss=0.05458, over 21797.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2943, pruned_loss=0.07015, over 4275152.68 frames. ], batch size: 282, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:14:01,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1664412.0, ans=0.125 2023-06-26 22:14:12,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1664472.0, ans=0.2 2023-06-26 22:14:40,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1664532.0, ans=0.0 2023-06-26 22:15:34,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1664712.0, ans=0.125 2023-06-26 22:15:40,775 INFO [train.py:996] (1/4) Epoch 10, batch 3000, loss[loss=0.2444, simple_loss=0.3272, pruned_loss=0.08078, over 21553.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2962, pruned_loss=0.07083, over 4273578.33 frames. ], batch size: 414, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:15:40,776 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 22:15:51,669 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.9046, 6.0563, 5.7343, 5.5663], device='cuda:1') 2023-06-26 22:15:58,647 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2517, simple_loss=0.3411, pruned_loss=0.08118, over 1796401.00 frames. 2023-06-26 22:15:58,648 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-26 22:16:29,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1664772.0, ans=0.2 2023-06-26 22:16:58,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1664832.0, ans=0.2 2023-06-26 22:17:35,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1664952.0, ans=0.0 2023-06-26 22:17:39,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.020e+02 5.823e+02 1.007e+03 1.425e+03 2.943e+03, threshold=2.014e+03, percent-clipped=25.0 2023-06-26 22:17:47,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1665012.0, ans=0.1 2023-06-26 22:17:48,235 INFO [train.py:996] (1/4) Epoch 10, batch 3050, loss[loss=0.1964, simple_loss=0.2947, pruned_loss=0.04906, over 21456.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2983, pruned_loss=0.07052, over 4278992.37 frames. ], batch size: 211, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:18:02,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1665012.0, ans=0.2 2023-06-26 22:18:02,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1665012.0, ans=0.09899494936611666 2023-06-26 22:18:18,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1665072.0, ans=0.0 2023-06-26 22:18:24,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-26 22:18:34,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1665132.0, ans=0.04949747468305833 2023-06-26 22:19:01,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1665192.0, ans=0.0 2023-06-26 22:19:37,781 INFO [train.py:996] (1/4) Epoch 10, batch 3100, loss[loss=0.1874, simple_loss=0.2771, pruned_loss=0.04888, over 21585.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2985, pruned_loss=0.06962, over 4281960.72 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:21:16,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1665552.0, ans=0.1 2023-06-26 22:21:17,185 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.635e+02 5.384e+02 7.508e+02 1.175e+03 3.644e+03, threshold=1.502e+03, percent-clipped=4.0 2023-06-26 22:21:26,438 INFO [train.py:996] (1/4) Epoch 10, batch 3150, loss[loss=0.2533, simple_loss=0.3279, pruned_loss=0.08937, over 21318.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3, pruned_loss=0.07017, over 4284823.73 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:22:46,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1665792.0, ans=10.0 2023-06-26 22:23:22,058 INFO [train.py:996] (1/4) Epoch 10, batch 3200, loss[loss=0.2024, simple_loss=0.2697, pruned_loss=0.06755, over 21187.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3008, pruned_loss=0.07046, over 4276817.26 frames. ], batch size: 608, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:23:59,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1665972.0, ans=0.0 2023-06-26 22:24:16,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1666032.0, ans=0.0 2023-06-26 22:24:18,917 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.70 vs. limit=15.0 2023-06-26 22:24:26,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1666032.0, ans=0.1 2023-06-26 22:24:59,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1666152.0, ans=0.2 2023-06-26 22:25:01,069 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.433e+02 6.467e+02 1.041e+03 1.408e+03 2.668e+03, threshold=2.081e+03, percent-clipped=19.0 2023-06-26 22:25:14,966 INFO [train.py:996] (1/4) Epoch 10, batch 3250, loss[loss=0.2263, simple_loss=0.3251, pruned_loss=0.0638, over 21493.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3038, pruned_loss=0.07186, over 4280611.75 frames. ], batch size: 471, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:25:25,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1666212.0, ans=0.025 2023-06-26 22:25:47,152 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-26 22:26:21,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1666392.0, ans=0.125 2023-06-26 22:26:21,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1666392.0, ans=0.2 2023-06-26 22:26:21,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1666392.0, ans=0.125 2023-06-26 22:26:59,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1666452.0, ans=0.0 2023-06-26 22:27:04,048 INFO [train.py:996] (1/4) Epoch 10, batch 3300, loss[loss=0.189, simple_loss=0.2587, pruned_loss=0.05968, over 21199.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3005, pruned_loss=0.07134, over 4272301.40 frames. ], batch size: 176, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:28:08,034 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-26 22:28:09,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1666692.0, ans=0.0 2023-06-26 22:28:42,867 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 7.426e+02 1.088e+03 1.707e+03 4.708e+03, threshold=2.176e+03, percent-clipped=17.0 2023-06-26 22:28:51,836 INFO [train.py:996] (1/4) Epoch 10, batch 3350, loss[loss=0.2448, simple_loss=0.3162, pruned_loss=0.08669, over 21373.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3012, pruned_loss=0.0721, over 4271020.46 frames. ], batch size: 549, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:29:17,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1666872.0, ans=0.125 2023-06-26 22:29:39,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-26 22:29:44,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1666932.0, ans=0.125 2023-06-26 22:29:48,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1666932.0, ans=0.1 2023-06-26 22:29:50,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1666932.0, ans=0.0 2023-06-26 22:30:09,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1666992.0, ans=0.04949747468305833 2023-06-26 22:30:39,086 INFO [train.py:996] (1/4) Epoch 10, batch 3400, loss[loss=0.2502, simple_loss=0.3476, pruned_loss=0.07637, over 20724.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3022, pruned_loss=0.0725, over 4278691.12 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:30:59,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1667112.0, ans=0.1 2023-06-26 22:31:07,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1667172.0, ans=0.1 2023-06-26 22:32:20,103 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 6.513e+02 9.750e+02 1.536e+03 3.496e+03, threshold=1.950e+03, percent-clipped=9.0 2023-06-26 22:32:34,440 INFO [train.py:996] (1/4) Epoch 10, batch 3450, loss[loss=0.2033, simple_loss=0.2646, pruned_loss=0.07103, over 21453.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.297, pruned_loss=0.0722, over 4276865.94 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:32:47,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1667412.0, ans=0.1 2023-06-26 22:33:56,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1667592.0, ans=0.125 2023-06-26 22:34:11,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1667652.0, ans=0.0 2023-06-26 22:34:17,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1667652.0, ans=0.125 2023-06-26 22:34:24,146 INFO [train.py:996] (1/4) Epoch 10, batch 3500, loss[loss=0.2468, simple_loss=0.3286, pruned_loss=0.08249, over 21827.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3044, pruned_loss=0.0751, over 4278780.87 frames. ], batch size: 124, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:34:26,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1667712.0, ans=0.1 2023-06-26 22:34:31,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1667712.0, ans=0.1 2023-06-26 22:36:04,387 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.012e+02 7.162e+02 1.009e+03 1.814e+03 3.226e+03, threshold=2.018e+03, percent-clipped=21.0 2023-06-26 22:36:06,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1667952.0, ans=0.2 2023-06-26 22:36:13,109 INFO [train.py:996] (1/4) Epoch 10, batch 3550, loss[loss=0.2078, simple_loss=0.2717, pruned_loss=0.07194, over 21852.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3067, pruned_loss=0.07662, over 4280093.35 frames. ], batch size: 107, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:36:25,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1668012.0, ans=0.125 2023-06-26 22:36:35,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1668072.0, ans=0.1 2023-06-26 22:36:39,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-26 22:36:54,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1668072.0, ans=0.125 2023-06-26 22:36:57,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1668132.0, ans=0.125 2023-06-26 22:37:22,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1668192.0, ans=0.0 2023-06-26 22:38:06,104 INFO [train.py:996] (1/4) Epoch 10, batch 3600, loss[loss=0.2201, simple_loss=0.308, pruned_loss=0.06605, over 20760.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3027, pruned_loss=0.07646, over 4275349.64 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 32.0 2023-06-26 22:38:47,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1668432.0, ans=0.125 2023-06-26 22:39:04,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1668432.0, ans=0.0 2023-06-26 22:39:10,378 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-26 22:39:42,548 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.944e+02 5.183e+02 6.801e+02 1.024e+03 2.371e+03, threshold=1.360e+03, percent-clipped=4.0 2023-06-26 22:39:54,939 INFO [train.py:996] (1/4) Epoch 10, batch 3650, loss[loss=0.1965, simple_loss=0.2755, pruned_loss=0.0588, over 21482.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3025, pruned_loss=0.07603, over 4274387.80 frames. ], batch size: 211, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:40:04,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1668612.0, ans=0.125 2023-06-26 22:40:04,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1668612.0, ans=0.05 2023-06-26 22:40:24,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1668672.0, ans=0.05 2023-06-26 22:40:34,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1668732.0, ans=0.04949747468305833 2023-06-26 22:40:45,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1668732.0, ans=0.04949747468305833 2023-06-26 22:41:01,231 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:41:36,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1668852.0, ans=0.0 2023-06-26 22:41:41,210 INFO [train.py:996] (1/4) Epoch 10, batch 3700, loss[loss=0.2179, simple_loss=0.3033, pruned_loss=0.06625, over 21864.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2998, pruned_loss=0.07483, over 4271424.35 frames. ], batch size: 371, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:43:23,446 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.784e+02 6.221e+02 8.574e+02 1.297e+03 2.866e+03, threshold=1.715e+03, percent-clipped=21.0 2023-06-26 22:43:30,702 INFO [train.py:996] (1/4) Epoch 10, batch 3750, loss[loss=0.2374, simple_loss=0.3077, pruned_loss=0.08358, over 21667.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2983, pruned_loss=0.07367, over 4281643.68 frames. ], batch size: 508, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:43:50,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1669212.0, ans=0.1 2023-06-26 22:43:56,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.85 vs. limit=10.0 2023-06-26 22:43:57,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1669272.0, ans=0.0 2023-06-26 22:44:06,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1669272.0, ans=0.0 2023-06-26 22:44:57,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.43 vs. limit=12.0 2023-06-26 22:45:02,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1669452.0, ans=0.1 2023-06-26 22:45:18,859 INFO [train.py:996] (1/4) Epoch 10, batch 3800, loss[loss=0.1963, simple_loss=0.2717, pruned_loss=0.06041, over 20115.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2958, pruned_loss=0.07183, over 4277459.58 frames. ], batch size: 703, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:45:24,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1669512.0, ans=0.125 2023-06-26 22:45:57,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1669572.0, ans=0.0 2023-06-26 22:46:40,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1669692.0, ans=0.125 2023-06-26 22:46:58,176 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.737e+02 5.847e+02 8.030e+02 1.160e+03 2.493e+03, threshold=1.606e+03, percent-clipped=8.0 2023-06-26 22:47:10,254 INFO [train.py:996] (1/4) Epoch 10, batch 3850, loss[loss=0.1987, simple_loss=0.2826, pruned_loss=0.05738, over 21406.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2935, pruned_loss=0.07201, over 4277842.75 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:47:12,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1669812.0, ans=0.1 2023-06-26 22:47:17,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1669812.0, ans=0.125 2023-06-26 22:47:36,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1669872.0, ans=0.0 2023-06-26 22:48:15,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1669992.0, ans=0.0 2023-06-26 22:48:15,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1669992.0, ans=0.0 2023-06-26 22:48:30,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1669992.0, ans=0.125 2023-06-26 22:48:51,924 INFO [train.py:996] (1/4) Epoch 10, batch 3900, loss[loss=0.2135, simple_loss=0.288, pruned_loss=0.06948, over 21899.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2894, pruned_loss=0.07155, over 4282655.30 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:49:13,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1670172.0, ans=0.125 2023-06-26 22:49:26,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1670172.0, ans=0.035 2023-06-26 22:49:47,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1670232.0, ans=0.2 2023-06-26 22:50:23,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1670352.0, ans=0.125 2023-06-26 22:50:40,116 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 6.738e+02 9.125e+02 1.558e+03 3.098e+03, threshold=1.825e+03, percent-clipped=22.0 2023-06-26 22:50:47,228 INFO [train.py:996] (1/4) Epoch 10, batch 3950, loss[loss=0.1661, simple_loss=0.2544, pruned_loss=0.03896, over 21793.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.291, pruned_loss=0.07036, over 4285343.99 frames. ], batch size: 282, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:52:07,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1670592.0, ans=0.2 2023-06-26 22:52:35,953 INFO [train.py:996] (1/4) Epoch 10, batch 4000, loss[loss=0.2022, simple_loss=0.2762, pruned_loss=0.06411, over 22015.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2854, pruned_loss=0.06705, over 4285306.31 frames. ], batch size: 103, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 22:52:55,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1670712.0, ans=0.1 2023-06-26 22:53:12,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-26 22:53:29,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1670832.0, ans=0.125 2023-06-26 22:53:59,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-26 22:54:19,972 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.363e+02 6.033e+02 8.423e+02 1.568e+03 3.555e+03, threshold=1.685e+03, percent-clipped=19.0 2023-06-26 22:54:31,310 INFO [train.py:996] (1/4) Epoch 10, batch 4050, loss[loss=0.1847, simple_loss=0.2374, pruned_loss=0.06601, over 20672.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2833, pruned_loss=0.06566, over 4283013.58 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:54:45,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1671012.0, ans=0.125 2023-06-26 22:54:52,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1671072.0, ans=0.1 2023-06-26 22:55:08,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-26 22:55:10,459 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-26 22:56:20,348 INFO [train.py:996] (1/4) Epoch 10, batch 4100, loss[loss=0.2035, simple_loss=0.2882, pruned_loss=0.05943, over 21817.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2868, pruned_loss=0.06679, over 4291613.44 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:57:57,625 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.779e+02 5.678e+02 9.516e+02 1.395e+03 3.425e+03, threshold=1.903e+03, percent-clipped=17.0 2023-06-26 22:58:00,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-26 22:58:02,736 INFO [train.py:996] (1/4) Epoch 10, batch 4150, loss[loss=0.1739, simple_loss=0.2626, pruned_loss=0.04257, over 21674.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2875, pruned_loss=0.06497, over 4282439.85 frames. ], batch size: 247, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:58:08,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1671612.0, ans=0.0 2023-06-26 22:58:10,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1671612.0, ans=10.0 2023-06-26 22:58:23,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1671672.0, ans=0.125 2023-06-26 22:58:54,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1671732.0, ans=0.125 2023-06-26 22:59:16,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1671792.0, ans=0.0 2023-06-26 22:59:29,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1671852.0, ans=0.2 2023-06-26 22:59:40,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-26 22:59:48,091 INFO [train.py:996] (1/4) Epoch 10, batch 4200, loss[loss=0.2117, simple_loss=0.3149, pruned_loss=0.05421, over 19837.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2874, pruned_loss=0.06418, over 4273903.86 frames. ], batch size: 703, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:00:15,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1671972.0, ans=0.2 2023-06-26 23:00:17,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1671972.0, ans=0.09899494936611666 2023-06-26 23:01:07,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1672092.0, ans=0.0 2023-06-26 23:01:29,526 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.633e+02 4.957e+02 6.956e+02 1.176e+03 3.842e+03, threshold=1.391e+03, percent-clipped=7.0 2023-06-26 23:01:33,266 INFO [train.py:996] (1/4) Epoch 10, batch 4250, loss[loss=0.2391, simple_loss=0.3186, pruned_loss=0.07986, over 21342.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2918, pruned_loss=0.06537, over 4275829.70 frames. ], batch size: 143, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:02:01,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1672272.0, ans=0.95 2023-06-26 23:02:52,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1672392.0, ans=0.5 2023-06-26 23:03:01,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1672452.0, ans=0.125 2023-06-26 23:03:30,181 INFO [train.py:996] (1/4) Epoch 10, batch 4300, loss[loss=0.2072, simple_loss=0.3098, pruned_loss=0.05228, over 21843.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2968, pruned_loss=0.06697, over 4277506.15 frames. ], batch size: 371, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:03:32,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1672512.0, ans=0.125 2023-06-26 23:03:33,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.73 vs. limit=12.0 2023-06-26 23:04:29,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1672632.0, ans=0.0 2023-06-26 23:04:42,411 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:05:03,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1672752.0, ans=0.0 2023-06-26 23:05:06,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1672752.0, ans=0.125 2023-06-26 23:05:12,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1672752.0, ans=0.1 2023-06-26 23:05:14,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1672752.0, ans=0.1 2023-06-26 23:05:15,310 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 6.207e+02 8.849e+02 1.440e+03 4.327e+03, threshold=1.770e+03, percent-clipped=25.0 2023-06-26 23:05:18,710 INFO [train.py:996] (1/4) Epoch 10, batch 4350, loss[loss=0.223, simple_loss=0.2875, pruned_loss=0.07925, over 21449.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2964, pruned_loss=0.06717, over 4274306.72 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:07:07,244 INFO [train.py:996] (1/4) Epoch 10, batch 4400, loss[loss=0.1797, simple_loss=0.2638, pruned_loss=0.04778, over 21752.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2926, pruned_loss=0.06643, over 4266316.72 frames. ], batch size: 124, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:07:16,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1673112.0, ans=0.0 2023-06-26 23:07:27,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1673112.0, ans=0.2 2023-06-26 23:08:52,416 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.985e+02 5.856e+02 8.779e+02 1.198e+03 2.482e+03, threshold=1.756e+03, percent-clipped=8.0 2023-06-26 23:08:56,191 INFO [train.py:996] (1/4) Epoch 10, batch 4450, loss[loss=0.2592, simple_loss=0.3536, pruned_loss=0.08245, over 21844.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3021, pruned_loss=0.06889, over 4266838.99 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:09:09,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1673412.0, ans=0.125 2023-06-26 23:09:23,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1673472.0, ans=0.125 2023-06-26 23:09:35,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1673472.0, ans=0.125 2023-06-26 23:09:54,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1673532.0, ans=0.2 2023-06-26 23:10:22,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1673592.0, ans=0.05 2023-06-26 23:10:28,040 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:10:45,071 INFO [train.py:996] (1/4) Epoch 10, batch 4500, loss[loss=0.2047, simple_loss=0.2777, pruned_loss=0.06588, over 21701.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3001, pruned_loss=0.0695, over 4275012.38 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:11:13,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-26 23:11:21,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1673772.0, ans=0.1 2023-06-26 23:11:26,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1673772.0, ans=0.125 2023-06-26 23:12:21,949 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=8.0 2023-06-26 23:12:28,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1673952.0, ans=0.0 2023-06-26 23:12:31,379 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.557e+02 6.437e+02 9.027e+02 1.407e+03 3.220e+03, threshold=1.805e+03, percent-clipped=13.0 2023-06-26 23:12:46,668 INFO [train.py:996] (1/4) Epoch 10, batch 4550, loss[loss=0.2474, simple_loss=0.3376, pruned_loss=0.0786, over 21813.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3022, pruned_loss=0.06948, over 4279778.66 frames. ], batch size: 124, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:13:06,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1674072.0, ans=0.0 2023-06-26 23:13:50,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1674192.0, ans=0.125 2023-06-26 23:13:55,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1674192.0, ans=0.125 2023-06-26 23:14:01,582 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-26 23:14:17,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1674252.0, ans=0.0 2023-06-26 23:14:30,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1674252.0, ans=0.0 2023-06-26 23:14:34,567 INFO [train.py:996] (1/4) Epoch 10, batch 4600, loss[loss=0.2171, simple_loss=0.292, pruned_loss=0.07104, over 21221.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3066, pruned_loss=0.07151, over 4281172.06 frames. ], batch size: 143, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:14:35,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1674312.0, ans=0.0 2023-06-26 23:14:52,249 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:15:21,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-26 23:15:41,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1674492.0, ans=0.0 2023-06-26 23:15:41,212 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:15:46,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-26 23:16:17,984 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.163e+02 6.181e+02 9.452e+02 1.480e+03 3.323e+03, threshold=1.890e+03, percent-clipped=16.0 2023-06-26 23:16:21,524 INFO [train.py:996] (1/4) Epoch 10, batch 4650, loss[loss=0.1714, simple_loss=0.2472, pruned_loss=0.04778, over 21881.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.301, pruned_loss=0.07021, over 4286287.43 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:17:11,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1674732.0, ans=0.2 2023-06-26 23:17:34,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1674792.0, ans=0.0 2023-06-26 23:17:59,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1674852.0, ans=0.0 2023-06-26 23:17:59,837 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:18:07,797 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.40 vs. limit=12.0 2023-06-26 23:18:08,104 INFO [train.py:996] (1/4) Epoch 10, batch 4700, loss[loss=0.2016, simple_loss=0.2601, pruned_loss=0.07158, over 21591.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2916, pruned_loss=0.06794, over 4287628.49 frames. ], batch size: 415, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:18:31,389 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-26 23:19:25,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1675152.0, ans=0.1 2023-06-26 23:19:47,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1675152.0, ans=0.125 2023-06-26 23:19:50,877 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.765e+02 4.747e+02 5.523e+02 7.889e+02 1.677e+03, threshold=1.105e+03, percent-clipped=0.0 2023-06-26 23:19:54,038 INFO [train.py:996] (1/4) Epoch 10, batch 4750, loss[loss=0.2354, simple_loss=0.3157, pruned_loss=0.07756, over 21890.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2866, pruned_loss=0.06822, over 4287189.75 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:19:54,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1675212.0, ans=0.125 2023-06-26 23:19:58,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1675212.0, ans=0.125 2023-06-26 23:20:01,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.34 vs. limit=15.0 2023-06-26 23:20:06,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1675212.0, ans=0.0 2023-06-26 23:20:50,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1675332.0, ans=0.125 2023-06-26 23:21:18,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1675452.0, ans=0.0 2023-06-26 23:21:30,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1675452.0, ans=0.05 2023-06-26 23:21:41,787 INFO [train.py:996] (1/4) Epoch 10, batch 4800, loss[loss=0.1727, simple_loss=0.2346, pruned_loss=0.05543, over 21192.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2865, pruned_loss=0.06777, over 4289616.73 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:21:47,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1675512.0, ans=0.125 2023-06-26 23:22:25,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1675632.0, ans=0.1 2023-06-26 23:22:44,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1675692.0, ans=0.015 2023-06-26 23:23:25,458 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.997e+02 5.704e+02 8.592e+02 1.252e+03 2.093e+03, threshold=1.718e+03, percent-clipped=31.0 2023-06-26 23:23:27,162 INFO [train.py:996] (1/4) Epoch 10, batch 4850, loss[loss=0.2159, simple_loss=0.2964, pruned_loss=0.06769, over 21631.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2851, pruned_loss=0.06762, over 4284851.98 frames. ], batch size: 389, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:23:34,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1675812.0, ans=0.1 2023-06-26 23:23:36,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-26 23:24:31,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1675992.0, ans=0.125 2023-06-26 23:24:42,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1675992.0, ans=0.05 2023-06-26 23:24:44,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1675992.0, ans=0.125 2023-06-26 23:25:15,482 INFO [train.py:996] (1/4) Epoch 10, batch 4900, loss[loss=0.2276, simple_loss=0.3192, pruned_loss=0.06796, over 21498.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2878, pruned_loss=0.06801, over 4280038.72 frames. ], batch size: 194, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:25:59,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676232.0, ans=0.1 2023-06-26 23:26:06,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1676232.0, ans=0.125 2023-06-26 23:26:53,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1676352.0, ans=15.0 2023-06-26 23:27:07,353 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.910e+02 6.746e+02 9.232e+02 1.272e+03 2.922e+03, threshold=1.846e+03, percent-clipped=7.0 2023-06-26 23:27:08,935 INFO [train.py:996] (1/4) Epoch 10, batch 4950, loss[loss=0.2399, simple_loss=0.3242, pruned_loss=0.07781, over 20642.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2913, pruned_loss=0.06643, over 4277158.03 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:27:31,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1676472.0, ans=0.1 2023-06-26 23:27:40,973 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-26 23:27:45,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1676472.0, ans=0.125 2023-06-26 23:27:54,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2023-06-26 23:28:00,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1676532.0, ans=0.125 2023-06-26 23:28:07,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1676592.0, ans=0.125 2023-06-26 23:28:50,822 INFO [train.py:996] (1/4) Epoch 10, batch 5000, loss[loss=0.244, simple_loss=0.3431, pruned_loss=0.07246, over 21282.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2934, pruned_loss=0.06415, over 4281807.94 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:29:46,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1676832.0, ans=0.0 2023-06-26 23:30:07,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676892.0, ans=0.1 2023-06-26 23:30:35,685 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.540e+02 5.923e+02 8.910e+02 1.386e+03 2.915e+03, threshold=1.782e+03, percent-clipped=9.0 2023-06-26 23:30:37,445 INFO [train.py:996] (1/4) Epoch 10, batch 5050, loss[loss=0.2186, simple_loss=0.2865, pruned_loss=0.07531, over 21722.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2945, pruned_loss=0.06561, over 4288586.35 frames. ], batch size: 230, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:31:00,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1677012.0, ans=0.125 2023-06-26 23:31:14,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1677072.0, ans=0.125 2023-06-26 23:31:40,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1677192.0, ans=0.2 2023-06-26 23:32:22,440 INFO [train.py:996] (1/4) Epoch 10, batch 5100, loss[loss=0.1969, simple_loss=0.2799, pruned_loss=0.05695, over 21788.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2922, pruned_loss=0.06653, over 4290302.86 frames. ], batch size: 414, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:33:09,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1677432.0, ans=0.125 2023-06-26 23:33:10,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1677432.0, ans=0.125 2023-06-26 23:34:07,922 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.053e+02 6.342e+02 8.169e+02 1.053e+03 2.713e+03, threshold=1.634e+03, percent-clipped=6.0 2023-06-26 23:34:09,481 INFO [train.py:996] (1/4) Epoch 10, batch 5150, loss[loss=0.247, simple_loss=0.3229, pruned_loss=0.08561, over 21710.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.291, pruned_loss=0.0668, over 4287676.28 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:34:11,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-26 23:34:42,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1677672.0, ans=0.125 2023-06-26 23:34:50,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1677672.0, ans=10.0 2023-06-26 23:35:45,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1677852.0, ans=0.0 2023-06-26 23:36:03,537 INFO [train.py:996] (1/4) Epoch 10, batch 5200, loss[loss=0.2149, simple_loss=0.3176, pruned_loss=0.05612, over 21663.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2926, pruned_loss=0.06759, over 4292423.25 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:36:28,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1677972.0, ans=0.1 2023-06-26 23:37:22,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1678092.0, ans=0.125 2023-06-26 23:37:50,414 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.940e+02 5.817e+02 8.011e+02 1.324e+03 3.418e+03, threshold=1.602e+03, percent-clipped=14.0 2023-06-26 23:37:50,446 INFO [train.py:996] (1/4) Epoch 10, batch 5250, loss[loss=0.2149, simple_loss=0.3002, pruned_loss=0.06476, over 21838.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2973, pruned_loss=0.06658, over 4289974.82 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:38:19,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1678272.0, ans=0.125 2023-06-26 23:39:35,326 INFO [train.py:996] (1/4) Epoch 10, batch 5300, loss[loss=0.2183, simple_loss=0.2853, pruned_loss=0.07567, over 21841.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2954, pruned_loss=0.06726, over 4296368.97 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:40:04,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1678572.0, ans=15.0 2023-06-26 23:41:21,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.821e+02 5.421e+02 7.005e+02 9.056e+02 1.380e+03, threshold=1.401e+03, percent-clipped=0.0 2023-06-26 23:41:21,239 INFO [train.py:996] (1/4) Epoch 10, batch 5350, loss[loss=0.2452, simple_loss=0.3032, pruned_loss=0.09364, over 21833.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2937, pruned_loss=0.06863, over 4303783.24 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:41:25,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678812.0, ans=0.1 2023-06-26 23:41:31,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1678812.0, ans=0.1 2023-06-26 23:41:53,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1678872.0, ans=0.0 2023-06-26 23:41:59,155 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.76 vs. limit=10.0 2023-06-26 23:42:05,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1678932.0, ans=0.0 2023-06-26 23:42:07,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1678932.0, ans=0.125 2023-06-26 23:42:16,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1678932.0, ans=0.125 2023-06-26 23:42:16,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1678932.0, ans=0.125 2023-06-26 23:42:32,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1678992.0, ans=0.2 2023-06-26 23:42:46,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1679052.0, ans=0.0 2023-06-26 23:42:59,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1679052.0, ans=0.125 2023-06-26 23:43:05,922 INFO [train.py:996] (1/4) Epoch 10, batch 5400, loss[loss=0.228, simple_loss=0.3101, pruned_loss=0.07296, over 21770.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2926, pruned_loss=0.06905, over 4297175.80 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:43:41,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1679172.0, ans=0.0 2023-06-26 23:44:12,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1679292.0, ans=0.015 2023-06-26 23:44:38,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1679352.0, ans=0.125 2023-06-26 23:44:53,961 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.666e+02 6.862e+02 1.175e+03 1.926e+03 4.033e+03, threshold=2.351e+03, percent-clipped=41.0 2023-06-26 23:44:53,992 INFO [train.py:996] (1/4) Epoch 10, batch 5450, loss[loss=0.217, simple_loss=0.3269, pruned_loss=0.05349, over 21359.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.293, pruned_loss=0.06666, over 4293332.47 frames. ], batch size: 194, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:46:08,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-26 23:46:50,760 INFO [train.py:996] (1/4) Epoch 10, batch 5500, loss[loss=0.2155, simple_loss=0.3113, pruned_loss=0.05986, over 21750.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2982, pruned_loss=0.06439, over 4275826.23 frames. ], batch size: 332, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:46:56,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1679712.0, ans=0.1 2023-06-26 23:47:51,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1679832.0, ans=0.125 2023-06-26 23:48:18,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1679892.0, ans=0.125 2023-06-26 23:48:48,468 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.727e+02 5.357e+02 7.450e+02 1.317e+03 3.051e+03, threshold=1.490e+03, percent-clipped=6.0 2023-06-26 23:48:48,500 INFO [train.py:996] (1/4) Epoch 10, batch 5550, loss[loss=0.1888, simple_loss=0.2853, pruned_loss=0.04612, over 21648.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2985, pruned_loss=0.06282, over 4270369.89 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:48:59,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-26 23:49:39,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1680132.0, ans=0.1 2023-06-26 23:50:07,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1680192.0, ans=0.0 2023-06-26 23:50:34,788 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-26 23:50:38,655 INFO [train.py:996] (1/4) Epoch 10, batch 5600, loss[loss=0.2167, simple_loss=0.344, pruned_loss=0.04471, over 19797.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2983, pruned_loss=0.06079, over 4269979.50 frames. ], batch size: 703, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:50:39,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1680312.0, ans=0.04949747468305833 2023-06-26 23:50:42,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1680312.0, ans=0.0 2023-06-26 23:50:45,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=15.0 2023-06-26 23:50:46,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1680312.0, ans=0.0 2023-06-26 23:50:52,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-26 23:51:05,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1680372.0, ans=0.125 2023-06-26 23:51:46,665 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.30 vs. limit=15.0 2023-06-26 23:52:12,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1680552.0, ans=0.0 2023-06-26 23:52:13,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1680552.0, ans=0.0 2023-06-26 23:52:25,069 INFO [train.py:996] (1/4) Epoch 10, batch 5650, loss[loss=0.1963, simple_loss=0.2698, pruned_loss=0.06137, over 21351.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.3007, pruned_loss=0.06252, over 4272030.42 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:52:27,136 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.741e+02 5.468e+02 7.224e+02 1.167e+03 2.877e+03, threshold=1.445e+03, percent-clipped=12.0 2023-06-26 23:53:14,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1680732.0, ans=0.125 2023-06-26 23:53:41,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-26 23:54:13,503 INFO [train.py:996] (1/4) Epoch 10, batch 5700, loss[loss=0.1885, simple_loss=0.2698, pruned_loss=0.0536, over 21554.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2995, pruned_loss=0.06434, over 4275551.55 frames. ], batch size: 195, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:54:34,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-06-26 23:55:43,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1681092.0, ans=10.0 2023-06-26 23:56:03,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-26 23:56:09,515 INFO [train.py:996] (1/4) Epoch 10, batch 5750, loss[loss=0.1631, simple_loss=0.2592, pruned_loss=0.03351, over 21444.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2938, pruned_loss=0.06166, over 4276488.99 frames. ], batch size: 212, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:56:11,423 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 6.670e+02 9.043e+02 1.357e+03 3.417e+03, threshold=1.809e+03, percent-clipped=19.0 2023-06-26 23:56:32,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1681272.0, ans=0.09899494936611666 2023-06-26 23:56:47,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1681272.0, ans=0.0 2023-06-26 23:56:48,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.08 vs. limit=5.0 2023-06-26 23:56:58,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1681332.0, ans=0.125 2023-06-26 23:57:14,152 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-26 23:57:34,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1681392.0, ans=0.0 2023-06-26 23:57:53,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1681452.0, ans=0.125 2023-06-26 23:57:58,048 INFO [train.py:996] (1/4) Epoch 10, batch 5800, loss[loss=0.2287, simple_loss=0.3378, pruned_loss=0.05982, over 19961.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.293, pruned_loss=0.06032, over 4267487.92 frames. ], batch size: 702, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:58:44,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1681572.0, ans=0.0 2023-06-26 23:59:15,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1681692.0, ans=0.04949747468305833 2023-06-26 23:59:38,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1681752.0, ans=0.125 2023-06-26 23:59:46,308 INFO [train.py:996] (1/4) Epoch 10, batch 5850, loss[loss=0.1696, simple_loss=0.2754, pruned_loss=0.03192, over 21657.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2933, pruned_loss=0.05784, over 4273578.28 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:59:47,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-26 23:59:53,508 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.721e+02 4.995e+02 7.881e+02 1.168e+03 2.434e+03, threshold=1.576e+03, percent-clipped=1.0 2023-06-27 00:00:06,096 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:00:06,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-27 00:00:09,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1681872.0, ans=0.125 2023-06-27 00:01:37,802 INFO [train.py:996] (1/4) Epoch 10, batch 5900, loss[loss=0.1738, simple_loss=0.2584, pruned_loss=0.04461, over 21693.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2867, pruned_loss=0.054, over 4280517.30 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:02:07,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1682172.0, ans=0.125 2023-06-27 00:02:17,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1682172.0, ans=0.125 2023-06-27 00:02:29,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1682232.0, ans=0.125 2023-06-27 00:03:01,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1682352.0, ans=0.125 2023-06-27 00:03:24,112 INFO [train.py:996] (1/4) Epoch 10, batch 5950, loss[loss=0.198, simple_loss=0.2559, pruned_loss=0.07009, over 21269.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2848, pruned_loss=0.05663, over 4278849.20 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:03:25,853 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.299e+02 4.862e+02 7.145e+02 9.461e+02 2.592e+03, threshold=1.429e+03, percent-clipped=2.0 2023-06-27 00:03:49,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-27 00:05:08,657 INFO [train.py:996] (1/4) Epoch 10, batch 6000, loss[loss=0.2042, simple_loss=0.2669, pruned_loss=0.07075, over 21736.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.281, pruned_loss=0.05892, over 4277079.90 frames. ], batch size: 351, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:05:08,657 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 00:05:29,813 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2604, simple_loss=0.3533, pruned_loss=0.08374, over 1796401.00 frames. 2023-06-27 00:05:29,814 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 00:05:48,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.79 vs. limit=10.0 2023-06-27 00:06:41,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1682892.0, ans=0.0 2023-06-27 00:06:43,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1682892.0, ans=0.0 2023-06-27 00:07:18,978 INFO [train.py:996] (1/4) Epoch 10, batch 6050, loss[loss=0.2383, simple_loss=0.3715, pruned_loss=0.05254, over 20806.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2759, pruned_loss=0.05956, over 4279757.73 frames. ], batch size: 607, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:07:21,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1683012.0, ans=0.2 2023-06-27 00:07:24,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.965e+02 5.435e+02 7.983e+02 1.281e+03 2.662e+03, threshold=1.597e+03, percent-clipped=18.0 2023-06-27 00:09:06,572 INFO [train.py:996] (1/4) Epoch 10, batch 6100, loss[loss=0.2314, simple_loss=0.3061, pruned_loss=0.07832, over 21801.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2747, pruned_loss=0.05843, over 4281553.75 frames. ], batch size: 112, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:09:47,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1683432.0, ans=0.0 2023-06-27 00:09:54,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1683432.0, ans=0.1 2023-06-27 00:10:24,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1683552.0, ans=0.2 2023-06-27 00:10:33,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1683552.0, ans=0.0 2023-06-27 00:10:42,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-27 00:10:53,278 INFO [train.py:996] (1/4) Epoch 10, batch 6150, loss[loss=0.2377, simple_loss=0.3107, pruned_loss=0.08237, over 21726.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2789, pruned_loss=0.06103, over 4273915.35 frames. ], batch size: 415, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:10:58,721 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.616e+02 5.589e+02 9.647e+02 1.302e+03 3.090e+03, threshold=1.929e+03, percent-clipped=16.0 2023-06-27 00:11:51,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1683732.0, ans=0.0 2023-06-27 00:12:02,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-27 00:12:35,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1683852.0, ans=0.09899494936611666 2023-06-27 00:12:42,242 INFO [train.py:996] (1/4) Epoch 10, batch 6200, loss[loss=0.2746, simple_loss=0.3979, pruned_loss=0.07561, over 20770.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2827, pruned_loss=0.06184, over 4277152.26 frames. ], batch size: 607, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:12:44,814 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:13:03,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1683972.0, ans=0.0 2023-06-27 00:13:49,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1684092.0, ans=0.125 2023-06-27 00:14:00,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1684092.0, ans=0.0 2023-06-27 00:14:31,212 INFO [train.py:996] (1/4) Epoch 10, batch 6250, loss[loss=0.1861, simple_loss=0.2917, pruned_loss=0.04024, over 21639.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2866, pruned_loss=0.0617, over 4279330.16 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:14:36,260 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.907e+02 5.995e+02 9.540e+02 1.636e+03 4.135e+03, threshold=1.908e+03, percent-clipped=20.0 2023-06-27 00:15:14,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1684332.0, ans=0.1 2023-06-27 00:15:19,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1684332.0, ans=15.0 2023-06-27 00:15:29,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1684332.0, ans=0.125 2023-06-27 00:15:32,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1684392.0, ans=0.07 2023-06-27 00:16:03,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1684452.0, ans=0.125 2023-06-27 00:16:16,335 INFO [train.py:996] (1/4) Epoch 10, batch 6300, loss[loss=0.2185, simple_loss=0.2869, pruned_loss=0.07507, over 21606.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2894, pruned_loss=0.06091, over 4281885.21 frames. ], batch size: 548, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:17:09,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1684632.0, ans=0.05 2023-06-27 00:17:20,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.49 vs. limit=15.0 2023-06-27 00:17:23,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.47 vs. limit=10.0 2023-06-27 00:17:24,553 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:18:08,447 INFO [train.py:996] (1/4) Epoch 10, batch 6350, loss[loss=0.2143, simple_loss=0.2904, pruned_loss=0.06915, over 21625.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2925, pruned_loss=0.06383, over 4286583.82 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:18:13,741 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.276e+02 6.494e+02 9.126e+02 1.517e+03, threshold=1.299e+03, percent-clipped=0.0 2023-06-27 00:19:57,959 INFO [train.py:996] (1/4) Epoch 10, batch 6400, loss[loss=0.2297, simple_loss=0.3018, pruned_loss=0.07881, over 22016.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.298, pruned_loss=0.06788, over 4292379.09 frames. ], batch size: 317, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:20:18,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1685112.0, ans=0.05 2023-06-27 00:20:25,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1685172.0, ans=0.025 2023-06-27 00:20:49,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1685232.0, ans=0.125 2023-06-27 00:21:01,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1685232.0, ans=0.125 2023-06-27 00:21:19,867 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:21:50,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-27 00:21:51,034 INFO [train.py:996] (1/4) Epoch 10, batch 6450, loss[loss=0.2121, simple_loss=0.3058, pruned_loss=0.05916, over 21201.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.3004, pruned_loss=0.06735, over 4291016.14 frames. ], batch size: 548, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:21:55,952 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 6.943e+02 1.024e+03 1.521e+03 2.741e+03, threshold=2.048e+03, percent-clipped=32.0 2023-06-27 00:22:53,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-27 00:23:05,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.03 vs. limit=15.0 2023-06-27 00:23:11,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1685592.0, ans=0.125 2023-06-27 00:23:37,624 INFO [train.py:996] (1/4) Epoch 10, batch 6500, loss[loss=0.203, simple_loss=0.2736, pruned_loss=0.06618, over 21785.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2938, pruned_loss=0.06638, over 4289598.81 frames. ], batch size: 102, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:23:46,545 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:23:48,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1685712.0, ans=0.05 2023-06-27 00:23:56,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1685712.0, ans=0.0 2023-06-27 00:24:08,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1685772.0, ans=0.0 2023-06-27 00:24:38,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1685892.0, ans=0.0 2023-06-27 00:25:01,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1685952.0, ans=0.125 2023-06-27 00:25:23,187 INFO [train.py:996] (1/4) Epoch 10, batch 6550, loss[loss=0.1955, simple_loss=0.2823, pruned_loss=0.05431, over 21819.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2915, pruned_loss=0.06568, over 4292553.18 frames. ], batch size: 351, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:25:28,443 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.027e+02 5.505e+02 8.547e+02 1.330e+03 2.902e+03, threshold=1.709e+03, percent-clipped=6.0 2023-06-27 00:25:52,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-27 00:26:34,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1686192.0, ans=0.2 2023-06-27 00:26:38,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-27 00:27:10,170 INFO [train.py:996] (1/4) Epoch 10, batch 6600, loss[loss=0.2195, simple_loss=0.2784, pruned_loss=0.08034, over 21491.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2871, pruned_loss=0.0655, over 4283871.81 frames. ], batch size: 441, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:27:33,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1686372.0, ans=0.0 2023-06-27 00:28:57,099 INFO [train.py:996] (1/4) Epoch 10, batch 6650, loss[loss=0.1977, simple_loss=0.2776, pruned_loss=0.05885, over 21563.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2806, pruned_loss=0.06238, over 4278932.38 frames. ], batch size: 442, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:29:09,398 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.441e+02 5.556e+02 7.751e+02 1.155e+03 2.381e+03, threshold=1.550e+03, percent-clipped=8.0 2023-06-27 00:29:15,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1686612.0, ans=0.025 2023-06-27 00:29:23,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1686672.0, ans=0.0 2023-06-27 00:30:08,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1686792.0, ans=0.125 2023-06-27 00:30:48,138 INFO [train.py:996] (1/4) Epoch 10, batch 6700, loss[loss=0.1928, simple_loss=0.2705, pruned_loss=0.05756, over 21640.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2761, pruned_loss=0.06271, over 4278462.96 frames. ], batch size: 391, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:31:04,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-27 00:31:49,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1687092.0, ans=0.2 2023-06-27 00:32:29,087 INFO [train.py:996] (1/4) Epoch 10, batch 6750, loss[loss=0.1925, simple_loss=0.2451, pruned_loss=0.06991, over 20304.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2741, pruned_loss=0.06348, over 4281971.29 frames. ], batch size: 703, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:32:41,011 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.839e+02 5.646e+02 8.043e+02 1.106e+03 2.898e+03, threshold=1.609e+03, percent-clipped=7.0 2023-06-27 00:32:49,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1687272.0, ans=0.2 2023-06-27 00:32:51,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-27 00:33:07,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-27 00:33:28,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1687332.0, ans=0.95 2023-06-27 00:33:47,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1687392.0, ans=0.0 2023-06-27 00:33:47,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1687392.0, ans=0.125 2023-06-27 00:34:06,674 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.87 vs. limit=15.0 2023-06-27 00:34:13,854 INFO [train.py:996] (1/4) Epoch 10, batch 6800, loss[loss=0.1949, simple_loss=0.2638, pruned_loss=0.06303, over 21767.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2756, pruned_loss=0.06514, over 4291321.23 frames. ], batch size: 333, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:34:16,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=12.0 2023-06-27 00:34:29,493 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:34:34,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1687572.0, ans=0.1 2023-06-27 00:35:00,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-27 00:35:16,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1687632.0, ans=0.125 2023-06-27 00:35:27,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1687692.0, ans=0.125 2023-06-27 00:35:54,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1687752.0, ans=0.0 2023-06-27 00:35:54,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1687752.0, ans=0.0 2023-06-27 00:36:00,655 INFO [train.py:996] (1/4) Epoch 10, batch 6850, loss[loss=0.1964, simple_loss=0.3296, pruned_loss=0.03164, over 20771.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2738, pruned_loss=0.06586, over 4275831.33 frames. ], batch size: 607, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:36:07,583 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 5.578e+02 7.964e+02 1.217e+03 2.059e+03, threshold=1.593e+03, percent-clipped=9.0 2023-06-27 00:37:00,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1687932.0, ans=0.0 2023-06-27 00:37:01,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1687932.0, ans=0.2 2023-06-27 00:37:02,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-27 00:37:18,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1687992.0, ans=0.125 2023-06-27 00:37:47,332 INFO [train.py:996] (1/4) Epoch 10, batch 6900, loss[loss=0.1972, simple_loss=0.2956, pruned_loss=0.04942, over 21742.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2758, pruned_loss=0.06601, over 4283126.15 frames. ], batch size: 441, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:37:50,200 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-27 00:37:54,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1688112.0, ans=0.125 2023-06-27 00:38:12,840 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.11 vs. limit=12.0 2023-06-27 00:38:27,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1688172.0, ans=0.125 2023-06-27 00:38:57,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1688292.0, ans=0.2 2023-06-27 00:39:06,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-27 00:39:19,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-27 00:39:24,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1688352.0, ans=0.125 2023-06-27 00:39:41,213 INFO [train.py:996] (1/4) Epoch 10, batch 6950, loss[loss=0.2579, simple_loss=0.3221, pruned_loss=0.09681, over 21450.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2774, pruned_loss=0.06342, over 4285048.38 frames. ], batch size: 471, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:39:43,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1688412.0, ans=0.1 2023-06-27 00:39:45,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-27 00:39:47,973 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.023e+02 6.673e+02 8.913e+02 1.216e+03 2.486e+03, threshold=1.783e+03, percent-clipped=9.0 2023-06-27 00:40:04,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1688472.0, ans=0.2 2023-06-27 00:41:27,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1688712.0, ans=0.0 2023-06-27 00:41:28,517 INFO [train.py:996] (1/4) Epoch 10, batch 7000, loss[loss=0.1872, simple_loss=0.2552, pruned_loss=0.05961, over 21374.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2817, pruned_loss=0.06585, over 4278469.46 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:41:40,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1688712.0, ans=0.125 2023-06-27 00:41:49,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1688772.0, ans=0.125 2023-06-27 00:42:15,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1688832.0, ans=0.07 2023-06-27 00:42:31,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1688892.0, ans=0.125 2023-06-27 00:42:50,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1688892.0, ans=0.125 2023-06-27 00:43:15,459 INFO [train.py:996] (1/4) Epoch 10, batch 7050, loss[loss=0.1658, simple_loss=0.2542, pruned_loss=0.03866, over 21558.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2792, pruned_loss=0.0641, over 4277303.44 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:43:19,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1689012.0, ans=0.125 2023-06-27 00:43:27,748 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 6.761e+02 1.057e+03 1.502e+03 3.144e+03, threshold=2.115e+03, percent-clipped=16.0 2023-06-27 00:45:05,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1689252.0, ans=0.125 2023-06-27 00:45:09,738 INFO [train.py:996] (1/4) Epoch 10, batch 7100, loss[loss=0.1926, simple_loss=0.2812, pruned_loss=0.05197, over 21822.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2875, pruned_loss=0.06583, over 4268130.20 frames. ], batch size: 333, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:45:57,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1689432.0, ans=0.125 2023-06-27 00:46:07,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1689432.0, ans=0.025 2023-06-27 00:46:16,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1689492.0, ans=0.125 2023-06-27 00:46:20,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1689492.0, ans=0.0 2023-06-27 00:46:26,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1689492.0, ans=0.125 2023-06-27 00:46:47,668 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-27 00:47:02,310 INFO [train.py:996] (1/4) Epoch 10, batch 7150, loss[loss=0.2181, simple_loss=0.2984, pruned_loss=0.06893, over 21711.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2845, pruned_loss=0.06403, over 4269126.81 frames. ], batch size: 298, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:47:09,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.861e+02 6.064e+02 8.725e+02 1.357e+03 2.823e+03, threshold=1.745e+03, percent-clipped=6.0 2023-06-27 00:47:09,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1689612.0, ans=0.2 2023-06-27 00:47:18,973 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=12.0 2023-06-27 00:47:23,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1689672.0, ans=0.125 2023-06-27 00:47:41,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0 2023-06-27 00:48:27,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-27 00:48:48,713 INFO [train.py:996] (1/4) Epoch 10, batch 7200, loss[loss=0.2208, simple_loss=0.2902, pruned_loss=0.07567, over 21411.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2864, pruned_loss=0.06578, over 4264945.88 frames. ], batch size: 194, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:48:59,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1689912.0, ans=0.0 2023-06-27 00:49:00,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-27 00:49:04,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1689972.0, ans=0.125 2023-06-27 00:49:19,599 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-27 00:50:05,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-27 00:50:15,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.81 vs. limit=10.0 2023-06-27 00:50:34,351 INFO [train.py:996] (1/4) Epoch 10, batch 7250, loss[loss=0.176, simple_loss=0.2401, pruned_loss=0.05597, over 21468.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2827, pruned_loss=0.06572, over 4262110.78 frames. ], batch size: 212, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:50:38,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1690212.0, ans=0.0 2023-06-27 00:50:40,768 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 6.230e+02 8.378e+02 1.198e+03 2.214e+03, threshold=1.676e+03, percent-clipped=4.0 2023-06-27 00:51:03,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1690272.0, ans=0.0 2023-06-27 00:51:49,211 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-27 00:52:18,852 INFO [train.py:996] (1/4) Epoch 10, batch 7300, loss[loss=0.199, simple_loss=0.2696, pruned_loss=0.06421, over 21731.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2769, pruned_loss=0.06534, over 4265924.97 frames. ], batch size: 112, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:53:13,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1690632.0, ans=0.0 2023-06-27 00:53:13,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.91 vs. limit=22.5 2023-06-27 00:53:33,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1690692.0, ans=0.1 2023-06-27 00:53:56,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1690752.0, ans=0.125 2023-06-27 00:53:58,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690752.0, ans=0.1 2023-06-27 00:54:06,774 INFO [train.py:996] (1/4) Epoch 10, batch 7350, loss[loss=0.1907, simple_loss=0.2729, pruned_loss=0.05426, over 16336.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2752, pruned_loss=0.06589, over 4261444.54 frames. ], batch size: 60, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:54:15,737 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.980e+02 5.910e+02 7.871e+02 1.338e+03 3.655e+03, threshold=1.574e+03, percent-clipped=15.0 2023-06-27 00:55:01,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690932.0, ans=0.1 2023-06-27 00:55:28,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1690992.0, ans=15.0 2023-06-27 00:55:38,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1691052.0, ans=0.125 2023-06-27 00:55:38,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1691052.0, ans=0.05 2023-06-27 00:55:49,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-06-27 00:55:50,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1691052.0, ans=0.0 2023-06-27 00:55:56,534 INFO [train.py:996] (1/4) Epoch 10, batch 7400, loss[loss=0.2449, simple_loss=0.3241, pruned_loss=0.08286, over 21423.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2816, pruned_loss=0.06723, over 4251566.36 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:55:58,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1691112.0, ans=0.0 2023-06-27 00:56:31,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1691172.0, ans=0.015 2023-06-27 00:56:32,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1691172.0, ans=0.0 2023-06-27 00:57:13,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1691292.0, ans=0.1 2023-06-27 00:57:15,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1691292.0, ans=0.0 2023-06-27 00:57:42,522 INFO [train.py:996] (1/4) Epoch 10, batch 7450, loss[loss=0.199, simple_loss=0.2651, pruned_loss=0.06649, over 21616.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2803, pruned_loss=0.06648, over 4258054.66 frames. ], batch size: 298, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:57:56,773 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.982e+02 5.896e+02 9.357e+02 1.491e+03 2.777e+03, threshold=1.871e+03, percent-clipped=18.0 2023-06-27 00:58:11,069 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-27 00:58:35,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1691532.0, ans=0.0 2023-06-27 00:59:06,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1691592.0, ans=0.1 2023-06-27 00:59:24,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1691652.0, ans=0.125 2023-06-27 00:59:37,955 INFO [train.py:996] (1/4) Epoch 10, batch 7500, loss[loss=0.2149, simple_loss=0.3068, pruned_loss=0.06143, over 21295.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2858, pruned_loss=0.06719, over 4265516.85 frames. ], batch size: 176, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:59:40,079 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:00:48,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1691892.0, ans=10.0 2023-06-27 01:01:05,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1691952.0, ans=0.025 2023-06-27 01:01:31,387 INFO [train.py:996] (1/4) Epoch 10, batch 7550, loss[loss=0.271, simple_loss=0.3624, pruned_loss=0.08984, over 21471.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2951, pruned_loss=0.06744, over 4272304.81 frames. ], batch size: 507, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 01:01:38,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1692012.0, ans=0.0 2023-06-27 01:01:39,834 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.177e+02 6.369e+02 9.874e+02 1.839e+03 3.635e+03, threshold=1.975e+03, percent-clipped=22.0 2023-06-27 01:02:02,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1692072.0, ans=0.125 2023-06-27 01:02:18,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1692132.0, ans=0.2 2023-06-27 01:03:11,966 INFO [train.py:996] (1/4) Epoch 10, batch 7600, loss[loss=0.2279, simple_loss=0.3022, pruned_loss=0.07677, over 22076.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2918, pruned_loss=0.06672, over 4275741.85 frames. ], batch size: 119, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:03:25,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1692312.0, ans=0.125 2023-06-27 01:03:48,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1692372.0, ans=0.125 2023-06-27 01:04:17,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1692492.0, ans=0.2 2023-06-27 01:04:24,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1692492.0, ans=0.0 2023-06-27 01:04:31,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1692492.0, ans=0.125 2023-06-27 01:04:34,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1692552.0, ans=0.1 2023-06-27 01:05:03,897 INFO [train.py:996] (1/4) Epoch 10, batch 7650, loss[loss=0.1841, simple_loss=0.2776, pruned_loss=0.04526, over 20783.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2902, pruned_loss=0.06706, over 4275529.87 frames. ], batch size: 609, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:05:06,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1692612.0, ans=0.0 2023-06-27 01:05:12,456 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.938e+02 5.695e+02 7.737e+02 9.992e+02 2.893e+03, threshold=1.547e+03, percent-clipped=4.0 2023-06-27 01:05:19,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1692672.0, ans=0.0 2023-06-27 01:06:12,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1692792.0, ans=0.125 2023-06-27 01:06:14,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1692792.0, ans=0.125 2023-06-27 01:06:23,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-27 01:06:52,693 INFO [train.py:996] (1/4) Epoch 10, batch 7700, loss[loss=0.2339, simple_loss=0.3072, pruned_loss=0.0803, over 21375.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2931, pruned_loss=0.07005, over 4282049.82 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:08:42,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1693212.0, ans=0.125 2023-06-27 01:08:43,837 INFO [train.py:996] (1/4) Epoch 10, batch 7750, loss[loss=0.2823, simple_loss=0.3816, pruned_loss=0.09152, over 21641.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2959, pruned_loss=0.0694, over 4270746.72 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:08:49,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1693212.0, ans=0.125 2023-06-27 01:08:51,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1693212.0, ans=0.125 2023-06-27 01:09:05,027 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.135e+02 8.248e+02 1.279e+03 1.795e+03 4.947e+03, threshold=2.557e+03, percent-clipped=28.0 2023-06-27 01:09:05,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1693212.0, ans=0.125 2023-06-27 01:09:42,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1693332.0, ans=0.2 2023-06-27 01:10:42,231 INFO [train.py:996] (1/4) Epoch 10, batch 7800, loss[loss=0.2422, simple_loss=0.3222, pruned_loss=0.08104, over 21851.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2985, pruned_loss=0.07047, over 4257854.95 frames. ], batch size: 372, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:11:30,879 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.35 vs. limit=12.0 2023-06-27 01:11:33,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1693692.0, ans=0.125 2023-06-27 01:12:12,614 INFO [train.py:996] (1/4) Epoch 10, batch 7850, loss[loss=0.1974, simple_loss=0.2625, pruned_loss=0.06612, over 21286.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2919, pruned_loss=0.06972, over 4257940.64 frames. ], batch size: 177, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:12:31,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1693812.0, ans=0.1 2023-06-27 01:12:32,517 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.059e+02 5.917e+02 8.514e+02 1.468e+03 3.815e+03, threshold=1.703e+03, percent-clipped=5.0 2023-06-27 01:12:45,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1693872.0, ans=0.125 2023-06-27 01:13:10,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-27 01:13:37,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1693992.0, ans=0.0 2023-06-27 01:14:08,065 INFO [train.py:996] (1/4) Epoch 10, batch 7900, loss[loss=0.3009, simple_loss=0.398, pruned_loss=0.102, over 21490.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2871, pruned_loss=0.06889, over 4260399.70 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 8.0 2023-06-27 01:14:08,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1694112.0, ans=0.125 2023-06-27 01:14:34,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1694172.0, ans=0.04949747468305833 2023-06-27 01:14:38,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1694172.0, ans=0.035 2023-06-27 01:14:40,428 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-27 01:14:42,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.24 vs. limit=22.5 2023-06-27 01:15:17,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-27 01:16:04,750 INFO [train.py:996] (1/4) Epoch 10, batch 7950, loss[loss=0.2053, simple_loss=0.2954, pruned_loss=0.0576, over 21909.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2934, pruned_loss=0.06793, over 4259900.47 frames. ], batch size: 316, lr: 2.97e-03, grad_scale: 8.0 2023-06-27 01:16:12,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1694412.0, ans=0.125 2023-06-27 01:16:16,929 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.966e+02 5.576e+02 7.742e+02 1.234e+03 3.670e+03, threshold=1.548e+03, percent-clipped=16.0 2023-06-27 01:16:51,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1694532.0, ans=0.02 2023-06-27 01:17:22,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1694592.0, ans=0.125 2023-06-27 01:17:49,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1694652.0, ans=0.0 2023-06-27 01:17:55,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1694712.0, ans=0.125 2023-06-27 01:17:56,291 INFO [train.py:996] (1/4) Epoch 10, batch 8000, loss[loss=0.2255, simple_loss=0.3082, pruned_loss=0.07135, over 21764.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2988, pruned_loss=0.07064, over 4260876.75 frames. ], batch size: 332, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:18:36,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1694772.0, ans=0.125 2023-06-27 01:18:46,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-27 01:19:57,656 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:20:02,338 INFO [train.py:996] (1/4) Epoch 10, batch 8050, loss[loss=0.2266, simple_loss=0.3129, pruned_loss=0.07016, over 21881.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.302, pruned_loss=0.07023, over 4262844.20 frames. ], batch size: 317, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:20:08,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1695012.0, ans=0.05 2023-06-27 01:20:14,618 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 7.082e+02 1.044e+03 1.392e+03 2.627e+03, threshold=2.088e+03, percent-clipped=20.0 2023-06-27 01:20:41,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-27 01:20:50,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1695132.0, ans=0.125 2023-06-27 01:21:48,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1695252.0, ans=10.0 2023-06-27 01:21:51,394 INFO [train.py:996] (1/4) Epoch 10, batch 8100, loss[loss=0.2164, simple_loss=0.2954, pruned_loss=0.06871, over 21862.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2992, pruned_loss=0.07075, over 4268993.65 frames. ], batch size: 107, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:22:07,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1695312.0, ans=0.125 2023-06-27 01:22:36,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1695372.0, ans=0.125 2023-06-27 01:23:24,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1695492.0, ans=10.0 2023-06-27 01:23:49,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.07 vs. limit=6.0 2023-06-27 01:23:50,268 INFO [train.py:996] (1/4) Epoch 10, batch 8150, loss[loss=0.3114, simple_loss=0.4028, pruned_loss=0.11, over 21529.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3069, pruned_loss=0.07265, over 4270129.83 frames. ], batch size: 509, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:24:07,914 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.023e+02 5.816e+02 8.551e+02 1.587e+03 5.169e+03, threshold=1.710e+03, percent-clipped=17.0 2023-06-27 01:24:15,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1695672.0, ans=0.125 2023-06-27 01:24:17,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-27 01:25:11,268 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:25:14,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1695852.0, ans=0.125 2023-06-27 01:25:29,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.73 vs. limit=10.0 2023-06-27 01:25:35,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1695852.0, ans=0.125 2023-06-27 01:25:38,328 INFO [train.py:996] (1/4) Epoch 10, batch 8200, loss[loss=0.1836, simple_loss=0.2456, pruned_loss=0.06078, over 21514.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.302, pruned_loss=0.07119, over 4272045.62 frames. ], batch size: 195, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:26:03,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=15.0 2023-06-27 01:26:04,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1695972.0, ans=0.125 2023-06-27 01:26:04,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1695972.0, ans=0.0 2023-06-27 01:26:06,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1695972.0, ans=0.125 2023-06-27 01:27:32,702 INFO [train.py:996] (1/4) Epoch 10, batch 8250, loss[loss=0.2196, simple_loss=0.3146, pruned_loss=0.06228, over 21713.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2999, pruned_loss=0.07027, over 4277994.07 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:27:44,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.40 vs. limit=15.0 2023-06-27 01:27:44,586 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.725e+02 5.485e+02 7.641e+02 1.335e+03 2.771e+03, threshold=1.528e+03, percent-clipped=11.0 2023-06-27 01:27:53,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1696272.0, ans=0.2 2023-06-27 01:28:14,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1696332.0, ans=0.0 2023-06-27 01:28:58,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1696452.0, ans=0.1 2023-06-27 01:29:11,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1696452.0, ans=0.1 2023-06-27 01:29:21,568 INFO [train.py:996] (1/4) Epoch 10, batch 8300, loss[loss=0.2482, simple_loss=0.3294, pruned_loss=0.08347, over 21607.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2968, pruned_loss=0.06699, over 4274475.07 frames. ], batch size: 414, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:31:11,522 INFO [train.py:996] (1/4) Epoch 10, batch 8350, loss[loss=0.1845, simple_loss=0.2694, pruned_loss=0.04979, over 21597.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2961, pruned_loss=0.06531, over 4263989.63 frames. ], batch size: 263, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:31:19,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1696812.0, ans=0.125 2023-06-27 01:31:22,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1696812.0, ans=0.04949747468305833 2023-06-27 01:31:22,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1696812.0, ans=0.0 2023-06-27 01:31:23,480 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 5.774e+02 7.528e+02 1.140e+03 3.100e+03, threshold=1.506e+03, percent-clipped=11.0 2023-06-27 01:32:20,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1696992.0, ans=0.0 2023-06-27 01:33:01,164 INFO [train.py:996] (1/4) Epoch 10, batch 8400, loss[loss=0.1826, simple_loss=0.2779, pruned_loss=0.04367, over 21701.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2928, pruned_loss=0.06288, over 4261025.64 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:33:10,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1697112.0, ans=0.125 2023-06-27 01:33:37,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1697172.0, ans=0.1 2023-06-27 01:33:44,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1697232.0, ans=0.2 2023-06-27 01:33:55,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1697232.0, ans=0.025 2023-06-27 01:34:16,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1697292.0, ans=0.0 2023-06-27 01:34:48,801 INFO [train.py:996] (1/4) Epoch 10, batch 8450, loss[loss=0.2346, simple_loss=0.3006, pruned_loss=0.08429, over 21811.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2908, pruned_loss=0.0626, over 4271106.01 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:35:02,438 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.213e+02 7.215e+02 1.072e+03 1.642e+03 3.949e+03, threshold=2.143e+03, percent-clipped=30.0 2023-06-27 01:35:26,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1697472.0, ans=0.1 2023-06-27 01:35:46,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1697532.0, ans=0.07 2023-06-27 01:36:33,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-27 01:36:35,394 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-27 01:36:37,993 INFO [train.py:996] (1/4) Epoch 10, batch 8500, loss[loss=0.1889, simple_loss=0.251, pruned_loss=0.06335, over 21641.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2872, pruned_loss=0.06353, over 4264936.08 frames. ], batch size: 247, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:38:27,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1698012.0, ans=0.05 2023-06-27 01:38:28,159 INFO [train.py:996] (1/4) Epoch 10, batch 8550, loss[loss=0.2066, simple_loss=0.2737, pruned_loss=0.06974, over 21973.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2905, pruned_loss=0.06594, over 4268677.30 frames. ], batch size: 103, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:38:41,904 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.171e+02 1.011e+03 1.607e+03 3.555e+03, threshold=2.023e+03, percent-clipped=12.0 2023-06-27 01:38:44,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-27 01:38:47,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1698072.0, ans=0.125 2023-06-27 01:39:12,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1698132.0, ans=0.1 2023-06-27 01:39:23,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1698132.0, ans=0.125 2023-06-27 01:39:27,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-27 01:40:09,449 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:40:17,115 INFO [train.py:996] (1/4) Epoch 10, batch 8600, loss[loss=0.1745, simple_loss=0.2302, pruned_loss=0.05939, over 20775.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2968, pruned_loss=0.06848, over 4278904.47 frames. ], batch size: 609, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:40:21,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1698312.0, ans=0.0 2023-06-27 01:40:37,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1698312.0, ans=0.125 2023-06-27 01:41:42,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1698492.0, ans=0.2 2023-06-27 01:42:05,295 INFO [train.py:996] (1/4) Epoch 10, batch 8650, loss[loss=0.2365, simple_loss=0.3397, pruned_loss=0.06663, over 21607.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3026, pruned_loss=0.07012, over 4278603.30 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:42:24,799 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.451e+02 5.765e+02 7.630e+02 1.183e+03 2.009e+03, threshold=1.526e+03, percent-clipped=0.0 2023-06-27 01:42:41,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1698672.0, ans=0.125 2023-06-27 01:42:44,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1698672.0, ans=0.1 2023-06-27 01:43:14,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.48 vs. limit=10.0 2023-06-27 01:43:34,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1698852.0, ans=0.125 2023-06-27 01:43:50,336 INFO [train.py:996] (1/4) Epoch 10, batch 8700, loss[loss=0.2254, simple_loss=0.2797, pruned_loss=0.08557, over 21235.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2942, pruned_loss=0.06695, over 4284139.96 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:45:06,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1699092.0, ans=0.125 2023-06-27 01:45:29,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-27 01:45:39,142 INFO [train.py:996] (1/4) Epoch 10, batch 8750, loss[loss=0.2008, simple_loss=0.2723, pruned_loss=0.06462, over 21727.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2885, pruned_loss=0.06677, over 4280762.55 frames. ], batch size: 230, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:45:59,218 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.645e+02 6.087e+02 8.152e+02 1.140e+03 2.309e+03, threshold=1.630e+03, percent-clipped=11.0 2023-06-27 01:46:25,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1699332.0, ans=0.125 2023-06-27 01:46:31,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.22 vs. limit=15.0 2023-06-27 01:47:16,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699452.0, ans=0.1 2023-06-27 01:47:34,982 INFO [train.py:996] (1/4) Epoch 10, batch 8800, loss[loss=0.2864, simple_loss=0.3564, pruned_loss=0.1082, over 21774.00 frames. ], tot_loss[loss=0.217, simple_loss=0.296, pruned_loss=0.06903, over 4277567.84 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:48:03,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-27 01:48:34,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1699632.0, ans=0.2 2023-06-27 01:48:38,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1699632.0, ans=0.95 2023-06-27 01:48:50,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1699692.0, ans=0.125 2023-06-27 01:49:33,052 INFO [train.py:996] (1/4) Epoch 10, batch 8850, loss[loss=0.2178, simple_loss=0.3264, pruned_loss=0.05459, over 15796.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3022, pruned_loss=0.07046, over 4266868.06 frames. ], batch size: 61, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:49:48,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.063e+02 5.642e+02 7.591e+02 1.245e+03 2.739e+03, threshold=1.518e+03, percent-clipped=14.0 2023-06-27 01:50:57,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1700052.0, ans=0.1 2023-06-27 01:50:58,313 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.16 vs. limit=15.0 2023-06-27 01:51:01,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1700052.0, ans=0.125 2023-06-27 01:51:22,892 INFO [train.py:996] (1/4) Epoch 10, batch 8900, loss[loss=0.2131, simple_loss=0.2829, pruned_loss=0.0716, over 15410.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2963, pruned_loss=0.06925, over 4259584.07 frames. ], batch size: 62, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:51:25,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1700112.0, ans=0.2 2023-06-27 01:51:42,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1700112.0, ans=0.2 2023-06-27 01:51:44,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1700112.0, ans=0.125 2023-06-27 01:51:51,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1700172.0, ans=0.125 2023-06-27 01:52:11,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1700232.0, ans=0.125 2023-06-27 01:52:21,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1700232.0, ans=0.125 2023-06-27 01:52:39,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1700292.0, ans=0.125 2023-06-27 01:52:55,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-27 01:52:56,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1700292.0, ans=0.2 2023-06-27 01:53:21,317 INFO [train.py:996] (1/4) Epoch 10, batch 8950, loss[loss=0.1903, simple_loss=0.3109, pruned_loss=0.03485, over 19791.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2968, pruned_loss=0.06797, over 4253746.27 frames. ], batch size: 702, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:53:27,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1700412.0, ans=0.125 2023-06-27 01:53:42,450 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.709e+02 6.064e+02 9.607e+02 1.976e+03 3.801e+03, threshold=1.921e+03, percent-clipped=34.0 2023-06-27 01:54:38,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1700592.0, ans=0.0 2023-06-27 01:55:09,679 INFO [train.py:996] (1/4) Epoch 10, batch 9000, loss[loss=0.2192, simple_loss=0.2899, pruned_loss=0.07421, over 21879.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.293, pruned_loss=0.06827, over 4250451.18 frames. ], batch size: 373, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:55:09,680 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 01:55:22,753 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.8559, 5.9903, 5.6648, 5.4702], device='cuda:1') 2023-06-27 01:55:24,545 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.0673, 5.2079, 2.3831, 4.7021], device='cuda:1') 2023-06-27 01:55:27,996 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2678, simple_loss=0.3533, pruned_loss=0.09113, over 1796401.00 frames. 2023-06-27 01:55:27,997 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 01:55:39,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1700712.0, ans=0.0 2023-06-27 01:55:44,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1700712.0, ans=0.1 2023-06-27 01:55:50,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1700772.0, ans=0.09899494936611666 2023-06-27 01:56:38,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-27 01:57:23,145 INFO [train.py:996] (1/4) Epoch 10, batch 9050, loss[loss=0.2204, simple_loss=0.3019, pruned_loss=0.06948, over 21732.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2876, pruned_loss=0.06511, over 4261114.48 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:57:45,755 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.665e+02 7.496e+02 1.289e+03 1.830e+03 3.310e+03, threshold=2.578e+03, percent-clipped=22.0 2023-06-27 01:58:02,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-27 01:58:18,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1701132.0, ans=0.015 2023-06-27 01:58:43,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1701192.0, ans=0.0 2023-06-27 01:59:07,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1701252.0, ans=0.125 2023-06-27 01:59:10,863 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:59:13,805 INFO [train.py:996] (1/4) Epoch 10, batch 9100, loss[loss=0.1732, simple_loss=0.2506, pruned_loss=0.04786, over 15597.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2922, pruned_loss=0.0675, over 4259324.29 frames. ], batch size: 60, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:59:38,757 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-27 01:59:45,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1701372.0, ans=0.0 2023-06-27 01:59:47,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-27 02:00:01,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1701432.0, ans=0.0 2023-06-27 02:00:03,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1701432.0, ans=0.1 2023-06-27 02:00:50,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-27 02:01:09,262 INFO [train.py:996] (1/4) Epoch 10, batch 9150, loss[loss=0.2058, simple_loss=0.3019, pruned_loss=0.05483, over 21820.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.297, pruned_loss=0.06616, over 4265977.34 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:01:24,809 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.482e+02 5.209e+02 7.364e+02 1.147e+03 3.350e+03, threshold=1.473e+03, percent-clipped=3.0 2023-06-27 02:01:29,693 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2023-06-27 02:02:14,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1701792.0, ans=0.1 2023-06-27 02:02:59,002 INFO [train.py:996] (1/4) Epoch 10, batch 9200, loss[loss=0.3028, simple_loss=0.3719, pruned_loss=0.1168, over 21467.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2983, pruned_loss=0.06522, over 4269358.20 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 02:03:14,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1701972.0, ans=0.125 2023-06-27 02:03:44,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-27 02:04:18,946 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-27 02:04:45,292 INFO [train.py:996] (1/4) Epoch 10, batch 9250, loss[loss=0.2547, simple_loss=0.3428, pruned_loss=0.08328, over 19785.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.3008, pruned_loss=0.06673, over 4267643.93 frames. ], batch size: 702, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:05:02,708 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 6.299e+02 8.423e+02 1.393e+03 3.022e+03, threshold=1.685e+03, percent-clipped=24.0 2023-06-27 02:05:30,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1702332.0, ans=0.125 2023-06-27 02:05:40,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1702332.0, ans=0.125 2023-06-27 02:06:11,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1702392.0, ans=0.2 2023-06-27 02:06:23,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.80 vs. limit=10.0 2023-06-27 02:06:35,089 INFO [train.py:996] (1/4) Epoch 10, batch 9300, loss[loss=0.259, simple_loss=0.3491, pruned_loss=0.08443, over 21612.00 frames. ], tot_loss[loss=0.214, simple_loss=0.295, pruned_loss=0.06653, over 4260507.49 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:06:39,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1702512.0, ans=0.07 2023-06-27 02:06:57,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702572.0, ans=0.1 2023-06-27 02:07:16,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-27 02:07:54,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702692.0, ans=0.1 2023-06-27 02:08:10,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1702752.0, ans=0.0 2023-06-27 02:08:19,215 INFO [train.py:996] (1/4) Epoch 10, batch 9350, loss[loss=0.2201, simple_loss=0.3083, pruned_loss=0.06599, over 21316.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3028, pruned_loss=0.06844, over 4253800.58 frames. ], batch size: 176, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:08:47,198 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.895e+02 6.669e+02 9.528e+02 1.719e+03 4.361e+03, threshold=1.906e+03, percent-clipped=26.0 2023-06-27 02:09:05,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1702872.0, ans=0.0 2023-06-27 02:09:28,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1702932.0, ans=0.0 2023-06-27 02:09:30,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1702932.0, ans=0.125 2023-06-27 02:09:46,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1702992.0, ans=0.0 2023-06-27 02:09:57,184 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-27 02:10:18,915 INFO [train.py:996] (1/4) Epoch 10, batch 9400, loss[loss=0.2093, simple_loss=0.2736, pruned_loss=0.07248, over 21278.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3035, pruned_loss=0.0691, over 4262036.36 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:10:24,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1703112.0, ans=0.2 2023-06-27 02:10:57,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1703232.0, ans=0.04949747468305833 2023-06-27 02:11:56,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1703352.0, ans=0.125 2023-06-27 02:12:05,133 INFO [train.py:996] (1/4) Epoch 10, batch 9450, loss[loss=0.1899, simple_loss=0.2585, pruned_loss=0.06062, over 21644.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2952, pruned_loss=0.06773, over 4257075.26 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:12:09,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1703412.0, ans=0.125 2023-06-27 02:12:22,346 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.110e+02 5.502e+02 7.576e+02 1.129e+03 2.324e+03, threshold=1.515e+03, percent-clipped=5.0 2023-06-27 02:12:44,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-27 02:13:22,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1703592.0, ans=0.1 2023-06-27 02:13:52,554 INFO [train.py:996] (1/4) Epoch 10, batch 9500, loss[loss=0.2285, simple_loss=0.2897, pruned_loss=0.08359, over 21417.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.288, pruned_loss=0.06649, over 4251398.50 frames. ], batch size: 508, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:14:00,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1703712.0, ans=0.125 2023-06-27 02:14:55,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1703892.0, ans=0.1 2023-06-27 02:15:20,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-27 02:15:42,660 INFO [train.py:996] (1/4) Epoch 10, batch 9550, loss[loss=0.2868, simple_loss=0.3458, pruned_loss=0.1139, over 21441.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2935, pruned_loss=0.06884, over 4248441.29 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:16:04,745 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.617e+02 9.297e+02 1.429e+03 3.226e+03, threshold=1.859e+03, percent-clipped=22.0 2023-06-27 02:16:30,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1704132.0, ans=0.0 2023-06-27 02:16:37,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-27 02:17:14,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1704252.0, ans=0.125 2023-06-27 02:17:29,878 INFO [train.py:996] (1/4) Epoch 10, batch 9600, loss[loss=0.2162, simple_loss=0.2913, pruned_loss=0.0706, over 21878.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2987, pruned_loss=0.07031, over 4248258.78 frames. ], batch size: 118, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:17:39,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1704312.0, ans=0.0 2023-06-27 02:19:26,583 INFO [train.py:996] (1/4) Epoch 10, batch 9650, loss[loss=0.2525, simple_loss=0.3279, pruned_loss=0.08857, over 21743.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2978, pruned_loss=0.06997, over 4255302.42 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:19:34,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704612.0, ans=0.1 2023-06-27 02:19:45,810 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 6.257e+02 8.564e+02 1.301e+03 2.812e+03, threshold=1.713e+03, percent-clipped=7.0 2023-06-27 02:20:47,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-27 02:20:50,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-27 02:20:59,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-27 02:21:15,572 INFO [train.py:996] (1/4) Epoch 10, batch 9700, loss[loss=0.2088, simple_loss=0.2868, pruned_loss=0.06534, over 21911.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3022, pruned_loss=0.07015, over 4257426.31 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:21:43,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704972.0, ans=0.1 2023-06-27 02:21:55,567 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-27 02:21:58,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1705032.0, ans=0.125 2023-06-27 02:22:18,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=15.0 2023-06-27 02:22:40,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1705152.0, ans=0.1 2023-06-27 02:23:03,762 INFO [train.py:996] (1/4) Epoch 10, batch 9750, loss[loss=0.1842, simple_loss=0.2479, pruned_loss=0.06025, over 21128.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2969, pruned_loss=0.06921, over 4258859.33 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:23:27,951 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.191e+02 6.700e+02 1.068e+03 1.546e+03 3.673e+03, threshold=2.135e+03, percent-clipped=19.0 2023-06-27 02:24:35,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1705452.0, ans=0.125 2023-06-27 02:24:45,102 INFO [train.py:996] (1/4) Epoch 10, batch 9800, loss[loss=0.1982, simple_loss=0.2719, pruned_loss=0.06223, over 21671.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2948, pruned_loss=0.0689, over 4259158.34 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:24:50,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1705512.0, ans=0.1 2023-06-27 02:24:52,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1705512.0, ans=0.125 2023-06-27 02:24:57,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0 2023-06-27 02:25:55,004 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-27 02:26:22,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1705752.0, ans=0.125 2023-06-27 02:26:38,261 INFO [train.py:996] (1/4) Epoch 10, batch 9850, loss[loss=0.1929, simple_loss=0.2623, pruned_loss=0.06174, over 21732.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2928, pruned_loss=0.06947, over 4268377.03 frames. ], batch size: 264, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:26:47,955 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.56 vs. limit=6.0 2023-06-27 02:27:02,350 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.847e+02 5.295e+02 7.367e+02 1.134e+03 2.701e+03, threshold=1.473e+03, percent-clipped=3.0 2023-06-27 02:27:20,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1705932.0, ans=0.125 2023-06-27 02:27:31,732 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.11 vs. limit=22.5 2023-06-27 02:27:36,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1705992.0, ans=0.05 2023-06-27 02:28:08,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1706052.0, ans=0.125 2023-06-27 02:28:26,485 INFO [train.py:996] (1/4) Epoch 10, batch 9900, loss[loss=0.2032, simple_loss=0.2853, pruned_loss=0.06058, over 19884.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2895, pruned_loss=0.06921, over 4245238.90 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:28:46,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1706112.0, ans=0.125 2023-06-27 02:29:03,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1706172.0, ans=0.035 2023-06-27 02:29:43,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1706292.0, ans=0.2 2023-06-27 02:30:14,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1706412.0, ans=0.125 2023-06-27 02:30:15,217 INFO [train.py:996] (1/4) Epoch 10, batch 9950, loss[loss=0.208, simple_loss=0.2696, pruned_loss=0.07325, over 21925.00 frames. ], tot_loss[loss=0.215, simple_loss=0.29, pruned_loss=0.07, over 4242220.54 frames. ], batch size: 373, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:30:39,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.546e+02 9.078e+02 1.320e+03 2.583e+03, threshold=1.816e+03, percent-clipped=18.0 2023-06-27 02:31:21,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1706592.0, ans=0.125 2023-06-27 02:31:47,658 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:31:49,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=15.0 2023-06-27 02:31:51,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1706652.0, ans=0.125 2023-06-27 02:31:59,302 INFO [train.py:996] (1/4) Epoch 10, batch 10000, loss[loss=0.2548, simple_loss=0.3278, pruned_loss=0.09088, over 21788.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.286, pruned_loss=0.06931, over 4245444.18 frames. ], batch size: 124, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:32:06,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1706712.0, ans=0.125 2023-06-27 02:32:42,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1706832.0, ans=0.1 2023-06-27 02:33:22,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1706892.0, ans=0.1 2023-06-27 02:33:57,199 INFO [train.py:996] (1/4) Epoch 10, batch 10050, loss[loss=0.2127, simple_loss=0.2962, pruned_loss=0.0646, over 21367.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.288, pruned_loss=0.06941, over 4249770.54 frames. ], batch size: 131, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:34:06,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-27 02:34:16,292 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 5.853e+02 8.209e+02 1.305e+03 2.955e+03, threshold=1.642e+03, percent-clipped=12.0 2023-06-27 02:35:14,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1707192.0, ans=0.0 2023-06-27 02:35:21,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1707252.0, ans=0.125 2023-06-27 02:35:45,573 INFO [train.py:996] (1/4) Epoch 10, batch 10100, loss[loss=0.174, simple_loss=0.2288, pruned_loss=0.05965, over 20778.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2867, pruned_loss=0.0684, over 4258317.53 frames. ], batch size: 608, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:35:53,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1707312.0, ans=0.0 2023-06-27 02:36:49,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1707432.0, ans=0.2 2023-06-27 02:36:55,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-27 02:36:58,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-27 02:37:33,971 INFO [train.py:996] (1/4) Epoch 10, batch 10150, loss[loss=0.2466, simple_loss=0.3182, pruned_loss=0.08746, over 21384.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2911, pruned_loss=0.07048, over 4261546.71 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:38:02,106 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.860e+02 5.691e+02 7.969e+02 1.243e+03 2.132e+03, threshold=1.594e+03, percent-clipped=9.0 2023-06-27 02:38:17,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-27 02:38:32,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1707732.0, ans=0.015 2023-06-27 02:38:57,505 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.93 vs. limit=15.0 2023-06-27 02:39:22,073 INFO [train.py:996] (1/4) Epoch 10, batch 10200, loss[loss=0.1796, simple_loss=0.2679, pruned_loss=0.04563, over 21223.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2902, pruned_loss=0.06816, over 4263242.42 frames. ], batch size: 176, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:39:24,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1707912.0, ans=0.125 2023-06-27 02:39:28,167 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:40:00,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1707972.0, ans=0.125 2023-06-27 02:40:08,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1708032.0, ans=0.125 2023-06-27 02:40:32,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1708092.0, ans=0.2 2023-06-27 02:41:10,224 INFO [train.py:996] (1/4) Epoch 10, batch 10250, loss[loss=0.2202, simple_loss=0.3044, pruned_loss=0.06803, over 21210.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2861, pruned_loss=0.0632, over 4272267.24 frames. ], batch size: 143, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:41:12,568 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:41:44,122 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.003e+02 5.121e+02 6.832e+02 1.019e+03 2.987e+03, threshold=1.366e+03, percent-clipped=4.0 2023-06-27 02:41:48,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1708272.0, ans=0.0 2023-06-27 02:41:55,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1708332.0, ans=0.0 2023-06-27 02:42:27,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1708392.0, ans=0.0 2023-06-27 02:42:43,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=22.5 2023-06-27 02:42:53,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1708452.0, ans=0.1 2023-06-27 02:43:03,416 INFO [train.py:996] (1/4) Epoch 10, batch 10300, loss[loss=0.2129, simple_loss=0.3068, pruned_loss=0.05946, over 21762.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2879, pruned_loss=0.06405, over 4274905.69 frames. ], batch size: 247, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:43:28,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1708512.0, ans=0.125 2023-06-27 02:44:03,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1708632.0, ans=0.1 2023-06-27 02:44:08,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-27 02:44:20,731 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-27 02:45:06,405 INFO [train.py:996] (1/4) Epoch 10, batch 10350, loss[loss=0.2787, simple_loss=0.3704, pruned_loss=0.0935, over 21475.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2929, pruned_loss=0.06524, over 4272451.94 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:45:35,571 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.505e+02 7.876e+02 1.206e+03 1.704e+03 3.503e+03, threshold=2.411e+03, percent-clipped=40.0 2023-06-27 02:45:51,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1708932.0, ans=0.0 2023-06-27 02:45:53,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1708932.0, ans=0.125 2023-06-27 02:46:36,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1709052.0, ans=0.0 2023-06-27 02:46:51,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1709052.0, ans=0.125 2023-06-27 02:46:57,627 INFO [train.py:996] (1/4) Epoch 10, batch 10400, loss[loss=0.1467, simple_loss=0.2003, pruned_loss=0.04658, over 21414.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2864, pruned_loss=0.06432, over 4261758.48 frames. ], batch size: 131, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:47:48,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1709232.0, ans=0.0 2023-06-27 02:47:55,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1709232.0, ans=0.125 2023-06-27 02:48:52,945 INFO [train.py:996] (1/4) Epoch 10, batch 10450, loss[loss=0.2358, simple_loss=0.3313, pruned_loss=0.07022, over 20757.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2905, pruned_loss=0.06742, over 4265257.15 frames. ], batch size: 608, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:49:21,565 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.083e+02 7.279e+02 1.026e+03 1.542e+03 3.103e+03, threshold=2.052e+03, percent-clipped=9.0 2023-06-27 02:49:32,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1709532.0, ans=0.2 2023-06-27 02:49:39,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1709532.0, ans=0.1 2023-06-27 02:50:08,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1709592.0, ans=0.1 2023-06-27 02:50:41,343 INFO [train.py:996] (1/4) Epoch 10, batch 10500, loss[loss=0.208, simple_loss=0.2752, pruned_loss=0.07038, over 21749.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2892, pruned_loss=0.06562, over 4263012.62 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:50:56,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1709712.0, ans=0.2 2023-06-27 02:51:30,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1709832.0, ans=0.125 2023-06-27 02:52:02,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1709892.0, ans=0.2 2023-06-27 02:52:08,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1709952.0, ans=0.0 2023-06-27 02:52:28,654 INFO [train.py:996] (1/4) Epoch 10, batch 10550, loss[loss=0.1735, simple_loss=0.2407, pruned_loss=0.05317, over 21632.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2855, pruned_loss=0.06447, over 4239699.67 frames. ], batch size: 231, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:52:41,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-27 02:52:55,939 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.583e+02 5.517e+02 8.817e+02 1.298e+03 2.428e+03, threshold=1.763e+03, percent-clipped=4.0 2023-06-27 02:53:02,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1710072.0, ans=0.2 2023-06-27 02:53:43,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1710192.0, ans=0.2 2023-06-27 02:54:16,526 INFO [train.py:996] (1/4) Epoch 10, batch 10600, loss[loss=0.1813, simple_loss=0.2676, pruned_loss=0.04747, over 21464.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2792, pruned_loss=0.06301, over 4252260.68 frames. ], batch size: 194, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:54:29,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1710312.0, ans=0.125 2023-06-27 02:54:48,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1710372.0, ans=0.0 2023-06-27 02:55:47,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1710492.0, ans=0.125 2023-06-27 02:55:57,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1710552.0, ans=0.125 2023-06-27 02:56:13,019 INFO [train.py:996] (1/4) Epoch 10, batch 10650, loss[loss=0.1662, simple_loss=0.2475, pruned_loss=0.04245, over 21680.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2806, pruned_loss=0.06222, over 4256783.16 frames. ], batch size: 247, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:56:29,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1710672.0, ans=0.125 2023-06-27 02:56:35,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 6.303e+02 9.847e+02 1.673e+03 3.050e+03, threshold=1.969e+03, percent-clipped=22.0 2023-06-27 02:56:38,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1710672.0, ans=0.1 2023-06-27 02:56:58,841 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:57:18,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1710792.0, ans=0.125 2023-06-27 02:57:51,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.36 vs. limit=15.0 2023-06-27 02:58:01,506 INFO [train.py:996] (1/4) Epoch 10, batch 10700, loss[loss=0.2253, simple_loss=0.2996, pruned_loss=0.0755, over 21637.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2793, pruned_loss=0.06193, over 4264444.36 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:59:02,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-06-27 02:59:51,989 INFO [train.py:996] (1/4) Epoch 10, batch 10750, loss[loss=0.2562, simple_loss=0.3483, pruned_loss=0.08204, over 20706.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2913, pruned_loss=0.06629, over 4264034.92 frames. ], batch size: 607, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:59:59,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1711212.0, ans=0.125 2023-06-27 03:00:21,240 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.422e+02 6.069e+02 8.010e+02 1.380e+03 3.013e+03, threshold=1.602e+03, percent-clipped=10.0 2023-06-27 03:00:21,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1711272.0, ans=0.125 2023-06-27 03:00:52,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1711332.0, ans=0.125 2023-06-27 03:01:19,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-27 03:01:41,465 INFO [train.py:996] (1/4) Epoch 10, batch 10800, loss[loss=0.2347, simple_loss=0.313, pruned_loss=0.07822, over 21727.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2963, pruned_loss=0.06712, over 4266104.26 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:01:58,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1711512.0, ans=0.1 2023-06-27 03:02:01,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1711512.0, ans=0.1 2023-06-27 03:02:17,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1711572.0, ans=15.0 2023-06-27 03:02:27,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1711572.0, ans=15.0 2023-06-27 03:03:14,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1711752.0, ans=0.0 2023-06-27 03:03:23,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1711752.0, ans=0.0 2023-06-27 03:03:30,045 INFO [train.py:996] (1/4) Epoch 10, batch 10850, loss[loss=0.1796, simple_loss=0.2519, pruned_loss=0.05365, over 21537.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2957, pruned_loss=0.06787, over 4268661.90 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:04:05,452 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.119e+02 5.251e+02 7.747e+02 1.275e+03 2.663e+03, threshold=1.549e+03, percent-clipped=11.0 2023-06-27 03:04:19,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1711932.0, ans=0.125 2023-06-27 03:04:42,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1711992.0, ans=0.2 2023-06-27 03:05:06,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1712052.0, ans=0.05 2023-06-27 03:05:08,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1712052.0, ans=0.125 2023-06-27 03:05:23,779 INFO [train.py:996] (1/4) Epoch 10, batch 10900, loss[loss=0.2367, simple_loss=0.3596, pruned_loss=0.05692, over 20802.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2906, pruned_loss=0.066, over 4269751.70 frames. ], batch size: 607, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:06:17,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1712232.0, ans=0.125 2023-06-27 03:06:35,188 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-27 03:06:42,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-27 03:07:12,334 INFO [train.py:996] (1/4) Epoch 10, batch 10950, loss[loss=0.215, simple_loss=0.2736, pruned_loss=0.07822, over 21242.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2869, pruned_loss=0.06401, over 4267374.21 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:07:27,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1712412.0, ans=0.125 2023-06-27 03:07:27,538 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:07:46,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1712472.0, ans=0.2 2023-06-27 03:07:48,602 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.904e+02 6.171e+02 9.007e+02 1.291e+03 2.424e+03, threshold=1.801e+03, percent-clipped=14.0 2023-06-27 03:08:58,756 INFO [train.py:996] (1/4) Epoch 10, batch 11000, loss[loss=0.222, simple_loss=0.2929, pruned_loss=0.07559, over 21838.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2848, pruned_loss=0.06467, over 4275513.92 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:09:01,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1712712.0, ans=0.2 2023-06-27 03:09:34,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1712772.0, ans=10.0 2023-06-27 03:09:49,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1712832.0, ans=0.2 2023-06-27 03:10:11,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1712892.0, ans=0.125 2023-06-27 03:10:16,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1712892.0, ans=0.0 2023-06-27 03:10:36,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-27 03:10:46,678 INFO [train.py:996] (1/4) Epoch 10, batch 11050, loss[loss=0.2022, simple_loss=0.2608, pruned_loss=0.07175, over 21677.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2817, pruned_loss=0.06598, over 4269824.57 frames. ], batch size: 416, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:10:49,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1713012.0, ans=0.125 2023-06-27 03:11:22,036 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.001e+02 5.814e+02 8.503e+02 1.206e+03 2.810e+03, threshold=1.701e+03, percent-clipped=7.0 2023-06-27 03:11:45,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.04 vs. limit=15.0 2023-06-27 03:12:33,196 INFO [train.py:996] (1/4) Epoch 10, batch 11100, loss[loss=0.1966, simple_loss=0.2719, pruned_loss=0.06071, over 21666.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2801, pruned_loss=0.06557, over 4258477.04 frames. ], batch size: 282, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:12:57,802 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-27 03:13:30,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-27 03:13:36,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1713432.0, ans=0.0 2023-06-27 03:14:17,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1713552.0, ans=0.1 2023-06-27 03:14:22,307 INFO [train.py:996] (1/4) Epoch 10, batch 11150, loss[loss=0.224, simple_loss=0.3019, pruned_loss=0.07301, over 20690.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2792, pruned_loss=0.06577, over 4265287.83 frames. ], batch size: 608, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:14:40,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-27 03:14:58,532 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.769e+02 5.768e+02 8.894e+02 1.400e+03 2.503e+03, threshold=1.779e+03, percent-clipped=10.0 2023-06-27 03:15:14,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1713732.0, ans=0.0 2023-06-27 03:15:19,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1713732.0, ans=0.035 2023-06-27 03:15:26,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-27 03:15:46,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1713852.0, ans=0.07 2023-06-27 03:16:00,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1713852.0, ans=0.04949747468305833 2023-06-27 03:16:08,617 INFO [train.py:996] (1/4) Epoch 10, batch 11200, loss[loss=0.1902, simple_loss=0.2411, pruned_loss=0.06962, over 20244.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2775, pruned_loss=0.06499, over 4265610.35 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:16:11,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.30 vs. limit=10.0 2023-06-27 03:16:37,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-27 03:17:03,988 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-27 03:17:55,858 INFO [train.py:996] (1/4) Epoch 10, batch 11250, loss[loss=0.2121, simple_loss=0.2953, pruned_loss=0.06446, over 21659.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2759, pruned_loss=0.06518, over 4268726.37 frames. ], batch size: 389, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:18:26,749 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 5.382e+02 8.145e+02 1.130e+03 2.477e+03, threshold=1.629e+03, percent-clipped=7.0 2023-06-27 03:18:52,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1714332.0, ans=0.125 2023-06-27 03:18:54,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1714332.0, ans=0.125 2023-06-27 03:19:15,393 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-27 03:19:18,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1714452.0, ans=0.0 2023-06-27 03:19:35,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1714452.0, ans=0.125 2023-06-27 03:19:37,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1714512.0, ans=0.0 2023-06-27 03:19:38,926 INFO [train.py:996] (1/4) Epoch 10, batch 11300, loss[loss=0.2031, simple_loss=0.2803, pruned_loss=0.06291, over 21700.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2775, pruned_loss=0.06487, over 4272586.88 frames. ], batch size: 389, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:20:06,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-27 03:20:16,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1714572.0, ans=0.125 2023-06-27 03:20:29,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1714632.0, ans=0.125 2023-06-27 03:21:11,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1714752.0, ans=0.0 2023-06-27 03:21:22,946 INFO [train.py:996] (1/4) Epoch 10, batch 11350, loss[loss=0.1936, simple_loss=0.2679, pruned_loss=0.05967, over 21291.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2792, pruned_loss=0.06435, over 4275816.87 frames. ], batch size: 143, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:22:00,029 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 5.912e+02 7.672e+02 1.183e+03 2.053e+03, threshold=1.534e+03, percent-clipped=10.0 2023-06-27 03:22:15,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-27 03:22:46,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1714992.0, ans=0.125 2023-06-27 03:23:12,852 INFO [train.py:996] (1/4) Epoch 10, batch 11400, loss[loss=0.1872, simple_loss=0.2698, pruned_loss=0.05228, over 21327.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2845, pruned_loss=0.06704, over 4273572.92 frames. ], batch size: 159, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:23:34,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1715112.0, ans=0.0 2023-06-27 03:23:45,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1715172.0, ans=10.0 2023-06-27 03:24:02,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1715232.0, ans=0.0 2023-06-27 03:24:09,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715232.0, ans=0.1 2023-06-27 03:24:11,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1715232.0, ans=0.125 2023-06-27 03:24:22,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1715292.0, ans=0.0 2023-06-27 03:24:58,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.81 vs. limit=6.0 2023-06-27 03:25:07,605 INFO [train.py:996] (1/4) Epoch 10, batch 11450, loss[loss=0.2181, simple_loss=0.2828, pruned_loss=0.07671, over 20061.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2861, pruned_loss=0.06595, over 4278671.16 frames. ], batch size: 707, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:25:18,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1715412.0, ans=0.125 2023-06-27 03:25:33,690 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 7.490e+02 1.068e+03 1.427e+03 2.700e+03, threshold=2.136e+03, percent-clipped=19.0 2023-06-27 03:26:19,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1715592.0, ans=0.0 2023-06-27 03:26:35,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1715652.0, ans=0.0 2023-06-27 03:26:50,406 INFO [train.py:996] (1/4) Epoch 10, batch 11500, loss[loss=0.2117, simple_loss=0.3126, pruned_loss=0.05545, over 21853.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.289, pruned_loss=0.0666, over 4283086.11 frames. ], batch size: 371, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:27:12,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1715772.0, ans=0.125 2023-06-27 03:27:22,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1715772.0, ans=0.0 2023-06-27 03:27:46,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-27 03:28:01,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1715892.0, ans=0.125 2023-06-27 03:28:11,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1715892.0, ans=0.035 2023-06-27 03:28:45,002 INFO [train.py:996] (1/4) Epoch 10, batch 11550, loss[loss=0.2819, simple_loss=0.385, pruned_loss=0.08937, over 21693.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2947, pruned_loss=0.06715, over 4281736.06 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:29:07,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.07 vs. limit=10.0 2023-06-27 03:29:17,151 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.702e+02 7.297e+02 1.033e+03 1.557e+03 3.418e+03, threshold=2.066e+03, percent-clipped=10.0 2023-06-27 03:30:27,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1716252.0, ans=0.125 2023-06-27 03:30:32,991 INFO [train.py:996] (1/4) Epoch 10, batch 11600, loss[loss=0.216, simple_loss=0.3103, pruned_loss=0.06092, over 21871.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3092, pruned_loss=0.0694, over 4275801.01 frames. ], batch size: 118, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:31:17,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-27 03:31:25,052 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-27 03:31:28,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.71 vs. limit=22.5 2023-06-27 03:31:33,686 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=22.5 2023-06-27 03:31:45,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1716492.0, ans=0.125 2023-06-27 03:32:05,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1716552.0, ans=0.125 2023-06-27 03:32:07,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1716552.0, ans=0.125 2023-06-27 03:32:20,568 INFO [train.py:996] (1/4) Epoch 10, batch 11650, loss[loss=0.2462, simple_loss=0.3489, pruned_loss=0.0717, over 21721.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3155, pruned_loss=0.06974, over 4269213.28 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:32:52,968 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.091e+02 7.350e+02 9.956e+02 1.670e+03 3.528e+03, threshold=1.991e+03, percent-clipped=18.0 2023-06-27 03:33:10,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-27 03:33:20,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1716732.0, ans=0.0 2023-06-27 03:33:44,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1716792.0, ans=0.1 2023-06-27 03:34:07,064 INFO [train.py:996] (1/4) Epoch 10, batch 11700, loss[loss=0.2226, simple_loss=0.3035, pruned_loss=0.07089, over 20015.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3071, pruned_loss=0.06919, over 4266734.28 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:34:56,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-27 03:35:15,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1717092.0, ans=0.125 2023-06-27 03:35:47,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1717152.0, ans=0.125 2023-06-27 03:35:53,401 INFO [train.py:996] (1/4) Epoch 10, batch 11750, loss[loss=0.1968, simple_loss=0.2687, pruned_loss=0.06241, over 21778.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.298, pruned_loss=0.06829, over 4275210.00 frames. ], batch size: 118, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:35:55,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1717212.0, ans=0.2 2023-06-27 03:36:26,192 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.050e+02 5.774e+02 7.571e+02 1.065e+03 1.774e+03, threshold=1.514e+03, percent-clipped=0.0 2023-06-27 03:36:26,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1717272.0, ans=0.1 2023-06-27 03:37:42,105 INFO [train.py:996] (1/4) Epoch 10, batch 11800, loss[loss=0.2314, simple_loss=0.3322, pruned_loss=0.06534, over 21921.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2985, pruned_loss=0.06995, over 4268898.75 frames. ], batch size: 372, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:38:04,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1717572.0, ans=0.1 2023-06-27 03:39:02,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-27 03:39:08,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1717692.0, ans=0.125 2023-06-27 03:39:12,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1717752.0, ans=0.0 2023-06-27 03:39:29,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1717812.0, ans=0.0 2023-06-27 03:39:30,350 INFO [train.py:996] (1/4) Epoch 10, batch 11850, loss[loss=0.1977, simple_loss=0.2939, pruned_loss=0.05074, over 21657.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2994, pruned_loss=0.06876, over 4280573.00 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:40:01,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1717872.0, ans=0.0 2023-06-27 03:40:09,300 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.078e+02 6.779e+02 9.644e+02 1.423e+03 2.292e+03, threshold=1.929e+03, percent-clipped=21.0 2023-06-27 03:40:45,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1717992.0, ans=0.0 2023-06-27 03:41:08,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1718052.0, ans=0.125 2023-06-27 03:41:25,953 INFO [train.py:996] (1/4) Epoch 10, batch 11900, loss[loss=0.2465, simple_loss=0.3532, pruned_loss=0.06989, over 19731.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.3016, pruned_loss=0.06705, over 4279221.86 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:42:29,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1718232.0, ans=0.125 2023-06-27 03:43:15,227 INFO [train.py:996] (1/4) Epoch 10, batch 11950, loss[loss=0.1938, simple_loss=0.2935, pruned_loss=0.047, over 21812.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.3024, pruned_loss=0.06511, over 4281463.30 frames. ], batch size: 371, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:43:27,445 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-27 03:43:32,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1718412.0, ans=0.125 2023-06-27 03:43:53,620 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.803e+02 5.577e+02 8.393e+02 1.338e+03 3.088e+03, threshold=1.679e+03, percent-clipped=11.0 2023-06-27 03:44:32,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1718592.0, ans=0.2 2023-06-27 03:44:41,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1718592.0, ans=0.0 2023-06-27 03:44:50,807 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-27 03:45:02,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1718712.0, ans=0.0 2023-06-27 03:45:09,404 INFO [train.py:996] (1/4) Epoch 10, batch 12000, loss[loss=0.2073, simple_loss=0.28, pruned_loss=0.06723, over 21973.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.295, pruned_loss=0.06345, over 4275446.12 frames. ], batch size: 103, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:45:09,405 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 03:45:30,590 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2595, simple_loss=0.3509, pruned_loss=0.08412, over 1796401.00 frames. 2023-06-27 03:45:30,591 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 03:46:36,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1718892.0, ans=0.2 2023-06-27 03:47:18,628 INFO [train.py:996] (1/4) Epoch 10, batch 12050, loss[loss=0.2163, simple_loss=0.2808, pruned_loss=0.07585, over 21306.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2906, pruned_loss=0.06457, over 4276341.73 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:47:53,491 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 6.182e+02 8.249e+02 1.335e+03 3.065e+03, threshold=1.650e+03, percent-clipped=10.0 2023-06-27 03:48:29,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1719192.0, ans=0.125 2023-06-27 03:48:47,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1719252.0, ans=0.0 2023-06-27 03:48:48,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1719252.0, ans=10.0 2023-06-27 03:48:50,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1719252.0, ans=0.1 2023-06-27 03:49:08,216 INFO [train.py:996] (1/4) Epoch 10, batch 12100, loss[loss=0.2689, simple_loss=0.3343, pruned_loss=0.1018, over 21484.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2946, pruned_loss=0.0678, over 4279776.23 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:49:30,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1719372.0, ans=0.0 2023-06-27 03:49:39,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-27 03:50:22,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1719492.0, ans=0.125 2023-06-27 03:51:06,064 INFO [train.py:996] (1/4) Epoch 10, batch 12150, loss[loss=0.2532, simple_loss=0.3527, pruned_loss=0.07684, over 21499.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2987, pruned_loss=0.06841, over 4270419.47 frames. ], batch size: 471, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:51:38,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1719672.0, ans=0.125 2023-06-27 03:51:40,995 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.275e+02 6.507e+02 9.290e+02 1.300e+03 3.036e+03, threshold=1.858e+03, percent-clipped=15.0 2023-06-27 03:51:54,738 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-27 03:52:29,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-27 03:52:50,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1719852.0, ans=0.125 2023-06-27 03:52:52,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1719912.0, ans=0.125 2023-06-27 03:52:53,534 INFO [train.py:996] (1/4) Epoch 10, batch 12200, loss[loss=0.2074, simple_loss=0.2691, pruned_loss=0.07291, over 21698.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2956, pruned_loss=0.06723, over 4269521.31 frames. ], batch size: 124, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:53:15,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1719972.0, ans=0.0 2023-06-27 03:53:32,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1720032.0, ans=0.0 2023-06-27 03:53:58,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-27 03:54:31,075 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:54:40,553 INFO [train.py:996] (1/4) Epoch 10, batch 12250, loss[loss=0.1324, simple_loss=0.1983, pruned_loss=0.03327, over 21766.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2884, pruned_loss=0.0643, over 4262072.50 frames. ], batch size: 107, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:55:08,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1720272.0, ans=0.0 2023-06-27 03:55:14,847 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.738e+02 5.320e+02 7.726e+02 1.159e+03 2.410e+03, threshold=1.545e+03, percent-clipped=8.0 2023-06-27 03:55:32,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1720332.0, ans=0.125 2023-06-27 03:55:45,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=22.5 2023-06-27 03:56:27,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1720512.0, ans=0.2 2023-06-27 03:56:28,148 INFO [train.py:996] (1/4) Epoch 10, batch 12300, loss[loss=0.226, simple_loss=0.3372, pruned_loss=0.05738, over 21210.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.282, pruned_loss=0.06053, over 4262279.75 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:56:28,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1720512.0, ans=0.025 2023-06-27 03:58:13,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1720752.0, ans=0.0 2023-06-27 03:58:16,051 INFO [train.py:996] (1/4) Epoch 10, batch 12350, loss[loss=0.2073, simple_loss=0.284, pruned_loss=0.06528, over 21465.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2847, pruned_loss=0.06096, over 4265918.57 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:58:50,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-27 03:58:50,873 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.592e+02 6.371e+02 1.042e+03 1.645e+03 3.511e+03, threshold=2.083e+03, percent-clipped=28.0 2023-06-27 03:59:27,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1720992.0, ans=0.125 2023-06-27 03:59:29,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1720992.0, ans=0.125 2023-06-27 03:59:30,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1720992.0, ans=0.2 2023-06-27 04:00:04,492 INFO [train.py:996] (1/4) Epoch 10, batch 12400, loss[loss=0.2321, simple_loss=0.2992, pruned_loss=0.08243, over 21490.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2888, pruned_loss=0.06425, over 4278703.13 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:00:22,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1721112.0, ans=0.1 2023-06-27 04:01:43,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1721352.0, ans=0.0 2023-06-27 04:01:58,687 INFO [train.py:996] (1/4) Epoch 10, batch 12450, loss[loss=0.2509, simple_loss=0.3237, pruned_loss=0.08907, over 21380.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2911, pruned_loss=0.06606, over 4284488.09 frames. ], batch size: 159, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:02:07,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1721412.0, ans=0.125 2023-06-27 04:02:31,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1721472.0, ans=0.2 2023-06-27 04:02:36,086 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 6.019e+02 7.668e+02 9.401e+02 2.639e+03, threshold=1.534e+03, percent-clipped=2.0 2023-06-27 04:03:29,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1721652.0, ans=0.125 2023-06-27 04:03:48,672 INFO [train.py:996] (1/4) Epoch 10, batch 12500, loss[loss=0.227, simple_loss=0.3218, pruned_loss=0.06608, over 21656.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3013, pruned_loss=0.06923, over 4285090.65 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:04:22,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1721772.0, ans=0.07 2023-06-27 04:04:24,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-27 04:04:52,818 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:05:45,554 INFO [train.py:996] (1/4) Epoch 10, batch 12550, loss[loss=0.1968, simple_loss=0.2663, pruned_loss=0.06363, over 21203.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3078, pruned_loss=0.07112, over 4283569.18 frames. ], batch size: 608, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:06:27,279 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.271e+02 6.681e+02 8.893e+02 1.594e+03 3.232e+03, threshold=1.779e+03, percent-clipped=27.0 2023-06-27 04:06:44,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-27 04:07:01,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1722192.0, ans=0.09899494936611666 2023-06-27 04:07:39,580 INFO [train.py:996] (1/4) Epoch 10, batch 12600, loss[loss=0.1773, simple_loss=0.2736, pruned_loss=0.04048, over 21634.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3063, pruned_loss=0.07027, over 4286341.57 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:08:05,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1722372.0, ans=0.2 2023-06-27 04:08:31,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1722432.0, ans=0.125 2023-06-27 04:08:46,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1722492.0, ans=0.125 2023-06-27 04:09:20,818 INFO [train.py:996] (1/4) Epoch 10, batch 12650, loss[loss=0.1337, simple_loss=0.1824, pruned_loss=0.04253, over 16437.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2991, pruned_loss=0.06747, over 4274772.71 frames. ], batch size: 60, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:09:31,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1722612.0, ans=0.2 2023-06-27 04:10:02,121 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 6.359e+02 1.024e+03 1.411e+03 2.503e+03, threshold=2.048e+03, percent-clipped=9.0 2023-06-27 04:10:04,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1722732.0, ans=0.125 2023-06-27 04:10:40,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1722792.0, ans=0.1 2023-06-27 04:10:42,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1722792.0, ans=0.125 2023-06-27 04:11:14,810 INFO [train.py:996] (1/4) Epoch 10, batch 12700, loss[loss=0.229, simple_loss=0.3069, pruned_loss=0.07551, over 21806.00 frames. ], tot_loss[loss=0.218, simple_loss=0.298, pruned_loss=0.06904, over 4277886.46 frames. ], batch size: 118, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:11:57,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1723032.0, ans=0.1 2023-06-27 04:12:09,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1723032.0, ans=0.125 2023-06-27 04:13:08,217 INFO [train.py:996] (1/4) Epoch 10, batch 12750, loss[loss=0.1959, simple_loss=0.2829, pruned_loss=0.05446, over 21792.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3001, pruned_loss=0.06988, over 4271713.39 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:13:38,770 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.128e+02 7.827e+02 1.074e+03 2.616e+03, threshold=1.565e+03, percent-clipped=3.0 2023-06-27 04:13:58,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1723332.0, ans=0.125 2023-06-27 04:14:23,567 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-27 04:14:28,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1723452.0, ans=0.125 2023-06-27 04:14:55,466 INFO [train.py:996] (1/4) Epoch 10, batch 12800, loss[loss=0.2196, simple_loss=0.286, pruned_loss=0.07665, over 21544.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2981, pruned_loss=0.06971, over 4277130.59 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:15:15,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1723572.0, ans=0.125 2023-06-27 04:15:47,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.53 vs. limit=15.0 2023-06-27 04:16:15,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1723692.0, ans=0.125 2023-06-27 04:16:45,026 INFO [train.py:996] (1/4) Epoch 10, batch 12850, loss[loss=0.1866, simple_loss=0.2808, pruned_loss=0.04618, over 21595.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3001, pruned_loss=0.07065, over 4280930.30 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:16:47,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1723812.0, ans=0.0 2023-06-27 04:17:19,701 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-27 04:17:22,021 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.972e+02 5.917e+02 7.824e+02 1.083e+03 2.191e+03, threshold=1.565e+03, percent-clipped=6.0 2023-06-27 04:17:31,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-27 04:18:26,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1724052.0, ans=0.1 2023-06-27 04:18:28,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1724052.0, ans=0.125 2023-06-27 04:18:29,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1724052.0, ans=0.2 2023-06-27 04:18:34,546 INFO [train.py:996] (1/4) Epoch 10, batch 12900, loss[loss=0.2466, simple_loss=0.3336, pruned_loss=0.07975, over 21512.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2968, pruned_loss=0.06737, over 4277607.28 frames. ], batch size: 471, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:18:40,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1724112.0, ans=0.2 2023-06-27 04:18:56,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1724172.0, ans=0.2 2023-06-27 04:18:57,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1724172.0, ans=0.125 2023-06-27 04:19:05,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1724172.0, ans=0.0 2023-06-27 04:19:27,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-27 04:19:57,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1724292.0, ans=0.125 2023-06-27 04:20:11,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1724352.0, ans=0.0 2023-06-27 04:20:23,521 INFO [train.py:996] (1/4) Epoch 10, batch 12950, loss[loss=0.233, simple_loss=0.3104, pruned_loss=0.07778, over 21726.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2955, pruned_loss=0.06575, over 4272992.64 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:20:33,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1724412.0, ans=0.1 2023-06-27 04:20:42,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1724412.0, ans=0.0 2023-06-27 04:21:19,205 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.055e+02 6.814e+02 9.301e+02 1.537e+03 3.645e+03, threshold=1.860e+03, percent-clipped=23.0 2023-06-27 04:21:23,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1724532.0, ans=0.1 2023-06-27 04:22:08,389 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=22.5 2023-06-27 04:22:17,942 INFO [train.py:996] (1/4) Epoch 10, batch 13000, loss[loss=0.1566, simple_loss=0.2294, pruned_loss=0.04194, over 20993.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2946, pruned_loss=0.06621, over 4265008.59 frames. ], batch size: 143, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:22:27,409 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-27 04:22:45,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1724772.0, ans=0.1 2023-06-27 04:23:43,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1724952.0, ans=0.125 2023-06-27 04:23:47,736 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-27 04:23:57,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1724952.0, ans=0.125 2023-06-27 04:24:05,870 INFO [train.py:996] (1/4) Epoch 10, batch 13050, loss[loss=0.2156, simple_loss=0.2896, pruned_loss=0.07076, over 21888.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2897, pruned_loss=0.06363, over 4268989.40 frames. ], batch size: 371, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:24:09,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1725012.0, ans=0.0 2023-06-27 04:24:39,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1725072.0, ans=0.0 2023-06-27 04:24:49,090 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.473e+02 7.954e+02 1.041e+03 2.275e+03, threshold=1.591e+03, percent-clipped=1.0 2023-06-27 04:25:20,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1725192.0, ans=0.0 2023-06-27 04:25:53,804 INFO [train.py:996] (1/4) Epoch 10, batch 13100, loss[loss=0.1959, simple_loss=0.2862, pruned_loss=0.05284, over 21769.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2928, pruned_loss=0.06435, over 4271307.96 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:26:16,937 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:26:39,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1725372.0, ans=0.0 2023-06-27 04:26:55,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1725432.0, ans=0.125 2023-06-27 04:26:57,250 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:27:29,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-27 04:27:43,053 INFO [train.py:996] (1/4) Epoch 10, batch 13150, loss[loss=0.2483, simple_loss=0.3568, pruned_loss=0.06984, over 20832.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2952, pruned_loss=0.06625, over 4278060.28 frames. ], batch size: 607, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:28:00,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1725612.0, ans=0.125 2023-06-27 04:28:17,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1725672.0, ans=0.0 2023-06-27 04:28:32,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.070e+02 6.134e+02 8.116e+02 1.164e+03 2.711e+03, threshold=1.623e+03, percent-clipped=9.0 2023-06-27 04:28:39,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1725732.0, ans=0.0 2023-06-27 04:29:02,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1725792.0, ans=0.125 2023-06-27 04:29:07,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1725792.0, ans=0.125 2023-06-27 04:29:32,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1725852.0, ans=0.2 2023-06-27 04:29:36,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1725912.0, ans=0.0 2023-06-27 04:29:37,418 INFO [train.py:996] (1/4) Epoch 10, batch 13200, loss[loss=0.2293, simple_loss=0.3089, pruned_loss=0.07489, over 21419.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2961, pruned_loss=0.067, over 4280364.65 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:29:41,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1725912.0, ans=0.125 2023-06-27 04:29:53,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1725912.0, ans=0.125 2023-06-27 04:30:42,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1726092.0, ans=10.0 2023-06-27 04:31:26,760 INFO [train.py:996] (1/4) Epoch 10, batch 13250, loss[loss=0.2113, simple_loss=0.2969, pruned_loss=0.06282, over 21854.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2948, pruned_loss=0.0685, over 4274352.86 frames. ], batch size: 371, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:31:53,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1726272.0, ans=0.125 2023-06-27 04:32:06,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 7.655e+02 1.062e+03 1.668e+03 3.650e+03, threshold=2.123e+03, percent-clipped=27.0 2023-06-27 04:32:26,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1726332.0, ans=0.5 2023-06-27 04:33:21,183 INFO [train.py:996] (1/4) Epoch 10, batch 13300, loss[loss=0.224, simple_loss=0.3114, pruned_loss=0.06832, over 21755.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.296, pruned_loss=0.06819, over 4275428.87 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:33:21,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1726512.0, ans=0.125 2023-06-27 04:33:51,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-27 04:33:59,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1726632.0, ans=0.125 2023-06-27 04:34:55,856 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-27 04:35:10,297 INFO [train.py:996] (1/4) Epoch 10, batch 13350, loss[loss=0.2126, simple_loss=0.2958, pruned_loss=0.06469, over 21373.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3006, pruned_loss=0.07093, over 4277222.46 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:35:12,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1726812.0, ans=0.1 2023-06-27 04:35:17,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1726812.0, ans=0.125 2023-06-27 04:35:48,979 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.150e+02 5.865e+02 7.490e+02 1.135e+03 2.182e+03, threshold=1.498e+03, percent-clipped=1.0 2023-06-27 04:36:24,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1726992.0, ans=0.125 2023-06-27 04:36:50,565 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:36:58,399 INFO [train.py:996] (1/4) Epoch 10, batch 13400, loss[loss=0.2256, simple_loss=0.29, pruned_loss=0.08059, over 21601.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3029, pruned_loss=0.07234, over 4281073.20 frames. ], batch size: 548, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:37:19,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1727172.0, ans=0.125 2023-06-27 04:37:30,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1727172.0, ans=0.125 2023-06-27 04:38:04,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1727232.0, ans=0.125 2023-06-27 04:38:41,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.86 vs. limit=15.0 2023-06-27 04:38:43,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1727352.0, ans=0.125 2023-06-27 04:38:47,910 INFO [train.py:996] (1/4) Epoch 10, batch 13450, loss[loss=0.2375, simple_loss=0.308, pruned_loss=0.08349, over 21657.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3045, pruned_loss=0.07403, over 4274624.39 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:39:07,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1727412.0, ans=0.1 2023-06-27 04:39:39,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-27 04:39:39,484 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 5.946e+02 7.827e+02 1.298e+03 2.826e+03, threshold=1.565e+03, percent-clipped=16.0 2023-06-27 04:40:08,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1727592.0, ans=0.0 2023-06-27 04:40:14,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2023-06-27 04:40:38,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1727652.0, ans=0.07 2023-06-27 04:40:43,701 INFO [train.py:996] (1/4) Epoch 10, batch 13500, loss[loss=0.1688, simple_loss=0.2272, pruned_loss=0.05521, over 21319.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2971, pruned_loss=0.07134, over 4270727.60 frames. ], batch size: 159, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:41:45,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1727832.0, ans=0.125 2023-06-27 04:41:58,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=22.5 2023-06-27 04:42:05,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1727892.0, ans=0.1 2023-06-27 04:42:15,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1727952.0, ans=0.0 2023-06-27 04:42:35,513 INFO [train.py:996] (1/4) Epoch 10, batch 13550, loss[loss=0.2657, simple_loss=0.3622, pruned_loss=0.08464, over 21766.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3003, pruned_loss=0.07111, over 4273628.64 frames. ], batch size: 332, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:42:36,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1728012.0, ans=0.1 2023-06-27 04:42:52,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-27 04:43:12,499 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:43:25,545 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.534e+02 7.345e+02 1.395e+03 2.191e+03 3.934e+03, threshold=2.790e+03, percent-clipped=45.0 2023-06-27 04:43:28,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-27 04:44:02,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1728252.0, ans=0.1 2023-06-27 04:44:17,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1728252.0, ans=0.125 2023-06-27 04:44:21,767 INFO [train.py:996] (1/4) Epoch 10, batch 13600, loss[loss=0.2277, simple_loss=0.3097, pruned_loss=0.07285, over 21581.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3013, pruned_loss=0.07164, over 4270896.88 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 32.0 2023-06-27 04:44:38,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1728312.0, ans=0.0 2023-06-27 04:44:54,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1728372.0, ans=0.0 2023-06-27 04:46:03,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1728552.0, ans=0.125 2023-06-27 04:46:12,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1728612.0, ans=0.0 2023-06-27 04:46:13,912 INFO [train.py:996] (1/4) Epoch 10, batch 13650, loss[loss=0.202, simple_loss=0.2645, pruned_loss=0.06978, over 21513.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2967, pruned_loss=0.06827, over 4268764.09 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:46:59,896 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.764e+02 5.044e+02 6.157e+02 8.736e+02 2.830e+03, threshold=1.231e+03, percent-clipped=2.0 2023-06-27 04:47:14,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1728792.0, ans=0.2 2023-06-27 04:48:02,145 INFO [train.py:996] (1/4) Epoch 10, batch 13700, loss[loss=0.2281, simple_loss=0.3484, pruned_loss=0.05386, over 19793.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2927, pruned_loss=0.06746, over 4270096.39 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:48:54,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-27 04:48:55,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1729032.0, ans=0.1 2023-06-27 04:49:09,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1729092.0, ans=0.09899494936611666 2023-06-27 04:49:18,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-27 04:49:40,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1729152.0, ans=0.125 2023-06-27 04:49:50,679 INFO [train.py:996] (1/4) Epoch 10, batch 13750, loss[loss=0.1486, simple_loss=0.2028, pruned_loss=0.04718, over 21790.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2891, pruned_loss=0.06636, over 4263846.16 frames. ], batch size: 102, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:50:33,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1729272.0, ans=0.1 2023-06-27 04:50:44,257 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.944e+02 7.619e+02 1.226e+03 1.767e+03 3.252e+03, threshold=2.451e+03, percent-clipped=47.0 2023-06-27 04:51:26,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1729452.0, ans=0.125 2023-06-27 04:51:52,082 INFO [train.py:996] (1/4) Epoch 10, batch 13800, loss[loss=0.153, simple_loss=0.2215, pruned_loss=0.04223, over 21866.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2947, pruned_loss=0.06523, over 4272932.11 frames. ], batch size: 107, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:53:30,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1729752.0, ans=0.125 2023-06-27 04:53:34,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-27 04:53:40,115 INFO [train.py:996] (1/4) Epoch 10, batch 13850, loss[loss=0.2191, simple_loss=0.3053, pruned_loss=0.06646, over 20680.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2997, pruned_loss=0.06626, over 4270354.24 frames. ], batch size: 608, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:54:23,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 7.886e+02 1.223e+03 1.813e+03 4.044e+03, threshold=2.445e+03, percent-clipped=7.0 2023-06-27 04:54:43,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1729992.0, ans=0.2 2023-06-27 04:55:28,103 INFO [train.py:996] (1/4) Epoch 10, batch 13900, loss[loss=0.2199, simple_loss=0.2786, pruned_loss=0.08056, over 20025.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3032, pruned_loss=0.06966, over 4271950.83 frames. ], batch size: 702, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:56:06,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1730232.0, ans=10.0 2023-06-27 04:56:06,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1730232.0, ans=0.0 2023-06-27 04:56:38,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1730292.0, ans=0.125 2023-06-27 04:56:45,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1730292.0, ans=0.2 2023-06-27 04:57:14,336 INFO [train.py:996] (1/4) Epoch 10, batch 13950, loss[loss=0.2154, simple_loss=0.285, pruned_loss=0.07295, over 21325.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.303, pruned_loss=0.07153, over 4273881.42 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:58:02,047 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.602e+02 6.616e+02 8.570e+02 1.217e+03 2.156e+03, threshold=1.714e+03, percent-clipped=0.0 2023-06-27 04:58:59,356 INFO [train.py:996] (1/4) Epoch 10, batch 14000, loss[loss=0.1722, simple_loss=0.2519, pruned_loss=0.04619, over 21367.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2995, pruned_loss=0.06879, over 4269818.17 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:00:06,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1730892.0, ans=0.2 2023-06-27 05:00:51,609 INFO [train.py:996] (1/4) Epoch 10, batch 14050, loss[loss=0.2266, simple_loss=0.2827, pruned_loss=0.08524, over 21374.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.294, pruned_loss=0.06534, over 4273791.48 frames. ], batch size: 507, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:00:58,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1731012.0, ans=0.125 2023-06-27 05:01:09,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1731072.0, ans=0.0 2023-06-27 05:01:11,532 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=12.0 2023-06-27 05:01:33,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.649e+02 7.272e+02 1.104e+03 1.609e+03 3.327e+03, threshold=2.207e+03, percent-clipped=18.0 2023-06-27 05:01:35,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731132.0, ans=0.1 2023-06-27 05:01:46,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1731192.0, ans=0.125 2023-06-27 05:01:58,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731192.0, ans=0.1 2023-06-27 05:02:27,182 INFO [train.py:996] (1/4) Epoch 10, batch 14100, loss[loss=0.2578, simple_loss=0.3154, pruned_loss=0.1001, over 21354.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2896, pruned_loss=0.06519, over 4270316.52 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:02:40,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1731312.0, ans=0.125 2023-06-27 05:04:05,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1731552.0, ans=0.125 2023-06-27 05:04:12,969 INFO [train.py:996] (1/4) Epoch 10, batch 14150, loss[loss=0.2151, simple_loss=0.2988, pruned_loss=0.0657, over 21833.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2922, pruned_loss=0.06575, over 4266507.49 frames. ], batch size: 102, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:04:14,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=22.5 2023-06-27 05:04:59,064 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 7.057e+02 1.107e+03 1.740e+03 3.584e+03, threshold=2.215e+03, percent-clipped=8.0 2023-06-27 05:05:09,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1731732.0, ans=0.0 2023-06-27 05:05:21,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1731792.0, ans=0.125 2023-06-27 05:05:33,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731792.0, ans=0.1 2023-06-27 05:05:55,685 INFO [train.py:996] (1/4) Epoch 10, batch 14200, loss[loss=0.2108, simple_loss=0.2675, pruned_loss=0.07704, over 20226.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2914, pruned_loss=0.0654, over 4276141.30 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:06:03,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-06-27 05:06:09,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731912.0, ans=0.1 2023-06-27 05:06:11,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1731972.0, ans=15.0 2023-06-27 05:06:43,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1732032.0, ans=0.125 2023-06-27 05:06:50,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1732032.0, ans=0.125 2023-06-27 05:07:41,160 INFO [train.py:996] (1/4) Epoch 10, batch 14250, loss[loss=0.1883, simple_loss=0.2563, pruned_loss=0.06012, over 21595.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2859, pruned_loss=0.065, over 4271354.59 frames. ], batch size: 247, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:08:32,533 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 5.743e+02 8.448e+02 1.114e+03 2.445e+03, threshold=1.690e+03, percent-clipped=1.0 2023-06-27 05:08:48,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-27 05:08:56,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1732392.0, ans=0.125 2023-06-27 05:09:25,872 INFO [train.py:996] (1/4) Epoch 10, batch 14300, loss[loss=0.2714, simple_loss=0.3673, pruned_loss=0.08772, over 21784.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2905, pruned_loss=0.06527, over 4268596.33 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:09:33,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1732512.0, ans=0.04949747468305833 2023-06-27 05:09:36,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1732512.0, ans=0.0 2023-06-27 05:09:37,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1732512.0, ans=0.125 2023-06-27 05:09:47,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1732572.0, ans=0.125 2023-06-27 05:09:52,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1732572.0, ans=0.07 2023-06-27 05:09:52,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1732572.0, ans=0.125 2023-06-27 05:11:14,214 INFO [train.py:996] (1/4) Epoch 10, batch 14350, loss[loss=0.1472, simple_loss=0.1977, pruned_loss=0.04833, over 16337.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2955, pruned_loss=0.06599, over 4256359.99 frames. ], batch size: 61, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:11:18,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1732812.0, ans=0.0 2023-06-27 05:11:47,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1732872.0, ans=0.125 2023-06-27 05:12:04,604 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 7.754e+02 1.154e+03 1.779e+03 3.670e+03, threshold=2.308e+03, percent-clipped=30.0 2023-06-27 05:12:24,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1732992.0, ans=0.09899494936611666 2023-06-27 05:12:49,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-27 05:13:00,579 INFO [train.py:996] (1/4) Epoch 10, batch 14400, loss[loss=0.2246, simple_loss=0.294, pruned_loss=0.07764, over 21824.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2934, pruned_loss=0.06675, over 4261280.13 frames. ], batch size: 118, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:13:04,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1733112.0, ans=0.2 2023-06-27 05:13:09,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1733112.0, ans=0.0 2023-06-27 05:14:35,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1733352.0, ans=0.125 2023-06-27 05:14:46,468 INFO [train.py:996] (1/4) Epoch 10, batch 14450, loss[loss=0.1818, simple_loss=0.2379, pruned_loss=0.06287, over 21236.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2875, pruned_loss=0.06696, over 4269448.13 frames. ], batch size: 548, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:15:12,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1733472.0, ans=0.0 2023-06-27 05:15:36,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.984e+02 5.618e+02 7.352e+02 1.088e+03 2.382e+03, threshold=1.470e+03, percent-clipped=1.0 2023-06-27 05:15:57,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-27 05:16:16,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1733652.0, ans=0.0 2023-06-27 05:16:28,034 INFO [train.py:996] (1/4) Epoch 10, batch 14500, loss[loss=0.2068, simple_loss=0.2954, pruned_loss=0.05908, over 21536.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2846, pruned_loss=0.06625, over 4268816.44 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:16:35,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1733712.0, ans=0.1 2023-06-27 05:16:37,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-27 05:16:41,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-27 05:18:02,233 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 05:18:12,021 INFO [train.py:996] (1/4) Epoch 10, batch 14550, loss[loss=0.2339, simple_loss=0.3035, pruned_loss=0.08212, over 21376.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2891, pruned_loss=0.06788, over 4272765.68 frames. ], batch size: 549, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:18:45,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1734072.0, ans=0.125 2023-06-27 05:18:54,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1734072.0, ans=0.125 2023-06-27 05:19:02,698 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.284e+02 5.674e+02 7.709e+02 1.144e+03 2.600e+03, threshold=1.542e+03, percent-clipped=15.0 2023-06-27 05:19:06,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1734132.0, ans=0.1 2023-06-27 05:19:42,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1734252.0, ans=0.125 2023-06-27 05:20:05,614 INFO [train.py:996] (1/4) Epoch 10, batch 14600, loss[loss=0.2313, simple_loss=0.3222, pruned_loss=0.07018, over 21799.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2975, pruned_loss=0.07138, over 4278372.61 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:20:15,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1734312.0, ans=0.5 2023-06-27 05:20:22,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1734372.0, ans=0.09899494936611666 2023-06-27 05:20:34,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1734372.0, ans=22.5 2023-06-27 05:20:42,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1734372.0, ans=0.125 2023-06-27 05:21:18,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1734492.0, ans=0.0 2023-06-27 05:21:40,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=22.5 2023-06-27 05:21:48,278 INFO [train.py:996] (1/4) Epoch 10, batch 14650, loss[loss=0.1606, simple_loss=0.2434, pruned_loss=0.03892, over 21400.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2995, pruned_loss=0.0703, over 4283596.39 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:22:06,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1734612.0, ans=0.125 2023-06-27 05:22:39,585 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 5.657e+02 7.781e+02 1.109e+03 2.213e+03, threshold=1.556e+03, percent-clipped=10.0 2023-06-27 05:22:51,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1734732.0, ans=0.0 2023-06-27 05:23:37,301 INFO [train.py:996] (1/4) Epoch 10, batch 14700, loss[loss=0.196, simple_loss=0.2907, pruned_loss=0.05063, over 21689.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2945, pruned_loss=0.06613, over 4276195.90 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:23:46,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1734912.0, ans=0.125 2023-06-27 05:24:08,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1734972.0, ans=0.125 2023-06-27 05:24:18,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1734972.0, ans=0.125 2023-06-27 05:24:38,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1735032.0, ans=0.125 2023-06-27 05:25:00,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-27 05:25:09,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1735152.0, ans=22.5 2023-06-27 05:25:38,798 INFO [train.py:996] (1/4) Epoch 10, batch 14750, loss[loss=0.2368, simple_loss=0.3131, pruned_loss=0.08023, over 21557.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2971, pruned_loss=0.06729, over 4270440.52 frames. ], batch size: 194, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:26:30,560 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.686e+02 7.000e+02 1.273e+03 1.820e+03 3.687e+03, threshold=2.546e+03, percent-clipped=36.0 2023-06-27 05:27:04,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1735452.0, ans=0.125 2023-06-27 05:27:29,180 INFO [train.py:996] (1/4) Epoch 10, batch 14800, loss[loss=0.1979, simple_loss=0.2724, pruned_loss=0.06166, over 21562.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3082, pruned_loss=0.07251, over 4275249.40 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 32.0 2023-06-27 05:27:35,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1735512.0, ans=0.2 2023-06-27 05:28:11,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1735572.0, ans=0.125 2023-06-27 05:28:22,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735632.0, ans=0.1 2023-06-27 05:29:29,442 INFO [train.py:996] (1/4) Epoch 10, batch 14850, loss[loss=0.1996, simple_loss=0.273, pruned_loss=0.06313, over 21647.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3018, pruned_loss=0.07165, over 4266294.90 frames. ], batch size: 247, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:30:16,838 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 5.316e+02 7.277e+02 1.299e+03 3.940e+03, threshold=1.455e+03, percent-clipped=5.0 2023-06-27 05:31:19,414 INFO [train.py:996] (1/4) Epoch 10, batch 14900, loss[loss=0.1991, simple_loss=0.2709, pruned_loss=0.06361, over 21624.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3031, pruned_loss=0.07305, over 4266006.86 frames. ], batch size: 112, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:31:39,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1736172.0, ans=0.0 2023-06-27 05:31:50,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736172.0, ans=0.1 2023-06-27 05:32:21,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1736232.0, ans=0.125 2023-06-27 05:32:23,726 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=15.0 2023-06-27 05:32:44,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1736292.0, ans=0.125 2023-06-27 05:32:59,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1736352.0, ans=0.1 2023-06-27 05:33:09,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1736412.0, ans=0.0 2023-06-27 05:33:11,108 INFO [train.py:996] (1/4) Epoch 10, batch 14950, loss[loss=0.2561, simple_loss=0.3436, pruned_loss=0.08427, over 21419.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3048, pruned_loss=0.07293, over 4266477.12 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:33:31,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736412.0, ans=0.1 2023-06-27 05:34:05,491 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.924e+02 5.785e+02 8.505e+02 1.255e+03 2.502e+03, threshold=1.701e+03, percent-clipped=18.0 2023-06-27 05:34:12,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1736592.0, ans=0.125 2023-06-27 05:34:44,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1736652.0, ans=0.125 2023-06-27 05:34:53,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1736652.0, ans=0.125 2023-06-27 05:35:00,006 INFO [train.py:996] (1/4) Epoch 10, batch 15000, loss[loss=0.2547, simple_loss=0.3295, pruned_loss=0.08994, over 20680.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3057, pruned_loss=0.07366, over 4268503.03 frames. ], batch size: 607, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:35:00,007 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 05:35:19,879 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2554, simple_loss=0.3462, pruned_loss=0.08227, over 1796401.00 frames. 2023-06-27 05:35:19,880 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 05:35:26,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1736712.0, ans=0.125 2023-06-27 05:36:53,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1736952.0, ans=0.125 2023-06-27 05:37:03,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1737012.0, ans=0.125 2023-06-27 05:37:04,896 INFO [train.py:996] (1/4) Epoch 10, batch 15050, loss[loss=0.2567, simple_loss=0.3356, pruned_loss=0.08887, over 21648.00 frames. ], tot_loss[loss=0.229, simple_loss=0.308, pruned_loss=0.07499, over 4272047.67 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:37:44,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1737072.0, ans=0.1 2023-06-27 05:38:00,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1737132.0, ans=0.2 2023-06-27 05:38:05,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.468e+02 6.013e+02 1.020e+03 1.764e+03 3.653e+03, threshold=2.041e+03, percent-clipped=28.0 2023-06-27 05:38:16,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1737192.0, ans=0.1 2023-06-27 05:38:59,242 INFO [train.py:996] (1/4) Epoch 10, batch 15100, loss[loss=0.2661, simple_loss=0.3408, pruned_loss=0.0957, over 21832.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3105, pruned_loss=0.07442, over 4273395.34 frames. ], batch size: 118, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:39:26,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1737372.0, ans=0.125 2023-06-27 05:39:40,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1737372.0, ans=0.0 2023-06-27 05:40:04,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1737492.0, ans=0.125 2023-06-27 05:40:43,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1737552.0, ans=0.125 2023-06-27 05:40:43,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1737552.0, ans=0.0 2023-06-27 05:40:48,204 INFO [train.py:996] (1/4) Epoch 10, batch 15150, loss[loss=0.2225, simple_loss=0.2939, pruned_loss=0.07553, over 21380.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3076, pruned_loss=0.07536, over 4278825.19 frames. ], batch size: 548, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:41:42,589 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 5.996e+02 8.329e+02 1.455e+03 4.229e+03, threshold=1.666e+03, percent-clipped=17.0 2023-06-27 05:41:46,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1737732.0, ans=0.1 2023-06-27 05:42:13,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.75 vs. limit=15.0 2023-06-27 05:42:36,438 INFO [train.py:996] (1/4) Epoch 10, batch 15200, loss[loss=0.1769, simple_loss=0.247, pruned_loss=0.05344, over 21725.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2987, pruned_loss=0.07217, over 4264431.45 frames. ], batch size: 124, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:42:37,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1737912.0, ans=0.0 2023-06-27 05:42:49,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-27 05:43:40,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1738092.0, ans=0.125 2023-06-27 05:43:52,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1738092.0, ans=0.025 2023-06-27 05:43:54,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1738092.0, ans=0.125 2023-06-27 05:44:06,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1738152.0, ans=0.2 2023-06-27 05:44:22,674 INFO [train.py:996] (1/4) Epoch 10, batch 15250, loss[loss=0.2218, simple_loss=0.2995, pruned_loss=0.07202, over 21752.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2923, pruned_loss=0.06986, over 4261678.94 frames. ], batch size: 124, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:44:38,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1738212.0, ans=0.0 2023-06-27 05:45:16,769 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 6.076e+02 9.164e+02 1.527e+03 3.060e+03, threshold=1.833e+03, percent-clipped=16.0 2023-06-27 05:45:36,964 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 05:46:02,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1738452.0, ans=0.125 2023-06-27 05:46:03,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2023-06-27 05:46:11,054 INFO [train.py:996] (1/4) Epoch 10, batch 15300, loss[loss=0.2357, simple_loss=0.3102, pruned_loss=0.08065, over 20713.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2951, pruned_loss=0.07218, over 4259190.23 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:46:49,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1738572.0, ans=0.125 2023-06-27 05:46:58,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1738632.0, ans=0.0 2023-06-27 05:47:00,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1738632.0, ans=0.125 2023-06-27 05:47:07,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1738632.0, ans=0.125 2023-06-27 05:47:24,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1738692.0, ans=0.0 2023-06-27 05:47:30,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1738692.0, ans=0.125 2023-06-27 05:47:31,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1738692.0, ans=0.0 2023-06-27 05:47:47,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1738752.0, ans=0.125 2023-06-27 05:47:50,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1738752.0, ans=0.04949747468305833 2023-06-27 05:47:52,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1738752.0, ans=0.125 2023-06-27 05:47:54,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1738752.0, ans=10.0 2023-06-27 05:47:58,640 INFO [train.py:996] (1/4) Epoch 10, batch 15350, loss[loss=0.219, simple_loss=0.2991, pruned_loss=0.06947, over 21604.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2998, pruned_loss=0.0743, over 4271332.08 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:48:51,336 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.081e+02 6.656e+02 9.808e+02 1.431e+03 3.197e+03, threshold=1.962e+03, percent-clipped=8.0 2023-06-27 05:48:55,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1738932.0, ans=0.2 2023-06-27 05:49:09,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-27 05:49:44,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1739112.0, ans=0.0 2023-06-27 05:49:45,873 INFO [train.py:996] (1/4) Epoch 10, batch 15400, loss[loss=0.2314, simple_loss=0.3026, pruned_loss=0.08013, over 21867.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3013, pruned_loss=0.07266, over 4275586.27 frames. ], batch size: 414, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:49:48,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1739112.0, ans=0.1 2023-06-27 05:49:54,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1739112.0, ans=0.125 2023-06-27 05:50:14,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.49 vs. limit=10.0 2023-06-27 05:50:33,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1739232.0, ans=0.125 2023-06-27 05:50:42,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-27 05:51:15,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1739352.0, ans=0.1 2023-06-27 05:51:17,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1739352.0, ans=0.015 2023-06-27 05:51:33,737 INFO [train.py:996] (1/4) Epoch 10, batch 15450, loss[loss=0.2056, simple_loss=0.2896, pruned_loss=0.06077, over 21432.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3003, pruned_loss=0.07136, over 4269708.51 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:52:10,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1739472.0, ans=0.125 2023-06-27 05:52:28,018 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.164e+02 6.328e+02 9.249e+02 1.410e+03 2.980e+03, threshold=1.850e+03, percent-clipped=8.0 2023-06-27 05:53:29,084 INFO [train.py:996] (1/4) Epoch 10, batch 15500, loss[loss=0.2393, simple_loss=0.3124, pruned_loss=0.08308, over 21605.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3015, pruned_loss=0.0711, over 4258185.77 frames. ], batch size: 263, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:54:05,561 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-27 05:54:34,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1739832.0, ans=0.125 2023-06-27 05:54:47,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1739892.0, ans=0.0 2023-06-27 05:55:23,959 INFO [train.py:996] (1/4) Epoch 10, batch 15550, loss[loss=0.2365, simple_loss=0.3212, pruned_loss=0.07586, over 20038.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3002, pruned_loss=0.07011, over 4252021.41 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:55:36,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1740012.0, ans=0.2 2023-06-27 05:56:09,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1740132.0, ans=0.1 2023-06-27 05:56:17,355 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.325e+02 6.960e+02 1.269e+03 1.845e+03 3.300e+03, threshold=2.538e+03, percent-clipped=23.0 2023-06-27 05:57:11,156 INFO [train.py:996] (1/4) Epoch 10, batch 15600, loss[loss=0.1939, simple_loss=0.2857, pruned_loss=0.051, over 21612.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2931, pruned_loss=0.06869, over 4252193.06 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 05:58:02,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1740432.0, ans=0.125 2023-06-27 05:58:09,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1740432.0, ans=0.2 2023-06-27 05:58:13,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1740492.0, ans=0.0 2023-06-27 05:58:16,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1740492.0, ans=0.0 2023-06-27 05:58:59,217 INFO [train.py:996] (1/4) Epoch 10, batch 15650, loss[loss=0.208, simple_loss=0.2732, pruned_loss=0.07146, over 21856.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2906, pruned_loss=0.0678, over 4266695.47 frames. ], batch size: 373, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:59:39,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1740732.0, ans=0.0 2023-06-27 05:59:48,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1740732.0, ans=0.125 2023-06-27 05:59:49,306 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 5.253e+02 8.016e+02 1.068e+03 2.204e+03, threshold=1.603e+03, percent-clipped=0.0 2023-06-27 06:00:36,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1740852.0, ans=0.2 2023-06-27 06:00:41,579 INFO [train.py:996] (1/4) Epoch 10, batch 15700, loss[loss=0.1702, simple_loss=0.2398, pruned_loss=0.0503, over 21213.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2872, pruned_loss=0.06676, over 4255184.67 frames. ], batch size: 176, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:01:03,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-27 06:02:07,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.72 vs. limit=22.5 2023-06-27 06:02:28,404 INFO [train.py:996] (1/4) Epoch 10, batch 15750, loss[loss=0.1902, simple_loss=0.264, pruned_loss=0.05822, over 21502.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2831, pruned_loss=0.06601, over 4267367.12 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:02:53,275 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:02:54,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1741272.0, ans=0.0 2023-06-27 06:03:03,256 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:03:22,675 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 5.702e+02 8.249e+02 1.125e+03 2.008e+03, threshold=1.650e+03, percent-clipped=7.0 2023-06-27 06:04:14,232 INFO [train.py:996] (1/4) Epoch 10, batch 15800, loss[loss=0.1945, simple_loss=0.2592, pruned_loss=0.06485, over 21889.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2788, pruned_loss=0.06604, over 4267577.88 frames. ], batch size: 373, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:04:16,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1741512.0, ans=0.0 2023-06-27 06:04:32,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=22.5 2023-06-27 06:05:02,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1741632.0, ans=0.125 2023-06-27 06:06:00,814 INFO [train.py:996] (1/4) Epoch 10, batch 15850, loss[loss=0.213, simple_loss=0.2824, pruned_loss=0.07178, over 21769.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2821, pruned_loss=0.06728, over 4266804.52 frames. ], batch size: 124, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:06:03,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1741812.0, ans=0.125 2023-06-27 06:06:15,603 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-27 06:06:35,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1741872.0, ans=0.125 2023-06-27 06:06:57,729 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.205e+02 6.570e+02 8.492e+02 1.187e+03 2.613e+03, threshold=1.698e+03, percent-clipped=9.0 2023-06-27 06:06:58,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1741932.0, ans=0.125 2023-06-27 06:07:15,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1741992.0, ans=0.1 2023-06-27 06:07:47,410 INFO [train.py:996] (1/4) Epoch 10, batch 15900, loss[loss=0.193, simple_loss=0.2596, pruned_loss=0.06319, over 21803.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2813, pruned_loss=0.06759, over 4273761.81 frames. ], batch size: 352, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:07:49,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1742112.0, ans=0.125 2023-06-27 06:08:02,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1742172.0, ans=0.125 2023-06-27 06:08:36,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1742232.0, ans=0.0 2023-06-27 06:08:51,017 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-27 06:09:09,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1742352.0, ans=0.0 2023-06-27 06:09:33,389 INFO [train.py:996] (1/4) Epoch 10, batch 15950, loss[loss=0.189, simple_loss=0.2716, pruned_loss=0.05319, over 21336.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.283, pruned_loss=0.06516, over 4279962.57 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:09:57,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1742472.0, ans=0.1 2023-06-27 06:10:31,667 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.565e+02 5.245e+02 8.616e+02 1.211e+03 4.191e+03, threshold=1.723e+03, percent-clipped=6.0 2023-06-27 06:10:47,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1742592.0, ans=0.125 2023-06-27 06:10:53,305 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-27 06:11:21,899 INFO [train.py:996] (1/4) Epoch 10, batch 16000, loss[loss=0.2186, simple_loss=0.3164, pruned_loss=0.06035, over 21661.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.285, pruned_loss=0.06409, over 4276252.22 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:11:33,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-27 06:11:34,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1742712.0, ans=0.125 2023-06-27 06:11:36,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1742712.0, ans=0.0 2023-06-27 06:11:50,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1742772.0, ans=0.125 2023-06-27 06:11:59,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1742832.0, ans=0.05 2023-06-27 06:12:19,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1742832.0, ans=0.125 2023-06-27 06:12:36,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1742892.0, ans=0.125 2023-06-27 06:13:10,604 INFO [train.py:996] (1/4) Epoch 10, batch 16050, loss[loss=0.2014, simple_loss=0.2478, pruned_loss=0.07755, over 20382.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2855, pruned_loss=0.06175, over 4271411.91 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:13:16,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1743012.0, ans=0.05 2023-06-27 06:13:30,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1743072.0, ans=0.0 2023-06-27 06:13:30,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1743072.0, ans=0.125 2023-06-27 06:13:50,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1743132.0, ans=0.1 2023-06-27 06:13:59,424 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:14:07,170 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.055e+02 6.829e+02 9.641e+02 1.432e+03 3.603e+03, threshold=1.928e+03, percent-clipped=16.0 2023-06-27 06:14:07,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1743132.0, ans=0.1 2023-06-27 06:14:07,828 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:14:51,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-27 06:14:51,660 INFO [train.py:996] (1/4) Epoch 10, batch 16100, loss[loss=0.214, simple_loss=0.289, pruned_loss=0.06952, over 21617.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2905, pruned_loss=0.06453, over 4276251.08 frames. ], batch size: 263, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:15:08,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1743372.0, ans=0.1 2023-06-27 06:15:15,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1743372.0, ans=0.0 2023-06-27 06:16:02,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1743492.0, ans=0.0 2023-06-27 06:16:27,668 INFO [train.py:996] (1/4) Epoch 10, batch 16150, loss[loss=0.2351, simple_loss=0.3019, pruned_loss=0.08419, over 21775.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.29, pruned_loss=0.06672, over 4291482.13 frames. ], batch size: 441, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:17:35,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1743732.0, ans=0.125 2023-06-27 06:17:36,691 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.965e+02 5.965e+02 7.575e+02 1.164e+03 3.405e+03, threshold=1.515e+03, percent-clipped=4.0 2023-06-27 06:18:27,503 INFO [train.py:996] (1/4) Epoch 10, batch 16200, loss[loss=0.2401, simple_loss=0.3211, pruned_loss=0.07957, over 21786.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2936, pruned_loss=0.06807, over 4287638.16 frames. ], batch size: 332, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:19:35,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1744092.0, ans=0.125 2023-06-27 06:20:13,805 INFO [train.py:996] (1/4) Epoch 10, batch 16250, loss[loss=0.2073, simple_loss=0.2841, pruned_loss=0.06522, over 21517.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2936, pruned_loss=0.06778, over 4275979.92 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:21:10,954 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 5.225e+02 6.820e+02 1.048e+03 2.777e+03, threshold=1.364e+03, percent-clipped=10.0 2023-06-27 06:21:33,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1744392.0, ans=0.125 2023-06-27 06:21:42,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1744452.0, ans=0.0 2023-06-27 06:22:00,273 INFO [train.py:996] (1/4) Epoch 10, batch 16300, loss[loss=0.2212, simple_loss=0.3327, pruned_loss=0.05484, over 19863.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2881, pruned_loss=0.06405, over 4277933.21 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:22:12,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1744512.0, ans=0.04949747468305833 2023-06-27 06:22:46,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1744632.0, ans=0.95 2023-06-27 06:23:02,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1744692.0, ans=0.1 2023-06-27 06:23:18,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1744692.0, ans=0.125 2023-06-27 06:23:18,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1744692.0, ans=0.0 2023-06-27 06:23:23,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1744752.0, ans=0.125 2023-06-27 06:23:48,219 INFO [train.py:996] (1/4) Epoch 10, batch 16350, loss[loss=0.2253, simple_loss=0.2991, pruned_loss=0.07576, over 19917.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2889, pruned_loss=0.06503, over 4276725.54 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:24:04,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-27 06:24:38,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.27 vs. limit=5.0 2023-06-27 06:24:45,627 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.899e+02 6.082e+02 8.252e+02 1.130e+03 2.497e+03, threshold=1.650e+03, percent-clipped=10.0 2023-06-27 06:25:29,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=22.5 2023-06-27 06:25:35,470 INFO [train.py:996] (1/4) Epoch 10, batch 16400, loss[loss=0.2233, simple_loss=0.2939, pruned_loss=0.07637, over 21374.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2959, pruned_loss=0.06754, over 4278458.21 frames. ], batch size: 144, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:25:39,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1745112.0, ans=0.125 2023-06-27 06:25:49,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1745112.0, ans=0.0 2023-06-27 06:26:29,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1745232.0, ans=0.125 2023-06-27 06:26:45,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1745292.0, ans=0.125 2023-06-27 06:27:09,605 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=12.0 2023-06-27 06:27:14,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1745352.0, ans=0.125 2023-06-27 06:27:14,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-27 06:27:22,341 INFO [train.py:996] (1/4) Epoch 10, batch 16450, loss[loss=0.2457, simple_loss=0.307, pruned_loss=0.09216, over 21621.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2967, pruned_loss=0.06904, over 4283569.86 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:27:33,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1745412.0, ans=0.0 2023-06-27 06:28:12,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1745532.0, ans=0.125 2023-06-27 06:28:19,529 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.272e+02 6.597e+02 9.235e+02 1.601e+03 3.322e+03, threshold=1.847e+03, percent-clipped=22.0 2023-06-27 06:28:27,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-27 06:28:44,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1745592.0, ans=0.0 2023-06-27 06:28:55,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1745652.0, ans=0.1 2023-06-27 06:29:15,254 INFO [train.py:996] (1/4) Epoch 10, batch 16500, loss[loss=0.1349, simple_loss=0.1693, pruned_loss=0.0503, over 16237.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2932, pruned_loss=0.06906, over 4283323.69 frames. ], batch size: 61, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:29:48,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1745772.0, ans=0.0 2023-06-27 06:30:14,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1745832.0, ans=0.125 2023-06-27 06:30:21,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1745832.0, ans=0.0 2023-06-27 06:30:37,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1745892.0, ans=0.125 2023-06-27 06:30:47,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1745952.0, ans=0.015 2023-06-27 06:30:47,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1745952.0, ans=0.125 2023-06-27 06:31:10,037 INFO [train.py:996] (1/4) Epoch 10, batch 16550, loss[loss=0.2213, simple_loss=0.2976, pruned_loss=0.07247, over 21461.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2909, pruned_loss=0.0669, over 4280942.94 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:31:47,147 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.20 vs. limit=6.0 2023-06-27 06:32:11,780 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.354e+02 1.023e+03 1.715e+03 3.969e+03, threshold=2.045e+03, percent-clipped=20.0 2023-06-27 06:32:18,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1746192.0, ans=0.04949747468305833 2023-06-27 06:32:26,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746192.0, ans=0.1 2023-06-27 06:32:45,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1746252.0, ans=0.125 2023-06-27 06:33:01,729 INFO [train.py:996] (1/4) Epoch 10, batch 16600, loss[loss=0.342, simple_loss=0.4205, pruned_loss=0.1317, over 21407.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2995, pruned_loss=0.06989, over 4280650.11 frames. ], batch size: 507, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:34:01,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1746432.0, ans=0.0 2023-06-27 06:34:27,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1746492.0, ans=0.0 2023-06-27 06:34:42,513 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:34:42,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1746552.0, ans=0.125 2023-06-27 06:34:49,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1746612.0, ans=0.2 2023-06-27 06:34:50,934 INFO [train.py:996] (1/4) Epoch 10, batch 16650, loss[loss=0.2522, simple_loss=0.3348, pruned_loss=0.08478, over 21346.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3069, pruned_loss=0.07195, over 4278938.22 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:35:11,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746612.0, ans=0.1 2023-06-27 06:35:14,314 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-27 06:35:15,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746672.0, ans=0.1 2023-06-27 06:35:55,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746732.0, ans=0.1 2023-06-27 06:35:58,140 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.748e+02 7.097e+02 9.518e+02 1.581e+03 3.619e+03, threshold=1.904e+03, percent-clipped=14.0 2023-06-27 06:36:48,636 INFO [train.py:996] (1/4) Epoch 10, batch 16700, loss[loss=0.2311, simple_loss=0.3513, pruned_loss=0.05549, over 19763.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3081, pruned_loss=0.07267, over 4278656.45 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:36:56,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1746912.0, ans=0.015 2023-06-27 06:37:09,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-06-27 06:37:10,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1746972.0, ans=0.1 2023-06-27 06:37:10,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746972.0, ans=0.1 2023-06-27 06:37:11,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1746972.0, ans=0.2 2023-06-27 06:37:31,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1746972.0, ans=0.04949747468305833 2023-06-27 06:37:37,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1747032.0, ans=0.0 2023-06-27 06:37:42,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1747032.0, ans=0.1 2023-06-27 06:38:46,419 INFO [train.py:996] (1/4) Epoch 10, batch 16750, loss[loss=0.2014, simple_loss=0.248, pruned_loss=0.07737, over 20111.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3097, pruned_loss=0.07494, over 4274379.78 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:39:17,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1747272.0, ans=0.1 2023-06-27 06:39:17,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1747272.0, ans=0.125 2023-06-27 06:39:53,241 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.731e+02 7.125e+02 1.124e+03 1.580e+03 3.763e+03, threshold=2.248e+03, percent-clipped=17.0 2023-06-27 06:39:55,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1747392.0, ans=0.125 2023-06-27 06:40:12,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1747392.0, ans=0.0 2023-06-27 06:40:33,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1747452.0, ans=0.1 2023-06-27 06:40:40,770 INFO [train.py:996] (1/4) Epoch 10, batch 16800, loss[loss=0.2025, simple_loss=0.2855, pruned_loss=0.05979, over 21905.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3133, pruned_loss=0.0745, over 4280300.98 frames. ], batch size: 316, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:40:50,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1747512.0, ans=0.125 2023-06-27 06:41:20,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1747632.0, ans=0.0 2023-06-27 06:41:43,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1747692.0, ans=0.0 2023-06-27 06:42:26,694 INFO [train.py:996] (1/4) Epoch 10, batch 16850, loss[loss=0.2222, simple_loss=0.2996, pruned_loss=0.07243, over 21903.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3097, pruned_loss=0.07399, over 4286993.93 frames. ], batch size: 118, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:42:29,220 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:42:37,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1747812.0, ans=0.05 2023-06-27 06:43:05,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1747932.0, ans=0.125 2023-06-27 06:43:14,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1747932.0, ans=0.125 2023-06-27 06:43:25,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-27 06:43:27,405 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.310e+02 6.690e+02 9.145e+02 1.519e+03 3.869e+03, threshold=1.829e+03, percent-clipped=12.0 2023-06-27 06:43:58,958 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-27 06:44:06,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1748052.0, ans=0.125 2023-06-27 06:44:08,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1748052.0, ans=0.1 2023-06-27 06:44:12,642 INFO [train.py:996] (1/4) Epoch 10, batch 16900, loss[loss=0.176, simple_loss=0.2484, pruned_loss=0.05183, over 21615.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3046, pruned_loss=0.07204, over 4288827.81 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:44:30,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1748172.0, ans=0.2 2023-06-27 06:44:52,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=22.5 2023-06-27 06:45:59,668 INFO [train.py:996] (1/4) Epoch 10, batch 16950, loss[loss=0.2158, simple_loss=0.2824, pruned_loss=0.07463, over 21420.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2989, pruned_loss=0.07062, over 4288701.67 frames. ], batch size: 177, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:46:20,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1748472.0, ans=0.0 2023-06-27 06:47:00,159 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.671e+02 6.123e+02 1.009e+03 1.392e+03 3.065e+03, threshold=2.019e+03, percent-clipped=11.0 2023-06-27 06:47:00,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1748592.0, ans=0.0 2023-06-27 06:47:14,457 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:47:15,205 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.47 vs. limit=10.0 2023-06-27 06:47:44,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1748652.0, ans=0.0 2023-06-27 06:47:47,015 INFO [train.py:996] (1/4) Epoch 10, batch 17000, loss[loss=0.2399, simple_loss=0.3065, pruned_loss=0.08658, over 21287.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2963, pruned_loss=0.07122, over 4294759.47 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:48:02,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.03 vs. limit=10.0 2023-06-27 06:48:33,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1748832.0, ans=0.04949747468305833 2023-06-27 06:48:37,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.43 vs. limit=22.5 2023-06-27 06:49:16,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1748952.0, ans=0.2 2023-06-27 06:49:32,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1748952.0, ans=0.2 2023-06-27 06:49:35,360 INFO [train.py:996] (1/4) Epoch 10, batch 17050, loss[loss=0.2164, simple_loss=0.2988, pruned_loss=0.06698, over 21821.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3034, pruned_loss=0.07243, over 4294776.38 frames. ], batch size: 282, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:50:16,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1749072.0, ans=0.125 2023-06-27 06:50:21,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1749132.0, ans=0.0 2023-06-27 06:50:35,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1749132.0, ans=0.125 2023-06-27 06:50:37,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1749132.0, ans=0.125 2023-06-27 06:50:39,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.381e+02 7.817e+02 1.217e+03 1.816e+03 4.089e+03, threshold=2.434e+03, percent-clipped=19.0 2023-06-27 06:50:45,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1749192.0, ans=0.125 2023-06-27 06:51:19,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1749312.0, ans=0.1 2023-06-27 06:51:20,984 INFO [train.py:996] (1/4) Epoch 10, batch 17100, loss[loss=0.2017, simple_loss=0.2722, pruned_loss=0.06557, over 21836.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3018, pruned_loss=0.07286, over 4298961.98 frames. ], batch size: 282, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:51:24,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1749312.0, ans=0.125 2023-06-27 06:51:28,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1749312.0, ans=0.125 2023-06-27 06:51:31,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1749312.0, ans=0.0 2023-06-27 06:51:53,211 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=12.0 2023-06-27 06:52:27,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1749492.0, ans=0.0 2023-06-27 06:52:56,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1749552.0, ans=0.0 2023-06-27 06:53:03,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1749552.0, ans=0.0 2023-06-27 06:53:07,948 INFO [train.py:996] (1/4) Epoch 10, batch 17150, loss[loss=0.1789, simple_loss=0.2641, pruned_loss=0.04689, over 21784.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2969, pruned_loss=0.07163, over 4304556.23 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:53:10,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1749612.0, ans=0.1 2023-06-27 06:53:45,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1749672.0, ans=0.125 2023-06-27 06:54:16,435 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 6.290e+02 9.886e+02 1.236e+03 2.278e+03, threshold=1.977e+03, percent-clipped=0.0 2023-06-27 06:54:34,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1749792.0, ans=0.07 2023-06-27 06:55:01,785 INFO [train.py:996] (1/4) Epoch 10, batch 17200, loss[loss=0.2094, simple_loss=0.2888, pruned_loss=0.065, over 21726.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2957, pruned_loss=0.07174, over 4302034.65 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:56:01,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1750032.0, ans=0.05 2023-06-27 06:56:09,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1750092.0, ans=0.125 2023-06-27 06:56:56,957 INFO [train.py:996] (1/4) Epoch 10, batch 17250, loss[loss=0.2668, simple_loss=0.3378, pruned_loss=0.09787, over 21369.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2973, pruned_loss=0.07331, over 4292044.49 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:57:13,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1750212.0, ans=0.125 2023-06-27 06:58:00,052 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.260e+02 7.026e+02 1.059e+03 1.492e+03 2.502e+03, threshold=2.118e+03, percent-clipped=5.0 2023-06-27 06:58:17,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1750392.0, ans=0.1 2023-06-27 06:58:19,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1750392.0, ans=0.125 2023-06-27 06:58:21,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1750452.0, ans=0.95 2023-06-27 06:58:50,673 INFO [train.py:996] (1/4) Epoch 10, batch 17300, loss[loss=0.2333, simple_loss=0.3161, pruned_loss=0.0752, over 21702.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3066, pruned_loss=0.07692, over 4287427.24 frames. ], batch size: 113, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 06:59:03,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1750512.0, ans=0.125 2023-06-27 06:59:06,099 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.80 vs. limit=15.0 2023-06-27 06:59:13,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1750572.0, ans=15.0 2023-06-27 07:00:06,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.44 vs. limit=15.0 2023-06-27 07:00:39,975 INFO [train.py:996] (1/4) Epoch 10, batch 17350, loss[loss=0.2009, simple_loss=0.2758, pruned_loss=0.063, over 21475.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3078, pruned_loss=0.07615, over 4282090.06 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:00:58,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1750812.0, ans=0.125 2023-06-27 07:01:14,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750872.0, ans=0.1 2023-06-27 07:01:18,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1750872.0, ans=0.125 2023-06-27 07:01:31,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1750932.0, ans=0.09899494936611666 2023-06-27 07:01:32,377 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=22.5 2023-06-27 07:01:43,499 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.572e+02 6.288e+02 8.971e+02 1.269e+03 2.386e+03, threshold=1.794e+03, percent-clipped=3.0 2023-06-27 07:02:35,912 INFO [train.py:996] (1/4) Epoch 10, batch 17400, loss[loss=0.204, simple_loss=0.2956, pruned_loss=0.05622, over 21759.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.304, pruned_loss=0.07287, over 4277936.34 frames. ], batch size: 332, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:02:36,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1751112.0, ans=0.2 2023-06-27 07:02:54,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1751172.0, ans=0.125 2023-06-27 07:02:54,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1751172.0, ans=0.2 2023-06-27 07:03:47,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-27 07:04:14,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1751352.0, ans=0.1 2023-06-27 07:04:24,539 INFO [train.py:996] (1/4) Epoch 10, batch 17450, loss[loss=0.1925, simple_loss=0.2862, pruned_loss=0.04936, over 21784.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3012, pruned_loss=0.07073, over 4278046.63 frames. ], batch size: 371, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:04:40,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1751472.0, ans=0.125 2023-06-27 07:04:52,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1751472.0, ans=0.125 2023-06-27 07:04:57,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.52 vs. limit=12.0 2023-06-27 07:04:58,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1751472.0, ans=0.125 2023-06-27 07:05:21,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1751532.0, ans=0.125 2023-06-27 07:05:31,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.811e+02 5.744e+02 7.670e+02 1.157e+03 3.080e+03, threshold=1.534e+03, percent-clipped=10.0 2023-06-27 07:05:59,237 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-27 07:06:11,707 INFO [train.py:996] (1/4) Epoch 10, batch 17500, loss[loss=0.242, simple_loss=0.3048, pruned_loss=0.08966, over 21758.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2967, pruned_loss=0.0693, over 4281434.43 frames. ], batch size: 441, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:06:39,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1751772.0, ans=0.125 2023-06-27 07:06:45,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1751772.0, ans=0.125 2023-06-27 07:06:55,310 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:07:36,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=15.0 2023-06-27 07:07:50,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1751952.0, ans=0.0 2023-06-27 07:07:59,009 INFO [train.py:996] (1/4) Epoch 10, batch 17550, loss[loss=0.2071, simple_loss=0.2983, pruned_loss=0.05799, over 21870.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2977, pruned_loss=0.06819, over 4281448.20 frames. ], batch size: 118, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:08:04,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1752012.0, ans=0.125 2023-06-27 07:09:08,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 5.513e+02 7.220e+02 1.144e+03 2.854e+03, threshold=1.444e+03, percent-clipped=10.0 2023-06-27 07:09:10,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1752192.0, ans=0.035 2023-06-27 07:09:12,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1752192.0, ans=0.0 2023-06-27 07:09:48,126 INFO [train.py:996] (1/4) Epoch 10, batch 17600, loss[loss=0.2057, simple_loss=0.3103, pruned_loss=0.05055, over 20746.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3002, pruned_loss=0.06867, over 4265525.51 frames. ], batch size: 608, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:09:50,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1752312.0, ans=0.2 2023-06-27 07:11:14,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1752492.0, ans=0.1 2023-06-27 07:11:36,227 INFO [train.py:996] (1/4) Epoch 10, batch 17650, loss[loss=0.1535, simple_loss=0.2216, pruned_loss=0.04271, over 21551.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2975, pruned_loss=0.06793, over 4274753.78 frames. ], batch size: 230, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:11:38,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1752612.0, ans=0.0 2023-06-27 07:11:48,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1752612.0, ans=0.125 2023-06-27 07:12:11,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1752672.0, ans=0.125 2023-06-27 07:12:16,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752672.0, ans=0.1 2023-06-27 07:12:43,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1752792.0, ans=0.0 2023-06-27 07:12:51,344 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.366e+02 6.949e+02 1.125e+03 1.736e+03 3.581e+03, threshold=2.249e+03, percent-clipped=33.0 2023-06-27 07:13:12,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-27 07:13:13,990 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-27 07:13:21,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1752852.0, ans=0.125 2023-06-27 07:13:30,131 INFO [train.py:996] (1/4) Epoch 10, batch 17700, loss[loss=0.2269, simple_loss=0.313, pruned_loss=0.07037, over 21710.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2933, pruned_loss=0.06638, over 4269785.03 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:13:45,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1752912.0, ans=0.07 2023-06-27 07:13:54,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1752972.0, ans=0.0 2023-06-27 07:14:32,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1753032.0, ans=0.125 2023-06-27 07:14:46,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1753092.0, ans=0.95 2023-06-27 07:15:23,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1753152.0, ans=0.125 2023-06-27 07:15:25,779 INFO [train.py:996] (1/4) Epoch 10, batch 17750, loss[loss=0.2545, simple_loss=0.3359, pruned_loss=0.08652, over 21504.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2986, pruned_loss=0.06828, over 4273337.78 frames. ], batch size: 194, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:16:14,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-27 07:16:31,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.520e+02 6.307e+02 8.574e+02 1.258e+03 1.929e+03, threshold=1.715e+03, percent-clipped=0.0 2023-06-27 07:16:58,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1753452.0, ans=0.125 2023-06-27 07:17:15,950 INFO [train.py:996] (1/4) Epoch 10, batch 17800, loss[loss=0.1691, simple_loss=0.2628, pruned_loss=0.03768, over 21830.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2983, pruned_loss=0.06761, over 4273781.95 frames. ], batch size: 372, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:17:25,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1753512.0, ans=0.0 2023-06-27 07:17:46,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1753572.0, ans=0.125 2023-06-27 07:17:51,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1753572.0, ans=0.0 2023-06-27 07:19:09,856 INFO [train.py:996] (1/4) Epoch 10, batch 17850, loss[loss=0.2231, simple_loss=0.3035, pruned_loss=0.07136, over 21826.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2988, pruned_loss=0.06757, over 4274194.96 frames. ], batch size: 282, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:19:15,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1753812.0, ans=0.125 2023-06-27 07:19:33,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1753872.0, ans=0.125 2023-06-27 07:19:40,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1753872.0, ans=0.125 2023-06-27 07:19:48,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1753932.0, ans=0.125 2023-06-27 07:20:00,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1753932.0, ans=0.125 2023-06-27 07:20:19,505 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.075e+02 5.767e+02 7.778e+02 1.051e+03 2.491e+03, threshold=1.556e+03, percent-clipped=2.0 2023-06-27 07:20:41,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1754052.0, ans=0.0 2023-06-27 07:20:59,092 INFO [train.py:996] (1/4) Epoch 10, batch 17900, loss[loss=0.2405, simple_loss=0.3257, pruned_loss=0.0777, over 21331.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3032, pruned_loss=0.06854, over 4271333.20 frames. ], batch size: 548, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:21:14,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.81 vs. limit=15.0 2023-06-27 07:21:56,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1754232.0, ans=0.125 2023-06-27 07:22:54,515 INFO [train.py:996] (1/4) Epoch 10, batch 17950, loss[loss=0.1833, simple_loss=0.2782, pruned_loss=0.04414, over 21607.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.3023, pruned_loss=0.06607, over 4267664.68 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:23:07,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1754412.0, ans=0.0 2023-06-27 07:23:36,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1754472.0, ans=22.5 2023-06-27 07:23:57,654 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 6.955e+02 1.067e+03 1.323e+03 3.422e+03, threshold=2.134e+03, percent-clipped=13.0 2023-06-27 07:24:41,309 INFO [train.py:996] (1/4) Epoch 10, batch 18000, loss[loss=0.1988, simple_loss=0.2649, pruned_loss=0.06629, over 21528.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2956, pruned_loss=0.0652, over 4272901.55 frames. ], batch size: 391, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:24:41,310 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 07:24:52,764 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.0056, 2.7084, 4.1684, 2.6116], device='cuda:1') 2023-06-27 07:24:59,820 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2583, simple_loss=0.3514, pruned_loss=0.08255, over 1796401.00 frames. 2023-06-27 07:24:59,821 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 07:25:16,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1754712.0, ans=0.125 2023-06-27 07:26:32,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1754952.0, ans=0.1 2023-06-27 07:26:48,117 INFO [train.py:996] (1/4) Epoch 10, batch 18050, loss[loss=0.1698, simple_loss=0.2537, pruned_loss=0.04299, over 21734.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.291, pruned_loss=0.06511, over 4270723.74 frames. ], batch size: 282, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:26:48,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1755012.0, ans=0.2 2023-06-27 07:27:01,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-27 07:27:26,257 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:28:06,134 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.888e+02 5.313e+02 7.169e+02 9.498e+02 2.481e+03, threshold=1.434e+03, percent-clipped=3.0 2023-06-27 07:28:20,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1755252.0, ans=0.125 2023-06-27 07:28:23,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-27 07:28:29,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-27 07:28:37,174 INFO [train.py:996] (1/4) Epoch 10, batch 18100, loss[loss=0.2739, simple_loss=0.355, pruned_loss=0.0964, over 21463.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2955, pruned_loss=0.06708, over 4272542.81 frames. ], batch size: 471, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:28:57,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755312.0, ans=0.1 2023-06-27 07:29:12,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1755372.0, ans=0.5 2023-06-27 07:29:38,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1755432.0, ans=0.125 2023-06-27 07:29:56,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1755492.0, ans=0.2 2023-06-27 07:30:00,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=12.0 2023-06-27 07:30:18,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1755552.0, ans=0.0 2023-06-27 07:30:24,626 INFO [train.py:996] (1/4) Epoch 10, batch 18150, loss[loss=0.1971, simple_loss=0.2712, pruned_loss=0.06149, over 21713.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2972, pruned_loss=0.06691, over 4272001.55 frames. ], batch size: 333, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:30:46,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1755612.0, ans=0.125 2023-06-27 07:31:42,083 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.043e+02 8.888e+02 1.339e+03 2.734e+03, threshold=1.778e+03, percent-clipped=20.0 2023-06-27 07:31:52,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1755792.0, ans=0.125 2023-06-27 07:32:00,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1755852.0, ans=0.125 2023-06-27 07:32:11,834 INFO [train.py:996] (1/4) Epoch 10, batch 18200, loss[loss=0.2142, simple_loss=0.284, pruned_loss=0.07215, over 21193.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2904, pruned_loss=0.06649, over 4254239.08 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:32:12,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1755912.0, ans=0.125 2023-06-27 07:32:17,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1755912.0, ans=0.0 2023-06-27 07:32:38,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1755972.0, ans=0.0 2023-06-27 07:32:52,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1755972.0, ans=0.125 2023-06-27 07:33:47,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1756152.0, ans=0.125 2023-06-27 07:33:49,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1756152.0, ans=0.125 2023-06-27 07:33:54,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1756152.0, ans=0.125 2023-06-27 07:33:57,129 INFO [train.py:996] (1/4) Epoch 10, batch 18250, loss[loss=0.1683, simple_loss=0.2466, pruned_loss=0.04497, over 21693.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2835, pruned_loss=0.06456, over 4246128.55 frames. ], batch size: 124, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:33:58,151 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-27 07:34:03,311 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-27 07:34:15,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1756212.0, ans=0.125 2023-06-27 07:34:33,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.24 vs. limit=12.0 2023-06-27 07:35:06,626 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.839e+02 5.360e+02 7.214e+02 1.131e+03 2.943e+03, threshold=1.443e+03, percent-clipped=6.0 2023-06-27 07:35:41,586 INFO [train.py:996] (1/4) Epoch 10, batch 18300, loss[loss=0.2548, simple_loss=0.3611, pruned_loss=0.07422, over 21715.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2863, pruned_loss=0.06579, over 4259342.65 frames. ], batch size: 389, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:37:00,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-27 07:37:17,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1756752.0, ans=0.0 2023-06-27 07:37:27,281 INFO [train.py:996] (1/4) Epoch 10, batch 18350, loss[loss=0.2359, simple_loss=0.3411, pruned_loss=0.06534, over 20733.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2916, pruned_loss=0.06518, over 4257765.51 frames. ], batch size: 607, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:37:43,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1756872.0, ans=0.125 2023-06-27 07:38:11,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1756932.0, ans=0.0 2023-06-27 07:38:39,353 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.953e+02 5.887e+02 8.763e+02 1.316e+03 3.037e+03, threshold=1.753e+03, percent-clipped=16.0 2023-06-27 07:38:40,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1756992.0, ans=0.2 2023-06-27 07:39:16,527 INFO [train.py:996] (1/4) Epoch 10, batch 18400, loss[loss=0.1889, simple_loss=0.2567, pruned_loss=0.06058, over 21600.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2862, pruned_loss=0.06375, over 4245481.49 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:39:19,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1757112.0, ans=0.07 2023-06-27 07:40:23,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1757292.0, ans=0.125 2023-06-27 07:41:01,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1757352.0, ans=0.125 2023-06-27 07:41:04,176 INFO [train.py:996] (1/4) Epoch 10, batch 18450, loss[loss=0.1984, simple_loss=0.2574, pruned_loss=0.0697, over 21093.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2809, pruned_loss=0.06073, over 4243551.78 frames. ], batch size: 608, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:42:13,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-27 07:42:17,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.327e+02 4.825e+02 6.029e+02 8.495e+02 1.994e+03, threshold=1.206e+03, percent-clipped=1.0 2023-06-27 07:42:45,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1757652.0, ans=0.035 2023-06-27 07:42:47,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1757652.0, ans=0.125 2023-06-27 07:42:50,178 INFO [train.py:996] (1/4) Epoch 10, batch 18500, loss[loss=0.1764, simple_loss=0.2487, pruned_loss=0.05209, over 21608.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2761, pruned_loss=0.05988, over 4253854.46 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:43:52,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1757832.0, ans=0.125 2023-06-27 07:44:32,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1757952.0, ans=0.0 2023-06-27 07:44:37,143 INFO [train.py:996] (1/4) Epoch 10, batch 18550, loss[loss=0.2084, simple_loss=0.2691, pruned_loss=0.07381, over 21306.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2741, pruned_loss=0.05981, over 4240589.96 frames. ], batch size: 473, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:44:51,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1758012.0, ans=0.125 2023-06-27 07:45:37,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1758132.0, ans=0.2 2023-06-27 07:45:57,523 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.953e+02 6.338e+02 9.700e+02 1.484e+03 3.316e+03, threshold=1.940e+03, percent-clipped=34.0 2023-06-27 07:46:24,946 INFO [train.py:996] (1/4) Epoch 10, batch 18600, loss[loss=0.1929, simple_loss=0.2828, pruned_loss=0.05154, over 21781.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2716, pruned_loss=0.05956, over 4241177.93 frames. ], batch size: 282, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:47:54,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1758552.0, ans=0.2 2023-06-27 07:48:09,083 INFO [train.py:996] (1/4) Epoch 10, batch 18650, loss[loss=0.2239, simple_loss=0.3136, pruned_loss=0.06711, over 21700.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2734, pruned_loss=0.06067, over 4242131.15 frames. ], batch size: 415, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:48:40,394 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-27 07:48:44,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1758672.0, ans=0.125 2023-06-27 07:49:21,076 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.715e+02 5.496e+02 8.127e+02 1.461e+03 3.115e+03, threshold=1.625e+03, percent-clipped=10.0 2023-06-27 07:49:21,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1758792.0, ans=0.125 2023-06-27 07:49:45,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1758852.0, ans=0.0 2023-06-27 07:49:53,336 INFO [train.py:996] (1/4) Epoch 10, batch 18700, loss[loss=0.2038, simple_loss=0.2783, pruned_loss=0.06469, over 21946.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2728, pruned_loss=0.06209, over 4243172.65 frames. ], batch size: 113, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:50:26,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1758972.0, ans=0.125 2023-06-27 07:50:29,239 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-27 07:50:47,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1759032.0, ans=0.125 2023-06-27 07:51:39,693 INFO [train.py:996] (1/4) Epoch 10, batch 18750, loss[loss=0.2444, simple_loss=0.3224, pruned_loss=0.08319, over 21860.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2746, pruned_loss=0.06378, over 4243735.91 frames. ], batch size: 118, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:51:55,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1759272.0, ans=0.0 2023-06-27 07:52:14,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1759272.0, ans=0.0 2023-06-27 07:52:52,807 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.039e+02 6.322e+02 1.036e+03 1.574e+03 2.810e+03, threshold=2.072e+03, percent-clipped=23.0 2023-06-27 07:53:25,198 INFO [train.py:996] (1/4) Epoch 10, batch 18800, loss[loss=0.1788, simple_loss=0.2552, pruned_loss=0.05119, over 21159.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2807, pruned_loss=0.06508, over 4248335.48 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:53:26,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1759512.0, ans=0.1 2023-06-27 07:53:31,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1759512.0, ans=0.0 2023-06-27 07:53:59,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1759572.0, ans=0.1 2023-06-27 07:53:59,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1759572.0, ans=0.2 2023-06-27 07:54:02,038 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.76 vs. limit=12.0 2023-06-27 07:54:12,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.80 vs. limit=15.0 2023-06-27 07:54:15,317 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-27 07:55:10,079 INFO [train.py:996] (1/4) Epoch 10, batch 18850, loss[loss=0.1735, simple_loss=0.2363, pruned_loss=0.05535, over 21820.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2771, pruned_loss=0.06198, over 4232472.86 frames. ], batch size: 102, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:55:18,052 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-27 07:55:44,285 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2023-06-27 07:56:22,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1759992.0, ans=0.125 2023-06-27 07:56:23,522 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.405e+02 6.936e+02 9.507e+02 2.005e+03, threshold=1.387e+03, percent-clipped=0.0 2023-06-27 07:56:39,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1760052.0, ans=0.125 2023-06-27 07:56:44,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760052.0, ans=0.1 2023-06-27 07:56:56,192 INFO [train.py:996] (1/4) Epoch 10, batch 18900, loss[loss=0.2047, simple_loss=0.2703, pruned_loss=0.06961, over 21814.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2743, pruned_loss=0.0625, over 4246768.40 frames. ], batch size: 371, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:57:13,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1760172.0, ans=0.025 2023-06-27 07:57:52,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760232.0, ans=0.1 2023-06-27 07:58:21,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1760292.0, ans=0.0 2023-06-27 07:58:41,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1760412.0, ans=0.0 2023-06-27 07:58:42,081 INFO [train.py:996] (1/4) Epoch 10, batch 18950, loss[loss=0.2207, simple_loss=0.2961, pruned_loss=0.07264, over 21838.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2759, pruned_loss=0.06399, over 4244341.35 frames. ], batch size: 391, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:58:55,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1760412.0, ans=0.0 2023-06-27 07:59:08,048 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-27 07:59:12,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1760472.0, ans=0.125 2023-06-27 07:59:15,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1760472.0, ans=0.0 2023-06-27 07:59:17,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1760472.0, ans=0.0 2023-06-27 07:59:36,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1760532.0, ans=0.0 2023-06-27 07:59:57,328 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 7.333e+02 1.084e+03 1.694e+03 3.772e+03, threshold=2.167e+03, percent-clipped=36.0 2023-06-27 08:00:01,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1760592.0, ans=0.2 2023-06-27 08:00:24,667 INFO [train.py:996] (1/4) Epoch 10, batch 19000, loss[loss=0.2905, simple_loss=0.3523, pruned_loss=0.1144, over 21353.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2873, pruned_loss=0.06675, over 4247284.22 frames. ], batch size: 507, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:01:42,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1760892.0, ans=0.0 2023-06-27 08:01:58,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1760952.0, ans=0.0 2023-06-27 08:02:06,192 INFO [train.py:996] (1/4) Epoch 10, batch 19050, loss[loss=0.2315, simple_loss=0.3429, pruned_loss=0.06008, over 20770.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2926, pruned_loss=0.07001, over 4253900.65 frames. ], batch size: 607, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:02:14,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-06-27 08:02:20,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1761012.0, ans=0.07 2023-06-27 08:02:44,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1761072.0, ans=0.1 2023-06-27 08:03:21,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1761192.0, ans=0.125 2023-06-27 08:03:24,546 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.013e+02 5.908e+02 6.994e+02 9.504e+02 2.053e+03, threshold=1.399e+03, percent-clipped=0.0 2023-06-27 08:03:51,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1761312.0, ans=0.125 2023-06-27 08:03:52,608 INFO [train.py:996] (1/4) Epoch 10, batch 19100, loss[loss=0.2013, simple_loss=0.2688, pruned_loss=0.06696, over 21892.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2896, pruned_loss=0.06878, over 4245352.51 frames. ], batch size: 107, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:03:54,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1761312.0, ans=0.125 2023-06-27 08:04:17,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1761372.0, ans=0.125 2023-06-27 08:04:24,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1761372.0, ans=0.5 2023-06-27 08:04:35,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1761372.0, ans=0.2 2023-06-27 08:05:30,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1761552.0, ans=0.0 2023-06-27 08:05:42,415 INFO [train.py:996] (1/4) Epoch 10, batch 19150, loss[loss=0.2269, simple_loss=0.3196, pruned_loss=0.06712, over 21578.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2924, pruned_loss=0.06978, over 4253922.94 frames. ], batch size: 230, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:05:49,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-27 08:05:57,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1761612.0, ans=0.0 2023-06-27 08:06:08,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1761672.0, ans=0.1 2023-06-27 08:06:20,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1761672.0, ans=0.125 2023-06-27 08:06:53,110 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.207e+02 6.138e+02 1.014e+03 1.599e+03 3.928e+03, threshold=2.029e+03, percent-clipped=32.0 2023-06-27 08:07:14,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1761852.0, ans=0.2 2023-06-27 08:07:26,284 INFO [train.py:996] (1/4) Epoch 10, batch 19200, loss[loss=0.2027, simple_loss=0.3087, pruned_loss=0.04834, over 21341.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3013, pruned_loss=0.06987, over 4255610.12 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 08:07:56,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1761972.0, ans=0.1 2023-06-27 08:08:27,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1762092.0, ans=0.0 2023-06-27 08:09:08,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1762152.0, ans=0.125 2023-06-27 08:09:12,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.63 vs. limit=22.5 2023-06-27 08:09:12,904 INFO [train.py:996] (1/4) Epoch 10, batch 19250, loss[loss=0.1618, simple_loss=0.2597, pruned_loss=0.03197, over 21723.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2994, pruned_loss=0.06516, over 4265140.22 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:09:15,344 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:10:01,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1762332.0, ans=0.125 2023-06-27 08:10:23,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.181e+02 6.655e+02 8.936e+02 1.845e+03, threshold=1.331e+03, percent-clipped=0.0 2023-06-27 08:10:59,774 INFO [train.py:996] (1/4) Epoch 10, batch 19300, loss[loss=0.2107, simple_loss=0.2811, pruned_loss=0.07021, over 21823.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2959, pruned_loss=0.06452, over 4277475.32 frames. ], batch size: 124, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:11:27,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1762572.0, ans=0.2 2023-06-27 08:11:34,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1762572.0, ans=0.125 2023-06-27 08:12:39,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1762752.0, ans=0.125 2023-06-27 08:12:52,825 INFO [train.py:996] (1/4) Epoch 10, batch 19350, loss[loss=0.1673, simple_loss=0.2488, pruned_loss=0.04291, over 21239.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2906, pruned_loss=0.06173, over 4283749.98 frames. ], batch size: 159, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:12:55,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1762812.0, ans=0.2 2023-06-27 08:13:20,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1762872.0, ans=0.0 2023-06-27 08:13:34,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1762932.0, ans=0.125 2023-06-27 08:14:03,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.583e+02 5.604e+02 8.480e+02 1.112e+03 2.601e+03, threshold=1.696e+03, percent-clipped=20.0 2023-06-27 08:14:11,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1763052.0, ans=0.125 2023-06-27 08:14:39,177 INFO [train.py:996] (1/4) Epoch 10, batch 19400, loss[loss=0.2312, simple_loss=0.3048, pruned_loss=0.07875, over 21609.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.289, pruned_loss=0.06139, over 4283338.73 frames. ], batch size: 471, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:14:41,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1763112.0, ans=0.025 2023-06-27 08:14:53,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1763112.0, ans=0.025 2023-06-27 08:16:13,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1763352.0, ans=0.125 2023-06-27 08:16:23,172 INFO [train.py:996] (1/4) Epoch 10, batch 19450, loss[loss=0.2169, simple_loss=0.2806, pruned_loss=0.07658, over 21359.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2862, pruned_loss=0.06292, over 4285053.98 frames. ], batch size: 143, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:16:32,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1763412.0, ans=0.125 2023-06-27 08:17:12,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1763532.0, ans=0.2 2023-06-27 08:17:16,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1763532.0, ans=0.0 2023-06-27 08:17:19,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-27 08:17:34,586 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 5.344e+02 8.011e+02 1.240e+03 3.010e+03, threshold=1.602e+03, percent-clipped=14.0 2023-06-27 08:18:05,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-27 08:18:11,412 INFO [train.py:996] (1/4) Epoch 10, batch 19500, loss[loss=0.2115, simple_loss=0.2912, pruned_loss=0.06587, over 21698.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2829, pruned_loss=0.06326, over 4280401.39 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:18:26,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1763712.0, ans=0.125 2023-06-27 08:18:26,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-06-27 08:18:46,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1763832.0, ans=0.125 2023-06-27 08:19:01,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1763892.0, ans=0.0 2023-06-27 08:19:17,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1763892.0, ans=15.0 2023-06-27 08:19:57,003 INFO [train.py:996] (1/4) Epoch 10, batch 19550, loss[loss=0.1993, simple_loss=0.2918, pruned_loss=0.05343, over 21787.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2786, pruned_loss=0.06232, over 4270635.53 frames. ], batch size: 282, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:20:49,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1764192.0, ans=0.0 2023-06-27 08:21:01,404 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.262e+02 6.617e+02 9.937e+02 1.346e+03 3.535e+03, threshold=1.987e+03, percent-clipped=18.0 2023-06-27 08:21:03,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1764192.0, ans=0.125 2023-06-27 08:21:41,966 INFO [train.py:996] (1/4) Epoch 10, batch 19600, loss[loss=0.2429, simple_loss=0.3063, pruned_loss=0.08978, over 21645.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2805, pruned_loss=0.06393, over 4273233.19 frames. ], batch size: 471, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:22:05,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1764372.0, ans=0.2 2023-06-27 08:22:05,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1764372.0, ans=0.1 2023-06-27 08:22:12,160 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:23:23,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1764552.0, ans=0.2 2023-06-27 08:23:30,477 INFO [train.py:996] (1/4) Epoch 10, batch 19650, loss[loss=0.2195, simple_loss=0.2887, pruned_loss=0.07512, over 21427.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2844, pruned_loss=0.06702, over 4275383.36 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:23:40,269 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-27 08:23:50,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1764672.0, ans=0.2 2023-06-27 08:24:56,583 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.290e+02 8.079e+02 1.063e+03 2.506e+03, threshold=1.616e+03, percent-clipped=1.0 2023-06-27 08:25:01,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1764792.0, ans=0.0 2023-06-27 08:25:19,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1764852.0, ans=0.2 2023-06-27 08:25:22,632 INFO [train.py:996] (1/4) Epoch 10, batch 19700, loss[loss=0.2422, simple_loss=0.3309, pruned_loss=0.07675, over 21308.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2884, pruned_loss=0.06768, over 4273795.38 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:25:27,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1764912.0, ans=0.125 2023-06-27 08:25:42,161 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=12.0 2023-06-27 08:25:58,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1764972.0, ans=0.1 2023-06-27 08:26:10,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1765032.0, ans=0.1 2023-06-27 08:26:34,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-27 08:26:53,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1765152.0, ans=0.125 2023-06-27 08:26:57,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1765152.0, ans=0.125 2023-06-27 08:27:07,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1765152.0, ans=0.1 2023-06-27 08:27:12,076 INFO [train.py:996] (1/4) Epoch 10, batch 19750, loss[loss=0.2661, simple_loss=0.383, pruned_loss=0.0746, over 21255.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2982, pruned_loss=0.06877, over 4277162.82 frames. ], batch size: 549, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:27:16,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2023-06-27 08:27:38,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-27 08:28:18,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-27 08:28:29,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1765392.0, ans=0.125 2023-06-27 08:28:33,891 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.148e+02 7.081e+02 1.435e+03 2.263e+03 4.438e+03, threshold=2.870e+03, percent-clipped=43.0 2023-06-27 08:28:58,221 INFO [train.py:996] (1/4) Epoch 10, batch 19800, loss[loss=0.193, simple_loss=0.2731, pruned_loss=0.05652, over 21910.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2973, pruned_loss=0.06873, over 4287994.80 frames. ], batch size: 316, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:28:58,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1765512.0, ans=0.035 2023-06-27 08:29:02,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1765512.0, ans=0.0 2023-06-27 08:30:48,429 INFO [train.py:996] (1/4) Epoch 10, batch 19850, loss[loss=0.1869, simple_loss=0.2805, pruned_loss=0.04661, over 21699.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2889, pruned_loss=0.06422, over 4281613.61 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:32:02,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1765992.0, ans=0.0 2023-06-27 08:32:13,172 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.649e+02 5.631e+02 8.956e+02 1.493e+03 4.041e+03, threshold=1.791e+03, percent-clipped=3.0 2023-06-27 08:32:15,838 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:32:29,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1766052.0, ans=0.125 2023-06-27 08:32:35,723 INFO [train.py:996] (1/4) Epoch 10, batch 19900, loss[loss=0.1824, simple_loss=0.259, pruned_loss=0.05294, over 21517.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2901, pruned_loss=0.06228, over 4286481.71 frames. ], batch size: 195, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:33:48,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1766292.0, ans=0.125 2023-06-27 08:33:59,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2023-06-27 08:34:29,563 INFO [train.py:996] (1/4) Epoch 10, batch 19950, loss[loss=0.171, simple_loss=0.2506, pruned_loss=0.0457, over 21507.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2855, pruned_loss=0.06174, over 4274746.48 frames. ], batch size: 230, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:35:02,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-27 08:35:49,401 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 4.941e+02 6.532e+02 1.016e+03 1.667e+03, threshold=1.306e+03, percent-clipped=0.0 2023-06-27 08:35:55,801 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-27 08:36:00,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-27 08:36:18,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1766652.0, ans=0.125 2023-06-27 08:36:21,406 INFO [train.py:996] (1/4) Epoch 10, batch 20000, loss[loss=0.1819, simple_loss=0.2421, pruned_loss=0.06081, over 20753.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2881, pruned_loss=0.06268, over 4270410.27 frames. ], batch size: 608, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:36:28,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1766712.0, ans=0.125 2023-06-27 08:36:33,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1766712.0, ans=0.0 2023-06-27 08:36:37,754 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-27 08:36:47,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-06-27 08:36:51,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1766772.0, ans=0.1 2023-06-27 08:37:22,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1766892.0, ans=0.0 2023-06-27 08:37:49,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1766952.0, ans=0.125 2023-06-27 08:38:03,073 INFO [train.py:996] (1/4) Epoch 10, batch 20050, loss[loss=0.218, simple_loss=0.293, pruned_loss=0.07148, over 21865.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2888, pruned_loss=0.06437, over 4269114.06 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:38:36,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-27 08:38:57,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1767132.0, ans=0.125 2023-06-27 08:39:15,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=22.5 2023-06-27 08:39:18,081 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 5.758e+02 8.028e+02 1.108e+03 2.385e+03, threshold=1.606e+03, percent-clipped=14.0 2023-06-27 08:39:47,055 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:39:56,998 INFO [train.py:996] (1/4) Epoch 10, batch 20100, loss[loss=0.2662, simple_loss=0.3651, pruned_loss=0.08366, over 21685.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2913, pruned_loss=0.06608, over 4281211.10 frames. ], batch size: 389, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:40:07,962 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:41:02,883 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:41:44,199 INFO [train.py:996] (1/4) Epoch 10, batch 20150, loss[loss=0.2428, simple_loss=0.3158, pruned_loss=0.08485, over 21710.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2988, pruned_loss=0.06944, over 4280570.64 frames. ], batch size: 332, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:41:52,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1767612.0, ans=0.0 2023-06-27 08:42:16,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1767672.0, ans=0.0 2023-06-27 08:43:08,916 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 8.899e+02 1.364e+03 1.871e+03 4.503e+03, threshold=2.728e+03, percent-clipped=36.0 2023-06-27 08:43:11,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1767792.0, ans=0.125 2023-06-27 08:43:31,390 INFO [train.py:996] (1/4) Epoch 10, batch 20200, loss[loss=0.1758, simple_loss=0.2107, pruned_loss=0.07044, over 16118.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3055, pruned_loss=0.07261, over 4276634.44 frames. ], batch size: 60, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:43:56,013 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-27 08:44:02,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=15.0 2023-06-27 08:44:51,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1768092.0, ans=0.125 2023-06-27 08:44:59,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-27 08:45:13,295 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-27 08:45:19,238 INFO [train.py:996] (1/4) Epoch 10, batch 20250, loss[loss=0.2236, simple_loss=0.3117, pruned_loss=0.06772, over 21692.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3073, pruned_loss=0.07215, over 4276001.62 frames. ], batch size: 389, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:45:30,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1768212.0, ans=0.125 2023-06-27 08:46:03,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1768332.0, ans=0.125 2023-06-27 08:46:37,989 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.153e+02 5.975e+02 7.847e+02 1.054e+03 2.189e+03, threshold=1.569e+03, percent-clipped=0.0 2023-06-27 08:46:59,740 INFO [train.py:996] (1/4) Epoch 10, batch 20300, loss[loss=0.1835, simple_loss=0.2629, pruned_loss=0.05207, over 21899.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3052, pruned_loss=0.06993, over 4280430.64 frames. ], batch size: 98, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:47:00,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1768512.0, ans=0.0 2023-06-27 08:47:00,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1768512.0, ans=0.0 2023-06-27 08:47:06,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1768512.0, ans=0.0 2023-06-27 08:48:40,508 INFO [train.py:996] (1/4) Epoch 10, batch 20350, loss[loss=0.2092, simple_loss=0.2841, pruned_loss=0.06714, over 21887.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3058, pruned_loss=0.07013, over 4273771.97 frames. ], batch size: 316, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:49:04,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1768872.0, ans=0.0 2023-06-27 08:49:09,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1768872.0, ans=0.0 2023-06-27 08:49:18,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1768872.0, ans=0.2 2023-06-27 08:49:49,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1768992.0, ans=0.0 2023-06-27 08:49:59,738 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-27 08:50:07,285 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.767e+02 5.725e+02 9.452e+02 1.415e+03 2.531e+03, threshold=1.890e+03, percent-clipped=19.0 2023-06-27 08:50:29,308 INFO [train.py:996] (1/4) Epoch 10, batch 20400, loss[loss=0.2974, simple_loss=0.3608, pruned_loss=0.117, over 21424.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3084, pruned_loss=0.07299, over 4279433.00 frames. ], batch size: 508, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:50:36,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-27 08:50:44,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1769112.0, ans=0.2 2023-06-27 08:51:21,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1769232.0, ans=0.125 2023-06-27 08:51:26,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1769232.0, ans=0.0 2023-06-27 08:51:29,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1769232.0, ans=0.125 2023-06-27 08:52:08,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1769352.0, ans=0.0 2023-06-27 08:52:16,216 INFO [train.py:996] (1/4) Epoch 10, batch 20450, loss[loss=0.2433, simple_loss=0.3176, pruned_loss=0.08449, over 21823.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3088, pruned_loss=0.07463, over 4285406.93 frames. ], batch size: 118, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:52:42,752 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-27 08:53:39,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1769592.0, ans=0.2 2023-06-27 08:53:42,090 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.405e+02 6.127e+02 7.186e+02 1.014e+03 1.873e+03, threshold=1.437e+03, percent-clipped=1.0 2023-06-27 08:53:46,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-06-27 08:54:02,054 INFO [train.py:996] (1/4) Epoch 10, batch 20500, loss[loss=0.2352, simple_loss=0.3014, pruned_loss=0.08448, over 21715.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.304, pruned_loss=0.07369, over 4284055.61 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:54:07,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1769712.0, ans=0.125 2023-06-27 08:54:11,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-27 08:54:13,290 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-27 08:54:30,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1769772.0, ans=0.2 2023-06-27 08:55:48,849 INFO [train.py:996] (1/4) Epoch 10, batch 20550, loss[loss=0.2052, simple_loss=0.2841, pruned_loss=0.06313, over 21572.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2963, pruned_loss=0.07211, over 4269982.97 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:56:05,774 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=8.0 2023-06-27 08:56:36,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1770072.0, ans=0.1 2023-06-27 08:56:53,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=22.5 2023-06-27 08:57:14,707 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 5.326e+02 8.786e+02 1.599e+03 3.543e+03, threshold=1.757e+03, percent-clipped=26.0 2023-06-27 08:57:30,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1770252.0, ans=0.125 2023-06-27 08:57:34,949 INFO [train.py:996] (1/4) Epoch 10, batch 20600, loss[loss=0.2078, simple_loss=0.2845, pruned_loss=0.06552, over 21738.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2971, pruned_loss=0.07041, over 4277545.36 frames. ], batch size: 247, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:57:43,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1770312.0, ans=0.0 2023-06-27 08:58:31,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1770432.0, ans=0.125 2023-06-27 08:58:31,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1770432.0, ans=0.0 2023-06-27 08:58:51,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1770492.0, ans=0.05 2023-06-27 08:59:11,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1770552.0, ans=0.1 2023-06-27 08:59:17,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1770552.0, ans=0.125 2023-06-27 08:59:19,935 INFO [train.py:996] (1/4) Epoch 10, batch 20650, loss[loss=0.2525, simple_loss=0.301, pruned_loss=0.102, over 21472.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.293, pruned_loss=0.07032, over 4272269.84 frames. ], batch size: 508, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:00:45,111 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.379e+02 5.610e+02 8.426e+02 1.372e+03 2.943e+03, threshold=1.685e+03, percent-clipped=16.0 2023-06-27 09:00:58,609 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.92 vs. limit=6.0 2023-06-27 09:01:06,343 INFO [train.py:996] (1/4) Epoch 10, batch 20700, loss[loss=0.1763, simple_loss=0.2543, pruned_loss=0.04915, over 21404.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2857, pruned_loss=0.0667, over 4258633.14 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:02:13,314 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=22.5 2023-06-27 09:02:25,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1771092.0, ans=0.125 2023-06-27 09:02:51,359 INFO [train.py:996] (1/4) Epoch 10, batch 20750, loss[loss=0.2254, simple_loss=0.3177, pruned_loss=0.06659, over 21403.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2886, pruned_loss=0.06661, over 4253025.76 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:04:13,385 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.912e+02 7.283e+02 1.259e+03 1.890e+03 5.387e+03, threshold=2.519e+03, percent-clipped=32.0 2023-06-27 09:04:39,266 INFO [train.py:996] (1/4) Epoch 10, batch 20800, loss[loss=0.1889, simple_loss=0.2635, pruned_loss=0.05715, over 21596.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2927, pruned_loss=0.06731, over 4253413.53 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 09:05:38,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1771632.0, ans=0.0 2023-06-27 09:05:58,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1771692.0, ans=0.125 2023-06-27 09:06:20,397 INFO [train.py:996] (1/4) Epoch 10, batch 20850, loss[loss=0.2461, simple_loss=0.3152, pruned_loss=0.0885, over 21753.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2879, pruned_loss=0.06617, over 4253266.44 frames. ], batch size: 112, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:07:29,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1771932.0, ans=0.0 2023-06-27 09:07:48,080 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.047e+02 6.784e+02 1.035e+03 1.709e+03 3.199e+03, threshold=2.070e+03, percent-clipped=7.0 2023-06-27 09:08:12,498 INFO [train.py:996] (1/4) Epoch 10, batch 20900, loss[loss=0.2332, simple_loss=0.2991, pruned_loss=0.08364, over 21849.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2889, pruned_loss=0.06736, over 4267925.42 frames. ], batch size: 391, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:09:20,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.51 vs. limit=15.0 2023-06-27 09:09:31,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1772292.0, ans=0.125 2023-06-27 09:09:36,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1772352.0, ans=0.125 2023-06-27 09:09:36,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1772352.0, ans=0.0 2023-06-27 09:09:38,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-27 09:09:42,995 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:09:54,330 INFO [train.py:996] (1/4) Epoch 10, batch 20950, loss[loss=0.1633, simple_loss=0.2409, pruned_loss=0.04286, over 21296.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2836, pruned_loss=0.06403, over 4265278.45 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:10:16,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1772412.0, ans=0.125 2023-06-27 09:10:54,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-27 09:10:59,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=8.0 2023-06-27 09:11:19,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.843e+02 5.854e+02 8.072e+02 1.179e+03 2.171e+03, threshold=1.614e+03, percent-clipped=1.0 2023-06-27 09:11:20,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.06 vs. limit=6.0 2023-06-27 09:11:38,296 INFO [train.py:996] (1/4) Epoch 10, batch 21000, loss[loss=0.2161, simple_loss=0.2861, pruned_loss=0.07302, over 21440.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2828, pruned_loss=0.0641, over 4258204.79 frames. ], batch size: 144, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:11:38,296 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 09:12:02,872 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2606, simple_loss=0.3545, pruned_loss=0.08334, over 1796401.00 frames. 2023-06-27 09:12:02,873 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 09:12:48,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1772832.0, ans=0.1 2023-06-27 09:13:19,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1772952.0, ans=0.0 2023-06-27 09:13:41,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1773012.0, ans=0.125 2023-06-27 09:13:42,565 INFO [train.py:996] (1/4) Epoch 10, batch 21050, loss[loss=0.2509, simple_loss=0.2944, pruned_loss=0.1037, over 21479.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2807, pruned_loss=0.06444, over 4263193.61 frames. ], batch size: 508, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:14:08,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-27 09:14:10,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1773072.0, ans=0.125 2023-06-27 09:14:25,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1773132.0, ans=0.0 2023-06-27 09:14:47,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1773192.0, ans=0.0 2023-06-27 09:14:59,146 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 6.476e+02 8.191e+02 1.141e+03 2.345e+03, threshold=1.638e+03, percent-clipped=6.0 2023-06-27 09:15:23,613 INFO [train.py:996] (1/4) Epoch 10, batch 21100, loss[loss=0.1943, simple_loss=0.2709, pruned_loss=0.05883, over 21681.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2771, pruned_loss=0.0643, over 4262992.60 frames. ], batch size: 333, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:15:31,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1773312.0, ans=0.1 2023-06-27 09:16:33,852 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:16:44,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1773492.0, ans=0.0 2023-06-27 09:16:54,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-27 09:17:08,863 INFO [train.py:996] (1/4) Epoch 10, batch 21150, loss[loss=0.2072, simple_loss=0.2749, pruned_loss=0.06973, over 16057.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2723, pruned_loss=0.06456, over 4250342.03 frames. ], batch size: 61, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:17:36,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1773672.0, ans=0.0 2023-06-27 09:18:14,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1773732.0, ans=0.09899494936611666 2023-06-27 09:18:33,088 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 6.384e+02 8.623e+02 1.133e+03 2.526e+03, threshold=1.725e+03, percent-clipped=9.0 2023-06-27 09:18:51,939 INFO [train.py:996] (1/4) Epoch 10, batch 21200, loss[loss=0.1705, simple_loss=0.243, pruned_loss=0.04902, over 21246.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2683, pruned_loss=0.06343, over 4254253.66 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 09:19:02,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1773912.0, ans=0.125 2023-06-27 09:19:18,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1773972.0, ans=0.025 2023-06-27 09:19:43,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1773972.0, ans=0.125 2023-06-27 09:19:46,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1774032.0, ans=0.2 2023-06-27 09:19:48,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1774032.0, ans=0.1 2023-06-27 09:20:44,821 INFO [train.py:996] (1/4) Epoch 10, batch 21250, loss[loss=0.203, simple_loss=0.2743, pruned_loss=0.06589, over 21640.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2677, pruned_loss=0.06338, over 4255074.53 frames. ], batch size: 247, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:22:08,468 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.773e+02 6.524e+02 9.029e+02 1.391e+03 2.253e+03, threshold=1.806e+03, percent-clipped=10.0 2023-06-27 09:22:25,325 INFO [train.py:996] (1/4) Epoch 10, batch 21300, loss[loss=0.229, simple_loss=0.3133, pruned_loss=0.07234, over 21820.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2732, pruned_loss=0.06466, over 4263860.43 frames. ], batch size: 391, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:23:33,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-27 09:24:13,057 INFO [train.py:996] (1/4) Epoch 10, batch 21350, loss[loss=0.1958, simple_loss=0.2861, pruned_loss=0.05272, over 21768.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2769, pruned_loss=0.06501, over 4272290.39 frames. ], batch size: 298, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:24:19,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.68 vs. limit=15.0 2023-06-27 09:24:41,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1774872.0, ans=0.0 2023-06-27 09:24:43,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1774872.0, ans=0.04949747468305833 2023-06-27 09:25:06,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774932.0, ans=0.1 2023-06-27 09:25:13,731 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.99 vs. limit=10.0 2023-06-27 09:25:14,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1774932.0, ans=0.0 2023-06-27 09:25:39,166 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.868e+02 6.492e+02 8.824e+02 1.457e+03 2.432e+03, threshold=1.765e+03, percent-clipped=7.0 2023-06-27 09:25:41,547 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:25:53,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1775052.0, ans=0.1 2023-06-27 09:26:01,036 INFO [train.py:996] (1/4) Epoch 10, batch 21400, loss[loss=0.1855, simple_loss=0.2866, pruned_loss=0.04219, over 20982.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2805, pruned_loss=0.06511, over 4275781.28 frames. ], batch size: 607, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:26:41,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1775172.0, ans=0.0 2023-06-27 09:27:33,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-27 09:27:38,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1775352.0, ans=0.125 2023-06-27 09:27:47,922 INFO [train.py:996] (1/4) Epoch 10, batch 21450, loss[loss=0.2088, simple_loss=0.2887, pruned_loss=0.06442, over 21870.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2853, pruned_loss=0.06736, over 4283689.70 frames. ], batch size: 371, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:27:51,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1775412.0, ans=0.1 2023-06-27 09:28:25,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1775472.0, ans=0.0 2023-06-27 09:29:12,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.429e+02 6.380e+02 8.632e+02 1.324e+03 3.087e+03, threshold=1.726e+03, percent-clipped=6.0 2023-06-27 09:29:39,667 INFO [train.py:996] (1/4) Epoch 10, batch 21500, loss[loss=0.2149, simple_loss=0.2832, pruned_loss=0.07325, over 15771.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2847, pruned_loss=0.06778, over 4271447.34 frames. ], batch size: 64, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:29:40,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-27 09:29:52,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1775712.0, ans=0.125 2023-06-27 09:30:02,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1775772.0, ans=0.125 2023-06-27 09:30:03,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1775772.0, ans=0.125 2023-06-27 09:30:23,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1775832.0, ans=0.1 2023-06-27 09:30:36,157 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=12.0 2023-06-27 09:31:25,379 INFO [train.py:996] (1/4) Epoch 10, batch 21550, loss[loss=0.1885, simple_loss=0.2506, pruned_loss=0.06323, over 21373.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2775, pruned_loss=0.06496, over 4277599.08 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:31:32,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1776012.0, ans=0.125 2023-06-27 09:31:44,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776012.0, ans=0.1 2023-06-27 09:31:44,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1776012.0, ans=0.1 2023-06-27 09:31:44,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1776012.0, ans=0.2 2023-06-27 09:32:39,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1776192.0, ans=0.125 2023-06-27 09:32:45,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.045e+02 5.690e+02 8.515e+02 1.276e+03 3.905e+03, threshold=1.703e+03, percent-clipped=13.0 2023-06-27 09:33:15,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1776252.0, ans=0.0 2023-06-27 09:33:20,038 INFO [train.py:996] (1/4) Epoch 10, batch 21600, loss[loss=0.2009, simple_loss=0.2885, pruned_loss=0.05661, over 21613.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2751, pruned_loss=0.06449, over 4273412.90 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:33:36,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1776372.0, ans=0.0 2023-06-27 09:33:54,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1776432.0, ans=0.035 2023-06-27 09:35:06,683 INFO [train.py:996] (1/4) Epoch 10, batch 21650, loss[loss=0.1899, simple_loss=0.2895, pruned_loss=0.04511, over 19922.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.278, pruned_loss=0.06257, over 4267338.20 frames. ], batch size: 703, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:35:25,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1776672.0, ans=0.125 2023-06-27 09:35:28,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1776672.0, ans=0.1 2023-06-27 09:35:54,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1776732.0, ans=0.125 2023-06-27 09:36:26,723 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.628e+02 5.833e+02 8.995e+02 1.569e+03 2.622e+03, threshold=1.799e+03, percent-clipped=22.0 2023-06-27 09:36:53,201 INFO [train.py:996] (1/4) Epoch 10, batch 21700, loss[loss=0.2032, simple_loss=0.2738, pruned_loss=0.06633, over 21956.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2766, pruned_loss=0.0606, over 4254691.70 frames. ], batch size: 113, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:37:08,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776972.0, ans=0.1 2023-06-27 09:37:45,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1777092.0, ans=0.05 2023-06-27 09:38:04,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1777152.0, ans=0.125 2023-06-27 09:38:29,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1777152.0, ans=0.0 2023-06-27 09:38:38,209 INFO [train.py:996] (1/4) Epoch 10, batch 21750, loss[loss=0.2377, simple_loss=0.2738, pruned_loss=0.1008, over 21428.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2732, pruned_loss=0.06107, over 4257197.38 frames. ], batch size: 511, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:39:14,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1777332.0, ans=0.0 2023-06-27 09:39:43,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777392.0, ans=0.1 2023-06-27 09:39:45,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777392.0, ans=0.1 2023-06-27 09:39:58,324 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.215e+02 5.988e+02 7.907e+02 1.038e+03 1.862e+03, threshold=1.581e+03, percent-clipped=2.0 2023-06-27 09:40:24,244 INFO [train.py:996] (1/4) Epoch 10, batch 21800, loss[loss=0.2135, simple_loss=0.3273, pruned_loss=0.04983, over 20853.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2729, pruned_loss=0.06179, over 4258650.74 frames. ], batch size: 607, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:40:27,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.16 vs. limit=12.0 2023-06-27 09:41:07,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1777632.0, ans=0.125 2023-06-27 09:41:14,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-27 09:41:15,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1777692.0, ans=0.125 2023-06-27 09:42:10,465 INFO [train.py:996] (1/4) Epoch 10, batch 21850, loss[loss=0.1873, simple_loss=0.2549, pruned_loss=0.05983, over 21360.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2796, pruned_loss=0.06317, over 4263192.23 frames. ], batch size: 177, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:42:13,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1777812.0, ans=0.125 2023-06-27 09:42:33,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1777872.0, ans=0.125 2023-06-27 09:42:36,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1777872.0, ans=0.0 2023-06-27 09:42:43,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1777932.0, ans=0.5 2023-06-27 09:43:30,244 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.011e+02 6.825e+02 1.374e+03 1.718e+03 3.521e+03, threshold=2.747e+03, percent-clipped=39.0 2023-06-27 09:43:55,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.78 vs. limit=22.5 2023-06-27 09:43:55,433 INFO [train.py:996] (1/4) Epoch 10, batch 21900, loss[loss=0.174, simple_loss=0.2208, pruned_loss=0.06358, over 20770.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2791, pruned_loss=0.06447, over 4259732.38 frames. ], batch size: 609, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:45:40,215 INFO [train.py:996] (1/4) Epoch 10, batch 21950, loss[loss=0.1676, simple_loss=0.2365, pruned_loss=0.04938, over 21204.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2738, pruned_loss=0.0623, over 4249865.37 frames. ], batch size: 159, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:45:57,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1778472.0, ans=0.125 2023-06-27 09:46:06,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-27 09:46:09,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1778472.0, ans=0.0 2023-06-27 09:47:06,213 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.644e+02 5.395e+02 6.536e+02 9.589e+02 2.193e+03, threshold=1.307e+03, percent-clipped=0.0 2023-06-27 09:47:26,919 INFO [train.py:996] (1/4) Epoch 10, batch 22000, loss[loss=0.1855, simple_loss=0.2555, pruned_loss=0.05778, over 21374.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2698, pruned_loss=0.06049, over 4254596.24 frames. ], batch size: 160, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:47:36,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1778712.0, ans=0.1 2023-06-27 09:47:40,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1778712.0, ans=0.0 2023-06-27 09:49:10,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-27 09:49:18,755 INFO [train.py:996] (1/4) Epoch 10, batch 22050, loss[loss=0.2079, simple_loss=0.2904, pruned_loss=0.06273, over 21694.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2751, pruned_loss=0.06212, over 4253093.50 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:49:36,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1779072.0, ans=0.125 2023-06-27 09:50:51,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.283e+02 7.142e+02 9.080e+02 1.741e+03 3.538e+03, threshold=1.816e+03, percent-clipped=36.0 2023-06-27 09:51:05,034 INFO [train.py:996] (1/4) Epoch 10, batch 22100, loss[loss=0.2755, simple_loss=0.3532, pruned_loss=0.09889, over 21327.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2834, pruned_loss=0.06609, over 4261933.18 frames. ], batch size: 159, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:52:03,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1779432.0, ans=0.125 2023-06-27 09:52:51,246 INFO [train.py:996] (1/4) Epoch 10, batch 22150, loss[loss=0.2436, simple_loss=0.3026, pruned_loss=0.0923, over 21746.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2861, pruned_loss=0.06777, over 4272432.20 frames. ], batch size: 508, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:53:50,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=8.0 2023-06-27 09:54:25,387 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.248e+02 5.661e+02 7.475e+02 1.093e+03 2.487e+03, threshold=1.495e+03, percent-clipped=9.0 2023-06-27 09:54:39,176 INFO [train.py:996] (1/4) Epoch 10, batch 22200, loss[loss=0.2534, simple_loss=0.3423, pruned_loss=0.08222, over 21564.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.288, pruned_loss=0.06864, over 4276468.02 frames. ], batch size: 471, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:55:13,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1779972.0, ans=0.2 2023-06-27 09:55:54,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1780092.0, ans=10.0 2023-06-27 09:56:22,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1780152.0, ans=0.125 2023-06-27 09:56:24,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1780152.0, ans=0.125 2023-06-27 09:56:27,616 INFO [train.py:996] (1/4) Epoch 10, batch 22250, loss[loss=0.2536, simple_loss=0.327, pruned_loss=0.09008, over 21584.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2946, pruned_loss=0.07036, over 4284524.47 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:57:11,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1780332.0, ans=0.125 2023-06-27 09:57:58,531 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.092e+02 1.077e+03 1.479e+03 2.486e+03, threshold=2.154e+03, percent-clipped=24.0 2023-06-27 09:58:06,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1780452.0, ans=0.125 2023-06-27 09:58:12,269 INFO [train.py:996] (1/4) Epoch 10, batch 22300, loss[loss=0.2014, simple_loss=0.2737, pruned_loss=0.06457, over 21916.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2956, pruned_loss=0.07166, over 4290791.42 frames. ], batch size: 283, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:59:00,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1780632.0, ans=0.2 2023-06-27 09:59:41,505 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:59:47,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-27 09:59:58,419 INFO [train.py:996] (1/4) Epoch 10, batch 22350, loss[loss=0.2223, simple_loss=0.2932, pruned_loss=0.07571, over 21885.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2941, pruned_loss=0.07214, over 4296596.30 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:00:07,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1780812.0, ans=0.125 2023-06-27 10:00:25,131 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-27 10:00:29,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1780872.0, ans=0.125 2023-06-27 10:01:21,011 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-27 10:01:32,055 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.877e+02 5.424e+02 7.099e+02 9.642e+02 1.783e+03, threshold=1.420e+03, percent-clipped=0.0 2023-06-27 10:01:32,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1781052.0, ans=0.125 2023-06-27 10:01:45,475 INFO [train.py:996] (1/4) Epoch 10, batch 22400, loss[loss=0.2001, simple_loss=0.2895, pruned_loss=0.0553, over 21745.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2923, pruned_loss=0.06988, over 4298067.04 frames. ], batch size: 351, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 10:01:51,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1781112.0, ans=0.0 2023-06-27 10:03:11,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1781292.0, ans=0.125 2023-06-27 10:03:16,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1781352.0, ans=0.125 2023-06-27 10:03:20,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1781352.0, ans=0.125 2023-06-27 10:03:31,233 INFO [train.py:996] (1/4) Epoch 10, batch 22450, loss[loss=0.1826, simple_loss=0.254, pruned_loss=0.05555, over 21725.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.286, pruned_loss=0.06876, over 4288090.01 frames. ], batch size: 283, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:03:48,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-27 10:03:52,406 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-27 10:05:03,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1781652.0, ans=0.0 2023-06-27 10:05:07,722 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.016e+02 6.073e+02 8.479e+02 1.179e+03 3.261e+03, threshold=1.696e+03, percent-clipped=18.0 2023-06-27 10:05:18,387 INFO [train.py:996] (1/4) Epoch 10, batch 22500, loss[loss=0.2089, simple_loss=0.3096, pruned_loss=0.05408, over 20808.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.282, pruned_loss=0.06815, over 4276640.72 frames. ], batch size: 607, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:05:36,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1781712.0, ans=0.0 2023-06-27 10:06:05,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1781772.0, ans=0.125 2023-06-27 10:06:07,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1781832.0, ans=0.125 2023-06-27 10:06:24,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1781892.0, ans=0.0 2023-06-27 10:07:06,836 INFO [train.py:996] (1/4) Epoch 10, batch 22550, loss[loss=0.1997, simple_loss=0.2741, pruned_loss=0.06265, over 21820.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2853, pruned_loss=0.06824, over 4277823.79 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:07:33,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1782072.0, ans=0.125 2023-06-27 10:07:54,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1782132.0, ans=0.125 2023-06-27 10:08:11,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1782192.0, ans=0.1 2023-06-27 10:08:32,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1782252.0, ans=0.2 2023-06-27 10:08:32,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=12.0 2023-06-27 10:08:41,500 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.419e+02 6.178e+02 1.242e+03 1.950e+03 4.739e+03, threshold=2.485e+03, percent-clipped=29.0 2023-06-27 10:08:48,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.88 vs. limit=6.0 2023-06-27 10:08:51,916 INFO [train.py:996] (1/4) Epoch 10, batch 22600, loss[loss=0.1911, simple_loss=0.2641, pruned_loss=0.05899, over 21634.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2893, pruned_loss=0.06837, over 4275505.52 frames. ], batch size: 230, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:08:55,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1782312.0, ans=0.0 2023-06-27 10:09:46,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1782432.0, ans=0.2 2023-06-27 10:10:00,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1782492.0, ans=0.2 2023-06-27 10:10:37,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1782612.0, ans=0.125 2023-06-27 10:10:38,552 INFO [train.py:996] (1/4) Epoch 10, batch 22650, loss[loss=0.1966, simple_loss=0.2581, pruned_loss=0.06757, over 21542.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2852, pruned_loss=0.06803, over 4272650.92 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:11:02,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1782612.0, ans=0.5 2023-06-27 10:11:23,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1782672.0, ans=0.125 2023-06-27 10:11:35,790 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 10:11:36,342 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-27 10:12:16,225 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.186e+02 6.070e+02 1.000e+03 1.313e+03 3.118e+03, threshold=2.001e+03, percent-clipped=3.0 2023-06-27 10:12:22,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2023-06-27 10:12:26,397 INFO [train.py:996] (1/4) Epoch 10, batch 22700, loss[loss=0.1895, simple_loss=0.2601, pruned_loss=0.05943, over 21784.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2801, pruned_loss=0.06712, over 4271936.43 frames. ], batch size: 351, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:12:37,682 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-27 10:13:34,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1783092.0, ans=0.09899494936611666 2023-06-27 10:13:44,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1783092.0, ans=0.1 2023-06-27 10:14:12,783 INFO [train.py:996] (1/4) Epoch 10, batch 22750, loss[loss=0.2247, simple_loss=0.2963, pruned_loss=0.07654, over 21773.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2808, pruned_loss=0.06883, over 4278626.16 frames. ], batch size: 332, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:14:13,451 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 10:14:32,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1783212.0, ans=0.1 2023-06-27 10:15:17,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1783332.0, ans=0.125 2023-06-27 10:15:48,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.913e+02 6.203e+02 1.029e+03 1.531e+03 3.011e+03, threshold=2.057e+03, percent-clipped=6.0 2023-06-27 10:15:51,352 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-27 10:16:04,100 INFO [train.py:996] (1/4) Epoch 10, batch 22800, loss[loss=0.1988, simple_loss=0.2706, pruned_loss=0.0635, over 21852.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2858, pruned_loss=0.0708, over 4280538.51 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:16:31,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1783572.0, ans=0.125 2023-06-27 10:17:23,050 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 10:17:39,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1783752.0, ans=0.0 2023-06-27 10:17:44,763 INFO [train.py:996] (1/4) Epoch 10, batch 22850, loss[loss=0.1897, simple_loss=0.2598, pruned_loss=0.05975, over 21760.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2826, pruned_loss=0.06996, over 4265019.14 frames. ], batch size: 112, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:18:09,988 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-27 10:18:24,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1783872.0, ans=0.125 2023-06-27 10:18:31,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1783872.0, ans=0.1 2023-06-27 10:18:31,990 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-27 10:18:38,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.69 vs. limit=10.0 2023-06-27 10:18:40,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1783932.0, ans=0.125 2023-06-27 10:18:51,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-27 10:19:23,323 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.547e+02 9.815e+02 1.470e+03 2.221e+03 4.175e+03, threshold=2.939e+03, percent-clipped=31.0 2023-06-27 10:19:44,585 INFO [train.py:996] (1/4) Epoch 10, batch 22900, loss[loss=0.2207, simple_loss=0.3276, pruned_loss=0.05686, over 21704.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2821, pruned_loss=0.06888, over 4257986.71 frames. ], batch size: 298, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:21:31,355 INFO [train.py:996] (1/4) Epoch 10, batch 22950, loss[loss=0.2107, simple_loss=0.3125, pruned_loss=0.05443, over 21208.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2945, pruned_loss=0.06757, over 4254261.11 frames. ], batch size: 159, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:21:33,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1784412.0, ans=0.2 2023-06-27 10:21:35,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-27 10:22:12,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-27 10:22:49,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=22.5 2023-06-27 10:22:51,598 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.863e+02 5.875e+02 8.793e+02 1.271e+03 3.173e+03, threshold=1.759e+03, percent-clipped=4.0 2023-06-27 10:23:05,431 INFO [train.py:996] (1/4) Epoch 10, batch 23000, loss[loss=0.2192, simple_loss=0.3633, pruned_loss=0.03752, over 20832.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.295, pruned_loss=0.06597, over 4259299.40 frames. ], batch size: 608, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:23:06,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1784712.0, ans=0.09899494936611666 2023-06-27 10:23:19,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1784712.0, ans=0.2 2023-06-27 10:24:42,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1784952.0, ans=10.0 2023-06-27 10:24:46,696 INFO [train.py:996] (1/4) Epoch 10, batch 23050, loss[loss=0.1961, simple_loss=0.2666, pruned_loss=0.06286, over 21142.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2969, pruned_loss=0.06787, over 4260887.50 frames. ], batch size: 608, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:24:53,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1785012.0, ans=0.125 2023-06-27 10:26:17,102 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 5.488e+02 7.273e+02 1.121e+03 2.826e+03, threshold=1.455e+03, percent-clipped=6.0 2023-06-27 10:26:25,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1785312.0, ans=0.1 2023-06-27 10:26:26,890 INFO [train.py:996] (1/4) Epoch 10, batch 23100, loss[loss=0.1943, simple_loss=0.2602, pruned_loss=0.0642, over 21610.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2927, pruned_loss=0.06823, over 4265821.45 frames. ], batch size: 415, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:26:34,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-27 10:26:59,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1785432.0, ans=0.125 2023-06-27 10:27:12,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1785432.0, ans=0.125 2023-06-27 10:27:43,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1785492.0, ans=0.0 2023-06-27 10:28:02,059 INFO [train.py:996] (1/4) Epoch 10, batch 23150, loss[loss=0.2383, simple_loss=0.3004, pruned_loss=0.08808, over 21713.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2866, pruned_loss=0.0676, over 4267221.98 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:28:16,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1785672.0, ans=0.125 2023-06-27 10:28:31,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.19 vs. limit=6.0 2023-06-27 10:28:31,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-27 10:29:06,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1785792.0, ans=0.125 2023-06-27 10:29:25,883 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.152e+02 5.952e+02 7.532e+02 1.121e+03 2.900e+03, threshold=1.506e+03, percent-clipped=14.0 2023-06-27 10:29:35,465 INFO [train.py:996] (1/4) Epoch 10, batch 23200, loss[loss=0.2011, simple_loss=0.2809, pruned_loss=0.06064, over 21473.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2864, pruned_loss=0.0685, over 4279315.53 frames. ], batch size: 131, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 10:29:57,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1785972.0, ans=0.125 2023-06-27 10:30:09,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1786032.0, ans=0.125 2023-06-27 10:30:30,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1786092.0, ans=0.0 2023-06-27 10:30:33,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1786092.0, ans=0.2 2023-06-27 10:30:46,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1786092.0, ans=0.125 2023-06-27 10:30:56,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1786152.0, ans=0.125 2023-06-27 10:30:58,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1786152.0, ans=0.1 2023-06-27 10:31:08,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1786152.0, ans=0.1 2023-06-27 10:31:10,867 INFO [train.py:996] (1/4) Epoch 10, batch 23250, loss[loss=0.183, simple_loss=0.2473, pruned_loss=0.05935, over 21218.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2864, pruned_loss=0.0692, over 4283611.48 frames. ], batch size: 608, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:31:39,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1786272.0, ans=0.125 2023-06-27 10:32:22,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1786392.0, ans=0.125 2023-06-27 10:32:44,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.481e+02 7.308e+02 1.025e+03 1.554e+03 3.146e+03, threshold=2.050e+03, percent-clipped=26.0 2023-06-27 10:32:52,914 INFO [train.py:996] (1/4) Epoch 10, batch 23300, loss[loss=0.237, simple_loss=0.3281, pruned_loss=0.07295, over 21326.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2935, pruned_loss=0.06976, over 4287105.56 frames. ], batch size: 144, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:33:05,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1786512.0, ans=0.125 2023-06-27 10:33:34,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1786632.0, ans=10.0 2023-06-27 10:33:39,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1786632.0, ans=0.125 2023-06-27 10:33:40,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1786632.0, ans=0.125 2023-06-27 10:33:52,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-27 10:34:09,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1786692.0, ans=0.125 2023-06-27 10:34:20,575 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-27 10:34:22,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1786752.0, ans=0.125 2023-06-27 10:34:27,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1786752.0, ans=0.1 2023-06-27 10:34:33,876 INFO [train.py:996] (1/4) Epoch 10, batch 23350, loss[loss=0.1274, simple_loss=0.1936, pruned_loss=0.0306, over 17076.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2961, pruned_loss=0.06899, over 4271144.38 frames. ], batch size: 63, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:34:41,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-27 10:35:45,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-06-27 10:35:56,946 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-27 10:36:01,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1787052.0, ans=0.125 2023-06-27 10:36:05,518 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.446e+02 7.072e+02 1.049e+03 1.355e+03 2.858e+03, threshold=2.098e+03, percent-clipped=8.0 2023-06-27 10:36:13,601 INFO [train.py:996] (1/4) Epoch 10, batch 23400, loss[loss=0.2145, simple_loss=0.2792, pruned_loss=0.07495, over 21439.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2911, pruned_loss=0.06595, over 4273771.53 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:36:38,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-27 10:36:48,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1787172.0, ans=0.125 2023-06-27 10:37:17,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1787232.0, ans=0.2 2023-06-27 10:37:54,848 INFO [train.py:996] (1/4) Epoch 10, batch 23450, loss[loss=0.2826, simple_loss=0.3564, pruned_loss=0.1044, over 21804.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2926, pruned_loss=0.06895, over 4271824.22 frames. ], batch size: 124, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:38:19,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1787472.0, ans=0.125 2023-06-27 10:38:30,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2023-06-27 10:38:47,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1787532.0, ans=0.0 2023-06-27 10:39:02,424 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-27 10:39:25,536 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.955e+02 6.626e+02 1.004e+03 1.261e+03 2.377e+03, threshold=2.009e+03, percent-clipped=2.0 2023-06-27 10:39:38,048 INFO [train.py:996] (1/4) Epoch 10, batch 23500, loss[loss=0.2069, simple_loss=0.276, pruned_loss=0.06893, over 21390.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2928, pruned_loss=0.07017, over 4275834.68 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:39:56,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1787712.0, ans=0.0 2023-06-27 10:40:15,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1787772.0, ans=0.125 2023-06-27 10:40:45,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1787892.0, ans=0.125 2023-06-27 10:40:52,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=22.5 2023-06-27 10:41:17,179 INFO [train.py:996] (1/4) Epoch 10, batch 23550, loss[loss=0.1948, simple_loss=0.2603, pruned_loss=0.06465, over 21742.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2886, pruned_loss=0.06926, over 4260287.39 frames. ], batch size: 118, lr: 2.89e-03, grad_scale: 8.0 2023-06-27 10:41:49,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-27 10:41:57,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1788072.0, ans=0.025 2023-06-27 10:41:57,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1788072.0, ans=0.125 2023-06-27 10:42:29,131 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-06-27 10:42:47,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.468e+02 6.896e+02 9.643e+02 1.434e+03 2.789e+03, threshold=1.929e+03, percent-clipped=7.0 2023-06-27 10:42:58,649 INFO [train.py:996] (1/4) Epoch 10, batch 23600, loss[loss=0.21, simple_loss=0.3, pruned_loss=0.06001, over 16930.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2871, pruned_loss=0.06916, over 4253922.65 frames. ], batch size: 60, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:42:59,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1788312.0, ans=0.1 2023-06-27 10:43:23,648 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-27 10:43:44,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1788372.0, ans=0.07 2023-06-27 10:43:54,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-06-27 10:43:54,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.18 vs. limit=15.0 2023-06-27 10:44:19,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=1788492.0, ans=12.0 2023-06-27 10:44:47,507 INFO [train.py:996] (1/4) Epoch 10, batch 23650, loss[loss=0.2377, simple_loss=0.3168, pruned_loss=0.07932, over 21272.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2861, pruned_loss=0.06747, over 4257586.71 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:45:35,407 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-27 10:45:52,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1788792.0, ans=0.125 2023-06-27 10:46:00,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1788792.0, ans=0.125 2023-06-27 10:46:27,253 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.326e+02 5.698e+02 8.154e+02 1.096e+03 2.339e+03, threshold=1.631e+03, percent-clipped=3.0 2023-06-27 10:46:38,534 INFO [train.py:996] (1/4) Epoch 10, batch 23700, loss[loss=0.2001, simple_loss=0.2775, pruned_loss=0.06137, over 21180.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.29, pruned_loss=0.06691, over 4260753.51 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:46:43,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1788912.0, ans=0.125 2023-06-27 10:46:47,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1788912.0, ans=0.0 2023-06-27 10:48:13,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1789152.0, ans=0.125 2023-06-27 10:48:19,723 INFO [train.py:996] (1/4) Epoch 10, batch 23750, loss[loss=0.2422, simple_loss=0.3215, pruned_loss=0.08145, over 21200.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2924, pruned_loss=0.06771, over 4266147.08 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:48:20,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1789212.0, ans=0.125 2023-06-27 10:48:33,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789212.0, ans=0.1 2023-06-27 10:48:34,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-27 10:48:52,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-27 10:49:23,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1789392.0, ans=0.125 2023-06-27 10:49:55,487 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.670e+02 6.081e+02 7.830e+02 1.141e+03 2.559e+03, threshold=1.566e+03, percent-clipped=8.0 2023-06-27 10:50:02,088 INFO [train.py:996] (1/4) Epoch 10, batch 23800, loss[loss=0.1718, simple_loss=0.2532, pruned_loss=0.04514, over 20770.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2901, pruned_loss=0.06566, over 4257596.56 frames. ], batch size: 607, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:50:02,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1789512.0, ans=0.95 2023-06-27 10:51:05,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1789632.0, ans=0.2 2023-06-27 10:51:15,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1789692.0, ans=0.0 2023-06-27 10:51:45,229 INFO [train.py:996] (1/4) Epoch 10, batch 23850, loss[loss=0.232, simple_loss=0.324, pruned_loss=0.07004, over 20717.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3007, pruned_loss=0.06828, over 4260831.24 frames. ], batch size: 607, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:51:49,613 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-27 10:53:12,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1790052.0, ans=0.07 2023-06-27 10:53:17,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-27 10:53:18,318 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 6.860e+02 1.142e+03 1.790e+03 3.579e+03, threshold=2.285e+03, percent-clipped=29.0 2023-06-27 10:53:24,735 INFO [train.py:996] (1/4) Epoch 10, batch 23900, loss[loss=0.2033, simple_loss=0.2809, pruned_loss=0.06286, over 21638.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3059, pruned_loss=0.07018, over 4265722.88 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:54:26,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1790232.0, ans=0.0 2023-06-27 10:54:36,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1790292.0, ans=0.2 2023-06-27 10:54:43,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-27 10:55:05,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1790412.0, ans=0.125 2023-06-27 10:55:05,903 INFO [train.py:996] (1/4) Epoch 10, batch 23950, loss[loss=0.1999, simple_loss=0.2802, pruned_loss=0.05978, over 20732.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3001, pruned_loss=0.06992, over 4271567.30 frames. ], batch size: 607, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:56:14,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1790592.0, ans=0.1 2023-06-27 10:56:40,581 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.159e+02 7.309e+02 9.584e+02 1.406e+03 2.703e+03, threshold=1.917e+03, percent-clipped=3.0 2023-06-27 10:56:47,122 INFO [train.py:996] (1/4) Epoch 10, batch 24000, loss[loss=0.2297, simple_loss=0.3017, pruned_loss=0.07888, over 21400.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3006, pruned_loss=0.07193, over 4265807.30 frames. ], batch size: 549, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 10:56:47,122 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 10:57:07,137 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2621, simple_loss=0.3549, pruned_loss=0.08461, over 1796401.00 frames. 2023-06-27 10:57:07,139 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 10:58:45,459 INFO [train.py:996] (1/4) Epoch 10, batch 24050, loss[loss=0.1897, simple_loss=0.2847, pruned_loss=0.04739, over 21622.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3024, pruned_loss=0.07244, over 4267599.97 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:59:05,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1791072.0, ans=0.015 2023-06-27 10:59:32,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1791132.0, ans=0.1 2023-06-27 10:59:38,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1791192.0, ans=0.125 2023-06-27 10:59:39,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.16 vs. limit=22.5 2023-06-27 11:00:21,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.514e+02 5.749e+02 8.023e+02 1.325e+03 2.806e+03, threshold=1.605e+03, percent-clipped=11.0 2023-06-27 11:00:25,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1791312.0, ans=0.125 2023-06-27 11:00:32,046 INFO [train.py:996] (1/4) Epoch 10, batch 24100, loss[loss=0.2111, simple_loss=0.2901, pruned_loss=0.06608, over 21249.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3024, pruned_loss=0.07105, over 4270556.02 frames. ], batch size: 159, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:00:36,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=22.5 2023-06-27 11:00:50,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1791372.0, ans=0.1 2023-06-27 11:01:04,327 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-06-27 11:01:13,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1791432.0, ans=0.0 2023-06-27 11:01:59,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1791552.0, ans=0.04949747468305833 2023-06-27 11:02:13,418 INFO [train.py:996] (1/4) Epoch 10, batch 24150, loss[loss=0.2443, simple_loss=0.3015, pruned_loss=0.09352, over 21811.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3018, pruned_loss=0.07234, over 4274674.81 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:02:40,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-27 11:02:54,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1791732.0, ans=0.125 2023-06-27 11:02:58,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1791732.0, ans=0.125 2023-06-27 11:03:08,387 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-27 11:03:31,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1791852.0, ans=0.1 2023-06-27 11:03:33,203 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:03:45,388 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 6.384e+02 9.147e+02 1.297e+03 2.622e+03, threshold=1.829e+03, percent-clipped=12.0 2023-06-27 11:03:50,486 INFO [train.py:996] (1/4) Epoch 10, batch 24200, loss[loss=0.2581, simple_loss=0.3525, pruned_loss=0.08189, over 21180.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3042, pruned_loss=0.0734, over 4280225.74 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:04:38,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1792032.0, ans=0.2 2023-06-27 11:05:33,111 INFO [train.py:996] (1/4) Epoch 10, batch 24250, loss[loss=0.1635, simple_loss=0.2495, pruned_loss=0.03876, over 21873.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3008, pruned_loss=0.06824, over 4273951.78 frames. ], batch size: 118, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:05:40,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1792212.0, ans=0.2 2023-06-27 11:05:45,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792212.0, ans=0.1 2023-06-27 11:05:59,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1792272.0, ans=0.2 2023-06-27 11:06:08,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-27 11:06:47,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1792392.0, ans=0.125 2023-06-27 11:06:54,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.50 vs. limit=6.0 2023-06-27 11:07:09,242 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 5.833e+02 9.026e+02 1.321e+03 2.992e+03, threshold=1.805e+03, percent-clipped=10.0 2023-06-27 11:07:13,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1792512.0, ans=0.0 2023-06-27 11:07:14,012 INFO [train.py:996] (1/4) Epoch 10, batch 24300, loss[loss=0.2032, simple_loss=0.2638, pruned_loss=0.07135, over 20236.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2946, pruned_loss=0.06322, over 4269646.16 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:07:34,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1792572.0, ans=0.125 2023-06-27 11:07:36,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792572.0, ans=0.1 2023-06-27 11:08:14,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1792632.0, ans=0.125 2023-06-27 11:08:30,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1792692.0, ans=0.0 2023-06-27 11:08:44,333 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-27 11:08:55,409 INFO [train.py:996] (1/4) Epoch 10, batch 24350, loss[loss=0.2306, simple_loss=0.3084, pruned_loss=0.07638, over 21437.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2885, pruned_loss=0.06185, over 4276944.02 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:09:44,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1792932.0, ans=0.0 2023-06-27 11:09:54,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1792932.0, ans=0.125 2023-06-27 11:10:07,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1792992.0, ans=0.0 2023-06-27 11:10:21,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1793052.0, ans=0.0 2023-06-27 11:10:27,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.762e+02 6.342e+02 9.950e+02 1.336e+03 3.105e+03, threshold=1.990e+03, percent-clipped=13.0 2023-06-27 11:10:32,375 INFO [train.py:996] (1/4) Epoch 10, batch 24400, loss[loss=0.2117, simple_loss=0.295, pruned_loss=0.06417, over 21680.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2921, pruned_loss=0.06494, over 4279017.81 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:10:33,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1793112.0, ans=0.1 2023-06-27 11:11:02,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1793172.0, ans=0.125 2023-06-27 11:11:07,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1793172.0, ans=0.2 2023-06-27 11:11:44,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1793292.0, ans=0.0 2023-06-27 11:12:04,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1793352.0, ans=0.0 2023-06-27 11:12:07,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-06-27 11:12:14,395 INFO [train.py:996] (1/4) Epoch 10, batch 24450, loss[loss=0.1902, simple_loss=0.2688, pruned_loss=0.05578, over 21382.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2951, pruned_loss=0.06689, over 4282422.01 frames. ], batch size: 131, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:12:37,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.07 vs. limit=12.0 2023-06-27 11:12:49,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1793472.0, ans=0.1 2023-06-27 11:13:01,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1793472.0, ans=0.2 2023-06-27 11:13:12,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1793532.0, ans=0.125 2023-06-27 11:13:27,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1793592.0, ans=0.125 2023-06-27 11:13:28,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1793592.0, ans=0.125 2023-06-27 11:13:29,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-27 11:13:38,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1793652.0, ans=0.2 2023-06-27 11:13:48,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1793652.0, ans=0.125 2023-06-27 11:13:50,877 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.345e+02 6.636e+02 9.241e+02 1.233e+03 3.193e+03, threshold=1.848e+03, percent-clipped=3.0 2023-06-27 11:13:54,212 INFO [train.py:996] (1/4) Epoch 10, batch 24500, loss[loss=0.2033, simple_loss=0.2808, pruned_loss=0.06287, over 21760.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2978, pruned_loss=0.06813, over 4288192.38 frames. ], batch size: 247, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:14:44,619 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:14:49,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1793832.0, ans=0.0 2023-06-27 11:15:04,685 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:15:21,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1793952.0, ans=0.0 2023-06-27 11:15:24,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1793952.0, ans=0.125 2023-06-27 11:15:40,100 INFO [train.py:996] (1/4) Epoch 10, batch 24550, loss[loss=0.2289, simple_loss=0.3108, pruned_loss=0.07354, over 21810.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2999, pruned_loss=0.07012, over 4291313.12 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:15:54,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1794012.0, ans=0.0 2023-06-27 11:15:57,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1794012.0, ans=0.07 2023-06-27 11:16:25,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1794132.0, ans=0.125 2023-06-27 11:16:58,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1794252.0, ans=0.0 2023-06-27 11:17:16,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 6.446e+02 9.214e+02 1.322e+03 3.260e+03, threshold=1.843e+03, percent-clipped=13.0 2023-06-27 11:17:19,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.40 vs. limit=22.5 2023-06-27 11:17:19,812 INFO [train.py:996] (1/4) Epoch 10, batch 24600, loss[loss=0.1857, simple_loss=0.2488, pruned_loss=0.06127, over 21178.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2977, pruned_loss=0.07009, over 4275974.23 frames. ], batch size: 159, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:17:44,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1794372.0, ans=15.0 2023-06-27 11:18:09,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=8.0 2023-06-27 11:18:38,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-27 11:18:47,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1794552.0, ans=0.125 2023-06-27 11:19:06,384 INFO [train.py:996] (1/4) Epoch 10, batch 24650, loss[loss=0.1981, simple_loss=0.2592, pruned_loss=0.06851, over 21874.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2926, pruned_loss=0.06922, over 4263617.35 frames. ], batch size: 373, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:19:38,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1794672.0, ans=0.0 2023-06-27 11:20:02,233 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:20:38,910 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.116e+02 6.411e+02 8.563e+02 1.154e+03 3.780e+03, threshold=1.713e+03, percent-clipped=12.0 2023-06-27 11:20:42,281 INFO [train.py:996] (1/4) Epoch 10, batch 24700, loss[loss=0.1984, simple_loss=0.2681, pruned_loss=0.06441, over 21729.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2893, pruned_loss=0.06783, over 4260416.04 frames. ], batch size: 124, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:21:16,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1794972.0, ans=0.0 2023-06-27 11:21:53,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1795152.0, ans=0.1 2023-06-27 11:22:16,081 INFO [train.py:996] (1/4) Epoch 10, batch 24750, loss[loss=0.1988, simple_loss=0.2729, pruned_loss=0.06232, over 21809.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2819, pruned_loss=0.06546, over 4260010.20 frames. ], batch size: 98, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:22:22,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1795212.0, ans=0.125 2023-06-27 11:22:42,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1795272.0, ans=0.125 2023-06-27 11:22:46,481 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-27 11:23:05,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1795332.0, ans=0.125 2023-06-27 11:23:14,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1795392.0, ans=0.125 2023-06-27 11:23:33,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.93 vs. limit=22.5 2023-06-27 11:23:36,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=22.5 2023-06-27 11:23:38,632 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.117e+02 6.853e+02 9.586e+02 1.478e+03 3.032e+03, threshold=1.917e+03, percent-clipped=13.0 2023-06-27 11:23:46,724 INFO [train.py:996] (1/4) Epoch 10, batch 24800, loss[loss=0.1923, simple_loss=0.2631, pruned_loss=0.06076, over 21864.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2761, pruned_loss=0.06517, over 4252267.18 frames. ], batch size: 316, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:24:21,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1795572.0, ans=0.1 2023-06-27 11:24:57,607 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-27 11:25:08,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1795752.0, ans=0.125 2023-06-27 11:25:29,261 INFO [train.py:996] (1/4) Epoch 10, batch 24850, loss[loss=0.2719, simple_loss=0.3451, pruned_loss=0.09937, over 21524.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2769, pruned_loss=0.06651, over 4256575.25 frames. ], batch size: 471, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:25:49,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1795872.0, ans=0.04949747468305833 2023-06-27 11:26:55,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-27 11:26:59,061 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.781e+02 6.964e+02 9.660e+02 1.513e+03 3.423e+03, threshold=1.932e+03, percent-clipped=14.0 2023-06-27 11:27:00,592 INFO [train.py:996] (1/4) Epoch 10, batch 24900, loss[loss=0.2676, simple_loss=0.336, pruned_loss=0.09965, over 21197.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2791, pruned_loss=0.06693, over 4257661.24 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:27:53,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1796292.0, ans=0.04949747468305833 2023-06-27 11:28:37,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1796352.0, ans=0.125 2023-06-27 11:28:40,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1796412.0, ans=0.125 2023-06-27 11:28:41,466 INFO [train.py:996] (1/4) Epoch 10, batch 24950, loss[loss=0.227, simple_loss=0.3362, pruned_loss=0.05894, over 17298.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2868, pruned_loss=0.07034, over 4260604.35 frames. ], batch size: 60, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:29:03,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1796472.0, ans=0.125 2023-06-27 11:29:18,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1796532.0, ans=0.035 2023-06-27 11:30:04,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1796652.0, ans=0.125 2023-06-27 11:30:19,510 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.654e+02 6.926e+02 9.542e+02 1.348e+03 3.788e+03, threshold=1.908e+03, percent-clipped=7.0 2023-06-27 11:30:20,990 INFO [train.py:996] (1/4) Epoch 10, batch 25000, loss[loss=0.204, simple_loss=0.2743, pruned_loss=0.06684, over 21531.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2948, pruned_loss=0.07216, over 4270631.14 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:30:54,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1796772.0, ans=0.1 2023-06-27 11:30:55,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1796772.0, ans=0.2 2023-06-27 11:32:11,880 INFO [train.py:996] (1/4) Epoch 10, batch 25050, loss[loss=0.2016, simple_loss=0.2705, pruned_loss=0.06631, over 21808.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2884, pruned_loss=0.07103, over 4259753.83 frames. ], batch size: 352, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:32:52,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1797132.0, ans=0.1 2023-06-27 11:32:54,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-27 11:33:30,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.38 vs. limit=15.0 2023-06-27 11:33:50,161 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.906e+02 5.494e+02 7.889e+02 1.087e+03 2.340e+03, threshold=1.578e+03, percent-clipped=4.0 2023-06-27 11:33:50,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1797312.0, ans=0.0 2023-06-27 11:33:51,533 INFO [train.py:996] (1/4) Epoch 10, batch 25100, loss[loss=0.1993, simple_loss=0.2949, pruned_loss=0.05188, over 21737.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2832, pruned_loss=0.06923, over 4247577.99 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:33:58,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1797312.0, ans=0.1 2023-06-27 11:34:13,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1797372.0, ans=0.1 2023-06-27 11:34:39,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1797432.0, ans=0.0 2023-06-27 11:35:26,550 INFO [train.py:996] (1/4) Epoch 10, batch 25150, loss[loss=0.1993, simple_loss=0.2886, pruned_loss=0.05502, over 21866.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2864, pruned_loss=0.06776, over 4245035.50 frames. ], batch size: 371, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:35:47,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1797672.0, ans=0.0 2023-06-27 11:37:04,859 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.938e+02 6.755e+02 1.265e+03 1.654e+03 3.292e+03, threshold=2.530e+03, percent-clipped=31.0 2023-06-27 11:37:06,428 INFO [train.py:996] (1/4) Epoch 10, batch 25200, loss[loss=0.1813, simple_loss=0.2703, pruned_loss=0.04616, over 21421.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2851, pruned_loss=0.06522, over 4250734.98 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:37:23,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-27 11:37:47,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1798032.0, ans=0.1 2023-06-27 11:37:57,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1798032.0, ans=0.125 2023-06-27 11:38:11,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1798092.0, ans=0.0 2023-06-27 11:38:45,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1798212.0, ans=0.125 2023-06-27 11:38:46,162 INFO [train.py:996] (1/4) Epoch 10, batch 25250, loss[loss=0.1963, simple_loss=0.2646, pruned_loss=0.064, over 21262.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2833, pruned_loss=0.0641, over 4261124.22 frames. ], batch size: 160, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:39:21,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1798272.0, ans=0.2 2023-06-27 11:39:39,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1798332.0, ans=0.0 2023-06-27 11:39:57,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-06-27 11:40:32,591 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.222e+02 7.212e+02 1.026e+03 1.530e+03 2.488e+03, threshold=2.053e+03, percent-clipped=0.0 2023-06-27 11:40:32,622 INFO [train.py:996] (1/4) Epoch 10, batch 25300, loss[loss=0.2182, simple_loss=0.2954, pruned_loss=0.07047, over 21737.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2813, pruned_loss=0.06359, over 4255645.26 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:41:50,587 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.85 vs. limit=22.5 2023-06-27 11:42:13,848 INFO [train.py:996] (1/4) Epoch 10, batch 25350, loss[loss=0.1904, simple_loss=0.2758, pruned_loss=0.05254, over 21780.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2826, pruned_loss=0.06233, over 4257587.46 frames. ], batch size: 371, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:42:24,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1798812.0, ans=0.125 2023-06-27 11:42:58,384 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-27 11:43:53,122 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.520e+02 8.858e+02 1.308e+03 2.699e+03, threshold=1.772e+03, percent-clipped=4.0 2023-06-27 11:43:53,153 INFO [train.py:996] (1/4) Epoch 10, batch 25400, loss[loss=0.2042, simple_loss=0.2836, pruned_loss=0.06244, over 21789.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2794, pruned_loss=0.06159, over 4256378.20 frames. ], batch size: 118, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:43:56,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1799112.0, ans=0.0 2023-06-27 11:44:10,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1799172.0, ans=0.0 2023-06-27 11:44:21,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1799172.0, ans=0.125 2023-06-27 11:44:29,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1799232.0, ans=0.125 2023-06-27 11:44:45,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1799232.0, ans=0.125 2023-06-27 11:45:10,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1799292.0, ans=0.125 2023-06-27 11:45:26,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1799352.0, ans=0.0 2023-06-27 11:45:34,082 INFO [train.py:996] (1/4) Epoch 10, batch 25450, loss[loss=0.1896, simple_loss=0.2882, pruned_loss=0.04545, over 21808.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2789, pruned_loss=0.06227, over 4258283.62 frames. ], batch size: 333, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:46:06,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1799472.0, ans=0.125 2023-06-27 11:46:58,680 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:47:16,317 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.039e+02 8.121e+02 1.135e+03 2.521e+03, threshold=1.624e+03, percent-clipped=2.0 2023-06-27 11:47:16,348 INFO [train.py:996] (1/4) Epoch 10, batch 25500, loss[loss=0.3096, simple_loss=0.3736, pruned_loss=0.1229, over 21360.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2804, pruned_loss=0.06077, over 4250592.79 frames. ], batch size: 507, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:48:18,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.81 vs. limit=15.0 2023-06-27 11:48:42,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1799952.0, ans=0.125 2023-06-27 11:48:58,447 INFO [train.py:996] (1/4) Epoch 10, batch 25550, loss[loss=0.209, simple_loss=0.3131, pruned_loss=0.05243, over 21787.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2874, pruned_loss=0.06103, over 4256411.33 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:49:03,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-27 11:49:32,082 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-27 11:50:07,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1800192.0, ans=0.035 2023-06-27 11:50:09,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1800192.0, ans=0.0 2023-06-27 11:50:39,237 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.324e+02 5.995e+02 1.017e+03 1.623e+03 5.096e+03, threshold=2.035e+03, percent-clipped=24.0 2023-06-27 11:50:39,268 INFO [train.py:996] (1/4) Epoch 10, batch 25600, loss[loss=0.2433, simple_loss=0.3391, pruned_loss=0.07377, over 21556.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2917, pruned_loss=0.06225, over 4254180.05 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 11:50:41,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1800312.0, ans=0.02 2023-06-27 11:51:00,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1800372.0, ans=0.0 2023-06-27 11:51:55,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1800492.0, ans=0.125 2023-06-27 11:51:58,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1800492.0, ans=0.2 2023-06-27 11:52:03,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1800552.0, ans=0.0 2023-06-27 11:52:19,296 INFO [train.py:996] (1/4) Epoch 10, batch 25650, loss[loss=0.225, simple_loss=0.2879, pruned_loss=0.08103, over 16274.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2931, pruned_loss=0.065, over 4256547.05 frames. ], batch size: 65, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:52:25,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-27 11:53:41,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1800792.0, ans=0.2 2023-06-27 11:54:00,586 INFO [train.py:996] (1/4) Epoch 10, batch 25700, loss[loss=0.2333, simple_loss=0.2933, pruned_loss=0.08668, over 21906.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2916, pruned_loss=0.06594, over 4251531.89 frames. ], batch size: 107, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:54:06,861 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.076e+02 8.398e+02 1.386e+03 2.056e+03 4.305e+03, threshold=2.773e+03, percent-clipped=25.0 2023-06-27 11:54:09,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1800912.0, ans=0.1 2023-06-27 11:54:33,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1800972.0, ans=0.0 2023-06-27 11:54:39,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1800972.0, ans=0.0 2023-06-27 11:54:44,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1800972.0, ans=0.0 2023-06-27 11:55:46,718 INFO [train.py:996] (1/4) Epoch 10, batch 25750, loss[loss=0.2478, simple_loss=0.3307, pruned_loss=0.08248, over 21498.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2963, pruned_loss=0.06884, over 4261364.38 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:55:57,391 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-27 11:56:22,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=22.5 2023-06-27 11:56:38,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1801332.0, ans=0.0 2023-06-27 11:57:39,565 INFO [train.py:996] (1/4) Epoch 10, batch 25800, loss[loss=0.2642, simple_loss=0.3469, pruned_loss=0.09074, over 21333.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3071, pruned_loss=0.073, over 4270924.75 frames. ], batch size: 159, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:57:41,432 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.492e+02 7.009e+02 1.091e+03 1.518e+03 3.688e+03, threshold=2.182e+03, percent-clipped=4.0 2023-06-27 11:57:59,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1801572.0, ans=0.125 2023-06-27 11:58:08,987 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=12.0 2023-06-27 11:58:40,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1801692.0, ans=0.0 2023-06-27 11:59:18,402 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-27 11:59:23,891 INFO [train.py:996] (1/4) Epoch 10, batch 25850, loss[loss=0.2221, simple_loss=0.2939, pruned_loss=0.07515, over 21486.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3089, pruned_loss=0.07239, over 4273863.07 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:59:39,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1801812.0, ans=0.125 2023-06-27 11:59:43,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1801872.0, ans=0.125 2023-06-27 11:59:50,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1801872.0, ans=0.125 2023-06-27 12:00:37,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1801992.0, ans=0.125 2023-06-27 12:00:55,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1802052.0, ans=0.0 2023-06-27 12:01:11,683 INFO [train.py:996] (1/4) Epoch 10, batch 25900, loss[loss=0.2417, simple_loss=0.3305, pruned_loss=0.07652, over 21445.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3087, pruned_loss=0.0727, over 4282952.81 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:01:12,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1802112.0, ans=0.1 2023-06-27 12:01:13,359 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.315e+02 6.367e+02 8.577e+02 1.335e+03 4.211e+03, threshold=1.715e+03, percent-clipped=7.0 2023-06-27 12:02:21,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.46 vs. limit=6.0 2023-06-27 12:02:47,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1802352.0, ans=0.2 2023-06-27 12:02:53,660 INFO [train.py:996] (1/4) Epoch 10, batch 25950, loss[loss=0.3088, simple_loss=0.3667, pruned_loss=0.1255, over 21357.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3157, pruned_loss=0.07701, over 4273770.15 frames. ], batch size: 507, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:02:59,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1802412.0, ans=0.0 2023-06-27 12:03:33,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1802472.0, ans=0.0 2023-06-27 12:03:52,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1802532.0, ans=0.95 2023-06-27 12:03:59,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1802592.0, ans=0.125 2023-06-27 12:04:04,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1802592.0, ans=0.125 2023-06-27 12:04:14,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1802592.0, ans=0.125 2023-06-27 12:04:16,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1802592.0, ans=0.0 2023-06-27 12:04:21,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1802652.0, ans=0.125 2023-06-27 12:04:28,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.23 vs. limit=6.0 2023-06-27 12:04:34,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1802712.0, ans=0.125 2023-06-27 12:04:35,384 INFO [train.py:996] (1/4) Epoch 10, batch 26000, loss[loss=0.2471, simple_loss=0.33, pruned_loss=0.08211, over 21916.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3163, pruned_loss=0.07619, over 4272777.55 frames. ], batch size: 372, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:04:37,164 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 6.185e+02 7.875e+02 1.125e+03 3.104e+03, threshold=1.575e+03, percent-clipped=8.0 2023-06-27 12:04:43,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.85 vs. limit=10.0 2023-06-27 12:05:23,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1802832.0, ans=0.0 2023-06-27 12:05:51,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1802892.0, ans=0.1 2023-06-27 12:06:01,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1802952.0, ans=0.125 2023-06-27 12:06:10,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-06-27 12:06:16,103 INFO [train.py:996] (1/4) Epoch 10, batch 26050, loss[loss=0.1994, simple_loss=0.2699, pruned_loss=0.06442, over 21918.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3146, pruned_loss=0.07603, over 4276961.57 frames. ], batch size: 316, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:06:21,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1803012.0, ans=0.0 2023-06-27 12:06:47,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1803072.0, ans=0.125 2023-06-27 12:07:11,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1803132.0, ans=0.0 2023-06-27 12:07:50,608 INFO [train.py:996] (1/4) Epoch 10, batch 26100, loss[loss=0.2133, simple_loss=0.2887, pruned_loss=0.06897, over 21465.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3088, pruned_loss=0.07569, over 4285188.10 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:07:53,723 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.444e+02 6.064e+02 8.418e+02 1.151e+03 2.910e+03, threshold=1.684e+03, percent-clipped=10.0 2023-06-27 12:08:01,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-27 12:08:01,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-27 12:08:47,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1803432.0, ans=0.125 2023-06-27 12:08:50,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1803432.0, ans=0.0 2023-06-27 12:09:09,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1803492.0, ans=0.125 2023-06-27 12:09:30,846 INFO [train.py:996] (1/4) Epoch 10, batch 26150, loss[loss=0.2448, simple_loss=0.3346, pruned_loss=0.07746, over 21857.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3046, pruned_loss=0.07497, over 4290955.50 frames. ], batch size: 118, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:09:53,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1803612.0, ans=0.125 2023-06-27 12:10:35,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1803792.0, ans=0.125 2023-06-27 12:10:51,526 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:10:54,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1803852.0, ans=0.125 2023-06-27 12:11:16,888 INFO [train.py:996] (1/4) Epoch 10, batch 26200, loss[loss=0.2085, simple_loss=0.2982, pruned_loss=0.05941, over 21130.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3058, pruned_loss=0.0732, over 4285587.10 frames. ], batch size: 143, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:11:20,490 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.988e+02 7.097e+02 1.092e+03 1.637e+03 2.606e+03, threshold=2.184e+03, percent-clipped=21.0 2023-06-27 12:11:40,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1803972.0, ans=0.125 2023-06-27 12:12:24,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1804092.0, ans=0.125 2023-06-27 12:12:46,372 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:12:56,974 INFO [train.py:996] (1/4) Epoch 10, batch 26250, loss[loss=0.2089, simple_loss=0.284, pruned_loss=0.06693, over 21832.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3091, pruned_loss=0.07291, over 4287059.70 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:13:04,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1804212.0, ans=0.125 2023-06-27 12:13:04,522 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-27 12:13:14,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1804212.0, ans=0.125 2023-06-27 12:14:36,323 INFO [train.py:996] (1/4) Epoch 10, batch 26300, loss[loss=0.203, simple_loss=0.2826, pruned_loss=0.06176, over 17146.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3059, pruned_loss=0.07338, over 4287893.23 frames. ], batch size: 60, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:14:36,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1804512.0, ans=0.0 2023-06-27 12:14:39,664 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.217e+02 5.994e+02 7.746e+02 1.132e+03 2.553e+03, threshold=1.549e+03, percent-clipped=2.0 2023-06-27 12:14:48,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1804512.0, ans=0.0 2023-06-27 12:14:53,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1804512.0, ans=0.125 2023-06-27 12:15:16,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.43 vs. limit=10.0 2023-06-27 12:15:17,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1804632.0, ans=0.125 2023-06-27 12:15:34,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1804692.0, ans=0.125 2023-06-27 12:15:53,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1804692.0, ans=0.2 2023-06-27 12:16:12,533 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:16:16,765 INFO [train.py:996] (1/4) Epoch 10, batch 26350, loss[loss=0.2411, simple_loss=0.3189, pruned_loss=0.08165, over 21424.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3041, pruned_loss=0.07369, over 4284987.21 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:16:25,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1804812.0, ans=0.0 2023-06-27 12:16:27,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1804812.0, ans=0.2 2023-06-27 12:16:33,727 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:16:57,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1804932.0, ans=0.125 2023-06-27 12:17:27,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-27 12:17:33,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1805052.0, ans=0.2 2023-06-27 12:17:41,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1805052.0, ans=0.125 2023-06-27 12:17:46,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1805052.0, ans=0.125 2023-06-27 12:17:51,988 INFO [train.py:996] (1/4) Epoch 10, batch 26400, loss[loss=0.236, simple_loss=0.2768, pruned_loss=0.09764, over 21464.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2983, pruned_loss=0.07367, over 4285479.43 frames. ], batch size: 510, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:17:55,527 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.612e+02 7.254e+02 1.118e+03 1.690e+03 3.507e+03, threshold=2.236e+03, percent-clipped=29.0 2023-06-27 12:18:18,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1805172.0, ans=0.1 2023-06-27 12:18:26,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1805172.0, ans=0.125 2023-06-27 12:19:02,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1805292.0, ans=0.025 2023-06-27 12:19:36,215 INFO [train.py:996] (1/4) Epoch 10, batch 26450, loss[loss=0.2621, simple_loss=0.3644, pruned_loss=0.07984, over 21718.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2991, pruned_loss=0.07399, over 4279144.06 frames. ], batch size: 332, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:19:49,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-06-27 12:20:10,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-27 12:20:16,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1805532.0, ans=0.0 2023-06-27 12:21:19,538 INFO [train.py:996] (1/4) Epoch 10, batch 26500, loss[loss=0.1903, simple_loss=0.2643, pruned_loss=0.05812, over 21623.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.302, pruned_loss=0.07255, over 4278183.89 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:21:28,833 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.787e+02 8.473e+02 1.317e+03 2.228e+03 4.940e+03, threshold=2.635e+03, percent-clipped=24.0 2023-06-27 12:22:00,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.89 vs. limit=6.0 2023-06-27 12:22:18,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1805832.0, ans=0.125 2023-06-27 12:22:56,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1805952.0, ans=0.125 2023-06-27 12:23:00,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-27 12:23:05,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1805952.0, ans=0.125 2023-06-27 12:23:07,846 INFO [train.py:996] (1/4) Epoch 10, batch 26550, loss[loss=0.2492, simple_loss=0.3528, pruned_loss=0.07277, over 19812.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.299, pruned_loss=0.07005, over 4265108.69 frames. ], batch size: 703, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:23:26,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1806012.0, ans=0.0 2023-06-27 12:24:05,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1806132.0, ans=0.125 2023-06-27 12:24:14,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806192.0, ans=0.1 2023-06-27 12:24:53,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-27 12:24:53,638 INFO [train.py:996] (1/4) Epoch 10, batch 26600, loss[loss=0.2381, simple_loss=0.2899, pruned_loss=0.09314, over 20217.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2984, pruned_loss=0.06701, over 4265898.96 frames. ], batch size: 707, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:25:01,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1806312.0, ans=0.0 2023-06-27 12:25:02,995 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 9.676e+02 1.340e+03 1.727e+03 3.782e+03, threshold=2.679e+03, percent-clipped=7.0 2023-06-27 12:25:50,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-27 12:26:38,850 INFO [train.py:996] (1/4) Epoch 10, batch 26650, loss[loss=0.1465, simple_loss=0.2304, pruned_loss=0.03129, over 21558.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2905, pruned_loss=0.06548, over 4258382.11 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:26:58,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1806672.0, ans=0.1 2023-06-27 12:27:16,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806732.0, ans=0.1 2023-06-27 12:27:32,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1806792.0, ans=0.125 2023-06-27 12:28:18,262 INFO [train.py:996] (1/4) Epoch 10, batch 26700, loss[loss=0.265, simple_loss=0.3131, pruned_loss=0.1085, over 21781.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2835, pruned_loss=0.06269, over 4256251.29 frames. ], batch size: 508, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:28:23,315 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.270e+02 4.740e+02 5.974e+02 7.751e+02 2.095e+03, threshold=1.195e+03, percent-clipped=0.0 2023-06-27 12:28:36,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1806912.0, ans=0.125 2023-06-27 12:28:57,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1807032.0, ans=0.125 2023-06-27 12:29:04,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1807032.0, ans=0.1 2023-06-27 12:29:24,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.75 vs. limit=15.0 2023-06-27 12:30:01,627 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-27 12:30:03,646 INFO [train.py:996] (1/4) Epoch 10, batch 26750, loss[loss=0.1932, simple_loss=0.2864, pruned_loss=0.04997, over 21796.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2829, pruned_loss=0.06156, over 4265703.42 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:30:54,597 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-27 12:31:03,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1807392.0, ans=0.0 2023-06-27 12:31:38,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1807452.0, ans=0.125 2023-06-27 12:31:45,851 INFO [train.py:996] (1/4) Epoch 10, batch 26800, loss[loss=0.2621, simple_loss=0.3369, pruned_loss=0.09363, over 21537.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2904, pruned_loss=0.06596, over 4272818.39 frames. ], batch size: 414, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:31:51,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.485e+02 8.253e+02 1.353e+03 2.004e+03 3.922e+03, threshold=2.706e+03, percent-clipped=54.0 2023-06-27 12:32:27,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1807632.0, ans=0.125 2023-06-27 12:33:04,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1807692.0, ans=0.125 2023-06-27 12:33:19,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1807752.0, ans=0.125 2023-06-27 12:33:27,232 INFO [train.py:996] (1/4) Epoch 10, batch 26850, loss[loss=0.1743, simple_loss=0.2397, pruned_loss=0.05447, over 21605.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2925, pruned_loss=0.0688, over 4264001.35 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:33:34,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1807812.0, ans=0.0 2023-06-27 12:34:54,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1808052.0, ans=0.0 2023-06-27 12:35:05,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808112.0, ans=0.1 2023-06-27 12:35:07,077 INFO [train.py:996] (1/4) Epoch 10, batch 26900, loss[loss=0.2011, simple_loss=0.2649, pruned_loss=0.06867, over 21746.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.285, pruned_loss=0.0687, over 4256032.43 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:35:13,694 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.943e+02 6.437e+02 8.362e+02 1.264e+03 2.899e+03, threshold=1.672e+03, percent-clipped=1.0 2023-06-27 12:35:29,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1808172.0, ans=0.125 2023-06-27 12:36:35,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1808352.0, ans=0.1 2023-06-27 12:36:35,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1808352.0, ans=0.125 2023-06-27 12:36:45,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1808412.0, ans=0.2 2023-06-27 12:36:46,478 INFO [train.py:996] (1/4) Epoch 10, batch 26950, loss[loss=0.241, simple_loss=0.3382, pruned_loss=0.07188, over 21223.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2834, pruned_loss=0.06842, over 4261034.56 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:38:27,742 INFO [train.py:996] (1/4) Epoch 10, batch 27000, loss[loss=0.1711, simple_loss=0.2549, pruned_loss=0.0436, over 21190.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2839, pruned_loss=0.06626, over 4254804.25 frames. ], batch size: 159, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:38:27,742 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 12:38:47,559 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2474, simple_loss=0.3368, pruned_loss=0.07904, over 1796401.00 frames. 2023-06-27 12:38:47,561 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 12:39:01,426 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 5.840e+02 8.267e+02 1.216e+03 2.372e+03, threshold=1.653e+03, percent-clipped=7.0 2023-06-27 12:39:36,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1808832.0, ans=0.0 2023-06-27 12:40:06,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1808892.0, ans=0.125 2023-06-27 12:40:14,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1808952.0, ans=0.0 2023-06-27 12:40:20,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808952.0, ans=0.1 2023-06-27 12:40:29,862 INFO [train.py:996] (1/4) Epoch 10, batch 27050, loss[loss=0.2186, simple_loss=0.3026, pruned_loss=0.06726, over 21829.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2865, pruned_loss=0.06307, over 4256922.88 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:41:29,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1809132.0, ans=0.125 2023-06-27 12:42:10,019 INFO [train.py:996] (1/4) Epoch 10, batch 27100, loss[loss=0.2339, simple_loss=0.3269, pruned_loss=0.07044, over 21717.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2888, pruned_loss=0.06395, over 4271008.61 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:42:16,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809312.0, ans=0.1 2023-06-27 12:42:22,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.015e+02 5.614e+02 8.365e+02 1.169e+03 2.643e+03, threshold=1.673e+03, percent-clipped=10.0 2023-06-27 12:43:12,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1809492.0, ans=0.125 2023-06-27 12:43:40,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1809552.0, ans=0.125 2023-06-27 12:43:51,705 INFO [train.py:996] (1/4) Epoch 10, batch 27150, loss[loss=0.3082, simple_loss=0.3938, pruned_loss=0.1113, over 21662.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3017, pruned_loss=0.06775, over 4275683.16 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:44:13,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1809672.0, ans=0.0 2023-06-27 12:44:16,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1809672.0, ans=0.125 2023-06-27 12:44:16,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1809672.0, ans=0.125 2023-06-27 12:44:55,430 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-27 12:45:37,928 INFO [train.py:996] (1/4) Epoch 10, batch 27200, loss[loss=0.2419, simple_loss=0.3274, pruned_loss=0.07827, over 20693.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3099, pruned_loss=0.0701, over 4274589.42 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:45:50,768 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 6.689e+02 1.006e+03 1.593e+03 2.972e+03, threshold=2.013e+03, percent-clipped=22.0 2023-06-27 12:46:01,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1809972.0, ans=0.125 2023-06-27 12:47:00,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-27 12:47:19,000 INFO [train.py:996] (1/4) Epoch 10, batch 27250, loss[loss=0.2403, simple_loss=0.3123, pruned_loss=0.08415, over 21478.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3122, pruned_loss=0.07409, over 4271928.54 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:47:47,920 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-27 12:48:21,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1810392.0, ans=0.0 2023-06-27 12:48:24,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.86 vs. limit=15.0 2023-06-27 12:48:48,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1810452.0, ans=0.125 2023-06-27 12:48:52,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1810452.0, ans=0.1 2023-06-27 12:48:52,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1810452.0, ans=0.0 2023-06-27 12:48:58,262 INFO [train.py:996] (1/4) Epoch 10, batch 27300, loss[loss=0.2279, simple_loss=0.3265, pruned_loss=0.06468, over 21732.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.313, pruned_loss=0.07456, over 4275360.33 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:49:00,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1810512.0, ans=0.0 2023-06-27 12:49:06,654 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.508e+02 9.291e+02 1.314e+03 3.410e+03, threshold=1.858e+03, percent-clipped=10.0 2023-06-27 12:49:22,729 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-27 12:49:30,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.95 vs. limit=15.0 2023-06-27 12:50:00,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1810692.0, ans=0.125 2023-06-27 12:50:26,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.34 vs. limit=6.0 2023-06-27 12:50:38,162 INFO [train.py:996] (1/4) Epoch 10, batch 27350, loss[loss=0.221, simple_loss=0.306, pruned_loss=0.06798, over 21790.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3149, pruned_loss=0.07461, over 4277122.96 frames. ], batch size: 124, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:50:39,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-27 12:51:12,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1810872.0, ans=0.0 2023-06-27 12:51:40,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1810992.0, ans=0.125 2023-06-27 12:51:41,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-27 12:52:12,673 INFO [train.py:996] (1/4) Epoch 10, batch 27400, loss[loss=0.2015, simple_loss=0.2715, pruned_loss=0.06577, over 21786.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3098, pruned_loss=0.07399, over 4274271.79 frames. ], batch size: 371, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:52:20,978 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 5.726e+02 8.020e+02 1.365e+03 2.836e+03, threshold=1.604e+03, percent-clipped=8.0 2023-06-27 12:52:23,393 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:52:30,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811172.0, ans=0.1 2023-06-27 12:52:44,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1811172.0, ans=0.0 2023-06-27 12:52:59,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1811232.0, ans=0.125 2023-06-27 12:53:05,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1811232.0, ans=0.125 2023-06-27 12:53:32,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811292.0, ans=0.1 2023-06-27 12:53:54,327 INFO [train.py:996] (1/4) Epoch 10, batch 27450, loss[loss=0.25, simple_loss=0.3286, pruned_loss=0.08571, over 21563.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3038, pruned_loss=0.07176, over 4268468.73 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:53:56,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1811412.0, ans=0.0 2023-06-27 12:54:06,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1811412.0, ans=0.125 2023-06-27 12:54:20,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1811472.0, ans=0.2 2023-06-27 12:54:24,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2023-06-27 12:54:46,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=10.0 2023-06-27 12:54:49,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1811532.0, ans=0.2 2023-06-27 12:55:06,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1811592.0, ans=0.0 2023-06-27 12:55:06,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2023-06-27 12:55:14,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1811652.0, ans=0.125 2023-06-27 12:55:30,305 INFO [train.py:996] (1/4) Epoch 10, batch 27500, loss[loss=0.1991, simple_loss=0.2739, pruned_loss=0.0621, over 21572.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3025, pruned_loss=0.07251, over 4273375.06 frames. ], batch size: 212, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:55:38,230 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.956e+02 6.120e+02 9.251e+02 1.541e+03 3.924e+03, threshold=1.850e+03, percent-clipped=23.0 2023-06-27 12:55:47,431 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2023-06-27 12:56:02,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1811772.0, ans=0.125 2023-06-27 12:56:15,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1811832.0, ans=0.125 2023-06-27 12:56:21,906 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:56:50,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1811892.0, ans=0.04949747468305833 2023-06-27 12:57:09,553 INFO [train.py:996] (1/4) Epoch 10, batch 27550, loss[loss=0.1757, simple_loss=0.263, pruned_loss=0.04419, over 21773.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2982, pruned_loss=0.06972, over 4273297.84 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 12:57:13,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1812012.0, ans=0.02 2023-06-27 12:57:17,089 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.95 vs. limit=10.0 2023-06-27 12:57:19,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1812012.0, ans=0.015 2023-06-27 12:57:56,655 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=12.0 2023-06-27 12:57:57,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1812132.0, ans=0.125 2023-06-27 12:58:24,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1812192.0, ans=0.0 2023-06-27 12:58:36,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1812252.0, ans=0.125 2023-06-27 12:58:48,793 INFO [train.py:996] (1/4) Epoch 10, batch 27600, loss[loss=0.2012, simple_loss=0.2728, pruned_loss=0.06477, over 21716.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2908, pruned_loss=0.06902, over 4277300.86 frames. ], batch size: 316, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 12:58:56,895 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.403e+02 6.402e+02 9.119e+02 1.240e+03 2.150e+03, threshold=1.824e+03, percent-clipped=4.0 2023-06-27 12:58:58,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.25 vs. limit=6.0 2023-06-27 13:00:22,183 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:00:27,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-27 13:00:29,607 INFO [train.py:996] (1/4) Epoch 10, batch 27650, loss[loss=0.2157, simple_loss=0.3016, pruned_loss=0.06485, over 21890.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2858, pruned_loss=0.06866, over 4267504.69 frames. ], batch size: 316, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:00:31,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1812612.0, ans=10.0 2023-06-27 13:00:33,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1812612.0, ans=0.125 2023-06-27 13:01:02,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1812672.0, ans=0.125 2023-06-27 13:01:09,225 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-27 13:01:33,403 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-27 13:01:49,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1812792.0, ans=0.2 2023-06-27 13:01:59,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1812852.0, ans=0.04949747468305833 2023-06-27 13:02:10,599 INFO [train.py:996] (1/4) Epoch 10, batch 27700, loss[loss=0.3263, simple_loss=0.3905, pruned_loss=0.131, over 21525.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2859, pruned_loss=0.06782, over 4267255.75 frames. ], batch size: 508, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:02:23,384 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.863e+02 9.869e+02 1.519e+03 3.382e+03, threshold=1.974e+03, percent-clipped=13.0 2023-06-27 13:02:52,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1812972.0, ans=0.0 2023-06-27 13:03:50,223 INFO [train.py:996] (1/4) Epoch 10, batch 27750, loss[loss=0.1834, simple_loss=0.267, pruned_loss=0.04993, over 21701.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.291, pruned_loss=0.06783, over 4273710.34 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:03:51,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-27 13:04:55,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1813392.0, ans=0.015 2023-06-27 13:05:28,654 INFO [train.py:996] (1/4) Epoch 10, batch 27800, loss[loss=0.2608, simple_loss=0.3126, pruned_loss=0.1046, over 21792.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2894, pruned_loss=0.06757, over 4277632.09 frames. ], batch size: 508, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:05:43,188 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.682e+02 6.752e+02 9.329e+02 1.344e+03 2.939e+03, threshold=1.866e+03, percent-clipped=10.0 2023-06-27 13:05:59,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1813572.0, ans=0.125 2023-06-27 13:06:05,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1813572.0, ans=0.2 2023-06-27 13:06:55,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=8.0 2023-06-27 13:07:09,261 INFO [train.py:996] (1/4) Epoch 10, batch 27850, loss[loss=0.2582, simple_loss=0.3425, pruned_loss=0.08694, over 21707.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2885, pruned_loss=0.06827, over 4290435.72 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:07:37,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1813872.0, ans=0.0 2023-06-27 13:08:15,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1813932.0, ans=0.125 2023-06-27 13:08:20,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1813992.0, ans=0.0 2023-06-27 13:08:43,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1814052.0, ans=0.125 2023-06-27 13:09:01,133 INFO [train.py:996] (1/4) Epoch 10, batch 27900, loss[loss=0.2217, simple_loss=0.3102, pruned_loss=0.0666, over 21404.00 frames. ], tot_loss[loss=0.217, simple_loss=0.296, pruned_loss=0.06901, over 4292027.96 frames. ], batch size: 194, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:09:15,843 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.509e+02 6.352e+02 8.865e+02 1.400e+03 2.806e+03, threshold=1.773e+03, percent-clipped=7.0 2023-06-27 13:09:18,670 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-27 13:09:28,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-27 13:10:38,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1814352.0, ans=0.125 2023-06-27 13:10:48,718 INFO [train.py:996] (1/4) Epoch 10, batch 27950, loss[loss=0.166, simple_loss=0.2533, pruned_loss=0.0393, over 21612.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2968, pruned_loss=0.06627, over 4288807.72 frames. ], batch size: 195, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:12:12,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1814652.0, ans=0.0 2023-06-27 13:12:28,085 INFO [train.py:996] (1/4) Epoch 10, batch 28000, loss[loss=0.2114, simple_loss=0.2772, pruned_loss=0.07281, over 21310.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2954, pruned_loss=0.06479, over 4286065.10 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:12:42,691 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.320e+02 5.982e+02 8.841e+02 1.274e+03 3.365e+03, threshold=1.768e+03, percent-clipped=7.0 2023-06-27 13:12:58,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1814772.0, ans=0.04949747468305833 2023-06-27 13:13:08,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1814832.0, ans=0.1 2023-06-27 13:13:16,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1814832.0, ans=0.2 2023-06-27 13:14:14,350 INFO [train.py:996] (1/4) Epoch 10, batch 28050, loss[loss=0.2021, simple_loss=0.2807, pruned_loss=0.06174, over 21777.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.293, pruned_loss=0.06576, over 4289701.52 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:14:18,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1815012.0, ans=0.0 2023-06-27 13:15:06,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1815192.0, ans=0.125 2023-06-27 13:15:31,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1815192.0, ans=0.125 2023-06-27 13:15:54,384 INFO [train.py:996] (1/4) Epoch 10, batch 28100, loss[loss=0.1948, simple_loss=0.2629, pruned_loss=0.0633, over 21640.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2905, pruned_loss=0.06491, over 4291281.14 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:16:06,173 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 5.968e+02 9.165e+02 1.416e+03 2.614e+03, threshold=1.833e+03, percent-clipped=9.0 2023-06-27 13:17:34,195 INFO [train.py:996] (1/4) Epoch 10, batch 28150, loss[loss=0.1829, simple_loss=0.2484, pruned_loss=0.05871, over 21560.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2831, pruned_loss=0.06508, over 4290469.69 frames. ], batch size: 231, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:17:48,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.29 vs. limit=12.0 2023-06-27 13:18:56,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.94 vs. limit=22.5 2023-06-27 13:19:14,722 INFO [train.py:996] (1/4) Epoch 10, batch 28200, loss[loss=0.2114, simple_loss=0.2755, pruned_loss=0.07367, over 21337.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.281, pruned_loss=0.06589, over 4279448.56 frames. ], batch size: 177, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:19:17,587 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-27 13:19:26,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.905e+02 6.047e+02 9.821e+02 1.464e+03 4.986e+03, threshold=1.964e+03, percent-clipped=9.0 2023-06-27 13:19:26,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1815912.0, ans=0.2 2023-06-27 13:20:35,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1816092.0, ans=0.0 2023-06-27 13:20:54,970 INFO [train.py:996] (1/4) Epoch 10, batch 28250, loss[loss=0.1863, simple_loss=0.2531, pruned_loss=0.05971, over 21379.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2848, pruned_loss=0.0684, over 4278687.31 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:21:08,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1816212.0, ans=0.0 2023-06-27 13:21:08,716 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:21:10,517 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:21:13,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1816272.0, ans=0.125 2023-06-27 13:21:53,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1816332.0, ans=0.125 2023-06-27 13:22:36,307 INFO [train.py:996] (1/4) Epoch 10, batch 28300, loss[loss=0.1999, simple_loss=0.2929, pruned_loss=0.05339, over 21491.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2832, pruned_loss=0.06654, over 4275746.65 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:22:45,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1816512.0, ans=0.125 2023-06-27 13:22:47,917 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 5.786e+02 9.744e+02 1.588e+03 3.149e+03, threshold=1.949e+03, percent-clipped=13.0 2023-06-27 13:23:04,124 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:24:15,589 INFO [train.py:996] (1/4) Epoch 10, batch 28350, loss[loss=0.1916, simple_loss=0.2673, pruned_loss=0.05797, over 21718.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2822, pruned_loss=0.06219, over 4265917.25 frames. ], batch size: 351, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:24:18,361 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-27 13:24:20,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=22.5 2023-06-27 13:25:20,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1816932.0, ans=0.0 2023-06-27 13:25:23,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1816992.0, ans=0.125 2023-06-27 13:25:25,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1816992.0, ans=0.125 2023-06-27 13:25:31,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1816992.0, ans=0.125 2023-06-27 13:25:48,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1817052.0, ans=0.1 2023-06-27 13:25:55,975 INFO [train.py:996] (1/4) Epoch 10, batch 28400, loss[loss=0.2292, simple_loss=0.296, pruned_loss=0.08118, over 21320.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2792, pruned_loss=0.062, over 4258301.67 frames. ], batch size: 549, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:26:18,384 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.150e+02 6.326e+02 1.038e+03 1.651e+03 3.367e+03, threshold=2.075e+03, percent-clipped=16.0 2023-06-27 13:26:38,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1817172.0, ans=0.035 2023-06-27 13:27:17,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1817292.0, ans=0.125 2023-06-27 13:27:26,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1817352.0, ans=0.125 2023-06-27 13:27:27,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1817352.0, ans=0.125 2023-06-27 13:27:37,254 INFO [train.py:996] (1/4) Epoch 10, batch 28450, loss[loss=0.2617, simple_loss=0.3197, pruned_loss=0.1019, over 21803.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2835, pruned_loss=0.06491, over 4263731.21 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:27:37,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1817412.0, ans=0.015 2023-06-27 13:27:50,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1817412.0, ans=0.1 2023-06-27 13:28:19,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1817472.0, ans=0.125 2023-06-27 13:28:25,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-27 13:28:29,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1817532.0, ans=0.1 2023-06-27 13:28:33,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.09 vs. limit=10.0 2023-06-27 13:28:40,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-27 13:28:41,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1817532.0, ans=0.125 2023-06-27 13:28:55,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1817592.0, ans=0.1 2023-06-27 13:29:27,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1817712.0, ans=0.0 2023-06-27 13:29:27,866 INFO [train.py:996] (1/4) Epoch 10, batch 28500, loss[loss=0.2118, simple_loss=0.2917, pruned_loss=0.06593, over 21873.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2839, pruned_loss=0.06645, over 4271082.54 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:29:38,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1817712.0, ans=0.0 2023-06-27 13:29:50,451 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 6.822e+02 1.044e+03 1.325e+03 2.451e+03, threshold=2.088e+03, percent-clipped=2.0 2023-06-27 13:30:12,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1817832.0, ans=0.125 2023-06-27 13:30:21,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1817832.0, ans=0.0 2023-06-27 13:31:14,230 INFO [train.py:996] (1/4) Epoch 10, batch 28550, loss[loss=0.2242, simple_loss=0.2986, pruned_loss=0.07495, over 21243.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2928, pruned_loss=0.06938, over 4276329.99 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:31:29,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1818012.0, ans=0.125 2023-06-27 13:31:29,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.92 vs. limit=15.0 2023-06-27 13:31:54,718 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-27 13:32:05,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-27 13:32:29,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1818252.0, ans=0.125 2023-06-27 13:32:59,169 INFO [train.py:996] (1/4) Epoch 10, batch 28600, loss[loss=0.2329, simple_loss=0.3056, pruned_loss=0.08012, over 21338.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2985, pruned_loss=0.07093, over 4275978.06 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:33:10,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-27 13:33:12,232 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.169e+02 6.322e+02 9.283e+02 1.275e+03 2.692e+03, threshold=1.857e+03, percent-clipped=3.0 2023-06-27 13:33:26,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1818372.0, ans=0.125 2023-06-27 13:34:20,249 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.03 vs. limit=15.0 2023-06-27 13:34:40,150 INFO [train.py:996] (1/4) Epoch 10, batch 28650, loss[loss=0.1722, simple_loss=0.237, pruned_loss=0.05375, over 21546.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2934, pruned_loss=0.07115, over 4259558.35 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:34:40,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1818612.0, ans=0.0 2023-06-27 13:34:55,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1818672.0, ans=0.125 2023-06-27 13:34:57,901 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-27 13:35:14,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-06-27 13:35:15,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1818732.0, ans=0.0 2023-06-27 13:35:24,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-27 13:35:53,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.12 vs. limit=10.0 2023-06-27 13:36:16,628 INFO [train.py:996] (1/4) Epoch 10, batch 28700, loss[loss=0.2304, simple_loss=0.3157, pruned_loss=0.0726, over 21798.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2928, pruned_loss=0.07213, over 4255054.49 frames. ], batch size: 118, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:36:17,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1818912.0, ans=0.2 2023-06-27 13:36:29,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.086e+02 6.900e+02 1.037e+03 1.524e+03 3.185e+03, threshold=2.075e+03, percent-clipped=14.0 2023-06-27 13:36:48,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-27 13:37:52,305 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-06-27 13:37:53,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-27 13:37:55,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1819152.0, ans=0.0 2023-06-27 13:37:57,656 INFO [train.py:996] (1/4) Epoch 10, batch 28750, loss[loss=0.2237, simple_loss=0.2989, pruned_loss=0.07422, over 21774.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2914, pruned_loss=0.0719, over 4258582.25 frames. ], batch size: 112, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:39:33,336 INFO [train.py:996] (1/4) Epoch 10, batch 28800, loss[loss=0.267, simple_loss=0.3398, pruned_loss=0.09709, over 21280.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2969, pruned_loss=0.07342, over 4256958.84 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:39:47,041 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 7.759e+02 9.840e+02 1.249e+03 3.010e+03, threshold=1.968e+03, percent-clipped=7.0 2023-06-27 13:39:51,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1819572.0, ans=0.2 2023-06-27 13:41:07,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-27 13:41:09,921 INFO [train.py:996] (1/4) Epoch 10, batch 28850, loss[loss=0.2552, simple_loss=0.3191, pruned_loss=0.09562, over 21770.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2981, pruned_loss=0.07442, over 4266974.62 frames. ], batch size: 414, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:41:35,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1819872.0, ans=0.2 2023-06-27 13:42:46,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1820052.0, ans=0.2 2023-06-27 13:42:50,435 INFO [train.py:996] (1/4) Epoch 10, batch 28900, loss[loss=0.2757, simple_loss=0.3451, pruned_loss=0.1032, over 21545.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3013, pruned_loss=0.07635, over 4277652.74 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:43:05,414 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.892e+02 6.958e+02 1.036e+03 1.416e+03 3.093e+03, threshold=2.073e+03, percent-clipped=9.0 2023-06-27 13:43:57,314 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.52 vs. limit=15.0 2023-06-27 13:44:33,633 INFO [train.py:996] (1/4) Epoch 10, batch 28950, loss[loss=0.1385, simple_loss=0.1832, pruned_loss=0.04692, over 16733.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3022, pruned_loss=0.0756, over 4270859.45 frames. ], batch size: 61, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:45:39,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1820532.0, ans=0.2 2023-06-27 13:46:10,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1820652.0, ans=0.05 2023-06-27 13:46:15,102 INFO [train.py:996] (1/4) Epoch 10, batch 29000, loss[loss=0.2299, simple_loss=0.3023, pruned_loss=0.07871, over 21445.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3053, pruned_loss=0.07441, over 4271223.92 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:46:43,661 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.952e+02 6.978e+02 9.216e+02 1.338e+03 4.286e+03, threshold=1.843e+03, percent-clipped=9.0 2023-06-27 13:47:32,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1820892.0, ans=0.0 2023-06-27 13:47:52,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1820952.0, ans=0.2 2023-06-27 13:48:04,737 INFO [train.py:996] (1/4) Epoch 10, batch 29050, loss[loss=0.2317, simple_loss=0.3025, pruned_loss=0.08039, over 21862.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3048, pruned_loss=0.07508, over 4270399.17 frames. ], batch size: 414, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:48:16,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1821012.0, ans=0.07 2023-06-27 13:48:17,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-27 13:48:20,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1821012.0, ans=0.125 2023-06-27 13:48:29,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1821072.0, ans=0.125 2023-06-27 13:48:29,970 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-27 13:48:36,179 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:49:24,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1821252.0, ans=0.125 2023-06-27 13:49:40,670 INFO [train.py:996] (1/4) Epoch 10, batch 29100, loss[loss=0.2019, simple_loss=0.2668, pruned_loss=0.06851, over 15376.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2964, pruned_loss=0.07259, over 4267973.34 frames. ], batch size: 61, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:49:55,603 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.400e+02 6.043e+02 9.332e+02 1.585e+03 3.722e+03, threshold=1.866e+03, percent-clipped=13.0 2023-06-27 13:50:07,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1821372.0, ans=0.125 2023-06-27 13:50:31,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-27 13:50:34,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-27 13:50:38,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1821492.0, ans=0.0 2023-06-27 13:51:14,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1821552.0, ans=0.125 2023-06-27 13:51:16,638 INFO [train.py:996] (1/4) Epoch 10, batch 29150, loss[loss=0.2055, simple_loss=0.2827, pruned_loss=0.06416, over 21777.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2939, pruned_loss=0.07032, over 4271020.79 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:51:20,838 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:51:38,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1821672.0, ans=0.125 2023-06-27 13:51:47,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1821672.0, ans=0.0 2023-06-27 13:52:08,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1821792.0, ans=0.07 2023-06-27 13:52:08,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1821792.0, ans=0.07 2023-06-27 13:52:57,676 INFO [train.py:996] (1/4) Epoch 10, batch 29200, loss[loss=0.2231, simple_loss=0.2986, pruned_loss=0.07377, over 21601.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2902, pruned_loss=0.06965, over 4274650.04 frames. ], batch size: 442, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:53:00,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1821912.0, ans=0.2 2023-06-27 13:53:11,343 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:53:14,036 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.160e+02 1.002e+03 1.749e+03 3.498e+03, threshold=2.004e+03, percent-clipped=20.0 2023-06-27 13:53:19,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1821972.0, ans=0.1 2023-06-27 13:53:25,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-27 13:53:43,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1822032.0, ans=0.0 2023-06-27 13:54:29,613 INFO [train.py:996] (1/4) Epoch 10, batch 29250, loss[loss=0.1917, simple_loss=0.2795, pruned_loss=0.05201, over 21112.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.288, pruned_loss=0.0677, over 4277761.66 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:54:55,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1822272.0, ans=0.125 2023-06-27 13:55:21,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1822392.0, ans=0.0 2023-06-27 13:55:52,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1822452.0, ans=0.125 2023-06-27 13:56:06,666 INFO [train.py:996] (1/4) Epoch 10, batch 29300, loss[loss=0.1801, simple_loss=0.2452, pruned_loss=0.05748, over 21200.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2887, pruned_loss=0.06635, over 4274091.90 frames. ], batch size: 144, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:56:22,990 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.212e+02 5.568e+02 7.846e+02 1.257e+03 2.359e+03, threshold=1.569e+03, percent-clipped=3.0 2023-06-27 13:56:33,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1822572.0, ans=0.2 2023-06-27 13:56:43,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-27 13:56:56,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1822692.0, ans=0.125 2023-06-27 13:57:02,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1822692.0, ans=0.125 2023-06-27 13:57:38,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1822752.0, ans=0.0 2023-06-27 13:57:48,082 INFO [train.py:996] (1/4) Epoch 10, batch 29350, loss[loss=0.1715, simple_loss=0.2294, pruned_loss=0.05681, over 20748.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2857, pruned_loss=0.06646, over 4268696.20 frames. ], batch size: 609, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:58:23,910 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:59:20,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-06-27 13:59:31,048 INFO [train.py:996] (1/4) Epoch 10, batch 29400, loss[loss=0.2312, simple_loss=0.3114, pruned_loss=0.0755, over 21547.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2859, pruned_loss=0.06513, over 4266596.84 frames. ], batch size: 473, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:59:42,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-27 13:59:47,673 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 6.893e+02 1.012e+03 1.543e+03 3.903e+03, threshold=2.024e+03, percent-clipped=23.0 2023-06-27 14:00:44,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1823292.0, ans=0.125 2023-06-27 14:01:03,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1823352.0, ans=0.1 2023-06-27 14:01:12,870 INFO [train.py:996] (1/4) Epoch 10, batch 29450, loss[loss=0.222, simple_loss=0.2949, pruned_loss=0.07458, over 21819.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2863, pruned_loss=0.06521, over 4262241.08 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:02:13,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1823532.0, ans=0.125 2023-06-27 14:02:31,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1823592.0, ans=0.125 2023-06-27 14:02:41,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1823652.0, ans=0.125 2023-06-27 14:02:44,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1823652.0, ans=0.04949747468305833 2023-06-27 14:02:51,894 INFO [train.py:996] (1/4) Epoch 10, batch 29500, loss[loss=0.2326, simple_loss=0.3, pruned_loss=0.08262, over 21209.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2886, pruned_loss=0.06743, over 4271553.18 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:02:52,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1823712.0, ans=0.125 2023-06-27 14:03:07,701 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 6.865e+02 1.061e+03 1.645e+03 3.419e+03, threshold=2.123e+03, percent-clipped=12.0 2023-06-27 14:03:57,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1823892.0, ans=0.125 2023-06-27 14:04:33,492 INFO [train.py:996] (1/4) Epoch 10, batch 29550, loss[loss=0.2248, simple_loss=0.2962, pruned_loss=0.0767, over 21435.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2877, pruned_loss=0.06899, over 4277476.09 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:04:35,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1824012.0, ans=0.0 2023-06-27 14:04:35,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1824012.0, ans=0.05 2023-06-27 14:04:49,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1824072.0, ans=0.04949747468305833 2023-06-27 14:05:31,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1824132.0, ans=0.0 2023-06-27 14:06:04,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1824252.0, ans=0.0 2023-06-27 14:06:11,174 INFO [train.py:996] (1/4) Epoch 10, batch 29600, loss[loss=0.2127, simple_loss=0.2876, pruned_loss=0.06892, over 21266.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2931, pruned_loss=0.07045, over 4280634.69 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 14:06:16,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1824312.0, ans=0.2 2023-06-27 14:06:23,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1824312.0, ans=0.0 2023-06-27 14:06:29,597 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.301e+02 5.949e+02 7.426e+02 9.960e+02 2.480e+03, threshold=1.485e+03, percent-clipped=1.0 2023-06-27 14:06:53,671 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-27 14:07:23,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1824492.0, ans=0.125 2023-06-27 14:07:30,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1824552.0, ans=0.2 2023-06-27 14:07:43,093 INFO [train.py:996] (1/4) Epoch 10, batch 29650, loss[loss=0.1943, simple_loss=0.2682, pruned_loss=0.06019, over 21678.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2952, pruned_loss=0.06896, over 4275886.66 frames. ], batch size: 263, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:08:10,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1824672.0, ans=0.2 2023-06-27 14:08:33,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-27 14:08:44,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1824732.0, ans=0.125 2023-06-27 14:08:44,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1824732.0, ans=0.125 2023-06-27 14:09:04,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1824852.0, ans=0.125 2023-06-27 14:09:06,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1824852.0, ans=0.125 2023-06-27 14:09:20,373 INFO [train.py:996] (1/4) Epoch 10, batch 29700, loss[loss=0.2179, simple_loss=0.312, pruned_loss=0.06185, over 21513.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2936, pruned_loss=0.06809, over 4268669.23 frames. ], batch size: 194, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:09:34,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-27 14:09:42,884 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.057e+02 7.297e+02 1.065e+03 1.869e+03 3.621e+03, threshold=2.131e+03, percent-clipped=32.0 2023-06-27 14:10:37,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1825092.0, ans=0.2 2023-06-27 14:10:56,031 INFO [train.py:996] (1/4) Epoch 10, batch 29750, loss[loss=0.2126, simple_loss=0.3151, pruned_loss=0.05502, over 21864.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2983, pruned_loss=0.06786, over 4273577.33 frames. ], batch size: 371, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:12:27,395 INFO [train.py:996] (1/4) Epoch 10, batch 29800, loss[loss=0.2056, simple_loss=0.2799, pruned_loss=0.06566, over 21886.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2993, pruned_loss=0.06787, over 4272325.26 frames. ], batch size: 371, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:12:34,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1825512.0, ans=0.07 2023-06-27 14:12:51,510 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.238e+02 6.246e+02 9.031e+02 1.363e+03 2.753e+03, threshold=1.806e+03, percent-clipped=5.0 2023-06-27 14:13:08,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-27 14:13:14,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1825632.0, ans=0.125 2023-06-27 14:13:52,273 INFO [train.py:996] (1/4) Epoch 10, batch 29850, loss[loss=0.1963, simple_loss=0.2771, pruned_loss=0.05774, over 21787.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2955, pruned_loss=0.0662, over 4274606.18 frames. ], batch size: 298, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:13:52,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1825812.0, ans=0.5 2023-06-27 14:14:18,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1825872.0, ans=0.2 2023-06-27 14:14:35,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1825872.0, ans=0.1 2023-06-27 14:15:27,920 INFO [train.py:996] (1/4) Epoch 10, batch 29900, loss[loss=0.2423, simple_loss=0.3089, pruned_loss=0.08784, over 21402.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2934, pruned_loss=0.06734, over 4277750.45 frames. ], batch size: 548, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:16:07,296 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.080e+02 5.707e+02 7.601e+02 1.173e+03 3.198e+03, threshold=1.520e+03, percent-clipped=6.0 2023-06-27 14:16:18,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1826172.0, ans=0.125 2023-06-27 14:16:20,639 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=12.0 2023-06-27 14:16:30,060 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:16:54,059 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.61 vs. limit=10.0 2023-06-27 14:16:58,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-27 14:17:06,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826352.0, ans=0.1 2023-06-27 14:17:11,026 INFO [train.py:996] (1/4) Epoch 10, batch 29950, loss[loss=0.2087, simple_loss=0.2844, pruned_loss=0.06648, over 21704.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2965, pruned_loss=0.0703, over 4277951.99 frames. ], batch size: 298, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:17:27,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1826412.0, ans=0.125 2023-06-27 14:17:46,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1826472.0, ans=0.125 2023-06-27 14:18:03,361 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=15.0 2023-06-27 14:18:23,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1826592.0, ans=0.0 2023-06-27 14:18:31,700 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:19:04,988 INFO [train.py:996] (1/4) Epoch 10, batch 30000, loss[loss=0.2366, simple_loss=0.3318, pruned_loss=0.07067, over 21495.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2986, pruned_loss=0.07058, over 4277022.44 frames. ], batch size: 471, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:19:04,989 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 14:19:22,070 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2475, simple_loss=0.3412, pruned_loss=0.07692, over 1796401.00 frames. 2023-06-27 14:19:22,071 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 14:19:43,675 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.141e+02 6.862e+02 9.553e+02 1.677e+03 3.481e+03, threshold=1.911e+03, percent-clipped=29.0 2023-06-27 14:19:48,467 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-27 14:19:59,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-27 14:20:04,214 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-27 14:20:07,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826832.0, ans=0.1 2023-06-27 14:20:28,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826892.0, ans=0.1 2023-06-27 14:20:36,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826892.0, ans=0.1 2023-06-27 14:20:48,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826952.0, ans=0.1 2023-06-27 14:21:05,353 INFO [train.py:996] (1/4) Epoch 10, batch 30050, loss[loss=0.2127, simple_loss=0.3257, pruned_loss=0.04989, over 20740.00 frames. ], tot_loss[loss=0.218, simple_loss=0.3011, pruned_loss=0.06746, over 4262562.59 frames. ], batch size: 607, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:21:09,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1827012.0, ans=0.1 2023-06-27 14:22:01,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1827132.0, ans=0.125 2023-06-27 14:22:04,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1827192.0, ans=0.0 2023-06-27 14:22:06,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1827192.0, ans=0.2 2023-06-27 14:22:39,190 INFO [train.py:996] (1/4) Epoch 10, batch 30100, loss[loss=0.206, simple_loss=0.2649, pruned_loss=0.07358, over 21363.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3015, pruned_loss=0.06777, over 4265090.84 frames. ], batch size: 160, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:22:51,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1827312.0, ans=0.125 2023-06-27 14:22:58,224 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.865e+02 7.541e+02 1.187e+03 1.645e+03 3.691e+03, threshold=2.374e+03, percent-clipped=12.0 2023-06-27 14:23:34,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-27 14:23:35,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1827432.0, ans=0.0 2023-06-27 14:23:45,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1827492.0, ans=0.125 2023-06-27 14:23:59,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1827552.0, ans=0.125 2023-06-27 14:24:17,471 INFO [train.py:996] (1/4) Epoch 10, batch 30150, loss[loss=0.231, simple_loss=0.3058, pruned_loss=0.07812, over 21735.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2974, pruned_loss=0.06886, over 4256593.62 frames. ], batch size: 231, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:24:33,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1827672.0, ans=0.0 2023-06-27 14:25:15,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1827732.0, ans=0.2 2023-06-27 14:25:25,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1827732.0, ans=0.0 2023-06-27 14:25:38,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-27 14:25:46,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1827852.0, ans=0.09899494936611666 2023-06-27 14:26:02,896 INFO [train.py:996] (1/4) Epoch 10, batch 30200, loss[loss=0.2252, simple_loss=0.2864, pruned_loss=0.08205, over 20052.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2997, pruned_loss=0.06838, over 4261393.40 frames. ], batch size: 703, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:26:18,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1827912.0, ans=0.0 2023-06-27 14:26:42,660 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.354e+02 6.809e+02 8.710e+02 1.204e+03 2.614e+03, threshold=1.742e+03, percent-clipped=2.0 2023-06-27 14:27:23,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1828092.0, ans=10.0 2023-06-27 14:27:32,452 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-27 14:28:02,219 INFO [train.py:996] (1/4) Epoch 10, batch 30250, loss[loss=0.2368, simple_loss=0.3263, pruned_loss=0.07364, over 17144.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3087, pruned_loss=0.07171, over 4263923.38 frames. ], batch size: 60, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:28:14,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1828212.0, ans=0.125 2023-06-27 14:28:20,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1828212.0, ans=0.0 2023-06-27 14:29:22,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1828452.0, ans=0.125 2023-06-27 14:29:38,339 INFO [train.py:996] (1/4) Epoch 10, batch 30300, loss[loss=0.1912, simple_loss=0.2635, pruned_loss=0.0594, over 21933.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3074, pruned_loss=0.07204, over 4264360.65 frames. ], batch size: 103, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:30:03,861 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 6.596e+02 9.409e+02 1.315e+03 2.834e+03, threshold=1.882e+03, percent-clipped=10.0 2023-06-27 14:30:25,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1828632.0, ans=0.5 2023-06-27 14:30:48,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1828692.0, ans=0.1 2023-06-27 14:31:27,452 INFO [train.py:996] (1/4) Epoch 10, batch 30350, loss[loss=0.2124, simple_loss=0.3055, pruned_loss=0.05966, over 21742.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3057, pruned_loss=0.07259, over 4263692.18 frames. ], batch size: 332, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:31:38,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-27 14:31:55,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1828872.0, ans=0.125 2023-06-27 14:31:59,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1828932.0, ans=0.0 2023-06-27 14:32:41,773 INFO [train.py:996] (1/4) Epoch 10, batch 30400, loss[loss=0.205, simple_loss=0.2552, pruned_loss=0.07739, over 20318.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2988, pruned_loss=0.07085, over 4245602.96 frames. ], batch size: 703, lr: 2.86e-03, grad_scale: 32.0 2023-06-27 14:32:45,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1829112.0, ans=0.0 2023-06-27 14:32:50,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1829112.0, ans=0.0 2023-06-27 14:33:09,680 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.426e+02 7.954e+02 1.288e+03 1.926e+03 4.132e+03, threshold=2.577e+03, percent-clipped=26.0 2023-06-27 14:33:43,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1829292.0, ans=0.125 2023-06-27 14:34:08,235 INFO [train.py:996] (1/4) Epoch 10, batch 30450, loss[loss=0.2755, simple_loss=0.3968, pruned_loss=0.07705, over 19804.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2994, pruned_loss=0.07034, over 4189198.02 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:35:08,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1829592.0, ans=0.125 2023-06-27 14:35:14,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1829652.0, ans=0.125 2023-06-27 14:37:28,486 INFO [train.py:996] (1/4) Epoch 11, batch 0, loss[loss=0.1754, simple_loss=0.2384, pruned_loss=0.05622, over 20734.00 frames. ], tot_loss[loss=0.1754, simple_loss=0.2384, pruned_loss=0.05622, over 20734.00 frames. ], batch size: 609, lr: 2.72e-03, grad_scale: 32.0 2023-06-27 14:37:28,486 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 14:37:44,723 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2445, simple_loss=0.3464, pruned_loss=0.07127, over 1796401.00 frames. 2023-06-27 14:37:44,724 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 14:37:55,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1829676.0, ans=0.0 2023-06-27 14:38:23,070 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.704e+02 1.606e+03 2.605e+03 4.493e+03 1.142e+04, threshold=5.209e+03, percent-clipped=50.0 2023-06-27 14:38:24,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.19 vs. limit=15.0 2023-06-27 14:38:46,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1829796.0, ans=0.125 2023-06-27 14:39:04,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1829916.0, ans=0.125 2023-06-27 14:39:22,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1829916.0, ans=0.125 2023-06-27 14:39:25,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1829976.0, ans=0.0 2023-06-27 14:39:26,623 INFO [train.py:996] (1/4) Epoch 11, batch 50, loss[loss=0.2609, simple_loss=0.3739, pruned_loss=0.07396, over 21646.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3028, pruned_loss=0.07055, over 966163.25 frames. ], batch size: 263, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:39:33,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1829976.0, ans=0.0 2023-06-27 14:39:58,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-27 14:40:11,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1830096.0, ans=0.0 2023-06-27 14:40:15,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1830096.0, ans=0.125 2023-06-27 14:40:35,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1830156.0, ans=0.02 2023-06-27 14:40:54,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1830216.0, ans=0.2 2023-06-27 14:41:08,863 INFO [train.py:996] (1/4) Epoch 11, batch 100, loss[loss=0.2584, simple_loss=0.34, pruned_loss=0.08835, over 21822.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3164, pruned_loss=0.07114, over 1701872.22 frames. ], batch size: 118, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:41:46,182 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.133e+02 5.871e+02 7.705e+02 1.160e+03 1.899e+03, threshold=1.541e+03, percent-clipped=0.0 2023-06-27 14:41:48,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1830396.0, ans=0.1 2023-06-27 14:42:37,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=22.5 2023-06-27 14:42:51,592 INFO [train.py:996] (1/4) Epoch 11, batch 150, loss[loss=0.2408, simple_loss=0.341, pruned_loss=0.07027, over 21610.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3199, pruned_loss=0.07145, over 2268880.08 frames. ], batch size: 389, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:43:01,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1830576.0, ans=0.2 2023-06-27 14:43:56,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1830756.0, ans=0.0 2023-06-27 14:44:33,948 INFO [train.py:996] (1/4) Epoch 11, batch 200, loss[loss=0.2155, simple_loss=0.3095, pruned_loss=0.06071, over 21705.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3152, pruned_loss=0.07085, over 2720152.99 frames. ], batch size: 351, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:45:11,943 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.129e+02 7.270e+02 1.005e+03 1.466e+03 4.683e+03, threshold=2.010e+03, percent-clipped=22.0 2023-06-27 14:45:17,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1830996.0, ans=0.0 2023-06-27 14:45:18,131 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-27 14:45:51,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1831056.0, ans=0.0 2023-06-27 14:46:02,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1831116.0, ans=0.0 2023-06-27 14:46:17,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1831176.0, ans=0.125 2023-06-27 14:46:18,430 INFO [train.py:996] (1/4) Epoch 11, batch 250, loss[loss=0.2725, simple_loss=0.3432, pruned_loss=0.1009, over 21410.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.31, pruned_loss=0.07026, over 3069737.28 frames. ], batch size: 471, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:46:22,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1831176.0, ans=0.1 2023-06-27 14:47:56,414 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-27 14:48:01,890 INFO [train.py:996] (1/4) Epoch 11, batch 300, loss[loss=0.1942, simple_loss=0.2583, pruned_loss=0.065, over 21211.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3056, pruned_loss=0.07144, over 3332668.94 frames. ], batch size: 608, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:48:33,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1831536.0, ans=0.04949747468305833 2023-06-27 14:48:40,911 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.322e+02 6.333e+02 9.156e+02 1.285e+03 2.394e+03, threshold=1.831e+03, percent-clipped=6.0 2023-06-27 14:48:41,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1831596.0, ans=0.0 2023-06-27 14:49:47,626 INFO [train.py:996] (1/4) Epoch 11, batch 350, loss[loss=0.1972, simple_loss=0.2764, pruned_loss=0.05899, over 21635.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2997, pruned_loss=0.07016, over 3541562.29 frames. ], batch size: 230, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:50:03,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1831836.0, ans=0.125 2023-06-27 14:50:10,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1831836.0, ans=0.2 2023-06-27 14:50:19,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1831836.0, ans=0.1 2023-06-27 14:50:36,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1831896.0, ans=0.0 2023-06-27 14:51:14,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1832016.0, ans=0.0 2023-06-27 14:51:30,047 INFO [train.py:996] (1/4) Epoch 11, batch 400, loss[loss=0.2571, simple_loss=0.3349, pruned_loss=0.08959, over 21505.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2933, pruned_loss=0.0673, over 3700545.41 frames. ], batch size: 508, lr: 2.72e-03, grad_scale: 32.0 2023-06-27 14:51:32,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1832076.0, ans=0.2 2023-06-27 14:51:32,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1832076.0, ans=0.125 2023-06-27 14:51:39,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.50 vs. limit=22.5 2023-06-27 14:52:09,625 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.211e+02 7.477e+02 1.167e+03 1.835e+03 4.227e+03, threshold=2.334e+03, percent-clipped=25.0 2023-06-27 14:52:15,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1832196.0, ans=0.0 2023-06-27 14:52:30,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1832256.0, ans=0.0 2023-06-27 14:52:40,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=22.5 2023-06-27 14:53:12,790 INFO [train.py:996] (1/4) Epoch 11, batch 450, loss[loss=0.2003, simple_loss=0.2703, pruned_loss=0.06515, over 21524.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2906, pruned_loss=0.06666, over 3830737.84 frames. ], batch size: 391, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:54:27,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1832556.0, ans=0.0 2023-06-27 14:54:30,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1832556.0, ans=0.1 2023-06-27 14:54:31,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-27 14:54:53,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1832616.0, ans=0.125 2023-06-27 14:54:57,311 INFO [train.py:996] (1/4) Epoch 11, batch 500, loss[loss=0.1964, simple_loss=0.2723, pruned_loss=0.06023, over 21858.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2872, pruned_loss=0.06628, over 3935625.24 frames. ], batch size: 373, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:55:02,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1832676.0, ans=0.125 2023-06-27 14:55:12,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1832736.0, ans=0.04949747468305833 2023-06-27 14:55:34,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1832796.0, ans=0.125 2023-06-27 14:55:37,224 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.096e+02 9.470e+02 1.676e+03 2.580e+03 4.364e+03, threshold=3.351e+03, percent-clipped=30.0 2023-06-27 14:55:53,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1832796.0, ans=0.125 2023-06-27 14:56:04,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1832856.0, ans=0.0 2023-06-27 14:56:16,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1832916.0, ans=0.0 2023-06-27 14:56:18,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1832916.0, ans=0.025 2023-06-27 14:56:39,107 INFO [train.py:996] (1/4) Epoch 11, batch 550, loss[loss=0.2368, simple_loss=0.3342, pruned_loss=0.06972, over 21275.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2921, pruned_loss=0.06536, over 4011137.46 frames. ], batch size: 176, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:57:14,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1833036.0, ans=0.0 2023-06-27 14:57:50,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1833156.0, ans=0.0 2023-06-27 14:58:07,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1833216.0, ans=0.125 2023-06-27 14:58:22,129 INFO [train.py:996] (1/4) Epoch 11, batch 600, loss[loss=0.2085, simple_loss=0.3012, pruned_loss=0.05787, over 21320.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2963, pruned_loss=0.06524, over 4065737.78 frames. ], batch size: 176, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 14:58:50,028 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:58:50,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1833336.0, ans=0.1 2023-06-27 14:58:54,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1833336.0, ans=0.125 2023-06-27 14:59:00,929 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.101e+02 6.551e+02 9.996e+02 1.452e+03 3.285e+03, threshold=1.999e+03, percent-clipped=0.0 2023-06-27 14:59:19,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1833456.0, ans=0.2 2023-06-27 14:59:22,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-27 15:00:03,724 INFO [train.py:996] (1/4) Epoch 11, batch 650, loss[loss=0.1964, simple_loss=0.2696, pruned_loss=0.06158, over 21774.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2971, pruned_loss=0.06554, over 4114495.40 frames. ], batch size: 247, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:01:39,987 INFO [train.py:996] (1/4) Epoch 11, batch 700, loss[loss=0.1947, simple_loss=0.2684, pruned_loss=0.06048, over 21474.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2976, pruned_loss=0.06616, over 4159130.39 frames. ], batch size: 194, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:02:26,189 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 7.317e+02 1.195e+03 1.924e+03 5.182e+03, threshold=2.390e+03, percent-clipped=22.0 2023-06-27 15:02:43,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1834056.0, ans=0.0 2023-06-27 15:02:59,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1834056.0, ans=0.125 2023-06-27 15:03:07,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1834116.0, ans=0.125 2023-06-27 15:03:09,064 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:03:26,529 INFO [train.py:996] (1/4) Epoch 11, batch 750, loss[loss=0.2044, simple_loss=0.2818, pruned_loss=0.06354, over 21845.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2965, pruned_loss=0.06753, over 4184948.04 frames. ], batch size: 371, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:03:33,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1834176.0, ans=0.2 2023-06-27 15:03:44,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-27 15:04:37,068 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:05:09,840 INFO [train.py:996] (1/4) Epoch 11, batch 800, loss[loss=0.202, simple_loss=0.2748, pruned_loss=0.06461, over 21302.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2952, pruned_loss=0.06806, over 4203732.54 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:05:12,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1834476.0, ans=0.0 2023-06-27 15:05:51,243 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.162e+02 6.690e+02 1.036e+03 1.625e+03 3.290e+03, threshold=2.071e+03, percent-clipped=5.0 2023-06-27 15:05:58,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1834596.0, ans=0.125 2023-06-27 15:06:53,139 INFO [train.py:996] (1/4) Epoch 11, batch 850, loss[loss=0.2317, simple_loss=0.2882, pruned_loss=0.08758, over 21659.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2927, pruned_loss=0.06775, over 4227013.21 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:07:26,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1834836.0, ans=0.1 2023-06-27 15:07:27,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1834836.0, ans=0.125 2023-06-27 15:07:31,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1834896.0, ans=0.125 2023-06-27 15:08:32,832 INFO [train.py:996] (1/4) Epoch 11, batch 900, loss[loss=0.1846, simple_loss=0.2776, pruned_loss=0.04578, over 21818.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.289, pruned_loss=0.06726, over 4239639.80 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:09:15,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1835196.0, ans=0.125 2023-06-27 15:09:18,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.069e+02 6.963e+02 1.051e+03 1.568e+03 3.283e+03, threshold=2.103e+03, percent-clipped=8.0 2023-06-27 15:10:10,467 INFO [train.py:996] (1/4) Epoch 11, batch 950, loss[loss=0.1942, simple_loss=0.2835, pruned_loss=0.05251, over 21745.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2867, pruned_loss=0.06636, over 4252760.99 frames. ], batch size: 298, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:10:34,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1835436.0, ans=0.125 2023-06-27 15:11:03,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1835496.0, ans=0.1 2023-06-27 15:11:53,100 INFO [train.py:996] (1/4) Epoch 11, batch 1000, loss[loss=0.2139, simple_loss=0.2963, pruned_loss=0.06577, over 21784.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2859, pruned_loss=0.06611, over 4266847.01 frames. ], batch size: 282, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:11:55,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1835676.0, ans=0.0 2023-06-27 15:12:13,959 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:12:44,391 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.396e+02 1.258e+03 1.842e+03 3.420e+03, threshold=2.515e+03, percent-clipped=20.0 2023-06-27 15:12:50,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-27 15:13:36,705 INFO [train.py:996] (1/4) Epoch 11, batch 1050, loss[loss=0.228, simple_loss=0.2993, pruned_loss=0.07838, over 21843.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2847, pruned_loss=0.06545, over 4276762.08 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:13:54,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1835976.0, ans=0.025 2023-06-27 15:14:31,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1836096.0, ans=0.125 2023-06-27 15:14:34,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1836096.0, ans=0.1 2023-06-27 15:14:48,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1836156.0, ans=0.125 2023-06-27 15:15:01,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1836156.0, ans=0.125 2023-06-27 15:15:13,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1836216.0, ans=0.0 2023-06-27 15:15:26,357 INFO [train.py:996] (1/4) Epoch 11, batch 1100, loss[loss=0.2612, simple_loss=0.3342, pruned_loss=0.09404, over 21730.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.286, pruned_loss=0.06544, over 4285327.70 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:15:34,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1836276.0, ans=0.125 2023-06-27 15:16:10,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1836396.0, ans=0.125 2023-06-27 15:16:12,582 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=22.5 2023-06-27 15:16:13,028 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.192e+02 8.562e+02 1.240e+03 1.886e+03 2.880e+03, threshold=2.480e+03, percent-clipped=5.0 2023-06-27 15:16:24,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1836396.0, ans=0.125 2023-06-27 15:17:09,915 INFO [train.py:996] (1/4) Epoch 11, batch 1150, loss[loss=0.2056, simple_loss=0.3043, pruned_loss=0.05346, over 21636.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.287, pruned_loss=0.06512, over 4287357.75 frames. ], batch size: 230, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:17:13,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1836576.0, ans=0.0 2023-06-27 15:18:13,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1836756.0, ans=0.0 2023-06-27 15:18:16,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1836756.0, ans=0.1 2023-06-27 15:18:40,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1836816.0, ans=0.125 2023-06-27 15:18:53,539 INFO [train.py:996] (1/4) Epoch 11, batch 1200, loss[loss=0.2177, simple_loss=0.2947, pruned_loss=0.07034, over 21472.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2892, pruned_loss=0.06551, over 4288675.72 frames. ], batch size: 194, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 15:18:54,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1836876.0, ans=0.0 2023-06-27 15:19:47,119 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.628e+02 7.428e+02 1.142e+03 1.630e+03 3.056e+03, threshold=2.284e+03, percent-clipped=6.0 2023-06-27 15:20:33,609 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-27 15:20:37,526 INFO [train.py:996] (1/4) Epoch 11, batch 1250, loss[loss=0.263, simple_loss=0.3305, pruned_loss=0.09779, over 21528.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2928, pruned_loss=0.06711, over 4290492.30 frames. ], batch size: 507, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:21:20,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1837296.0, ans=0.125 2023-06-27 15:21:20,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1837296.0, ans=0.125 2023-06-27 15:21:32,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1837296.0, ans=0.0 2023-06-27 15:22:21,898 INFO [train.py:996] (1/4) Epoch 11, batch 1300, loss[loss=0.2035, simple_loss=0.2997, pruned_loss=0.05368, over 21714.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2938, pruned_loss=0.06662, over 4287046.67 frames. ], batch size: 351, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:22:54,693 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.29 vs. limit=22.5 2023-06-27 15:23:16,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.358e+02 6.400e+02 8.214e+02 1.269e+03 2.290e+03, threshold=1.643e+03, percent-clipped=1.0 2023-06-27 15:24:11,936 INFO [train.py:996] (1/4) Epoch 11, batch 1350, loss[loss=0.2565, simple_loss=0.3264, pruned_loss=0.09336, over 21751.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.295, pruned_loss=0.06679, over 4289775.38 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:24:22,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1837776.0, ans=0.0 2023-06-27 15:25:06,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1837896.0, ans=0.2 2023-06-27 15:25:31,924 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-27 15:25:53,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1838016.0, ans=0.0 2023-06-27 15:25:56,186 INFO [train.py:996] (1/4) Epoch 11, batch 1400, loss[loss=0.205, simple_loss=0.2816, pruned_loss=0.06418, over 21852.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.294, pruned_loss=0.06785, over 4294095.82 frames. ], batch size: 107, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:26:10,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1838076.0, ans=0.125 2023-06-27 15:26:30,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-27 15:26:46,609 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.209e+02 7.064e+02 1.087e+03 1.603e+03 3.118e+03, threshold=2.174e+03, percent-clipped=20.0 2023-06-27 15:26:52,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1838196.0, ans=0.035 2023-06-27 15:26:54,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1838196.0, ans=0.1 2023-06-27 15:27:17,317 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-27 15:27:35,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-27 15:27:38,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1838376.0, ans=0.125 2023-06-27 15:27:39,794 INFO [train.py:996] (1/4) Epoch 11, batch 1450, loss[loss=0.2141, simple_loss=0.29, pruned_loss=0.06909, over 21632.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2933, pruned_loss=0.06771, over 4291255.41 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:28:27,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1838496.0, ans=0.0 2023-06-27 15:28:47,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1838556.0, ans=0.125 2023-06-27 15:29:19,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1838616.0, ans=0.125 2023-06-27 15:29:27,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1838676.0, ans=0.0 2023-06-27 15:29:28,845 INFO [train.py:996] (1/4) Epoch 11, batch 1500, loss[loss=0.2461, simple_loss=0.3374, pruned_loss=0.07742, over 21700.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2956, pruned_loss=0.0689, over 4296920.96 frames. ], batch size: 389, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:29:32,742 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:29:48,902 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-27 15:29:51,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1838736.0, ans=0.125 2023-06-27 15:29:55,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1838736.0, ans=0.125 2023-06-27 15:30:14,626 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 7.080e+02 9.690e+02 1.530e+03 3.266e+03, threshold=1.938e+03, percent-clipped=8.0 2023-06-27 15:30:37,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1838856.0, ans=0.1 2023-06-27 15:30:44,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1838856.0, ans=0.125 2023-06-27 15:30:52,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1838916.0, ans=0.125 2023-06-27 15:31:14,246 INFO [train.py:996] (1/4) Epoch 11, batch 1550, loss[loss=0.1935, simple_loss=0.2643, pruned_loss=0.06132, over 21310.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2934, pruned_loss=0.0693, over 4299282.51 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:31:47,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.23 vs. limit=10.0 2023-06-27 15:32:23,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1839156.0, ans=0.0 2023-06-27 15:33:01,771 INFO [train.py:996] (1/4) Epoch 11, batch 1600, loss[loss=0.2267, simple_loss=0.3158, pruned_loss=0.0688, over 21830.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2919, pruned_loss=0.06838, over 4298178.48 frames. ], batch size: 372, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:33:11,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-27 15:33:42,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1839396.0, ans=0.125 2023-06-27 15:33:52,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1839396.0, ans=0.0 2023-06-27 15:33:53,890 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 6.555e+02 8.833e+02 1.502e+03 3.809e+03, threshold=1.767e+03, percent-clipped=10.0 2023-06-27 15:34:11,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-27 15:34:35,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1839516.0, ans=0.0 2023-06-27 15:34:47,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1839576.0, ans=0.1 2023-06-27 15:34:48,929 INFO [train.py:996] (1/4) Epoch 11, batch 1650, loss[loss=0.1832, simple_loss=0.2745, pruned_loss=0.0459, over 21584.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2907, pruned_loss=0.06795, over 4296442.27 frames. ], batch size: 230, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:35:21,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1839636.0, ans=0.125 2023-06-27 15:35:34,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1839696.0, ans=0.125 2023-06-27 15:35:59,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1839756.0, ans=0.125 2023-06-27 15:36:26,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1839816.0, ans=0.125 2023-06-27 15:36:36,403 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-06-27 15:36:37,006 INFO [train.py:996] (1/4) Epoch 11, batch 1700, loss[loss=0.2189, simple_loss=0.2976, pruned_loss=0.07006, over 21467.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2943, pruned_loss=0.0695, over 4285957.03 frames. ], batch size: 548, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:37:30,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1839996.0, ans=0.1 2023-06-27 15:37:35,056 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.435e+02 5.947e+02 9.216e+02 1.351e+03 2.792e+03, threshold=1.843e+03, percent-clipped=11.0 2023-06-27 15:37:56,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1840056.0, ans=0.125 2023-06-27 15:38:24,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1840116.0, ans=0.1 2023-06-27 15:38:30,369 INFO [train.py:996] (1/4) Epoch 11, batch 1750, loss[loss=0.2101, simple_loss=0.2994, pruned_loss=0.06043, over 21456.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2964, pruned_loss=0.06916, over 4283813.18 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:38:42,309 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-27 15:38:54,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1840236.0, ans=0.125 2023-06-27 15:39:42,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1840356.0, ans=0.125 2023-06-27 15:40:11,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1840416.0, ans=0.0 2023-06-27 15:40:16,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1840476.0, ans=0.125 2023-06-27 15:40:22,768 INFO [train.py:996] (1/4) Epoch 11, batch 1800, loss[loss=0.2008, simple_loss=0.2982, pruned_loss=0.05165, over 21616.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2978, pruned_loss=0.06833, over 4285317.32 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:40:26,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1840476.0, ans=0.0 2023-06-27 15:40:28,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1840476.0, ans=0.125 2023-06-27 15:40:34,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1840476.0, ans=0.0 2023-06-27 15:40:54,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1840536.0, ans=0.0 2023-06-27 15:41:03,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=22.5 2023-06-27 15:41:13,917 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 6.830e+02 1.090e+03 1.802e+03 4.605e+03, threshold=2.180e+03, percent-clipped=19.0 2023-06-27 15:41:14,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1840596.0, ans=0.0 2023-06-27 15:41:18,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1840596.0, ans=0.1 2023-06-27 15:42:09,055 INFO [train.py:996] (1/4) Epoch 11, batch 1850, loss[loss=0.2056, simple_loss=0.2939, pruned_loss=0.05861, over 21804.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2965, pruned_loss=0.06552, over 4287537.36 frames. ], batch size: 351, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:42:36,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1840836.0, ans=0.07 2023-06-27 15:42:53,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1840896.0, ans=0.2 2023-06-27 15:42:57,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1840896.0, ans=0.0 2023-06-27 15:43:34,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1841016.0, ans=0.125 2023-06-27 15:43:35,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1841016.0, ans=0.125 2023-06-27 15:43:42,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1841016.0, ans=0.125 2023-06-27 15:43:47,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1841016.0, ans=0.0 2023-06-27 15:43:53,667 INFO [train.py:996] (1/4) Epoch 11, batch 1900, loss[loss=0.1955, simple_loss=0.2603, pruned_loss=0.06536, over 21296.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2974, pruned_loss=0.06532, over 4287131.72 frames. ], batch size: 608, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:43:54,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1841076.0, ans=0.125 2023-06-27 15:43:54,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-06-27 15:44:43,247 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.131e+02 8.434e+02 1.477e+03 2.094e+03 4.159e+03, threshold=2.954e+03, percent-clipped=22.0 2023-06-27 15:44:47,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1841196.0, ans=0.125 2023-06-27 15:44:59,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1841256.0, ans=0.125 2023-06-27 15:45:41,633 INFO [train.py:996] (1/4) Epoch 11, batch 1950, loss[loss=0.1934, simple_loss=0.2596, pruned_loss=0.06362, over 21682.00 frames. ], tot_loss[loss=0.212, simple_loss=0.294, pruned_loss=0.065, over 4276905.09 frames. ], batch size: 333, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:45:42,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1841376.0, ans=0.0 2023-06-27 15:46:42,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-27 15:46:58,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1841616.0, ans=0.95 2023-06-27 15:47:11,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=22.5 2023-06-27 15:47:26,628 INFO [train.py:996] (1/4) Epoch 11, batch 2000, loss[loss=0.2158, simple_loss=0.3094, pruned_loss=0.06107, over 19925.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2871, pruned_loss=0.06281, over 4265538.52 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 15:48:13,536 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 7.614e+02 1.079e+03 2.039e+03 3.848e+03, threshold=2.158e+03, percent-clipped=8.0 2023-06-27 15:48:17,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1841796.0, ans=0.05 2023-06-27 15:48:27,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-27 15:48:37,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1841856.0, ans=0.0 2023-06-27 15:48:43,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1841916.0, ans=0.125 2023-06-27 15:49:09,595 INFO [train.py:996] (1/4) Epoch 11, batch 2050, loss[loss=0.2096, simple_loss=0.293, pruned_loss=0.06309, over 21353.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2887, pruned_loss=0.06272, over 4267299.63 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:49:45,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1842036.0, ans=0.2 2023-06-27 15:49:59,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1842096.0, ans=0.04949747468305833 2023-06-27 15:50:59,227 INFO [train.py:996] (1/4) Epoch 11, batch 2100, loss[loss=0.2696, simple_loss=0.3252, pruned_loss=0.107, over 21728.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2916, pruned_loss=0.06485, over 4268923.00 frames. ], batch size: 507, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:51:06,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1842276.0, ans=0.2 2023-06-27 15:51:36,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1842396.0, ans=0.5 2023-06-27 15:51:46,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.397e+02 7.542e+02 1.130e+03 1.676e+03 4.140e+03, threshold=2.259e+03, percent-clipped=14.0 2023-06-27 15:51:52,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1842396.0, ans=0.125 2023-06-27 15:52:21,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-27 15:52:44,202 INFO [train.py:996] (1/4) Epoch 11, batch 2150, loss[loss=0.2017, simple_loss=0.2636, pruned_loss=0.0699, over 21473.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2941, pruned_loss=0.0664, over 4265295.92 frames. ], batch size: 195, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:52:48,295 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:52:57,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-27 15:53:08,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1842636.0, ans=0.5 2023-06-27 15:53:31,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2023-06-27 15:53:41,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-27 15:54:13,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1842816.0, ans=0.07 2023-06-27 15:54:22,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-27 15:54:29,177 INFO [train.py:996] (1/4) Epoch 11, batch 2200, loss[loss=0.2353, simple_loss=0.3036, pruned_loss=0.08352, over 21672.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2956, pruned_loss=0.06689, over 4275325.48 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:55:00,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1842936.0, ans=0.125 2023-06-27 15:55:16,679 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 6.339e+02 9.896e+02 1.686e+03 3.946e+03, threshold=1.979e+03, percent-clipped=15.0 2023-06-27 15:55:27,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1843056.0, ans=0.125 2023-06-27 15:56:04,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1843116.0, ans=0.125 2023-06-27 15:56:04,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1843116.0, ans=0.0 2023-06-27 15:56:14,356 INFO [train.py:996] (1/4) Epoch 11, batch 2250, loss[loss=0.2118, simple_loss=0.297, pruned_loss=0.06326, over 21681.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.294, pruned_loss=0.06559, over 4277056.92 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:56:18,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1843176.0, ans=0.0 2023-06-27 15:57:52,262 INFO [train.py:996] (1/4) Epoch 11, batch 2300, loss[loss=0.1841, simple_loss=0.2517, pruned_loss=0.05823, over 21507.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2891, pruned_loss=0.0638, over 4280194.47 frames. ], batch size: 230, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:58:06,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1843476.0, ans=0.1 2023-06-27 15:58:39,377 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 6.436e+02 1.038e+03 1.737e+03 5.031e+03, threshold=2.076e+03, percent-clipped=15.0 2023-06-27 15:58:43,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1843596.0, ans=0.2 2023-06-27 15:58:51,684 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:59:05,611 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-27 15:59:08,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1843656.0, ans=0.125 2023-06-27 15:59:36,622 INFO [train.py:996] (1/4) Epoch 11, batch 2350, loss[loss=0.1954, simple_loss=0.2553, pruned_loss=0.06778, over 21559.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2845, pruned_loss=0.06326, over 4276481.02 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:59:48,663 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:01:21,955 INFO [train.py:996] (1/4) Epoch 11, batch 2400, loss[loss=0.2193, simple_loss=0.2917, pruned_loss=0.07349, over 21640.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2832, pruned_loss=0.0647, over 4271471.97 frames. ], batch size: 230, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 16:02:06,043 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-27 16:02:21,960 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.317e+02 6.915e+02 1.084e+03 1.714e+03 3.712e+03, threshold=2.167e+03, percent-clipped=11.0 2023-06-27 16:03:02,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1844316.0, ans=0.125 2023-06-27 16:03:07,420 INFO [train.py:996] (1/4) Epoch 11, batch 2450, loss[loss=0.2104, simple_loss=0.2805, pruned_loss=0.07017, over 21742.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2853, pruned_loss=0.06663, over 4268621.59 frames. ], batch size: 351, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:03:11,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1844376.0, ans=0.1 2023-06-27 16:03:56,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1844496.0, ans=0.125 2023-06-27 16:04:14,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1844556.0, ans=0.0 2023-06-27 16:04:15,237 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2023-06-27 16:04:25,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1844556.0, ans=0.125 2023-06-27 16:04:27,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1844556.0, ans=0.125 2023-06-27 16:04:28,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1844556.0, ans=0.125 2023-06-27 16:04:45,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1844616.0, ans=0.125 2023-06-27 16:04:49,999 INFO [train.py:996] (1/4) Epoch 11, batch 2500, loss[loss=0.1885, simple_loss=0.2544, pruned_loss=0.06135, over 21337.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2849, pruned_loss=0.06696, over 4276119.22 frames. ], batch size: 160, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:05:43,703 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.461e+02 7.979e+02 1.093e+03 1.704e+03 3.202e+03, threshold=2.185e+03, percent-clipped=12.0 2023-06-27 16:06:29,949 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-27 16:06:34,027 INFO [train.py:996] (1/4) Epoch 11, batch 2550, loss[loss=0.2352, simple_loss=0.3026, pruned_loss=0.08388, over 15913.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2866, pruned_loss=0.06601, over 4260279.79 frames. ], batch size: 64, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:06:41,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1844976.0, ans=0.1 2023-06-27 16:06:44,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1844976.0, ans=0.125 2023-06-27 16:07:03,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1845036.0, ans=0.2 2023-06-27 16:07:24,249 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=15.0 2023-06-27 16:07:39,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-27 16:08:18,041 INFO [train.py:996] (1/4) Epoch 11, batch 2600, loss[loss=0.2602, simple_loss=0.3238, pruned_loss=0.09834, over 21332.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2912, pruned_loss=0.06812, over 4258663.08 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:08:54,487 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-27 16:09:12,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.122e+02 7.338e+02 1.284e+03 1.915e+03 4.312e+03, threshold=2.567e+03, percent-clipped=18.0 2023-06-27 16:09:21,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=12.0 2023-06-27 16:09:32,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1845456.0, ans=0.2 2023-06-27 16:09:58,131 INFO [train.py:996] (1/4) Epoch 11, batch 2650, loss[loss=0.2333, simple_loss=0.3068, pruned_loss=0.07993, over 21380.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2931, pruned_loss=0.06978, over 4269707.43 frames. ], batch size: 159, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:10:11,198 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:10:52,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1845696.0, ans=0.125 2023-06-27 16:11:43,781 INFO [train.py:996] (1/4) Epoch 11, batch 2700, loss[loss=0.2024, simple_loss=0.2835, pruned_loss=0.06072, over 21739.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2911, pruned_loss=0.06945, over 4267481.03 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:12:11,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-27 16:12:43,564 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.729e+02 6.625e+02 9.246e+02 1.409e+03 2.648e+03, threshold=1.849e+03, percent-clipped=2.0 2023-06-27 16:13:28,867 INFO [train.py:996] (1/4) Epoch 11, batch 2750, loss[loss=0.2448, simple_loss=0.3274, pruned_loss=0.0811, over 21607.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2921, pruned_loss=0.06997, over 4265577.58 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:14:05,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-27 16:14:22,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1846296.0, ans=0.125 2023-06-27 16:14:26,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-27 16:14:40,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1846356.0, ans=0.1 2023-06-27 16:15:15,718 INFO [train.py:996] (1/4) Epoch 11, batch 2800, loss[loss=0.2442, simple_loss=0.3352, pruned_loss=0.07661, over 21751.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2944, pruned_loss=0.06958, over 4273798.24 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 16:15:36,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1846476.0, ans=0.2 2023-06-27 16:15:57,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.27 vs. limit=15.0 2023-06-27 16:16:18,337 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.904e+02 7.981e+02 1.210e+03 1.745e+03 3.756e+03, threshold=2.419e+03, percent-clipped=24.0 2023-06-27 16:16:20,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1846596.0, ans=0.125 2023-06-27 16:16:38,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1846656.0, ans=0.0 2023-06-27 16:16:49,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1846716.0, ans=0.125 2023-06-27 16:16:51,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1846716.0, ans=0.05 2023-06-27 16:16:53,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1846716.0, ans=0.0 2023-06-27 16:17:03,352 INFO [train.py:996] (1/4) Epoch 11, batch 2850, loss[loss=0.2082, simple_loss=0.3052, pruned_loss=0.05557, over 21654.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2977, pruned_loss=0.07145, over 4275247.81 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:17:39,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1846836.0, ans=0.125 2023-06-27 16:17:45,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1846836.0, ans=0.125 2023-06-27 16:18:15,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1846956.0, ans=0.0 2023-06-27 16:18:16,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1846956.0, ans=0.015 2023-06-27 16:18:40,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1847076.0, ans=0.125 2023-06-27 16:18:41,446 INFO [train.py:996] (1/4) Epoch 11, batch 2900, loss[loss=0.1624, simple_loss=0.2239, pruned_loss=0.05039, over 21277.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2936, pruned_loss=0.07039, over 4278309.36 frames. ], batch size: 176, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:19:01,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1847136.0, ans=0.125 2023-06-27 16:19:19,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1847136.0, ans=0.125 2023-06-27 16:19:26,981 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-27 16:19:27,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1847136.0, ans=15.0 2023-06-27 16:19:45,511 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.568e+02 6.840e+02 9.553e+02 1.645e+03 3.808e+03, threshold=1.911e+03, percent-clipped=8.0 2023-06-27 16:20:25,216 INFO [train.py:996] (1/4) Epoch 11, batch 2950, loss[loss=0.2215, simple_loss=0.2981, pruned_loss=0.07248, over 21557.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2945, pruned_loss=0.07058, over 4282279.26 frames. ], batch size: 548, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:21:36,361 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-27 16:22:14,902 INFO [train.py:996] (1/4) Epoch 11, batch 3000, loss[loss=0.2196, simple_loss=0.3007, pruned_loss=0.06926, over 21835.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2971, pruned_loss=0.07031, over 4285901.64 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:22:14,902 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 16:22:35,493 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2528, simple_loss=0.3433, pruned_loss=0.08109, over 1796401.00 frames. 2023-06-27 16:22:35,494 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 16:22:56,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1847736.0, ans=0.125 2023-06-27 16:23:27,444 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.016e+02 6.559e+02 9.881e+02 1.581e+03 3.511e+03, threshold=1.976e+03, percent-clipped=15.0 2023-06-27 16:23:34,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1847856.0, ans=0.125 2023-06-27 16:23:55,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1847916.0, ans=0.125 2023-06-27 16:24:16,758 INFO [train.py:996] (1/4) Epoch 11, batch 3050, loss[loss=0.1356, simple_loss=0.2097, pruned_loss=0.0308, over 16921.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2979, pruned_loss=0.06909, over 4284815.78 frames. ], batch size: 60, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:25:00,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1848096.0, ans=0.2 2023-06-27 16:26:03,790 INFO [train.py:996] (1/4) Epoch 11, batch 3100, loss[loss=0.1973, simple_loss=0.2865, pruned_loss=0.05406, over 21465.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2969, pruned_loss=0.06791, over 4287275.17 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:26:43,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1848396.0, ans=0.125 2023-06-27 16:26:54,554 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.218e+02 9.868e+02 1.604e+03 2.316e+03 3.970e+03, threshold=3.207e+03, percent-clipped=39.0 2023-06-27 16:27:20,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1848456.0, ans=0.125 2023-06-27 16:27:54,285 INFO [train.py:996] (1/4) Epoch 11, batch 3150, loss[loss=0.2137, simple_loss=0.2913, pruned_loss=0.06799, over 21682.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2974, pruned_loss=0.06826, over 4283571.54 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:28:21,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.12 vs. limit=10.0 2023-06-27 16:28:51,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1848696.0, ans=0.125 2023-06-27 16:29:15,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1848756.0, ans=0.0 2023-06-27 16:29:40,809 INFO [train.py:996] (1/4) Epoch 11, batch 3200, loss[loss=0.2695, simple_loss=0.3504, pruned_loss=0.0943, over 21452.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3006, pruned_loss=0.06895, over 4278813.67 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:29:56,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1848936.0, ans=0.07 2023-06-27 16:30:06,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1848936.0, ans=0.2 2023-06-27 16:30:30,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1848996.0, ans=0.125 2023-06-27 16:30:42,998 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.812e+02 8.312e+02 1.188e+03 1.817e+03 3.495e+03, threshold=2.376e+03, percent-clipped=3.0 2023-06-27 16:31:03,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.79 vs. limit=5.0 2023-06-27 16:31:11,372 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-27 16:31:25,352 INFO [train.py:996] (1/4) Epoch 11, batch 3250, loss[loss=0.2234, simple_loss=0.2989, pruned_loss=0.07395, over 21439.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3007, pruned_loss=0.06925, over 4277387.43 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:32:29,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1849296.0, ans=0.1 2023-06-27 16:32:44,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1849356.0, ans=0.2 2023-06-27 16:32:56,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1849416.0, ans=0.125 2023-06-27 16:33:11,131 INFO [train.py:996] (1/4) Epoch 11, batch 3300, loss[loss=0.2197, simple_loss=0.29, pruned_loss=0.07466, over 21790.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2986, pruned_loss=0.06985, over 4274886.58 frames. ], batch size: 372, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:33:29,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1849536.0, ans=0.125 2023-06-27 16:33:50,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1849536.0, ans=0.125 2023-06-27 16:34:15,394 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 6.747e+02 1.095e+03 2.044e+03 4.676e+03, threshold=2.190e+03, percent-clipped=15.0 2023-06-27 16:34:50,739 INFO [train.py:996] (1/4) Epoch 11, batch 3350, loss[loss=0.1882, simple_loss=0.2443, pruned_loss=0.06603, over 20031.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3009, pruned_loss=0.0701, over 4281980.20 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:35:03,797 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-06-27 16:35:33,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-27 16:35:34,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1849836.0, ans=0.125 2023-06-27 16:35:38,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1849836.0, ans=0.125 2023-06-27 16:36:01,452 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:36:35,659 INFO [train.py:996] (1/4) Epoch 11, batch 3400, loss[loss=0.1982, simple_loss=0.2826, pruned_loss=0.05691, over 21688.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3015, pruned_loss=0.07071, over 4280298.31 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:37:12,690 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-27 16:37:19,428 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-27 16:37:38,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1850196.0, ans=0.1 2023-06-27 16:37:43,317 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 6.438e+02 9.614e+02 1.434e+03 2.571e+03, threshold=1.923e+03, percent-clipped=1.0 2023-06-27 16:38:00,225 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.84 vs. limit=15.0 2023-06-27 16:38:06,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1850316.0, ans=0.0 2023-06-27 16:38:24,831 INFO [train.py:996] (1/4) Epoch 11, batch 3450, loss[loss=0.2105, simple_loss=0.2692, pruned_loss=0.0759, over 21180.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2955, pruned_loss=0.06875, over 4279128.33 frames. ], batch size: 176, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:38:37,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1850376.0, ans=0.125 2023-06-27 16:38:55,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=12.0 2023-06-27 16:39:56,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1850616.0, ans=0.0 2023-06-27 16:40:04,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1850616.0, ans=0.125 2023-06-27 16:40:13,246 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-06-27 16:40:15,589 INFO [train.py:996] (1/4) Epoch 11, batch 3500, loss[loss=0.2844, simple_loss=0.3614, pruned_loss=0.1037, over 21571.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3053, pruned_loss=0.07328, over 4280580.52 frames. ], batch size: 414, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:40:26,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1850676.0, ans=0.04949747468305833 2023-06-27 16:41:14,140 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.683e+02 8.214e+02 1.340e+03 2.218e+03 5.014e+03, threshold=2.681e+03, percent-clipped=29.0 2023-06-27 16:41:14,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1850856.0, ans=0.125 2023-06-27 16:41:17,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1850856.0, ans=0.025 2023-06-27 16:41:27,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-06-27 16:42:05,012 INFO [train.py:996] (1/4) Epoch 11, batch 3550, loss[loss=0.2095, simple_loss=0.2925, pruned_loss=0.06319, over 21326.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3082, pruned_loss=0.07539, over 4285019.21 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:42:07,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1850976.0, ans=0.0 2023-06-27 16:42:36,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-27 16:42:51,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1851096.0, ans=0.1 2023-06-27 16:42:53,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.44 vs. limit=22.5 2023-06-27 16:42:55,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1851096.0, ans=0.125 2023-06-27 16:43:49,623 INFO [train.py:996] (1/4) Epoch 11, batch 3600, loss[loss=0.2401, simple_loss=0.309, pruned_loss=0.08563, over 21273.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3014, pruned_loss=0.07464, over 4268783.52 frames. ], batch size: 143, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:44:07,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1851276.0, ans=0.125 2023-06-27 16:44:15,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1851336.0, ans=0.125 2023-06-27 16:44:44,961 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.210e+02 6.431e+02 1.048e+03 1.688e+03 3.904e+03, threshold=2.095e+03, percent-clipped=4.0 2023-06-27 16:45:36,132 INFO [train.py:996] (1/4) Epoch 11, batch 3650, loss[loss=0.1882, simple_loss=0.2834, pruned_loss=0.04656, over 21686.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3011, pruned_loss=0.07516, over 4267509.96 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:45:54,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1851576.0, ans=0.1 2023-06-27 16:46:16,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1851696.0, ans=0.125 2023-06-27 16:46:55,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1851756.0, ans=0.1 2023-06-27 16:47:19,916 INFO [train.py:996] (1/4) Epoch 11, batch 3700, loss[loss=0.2095, simple_loss=0.292, pruned_loss=0.06352, over 21915.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2989, pruned_loss=0.07311, over 4278648.88 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:47:20,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1851876.0, ans=0.125 2023-06-27 16:47:26,234 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-27 16:47:59,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1851996.0, ans=0.0 2023-06-27 16:48:13,913 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.881e+02 6.744e+02 1.016e+03 1.702e+03 3.129e+03, threshold=2.032e+03, percent-clipped=14.0 2023-06-27 16:49:02,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1852116.0, ans=0.125 2023-06-27 16:49:04,438 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-27 16:49:04,969 INFO [train.py:996] (1/4) Epoch 11, batch 3750, loss[loss=0.1869, simple_loss=0.2608, pruned_loss=0.05654, over 21484.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2976, pruned_loss=0.07277, over 4281264.32 frames. ], batch size: 212, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:49:36,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1852236.0, ans=0.125 2023-06-27 16:49:36,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-27 16:50:06,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1852356.0, ans=0.0 2023-06-27 16:50:29,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-27 16:50:35,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1852416.0, ans=0.0 2023-06-27 16:50:49,290 INFO [train.py:996] (1/4) Epoch 11, batch 3800, loss[loss=0.2497, simple_loss=0.3375, pruned_loss=0.08092, over 21521.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2939, pruned_loss=0.07162, over 4272024.66 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:51:08,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1852536.0, ans=0.125 2023-06-27 16:51:09,173 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-27 16:51:11,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1852536.0, ans=0.125 2023-06-27 16:51:28,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1852596.0, ans=0.2 2023-06-27 16:51:44,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1852596.0, ans=0.0 2023-06-27 16:51:47,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.392e+02 7.087e+02 9.540e+02 1.301e+03 2.936e+03, threshold=1.908e+03, percent-clipped=6.0 2023-06-27 16:52:29,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1852716.0, ans=0.0 2023-06-27 16:52:32,369 INFO [train.py:996] (1/4) Epoch 11, batch 3850, loss[loss=0.2235, simple_loss=0.3507, pruned_loss=0.04818, over 20838.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2931, pruned_loss=0.07131, over 4272206.06 frames. ], batch size: 607, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:52:46,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1852776.0, ans=0.1 2023-06-27 16:52:59,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1852836.0, ans=0.1 2023-06-27 16:53:35,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-27 16:54:12,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=12.0 2023-06-27 16:54:14,661 INFO [train.py:996] (1/4) Epoch 11, batch 3900, loss[loss=0.209, simple_loss=0.2859, pruned_loss=0.06607, over 21894.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2886, pruned_loss=0.07052, over 4264199.09 frames. ], batch size: 118, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:54:30,453 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:54:57,751 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:55:09,164 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.347e+02 6.134e+02 8.883e+02 1.369e+03 3.236e+03, threshold=1.777e+03, percent-clipped=7.0 2023-06-27 16:55:23,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1853256.0, ans=0.04949747468305833 2023-06-27 16:55:28,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1853256.0, ans=0.0 2023-06-27 16:55:48,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1853316.0, ans=0.125 2023-06-27 16:55:54,569 INFO [train.py:996] (1/4) Epoch 11, batch 3950, loss[loss=0.1698, simple_loss=0.2269, pruned_loss=0.05633, over 20620.00 frames. ], tot_loss[loss=0.215, simple_loss=0.29, pruned_loss=0.06994, over 4270982.47 frames. ], batch size: 607, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:56:10,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1853436.0, ans=0.2 2023-06-27 16:57:18,981 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-27 16:57:32,925 INFO [train.py:996] (1/4) Epoch 11, batch 4000, loss[loss=0.2333, simple_loss=0.2956, pruned_loss=0.08547, over 21619.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2862, pruned_loss=0.06707, over 4270700.41 frames. ], batch size: 548, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:57:36,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1853676.0, ans=0.0 2023-06-27 16:58:26,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1853796.0, ans=0.125 2023-06-27 16:58:37,197 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 6.762e+02 1.217e+03 2.027e+03 5.671e+03, threshold=2.434e+03, percent-clipped=30.0 2023-06-27 16:58:42,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1853856.0, ans=0.0 2023-06-27 16:58:59,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1853916.0, ans=0.125 2023-06-27 16:59:17,789 INFO [train.py:996] (1/4) Epoch 11, batch 4050, loss[loss=0.1876, simple_loss=0.251, pruned_loss=0.0621, over 20752.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2861, pruned_loss=0.06557, over 4271322.05 frames. ], batch size: 608, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:59:18,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1853976.0, ans=0.125 2023-06-27 16:59:23,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1853976.0, ans=0.2 2023-06-27 16:59:23,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1853976.0, ans=0.125 2023-06-27 16:59:36,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1854036.0, ans=0.2 2023-06-27 16:59:37,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-27 16:59:39,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1854036.0, ans=0.0 2023-06-27 17:00:11,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1854096.0, ans=0.125 2023-06-27 17:00:39,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1854156.0, ans=0.125 2023-06-27 17:00:47,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-27 17:00:50,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1854216.0, ans=0.125 2023-06-27 17:00:53,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1854216.0, ans=0.07 2023-06-27 17:01:01,309 INFO [train.py:996] (1/4) Epoch 11, batch 4100, loss[loss=0.2045, simple_loss=0.2834, pruned_loss=0.06282, over 21901.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.287, pruned_loss=0.06582, over 4277346.52 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:01:34,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1854336.0, ans=0.125 2023-06-27 17:01:35,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1854336.0, ans=0.1 2023-06-27 17:01:39,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-27 17:02:11,182 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.335e+02 7.301e+02 1.093e+03 1.524e+03 3.311e+03, threshold=2.186e+03, percent-clipped=4.0 2023-06-27 17:02:20,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1854456.0, ans=0.125 2023-06-27 17:02:23,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1854456.0, ans=0.125 2023-06-27 17:02:45,149 INFO [train.py:996] (1/4) Epoch 11, batch 4150, loss[loss=0.2115, simple_loss=0.2953, pruned_loss=0.06387, over 21723.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2866, pruned_loss=0.06339, over 4276162.17 frames. ], batch size: 333, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:02:55,048 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.54 vs. limit=6.0 2023-06-27 17:03:00,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.61 vs. limit=15.0 2023-06-27 17:03:33,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1854696.0, ans=0.125 2023-06-27 17:03:35,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1854696.0, ans=0.1 2023-06-27 17:03:57,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1854756.0, ans=0.2 2023-06-27 17:04:01,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1854756.0, ans=0.0 2023-06-27 17:04:27,388 INFO [train.py:996] (1/4) Epoch 11, batch 4200, loss[loss=0.1884, simple_loss=0.2694, pruned_loss=0.0537, over 21668.00 frames. ], tot_loss[loss=0.206, simple_loss=0.286, pruned_loss=0.06296, over 4255304.42 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:04:43,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1854876.0, ans=0.0 2023-06-27 17:05:15,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1854996.0, ans=0.025 2023-06-27 17:05:34,655 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.415e+02 6.010e+02 8.416e+02 1.376e+03 4.083e+03, threshold=1.683e+03, percent-clipped=10.0 2023-06-27 17:05:38,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1855056.0, ans=0.0 2023-06-27 17:06:13,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1855176.0, ans=0.0 2023-06-27 17:06:14,237 INFO [train.py:996] (1/4) Epoch 11, batch 4250, loss[loss=0.3167, simple_loss=0.4059, pruned_loss=0.1138, over 21429.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2942, pruned_loss=0.06592, over 4254798.95 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:06:50,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1855236.0, ans=0.07 2023-06-27 17:06:55,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1855236.0, ans=0.0 2023-06-27 17:07:34,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1855416.0, ans=0.125 2023-06-27 17:07:34,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-27 17:08:00,637 INFO [train.py:996] (1/4) Epoch 11, batch 4300, loss[loss=0.1499, simple_loss=0.2012, pruned_loss=0.04925, over 16892.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2984, pruned_loss=0.06715, over 4258827.53 frames. ], batch size: 60, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:08:09,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1855476.0, ans=0.1 2023-06-27 17:08:44,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1855596.0, ans=0.125 2023-06-27 17:08:55,703 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.494e+02 7.202e+02 1.029e+03 1.570e+03 4.728e+03, threshold=2.058e+03, percent-clipped=18.0 2023-06-27 17:09:27,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1855716.0, ans=0.2 2023-06-27 17:09:39,124 INFO [train.py:996] (1/4) Epoch 11, batch 4350, loss[loss=0.2093, simple_loss=0.2778, pruned_loss=0.07042, over 21902.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2971, pruned_loss=0.0666, over 4257565.32 frames. ], batch size: 107, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:09:53,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1855776.0, ans=0.0 2023-06-27 17:10:01,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1855836.0, ans=0.125 2023-06-27 17:10:22,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1855896.0, ans=0.04949747468305833 2023-06-27 17:10:22,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1855896.0, ans=0.125 2023-06-27 17:10:25,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1855896.0, ans=0.05 2023-06-27 17:10:45,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1855956.0, ans=0.125 2023-06-27 17:10:56,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1855956.0, ans=0.0 2023-06-27 17:10:56,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1855956.0, ans=0.025 2023-06-27 17:11:13,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1856016.0, ans=0.125 2023-06-27 17:11:29,246 INFO [train.py:996] (1/4) Epoch 11, batch 4400, loss[loss=0.1937, simple_loss=0.2754, pruned_loss=0.05595, over 21143.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2935, pruned_loss=0.06532, over 4253661.97 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 17:11:44,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1856076.0, ans=0.2 2023-06-27 17:11:56,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1856136.0, ans=0.0 2023-06-27 17:11:56,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-27 17:12:32,722 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.456e+02 7.940e+02 1.162e+03 1.682e+03 5.044e+03, threshold=2.325e+03, percent-clipped=15.0 2023-06-27 17:12:46,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=15.0 2023-06-27 17:13:00,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-27 17:13:14,920 INFO [train.py:996] (1/4) Epoch 11, batch 4450, loss[loss=0.2578, simple_loss=0.3606, pruned_loss=0.07753, over 21877.00 frames. ], tot_loss[loss=0.218, simple_loss=0.3019, pruned_loss=0.0671, over 4259931.20 frames. ], batch size: 317, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:13:20,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1856376.0, ans=0.07 2023-06-27 17:13:38,184 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-27 17:14:46,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1856616.0, ans=0.125 2023-06-27 17:14:59,763 INFO [train.py:996] (1/4) Epoch 11, batch 4500, loss[loss=0.2143, simple_loss=0.3137, pruned_loss=0.0575, over 21857.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.303, pruned_loss=0.06856, over 4272026.52 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:15:18,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1856736.0, ans=22.5 2023-06-27 17:15:53,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1856796.0, ans=0.125 2023-06-27 17:16:01,160 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.839e+02 8.296e+02 1.426e+03 1.842e+03 5.527e+03, threshold=2.851e+03, percent-clipped=18.0 2023-06-27 17:16:38,268 INFO [train.py:996] (1/4) Epoch 11, batch 4550, loss[loss=0.2747, simple_loss=0.3524, pruned_loss=0.09853, over 21800.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3042, pruned_loss=0.06834, over 4276601.68 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:17:14,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1857036.0, ans=0.125 2023-06-27 17:18:00,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1857156.0, ans=0.0 2023-06-27 17:18:21,927 INFO [train.py:996] (1/4) Epoch 11, batch 4600, loss[loss=0.1968, simple_loss=0.2841, pruned_loss=0.05477, over 21824.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3059, pruned_loss=0.06923, over 4280561.95 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:19:07,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1857396.0, ans=0.015 2023-06-27 17:19:33,355 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.640e+02 1.105e+03 1.523e+03 3.294e+03, threshold=2.209e+03, percent-clipped=1.0 2023-06-27 17:19:37,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1857456.0, ans=0.125 2023-06-27 17:19:47,330 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 17:19:57,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1857516.0, ans=0.0 2023-06-27 17:20:04,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1857576.0, ans=0.0 2023-06-27 17:20:05,572 INFO [train.py:996] (1/4) Epoch 11, batch 4650, loss[loss=0.1461, simple_loss=0.2206, pruned_loss=0.03585, over 21578.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.3004, pruned_loss=0.06854, over 4287084.39 frames. ], batch size: 230, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:20:10,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1857576.0, ans=0.125 2023-06-27 17:20:29,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-27 17:20:57,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1857696.0, ans=0.0 2023-06-27 17:20:57,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1857696.0, ans=0.2 2023-06-27 17:21:45,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1857816.0, ans=0.125 2023-06-27 17:21:49,625 INFO [train.py:996] (1/4) Epoch 11, batch 4700, loss[loss=0.1996, simple_loss=0.2612, pruned_loss=0.06899, over 21400.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2948, pruned_loss=0.06684, over 4274011.54 frames. ], batch size: 473, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:21:55,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1857876.0, ans=0.2 2023-06-27 17:22:11,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1857936.0, ans=0.125 2023-06-27 17:22:39,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-27 17:22:46,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1857996.0, ans=0.0 2023-06-27 17:22:59,952 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 6.918e+02 1.097e+03 1.707e+03 4.002e+03, threshold=2.193e+03, percent-clipped=11.0 2023-06-27 17:23:18,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1858116.0, ans=0.125 2023-06-27 17:23:20,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1858116.0, ans=0.0 2023-06-27 17:23:31,327 INFO [train.py:996] (1/4) Epoch 11, batch 4750, loss[loss=0.2193, simple_loss=0.2825, pruned_loss=0.07802, over 21561.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2886, pruned_loss=0.06635, over 4283524.53 frames. ], batch size: 548, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:24:46,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1858356.0, ans=0.07 2023-06-27 17:24:47,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1858356.0, ans=0.125 2023-06-27 17:25:05,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1858416.0, ans=0.0 2023-06-27 17:25:06,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1858416.0, ans=0.2 2023-06-27 17:25:20,797 INFO [train.py:996] (1/4) Epoch 11, batch 4800, loss[loss=0.1953, simple_loss=0.2851, pruned_loss=0.05275, over 21787.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2884, pruned_loss=0.06724, over 4280228.21 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 17:25:33,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1858476.0, ans=0.125 2023-06-27 17:25:41,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1858536.0, ans=0.0 2023-06-27 17:26:28,648 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.577e+02 8.092e+02 1.102e+03 1.736e+03 3.587e+03, threshold=2.204e+03, percent-clipped=14.0 2023-06-27 17:26:52,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1858716.0, ans=0.0 2023-06-27 17:26:52,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1858716.0, ans=0.09899494936611666 2023-06-27 17:27:03,218 INFO [train.py:996] (1/4) Epoch 11, batch 4850, loss[loss=0.2098, simple_loss=0.2841, pruned_loss=0.06777, over 21625.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2879, pruned_loss=0.06625, over 4273463.30 frames. ], batch size: 389, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:27:41,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-27 17:27:58,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1858896.0, ans=0.125 2023-06-27 17:27:58,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1858896.0, ans=0.2 2023-06-27 17:28:08,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1858956.0, ans=0.125 2023-06-27 17:28:10,902 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2023-06-27 17:28:40,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1859076.0, ans=0.125 2023-06-27 17:28:41,956 INFO [train.py:996] (1/4) Epoch 11, batch 4900, loss[loss=0.2185, simple_loss=0.2946, pruned_loss=0.07119, over 21860.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2887, pruned_loss=0.06677, over 4279248.85 frames. ], batch size: 371, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:28:50,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1859076.0, ans=0.2 2023-06-27 17:29:56,064 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 7.657e+02 1.361e+03 1.915e+03 3.497e+03, threshold=2.723e+03, percent-clipped=17.0 2023-06-27 17:30:11,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1859316.0, ans=0.09899494936611666 2023-06-27 17:30:24,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1859316.0, ans=0.125 2023-06-27 17:30:30,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.39 vs. limit=10.0 2023-06-27 17:30:31,168 INFO [train.py:996] (1/4) Epoch 11, batch 4950, loss[loss=0.183, simple_loss=0.2821, pruned_loss=0.04192, over 21765.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2922, pruned_loss=0.06539, over 4278223.20 frames. ], batch size: 371, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:31:35,155 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=15.0 2023-06-27 17:32:09,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1859616.0, ans=0.125 2023-06-27 17:32:14,086 INFO [train.py:996] (1/4) Epoch 11, batch 5000, loss[loss=0.2375, simple_loss=0.3103, pruned_loss=0.08229, over 21845.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2902, pruned_loss=0.06272, over 4274028.33 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:32:21,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1859676.0, ans=0.125 2023-06-27 17:32:50,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1859736.0, ans=0.125 2023-06-27 17:33:14,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1859796.0, ans=0.0 2023-06-27 17:33:20,244 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.265e+02 5.938e+02 8.345e+02 1.344e+03 2.733e+03, threshold=1.669e+03, percent-clipped=1.0 2023-06-27 17:33:40,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1859916.0, ans=0.0 2023-06-27 17:33:50,175 INFO [train.py:996] (1/4) Epoch 11, batch 5050, loss[loss=0.2583, simple_loss=0.3309, pruned_loss=0.09292, over 21835.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2909, pruned_loss=0.06428, over 4284457.74 frames. ], batch size: 118, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:34:18,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1860036.0, ans=0.0 2023-06-27 17:34:27,522 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-27 17:34:50,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1860096.0, ans=10.0 2023-06-27 17:34:50,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1860096.0, ans=0.125 2023-06-27 17:35:19,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1860216.0, ans=0.2 2023-06-27 17:35:33,541 INFO [train.py:996] (1/4) Epoch 11, batch 5100, loss[loss=0.1741, simple_loss=0.2482, pruned_loss=0.04997, over 16986.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2893, pruned_loss=0.06474, over 4283756.12 frames. ], batch size: 60, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:35:50,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1860276.0, ans=0.5 2023-06-27 17:36:18,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-27 17:36:47,573 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.357e+02 6.699e+02 8.715e+02 1.182e+03 3.007e+03, threshold=1.743e+03, percent-clipped=11.0 2023-06-27 17:37:14,406 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-27 17:37:23,084 INFO [train.py:996] (1/4) Epoch 11, batch 5150, loss[loss=0.2204, simple_loss=0.2875, pruned_loss=0.07667, over 21287.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2874, pruned_loss=0.06579, over 4286302.67 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:37:31,150 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2023-06-27 17:37:37,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1860576.0, ans=0.125 2023-06-27 17:38:10,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1860696.0, ans=0.0 2023-06-27 17:39:12,462 INFO [train.py:996] (1/4) Epoch 11, batch 5200, loss[loss=0.2124, simple_loss=0.3064, pruned_loss=0.05915, over 21590.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.289, pruned_loss=0.06654, over 4275362.62 frames. ], batch size: 230, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:39:35,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-27 17:39:39,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1860936.0, ans=0.0 2023-06-27 17:40:17,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.467e+02 7.745e+02 1.179e+03 1.665e+03 4.294e+03, threshold=2.357e+03, percent-clipped=21.0 2023-06-27 17:40:26,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-06-27 17:40:46,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1861116.0, ans=0.125 2023-06-27 17:41:00,885 INFO [train.py:996] (1/4) Epoch 11, batch 5250, loss[loss=0.1868, simple_loss=0.2739, pruned_loss=0.04983, over 21749.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2925, pruned_loss=0.06531, over 4273566.08 frames. ], batch size: 124, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:41:14,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1861176.0, ans=0.0 2023-06-27 17:41:16,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-27 17:41:46,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1861296.0, ans=0.125 2023-06-27 17:42:07,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.78 vs. limit=22.5 2023-06-27 17:42:32,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1861416.0, ans=0.125 2023-06-27 17:42:41,258 INFO [train.py:996] (1/4) Epoch 11, batch 5300, loss[loss=0.2068, simple_loss=0.2809, pruned_loss=0.06631, over 21869.00 frames. ], tot_loss[loss=0.212, simple_loss=0.292, pruned_loss=0.06606, over 4287635.34 frames. ], batch size: 391, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:42:56,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1861536.0, ans=0.125 2023-06-27 17:42:57,143 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-27 17:43:38,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1861656.0, ans=0.125 2023-06-27 17:43:38,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=22.5 2023-06-27 17:43:39,366 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.786e+02 7.949e+02 1.214e+03 1.979e+03 3.974e+03, threshold=2.428e+03, percent-clipped=14.0 2023-06-27 17:43:45,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1861656.0, ans=0.1 2023-06-27 17:44:21,218 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-27 17:44:21,751 INFO [train.py:996] (1/4) Epoch 11, batch 5350, loss[loss=0.2294, simple_loss=0.293, pruned_loss=0.08293, over 21786.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2912, pruned_loss=0.06752, over 4299154.61 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:45:19,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1861956.0, ans=0.125 2023-06-27 17:45:41,004 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=12.0 2023-06-27 17:46:01,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1862016.0, ans=0.0 2023-06-27 17:46:05,871 INFO [train.py:996] (1/4) Epoch 11, batch 5400, loss[loss=0.2404, simple_loss=0.2947, pruned_loss=0.09311, over 21782.00 frames. ], tot_loss[loss=0.213, simple_loss=0.29, pruned_loss=0.06803, over 4304979.77 frames. ], batch size: 508, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:46:19,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-27 17:46:35,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1862136.0, ans=0.1 2023-06-27 17:47:07,282 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 6.450e+02 1.066e+03 1.376e+03 3.123e+03, threshold=2.132e+03, percent-clipped=3.0 2023-06-27 17:47:27,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1862316.0, ans=0.0 2023-06-27 17:47:35,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1862316.0, ans=0.125 2023-06-27 17:47:50,482 INFO [train.py:996] (1/4) Epoch 11, batch 5450, loss[loss=0.1967, simple_loss=0.2774, pruned_loss=0.05801, over 21324.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2904, pruned_loss=0.06633, over 4296199.92 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:48:26,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1862436.0, ans=0.125 2023-06-27 17:48:51,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-27 17:49:13,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1862556.0, ans=0.2 2023-06-27 17:49:40,241 INFO [train.py:996] (1/4) Epoch 11, batch 5500, loss[loss=0.2428, simple_loss=0.3415, pruned_loss=0.07201, over 21277.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2956, pruned_loss=0.06399, over 4290298.37 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:50:47,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1862856.0, ans=0.2 2023-06-27 17:50:49,747 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.051e+02 7.572e+02 1.528e+03 2.313e+03 5.179e+03, threshold=3.055e+03, percent-clipped=29.0 2023-06-27 17:50:57,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1862856.0, ans=0.0 2023-06-27 17:51:06,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-27 17:51:24,525 INFO [train.py:996] (1/4) Epoch 11, batch 5550, loss[loss=0.1737, simple_loss=0.2788, pruned_loss=0.0343, over 21610.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2958, pruned_loss=0.06142, over 4291035.50 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:51:25,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1862976.0, ans=0.125 2023-06-27 17:51:49,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1863036.0, ans=0.125 2023-06-27 17:51:55,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1863036.0, ans=0.125 2023-06-27 17:52:22,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1863096.0, ans=0.0 2023-06-27 17:52:41,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1863156.0, ans=0.125 2023-06-27 17:53:04,455 INFO [train.py:996] (1/4) Epoch 11, batch 5600, loss[loss=0.2013, simple_loss=0.284, pruned_loss=0.05925, over 21279.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2947, pruned_loss=0.05935, over 4282435.46 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:53:15,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1863276.0, ans=0.2 2023-06-27 17:53:24,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1863336.0, ans=0.0 2023-06-27 17:53:42,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1863336.0, ans=0.1 2023-06-27 17:53:52,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1863396.0, ans=0.125 2023-06-27 17:54:04,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1863456.0, ans=0.0 2023-06-27 17:54:06,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1863456.0, ans=0.125 2023-06-27 17:54:13,845 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.087e+02 7.294e+02 1.095e+03 1.659e+03 3.151e+03, threshold=2.190e+03, percent-clipped=1.0 2023-06-27 17:54:41,739 INFO [train.py:996] (1/4) Epoch 11, batch 5650, loss[loss=0.224, simple_loss=0.3017, pruned_loss=0.07316, over 21796.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2985, pruned_loss=0.06195, over 4284369.71 frames. ], batch size: 298, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:54:47,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1863576.0, ans=0.0 2023-06-27 17:55:17,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1863636.0, ans=0.125 2023-06-27 17:55:40,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-06-27 17:55:56,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1863756.0, ans=0.2 2023-06-27 17:56:13,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1863816.0, ans=0.0 2023-06-27 17:56:19,746 INFO [train.py:996] (1/4) Epoch 11, batch 5700, loss[loss=0.1968, simple_loss=0.2921, pruned_loss=0.05071, over 21735.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2968, pruned_loss=0.06368, over 4290813.05 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:56:47,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1863936.0, ans=0.0 2023-06-27 17:57:32,517 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 6.609e+02 9.381e+02 1.350e+03 3.463e+03, threshold=1.876e+03, percent-clipped=9.0 2023-06-27 17:57:33,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1864056.0, ans=0.125 2023-06-27 17:57:53,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-27 17:58:13,668 INFO [train.py:996] (1/4) Epoch 11, batch 5750, loss[loss=0.2009, simple_loss=0.3057, pruned_loss=0.04807, over 21206.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2928, pruned_loss=0.06088, over 4283716.07 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:58:18,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1864176.0, ans=0.1 2023-06-27 17:58:18,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1864176.0, ans=0.0 2023-06-27 17:58:27,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1864176.0, ans=0.1 2023-06-27 17:59:28,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1864356.0, ans=0.0 2023-06-27 17:59:30,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1864416.0, ans=0.125 2023-06-27 17:59:56,947 INFO [train.py:996] (1/4) Epoch 11, batch 5800, loss[loss=0.2445, simple_loss=0.3479, pruned_loss=0.07059, over 21242.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2942, pruned_loss=0.05986, over 4279169.08 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:59:58,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0 2023-06-27 18:00:05,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1864476.0, ans=0.125 2023-06-27 18:01:03,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1864656.0, ans=0.1 2023-06-27 18:01:04,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.155e+02 1.088e+03 1.847e+03 4.141e+03, threshold=2.176e+03, percent-clipped=25.0 2023-06-27 18:01:16,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1864656.0, ans=0.0 2023-06-27 18:01:26,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1864716.0, ans=0.125 2023-06-27 18:01:41,151 INFO [train.py:996] (1/4) Epoch 11, batch 5850, loss[loss=0.1658, simple_loss=0.264, pruned_loss=0.03379, over 21716.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2922, pruned_loss=0.05645, over 4273437.54 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:02:33,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1864896.0, ans=0.0 2023-06-27 18:02:42,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1864956.0, ans=0.1 2023-06-27 18:03:17,845 INFO [train.py:996] (1/4) Epoch 11, batch 5900, loss[loss=0.2135, simple_loss=0.2967, pruned_loss=0.06518, over 19927.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2857, pruned_loss=0.05248, over 4272353.11 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:04:09,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1865196.0, ans=0.2 2023-06-27 18:04:28,081 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 6.471e+02 9.679e+02 1.352e+03 2.438e+03, threshold=1.936e+03, percent-clipped=4.0 2023-06-27 18:04:37,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-27 18:04:54,771 INFO [train.py:996] (1/4) Epoch 11, batch 5950, loss[loss=0.1962, simple_loss=0.2543, pruned_loss=0.06904, over 21343.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2843, pruned_loss=0.05555, over 4282295.55 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:05:05,134 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:05:49,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1865496.0, ans=0.0 2023-06-27 18:05:53,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1865556.0, ans=0.2 2023-06-27 18:06:11,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1865556.0, ans=0.2 2023-06-27 18:06:37,189 INFO [train.py:996] (1/4) Epoch 11, batch 6000, loss[loss=0.1746, simple_loss=0.2223, pruned_loss=0.06339, over 19953.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.279, pruned_loss=0.05784, over 4265517.13 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:06:37,190 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 18:06:56,340 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2612, simple_loss=0.354, pruned_loss=0.08419, over 1796401.00 frames. 2023-06-27 18:06:56,341 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 18:08:10,067 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.353e+02 5.907e+02 8.109e+02 1.325e+03 2.971e+03, threshold=1.622e+03, percent-clipped=7.0 2023-06-27 18:08:11,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-27 18:08:25,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1865916.0, ans=0.1 2023-06-27 18:08:39,958 INFO [train.py:996] (1/4) Epoch 11, batch 6050, loss[loss=0.1842, simple_loss=0.2609, pruned_loss=0.05376, over 21437.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2741, pruned_loss=0.05933, over 4275209.29 frames. ], batch size: 473, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:09:09,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1866036.0, ans=0.2 2023-06-27 18:10:17,460 INFO [train.py:996] (1/4) Epoch 11, batch 6100, loss[loss=0.2337, simple_loss=0.3001, pruned_loss=0.08362, over 21790.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2739, pruned_loss=0.05792, over 4272914.19 frames. ], batch size: 112, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:11:29,681 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.047e+02 7.065e+02 1.029e+03 1.365e+03 3.489e+03, threshold=2.059e+03, percent-clipped=16.0 2023-06-27 18:11:31,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-27 18:11:59,721 INFO [train.py:996] (1/4) Epoch 11, batch 6150, loss[loss=0.2015, simple_loss=0.2778, pruned_loss=0.06266, over 21517.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2776, pruned_loss=0.05965, over 4270641.63 frames. ], batch size: 212, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:12:53,548 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:13:38,553 INFO [train.py:996] (1/4) Epoch 11, batch 6200, loss[loss=0.2501, simple_loss=0.3283, pruned_loss=0.08596, over 21546.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2812, pruned_loss=0.06074, over 4266637.22 frames. ], batch size: 471, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:13:54,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1866876.0, ans=0.0 2023-06-27 18:14:01,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-27 18:14:11,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1866936.0, ans=0.2 2023-06-27 18:14:41,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1867056.0, ans=0.0 2023-06-27 18:14:52,448 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 7.354e+02 1.075e+03 1.607e+03 4.153e+03, threshold=2.150e+03, percent-clipped=10.0 2023-06-27 18:15:03,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1867116.0, ans=0.05 2023-06-27 18:15:07,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1867116.0, ans=0.2 2023-06-27 18:15:18,548 INFO [train.py:996] (1/4) Epoch 11, batch 6250, loss[loss=0.1981, simple_loss=0.2993, pruned_loss=0.04846, over 21785.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2871, pruned_loss=0.06079, over 4264878.95 frames. ], batch size: 282, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:16:11,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1867296.0, ans=0.0 2023-06-27 18:16:18,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-27 18:16:50,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1867416.0, ans=0.0 2023-06-27 18:17:10,369 INFO [train.py:996] (1/4) Epoch 11, batch 6300, loss[loss=0.2316, simple_loss=0.3058, pruned_loss=0.07868, over 21944.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2901, pruned_loss=0.05975, over 4260256.89 frames. ], batch size: 113, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:17:10,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1867476.0, ans=0.125 2023-06-27 18:17:57,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1867596.0, ans=0.0 2023-06-27 18:18:14,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1867656.0, ans=0.0 2023-06-27 18:18:17,773 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.254e+02 6.166e+02 8.295e+02 1.136e+03 2.739e+03, threshold=1.659e+03, percent-clipped=3.0 2023-06-27 18:18:21,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1867656.0, ans=0.125 2023-06-27 18:18:30,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=12.0 2023-06-27 18:18:52,469 INFO [train.py:996] (1/4) Epoch 11, batch 6350, loss[loss=0.2247, simple_loss=0.2937, pruned_loss=0.07783, over 21901.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2926, pruned_loss=0.06316, over 4275155.32 frames. ], batch size: 316, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:19:28,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1867836.0, ans=0.125 2023-06-27 18:19:29,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1867896.0, ans=0.1 2023-06-27 18:20:24,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1868016.0, ans=0.125 2023-06-27 18:20:40,584 INFO [train.py:996] (1/4) Epoch 11, batch 6400, loss[loss=0.255, simple_loss=0.3312, pruned_loss=0.08939, over 21810.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2974, pruned_loss=0.0671, over 4280305.11 frames. ], batch size: 118, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:21:06,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1868136.0, ans=0.125 2023-06-27 18:21:25,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1868196.0, ans=0.0 2023-06-27 18:21:55,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.666e+02 7.590e+02 1.060e+03 1.570e+03 3.138e+03, threshold=2.120e+03, percent-clipped=19.0 2023-06-27 18:22:23,551 INFO [train.py:996] (1/4) Epoch 11, batch 6450, loss[loss=0.2037, simple_loss=0.286, pruned_loss=0.06075, over 21154.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.3007, pruned_loss=0.0672, over 4273128.48 frames. ], batch size: 143, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:22:32,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-27 18:22:51,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1868436.0, ans=0.0 2023-06-27 18:24:00,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1868616.0, ans=0.125 2023-06-27 18:24:04,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1868616.0, ans=0.95 2023-06-27 18:24:06,994 INFO [train.py:996] (1/4) Epoch 11, batch 6500, loss[loss=0.2183, simple_loss=0.3142, pruned_loss=0.06121, over 21557.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2946, pruned_loss=0.06562, over 4259400.52 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:24:10,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1868676.0, ans=0.1 2023-06-27 18:24:18,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1868676.0, ans=0.0 2023-06-27 18:24:19,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1868676.0, ans=0.1 2023-06-27 18:24:39,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1868736.0, ans=0.0 2023-06-27 18:25:14,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1868856.0, ans=0.2 2023-06-27 18:25:20,917 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.722e+02 7.121e+02 1.016e+03 1.758e+03 3.430e+03, threshold=2.032e+03, percent-clipped=12.0 2023-06-27 18:25:48,834 INFO [train.py:996] (1/4) Epoch 11, batch 6550, loss[loss=0.2133, simple_loss=0.2738, pruned_loss=0.07635, over 20115.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.292, pruned_loss=0.0643, over 4253557.95 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:25:57,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1868976.0, ans=0.95 2023-06-27 18:26:02,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1868976.0, ans=0.0 2023-06-27 18:26:34,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1869096.0, ans=0.0 2023-06-27 18:26:43,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-27 18:26:51,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1869156.0, ans=0.2 2023-06-27 18:27:31,155 INFO [train.py:996] (1/4) Epoch 11, batch 6600, loss[loss=0.1995, simple_loss=0.2608, pruned_loss=0.06907, over 21763.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2863, pruned_loss=0.0642, over 4258825.11 frames. ], batch size: 371, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:27:36,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1869276.0, ans=0.125 2023-06-27 18:27:46,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1869276.0, ans=0.125 2023-06-27 18:27:46,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1869276.0, ans=0.125 2023-06-27 18:27:48,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-27 18:28:13,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1869396.0, ans=0.0 2023-06-27 18:28:50,865 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.592e+02 1.007e+03 1.403e+03 3.039e+03, threshold=2.014e+03, percent-clipped=10.0 2023-06-27 18:29:12,962 INFO [train.py:996] (1/4) Epoch 11, batch 6650, loss[loss=0.176, simple_loss=0.2536, pruned_loss=0.04917, over 21813.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.279, pruned_loss=0.06235, over 4265985.42 frames. ], batch size: 352, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:29:34,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1869636.0, ans=0.125 2023-06-27 18:30:10,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-27 18:30:59,833 INFO [train.py:996] (1/4) Epoch 11, batch 6700, loss[loss=0.2306, simple_loss=0.3421, pruned_loss=0.05954, over 19829.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2742, pruned_loss=0.06211, over 4259736.43 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:31:16,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1869936.0, ans=0.2 2023-06-27 18:31:18,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1869936.0, ans=0.035 2023-06-27 18:31:25,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1869936.0, ans=0.2 2023-06-27 18:32:16,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.214e+02 6.879e+02 9.707e+02 1.410e+03 2.811e+03, threshold=1.941e+03, percent-clipped=3.0 2023-06-27 18:32:28,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1870116.0, ans=0.2 2023-06-27 18:32:42,377 INFO [train.py:996] (1/4) Epoch 11, batch 6750, loss[loss=0.1933, simple_loss=0.311, pruned_loss=0.03779, over 19818.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2727, pruned_loss=0.0625, over 4257142.63 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:33:38,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1870296.0, ans=0.2 2023-06-27 18:34:11,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1870416.0, ans=0.125 2023-06-27 18:34:13,722 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=22.5 2023-06-27 18:34:23,481 INFO [train.py:996] (1/4) Epoch 11, batch 6800, loss[loss=0.2097, simple_loss=0.2765, pruned_loss=0.07144, over 22027.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2748, pruned_loss=0.0636, over 4263845.68 frames. ], batch size: 103, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:34:32,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1870476.0, ans=0.0 2023-06-27 18:35:21,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1870656.0, ans=0.125 2023-06-27 18:35:39,116 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.574e+02 7.159e+02 9.186e+02 1.470e+03 3.415e+03, threshold=1.837e+03, percent-clipped=10.0 2023-06-27 18:36:00,269 INFO [train.py:996] (1/4) Epoch 11, batch 6850, loss[loss=0.2511, simple_loss=0.2926, pruned_loss=0.1048, over 21555.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2744, pruned_loss=0.06399, over 4269151.08 frames. ], batch size: 508, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:36:27,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1870836.0, ans=0.125 2023-06-27 18:37:06,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1870956.0, ans=0.125 2023-06-27 18:37:37,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1871016.0, ans=0.1 2023-06-27 18:37:43,673 INFO [train.py:996] (1/4) Epoch 11, batch 6900, loss[loss=0.1992, simple_loss=0.2705, pruned_loss=0.06396, over 21554.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.274, pruned_loss=0.06379, over 4279805.22 frames. ], batch size: 212, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:39:05,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.235e+02 7.048e+02 1.193e+03 1.711e+03 4.903e+03, threshold=2.385e+03, percent-clipped=22.0 2023-06-27 18:39:22,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1871316.0, ans=0.125 2023-06-27 18:39:31,787 INFO [train.py:996] (1/4) Epoch 11, batch 6950, loss[loss=0.2306, simple_loss=0.3066, pruned_loss=0.07728, over 21927.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2789, pruned_loss=0.0619, over 4270947.05 frames. ], batch size: 316, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:39:32,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1871376.0, ans=0.125 2023-06-27 18:39:37,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1871376.0, ans=0.2 2023-06-27 18:39:38,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.13 vs. limit=22.5 2023-06-27 18:39:57,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1871436.0, ans=0.1 2023-06-27 18:40:01,438 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2023-06-27 18:41:14,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=22.5 2023-06-27 18:41:14,903 INFO [train.py:996] (1/4) Epoch 11, batch 7000, loss[loss=0.2062, simple_loss=0.2633, pruned_loss=0.07458, over 21583.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.282, pruned_loss=0.06382, over 4280683.47 frames. ], batch size: 231, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:41:15,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1871676.0, ans=0.125 2023-06-27 18:41:17,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1871676.0, ans=0.0 2023-06-27 18:41:18,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1871676.0, ans=0.125 2023-06-27 18:42:24,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1871856.0, ans=0.05 2023-06-27 18:42:31,980 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 6.963e+02 9.301e+02 1.305e+03 2.856e+03, threshold=1.860e+03, percent-clipped=1.0 2023-06-27 18:42:44,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1871916.0, ans=0.2 2023-06-27 18:42:58,609 INFO [train.py:996] (1/4) Epoch 11, batch 7050, loss[loss=0.173, simple_loss=0.266, pruned_loss=0.04001, over 21608.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2785, pruned_loss=0.06314, over 4279106.24 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:43:49,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1872096.0, ans=10.0 2023-06-27 18:44:09,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1872156.0, ans=0.0 2023-06-27 18:44:47,741 INFO [train.py:996] (1/4) Epoch 11, batch 7100, loss[loss=0.1966, simple_loss=0.2831, pruned_loss=0.05507, over 21741.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2841, pruned_loss=0.0646, over 4281899.09 frames. ], batch size: 332, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:45:15,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1872336.0, ans=0.1 2023-06-27 18:45:37,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1872396.0, ans=0.1 2023-06-27 18:45:37,999 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-27 18:45:51,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.58 vs. limit=22.5 2023-06-27 18:46:03,442 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.854e+02 6.087e+02 7.876e+02 1.187e+03 3.248e+03, threshold=1.575e+03, percent-clipped=9.0 2023-06-27 18:46:25,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1872516.0, ans=0.2 2023-06-27 18:46:30,037 INFO [train.py:996] (1/4) Epoch 11, batch 7150, loss[loss=0.1255, simple_loss=0.1949, pruned_loss=0.02811, over 21853.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.282, pruned_loss=0.0629, over 4271680.63 frames. ], batch size: 98, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:46:42,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.62 vs. limit=12.0 2023-06-27 18:47:05,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1872636.0, ans=0.125 2023-06-27 18:47:07,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1872636.0, ans=0.0 2023-06-27 18:47:47,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1872756.0, ans=0.0 2023-06-27 18:48:07,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1872816.0, ans=0.0 2023-06-27 18:48:18,349 INFO [train.py:996] (1/4) Epoch 11, batch 7200, loss[loss=0.2042, simple_loss=0.2764, pruned_loss=0.06601, over 21236.00 frames. ], tot_loss[loss=0.207, simple_loss=0.285, pruned_loss=0.06446, over 4269446.97 frames. ], batch size: 159, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:48:59,464 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-27 18:49:35,423 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.391e+02 8.685e+02 1.394e+03 1.830e+03 3.525e+03, threshold=2.788e+03, percent-clipped=36.0 2023-06-27 18:50:03,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1873176.0, ans=0.0 2023-06-27 18:50:04,641 INFO [train.py:996] (1/4) Epoch 11, batch 7250, loss[loss=0.1976, simple_loss=0.271, pruned_loss=0.06208, over 21859.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2819, pruned_loss=0.06458, over 4264147.08 frames. ], batch size: 107, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:50:28,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1873236.0, ans=0.0 2023-06-27 18:51:47,389 INFO [train.py:996] (1/4) Epoch 11, batch 7300, loss[loss=0.1871, simple_loss=0.253, pruned_loss=0.06058, over 21302.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2759, pruned_loss=0.0639, over 4260020.83 frames. ], batch size: 144, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:52:46,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=22.5 2023-06-27 18:53:00,323 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 7.298e+02 1.227e+03 1.780e+03 3.301e+03, threshold=2.454e+03, percent-clipped=5.0 2023-06-27 18:53:17,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1873716.0, ans=0.125 2023-06-27 18:53:30,252 INFO [train.py:996] (1/4) Epoch 11, batch 7350, loss[loss=0.2383, simple_loss=0.3151, pruned_loss=0.08073, over 21464.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2743, pruned_loss=0.06484, over 4263205.42 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:53:32,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1873776.0, ans=0.1 2023-06-27 18:53:38,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-06-27 18:53:49,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1873836.0, ans=0.125 2023-06-27 18:53:49,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-06-27 18:54:07,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1873896.0, ans=0.125 2023-06-27 18:54:53,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.98 vs. limit=5.0 2023-06-27 18:55:13,770 INFO [train.py:996] (1/4) Epoch 11, batch 7400, loss[loss=0.1993, simple_loss=0.276, pruned_loss=0.06125, over 20877.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2786, pruned_loss=0.06606, over 4261497.26 frames. ], batch size: 609, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:55:20,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-27 18:55:28,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1874076.0, ans=0.0 2023-06-27 18:56:07,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874196.0, ans=0.1 2023-06-27 18:56:15,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1874256.0, ans=0.125 2023-06-27 18:56:31,615 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.427e+02 7.089e+02 1.051e+03 1.718e+03 3.603e+03, threshold=2.102e+03, percent-clipped=3.0 2023-06-27 18:56:57,308 INFO [train.py:996] (1/4) Epoch 11, batch 7450, loss[loss=0.1987, simple_loss=0.266, pruned_loss=0.06564, over 21581.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2762, pruned_loss=0.06548, over 4263056.89 frames. ], batch size: 391, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:57:47,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1874496.0, ans=0.125 2023-06-27 18:57:55,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1874496.0, ans=0.125 2023-06-27 18:58:22,810 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-27 18:58:33,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1874616.0, ans=0.0 2023-06-27 18:58:33,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1874616.0, ans=0.125 2023-06-27 18:58:41,420 INFO [train.py:996] (1/4) Epoch 11, batch 7500, loss[loss=0.219, simple_loss=0.3157, pruned_loss=0.06115, over 21276.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2816, pruned_loss=0.06733, over 4271238.49 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 18:58:45,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1874676.0, ans=0.0 2023-06-27 18:58:49,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1874676.0, ans=0.0 2023-06-27 18:58:52,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1874676.0, ans=0.07 2023-06-27 18:59:07,506 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-27 18:59:41,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1874796.0, ans=0.0 2023-06-27 19:00:04,713 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.389e+02 7.977e+02 1.325e+03 1.991e+03 3.400e+03, threshold=2.650e+03, percent-clipped=21.0 2023-06-27 19:00:15,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1874916.0, ans=0.0 2023-06-27 19:00:24,549 INFO [train.py:996] (1/4) Epoch 11, batch 7550, loss[loss=0.2079, simple_loss=0.3091, pruned_loss=0.05332, over 21779.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2897, pruned_loss=0.06632, over 4274274.63 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:01:27,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1875156.0, ans=0.0 2023-06-27 19:01:33,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1875156.0, ans=0.125 2023-06-27 19:02:05,493 INFO [train.py:996] (1/4) Epoch 11, batch 7600, loss[loss=0.2168, simple_loss=0.2911, pruned_loss=0.07121, over 21872.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.289, pruned_loss=0.06538, over 4278133.65 frames. ], batch size: 371, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:02:07,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-27 19:02:16,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1875276.0, ans=0.2 2023-06-27 19:02:56,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1875396.0, ans=0.125 2023-06-27 19:03:28,848 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 7.250e+02 9.858e+02 1.337e+03 3.374e+03, threshold=1.972e+03, percent-clipped=5.0 2023-06-27 19:03:32,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-27 19:03:47,207 INFO [train.py:996] (1/4) Epoch 11, batch 7650, loss[loss=0.2068, simple_loss=0.27, pruned_loss=0.07183, over 21567.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2878, pruned_loss=0.06664, over 4289700.11 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:03:58,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-27 19:04:23,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1875636.0, ans=15.0 2023-06-27 19:04:53,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-27 19:05:30,749 INFO [train.py:996] (1/4) Epoch 11, batch 7700, loss[loss=0.2575, simple_loss=0.3373, pruned_loss=0.08886, over 21814.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2908, pruned_loss=0.06898, over 4289905.31 frames. ], batch size: 124, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:05:47,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1875876.0, ans=0.0 2023-06-27 19:06:14,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1875996.0, ans=0.125 2023-06-27 19:06:29,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1875996.0, ans=0.1 2023-06-27 19:06:29,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1875996.0, ans=0.0 2023-06-27 19:06:46,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1876056.0, ans=0.0 2023-06-27 19:06:59,795 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 8.160e+02 1.175e+03 1.754e+03 4.757e+03, threshold=2.350e+03, percent-clipped=23.0 2023-06-27 19:07:16,809 INFO [train.py:996] (1/4) Epoch 11, batch 7750, loss[loss=0.2034, simple_loss=0.2947, pruned_loss=0.05602, over 21274.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2952, pruned_loss=0.06838, over 4286857.71 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:07:23,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1876176.0, ans=0.0 2023-06-27 19:07:39,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1876176.0, ans=0.0 2023-06-27 19:07:43,237 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-27 19:08:01,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1876236.0, ans=0.125 2023-06-27 19:08:22,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1876296.0, ans=0.0 2023-06-27 19:08:36,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1876356.0, ans=0.125 2023-06-27 19:08:44,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1876416.0, ans=0.2 2023-06-27 19:09:04,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-27 19:09:10,458 INFO [train.py:996] (1/4) Epoch 11, batch 7800, loss[loss=0.2421, simple_loss=0.327, pruned_loss=0.07855, over 21557.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2977, pruned_loss=0.06905, over 4277862.31 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:09:13,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1876476.0, ans=0.025 2023-06-27 19:09:15,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-27 19:09:42,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1876536.0, ans=0.1 2023-06-27 19:09:52,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=22.5 2023-06-27 19:10:05,579 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:10:26,680 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.510e+02 6.767e+02 1.181e+03 1.586e+03 4.451e+03, threshold=2.363e+03, percent-clipped=7.0 2023-06-27 19:10:49,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1876716.0, ans=0.125 2023-06-27 19:10:52,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1876776.0, ans=0.125 2023-06-27 19:10:53,758 INFO [train.py:996] (1/4) Epoch 11, batch 7850, loss[loss=0.1754, simple_loss=0.2416, pruned_loss=0.05462, over 21458.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2901, pruned_loss=0.06813, over 4264739.34 frames. ], batch size: 230, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:10:59,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1876776.0, ans=0.125 2023-06-27 19:11:45,423 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:11:58,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-27 19:12:02,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1876956.0, ans=0.125 2023-06-27 19:12:26,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1877016.0, ans=0.125 2023-06-27 19:12:40,430 INFO [train.py:996] (1/4) Epoch 11, batch 7900, loss[loss=0.1732, simple_loss=0.2476, pruned_loss=0.04944, over 21244.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2858, pruned_loss=0.06786, over 4269074.92 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:12:55,475 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:13:15,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1877136.0, ans=0.125 2023-06-27 19:13:19,888 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-27 19:13:33,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-27 19:13:53,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1877256.0, ans=0.125 2023-06-27 19:14:08,124 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 7.562e+02 1.142e+03 1.795e+03 4.843e+03, threshold=2.283e+03, percent-clipped=15.0 2023-06-27 19:14:29,978 INFO [train.py:996] (1/4) Epoch 11, batch 7950, loss[loss=0.1914, simple_loss=0.2771, pruned_loss=0.05288, over 21438.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2876, pruned_loss=0.06672, over 4263073.21 frames. ], batch size: 211, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:14:52,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1877436.0, ans=0.0 2023-06-27 19:14:53,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1877436.0, ans=0.125 2023-06-27 19:15:08,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1877496.0, ans=0.125 2023-06-27 19:15:39,556 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-06-27 19:15:56,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1877556.0, ans=0.125 2023-06-27 19:16:22,061 INFO [train.py:996] (1/4) Epoch 11, batch 8000, loss[loss=0.2414, simple_loss=0.3244, pruned_loss=0.07922, over 21641.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2922, pruned_loss=0.06823, over 4263593.25 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:16:44,352 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-27 19:17:04,117 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:17:26,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1877796.0, ans=0.1 2023-06-27 19:17:51,520 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.031e+02 6.364e+02 9.395e+02 1.417e+03 3.378e+03, threshold=1.879e+03, percent-clipped=5.0 2023-06-27 19:18:08,686 INFO [train.py:996] (1/4) Epoch 11, batch 8050, loss[loss=0.2045, simple_loss=0.2732, pruned_loss=0.06788, over 21397.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2944, pruned_loss=0.06858, over 4251355.20 frames. ], batch size: 194, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:18:14,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1877976.0, ans=0.125 2023-06-27 19:18:42,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1878036.0, ans=0.125 2023-06-27 19:19:53,018 INFO [train.py:996] (1/4) Epoch 11, batch 8100, loss[loss=0.2556, simple_loss=0.3145, pruned_loss=0.09829, over 21624.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2939, pruned_loss=0.06911, over 4258144.90 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:21:22,430 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.025e+02 8.290e+02 1.329e+03 2.139e+03 5.514e+03, threshold=2.658e+03, percent-clipped=35.0 2023-06-27 19:21:41,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1878516.0, ans=0.2 2023-06-27 19:21:46,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1878516.0, ans=0.125 2023-06-27 19:21:48,877 INFO [train.py:996] (1/4) Epoch 11, batch 8150, loss[loss=0.2331, simple_loss=0.3404, pruned_loss=0.06293, over 21589.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3027, pruned_loss=0.07091, over 4266112.81 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:21:49,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1878576.0, ans=0.125 2023-06-27 19:22:43,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1878696.0, ans=0.125 2023-06-27 19:23:30,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-27 19:23:31,194 INFO [train.py:996] (1/4) Epoch 11, batch 8200, loss[loss=0.2143, simple_loss=0.2725, pruned_loss=0.07807, over 21508.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2963, pruned_loss=0.06841, over 4257343.18 frames. ], batch size: 442, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:24:03,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1878936.0, ans=0.125 2023-06-27 19:24:10,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1878936.0, ans=0.125 2023-06-27 19:24:50,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1879056.0, ans=0.1 2023-06-27 19:24:53,469 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.441e+02 7.151e+02 1.119e+03 1.525e+03 4.860e+03, threshold=2.239e+03, percent-clipped=3.0 2023-06-27 19:25:04,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1879116.0, ans=0.125 2023-06-27 19:25:15,173 INFO [train.py:996] (1/4) Epoch 11, batch 8250, loss[loss=0.2578, simple_loss=0.3531, pruned_loss=0.08129, over 21194.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2948, pruned_loss=0.06786, over 4252400.71 frames. ], batch size: 548, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:25:41,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1879236.0, ans=0.0 2023-06-27 19:26:03,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1879296.0, ans=0.0 2023-06-27 19:26:15,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.88 vs. limit=15.0 2023-06-27 19:26:17,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1879356.0, ans=0.1 2023-06-27 19:26:33,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1879356.0, ans=0.125 2023-06-27 19:26:54,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1879416.0, ans=0.0 2023-06-27 19:26:56,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1879416.0, ans=0.0 2023-06-27 19:26:59,249 INFO [train.py:996] (1/4) Epoch 11, batch 8300, loss[loss=0.2389, simple_loss=0.3247, pruned_loss=0.07657, over 21677.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2944, pruned_loss=0.06535, over 4259626.68 frames. ], batch size: 414, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:27:16,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1879476.0, ans=0.125 2023-06-27 19:27:49,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1879596.0, ans=0.0 2023-06-27 19:28:25,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.537e+02 6.833e+02 1.058e+03 1.562e+03 3.226e+03, threshold=2.116e+03, percent-clipped=10.0 2023-06-27 19:28:37,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1879716.0, ans=0.0 2023-06-27 19:28:41,969 INFO [train.py:996] (1/4) Epoch 11, batch 8350, loss[loss=0.1878, simple_loss=0.2628, pruned_loss=0.05638, over 21156.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2934, pruned_loss=0.06383, over 4258644.95 frames. ], batch size: 548, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:28:52,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1879776.0, ans=0.125 2023-06-27 19:29:02,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1879836.0, ans=0.125 2023-06-27 19:29:24,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1879896.0, ans=0.0 2023-06-27 19:29:36,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1879896.0, ans=0.125 2023-06-27 19:29:52,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1879956.0, ans=0.0 2023-06-27 19:30:00,147 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-27 19:30:04,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1879956.0, ans=0.2 2023-06-27 19:30:15,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1880016.0, ans=0.0 2023-06-27 19:30:29,632 INFO [train.py:996] (1/4) Epoch 11, batch 8400, loss[loss=0.2042, simple_loss=0.3, pruned_loss=0.05416, over 21695.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2905, pruned_loss=0.06122, over 4263466.43 frames. ], batch size: 414, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:30:35,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-27 19:30:37,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2023-06-27 19:31:37,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1880256.0, ans=0.125 2023-06-27 19:31:51,127 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.798e+02 6.790e+02 1.029e+03 1.707e+03 4.211e+03, threshold=2.059e+03, percent-clipped=16.0 2023-06-27 19:32:11,260 INFO [train.py:996] (1/4) Epoch 11, batch 8450, loss[loss=0.1953, simple_loss=0.2704, pruned_loss=0.06012, over 21268.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2891, pruned_loss=0.06134, over 4273423.69 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:32:30,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.16 vs. limit=15.0 2023-06-27 19:33:37,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1880616.0, ans=0.125 2023-06-27 19:33:48,564 INFO [train.py:996] (1/4) Epoch 11, batch 8500, loss[loss=0.2211, simple_loss=0.276, pruned_loss=0.08309, over 21281.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2849, pruned_loss=0.06276, over 4272702.99 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:34:33,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1880796.0, ans=0.0 2023-06-27 19:34:59,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1880856.0, ans=0.125 2023-06-27 19:35:17,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.145e+02 8.155e+02 1.098e+03 1.780e+03 3.950e+03, threshold=2.195e+03, percent-clipped=18.0 2023-06-27 19:35:37,553 INFO [train.py:996] (1/4) Epoch 11, batch 8550, loss[loss=0.2155, simple_loss=0.2956, pruned_loss=0.06771, over 21301.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2846, pruned_loss=0.06419, over 4267400.03 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:35:57,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1880976.0, ans=0.125 2023-06-27 19:36:06,520 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.99 vs. limit=15.0 2023-06-27 19:36:58,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1881156.0, ans=0.05 2023-06-27 19:37:09,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1881216.0, ans=0.0 2023-06-27 19:37:27,694 INFO [train.py:996] (1/4) Epoch 11, batch 8600, loss[loss=0.2589, simple_loss=0.3774, pruned_loss=0.07026, over 20763.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2945, pruned_loss=0.06774, over 4266671.78 frames. ], batch size: 607, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:37:37,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-27 19:38:11,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1881396.0, ans=0.125 2023-06-27 19:38:36,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1881456.0, ans=10.0 2023-06-27 19:38:50,814 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.000e+02 1.009e+03 1.607e+03 3.888e+03, threshold=2.018e+03, percent-clipped=13.0 2023-06-27 19:39:06,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1881516.0, ans=0.125 2023-06-27 19:39:11,183 INFO [train.py:996] (1/4) Epoch 11, batch 8650, loss[loss=0.1878, simple_loss=0.288, pruned_loss=0.04382, over 21627.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2996, pruned_loss=0.06842, over 4272495.55 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:39:16,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1881576.0, ans=0.125 2023-06-27 19:40:00,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1881696.0, ans=0.125 2023-06-27 19:40:06,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1881696.0, ans=10.0 2023-06-27 19:40:30,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1881816.0, ans=0.1 2023-06-27 19:40:31,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1881816.0, ans=0.015 2023-06-27 19:40:52,494 INFO [train.py:996] (1/4) Epoch 11, batch 8700, loss[loss=0.1967, simple_loss=0.2599, pruned_loss=0.0668, over 21485.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2917, pruned_loss=0.06508, over 4275578.39 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:40:59,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1881876.0, ans=0.125 2023-06-27 19:41:05,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.09 vs. limit=22.5 2023-06-27 19:41:36,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=22.5 2023-06-27 19:41:55,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-27 19:42:01,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1882056.0, ans=0.125 2023-06-27 19:42:15,392 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.345e+02 6.737e+02 1.063e+03 1.710e+03 3.619e+03, threshold=2.126e+03, percent-clipped=15.0 2023-06-27 19:42:27,036 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=12.0 2023-06-27 19:42:29,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1882116.0, ans=0.2 2023-06-27 19:42:35,711 INFO [train.py:996] (1/4) Epoch 11, batch 8750, loss[loss=0.2087, simple_loss=0.2838, pruned_loss=0.06674, over 21885.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2887, pruned_loss=0.06531, over 4286551.15 frames. ], batch size: 333, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:42:44,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1882176.0, ans=0.0 2023-06-27 19:42:46,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1882176.0, ans=0.1 2023-06-27 19:43:06,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1882236.0, ans=0.125 2023-06-27 19:44:11,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1882416.0, ans=0.0 2023-06-27 19:44:13,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1882416.0, ans=0.0 2023-06-27 19:44:13,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1882416.0, ans=0.0 2023-06-27 19:44:19,327 INFO [train.py:996] (1/4) Epoch 11, batch 8800, loss[loss=0.2389, simple_loss=0.3187, pruned_loss=0.07959, over 21301.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2973, pruned_loss=0.06788, over 4286995.22 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:44:20,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1882476.0, ans=0.0 2023-06-27 19:44:27,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1882476.0, ans=0.125 2023-06-27 19:44:35,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1882476.0, ans=0.125 2023-06-27 19:44:55,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1882536.0, ans=0.125 2023-06-27 19:45:29,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1882656.0, ans=0.125 2023-06-27 19:45:44,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1882716.0, ans=0.025 2023-06-27 19:45:49,238 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 9.134e+02 1.413e+03 2.470e+03 4.738e+03, threshold=2.826e+03, percent-clipped=30.0 2023-06-27 19:46:02,353 INFO [train.py:996] (1/4) Epoch 11, batch 8850, loss[loss=0.2359, simple_loss=0.3308, pruned_loss=0.07051, over 21881.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3032, pruned_loss=0.07002, over 4287256.62 frames. ], batch size: 98, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:47:03,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1882896.0, ans=0.0 2023-06-27 19:47:22,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1882956.0, ans=0.125 2023-06-27 19:47:29,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1883016.0, ans=0.0 2023-06-27 19:47:50,832 INFO [train.py:996] (1/4) Epoch 11, batch 8900, loss[loss=0.2062, simple_loss=0.292, pruned_loss=0.06018, over 21615.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2981, pruned_loss=0.06872, over 4285557.72 frames. ], batch size: 414, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:47:51,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1883076.0, ans=0.125 2023-06-27 19:47:51,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1883076.0, ans=0.125 2023-06-27 19:48:18,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1883136.0, ans=0.2 2023-06-27 19:49:03,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-27 19:49:23,139 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.611e+02 6.392e+02 1.039e+03 1.753e+03 5.076e+03, threshold=2.078e+03, percent-clipped=8.0 2023-06-27 19:49:36,305 INFO [train.py:996] (1/4) Epoch 11, batch 8950, loss[loss=0.1802, simple_loss=0.2455, pruned_loss=0.05746, over 21174.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2978, pruned_loss=0.06793, over 4280653.32 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:50:46,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.23 vs. limit=6.0 2023-06-27 19:51:06,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.32 vs. limit=10.0 2023-06-27 19:51:18,632 INFO [train.py:996] (1/4) Epoch 11, batch 9000, loss[loss=0.1862, simple_loss=0.2484, pruned_loss=0.06197, over 21474.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2905, pruned_loss=0.06696, over 4270732.09 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:51:18,632 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 19:51:37,887 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2621, simple_loss=0.3543, pruned_loss=0.08494, over 1796401.00 frames. 2023-06-27 19:51:37,888 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 19:51:40,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1883676.0, ans=0.125 2023-06-27 19:52:51,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1883856.0, ans=0.05 2023-06-27 19:52:58,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1883856.0, ans=0.05 2023-06-27 19:53:04,549 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.323e+02 6.298e+02 8.263e+02 1.367e+03 3.761e+03, threshold=1.653e+03, percent-clipped=12.0 2023-06-27 19:53:14,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1883916.0, ans=0.125 2023-06-27 19:53:28,398 INFO [train.py:996] (1/4) Epoch 11, batch 9050, loss[loss=0.2618, simple_loss=0.3793, pruned_loss=0.07215, over 19828.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.288, pruned_loss=0.06423, over 4266552.36 frames. ], batch size: 702, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:53:58,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1884036.0, ans=0.1 2023-06-27 19:54:08,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1884036.0, ans=0.0 2023-06-27 19:54:13,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1884096.0, ans=0.0 2023-06-27 19:54:22,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-27 19:54:25,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1884096.0, ans=0.0 2023-06-27 19:54:36,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1884156.0, ans=0.1 2023-06-27 19:55:11,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-27 19:55:13,473 INFO [train.py:996] (1/4) Epoch 11, batch 9100, loss[loss=0.2092, simple_loss=0.2997, pruned_loss=0.05936, over 21159.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2935, pruned_loss=0.06631, over 4269725.42 frames. ], batch size: 143, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:55:20,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1884276.0, ans=0.125 2023-06-27 19:55:45,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1884336.0, ans=0.0 2023-06-27 19:56:44,831 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 7.145e+02 1.042e+03 1.570e+03 3.461e+03, threshold=2.085e+03, percent-clipped=19.0 2023-06-27 19:56:50,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1884516.0, ans=0.125 2023-06-27 19:57:00,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1884516.0, ans=0.2 2023-06-27 19:57:03,244 INFO [train.py:996] (1/4) Epoch 11, batch 9150, loss[loss=0.1968, simple_loss=0.2887, pruned_loss=0.0525, over 21717.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2959, pruned_loss=0.06442, over 4270863.52 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:57:52,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1884696.0, ans=0.2 2023-06-27 19:57:53,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1884696.0, ans=0.125 2023-06-27 19:58:45,956 INFO [train.py:996] (1/4) Epoch 11, batch 9200, loss[loss=0.1978, simple_loss=0.289, pruned_loss=0.05328, over 21833.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2967, pruned_loss=0.06311, over 4269117.38 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:58:49,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-27 19:59:07,686 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:59:11,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1884936.0, ans=0.1 2023-06-27 19:59:21,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-27 19:59:34,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1884996.0, ans=0.07 2023-06-27 19:59:46,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1885056.0, ans=0.125 2023-06-27 19:59:47,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1885056.0, ans=0.2 2023-06-27 19:59:47,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1885056.0, ans=0.0 2023-06-27 19:59:59,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.89 vs. limit=22.5 2023-06-27 20:00:16,344 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 7.193e+02 1.189e+03 2.039e+03 4.796e+03, threshold=2.378e+03, percent-clipped=22.0 2023-06-27 20:00:28,233 INFO [train.py:996] (1/4) Epoch 11, batch 9250, loss[loss=0.1964, simple_loss=0.265, pruned_loss=0.06393, over 21607.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2995, pruned_loss=0.06568, over 4271560.59 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:01:19,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1885296.0, ans=0.0 2023-06-27 20:02:17,642 INFO [train.py:996] (1/4) Epoch 11, batch 9300, loss[loss=0.1994, simple_loss=0.2599, pruned_loss=0.06944, over 21469.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2938, pruned_loss=0.06533, over 4260159.74 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:03:37,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1885656.0, ans=0.125 2023-06-27 20:03:50,077 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.175e+02 5.672e+02 8.335e+02 1.329e+03 3.533e+03, threshold=1.667e+03, percent-clipped=8.0 2023-06-27 20:04:02,354 INFO [train.py:996] (1/4) Epoch 11, batch 9350, loss[loss=0.227, simple_loss=0.3111, pruned_loss=0.07145, over 21744.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2993, pruned_loss=0.06654, over 4262120.36 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:04:06,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1885776.0, ans=0.2 2023-06-27 20:04:14,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1885776.0, ans=0.0 2023-06-27 20:04:56,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1885896.0, ans=0.125 2023-06-27 20:05:02,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1885896.0, ans=0.0 2023-06-27 20:05:27,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-27 20:05:45,787 INFO [train.py:996] (1/4) Epoch 11, batch 9400, loss[loss=0.2345, simple_loss=0.2876, pruned_loss=0.09074, over 21299.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3009, pruned_loss=0.06724, over 4260329.16 frames. ], batch size: 507, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:05:58,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1886076.0, ans=0.0 2023-06-27 20:07:08,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1886256.0, ans=0.0 2023-06-27 20:07:16,432 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.854e+02 7.483e+02 1.060e+03 1.789e+03 3.889e+03, threshold=2.119e+03, percent-clipped=27.0 2023-06-27 20:07:27,689 INFO [train.py:996] (1/4) Epoch 11, batch 9450, loss[loss=0.2248, simple_loss=0.298, pruned_loss=0.07576, over 21851.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2935, pruned_loss=0.06664, over 4257017.76 frames. ], batch size: 98, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:07:28,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=22.5 2023-06-27 20:07:33,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1886376.0, ans=0.1 2023-06-27 20:08:30,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1886496.0, ans=0.125 2023-06-27 20:09:11,691 INFO [train.py:996] (1/4) Epoch 11, batch 9500, loss[loss=0.174, simple_loss=0.2545, pruned_loss=0.04672, over 21436.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2866, pruned_loss=0.06553, over 4267092.56 frames. ], batch size: 211, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:09:44,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1886736.0, ans=0.0 2023-06-27 20:09:48,521 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.58 vs. limit=15.0 2023-06-27 20:09:50,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-27 20:09:56,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1886796.0, ans=0.0 2023-06-27 20:10:20,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1886856.0, ans=0.2 2023-06-27 20:10:29,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1886856.0, ans=0.1 2023-06-27 20:10:38,455 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.105e+02 7.389e+02 1.115e+03 1.559e+03 4.093e+03, threshold=2.229e+03, percent-clipped=13.0 2023-06-27 20:10:43,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1886916.0, ans=0.125 2023-06-27 20:10:49,782 INFO [train.py:996] (1/4) Epoch 11, batch 9550, loss[loss=0.23, simple_loss=0.331, pruned_loss=0.06451, over 21786.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2922, pruned_loss=0.06777, over 4261046.59 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:11:16,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1887036.0, ans=0.2 2023-06-27 20:11:41,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1887096.0, ans=0.125 2023-06-27 20:12:18,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1887216.0, ans=0.5 2023-06-27 20:12:25,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1887276.0, ans=0.125 2023-06-27 20:12:26,516 INFO [train.py:996] (1/4) Epoch 11, batch 9600, loss[loss=0.1829, simple_loss=0.2584, pruned_loss=0.05366, over 21792.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2956, pruned_loss=0.06952, over 4264347.66 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 20:13:13,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1887396.0, ans=0.2 2023-06-27 20:13:19,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-27 20:13:36,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.80 vs. limit=15.0 2023-06-27 20:13:54,723 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.225e+02 7.650e+02 1.090e+03 1.713e+03 4.107e+03, threshold=2.181e+03, percent-clipped=11.0 2023-06-27 20:14:04,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1887576.0, ans=0.1 2023-06-27 20:14:05,157 INFO [train.py:996] (1/4) Epoch 11, batch 9650, loss[loss=0.2292, simple_loss=0.3027, pruned_loss=0.07783, over 21560.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2962, pruned_loss=0.06913, over 4271795.87 frames. ], batch size: 230, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:14:55,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1887696.0, ans=0.0 2023-06-27 20:15:53,493 INFO [train.py:996] (1/4) Epoch 11, batch 9700, loss[loss=0.1864, simple_loss=0.2698, pruned_loss=0.05147, over 21394.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2995, pruned_loss=0.06955, over 4275258.33 frames. ], batch size: 211, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:16:12,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1887876.0, ans=0.2 2023-06-27 20:16:29,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1887936.0, ans=0.2 2023-06-27 20:16:43,758 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:16:48,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1887996.0, ans=0.1 2023-06-27 20:17:20,714 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 5.952e+02 8.386e+02 1.192e+03 2.882e+03, threshold=1.677e+03, percent-clipped=3.0 2023-06-27 20:17:35,536 INFO [train.py:996] (1/4) Epoch 11, batch 9750, loss[loss=0.1893, simple_loss=0.2553, pruned_loss=0.06165, over 21896.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2933, pruned_loss=0.0683, over 4270873.92 frames. ], batch size: 373, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:18:25,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1888296.0, ans=0.0 2023-06-27 20:18:26,313 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-27 20:18:43,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1888356.0, ans=0.125 2023-06-27 20:18:53,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1888416.0, ans=0.125 2023-06-27 20:19:10,798 INFO [train.py:996] (1/4) Epoch 11, batch 9800, loss[loss=0.2109, simple_loss=0.2906, pruned_loss=0.06562, over 21834.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2942, pruned_loss=0.06845, over 4264248.33 frames. ], batch size: 124, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:20:42,518 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.393e+02 6.342e+02 8.538e+02 1.222e+03 6.218e+03, threshold=1.708e+03, percent-clipped=13.0 2023-06-27 20:20:51,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1888776.0, ans=0.2 2023-06-27 20:20:52,447 INFO [train.py:996] (1/4) Epoch 11, batch 9850, loss[loss=0.1901, simple_loss=0.2532, pruned_loss=0.0635, over 21772.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2901, pruned_loss=0.06763, over 4260994.08 frames. ], batch size: 102, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:21:44,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1888896.0, ans=0.1 2023-06-27 20:22:05,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1888956.0, ans=0.04949747468305833 2023-06-27 20:22:13,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1888956.0, ans=0.125 2023-06-27 20:22:17,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1889016.0, ans=0.125 2023-06-27 20:22:22,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1889016.0, ans=0.125 2023-06-27 20:22:35,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-06-27 20:22:35,619 INFO [train.py:996] (1/4) Epoch 11, batch 9900, loss[loss=0.2097, simple_loss=0.2898, pruned_loss=0.06483, over 21713.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2855, pruned_loss=0.06682, over 4258173.24 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:22:36,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889076.0, ans=0.1 2023-06-27 20:22:47,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1889076.0, ans=0.125 2023-06-27 20:23:58,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1889256.0, ans=0.125 2023-06-27 20:24:07,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1889316.0, ans=0.125 2023-06-27 20:24:07,970 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.874e+02 1.115e+03 1.655e+03 5.340e+03, threshold=2.230e+03, percent-clipped=22.0 2023-06-27 20:24:13,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1889316.0, ans=0.2 2023-06-27 20:24:18,300 INFO [train.py:996] (1/4) Epoch 11, batch 9950, loss[loss=0.2125, simple_loss=0.2857, pruned_loss=0.06961, over 21707.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.287, pruned_loss=0.06886, over 4259567.93 frames. ], batch size: 124, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:24:39,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1889376.0, ans=0.125 2023-06-27 20:25:05,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1889496.0, ans=0.025 2023-06-27 20:25:30,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1889556.0, ans=0.125 2023-06-27 20:25:37,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1889556.0, ans=0.125 2023-06-27 20:25:46,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1889616.0, ans=0.125 2023-06-27 20:26:04,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1889616.0, ans=0.2 2023-06-27 20:26:16,675 INFO [train.py:996] (1/4) Epoch 11, batch 10000, loss[loss=0.1856, simple_loss=0.2623, pruned_loss=0.05449, over 21526.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2834, pruned_loss=0.06797, over 4268788.16 frames. ], batch size: 195, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 20:26:39,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1889736.0, ans=0.125 2023-06-27 20:27:08,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1889796.0, ans=10.0 2023-06-27 20:27:11,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1889856.0, ans=0.125 2023-06-27 20:27:13,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1889856.0, ans=0.125 2023-06-27 20:27:52,685 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 6.432e+02 1.028e+03 1.503e+03 2.874e+03, threshold=2.056e+03, percent-clipped=5.0 2023-06-27 20:27:59,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-06-27 20:28:01,328 INFO [train.py:996] (1/4) Epoch 11, batch 10050, loss[loss=0.2194, simple_loss=0.2885, pruned_loss=0.07514, over 21503.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2849, pruned_loss=0.06823, over 4270671.45 frames. ], batch size: 211, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:28:19,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1890036.0, ans=0.05 2023-06-27 20:28:43,864 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:29:03,641 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.64 vs. limit=15.0 2023-06-27 20:29:27,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1890216.0, ans=0.1 2023-06-27 20:29:44,814 INFO [train.py:996] (1/4) Epoch 11, batch 10100, loss[loss=0.2152, simple_loss=0.3155, pruned_loss=0.05749, over 21244.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2849, pruned_loss=0.06701, over 4266792.08 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:30:06,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1890336.0, ans=0.1 2023-06-27 20:31:19,829 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.253e+02 6.464e+02 9.043e+02 1.577e+03 3.572e+03, threshold=1.809e+03, percent-clipped=15.0 2023-06-27 20:31:27,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-27 20:31:28,336 INFO [train.py:996] (1/4) Epoch 11, batch 10150, loss[loss=0.2194, simple_loss=0.3038, pruned_loss=0.06752, over 21719.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2909, pruned_loss=0.06906, over 4273074.69 frames. ], batch size: 351, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:31:52,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1890636.0, ans=0.125 2023-06-27 20:32:07,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1890696.0, ans=0.04949747468305833 2023-06-27 20:32:29,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-27 20:32:37,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1890756.0, ans=0.0 2023-06-27 20:32:57,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-27 20:33:06,433 INFO [train.py:996] (1/4) Epoch 11, batch 10200, loss[loss=0.2293, simple_loss=0.309, pruned_loss=0.0748, over 21251.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2918, pruned_loss=0.0684, over 4271355.66 frames. ], batch size: 143, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:33:52,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1890996.0, ans=0.035 2023-06-27 20:33:57,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1890996.0, ans=0.125 2023-06-27 20:34:24,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1891056.0, ans=0.125 2023-06-27 20:34:29,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1891056.0, ans=0.125 2023-06-27 20:34:41,112 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.938e+02 9.160e+02 1.393e+03 3.097e+03, threshold=1.832e+03, percent-clipped=16.0 2023-06-27 20:34:45,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-27 20:34:49,783 INFO [train.py:996] (1/4) Epoch 11, batch 10250, loss[loss=0.1503, simple_loss=0.2446, pruned_loss=0.02803, over 21717.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2857, pruned_loss=0.06295, over 4277934.02 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:35:08,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1891176.0, ans=0.0 2023-06-27 20:35:22,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1891236.0, ans=0.0 2023-06-27 20:35:25,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1891236.0, ans=0.0 2023-06-27 20:36:06,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891356.0, ans=0.1 2023-06-27 20:36:38,671 INFO [train.py:996] (1/4) Epoch 11, batch 10300, loss[loss=0.2303, simple_loss=0.3333, pruned_loss=0.06364, over 21625.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2892, pruned_loss=0.06427, over 4282678.73 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:36:50,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1891476.0, ans=0.125 2023-06-27 20:37:01,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1891536.0, ans=0.1 2023-06-27 20:37:06,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1891536.0, ans=0.125 2023-06-27 20:37:46,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1891656.0, ans=0.2 2023-06-27 20:37:46,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.68 vs. limit=6.0 2023-06-27 20:38:14,285 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.623e+02 8.106e+02 1.179e+03 1.696e+03 3.317e+03, threshold=2.359e+03, percent-clipped=22.0 2023-06-27 20:38:14,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1891716.0, ans=0.0 2023-06-27 20:38:22,832 INFO [train.py:996] (1/4) Epoch 11, batch 10350, loss[loss=0.1566, simple_loss=0.2097, pruned_loss=0.05178, over 21887.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2888, pruned_loss=0.06411, over 4276987.29 frames. ], batch size: 107, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:38:24,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=12.0 2023-06-27 20:38:31,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1891776.0, ans=0.04949747468305833 2023-06-27 20:38:54,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1891836.0, ans=0.125 2023-06-27 20:40:03,118 INFO [train.py:996] (1/4) Epoch 11, batch 10400, loss[loss=0.1818, simple_loss=0.2349, pruned_loss=0.06438, over 21188.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2817, pruned_loss=0.06301, over 4272764.49 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 20:40:14,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1892076.0, ans=0.2 2023-06-27 20:41:36,683 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.431e+02 7.002e+02 1.054e+03 1.542e+03 5.604e+03, threshold=2.109e+03, percent-clipped=11.0 2023-06-27 20:41:43,659 INFO [train.py:996] (1/4) Epoch 11, batch 10450, loss[loss=0.2343, simple_loss=0.3228, pruned_loss=0.07294, over 21840.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2874, pruned_loss=0.06521, over 4269762.22 frames. ], batch size: 371, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:43:01,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-27 20:43:35,359 INFO [train.py:996] (1/4) Epoch 11, batch 10500, loss[loss=0.2079, simple_loss=0.2707, pruned_loss=0.07257, over 21601.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2869, pruned_loss=0.06419, over 4259767.64 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:44:22,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-27 20:44:24,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1892796.0, ans=0.1 2023-06-27 20:44:52,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1892916.0, ans=0.1 2023-06-27 20:44:54,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1892916.0, ans=0.0 2023-06-27 20:45:06,707 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 6.335e+02 9.197e+02 1.411e+03 2.954e+03, threshold=1.839e+03, percent-clipped=7.0 2023-06-27 20:45:11,720 INFO [train.py:996] (1/4) Epoch 11, batch 10550, loss[loss=0.1732, simple_loss=0.2424, pruned_loss=0.052, over 21536.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2817, pruned_loss=0.06348, over 4241456.82 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:45:20,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1892976.0, ans=0.1 2023-06-27 20:45:22,936 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-27 20:45:42,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-27 20:45:43,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1893036.0, ans=0.0 2023-06-27 20:46:02,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-27 20:46:25,441 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:46:32,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-06-27 20:46:38,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-06-27 20:47:00,279 INFO [train.py:996] (1/4) Epoch 11, batch 10600, loss[loss=0.1725, simple_loss=0.2721, pruned_loss=0.03648, over 21805.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2779, pruned_loss=0.06259, over 4250875.22 frames. ], batch size: 282, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:47:11,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-27 20:47:20,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-27 20:47:38,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-27 20:48:03,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1893456.0, ans=0.125 2023-06-27 20:48:06,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1893456.0, ans=0.125 2023-06-27 20:48:45,439 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.137e+02 6.706e+02 1.104e+03 1.395e+03 2.716e+03, threshold=2.208e+03, percent-clipped=10.0 2023-06-27 20:48:50,912 INFO [train.py:996] (1/4) Epoch 11, batch 10650, loss[loss=0.2061, simple_loss=0.2847, pruned_loss=0.06376, over 20816.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2807, pruned_loss=0.0614, over 4252335.87 frames. ], batch size: 611, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:48:53,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1893576.0, ans=0.2 2023-06-27 20:48:58,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1893576.0, ans=0.125 2023-06-27 20:49:16,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1893636.0, ans=0.125 2023-06-27 20:49:41,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1893696.0, ans=0.125 2023-06-27 20:50:30,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1893876.0, ans=0.0 2023-06-27 20:50:31,092 INFO [train.py:996] (1/4) Epoch 11, batch 10700, loss[loss=0.214, simple_loss=0.2937, pruned_loss=0.06717, over 21598.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2812, pruned_loss=0.06147, over 4255359.84 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:50:47,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-27 20:51:30,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1894056.0, ans=0.0 2023-06-27 20:51:33,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1894056.0, ans=0.0 2023-06-27 20:51:39,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.38 vs. limit=15.0 2023-06-27 20:52:10,065 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.693e+02 7.296e+02 1.063e+03 2.005e+03 4.294e+03, threshold=2.126e+03, percent-clipped=18.0 2023-06-27 20:52:14,955 INFO [train.py:996] (1/4) Epoch 11, batch 10750, loss[loss=0.2227, simple_loss=0.3186, pruned_loss=0.06335, over 21712.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.292, pruned_loss=0.06646, over 4259319.38 frames. ], batch size: 247, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:52:41,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1894236.0, ans=0.5 2023-06-27 20:52:46,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1894236.0, ans=0.0 2023-06-27 20:53:36,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1894356.0, ans=0.0 2023-06-27 20:53:36,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1894356.0, ans=0.125 2023-06-27 20:53:52,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1894416.0, ans=0.125 2023-06-27 20:53:54,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1894416.0, ans=0.125 2023-06-27 20:53:57,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1894416.0, ans=0.2 2023-06-27 20:54:00,256 INFO [train.py:996] (1/4) Epoch 11, batch 10800, loss[loss=0.2393, simple_loss=0.3095, pruned_loss=0.08459, over 21372.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2977, pruned_loss=0.06743, over 4261210.68 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:55:30,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1894716.0, ans=0.2 2023-06-27 20:55:38,140 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.646e+02 1.015e+03 1.682e+03 4.029e+03, threshold=2.031e+03, percent-clipped=15.0 2023-06-27 20:55:38,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1894716.0, ans=0.125 2023-06-27 20:55:42,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1894776.0, ans=0.125 2023-06-27 20:55:43,176 INFO [train.py:996] (1/4) Epoch 11, batch 10850, loss[loss=0.1744, simple_loss=0.2648, pruned_loss=0.04198, over 20790.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2979, pruned_loss=0.06705, over 4256522.58 frames. ], batch size: 607, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:55:57,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2023-06-27 20:55:58,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1894776.0, ans=0.125 2023-06-27 20:57:20,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1895016.0, ans=0.2 2023-06-27 20:57:26,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1895076.0, ans=0.125 2023-06-27 20:57:27,847 INFO [train.py:996] (1/4) Epoch 11, batch 10900, loss[loss=0.1872, simple_loss=0.2686, pruned_loss=0.05293, over 21244.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2902, pruned_loss=0.06533, over 4253936.96 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:57:28,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1895076.0, ans=0.0 2023-06-27 20:57:53,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=22.5 2023-06-27 20:58:59,317 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.887e+02 5.631e+02 8.269e+02 1.201e+03 2.087e+03, threshold=1.654e+03, percent-clipped=2.0 2023-06-27 20:59:04,293 INFO [train.py:996] (1/4) Epoch 11, batch 10950, loss[loss=0.2048, simple_loss=0.2704, pruned_loss=0.0696, over 21847.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.285, pruned_loss=0.06399, over 4250316.21 frames. ], batch size: 373, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:00:01,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1895496.0, ans=0.0 2023-06-27 21:00:17,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895556.0, ans=0.1 2023-06-27 21:00:36,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1895616.0, ans=0.0 2023-06-27 21:00:39,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1895616.0, ans=0.125 2023-06-27 21:00:41,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1895616.0, ans=0.1 2023-06-27 21:00:42,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1895616.0, ans=0.125 2023-06-27 21:00:51,815 INFO [train.py:996] (1/4) Epoch 11, batch 11000, loss[loss=0.1878, simple_loss=0.2684, pruned_loss=0.05364, over 21706.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2837, pruned_loss=0.06376, over 4251676.28 frames. ], batch size: 282, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:01:03,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1895676.0, ans=0.0 2023-06-27 21:01:11,297 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-27 21:01:37,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1895796.0, ans=0.1 2023-06-27 21:01:48,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895796.0, ans=0.1 2023-06-27 21:02:18,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1895916.0, ans=0.125 2023-06-27 21:02:24,707 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.482e+02 6.416e+02 9.382e+02 1.390e+03 3.598e+03, threshold=1.876e+03, percent-clipped=17.0 2023-06-27 21:02:28,490 INFO [train.py:996] (1/4) Epoch 11, batch 11050, loss[loss=0.1817, simple_loss=0.2482, pruned_loss=0.05762, over 21380.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2815, pruned_loss=0.06408, over 4253661.09 frames. ], batch size: 131, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:03:36,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1896096.0, ans=0.125 2023-06-27 21:03:47,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1896156.0, ans=0.125 2023-06-27 21:04:16,372 INFO [train.py:996] (1/4) Epoch 11, batch 11100, loss[loss=0.2087, simple_loss=0.2975, pruned_loss=0.05988, over 21574.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2798, pruned_loss=0.06453, over 4252761.30 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:05:06,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1896396.0, ans=0.0 2023-06-27 21:05:18,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=22.5 2023-06-27 21:05:55,701 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.559e+02 5.976e+02 8.582e+02 1.481e+03 2.937e+03, threshold=1.716e+03, percent-clipped=16.0 2023-06-27 21:05:58,937 INFO [train.py:996] (1/4) Epoch 11, batch 11150, loss[loss=0.2483, simple_loss=0.3321, pruned_loss=0.08225, over 21390.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2783, pruned_loss=0.0643, over 4249852.22 frames. ], batch size: 471, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:06:54,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1896696.0, ans=0.0 2023-06-27 21:06:58,665 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-27 21:07:06,862 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-27 21:07:42,371 INFO [train.py:996] (1/4) Epoch 11, batch 11200, loss[loss=0.2146, simple_loss=0.2845, pruned_loss=0.07233, over 21575.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2777, pruned_loss=0.0645, over 4249721.59 frames. ], batch size: 391, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:07:43,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1896876.0, ans=0.09899494936611666 2023-06-27 21:07:43,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1896876.0, ans=0.0 2023-06-27 21:07:44,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1896876.0, ans=0.2 2023-06-27 21:08:43,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-27 21:08:47,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1897056.0, ans=0.125 2023-06-27 21:09:00,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1897056.0, ans=0.0 2023-06-27 21:09:20,941 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.297e+02 6.349e+02 8.470e+02 1.226e+03 2.540e+03, threshold=1.694e+03, percent-clipped=7.0 2023-06-27 21:09:24,609 INFO [train.py:996] (1/4) Epoch 11, batch 11250, loss[loss=0.19, simple_loss=0.2749, pruned_loss=0.05259, over 20828.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2775, pruned_loss=0.0647, over 4251540.02 frames. ], batch size: 609, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:09:25,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1897176.0, ans=0.125 2023-06-27 21:10:05,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1897236.0, ans=0.2 2023-06-27 21:10:45,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1897416.0, ans=0.0 2023-06-27 21:11:06,516 INFO [train.py:996] (1/4) Epoch 11, batch 11300, loss[loss=0.2028, simple_loss=0.2901, pruned_loss=0.05773, over 19925.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2782, pruned_loss=0.06497, over 4254864.80 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:11:11,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1897476.0, ans=0.05 2023-06-27 21:11:16,073 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.57 vs. limit=15.0 2023-06-27 21:11:23,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1897476.0, ans=0.125 2023-06-27 21:11:36,776 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:11:52,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1897596.0, ans=0.0 2023-06-27 21:11:54,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1897596.0, ans=0.0 2023-06-27 21:12:01,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1897596.0, ans=0.125 2023-06-27 21:12:27,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1897716.0, ans=0.0 2023-06-27 21:12:42,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-27 21:12:44,920 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.908e+02 6.761e+02 9.436e+02 1.469e+03 2.612e+03, threshold=1.887e+03, percent-clipped=16.0 2023-06-27 21:12:48,346 INFO [train.py:996] (1/4) Epoch 11, batch 11350, loss[loss=0.2444, simple_loss=0.3248, pruned_loss=0.08197, over 21549.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2805, pruned_loss=0.06482, over 4257735.21 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:12:49,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-27 21:13:07,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1897776.0, ans=0.1 2023-06-27 21:13:48,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1897896.0, ans=0.2 2023-06-27 21:14:30,968 INFO [train.py:996] (1/4) Epoch 11, batch 11400, loss[loss=0.2016, simple_loss=0.2807, pruned_loss=0.06132, over 19875.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2871, pruned_loss=0.06706, over 4262623.24 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:14:45,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1898076.0, ans=0.0 2023-06-27 21:15:12,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1898136.0, ans=0.0 2023-06-27 21:15:12,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2023-06-27 21:15:34,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1898256.0, ans=0.0 2023-06-27 21:16:09,913 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.866e+02 7.031e+02 1.006e+03 1.495e+03 2.656e+03, threshold=2.011e+03, percent-clipped=10.0 2023-06-27 21:16:23,426 INFO [train.py:996] (1/4) Epoch 11, batch 11450, loss[loss=0.2803, simple_loss=0.3451, pruned_loss=0.1078, over 21393.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2877, pruned_loss=0.06603, over 4260626.49 frames. ], batch size: 507, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:16:47,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1898436.0, ans=0.2 2023-06-27 21:16:54,098 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:16:58,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1898436.0, ans=0.0 2023-06-27 21:17:38,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1898556.0, ans=0.125 2023-06-27 21:17:39,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1898616.0, ans=0.125 2023-06-27 21:17:56,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1898616.0, ans=0.125 2023-06-27 21:18:06,597 INFO [train.py:996] (1/4) Epoch 11, batch 11500, loss[loss=0.2018, simple_loss=0.2986, pruned_loss=0.05249, over 21770.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2908, pruned_loss=0.06749, over 4267618.71 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:18:24,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-27 21:19:09,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1898856.0, ans=0.2 2023-06-27 21:19:45,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1898916.0, ans=0.0 2023-06-27 21:19:48,636 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.762e+02 6.937e+02 1.166e+03 1.634e+03 3.269e+03, threshold=2.333e+03, percent-clipped=13.0 2023-06-27 21:19:51,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-27 21:19:52,363 INFO [train.py:996] (1/4) Epoch 11, batch 11550, loss[loss=0.3684, simple_loss=0.4486, pruned_loss=0.1441, over 21507.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2955, pruned_loss=0.06744, over 4270402.38 frames. ], batch size: 508, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:19:54,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1898976.0, ans=0.125 2023-06-27 21:19:54,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898976.0, ans=0.1 2023-06-27 21:21:26,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1899216.0, ans=0.125 2023-06-27 21:21:38,050 INFO [train.py:996] (1/4) Epoch 11, batch 11600, loss[loss=0.2382, simple_loss=0.3361, pruned_loss=0.07012, over 21447.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3088, pruned_loss=0.06869, over 4271292.88 frames. ], batch size: 194, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 21:21:43,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1899276.0, ans=0.125 2023-06-27 21:22:41,277 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2023-06-27 21:23:04,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1899516.0, ans=0.1 2023-06-27 21:23:14,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1899516.0, ans=0.0 2023-06-27 21:23:15,073 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.239e+02 7.630e+02 1.389e+03 2.274e+03 4.713e+03, threshold=2.778e+03, percent-clipped=21.0 2023-06-27 21:23:16,791 INFO [train.py:996] (1/4) Epoch 11, batch 11650, loss[loss=0.2355, simple_loss=0.3268, pruned_loss=0.07211, over 21629.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3142, pruned_loss=0.06916, over 4274804.95 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:23:22,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1899576.0, ans=0.0 2023-06-27 21:24:13,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1899696.0, ans=0.0 2023-06-27 21:24:26,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1899756.0, ans=0.2 2023-06-27 21:24:28,828 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-27 21:24:31,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1899756.0, ans=0.1 2023-06-27 21:24:51,514 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.06 vs. limit=10.0 2023-06-27 21:24:53,628 INFO [train.py:996] (1/4) Epoch 11, batch 11700, loss[loss=0.193, simple_loss=0.2715, pruned_loss=0.0573, over 15003.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3057, pruned_loss=0.06796, over 4274097.36 frames. ], batch size: 61, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:25:34,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-27 21:26:09,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1900056.0, ans=0.05 2023-06-27 21:26:12,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1900116.0, ans=0.125 2023-06-27 21:26:16,309 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-27 21:26:28,325 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.521e+02 7.314e+02 1.090e+03 1.615e+03 2.478e+03, threshold=2.180e+03, percent-clipped=0.0 2023-06-27 21:26:29,959 INFO [train.py:996] (1/4) Epoch 11, batch 11750, loss[loss=0.1786, simple_loss=0.2642, pruned_loss=0.04645, over 19876.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2968, pruned_loss=0.06734, over 4277394.16 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:26:36,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.60 vs. limit=15.0 2023-06-27 21:27:21,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1900296.0, ans=0.5 2023-06-27 21:28:08,581 INFO [train.py:996] (1/4) Epoch 11, batch 11800, loss[loss=0.2298, simple_loss=0.3312, pruned_loss=0.06425, over 21659.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2985, pruned_loss=0.06917, over 4276492.17 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:28:18,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-27 21:28:19,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1900476.0, ans=0.125 2023-06-27 21:28:20,123 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-27 21:28:25,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1900536.0, ans=0.125 2023-06-27 21:28:38,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1900536.0, ans=0.0 2023-06-27 21:28:41,391 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-06-27 21:29:37,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1900716.0, ans=0.0 2023-06-27 21:29:44,971 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.871e+02 7.084e+02 9.791e+02 1.465e+03 2.454e+03, threshold=1.958e+03, percent-clipped=4.0 2023-06-27 21:29:46,621 INFO [train.py:996] (1/4) Epoch 11, batch 11850, loss[loss=0.2083, simple_loss=0.2936, pruned_loss=0.06152, over 21629.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.299, pruned_loss=0.06827, over 4277343.15 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:30:17,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1900836.0, ans=0.125 2023-06-27 21:30:30,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1900896.0, ans=0.0 2023-06-27 21:31:06,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1900956.0, ans=0.125 2023-06-27 21:31:25,867 INFO [train.py:996] (1/4) Epoch 11, batch 11900, loss[loss=0.2204, simple_loss=0.337, pruned_loss=0.05196, over 20844.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2998, pruned_loss=0.0661, over 4271640.13 frames. ], batch size: 608, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:31:43,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1901076.0, ans=0.0 2023-06-27 21:32:09,682 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:32:32,536 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=22.5 2023-06-27 21:33:08,449 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.313e+02 5.634e+02 7.548e+02 1.171e+03 3.128e+03, threshold=1.510e+03, percent-clipped=7.0 2023-06-27 21:33:14,927 INFO [train.py:996] (1/4) Epoch 11, batch 11950, loss[loss=0.1693, simple_loss=0.2629, pruned_loss=0.03785, over 21762.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.303, pruned_loss=0.06377, over 4266205.17 frames. ], batch size: 316, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:33:45,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-27 21:34:52,071 INFO [train.py:996] (1/4) Epoch 11, batch 12000, loss[loss=0.1831, simple_loss=0.2537, pruned_loss=0.05628, over 21450.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.296, pruned_loss=0.06214, over 4261077.28 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 21:34:52,071 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 21:35:12,133 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2616, simple_loss=0.3513, pruned_loss=0.08594, over 1796401.00 frames. 2023-06-27 21:35:12,134 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 21:35:52,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1901736.0, ans=0.125 2023-06-27 21:36:59,874 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.508e+02 6.629e+02 1.038e+03 1.682e+03 4.454e+03, threshold=2.077e+03, percent-clipped=31.0 2023-06-27 21:36:59,919 INFO [train.py:996] (1/4) Epoch 11, batch 12050, loss[loss=0.1956, simple_loss=0.2664, pruned_loss=0.06235, over 21512.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2935, pruned_loss=0.06374, over 4267557.28 frames. ], batch size: 212, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:37:32,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1902036.0, ans=0.0 2023-06-27 21:38:43,461 INFO [train.py:996] (1/4) Epoch 11, batch 12100, loss[loss=0.1876, simple_loss=0.2352, pruned_loss=0.07003, over 20913.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2967, pruned_loss=0.06763, over 4271941.41 frames. ], batch size: 613, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:38:57,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1902276.0, ans=0.2 2023-06-27 21:39:13,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1902336.0, ans=0.125 2023-06-27 21:39:22,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1902396.0, ans=0.125 2023-06-27 21:40:21,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1902516.0, ans=0.0 2023-06-27 21:40:29,113 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.049e+02 8.933e+02 1.358e+03 2.118e+03 4.417e+03, threshold=2.716e+03, percent-clipped=26.0 2023-06-27 21:40:29,144 INFO [train.py:996] (1/4) Epoch 11, batch 12150, loss[loss=0.2236, simple_loss=0.3299, pruned_loss=0.05864, over 21666.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2979, pruned_loss=0.06642, over 4273657.14 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:40:48,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1902636.0, ans=0.0 2023-06-27 21:42:07,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1902816.0, ans=0.09899494936611666 2023-06-27 21:42:09,529 INFO [train.py:996] (1/4) Epoch 11, batch 12200, loss[loss=0.228, simple_loss=0.2836, pruned_loss=0.08624, over 21245.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2944, pruned_loss=0.06628, over 4273269.58 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:42:40,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-27 21:43:10,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1903056.0, ans=0.125 2023-06-27 21:43:22,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1903056.0, ans=0.1 2023-06-27 21:43:50,740 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 6.388e+02 1.060e+03 1.811e+03 4.082e+03, threshold=2.119e+03, percent-clipped=7.0 2023-06-27 21:43:50,772 INFO [train.py:996] (1/4) Epoch 11, batch 12250, loss[loss=0.1693, simple_loss=0.2527, pruned_loss=0.04297, over 21691.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2869, pruned_loss=0.0633, over 4271611.06 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:44:17,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1903236.0, ans=0.125 2023-06-27 21:44:30,237 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:44:55,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1903356.0, ans=0.0 2023-06-27 21:45:24,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1903416.0, ans=0.125 2023-06-27 21:45:34,140 INFO [train.py:996] (1/4) Epoch 11, batch 12300, loss[loss=0.2032, simple_loss=0.3186, pruned_loss=0.0439, over 20744.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2806, pruned_loss=0.05888, over 4269810.95 frames. ], batch size: 607, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:46:03,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1903536.0, ans=0.0 2023-06-27 21:46:05,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1903536.0, ans=0.1 2023-06-27 21:47:15,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1903776.0, ans=0.2 2023-06-27 21:47:16,517 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 6.456e+02 1.093e+03 1.764e+03 5.046e+03, threshold=2.186e+03, percent-clipped=16.0 2023-06-27 21:47:16,548 INFO [train.py:996] (1/4) Epoch 11, batch 12350, loss[loss=0.1875, simple_loss=0.2487, pruned_loss=0.06311, over 20994.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.285, pruned_loss=0.05932, over 4264915.99 frames. ], batch size: 608, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:47:41,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1903836.0, ans=0.1 2023-06-27 21:47:53,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1903896.0, ans=0.2 2023-06-27 21:48:28,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1903956.0, ans=0.0 2023-06-27 21:48:57,244 INFO [train.py:996] (1/4) Epoch 11, batch 12400, loss[loss=0.229, simple_loss=0.2892, pruned_loss=0.08441, over 21295.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.287, pruned_loss=0.0626, over 4275965.84 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 21:49:02,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1904076.0, ans=0.0 2023-06-27 21:49:12,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1904136.0, ans=0.1 2023-06-27 21:50:13,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1904256.0, ans=0.125 2023-06-27 21:50:28,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1904316.0, ans=0.0 2023-06-27 21:50:37,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1904316.0, ans=0.0 2023-06-27 21:50:39,830 INFO [train.py:996] (1/4) Epoch 11, batch 12450, loss[loss=0.2782, simple_loss=0.3449, pruned_loss=0.1058, over 21804.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2911, pruned_loss=0.06568, over 4280051.07 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:50:41,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.549e+02 6.531e+02 8.502e+02 1.313e+03 3.916e+03, threshold=1.700e+03, percent-clipped=4.0 2023-06-27 21:50:56,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1904376.0, ans=0.0 2023-06-27 21:50:56,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-27 21:51:11,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1904436.0, ans=0.0 2023-06-27 21:51:15,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1904436.0, ans=0.125 2023-06-27 21:51:44,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1904496.0, ans=0.0 2023-06-27 21:51:51,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1904556.0, ans=0.125 2023-06-27 21:52:29,345 INFO [train.py:996] (1/4) Epoch 11, batch 12500, loss[loss=0.2042, simple_loss=0.3262, pruned_loss=0.04114, over 20752.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.3008, pruned_loss=0.06832, over 4279918.00 frames. ], batch size: 607, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:52:40,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1904676.0, ans=0.0 2023-06-27 21:53:25,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1904796.0, ans=0.0 2023-06-27 21:53:31,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1904856.0, ans=0.125 2023-06-27 21:53:57,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1904916.0, ans=0.2 2023-06-27 21:54:00,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1904916.0, ans=0.125 2023-06-27 21:54:02,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1904916.0, ans=0.0 2023-06-27 21:54:10,059 INFO [train.py:996] (1/4) Epoch 11, batch 12550, loss[loss=0.2425, simple_loss=0.3178, pruned_loss=0.08364, over 21278.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3047, pruned_loss=0.07002, over 4275893.56 frames. ], batch size: 143, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:54:11,832 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.997e+02 7.151e+02 9.738e+02 1.410e+03 2.995e+03, threshold=1.948e+03, percent-clipped=12.0 2023-06-27 21:54:35,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-27 21:54:53,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1905096.0, ans=0.125 2023-06-27 21:55:04,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1905096.0, ans=0.2 2023-06-27 21:55:05,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-06-27 21:55:13,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1905156.0, ans=0.125 2023-06-27 21:55:22,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1905156.0, ans=0.125 2023-06-27 21:55:53,701 INFO [train.py:996] (1/4) Epoch 11, batch 12600, loss[loss=0.1635, simple_loss=0.2367, pruned_loss=0.04508, over 21837.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.3019, pruned_loss=0.06823, over 4266820.02 frames. ], batch size: 98, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:56:09,756 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-27 21:57:03,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1905456.0, ans=0.0 2023-06-27 21:57:16,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1905516.0, ans=0.0 2023-06-27 21:57:30,695 INFO [train.py:996] (1/4) Epoch 11, batch 12650, loss[loss=0.2171, simple_loss=0.2903, pruned_loss=0.07194, over 21401.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2946, pruned_loss=0.06537, over 4268039.67 frames. ], batch size: 144, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:57:36,990 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.857e+02 6.084e+02 8.925e+02 1.601e+03 4.127e+03, threshold=1.785e+03, percent-clipped=11.0 2023-06-27 21:59:17,351 INFO [train.py:996] (1/4) Epoch 11, batch 12700, loss[loss=0.2164, simple_loss=0.286, pruned_loss=0.07341, over 21363.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2938, pruned_loss=0.06722, over 4271518.12 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 21:59:19,573 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:59:28,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1905876.0, ans=0.2 2023-06-27 22:00:24,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1906056.0, ans=0.125 2023-06-27 22:00:41,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1906116.0, ans=0.0 2023-06-27 22:00:59,916 INFO [train.py:996] (1/4) Epoch 11, batch 12750, loss[loss=0.2638, simple_loss=0.3415, pruned_loss=0.09306, over 21570.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2959, pruned_loss=0.06788, over 4267585.28 frames. ], batch size: 509, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:01:03,058 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.884e+02 6.384e+02 9.703e+02 1.626e+03 3.460e+03, threshold=1.941e+03, percent-clipped=17.0 2023-06-27 22:01:33,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1906296.0, ans=0.125 2023-06-27 22:02:37,425 INFO [train.py:996] (1/4) Epoch 11, batch 12800, loss[loss=0.2406, simple_loss=0.3205, pruned_loss=0.08038, over 21746.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2965, pruned_loss=0.06845, over 4278346.88 frames. ], batch size: 124, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:04:03,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1906716.0, ans=0.2 2023-06-27 22:04:16,516 INFO [train.py:996] (1/4) Epoch 11, batch 12850, loss[loss=0.2161, simple_loss=0.3006, pruned_loss=0.06575, over 21341.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2987, pruned_loss=0.06922, over 4275141.63 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:04:19,912 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.619e+02 8.381e+02 1.196e+03 2.769e+03, threshold=1.676e+03, percent-clipped=10.0 2023-06-27 22:05:06,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1906896.0, ans=0.125 2023-06-27 22:05:31,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1906956.0, ans=0.125 2023-06-27 22:05:43,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=12.0 2023-06-27 22:05:56,185 INFO [train.py:996] (1/4) Epoch 11, batch 12900, loss[loss=0.2091, simple_loss=0.2979, pruned_loss=0.06013, over 21874.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2955, pruned_loss=0.06568, over 4271933.27 frames. ], batch size: 373, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:06:37,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1907196.0, ans=0.125 2023-06-27 22:07:33,017 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-27 22:07:38,240 INFO [train.py:996] (1/4) Epoch 11, batch 12950, loss[loss=0.1509, simple_loss=0.234, pruned_loss=0.0339, over 21217.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2948, pruned_loss=0.06523, over 4276877.36 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:07:46,037 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.286e+02 5.649e+02 7.456e+02 9.840e+02 3.735e+03, threshold=1.491e+03, percent-clipped=7.0 2023-06-27 22:07:55,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1907376.0, ans=0.125 2023-06-27 22:07:55,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-27 22:08:51,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-06-27 22:08:52,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1907556.0, ans=0.125 2023-06-27 22:09:05,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1907616.0, ans=0.5 2023-06-27 22:09:19,447 INFO [train.py:996] (1/4) Epoch 11, batch 13000, loss[loss=0.1912, simple_loss=0.264, pruned_loss=0.05924, over 21126.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2936, pruned_loss=0.06489, over 4272962.93 frames. ], batch size: 608, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:09:33,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1907676.0, ans=0.125 2023-06-27 22:10:21,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1907796.0, ans=0.125 2023-06-27 22:10:37,389 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-27 22:11:06,075 INFO [train.py:996] (1/4) Epoch 11, batch 13050, loss[loss=0.2137, simple_loss=0.2864, pruned_loss=0.07051, over 21779.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2885, pruned_loss=0.06311, over 4274012.65 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:11:09,262 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.155e+02 7.836e+02 1.180e+03 1.629e+03 3.232e+03, threshold=2.361e+03, percent-clipped=34.0 2023-06-27 22:11:21,491 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:11:59,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1908096.0, ans=0.125 2023-06-27 22:12:00,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1908096.0, ans=0.125 2023-06-27 22:12:06,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1908156.0, ans=0.1 2023-06-27 22:12:42,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1908276.0, ans=0.0 2023-06-27 22:12:43,813 INFO [train.py:996] (1/4) Epoch 11, batch 13100, loss[loss=0.2353, simple_loss=0.3163, pruned_loss=0.07716, over 21483.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2914, pruned_loss=0.06343, over 4275278.21 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:13:14,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1908336.0, ans=0.0 2023-06-27 22:13:59,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1908456.0, ans=0.2 2023-06-27 22:14:11,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1908516.0, ans=0.05 2023-06-27 22:14:19,042 INFO [train.py:996] (1/4) Epoch 11, batch 13150, loss[loss=0.2376, simple_loss=0.3034, pruned_loss=0.08589, over 21620.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2952, pruned_loss=0.06561, over 4273423.66 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:14:22,236 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.348e+02 6.692e+02 9.537e+02 1.354e+03 2.505e+03, threshold=1.907e+03, percent-clipped=1.0 2023-06-27 22:14:50,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1908636.0, ans=0.125 2023-06-27 22:15:07,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1908696.0, ans=0.125 2023-06-27 22:15:46,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1908816.0, ans=0.1 2023-06-27 22:16:12,338 INFO [train.py:996] (1/4) Epoch 11, batch 13200, loss[loss=0.2321, simple_loss=0.3049, pruned_loss=0.07966, over 21674.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2946, pruned_loss=0.06587, over 4278052.34 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:16:16,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1908876.0, ans=0.0 2023-06-27 22:16:28,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1908936.0, ans=0.125 2023-06-27 22:16:43,599 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-27 22:17:50,826 INFO [train.py:996] (1/4) Epoch 11, batch 13250, loss[loss=0.2067, simple_loss=0.2956, pruned_loss=0.05891, over 21844.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2945, pruned_loss=0.06732, over 4280493.08 frames. ], batch size: 316, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:17:55,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.059e+02 8.358e+02 1.341e+03 1.799e+03 2.954e+03, threshold=2.682e+03, percent-clipped=21.0 2023-06-27 22:18:09,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1909236.0, ans=0.125 2023-06-27 22:18:21,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1909236.0, ans=0.125 2023-06-27 22:18:36,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1909296.0, ans=0.5 2023-06-27 22:19:29,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1909416.0, ans=0.0 2023-06-27 22:19:34,351 INFO [train.py:996] (1/4) Epoch 11, batch 13300, loss[loss=0.2324, simple_loss=0.3144, pruned_loss=0.07519, over 21764.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2966, pruned_loss=0.06738, over 4285229.92 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:19:43,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1909476.0, ans=0.125 2023-06-27 22:19:45,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1909476.0, ans=0.1 2023-06-27 22:19:48,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1909476.0, ans=0.125 2023-06-27 22:20:23,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1909596.0, ans=0.125 2023-06-27 22:20:28,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1909596.0, ans=0.125 2023-06-27 22:20:38,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1909656.0, ans=10.0 2023-06-27 22:20:51,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1909656.0, ans=0.125 2023-06-27 22:20:57,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1909656.0, ans=0.95 2023-06-27 22:21:06,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1909716.0, ans=0.125 2023-06-27 22:21:18,948 INFO [train.py:996] (1/4) Epoch 11, batch 13350, loss[loss=0.2681, simple_loss=0.3459, pruned_loss=0.09518, over 21436.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.301, pruned_loss=0.07025, over 4282668.23 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:21:23,889 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 8.555e+02 1.217e+03 1.843e+03 4.034e+03, threshold=2.434e+03, percent-clipped=8.0 2023-06-27 22:21:34,984 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-27 22:21:51,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.30 vs. limit=15.0 2023-06-27 22:21:52,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-27 22:22:44,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1910016.0, ans=0.125 2023-06-27 22:23:00,812 INFO [train.py:996] (1/4) Epoch 11, batch 13400, loss[loss=0.2397, simple_loss=0.3098, pruned_loss=0.08476, over 21831.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3015, pruned_loss=0.07158, over 4270243.72 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:23:18,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-27 22:24:42,422 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:24:43,493 INFO [train.py:996] (1/4) Epoch 11, batch 13450, loss[loss=0.2012, simple_loss=0.2668, pruned_loss=0.06785, over 16899.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.304, pruned_loss=0.07317, over 4269137.17 frames. ], batch size: 60, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:24:52,951 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 6.376e+02 8.067e+02 1.099e+03 2.577e+03, threshold=1.613e+03, percent-clipped=1.0 2023-06-27 22:24:55,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1910376.0, ans=0.125 2023-06-27 22:25:07,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1910436.0, ans=0.0 2023-06-27 22:25:48,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1910496.0, ans=0.125 2023-06-27 22:26:17,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1910616.0, ans=0.125 2023-06-27 22:26:31,877 INFO [train.py:996] (1/4) Epoch 11, batch 13500, loss[loss=0.1648, simple_loss=0.2235, pruned_loss=0.05301, over 21345.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2965, pruned_loss=0.07025, over 4269074.47 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:27:34,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1910856.0, ans=0.0 2023-06-27 22:28:11,316 INFO [train.py:996] (1/4) Epoch 11, batch 13550, loss[loss=0.2775, simple_loss=0.374, pruned_loss=0.0905, over 21688.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2997, pruned_loss=0.0695, over 4275359.77 frames. ], batch size: 414, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:28:16,104 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.678e+02 8.007e+02 1.277e+03 1.961e+03 4.546e+03, threshold=2.554e+03, percent-clipped=33.0 2023-06-27 22:28:39,625 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:28:41,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1911036.0, ans=0.125 2023-06-27 22:28:41,220 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:28:54,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1911036.0, ans=0.125 2023-06-27 22:29:08,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-06-27 22:29:24,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-27 22:29:35,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1911216.0, ans=0.035 2023-06-27 22:29:49,629 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.09 vs. limit=10.0 2023-06-27 22:29:53,214 INFO [train.py:996] (1/4) Epoch 11, batch 13600, loss[loss=0.1855, simple_loss=0.262, pruned_loss=0.0545, over 21768.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2979, pruned_loss=0.06955, over 4273302.03 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:29:57,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1911276.0, ans=0.0 2023-06-27 22:30:31,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911336.0, ans=0.1 2023-06-27 22:30:39,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1911396.0, ans=0.0 2023-06-27 22:30:45,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1911396.0, ans=0.0 2023-06-27 22:31:26,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1911516.0, ans=0.95 2023-06-27 22:31:26,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911516.0, ans=0.1 2023-06-27 22:31:34,439 INFO [train.py:996] (1/4) Epoch 11, batch 13650, loss[loss=0.1823, simple_loss=0.2433, pruned_loss=0.06064, over 21344.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2923, pruned_loss=0.06655, over 4272441.92 frames. ], batch size: 211, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:31:45,846 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.126e+02 5.497e+02 8.318e+02 1.364e+03 3.376e+03, threshold=1.664e+03, percent-clipped=5.0 2023-06-27 22:32:18,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1911696.0, ans=0.125 2023-06-27 22:32:35,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1911756.0, ans=0.125 2023-06-27 22:32:43,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1911756.0, ans=10.0 2023-06-27 22:32:45,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911756.0, ans=0.1 2023-06-27 22:32:55,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1911816.0, ans=0.125 2023-06-27 22:32:58,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1911816.0, ans=0.0 2023-06-27 22:33:13,812 INFO [train.py:996] (1/4) Epoch 11, batch 13700, loss[loss=0.1845, simple_loss=0.2548, pruned_loss=0.05708, over 21232.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2889, pruned_loss=0.06596, over 4263473.49 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:34:00,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1911996.0, ans=0.125 2023-06-27 22:34:00,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1911996.0, ans=0.125 2023-06-27 22:34:38,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1912116.0, ans=0.1 2023-06-27 22:34:40,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1912116.0, ans=0.0 2023-06-27 22:35:01,858 INFO [train.py:996] (1/4) Epoch 11, batch 13750, loss[loss=0.1798, simple_loss=0.2413, pruned_loss=0.05917, over 21228.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.286, pruned_loss=0.06439, over 4260480.80 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:35:10,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1912176.0, ans=10.0 2023-06-27 22:35:13,331 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.554e+02 7.211e+02 1.142e+03 1.644e+03 3.975e+03, threshold=2.283e+03, percent-clipped=24.0 2023-06-27 22:35:57,866 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:36:00,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1912296.0, ans=0.0 2023-06-27 22:36:06,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1912356.0, ans=0.0 2023-06-27 22:36:40,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1912416.0, ans=0.0 2023-06-27 22:36:47,741 INFO [train.py:996] (1/4) Epoch 11, batch 13800, loss[loss=0.2543, simple_loss=0.361, pruned_loss=0.0738, over 21691.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2908, pruned_loss=0.06445, over 4266363.86 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:38:31,500 INFO [train.py:996] (1/4) Epoch 11, batch 13850, loss[loss=0.2246, simple_loss=0.3065, pruned_loss=0.07135, over 21771.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2982, pruned_loss=0.06509, over 4259735.88 frames. ], batch size: 124, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:38:36,847 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:38:38,141 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 6.806e+02 9.243e+02 1.369e+03 3.206e+03, threshold=1.849e+03, percent-clipped=7.0 2023-06-27 22:38:38,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1912776.0, ans=0.035 2023-06-27 22:38:49,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1912776.0, ans=0.0 2023-06-27 22:39:04,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1912836.0, ans=0.0 2023-06-27 22:39:19,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=15.0 2023-06-27 22:39:35,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1912956.0, ans=0.05 2023-06-27 22:40:12,062 INFO [train.py:996] (1/4) Epoch 11, batch 13900, loss[loss=0.2096, simple_loss=0.2877, pruned_loss=0.06578, over 21906.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3011, pruned_loss=0.0679, over 4266955.57 frames. ], batch size: 316, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:40:35,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1913136.0, ans=0.125 2023-06-27 22:40:46,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-27 22:40:48,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1913196.0, ans=0.0 2023-06-27 22:41:03,879 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0 2023-06-27 22:41:36,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1913316.0, ans=0.1 2023-06-27 22:41:40,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1913316.0, ans=0.0 2023-06-27 22:41:49,825 INFO [train.py:996] (1/4) Epoch 11, batch 13950, loss[loss=0.219, simple_loss=0.2824, pruned_loss=0.07775, over 19986.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.301, pruned_loss=0.0701, over 4272409.29 frames. ], batch size: 703, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:42:00,997 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.608e+02 7.245e+02 1.112e+03 1.601e+03 2.924e+03, threshold=2.224e+03, percent-clipped=16.0 2023-06-27 22:42:21,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-27 22:43:10,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1913556.0, ans=0.2 2023-06-27 22:43:29,847 INFO [train.py:996] (1/4) Epoch 11, batch 14000, loss[loss=0.1788, simple_loss=0.2665, pruned_loss=0.04553, over 21584.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.297, pruned_loss=0.06834, over 4272743.20 frames. ], batch size: 230, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:45:00,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1913916.0, ans=0.0 2023-06-27 22:45:08,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1913916.0, ans=0.1 2023-06-27 22:45:15,875 INFO [train.py:996] (1/4) Epoch 11, batch 14050, loss[loss=0.1808, simple_loss=0.2561, pruned_loss=0.0528, over 21673.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2936, pruned_loss=0.06545, over 4268607.93 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:45:22,437 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.606e+02 1.013e+03 1.561e+03 3.162e+03, threshold=2.026e+03, percent-clipped=9.0 2023-06-27 22:45:26,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1913976.0, ans=0.0 2023-06-27 22:46:57,094 INFO [train.py:996] (1/4) Epoch 11, batch 14100, loss[loss=0.2136, simple_loss=0.2819, pruned_loss=0.07271, over 19899.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2866, pruned_loss=0.06493, over 4262497.66 frames. ], batch size: 702, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:47:00,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1914276.0, ans=0.125 2023-06-27 22:47:44,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1914396.0, ans=0.1 2023-06-27 22:47:54,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1914456.0, ans=0.1 2023-06-27 22:48:00,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1914456.0, ans=0.2 2023-06-27 22:48:31,949 INFO [train.py:996] (1/4) Epoch 11, batch 14150, loss[loss=0.2224, simple_loss=0.3384, pruned_loss=0.05324, over 19837.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2904, pruned_loss=0.06605, over 4241814.36 frames. ], batch size: 702, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:48:35,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1914576.0, ans=0.05 2023-06-27 22:48:44,411 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.197e+02 6.392e+02 8.340e+02 1.310e+03 2.692e+03, threshold=1.668e+03, percent-clipped=6.0 2023-06-27 22:48:49,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1914576.0, ans=0.125 2023-06-27 22:49:11,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1914696.0, ans=0.1 2023-06-27 22:49:21,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1914696.0, ans=0.0 2023-06-27 22:50:10,255 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-27 22:50:10,690 INFO [train.py:996] (1/4) Epoch 11, batch 14200, loss[loss=0.1863, simple_loss=0.2696, pruned_loss=0.05145, over 21820.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2902, pruned_loss=0.06495, over 4243704.07 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:50:17,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1914876.0, ans=0.025 2023-06-27 22:50:43,990 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-06-27 22:50:51,594 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:50:57,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1914996.0, ans=0.1 2023-06-27 22:51:40,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.31 vs. limit=15.0 2023-06-27 22:51:50,595 INFO [train.py:996] (1/4) Epoch 11, batch 14250, loss[loss=0.1674, simple_loss=0.2515, pruned_loss=0.04162, over 21663.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2847, pruned_loss=0.06448, over 4246944.44 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:51:59,424 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.518e+02 7.474e+02 9.961e+02 1.736e+03 2.961e+03, threshold=1.992e+03, percent-clipped=26.0 2023-06-27 22:52:05,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1915176.0, ans=0.0 2023-06-27 22:52:21,255 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-27 22:52:22,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1915236.0, ans=0.125 2023-06-27 22:52:36,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1915296.0, ans=0.2 2023-06-27 22:52:58,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.18 vs. limit=22.5 2023-06-27 22:53:35,416 INFO [train.py:996] (1/4) Epoch 11, batch 14300, loss[loss=0.307, simple_loss=0.4177, pruned_loss=0.09816, over 21256.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2872, pruned_loss=0.06378, over 4233710.37 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:53:38,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1915476.0, ans=0.1 2023-06-27 22:53:49,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1915476.0, ans=0.2 2023-06-27 22:53:56,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1915536.0, ans=0.2 2023-06-27 22:54:27,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1915596.0, ans=0.0 2023-06-27 22:54:55,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-27 22:55:09,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1915716.0, ans=0.2 2023-06-27 22:55:18,183 INFO [train.py:996] (1/4) Epoch 11, batch 14350, loss[loss=0.2096, simple_loss=0.295, pruned_loss=0.06216, over 21861.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2928, pruned_loss=0.06442, over 4236997.88 frames. ], batch size: 371, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:55:27,947 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.575e+02 7.291e+02 1.107e+03 2.245e+03 6.428e+03, threshold=2.214e+03, percent-clipped=28.0 2023-06-27 22:55:48,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1915836.0, ans=0.125 2023-06-27 22:55:52,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-27 22:56:08,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-27 22:56:16,788 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-27 22:56:23,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-27 22:56:23,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-27 22:56:50,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1916016.0, ans=0.125 2023-06-27 22:56:59,073 INFO [train.py:996] (1/4) Epoch 11, batch 14400, loss[loss=0.2185, simple_loss=0.2826, pruned_loss=0.0772, over 21691.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2906, pruned_loss=0.06486, over 4241831.82 frames. ], batch size: 414, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:57:31,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1916136.0, ans=0.05 2023-06-27 22:58:12,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1916256.0, ans=0.0 2023-06-27 22:58:13,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-27 22:58:37,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1916316.0, ans=0.2 2023-06-27 22:58:38,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=22.5 2023-06-27 22:58:40,445 INFO [train.py:996] (1/4) Epoch 11, batch 14450, loss[loss=0.1939, simple_loss=0.2623, pruned_loss=0.06277, over 21306.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2875, pruned_loss=0.06565, over 4254053.99 frames. ], batch size: 177, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:58:42,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1916376.0, ans=0.125 2023-06-27 22:58:50,302 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.736e+02 6.824e+02 1.004e+03 1.771e+03 3.739e+03, threshold=2.008e+03, percent-clipped=15.0 2023-06-27 22:58:51,810 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.39 vs. limit=12.0 2023-06-27 22:58:52,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1916376.0, ans=0.0 2023-06-27 22:59:11,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-27 23:00:07,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1916616.0, ans=0.2 2023-06-27 23:00:21,612 INFO [train.py:996] (1/4) Epoch 11, batch 14500, loss[loss=0.1996, simple_loss=0.297, pruned_loss=0.05105, over 19991.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2827, pruned_loss=0.06493, over 4259377.57 frames. ], batch size: 703, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 23:00:51,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1916736.0, ans=0.1 2023-06-27 23:00:54,974 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:01:15,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1916796.0, ans=0.035 2023-06-27 23:01:25,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1916856.0, ans=15.0 2023-06-27 23:02:04,595 INFO [train.py:996] (1/4) Epoch 11, batch 14550, loss[loss=0.2429, simple_loss=0.3222, pruned_loss=0.08179, over 21691.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.288, pruned_loss=0.06649, over 4260092.86 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 23:02:14,903 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.849e+02 9.219e+02 1.443e+03 4.541e+03, threshold=1.844e+03, percent-clipped=15.0 2023-06-27 23:02:53,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1917096.0, ans=0.0 2023-06-27 23:02:56,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1917096.0, ans=0.0 2023-06-27 23:03:25,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1917156.0, ans=0.125 2023-06-27 23:03:48,450 INFO [train.py:996] (1/4) Epoch 11, batch 14600, loss[loss=0.2372, simple_loss=0.3307, pruned_loss=0.07182, over 21743.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2948, pruned_loss=0.07004, over 4259126.50 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:03:52,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1917276.0, ans=0.125 2023-06-27 23:04:07,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1917276.0, ans=0.2 2023-06-27 23:05:20,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=22.5 2023-06-27 23:05:31,381 INFO [train.py:996] (1/4) Epoch 11, batch 14650, loss[loss=0.2478, simple_loss=0.3205, pruned_loss=0.08752, over 21386.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2968, pruned_loss=0.06936, over 4264345.85 frames. ], batch size: 549, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:05:38,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1917576.0, ans=0.2 2023-06-27 23:05:38,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=1917576.0, ans=12.0 2023-06-27 23:05:43,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1917576.0, ans=0.125 2023-06-27 23:05:45,911 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.373e+02 8.214e+02 1.374e+03 1.981e+03 3.761e+03, threshold=2.748e+03, percent-clipped=28.0 2023-06-27 23:06:45,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-27 23:06:51,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1917756.0, ans=0.125 2023-06-27 23:06:55,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1917756.0, ans=0.125 2023-06-27 23:07:19,698 INFO [train.py:996] (1/4) Epoch 11, batch 14700, loss[loss=0.1572, simple_loss=0.2499, pruned_loss=0.03226, over 21663.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2915, pruned_loss=0.06449, over 4257812.10 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:07:20,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1917876.0, ans=0.0 2023-06-27 23:07:33,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1917876.0, ans=0.025 2023-06-27 23:07:56,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1917936.0, ans=0.0 2023-06-27 23:08:32,984 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.68 vs. limit=12.0 2023-06-27 23:09:03,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1918176.0, ans=0.1 2023-06-27 23:09:04,353 INFO [train.py:996] (1/4) Epoch 11, batch 14750, loss[loss=0.2162, simple_loss=0.2686, pruned_loss=0.0819, over 20080.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2954, pruned_loss=0.06595, over 4264787.11 frames. ], batch size: 703, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:09:08,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1918176.0, ans=0.1 2023-06-27 23:09:14,865 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.274e+02 6.652e+02 9.504e+02 1.333e+03 3.432e+03, threshold=1.901e+03, percent-clipped=1.0 2023-06-27 23:09:57,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1918296.0, ans=0.125 2023-06-27 23:09:58,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-27 23:10:06,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1918296.0, ans=0.2 2023-06-27 23:10:06,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1918296.0, ans=0.0 2023-06-27 23:10:39,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1918416.0, ans=0.2 2023-06-27 23:10:48,842 INFO [train.py:996] (1/4) Epoch 11, batch 14800, loss[loss=0.2067, simple_loss=0.2794, pruned_loss=0.06698, over 21471.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3068, pruned_loss=0.07128, over 4262780.63 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:11:44,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-27 23:11:46,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1918596.0, ans=0.125 2023-06-27 23:11:50,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1918596.0, ans=0.1 2023-06-27 23:12:11,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1918656.0, ans=0.125 2023-06-27 23:12:43,481 INFO [train.py:996] (1/4) Epoch 11, batch 14850, loss[loss=0.2305, simple_loss=0.3105, pruned_loss=0.07532, over 21823.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.301, pruned_loss=0.07076, over 4257880.85 frames. ], batch size: 372, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:12:59,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1918776.0, ans=0.125 2023-06-27 23:13:00,843 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 8.785e+02 1.252e+03 1.775e+03 4.444e+03, threshold=2.503e+03, percent-clipped=22.0 2023-06-27 23:13:04,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1918836.0, ans=0.1 2023-06-27 23:13:08,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1918836.0, ans=0.125 2023-06-27 23:13:10,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1918836.0, ans=0.0 2023-06-27 23:13:11,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1918836.0, ans=0.2 2023-06-27 23:13:23,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1918896.0, ans=0.07 2023-06-27 23:13:43,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-27 23:14:11,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1919016.0, ans=0.125 2023-06-27 23:14:32,369 INFO [train.py:996] (1/4) Epoch 11, batch 14900, loss[loss=0.2405, simple_loss=0.3256, pruned_loss=0.07768, over 20682.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3047, pruned_loss=0.07213, over 4257810.48 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:14:32,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1919076.0, ans=0.0 2023-06-27 23:15:20,150 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=8.0 2023-06-27 23:15:20,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1919196.0, ans=0.035 2023-06-27 23:15:46,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1919256.0, ans=0.125 2023-06-27 23:16:01,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-27 23:16:16,123 INFO [train.py:996] (1/4) Epoch 11, batch 14950, loss[loss=0.2115, simple_loss=0.2948, pruned_loss=0.06413, over 21353.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3066, pruned_loss=0.07182, over 4260482.04 frames. ], batch size: 176, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:16:27,765 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.435e+02 7.906e+02 1.198e+03 1.645e+03 4.202e+03, threshold=2.397e+03, percent-clipped=8.0 2023-06-27 23:16:30,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1919376.0, ans=0.0 2023-06-27 23:16:38,834 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-27 23:17:12,196 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=22.5 2023-06-27 23:17:27,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1919556.0, ans=0.125 2023-06-27 23:17:37,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1919556.0, ans=0.0 2023-06-27 23:17:56,218 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-27 23:17:58,257 INFO [train.py:996] (1/4) Epoch 11, batch 15000, loss[loss=0.2506, simple_loss=0.3646, pruned_loss=0.06832, over 20740.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3077, pruned_loss=0.07283, over 4271107.07 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:17:58,258 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-27 23:18:18,450 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2534, simple_loss=0.3437, pruned_loss=0.08155, over 1796401.00 frames. 2023-06-27 23:18:18,451 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-27 23:18:47,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-27 23:19:07,562 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-27 23:19:25,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1919856.0, ans=0.1 2023-06-27 23:19:50,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1919916.0, ans=0.2 2023-06-27 23:20:03,430 INFO [train.py:996] (1/4) Epoch 11, batch 15050, loss[loss=0.2193, simple_loss=0.3108, pruned_loss=0.06389, over 21708.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3066, pruned_loss=0.07343, over 4269799.29 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:20:12,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1919976.0, ans=0.125 2023-06-27 23:20:17,264 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.821e+02 6.928e+02 9.435e+02 1.433e+03 3.639e+03, threshold=1.887e+03, percent-clipped=3.0 2023-06-27 23:20:43,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1920036.0, ans=0.125 2023-06-27 23:20:43,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1920036.0, ans=0.125 2023-06-27 23:21:00,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1920096.0, ans=0.125 2023-06-27 23:21:06,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1920096.0, ans=0.125 2023-06-27 23:21:49,513 INFO [train.py:996] (1/4) Epoch 11, batch 15100, loss[loss=0.2352, simple_loss=0.3234, pruned_loss=0.07349, over 21858.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3086, pruned_loss=0.07346, over 4269596.09 frames. ], batch size: 371, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:23:30,439 INFO [train.py:996] (1/4) Epoch 11, batch 15150, loss[loss=0.207, simple_loss=0.2698, pruned_loss=0.07207, over 21311.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3053, pruned_loss=0.07381, over 4267590.08 frames. ], batch size: 549, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:23:33,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-27 23:23:46,525 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.704e+02 7.406e+02 1.033e+03 1.604e+03 3.709e+03, threshold=2.066e+03, percent-clipped=14.0 2023-06-27 23:24:16,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1920696.0, ans=0.0 2023-06-27 23:24:56,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1920816.0, ans=0.125 2023-06-27 23:25:12,944 INFO [train.py:996] (1/4) Epoch 11, batch 15200, loss[loss=0.1841, simple_loss=0.2703, pruned_loss=0.049, over 21799.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2974, pruned_loss=0.07035, over 4269600.24 frames. ], batch size: 317, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:25:39,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1920936.0, ans=0.125 2023-06-27 23:26:20,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1920996.0, ans=0.125 2023-06-27 23:26:23,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1921056.0, ans=0.0 2023-06-27 23:26:36,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-27 23:27:01,186 INFO [train.py:996] (1/4) Epoch 11, batch 15250, loss[loss=0.2036, simple_loss=0.2607, pruned_loss=0.07328, over 21318.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.293, pruned_loss=0.06923, over 4268465.36 frames. ], batch size: 473, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:27:02,070 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-27 23:27:23,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.342e+02 7.886e+02 1.142e+03 1.659e+03 3.992e+03, threshold=2.285e+03, percent-clipped=18.0 2023-06-27 23:28:42,800 INFO [train.py:996] (1/4) Epoch 11, batch 15300, loss[loss=0.2264, simple_loss=0.2989, pruned_loss=0.07695, over 21600.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2943, pruned_loss=0.07098, over 4270436.35 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:28:50,399 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-27 23:28:54,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1921476.0, ans=0.125 2023-06-27 23:29:10,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1921536.0, ans=0.125 2023-06-27 23:29:14,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1921536.0, ans=0.125 2023-06-27 23:29:27,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1921596.0, ans=0.125 2023-06-27 23:30:29,666 INFO [train.py:996] (1/4) Epoch 11, batch 15350, loss[loss=0.1968, simple_loss=0.2624, pruned_loss=0.06559, over 20780.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.299, pruned_loss=0.07324, over 4273242.36 frames. ], batch size: 609, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:30:47,340 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.758e+02 7.708e+02 1.113e+03 1.589e+03 3.642e+03, threshold=2.225e+03, percent-clipped=6.0 2023-06-27 23:31:35,306 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:31:51,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1922016.0, ans=0.1 2023-06-27 23:32:05,686 INFO [train.py:996] (1/4) Epoch 11, batch 15400, loss[loss=0.2464, simple_loss=0.3595, pruned_loss=0.06667, over 19763.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.299, pruned_loss=0.07131, over 4276826.33 frames. ], batch size: 703, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:33:28,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1922316.0, ans=0.1 2023-06-27 23:33:47,727 INFO [train.py:996] (1/4) Epoch 11, batch 15450, loss[loss=0.2019, simple_loss=0.2897, pruned_loss=0.05704, over 21763.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2981, pruned_loss=0.07078, over 4283225.70 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:34:08,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1922376.0, ans=0.04949747468305833 2023-06-27 23:34:10,730 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.367e+02 6.968e+02 9.606e+02 1.449e+03 2.613e+03, threshold=1.921e+03, percent-clipped=5.0 2023-06-27 23:34:29,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1922496.0, ans=0.0 2023-06-27 23:34:59,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1922556.0, ans=0.1 2023-06-27 23:35:18,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1922616.0, ans=0.125 2023-06-27 23:35:34,400 INFO [train.py:996] (1/4) Epoch 11, batch 15500, loss[loss=0.195, simple_loss=0.2798, pruned_loss=0.05512, over 16176.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3012, pruned_loss=0.07049, over 4275779.79 frames. ], batch size: 60, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:36:16,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1922796.0, ans=0.125 2023-06-27 23:36:31,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1922796.0, ans=0.0 2023-06-27 23:36:57,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1922916.0, ans=0.95 2023-06-27 23:37:05,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1922916.0, ans=0.125 2023-06-27 23:37:21,917 INFO [train.py:996] (1/4) Epoch 11, batch 15550, loss[loss=0.1896, simple_loss=0.271, pruned_loss=0.05408, over 21695.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2992, pruned_loss=0.06789, over 4267324.48 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:37:24,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1922976.0, ans=0.0 2023-06-27 23:37:32,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1922976.0, ans=0.1 2023-06-27 23:37:34,973 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.660e+02 9.717e+02 1.306e+03 2.635e+03, threshold=1.943e+03, percent-clipped=6.0 2023-06-27 23:37:37,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1923036.0, ans=0.125 2023-06-27 23:37:57,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1923036.0, ans=0.125 2023-06-27 23:38:31,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1923156.0, ans=0.125 2023-06-27 23:38:38,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1923216.0, ans=0.0 2023-06-27 23:38:40,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1923216.0, ans=0.125 2023-06-27 23:38:47,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.93 vs. limit=6.0 2023-06-27 23:39:02,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1923276.0, ans=0.0 2023-06-27 23:39:03,616 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-27 23:39:03,934 INFO [train.py:996] (1/4) Epoch 11, batch 15600, loss[loss=0.2454, simple_loss=0.3087, pruned_loss=0.09108, over 21371.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2937, pruned_loss=0.06648, over 4264338.45 frames. ], batch size: 508, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:39:41,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1923396.0, ans=0.125 2023-06-27 23:39:47,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-27 23:40:45,235 INFO [train.py:996] (1/4) Epoch 11, batch 15650, loss[loss=0.187, simple_loss=0.2594, pruned_loss=0.05726, over 21752.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2916, pruned_loss=0.06575, over 4263519.30 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:40:50,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1923576.0, ans=0.0 2023-06-27 23:40:56,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1923576.0, ans=0.0 2023-06-27 23:41:03,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.415e+02 8.795e+02 1.290e+03 1.896e+03 3.786e+03, threshold=2.580e+03, percent-clipped=24.0 2023-06-27 23:41:13,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1923636.0, ans=0.0 2023-06-27 23:41:27,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-27 23:41:29,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-27 23:41:33,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1923696.0, ans=0.125 2023-06-27 23:41:55,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1923756.0, ans=0.1 2023-06-27 23:42:16,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1923816.0, ans=22.5 2023-06-27 23:42:18,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-27 23:42:25,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-27 23:42:27,306 INFO [train.py:996] (1/4) Epoch 11, batch 15700, loss[loss=0.2001, simple_loss=0.2634, pruned_loss=0.06836, over 21279.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2872, pruned_loss=0.06517, over 4252210.33 frames. ], batch size: 144, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:43:16,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1923996.0, ans=0.2 2023-06-27 23:43:30,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1924056.0, ans=0.2 2023-06-27 23:43:30,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1924056.0, ans=0.125 2023-06-27 23:43:33,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1924056.0, ans=0.125 2023-06-27 23:44:04,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-27 23:44:08,202 INFO [train.py:996] (1/4) Epoch 11, batch 15750, loss[loss=0.238, simple_loss=0.288, pruned_loss=0.09396, over 21401.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2833, pruned_loss=0.06496, over 4258652.01 frames. ], batch size: 508, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:44:21,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1924176.0, ans=0.0 2023-06-27 23:44:27,430 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.257e+02 5.974e+02 8.242e+02 1.132e+03 2.648e+03, threshold=1.648e+03, percent-clipped=1.0 2023-06-27 23:44:49,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1924296.0, ans=0.1 2023-06-27 23:44:58,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-27 23:45:49,094 INFO [train.py:996] (1/4) Epoch 11, batch 15800, loss[loss=0.1953, simple_loss=0.2668, pruned_loss=0.06189, over 21646.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2787, pruned_loss=0.06415, over 4259547.60 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:45:50,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1924476.0, ans=0.02 2023-06-27 23:46:39,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1924596.0, ans=0.125 2023-06-27 23:47:32,269 INFO [train.py:996] (1/4) Epoch 11, batch 15850, loss[loss=0.1779, simple_loss=0.2258, pruned_loss=0.06499, over 20013.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.28, pruned_loss=0.06591, over 4259335.27 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:47:52,236 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.544e+02 6.551e+02 9.403e+02 1.336e+03 2.589e+03, threshold=1.881e+03, percent-clipped=10.0 2023-06-27 23:48:58,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1925016.0, ans=0.125 2023-06-27 23:49:15,203 INFO [train.py:996] (1/4) Epoch 11, batch 15900, loss[loss=0.211, simple_loss=0.2858, pruned_loss=0.06813, over 21715.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2785, pruned_loss=0.06679, over 4260018.48 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:49:49,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-27 23:49:52,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1925196.0, ans=0.1 2023-06-27 23:49:57,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1925196.0, ans=0.0 2023-06-27 23:50:02,801 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-27 23:50:46,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1925316.0, ans=0.125 2023-06-27 23:50:57,576 INFO [train.py:996] (1/4) Epoch 11, batch 15950, loss[loss=0.1979, simple_loss=0.2819, pruned_loss=0.05692, over 21825.00 frames. ], tot_loss[loss=0.204, simple_loss=0.279, pruned_loss=0.06452, over 4265098.14 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:51:03,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-06-27 23:51:17,432 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.351e+02 7.372e+02 1.063e+03 1.688e+03 3.100e+03, threshold=2.125e+03, percent-clipped=16.0 2023-06-27 23:51:28,365 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-27 23:51:34,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1925496.0, ans=0.07 2023-06-27 23:51:53,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1925556.0, ans=0.125 2023-06-27 23:52:15,004 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:52:32,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1925616.0, ans=0.125 2023-06-27 23:52:40,050 INFO [train.py:996] (1/4) Epoch 11, batch 16000, loss[loss=0.149, simple_loss=0.2386, pruned_loss=0.02965, over 21426.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2802, pruned_loss=0.06261, over 4265485.59 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:52:44,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1925676.0, ans=0.0 2023-06-27 23:52:57,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1925676.0, ans=0.125 2023-06-27 23:53:14,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1925736.0, ans=0.125 2023-06-27 23:53:26,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1925796.0, ans=0.125 2023-06-27 23:54:07,162 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-27 23:54:17,618 INFO [train.py:996] (1/4) Epoch 11, batch 16050, loss[loss=0.1934, simple_loss=0.2905, pruned_loss=0.04821, over 21665.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2843, pruned_loss=0.06147, over 4266570.03 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:54:18,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1925976.0, ans=0.125 2023-06-27 23:54:43,011 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.281e+02 6.057e+02 9.389e+02 1.429e+03 3.235e+03, threshold=1.878e+03, percent-clipped=6.0 2023-06-27 23:54:45,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1926036.0, ans=0.125 2023-06-27 23:54:54,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1926036.0, ans=0.015 2023-06-27 23:54:54,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1926036.0, ans=0.0 2023-06-27 23:55:20,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1926156.0, ans=0.1 2023-06-27 23:55:38,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1926216.0, ans=0.1 2023-06-27 23:55:56,783 INFO [train.py:996] (1/4) Epoch 11, batch 16100, loss[loss=0.2091, simple_loss=0.2873, pruned_loss=0.06549, over 21904.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.288, pruned_loss=0.06249, over 4278291.33 frames. ], batch size: 371, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:56:10,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1926276.0, ans=0.125 2023-06-27 23:56:10,775 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-27 23:56:29,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1926336.0, ans=0.125 2023-06-27 23:56:31,487 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-27 23:57:37,661 INFO [train.py:996] (1/4) Epoch 11, batch 16150, loss[loss=0.2032, simple_loss=0.2878, pruned_loss=0.05931, over 21796.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2888, pruned_loss=0.06358, over 4284170.85 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:58:03,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.374e+02 7.210e+02 1.100e+03 1.545e+03 2.941e+03, threshold=2.200e+03, percent-clipped=14.0 2023-06-27 23:58:09,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1926636.0, ans=0.125 2023-06-27 23:58:38,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1926756.0, ans=0.125 2023-06-27 23:59:19,446 INFO [train.py:996] (1/4) Epoch 11, batch 16200, loss[loss=0.1894, simple_loss=0.2344, pruned_loss=0.07215, over 20303.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2926, pruned_loss=0.06474, over 4283119.40 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:59:48,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1926936.0, ans=0.1 2023-06-27 23:59:53,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1926936.0, ans=0.125 2023-06-28 00:00:48,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1927116.0, ans=0.0 2023-06-28 00:01:06,256 INFO [train.py:996] (1/4) Epoch 11, batch 16250, loss[loss=0.1598, simple_loss=0.2444, pruned_loss=0.03756, over 21797.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2918, pruned_loss=0.06442, over 4282819.24 frames. ], batch size: 282, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:01:27,646 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.408e+02 8.189e+02 1.172e+03 1.830e+03 4.029e+03, threshold=2.343e+03, percent-clipped=14.0 2023-06-28 00:02:02,082 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-28 00:02:53,056 INFO [train.py:996] (1/4) Epoch 11, batch 16300, loss[loss=0.1749, simple_loss=0.2546, pruned_loss=0.04761, over 21394.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2856, pruned_loss=0.0613, over 4281522.81 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:03:18,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1927536.0, ans=0.125 2023-06-28 00:03:23,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1927536.0, ans=0.2 2023-06-28 00:03:40,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1927596.0, ans=0.125 2023-06-28 00:04:14,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1927716.0, ans=0.125 2023-06-28 00:04:37,034 INFO [train.py:996] (1/4) Epoch 11, batch 16350, loss[loss=0.2369, simple_loss=0.3086, pruned_loss=0.08257, over 21705.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2855, pruned_loss=0.06201, over 4266252.40 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:04:37,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1927776.0, ans=0.0 2023-06-28 00:04:53,539 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.626e+02 5.951e+02 8.785e+02 1.347e+03 2.273e+03, threshold=1.757e+03, percent-clipped=0.0 2023-06-28 00:05:01,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1927836.0, ans=0.125 2023-06-28 00:05:20,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-28 00:05:29,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1927896.0, ans=0.1 2023-06-28 00:05:55,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1928016.0, ans=0.2 2023-06-28 00:06:15,110 INFO [train.py:996] (1/4) Epoch 11, batch 16400, loss[loss=0.1977, simple_loss=0.2732, pruned_loss=0.06109, over 21492.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2894, pruned_loss=0.06342, over 4263724.08 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 32.0 2023-06-28 00:07:23,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1928256.0, ans=0.1 2023-06-28 00:07:25,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=15.0 2023-06-28 00:07:30,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1928316.0, ans=0.125 2023-06-28 00:07:36,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1928316.0, ans=0.125 2023-06-28 00:07:36,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1928316.0, ans=0.125 2023-06-28 00:07:55,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1928376.0, ans=0.125 2023-06-28 00:07:56,942 INFO [train.py:996] (1/4) Epoch 11, batch 16450, loss[loss=0.2045, simple_loss=0.2912, pruned_loss=0.05891, over 21850.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2889, pruned_loss=0.06435, over 4268842.99 frames. ], batch size: 316, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:08:16,237 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.401e+02 6.665e+02 9.796e+02 1.595e+03 2.942e+03, threshold=1.959e+03, percent-clipped=15.0 2023-06-28 00:08:45,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1928496.0, ans=0.125 2023-06-28 00:09:04,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1928556.0, ans=0.125 2023-06-28 00:09:06,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-28 00:09:41,707 INFO [train.py:996] (1/4) Epoch 11, batch 16500, loss[loss=0.1726, simple_loss=0.2394, pruned_loss=0.05287, over 21588.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2892, pruned_loss=0.06549, over 4274929.60 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:10:31,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1928796.0, ans=0.125 2023-06-28 00:10:41,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1928856.0, ans=0.125 2023-06-28 00:11:09,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1928916.0, ans=0.125 2023-06-28 00:11:24,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-28 00:11:26,475 INFO [train.py:996] (1/4) Epoch 11, batch 16550, loss[loss=0.2696, simple_loss=0.3437, pruned_loss=0.09779, over 21450.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.29, pruned_loss=0.06463, over 4261971.91 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:11:28,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1928976.0, ans=0.0 2023-06-28 00:11:50,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.361e+02 7.345e+02 1.277e+03 1.917e+03 4.181e+03, threshold=2.555e+03, percent-clipped=23.0 2023-06-28 00:12:03,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1929036.0, ans=0.0 2023-06-28 00:13:15,359 INFO [train.py:996] (1/4) Epoch 11, batch 16600, loss[loss=0.1947, simple_loss=0.3045, pruned_loss=0.04246, over 20798.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2961, pruned_loss=0.06688, over 4270568.76 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:13:40,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.77 vs. limit=6.0 2023-06-28 00:13:46,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1929336.0, ans=0.0 2023-06-28 00:14:22,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1929456.0, ans=0.125 2023-06-28 00:14:26,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1929456.0, ans=0.0 2023-06-28 00:15:00,042 INFO [train.py:996] (1/4) Epoch 11, batch 16650, loss[loss=0.301, simple_loss=0.363, pruned_loss=0.1195, over 21294.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3055, pruned_loss=0.06957, over 4273352.45 frames. ], batch size: 507, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:15:01,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-28 00:15:16,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1929576.0, ans=0.05 2023-06-28 00:15:28,778 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.761e+02 8.064e+02 1.116e+03 1.585e+03 3.216e+03, threshold=2.231e+03, percent-clipped=5.0 2023-06-28 00:15:47,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1929696.0, ans=0.1 2023-06-28 00:15:53,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1929696.0, ans=0.1 2023-06-28 00:16:50,148 INFO [train.py:996] (1/4) Epoch 11, batch 16700, loss[loss=0.2014, simple_loss=0.2804, pruned_loss=0.06121, over 21708.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3062, pruned_loss=0.07073, over 4269823.53 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:17:32,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1929936.0, ans=0.125 2023-06-28 00:17:36,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1929936.0, ans=0.0 2023-06-28 00:18:06,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1930056.0, ans=15.0 2023-06-28 00:18:09,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1930056.0, ans=0.05 2023-06-28 00:18:28,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1930116.0, ans=0.0 2023-06-28 00:18:47,098 INFO [train.py:996] (1/4) Epoch 11, batch 16750, loss[loss=0.3294, simple_loss=0.4027, pruned_loss=0.128, over 21412.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3086, pruned_loss=0.07248, over 4269782.71 frames. ], batch size: 507, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:18:55,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1930176.0, ans=0.0 2023-06-28 00:19:09,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-28 00:19:12,037 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.860e+02 6.994e+02 8.979e+02 1.342e+03 3.526e+03, threshold=1.796e+03, percent-clipped=9.0 2023-06-28 00:20:03,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1930356.0, ans=0.2 2023-06-28 00:20:03,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-28 00:20:37,335 INFO [train.py:996] (1/4) Epoch 11, batch 16800, loss[loss=0.2125, simple_loss=0.2789, pruned_loss=0.07301, over 21436.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3106, pruned_loss=0.07232, over 4268329.20 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 32.0 2023-06-28 00:20:49,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-28 00:20:54,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-28 00:20:58,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1930536.0, ans=0.0 2023-06-28 00:21:06,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-28 00:21:27,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1930596.0, ans=0.125 2023-06-28 00:21:27,945 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-28 00:22:18,737 INFO [train.py:996] (1/4) Epoch 11, batch 16850, loss[loss=0.2414, simple_loss=0.3121, pruned_loss=0.08537, over 21870.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3071, pruned_loss=0.07206, over 4278441.97 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:22:25,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1930776.0, ans=0.0 2023-06-28 00:22:38,611 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.042e+02 8.354e+02 1.397e+03 2.191e+03 5.653e+03, threshold=2.793e+03, percent-clipped=35.0 2023-06-28 00:23:20,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1930956.0, ans=0.125 2023-06-28 00:23:27,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1930956.0, ans=0.025 2023-06-28 00:23:41,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1931016.0, ans=0.125 2023-06-28 00:23:44,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=22.5 2023-06-28 00:24:00,836 INFO [train.py:996] (1/4) Epoch 11, batch 16900, loss[loss=0.187, simple_loss=0.2559, pruned_loss=0.05907, over 21199.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3006, pruned_loss=0.07014, over 4280320.91 frames. ], batch size: 143, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:24:28,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1931136.0, ans=0.0 2023-06-28 00:24:43,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-28 00:25:02,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1931256.0, ans=0.2 2023-06-28 00:25:28,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1931316.0, ans=0.125 2023-06-28 00:25:41,092 INFO [train.py:996] (1/4) Epoch 11, batch 16950, loss[loss=0.1861, simple_loss=0.2631, pruned_loss=0.05451, over 21143.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2944, pruned_loss=0.06853, over 4280328.73 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:25:42,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1931376.0, ans=15.0 2023-06-28 00:26:00,772 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 6.361e+02 9.262e+02 1.143e+03 1.974e+03, threshold=1.852e+03, percent-clipped=0.0 2023-06-28 00:26:39,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1931556.0, ans=0.125 2023-06-28 00:27:22,676 INFO [train.py:996] (1/4) Epoch 11, batch 17000, loss[loss=0.2657, simple_loss=0.3192, pruned_loss=0.1062, over 21614.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2916, pruned_loss=0.06943, over 4286528.10 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:27:38,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1931736.0, ans=0.5 2023-06-28 00:28:14,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1931796.0, ans=0.0 2023-06-28 00:28:26,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1931856.0, ans=0.2 2023-06-28 00:28:46,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.09 vs. limit=22.5 2023-06-28 00:29:06,153 INFO [train.py:996] (1/4) Epoch 11, batch 17050, loss[loss=0.2258, simple_loss=0.3124, pruned_loss=0.06957, over 21838.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2968, pruned_loss=0.07102, over 4286999.94 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:29:14,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1931976.0, ans=0.125 2023-06-28 00:29:26,235 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.012e+02 8.433e+02 1.501e+03 2.176e+03 5.028e+03, threshold=3.003e+03, percent-clipped=35.0 2023-06-28 00:29:49,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1932096.0, ans=0.07 2023-06-28 00:29:52,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932096.0, ans=0.1 2023-06-28 00:30:37,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932216.0, ans=0.1 2023-06-28 00:30:46,881 INFO [train.py:996] (1/4) Epoch 11, batch 17100, loss[loss=0.2201, simple_loss=0.2912, pruned_loss=0.07451, over 21887.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2958, pruned_loss=0.07139, over 4286279.20 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:30:51,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.51 vs. limit=6.0 2023-06-28 00:31:12,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1932336.0, ans=0.2 2023-06-28 00:31:15,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1932336.0, ans=0.0 2023-06-28 00:31:56,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932456.0, ans=0.1 2023-06-28 00:32:29,946 INFO [train.py:996] (1/4) Epoch 11, batch 17150, loss[loss=0.1903, simple_loss=0.2651, pruned_loss=0.05776, over 21882.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2938, pruned_loss=0.0711, over 4289887.27 frames. ], batch size: 118, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:32:54,500 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.748e+02 5.716e+02 7.652e+02 9.791e+02 2.028e+03, threshold=1.530e+03, percent-clipped=0.0 2023-06-28 00:33:36,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-28 00:34:00,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932816.0, ans=0.1 2023-06-28 00:34:02,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1932816.0, ans=0.125 2023-06-28 00:34:07,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932816.0, ans=0.1 2023-06-28 00:34:16,988 INFO [train.py:996] (1/4) Epoch 11, batch 17200, loss[loss=0.2401, simple_loss=0.3164, pruned_loss=0.08189, over 21539.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2938, pruned_loss=0.07098, over 4286506.07 frames. ], batch size: 414, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 00:35:49,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1933116.0, ans=0.125 2023-06-28 00:36:00,769 INFO [train.py:996] (1/4) Epoch 11, batch 17250, loss[loss=0.2786, simple_loss=0.3418, pruned_loss=0.1078, over 21412.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2975, pruned_loss=0.07205, over 4278790.72 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:36:17,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1933176.0, ans=0.2 2023-06-28 00:36:32,856 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.884e+02 8.366e+02 1.182e+03 1.787e+03 4.360e+03, threshold=2.365e+03, percent-clipped=31.0 2023-06-28 00:36:38,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1933236.0, ans=0.125 2023-06-28 00:36:40,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1933236.0, ans=0.125 2023-06-28 00:36:40,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1933236.0, ans=0.2 2023-06-28 00:37:13,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1933356.0, ans=0.2 2023-06-28 00:37:30,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1933416.0, ans=0.1 2023-06-28 00:37:31,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1933416.0, ans=0.0 2023-06-28 00:37:37,568 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.88 vs. limit=10.0 2023-06-28 00:37:49,417 INFO [train.py:996] (1/4) Epoch 11, batch 17300, loss[loss=0.2553, simple_loss=0.3288, pruned_loss=0.09089, over 21801.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3052, pruned_loss=0.07508, over 4284492.52 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:37:52,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1933476.0, ans=6.0 2023-06-28 00:38:10,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1933536.0, ans=0.0 2023-06-28 00:39:25,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1933716.0, ans=0.125 2023-06-28 00:39:40,383 INFO [train.py:996] (1/4) Epoch 11, batch 17350, loss[loss=0.1947, simple_loss=0.287, pruned_loss=0.05124, over 21761.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3065, pruned_loss=0.07508, over 4282455.62 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:40:01,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1933836.0, ans=0.1 2023-06-28 00:40:06,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-28 00:40:07,461 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.114e+02 8.299e+02 1.147e+03 1.835e+03 3.555e+03, threshold=2.294e+03, percent-clipped=8.0 2023-06-28 00:40:12,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=15.0 2023-06-28 00:40:37,111 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:40:41,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-28 00:41:10,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1934016.0, ans=0.0 2023-06-28 00:41:24,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1934076.0, ans=0.125 2023-06-28 00:41:25,717 INFO [train.py:996] (1/4) Epoch 11, batch 17400, loss[loss=0.1875, simple_loss=0.2796, pruned_loss=0.04776, over 21557.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3018, pruned_loss=0.07147, over 4265491.24 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:41:28,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-28 00:41:45,293 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-28 00:42:04,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1934136.0, ans=0.125 2023-06-28 00:42:33,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1934256.0, ans=0.0 2023-06-28 00:43:13,928 INFO [train.py:996] (1/4) Epoch 11, batch 17450, loss[loss=0.1812, simple_loss=0.2458, pruned_loss=0.05832, over 21174.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2978, pruned_loss=0.06926, over 4267449.09 frames. ], batch size: 143, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:43:21,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1934376.0, ans=0.0 2023-06-28 00:43:22,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1934376.0, ans=0.125 2023-06-28 00:43:41,602 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.482e+02 8.576e+02 1.354e+03 2.024e+03 4.305e+03, threshold=2.708e+03, percent-clipped=16.0 2023-06-28 00:43:43,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.84 vs. limit=22.5 2023-06-28 00:44:19,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1934556.0, ans=0.2 2023-06-28 00:44:55,324 INFO [train.py:996] (1/4) Epoch 11, batch 17500, loss[loss=0.2691, simple_loss=0.3187, pruned_loss=0.1097, over 21722.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2945, pruned_loss=0.06756, over 4273551.80 frames. ], batch size: 508, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 00:46:18,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1934916.0, ans=0.0 2023-06-28 00:46:35,452 INFO [train.py:996] (1/4) Epoch 11, batch 17550, loss[loss=0.2097, simple_loss=0.3039, pruned_loss=0.05774, over 21372.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.295, pruned_loss=0.06695, over 4268596.74 frames. ], batch size: 548, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 00:47:02,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1935036.0, ans=0.1 2023-06-28 00:47:02,918 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.317e+02 6.340e+02 7.775e+02 1.102e+03 1.869e+03, threshold=1.555e+03, percent-clipped=0.0 2023-06-28 00:47:03,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1935036.0, ans=0.2 2023-06-28 00:47:11,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1935096.0, ans=0.125 2023-06-28 00:47:34,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1935156.0, ans=0.0 2023-06-28 00:48:16,979 INFO [train.py:996] (1/4) Epoch 11, batch 17600, loss[loss=0.2289, simple_loss=0.3079, pruned_loss=0.07501, over 21444.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2979, pruned_loss=0.0673, over 4262198.40 frames. ], batch size: 211, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:48:21,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-28 00:49:46,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1935516.0, ans=0.0 2023-06-28 00:50:01,079 INFO [train.py:996] (1/4) Epoch 11, batch 17650, loss[loss=0.2021, simple_loss=0.2834, pruned_loss=0.06037, over 21677.00 frames. ], tot_loss[loss=0.216, simple_loss=0.297, pruned_loss=0.06751, over 4270171.24 frames. ], batch size: 415, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:50:06,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1935576.0, ans=0.125 2023-06-28 00:50:29,621 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.907e+02 7.332e+02 1.084e+03 1.896e+03 3.594e+03, threshold=2.168e+03, percent-clipped=34.0 2023-06-28 00:51:02,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1935756.0, ans=0.125 2023-06-28 00:51:35,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1935816.0, ans=0.1 2023-06-28 00:51:40,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1935816.0, ans=0.125 2023-06-28 00:51:49,577 INFO [train.py:996] (1/4) Epoch 11, batch 17700, loss[loss=0.2429, simple_loss=0.3287, pruned_loss=0.07855, over 21910.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2907, pruned_loss=0.06508, over 4273774.87 frames. ], batch size: 372, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:51:55,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1935876.0, ans=0.0 2023-06-28 00:52:20,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1935936.0, ans=0.0 2023-06-28 00:52:50,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1936056.0, ans=0.125 2023-06-28 00:53:07,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1936056.0, ans=0.125 2023-06-28 00:53:09,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1936056.0, ans=0.125 2023-06-28 00:53:33,360 INFO [train.py:996] (1/4) Epoch 11, batch 17750, loss[loss=0.2316, simple_loss=0.315, pruned_loss=0.0741, over 21983.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2973, pruned_loss=0.0678, over 4264557.32 frames. ], batch size: 317, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:53:34,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-28 00:53:51,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1936176.0, ans=0.125 2023-06-28 00:54:01,418 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.937e+02 7.178e+02 1.077e+03 1.520e+03 3.336e+03, threshold=2.154e+03, percent-clipped=9.0 2023-06-28 00:54:21,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1936296.0, ans=0.125 2023-06-28 00:54:59,442 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-28 00:55:11,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1936416.0, ans=0.125 2023-06-28 00:55:22,099 INFO [train.py:996] (1/4) Epoch 11, batch 17800, loss[loss=0.1935, simple_loss=0.2783, pruned_loss=0.0543, over 21427.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2961, pruned_loss=0.06706, over 4265265.01 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:56:27,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1936656.0, ans=0.125 2023-06-28 00:56:33,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1936656.0, ans=0.125 2023-06-28 00:57:05,741 INFO [train.py:996] (1/4) Epoch 11, batch 17850, loss[loss=0.2065, simple_loss=0.2734, pruned_loss=0.06984, over 20013.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2973, pruned_loss=0.06792, over 4265958.46 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:57:20,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1936776.0, ans=0.95 2023-06-28 00:57:34,255 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.504e+02 7.242e+02 1.057e+03 1.582e+03 3.438e+03, threshold=2.115e+03, percent-clipped=9.0 2023-06-28 00:58:35,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1937016.0, ans=0.1 2023-06-28 00:58:48,598 INFO [train.py:996] (1/4) Epoch 11, batch 17900, loss[loss=0.2281, simple_loss=0.3149, pruned_loss=0.07061, over 21809.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3019, pruned_loss=0.06948, over 4274097.97 frames. ], batch size: 118, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:59:05,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1937076.0, ans=0.125 2023-06-28 00:59:16,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-28 01:00:15,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1937316.0, ans=0.0 2023-06-28 01:00:22,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-28 01:00:37,333 INFO [train.py:996] (1/4) Epoch 11, batch 17950, loss[loss=0.1697, simple_loss=0.2697, pruned_loss=0.03488, over 21777.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.3006, pruned_loss=0.0666, over 4267220.61 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:00:44,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-06-28 01:00:46,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1937376.0, ans=0.125 2023-06-28 01:01:09,590 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.263e+02 6.938e+02 9.459e+02 1.364e+03 3.127e+03, threshold=1.892e+03, percent-clipped=7.0 2023-06-28 01:01:11,731 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:01:26,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1937496.0, ans=0.125 2023-06-28 01:01:49,862 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.11 vs. limit=10.0 2023-06-28 01:02:22,722 INFO [train.py:996] (1/4) Epoch 11, batch 18000, loss[loss=0.1862, simple_loss=0.2577, pruned_loss=0.05733, over 21665.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2931, pruned_loss=0.06496, over 4269589.80 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:02:22,722 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 01:02:39,145 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2572, simple_loss=0.3509, pruned_loss=0.08176, over 1796401.00 frames. 2023-06-28 01:02:39,146 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 01:02:40,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1937676.0, ans=0.125 2023-06-28 01:02:43,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1937676.0, ans=0.125 2023-06-28 01:02:56,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1937676.0, ans=0.05 2023-06-28 01:03:28,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1937796.0, ans=0.2 2023-06-28 01:04:22,696 INFO [train.py:996] (1/4) Epoch 11, batch 18050, loss[loss=0.1788, simple_loss=0.2432, pruned_loss=0.05716, over 20764.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2873, pruned_loss=0.06398, over 4261604.24 frames. ], batch size: 608, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:04:53,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1938036.0, ans=0.125 2023-06-28 01:04:58,009 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.600e+02 6.639e+02 9.648e+02 1.453e+03 3.276e+03, threshold=1.930e+03, percent-clipped=8.0 2023-06-28 01:06:10,716 INFO [train.py:996] (1/4) Epoch 11, batch 18100, loss[loss=0.2627, simple_loss=0.3477, pruned_loss=0.08889, over 21591.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2922, pruned_loss=0.06589, over 4262598.76 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:06:34,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1938336.0, ans=0.0 2023-06-28 01:07:05,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1938456.0, ans=10.0 2023-06-28 01:07:19,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1938456.0, ans=0.125 2023-06-28 01:07:23,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-28 01:07:48,926 INFO [train.py:996] (1/4) Epoch 11, batch 18150, loss[loss=0.2156, simple_loss=0.3073, pruned_loss=0.06194, over 21919.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2957, pruned_loss=0.06608, over 4266803.10 frames. ], batch size: 373, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:07:57,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-28 01:08:18,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.434e+02 6.385e+02 9.174e+02 1.252e+03 3.670e+03, threshold=1.835e+03, percent-clipped=3.0 2023-06-28 01:08:20,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1938636.0, ans=0.125 2023-06-28 01:08:59,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-28 01:09:18,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1938816.0, ans=0.125 2023-06-28 01:09:24,172 INFO [train.py:996] (1/4) Epoch 11, batch 18200, loss[loss=0.1741, simple_loss=0.2491, pruned_loss=0.0495, over 21547.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.289, pruned_loss=0.06611, over 4253081.36 frames. ], batch size: 195, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:10:11,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1938996.0, ans=0.1 2023-06-28 01:11:04,721 INFO [train.py:996] (1/4) Epoch 11, batch 18250, loss[loss=0.2084, simple_loss=0.2808, pruned_loss=0.06799, over 21831.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2822, pruned_loss=0.06365, over 4250188.94 frames. ], batch size: 416, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:11:37,983 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.242e+02 6.955e+02 1.102e+03 1.552e+03 2.927e+03, threshold=2.205e+03, percent-clipped=10.0 2023-06-28 01:11:59,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1939296.0, ans=0.0 2023-06-28 01:12:33,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1939416.0, ans=0.0 2023-06-28 01:12:46,289 INFO [train.py:996] (1/4) Epoch 11, batch 18300, loss[loss=0.2093, simple_loss=0.2803, pruned_loss=0.06914, over 21910.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2808, pruned_loss=0.06336, over 4254615.98 frames. ], batch size: 118, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:13:01,804 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:13:43,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1939656.0, ans=0.0 2023-06-28 01:14:04,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1939716.0, ans=0.0 2023-06-28 01:14:22,407 INFO [train.py:996] (1/4) Epoch 11, batch 18350, loss[loss=0.1607, simple_loss=0.2325, pruned_loss=0.04442, over 16309.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2864, pruned_loss=0.06295, over 4243431.18 frames. ], batch size: 61, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:14:33,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.66 vs. limit=10.0 2023-06-28 01:14:47,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1939836.0, ans=0.0 2023-06-28 01:14:56,374 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.730e+02 6.827e+02 1.100e+03 1.659e+03 4.791e+03, threshold=2.200e+03, percent-clipped=14.0 2023-06-28 01:15:00,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1939836.0, ans=0.125 2023-06-28 01:15:12,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1939896.0, ans=0.05 2023-06-28 01:15:30,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1939956.0, ans=0.125 2023-06-28 01:15:52,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1940016.0, ans=0.0 2023-06-28 01:15:52,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1940016.0, ans=0.2 2023-06-28 01:16:05,019 INFO [train.py:996] (1/4) Epoch 11, batch 18400, loss[loss=0.1671, simple_loss=0.2507, pruned_loss=0.04179, over 21140.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2811, pruned_loss=0.06105, over 4242773.47 frames. ], batch size: 159, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:16:07,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1940076.0, ans=0.0 2023-06-28 01:16:15,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1940076.0, ans=0.0 2023-06-28 01:16:25,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1940136.0, ans=0.0 2023-06-28 01:17:02,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1940256.0, ans=0.0 2023-06-28 01:17:37,795 INFO [train.py:996] (1/4) Epoch 11, batch 18450, loss[loss=0.1839, simple_loss=0.2531, pruned_loss=0.05735, over 21994.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2784, pruned_loss=0.05803, over 4253511.65 frames. ], batch size: 103, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:18:14,202 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.159e+02 6.017e+02 7.931e+02 1.267e+03 3.301e+03, threshold=1.586e+03, percent-clipped=3.0 2023-06-28 01:18:22,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1940496.0, ans=0.125 2023-06-28 01:18:31,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1940496.0, ans=0.125 2023-06-28 01:19:12,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1940616.0, ans=0.125 2023-06-28 01:19:15,639 INFO [train.py:996] (1/4) Epoch 11, batch 18500, loss[loss=0.1817, simple_loss=0.2679, pruned_loss=0.04774, over 21793.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2753, pruned_loss=0.05764, over 4251073.37 frames. ], batch size: 316, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:19:47,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=22.5 2023-06-28 01:20:10,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1940796.0, ans=0.0 2023-06-28 01:20:33,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1940856.0, ans=0.2 2023-06-28 01:20:40,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1940916.0, ans=10.0 2023-06-28 01:20:57,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.78 vs. limit=10.0 2023-06-28 01:20:57,750 INFO [train.py:996] (1/4) Epoch 11, batch 18550, loss[loss=0.1691, simple_loss=0.2296, pruned_loss=0.05435, over 20774.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2737, pruned_loss=0.05704, over 4240114.70 frames. ], batch size: 608, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:21:19,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1940976.0, ans=0.09899494936611666 2023-06-28 01:21:25,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-28 01:21:34,191 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.234e+02 6.100e+02 9.556e+02 1.452e+03 3.261e+03, threshold=1.911e+03, percent-clipped=19.0 2023-06-28 01:21:39,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1941096.0, ans=0.0 2023-06-28 01:21:51,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1941096.0, ans=0.0 2023-06-28 01:22:45,342 INFO [train.py:996] (1/4) Epoch 11, batch 18600, loss[loss=0.2032, simple_loss=0.2812, pruned_loss=0.06258, over 21633.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2731, pruned_loss=0.05848, over 4230554.57 frames. ], batch size: 391, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:23:46,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.11 vs. limit=6.0 2023-06-28 01:24:11,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1941516.0, ans=0.5 2023-06-28 01:24:26,400 INFO [train.py:996] (1/4) Epoch 11, batch 18650, loss[loss=0.2343, simple_loss=0.291, pruned_loss=0.08879, over 21363.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2712, pruned_loss=0.05848, over 4229666.28 frames. ], batch size: 473, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:24:52,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.322e+02 7.479e+02 1.141e+03 1.737e+03 3.586e+03, threshold=2.283e+03, percent-clipped=19.0 2023-06-28 01:25:32,348 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:25:57,788 INFO [train.py:996] (1/4) Epoch 11, batch 18700, loss[loss=0.2322, simple_loss=0.282, pruned_loss=0.09122, over 21753.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.2695, pruned_loss=0.05998, over 4247225.36 frames. ], batch size: 508, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:26:33,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1941936.0, ans=0.1 2023-06-28 01:26:53,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1941996.0, ans=0.125 2023-06-28 01:27:06,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1942056.0, ans=0.125 2023-06-28 01:27:40,761 INFO [train.py:996] (1/4) Epoch 11, batch 18750, loss[loss=0.2539, simple_loss=0.3367, pruned_loss=0.08552, over 21632.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2724, pruned_loss=0.06246, over 4244596.48 frames. ], batch size: 389, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:28:17,082 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.499e+02 6.198e+02 1.010e+03 1.418e+03 2.835e+03, threshold=2.020e+03, percent-clipped=5.0 2023-06-28 01:28:35,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1942296.0, ans=10.0 2023-06-28 01:28:37,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1942296.0, ans=0.125 2023-06-28 01:29:05,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1942416.0, ans=0.1 2023-06-28 01:29:12,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1942416.0, ans=0.125 2023-06-28 01:29:17,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-28 01:29:23,219 INFO [train.py:996] (1/4) Epoch 11, batch 18800, loss[loss=0.1948, simple_loss=0.2838, pruned_loss=0.05293, over 21867.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2794, pruned_loss=0.0639, over 4238797.97 frames. ], batch size: 371, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:29:27,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1942476.0, ans=0.125 2023-06-28 01:29:55,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1942536.0, ans=0.1 2023-06-28 01:30:01,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1942536.0, ans=0.125 2023-06-28 01:30:22,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1942656.0, ans=0.125 2023-06-28 01:30:54,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1942716.0, ans=0.1 2023-06-28 01:31:04,499 INFO [train.py:996] (1/4) Epoch 11, batch 18850, loss[loss=0.1841, simple_loss=0.2524, pruned_loss=0.05793, over 21506.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2799, pruned_loss=0.06102, over 4238825.42 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:31:22,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1942776.0, ans=0.1 2023-06-28 01:31:22,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1942776.0, ans=0.0 2023-06-28 01:31:41,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.145e+02 6.934e+02 1.004e+03 1.636e+03 4.618e+03, threshold=2.007e+03, percent-clipped=13.0 2023-06-28 01:31:42,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1942836.0, ans=0.125 2023-06-28 01:31:52,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1942896.0, ans=0.0 2023-06-28 01:31:57,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1942896.0, ans=0.125 2023-06-28 01:32:46,437 INFO [train.py:996] (1/4) Epoch 11, batch 18900, loss[loss=0.2022, simple_loss=0.2695, pruned_loss=0.06748, over 21823.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2762, pruned_loss=0.06098, over 4250488.09 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:33:13,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1943136.0, ans=0.125 2023-06-28 01:33:17,516 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:33:36,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1943196.0, ans=0.125 2023-06-28 01:34:17,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1943316.0, ans=0.2 2023-06-28 01:34:19,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-28 01:34:28,565 INFO [train.py:996] (1/4) Epoch 11, batch 18950, loss[loss=0.2079, simple_loss=0.293, pruned_loss=0.06143, over 21811.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2775, pruned_loss=0.06312, over 4264437.49 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:34:37,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1943376.0, ans=0.0 2023-06-28 01:35:07,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.266e+02 7.363e+02 1.116e+03 1.715e+03 3.795e+03, threshold=2.232e+03, percent-clipped=17.0 2023-06-28 01:35:41,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1943556.0, ans=0.125 2023-06-28 01:36:05,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1943616.0, ans=0.125 2023-06-28 01:36:16,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-28 01:36:16,458 INFO [train.py:996] (1/4) Epoch 11, batch 19000, loss[loss=0.259, simple_loss=0.331, pruned_loss=0.09349, over 21777.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2839, pruned_loss=0.06423, over 4269556.87 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:36:46,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1943736.0, ans=0.0 2023-06-28 01:37:11,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1943796.0, ans=0.125 2023-06-28 01:37:45,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1943916.0, ans=0.125 2023-06-28 01:37:46,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1943916.0, ans=0.0 2023-06-28 01:37:59,363 INFO [train.py:996] (1/4) Epoch 11, batch 19050, loss[loss=0.2265, simple_loss=0.2995, pruned_loss=0.07672, over 21857.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2884, pruned_loss=0.06733, over 4271498.26 frames. ], batch size: 118, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:38:33,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1944036.0, ans=0.07 2023-06-28 01:38:34,340 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.763e+02 7.359e+02 1.013e+03 1.496e+03 3.084e+03, threshold=2.026e+03, percent-clipped=8.0 2023-06-28 01:39:43,732 INFO [train.py:996] (1/4) Epoch 11, batch 19100, loss[loss=0.2396, simple_loss=0.3122, pruned_loss=0.08346, over 20012.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2867, pruned_loss=0.06841, over 4272061.11 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:40:02,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1944276.0, ans=0.125 2023-06-28 01:40:37,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-28 01:40:58,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1944456.0, ans=0.0 2023-06-28 01:41:33,376 INFO [train.py:996] (1/4) Epoch 11, batch 19150, loss[loss=0.2155, simple_loss=0.3143, pruned_loss=0.05835, over 21699.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2889, pruned_loss=0.06879, over 4263273.48 frames. ], batch size: 247, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:41:39,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1944576.0, ans=0.125 2023-06-28 01:42:09,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 7.660e+02 1.202e+03 2.015e+03 4.043e+03, threshold=2.404e+03, percent-clipped=23.0 2023-06-28 01:42:24,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1944696.0, ans=0.125 2023-06-28 01:42:58,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1944756.0, ans=0.125 2023-06-28 01:43:03,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1944816.0, ans=0.125 2023-06-28 01:43:19,389 INFO [train.py:996] (1/4) Epoch 11, batch 19200, loss[loss=0.2061, simple_loss=0.2823, pruned_loss=0.06492, over 21862.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2972, pruned_loss=0.06878, over 4266382.03 frames. ], batch size: 98, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:43:40,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.52 vs. limit=15.0 2023-06-28 01:44:24,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1945056.0, ans=0.0 2023-06-28 01:44:47,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-28 01:44:54,705 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-28 01:45:01,788 INFO [train.py:996] (1/4) Epoch 11, batch 19250, loss[loss=0.1611, simple_loss=0.2645, pruned_loss=0.02878, over 21774.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2983, pruned_loss=0.06473, over 4262144.47 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:45:07,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.54 vs. limit=15.0 2023-06-28 01:45:26,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1945236.0, ans=10.0 2023-06-28 01:45:36,158 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.112e+02 6.434e+02 9.084e+02 1.292e+03 2.942e+03, threshold=1.817e+03, percent-clipped=2.0 2023-06-28 01:46:43,101 INFO [train.py:996] (1/4) Epoch 11, batch 19300, loss[loss=0.2172, simple_loss=0.2947, pruned_loss=0.06982, over 21574.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2953, pruned_loss=0.06368, over 4262429.44 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:47:13,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1945536.0, ans=0.1 2023-06-28 01:47:13,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1945536.0, ans=0.02 2023-06-28 01:47:33,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1945596.0, ans=0.2 2023-06-28 01:48:25,859 INFO [train.py:996] (1/4) Epoch 11, batch 19350, loss[loss=0.1738, simple_loss=0.2636, pruned_loss=0.04195, over 21793.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.291, pruned_loss=0.06055, over 4268421.47 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:48:43,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-28 01:49:06,755 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.646e+02 6.544e+02 1.045e+03 1.616e+03 2.621e+03, threshold=2.089e+03, percent-clipped=15.0 2023-06-28 01:50:06,770 INFO [train.py:996] (1/4) Epoch 11, batch 19400, loss[loss=0.1738, simple_loss=0.254, pruned_loss=0.04684, over 21260.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2891, pruned_loss=0.06036, over 4276842.33 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 01:50:10,690 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:50:33,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1946136.0, ans=0.2 2023-06-28 01:51:48,611 INFO [train.py:996] (1/4) Epoch 11, batch 19450, loss[loss=0.2032, simple_loss=0.2774, pruned_loss=0.06452, over 21676.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2863, pruned_loss=0.06234, over 4289708.10 frames. ], batch size: 391, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:52:30,301 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 7.227e+02 1.148e+03 1.482e+03 2.916e+03, threshold=2.296e+03, percent-clipped=8.0 2023-06-28 01:52:40,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1946496.0, ans=0.0 2023-06-28 01:52:54,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1946556.0, ans=0.125 2023-06-28 01:53:32,669 INFO [train.py:996] (1/4) Epoch 11, batch 19500, loss[loss=0.1833, simple_loss=0.2485, pruned_loss=0.05905, over 20800.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2812, pruned_loss=0.06296, over 4263188.13 frames. ], batch size: 607, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:53:52,232 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:54:11,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1946736.0, ans=0.1 2023-06-28 01:55:10,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-28 01:55:15,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1946976.0, ans=0.0 2023-06-28 01:55:16,445 INFO [train.py:996] (1/4) Epoch 11, batch 19550, loss[loss=0.2457, simple_loss=0.3324, pruned_loss=0.07947, over 21555.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2789, pruned_loss=0.06185, over 4258153.09 frames. ], batch size: 508, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:55:35,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1946976.0, ans=0.125 2023-06-28 01:55:57,082 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.189e+02 6.305e+02 9.070e+02 1.284e+03 2.823e+03, threshold=1.814e+03, percent-clipped=4.0 2023-06-28 01:56:06,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1947096.0, ans=0.0 2023-06-28 01:56:41,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1947216.0, ans=0.125 2023-06-28 01:56:47,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1947216.0, ans=0.0 2023-06-28 01:56:50,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1947216.0, ans=0.2 2023-06-28 01:56:57,986 INFO [train.py:996] (1/4) Epoch 11, batch 19600, loss[loss=0.2509, simple_loss=0.3263, pruned_loss=0.0878, over 21838.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2808, pruned_loss=0.06256, over 4263450.00 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 01:58:43,056 INFO [train.py:996] (1/4) Epoch 11, batch 19650, loss[loss=0.2158, simple_loss=0.2902, pruned_loss=0.07074, over 21968.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2858, pruned_loss=0.06613, over 4270579.13 frames. ], batch size: 316, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 01:58:44,313 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.62 vs. limit=22.5 2023-06-28 01:59:29,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 7.409e+02 1.104e+03 1.587e+03 3.520e+03, threshold=2.207e+03, percent-clipped=14.0 2023-06-28 01:59:37,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1947696.0, ans=0.125 2023-06-28 02:00:27,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-28 02:00:39,363 INFO [train.py:996] (1/4) Epoch 11, batch 19700, loss[loss=0.1786, simple_loss=0.2681, pruned_loss=0.04453, over 21625.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2892, pruned_loss=0.06703, over 4271368.62 frames. ], batch size: 247, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:00:56,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1947876.0, ans=0.0 2023-06-28 02:00:59,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.07 vs. limit=15.0 2023-06-28 02:01:47,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-28 02:01:49,712 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:02:07,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1948116.0, ans=0.125 2023-06-28 02:02:28,060 INFO [train.py:996] (1/4) Epoch 11, batch 19750, loss[loss=0.2049, simple_loss=0.2838, pruned_loss=0.06298, over 21436.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2981, pruned_loss=0.06856, over 4277383.44 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:02:42,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=12.0 2023-06-28 02:03:04,765 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.869e+02 8.060e+02 1.121e+03 1.722e+03 5.088e+03, threshold=2.243e+03, percent-clipped=14.0 2023-06-28 02:03:23,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-28 02:04:06,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1948416.0, ans=0.0 2023-06-28 02:04:10,954 INFO [train.py:996] (1/4) Epoch 11, batch 19800, loss[loss=0.1814, simple_loss=0.2552, pruned_loss=0.0538, over 21686.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2972, pruned_loss=0.06879, over 4278527.16 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:04:11,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1948476.0, ans=0.125 2023-06-28 02:04:26,768 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:04:58,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1948596.0, ans=0.0 2023-06-28 02:05:18,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1948656.0, ans=0.125 2023-06-28 02:06:00,833 INFO [train.py:996] (1/4) Epoch 11, batch 19850, loss[loss=0.2268, simple_loss=0.331, pruned_loss=0.06128, over 21266.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2911, pruned_loss=0.06487, over 4275256.55 frames. ], batch size: 549, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:06:15,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-28 02:06:18,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1948836.0, ans=0.05 2023-06-28 02:06:32,734 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.738e+02 8.105e+02 1.255e+03 1.783e+03 2.882e+03, threshold=2.510e+03, percent-clipped=10.0 2023-06-28 02:07:42,431 INFO [train.py:996] (1/4) Epoch 11, batch 19900, loss[loss=0.1934, simple_loss=0.2878, pruned_loss=0.04944, over 21803.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2908, pruned_loss=0.06218, over 4276095.01 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:07:52,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1949076.0, ans=0.0 2023-06-28 02:08:16,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1949196.0, ans=0.1 2023-06-28 02:08:58,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1949256.0, ans=0.125 2023-06-28 02:09:25,676 INFO [train.py:996] (1/4) Epoch 11, batch 19950, loss[loss=0.2105, simple_loss=0.29, pruned_loss=0.0655, over 21570.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2848, pruned_loss=0.06192, over 4278585.92 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:09:58,011 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.624e+02 6.449e+02 8.969e+02 1.295e+03 2.845e+03, threshold=1.794e+03, percent-clipped=2.0 2023-06-28 02:11:07,742 INFO [train.py:996] (1/4) Epoch 11, batch 20000, loss[loss=0.2272, simple_loss=0.298, pruned_loss=0.07816, over 21253.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2855, pruned_loss=0.06201, over 4269058.05 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:11:13,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1949676.0, ans=0.2 2023-06-28 02:11:38,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1949736.0, ans=0.09899494936611666 2023-06-28 02:11:46,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1949796.0, ans=0.125 2023-06-28 02:12:49,280 INFO [train.py:996] (1/4) Epoch 11, batch 20050, loss[loss=0.2103, simple_loss=0.2812, pruned_loss=0.06973, over 21408.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2875, pruned_loss=0.06452, over 4276578.73 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:13:01,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1949976.0, ans=0.1 2023-06-28 02:13:27,884 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 6.719e+02 1.022e+03 1.464e+03 2.848e+03, threshold=2.043e+03, percent-clipped=12.0 2023-06-28 02:13:46,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1950096.0, ans=0.125 2023-06-28 02:14:29,013 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=22.5 2023-06-28 02:14:33,042 INFO [train.py:996] (1/4) Epoch 11, batch 20100, loss[loss=0.2155, simple_loss=0.308, pruned_loss=0.06148, over 21416.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2899, pruned_loss=0.06648, over 4284017.83 frames. ], batch size: 211, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:14:39,375 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-28 02:14:55,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1950336.0, ans=0.125 2023-06-28 02:15:18,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1950396.0, ans=0.0 2023-06-28 02:16:16,910 INFO [train.py:996] (1/4) Epoch 11, batch 20150, loss[loss=0.2277, simple_loss=0.3055, pruned_loss=0.07494, over 21275.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2978, pruned_loss=0.06863, over 4283729.58 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:16:45,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1950636.0, ans=0.125 2023-06-28 02:17:06,272 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.168e+02 7.352e+02 1.035e+03 1.689e+03 3.687e+03, threshold=2.071e+03, percent-clipped=15.0 2023-06-28 02:18:07,654 INFO [train.py:996] (1/4) Epoch 11, batch 20200, loss[loss=0.1911, simple_loss=0.2671, pruned_loss=0.05758, over 21389.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3033, pruned_loss=0.07135, over 4283436.10 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:19:23,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1951056.0, ans=0.0 2023-06-28 02:19:41,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1951116.0, ans=0.125 2023-06-28 02:19:51,074 INFO [train.py:996] (1/4) Epoch 11, batch 20250, loss[loss=0.1956, simple_loss=0.2714, pruned_loss=0.05991, over 21177.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3034, pruned_loss=0.07016, over 4286506.47 frames. ], batch size: 143, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:19:53,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1951176.0, ans=0.0 2023-06-28 02:20:18,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1951236.0, ans=0.2 2023-06-28 02:20:23,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1951236.0, ans=0.0 2023-06-28 02:20:38,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1951296.0, ans=0.125 2023-06-28 02:20:39,500 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.229e+02 6.229e+02 9.670e+02 1.265e+03 2.835e+03, threshold=1.934e+03, percent-clipped=7.0 2023-06-28 02:20:48,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1951296.0, ans=0.1 2023-06-28 02:21:37,851 INFO [train.py:996] (1/4) Epoch 11, batch 20300, loss[loss=0.2006, simple_loss=0.2861, pruned_loss=0.05758, over 21358.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3006, pruned_loss=0.06745, over 4276701.92 frames. ], batch size: 211, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:21:40,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1951476.0, ans=0.125 2023-06-28 02:21:41,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1951476.0, ans=0.0 2023-06-28 02:21:51,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1951476.0, ans=0.0 2023-06-28 02:22:17,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1951536.0, ans=0.125 2023-06-28 02:22:26,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.21 vs. limit=10.0 2023-06-28 02:23:12,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1951776.0, ans=0.125 2023-06-28 02:23:13,331 INFO [train.py:996] (1/4) Epoch 11, batch 20350, loss[loss=0.2405, simple_loss=0.3122, pruned_loss=0.08442, over 21290.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.301, pruned_loss=0.06815, over 4260491.94 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:23:33,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1951836.0, ans=0.125 2023-06-28 02:24:01,028 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.709e+02 6.451e+02 8.868e+02 1.412e+03 2.811e+03, threshold=1.774e+03, percent-clipped=7.0 2023-06-28 02:24:13,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1951896.0, ans=0.0 2023-06-28 02:24:25,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1951956.0, ans=0.04949747468305833 2023-06-28 02:24:56,169 INFO [train.py:996] (1/4) Epoch 11, batch 20400, loss[loss=0.2518, simple_loss=0.3364, pruned_loss=0.08359, over 21667.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3038, pruned_loss=0.07062, over 4251048.07 frames. ], batch size: 414, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:25:06,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1952076.0, ans=0.1 2023-06-28 02:25:12,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1952076.0, ans=0.0 2023-06-28 02:25:15,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1952076.0, ans=0.1 2023-06-28 02:25:35,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-28 02:25:37,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1952136.0, ans=0.0 2023-06-28 02:25:40,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1952136.0, ans=0.0 2023-06-28 02:25:47,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1952196.0, ans=0.125 2023-06-28 02:25:56,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1952196.0, ans=0.0 2023-06-28 02:26:31,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-28 02:26:37,004 INFO [train.py:996] (1/4) Epoch 11, batch 20450, loss[loss=0.1965, simple_loss=0.2621, pruned_loss=0.06539, over 21022.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3058, pruned_loss=0.0733, over 4258643.85 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:27:25,098 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.694e+02 8.138e+02 1.140e+03 1.534e+03 2.680e+03, threshold=2.280e+03, percent-clipped=12.0 2023-06-28 02:27:25,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1952496.0, ans=0.125 2023-06-28 02:28:16,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1952676.0, ans=10.0 2023-06-28 02:28:16,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1952676.0, ans=0.125 2023-06-28 02:28:17,759 INFO [train.py:996] (1/4) Epoch 11, batch 20500, loss[loss=0.1937, simple_loss=0.2554, pruned_loss=0.06604, over 21152.00 frames. ], tot_loss[loss=0.224, simple_loss=0.301, pruned_loss=0.07348, over 4258250.00 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:28:40,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1952676.0, ans=0.04949747468305833 2023-06-28 02:29:40,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1952856.0, ans=0.0 2023-06-28 02:29:45,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1952916.0, ans=0.125 2023-06-28 02:29:50,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1952916.0, ans=0.1 2023-06-28 02:30:04,140 INFO [train.py:996] (1/4) Epoch 11, batch 20550, loss[loss=0.1983, simple_loss=0.283, pruned_loss=0.05679, over 21178.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2945, pruned_loss=0.07163, over 4261535.91 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:30:28,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1953036.0, ans=0.125 2023-06-28 02:30:29,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1953036.0, ans=0.035 2023-06-28 02:30:49,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.218e+02 1.038e+03 1.367e+03 4.804e+03, threshold=2.077e+03, percent-clipped=4.0 2023-06-28 02:30:56,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1953096.0, ans=0.0 2023-06-28 02:30:59,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1953096.0, ans=0.125 2023-06-28 02:31:01,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1953096.0, ans=0.2 2023-06-28 02:31:18,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1953156.0, ans=0.0 2023-06-28 02:31:24,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1953216.0, ans=0.125 2023-06-28 02:31:42,648 INFO [train.py:996] (1/4) Epoch 11, batch 20600, loss[loss=0.2115, simple_loss=0.286, pruned_loss=0.06854, over 21655.00 frames. ], tot_loss[loss=0.218, simple_loss=0.296, pruned_loss=0.06998, over 4258289.67 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:32:43,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1953396.0, ans=0.1 2023-06-28 02:32:57,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1953456.0, ans=0.125 2023-06-28 02:33:27,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1953576.0, ans=0.1 2023-06-28 02:33:28,457 INFO [train.py:996] (1/4) Epoch 11, batch 20650, loss[loss=0.1819, simple_loss=0.2507, pruned_loss=0.0566, over 21165.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.292, pruned_loss=0.0703, over 4269548.14 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:33:35,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1953576.0, ans=0.04949747468305833 2023-06-28 02:34:08,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1953636.0, ans=0.0 2023-06-28 02:34:13,039 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.069e+02 8.420e+02 1.112e+03 2.688e+03, threshold=1.684e+03, percent-clipped=4.0 2023-06-28 02:34:15,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1953696.0, ans=0.1 2023-06-28 02:34:29,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1953756.0, ans=0.5 2023-06-28 02:35:11,586 INFO [train.py:996] (1/4) Epoch 11, batch 20700, loss[loss=0.1938, simple_loss=0.2765, pruned_loss=0.05559, over 21663.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2851, pruned_loss=0.067, over 4263420.20 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:36:01,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1953996.0, ans=0.125 2023-06-28 02:36:19,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1954056.0, ans=0.125 2023-06-28 02:36:21,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1954056.0, ans=0.0 2023-06-28 02:36:44,286 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-28 02:37:05,872 INFO [train.py:996] (1/4) Epoch 11, batch 20750, loss[loss=0.214, simple_loss=0.3045, pruned_loss=0.06181, over 21401.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2875, pruned_loss=0.06668, over 4260933.46 frames. ], batch size: 194, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:37:28,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1954236.0, ans=0.0 2023-06-28 02:37:46,762 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 6.969e+02 1.049e+03 1.420e+03 3.386e+03, threshold=2.099e+03, percent-clipped=18.0 2023-06-28 02:38:44,608 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-28 02:38:48,452 INFO [train.py:996] (1/4) Epoch 11, batch 20800, loss[loss=0.1871, simple_loss=0.2536, pruned_loss=0.06027, over 21437.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2903, pruned_loss=0.06787, over 4258984.44 frames. ], batch size: 212, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:39:59,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1954656.0, ans=0.07 2023-06-28 02:40:10,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1954716.0, ans=0.0 2023-06-28 02:40:30,172 INFO [train.py:996] (1/4) Epoch 11, batch 20850, loss[loss=0.1801, simple_loss=0.2503, pruned_loss=0.055, over 21237.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2836, pruned_loss=0.06615, over 4261257.56 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:41:11,765 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 6.813e+02 9.986e+02 1.626e+03 4.926e+03, threshold=1.997e+03, percent-clipped=17.0 2023-06-28 02:41:49,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1955016.0, ans=0.125 2023-06-28 02:41:51,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1955016.0, ans=0.125 2023-06-28 02:42:12,912 INFO [train.py:996] (1/4) Epoch 11, batch 20900, loss[loss=0.2119, simple_loss=0.2941, pruned_loss=0.06488, over 21560.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2855, pruned_loss=0.06657, over 4258848.20 frames. ], batch size: 195, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:42:27,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1955136.0, ans=0.125 2023-06-28 02:43:06,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1955256.0, ans=0.125 2023-06-28 02:43:27,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1955316.0, ans=0.025 2023-06-28 02:43:46,920 INFO [train.py:996] (1/4) Epoch 11, batch 20950, loss[loss=0.192, simple_loss=0.2651, pruned_loss=0.05951, over 21585.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2813, pruned_loss=0.06345, over 4256160.77 frames. ], batch size: 389, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:44:16,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1955436.0, ans=0.125 2023-06-28 02:44:26,754 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 6.600e+02 1.009e+03 1.481e+03 3.746e+03, threshold=2.018e+03, percent-clipped=8.0 2023-06-28 02:44:27,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1955496.0, ans=0.2 2023-06-28 02:45:14,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1955616.0, ans=0.0 2023-06-28 02:45:25,835 INFO [train.py:996] (1/4) Epoch 11, batch 21000, loss[loss=0.1928, simple_loss=0.2667, pruned_loss=0.05949, over 21812.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.279, pruned_loss=0.06335, over 4248701.03 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:45:25,836 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 02:45:45,778 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2661, simple_loss=0.3574, pruned_loss=0.08743, over 1796401.00 frames. 2023-06-28 02:45:45,779 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 02:46:45,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1955856.0, ans=0.0 2023-06-28 02:47:22,978 INFO [train.py:996] (1/4) Epoch 11, batch 21050, loss[loss=0.1739, simple_loss=0.2519, pruned_loss=0.04792, over 20161.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2771, pruned_loss=0.06365, over 4234129.53 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:47:41,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1955976.0, ans=0.0 2023-06-28 02:47:50,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-28 02:48:09,001 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.407e+02 6.035e+02 7.930e+02 1.297e+03 2.545e+03, threshold=1.586e+03, percent-clipped=7.0 2023-06-28 02:48:11,840 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-28 02:48:19,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1956096.0, ans=0.0 2023-06-28 02:48:31,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1956156.0, ans=0.125 2023-06-28 02:49:04,829 INFO [train.py:996] (1/4) Epoch 11, batch 21100, loss[loss=0.1956, simple_loss=0.256, pruned_loss=0.06763, over 20180.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2729, pruned_loss=0.06299, over 4244218.66 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:49:07,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-28 02:49:32,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-28 02:50:25,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=12.0 2023-06-28 02:50:40,615 INFO [train.py:996] (1/4) Epoch 11, batch 21150, loss[loss=0.1719, simple_loss=0.2363, pruned_loss=0.05376, over 21547.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2697, pruned_loss=0.06277, over 4239462.03 frames. ], batch size: 213, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:50:54,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1956576.0, ans=0.2 2023-06-28 02:51:09,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=15.0 2023-06-28 02:51:26,158 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.779e+02 6.313e+02 9.139e+02 1.246e+03 3.367e+03, threshold=1.828e+03, percent-clipped=14.0 2023-06-28 02:51:36,068 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:51:52,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1956756.0, ans=0.125 2023-06-28 02:52:16,462 INFO [train.py:996] (1/4) Epoch 11, batch 21200, loss[loss=0.1845, simple_loss=0.2627, pruned_loss=0.0531, over 21993.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2663, pruned_loss=0.06206, over 4255498.55 frames. ], batch size: 103, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:52:20,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1956876.0, ans=0.0 2023-06-28 02:52:34,044 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-28 02:52:50,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-28 02:52:51,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1956936.0, ans=0.0 2023-06-28 02:53:13,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1956996.0, ans=0.1 2023-06-28 02:53:15,458 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-28 02:53:20,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1957056.0, ans=0.2 2023-06-28 02:53:48,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1957116.0, ans=0.125 2023-06-28 02:53:58,174 INFO [train.py:996] (1/4) Epoch 11, batch 21250, loss[loss=0.2574, simple_loss=0.3347, pruned_loss=0.09003, over 21322.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2656, pruned_loss=0.0623, over 4265199.64 frames. ], batch size: 551, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:54:13,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1957176.0, ans=0.0 2023-06-28 02:54:14,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-28 02:54:24,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-28 02:54:28,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1957236.0, ans=0.1 2023-06-28 02:54:32,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1957236.0, ans=0.125 2023-06-28 02:54:47,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 7.284e+02 1.070e+03 1.587e+03 2.954e+03, threshold=2.141e+03, percent-clipped=16.0 2023-06-28 02:55:30,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1957416.0, ans=0.125 2023-06-28 02:55:39,437 INFO [train.py:996] (1/4) Epoch 11, batch 21300, loss[loss=0.2054, simple_loss=0.2847, pruned_loss=0.06307, over 21677.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2725, pruned_loss=0.06438, over 4273323.42 frames. ], batch size: 230, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:55:44,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1957476.0, ans=0.125 2023-06-28 02:55:53,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1957476.0, ans=0.125 2023-06-28 02:56:18,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1957596.0, ans=0.125 2023-06-28 02:56:28,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1957596.0, ans=0.1 2023-06-28 02:56:50,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1957656.0, ans=0.0 2023-06-28 02:56:59,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1957716.0, ans=0.0 2023-06-28 02:57:06,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1957716.0, ans=0.1 2023-06-28 02:57:22,789 INFO [train.py:996] (1/4) Epoch 11, batch 21350, loss[loss=0.1809, simple_loss=0.2761, pruned_loss=0.04283, over 21789.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.277, pruned_loss=0.06492, over 4278943.31 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:57:24,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1957776.0, ans=0.2 2023-06-28 02:58:08,239 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.369e+02 7.362e+02 1.168e+03 1.519e+03 3.106e+03, threshold=2.337e+03, percent-clipped=14.0 2023-06-28 02:58:37,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1957956.0, ans=0.125 2023-06-28 02:58:57,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1958016.0, ans=0.125 2023-06-28 02:59:07,585 INFO [train.py:996] (1/4) Epoch 11, batch 21400, loss[loss=0.2489, simple_loss=0.3285, pruned_loss=0.08464, over 21536.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2823, pruned_loss=0.06564, over 4278867.67 frames. ], batch size: 389, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:59:21,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1958076.0, ans=0.125 2023-06-28 02:59:31,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1958136.0, ans=0.125 2023-06-28 02:59:51,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=12.0 2023-06-28 03:00:49,133 INFO [train.py:996] (1/4) Epoch 11, batch 21450, loss[loss=0.2083, simple_loss=0.2829, pruned_loss=0.06683, over 21302.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2843, pruned_loss=0.0663, over 4274289.90 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:01:11,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1958436.0, ans=0.125 2023-06-28 03:01:11,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=15.0 2023-06-28 03:01:24,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1958436.0, ans=0.125 2023-06-28 03:01:26,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-28 03:01:33,843 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.846e+02 6.247e+02 7.898e+02 1.203e+03 2.207e+03, threshold=1.580e+03, percent-clipped=0.0 2023-06-28 03:02:14,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1958616.0, ans=0.0 2023-06-28 03:02:30,241 INFO [train.py:996] (1/4) Epoch 11, batch 21500, loss[loss=0.1949, simple_loss=0.2618, pruned_loss=0.06405, over 21746.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2817, pruned_loss=0.06672, over 4270746.26 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:02:37,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1958676.0, ans=0.07 2023-06-28 03:03:01,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1958736.0, ans=0.0 2023-06-28 03:04:08,447 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:04:11,237 INFO [train.py:996] (1/4) Epoch 11, batch 21550, loss[loss=0.1619, simple_loss=0.2328, pruned_loss=0.04547, over 21303.00 frames. ], tot_loss[loss=0.201, simple_loss=0.274, pruned_loss=0.06402, over 4264488.91 frames. ], batch size: 144, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:04:20,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1958976.0, ans=0.125 2023-06-28 03:04:56,024 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 6.233e+02 9.531e+02 1.253e+03 2.671e+03, threshold=1.906e+03, percent-clipped=10.0 2023-06-28 03:04:56,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1959096.0, ans=0.125 2023-06-28 03:05:15,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-28 03:05:49,818 INFO [train.py:996] (1/4) Epoch 11, batch 21600, loss[loss=0.1873, simple_loss=0.2794, pruned_loss=0.04758, over 21596.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2703, pruned_loss=0.06286, over 4267083.80 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:05:56,784 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:06:17,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1959336.0, ans=0.125 2023-06-28 03:06:19,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.38 vs. limit=15.0 2023-06-28 03:07:21,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1959516.0, ans=0.125 2023-06-28 03:07:37,238 INFO [train.py:996] (1/4) Epoch 11, batch 21650, loss[loss=0.2895, simple_loss=0.3718, pruned_loss=0.1036, over 21505.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2764, pruned_loss=0.06182, over 4266056.63 frames. ], batch size: 507, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:07:43,366 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=15.0 2023-06-28 03:08:24,880 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:08:26,010 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.567e+02 7.101e+02 1.132e+03 1.604e+03 3.542e+03, threshold=2.263e+03, percent-clipped=14.0 2023-06-28 03:08:38,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-06-28 03:09:01,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-28 03:09:18,420 INFO [train.py:996] (1/4) Epoch 11, batch 21700, loss[loss=0.2063, simple_loss=0.2993, pruned_loss=0.05665, over 21654.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2785, pruned_loss=0.06041, over 4269639.84 frames. ], batch size: 414, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:10:06,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1959996.0, ans=0.125 2023-06-28 03:10:06,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1959996.0, ans=0.125 2023-06-28 03:10:09,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1959996.0, ans=0.125 2023-06-28 03:10:44,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1960116.0, ans=0.035 2023-06-28 03:11:00,008 INFO [train.py:996] (1/4) Epoch 11, batch 21750, loss[loss=0.1915, simple_loss=0.2638, pruned_loss=0.05958, over 21880.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2744, pruned_loss=0.0603, over 4272142.17 frames. ], batch size: 107, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:11:13,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1960176.0, ans=0.0 2023-06-28 03:11:43,944 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 7.621e+02 1.214e+03 1.880e+03 3.851e+03, threshold=2.427e+03, percent-clipped=16.0 2023-06-28 03:12:02,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1960356.0, ans=0.125 2023-06-28 03:12:04,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=6.0 2023-06-28 03:12:18,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1960416.0, ans=0.5 2023-06-28 03:12:26,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-28 03:12:37,219 INFO [train.py:996] (1/4) Epoch 11, batch 21800, loss[loss=0.1738, simple_loss=0.2418, pruned_loss=0.05292, over 21613.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2718, pruned_loss=0.06109, over 4264928.64 frames. ], batch size: 247, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:13:58,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1960716.0, ans=0.125 2023-06-28 03:14:15,421 INFO [train.py:996] (1/4) Epoch 11, batch 21850, loss[loss=0.2209, simple_loss=0.3336, pruned_loss=0.05413, over 21260.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.277, pruned_loss=0.062, over 4253194.21 frames. ], batch size: 549, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:14:49,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1960836.0, ans=0.2 2023-06-28 03:15:00,556 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.380e+02 8.991e+02 1.412e+03 2.394e+03, threshold=1.798e+03, percent-clipped=0.0 2023-06-28 03:15:14,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1960956.0, ans=0.0 2023-06-28 03:15:19,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-06-28 03:15:22,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1960956.0, ans=0.125 2023-06-28 03:15:52,997 INFO [train.py:996] (1/4) Epoch 11, batch 21900, loss[loss=0.1778, simple_loss=0.2542, pruned_loss=0.0507, over 14335.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2798, pruned_loss=0.06356, over 4259446.39 frames. ], batch size: 60, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:15:59,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.10 vs. limit=22.5 2023-06-28 03:16:21,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1961136.0, ans=0.0 2023-06-28 03:16:54,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1961256.0, ans=0.0 2023-06-28 03:16:55,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1961256.0, ans=0.2 2023-06-28 03:17:06,504 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.44 vs. limit=10.0 2023-06-28 03:17:28,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1961376.0, ans=0.125 2023-06-28 03:17:29,756 INFO [train.py:996] (1/4) Epoch 11, batch 21950, loss[loss=0.1847, simple_loss=0.2528, pruned_loss=0.05828, over 21301.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2735, pruned_loss=0.06251, over 4270324.34 frames. ], batch size: 144, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:17:35,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.67 vs. limit=22.5 2023-06-28 03:18:09,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1961436.0, ans=0.0 2023-06-28 03:18:23,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.683e+02 5.864e+02 6.968e+02 1.003e+03 1.764e+03, threshold=1.394e+03, percent-clipped=0.0 2023-06-28 03:18:23,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1961496.0, ans=0.1 2023-06-28 03:18:44,255 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:19:10,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1961676.0, ans=0.125 2023-06-28 03:19:11,926 INFO [train.py:996] (1/4) Epoch 11, batch 22000, loss[loss=0.1506, simple_loss=0.2278, pruned_loss=0.0367, over 21510.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2681, pruned_loss=0.05995, over 4268422.52 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 03:19:12,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1961676.0, ans=0.125 2023-06-28 03:19:16,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1961676.0, ans=0.125 2023-06-28 03:19:29,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1961736.0, ans=0.0 2023-06-28 03:20:10,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1961796.0, ans=0.125 2023-06-28 03:20:55,915 INFO [train.py:996] (1/4) Epoch 11, batch 22050, loss[loss=0.2249, simple_loss=0.3075, pruned_loss=0.07122, over 21392.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2705, pruned_loss=0.06049, over 4254717.89 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:21:08,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1961976.0, ans=0.125 2023-06-28 03:21:26,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1962036.0, ans=0.0 2023-06-28 03:21:53,076 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.146e+02 7.401e+02 1.317e+03 1.911e+03 4.599e+03, threshold=2.634e+03, percent-clipped=46.0 2023-06-28 03:22:19,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1962216.0, ans=0.0 2023-06-28 03:22:32,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-28 03:22:40,222 INFO [train.py:996] (1/4) Epoch 11, batch 22100, loss[loss=0.2755, simple_loss=0.3781, pruned_loss=0.0864, over 19712.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2815, pruned_loss=0.06544, over 4247613.26 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:22:43,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=12.0 2023-06-28 03:22:56,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1962276.0, ans=0.125 2023-06-28 03:23:07,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1962336.0, ans=0.1 2023-06-28 03:24:17,365 INFO [train.py:996] (1/4) Epoch 11, batch 22150, loss[loss=0.1983, simple_loss=0.271, pruned_loss=0.06279, over 21674.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2844, pruned_loss=0.06648, over 4260696.66 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:24:19,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1962576.0, ans=0.1 2023-06-28 03:24:45,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1962636.0, ans=0.0 2023-06-28 03:25:13,701 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.995e+02 8.783e+02 1.255e+03 1.849e+03 4.260e+03, threshold=2.511e+03, percent-clipped=3.0 2023-06-28 03:26:00,156 INFO [train.py:996] (1/4) Epoch 11, batch 22200, loss[loss=0.2237, simple_loss=0.2884, pruned_loss=0.07949, over 21774.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2875, pruned_loss=0.06684, over 4265954.55 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:26:49,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1962996.0, ans=0.0 2023-06-28 03:27:04,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1963056.0, ans=0.0 2023-06-28 03:27:42,228 INFO [train.py:996] (1/4) Epoch 11, batch 22250, loss[loss=0.3194, simple_loss=0.368, pruned_loss=0.1353, over 21336.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2934, pruned_loss=0.06823, over 4269417.22 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:28:06,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1963236.0, ans=0.125 2023-06-28 03:28:15,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1963236.0, ans=0.125 2023-06-28 03:28:15,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1963236.0, ans=0.125 2023-06-28 03:28:26,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1963236.0, ans=0.125 2023-06-28 03:28:37,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.198e+02 6.711e+02 8.468e+02 1.239e+03 3.194e+03, threshold=1.694e+03, percent-clipped=5.0 2023-06-28 03:29:28,304 INFO [train.py:996] (1/4) Epoch 11, batch 22300, loss[loss=0.2479, simple_loss=0.3178, pruned_loss=0.089, over 21353.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2952, pruned_loss=0.07029, over 4274091.27 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:30:09,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1963596.0, ans=10.0 2023-06-28 03:30:42,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1963656.0, ans=0.1 2023-06-28 03:31:13,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1963776.0, ans=0.125 2023-06-28 03:31:14,593 INFO [train.py:996] (1/4) Epoch 11, batch 22350, loss[loss=0.2169, simple_loss=0.2847, pruned_loss=0.07456, over 21774.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.293, pruned_loss=0.071, over 4280344.03 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:31:16,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1963776.0, ans=0.2 2023-06-28 03:31:29,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1963776.0, ans=0.0 2023-06-28 03:31:53,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1963896.0, ans=0.5 2023-06-28 03:32:01,626 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.484e+02 6.278e+02 9.923e+02 1.351e+03 2.767e+03, threshold=1.985e+03, percent-clipped=14.0 2023-06-28 03:32:08,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1963896.0, ans=0.0 2023-06-28 03:32:23,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1963956.0, ans=0.1 2023-06-28 03:32:27,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1963956.0, ans=0.0 2023-06-28 03:32:39,479 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-28 03:32:56,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1964076.0, ans=0.125 2023-06-28 03:32:58,060 INFO [train.py:996] (1/4) Epoch 11, batch 22400, loss[loss=0.1882, simple_loss=0.2558, pruned_loss=0.06031, over 21263.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.29, pruned_loss=0.0684, over 4283761.23 frames. ], batch size: 608, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 03:33:31,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-28 03:33:48,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1964196.0, ans=0.1 2023-06-28 03:34:10,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1964256.0, ans=0.0 2023-06-28 03:34:40,492 INFO [train.py:996] (1/4) Epoch 11, batch 22450, loss[loss=0.1775, simple_loss=0.2419, pruned_loss=0.05659, over 21095.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2847, pruned_loss=0.0672, over 4259329.87 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:34:56,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1964376.0, ans=0.125 2023-06-28 03:34:57,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1964376.0, ans=0.125 2023-06-28 03:35:29,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-06-28 03:35:34,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1964496.0, ans=0.0 2023-06-28 03:35:35,819 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 5.963e+02 8.267e+02 1.246e+03 2.225e+03, threshold=1.653e+03, percent-clipped=2.0 2023-06-28 03:36:24,004 INFO [train.py:996] (1/4) Epoch 11, batch 22500, loss[loss=0.212, simple_loss=0.3032, pruned_loss=0.06041, over 21540.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2803, pruned_loss=0.06667, over 4268095.60 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:36:26,165 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:38:07,196 INFO [train.py:996] (1/4) Epoch 11, batch 22550, loss[loss=0.1674, simple_loss=0.219, pruned_loss=0.05786, over 20755.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2819, pruned_loss=0.06654, over 4265944.81 frames. ], batch size: 609, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:38:24,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1964976.0, ans=0.1 2023-06-28 03:38:49,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.01 vs. limit=22.5 2023-06-28 03:38:54,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.46 vs. limit=22.5 2023-06-28 03:39:00,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1965096.0, ans=0.125 2023-06-28 03:39:03,638 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.783e+02 6.887e+02 1.011e+03 1.935e+03 4.167e+03, threshold=2.022e+03, percent-clipped=31.0 2023-06-28 03:39:31,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1965156.0, ans=15.0 2023-06-28 03:39:56,247 INFO [train.py:996] (1/4) Epoch 11, batch 22600, loss[loss=0.3115, simple_loss=0.3851, pruned_loss=0.119, over 21453.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2861, pruned_loss=0.06691, over 4269547.57 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:41:25,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1965516.0, ans=0.1 2023-06-28 03:41:33,176 INFO [train.py:996] (1/4) Epoch 11, batch 22650, loss[loss=0.1839, simple_loss=0.2531, pruned_loss=0.05734, over 21400.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2831, pruned_loss=0.06665, over 4264801.48 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:41:36,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1965576.0, ans=0.125 2023-06-28 03:41:53,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1965636.0, ans=0.125 2023-06-28 03:42:04,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1965636.0, ans=0.2 2023-06-28 03:42:26,685 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.890e+02 8.413e+02 1.340e+03 1.745e+03 3.098e+03, threshold=2.679e+03, percent-clipped=14.0 2023-06-28 03:42:28,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1965696.0, ans=0.125 2023-06-28 03:42:46,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1965756.0, ans=0.035 2023-06-28 03:43:14,254 INFO [train.py:996] (1/4) Epoch 11, batch 22700, loss[loss=0.1911, simple_loss=0.2584, pruned_loss=0.06193, over 21822.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2771, pruned_loss=0.06593, over 4251813.45 frames. ], batch size: 352, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:43:53,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1965996.0, ans=0.125 2023-06-28 03:44:04,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1965996.0, ans=0.2 2023-06-28 03:44:40,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1966116.0, ans=0.125 2023-06-28 03:44:43,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1966116.0, ans=0.125 2023-06-28 03:44:43,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1966116.0, ans=0.125 2023-06-28 03:44:49,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1966116.0, ans=0.125 2023-06-28 03:44:56,782 INFO [train.py:996] (1/4) Epoch 11, batch 22750, loss[loss=0.1752, simple_loss=0.2381, pruned_loss=0.05616, over 20738.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2794, pruned_loss=0.06741, over 4259490.61 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:45:04,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-28 03:45:12,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1966236.0, ans=0.0 2023-06-28 03:45:55,454 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 9.181e+02 1.363e+03 2.029e+03 5.534e+03, threshold=2.727e+03, percent-clipped=14.0 2023-06-28 03:46:38,648 INFO [train.py:996] (1/4) Epoch 11, batch 22800, loss[loss=0.2394, simple_loss=0.3076, pruned_loss=0.08557, over 21842.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2847, pruned_loss=0.06942, over 4266548.43 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:46:54,755 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-28 03:47:19,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1966596.0, ans=0.0 2023-06-28 03:48:20,559 INFO [train.py:996] (1/4) Epoch 11, batch 22850, loss[loss=0.2083, simple_loss=0.2939, pruned_loss=0.06134, over 19935.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2809, pruned_loss=0.06902, over 4274953.84 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:48:32,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1966776.0, ans=0.125 2023-06-28 03:49:02,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1966896.0, ans=0.125 2023-06-28 03:49:19,986 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.398e+02 6.821e+02 8.997e+02 1.443e+03 3.960e+03, threshold=1.799e+03, percent-clipped=4.0 2023-06-28 03:50:04,231 INFO [train.py:996] (1/4) Epoch 11, batch 22900, loss[loss=0.2746, simple_loss=0.3791, pruned_loss=0.08498, over 21457.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2814, pruned_loss=0.06836, over 4259403.36 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:51:48,483 INFO [train.py:996] (1/4) Epoch 11, batch 22950, loss[loss=0.2242, simple_loss=0.3371, pruned_loss=0.05567, over 21676.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.293, pruned_loss=0.06677, over 4261168.27 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:51:52,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1967376.0, ans=0.125 2023-06-28 03:52:00,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1967376.0, ans=0.125 2023-06-28 03:52:24,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1967436.0, ans=0.125 2023-06-28 03:52:35,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1967496.0, ans=0.125 2023-06-28 03:52:39,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1967496.0, ans=0.0 2023-06-28 03:52:42,073 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.491e+02 7.317e+02 1.405e+03 2.219e+03 4.116e+03, threshold=2.810e+03, percent-clipped=42.0 2023-06-28 03:53:07,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1967616.0, ans=0.125 2023-06-28 03:53:08,059 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-06-28 03:53:25,449 INFO [train.py:996] (1/4) Epoch 11, batch 23000, loss[loss=0.2194, simple_loss=0.3464, pruned_loss=0.04613, over 20756.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2939, pruned_loss=0.06562, over 4266515.12 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:53:56,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-28 03:54:06,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1967736.0, ans=0.2 2023-06-28 03:54:23,241 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-28 03:54:46,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1967856.0, ans=0.07 2023-06-28 03:54:53,425 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-28 03:55:01,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1967916.0, ans=0.125 2023-06-28 03:55:07,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1967916.0, ans=0.015 2023-06-28 03:55:09,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1967916.0, ans=0.0 2023-06-28 03:55:11,926 INFO [train.py:996] (1/4) Epoch 11, batch 23050, loss[loss=0.2428, simple_loss=0.3216, pruned_loss=0.08199, over 21587.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2945, pruned_loss=0.06686, over 4272898.20 frames. ], batch size: 389, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:55:26,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1967976.0, ans=0.0 2023-06-28 03:55:35,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-28 03:55:38,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1968036.0, ans=0.1 2023-06-28 03:55:38,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1968036.0, ans=0.1 2023-06-28 03:55:47,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-28 03:55:59,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1968096.0, ans=0.0 2023-06-28 03:56:02,460 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.806e+02 7.903e+02 1.210e+03 1.646e+03 4.576e+03, threshold=2.420e+03, percent-clipped=5.0 2023-06-28 03:56:04,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1968096.0, ans=0.0 2023-06-28 03:56:19,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.78 vs. limit=15.0 2023-06-28 03:56:53,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1968276.0, ans=0.125 2023-06-28 03:56:54,619 INFO [train.py:996] (1/4) Epoch 11, batch 23100, loss[loss=0.2003, simple_loss=0.2613, pruned_loss=0.06963, over 21592.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2911, pruned_loss=0.06798, over 4271867.49 frames. ], batch size: 415, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:58:10,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1968516.0, ans=0.0 2023-06-28 03:58:36,206 INFO [train.py:996] (1/4) Epoch 11, batch 23150, loss[loss=0.2311, simple_loss=0.2915, pruned_loss=0.08536, over 21593.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2863, pruned_loss=0.06748, over 4279001.69 frames. ], batch size: 471, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:59:07,045 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=12.0 2023-06-28 03:59:20,957 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.155e+02 6.572e+02 9.609e+02 1.447e+03 3.666e+03, threshold=1.922e+03, percent-clipped=4.0 2023-06-28 03:59:21,932 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-28 03:59:24,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1968756.0, ans=0.125 2023-06-28 03:59:32,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-28 04:00:00,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=8.0 2023-06-28 04:00:06,814 INFO [train.py:996] (1/4) Epoch 11, batch 23200, loss[loss=0.2212, simple_loss=0.2951, pruned_loss=0.07364, over 21281.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2865, pruned_loss=0.06842, over 4285122.09 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:00:28,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1968936.0, ans=0.0 2023-06-28 04:00:41,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1968936.0, ans=0.07 2023-06-28 04:01:15,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1969056.0, ans=0.1 2023-06-28 04:01:48,926 INFO [train.py:996] (1/4) Epoch 11, batch 23250, loss[loss=0.2581, simple_loss=0.3056, pruned_loss=0.1054, over 21808.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2862, pruned_loss=0.0687, over 4294079.37 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:02:24,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-28 04:02:42,554 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.956e+02 7.376e+02 1.130e+03 1.714e+03 3.374e+03, threshold=2.260e+03, percent-clipped=21.0 2023-06-28 04:03:05,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1969356.0, ans=0.125 2023-06-28 04:03:31,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1969416.0, ans=0.1 2023-06-28 04:03:31,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1969416.0, ans=0.125 2023-06-28 04:03:34,408 INFO [train.py:996] (1/4) Epoch 11, batch 23300, loss[loss=0.2086, simple_loss=0.2844, pruned_loss=0.06637, over 20069.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2921, pruned_loss=0.07055, over 4286300.48 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:03:36,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1969476.0, ans=0.1 2023-06-28 04:03:55,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1969536.0, ans=0.0 2023-06-28 04:04:17,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1969596.0, ans=0.0 2023-06-28 04:04:34,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-28 04:04:44,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1969656.0, ans=0.0 2023-06-28 04:04:59,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-28 04:05:18,305 INFO [train.py:996] (1/4) Epoch 11, batch 23350, loss[loss=0.1755, simple_loss=0.2682, pruned_loss=0.04139, over 21805.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.296, pruned_loss=0.07014, over 4273135.38 frames. ], batch size: 372, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:05:33,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1969836.0, ans=0.125 2023-06-28 04:05:46,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1969836.0, ans=0.95 2023-06-28 04:06:14,290 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.812e+02 6.998e+02 1.084e+03 1.696e+03 4.677e+03, threshold=2.169e+03, percent-clipped=9.0 2023-06-28 04:07:00,186 INFO [train.py:996] (1/4) Epoch 11, batch 23400, loss[loss=0.206, simple_loss=0.2823, pruned_loss=0.0649, over 21928.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2902, pruned_loss=0.06674, over 4278940.81 frames. ], batch size: 333, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:07:47,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1970196.0, ans=0.0 2023-06-28 04:07:58,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1970256.0, ans=0.125 2023-06-28 04:08:26,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1970316.0, ans=0.0 2023-06-28 04:08:28,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1970316.0, ans=0.05 2023-06-28 04:08:38,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1970316.0, ans=0.125 2023-06-28 04:08:42,829 INFO [train.py:996] (1/4) Epoch 11, batch 23450, loss[loss=0.2338, simple_loss=0.3078, pruned_loss=0.07992, over 21744.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2918, pruned_loss=0.06872, over 4281545.97 frames. ], batch size: 298, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:09:04,458 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-28 04:09:26,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1970496.0, ans=0.125 2023-06-28 04:09:38,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1970496.0, ans=0.0 2023-06-28 04:09:39,101 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 9.066e+02 1.305e+03 2.110e+03 3.921e+03, threshold=2.611e+03, percent-clipped=24.0 2023-06-28 04:10:11,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1970616.0, ans=0.0 2023-06-28 04:10:20,268 INFO [train.py:996] (1/4) Epoch 11, batch 23500, loss[loss=0.211, simple_loss=0.283, pruned_loss=0.0695, over 21865.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2924, pruned_loss=0.0697, over 4280226.97 frames. ], batch size: 351, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:11:14,979 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:11:31,504 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-28 04:11:56,887 INFO [train.py:996] (1/4) Epoch 11, batch 23550, loss[loss=0.1856, simple_loss=0.2588, pruned_loss=0.05625, over 21414.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2871, pruned_loss=0.07, over 4276092.36 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:12:34,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1971036.0, ans=0.1 2023-06-28 04:12:56,972 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.515e+02 7.071e+02 9.804e+02 1.415e+03 2.782e+03, threshold=1.961e+03, percent-clipped=2.0 2023-06-28 04:13:00,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1971156.0, ans=0.2 2023-06-28 04:13:04,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1971156.0, ans=0.125 2023-06-28 04:13:30,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-28 04:13:33,841 INFO [train.py:996] (1/4) Epoch 11, batch 23600, loss[loss=0.2171, simple_loss=0.2905, pruned_loss=0.0718, over 21786.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2883, pruned_loss=0.07009, over 4251326.29 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:13:56,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1971336.0, ans=0.0 2023-06-28 04:14:07,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1971336.0, ans=0.0 2023-06-28 04:14:40,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1971396.0, ans=0.0 2023-06-28 04:14:50,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1971456.0, ans=0.125 2023-06-28 04:15:06,414 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-28 04:15:22,097 INFO [train.py:996] (1/4) Epoch 11, batch 23650, loss[loss=0.1815, simple_loss=0.2639, pruned_loss=0.04949, over 21296.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2878, pruned_loss=0.06826, over 4256385.70 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:15:26,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.19 vs. limit=22.5 2023-06-28 04:15:47,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1971636.0, ans=0.125 2023-06-28 04:16:25,964 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 7.663e+02 1.286e+03 2.404e+03 4.690e+03, threshold=2.571e+03, percent-clipped=33.0 2023-06-28 04:16:43,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-28 04:17:10,180 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:17:11,305 INFO [train.py:996] (1/4) Epoch 11, batch 23700, loss[loss=0.157, simple_loss=0.2297, pruned_loss=0.04217, over 19895.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2892, pruned_loss=0.0678, over 4257858.76 frames. ], batch size: 704, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:17:55,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1971996.0, ans=0.07 2023-06-28 04:18:09,998 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-28 04:18:10,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.87 vs. limit=15.0 2023-06-28 04:18:44,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1972116.0, ans=0.0 2023-06-28 04:18:49,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1972116.0, ans=0.2 2023-06-28 04:18:55,616 INFO [train.py:996] (1/4) Epoch 11, batch 23750, loss[loss=0.1905, simple_loss=0.2883, pruned_loss=0.04636, over 20738.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2926, pruned_loss=0.06809, over 4260578.20 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:19:18,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1972176.0, ans=0.1 2023-06-28 04:19:22,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1972236.0, ans=0.125 2023-06-28 04:19:53,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-28 04:19:59,567 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 7.586e+02 1.231e+03 1.988e+03 4.114e+03, threshold=2.463e+03, percent-clipped=17.0 2023-06-28 04:20:27,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1972416.0, ans=0.125 2023-06-28 04:20:49,618 INFO [train.py:996] (1/4) Epoch 11, batch 23800, loss[loss=0.1946, simple_loss=0.291, pruned_loss=0.04908, over 21627.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2911, pruned_loss=0.06656, over 4265649.79 frames. ], batch size: 389, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:22:03,795 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-28 04:22:14,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1972716.0, ans=0.2 2023-06-28 04:22:30,797 INFO [train.py:996] (1/4) Epoch 11, batch 23850, loss[loss=0.2742, simple_loss=0.3548, pruned_loss=0.09681, over 21762.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2975, pruned_loss=0.06864, over 4267518.77 frames. ], batch size: 124, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:23:01,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1972836.0, ans=0.0 2023-06-28 04:23:10,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=1972896.0, ans=12.0 2023-06-28 04:23:30,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.838e+02 1.015e+03 1.727e+03 2.965e+03 4.931e+03, threshold=3.454e+03, percent-clipped=27.0 2023-06-28 04:23:53,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1972956.0, ans=0.1 2023-06-28 04:24:01,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1973016.0, ans=22.5 2023-06-28 04:24:14,931 INFO [train.py:996] (1/4) Epoch 11, batch 23900, loss[loss=0.2385, simple_loss=0.3253, pruned_loss=0.07587, over 21557.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3027, pruned_loss=0.06986, over 4269417.47 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:24:15,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1973076.0, ans=0.1 2023-06-28 04:24:27,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1973076.0, ans=0.125 2023-06-28 04:24:37,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1973136.0, ans=0.125 2023-06-28 04:25:09,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1973196.0, ans=0.5 2023-06-28 04:25:36,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1973256.0, ans=0.125 2023-06-28 04:25:57,205 INFO [train.py:996] (1/4) Epoch 11, batch 23950, loss[loss=0.2287, simple_loss=0.2861, pruned_loss=0.0856, over 21271.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2982, pruned_loss=0.06955, over 4268435.56 frames. ], batch size: 471, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:26:03,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1973376.0, ans=0.125 2023-06-28 04:26:43,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1973496.0, ans=0.0 2023-06-28 04:26:46,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1973496.0, ans=0.125 2023-06-28 04:26:48,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1973496.0, ans=0.125 2023-06-28 04:26:50,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1973496.0, ans=0.0 2023-06-28 04:27:01,132 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.639e+02 7.884e+02 1.240e+03 1.758e+03 3.648e+03, threshold=2.481e+03, percent-clipped=1.0 2023-06-28 04:27:15,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1973556.0, ans=0.125 2023-06-28 04:27:31,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1973616.0, ans=0.125 2023-06-28 04:27:31,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1973616.0, ans=0.1 2023-06-28 04:27:40,580 INFO [train.py:996] (1/4) Epoch 11, batch 24000, loss[loss=0.2895, simple_loss=0.3414, pruned_loss=0.1188, over 21471.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2994, pruned_loss=0.07212, over 4264979.34 frames. ], batch size: 510, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:27:40,581 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 04:28:01,235 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2606, simple_loss=0.3539, pruned_loss=0.08365, over 1796401.00 frames. 2023-06-28 04:28:01,236 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 04:28:12,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1973676.0, ans=0.2 2023-06-28 04:28:22,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1973736.0, ans=0.95 2023-06-28 04:28:35,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1973736.0, ans=10.0 2023-06-28 04:29:09,257 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:29:45,840 INFO [train.py:996] (1/4) Epoch 11, batch 24050, loss[loss=0.1729, simple_loss=0.2535, pruned_loss=0.04612, over 21237.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3005, pruned_loss=0.07252, over 4261300.31 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:29:46,860 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-28 04:30:45,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1974096.0, ans=0.1 2023-06-28 04:30:50,227 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.159e+02 7.180e+02 1.052e+03 1.636e+03 2.739e+03, threshold=2.104e+03, percent-clipped=1.0 2023-06-28 04:30:51,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1974156.0, ans=0.05 2023-06-28 04:30:52,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-28 04:30:52,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1974156.0, ans=0.09899494936611666 2023-06-28 04:30:54,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1974156.0, ans=0.125 2023-06-28 04:30:56,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1974156.0, ans=0.0 2023-06-28 04:31:33,797 INFO [train.py:996] (1/4) Epoch 11, batch 24100, loss[loss=0.2526, simple_loss=0.3335, pruned_loss=0.08589, over 21736.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3007, pruned_loss=0.07114, over 4268232.93 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:31:50,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1974336.0, ans=0.125 2023-06-28 04:31:51,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-28 04:32:39,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1974456.0, ans=0.125 2023-06-28 04:32:41,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1974456.0, ans=0.125 2023-06-28 04:32:50,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1974516.0, ans=0.0 2023-06-28 04:33:13,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1974576.0, ans=0.125 2023-06-28 04:33:14,900 INFO [train.py:996] (1/4) Epoch 11, batch 24150, loss[loss=0.2308, simple_loss=0.3136, pruned_loss=0.07403, over 21790.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2992, pruned_loss=0.07158, over 4270861.12 frames. ], batch size: 124, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:33:23,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-06-28 04:33:25,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1974576.0, ans=0.1 2023-06-28 04:34:00,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1974696.0, ans=0.125 2023-06-28 04:34:05,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1974696.0, ans=0.0 2023-06-28 04:34:14,498 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.271e+02 8.001e+02 1.203e+03 1.842e+03 3.600e+03, threshold=2.405e+03, percent-clipped=13.0 2023-06-28 04:34:16,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1974756.0, ans=0.2 2023-06-28 04:34:20,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-28 04:34:22,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1974756.0, ans=0.1 2023-06-28 04:34:38,988 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:34:58,327 INFO [train.py:996] (1/4) Epoch 11, batch 24200, loss[loss=0.221, simple_loss=0.2955, pruned_loss=0.07326, over 21235.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3028, pruned_loss=0.07254, over 4275653.89 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:35:12,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1974876.0, ans=0.125 2023-06-28 04:35:18,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-28 04:36:11,587 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-28 04:36:28,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-28 04:36:36,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1975116.0, ans=0.125 2023-06-28 04:36:47,569 INFO [train.py:996] (1/4) Epoch 11, batch 24250, loss[loss=0.1976, simple_loss=0.2985, pruned_loss=0.04839, over 21754.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2999, pruned_loss=0.06834, over 4278638.79 frames. ], batch size: 332, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:37:04,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1975176.0, ans=0.0 2023-06-28 04:37:30,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1975296.0, ans=0.1 2023-06-28 04:37:45,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1975296.0, ans=0.95 2023-06-28 04:37:48,145 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.666e+02 6.261e+02 9.348e+02 1.527e+03 2.867e+03, threshold=1.870e+03, percent-clipped=6.0 2023-06-28 04:38:02,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1975356.0, ans=0.2 2023-06-28 04:38:35,081 INFO [train.py:996] (1/4) Epoch 11, batch 24300, loss[loss=0.162, simple_loss=0.2295, pruned_loss=0.04723, over 21831.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2957, pruned_loss=0.06406, over 4270207.88 frames. ], batch size: 118, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:39:05,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1975536.0, ans=0.1 2023-06-28 04:39:25,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1975596.0, ans=0.0 2023-06-28 04:39:35,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1975656.0, ans=0.0 2023-06-28 04:39:46,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1975656.0, ans=0.0 2023-06-28 04:40:16,733 INFO [train.py:996] (1/4) Epoch 11, batch 24350, loss[loss=0.2166, simple_loss=0.2883, pruned_loss=0.07247, over 21888.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2936, pruned_loss=0.06394, over 4270907.68 frames. ], batch size: 316, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:40:30,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1975776.0, ans=0.0 2023-06-28 04:41:13,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1975896.0, ans=0.2 2023-06-28 04:41:16,494 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.173e+02 7.216e+02 1.198e+03 1.667e+03 3.137e+03, threshold=2.397e+03, percent-clipped=16.0 2023-06-28 04:41:25,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1975956.0, ans=0.125 2023-06-28 04:41:27,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1975956.0, ans=0.2 2023-06-28 04:41:57,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.54 vs. limit=10.0 2023-06-28 04:41:59,536 INFO [train.py:996] (1/4) Epoch 11, batch 24400, loss[loss=0.2154, simple_loss=0.273, pruned_loss=0.07895, over 20204.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2941, pruned_loss=0.06572, over 4269471.43 frames. ], batch size: 707, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:42:10,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1976076.0, ans=0.0 2023-06-28 04:43:42,610 INFO [train.py:996] (1/4) Epoch 11, batch 24450, loss[loss=0.191, simple_loss=0.2764, pruned_loss=0.05276, over 21460.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2963, pruned_loss=0.06691, over 4267738.13 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:44:11,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1976436.0, ans=0.125 2023-06-28 04:44:30,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1976496.0, ans=0.0 2023-06-28 04:44:48,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.546e+02 6.657e+02 8.727e+02 1.270e+03 2.887e+03, threshold=1.745e+03, percent-clipped=2.0 2023-06-28 04:44:55,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1976556.0, ans=0.0 2023-06-28 04:45:19,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1976616.0, ans=0.0 2023-06-28 04:45:21,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1976616.0, ans=0.2 2023-06-28 04:45:23,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1976676.0, ans=0.07 2023-06-28 04:45:24,297 INFO [train.py:996] (1/4) Epoch 11, batch 24500, loss[loss=0.2001, simple_loss=0.2987, pruned_loss=0.05079, over 21680.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2967, pruned_loss=0.06735, over 4268072.57 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:45:26,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1976676.0, ans=0.5 2023-06-28 04:45:59,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1976736.0, ans=0.2 2023-06-28 04:46:33,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1976856.0, ans=0.1 2023-06-28 04:46:36,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-28 04:46:46,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-28 04:47:07,088 INFO [train.py:996] (1/4) Epoch 11, batch 24550, loss[loss=0.2299, simple_loss=0.309, pruned_loss=0.07539, over 21809.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2996, pruned_loss=0.06899, over 4275616.19 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:48:12,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1977096.0, ans=0.0 2023-06-28 04:48:18,393 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.753e+02 7.977e+02 1.391e+03 1.923e+03 3.873e+03, threshold=2.782e+03, percent-clipped=31.0 2023-06-28 04:48:20,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1977156.0, ans=0.0 2023-06-28 04:48:28,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1977156.0, ans=0.125 2023-06-28 04:48:30,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=15.0 2023-06-28 04:48:54,463 INFO [train.py:996] (1/4) Epoch 11, batch 24600, loss[loss=0.1764, simple_loss=0.2467, pruned_loss=0.05303, over 21356.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.295, pruned_loss=0.06794, over 4279789.63 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:48:58,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1977276.0, ans=0.125 2023-06-28 04:49:08,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1977276.0, ans=0.1 2023-06-28 04:49:50,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1977396.0, ans=0.125 2023-06-28 04:49:54,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1977396.0, ans=0.0 2023-06-28 04:49:55,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1977396.0, ans=0.125 2023-06-28 04:50:06,196 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=12.0 2023-06-28 04:50:22,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1977516.0, ans=0.125 2023-06-28 04:50:26,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1977516.0, ans=0.2 2023-06-28 04:50:37,064 INFO [train.py:996] (1/4) Epoch 11, batch 24650, loss[loss=0.1949, simple_loss=0.2558, pruned_loss=0.06704, over 21274.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2887, pruned_loss=0.06647, over 4274265.78 frames. ], batch size: 144, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:50:42,738 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:50:51,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1977576.0, ans=0.0 2023-06-28 04:50:59,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-28 04:51:42,514 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.777e+02 8.360e+02 1.097e+03 1.550e+03 2.969e+03, threshold=2.194e+03, percent-clipped=2.0 2023-06-28 04:52:00,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=12.0 2023-06-28 04:52:19,286 INFO [train.py:996] (1/4) Epoch 11, batch 24700, loss[loss=0.212, simple_loss=0.2696, pruned_loss=0.07726, over 21364.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2847, pruned_loss=0.0654, over 4266800.16 frames. ], batch size: 473, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:53:15,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1977996.0, ans=0.1 2023-06-28 04:53:17,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1977996.0, ans=0.015 2023-06-28 04:54:01,974 INFO [train.py:996] (1/4) Epoch 11, batch 24750, loss[loss=0.1649, simple_loss=0.2375, pruned_loss=0.04618, over 21348.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2789, pruned_loss=0.06419, over 4263167.73 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:55:07,374 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.038e+02 5.891e+02 8.003e+02 1.099e+03 2.127e+03, threshold=1.601e+03, percent-clipped=0.0 2023-06-28 04:55:11,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1978356.0, ans=0.1 2023-06-28 04:55:11,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1978356.0, ans=0.125 2023-06-28 04:55:16,087 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:55:38,488 INFO [train.py:996] (1/4) Epoch 11, batch 24800, loss[loss=0.1968, simple_loss=0.2832, pruned_loss=0.05524, over 21228.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2736, pruned_loss=0.0635, over 4264017.05 frames. ], batch size: 549, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 04:57:02,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.00 vs. limit=15.0 2023-06-28 04:57:22,267 INFO [train.py:996] (1/4) Epoch 11, batch 24850, loss[loss=0.1899, simple_loss=0.2611, pruned_loss=0.05938, over 21492.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2721, pruned_loss=0.06372, over 4262864.46 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:58:07,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1978836.0, ans=0.125 2023-06-28 04:58:35,356 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.508e+02 8.527e+02 1.164e+03 1.873e+03 3.084e+03, threshold=2.328e+03, percent-clipped=28.0 2023-06-28 04:58:36,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1978956.0, ans=0.1 2023-06-28 04:58:44,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1978956.0, ans=0.0 2023-06-28 04:58:46,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1978956.0, ans=15.0 2023-06-28 04:58:49,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-28 04:59:09,780 INFO [train.py:996] (1/4) Epoch 11, batch 24900, loss[loss=0.2323, simple_loss=0.3103, pruned_loss=0.07709, over 21588.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2752, pruned_loss=0.06454, over 4273177.81 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:59:14,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-28 04:59:22,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1979076.0, ans=0.0 2023-06-28 04:59:31,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1979136.0, ans=0.0 2023-06-28 04:59:34,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-28 04:59:56,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1979196.0, ans=0.125 2023-06-28 05:00:23,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1979256.0, ans=0.2 2023-06-28 05:00:25,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1979256.0, ans=0.125 2023-06-28 05:00:26,372 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-28 05:00:35,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1979316.0, ans=0.0 2023-06-28 05:00:58,714 INFO [train.py:996] (1/4) Epoch 11, batch 24950, loss[loss=0.2256, simple_loss=0.2989, pruned_loss=0.07616, over 21794.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2825, pruned_loss=0.06774, over 4274901.22 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:01:03,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.17 vs. limit=15.0 2023-06-28 05:01:34,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979436.0, ans=0.1 2023-06-28 05:01:45,291 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-28 05:02:04,368 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.368e+02 8.687e+02 1.291e+03 2.049e+03 3.753e+03, threshold=2.582e+03, percent-clipped=19.0 2023-06-28 05:02:42,788 INFO [train.py:996] (1/4) Epoch 11, batch 25000, loss[loss=0.225, simple_loss=0.2888, pruned_loss=0.08062, over 21562.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2896, pruned_loss=0.06944, over 4279212.38 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:03:30,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1979796.0, ans=0.1 2023-06-28 05:03:38,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1979796.0, ans=0.125 2023-06-28 05:03:43,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979856.0, ans=0.1 2023-06-28 05:03:43,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1979856.0, ans=0.125 2023-06-28 05:04:25,863 INFO [train.py:996] (1/4) Epoch 11, batch 25050, loss[loss=0.2156, simple_loss=0.2668, pruned_loss=0.08223, over 21219.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2846, pruned_loss=0.06866, over 4285699.61 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:04:35,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1979976.0, ans=0.125 2023-06-28 05:04:35,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1979976.0, ans=0.1 2023-06-28 05:05:04,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1980036.0, ans=0.125 2023-06-28 05:05:37,079 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.870e+02 6.206e+02 8.703e+02 1.312e+03 2.418e+03, threshold=1.741e+03, percent-clipped=0.0 2023-06-28 05:06:09,896 INFO [train.py:996] (1/4) Epoch 11, batch 25100, loss[loss=0.2001, simple_loss=0.2872, pruned_loss=0.05651, over 21583.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2808, pruned_loss=0.06775, over 4285526.27 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:06:54,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980396.0, ans=0.1 2023-06-28 05:07:06,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1980396.0, ans=0.125 2023-06-28 05:07:25,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1980456.0, ans=0.0 2023-06-28 05:07:29,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1980516.0, ans=0.125 2023-06-28 05:07:51,376 INFO [train.py:996] (1/4) Epoch 11, batch 25150, loss[loss=0.1958, simple_loss=0.2801, pruned_loss=0.05575, over 21658.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2834, pruned_loss=0.06601, over 4287018.72 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:07:52,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1980576.0, ans=0.0 2023-06-28 05:08:46,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1980696.0, ans=0.125 2023-06-28 05:08:55,622 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.446e+02 6.553e+02 1.065e+03 1.530e+03 2.529e+03, threshold=2.131e+03, percent-clipped=15.0 2023-06-28 05:09:04,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1980756.0, ans=0.1 2023-06-28 05:09:07,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1980816.0, ans=0.125 2023-06-28 05:09:28,757 INFO [train.py:996] (1/4) Epoch 11, batch 25200, loss[loss=0.1902, simple_loss=0.2846, pruned_loss=0.04792, over 21690.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2839, pruned_loss=0.0649, over 4277781.70 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:09:58,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1980936.0, ans=0.0 2023-06-28 05:10:00,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1980936.0, ans=0.0 2023-06-28 05:10:38,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1981056.0, ans=0.0 2023-06-28 05:10:49,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1981056.0, ans=0.2 2023-06-28 05:11:06,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-28 05:11:10,873 INFO [train.py:996] (1/4) Epoch 11, batch 25250, loss[loss=0.1808, simple_loss=0.2566, pruned_loss=0.0525, over 21589.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2823, pruned_loss=0.06359, over 4271434.04 frames. ], batch size: 263, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:11:39,611 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-28 05:12:09,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1981296.0, ans=0.125 2023-06-28 05:12:11,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1981296.0, ans=0.125 2023-06-28 05:12:21,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.599e+02 7.557e+02 1.172e+03 1.779e+03 3.738e+03, threshold=2.344e+03, percent-clipped=14.0 2023-06-28 05:12:24,407 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.65 vs. limit=22.5 2023-06-28 05:12:30,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1981356.0, ans=0.0 2023-06-28 05:12:31,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1981356.0, ans=0.0 2023-06-28 05:12:59,823 INFO [train.py:996] (1/4) Epoch 11, batch 25300, loss[loss=0.1822, simple_loss=0.2368, pruned_loss=0.06384, over 20735.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2813, pruned_loss=0.06351, over 4274933.46 frames. ], batch size: 608, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:13:35,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1981536.0, ans=0.125 2023-06-28 05:14:00,066 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-28 05:14:24,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1981716.0, ans=0.0 2023-06-28 05:14:36,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1981716.0, ans=0.1 2023-06-28 05:14:44,498 INFO [train.py:996] (1/4) Epoch 11, batch 25350, loss[loss=0.1874, simple_loss=0.277, pruned_loss=0.04889, over 20691.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.282, pruned_loss=0.06294, over 4261977.83 frames. ], batch size: 607, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:14:51,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1981776.0, ans=0.125 2023-06-28 05:15:17,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1981836.0, ans=0.125 2023-06-28 05:15:53,135 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 7.550e+02 1.200e+03 1.857e+03 4.350e+03, threshold=2.399e+03, percent-clipped=14.0 2023-06-28 05:15:53,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1981956.0, ans=0.0 2023-06-28 05:16:25,270 INFO [train.py:996] (1/4) Epoch 11, batch 25400, loss[loss=0.1918, simple_loss=0.2553, pruned_loss=0.06413, over 21207.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2797, pruned_loss=0.06249, over 4263934.74 frames. ], batch size: 144, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:16:32,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1982076.0, ans=0.0 2023-06-28 05:16:52,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-28 05:17:11,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1982196.0, ans=0.125 2023-06-28 05:17:42,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1982256.0, ans=0.0 2023-06-28 05:18:07,457 INFO [train.py:996] (1/4) Epoch 11, batch 25450, loss[loss=0.19, simple_loss=0.2851, pruned_loss=0.04744, over 21385.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2807, pruned_loss=0.06329, over 4259940.55 frames. ], batch size: 194, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:18:08,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1982376.0, ans=0.0 2023-06-28 05:18:49,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1982496.0, ans=0.125 2023-06-28 05:19:17,817 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.365e+02 6.795e+02 1.021e+03 1.795e+03 3.141e+03, threshold=2.041e+03, percent-clipped=7.0 2023-06-28 05:19:56,328 INFO [train.py:996] (1/4) Epoch 11, batch 25500, loss[loss=0.2045, simple_loss=0.2783, pruned_loss=0.06536, over 19963.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2805, pruned_loss=0.06067, over 4252316.66 frames. ], batch size: 702, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:20:42,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1982796.0, ans=0.0 2023-06-28 05:21:39,901 INFO [train.py:996] (1/4) Epoch 11, batch 25550, loss[loss=0.2035, simple_loss=0.2889, pruned_loss=0.05901, over 21199.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2872, pruned_loss=0.06122, over 4229981.64 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:21:51,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1982976.0, ans=0.125 2023-06-28 05:22:18,536 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:22:26,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1983096.0, ans=0.1 2023-06-28 05:22:44,325 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.429e+02 7.335e+02 1.015e+03 1.599e+03 3.312e+03, threshold=2.031e+03, percent-clipped=14.0 2023-06-28 05:23:07,726 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-28 05:23:28,333 INFO [train.py:996] (1/4) Epoch 11, batch 25600, loss[loss=0.2534, simple_loss=0.327, pruned_loss=0.08994, over 21805.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2917, pruned_loss=0.06234, over 4243693.57 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:23:42,775 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:23:43,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-28 05:24:08,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1983396.0, ans=0.0 2023-06-28 05:24:12,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1983396.0, ans=0.1 2023-06-28 05:24:46,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1983456.0, ans=0.0 2023-06-28 05:25:10,612 INFO [train.py:996] (1/4) Epoch 11, batch 25650, loss[loss=0.1955, simple_loss=0.2623, pruned_loss=0.06431, over 21620.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2927, pruned_loss=0.06483, over 4243704.31 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:25:11,655 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-28 05:25:39,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-28 05:25:45,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1983636.0, ans=0.125 2023-06-28 05:26:06,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1983756.0, ans=0.125 2023-06-28 05:26:21,286 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.706e+02 6.784e+02 1.002e+03 1.536e+03 3.689e+03, threshold=2.004e+03, percent-clipped=11.0 2023-06-28 05:26:30,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-28 05:26:34,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=15.0 2023-06-28 05:26:52,928 INFO [train.py:996] (1/4) Epoch 11, batch 25700, loss[loss=0.2392, simple_loss=0.3024, pruned_loss=0.08798, over 21854.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2889, pruned_loss=0.06555, over 4250580.57 frames. ], batch size: 107, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:27:09,114 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-28 05:27:27,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1983936.0, ans=0.1 2023-06-28 05:28:10,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1984056.0, ans=0.125 2023-06-28 05:28:32,206 INFO [train.py:996] (1/4) Epoch 11, batch 25750, loss[loss=0.334, simple_loss=0.3945, pruned_loss=0.1368, over 21371.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2937, pruned_loss=0.06846, over 4258480.87 frames. ], batch size: 508, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:28:36,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1984176.0, ans=0.125 2023-06-28 05:29:29,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1984296.0, ans=0.125 2023-06-28 05:29:31,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1984296.0, ans=0.2 2023-06-28 05:29:50,525 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 8.292e+02 1.215e+03 2.235e+03 4.745e+03, threshold=2.430e+03, percent-clipped=27.0 2023-06-28 05:29:57,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-28 05:30:23,506 INFO [train.py:996] (1/4) Epoch 11, batch 25800, loss[loss=0.2194, simple_loss=0.2792, pruned_loss=0.0798, over 20147.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3053, pruned_loss=0.07304, over 4266465.28 frames. ], batch size: 707, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:30:43,795 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=22.5 2023-06-28 05:30:50,021 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-28 05:31:23,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1984596.0, ans=0.02 2023-06-28 05:31:28,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1984656.0, ans=0.125 2023-06-28 05:32:01,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1984716.0, ans=0.0 2023-06-28 05:32:06,388 INFO [train.py:996] (1/4) Epoch 11, batch 25850, loss[loss=0.2252, simple_loss=0.2977, pruned_loss=0.0763, over 21826.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3067, pruned_loss=0.07229, over 4276716.31 frames. ], batch size: 124, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:32:08,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-28 05:32:13,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1984776.0, ans=0.0 2023-06-28 05:32:16,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1984776.0, ans=0.125 2023-06-28 05:32:32,349 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-28 05:32:40,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1984836.0, ans=0.2 2023-06-28 05:32:43,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1984836.0, ans=0.125 2023-06-28 05:33:16,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1984956.0, ans=0.035 2023-06-28 05:33:18,853 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.240e+02 7.750e+02 1.095e+03 1.413e+03 4.702e+03, threshold=2.190e+03, percent-clipped=3.0 2023-06-28 05:33:20,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=22.5 2023-06-28 05:33:44,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1985076.0, ans=0.0 2023-06-28 05:33:45,971 INFO [train.py:996] (1/4) Epoch 11, batch 25900, loss[loss=0.2613, simple_loss=0.3594, pruned_loss=0.08156, over 21828.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3082, pruned_loss=0.07255, over 4286334.94 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:33:48,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1985076.0, ans=0.1 2023-06-28 05:34:53,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1985256.0, ans=0.125 2023-06-28 05:35:07,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1985316.0, ans=0.0 2023-06-28 05:35:29,666 INFO [train.py:996] (1/4) Epoch 11, batch 25950, loss[loss=0.2739, simple_loss=0.3409, pruned_loss=0.1035, over 21418.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3127, pruned_loss=0.07503, over 4284238.88 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:36:05,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=22.5 2023-06-28 05:36:41,749 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 7.393e+02 8.893e+02 1.407e+03 4.224e+03, threshold=1.779e+03, percent-clipped=8.0 2023-06-28 05:36:42,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1985556.0, ans=0.125 2023-06-28 05:36:44,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1985556.0, ans=0.1 2023-06-28 05:36:52,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1985616.0, ans=0.125 2023-06-28 05:37:14,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1985616.0, ans=0.125 2023-06-28 05:37:17,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1985676.0, ans=0.125 2023-06-28 05:37:18,832 INFO [train.py:996] (1/4) Epoch 11, batch 26000, loss[loss=0.1977, simple_loss=0.3014, pruned_loss=0.04703, over 21798.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3113, pruned_loss=0.07296, over 4274504.81 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:37:27,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1985676.0, ans=0.125 2023-06-28 05:38:40,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1985916.0, ans=0.5 2023-06-28 05:38:45,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1985916.0, ans=0.125 2023-06-28 05:39:00,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1985976.0, ans=0.125 2023-06-28 05:39:00,977 INFO [train.py:996] (1/4) Epoch 11, batch 26050, loss[loss=0.2188, simple_loss=0.2916, pruned_loss=0.07305, over 21872.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3116, pruned_loss=0.07438, over 4280913.94 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:39:01,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-28 05:39:42,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1986096.0, ans=0.2 2023-06-28 05:40:03,337 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 6.927e+02 9.303e+02 1.315e+03 2.564e+03, threshold=1.861e+03, percent-clipped=11.0 2023-06-28 05:40:31,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1986216.0, ans=0.0 2023-06-28 05:40:37,547 INFO [train.py:996] (1/4) Epoch 11, batch 26100, loss[loss=0.2203, simple_loss=0.2902, pruned_loss=0.07525, over 21913.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3067, pruned_loss=0.07444, over 4289045.99 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:40:59,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1986336.0, ans=0.125 2023-06-28 05:41:35,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1986396.0, ans=0.0 2023-06-28 05:41:43,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1986456.0, ans=0.125 2023-06-28 05:41:54,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-28 05:42:04,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1986516.0, ans=0.125 2023-06-28 05:42:25,864 INFO [train.py:996] (1/4) Epoch 11, batch 26150, loss[loss=0.2132, simple_loss=0.2843, pruned_loss=0.07107, over 17513.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3038, pruned_loss=0.07422, over 4285862.16 frames. ], batch size: 61, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:42:34,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1986576.0, ans=0.125 2023-06-28 05:43:14,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1986696.0, ans=0.125 2023-06-28 05:43:36,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1986756.0, ans=0.125 2023-06-28 05:43:36,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1986756.0, ans=0.125 2023-06-28 05:43:40,788 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 6.839e+02 9.051e+02 1.314e+03 2.834e+03, threshold=1.810e+03, percent-clipped=6.0 2023-06-28 05:44:04,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1986816.0, ans=0.0 2023-06-28 05:44:10,840 INFO [train.py:996] (1/4) Epoch 11, batch 26200, loss[loss=0.2251, simple_loss=0.3473, pruned_loss=0.05147, over 20868.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3045, pruned_loss=0.07214, over 4288137.35 frames. ], batch size: 608, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:44:44,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1986936.0, ans=0.125 2023-06-28 05:45:49,439 INFO [train.py:996] (1/4) Epoch 11, batch 26250, loss[loss=0.1971, simple_loss=0.2773, pruned_loss=0.05846, over 21486.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3061, pruned_loss=0.07097, over 4285463.33 frames. ], batch size: 194, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:46:04,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1987176.0, ans=10.0 2023-06-28 05:46:09,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1987236.0, ans=0.2 2023-06-28 05:46:57,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1987356.0, ans=0.125 2023-06-28 05:47:01,913 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.966e+02 7.302e+02 1.108e+03 1.607e+03 4.168e+03, threshold=2.217e+03, percent-clipped=19.0 2023-06-28 05:47:09,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-28 05:47:31,742 INFO [train.py:996] (1/4) Epoch 11, batch 26300, loss[loss=0.2179, simple_loss=0.2875, pruned_loss=0.07418, over 21968.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3039, pruned_loss=0.07116, over 4283690.12 frames. ], batch size: 333, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:48:13,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-28 05:48:27,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1987596.0, ans=0.125 2023-06-28 05:49:00,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1987716.0, ans=0.125 2023-06-28 05:49:02,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1987716.0, ans=0.125 2023-06-28 05:49:19,385 INFO [train.py:996] (1/4) Epoch 11, batch 26350, loss[loss=0.2135, simple_loss=0.2742, pruned_loss=0.07638, over 19945.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3008, pruned_loss=0.0717, over 4282835.46 frames. ], batch size: 702, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:50:11,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-28 05:50:12,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1987896.0, ans=0.2 2023-06-28 05:50:32,387 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.942e+02 8.987e+02 1.115e+03 1.521e+03 3.466e+03, threshold=2.231e+03, percent-clipped=6.0 2023-06-28 05:51:02,164 INFO [train.py:996] (1/4) Epoch 11, batch 26400, loss[loss=0.2119, simple_loss=0.2745, pruned_loss=0.07465, over 21820.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2966, pruned_loss=0.07229, over 4274004.70 frames. ], batch size: 98, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:51:21,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988136.0, ans=0.1 2023-06-28 05:52:01,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1988196.0, ans=0.125 2023-06-28 05:52:11,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1988256.0, ans=0.125 2023-06-28 05:52:28,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1988256.0, ans=0.125 2023-06-28 05:52:33,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1988316.0, ans=0.125 2023-06-28 05:52:38,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1988316.0, ans=0.025 2023-06-28 05:52:48,901 INFO [train.py:996] (1/4) Epoch 11, batch 26450, loss[loss=0.2452, simple_loss=0.3469, pruned_loss=0.07172, over 21744.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.297, pruned_loss=0.07225, over 4261205.11 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:53:04,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1988376.0, ans=0.2 2023-06-28 05:53:04,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1988376.0, ans=0.125 2023-06-28 05:53:34,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1988436.0, ans=0.2 2023-06-28 05:53:55,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988556.0, ans=0.1 2023-06-28 05:53:58,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1988556.0, ans=15.0 2023-06-28 05:54:09,262 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.671e+02 1.024e+03 1.650e+03 2.442e+03 4.564e+03, threshold=3.300e+03, percent-clipped=28.0 2023-06-28 05:54:37,910 INFO [train.py:996] (1/4) Epoch 11, batch 26500, loss[loss=0.1829, simple_loss=0.2273, pruned_loss=0.06921, over 20040.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2995, pruned_loss=0.07111, over 4262454.44 frames. ], batch size: 704, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:55:08,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-28 05:55:48,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1988856.0, ans=0.0 2023-06-28 05:56:17,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1988916.0, ans=15.0 2023-06-28 05:56:28,853 INFO [train.py:996] (1/4) Epoch 11, batch 26550, loss[loss=0.1834, simple_loss=0.2629, pruned_loss=0.05192, over 21617.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2979, pruned_loss=0.06919, over 4261586.84 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:57:06,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1989096.0, ans=0.05 2023-06-28 05:57:36,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-28 05:57:38,607 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 7.811e+02 1.294e+03 2.097e+03 4.356e+03, threshold=2.588e+03, percent-clipped=4.0 2023-06-28 05:57:39,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1989156.0, ans=0.125 2023-06-28 05:58:10,601 INFO [train.py:996] (1/4) Epoch 11, batch 26600, loss[loss=0.188, simple_loss=0.2647, pruned_loss=0.05569, over 21473.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2954, pruned_loss=0.06626, over 4261945.67 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:58:13,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1989276.0, ans=0.0 2023-06-28 05:58:39,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1989336.0, ans=0.125 2023-06-28 05:59:43,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1989516.0, ans=0.125 2023-06-28 05:59:52,621 INFO [train.py:996] (1/4) Epoch 11, batch 26650, loss[loss=0.1512, simple_loss=0.2401, pruned_loss=0.03118, over 21657.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2889, pruned_loss=0.0652, over 4265571.13 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:00:30,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1989696.0, ans=0.125 2023-06-28 06:01:05,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.789e+02 5.280e+02 6.792e+02 8.539e+02 2.170e+03, threshold=1.358e+03, percent-clipped=0.0 2023-06-28 06:01:22,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1989816.0, ans=0.0 2023-06-28 06:01:27,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1989816.0, ans=0.0 2023-06-28 06:01:29,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1989816.0, ans=0.2 2023-06-28 06:01:33,806 INFO [train.py:996] (1/4) Epoch 11, batch 26700, loss[loss=0.2356, simple_loss=0.3119, pruned_loss=0.07968, over 21859.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2819, pruned_loss=0.06198, over 4276867.62 frames. ], batch size: 107, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:01:34,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1989876.0, ans=0.04949747468305833 2023-06-28 06:02:38,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-28 06:02:45,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-28 06:02:51,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1990056.0, ans=0.0 2023-06-28 06:03:15,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1990116.0, ans=0.04949747468305833 2023-06-28 06:03:18,029 INFO [train.py:996] (1/4) Epoch 11, batch 26750, loss[loss=0.1796, simple_loss=0.2543, pruned_loss=0.05244, over 22017.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2819, pruned_loss=0.06157, over 4283999.77 frames. ], batch size: 103, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:03:44,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1990236.0, ans=0.125 2023-06-28 06:04:15,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1990296.0, ans=0.0 2023-06-28 06:04:29,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1990356.0, ans=0.0 2023-06-28 06:04:33,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.602e+02 7.194e+02 1.094e+03 1.684e+03 4.507e+03, threshold=2.188e+03, percent-clipped=37.0 2023-06-28 06:05:02,066 INFO [train.py:996] (1/4) Epoch 11, batch 26800, loss[loss=0.2763, simple_loss=0.3383, pruned_loss=0.1072, over 21308.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2892, pruned_loss=0.06562, over 4280551.39 frames. ], batch size: 507, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 06:05:48,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=22.5 2023-06-28 06:05:57,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1990596.0, ans=0.125 2023-06-28 06:06:05,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1990656.0, ans=0.125 2023-06-28 06:06:38,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1990716.0, ans=0.125 2023-06-28 06:06:43,231 INFO [train.py:996] (1/4) Epoch 11, batch 26850, loss[loss=0.2284, simple_loss=0.2774, pruned_loss=0.08966, over 21395.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2908, pruned_loss=0.06799, over 4275058.55 frames. ], batch size: 473, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:06:53,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1990776.0, ans=0.035 2023-06-28 06:06:57,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=22.5 2023-06-28 06:07:54,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-28 06:08:02,684 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.773e+02 7.985e+02 1.116e+03 1.630e+03 3.577e+03, threshold=2.232e+03, percent-clipped=14.0 2023-06-28 06:08:24,681 INFO [train.py:996] (1/4) Epoch 11, batch 26900, loss[loss=0.1981, simple_loss=0.2583, pruned_loss=0.06889, over 21132.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2823, pruned_loss=0.06648, over 4257052.18 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:09:22,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1991196.0, ans=0.125 2023-06-28 06:09:29,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=25.63 vs. limit=22.5 2023-06-28 06:09:53,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1991316.0, ans=0.125 2023-06-28 06:10:05,641 INFO [train.py:996] (1/4) Epoch 11, batch 26950, loss[loss=0.2172, simple_loss=0.3093, pruned_loss=0.0626, over 21712.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2802, pruned_loss=0.06556, over 4258345.77 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:10:34,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1991436.0, ans=0.0 2023-06-28 06:10:55,327 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-28 06:11:08,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1991556.0, ans=0.2 2023-06-28 06:11:27,131 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 6.578e+02 9.395e+02 1.272e+03 2.979e+03, threshold=1.879e+03, percent-clipped=1.0 2023-06-28 06:11:41,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1991616.0, ans=0.1 2023-06-28 06:11:47,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1991676.0, ans=0.125 2023-06-28 06:11:47,976 INFO [train.py:996] (1/4) Epoch 11, batch 27000, loss[loss=0.192, simple_loss=0.298, pruned_loss=0.04296, over 20757.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2817, pruned_loss=0.06366, over 4266776.65 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:11:47,976 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 06:12:09,372 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.246, simple_loss=0.3377, pruned_loss=0.07718, over 1796401.00 frames. 2023-06-28 06:12:09,373 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 06:12:20,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-28 06:12:29,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-28 06:13:00,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1991796.0, ans=0.02 2023-06-28 06:13:19,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-28 06:13:57,805 INFO [train.py:996] (1/4) Epoch 11, batch 27050, loss[loss=0.2123, simple_loss=0.2914, pruned_loss=0.06658, over 21462.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2835, pruned_loss=0.06131, over 4262612.00 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:14:00,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1991976.0, ans=0.0 2023-06-28 06:14:15,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1992036.0, ans=0.125 2023-06-28 06:14:45,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-06-28 06:14:46,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1992096.0, ans=0.0 2023-06-28 06:14:49,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-28 06:15:07,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.395e+02 5.810e+02 8.142e+02 1.096e+03 2.681e+03, threshold=1.628e+03, percent-clipped=6.0 2023-06-28 06:15:10,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=22.5 2023-06-28 06:15:37,370 INFO [train.py:996] (1/4) Epoch 11, batch 27100, loss[loss=0.2156, simple_loss=0.3001, pruned_loss=0.0656, over 21193.00 frames. ], tot_loss[loss=0.205, simple_loss=0.285, pruned_loss=0.0625, over 4269194.67 frames. ], batch size: 143, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:15:50,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1992276.0, ans=0.1 2023-06-28 06:16:03,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1992336.0, ans=0.125 2023-06-28 06:16:08,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1992336.0, ans=0.125 2023-06-28 06:16:16,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1992336.0, ans=0.125 2023-06-28 06:16:44,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1992456.0, ans=0.125 2023-06-28 06:17:22,568 INFO [train.py:996] (1/4) Epoch 11, batch 27150, loss[loss=0.2399, simple_loss=0.3358, pruned_loss=0.07196, over 21729.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2976, pruned_loss=0.06573, over 4267457.29 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:17:23,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1992576.0, ans=0.0 2023-06-28 06:17:24,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1992576.0, ans=0.125 2023-06-28 06:17:54,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1992636.0, ans=0.0 2023-06-28 06:18:22,781 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:18:35,177 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.727e+02 8.827e+02 1.451e+03 2.136e+03 4.044e+03, threshold=2.902e+03, percent-clipped=43.0 2023-06-28 06:18:39,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1992816.0, ans=0.125 2023-06-28 06:18:57,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1992816.0, ans=0.125 2023-06-28 06:18:59,849 INFO [train.py:996] (1/4) Epoch 11, batch 27200, loss[loss=0.2479, simple_loss=0.325, pruned_loss=0.08537, over 21379.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3045, pruned_loss=0.06783, over 4275630.44 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:19:02,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1992876.0, ans=0.2 2023-06-28 06:19:07,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1992876.0, ans=0.125 2023-06-28 06:19:07,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1992876.0, ans=0.1 2023-06-28 06:19:27,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1992936.0, ans=0.0 2023-06-28 06:20:19,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1993056.0, ans=0.2 2023-06-28 06:20:24,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1993056.0, ans=0.125 2023-06-28 06:20:34,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1993116.0, ans=0.1 2023-06-28 06:20:47,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.62 vs. limit=12.0 2023-06-28 06:20:49,575 INFO [train.py:996] (1/4) Epoch 11, batch 27250, loss[loss=0.2473, simple_loss=0.3254, pruned_loss=0.08463, over 21770.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3077, pruned_loss=0.07103, over 4278392.65 frames. ], batch size: 118, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:20:55,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=22.5 2023-06-28 06:21:39,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=22.5 2023-06-28 06:22:14,913 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.213e+02 7.322e+02 9.525e+02 1.331e+03 3.028e+03, threshold=1.905e+03, percent-clipped=1.0 2023-06-28 06:22:30,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-28 06:22:35,623 INFO [train.py:996] (1/4) Epoch 11, batch 27300, loss[loss=0.22, simple_loss=0.3235, pruned_loss=0.0582, over 21749.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3091, pruned_loss=0.07192, over 4276284.56 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:22:43,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1993476.0, ans=0.0 2023-06-28 06:23:06,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1993536.0, ans=0.0 2023-06-28 06:24:03,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1993716.0, ans=0.035 2023-06-28 06:24:12,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1993716.0, ans=0.125 2023-06-28 06:24:20,016 INFO [train.py:996] (1/4) Epoch 11, batch 27350, loss[loss=0.2665, simple_loss=0.3341, pruned_loss=0.09946, over 21544.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3125, pruned_loss=0.07346, over 4277362.76 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:24:59,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1993836.0, ans=0.125 2023-06-28 06:25:41,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.836e+02 8.634e+02 1.277e+03 1.695e+03 3.535e+03, threshold=2.554e+03, percent-clipped=18.0 2023-06-28 06:26:01,626 INFO [train.py:996] (1/4) Epoch 11, batch 27400, loss[loss=0.2219, simple_loss=0.2878, pruned_loss=0.07796, over 21698.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3082, pruned_loss=0.07255, over 4281021.59 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:27:07,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1994196.0, ans=0.1 2023-06-28 06:27:16,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-28 06:27:44,105 INFO [train.py:996] (1/4) Epoch 11, batch 27450, loss[loss=0.2745, simple_loss=0.3424, pruned_loss=0.1033, over 21363.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3015, pruned_loss=0.07084, over 4283811.90 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:28:27,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1994436.0, ans=0.125 2023-06-28 06:28:44,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1994496.0, ans=0.1 2023-06-28 06:28:59,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1994556.0, ans=0.125 2023-06-28 06:29:05,150 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.109e+02 6.770e+02 1.005e+03 1.545e+03 3.220e+03, threshold=2.009e+03, percent-clipped=5.0 2023-06-28 06:29:25,920 INFO [train.py:996] (1/4) Epoch 11, batch 27500, loss[loss=0.1908, simple_loss=0.263, pruned_loss=0.05927, over 21836.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3003, pruned_loss=0.0708, over 4287398.80 frames. ], batch size: 282, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:30:05,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1994736.0, ans=0.0 2023-06-28 06:30:54,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-28 06:30:57,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1994916.0, ans=0.125 2023-06-28 06:31:16,353 INFO [train.py:996] (1/4) Epoch 11, batch 27550, loss[loss=0.186, simple_loss=0.2637, pruned_loss=0.05411, over 21704.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2961, pruned_loss=0.0689, over 4284942.97 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:31:20,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1994976.0, ans=0.125 2023-06-28 06:31:34,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1994976.0, ans=0.125 2023-06-28 06:31:43,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1995036.0, ans=0.0 2023-06-28 06:31:51,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1995036.0, ans=0.5 2023-06-28 06:32:02,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1995096.0, ans=0.125 2023-06-28 06:32:09,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1995096.0, ans=0.125 2023-06-28 06:32:28,476 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.464e+02 6.626e+02 9.640e+02 1.426e+03 2.852e+03, threshold=1.928e+03, percent-clipped=10.0 2023-06-28 06:32:28,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1995156.0, ans=0.125 2023-06-28 06:32:32,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1995216.0, ans=0.125 2023-06-28 06:32:53,136 INFO [train.py:996] (1/4) Epoch 11, batch 27600, loss[loss=0.2418, simple_loss=0.3268, pruned_loss=0.07845, over 19892.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2904, pruned_loss=0.06755, over 4276714.85 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 06:34:30,773 INFO [train.py:996] (1/4) Epoch 11, batch 27650, loss[loss=0.1795, simple_loss=0.2476, pruned_loss=0.05568, over 21556.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2846, pruned_loss=0.06713, over 4274234.24 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 06:35:33,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1995696.0, ans=0.125 2023-06-28 06:35:40,611 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-28 06:35:54,454 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.549e+02 8.580e+02 1.335e+03 1.838e+03 2.881e+03, threshold=2.670e+03, percent-clipped=20.0 2023-06-28 06:36:17,808 INFO [train.py:996] (1/4) Epoch 11, batch 27700, loss[loss=0.1828, simple_loss=0.2704, pruned_loss=0.04753, over 21727.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.285, pruned_loss=0.06566, over 4278503.31 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:37:01,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1995996.0, ans=0.0 2023-06-28 06:38:04,199 INFO [train.py:996] (1/4) Epoch 11, batch 27750, loss[loss=0.1868, simple_loss=0.2665, pruned_loss=0.05351, over 21268.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2888, pruned_loss=0.06614, over 4277694.96 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:38:23,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1996236.0, ans=0.0 2023-06-28 06:38:38,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1996236.0, ans=0.125 2023-06-28 06:38:54,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1996296.0, ans=0.0 2023-06-28 06:39:07,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1996356.0, ans=0.1 2023-06-28 06:39:18,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.799e+02 7.463e+02 1.002e+03 1.388e+03 2.774e+03, threshold=2.003e+03, percent-clipped=1.0 2023-06-28 06:39:29,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.09 vs. limit=15.0 2023-06-28 06:39:39,536 INFO [train.py:996] (1/4) Epoch 11, batch 27800, loss[loss=0.2231, simple_loss=0.2952, pruned_loss=0.07547, over 21441.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2876, pruned_loss=0.0667, over 4285511.76 frames. ], batch size: 144, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:39:41,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-28 06:40:08,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1996536.0, ans=0.125 2023-06-28 06:40:10,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1996536.0, ans=0.2 2023-06-28 06:40:52,634 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-28 06:41:06,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1996716.0, ans=0.1 2023-06-28 06:41:26,062 INFO [train.py:996] (1/4) Epoch 11, batch 27850, loss[loss=0.1877, simple_loss=0.2704, pruned_loss=0.05251, over 21852.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.287, pruned_loss=0.06732, over 4294493.56 frames. ], batch size: 282, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:41:26,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1996776.0, ans=0.2 2023-06-28 06:41:46,125 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=22.5 2023-06-28 06:42:13,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-28 06:42:18,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1996896.0, ans=0.025 2023-06-28 06:42:26,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1996956.0, ans=0.0 2023-06-28 06:42:49,901 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.063e+02 7.124e+02 9.695e+02 1.441e+03 2.660e+03, threshold=1.939e+03, percent-clipped=8.0 2023-06-28 06:43:00,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1997016.0, ans=0.0 2023-06-28 06:43:16,102 INFO [train.py:996] (1/4) Epoch 11, batch 27900, loss[loss=0.2426, simple_loss=0.3319, pruned_loss=0.07665, over 21747.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2965, pruned_loss=0.06837, over 4292378.78 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:43:20,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-28 06:43:45,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1997136.0, ans=0.125 2023-06-28 06:44:15,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-28 06:44:16,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1997256.0, ans=0.1 2023-06-28 06:44:57,019 INFO [train.py:996] (1/4) Epoch 11, batch 27950, loss[loss=0.2042, simple_loss=0.3132, pruned_loss=0.04765, over 21213.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2979, pruned_loss=0.0656, over 4290236.37 frames. ], batch size: 549, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:45:21,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1997436.0, ans=0.0 2023-06-28 06:45:26,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1997436.0, ans=0.1 2023-06-28 06:46:23,001 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.171e+02 6.116e+02 8.584e+02 1.262e+03 3.314e+03, threshold=1.717e+03, percent-clipped=6.0 2023-06-28 06:46:39,425 INFO [train.py:996] (1/4) Epoch 11, batch 28000, loss[loss=0.1877, simple_loss=0.2837, pruned_loss=0.0458, over 21410.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2953, pruned_loss=0.063, over 4286126.56 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:46:58,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1997676.0, ans=0.2 2023-06-28 06:46:58,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1997676.0, ans=0.125 2023-06-28 06:47:27,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1997796.0, ans=0.0 2023-06-28 06:48:14,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1997916.0, ans=0.2 2023-06-28 06:48:23,192 INFO [train.py:996] (1/4) Epoch 11, batch 28050, loss[loss=0.2095, simple_loss=0.3103, pruned_loss=0.05437, over 20816.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2929, pruned_loss=0.06443, over 4289070.07 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:48:54,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1998036.0, ans=0.0 2023-06-28 06:49:50,470 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.773e+02 7.059e+02 1.070e+03 1.534e+03 3.837e+03, threshold=2.141e+03, percent-clipped=19.0 2023-06-28 06:50:05,478 INFO [train.py:996] (1/4) Epoch 11, batch 28100, loss[loss=0.1768, simple_loss=0.2395, pruned_loss=0.0571, over 21578.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2881, pruned_loss=0.06424, over 4281749.34 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:50:20,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1998336.0, ans=0.2 2023-06-28 06:50:31,280 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.85 vs. limit=10.0 2023-06-28 06:50:35,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1998336.0, ans=0.125 2023-06-28 06:50:40,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1998396.0, ans=0.1 2023-06-28 06:51:03,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1998456.0, ans=0.0 2023-06-28 06:51:42,264 INFO [train.py:996] (1/4) Epoch 11, batch 28150, loss[loss=0.1874, simple_loss=0.254, pruned_loss=0.06038, over 21170.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2825, pruned_loss=0.06456, over 4275473.12 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:52:47,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1998756.0, ans=0.1 2023-06-28 06:53:04,713 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.152e+02 7.382e+02 1.011e+03 1.548e+03 3.347e+03, threshold=2.022e+03, percent-clipped=11.0 2023-06-28 06:53:19,836 INFO [train.py:996] (1/4) Epoch 11, batch 28200, loss[loss=0.2251, simple_loss=0.2954, pruned_loss=0.07738, over 21526.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2813, pruned_loss=0.06551, over 4270939.88 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:54:38,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1999056.0, ans=0.125 2023-06-28 06:54:58,086 INFO [train.py:996] (1/4) Epoch 11, batch 28250, loss[loss=0.2107, simple_loss=0.2774, pruned_loss=0.07202, over 16403.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2837, pruned_loss=0.06751, over 4260586.51 frames. ], batch size: 60, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:55:35,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1999236.0, ans=0.1 2023-06-28 06:55:47,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1999296.0, ans=0.0 2023-06-28 06:56:21,466 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.590e+02 1.013e+03 1.851e+03 3.926e+03, threshold=2.026e+03, percent-clipped=15.0 2023-06-28 06:56:22,122 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:56:37,218 INFO [train.py:996] (1/4) Epoch 11, batch 28300, loss[loss=0.171, simple_loss=0.2638, pruned_loss=0.03912, over 21730.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2811, pruned_loss=0.06522, over 4269279.56 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:56:46,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1999476.0, ans=0.0 2023-06-28 06:56:56,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-06-28 06:57:18,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1999596.0, ans=0.125 2023-06-28 06:58:15,200 INFO [train.py:996] (1/4) Epoch 11, batch 28350, loss[loss=0.2003, simple_loss=0.2671, pruned_loss=0.0668, over 21479.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2787, pruned_loss=0.06058, over 4273657.92 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:58:24,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2023-06-28 06:58:47,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1999836.0, ans=0.125 2023-06-28 06:59:37,855 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 8.082e+02 1.144e+03 1.595e+03 4.896e+03, threshold=2.288e+03, percent-clipped=16.0 2023-06-28 06:59:57,606 INFO [train.py:996] (1/4) Epoch 11, batch 28400, loss[loss=0.1939, simple_loss=0.2715, pruned_loss=0.05817, over 21656.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.277, pruned_loss=0.05976, over 4266121.10 frames. ], batch size: 298, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:00:24,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2000136.0, ans=0.0 2023-06-28 07:00:26,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2000136.0, ans=0.0 2023-06-28 07:01:12,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2000256.0, ans=0.125 2023-06-28 07:01:28,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2000316.0, ans=0.125 2023-06-28 07:01:40,572 INFO [train.py:996] (1/4) Epoch 11, batch 28450, loss[loss=0.2736, simple_loss=0.3314, pruned_loss=0.1079, over 21542.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2812, pruned_loss=0.06288, over 4259023.18 frames. ], batch size: 507, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:02:29,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2000496.0, ans=0.0 2023-06-28 07:03:03,711 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 8.706e+02 1.350e+03 2.003e+03 3.584e+03, threshold=2.700e+03, percent-clipped=15.0 2023-06-28 07:03:28,165 INFO [train.py:996] (1/4) Epoch 11, batch 28500, loss[loss=0.2251, simple_loss=0.3106, pruned_loss=0.06982, over 21780.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2835, pruned_loss=0.06561, over 4271583.95 frames. ], batch size: 124, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:03:30,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-28 07:03:54,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2000736.0, ans=0.1 2023-06-28 07:03:55,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2000736.0, ans=0.2 2023-06-28 07:04:46,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-28 07:05:11,633 INFO [train.py:996] (1/4) Epoch 11, batch 28550, loss[loss=0.1926, simple_loss=0.2945, pruned_loss=0.04533, over 20016.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2919, pruned_loss=0.06863, over 4270820.66 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:05:14,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=15.0 2023-06-28 07:05:25,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2000976.0, ans=0.0 2023-06-28 07:06:36,995 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:06:41,150 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.280e+02 7.153e+02 1.164e+03 1.649e+03 3.101e+03, threshold=2.329e+03, percent-clipped=2.0 2023-06-28 07:06:59,369 INFO [train.py:996] (1/4) Epoch 11, batch 28600, loss[loss=0.2082, simple_loss=0.2717, pruned_loss=0.07238, over 20060.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2984, pruned_loss=0.07103, over 4274214.17 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:07:22,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2001336.0, ans=0.09899494936611666 2023-06-28 07:07:28,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2023-06-28 07:07:49,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2001396.0, ans=0.0 2023-06-28 07:08:09,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-28 07:08:17,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2001456.0, ans=0.125 2023-06-28 07:08:33,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2001516.0, ans=0.0 2023-06-28 07:08:41,518 INFO [train.py:996] (1/4) Epoch 11, batch 28650, loss[loss=0.2308, simple_loss=0.279, pruned_loss=0.0913, over 21228.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2928, pruned_loss=0.07073, over 4271657.40 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:09:14,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2001696.0, ans=0.0 2023-06-28 07:09:16,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2001696.0, ans=0.1 2023-06-28 07:09:39,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2001696.0, ans=0.0 2023-06-28 07:10:06,741 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.860e+02 6.883e+02 1.055e+03 1.706e+03 3.634e+03, threshold=2.110e+03, percent-clipped=9.0 2023-06-28 07:10:19,941 INFO [train.py:996] (1/4) Epoch 11, batch 28700, loss[loss=0.2362, simple_loss=0.3046, pruned_loss=0.08389, over 21357.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2903, pruned_loss=0.07093, over 4275182.24 frames. ], batch size: 549, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:10:32,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2001876.0, ans=0.0 2023-06-28 07:10:34,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-28 07:10:40,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2001936.0, ans=0.07 2023-06-28 07:11:33,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2002056.0, ans=0.125 2023-06-28 07:12:03,058 INFO [train.py:996] (1/4) Epoch 11, batch 28750, loss[loss=0.2014, simple_loss=0.2908, pruned_loss=0.056, over 21633.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2902, pruned_loss=0.07139, over 4283269.08 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:12:11,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-28 07:13:00,409 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-28 07:13:17,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2002356.0, ans=0.0 2023-06-28 07:13:33,315 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.047e+02 7.903e+02 1.280e+03 1.957e+03 3.313e+03, threshold=2.559e+03, percent-clipped=20.0 2023-06-28 07:13:46,641 INFO [train.py:996] (1/4) Epoch 11, batch 28800, loss[loss=0.2333, simple_loss=0.3088, pruned_loss=0.07895, over 21614.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2922, pruned_loss=0.07059, over 4286188.13 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:14:24,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2002536.0, ans=0.0 2023-06-28 07:14:33,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-28 07:14:34,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=2002596.0, ans=15.0 2023-06-28 07:14:37,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2002596.0, ans=0.1 2023-06-28 07:15:04,804 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-28 07:15:28,355 INFO [train.py:996] (1/4) Epoch 11, batch 28850, loss[loss=0.1968, simple_loss=0.2661, pruned_loss=0.06378, over 20962.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2944, pruned_loss=0.07226, over 4284295.78 frames. ], batch size: 607, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:16:22,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2002896.0, ans=0.0 2023-06-28 07:16:58,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-28 07:16:58,888 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.940e+02 7.188e+02 1.079e+03 1.563e+03 3.306e+03, threshold=2.159e+03, percent-clipped=4.0 2023-06-28 07:17:09,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2003016.0, ans=0.125 2023-06-28 07:17:12,879 INFO [train.py:996] (1/4) Epoch 11, batch 28900, loss[loss=0.2057, simple_loss=0.2897, pruned_loss=0.06087, over 21763.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.298, pruned_loss=0.07421, over 4283619.22 frames. ], batch size: 298, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:17:36,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2003076.0, ans=0.0 2023-06-28 07:18:07,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2003196.0, ans=0.125 2023-06-28 07:19:06,028 INFO [train.py:996] (1/4) Epoch 11, batch 28950, loss[loss=0.2103, simple_loss=0.2966, pruned_loss=0.06198, over 21851.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2993, pruned_loss=0.07343, over 4277418.88 frames. ], batch size: 371, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:19:13,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2003376.0, ans=0.125 2023-06-28 07:19:18,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2003376.0, ans=0.05 2023-06-28 07:19:58,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.78 vs. limit=6.0 2023-06-28 07:20:14,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2003556.0, ans=0.0 2023-06-28 07:20:23,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2003556.0, ans=0.05 2023-06-28 07:20:36,312 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 7.501e+02 1.038e+03 1.526e+03 3.753e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-28 07:20:50,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2003616.0, ans=0.2 2023-06-28 07:20:54,725 INFO [train.py:996] (1/4) Epoch 11, batch 29000, loss[loss=0.2273, simple_loss=0.3123, pruned_loss=0.07116, over 21788.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3014, pruned_loss=0.07225, over 4270786.59 frames. ], batch size: 118, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:21:00,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2003676.0, ans=0.0 2023-06-28 07:21:11,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2003676.0, ans=0.0 2023-06-28 07:21:11,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2003676.0, ans=0.0 2023-06-28 07:21:23,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2003736.0, ans=0.125 2023-06-28 07:21:31,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2003796.0, ans=0.0 2023-06-28 07:21:53,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2003856.0, ans=0.0 2023-06-28 07:22:04,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2003856.0, ans=0.125 2023-06-28 07:22:35,931 INFO [train.py:996] (1/4) Epoch 11, batch 29050, loss[loss=0.1806, simple_loss=0.2466, pruned_loss=0.05727, over 21168.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3006, pruned_loss=0.07296, over 4277445.23 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:22:53,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-28 07:23:08,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-28 07:24:04,795 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 7.722e+02 1.077e+03 1.560e+03 2.970e+03, threshold=2.155e+03, percent-clipped=7.0 2023-06-28 07:24:18,297 INFO [train.py:996] (1/4) Epoch 11, batch 29100, loss[loss=0.1912, simple_loss=0.2514, pruned_loss=0.06552, over 21487.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2919, pruned_loss=0.07032, over 4284022.67 frames. ], batch size: 476, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:25:10,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2004396.0, ans=0.035 2023-06-28 07:25:59,495 INFO [train.py:996] (1/4) Epoch 11, batch 29150, loss[loss=0.1913, simple_loss=0.2687, pruned_loss=0.05693, over 21209.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2917, pruned_loss=0.06933, over 4275320.29 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:26:08,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2004576.0, ans=0.125 2023-06-28 07:26:37,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2004696.0, ans=0.125 2023-06-28 07:26:45,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2004696.0, ans=0.0 2023-06-28 07:26:47,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2004696.0, ans=0.125 2023-06-28 07:26:53,022 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-28 07:27:17,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2004756.0, ans=0.125 2023-06-28 07:27:24,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2004816.0, ans=0.125 2023-06-28 07:27:26,706 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.585e+02 7.263e+02 1.061e+03 1.748e+03 3.304e+03, threshold=2.122e+03, percent-clipped=12.0 2023-06-28 07:27:38,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2004876.0, ans=0.1 2023-06-28 07:27:39,640 INFO [train.py:996] (1/4) Epoch 11, batch 29200, loss[loss=0.1874, simple_loss=0.2555, pruned_loss=0.05965, over 21386.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2875, pruned_loss=0.06864, over 4277067.93 frames. ], batch size: 194, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 07:29:26,302 INFO [train.py:996] (1/4) Epoch 11, batch 29250, loss[loss=0.18, simple_loss=0.2577, pruned_loss=0.0511, over 21723.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2873, pruned_loss=0.06675, over 4270453.43 frames. ], batch size: 112, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:30:34,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2005356.0, ans=0.0 2023-06-28 07:30:46,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2005416.0, ans=0.0 2023-06-28 07:30:46,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2005416.0, ans=0.125 2023-06-28 07:30:52,283 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.090e+02 7.920e+02 1.206e+03 1.772e+03 3.423e+03, threshold=2.413e+03, percent-clipped=14.0 2023-06-28 07:31:01,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-06-28 07:31:08,129 INFO [train.py:996] (1/4) Epoch 11, batch 29300, loss[loss=0.2483, simple_loss=0.3088, pruned_loss=0.0939, over 21259.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2889, pruned_loss=0.06595, over 4265470.59 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:31:53,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2005596.0, ans=0.125 2023-06-28 07:32:01,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2005656.0, ans=0.0 2023-06-28 07:32:21,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2005656.0, ans=0.125 2023-06-28 07:32:34,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2005716.0, ans=0.125 2023-06-28 07:32:46,276 INFO [train.py:996] (1/4) Epoch 11, batch 29350, loss[loss=0.2043, simple_loss=0.3052, pruned_loss=0.05168, over 21716.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2839, pruned_loss=0.06504, over 4255817.95 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:33:16,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-06-28 07:33:27,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2005896.0, ans=0.125 2023-06-28 07:33:50,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2005956.0, ans=0.1 2023-06-28 07:33:52,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2005956.0, ans=0.125 2023-06-28 07:34:18,452 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 6.284e+02 9.416e+02 1.465e+03 2.688e+03, threshold=1.883e+03, percent-clipped=1.0 2023-06-28 07:34:30,002 INFO [train.py:996] (1/4) Epoch 11, batch 29400, loss[loss=0.2199, simple_loss=0.3058, pruned_loss=0.06693, over 21430.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.284, pruned_loss=0.06365, over 4261311.16 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:34:52,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-28 07:35:01,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2006136.0, ans=0.0 2023-06-28 07:35:03,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2006136.0, ans=0.0 2023-06-28 07:35:11,580 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:35:11,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2006196.0, ans=0.125 2023-06-28 07:35:58,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2006316.0, ans=0.125 2023-06-28 07:36:12,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2006376.0, ans=0.125 2023-06-28 07:36:13,446 INFO [train.py:996] (1/4) Epoch 11, batch 29450, loss[loss=0.2139, simple_loss=0.2905, pruned_loss=0.06867, over 21688.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2825, pruned_loss=0.06369, over 4266382.58 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:37:08,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2006496.0, ans=0.2 2023-06-28 07:37:17,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2006496.0, ans=0.125 2023-06-28 07:37:32,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2006556.0, ans=0.125 2023-06-28 07:37:43,663 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.711e+02 7.454e+02 1.204e+03 1.829e+03 3.653e+03, threshold=2.407e+03, percent-clipped=22.0 2023-06-28 07:37:54,986 INFO [train.py:996] (1/4) Epoch 11, batch 29500, loss[loss=0.2046, simple_loss=0.2805, pruned_loss=0.06434, over 21865.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2864, pruned_loss=0.06629, over 4265614.93 frames. ], batch size: 124, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:37:58,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2006676.0, ans=0.1 2023-06-28 07:38:13,782 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:38:20,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2006736.0, ans=0.125 2023-06-28 07:38:41,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2006796.0, ans=0.125 2023-06-28 07:38:42,971 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:38:50,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2006796.0, ans=0.1 2023-06-28 07:39:01,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2006856.0, ans=0.125 2023-06-28 07:39:05,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=2006856.0, ans=0.025 2023-06-28 07:39:10,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2006856.0, ans=0.07 2023-06-28 07:39:13,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2006856.0, ans=0.125 2023-06-28 07:39:36,796 INFO [train.py:996] (1/4) Epoch 11, batch 29550, loss[loss=0.2061, simple_loss=0.279, pruned_loss=0.06664, over 21332.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2854, pruned_loss=0.06745, over 4274210.51 frames. ], batch size: 176, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:40:23,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2007096.0, ans=0.125 2023-06-28 07:40:27,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-28 07:40:32,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2007096.0, ans=0.125 2023-06-28 07:40:54,913 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-28 07:41:08,576 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.953e+02 8.227e+02 1.182e+03 1.842e+03 6.634e+03, threshold=2.364e+03, percent-clipped=14.0 2023-06-28 07:41:19,885 INFO [train.py:996] (1/4) Epoch 11, batch 29600, loss[loss=0.2447, simple_loss=0.3293, pruned_loss=0.08004, over 21444.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2917, pruned_loss=0.06963, over 4278261.53 frames. ], batch size: 194, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:41:20,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2007276.0, ans=0.125 2023-06-28 07:42:28,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.08 vs. limit=22.5 2023-06-28 07:42:42,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2007516.0, ans=0.0 2023-06-28 07:42:47,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2007516.0, ans=0.2 2023-06-28 07:42:56,837 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-28 07:42:57,058 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.78 vs. limit=22.5 2023-06-28 07:42:57,475 INFO [train.py:996] (1/4) Epoch 11, batch 29650, loss[loss=0.1737, simple_loss=0.2512, pruned_loss=0.0481, over 21573.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2913, pruned_loss=0.06718, over 4274654.83 frames. ], batch size: 212, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:44:23,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2007816.0, ans=0.0 2023-06-28 07:44:26,234 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.283e+02 7.090e+02 1.074e+03 1.668e+03 4.986e+03, threshold=2.147e+03, percent-clipped=16.0 2023-06-28 07:44:32,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2007816.0, ans=0.125 2023-06-28 07:44:40,973 INFO [train.py:996] (1/4) Epoch 11, batch 29700, loss[loss=0.2224, simple_loss=0.3227, pruned_loss=0.0611, over 21423.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2924, pruned_loss=0.06705, over 4284522.11 frames. ], batch size: 211, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:45:12,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2007936.0, ans=0.0 2023-06-28 07:46:13,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2008116.0, ans=0.1 2023-06-28 07:46:22,820 INFO [train.py:996] (1/4) Epoch 11, batch 29750, loss[loss=0.1786, simple_loss=0.2665, pruned_loss=0.04534, over 16241.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2969, pruned_loss=0.0665, over 4271965.28 frames. ], batch size: 60, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:46:41,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2008176.0, ans=0.0 2023-06-28 07:47:01,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2008236.0, ans=0.05 2023-06-28 07:47:26,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2008356.0, ans=0.0 2023-06-28 07:47:29,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2008356.0, ans=0.1 2023-06-28 07:47:38,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2008356.0, ans=0.0 2023-06-28 07:47:49,318 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 7.116e+02 1.082e+03 1.518e+03 2.580e+03, threshold=2.164e+03, percent-clipped=5.0 2023-06-28 07:48:07,924 INFO [train.py:996] (1/4) Epoch 11, batch 29800, loss[loss=0.1979, simple_loss=0.2824, pruned_loss=0.05673, over 21890.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2981, pruned_loss=0.06746, over 4280376.75 frames. ], batch size: 332, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:48:20,726 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=22.5 2023-06-28 07:48:28,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2008536.0, ans=0.125 2023-06-28 07:48:29,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2008536.0, ans=0.0 2023-06-28 07:48:37,990 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-28 07:49:20,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2008656.0, ans=0.125 2023-06-28 07:49:34,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2008716.0, ans=0.05 2023-06-28 07:49:43,439 INFO [train.py:996] (1/4) Epoch 11, batch 29850, loss[loss=0.1699, simple_loss=0.2551, pruned_loss=0.04231, over 20949.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2935, pruned_loss=0.0653, over 4280527.55 frames. ], batch size: 608, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:51:06,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2009016.0, ans=0.125 2023-06-28 07:51:09,754 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.656e+02 6.362e+02 8.665e+02 1.424e+03 2.891e+03, threshold=1.733e+03, percent-clipped=5.0 2023-06-28 07:51:23,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2009076.0, ans=0.0 2023-06-28 07:51:29,358 INFO [train.py:996] (1/4) Epoch 11, batch 29900, loss[loss=0.2424, simple_loss=0.3153, pruned_loss=0.08473, over 21647.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2927, pruned_loss=0.06663, over 4285250.24 frames. ], batch size: 389, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:51:57,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2009136.0, ans=0.125 2023-06-28 07:52:06,648 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-28 07:52:14,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2009196.0, ans=0.0 2023-06-28 07:52:19,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2009196.0, ans=0.125 2023-06-28 07:52:26,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2009196.0, ans=0.2 2023-06-28 07:52:32,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2009256.0, ans=0.0 2023-06-28 07:52:34,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2009256.0, ans=0.025 2023-06-28 07:52:56,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-28 07:53:11,885 INFO [train.py:996] (1/4) Epoch 11, batch 29950, loss[loss=0.2303, simple_loss=0.3015, pruned_loss=0.07957, over 21536.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2974, pruned_loss=0.07065, over 4290165.19 frames. ], batch size: 211, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:53:30,847 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-28 07:54:06,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2009496.0, ans=0.04949747468305833 2023-06-28 07:54:50,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.989e+02 7.621e+02 1.240e+03 1.706e+03 3.587e+03, threshold=2.479e+03, percent-clipped=22.0 2023-06-28 07:55:04,741 INFO [train.py:996] (1/4) Epoch 11, batch 30000, loss[loss=0.2114, simple_loss=0.3098, pruned_loss=0.05649, over 21686.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2978, pruned_loss=0.06984, over 4289126.14 frames. ], batch size: 414, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:55:04,742 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 07:55:17,909 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6551, 2.3544, 3.7027, 2.3683], device='cuda:1') 2023-06-28 07:55:21,704 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2519, simple_loss=0.3444, pruned_loss=0.07975, over 1796401.00 frames. 2023-06-28 07:55:21,705 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 07:55:58,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2009796.0, ans=0.2 2023-06-28 07:57:10,907 INFO [train.py:996] (1/4) Epoch 11, batch 30050, loss[loss=0.1737, simple_loss=0.2932, pruned_loss=0.02711, over 19807.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3016, pruned_loss=0.068, over 4274734.74 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:57:41,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2010036.0, ans=0.125 2023-06-28 07:58:08,100 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.69 vs. limit=10.0 2023-06-28 07:58:24,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2010156.0, ans=0.125 2023-06-28 07:58:44,578 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 6.904e+02 1.436e+03 2.249e+03 4.425e+03, threshold=2.873e+03, percent-clipped=20.0 2023-06-28 07:58:53,200 INFO [train.py:996] (1/4) Epoch 11, batch 30100, loss[loss=0.2033, simple_loss=0.2767, pruned_loss=0.06491, over 21600.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3009, pruned_loss=0.06788, over 4275040.41 frames. ], batch size: 332, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:59:15,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2010336.0, ans=0.125 2023-06-28 07:59:34,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2010336.0, ans=0.125 2023-06-28 07:59:50,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2010396.0, ans=0.125 2023-06-28 08:00:05,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2010456.0, ans=0.125 2023-06-28 08:00:30,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2010516.0, ans=0.125 2023-06-28 08:00:36,690 INFO [train.py:996] (1/4) Epoch 11, batch 30150, loss[loss=0.284, simple_loss=0.3344, pruned_loss=0.1168, over 21285.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2964, pruned_loss=0.06898, over 4272271.69 frames. ], batch size: 507, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:00:55,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2010576.0, ans=0.0 2023-06-28 08:01:30,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2010696.0, ans=0.1 2023-06-28 08:02:18,805 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.690e+02 6.702e+02 9.063e+02 1.523e+03 3.175e+03, threshold=1.813e+03, percent-clipped=2.0 2023-06-28 08:02:36,690 INFO [train.py:996] (1/4) Epoch 11, batch 30200, loss[loss=0.2132, simple_loss=0.2873, pruned_loss=0.06956, over 21398.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.298, pruned_loss=0.06808, over 4274165.70 frames. ], batch size: 194, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:02:43,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2010876.0, ans=0.125 2023-06-28 08:03:15,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2010996.0, ans=0.125 2023-06-28 08:03:18,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2010996.0, ans=0.0 2023-06-28 08:03:27,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2010996.0, ans=0.1 2023-06-28 08:03:29,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=2010996.0, ans=0.1 2023-06-28 08:04:12,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2011116.0, ans=0.0 2023-06-28 08:04:21,952 INFO [train.py:996] (1/4) Epoch 11, batch 30250, loss[loss=0.3619, simple_loss=0.4437, pruned_loss=0.1401, over 21412.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3056, pruned_loss=0.07008, over 4275406.50 frames. ], batch size: 507, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:05:55,953 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.393e+02 7.352e+02 1.154e+03 1.714e+03 3.720e+03, threshold=2.308e+03, percent-clipped=21.0 2023-06-28 08:06:04,342 INFO [train.py:996] (1/4) Epoch 11, batch 30300, loss[loss=0.2192, simple_loss=0.3079, pruned_loss=0.06525, over 21686.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3017, pruned_loss=0.06974, over 4268914.40 frames. ], batch size: 332, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:06:13,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2011476.0, ans=0.0 2023-06-28 08:06:17,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-28 08:06:43,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2011596.0, ans=0.125 2023-06-28 08:06:58,643 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:07:50,272 INFO [train.py:996] (1/4) Epoch 11, batch 30350, loss[loss=0.2318, simple_loss=0.3275, pruned_loss=0.06808, over 21635.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3014, pruned_loss=0.0706, over 4266729.84 frames. ], batch size: 389, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:08:30,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-28 08:08:42,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2011956.0, ans=0.07 2023-06-28 08:08:44,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2011956.0, ans=0.125 2023-06-28 08:08:57,088 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.778e+02 8.957e+02 1.374e+03 2.295e+03 4.777e+03, threshold=2.749e+03, percent-clipped=24.0 2023-06-28 08:09:10,055 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-28 08:09:11,745 INFO [train.py:996] (1/4) Epoch 11, batch 30400, loss[loss=0.2098, simple_loss=0.2584, pruned_loss=0.08057, over 20401.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2948, pruned_loss=0.06881, over 4258111.44 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 08:09:38,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2012136.0, ans=0.0 2023-06-28 08:10:12,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2012256.0, ans=0.125 2023-06-28 08:10:16,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2012316.0, ans=0.125 2023-06-28 08:10:34,681 INFO [train.py:996] (1/4) Epoch 11, batch 30450, loss[loss=0.2453, simple_loss=0.3534, pruned_loss=0.06856, over 19917.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2956, pruned_loss=0.06868, over 4199665.66 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-28 08:10:51,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2012376.0, ans=0.125 2023-06-28 08:11:30,948 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:11:32,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2012556.0, ans=0.125 2023-06-28 08:13:53,273 INFO [train.py:996] (1/4) Epoch 12, batch 0, loss[loss=0.2169, simple_loss=0.2878, pruned_loss=0.07294, over 21934.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2878, pruned_loss=0.07294, over 21934.00 frames. ], batch size: 103, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:13:53,274 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 08:14:06,583 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.4302, 4.0477, 3.6281, 2.5575], device='cuda:1') 2023-06-28 08:14:09,653 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2477, simple_loss=0.3485, pruned_loss=0.0734, over 1796401.00 frames. 2023-06-28 08:14:09,654 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 08:14:12,903 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.161e+02 1.803e+03 3.374e+03 5.381e+03 1.358e+04, threshold=6.748e+03, percent-clipped=56.0 2023-06-28 08:14:47,075 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:15:20,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2012826.0, ans=0.125 2023-06-28 08:15:22,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2012826.0, ans=0.125 2023-06-28 08:15:54,163 INFO [train.py:996] (1/4) Epoch 12, batch 50, loss[loss=0.2573, simple_loss=0.3549, pruned_loss=0.07981, over 21682.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3046, pruned_loss=0.0678, over 967573.05 frames. ], batch size: 389, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:15:56,888 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-28 08:16:55,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2013066.0, ans=0.0 2023-06-28 08:17:14,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2013126.0, ans=0.0 2023-06-28 08:17:18,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2013126.0, ans=0.125 2023-06-28 08:17:37,281 INFO [train.py:996] (1/4) Epoch 12, batch 100, loss[loss=0.274, simple_loss=0.3563, pruned_loss=0.09583, over 21489.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3183, pruned_loss=0.0704, over 1686440.64 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:17:40,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.118e+02 6.672e+02 9.899e+02 1.706e+03 3.699e+03, threshold=1.980e+03, percent-clipped=0.0 2023-06-28 08:17:52,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2013246.0, ans=0.1 2023-06-28 08:18:52,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2013426.0, ans=0.035 2023-06-28 08:18:59,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2013426.0, ans=0.5 2023-06-28 08:19:18,622 INFO [train.py:996] (1/4) Epoch 12, batch 150, loss[loss=0.1919, simple_loss=0.2707, pruned_loss=0.05652, over 21202.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3158, pruned_loss=0.06985, over 2261579.86 frames. ], batch size: 159, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:19:24,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2013546.0, ans=0.1 2023-06-28 08:19:32,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2013546.0, ans=0.0 2023-06-28 08:19:46,428 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-28 08:20:40,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2013786.0, ans=0.125 2023-06-28 08:20:40,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2013786.0, ans=0.125 2023-06-28 08:20:57,662 INFO [train.py:996] (1/4) Epoch 12, batch 200, loss[loss=0.222, simple_loss=0.3221, pruned_loss=0.06094, over 20710.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3134, pruned_loss=0.06859, over 2704262.53 frames. ], batch size: 607, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:21:00,951 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 7.904e+02 1.199e+03 1.656e+03 3.803e+03, threshold=2.398e+03, percent-clipped=21.0 2023-06-28 08:21:52,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-28 08:22:42,027 INFO [train.py:996] (1/4) Epoch 12, batch 250, loss[loss=0.2172, simple_loss=0.2949, pruned_loss=0.06979, over 21820.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3086, pruned_loss=0.06842, over 3058345.61 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:23:52,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2014266.0, ans=0.125 2023-06-28 08:24:02,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2014326.0, ans=0.125 2023-06-28 08:24:32,074 INFO [train.py:996] (1/4) Epoch 12, batch 300, loss[loss=0.2313, simple_loss=0.3056, pruned_loss=0.07848, over 21795.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3025, pruned_loss=0.06759, over 3326307.71 frames. ], batch size: 391, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:24:35,391 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.745e+02 6.973e+02 9.077e+02 1.413e+03 3.093e+03, threshold=1.815e+03, percent-clipped=6.0 2023-06-28 08:25:16,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2014506.0, ans=0.2 2023-06-28 08:25:52,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2014626.0, ans=0.1 2023-06-28 08:25:57,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2014626.0, ans=0.1 2023-06-28 08:26:20,885 INFO [train.py:996] (1/4) Epoch 12, batch 350, loss[loss=0.1926, simple_loss=0.271, pruned_loss=0.05708, over 21624.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2973, pruned_loss=0.06725, over 3532837.91 frames. ], batch size: 415, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:26:35,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2023-06-28 08:27:02,635 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-28 08:27:37,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2014926.0, ans=0.1 2023-06-28 08:27:38,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2014926.0, ans=0.0 2023-06-28 08:27:40,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2014926.0, ans=0.2 2023-06-28 08:28:04,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2014986.0, ans=0.2 2023-06-28 08:28:06,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2015046.0, ans=0.125 2023-06-28 08:28:07,203 INFO [train.py:996] (1/4) Epoch 12, batch 400, loss[loss=0.1859, simple_loss=0.2578, pruned_loss=0.05695, over 21706.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2891, pruned_loss=0.06609, over 3697835.02 frames. ], batch size: 333, lr: 2.47e-03, grad_scale: 32.0 2023-06-28 08:28:10,665 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.672e+02 1.106e+03 1.472e+03 3.614e+03, threshold=2.212e+03, percent-clipped=11.0 2023-06-28 08:29:47,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2015286.0, ans=0.95 2023-06-28 08:29:53,468 INFO [train.py:996] (1/4) Epoch 12, batch 450, loss[loss=0.2647, simple_loss=0.3746, pruned_loss=0.07744, over 21798.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2871, pruned_loss=0.06573, over 3827842.21 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:30:01,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2015346.0, ans=0.125 2023-06-28 08:30:04,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2015346.0, ans=0.125 2023-06-28 08:31:37,497 INFO [train.py:996] (1/4) Epoch 12, batch 500, loss[loss=0.2304, simple_loss=0.3429, pruned_loss=0.05894, over 21772.00 frames. ], tot_loss[loss=0.21, simple_loss=0.29, pruned_loss=0.06503, over 3931454.61 frames. ], batch size: 282, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:31:42,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.576e+02 9.650e+02 1.378e+03 2.425e+03 6.087e+03, threshold=2.755e+03, percent-clipped=29.0 2023-06-28 08:32:15,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=22.5 2023-06-28 08:33:22,147 INFO [train.py:996] (1/4) Epoch 12, batch 550, loss[loss=0.2182, simple_loss=0.2943, pruned_loss=0.07103, over 21879.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2945, pruned_loss=0.06448, over 4015018.84 frames. ], batch size: 107, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:33:54,693 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-28 08:34:42,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2016126.0, ans=0.0 2023-06-28 08:34:43,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=2016186.0, ans=0.025 2023-06-28 08:34:58,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2016186.0, ans=0.0 2023-06-28 08:35:00,968 INFO [train.py:996] (1/4) Epoch 12, batch 600, loss[loss=0.2229, simple_loss=0.3434, pruned_loss=0.05116, over 21252.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2984, pruned_loss=0.0652, over 4071713.02 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:35:05,854 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.203e+02 8.466e+02 1.434e+03 2.194e+03 5.258e+03, threshold=2.867e+03, percent-clipped=12.0 2023-06-28 08:36:09,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2016366.0, ans=0.2 2023-06-28 08:36:44,608 INFO [train.py:996] (1/4) Epoch 12, batch 650, loss[loss=0.2348, simple_loss=0.325, pruned_loss=0.07232, over 21719.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.298, pruned_loss=0.06618, over 4125668.45 frames. ], batch size: 441, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:37:32,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2016666.0, ans=0.05 2023-06-28 08:38:01,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2016726.0, ans=0.125 2023-06-28 08:38:03,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2016726.0, ans=0.0 2023-06-28 08:38:23,205 INFO [train.py:996] (1/4) Epoch 12, batch 700, loss[loss=0.3349, simple_loss=0.4116, pruned_loss=0.1291, over 21513.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2966, pruned_loss=0.06659, over 4167117.23 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 08:38:34,694 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.083e+02 8.629e+02 1.370e+03 1.985e+03 4.368e+03, threshold=2.739e+03, percent-clipped=8.0 2023-06-28 08:39:57,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=22.5 2023-06-28 08:40:02,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2017086.0, ans=0.0 2023-06-28 08:40:06,459 INFO [train.py:996] (1/4) Epoch 12, batch 750, loss[loss=0.191, simple_loss=0.3169, pruned_loss=0.03258, over 19848.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2958, pruned_loss=0.06701, over 4199042.09 frames. ], batch size: 703, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 08:40:43,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2017206.0, ans=0.1 2023-06-28 08:41:50,315 INFO [train.py:996] (1/4) Epoch 12, batch 800, loss[loss=0.1633, simple_loss=0.2353, pruned_loss=0.04561, over 16232.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2924, pruned_loss=0.06729, over 4206075.49 frames. ], batch size: 60, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:42:01,925 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 9.081e+02 1.260e+03 2.091e+03 4.459e+03, threshold=2.521e+03, percent-clipped=14.0 2023-06-28 08:42:20,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2017506.0, ans=0.125 2023-06-28 08:42:32,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2017566.0, ans=0.04949747468305833 2023-06-28 08:43:05,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-28 08:43:08,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-28 08:43:20,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2017686.0, ans=0.2 2023-06-28 08:43:32,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2017746.0, ans=0.0 2023-06-28 08:43:33,340 INFO [train.py:996] (1/4) Epoch 12, batch 850, loss[loss=0.2019, simple_loss=0.2695, pruned_loss=0.06716, over 21247.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2896, pruned_loss=0.06747, over 4224720.22 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:44:26,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-28 08:45:24,331 INFO [train.py:996] (1/4) Epoch 12, batch 900, loss[loss=0.202, simple_loss=0.2996, pruned_loss=0.05223, over 21796.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2892, pruned_loss=0.06744, over 4239225.33 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:45:35,781 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.139e+02 7.978e+02 1.292e+03 1.942e+03 4.093e+03, threshold=2.584e+03, percent-clipped=13.0 2023-06-28 08:45:43,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2018046.0, ans=0.125 2023-06-28 08:46:32,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2018226.0, ans=0.1 2023-06-28 08:46:44,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2018226.0, ans=0.125 2023-06-28 08:46:45,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2018286.0, ans=0.125 2023-06-28 08:46:48,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-28 08:47:14,380 INFO [train.py:996] (1/4) Epoch 12, batch 950, loss[loss=0.2019, simple_loss=0.2771, pruned_loss=0.0633, over 21696.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2872, pruned_loss=0.06632, over 4250980.97 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:47:19,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2018346.0, ans=0.0 2023-06-28 08:47:56,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2018466.0, ans=0.125 2023-06-28 08:48:01,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2018466.0, ans=0.125 2023-06-28 08:48:15,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2018526.0, ans=0.04949747468305833 2023-06-28 08:48:39,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2018586.0, ans=0.125 2023-06-28 08:48:56,774 INFO [train.py:996] (1/4) Epoch 12, batch 1000, loss[loss=0.2223, simple_loss=0.3056, pruned_loss=0.06952, over 21899.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2886, pruned_loss=0.06647, over 4263164.99 frames. ], batch size: 316, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:49:03,709 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.940e+02 7.062e+02 8.970e+02 1.402e+03 3.868e+03, threshold=1.794e+03, percent-clipped=7.0 2023-06-28 08:49:21,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2018706.0, ans=0.0 2023-06-28 08:49:26,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2018706.0, ans=0.0 2023-06-28 08:49:39,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2018766.0, ans=0.0 2023-06-28 08:50:37,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2018886.0, ans=0.0 2023-06-28 08:50:42,129 INFO [train.py:996] (1/4) Epoch 12, batch 1050, loss[loss=0.3061, simple_loss=0.3624, pruned_loss=0.1249, over 21420.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2884, pruned_loss=0.06602, over 4267862.90 frames. ], batch size: 507, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:50:54,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=2018946.0, ans=12.0 2023-06-28 08:50:59,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2018946.0, ans=0.1 2023-06-28 08:52:24,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2019186.0, ans=0.07 2023-06-28 08:52:31,734 INFO [train.py:996] (1/4) Epoch 12, batch 1100, loss[loss=0.1831, simple_loss=0.2624, pruned_loss=0.05193, over 21255.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2876, pruned_loss=0.06501, over 4272242.51 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:52:34,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2019246.0, ans=0.1 2023-06-28 08:52:34,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2019246.0, ans=0.0 2023-06-28 08:52:39,019 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.826e+02 7.483e+02 1.102e+03 1.696e+03 3.574e+03, threshold=2.203e+03, percent-clipped=22.0 2023-06-28 08:53:06,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2019306.0, ans=0.2 2023-06-28 08:53:09,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2019366.0, ans=0.0 2023-06-28 08:53:24,129 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-28 08:54:16,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-28 08:54:17,191 INFO [train.py:996] (1/4) Epoch 12, batch 1150, loss[loss=0.2321, simple_loss=0.3011, pruned_loss=0.08154, over 21258.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2873, pruned_loss=0.06513, over 4272599.73 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:54:21,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.60 vs. limit=15.0 2023-06-28 08:54:38,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-28 08:54:39,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2019606.0, ans=0.1 2023-06-28 08:54:47,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=2019606.0, ans=12.0 2023-06-28 08:54:48,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2019606.0, ans=0.0 2023-06-28 08:55:41,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2019786.0, ans=0.0 2023-06-28 08:55:53,585 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:55:57,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2019786.0, ans=0.0 2023-06-28 08:56:08,749 INFO [train.py:996] (1/4) Epoch 12, batch 1200, loss[loss=0.2309, simple_loss=0.3125, pruned_loss=0.07469, over 21950.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2885, pruned_loss=0.06561, over 4280241.77 frames. ], batch size: 372, lr: 2.47e-03, grad_scale: 32.0 2023-06-28 08:56:15,505 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.003e+02 8.397e+02 1.494e+03 2.117e+03 4.524e+03, threshold=2.987e+03, percent-clipped=23.0 2023-06-28 08:56:32,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2019906.0, ans=0.125 2023-06-28 08:56:50,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-28 08:56:51,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2019966.0, ans=0.5 2023-06-28 08:57:31,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2020086.0, ans=0.05 2023-06-28 08:57:49,457 INFO [train.py:996] (1/4) Epoch 12, batch 1250, loss[loss=0.2088, simple_loss=0.2924, pruned_loss=0.06259, over 21784.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2928, pruned_loss=0.06721, over 4288451.31 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:58:12,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2020206.0, ans=0.2 2023-06-28 08:58:17,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2020206.0, ans=0.125 2023-06-28 08:59:15,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2020386.0, ans=0.125 2023-06-28 08:59:32,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2020386.0, ans=0.125 2023-06-28 08:59:40,400 INFO [train.py:996] (1/4) Epoch 12, batch 1300, loss[loss=0.1976, simple_loss=0.2738, pruned_loss=0.06064, over 21692.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2949, pruned_loss=0.06836, over 4296261.88 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:59:47,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2020446.0, ans=0.0 2023-06-28 08:59:48,734 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.742e+02 7.744e+02 1.078e+03 1.630e+03 3.241e+03, threshold=2.156e+03, percent-clipped=1.0 2023-06-28 08:59:58,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-28 09:00:45,564 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:01:05,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2020686.0, ans=0.1 2023-06-28 09:01:09,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2020686.0, ans=0.125 2023-06-28 09:01:25,435 INFO [train.py:996] (1/4) Epoch 12, batch 1350, loss[loss=0.2526, simple_loss=0.3368, pruned_loss=0.08422, over 21504.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2958, pruned_loss=0.06846, over 4293694.45 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:02:11,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2020866.0, ans=0.125 2023-06-28 09:02:45,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2020986.0, ans=0.2 2023-06-28 09:03:05,016 INFO [train.py:996] (1/4) Epoch 12, batch 1400, loss[loss=0.2078, simple_loss=0.2808, pruned_loss=0.06745, over 21483.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2928, pruned_loss=0.06785, over 4290914.95 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:03:13,316 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.716e+02 8.874e+02 1.255e+03 1.971e+03 3.857e+03, threshold=2.510e+03, percent-clipped=18.0 2023-06-28 09:04:08,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2021226.0, ans=0.125 2023-06-28 09:04:50,276 INFO [train.py:996] (1/4) Epoch 12, batch 1450, loss[loss=0.2244, simple_loss=0.3038, pruned_loss=0.07245, over 21824.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2916, pruned_loss=0.06751, over 4286333.40 frames. ], batch size: 282, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:05:01,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2021346.0, ans=0.0 2023-06-28 09:05:09,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2021406.0, ans=0.0 2023-06-28 09:05:30,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.42 vs. limit=15.0 2023-06-28 09:05:51,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2021526.0, ans=0.125 2023-06-28 09:06:07,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2021526.0, ans=0.125 2023-06-28 09:06:24,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2021586.0, ans=0.2 2023-06-28 09:06:37,311 INFO [train.py:996] (1/4) Epoch 12, batch 1500, loss[loss=0.2035, simple_loss=0.2815, pruned_loss=0.06275, over 17330.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.293, pruned_loss=0.06876, over 4281156.47 frames. ], batch size: 60, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:06:46,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2021646.0, ans=0.2 2023-06-28 09:06:47,789 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.815e+02 8.356e+02 1.274e+03 1.855e+03 4.343e+03, threshold=2.548e+03, percent-clipped=12.0 2023-06-28 09:07:07,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-28 09:07:15,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2021766.0, ans=0.0 2023-06-28 09:07:57,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2021826.0, ans=0.2 2023-06-28 09:08:11,597 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-28 09:08:19,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2021886.0, ans=0.0 2023-06-28 09:08:24,459 INFO [train.py:996] (1/4) Epoch 12, batch 1550, loss[loss=0.1903, simple_loss=0.2753, pruned_loss=0.05267, over 20992.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2927, pruned_loss=0.06863, over 4283647.70 frames. ], batch size: 607, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:08:27,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-28 09:08:37,693 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-28 09:08:56,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2022006.0, ans=0.2 2023-06-28 09:08:58,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-28 09:09:13,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2022066.0, ans=0.125 2023-06-28 09:09:16,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2022066.0, ans=0.2 2023-06-28 09:10:00,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2022186.0, ans=0.0 2023-06-28 09:10:03,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2022186.0, ans=0.125 2023-06-28 09:10:05,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2022186.0, ans=0.125 2023-06-28 09:10:09,936 INFO [train.py:996] (1/4) Epoch 12, batch 1600, loss[loss=0.2474, simple_loss=0.3204, pruned_loss=0.08723, over 21624.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2914, pruned_loss=0.06714, over 4280012.81 frames. ], batch size: 389, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:10:20,072 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.882e+02 7.910e+02 1.210e+03 1.920e+03 3.790e+03, threshold=2.419e+03, percent-clipped=9.0 2023-06-28 09:11:58,005 INFO [train.py:996] (1/4) Epoch 12, batch 1650, loss[loss=0.1965, simple_loss=0.2766, pruned_loss=0.05813, over 21212.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2884, pruned_loss=0.06558, over 4276971.96 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:13:06,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2022666.0, ans=0.125 2023-06-28 09:13:16,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2022726.0, ans=0.0 2023-06-28 09:13:39,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2022786.0, ans=0.1 2023-06-28 09:13:45,609 INFO [train.py:996] (1/4) Epoch 12, batch 1700, loss[loss=0.2683, simple_loss=0.3547, pruned_loss=0.09099, over 21855.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2922, pruned_loss=0.06721, over 4284122.94 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:13:50,208 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-28 09:13:55,785 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.138e+02 6.859e+02 1.024e+03 1.407e+03 3.205e+03, threshold=2.048e+03, percent-clipped=5.0 2023-06-28 09:14:18,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2022906.0, ans=0.125 2023-06-28 09:14:43,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-28 09:14:53,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-28 09:15:32,585 INFO [train.py:996] (1/4) Epoch 12, batch 1750, loss[loss=0.2368, simple_loss=0.3226, pruned_loss=0.07547, over 21454.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2933, pruned_loss=0.06697, over 4283788.70 frames. ], batch size: 507, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:15:45,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2023146.0, ans=0.0 2023-06-28 09:16:27,920 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-28 09:16:53,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2023326.0, ans=0.07 2023-06-28 09:17:25,794 INFO [train.py:996] (1/4) Epoch 12, batch 1800, loss[loss=0.2075, simple_loss=0.2978, pruned_loss=0.05864, over 21359.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2913, pruned_loss=0.06442, over 4272960.14 frames. ], batch size: 194, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:17:46,600 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.284e+02 7.829e+02 1.190e+03 1.910e+03 4.483e+03, threshold=2.381e+03, percent-clipped=19.0 2023-06-28 09:18:00,924 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:18:18,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2023-06-28 09:18:40,089 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:19:11,623 INFO [train.py:996] (1/4) Epoch 12, batch 1850, loss[loss=0.2145, simple_loss=0.3012, pruned_loss=0.06387, over 21600.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.292, pruned_loss=0.06285, over 4271732.10 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:19:46,425 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:19:58,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2023866.0, ans=0.0 2023-06-28 09:20:03,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2023866.0, ans=0.0 2023-06-28 09:20:05,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2023866.0, ans=0.125 2023-06-28 09:20:12,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2023926.0, ans=0.025 2023-06-28 09:20:59,929 INFO [train.py:996] (1/4) Epoch 12, batch 1900, loss[loss=0.2402, simple_loss=0.3483, pruned_loss=0.06611, over 20842.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2943, pruned_loss=0.06367, over 4276090.92 frames. ], batch size: 607, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:21:22,274 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.886e+02 8.389e+02 1.357e+03 2.180e+03 3.591e+03, threshold=2.714e+03, percent-clipped=20.0 2023-06-28 09:21:43,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2024166.0, ans=0.1 2023-06-28 09:22:05,365 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-28 09:22:38,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2024286.0, ans=0.0 2023-06-28 09:22:54,025 INFO [train.py:996] (1/4) Epoch 12, batch 1950, loss[loss=0.1852, simple_loss=0.2532, pruned_loss=0.05856, over 21655.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2907, pruned_loss=0.06378, over 4280282.41 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:23:15,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2024406.0, ans=0.125 2023-06-28 09:23:19,648 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.61 vs. limit=15.0 2023-06-28 09:24:40,575 INFO [train.py:996] (1/4) Epoch 12, batch 2000, loss[loss=0.1462, simple_loss=0.2194, pruned_loss=0.03647, over 21329.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2871, pruned_loss=0.0624, over 4280518.47 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:24:52,588 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.543e+02 8.090e+02 1.262e+03 2.210e+03 4.405e+03, threshold=2.524e+03, percent-clipped=15.0 2023-06-28 09:25:26,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2024766.0, ans=0.0 2023-06-28 09:25:40,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=2024826.0, ans=0.1 2023-06-28 09:26:25,014 INFO [train.py:996] (1/4) Epoch 12, batch 2050, loss[loss=0.1723, simple_loss=0.2406, pruned_loss=0.052, over 21274.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2891, pruned_loss=0.06337, over 4278064.75 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:26:33,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2024946.0, ans=0.1 2023-06-28 09:27:05,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2025066.0, ans=0.125 2023-06-28 09:27:06,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2025066.0, ans=0.0 2023-06-28 09:27:06,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2025066.0, ans=0.125 2023-06-28 09:27:09,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2025066.0, ans=0.125 2023-06-28 09:27:45,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025186.0, ans=0.1 2023-06-28 09:28:07,567 INFO [train.py:996] (1/4) Epoch 12, batch 2100, loss[loss=0.2131, simple_loss=0.2922, pruned_loss=0.06701, over 21836.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2919, pruned_loss=0.06451, over 4285559.86 frames. ], batch size: 102, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:28:08,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2025246.0, ans=0.125 2023-06-28 09:28:21,410 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.339e+02 9.933e+02 1.500e+03 2.145e+03 4.437e+03, threshold=3.000e+03, percent-clipped=17.0 2023-06-28 09:28:34,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=2025306.0, ans=0.2 2023-06-28 09:29:34,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025486.0, ans=0.1 2023-06-28 09:29:52,489 INFO [train.py:996] (1/4) Epoch 12, batch 2150, loss[loss=0.1878, simple_loss=0.2631, pruned_loss=0.0562, over 21582.00 frames. ], tot_loss[loss=0.211, simple_loss=0.291, pruned_loss=0.06549, over 4283004.16 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:29:56,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2025546.0, ans=0.1 2023-06-28 09:30:33,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025666.0, ans=0.1 2023-06-28 09:31:31,656 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:31:37,748 INFO [train.py:996] (1/4) Epoch 12, batch 2200, loss[loss=0.1743, simple_loss=0.2552, pruned_loss=0.04672, over 21395.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2919, pruned_loss=0.06576, over 4282540.95 frames. ], batch size: 194, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:31:51,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.594e+02 7.144e+02 1.049e+03 1.524e+03 3.402e+03, threshold=2.098e+03, percent-clipped=4.0 2023-06-28 09:32:05,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2025906.0, ans=0.125 2023-06-28 09:32:17,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-28 09:32:35,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2026026.0, ans=0.0 2023-06-28 09:32:49,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2026026.0, ans=0.2 2023-06-28 09:33:21,822 INFO [train.py:996] (1/4) Epoch 12, batch 2250, loss[loss=0.2033, simple_loss=0.2769, pruned_loss=0.06491, over 21664.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2896, pruned_loss=0.06474, over 4280927.31 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:33:27,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2026146.0, ans=0.2 2023-06-28 09:34:09,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2026266.0, ans=0.0 2023-06-28 09:35:01,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2026386.0, ans=0.125 2023-06-28 09:35:06,614 INFO [train.py:996] (1/4) Epoch 12, batch 2300, loss[loss=0.1834, simple_loss=0.2425, pruned_loss=0.06217, over 21228.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2843, pruned_loss=0.06409, over 4284349.04 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:35:20,311 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.194e+02 1.165e+03 1.936e+03 3.464e+03, threshold=2.331e+03, percent-clipped=21.0 2023-06-28 09:35:21,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2026446.0, ans=0.04949747468305833 2023-06-28 09:35:21,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2026446.0, ans=0.1 2023-06-28 09:35:50,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2026566.0, ans=0.1 2023-06-28 09:35:50,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2026566.0, ans=0.125 2023-06-28 09:35:50,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2026566.0, ans=0.125 2023-06-28 09:36:39,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2026686.0, ans=0.2 2023-06-28 09:36:46,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2026686.0, ans=0.125 2023-06-28 09:36:50,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-28 09:36:53,095 INFO [train.py:996] (1/4) Epoch 12, batch 2350, loss[loss=0.2193, simple_loss=0.2905, pruned_loss=0.07399, over 21887.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2809, pruned_loss=0.06477, over 4282231.85 frames. ], batch size: 317, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:36:58,249 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.81 vs. limit=5.0 2023-06-28 09:37:27,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2026806.0, ans=0.125 2023-06-28 09:38:05,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2026926.0, ans=0.0 2023-06-28 09:38:38,342 INFO [train.py:996] (1/4) Epoch 12, batch 2400, loss[loss=0.2431, simple_loss=0.3204, pruned_loss=0.08292, over 21307.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2854, pruned_loss=0.067, over 4268687.36 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:38:50,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-28 09:38:57,279 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.615e+02 1.092e+03 1.757e+03 3.744e+03, threshold=2.185e+03, percent-clipped=12.0 2023-06-28 09:39:10,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2027106.0, ans=0.0 2023-06-28 09:40:02,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2027226.0, ans=0.0 2023-06-28 09:40:09,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2027286.0, ans=0.1 2023-06-28 09:40:24,041 INFO [train.py:996] (1/4) Epoch 12, batch 2450, loss[loss=0.1815, simple_loss=0.2532, pruned_loss=0.05489, over 21639.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2907, pruned_loss=0.0684, over 4265387.86 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:40:34,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2027346.0, ans=0.125 2023-06-28 09:41:12,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2027466.0, ans=0.2 2023-06-28 09:41:16,424 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.69 vs. limit=5.0 2023-06-28 09:42:08,964 INFO [train.py:996] (1/4) Epoch 12, batch 2500, loss[loss=0.2178, simple_loss=0.2852, pruned_loss=0.07516, over 21867.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2884, pruned_loss=0.06748, over 4271287.88 frames. ], batch size: 98, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:42:19,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2027646.0, ans=0.125 2023-06-28 09:42:21,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2027646.0, ans=10.0 2023-06-28 09:42:27,033 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 7.873e+02 1.330e+03 1.943e+03 4.895e+03, threshold=2.659e+03, percent-clipped=18.0 2023-06-28 09:43:32,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=2027826.0, ans=22.5 2023-06-28 09:43:53,466 INFO [train.py:996] (1/4) Epoch 12, batch 2550, loss[loss=0.1834, simple_loss=0.2616, pruned_loss=0.05256, over 21371.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2858, pruned_loss=0.06614, over 4264598.12 frames. ], batch size: 211, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:44:07,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2027946.0, ans=0.035 2023-06-28 09:44:22,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2028006.0, ans=0.0 2023-06-28 09:45:14,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-28 09:45:37,008 INFO [train.py:996] (1/4) Epoch 12, batch 2600, loss[loss=0.1791, simple_loss=0.2453, pruned_loss=0.05638, over 21623.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2871, pruned_loss=0.06768, over 4263803.15 frames. ], batch size: 264, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:45:55,673 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 1.004e+03 1.411e+03 2.308e+03 3.873e+03, threshold=2.822e+03, percent-clipped=11.0 2023-06-28 09:46:04,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2028306.0, ans=0.125 2023-06-28 09:46:34,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2028366.0, ans=0.0 2023-06-28 09:47:06,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-28 09:47:12,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2028486.0, ans=0.125 2023-06-28 09:47:21,870 INFO [train.py:996] (1/4) Epoch 12, batch 2650, loss[loss=0.2117, simple_loss=0.3019, pruned_loss=0.0608, over 21835.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2883, pruned_loss=0.06886, over 4270078.69 frames. ], batch size: 371, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:48:03,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2028666.0, ans=0.125 2023-06-28 09:48:12,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2028666.0, ans=0.125 2023-06-28 09:48:24,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2028666.0, ans=0.1 2023-06-28 09:48:40,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2028726.0, ans=0.125 2023-06-28 09:48:45,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2028726.0, ans=0.125 2023-06-28 09:49:07,743 INFO [train.py:996] (1/4) Epoch 12, batch 2700, loss[loss=0.1927, simple_loss=0.263, pruned_loss=0.06124, over 21763.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2856, pruned_loss=0.06735, over 4279869.08 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:49:25,941 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.017e+02 6.917e+02 8.915e+02 1.240e+03 3.062e+03, threshold=1.783e+03, percent-clipped=1.0 2023-06-28 09:50:25,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-28 09:50:45,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.64 vs. limit=5.0 2023-06-28 09:50:51,159 INFO [train.py:996] (1/4) Epoch 12, batch 2750, loss[loss=0.2274, simple_loss=0.3076, pruned_loss=0.07359, over 21770.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2854, pruned_loss=0.06704, over 4283636.61 frames. ], batch size: 112, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:51:41,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-28 09:52:43,517 INFO [train.py:996] (1/4) Epoch 12, batch 2800, loss[loss=0.2061, simple_loss=0.2811, pruned_loss=0.06561, over 21583.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2913, pruned_loss=0.06803, over 4288427.31 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 09:52:58,759 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.305e+02 8.151e+02 1.437e+03 2.226e+03 4.806e+03, threshold=2.874e+03, percent-clipped=38.0 2023-06-28 09:53:28,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2029566.0, ans=0.0 2023-06-28 09:53:57,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-28 09:54:28,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-28 09:54:28,777 INFO [train.py:996] (1/4) Epoch 12, batch 2850, loss[loss=0.2014, simple_loss=0.2851, pruned_loss=0.05882, over 21602.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2943, pruned_loss=0.06946, over 4280240.55 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:54:54,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2029806.0, ans=0.125 2023-06-28 09:55:46,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2029926.0, ans=0.0 2023-06-28 09:55:58,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2029986.0, ans=0.0 2023-06-28 09:56:12,496 INFO [train.py:996] (1/4) Epoch 12, batch 2900, loss[loss=0.2431, simple_loss=0.3419, pruned_loss=0.07219, over 21364.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2904, pruned_loss=0.06842, over 4280969.89 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:56:27,897 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.497e+02 8.368e+02 1.188e+03 2.037e+03 3.726e+03, threshold=2.377e+03, percent-clipped=4.0 2023-06-28 09:57:50,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2030286.0, ans=0.0 2023-06-28 09:57:56,775 INFO [train.py:996] (1/4) Epoch 12, batch 2950, loss[loss=0.2319, simple_loss=0.3287, pruned_loss=0.06758, over 21883.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2927, pruned_loss=0.0688, over 4291019.22 frames. ], batch size: 371, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:58:10,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2030346.0, ans=0.0 2023-06-28 09:58:24,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2030406.0, ans=0.125 2023-06-28 09:58:24,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2030406.0, ans=0.1 2023-06-28 09:58:52,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2030466.0, ans=0.07 2023-06-28 09:59:41,599 INFO [train.py:996] (1/4) Epoch 12, batch 3000, loss[loss=0.2322, simple_loss=0.3212, pruned_loss=0.07162, over 21602.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2972, pruned_loss=0.06925, over 4293273.09 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:59:41,600 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 10:00:03,543 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2539, simple_loss=0.3416, pruned_loss=0.08306, over 1796401.00 frames. 2023-06-28 10:00:03,544 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 10:00:10,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=12.0 2023-06-28 10:00:24,305 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 8.195e+02 1.192e+03 1.732e+03 4.635e+03, threshold=2.384e+03, percent-clipped=12.0 2023-06-28 10:00:41,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2030706.0, ans=0.0 2023-06-28 10:01:06,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2030766.0, ans=0.125 2023-06-28 10:01:07,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2030826.0, ans=0.2 2023-06-28 10:01:42,912 INFO [train.py:996] (1/4) Epoch 12, batch 3050, loss[loss=0.2018, simple_loss=0.2738, pruned_loss=0.0649, over 20823.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.297, pruned_loss=0.06775, over 4289078.02 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:01:57,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2030946.0, ans=0.125 2023-06-28 10:02:16,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2031006.0, ans=0.125 2023-06-28 10:02:54,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2031126.0, ans=0.07 2023-06-28 10:02:59,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2031126.0, ans=0.125 2023-06-28 10:03:09,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2031186.0, ans=0.125 2023-06-28 10:03:16,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2031186.0, ans=0.125 2023-06-28 10:03:37,788 INFO [train.py:996] (1/4) Epoch 12, batch 3100, loss[loss=0.1743, simple_loss=0.2599, pruned_loss=0.04438, over 21371.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.297, pruned_loss=0.06705, over 4282605.54 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:03:54,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=2031246.0, ans=6.0 2023-06-28 10:03:57,009 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 7.796e+02 1.121e+03 1.860e+03 4.097e+03, threshold=2.242e+03, percent-clipped=9.0 2023-06-28 10:04:19,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2031366.0, ans=0.2 2023-06-28 10:05:26,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2031546.0, ans=0.0 2023-06-28 10:05:27,749 INFO [train.py:996] (1/4) Epoch 12, batch 3150, loss[loss=0.2757, simple_loss=0.3539, pruned_loss=0.09875, over 21866.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2981, pruned_loss=0.06725, over 4281174.81 frames. ], batch size: 118, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:05:57,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2031606.0, ans=0.1 2023-06-28 10:05:58,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2031606.0, ans=0.0 2023-06-28 10:06:57,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2031786.0, ans=0.125 2023-06-28 10:06:59,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2031786.0, ans=0.125 2023-06-28 10:07:12,537 INFO [train.py:996] (1/4) Epoch 12, batch 3200, loss[loss=0.2063, simple_loss=0.2998, pruned_loss=0.05638, over 21911.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2992, pruned_loss=0.06749, over 4282029.99 frames. ], batch size: 372, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 10:07:32,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.058e+02 7.728e+02 1.156e+03 1.759e+03 4.154e+03, threshold=2.311e+03, percent-clipped=17.0 2023-06-28 10:07:34,858 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:07:41,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2031906.0, ans=0.125 2023-06-28 10:07:53,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2031966.0, ans=0.2 2023-06-28 10:08:23,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2032026.0, ans=0.1 2023-06-28 10:09:00,240 INFO [train.py:996] (1/4) Epoch 12, batch 3250, loss[loss=0.2495, simple_loss=0.2963, pruned_loss=0.1014, over 21450.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3006, pruned_loss=0.06917, over 4275085.20 frames. ], batch size: 510, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:10:06,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2032326.0, ans=0.09899494936611666 2023-06-28 10:10:39,243 INFO [train.py:996] (1/4) Epoch 12, batch 3300, loss[loss=0.1875, simple_loss=0.2539, pruned_loss=0.06052, over 15571.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2946, pruned_loss=0.06838, over 4274341.11 frames. ], batch size: 60, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:10:56,002 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.931e+02 8.073e+02 1.537e+03 2.186e+03 4.176e+03, threshold=3.073e+03, percent-clipped=21.0 2023-06-28 10:10:56,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2032506.0, ans=0.1 2023-06-28 10:12:23,344 INFO [train.py:996] (1/4) Epoch 12, batch 3350, loss[loss=0.1998, simple_loss=0.2845, pruned_loss=0.0576, over 21658.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2968, pruned_loss=0.06826, over 4272880.31 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:12:31,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-28 10:12:46,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2032806.0, ans=0.125 2023-06-28 10:12:49,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2032806.0, ans=0.125 2023-06-28 10:12:59,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2032806.0, ans=0.2 2023-06-28 10:13:32,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2032926.0, ans=0.125 2023-06-28 10:13:42,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2032926.0, ans=0.125 2023-06-28 10:14:06,585 INFO [train.py:996] (1/4) Epoch 12, batch 3400, loss[loss=0.229, simple_loss=0.3183, pruned_loss=0.06987, over 21832.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.297, pruned_loss=0.06879, over 4281769.30 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:14:28,087 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.882e+02 7.652e+02 1.057e+03 1.709e+03 3.627e+03, threshold=2.113e+03, percent-clipped=2.0 2023-06-28 10:15:05,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-28 10:15:10,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2033226.0, ans=0.1 2023-06-28 10:15:16,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-28 10:15:17,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2033226.0, ans=0.1 2023-06-28 10:15:20,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2033226.0, ans=0.2 2023-06-28 10:15:33,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2033286.0, ans=0.5 2023-06-28 10:15:50,721 INFO [train.py:996] (1/4) Epoch 12, batch 3450, loss[loss=0.2198, simple_loss=0.3024, pruned_loss=0.06855, over 21499.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2918, pruned_loss=0.06788, over 4280509.33 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:15:55,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2033346.0, ans=0.125 2023-06-28 10:16:01,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2033346.0, ans=0.2 2023-06-28 10:16:10,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2033346.0, ans=0.125 2023-06-28 10:16:14,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2033406.0, ans=0.125 2023-06-28 10:17:18,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2033586.0, ans=0.125 2023-06-28 10:17:35,128 INFO [train.py:996] (1/4) Epoch 12, batch 3500, loss[loss=0.2463, simple_loss=0.3282, pruned_loss=0.08219, over 21379.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3004, pruned_loss=0.07133, over 4279480.99 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:17:52,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2033646.0, ans=0.125 2023-06-28 10:18:03,141 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.129e+02 8.730e+02 1.318e+03 1.854e+03 3.895e+03, threshold=2.636e+03, percent-clipped=20.0 2023-06-28 10:19:16,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2033886.0, ans=0.125 2023-06-28 10:19:23,747 INFO [train.py:996] (1/4) Epoch 12, batch 3550, loss[loss=0.186, simple_loss=0.2552, pruned_loss=0.05834, over 21583.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3031, pruned_loss=0.07277, over 4277575.80 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:19:51,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2034006.0, ans=0.125 2023-06-28 10:20:10,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2034066.0, ans=0.125 2023-06-28 10:20:18,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2034066.0, ans=0.0 2023-06-28 10:20:26,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2034126.0, ans=0.125 2023-06-28 10:20:38,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2034126.0, ans=0.125 2023-06-28 10:21:12,828 INFO [train.py:996] (1/4) Epoch 12, batch 3600, loss[loss=0.2228, simple_loss=0.2936, pruned_loss=0.07604, over 21705.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2984, pruned_loss=0.07255, over 4271719.44 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:21:31,739 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.278e+02 7.988e+02 1.219e+03 1.896e+03 5.241e+03, threshold=2.438e+03, percent-clipped=11.0 2023-06-28 10:21:35,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2034306.0, ans=0.0 2023-06-28 10:21:35,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2034306.0, ans=0.125 2023-06-28 10:21:50,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2034366.0, ans=0.0 2023-06-28 10:22:32,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2034486.0, ans=0.1 2023-06-28 10:22:51,689 INFO [train.py:996] (1/4) Epoch 12, batch 3650, loss[loss=0.2488, simple_loss=0.3263, pruned_loss=0.08571, over 21650.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2981, pruned_loss=0.07322, over 4275888.51 frames. ], batch size: 508, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:24:06,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2034786.0, ans=0.125 2023-06-28 10:24:33,949 INFO [train.py:996] (1/4) Epoch 12, batch 3700, loss[loss=0.2078, simple_loss=0.2862, pruned_loss=0.06465, over 21463.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2968, pruned_loss=0.07244, over 4281599.96 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:24:42,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2034846.0, ans=0.125 2023-06-28 10:24:56,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2034906.0, ans=0.1 2023-06-28 10:24:57,048 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.964e+02 7.431e+02 1.073e+03 1.535e+03 4.329e+03, threshold=2.147e+03, percent-clipped=8.0 2023-06-28 10:25:11,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2034906.0, ans=0.125 2023-06-28 10:25:59,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2035086.0, ans=0.0 2023-06-28 10:26:16,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2035146.0, ans=0.125 2023-06-28 10:26:17,528 INFO [train.py:996] (1/4) Epoch 12, batch 3750, loss[loss=0.2073, simple_loss=0.3037, pruned_loss=0.05543, over 21316.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2955, pruned_loss=0.07097, over 4285190.62 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:26:59,598 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.17 vs. limit=6.0 2023-06-28 10:27:05,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2035266.0, ans=0.125 2023-06-28 10:27:15,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2035326.0, ans=0.1 2023-06-28 10:27:43,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2035386.0, ans=0.2 2023-06-28 10:27:57,866 INFO [train.py:996] (1/4) Epoch 12, batch 3800, loss[loss=0.2094, simple_loss=0.2941, pruned_loss=0.06234, over 21912.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2915, pruned_loss=0.06907, over 4275741.82 frames. ], batch size: 372, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:28:21,911 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.866e+02 7.287e+02 1.012e+03 1.468e+03 2.920e+03, threshold=2.024e+03, percent-clipped=9.0 2023-06-28 10:28:23,154 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-28 10:28:36,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2035566.0, ans=0.1 2023-06-28 10:28:43,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2035566.0, ans=0.0 2023-06-28 10:29:00,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2035626.0, ans=0.1 2023-06-28 10:29:09,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2035626.0, ans=0.04949747468305833 2023-06-28 10:29:40,096 INFO [train.py:996] (1/4) Epoch 12, batch 3850, loss[loss=0.1885, simple_loss=0.2514, pruned_loss=0.06281, over 21329.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2902, pruned_loss=0.06933, over 4272375.56 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:30:24,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2035866.0, ans=0.125 2023-06-28 10:31:23,412 INFO [train.py:996] (1/4) Epoch 12, batch 3900, loss[loss=0.206, simple_loss=0.2793, pruned_loss=0.06631, over 21859.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2865, pruned_loss=0.06857, over 4266303.41 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:31:47,274 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.137e+02 7.075e+02 9.098e+02 1.343e+03 3.131e+03, threshold=1.820e+03, percent-clipped=11.0 2023-06-28 10:32:08,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2036166.0, ans=0.2 2023-06-28 10:32:15,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2036166.0, ans=0.0 2023-06-28 10:32:19,690 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-28 10:32:48,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-28 10:33:08,698 INFO [train.py:996] (1/4) Epoch 12, batch 3950, loss[loss=0.1955, simple_loss=0.2558, pruned_loss=0.06764, over 20836.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2875, pruned_loss=0.06773, over 4268725.63 frames. ], batch size: 611, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:34:05,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2036526.0, ans=0.1 2023-06-28 10:34:24,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2036526.0, ans=0.125 2023-06-28 10:34:52,697 INFO [train.py:996] (1/4) Epoch 12, batch 4000, loss[loss=0.1942, simple_loss=0.2578, pruned_loss=0.06535, over 21439.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2805, pruned_loss=0.06473, over 4273821.55 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 10:35:16,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.215e+02 7.767e+02 1.100e+03 1.663e+03 3.671e+03, threshold=2.200e+03, percent-clipped=20.0 2023-06-28 10:36:04,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2036826.0, ans=0.125 2023-06-28 10:36:13,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2036826.0, ans=0.0 2023-06-28 10:36:35,200 INFO [train.py:996] (1/4) Epoch 12, batch 4050, loss[loss=0.197, simple_loss=0.2922, pruned_loss=0.05092, over 21687.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2792, pruned_loss=0.06338, over 4267914.60 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:36:45,114 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-28 10:36:54,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2036946.0, ans=0.125 2023-06-28 10:37:20,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.52 vs. limit=22.5 2023-06-28 10:38:18,408 INFO [train.py:996] (1/4) Epoch 12, batch 4100, loss[loss=0.1918, simple_loss=0.2723, pruned_loss=0.05564, over 21410.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2817, pruned_loss=0.06315, over 4270666.93 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:38:45,099 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-28 10:38:45,581 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.840e+02 7.700e+02 1.227e+03 1.924e+03 4.359e+03, threshold=2.455e+03, percent-clipped=14.0 2023-06-28 10:39:39,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2037426.0, ans=0.0 2023-06-28 10:40:06,904 INFO [train.py:996] (1/4) Epoch 12, batch 4150, loss[loss=0.1637, simple_loss=0.259, pruned_loss=0.03416, over 21485.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2825, pruned_loss=0.06058, over 4276274.67 frames. ], batch size: 195, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:40:14,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2037546.0, ans=0.125 2023-06-28 10:40:43,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2037666.0, ans=0.125 2023-06-28 10:40:56,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=8.0 2023-06-28 10:41:03,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2037666.0, ans=0.0 2023-06-28 10:41:43,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-28 10:41:52,367 INFO [train.py:996] (1/4) Epoch 12, batch 4200, loss[loss=0.2029, simple_loss=0.3076, pruned_loss=0.04909, over 21226.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2842, pruned_loss=0.06117, over 4274756.52 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:42:14,627 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.777e+02 8.293e+02 1.484e+03 2.185e+03 3.637e+03, threshold=2.967e+03, percent-clipped=18.0 2023-06-28 10:42:18,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2037906.0, ans=0.1 2023-06-28 10:42:28,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2037906.0, ans=0.0 2023-06-28 10:43:24,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2038086.0, ans=0.1 2023-06-28 10:43:37,191 INFO [train.py:996] (1/4) Epoch 12, batch 4250, loss[loss=0.239, simple_loss=0.3259, pruned_loss=0.07608, over 21363.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2887, pruned_loss=0.06294, over 4271782.49 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:43:57,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2038206.0, ans=0.0 2023-06-28 10:44:19,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2038266.0, ans=0.125 2023-06-28 10:44:19,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2038266.0, ans=0.125 2023-06-28 10:44:30,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2038266.0, ans=0.125 2023-06-28 10:45:24,214 INFO [train.py:996] (1/4) Epoch 12, batch 4300, loss[loss=0.2002, simple_loss=0.2832, pruned_loss=0.05855, over 21303.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2949, pruned_loss=0.06483, over 4274292.66 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:45:43,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2038446.0, ans=0.05 2023-06-28 10:46:00,800 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.611e+02 9.355e+02 1.305e+03 1.983e+03 5.098e+03, threshold=2.609e+03, percent-clipped=8.0 2023-06-28 10:46:26,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2038566.0, ans=0.125 2023-06-28 10:47:06,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2038686.0, ans=0.0 2023-06-28 10:47:12,494 INFO [train.py:996] (1/4) Epoch 12, batch 4350, loss[loss=0.1796, simple_loss=0.249, pruned_loss=0.05514, over 21368.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2953, pruned_loss=0.06433, over 4277025.54 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:47:30,048 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-28 10:48:27,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2038926.0, ans=0.0 2023-06-28 10:48:43,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2038986.0, ans=0.2 2023-06-28 10:49:03,166 INFO [train.py:996] (1/4) Epoch 12, batch 4400, loss[loss=0.1916, simple_loss=0.2649, pruned_loss=0.05915, over 21496.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2917, pruned_loss=0.06389, over 4261995.92 frames. ], batch size: 195, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:49:03,839 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:49:09,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2039046.0, ans=0.1 2023-06-28 10:49:15,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2039046.0, ans=0.1 2023-06-28 10:49:15,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2039046.0, ans=0.0 2023-06-28 10:49:35,015 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.000e+02 1.052e+03 1.456e+03 1.843e+03 4.869e+03, threshold=2.912e+03, percent-clipped=14.0 2023-06-28 10:50:15,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2039226.0, ans=0.2 2023-06-28 10:50:29,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2039286.0, ans=0.07 2023-06-28 10:50:53,926 INFO [train.py:996] (1/4) Epoch 12, batch 4450, loss[loss=0.2347, simple_loss=0.3313, pruned_loss=0.06908, over 21603.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2965, pruned_loss=0.06444, over 4261889.74 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:51:01,594 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=22.5 2023-06-28 10:51:42,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.48 vs. limit=15.0 2023-06-28 10:52:16,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2039586.0, ans=0.1 2023-06-28 10:52:20,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2039586.0, ans=0.04949747468305833 2023-06-28 10:52:38,111 INFO [train.py:996] (1/4) Epoch 12, batch 4500, loss[loss=0.205, simple_loss=0.2915, pruned_loss=0.05923, over 21687.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2968, pruned_loss=0.06633, over 4272262.91 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:53:04,869 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.459e+02 9.304e+02 1.246e+03 2.301e+03 3.917e+03, threshold=2.492e+03, percent-clipped=11.0 2023-06-28 10:53:05,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.80 vs. limit=10.0 2023-06-28 10:53:21,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=2039766.0, ans=0.025 2023-06-28 10:54:28,127 INFO [train.py:996] (1/4) Epoch 12, batch 4550, loss[loss=0.1863, simple_loss=0.2875, pruned_loss=0.04259, over 20778.00 frames. ], tot_loss[loss=0.217, simple_loss=0.3001, pruned_loss=0.06695, over 4270695.12 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:54:32,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2039946.0, ans=0.1 2023-06-28 10:55:20,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2040066.0, ans=0.0 2023-06-28 10:55:31,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2040126.0, ans=0.0 2023-06-28 10:55:59,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2040186.0, ans=0.1 2023-06-28 10:56:14,155 INFO [train.py:996] (1/4) Epoch 12, batch 4600, loss[loss=0.1927, simple_loss=0.2712, pruned_loss=0.05717, over 21402.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3022, pruned_loss=0.06795, over 4272471.65 frames. ], batch size: 194, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:56:36,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-28 10:56:36,681 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.360e+02 7.467e+02 1.139e+03 1.677e+03 2.825e+03, threshold=2.277e+03, percent-clipped=5.0 2023-06-28 10:56:54,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2040366.0, ans=0.125 2023-06-28 10:56:56,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2040366.0, ans=0.125 2023-06-28 10:57:05,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2040366.0, ans=0.125 2023-06-28 10:57:05,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2040366.0, ans=0.0 2023-06-28 10:57:12,789 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-28 10:57:25,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2040426.0, ans=0.2 2023-06-28 10:57:58,186 INFO [train.py:996] (1/4) Epoch 12, batch 4650, loss[loss=0.2153, simple_loss=0.2915, pruned_loss=0.06955, over 21396.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.297, pruned_loss=0.06666, over 4273117.51 frames. ], batch size: 144, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:58:09,399 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-28 10:59:19,559 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:59:19,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2040786.0, ans=0.04949747468305833 2023-06-28 10:59:40,610 INFO [train.py:996] (1/4) Epoch 12, batch 4700, loss[loss=0.1758, simple_loss=0.2387, pruned_loss=0.05645, over 21245.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.291, pruned_loss=0.06526, over 4271467.48 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 11:00:07,729 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 7.682e+02 1.181e+03 1.934e+03 4.585e+03, threshold=2.362e+03, percent-clipped=15.0 2023-06-28 11:00:10,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2040906.0, ans=0.0 2023-06-28 11:00:57,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2041026.0, ans=0.1 2023-06-28 11:01:02,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2041086.0, ans=0.1 2023-06-28 11:01:23,167 INFO [train.py:996] (1/4) Epoch 12, batch 4750, loss[loss=0.2021, simple_loss=0.2748, pruned_loss=0.0647, over 21627.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2855, pruned_loss=0.06547, over 4272218.65 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 11:01:30,941 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-28 11:02:30,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-28 11:02:56,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2041386.0, ans=0.1 2023-06-28 11:03:05,688 INFO [train.py:996] (1/4) Epoch 12, batch 4800, loss[loss=0.2129, simple_loss=0.2841, pruned_loss=0.07087, over 21904.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2859, pruned_loss=0.06633, over 4283879.59 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 11:03:15,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-28 11:03:32,411 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.216e+02 8.084e+02 1.278e+03 1.855e+03 4.015e+03, threshold=2.556e+03, percent-clipped=12.0 2023-06-28 11:03:36,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2041506.0, ans=0.125 2023-06-28 11:03:36,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.31 vs. limit=6.0 2023-06-28 11:03:41,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-28 11:04:26,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2041686.0, ans=0.125 2023-06-28 11:04:47,271 INFO [train.py:996] (1/4) Epoch 12, batch 4850, loss[loss=0.2116, simple_loss=0.3002, pruned_loss=0.06146, over 21538.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2849, pruned_loss=0.06584, over 4278573.19 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 11:04:47,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2041746.0, ans=0.1 2023-06-28 11:06:22,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2041986.0, ans=0.125 2023-06-28 11:06:30,301 INFO [train.py:996] (1/4) Epoch 12, batch 4900, loss[loss=0.2096, simple_loss=0.2857, pruned_loss=0.0668, over 21881.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.286, pruned_loss=0.06623, over 4286430.29 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 11:06:37,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2042046.0, ans=0.125 2023-06-28 11:06:58,460 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.185e+02 7.394e+02 1.193e+03 1.925e+03 4.019e+03, threshold=2.386e+03, percent-clipped=10.0 2023-06-28 11:08:14,026 INFO [train.py:996] (1/4) Epoch 12, batch 4950, loss[loss=0.1702, simple_loss=0.272, pruned_loss=0.0342, over 21680.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2915, pruned_loss=0.06583, over 4287087.34 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:08:46,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2042406.0, ans=0.125 2023-06-28 11:09:23,857 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-28 11:09:33,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2042526.0, ans=0.2 2023-06-28 11:09:54,802 INFO [train.py:996] (1/4) Epoch 12, batch 5000, loss[loss=0.2249, simple_loss=0.2998, pruned_loss=0.07499, over 21852.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.291, pruned_loss=0.06242, over 4293979.23 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:09:57,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2042646.0, ans=0.2 2023-06-28 11:10:22,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.667e+02 7.098e+02 1.009e+03 1.573e+03 3.184e+03, threshold=2.017e+03, percent-clipped=11.0 2023-06-28 11:11:02,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2042826.0, ans=0.0 2023-06-28 11:11:35,586 INFO [train.py:996] (1/4) Epoch 12, batch 5050, loss[loss=0.2039, simple_loss=0.283, pruned_loss=0.06243, over 21610.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2911, pruned_loss=0.06359, over 4298915.40 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:11:42,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2042946.0, ans=0.125 2023-06-28 11:12:02,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2043006.0, ans=0.125 2023-06-28 11:12:25,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2043066.0, ans=0.125 2023-06-28 11:13:17,755 INFO [train.py:996] (1/4) Epoch 12, batch 5100, loss[loss=0.2352, simple_loss=0.3023, pruned_loss=0.08403, over 21592.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2899, pruned_loss=0.06466, over 4292789.63 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:13:44,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2043306.0, ans=0.2 2023-06-28 11:13:45,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.047e+02 7.915e+02 1.019e+03 1.431e+03 3.420e+03, threshold=2.039e+03, percent-clipped=6.0 2023-06-28 11:14:28,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2043426.0, ans=0.0 2023-06-28 11:14:30,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2043426.0, ans=0.0 2023-06-28 11:14:32,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2043426.0, ans=0.125 2023-06-28 11:15:00,440 INFO [train.py:996] (1/4) Epoch 12, batch 5150, loss[loss=0.1968, simple_loss=0.2666, pruned_loss=0.06351, over 21481.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2873, pruned_loss=0.06519, over 4293534.06 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:15:11,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=2043546.0, ans=0.1 2023-06-28 11:15:42,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2043666.0, ans=0.125 2023-06-28 11:16:44,542 INFO [train.py:996] (1/4) Epoch 12, batch 5200, loss[loss=0.2028, simple_loss=0.2882, pruned_loss=0.05872, over 21167.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2887, pruned_loss=0.0659, over 4291960.66 frames. ], batch size: 143, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:17:03,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2043846.0, ans=0.125 2023-06-28 11:17:18,856 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.436e+02 7.444e+02 1.331e+03 2.729e+03 6.291e+03, threshold=2.663e+03, percent-clipped=30.0 2023-06-28 11:17:56,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2044026.0, ans=0.125 2023-06-28 11:18:26,592 INFO [train.py:996] (1/4) Epoch 12, batch 5250, loss[loss=0.2521, simple_loss=0.3443, pruned_loss=0.07991, over 21494.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2892, pruned_loss=0.06476, over 4284125.51 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:20:07,516 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-28 11:20:08,081 INFO [train.py:996] (1/4) Epoch 12, batch 5300, loss[loss=0.2109, simple_loss=0.2806, pruned_loss=0.07062, over 21784.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2876, pruned_loss=0.06478, over 4273028.19 frames. ], batch size: 231, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:20:33,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-28 11:20:42,497 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.313e+02 7.509e+02 1.039e+03 1.571e+03 3.451e+03, threshold=2.078e+03, percent-clipped=7.0 2023-06-28 11:20:44,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2044506.0, ans=0.125 2023-06-28 11:21:47,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2044746.0, ans=0.125 2023-06-28 11:21:48,587 INFO [train.py:996] (1/4) Epoch 12, batch 5350, loss[loss=0.2043, simple_loss=0.2745, pruned_loss=0.06704, over 21375.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2865, pruned_loss=0.06564, over 4271974.45 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:22:19,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2044806.0, ans=0.1 2023-06-28 11:22:23,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2044806.0, ans=0.125 2023-06-28 11:22:23,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2044806.0, ans=0.125 2023-06-28 11:22:26,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2044806.0, ans=0.2 2023-06-28 11:22:36,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2044866.0, ans=0.125 2023-06-28 11:22:42,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.08 vs. limit=10.0 2023-06-28 11:23:04,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2044926.0, ans=0.2 2023-06-28 11:23:06,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2044926.0, ans=0.1 2023-06-28 11:23:16,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2044986.0, ans=0.125 2023-06-28 11:23:35,438 INFO [train.py:996] (1/4) Epoch 12, batch 5400, loss[loss=0.2041, simple_loss=0.275, pruned_loss=0.06658, over 21495.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2861, pruned_loss=0.06669, over 4281693.25 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:23:42,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-28 11:23:48,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2045046.0, ans=0.0 2023-06-28 11:23:56,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2045106.0, ans=0.0 2023-06-28 11:24:05,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.16 vs. limit=15.0 2023-06-28 11:24:05,988 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.403e+02 8.319e+02 1.196e+03 1.782e+03 3.222e+03, threshold=2.392e+03, percent-clipped=18.0 2023-06-28 11:24:33,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2045166.0, ans=0.2 2023-06-28 11:24:53,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2045226.0, ans=0.125 2023-06-28 11:25:08,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2045286.0, ans=0.2 2023-06-28 11:25:16,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2045286.0, ans=0.125 2023-06-28 11:25:19,496 INFO [train.py:996] (1/4) Epoch 12, batch 5450, loss[loss=0.2014, simple_loss=0.3022, pruned_loss=0.05027, over 21394.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2862, pruned_loss=0.0651, over 4278298.49 frames. ], batch size: 211, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:25:31,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2045346.0, ans=0.125 2023-06-28 11:25:39,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-28 11:25:57,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2045406.0, ans=0.07 2023-06-28 11:26:20,864 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:26:33,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2045526.0, ans=0.1 2023-06-28 11:27:08,780 INFO [train.py:996] (1/4) Epoch 12, batch 5500, loss[loss=0.2183, simple_loss=0.3293, pruned_loss=0.05362, over 21192.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2921, pruned_loss=0.06236, over 4274050.07 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:27:44,012 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.944e+02 8.580e+02 1.207e+03 1.863e+03 4.637e+03, threshold=2.413e+03, percent-clipped=15.0 2023-06-28 11:28:00,537 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-28 11:28:42,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-28 11:28:56,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2045946.0, ans=0.125 2023-06-28 11:28:57,716 INFO [train.py:996] (1/4) Epoch 12, batch 5550, loss[loss=0.1551, simple_loss=0.241, pruned_loss=0.03459, over 21031.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2944, pruned_loss=0.0603, over 4269333.26 frames. ], batch size: 143, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:29:10,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2045946.0, ans=0.125 2023-06-28 11:29:13,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2045946.0, ans=0.0 2023-06-28 11:30:46,206 INFO [train.py:996] (1/4) Epoch 12, batch 5600, loss[loss=0.2169, simple_loss=0.3184, pruned_loss=0.05769, over 21770.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2952, pruned_loss=0.05864, over 4268738.86 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:30:49,044 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=22.5 2023-06-28 11:31:01,996 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:31:12,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-28 11:31:13,171 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 9.110e+02 1.414e+03 2.313e+03 5.859e+03, threshold=2.829e+03, percent-clipped=23.0 2023-06-28 11:31:34,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2046366.0, ans=0.0 2023-06-28 11:31:45,237 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:32:02,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-28 11:32:27,082 INFO [train.py:996] (1/4) Epoch 12, batch 5650, loss[loss=0.2235, simple_loss=0.2942, pruned_loss=0.07642, over 20003.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.296, pruned_loss=0.06064, over 4271738.84 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:32:42,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2046606.0, ans=0.0 2023-06-28 11:33:32,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2046726.0, ans=0.07 2023-06-28 11:34:09,873 INFO [train.py:996] (1/4) Epoch 12, batch 5700, loss[loss=0.2225, simple_loss=0.3428, pruned_loss=0.05109, over 19776.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2952, pruned_loss=0.06217, over 4276438.67 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:34:42,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.668e+02 8.634e+02 1.270e+03 1.811e+03 3.578e+03, threshold=2.540e+03, percent-clipped=6.0 2023-06-28 11:35:48,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2047086.0, ans=0.0 2023-06-28 11:35:54,480 INFO [train.py:996] (1/4) Epoch 12, batch 5750, loss[loss=0.1599, simple_loss=0.2479, pruned_loss=0.03598, over 21826.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2927, pruned_loss=0.05942, over 4262250.49 frames. ], batch size: 316, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:36:21,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2047206.0, ans=0.125 2023-06-28 11:36:31,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2047206.0, ans=0.125 2023-06-28 11:36:31,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2047206.0, ans=0.0 2023-06-28 11:37:13,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2047326.0, ans=0.2 2023-06-28 11:37:23,843 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:37:33,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-28 11:37:43,052 INFO [train.py:996] (1/4) Epoch 12, batch 5800, loss[loss=0.1845, simple_loss=0.2678, pruned_loss=0.05058, over 21292.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2919, pruned_loss=0.05859, over 4257622.66 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:38:10,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2047506.0, ans=0.0 2023-06-28 11:38:13,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2047506.0, ans=0.1 2023-06-28 11:38:14,538 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.621e+02 6.881e+02 1.222e+03 1.758e+03 3.677e+03, threshold=2.444e+03, percent-clipped=11.0 2023-06-28 11:38:45,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2047566.0, ans=0.0 2023-06-28 11:38:52,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2047626.0, ans=0.0 2023-06-28 11:39:22,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2047686.0, ans=0.0 2023-06-28 11:39:31,900 INFO [train.py:996] (1/4) Epoch 12, batch 5850, loss[loss=0.1848, simple_loss=0.2981, pruned_loss=0.03576, over 21785.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2913, pruned_loss=0.05588, over 4266492.59 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:39:44,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2047746.0, ans=0.125 2023-06-28 11:39:49,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2047806.0, ans=0.2 2023-06-28 11:40:39,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2047926.0, ans=0.125 2023-06-28 11:40:51,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=2047986.0, ans=0.2 2023-06-28 11:40:58,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=22.5 2023-06-28 11:41:02,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2047986.0, ans=0.1 2023-06-28 11:41:08,981 INFO [train.py:996] (1/4) Epoch 12, batch 5900, loss[loss=0.2164, simple_loss=0.3017, pruned_loss=0.06556, over 21453.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2851, pruned_loss=0.05178, over 4272656.86 frames. ], batch size: 507, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:41:29,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.16 vs. limit=10.0 2023-06-28 11:41:35,532 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-28 11:41:44,137 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 9.930e+02 1.759e+03 2.367e+03 3.954e+03, threshold=3.519e+03, percent-clipped=21.0 2023-06-28 11:42:11,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-28 11:42:27,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2048286.0, ans=0.0 2023-06-28 11:42:54,205 INFO [train.py:996] (1/4) Epoch 12, batch 5950, loss[loss=0.1687, simple_loss=0.2382, pruned_loss=0.04954, over 21675.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2844, pruned_loss=0.05426, over 4276849.97 frames. ], batch size: 231, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:43:57,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2048526.0, ans=0.0 2023-06-28 11:44:34,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2048586.0, ans=0.0 2023-06-28 11:44:36,730 INFO [train.py:996] (1/4) Epoch 12, batch 6000, loss[loss=0.1831, simple_loss=0.2489, pruned_loss=0.05865, over 21739.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2803, pruned_loss=0.05727, over 4274200.53 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:44:36,731 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 11:44:57,243 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2597, simple_loss=0.3509, pruned_loss=0.08424, over 1796401.00 frames. 2023-06-28 11:44:57,244 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 11:45:06,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2048646.0, ans=0.0 2023-06-28 11:45:14,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2048646.0, ans=0.125 2023-06-28 11:45:28,552 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.837e+02 9.369e+02 1.291e+03 2.028e+03 3.757e+03, threshold=2.582e+03, percent-clipped=1.0 2023-06-28 11:45:37,152 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-28 11:45:38,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-28 11:46:20,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2048886.0, ans=0.1 2023-06-28 11:46:40,051 INFO [train.py:996] (1/4) Epoch 12, batch 6050, loss[loss=0.1969, simple_loss=0.2617, pruned_loss=0.06599, over 21652.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2757, pruned_loss=0.05795, over 4266709.28 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:47:09,282 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:47:41,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2049126.0, ans=0.1 2023-06-28 11:47:52,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2049126.0, ans=0.0 2023-06-28 11:48:17,817 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2023-06-28 11:48:28,632 INFO [train.py:996] (1/4) Epoch 12, batch 6100, loss[loss=0.1984, simple_loss=0.2829, pruned_loss=0.057, over 21816.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2746, pruned_loss=0.05666, over 4270380.12 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:48:57,066 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 8.433e+02 1.328e+03 2.179e+03 5.742e+03, threshold=2.657e+03, percent-clipped=17.0 2023-06-28 11:49:16,605 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:49:21,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2049366.0, ans=0.125 2023-06-28 11:49:36,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2049426.0, ans=0.1 2023-06-28 11:50:13,327 INFO [train.py:996] (1/4) Epoch 12, batch 6150, loss[loss=0.1862, simple_loss=0.2638, pruned_loss=0.05433, over 21800.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2766, pruned_loss=0.05905, over 4270572.38 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:50:19,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=2049546.0, ans=15.0 2023-06-28 11:50:20,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2049546.0, ans=0.125 2023-06-28 11:51:05,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2049666.0, ans=0.125 2023-06-28 11:51:30,394 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.47 vs. limit=6.0 2023-06-28 11:51:56,282 INFO [train.py:996] (1/4) Epoch 12, batch 6200, loss[loss=0.2201, simple_loss=0.3026, pruned_loss=0.06884, over 21624.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2784, pruned_loss=0.05959, over 4273595.97 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:52:27,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2049906.0, ans=0.125 2023-06-28 11:52:32,345 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.502e+02 7.829e+02 1.153e+03 1.728e+03 4.252e+03, threshold=2.307e+03, percent-clipped=8.0 2023-06-28 11:52:54,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2049966.0, ans=0.125 2023-06-28 11:53:41,415 INFO [train.py:996] (1/4) Epoch 12, batch 6250, loss[loss=0.1855, simple_loss=0.3004, pruned_loss=0.03529, over 20848.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2838, pruned_loss=0.05953, over 4267653.89 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:53:50,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2050146.0, ans=0.125 2023-06-28 11:54:15,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.71 vs. limit=6.0 2023-06-28 11:55:07,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2050386.0, ans=0.0 2023-06-28 11:55:16,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-28 11:55:23,841 INFO [train.py:996] (1/4) Epoch 12, batch 6300, loss[loss=0.1984, simple_loss=0.2894, pruned_loss=0.05367, over 21850.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.288, pruned_loss=0.05862, over 4278213.16 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:55:26,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2050446.0, ans=0.0 2023-06-28 11:55:51,876 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-28 11:56:03,345 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.650e+02 7.162e+02 1.070e+03 1.625e+03 2.845e+03, threshold=2.140e+03, percent-clipped=5.0 2023-06-28 11:57:05,256 INFO [train.py:996] (1/4) Epoch 12, batch 6350, loss[loss=0.2178, simple_loss=0.2936, pruned_loss=0.071, over 21568.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2905, pruned_loss=0.06196, over 4285085.70 frames. ], batch size: 230, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:57:19,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-28 11:58:00,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2050866.0, ans=0.125 2023-06-28 11:58:20,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2050926.0, ans=0.0 2023-06-28 11:58:37,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-28 11:58:54,045 INFO [train.py:996] (1/4) Epoch 12, batch 6400, loss[loss=0.186, simple_loss=0.3039, pruned_loss=0.03403, over 19767.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2956, pruned_loss=0.06525, over 4273622.02 frames. ], batch size: 703, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:59:29,778 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.784e+02 8.222e+02 1.150e+03 1.542e+03 3.199e+03, threshold=2.299e+03, percent-clipped=10.0 2023-06-28 11:59:51,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2051166.0, ans=0.125 2023-06-28 12:00:15,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=2051286.0, ans=15.0 2023-06-28 12:00:30,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2051286.0, ans=0.5 2023-06-28 12:00:36,723 INFO [train.py:996] (1/4) Epoch 12, batch 6450, loss[loss=0.2319, simple_loss=0.317, pruned_loss=0.07345, over 21601.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2977, pruned_loss=0.06569, over 4277161.53 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:01:04,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-28 12:01:27,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2051466.0, ans=0.0 2023-06-28 12:01:29,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-28 12:02:20,324 INFO [train.py:996] (1/4) Epoch 12, batch 6500, loss[loss=0.1949, simple_loss=0.2616, pruned_loss=0.06412, over 21271.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.29, pruned_loss=0.06457, over 4262478.72 frames. ], batch size: 131, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:02:59,803 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.006e+02 7.341e+02 1.379e+03 1.907e+03 4.704e+03, threshold=2.757e+03, percent-clipped=17.0 2023-06-28 12:03:46,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-28 12:04:01,071 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:04:03,582 INFO [train.py:996] (1/4) Epoch 12, batch 6550, loss[loss=0.2112, simple_loss=0.2877, pruned_loss=0.0674, over 21825.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2891, pruned_loss=0.06348, over 4267372.36 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:04:41,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2052006.0, ans=0.125 2023-06-28 12:04:54,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2052066.0, ans=0.125 2023-06-28 12:05:10,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2052126.0, ans=0.0 2023-06-28 12:05:21,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2052126.0, ans=0.0 2023-06-28 12:05:44,398 INFO [train.py:996] (1/4) Epoch 12, batch 6600, loss[loss=0.1714, simple_loss=0.2351, pruned_loss=0.05378, over 21242.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2833, pruned_loss=0.0631, over 4275577.55 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:06:02,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2052246.0, ans=0.2 2023-06-28 12:06:28,656 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.752e+02 7.717e+02 1.174e+03 1.589e+03 2.955e+03, threshold=2.349e+03, percent-clipped=1.0 2023-06-28 12:06:46,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2052366.0, ans=0.125 2023-06-28 12:07:14,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2052486.0, ans=0.1 2023-06-28 12:07:32,119 INFO [train.py:996] (1/4) Epoch 12, batch 6650, loss[loss=0.2296, simple_loss=0.2856, pruned_loss=0.0868, over 21376.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2766, pruned_loss=0.06047, over 4274037.46 frames. ], batch size: 508, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:08:27,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2052666.0, ans=0.125 2023-06-28 12:08:52,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2052786.0, ans=0.1 2023-06-28 12:09:13,051 INFO [train.py:996] (1/4) Epoch 12, batch 6700, loss[loss=0.1759, simple_loss=0.2527, pruned_loss=0.04953, over 21735.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2727, pruned_loss=0.06059, over 4273617.88 frames. ], batch size: 118, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:09:37,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2052906.0, ans=0.125 2023-06-28 12:09:48,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-28 12:09:52,368 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.639e+02 7.163e+02 1.028e+03 1.473e+03 3.561e+03, threshold=2.056e+03, percent-clipped=9.0 2023-06-28 12:10:00,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-28 12:10:25,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2053026.0, ans=0.1 2023-06-28 12:10:26,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2053026.0, ans=0.0 2023-06-28 12:10:53,892 INFO [train.py:996] (1/4) Epoch 12, batch 6750, loss[loss=0.2058, simple_loss=0.2857, pruned_loss=0.06297, over 21862.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2728, pruned_loss=0.06043, over 4269723.91 frames. ], batch size: 118, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:10:54,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2053146.0, ans=0.125 2023-06-28 12:11:12,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-28 12:12:33,672 INFO [train.py:996] (1/4) Epoch 12, batch 6800, loss[loss=0.1856, simple_loss=0.2673, pruned_loss=0.05194, over 21122.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2746, pruned_loss=0.06284, over 4275151.83 frames. ], batch size: 607, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:13:13,851 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.991e+02 6.929e+02 1.207e+03 2.029e+03 5.012e+03, threshold=2.414e+03, percent-clipped=24.0 2023-06-28 12:14:14,445 INFO [train.py:996] (1/4) Epoch 12, batch 6850, loss[loss=0.1942, simple_loss=0.2609, pruned_loss=0.06379, over 21422.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2723, pruned_loss=0.06399, over 4276775.05 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:14:23,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2053746.0, ans=0.0 2023-06-28 12:14:42,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2053806.0, ans=0.125 2023-06-28 12:15:22,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2053926.0, ans=0.0 2023-06-28 12:15:44,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2053986.0, ans=0.125 2023-06-28 12:15:58,204 INFO [train.py:996] (1/4) Epoch 12, batch 6900, loss[loss=0.1828, simple_loss=0.2671, pruned_loss=0.04922, over 21374.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.273, pruned_loss=0.06423, over 4278018.72 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:16:05,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2054046.0, ans=0.125 2023-06-28 12:16:19,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-28 12:16:39,822 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 6.265e+02 8.270e+02 1.384e+03 3.220e+03, threshold=1.654e+03, percent-clipped=7.0 2023-06-28 12:16:47,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2054166.0, ans=0.07 2023-06-28 12:17:45,888 INFO [train.py:996] (1/4) Epoch 12, batch 6950, loss[loss=0.2186, simple_loss=0.2979, pruned_loss=0.06964, over 21851.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2759, pruned_loss=0.06221, over 4280569.53 frames. ], batch size: 371, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:18:14,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2054406.0, ans=0.2 2023-06-28 12:18:26,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2054466.0, ans=0.05 2023-06-28 12:18:37,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2054466.0, ans=0.1 2023-06-28 12:19:11,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2054586.0, ans=0.1 2023-06-28 12:19:16,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=2054586.0, ans=15.0 2023-06-28 12:19:28,441 INFO [train.py:996] (1/4) Epoch 12, batch 7000, loss[loss=0.1919, simple_loss=0.2591, pruned_loss=0.06236, over 21730.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.279, pruned_loss=0.06405, over 4275042.39 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:19:34,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2054646.0, ans=0.125 2023-06-28 12:19:45,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2054646.0, ans=0.0 2023-06-28 12:20:05,474 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.352e+02 8.399e+02 1.085e+03 1.441e+03 2.628e+03, threshold=2.170e+03, percent-clipped=15.0 2023-06-28 12:21:16,085 INFO [train.py:996] (1/4) Epoch 12, batch 7050, loss[loss=0.2025, simple_loss=0.2979, pruned_loss=0.05356, over 21264.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2762, pruned_loss=0.06235, over 4267305.87 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:22:15,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=2055126.0, ans=12.0 2023-06-28 12:22:18,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2055126.0, ans=0.0 2023-06-28 12:22:57,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2055186.0, ans=0.125 2023-06-28 12:23:00,202 INFO [train.py:996] (1/4) Epoch 12, batch 7100, loss[loss=0.1578, simple_loss=0.2317, pruned_loss=0.042, over 21283.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2814, pruned_loss=0.064, over 4268188.86 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:23:36,452 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 7.505e+02 1.150e+03 1.796e+03 3.717e+03, threshold=2.300e+03, percent-clipped=14.0 2023-06-28 12:24:34,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2055486.0, ans=0.125 2023-06-28 12:24:42,352 INFO [train.py:996] (1/4) Epoch 12, batch 7150, loss[loss=0.2867, simple_loss=0.3399, pruned_loss=0.1168, over 21433.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2787, pruned_loss=0.06195, over 4270678.50 frames. ], batch size: 510, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:24:50,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-28 12:25:11,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2055606.0, ans=0.125 2023-06-28 12:25:11,452 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:26:15,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2055786.0, ans=0.125 2023-06-28 12:26:25,286 INFO [train.py:996] (1/4) Epoch 12, batch 7200, loss[loss=0.201, simple_loss=0.2664, pruned_loss=0.06784, over 21793.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2814, pruned_loss=0.06444, over 4270721.11 frames. ], batch size: 317, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:26:45,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2055906.0, ans=0.1 2023-06-28 12:26:52,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2055906.0, ans=0.015 2023-06-28 12:26:54,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2055906.0, ans=0.0 2023-06-28 12:27:08,261 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.252e+02 8.659e+02 1.185e+03 1.756e+03 3.819e+03, threshold=2.369e+03, percent-clipped=13.0 2023-06-28 12:27:46,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2056026.0, ans=0.0 2023-06-28 12:28:03,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2056086.0, ans=0.125 2023-06-28 12:28:06,987 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-28 12:28:12,247 INFO [train.py:996] (1/4) Epoch 12, batch 7250, loss[loss=0.153, simple_loss=0.2237, pruned_loss=0.04116, over 21406.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2767, pruned_loss=0.06395, over 4265214.53 frames. ], batch size: 212, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:28:26,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2056146.0, ans=0.125 2023-06-28 12:28:29,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2056206.0, ans=0.125 2023-06-28 12:29:33,583 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:29:53,562 INFO [train.py:996] (1/4) Epoch 12, batch 7300, loss[loss=0.1932, simple_loss=0.2542, pruned_loss=0.06615, over 21228.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2718, pruned_loss=0.06316, over 4270821.38 frames. ], batch size: 143, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:30:04,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2056446.0, ans=0.0 2023-06-28 12:30:07,810 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-28 12:30:20,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2056506.0, ans=0.125 2023-06-28 12:30:31,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 7.948e+02 1.183e+03 1.586e+03 3.750e+03, threshold=2.367e+03, percent-clipped=12.0 2023-06-28 12:31:31,984 INFO [train.py:996] (1/4) Epoch 12, batch 7350, loss[loss=0.2212, simple_loss=0.2917, pruned_loss=0.07532, over 21366.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2695, pruned_loss=0.06359, over 4277050.53 frames. ], batch size: 549, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:31:41,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2056746.0, ans=0.1 2023-06-28 12:31:59,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2056806.0, ans=10.0 2023-06-28 12:32:08,775 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.06 vs. limit=6.0 2023-06-28 12:32:42,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2056926.0, ans=0.125 2023-06-28 12:32:42,419 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:32:44,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2056926.0, ans=0.125 2023-06-28 12:33:09,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2056986.0, ans=0.125 2023-06-28 12:33:17,336 INFO [train.py:996] (1/4) Epoch 12, batch 7400, loss[loss=0.2076, simple_loss=0.3048, pruned_loss=0.05518, over 21694.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2772, pruned_loss=0.06535, over 4275221.28 frames. ], batch size: 415, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:33:19,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2057046.0, ans=0.125 2023-06-28 12:33:23,616 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-06-28 12:34:05,464 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-28 12:34:05,872 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.109e+02 7.290e+02 9.953e+02 1.415e+03 2.956e+03, threshold=1.991e+03, percent-clipped=1.0 2023-06-28 12:34:47,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2057286.0, ans=0.0 2023-06-28 12:34:49,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2057286.0, ans=0.1 2023-06-28 12:34:54,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2057286.0, ans=0.0 2023-06-28 12:35:00,568 INFO [train.py:996] (1/4) Epoch 12, batch 7450, loss[loss=0.2075, simple_loss=0.2745, pruned_loss=0.07026, over 21875.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.278, pruned_loss=0.0642, over 4266122.32 frames. ], batch size: 373, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:35:01,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2057346.0, ans=0.1 2023-06-28 12:35:03,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2057346.0, ans=0.0 2023-06-28 12:35:29,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2057406.0, ans=0.125 2023-06-28 12:36:14,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=2057526.0, ans=0.02 2023-06-28 12:36:49,961 INFO [train.py:996] (1/4) Epoch 12, batch 7500, loss[loss=0.2269, simple_loss=0.3279, pruned_loss=0.063, over 21444.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2829, pruned_loss=0.06495, over 4269883.30 frames. ], batch size: 211, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:37:09,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2057646.0, ans=0.0 2023-06-28 12:37:33,885 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 7.365e+02 1.053e+03 1.699e+03 4.084e+03, threshold=2.105e+03, percent-clipped=21.0 2023-06-28 12:38:34,125 INFO [train.py:996] (1/4) Epoch 12, batch 7550, loss[loss=0.1825, simple_loss=0.2793, pruned_loss=0.04291, over 21653.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2911, pruned_loss=0.06521, over 4265751.83 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:39:10,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-28 12:39:39,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2058126.0, ans=0.1 2023-06-28 12:40:02,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2058186.0, ans=0.04949747468305833 2023-06-28 12:40:15,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2058246.0, ans=0.125 2023-06-28 12:40:16,305 INFO [train.py:996] (1/4) Epoch 12, batch 7600, loss[loss=0.1925, simple_loss=0.2685, pruned_loss=0.0583, over 21856.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2909, pruned_loss=0.0644, over 4274881.22 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:40:58,888 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.988e+02 7.986e+02 1.163e+03 1.762e+03 3.955e+03, threshold=2.326e+03, percent-clipped=12.0 2023-06-28 12:41:28,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2058426.0, ans=0.0 2023-06-28 12:41:37,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2058486.0, ans=0.125 2023-06-28 12:41:56,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2058546.0, ans=0.1 2023-06-28 12:41:57,888 INFO [train.py:996] (1/4) Epoch 12, batch 7650, loss[loss=0.1956, simple_loss=0.2665, pruned_loss=0.06229, over 21833.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2887, pruned_loss=0.06529, over 4286303.87 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:42:00,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2058546.0, ans=0.125 2023-06-28 12:43:01,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.71 vs. limit=10.0 2023-06-28 12:43:12,988 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-28 12:43:17,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2058726.0, ans=0.1 2023-06-28 12:43:46,555 INFO [train.py:996] (1/4) Epoch 12, batch 7700, loss[loss=0.2351, simple_loss=0.3113, pruned_loss=0.07946, over 21576.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2915, pruned_loss=0.06835, over 4287392.26 frames. ], batch size: 389, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:44:31,977 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 7.414e+02 1.157e+03 1.590e+03 5.387e+03, threshold=2.314e+03, percent-clipped=8.0 2023-06-28 12:44:55,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2059026.0, ans=0.125 2023-06-28 12:45:05,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2059026.0, ans=0.125 2023-06-28 12:45:05,674 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.49 vs. limit=6.0 2023-06-28 12:45:36,621 INFO [train.py:996] (1/4) Epoch 12, batch 7750, loss[loss=0.2232, simple_loss=0.3214, pruned_loss=0.06252, over 21426.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2944, pruned_loss=0.06805, over 4273969.55 frames. ], batch size: 211, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:45:51,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2059146.0, ans=0.1 2023-06-28 12:45:54,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2059206.0, ans=0.125 2023-06-28 12:46:11,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2059206.0, ans=0.125 2023-06-28 12:46:25,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2059266.0, ans=0.0 2023-06-28 12:46:49,243 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-28 12:47:21,155 INFO [train.py:996] (1/4) Epoch 12, batch 7800, loss[loss=0.2455, simple_loss=0.3763, pruned_loss=0.0574, over 19845.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2952, pruned_loss=0.06857, over 4269859.27 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:47:44,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2059506.0, ans=0.2 2023-06-28 12:48:00,042 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.542e+02 9.199e+02 1.440e+03 2.477e+03 5.669e+03, threshold=2.881e+03, percent-clipped=30.0 2023-06-28 12:48:07,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2059566.0, ans=0.2 2023-06-28 12:49:03,654 INFO [train.py:996] (1/4) Epoch 12, batch 7850, loss[loss=0.192, simple_loss=0.2568, pruned_loss=0.06361, over 21737.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2901, pruned_loss=0.06778, over 4263953.10 frames. ], batch size: 317, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:49:14,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2059746.0, ans=0.125 2023-06-28 12:49:31,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2059806.0, ans=0.2 2023-06-28 12:49:36,254 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:50:35,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2059986.0, ans=0.125 2023-06-28 12:50:49,180 INFO [train.py:996] (1/4) Epoch 12, batch 7900, loss[loss=0.193, simple_loss=0.2659, pruned_loss=0.06006, over 21700.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2852, pruned_loss=0.0666, over 4265252.81 frames. ], batch size: 333, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:50:49,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2060046.0, ans=0.125 2023-06-28 12:51:29,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2060166.0, ans=0.125 2023-06-28 12:51:30,562 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.553e+02 9.216e+02 1.431e+03 2.035e+03 3.808e+03, threshold=2.862e+03, percent-clipped=8.0 2023-06-28 12:51:31,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2060166.0, ans=0.2 2023-06-28 12:51:47,990 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.11 vs. limit=10.0 2023-06-28 12:51:48,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2060166.0, ans=0.125 2023-06-28 12:51:51,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-28 12:52:37,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2060346.0, ans=0.05 2023-06-28 12:52:38,402 INFO [train.py:996] (1/4) Epoch 12, batch 7950, loss[loss=0.1995, simple_loss=0.2805, pruned_loss=0.05923, over 21074.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2894, pruned_loss=0.0665, over 4253753.07 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:52:51,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-28 12:52:56,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2060406.0, ans=0.125 2023-06-28 12:53:06,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2060406.0, ans=0.2 2023-06-28 12:53:53,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2060526.0, ans=10.0 2023-06-28 12:54:20,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-28 12:54:21,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2060586.0, ans=0.0 2023-06-28 12:54:24,569 INFO [train.py:996] (1/4) Epoch 12, batch 8000, loss[loss=0.2349, simple_loss=0.3113, pruned_loss=0.07924, over 21183.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2955, pruned_loss=0.06834, over 4255597.52 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 12:55:18,769 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.842e+02 9.882e+02 1.672e+03 2.798e+03 5.114e+03, threshold=3.344e+03, percent-clipped=23.0 2023-06-28 12:55:26,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2060766.0, ans=0.125 2023-06-28 12:55:28,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2060766.0, ans=0.125 2023-06-28 12:55:45,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2060826.0, ans=0.2 2023-06-28 12:56:13,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2060886.0, ans=0.1 2023-06-28 12:56:16,333 INFO [train.py:996] (1/4) Epoch 12, batch 8050, loss[loss=0.2483, simple_loss=0.3371, pruned_loss=0.07973, over 21756.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2989, pruned_loss=0.06859, over 4259553.87 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:56:29,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-06-28 12:57:05,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2061066.0, ans=10.0 2023-06-28 12:57:20,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2061126.0, ans=0.5 2023-06-28 12:57:41,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-28 12:57:52,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-28 12:58:04,707 INFO [train.py:996] (1/4) Epoch 12, batch 8100, loss[loss=0.2033, simple_loss=0.2783, pruned_loss=0.06416, over 21514.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.296, pruned_loss=0.06888, over 4270308.90 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:58:35,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=2061306.0, ans=22.5 2023-06-28 12:58:53,296 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 7.832e+02 1.202e+03 2.450e+03 5.574e+03, threshold=2.405e+03, percent-clipped=10.0 2023-06-28 12:58:55,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2061366.0, ans=0.125 2023-06-28 12:59:32,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-28 12:59:56,686 INFO [train.py:996] (1/4) Epoch 12, batch 8150, loss[loss=0.2161, simple_loss=0.3202, pruned_loss=0.05595, over 21677.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3047, pruned_loss=0.07004, over 4269762.16 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:00:30,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2061606.0, ans=0.0 2023-06-28 13:01:39,551 INFO [train.py:996] (1/4) Epoch 12, batch 8200, loss[loss=0.1831, simple_loss=0.2469, pruned_loss=0.0597, over 21303.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.298, pruned_loss=0.06846, over 4272648.71 frames. ], batch size: 551, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:02:20,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2061966.0, ans=0.125 2023-06-28 13:02:21,471 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.733e+02 7.541e+02 1.166e+03 1.975e+03 4.840e+03, threshold=2.333e+03, percent-clipped=18.0 2023-06-28 13:02:25,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2061966.0, ans=0.125 2023-06-28 13:03:23,712 INFO [train.py:996] (1/4) Epoch 12, batch 8250, loss[loss=0.2079, simple_loss=0.3101, pruned_loss=0.05282, over 21693.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2947, pruned_loss=0.06829, over 4265283.92 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:03:24,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2062146.0, ans=0.0 2023-06-28 13:03:53,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2062206.0, ans=0.0 2023-06-28 13:04:19,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2062266.0, ans=0.2 2023-06-28 13:05:07,864 INFO [train.py:996] (1/4) Epoch 12, batch 8300, loss[loss=0.2056, simple_loss=0.2901, pruned_loss=0.06053, over 21650.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2932, pruned_loss=0.06563, over 4264995.67 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:05:49,525 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.792e+02 1.211e+03 1.944e+03 6.178e+03, threshold=2.421e+03, percent-clipped=18.0 2023-06-28 13:06:10,445 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=12.0 2023-06-28 13:06:35,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2062686.0, ans=0.0 2023-06-28 13:06:36,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2062686.0, ans=0.05 2023-06-28 13:06:39,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2062686.0, ans=0.125 2023-06-28 13:06:55,852 INFO [train.py:996] (1/4) Epoch 12, batch 8350, loss[loss=0.1992, simple_loss=0.2878, pruned_loss=0.05526, over 21560.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2923, pruned_loss=0.06364, over 4261996.59 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:08:35,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2062986.0, ans=0.125 2023-06-28 13:08:39,736 INFO [train.py:996] (1/4) Epoch 12, batch 8400, loss[loss=0.1907, simple_loss=0.2796, pruned_loss=0.05088, over 21784.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2916, pruned_loss=0.06178, over 4269238.14 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:08:56,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2063106.0, ans=0.125 2023-06-28 13:09:06,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2063106.0, ans=0.0 2023-06-28 13:09:20,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2063166.0, ans=0.125 2023-06-28 13:09:21,788 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 6.739e+02 1.036e+03 1.500e+03 3.619e+03, threshold=2.071e+03, percent-clipped=10.0 2023-06-28 13:09:41,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-28 13:09:50,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2063226.0, ans=0.0 2023-06-28 13:10:03,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2063286.0, ans=0.125 2023-06-28 13:10:21,248 INFO [train.py:996] (1/4) Epoch 12, batch 8450, loss[loss=0.2528, simple_loss=0.3053, pruned_loss=0.1001, over 21788.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2899, pruned_loss=0.06171, over 4273374.73 frames. ], batch size: 508, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:10:30,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2063346.0, ans=0.0 2023-06-28 13:10:33,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2063346.0, ans=0.125 2023-06-28 13:11:35,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-06-28 13:12:04,195 INFO [train.py:996] (1/4) Epoch 12, batch 8500, loss[loss=0.1675, simple_loss=0.2345, pruned_loss=0.05027, over 21473.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2856, pruned_loss=0.06238, over 4270674.84 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:12:41,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2063766.0, ans=0.125 2023-06-28 13:12:49,780 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.731e+02 8.144e+02 1.139e+03 1.907e+03 5.140e+03, threshold=2.279e+03, percent-clipped=18.0 2023-06-28 13:13:05,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2063826.0, ans=0.05 2023-06-28 13:13:13,935 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:13:38,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2063886.0, ans=0.015 2023-06-28 13:13:38,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2063886.0, ans=0.125 2023-06-28 13:13:42,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2063886.0, ans=0.125 2023-06-28 13:13:48,481 INFO [train.py:996] (1/4) Epoch 12, batch 8550, loss[loss=0.2426, simple_loss=0.325, pruned_loss=0.08007, over 21828.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2892, pruned_loss=0.06421, over 4275153.60 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:14:21,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2064006.0, ans=0.125 2023-06-28 13:14:40,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2064066.0, ans=0.2 2023-06-28 13:15:18,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2064186.0, ans=0.125 2023-06-28 13:15:18,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2064186.0, ans=0.125 2023-06-28 13:15:18,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2064186.0, ans=0.125 2023-06-28 13:15:27,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2064186.0, ans=0.2 2023-06-28 13:15:34,952 INFO [train.py:996] (1/4) Epoch 12, batch 8600, loss[loss=0.2366, simple_loss=0.3179, pruned_loss=0.07765, over 21469.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2948, pruned_loss=0.06599, over 4272213.93 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:15:37,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2064246.0, ans=0.125 2023-06-28 13:15:42,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=2064246.0, ans=10.0 2023-06-28 13:15:54,804 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-28 13:16:29,849 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.562e+02 1.076e+03 1.611e+03 2.403e+03 4.318e+03, threshold=3.223e+03, percent-clipped=30.0 2023-06-28 13:16:34,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2064366.0, ans=0.09899494936611666 2023-06-28 13:17:02,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2064486.0, ans=0.1 2023-06-28 13:17:03,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2064486.0, ans=0.07 2023-06-28 13:17:18,551 INFO [train.py:996] (1/4) Epoch 12, batch 8650, loss[loss=0.1857, simple_loss=0.2318, pruned_loss=0.06982, over 20037.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3016, pruned_loss=0.06812, over 4275645.94 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:17:27,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2064546.0, ans=0.2 2023-06-28 13:17:52,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2064606.0, ans=0.125 2023-06-28 13:18:02,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2064666.0, ans=0.125 2023-06-28 13:18:02,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2064666.0, ans=0.125 2023-06-28 13:18:34,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2064726.0, ans=0.125 2023-06-28 13:18:49,752 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-28 13:18:59,816 INFO [train.py:996] (1/4) Epoch 12, batch 8700, loss[loss=0.1691, simple_loss=0.2073, pruned_loss=0.06544, over 20119.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2916, pruned_loss=0.06496, over 4272463.47 frames. ], batch size: 704, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:19:44,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2064966.0, ans=0.125 2023-06-28 13:19:53,459 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.697e+02 7.863e+02 1.211e+03 1.985e+03 4.359e+03, threshold=2.422e+03, percent-clipped=4.0 2023-06-28 13:20:15,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2065026.0, ans=0.125 2023-06-28 13:20:27,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.11 vs. limit=10.0 2023-06-28 13:20:41,880 INFO [train.py:996] (1/4) Epoch 12, batch 8750, loss[loss=0.1958, simple_loss=0.2695, pruned_loss=0.06098, over 21806.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2873, pruned_loss=0.06578, over 4278225.15 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:20:44,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2065146.0, ans=0.125 2023-06-28 13:21:20,801 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=15.0 2023-06-28 13:22:13,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2065386.0, ans=0.125 2023-06-28 13:22:18,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2065386.0, ans=0.125 2023-06-28 13:22:30,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-28 13:22:31,074 INFO [train.py:996] (1/4) Epoch 12, batch 8800, loss[loss=0.1755, simple_loss=0.2878, pruned_loss=0.03162, over 20770.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2971, pruned_loss=0.06849, over 4277655.59 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:22:57,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2065506.0, ans=0.0 2023-06-28 13:23:26,838 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.165e+02 8.763e+02 1.222e+03 1.735e+03 3.559e+03, threshold=2.444e+03, percent-clipped=10.0 2023-06-28 13:23:35,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2065626.0, ans=0.125 2023-06-28 13:23:35,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2065626.0, ans=0.125 2023-06-28 13:23:53,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-28 13:23:54,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2065686.0, ans=0.1 2023-06-28 13:24:16,138 INFO [train.py:996] (1/4) Epoch 12, batch 8850, loss[loss=0.1891, simple_loss=0.2739, pruned_loss=0.05211, over 21603.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3031, pruned_loss=0.07021, over 4281264.43 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:24:29,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2065746.0, ans=0.0 2023-06-28 13:24:41,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2065806.0, ans=0.0 2023-06-28 13:24:43,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2065806.0, ans=0.0 2023-06-28 13:24:47,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-28 13:25:20,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=12.0 2023-06-28 13:25:30,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2065926.0, ans=0.2 2023-06-28 13:25:55,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2065986.0, ans=0.1 2023-06-28 13:26:05,235 INFO [train.py:996] (1/4) Epoch 12, batch 8900, loss[loss=0.1667, simple_loss=0.2375, pruned_loss=0.04791, over 21323.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2983, pruned_loss=0.06897, over 4263933.74 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:26:06,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2066046.0, ans=0.95 2023-06-28 13:26:36,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2066106.0, ans=0.125 2023-06-28 13:26:57,492 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.266e+02 7.347e+02 1.235e+03 1.790e+03 4.739e+03, threshold=2.470e+03, percent-clipped=10.0 2023-06-28 13:26:58,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2066166.0, ans=0.125 2023-06-28 13:27:56,297 INFO [train.py:996] (1/4) Epoch 12, batch 8950, loss[loss=0.2272, simple_loss=0.3157, pruned_loss=0.06937, over 21732.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2993, pruned_loss=0.06815, over 4264075.09 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:28:03,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2066346.0, ans=0.2 2023-06-28 13:28:25,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2066406.0, ans=0.2 2023-06-28 13:28:40,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2066466.0, ans=0.125 2023-06-28 13:29:38,960 INFO [train.py:996] (1/4) Epoch 12, batch 9000, loss[loss=0.178, simple_loss=0.257, pruned_loss=0.04945, over 21334.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2934, pruned_loss=0.06808, over 4268688.64 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:29:38,960 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 13:29:59,521 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2628, simple_loss=0.3535, pruned_loss=0.086, over 1796401.00 frames. 2023-06-28 13:29:59,522 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 13:30:05,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-28 13:30:35,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2066706.0, ans=10.0 2023-06-28 13:30:37,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2066766.0, ans=0.125 2023-06-28 13:30:44,982 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.661e+02 7.055e+02 9.403e+02 1.588e+03 4.919e+03, threshold=1.881e+03, percent-clipped=11.0 2023-06-28 13:31:00,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2066826.0, ans=0.0 2023-06-28 13:31:44,377 INFO [train.py:996] (1/4) Epoch 12, batch 9050, loss[loss=0.1869, simple_loss=0.2739, pruned_loss=0.04998, over 21812.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.288, pruned_loss=0.06536, over 4256482.82 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:31:45,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=12.0 2023-06-28 13:32:19,672 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:32:40,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2067066.0, ans=0.2 2023-06-28 13:32:47,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2067066.0, ans=0.125 2023-06-28 13:32:53,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2067126.0, ans=0.5 2023-06-28 13:33:15,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2067186.0, ans=0.0 2023-06-28 13:33:30,608 INFO [train.py:996] (1/4) Epoch 12, batch 9100, loss[loss=0.1947, simple_loss=0.2984, pruned_loss=0.04548, over 21736.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2932, pruned_loss=0.06724, over 4259750.85 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:33:34,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2067246.0, ans=0.2 2023-06-28 13:34:22,066 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.166e+02 1.280e+03 2.185e+03 3.198e+03 4.785e+03, threshold=4.371e+03, percent-clipped=55.0 2023-06-28 13:34:30,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2067366.0, ans=0.125 2023-06-28 13:34:47,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2067426.0, ans=0.2 2023-06-28 13:34:50,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-28 13:35:03,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2067486.0, ans=0.025 2023-06-28 13:35:16,206 INFO [train.py:996] (1/4) Epoch 12, batch 9150, loss[loss=0.2167, simple_loss=0.3187, pruned_loss=0.05736, over 21736.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2967, pruned_loss=0.06503, over 4258630.07 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:35:43,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2067606.0, ans=0.125 2023-06-28 13:36:59,440 INFO [train.py:996] (1/4) Epoch 12, batch 9200, loss[loss=0.2054, simple_loss=0.2896, pruned_loss=0.06061, over 21261.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2986, pruned_loss=0.06421, over 4251060.53 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:37:17,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2067846.0, ans=0.2 2023-06-28 13:37:35,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2067906.0, ans=0.0 2023-06-28 13:38:01,134 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.761e+02 9.017e+02 1.569e+03 2.101e+03 3.767e+03, threshold=3.138e+03, percent-clipped=0.0 2023-06-28 13:38:33,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.07 vs. limit=22.5 2023-06-28 13:38:48,658 INFO [train.py:996] (1/4) Epoch 12, batch 9250, loss[loss=0.1961, simple_loss=0.2628, pruned_loss=0.06466, over 21617.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.3003, pruned_loss=0.06603, over 4255112.39 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:40:03,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-28 13:40:39,791 INFO [train.py:996] (1/4) Epoch 12, batch 9300, loss[loss=0.2034, simple_loss=0.2618, pruned_loss=0.07252, over 21439.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2934, pruned_loss=0.06536, over 4261736.57 frames. ], batch size: 475, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:41:32,624 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.166e+02 1.033e+03 1.685e+03 2.661e+03 5.053e+03, threshold=3.371e+03, percent-clipped=15.0 2023-06-28 13:42:25,451 INFO [train.py:996] (1/4) Epoch 12, batch 9350, loss[loss=0.2323, simple_loss=0.3166, pruned_loss=0.07395, over 21901.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2986, pruned_loss=0.06611, over 4262499.77 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:43:23,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2068866.0, ans=0.125 2023-06-28 13:43:46,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.50 vs. limit=12.0 2023-06-28 13:44:15,569 INFO [train.py:996] (1/4) Epoch 12, batch 9400, loss[loss=0.1876, simple_loss=0.2607, pruned_loss=0.05723, over 21640.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.3005, pruned_loss=0.06664, over 4265887.23 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:45:01,445 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.886e+02 7.931e+02 1.125e+03 1.716e+03 3.605e+03, threshold=2.249e+03, percent-clipped=1.0 2023-06-28 13:45:27,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2069226.0, ans=0.95 2023-06-28 13:45:27,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2069226.0, ans=0.1 2023-06-28 13:45:58,262 INFO [train.py:996] (1/4) Epoch 12, batch 9450, loss[loss=0.1908, simple_loss=0.2584, pruned_loss=0.06164, over 21668.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2935, pruned_loss=0.06532, over 4269304.20 frames. ], batch size: 333, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:46:22,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2069406.0, ans=0.0 2023-06-28 13:46:30,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2069406.0, ans=0.015 2023-06-28 13:46:37,664 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-28 13:46:54,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2069466.0, ans=0.2 2023-06-28 13:47:41,510 INFO [train.py:996] (1/4) Epoch 12, batch 9500, loss[loss=0.1658, simple_loss=0.2353, pruned_loss=0.04815, over 21488.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.288, pruned_loss=0.0638, over 4264706.85 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:48:02,999 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.51 vs. limit=8.0 2023-06-28 13:48:38,315 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.464e+02 8.117e+02 1.177e+03 1.570e+03 4.123e+03, threshold=2.354e+03, percent-clipped=16.0 2023-06-28 13:48:41,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2069766.0, ans=0.125 2023-06-28 13:48:41,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2069766.0, ans=0.09899494936611666 2023-06-28 13:49:25,115 INFO [train.py:996] (1/4) Epoch 12, batch 9550, loss[loss=0.2552, simple_loss=0.3255, pruned_loss=0.09244, over 21797.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2899, pruned_loss=0.06534, over 4266073.51 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:49:34,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2069946.0, ans=0.025 2023-06-28 13:49:42,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2069946.0, ans=0.125 2023-06-28 13:49:47,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2070006.0, ans=0.1 2023-06-28 13:49:59,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2070006.0, ans=0.125 2023-06-28 13:50:07,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2070066.0, ans=0.0 2023-06-28 13:51:04,175 INFO [train.py:996] (1/4) Epoch 12, batch 9600, loss[loss=0.1863, simple_loss=0.2645, pruned_loss=0.05403, over 21415.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2925, pruned_loss=0.06721, over 4273490.49 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:51:38,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-28 13:51:41,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2070366.0, ans=0.1 2023-06-28 13:52:01,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.070e+02 8.059e+02 1.139e+03 1.979e+03 4.989e+03, threshold=2.277e+03, percent-clipped=18.0 2023-06-28 13:52:13,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2070426.0, ans=0.125 2023-06-28 13:52:31,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2070486.0, ans=0.0 2023-06-28 13:52:31,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2070486.0, ans=0.1 2023-06-28 13:52:52,012 INFO [train.py:996] (1/4) Epoch 12, batch 9650, loss[loss=0.2091, simple_loss=0.288, pruned_loss=0.0651, over 21760.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2917, pruned_loss=0.06709, over 4282294.71 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:53:14,496 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:53:49,113 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:53:59,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2070726.0, ans=0.125 2023-06-28 13:54:01,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2070726.0, ans=0.125 2023-06-28 13:54:36,711 INFO [train.py:996] (1/4) Epoch 12, batch 9700, loss[loss=0.2065, simple_loss=0.277, pruned_loss=0.06795, over 21242.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2947, pruned_loss=0.06728, over 4282654.50 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:54:49,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2070846.0, ans=0.125 2023-06-28 13:55:27,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2070966.0, ans=0.0 2023-06-28 13:55:27,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2070966.0, ans=0.0 2023-06-28 13:55:29,991 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.817e+02 8.034e+02 1.157e+03 1.856e+03 3.207e+03, threshold=2.314e+03, percent-clipped=13.0 2023-06-28 13:55:41,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2071026.0, ans=0.0 2023-06-28 13:55:47,948 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-28 13:55:48,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2071026.0, ans=0.0 2023-06-28 13:56:19,104 INFO [train.py:996] (1/4) Epoch 12, batch 9750, loss[loss=0.1858, simple_loss=0.253, pruned_loss=0.05933, over 21715.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2909, pruned_loss=0.06694, over 4268090.10 frames. ], batch size: 299, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:56:59,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2071206.0, ans=0.0 2023-06-28 13:57:17,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=2071326.0, ans=0.05 2023-06-28 13:57:31,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=2071326.0, ans=15.0 2023-06-28 13:58:01,398 INFO [train.py:996] (1/4) Epoch 12, batch 9800, loss[loss=0.1934, simple_loss=0.2736, pruned_loss=0.05657, over 21903.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2899, pruned_loss=0.06717, over 4261880.45 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:58:45,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2071566.0, ans=0.2 2023-06-28 13:58:54,278 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.350e+02 9.272e+02 1.641e+03 2.423e+03 5.120e+03, threshold=3.282e+03, percent-clipped=25.0 2023-06-28 13:59:01,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2071626.0, ans=0.125 2023-06-28 13:59:22,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2071686.0, ans=0.04949747468305833 2023-06-28 13:59:43,788 INFO [train.py:996] (1/4) Epoch 12, batch 9850, loss[loss=0.2158, simple_loss=0.2974, pruned_loss=0.06712, over 15732.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2864, pruned_loss=0.0668, over 4256462.29 frames. ], batch size: 60, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:00:17,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2071806.0, ans=0.0 2023-06-28 14:00:58,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-28 14:01:19,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2071986.0, ans=0.125 2023-06-28 14:01:22,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2071986.0, ans=0.2 2023-06-28 14:01:25,450 INFO [train.py:996] (1/4) Epoch 12, batch 9900, loss[loss=0.2266, simple_loss=0.3185, pruned_loss=0.0674, over 16367.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2836, pruned_loss=0.06621, over 4256117.87 frames. ], batch size: 60, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:02:19,780 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.295e+02 1.063e+03 1.503e+03 2.102e+03 4.753e+03, threshold=3.006e+03, percent-clipped=10.0 2023-06-28 14:02:35,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2072226.0, ans=0.1 2023-06-28 14:03:09,590 INFO [train.py:996] (1/4) Epoch 12, batch 9950, loss[loss=0.1996, simple_loss=0.2646, pruned_loss=0.06736, over 21608.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2842, pruned_loss=0.06787, over 4255641.19 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:03:16,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2072346.0, ans=0.125 2023-06-28 14:03:21,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2072346.0, ans=0.125 2023-06-28 14:03:30,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-06-28 14:04:08,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2072526.0, ans=0.1 2023-06-28 14:04:43,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2072586.0, ans=0.1 2023-06-28 14:04:52,801 INFO [train.py:996] (1/4) Epoch 12, batch 10000, loss[loss=0.1918, simple_loss=0.2654, pruned_loss=0.05915, over 21759.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2804, pruned_loss=0.06678, over 4252426.30 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 14:04:54,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-28 14:05:01,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2072646.0, ans=0.125 2023-06-28 14:05:14,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2072706.0, ans=0.0 2023-06-28 14:05:34,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2072766.0, ans=0.125 2023-06-28 14:05:42,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2072766.0, ans=0.125 2023-06-28 14:05:50,332 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.893e+02 6.803e+02 1.015e+03 1.604e+03 3.420e+03, threshold=2.029e+03, percent-clipped=1.0 2023-06-28 14:05:57,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2072826.0, ans=0.0 2023-06-28 14:06:02,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2072826.0, ans=0.0 2023-06-28 14:06:23,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2072886.0, ans=0.2 2023-06-28 14:06:28,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-28 14:06:36,069 INFO [train.py:996] (1/4) Epoch 12, batch 10050, loss[loss=0.2292, simple_loss=0.3052, pruned_loss=0.0766, over 21371.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.282, pruned_loss=0.06714, over 4262075.36 frames. ], batch size: 549, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:07:27,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2073066.0, ans=0.2 2023-06-28 14:07:37,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2073066.0, ans=0.0 2023-06-28 14:07:51,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2073126.0, ans=0.125 2023-06-28 14:08:21,364 INFO [train.py:996] (1/4) Epoch 12, batch 10100, loss[loss=0.2273, simple_loss=0.3024, pruned_loss=0.07609, over 21384.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2816, pruned_loss=0.06568, over 4263999.69 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:08:28,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2073246.0, ans=0.125 2023-06-28 14:09:17,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2073366.0, ans=0.1 2023-06-28 14:09:21,187 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.696e+02 9.806e+02 1.615e+03 2.401e+03 4.786e+03, threshold=3.230e+03, percent-clipped=36.0 2023-06-28 14:09:47,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2073486.0, ans=0.1 2023-06-28 14:09:56,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2073486.0, ans=0.125 2023-06-28 14:10:10,008 INFO [train.py:996] (1/4) Epoch 12, batch 10150, loss[loss=0.2748, simple_loss=0.3402, pruned_loss=0.1047, over 21459.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2877, pruned_loss=0.06821, over 4270526.48 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:11:17,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2073726.0, ans=0.0 2023-06-28 14:11:40,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2073786.0, ans=0.125 2023-06-28 14:11:52,837 INFO [train.py:996] (1/4) Epoch 12, batch 10200, loss[loss=0.1818, simple_loss=0.2553, pruned_loss=0.05412, over 21726.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2874, pruned_loss=0.06663, over 4273738.86 frames. ], batch size: 112, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:12:03,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2073846.0, ans=0.0 2023-06-28 14:12:19,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2073906.0, ans=0.0 2023-06-28 14:12:47,765 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.411e+02 8.616e+02 1.269e+03 2.043e+03 3.610e+03, threshold=2.539e+03, percent-clipped=1.0 2023-06-28 14:13:01,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-28 14:13:02,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2074026.0, ans=0.125 2023-06-28 14:13:08,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2074026.0, ans=0.05 2023-06-28 14:13:24,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2074086.0, ans=0.125 2023-06-28 14:13:40,917 INFO [train.py:996] (1/4) Epoch 12, batch 10250, loss[loss=0.158, simple_loss=0.2533, pruned_loss=0.03139, over 21624.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2827, pruned_loss=0.06147, over 4277314.43 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:14:50,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2074326.0, ans=0.0 2023-06-28 14:15:12,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2074386.0, ans=0.1 2023-06-28 14:15:25,088 INFO [train.py:996] (1/4) Epoch 12, batch 10300, loss[loss=0.2254, simple_loss=0.3079, pruned_loss=0.07142, over 21374.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2851, pruned_loss=0.06235, over 4277445.63 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:16:22,424 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.719e+02 6.981e+02 1.162e+03 1.847e+03 5.403e+03, threshold=2.324e+03, percent-clipped=10.0 2023-06-28 14:17:11,850 INFO [train.py:996] (1/4) Epoch 12, batch 10350, loss[loss=0.1775, simple_loss=0.257, pruned_loss=0.04897, over 20805.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2878, pruned_loss=0.0622, over 4279142.98 frames. ], batch size: 609, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:17:36,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2074806.0, ans=0.125 2023-06-28 14:17:49,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-28 14:17:56,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2074866.0, ans=0.1 2023-06-28 14:17:58,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2074866.0, ans=0.0 2023-06-28 14:18:08,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2074866.0, ans=0.125 2023-06-28 14:18:19,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2074926.0, ans=0.125 2023-06-28 14:18:50,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2074986.0, ans=0.0 2023-06-28 14:19:00,733 INFO [train.py:996] (1/4) Epoch 12, batch 10400, loss[loss=0.1635, simple_loss=0.2264, pruned_loss=0.05031, over 21485.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2825, pruned_loss=0.06182, over 4266156.52 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 14:19:03,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2075046.0, ans=0.0 2023-06-28 14:19:04,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2075046.0, ans=0.125 2023-06-28 14:19:08,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2075046.0, ans=0.0 2023-06-28 14:19:31,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2075106.0, ans=0.125 2023-06-28 14:19:43,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2075166.0, ans=0.0 2023-06-28 14:19:57,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2075166.0, ans=0.125 2023-06-28 14:19:57,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2075166.0, ans=0.1 2023-06-28 14:19:58,464 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.939e+02 1.030e+03 1.665e+03 2.817e+03 5.984e+03, threshold=3.330e+03, percent-clipped=36.0 2023-06-28 14:20:02,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2075226.0, ans=0.0 2023-06-28 14:20:46,348 INFO [train.py:996] (1/4) Epoch 12, batch 10450, loss[loss=0.286, simple_loss=0.3669, pruned_loss=0.1026, over 21521.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.285, pruned_loss=0.06398, over 4261826.60 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:20:58,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2075346.0, ans=0.125 2023-06-28 14:21:06,027 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-28 14:21:07,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-28 14:21:12,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2075406.0, ans=0.0 2023-06-28 14:21:18,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2075406.0, ans=0.2 2023-06-28 14:21:22,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2075406.0, ans=0.125 2023-06-28 14:22:22,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2075586.0, ans=0.0 2023-06-28 14:22:34,290 INFO [train.py:996] (1/4) Epoch 12, batch 10500, loss[loss=0.1943, simple_loss=0.2532, pruned_loss=0.06773, over 21418.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2859, pruned_loss=0.06269, over 4259108.93 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:22:53,093 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:23:01,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2075706.0, ans=0.1 2023-06-28 14:23:30,263 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.346e+02 7.811e+02 1.278e+03 1.903e+03 4.033e+03, threshold=2.556e+03, percent-clipped=2.0 2023-06-28 14:23:38,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2075826.0, ans=0.125 2023-06-28 14:24:02,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2075886.0, ans=0.125 2023-06-28 14:24:16,699 INFO [train.py:996] (1/4) Epoch 12, batch 10550, loss[loss=0.1907, simple_loss=0.2477, pruned_loss=0.06687, over 21148.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2807, pruned_loss=0.06235, over 4257133.27 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:24:54,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2023-06-28 14:25:22,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-28 14:26:00,758 INFO [train.py:996] (1/4) Epoch 12, batch 10600, loss[loss=0.1696, simple_loss=0.2447, pruned_loss=0.04721, over 21423.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2759, pruned_loss=0.06123, over 4255441.71 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:26:12,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2076246.0, ans=0.0 2023-06-28 14:26:16,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.36 vs. limit=12.0 2023-06-28 14:26:34,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2076306.0, ans=0.2 2023-06-28 14:26:47,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-28 14:26:47,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2076366.0, ans=0.125 2023-06-28 14:26:49,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2076366.0, ans=0.0 2023-06-28 14:26:59,364 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.873e+02 6.316e+02 8.507e+02 1.506e+03 2.988e+03, threshold=1.701e+03, percent-clipped=6.0 2023-06-28 14:27:02,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.86 vs. limit=10.0 2023-06-28 14:27:13,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2076426.0, ans=0.125 2023-06-28 14:27:14,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2076426.0, ans=0.125 2023-06-28 14:27:24,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2076426.0, ans=0.125 2023-06-28 14:27:31,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-28 14:27:46,074 INFO [train.py:996] (1/4) Epoch 12, batch 10650, loss[loss=0.1923, simple_loss=0.2835, pruned_loss=0.05053, over 21620.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2809, pruned_loss=0.06069, over 4258538.81 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:28:03,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2076546.0, ans=0.125 2023-06-28 14:28:08,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2076606.0, ans=0.2 2023-06-28 14:29:04,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2076726.0, ans=0.0 2023-06-28 14:29:29,932 INFO [train.py:996] (1/4) Epoch 12, batch 10700, loss[loss=0.1828, simple_loss=0.2585, pruned_loss=0.05359, over 21631.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2794, pruned_loss=0.06123, over 4260894.06 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:30:10,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2076906.0, ans=0.125 2023-06-28 14:30:32,561 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.676e+02 7.956e+02 1.277e+03 1.864e+03 4.109e+03, threshold=2.555e+03, percent-clipped=30.0 2023-06-28 14:31:21,314 INFO [train.py:996] (1/4) Epoch 12, batch 10750, loss[loss=0.2185, simple_loss=0.3028, pruned_loss=0.0671, over 21379.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2905, pruned_loss=0.06534, over 4264574.30 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:32:15,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2077266.0, ans=0.0 2023-06-28 14:32:17,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2077266.0, ans=0.125 2023-06-28 14:32:27,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2077326.0, ans=0.125 2023-06-28 14:32:32,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2077326.0, ans=0.2 2023-06-28 14:33:10,862 INFO [train.py:996] (1/4) Epoch 12, batch 10800, loss[loss=0.2051, simple_loss=0.2886, pruned_loss=0.06085, over 21908.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2955, pruned_loss=0.06558, over 4272290.92 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 14:33:35,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-28 14:33:38,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2077506.0, ans=0.125 2023-06-28 14:33:58,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=2077566.0, ans=0.1 2023-06-28 14:34:08,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.630e+02 8.272e+02 1.352e+03 2.286e+03 6.133e+03, threshold=2.703e+03, percent-clipped=22.0 2023-06-28 14:34:47,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-28 14:34:54,576 INFO [train.py:996] (1/4) Epoch 12, batch 10850, loss[loss=0.1962, simple_loss=0.2671, pruned_loss=0.06264, over 21295.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2967, pruned_loss=0.06564, over 4273179.08 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:35:36,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=22.5 2023-06-28 14:35:40,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-28 14:35:57,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2077926.0, ans=0.05 2023-06-28 14:36:07,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2077926.0, ans=0.0 2023-06-28 14:36:29,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=2077986.0, ans=0.025 2023-06-28 14:36:38,973 INFO [train.py:996] (1/4) Epoch 12, batch 10900, loss[loss=0.194, simple_loss=0.2827, pruned_loss=0.05269, over 21659.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2895, pruned_loss=0.06403, over 4265817.11 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:37:36,118 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.311e+02 7.551e+02 9.581e+02 1.368e+03 2.722e+03, threshold=1.916e+03, percent-clipped=1.0 2023-06-28 14:38:09,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2078286.0, ans=0.125 2023-06-28 14:38:16,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2078286.0, ans=0.1 2023-06-28 14:38:20,663 INFO [train.py:996] (1/4) Epoch 12, batch 10950, loss[loss=0.1967, simple_loss=0.2666, pruned_loss=0.06339, over 21470.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2859, pruned_loss=0.0626, over 4266655.53 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:38:58,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2078406.0, ans=0.0 2023-06-28 14:39:01,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2078406.0, ans=0.0 2023-06-28 14:39:29,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-28 14:40:04,310 INFO [train.py:996] (1/4) Epoch 12, batch 11000, loss[loss=0.2018, simple_loss=0.2641, pruned_loss=0.06978, over 21581.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2854, pruned_loss=0.06233, over 4259753.74 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:41:02,131 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 8.398e+02 1.287e+03 1.832e+03 5.305e+03, threshold=2.574e+03, percent-clipped=21.0 2023-06-28 14:41:42,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2078886.0, ans=0.125 2023-06-28 14:41:45,732 INFO [train.py:996] (1/4) Epoch 12, batch 11050, loss[loss=0.1972, simple_loss=0.2636, pruned_loss=0.06542, over 21841.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2827, pruned_loss=0.06356, over 4265197.52 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:41:46,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2078946.0, ans=0.125 2023-06-28 14:43:17,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2079186.0, ans=0.125 2023-06-28 14:43:22,903 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:43:23,999 INFO [train.py:996] (1/4) Epoch 12, batch 11100, loss[loss=0.209, simple_loss=0.2817, pruned_loss=0.06819, over 21888.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2811, pruned_loss=0.06415, over 4270351.07 frames. ], batch size: 107, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:44:22,368 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.287e+02 7.124e+02 1.046e+03 1.474e+03 3.228e+03, threshold=2.092e+03, percent-clipped=3.0 2023-06-28 14:44:43,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2079426.0, ans=0.125 2023-06-28 14:45:05,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2079546.0, ans=0.125 2023-06-28 14:45:06,628 INFO [train.py:996] (1/4) Epoch 12, batch 11150, loss[loss=0.1961, simple_loss=0.2718, pruned_loss=0.06024, over 21617.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2774, pruned_loss=0.06298, over 4259816.29 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:45:07,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2079546.0, ans=0.2 2023-06-28 14:45:45,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-28 14:45:56,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2079666.0, ans=0.125 2023-06-28 14:46:44,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.24 vs. limit=22.5 2023-06-28 14:46:49,369 INFO [train.py:996] (1/4) Epoch 12, batch 11200, loss[loss=0.2136, simple_loss=0.2715, pruned_loss=0.07788, over 21299.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2761, pruned_loss=0.06274, over 4264800.09 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 14:47:25,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-28 14:47:48,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.550e+02 9.831e+02 1.329e+03 1.720e+03 5.358e+03, threshold=2.658e+03, percent-clipped=16.0 2023-06-28 14:48:03,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2080026.0, ans=0.0 2023-06-28 14:48:09,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2080026.0, ans=0.1 2023-06-28 14:48:30,190 INFO [train.py:996] (1/4) Epoch 12, batch 11250, loss[loss=0.1964, simple_loss=0.2902, pruned_loss=0.05135, over 21604.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2753, pruned_loss=0.06263, over 4254698.02 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:48:43,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-06-28 14:49:33,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2080326.0, ans=0.125 2023-06-28 14:50:00,234 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-28 14:50:12,575 INFO [train.py:996] (1/4) Epoch 12, batch 11300, loss[loss=0.1957, simple_loss=0.279, pruned_loss=0.05622, over 21825.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2767, pruned_loss=0.06251, over 4266471.35 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:50:55,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2080566.0, ans=0.125 2023-06-28 14:50:56,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2080566.0, ans=0.0 2023-06-28 14:51:14,610 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.978e+02 7.526e+02 1.048e+03 1.657e+03 3.488e+03, threshold=2.097e+03, percent-clipped=3.0 2023-06-28 14:51:15,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-28 14:51:17,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2080626.0, ans=0.1 2023-06-28 14:51:45,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-06-28 14:51:54,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2080746.0, ans=0.0 2023-06-28 14:51:55,965 INFO [train.py:996] (1/4) Epoch 12, batch 11350, loss[loss=0.206, simple_loss=0.3006, pruned_loss=0.05572, over 21813.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2778, pruned_loss=0.06226, over 4266114.85 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:52:13,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2080746.0, ans=0.125 2023-06-28 14:52:40,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2080866.0, ans=0.125 2023-06-28 14:53:51,207 INFO [train.py:996] (1/4) Epoch 12, batch 11400, loss[loss=0.225, simple_loss=0.3061, pruned_loss=0.07196, over 20113.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2838, pruned_loss=0.06463, over 4261666.06 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:54:11,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2081106.0, ans=0.125 2023-06-28 14:54:28,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-28 14:54:39,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2081166.0, ans=0.1 2023-06-28 14:54:51,838 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.874e+02 7.904e+02 1.165e+03 1.837e+03 4.416e+03, threshold=2.330e+03, percent-clipped=18.0 2023-06-28 14:55:32,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2081346.0, ans=0.0 2023-06-28 14:55:34,011 INFO [train.py:996] (1/4) Epoch 12, batch 11450, loss[loss=0.1853, simple_loss=0.2649, pruned_loss=0.05291, over 21254.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2849, pruned_loss=0.06368, over 4267988.09 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:55:46,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2081346.0, ans=0.125 2023-06-28 14:55:53,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=22.5 2023-06-28 14:55:54,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2081406.0, ans=0.2 2023-06-28 14:56:05,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2081406.0, ans=0.1 2023-06-28 14:57:18,111 INFO [train.py:996] (1/4) Epoch 12, batch 11500, loss[loss=0.2608, simple_loss=0.3266, pruned_loss=0.09749, over 21414.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2879, pruned_loss=0.0653, over 4271215.86 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:57:20,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2081646.0, ans=0.0 2023-06-28 14:57:20,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2081646.0, ans=0.0 2023-06-28 14:58:12,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2081766.0, ans=0.125 2023-06-28 14:58:20,247 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.484e+02 9.519e+02 1.302e+03 1.957e+03 4.452e+03, threshold=2.605e+03, percent-clipped=16.0 2023-06-28 14:59:03,075 INFO [train.py:996] (1/4) Epoch 12, batch 11550, loss[loss=0.2523, simple_loss=0.3647, pruned_loss=0.06989, over 21763.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2933, pruned_loss=0.06519, over 4272664.97 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:59:07,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-28 14:59:14,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2081946.0, ans=0.1 2023-06-28 14:59:37,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2082006.0, ans=0.125 2023-06-28 14:59:37,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2082006.0, ans=0.1 2023-06-28 15:00:40,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2082186.0, ans=0.0 2023-06-28 15:00:43,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2082186.0, ans=0.125 2023-06-28 15:00:46,577 INFO [train.py:996] (1/4) Epoch 12, batch 11600, loss[loss=0.2156, simple_loss=0.2987, pruned_loss=0.06621, over 21842.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3093, pruned_loss=0.06769, over 4269317.87 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:01:15,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2082306.0, ans=0.125 2023-06-28 15:01:22,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2082306.0, ans=0.125 2023-06-28 15:01:57,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2082426.0, ans=0.035 2023-06-28 15:01:58,182 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.568e+02 8.664e+02 1.450e+03 2.268e+03 5.007e+03, threshold=2.901e+03, percent-clipped=18.0 2023-06-28 15:02:21,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.84 vs. limit=5.0 2023-06-28 15:02:35,265 INFO [train.py:996] (1/4) Epoch 12, batch 11650, loss[loss=0.2394, simple_loss=0.3247, pruned_loss=0.07709, over 21850.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3155, pruned_loss=0.06848, over 4261112.45 frames. ], batch size: 372, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:03:40,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-28 15:03:41,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2082726.0, ans=0.1 2023-06-28 15:03:54,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2082786.0, ans=0.0 2023-06-28 15:04:16,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2082846.0, ans=0.125 2023-06-28 15:04:16,902 INFO [train.py:996] (1/4) Epoch 12, batch 11700, loss[loss=0.1961, simple_loss=0.2625, pruned_loss=0.06482, over 21828.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3073, pruned_loss=0.06793, over 4258232.59 frames. ], batch size: 372, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:04:56,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2082906.0, ans=0.0 2023-06-28 15:05:08,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2082966.0, ans=0.125 2023-06-28 15:05:22,738 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.474e+02 9.550e+02 1.552e+03 2.202e+03 4.902e+03, threshold=3.105e+03, percent-clipped=9.0 2023-06-28 15:05:57,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-28 15:06:04,483 INFO [train.py:996] (1/4) Epoch 12, batch 11750, loss[loss=0.2267, simple_loss=0.2862, pruned_loss=0.08357, over 21315.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.298, pruned_loss=0.06721, over 4259257.88 frames. ], batch size: 549, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:06:57,452 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-28 15:07:17,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2083326.0, ans=0.0 2023-06-28 15:07:21,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-28 15:07:47,880 INFO [train.py:996] (1/4) Epoch 12, batch 11800, loss[loss=0.2161, simple_loss=0.3175, pruned_loss=0.0574, over 21746.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2993, pruned_loss=0.06881, over 4269090.05 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:08:15,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-28 15:08:39,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2083566.0, ans=0.125 2023-06-28 15:08:39,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2083566.0, ans=0.2 2023-06-28 15:08:48,880 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.444e+02 9.330e+02 1.436e+03 2.225e+03 5.022e+03, threshold=2.872e+03, percent-clipped=11.0 2023-06-28 15:08:54,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2083626.0, ans=0.125 2023-06-28 15:08:58,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2083626.0, ans=0.125 2023-06-28 15:09:01,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2083626.0, ans=0.0 2023-06-28 15:09:26,671 INFO [train.py:996] (1/4) Epoch 12, batch 11850, loss[loss=0.1927, simple_loss=0.2772, pruned_loss=0.05414, over 21919.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2987, pruned_loss=0.06804, over 4274635.83 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:09:37,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2083746.0, ans=0.2 2023-06-28 15:10:06,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.13 vs. limit=8.0 2023-06-28 15:11:10,426 INFO [train.py:996] (1/4) Epoch 12, batch 11900, loss[loss=0.211, simple_loss=0.3055, pruned_loss=0.05832, over 21602.00 frames. ], tot_loss[loss=0.216, simple_loss=0.299, pruned_loss=0.06647, over 4275048.07 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:11:15,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-28 15:11:59,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2084166.0, ans=0.1 2023-06-28 15:12:13,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.185e+02 7.198e+02 9.065e+02 1.390e+03 3.282e+03, threshold=1.813e+03, percent-clipped=3.0 2023-06-28 15:12:54,391 INFO [train.py:996] (1/4) Epoch 12, batch 11950, loss[loss=0.1524, simple_loss=0.2371, pruned_loss=0.03392, over 21592.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2978, pruned_loss=0.06345, over 4275200.69 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:13:02,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2084346.0, ans=0.125 2023-06-28 15:14:03,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2084526.0, ans=0.125 2023-06-28 15:14:24,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2084586.0, ans=0.1 2023-06-28 15:14:33,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2084586.0, ans=0.2 2023-06-28 15:14:35,777 INFO [train.py:996] (1/4) Epoch 12, batch 12000, loss[loss=0.2101, simple_loss=0.2816, pruned_loss=0.0693, over 21975.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2921, pruned_loss=0.06154, over 4269646.41 frames. ], batch size: 103, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:14:35,777 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 15:14:56,357 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2655, simple_loss=0.3539, pruned_loss=0.08861, over 1796401.00 frames. 2023-06-28 15:14:56,358 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 15:15:30,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2084766.0, ans=0.0 2023-06-28 15:15:57,097 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.607e+02 7.542e+02 1.127e+03 1.617e+03 2.900e+03, threshold=2.254e+03, percent-clipped=14.0 2023-06-28 15:16:15,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2084886.0, ans=0.2 2023-06-28 15:16:16,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2084886.0, ans=0.0 2023-06-28 15:16:38,993 INFO [train.py:996] (1/4) Epoch 12, batch 12050, loss[loss=0.2229, simple_loss=0.3505, pruned_loss=0.04767, over 20737.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2883, pruned_loss=0.06248, over 4268705.45 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:17:23,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2085066.0, ans=0.125 2023-06-28 15:17:38,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2085126.0, ans=0.5 2023-06-28 15:17:45,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2085126.0, ans=0.1 2023-06-28 15:17:50,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2085126.0, ans=0.125 2023-06-28 15:18:17,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=2085186.0, ans=15.0 2023-06-28 15:18:21,656 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:18:22,711 INFO [train.py:996] (1/4) Epoch 12, batch 12100, loss[loss=0.2104, simple_loss=0.2996, pruned_loss=0.06061, over 21893.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2928, pruned_loss=0.06618, over 4274436.00 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:18:40,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2085306.0, ans=0.0 2023-06-28 15:18:57,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2085306.0, ans=0.125 2023-06-28 15:19:18,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-28 15:19:28,427 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 8.334e+02 1.059e+03 1.614e+03 4.516e+03, threshold=2.118e+03, percent-clipped=9.0 2023-06-28 15:19:55,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2085486.0, ans=0.0 2023-06-28 15:20:08,869 INFO [train.py:996] (1/4) Epoch 12, batch 12150, loss[loss=0.2001, simple_loss=0.2924, pruned_loss=0.05392, over 21582.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2966, pruned_loss=0.06616, over 4276368.35 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:20:28,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2085606.0, ans=0.125 2023-06-28 15:20:49,173 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-28 15:20:50,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2085666.0, ans=0.0 2023-06-28 15:20:54,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.09 vs. limit=6.0 2023-06-28 15:21:00,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2085666.0, ans=0.0 2023-06-28 15:21:21,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2085726.0, ans=0.0 2023-06-28 15:21:28,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2085726.0, ans=0.0 2023-06-28 15:21:36,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-28 15:21:50,480 INFO [train.py:996] (1/4) Epoch 12, batch 12200, loss[loss=0.2043, simple_loss=0.2671, pruned_loss=0.07077, over 21635.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2932, pruned_loss=0.0646, over 4264013.31 frames. ], batch size: 231, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:22:23,028 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:22:31,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2085906.0, ans=0.0 2023-06-28 15:23:03,731 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.152e+02 7.335e+02 1.257e+03 1.849e+03 4.350e+03, threshold=2.514e+03, percent-clipped=17.0 2023-06-28 15:23:20,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2086086.0, ans=0.125 2023-06-28 15:23:27,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2086086.0, ans=0.2 2023-06-28 15:23:32,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-28 15:23:33,608 INFO [train.py:996] (1/4) Epoch 12, batch 12250, loss[loss=0.1474, simple_loss=0.2296, pruned_loss=0.03258, over 21302.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2859, pruned_loss=0.06294, over 4267146.75 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:23:39,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2086146.0, ans=0.0 2023-06-28 15:25:05,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2086386.0, ans=0.2 2023-06-28 15:25:05,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2086386.0, ans=0.04949747468305833 2023-06-28 15:25:10,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2086386.0, ans=0.125 2023-06-28 15:25:16,907 INFO [train.py:996] (1/4) Epoch 12, batch 12300, loss[loss=0.1983, simple_loss=0.2932, pruned_loss=0.05175, over 21739.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2796, pruned_loss=0.05834, over 4264585.08 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:25:19,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2086446.0, ans=0.125 2023-06-28 15:25:59,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2086566.0, ans=0.035 2023-06-28 15:26:29,029 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.595e+02 6.953e+02 1.064e+03 1.818e+03 4.648e+03, threshold=2.128e+03, percent-clipped=12.0 2023-06-28 15:26:54,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2086686.0, ans=0.125 2023-06-28 15:26:59,117 INFO [train.py:996] (1/4) Epoch 12, batch 12350, loss[loss=0.2147, simple_loss=0.2995, pruned_loss=0.06493, over 21911.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2817, pruned_loss=0.05783, over 4268857.41 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:27:06,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2086746.0, ans=0.0 2023-06-28 15:27:06,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2086746.0, ans=0.2 2023-06-28 15:27:31,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.13 vs. limit=12.0 2023-06-28 15:27:45,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2086866.0, ans=0.0 2023-06-28 15:28:27,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2086986.0, ans=0.125 2023-06-28 15:28:40,054 INFO [train.py:996] (1/4) Epoch 12, batch 12400, loss[loss=0.2446, simple_loss=0.3652, pruned_loss=0.06201, over 19826.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2857, pruned_loss=0.06179, over 4280533.25 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:28:41,328 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-28 15:29:54,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.210e+02 8.096e+02 1.137e+03 1.573e+03 3.341e+03, threshold=2.274e+03, percent-clipped=11.0 2023-06-28 15:29:57,518 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.23 vs. limit=15.0 2023-06-28 15:30:32,676 INFO [train.py:996] (1/4) Epoch 12, batch 12450, loss[loss=0.2199, simple_loss=0.2999, pruned_loss=0.06993, over 21768.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2897, pruned_loss=0.06468, over 4285556.03 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:30:45,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-28 15:31:14,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2087466.0, ans=0.125 2023-06-28 15:32:16,031 INFO [train.py:996] (1/4) Epoch 12, batch 12500, loss[loss=0.2512, simple_loss=0.3452, pruned_loss=0.07864, over 21301.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.3004, pruned_loss=0.06784, over 4290473.66 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:32:21,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2087646.0, ans=0.125 2023-06-28 15:33:19,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2087826.0, ans=0.125 2023-06-28 15:33:22,113 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.609e+02 8.449e+02 1.202e+03 1.905e+03 3.240e+03, threshold=2.404e+03, percent-clipped=12.0 2023-06-28 15:34:00,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2087946.0, ans=0.1 2023-06-28 15:34:05,902 INFO [train.py:996] (1/4) Epoch 12, batch 12550, loss[loss=0.2288, simple_loss=0.3072, pruned_loss=0.07519, over 21970.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3057, pruned_loss=0.0705, over 4285671.08 frames. ], batch size: 317, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:35:23,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2088126.0, ans=0.2 2023-06-28 15:35:46,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2088186.0, ans=0.125 2023-06-28 15:35:50,641 INFO [train.py:996] (1/4) Epoch 12, batch 12600, loss[loss=0.2411, simple_loss=0.3315, pruned_loss=0.07533, over 21587.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3055, pruned_loss=0.06823, over 4289991.09 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:36:59,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 8.006e+02 1.115e+03 1.640e+03 2.498e+03, threshold=2.229e+03, percent-clipped=4.0 2023-06-28 15:37:14,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2088486.0, ans=0.0 2023-06-28 15:37:31,412 INFO [train.py:996] (1/4) Epoch 12, batch 12650, loss[loss=0.1742, simple_loss=0.2479, pruned_loss=0.05022, over 17102.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2982, pruned_loss=0.06469, over 4282372.59 frames. ], batch size: 60, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:37:31,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2088546.0, ans=0.1 2023-06-28 15:38:01,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2088606.0, ans=0.1 2023-06-28 15:38:37,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2088726.0, ans=0.0 2023-06-28 15:38:39,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2088726.0, ans=0.025 2023-06-28 15:39:09,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2088786.0, ans=0.125 2023-06-28 15:39:18,834 INFO [train.py:996] (1/4) Epoch 12, batch 12700, loss[loss=0.2251, simple_loss=0.3054, pruned_loss=0.07243, over 21864.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2967, pruned_loss=0.06634, over 4285833.17 frames. ], batch size: 371, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:39:27,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2088846.0, ans=0.125 2023-06-28 15:39:29,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2088846.0, ans=0.1 2023-06-28 15:40:22,986 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.191e+02 7.639e+02 1.038e+03 1.743e+03 3.264e+03, threshold=2.075e+03, percent-clipped=12.0 2023-06-28 15:40:28,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2089026.0, ans=0.0 2023-06-28 15:40:46,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2089086.0, ans=0.125 2023-06-28 15:41:01,161 INFO [train.py:996] (1/4) Epoch 12, batch 12750, loss[loss=0.1994, simple_loss=0.2817, pruned_loss=0.05857, over 21397.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2982, pruned_loss=0.06677, over 4283949.42 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:41:31,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2089206.0, ans=0.1 2023-06-28 15:42:07,674 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:42:30,577 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:42:39,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2023-06-28 15:42:43,111 INFO [train.py:996] (1/4) Epoch 12, batch 12800, loss[loss=0.225, simple_loss=0.297, pruned_loss=0.07644, over 21841.00 frames. ], tot_loss[loss=0.216, simple_loss=0.297, pruned_loss=0.06751, over 4291459.87 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:43:55,076 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.524e+02 8.190e+02 1.176e+03 1.690e+03 3.535e+03, threshold=2.352e+03, percent-clipped=8.0 2023-06-28 15:44:25,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2089746.0, ans=0.125 2023-06-28 15:44:27,202 INFO [train.py:996] (1/4) Epoch 12, batch 12850, loss[loss=0.186, simple_loss=0.2836, pruned_loss=0.04421, over 21684.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2981, pruned_loss=0.06838, over 4295004.80 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:45:10,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2089866.0, ans=0.125 2023-06-28 15:45:28,149 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-28 15:45:35,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2089926.0, ans=0.0 2023-06-28 15:46:15,554 INFO [train.py:996] (1/4) Epoch 12, batch 12900, loss[loss=0.1958, simple_loss=0.2854, pruned_loss=0.05309, over 21682.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2946, pruned_loss=0.06522, over 4286967.44 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:46:58,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2090166.0, ans=0.125 2023-06-28 15:47:00,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2090166.0, ans=0.125 2023-06-28 15:47:25,954 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.842e+02 7.681e+02 1.209e+03 1.743e+03 3.932e+03, threshold=2.418e+03, percent-clipped=11.0 2023-06-28 15:47:28,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=22.5 2023-06-28 15:47:48,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-28 15:47:55,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2090286.0, ans=0.1 2023-06-28 15:48:02,340 INFO [train.py:996] (1/4) Epoch 12, batch 12950, loss[loss=0.2419, simple_loss=0.3182, pruned_loss=0.08283, over 21730.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.292, pruned_loss=0.06357, over 4281649.40 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:48:13,532 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-28 15:48:16,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-28 15:48:18,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.61 vs. limit=5.0 2023-06-28 15:48:30,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2090406.0, ans=0.1 2023-06-28 15:49:25,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2090586.0, ans=0.0 2023-06-28 15:49:44,565 INFO [train.py:996] (1/4) Epoch 12, batch 13000, loss[loss=0.1705, simple_loss=0.2573, pruned_loss=0.04186, over 21627.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2919, pruned_loss=0.06374, over 4276732.34 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:50:25,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-06-28 15:50:50,319 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.116e+02 7.680e+02 1.047e+03 1.374e+03 2.853e+03, threshold=2.094e+03, percent-clipped=2.0 2023-06-28 15:50:54,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2090826.0, ans=0.125 2023-06-28 15:51:18,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2090886.0, ans=0.125 2023-06-28 15:51:25,518 INFO [train.py:996] (1/4) Epoch 12, batch 13050, loss[loss=0.2057, simple_loss=0.2808, pruned_loss=0.06534, over 21306.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2879, pruned_loss=0.06214, over 4282051.26 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:52:10,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-28 15:53:07,012 INFO [train.py:996] (1/4) Epoch 12, batch 13100, loss[loss=0.2313, simple_loss=0.3086, pruned_loss=0.07703, over 21372.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2895, pruned_loss=0.06236, over 4284149.46 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:53:32,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2091306.0, ans=0.125 2023-06-28 15:54:14,291 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 6.908e+02 8.185e+02 1.188e+03 2.631e+03, threshold=1.637e+03, percent-clipped=2.0 2023-06-28 15:54:49,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2091546.0, ans=0.125 2023-06-28 15:54:50,928 INFO [train.py:996] (1/4) Epoch 12, batch 13150, loss[loss=0.1957, simple_loss=0.2762, pruned_loss=0.05766, over 21846.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2907, pruned_loss=0.06394, over 4276647.44 frames. ], batch size: 372, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:56:37,585 INFO [train.py:996] (1/4) Epoch 12, batch 13200, loss[loss=0.2155, simple_loss=0.294, pruned_loss=0.06843, over 21733.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2904, pruned_loss=0.06399, over 4273337.07 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:57:03,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2091906.0, ans=0.0 2023-06-28 15:57:46,671 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.053e+02 7.241e+02 1.163e+03 1.743e+03 3.163e+03, threshold=2.326e+03, percent-clipped=27.0 2023-06-28 15:58:21,569 INFO [train.py:996] (1/4) Epoch 12, batch 13250, loss[loss=0.2557, simple_loss=0.4025, pruned_loss=0.05446, over 19633.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2934, pruned_loss=0.06571, over 4261984.35 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:58:29,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2092146.0, ans=0.0 2023-06-28 15:58:32,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2092146.0, ans=0.95 2023-06-28 15:59:07,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.59 vs. limit=22.5 2023-06-28 15:59:15,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2092266.0, ans=0.0 2023-06-28 15:59:21,235 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=22.5 2023-06-28 15:59:57,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2092386.0, ans=0.2 2023-06-28 16:00:05,144 INFO [train.py:996] (1/4) Epoch 12, batch 13300, loss[loss=0.2387, simple_loss=0.327, pruned_loss=0.07523, over 21859.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2961, pruned_loss=0.06611, over 4263693.68 frames. ], batch size: 371, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:00:42,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2092506.0, ans=0.1 2023-06-28 16:01:02,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2092566.0, ans=0.04949747468305833 2023-06-28 16:01:15,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2092626.0, ans=0.2 2023-06-28 16:01:23,226 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.410e+02 8.803e+02 1.227e+03 2.112e+03 5.928e+03, threshold=2.454e+03, percent-clipped=20.0 2023-06-28 16:01:23,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2092626.0, ans=0.2 2023-06-28 16:01:30,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2092686.0, ans=0.0 2023-06-28 16:01:48,703 INFO [train.py:996] (1/4) Epoch 12, batch 13350, loss[loss=0.183, simple_loss=0.2568, pruned_loss=0.05466, over 16289.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3024, pruned_loss=0.06954, over 4261943.58 frames. ], batch size: 61, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:02:04,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2092746.0, ans=0.1 2023-06-28 16:02:33,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2092866.0, ans=0.0 2023-06-28 16:03:35,155 INFO [train.py:996] (1/4) Epoch 12, batch 13400, loss[loss=0.2131, simple_loss=0.2994, pruned_loss=0.06338, over 21857.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3027, pruned_loss=0.07105, over 4271958.35 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:03:52,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2093046.0, ans=0.0 2023-06-28 16:03:54,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.33 vs. limit=12.0 2023-06-28 16:04:22,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2093166.0, ans=0.1 2023-06-28 16:04:44,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.284e+02 9.193e+02 1.348e+03 2.044e+03 4.158e+03, threshold=2.695e+03, percent-clipped=16.0 2023-06-28 16:04:45,400 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-28 16:05:01,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2093286.0, ans=0.125 2023-06-28 16:05:14,145 INFO [train.py:996] (1/4) Epoch 12, batch 13450, loss[loss=0.1985, simple_loss=0.259, pruned_loss=0.06898, over 21332.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3034, pruned_loss=0.07315, over 4277229.50 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:05:43,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2093406.0, ans=15.0 2023-06-28 16:06:58,131 INFO [train.py:996] (1/4) Epoch 12, batch 13500, loss[loss=0.2331, simple_loss=0.3132, pruned_loss=0.07653, over 21731.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2957, pruned_loss=0.07029, over 4264593.38 frames. ], batch size: 391, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:07:02,860 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.53 vs. limit=10.0 2023-06-28 16:07:29,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2093706.0, ans=0.125 2023-06-28 16:07:31,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2093706.0, ans=0.1 2023-06-28 16:07:33,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-28 16:07:41,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2093766.0, ans=0.125 2023-06-28 16:08:13,367 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.877e+02 6.964e+02 1.038e+03 1.541e+03 3.052e+03, threshold=2.076e+03, percent-clipped=2.0 2023-06-28 16:08:15,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2093826.0, ans=0.0 2023-06-28 16:08:29,648 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-28 16:08:43,468 INFO [train.py:996] (1/4) Epoch 12, batch 13550, loss[loss=0.1676, simple_loss=0.2377, pruned_loss=0.04871, over 21712.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2996, pruned_loss=0.06949, over 4258302.00 frames. ], batch size: 112, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:08:49,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2093946.0, ans=0.125 2023-06-28 16:10:26,348 INFO [train.py:996] (1/4) Epoch 12, batch 13600, loss[loss=0.1989, simple_loss=0.286, pruned_loss=0.05594, over 21864.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3002, pruned_loss=0.06951, over 4264158.97 frames. ], batch size: 124, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:10:37,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-28 16:11:14,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2094366.0, ans=0.07 2023-06-28 16:11:29,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2094426.0, ans=0.125 2023-06-28 16:11:33,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2094426.0, ans=0.125 2023-06-28 16:11:39,040 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.839e+02 7.727e+02 1.210e+03 1.733e+03 4.112e+03, threshold=2.419e+03, percent-clipped=15.0 2023-06-28 16:12:13,570 INFO [train.py:996] (1/4) Epoch 12, batch 13650, loss[loss=0.1779, simple_loss=0.2538, pruned_loss=0.05103, over 21310.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2939, pruned_loss=0.06588, over 4268079.10 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:12:52,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2094666.0, ans=0.0 2023-06-28 16:13:04,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2094666.0, ans=0.0 2023-06-28 16:13:15,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2094726.0, ans=0.125 2023-06-28 16:13:20,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2094726.0, ans=0.125 2023-06-28 16:13:39,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=22.5 2023-06-28 16:13:43,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2094786.0, ans=0.125 2023-06-28 16:13:47,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2094786.0, ans=0.125 2023-06-28 16:13:57,089 INFO [train.py:996] (1/4) Epoch 12, batch 13700, loss[loss=0.205, simple_loss=0.2841, pruned_loss=0.0629, over 21816.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2877, pruned_loss=0.06537, over 4274332.45 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:14:09,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-28 16:14:10,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2094846.0, ans=0.125 2023-06-28 16:14:53,200 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-06-28 16:15:15,691 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.251e+02 7.877e+02 1.121e+03 1.931e+03 5.975e+03, threshold=2.242e+03, percent-clipped=12.0 2023-06-28 16:15:31,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2095086.0, ans=0.0 2023-06-28 16:15:47,585 INFO [train.py:996] (1/4) Epoch 12, batch 13750, loss[loss=0.1821, simple_loss=0.2627, pruned_loss=0.05074, over 21622.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2856, pruned_loss=0.06475, over 4271079.33 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:16:01,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.27 vs. limit=22.5 2023-06-28 16:17:29,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2095386.0, ans=0.125 2023-06-28 16:17:34,229 INFO [train.py:996] (1/4) Epoch 12, batch 13800, loss[loss=0.2315, simple_loss=0.3331, pruned_loss=0.06498, over 21601.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2908, pruned_loss=0.06391, over 4267250.72 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:17:47,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2095446.0, ans=0.05 2023-06-28 16:17:52,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2095506.0, ans=0.0 2023-06-28 16:17:53,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2095506.0, ans=0.0 2023-06-28 16:18:03,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2095506.0, ans=0.0 2023-06-28 16:18:14,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2095506.0, ans=0.2 2023-06-28 16:18:56,444 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.457e+02 7.505e+02 1.008e+03 1.759e+03 5.617e+03, threshold=2.016e+03, percent-clipped=13.0 2023-06-28 16:19:00,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2095686.0, ans=0.125 2023-06-28 16:19:02,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2095686.0, ans=10.0 2023-06-28 16:19:18,272 INFO [train.py:996] (1/4) Epoch 12, batch 13850, loss[loss=0.241, simple_loss=0.3351, pruned_loss=0.07345, over 21855.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.3003, pruned_loss=0.06556, over 4264145.14 frames. ], batch size: 371, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:19:28,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2095746.0, ans=0.125 2023-06-28 16:19:53,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2095806.0, ans=0.0 2023-06-28 16:20:06,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2095866.0, ans=0.0 2023-06-28 16:20:12,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=2095866.0, ans=0.025 2023-06-28 16:20:16,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2095866.0, ans=0.035 2023-06-28 16:20:36,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2095926.0, ans=0.125 2023-06-28 16:21:05,032 INFO [train.py:996] (1/4) Epoch 12, batch 13900, loss[loss=0.2475, simple_loss=0.3118, pruned_loss=0.09163, over 21790.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.303, pruned_loss=0.06802, over 4269212.36 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:21:53,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-28 16:22:08,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2096226.0, ans=0.125 2023-06-28 16:22:10,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=22.5 2023-06-28 16:22:11,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-28 16:22:20,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.853e+02 9.390e+02 1.248e+03 1.935e+03 5.140e+03, threshold=2.497e+03, percent-clipped=23.0 2023-06-28 16:22:47,371 INFO [train.py:996] (1/4) Epoch 12, batch 13950, loss[loss=0.235, simple_loss=0.3183, pruned_loss=0.07588, over 21706.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3027, pruned_loss=0.06935, over 4278750.29 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:23:16,873 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-28 16:23:43,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2096466.0, ans=0.125 2023-06-28 16:24:03,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2096526.0, ans=0.2 2023-06-28 16:24:25,023 INFO [train.py:996] (1/4) Epoch 12, batch 14000, loss[loss=0.1764, simple_loss=0.25, pruned_loss=0.0514, over 21266.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2997, pruned_loss=0.06839, over 4268964.41 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:24:43,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2096646.0, ans=0.0 2023-06-28 16:25:45,224 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.785e+02 7.339e+02 1.044e+03 1.507e+03 3.234e+03, threshold=2.088e+03, percent-clipped=5.0 2023-06-28 16:26:11,281 INFO [train.py:996] (1/4) Epoch 12, batch 14050, loss[loss=0.1773, simple_loss=0.2493, pruned_loss=0.05267, over 21551.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.294, pruned_loss=0.06517, over 4268854.63 frames. ], batch size: 213, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:26:22,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2096946.0, ans=0.0 2023-06-28 16:26:27,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2096946.0, ans=0.0 2023-06-28 16:26:42,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2097006.0, ans=0.2 2023-06-28 16:26:49,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-28 16:27:52,886 INFO [train.py:996] (1/4) Epoch 12, batch 14100, loss[loss=0.2195, simple_loss=0.2897, pruned_loss=0.07468, over 21382.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2869, pruned_loss=0.06466, over 4268397.79 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:28:01,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2097246.0, ans=0.0 2023-06-28 16:28:13,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2097306.0, ans=0.125 2023-06-28 16:28:55,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2097426.0, ans=0.0 2023-06-28 16:29:08,545 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 8.924e+02 1.261e+03 1.864e+03 4.328e+03, threshold=2.523e+03, percent-clipped=18.0 2023-06-28 16:29:29,312 INFO [train.py:996] (1/4) Epoch 12, batch 14150, loss[loss=0.2503, simple_loss=0.3589, pruned_loss=0.07088, over 19815.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2901, pruned_loss=0.06547, over 4256272.75 frames. ], batch size: 702, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:30:06,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2097606.0, ans=0.125 2023-06-28 16:31:07,775 INFO [train.py:996] (1/4) Epoch 12, batch 14200, loss[loss=0.2316, simple_loss=0.2875, pruned_loss=0.08781, over 21494.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2901, pruned_loss=0.06468, over 4242207.47 frames. ], batch size: 473, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:31:10,481 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-28 16:32:21,330 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.921e+02 6.799e+02 8.924e+02 1.241e+03 3.377e+03, threshold=1.785e+03, percent-clipped=4.0 2023-06-28 16:32:23,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2098026.0, ans=0.05 2023-06-28 16:32:46,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2098146.0, ans=0.09899494936611666 2023-06-28 16:32:47,806 INFO [train.py:996] (1/4) Epoch 12, batch 14250, loss[loss=0.2121, simple_loss=0.2719, pruned_loss=0.07616, over 21353.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2842, pruned_loss=0.06434, over 4244899.17 frames. ], batch size: 473, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:32:58,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2098146.0, ans=0.2 2023-06-28 16:33:31,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2098266.0, ans=0.125 2023-06-28 16:33:57,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-28 16:34:05,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.98 vs. limit=22.5 2023-06-28 16:34:20,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-28 16:34:24,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2098386.0, ans=0.05 2023-06-28 16:34:34,312 INFO [train.py:996] (1/4) Epoch 12, batch 14300, loss[loss=0.2695, simple_loss=0.3719, pruned_loss=0.08356, over 21791.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2865, pruned_loss=0.06366, over 4242982.89 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:34:38,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2098446.0, ans=0.125 2023-06-28 16:35:55,446 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.012e+02 8.237e+02 1.255e+03 2.124e+03 4.385e+03, threshold=2.511e+03, percent-clipped=34.0 2023-06-28 16:36:17,110 INFO [train.py:996] (1/4) Epoch 12, batch 14350, loss[loss=0.2761, simple_loss=0.3761, pruned_loss=0.08807, over 21510.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2934, pruned_loss=0.06453, over 4235143.39 frames. ], batch size: 507, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:36:17,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2098746.0, ans=0.0 2023-06-28 16:36:53,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2098806.0, ans=0.125 2023-06-28 16:37:53,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-28 16:37:59,238 INFO [train.py:996] (1/4) Epoch 12, batch 14400, loss[loss=0.1924, simple_loss=0.267, pruned_loss=0.05887, over 21841.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2903, pruned_loss=0.06491, over 4248734.38 frames. ], batch size: 333, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:39:02,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2099226.0, ans=0.125 2023-06-28 16:39:18,229 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.847e+02 6.927e+02 1.038e+03 1.645e+03 3.908e+03, threshold=2.076e+03, percent-clipped=8.0 2023-06-28 16:39:39,697 INFO [train.py:996] (1/4) Epoch 12, batch 14450, loss[loss=0.2244, simple_loss=0.2945, pruned_loss=0.07715, over 21742.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2842, pruned_loss=0.06458, over 4247681.81 frames. ], batch size: 112, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:40:04,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-28 16:40:06,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2099406.0, ans=0.125 2023-06-28 16:40:48,884 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:41:13,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2099586.0, ans=0.1 2023-06-28 16:41:23,321 INFO [train.py:996] (1/4) Epoch 12, batch 14500, loss[loss=0.1982, simple_loss=0.2847, pruned_loss=0.05587, over 21218.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2796, pruned_loss=0.06409, over 4247964.77 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:41:32,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.21 vs. limit=5.0 2023-06-28 16:42:23,879 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-28 16:42:46,532 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.162e+02 7.722e+02 1.013e+03 1.611e+03 2.945e+03, threshold=2.026e+03, percent-clipped=11.0 2023-06-28 16:42:50,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2099886.0, ans=0.1 2023-06-28 16:42:53,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2099886.0, ans=0.2 2023-06-28 16:43:11,635 INFO [train.py:996] (1/4) Epoch 12, batch 14550, loss[loss=0.2391, simple_loss=0.3172, pruned_loss=0.08053, over 21309.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2849, pruned_loss=0.06633, over 4255026.14 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:43:46,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2100006.0, ans=0.125 2023-06-28 16:44:19,211 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:44:29,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2100126.0, ans=0.125 2023-06-28 16:44:34,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2100186.0, ans=0.125 2023-06-28 16:44:59,759 INFO [train.py:996] (1/4) Epoch 12, batch 14600, loss[loss=0.2326, simple_loss=0.3272, pruned_loss=0.06905, over 21868.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2913, pruned_loss=0.06838, over 4263984.50 frames. ], batch size: 371, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:45:54,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2100366.0, ans=0.125 2023-06-28 16:45:59,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2100426.0, ans=0.0 2023-06-28 16:46:05,977 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:46:12,061 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.268e+02 8.624e+02 1.300e+03 2.155e+03 4.412e+03, threshold=2.599e+03, percent-clipped=26.0 2023-06-28 16:46:20,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2100486.0, ans=0.0 2023-06-28 16:46:41,590 INFO [train.py:996] (1/4) Epoch 12, batch 14650, loss[loss=0.2039, simple_loss=0.2821, pruned_loss=0.06285, over 21786.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2945, pruned_loss=0.06721, over 4275727.04 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:47:38,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2100666.0, ans=0.125 2023-06-28 16:48:22,463 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-28 16:48:23,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2100846.0, ans=0.0 2023-06-28 16:48:24,621 INFO [train.py:996] (1/4) Epoch 12, batch 14700, loss[loss=0.1617, simple_loss=0.2348, pruned_loss=0.0443, over 21797.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2895, pruned_loss=0.06241, over 4266419.62 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:48:30,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2100846.0, ans=0.0 2023-06-28 16:48:40,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2100846.0, ans=0.0 2023-06-28 16:48:57,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-28 16:49:27,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2101026.0, ans=0.0 2023-06-28 16:49:34,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-28 16:49:40,143 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.485e+02 7.512e+02 1.036e+03 1.553e+03 3.154e+03, threshold=2.072e+03, percent-clipped=4.0 2023-06-28 16:49:41,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-28 16:49:50,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2101086.0, ans=0.125 2023-06-28 16:50:02,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2101086.0, ans=0.2 2023-06-28 16:50:04,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-28 16:50:14,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2101146.0, ans=0.0 2023-06-28 16:50:15,501 INFO [train.py:996] (1/4) Epoch 12, batch 14750, loss[loss=0.2693, simple_loss=0.3372, pruned_loss=0.1007, over 21610.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2937, pruned_loss=0.06389, over 4260855.36 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:51:20,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2101326.0, ans=0.0 2023-06-28 16:51:31,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2101326.0, ans=0.2 2023-06-28 16:51:58,894 INFO [train.py:996] (1/4) Epoch 12, batch 14800, loss[loss=0.2231, simple_loss=0.2954, pruned_loss=0.07544, over 21734.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.304, pruned_loss=0.06897, over 4258276.74 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:52:16,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2101446.0, ans=0.1 2023-06-28 16:52:44,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2101566.0, ans=0.0 2023-06-28 16:53:10,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2101626.0, ans=0.0 2023-06-28 16:53:24,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.184e+02 7.719e+02 1.255e+03 2.135e+03 5.182e+03, threshold=2.510e+03, percent-clipped=29.0 2023-06-28 16:53:50,387 INFO [train.py:996] (1/4) Epoch 12, batch 14850, loss[loss=0.1853, simple_loss=0.2604, pruned_loss=0.05513, over 21786.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2971, pruned_loss=0.06831, over 4257712.39 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:53:56,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2101746.0, ans=0.0 2023-06-28 16:54:50,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2101866.0, ans=0.2 2023-06-28 16:54:54,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2101926.0, ans=0.025 2023-06-28 16:55:02,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2101926.0, ans=0.125 2023-06-28 16:55:26,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2101986.0, ans=0.125 2023-06-28 16:55:34,852 INFO [train.py:996] (1/4) Epoch 12, batch 14900, loss[loss=0.2438, simple_loss=0.321, pruned_loss=0.08334, over 21336.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3007, pruned_loss=0.07086, over 4255353.91 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:55:50,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2102106.0, ans=0.125 2023-06-28 16:56:02,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2102106.0, ans=0.125 2023-06-28 16:56:06,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2102106.0, ans=0.125 2023-06-28 16:56:34,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2102226.0, ans=0.2 2023-06-28 16:56:55,256 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 9.079e+02 1.317e+03 1.882e+03 4.138e+03, threshold=2.634e+03, percent-clipped=10.0 2023-06-28 16:57:14,113 INFO [train.py:996] (1/4) Epoch 12, batch 14950, loss[loss=0.207, simple_loss=0.29, pruned_loss=0.06199, over 21742.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3013, pruned_loss=0.07063, over 4263301.14 frames. ], batch size: 247, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:57:42,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-28 16:57:48,613 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-28 16:58:03,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-28 16:58:41,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2102586.0, ans=0.0 2023-06-28 16:58:43,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2102586.0, ans=0.125 2023-06-28 16:58:52,428 INFO [train.py:996] (1/4) Epoch 12, batch 15000, loss[loss=0.2066, simple_loss=0.2773, pruned_loss=0.0679, over 21657.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3037, pruned_loss=0.07212, over 4258546.40 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:58:52,428 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 16:59:11,969 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2573, simple_loss=0.3458, pruned_loss=0.08437, over 1796401.00 frames. 2023-06-28 16:59:11,970 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 16:59:16,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2102646.0, ans=0.125 2023-06-28 16:59:16,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2102646.0, ans=0.0 2023-06-28 16:59:43,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2102706.0, ans=0.2 2023-06-28 17:00:07,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2102766.0, ans=0.125 2023-06-28 17:00:10,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2102766.0, ans=0.0 2023-06-28 17:00:25,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2102826.0, ans=0.0 2023-06-28 17:00:28,471 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.412e+02 7.471e+02 1.040e+03 1.542e+03 3.461e+03, threshold=2.079e+03, percent-clipped=2.0 2023-06-28 17:00:55,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-28 17:00:57,520 INFO [train.py:996] (1/4) Epoch 12, batch 15050, loss[loss=0.2406, simple_loss=0.3348, pruned_loss=0.07321, over 21674.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3032, pruned_loss=0.07256, over 4252865.88 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:01:03,493 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:01:41,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2103066.0, ans=0.1 2023-06-28 17:02:02,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2103126.0, ans=0.04949747468305833 2023-06-28 17:02:09,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2103126.0, ans=0.2 2023-06-28 17:02:44,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2103246.0, ans=0.2 2023-06-28 17:02:45,490 INFO [train.py:996] (1/4) Epoch 12, batch 15100, loss[loss=0.2392, simple_loss=0.316, pruned_loss=0.08125, over 21700.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.307, pruned_loss=0.07268, over 4257594.85 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:03:29,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2103366.0, ans=0.0 2023-06-28 17:03:32,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2103366.0, ans=0.2 2023-06-28 17:03:39,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2103426.0, ans=0.125 2023-06-28 17:03:42,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2103426.0, ans=0.125 2023-06-28 17:03:55,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2103426.0, ans=0.2 2023-06-28 17:04:04,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.134e+02 7.864e+02 1.140e+03 1.681e+03 3.504e+03, threshold=2.280e+03, percent-clipped=13.0 2023-06-28 17:04:27,710 INFO [train.py:996] (1/4) Epoch 12, batch 15150, loss[loss=0.2081, simple_loss=0.2738, pruned_loss=0.07119, over 21555.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.303, pruned_loss=0.0726, over 4254078.67 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:05:35,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2103726.0, ans=0.125 2023-06-28 17:05:42,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-28 17:05:58,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.27 vs. limit=15.0 2023-06-28 17:06:00,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-28 17:06:10,667 INFO [train.py:996] (1/4) Epoch 12, batch 15200, loss[loss=0.1832, simple_loss=0.2612, pruned_loss=0.05259, over 21278.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2931, pruned_loss=0.06881, over 4261422.60 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 17:07:34,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.768e+02 7.003e+02 9.713e+02 1.349e+03 2.577e+03, threshold=1.943e+03, percent-clipped=4.0 2023-06-28 17:07:52,327 INFO [train.py:996] (1/4) Epoch 12, batch 15250, loss[loss=0.2233, simple_loss=0.3121, pruned_loss=0.06728, over 19762.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2879, pruned_loss=0.06717, over 4264322.56 frames. ], batch size: 704, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:07:52,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2104146.0, ans=0.125 2023-06-28 17:08:19,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2104206.0, ans=0.125 2023-06-28 17:09:13,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2104326.0, ans=0.2 2023-06-28 17:09:23,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-28 17:09:34,115 INFO [train.py:996] (1/4) Epoch 12, batch 15300, loss[loss=0.247, simple_loss=0.3211, pruned_loss=0.08652, over 21227.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2897, pruned_loss=0.06889, over 4266642.08 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:09:56,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2104506.0, ans=0.0 2023-06-28 17:10:37,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2104566.0, ans=0.0 2023-06-28 17:11:01,319 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.828e+02 9.652e+02 1.202e+03 1.838e+03 3.602e+03, threshold=2.404e+03, percent-clipped=24.0 2023-06-28 17:11:17,481 INFO [train.py:996] (1/4) Epoch 12, batch 15350, loss[loss=0.2623, simple_loss=0.3374, pruned_loss=0.09364, over 21277.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2965, pruned_loss=0.07084, over 4263342.78 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:11:30,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2104746.0, ans=0.0 2023-06-28 17:11:44,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2104806.0, ans=0.1 2023-06-28 17:12:33,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2104926.0, ans=0.125 2023-06-28 17:12:37,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2104926.0, ans=0.0 2023-06-28 17:12:37,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-28 17:12:56,971 INFO [train.py:996] (1/4) Epoch 12, batch 15400, loss[loss=0.201, simple_loss=0.2854, pruned_loss=0.05835, over 21504.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2959, pruned_loss=0.06954, over 4265982.90 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:14:00,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2105226.0, ans=0.09899494936611666 2023-06-28 17:14:06,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2105226.0, ans=0.0 2023-06-28 17:14:16,124 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.384e+02 7.580e+02 1.010e+03 1.519e+03 4.001e+03, threshold=2.021e+03, percent-clipped=6.0 2023-06-28 17:14:37,992 INFO [train.py:996] (1/4) Epoch 12, batch 15450, loss[loss=0.2003, simple_loss=0.2714, pruned_loss=0.06463, over 21554.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2931, pruned_loss=0.06881, over 4269859.33 frames. ], batch size: 548, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:15:13,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2105406.0, ans=0.5 2023-06-28 17:15:33,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2105466.0, ans=0.0 2023-06-28 17:16:20,921 INFO [train.py:996] (1/4) Epoch 12, batch 15500, loss[loss=0.2218, simple_loss=0.303, pruned_loss=0.07029, over 20666.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2965, pruned_loss=0.06861, over 4270130.79 frames. ], batch size: 607, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:16:21,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2105646.0, ans=0.0 2023-06-28 17:16:21,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2105646.0, ans=0.0 2023-06-28 17:16:21,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2105646.0, ans=0.125 2023-06-28 17:16:23,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2105646.0, ans=0.1 2023-06-28 17:17:24,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-28 17:17:42,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-28 17:17:46,454 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.645e+02 8.205e+02 1.251e+03 1.746e+03 3.424e+03, threshold=2.502e+03, percent-clipped=13.0 2023-06-28 17:18:07,394 INFO [train.py:996] (1/4) Epoch 12, batch 15550, loss[loss=0.1956, simple_loss=0.2856, pruned_loss=0.05276, over 21689.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2953, pruned_loss=0.06643, over 4270933.57 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:18:19,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2105946.0, ans=10.0 2023-06-28 17:19:50,314 INFO [train.py:996] (1/4) Epoch 12, batch 15600, loss[loss=0.2322, simple_loss=0.2992, pruned_loss=0.08263, over 21398.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2891, pruned_loss=0.06536, over 4266254.16 frames. ], batch size: 508, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 17:19:58,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-28 17:20:13,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2106306.0, ans=0.0 2023-06-28 17:20:35,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2106366.0, ans=0.95 2023-06-28 17:21:08,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.963e+02 9.239e+02 1.318e+03 1.838e+03 4.350e+03, threshold=2.636e+03, percent-clipped=8.0 2023-06-28 17:21:29,749 INFO [train.py:996] (1/4) Epoch 12, batch 15650, loss[loss=0.1938, simple_loss=0.266, pruned_loss=0.06076, over 21487.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2877, pruned_loss=0.06437, over 4272192.98 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:21:55,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=2106606.0, ans=15.0 2023-06-28 17:22:14,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2106666.0, ans=0.125 2023-06-28 17:22:43,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=2106726.0, ans=0.025 2023-06-28 17:23:12,646 INFO [train.py:996] (1/4) Epoch 12, batch 15700, loss[loss=0.2613, simple_loss=0.3798, pruned_loss=0.0714, over 19802.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2861, pruned_loss=0.06392, over 4264673.87 frames. ], batch size: 702, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:24:24,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.11 vs. limit=10.0 2023-06-28 17:24:39,915 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 8.496e+02 1.514e+03 2.181e+03 4.345e+03, threshold=3.028e+03, percent-clipped=16.0 2023-06-28 17:24:54,677 INFO [train.py:996] (1/4) Epoch 12, batch 15750, loss[loss=0.1989, simple_loss=0.272, pruned_loss=0.06292, over 22002.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2822, pruned_loss=0.06372, over 4255242.09 frames. ], batch size: 103, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:24:58,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2107146.0, ans=0.0 2023-06-28 17:25:22,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.92 vs. limit=22.5 2023-06-28 17:25:23,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2107206.0, ans=0.0 2023-06-28 17:26:35,174 INFO [train.py:996] (1/4) Epoch 12, batch 15800, loss[loss=0.1883, simple_loss=0.2598, pruned_loss=0.05834, over 21797.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2773, pruned_loss=0.06354, over 4251397.27 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:27:02,769 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=22.5 2023-06-28 17:27:16,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2107566.0, ans=0.1 2023-06-28 17:27:18,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2107566.0, ans=0.125 2023-06-28 17:27:29,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2107566.0, ans=0.125 2023-06-28 17:27:46,656 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-28 17:28:01,452 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.867e+02 7.167e+02 8.955e+02 1.687e+03 3.256e+03, threshold=1.791e+03, percent-clipped=1.0 2023-06-28 17:28:16,331 INFO [train.py:996] (1/4) Epoch 12, batch 15850, loss[loss=0.2125, simple_loss=0.2849, pruned_loss=0.07005, over 21704.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.28, pruned_loss=0.06608, over 4260165.81 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:28:24,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2107746.0, ans=0.2 2023-06-28 17:28:51,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2107806.0, ans=0.2 2023-06-28 17:28:51,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-28 17:28:58,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-06-28 17:29:01,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2107866.0, ans=0.125 2023-06-28 17:29:01,789 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-28 17:29:33,337 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-28 17:29:34,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2107986.0, ans=0.125 2023-06-28 17:29:41,608 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-28 17:29:53,387 INFO [train.py:996] (1/4) Epoch 12, batch 15900, loss[loss=0.1999, simple_loss=0.2613, pruned_loss=0.06918, over 21317.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2771, pruned_loss=0.06593, over 4249667.02 frames. ], batch size: 549, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:31:15,603 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.381e+02 9.815e+02 1.486e+03 2.540e+03, threshold=1.963e+03, percent-clipped=11.0 2023-06-28 17:31:34,457 INFO [train.py:996] (1/4) Epoch 12, batch 15950, loss[loss=0.2331, simple_loss=0.313, pruned_loss=0.07659, over 21595.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2803, pruned_loss=0.06369, over 4255879.33 frames. ], batch size: 508, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:32:33,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2108466.0, ans=0.1 2023-06-28 17:32:48,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2108526.0, ans=0.125 2023-06-28 17:33:11,198 INFO [train.py:996] (1/4) Epoch 12, batch 16000, loss[loss=0.1856, simple_loss=0.2811, pruned_loss=0.0451, over 21758.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2828, pruned_loss=0.06247, over 4269540.04 frames. ], batch size: 247, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:33:19,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2108646.0, ans=0.1 2023-06-28 17:33:44,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2108706.0, ans=0.2 2023-06-28 17:34:02,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2108766.0, ans=0.125 2023-06-28 17:34:39,581 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.641e+02 6.614e+02 9.934e+02 1.443e+03 3.349e+03, threshold=1.987e+03, percent-clipped=8.0 2023-06-28 17:34:52,878 INFO [train.py:996] (1/4) Epoch 12, batch 16050, loss[loss=0.2127, simple_loss=0.3175, pruned_loss=0.05394, over 21782.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2859, pruned_loss=0.06101, over 4269089.99 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:34:54,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.32 vs. limit=10.0 2023-06-28 17:35:06,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2108946.0, ans=0.125 2023-06-28 17:35:13,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.55 vs. limit=15.0 2023-06-28 17:35:54,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2109126.0, ans=0.125 2023-06-28 17:36:02,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2109126.0, ans=0.0 2023-06-28 17:36:28,135 INFO [train.py:996] (1/4) Epoch 12, batch 16100, loss[loss=0.2188, simple_loss=0.2875, pruned_loss=0.0751, over 21269.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2903, pruned_loss=0.06298, over 4270143.82 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:37:21,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2109366.0, ans=0.125 2023-06-28 17:37:33,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2109426.0, ans=0.0 2023-06-28 17:37:52,806 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.516e+02 1.028e+03 1.550e+03 2.496e+03 6.023e+03, threshold=3.100e+03, percent-clipped=39.0 2023-06-28 17:38:06,337 INFO [train.py:996] (1/4) Epoch 12, batch 16150, loss[loss=0.243, simple_loss=0.3064, pruned_loss=0.08982, over 21742.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2882, pruned_loss=0.0649, over 4280436.53 frames. ], batch size: 473, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:38:48,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2109666.0, ans=0.125 2023-06-28 17:39:04,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2109666.0, ans=0.09899494936611666 2023-06-28 17:39:49,881 INFO [train.py:996] (1/4) Epoch 12, batch 16200, loss[loss=0.2323, simple_loss=0.2985, pruned_loss=0.08305, over 22036.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.293, pruned_loss=0.06632, over 4287020.29 frames. ], batch size: 416, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:41:21,145 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.167e+02 9.228e+02 1.472e+03 2.186e+03 5.217e+03, threshold=2.943e+03, percent-clipped=8.0 2023-06-28 17:41:39,810 INFO [train.py:996] (1/4) Epoch 12, batch 16250, loss[loss=0.2157, simple_loss=0.2824, pruned_loss=0.07448, over 21428.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2935, pruned_loss=0.06621, over 4286304.51 frames. ], batch size: 508, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:42:06,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2110206.0, ans=0.125 2023-06-28 17:42:40,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2110326.0, ans=0.0 2023-06-28 17:42:45,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2110326.0, ans=0.0 2023-06-28 17:42:51,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2110326.0, ans=0.125 2023-06-28 17:43:21,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2110446.0, ans=0.0 2023-06-28 17:43:22,708 INFO [train.py:996] (1/4) Epoch 12, batch 16300, loss[loss=0.1805, simple_loss=0.2797, pruned_loss=0.04071, over 21601.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2866, pruned_loss=0.0623, over 4282327.46 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:43:43,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2110506.0, ans=0.125 2023-06-28 17:43:44,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2110506.0, ans=0.125 2023-06-28 17:43:48,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2110506.0, ans=0.125 2023-06-28 17:44:09,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2110566.0, ans=0.0 2023-06-28 17:44:35,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.42 vs. limit=10.0 2023-06-28 17:44:42,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2110686.0, ans=0.0 2023-06-28 17:44:47,671 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.729e+02 7.899e+02 1.103e+03 1.681e+03 3.393e+03, threshold=2.206e+03, percent-clipped=5.0 2023-06-28 17:45:06,099 INFO [train.py:996] (1/4) Epoch 12, batch 16350, loss[loss=0.2995, simple_loss=0.3546, pruned_loss=0.1222, over 21289.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2862, pruned_loss=0.06315, over 4269773.58 frames. ], batch size: 507, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:46:26,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2110986.0, ans=0.125 2023-06-28 17:46:53,876 INFO [train.py:996] (1/4) Epoch 12, batch 16400, loss[loss=0.2139, simple_loss=0.2939, pruned_loss=0.06697, over 21830.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2904, pruned_loss=0.06546, over 4272091.09 frames. ], batch size: 391, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 17:47:08,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2111046.0, ans=0.125 2023-06-28 17:47:30,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2111106.0, ans=0.04949747468305833 2023-06-28 17:47:40,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2111166.0, ans=0.125 2023-06-28 17:48:11,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2111286.0, ans=0.1 2023-06-28 17:48:15,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2111286.0, ans=0.1 2023-06-28 17:48:16,143 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.551e+02 7.002e+02 9.291e+02 1.321e+03 2.557e+03, threshold=1.858e+03, percent-clipped=4.0 2023-06-28 17:48:37,402 INFO [train.py:996] (1/4) Epoch 12, batch 16450, loss[loss=0.2049, simple_loss=0.2785, pruned_loss=0.06566, over 21152.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2898, pruned_loss=0.06606, over 4280314.25 frames. ], batch size: 608, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:49:51,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2111526.0, ans=0.2 2023-06-28 17:49:54,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2111586.0, ans=0.125 2023-06-28 17:50:12,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2111586.0, ans=0.0 2023-06-28 17:50:20,646 INFO [train.py:996] (1/4) Epoch 12, batch 16500, loss[loss=0.1737, simple_loss=0.2361, pruned_loss=0.05564, over 21398.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2873, pruned_loss=0.06635, over 4283757.49 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:50:47,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-28 17:50:51,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2111706.0, ans=0.1 2023-06-28 17:51:08,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2111766.0, ans=0.125 2023-06-28 17:51:08,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2111766.0, ans=0.125 2023-06-28 17:51:52,098 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.579e+02 1.164e+03 1.772e+03 4.926e+03, threshold=2.328e+03, percent-clipped=21.0 2023-06-28 17:52:03,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2111946.0, ans=0.125 2023-06-28 17:52:09,283 INFO [train.py:996] (1/4) Epoch 12, batch 16550, loss[loss=0.2451, simple_loss=0.3299, pruned_loss=0.0801, over 21602.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2846, pruned_loss=0.0639, over 4274940.16 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:52:13,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2111946.0, ans=0.0 2023-06-28 17:52:20,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2111946.0, ans=0.125 2023-06-28 17:53:21,898 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:53:45,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2112186.0, ans=0.0 2023-06-28 17:53:54,962 INFO [train.py:996] (1/4) Epoch 12, batch 16600, loss[loss=0.2848, simple_loss=0.3887, pruned_loss=0.09039, over 21643.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2911, pruned_loss=0.06579, over 4276628.31 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:53:59,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2112246.0, ans=0.1 2023-06-28 17:54:58,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2112366.0, ans=0.1 2023-06-28 17:55:04,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2112426.0, ans=0.0 2023-06-28 17:55:18,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2112426.0, ans=0.0 2023-06-28 17:55:27,796 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.049e+02 7.685e+02 9.523e+02 1.400e+03 3.440e+03, threshold=1.905e+03, percent-clipped=5.0 2023-06-28 17:55:40,081 INFO [train.py:996] (1/4) Epoch 12, batch 16650, loss[loss=0.2309, simple_loss=0.3199, pruned_loss=0.07092, over 21697.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3034, pruned_loss=0.06873, over 4274948.43 frames. ], batch size: 351, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:55:42,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2112546.0, ans=0.0 2023-06-28 17:55:44,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=2112546.0, ans=0.2 2023-06-28 17:56:14,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2112606.0, ans=0.2 2023-06-28 17:56:37,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2112666.0, ans=0.0 2023-06-28 17:57:17,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=12.0 2023-06-28 17:57:35,421 INFO [train.py:996] (1/4) Epoch 12, batch 16700, loss[loss=0.2268, simple_loss=0.3324, pruned_loss=0.06059, over 20773.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3055, pruned_loss=0.0698, over 4278839.52 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:57:37,821 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:58:00,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2112906.0, ans=0.1 2023-06-28 17:58:28,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2112966.0, ans=0.125 2023-06-28 17:59:08,949 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.823e+02 8.945e+02 1.338e+03 1.942e+03 4.278e+03, threshold=2.675e+03, percent-clipped=28.0 2023-06-28 17:59:26,147 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.14 vs. limit=10.0 2023-06-28 17:59:26,623 INFO [train.py:996] (1/4) Epoch 12, batch 16750, loss[loss=0.2755, simple_loss=0.3492, pruned_loss=0.1009, over 21468.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3076, pruned_loss=0.0718, over 4268835.38 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:59:52,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2113206.0, ans=0.125 2023-06-28 18:00:00,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2113206.0, ans=0.0 2023-06-28 18:00:11,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2113266.0, ans=0.0 2023-06-28 18:00:13,133 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:00:57,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2113386.0, ans=10.0 2023-06-28 18:01:11,587 INFO [train.py:996] (1/4) Epoch 12, batch 16800, loss[loss=0.247, simple_loss=0.3312, pruned_loss=0.08138, over 21802.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3121, pruned_loss=0.07217, over 4271796.84 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 18:01:54,237 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=12.0 2023-06-28 18:02:08,476 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:02:36,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2113686.0, ans=0.0 2023-06-28 18:02:44,388 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.630e+02 9.200e+02 1.390e+03 2.563e+03 4.897e+03, threshold=2.780e+03, percent-clipped=19.0 2023-06-28 18:02:58,992 INFO [train.py:996] (1/4) Epoch 12, batch 16850, loss[loss=0.2007, simple_loss=0.2767, pruned_loss=0.06241, over 21384.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3079, pruned_loss=0.07164, over 4272066.61 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:02:59,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2113746.0, ans=0.2 2023-06-28 18:02:59,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2113746.0, ans=0.125 2023-06-28 18:03:03,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2113746.0, ans=0.07 2023-06-28 18:03:19,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2113806.0, ans=0.0 2023-06-28 18:03:30,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=12.0 2023-06-28 18:03:55,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2113866.0, ans=0.125 2023-06-28 18:04:17,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2113926.0, ans=0.0 2023-06-28 18:04:39,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2114046.0, ans=0.125 2023-06-28 18:04:40,768 INFO [train.py:996] (1/4) Epoch 12, batch 16900, loss[loss=0.1732, simple_loss=0.2579, pruned_loss=0.04423, over 21611.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3024, pruned_loss=0.07041, over 4267381.46 frames. ], batch size: 263, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:04:51,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2114046.0, ans=0.125 2023-06-28 18:05:49,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2114226.0, ans=0.125 2023-06-28 18:06:04,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-28 18:06:08,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.316e+02 1.157e+03 1.734e+03 4.199e+03, threshold=2.313e+03, percent-clipped=8.0 2023-06-28 18:06:21,753 INFO [train.py:996] (1/4) Epoch 12, batch 16950, loss[loss=0.2069, simple_loss=0.2897, pruned_loss=0.06205, over 15791.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2958, pruned_loss=0.06898, over 4267506.40 frames. ], batch size: 60, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:06:52,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2114406.0, ans=0.07 2023-06-28 18:07:49,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-28 18:07:59,334 INFO [train.py:996] (1/4) Epoch 12, batch 17000, loss[loss=0.1967, simple_loss=0.266, pruned_loss=0.06365, over 21677.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2927, pruned_loss=0.06936, over 4276192.24 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:07:59,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2114646.0, ans=0.0 2023-06-28 18:07:59,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2114646.0, ans=0.05 2023-06-28 18:08:06,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2114646.0, ans=0.07 2023-06-28 18:08:17,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2114706.0, ans=0.125 2023-06-28 18:08:33,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2114706.0, ans=0.0 2023-06-28 18:09:10,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2114826.0, ans=0.125 2023-06-28 18:09:29,806 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 1.097e+03 1.381e+03 1.822e+03 3.953e+03, threshold=2.762e+03, percent-clipped=12.0 2023-06-28 18:09:30,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2114886.0, ans=0.125 2023-06-28 18:09:40,775 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-28 18:09:41,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2114946.0, ans=0.125 2023-06-28 18:09:42,674 INFO [train.py:996] (1/4) Epoch 12, batch 17050, loss[loss=0.2467, simple_loss=0.3299, pruned_loss=0.08177, over 21846.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3006, pruned_loss=0.07243, over 4284415.24 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:09:53,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2114946.0, ans=0.5 2023-06-28 18:10:05,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-28 18:10:19,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2115066.0, ans=0.125 2023-06-28 18:11:16,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-28 18:11:18,405 INFO [train.py:996] (1/4) Epoch 12, batch 17100, loss[loss=0.2193, simple_loss=0.2903, pruned_loss=0.07416, over 21845.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2999, pruned_loss=0.07281, over 4287288.08 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:11:25,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2115246.0, ans=0.125 2023-06-28 18:12:22,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-28 18:12:35,678 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.04 vs. limit=10.0 2023-06-28 18:12:43,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2115486.0, ans=0.1 2023-06-28 18:12:52,863 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.623e+02 7.702e+02 1.047e+03 1.626e+03 3.499e+03, threshold=2.095e+03, percent-clipped=2.0 2023-06-28 18:13:01,314 INFO [train.py:996] (1/4) Epoch 12, batch 17150, loss[loss=0.1651, simple_loss=0.2453, pruned_loss=0.04243, over 21399.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2957, pruned_loss=0.07227, over 4291265.62 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:14:32,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-28 18:14:40,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2115786.0, ans=0.0 2023-06-28 18:14:44,924 INFO [train.py:996] (1/4) Epoch 12, batch 17200, loss[loss=0.2135, simple_loss=0.2901, pruned_loss=0.06848, over 20708.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2952, pruned_loss=0.07135, over 4291109.93 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:14:57,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2115846.0, ans=0.125 2023-06-28 18:16:02,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=15.0 2023-06-28 18:16:07,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2116026.0, ans=0.0 2023-06-28 18:16:20,212 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.974e+02 7.324e+02 9.389e+02 1.283e+03 2.769e+03, threshold=1.878e+03, percent-clipped=7.0 2023-06-28 18:16:24,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-06-28 18:16:33,061 INFO [train.py:996] (1/4) Epoch 12, batch 17250, loss[loss=0.2323, simple_loss=0.3162, pruned_loss=0.07415, over 21584.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.298, pruned_loss=0.0728, over 4289190.38 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:17:13,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2116266.0, ans=0.125 2023-06-28 18:17:23,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2116266.0, ans=0.125 2023-06-28 18:18:07,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2116386.0, ans=0.125 2023-06-28 18:18:15,684 INFO [train.py:996] (1/4) Epoch 12, batch 17300, loss[loss=0.25, simple_loss=0.3171, pruned_loss=0.09145, over 21376.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3052, pruned_loss=0.07507, over 4282981.70 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:18:31,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2116446.0, ans=0.0 2023-06-28 18:19:15,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.54 vs. limit=22.5 2023-06-28 18:19:35,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-28 18:19:48,135 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 8.589e+02 1.215e+03 1.645e+03 3.725e+03, threshold=2.430e+03, percent-clipped=16.0 2023-06-28 18:19:59,784 INFO [train.py:996] (1/4) Epoch 12, batch 17350, loss[loss=0.2195, simple_loss=0.2992, pruned_loss=0.0699, over 20680.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3064, pruned_loss=0.07479, over 4278918.55 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:21:10,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-28 18:21:11,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2116926.0, ans=0.125 2023-06-28 18:21:42,602 INFO [train.py:996] (1/4) Epoch 12, batch 17400, loss[loss=0.2425, simple_loss=0.3387, pruned_loss=0.07318, over 21211.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3041, pruned_loss=0.07148, over 4278794.98 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:21:55,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2117046.0, ans=0.015 2023-06-28 18:22:08,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2117106.0, ans=0.2 2023-06-28 18:22:10,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=2117106.0, ans=15.0 2023-06-28 18:22:15,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-28 18:22:46,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2117226.0, ans=0.125 2023-06-28 18:22:54,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2117226.0, ans=0.125 2023-06-28 18:23:02,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2117286.0, ans=0.125 2023-06-28 18:23:13,920 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.979e+02 8.447e+02 1.378e+03 1.932e+03 4.918e+03, threshold=2.756e+03, percent-clipped=14.0 2023-06-28 18:23:17,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2117286.0, ans=0.125 2023-06-28 18:23:20,614 INFO [train.py:996] (1/4) Epoch 12, batch 17450, loss[loss=0.1833, simple_loss=0.2778, pruned_loss=0.04438, over 21730.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3013, pruned_loss=0.06972, over 4268177.09 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:23:22,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2117346.0, ans=0.1 2023-06-28 18:23:54,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2117406.0, ans=0.0 2023-06-28 18:23:54,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2117406.0, ans=0.125 2023-06-28 18:23:57,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2117466.0, ans=0.125 2023-06-28 18:24:05,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2117466.0, ans=0.0 2023-06-28 18:24:37,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2117526.0, ans=0.125 2023-06-28 18:24:57,149 INFO [train.py:996] (1/4) Epoch 12, batch 17500, loss[loss=0.2016, simple_loss=0.2796, pruned_loss=0.06181, over 21631.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2962, pruned_loss=0.06722, over 4273409.31 frames. ], batch size: 263, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:25:18,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2117706.0, ans=0.125 2023-06-28 18:26:30,506 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.515e+02 7.082e+02 9.304e+02 1.343e+03 2.877e+03, threshold=1.861e+03, percent-clipped=1.0 2023-06-28 18:26:36,951 INFO [train.py:996] (1/4) Epoch 12, batch 17550, loss[loss=0.2185, simple_loss=0.3054, pruned_loss=0.06581, over 21251.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2955, pruned_loss=0.0654, over 4278966.54 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:26:37,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2117946.0, ans=0.125 2023-06-28 18:27:35,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2118066.0, ans=0.1 2023-06-28 18:28:00,422 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:28:18,262 INFO [train.py:996] (1/4) Epoch 12, batch 17600, loss[loss=0.2761, simple_loss=0.3416, pruned_loss=0.1053, over 21412.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2994, pruned_loss=0.06608, over 4270637.30 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:28:25,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2118246.0, ans=0.04949747468305833 2023-06-28 18:28:48,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2118306.0, ans=0.125 2023-06-28 18:29:27,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2118426.0, ans=0.125 2023-06-28 18:29:29,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2118426.0, ans=0.125 2023-06-28 18:29:51,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.265e+02 8.846e+02 1.006e+03 1.368e+03 3.785e+03, threshold=2.012e+03, percent-clipped=6.0 2023-06-28 18:29:56,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.52 vs. limit=6.0 2023-06-28 18:30:03,166 INFO [train.py:996] (1/4) Epoch 12, batch 17650, loss[loss=0.1799, simple_loss=0.2659, pruned_loss=0.04691, over 21590.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2966, pruned_loss=0.06608, over 4277751.88 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:30:03,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2118546.0, ans=0.0 2023-06-28 18:31:04,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2118666.0, ans=0.125 2023-06-28 18:31:10,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-06-28 18:31:42,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2118786.0, ans=0.2 2023-06-28 18:31:46,605 INFO [train.py:996] (1/4) Epoch 12, batch 17700, loss[loss=0.2215, simple_loss=0.3154, pruned_loss=0.06379, over 21957.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2904, pruned_loss=0.06419, over 4270461.85 frames. ], batch size: 317, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:32:00,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2118846.0, ans=0.125 2023-06-28 18:32:20,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2118906.0, ans=0.05 2023-06-28 18:33:19,196 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.445e+02 8.687e+02 1.297e+03 2.273e+03 4.187e+03, threshold=2.595e+03, percent-clipped=29.0 2023-06-28 18:33:26,139 INFO [train.py:996] (1/4) Epoch 12, batch 17750, loss[loss=0.2486, simple_loss=0.3264, pruned_loss=0.08543, over 21802.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2979, pruned_loss=0.06714, over 4276302.69 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:33:26,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2119146.0, ans=0.1 2023-06-28 18:33:26,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2119146.0, ans=0.1 2023-06-28 18:33:54,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2119146.0, ans=0.0 2023-06-28 18:34:00,287 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-28 18:34:01,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2119206.0, ans=0.05 2023-06-28 18:34:25,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2119266.0, ans=0.125 2023-06-28 18:34:25,943 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-28 18:35:20,466 INFO [train.py:996] (1/4) Epoch 12, batch 17800, loss[loss=0.2114, simple_loss=0.2811, pruned_loss=0.07085, over 21129.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2973, pruned_loss=0.06664, over 4274524.33 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:35:43,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2119506.0, ans=0.125 2023-06-28 18:36:19,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2119626.0, ans=0.125 2023-06-28 18:36:52,586 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.029e+02 8.129e+02 1.136e+03 1.993e+03 4.758e+03, threshold=2.272e+03, percent-clipped=17.0 2023-06-28 18:36:59,634 INFO [train.py:996] (1/4) Epoch 12, batch 17850, loss[loss=0.2434, simple_loss=0.312, pruned_loss=0.08735, over 21267.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2979, pruned_loss=0.06688, over 4272940.66 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:37:03,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2119746.0, ans=0.125 2023-06-28 18:37:09,892 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2023-06-28 18:37:11,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2119746.0, ans=0.1 2023-06-28 18:37:23,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2119806.0, ans=0.0 2023-06-28 18:37:34,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2119806.0, ans=0.125 2023-06-28 18:37:55,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-06-28 18:38:05,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2119926.0, ans=0.0 2023-06-28 18:38:23,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2119986.0, ans=0.0 2023-06-28 18:38:40,366 INFO [train.py:996] (1/4) Epoch 12, batch 17900, loss[loss=0.2606, simple_loss=0.3524, pruned_loss=0.08444, over 21744.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3015, pruned_loss=0.0682, over 4264979.78 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:38:52,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2120046.0, ans=0.125 2023-06-28 18:39:13,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2120106.0, ans=0.1 2023-06-28 18:39:21,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2120166.0, ans=0.125 2023-06-28 18:40:11,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2120286.0, ans=0.125 2023-06-28 18:40:12,405 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.160e+02 9.224e+02 1.391e+03 2.083e+03 4.254e+03, threshold=2.783e+03, percent-clipped=21.0 2023-06-28 18:40:19,126 INFO [train.py:996] (1/4) Epoch 12, batch 17950, loss[loss=0.1934, simple_loss=0.2903, pruned_loss=0.04828, over 21656.00 frames. ], tot_loss[loss=0.216, simple_loss=0.3013, pruned_loss=0.0653, over 4267251.15 frames. ], batch size: 263, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:41:08,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2120466.0, ans=0.125 2023-06-28 18:41:24,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2120526.0, ans=0.125 2023-06-28 18:41:55,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2120646.0, ans=0.125 2023-06-28 18:41:55,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2120646.0, ans=0.0 2023-06-28 18:41:56,542 INFO [train.py:996] (1/4) Epoch 12, batch 18000, loss[loss=0.1667, simple_loss=0.2285, pruned_loss=0.05248, over 20719.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2938, pruned_loss=0.06315, over 4270364.26 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 18:41:56,543 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 18:42:16,407 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2604, simple_loss=0.3527, pruned_loss=0.08401, over 1796401.00 frames. 2023-06-28 18:42:16,408 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 18:43:17,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2120766.0, ans=0.125 2023-06-28 18:43:55,005 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 7.241e+02 9.176e+02 1.211e+03 3.223e+03, threshold=1.835e+03, percent-clipped=1.0 2023-06-28 18:44:00,007 INFO [train.py:996] (1/4) Epoch 12, batch 18050, loss[loss=0.2294, simple_loss=0.2981, pruned_loss=0.08033, over 21558.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2877, pruned_loss=0.06242, over 4270288.61 frames. ], batch size: 415, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:44:09,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-28 18:44:19,142 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:44:24,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2121006.0, ans=0.125 2023-06-28 18:45:00,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2121066.0, ans=0.125 2023-06-28 18:45:16,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2121126.0, ans=0.0 2023-06-28 18:45:43,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2121246.0, ans=0.125 2023-06-28 18:45:44,364 INFO [train.py:996] (1/4) Epoch 12, batch 18100, loss[loss=0.2258, simple_loss=0.323, pruned_loss=0.06427, over 21268.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2914, pruned_loss=0.06453, over 4274079.65 frames. ], batch size: 549, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:46:31,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-28 18:46:34,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2121366.0, ans=0.125 2023-06-28 18:47:04,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2121426.0, ans=0.0 2023-06-28 18:47:23,005 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.434e+02 8.761e+02 1.193e+03 1.712e+03 3.705e+03, threshold=2.386e+03, percent-clipped=21.0 2023-06-28 18:47:26,565 INFO [train.py:996] (1/4) Epoch 12, batch 18150, loss[loss=0.1945, simple_loss=0.2735, pruned_loss=0.05778, over 21678.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2941, pruned_loss=0.06496, over 4274804.51 frames. ], batch size: 333, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:48:33,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2121726.0, ans=0.1 2023-06-28 18:48:56,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2121786.0, ans=0.125 2023-06-28 18:49:08,822 INFO [train.py:996] (1/4) Epoch 12, batch 18200, loss[loss=0.1803, simple_loss=0.2571, pruned_loss=0.0517, over 21670.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2882, pruned_loss=0.06507, over 4276464.64 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:49:10,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2121846.0, ans=0.0 2023-06-28 18:50:14,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2122026.0, ans=0.0 2023-06-28 18:50:44,438 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.871e+02 6.470e+02 8.191e+02 1.481e+03 3.644e+03, threshold=1.638e+03, percent-clipped=8.0 2023-06-28 18:50:48,140 INFO [train.py:996] (1/4) Epoch 12, batch 18250, loss[loss=0.1567, simple_loss=0.2336, pruned_loss=0.0399, over 21588.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2821, pruned_loss=0.06364, over 4272847.23 frames. ], batch size: 132, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:50:50,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2122146.0, ans=0.125 2023-06-28 18:50:52,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2122146.0, ans=0.05 2023-06-28 18:50:58,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2122146.0, ans=0.07 2023-06-28 18:50:59,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=11.42 vs. limit=15.0 2023-06-28 18:51:01,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2122146.0, ans=0.2 2023-06-28 18:51:17,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2122206.0, ans=0.125 2023-06-28 18:51:21,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2122206.0, ans=0.2 2023-06-28 18:51:29,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=15.0 2023-06-28 18:51:37,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-28 18:51:49,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2122326.0, ans=0.2 2023-06-28 18:51:50,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2122326.0, ans=0.125 2023-06-28 18:52:25,408 INFO [train.py:996] (1/4) Epoch 12, batch 18300, loss[loss=0.2096, simple_loss=0.2791, pruned_loss=0.07006, over 21293.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2831, pruned_loss=0.06301, over 4271601.17 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:52:45,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=2122506.0, ans=0.2 2023-06-28 18:53:21,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2122566.0, ans=0.0 2023-06-28 18:53:41,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2122626.0, ans=0.125 2023-06-28 18:54:01,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2122686.0, ans=10.0 2023-06-28 18:54:03,759 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.548e+02 1.033e+03 1.487e+03 2.193e+03 4.357e+03, threshold=2.975e+03, percent-clipped=43.0 2023-06-28 18:54:06,778 INFO [train.py:996] (1/4) Epoch 12, batch 18350, loss[loss=0.1908, simple_loss=0.2664, pruned_loss=0.05763, over 21721.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2879, pruned_loss=0.06286, over 4279641.35 frames. ], batch size: 316, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:55:23,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2122926.0, ans=0.125 2023-06-28 18:55:32,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2122986.0, ans=0.125 2023-06-28 18:55:49,917 INFO [train.py:996] (1/4) Epoch 12, batch 18400, loss[loss=0.1744, simple_loss=0.2616, pruned_loss=0.04362, over 21589.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2836, pruned_loss=0.06146, over 4270428.93 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:56:22,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2123106.0, ans=0.0 2023-06-28 18:57:02,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2123226.0, ans=0.1 2023-06-28 18:57:12,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-28 18:57:21,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2123286.0, ans=0.125 2023-06-28 18:57:22,743 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.644e+02 6.567e+02 9.671e+02 1.816e+03 3.680e+03, threshold=1.934e+03, percent-clipped=2.0 2023-06-28 18:57:26,094 INFO [train.py:996] (1/4) Epoch 12, batch 18450, loss[loss=0.1979, simple_loss=0.2779, pruned_loss=0.05894, over 21635.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2803, pruned_loss=0.05842, over 4267590.92 frames. ], batch size: 415, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:58:00,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2123406.0, ans=0.0 2023-06-28 18:58:02,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2123406.0, ans=0.125 2023-06-28 18:58:21,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2123466.0, ans=0.125 2023-06-28 18:58:40,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2123526.0, ans=0.125 2023-06-28 18:58:43,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2123526.0, ans=0.2 2023-06-28 18:59:07,201 INFO [train.py:996] (1/4) Epoch 12, batch 18500, loss[loss=0.173, simple_loss=0.2401, pruned_loss=0.05288, over 21438.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2771, pruned_loss=0.05806, over 4270232.68 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:59:07,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2123646.0, ans=0.1 2023-06-28 18:59:56,856 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-28 19:00:11,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2123826.0, ans=0.125 2023-06-28 19:00:23,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-28 19:00:45,281 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 8.087e+02 1.310e+03 2.007e+03 4.820e+03, threshold=2.620e+03, percent-clipped=25.0 2023-06-28 19:00:48,717 INFO [train.py:996] (1/4) Epoch 12, batch 18550, loss[loss=0.2098, simple_loss=0.2704, pruned_loss=0.07462, over 21849.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2754, pruned_loss=0.05755, over 4262499.97 frames. ], batch size: 107, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:01:12,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2124006.0, ans=0.125 2023-06-28 19:01:37,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2124066.0, ans=0.2 2023-06-28 19:01:42,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2124066.0, ans=0.1 2023-06-28 19:02:32,388 INFO [train.py:996] (1/4) Epoch 12, batch 18600, loss[loss=0.2311, simple_loss=0.3186, pruned_loss=0.0718, over 21609.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2727, pruned_loss=0.05787, over 4266104.27 frames. ], batch size: 442, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:02:41,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2124246.0, ans=0.125 2023-06-28 19:03:22,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2124366.0, ans=0.2 2023-06-28 19:03:51,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2124426.0, ans=0.125 2023-06-28 19:04:06,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2124486.0, ans=0.125 2023-06-28 19:04:12,043 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.057e+02 7.816e+02 1.103e+03 1.650e+03 3.069e+03, threshold=2.205e+03, percent-clipped=1.0 2023-06-28 19:04:13,754 INFO [train.py:996] (1/4) Epoch 12, batch 18650, loss[loss=0.1871, simple_loss=0.2593, pruned_loss=0.05743, over 21439.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2714, pruned_loss=0.05774, over 4240580.47 frames. ], batch size: 212, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:04:45,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2124606.0, ans=0.0 2023-06-28 19:04:58,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2124606.0, ans=0.125 2023-06-28 19:05:01,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2124666.0, ans=0.0 2023-06-28 19:05:04,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2124666.0, ans=0.04949747468305833 2023-06-28 19:05:44,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2124786.0, ans=0.2 2023-06-28 19:05:55,214 INFO [train.py:996] (1/4) Epoch 12, batch 18700, loss[loss=0.1791, simple_loss=0.2436, pruned_loss=0.05725, over 21465.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2689, pruned_loss=0.05896, over 4250531.61 frames. ], batch size: 195, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:06:43,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.39 vs. limit=15.0 2023-06-28 19:07:26,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-28 19:07:35,699 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.958e+02 6.838e+02 8.648e+02 1.289e+03 2.694e+03, threshold=1.730e+03, percent-clipped=5.0 2023-06-28 19:07:37,313 INFO [train.py:996] (1/4) Epoch 12, batch 18750, loss[loss=0.1923, simple_loss=0.266, pruned_loss=0.05932, over 21832.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2704, pruned_loss=0.06087, over 4261395.81 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:08:42,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2125326.0, ans=0.07 2023-06-28 19:09:19,262 INFO [train.py:996] (1/4) Epoch 12, batch 18800, loss[loss=0.2056, simple_loss=0.2997, pruned_loss=0.05577, over 21750.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2792, pruned_loss=0.0629, over 4264599.64 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:09:25,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2125446.0, ans=0.125 2023-06-28 19:09:44,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2125506.0, ans=0.125 2023-06-28 19:09:49,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2125506.0, ans=0.2 2023-06-28 19:10:58,897 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.378e+02 7.621e+02 1.255e+03 1.956e+03 3.877e+03, threshold=2.510e+03, percent-clipped=29.0 2023-06-28 19:11:00,574 INFO [train.py:996] (1/4) Epoch 12, batch 18850, loss[loss=0.1726, simple_loss=0.2449, pruned_loss=0.05011, over 20797.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2779, pruned_loss=0.05971, over 4253875.31 frames. ], batch size: 609, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:11:20,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2125806.0, ans=0.125 2023-06-28 19:11:20,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-28 19:11:30,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.02 vs. limit=15.0 2023-06-28 19:12:40,423 INFO [train.py:996] (1/4) Epoch 12, batch 18900, loss[loss=0.1712, simple_loss=0.2398, pruned_loss=0.05125, over 21466.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2732, pruned_loss=0.05928, over 4246850.05 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:12:55,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2126106.0, ans=0.0 2023-06-28 19:13:00,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2126106.0, ans=0.0 2023-06-28 19:13:10,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-28 19:14:07,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.09 vs. limit=10.0 2023-06-28 19:14:14,963 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 7.566e+02 1.259e+03 1.840e+03 2.966e+03, threshold=2.518e+03, percent-clipped=3.0 2023-06-28 19:14:16,560 INFO [train.py:996] (1/4) Epoch 12, batch 18950, loss[loss=0.181, simple_loss=0.2322, pruned_loss=0.06492, over 20239.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.272, pruned_loss=0.06056, over 4255258.43 frames. ], batch size: 703, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:14:50,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=2126406.0, ans=0.2 2023-06-28 19:15:12,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2126466.0, ans=0.0 2023-06-28 19:15:55,094 INFO [train.py:996] (1/4) Epoch 12, batch 19000, loss[loss=0.2212, simple_loss=0.3058, pruned_loss=0.06831, over 21305.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.281, pruned_loss=0.06299, over 4256917.43 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:16:34,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2126706.0, ans=0.0 2023-06-28 19:16:37,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2126766.0, ans=0.125 2023-06-28 19:17:24,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2126886.0, ans=0.2 2023-06-28 19:17:27,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2126886.0, ans=0.125 2023-06-28 19:17:32,188 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.360e+02 7.287e+02 9.721e+02 1.319e+03 3.703e+03, threshold=1.944e+03, percent-clipped=9.0 2023-06-28 19:17:33,787 INFO [train.py:996] (1/4) Epoch 12, batch 19050, loss[loss=0.2219, simple_loss=0.2852, pruned_loss=0.07928, over 21305.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2859, pruned_loss=0.06596, over 4263276.16 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:17:55,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2127006.0, ans=0.125 2023-06-28 19:17:57,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2127006.0, ans=0.1 2023-06-28 19:18:14,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2127006.0, ans=0.125 2023-06-28 19:18:53,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2127126.0, ans=0.125 2023-06-28 19:19:00,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2127186.0, ans=0.125 2023-06-28 19:19:08,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2127186.0, ans=0.05 2023-06-28 19:19:11,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2127186.0, ans=0.125 2023-06-28 19:19:16,207 INFO [train.py:996] (1/4) Epoch 12, batch 19100, loss[loss=0.1878, simple_loss=0.254, pruned_loss=0.06084, over 21610.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2847, pruned_loss=0.06711, over 4264326.44 frames. ], batch size: 231, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:20:12,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2127366.0, ans=0.125 2023-06-28 19:20:29,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-28 19:20:49,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2127486.0, ans=0.0 2023-06-28 19:20:55,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-28 19:20:56,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2127486.0, ans=0.05 2023-06-28 19:21:01,435 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.479e+02 7.973e+02 1.169e+03 1.755e+03 3.524e+03, threshold=2.338e+03, percent-clipped=19.0 2023-06-28 19:21:03,212 INFO [train.py:996] (1/4) Epoch 12, batch 19150, loss[loss=0.1935, simple_loss=0.2824, pruned_loss=0.05229, over 21229.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2878, pruned_loss=0.06815, over 4263848.03 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:22:21,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2127726.0, ans=0.125 2023-06-28 19:22:34,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2127786.0, ans=0.0 2023-06-28 19:22:53,943 INFO [train.py:996] (1/4) Epoch 12, batch 19200, loss[loss=0.2793, simple_loss=0.3742, pruned_loss=0.09217, over 21642.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2974, pruned_loss=0.06874, over 4262354.57 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 19:23:28,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2127906.0, ans=0.0 2023-06-28 19:23:41,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2127966.0, ans=0.0 2023-06-28 19:23:46,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2127966.0, ans=0.1 2023-06-28 19:23:59,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2128026.0, ans=0.125 2023-06-28 19:24:36,177 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 8.519e+02 1.165e+03 1.659e+03 4.865e+03, threshold=2.330e+03, percent-clipped=13.0 2023-06-28 19:24:36,207 INFO [train.py:996] (1/4) Epoch 12, batch 19250, loss[loss=0.1572, simple_loss=0.2574, pruned_loss=0.02853, over 21386.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2997, pruned_loss=0.0653, over 4267391.12 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:25:21,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2128266.0, ans=0.125 2023-06-28 19:25:23,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2128266.0, ans=0.125 2023-06-28 19:25:36,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2128326.0, ans=0.0 2023-06-28 19:25:39,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2128326.0, ans=0.1 2023-06-28 19:25:46,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-28 19:25:58,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2128386.0, ans=0.1 2023-06-28 19:25:59,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2128386.0, ans=0.1 2023-06-28 19:26:18,595 INFO [train.py:996] (1/4) Epoch 12, batch 19300, loss[loss=0.1653, simple_loss=0.2477, pruned_loss=0.04143, over 21491.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2966, pruned_loss=0.06421, over 4265304.89 frames. ], batch size: 195, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:26:47,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2128506.0, ans=0.125 2023-06-28 19:27:56,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-28 19:27:57,287 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.967e+02 7.718e+02 1.195e+03 1.796e+03 4.248e+03, threshold=2.390e+03, percent-clipped=9.0 2023-06-28 19:27:57,333 INFO [train.py:996] (1/4) Epoch 12, batch 19350, loss[loss=0.1748, simple_loss=0.2484, pruned_loss=0.05056, over 21163.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2928, pruned_loss=0.06124, over 4268742.33 frames. ], batch size: 143, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:28:17,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2128806.0, ans=0.2 2023-06-28 19:29:37,800 INFO [train.py:996] (1/4) Epoch 12, batch 19400, loss[loss=0.2148, simple_loss=0.2862, pruned_loss=0.07172, over 21927.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.289, pruned_loss=0.06006, over 4277658.20 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:30:46,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2129226.0, ans=0.125 2023-06-28 19:30:47,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2129226.0, ans=0.0 2023-06-28 19:31:19,986 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.788e+02 6.972e+02 8.917e+02 1.265e+03 3.232e+03, threshold=1.783e+03, percent-clipped=5.0 2023-06-28 19:31:20,030 INFO [train.py:996] (1/4) Epoch 12, batch 19450, loss[loss=0.1866, simple_loss=0.2485, pruned_loss=0.06241, over 21576.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2865, pruned_loss=0.06137, over 4285725.33 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:31:35,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-28 19:33:02,591 INFO [train.py:996] (1/4) Epoch 12, batch 19500, loss[loss=0.145, simple_loss=0.1926, pruned_loss=0.04866, over 16489.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2824, pruned_loss=0.06221, over 4263624.19 frames. ], batch size: 62, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 19:33:16,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2129646.0, ans=0.125 2023-06-28 19:34:06,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2129826.0, ans=0.125 2023-06-28 19:34:22,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2129886.0, ans=0.09899494936611666 2023-06-28 19:34:43,709 INFO [train.py:996] (1/4) Epoch 12, batch 19550, loss[loss=0.2028, simple_loss=0.2974, pruned_loss=0.0541, over 20861.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2779, pruned_loss=0.06091, over 4264047.72 frames. ], batch size: 609, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 19:34:45,239 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.099e+02 1.131e+03 1.724e+03 3.417e+03, threshold=2.262e+03, percent-clipped=22.0 2023-06-28 19:35:28,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-28 19:36:24,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2130246.0, ans=0.125 2023-06-28 19:36:25,911 INFO [train.py:996] (1/4) Epoch 12, batch 19600, loss[loss=0.2384, simple_loss=0.3115, pruned_loss=0.08268, over 21902.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2786, pruned_loss=0.06104, over 4273053.90 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:37:15,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2130366.0, ans=0.1 2023-06-28 19:37:22,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2130366.0, ans=0.0 2023-06-28 19:37:50,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2130486.0, ans=0.1 2023-06-28 19:38:14,412 INFO [train.py:996] (1/4) Epoch 12, batch 19650, loss[loss=0.2177, simple_loss=0.2802, pruned_loss=0.07756, over 21636.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2833, pruned_loss=0.06479, over 4281915.82 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:38:16,158 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.322e+02 7.698e+02 1.187e+03 1.875e+03 3.672e+03, threshold=2.374e+03, percent-clipped=11.0 2023-06-28 19:38:30,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2130606.0, ans=0.125 2023-06-28 19:38:57,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2130666.0, ans=0.0 2023-06-28 19:39:04,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=2130666.0, ans=0.05 2023-06-28 19:39:13,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-28 19:40:00,294 INFO [train.py:996] (1/4) Epoch 12, batch 19700, loss[loss=0.2076, simple_loss=0.3033, pruned_loss=0.05593, over 21719.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2862, pruned_loss=0.0649, over 4273554.10 frames. ], batch size: 415, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:41:50,327 INFO [train.py:996] (1/4) Epoch 12, batch 19750, loss[loss=0.2392, simple_loss=0.327, pruned_loss=0.07565, over 21489.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2964, pruned_loss=0.06677, over 4269125.06 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:41:51,906 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 8.894e+02 1.243e+03 1.861e+03 5.840e+03, threshold=2.486e+03, percent-clipped=14.0 2023-06-28 19:42:56,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2131326.0, ans=0.025 2023-06-28 19:43:12,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2131386.0, ans=0.1 2023-06-28 19:43:22,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2131386.0, ans=0.1 2023-06-28 19:43:31,909 INFO [train.py:996] (1/4) Epoch 12, batch 19800, loss[loss=0.2065, simple_loss=0.2865, pruned_loss=0.06323, over 21688.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2967, pruned_loss=0.06698, over 4269798.97 frames. ], batch size: 389, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:43:46,568 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-28 19:44:00,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-06-28 19:44:01,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2131506.0, ans=0.0 2023-06-28 19:44:57,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-28 19:45:05,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2131686.0, ans=15.0 2023-06-28 19:45:16,346 INFO [train.py:996] (1/4) Epoch 12, batch 19850, loss[loss=0.1802, simple_loss=0.2693, pruned_loss=0.04553, over 21748.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2882, pruned_loss=0.06268, over 4274524.54 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:45:18,107 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.069e+02 7.581e+02 9.843e+02 1.508e+03 3.551e+03, threshold=1.969e+03, percent-clipped=6.0 2023-06-28 19:45:29,822 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-28 19:45:32,490 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 19:45:37,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2131806.0, ans=0.125 2023-06-28 19:45:39,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2131806.0, ans=0.1 2023-06-28 19:45:40,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2131806.0, ans=0.125 2023-06-28 19:46:45,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2131986.0, ans=0.125 2023-06-28 19:46:59,304 INFO [train.py:996] (1/4) Epoch 12, batch 19900, loss[loss=0.2352, simple_loss=0.3512, pruned_loss=0.05965, over 19811.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2904, pruned_loss=0.06114, over 4269614.56 frames. ], batch size: 702, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:47:34,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2132106.0, ans=0.0 2023-06-28 19:47:36,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2132106.0, ans=0.0 2023-06-28 19:48:42,868 INFO [train.py:996] (1/4) Epoch 12, batch 19950, loss[loss=0.187, simple_loss=0.259, pruned_loss=0.05753, over 21774.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2844, pruned_loss=0.06099, over 4265064.42 frames. ], batch size: 118, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:48:44,572 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.913e+02 9.095e+02 1.320e+03 1.827e+03 2.856e+03, threshold=2.640e+03, percent-clipped=20.0 2023-06-28 19:49:13,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2132406.0, ans=0.0 2023-06-28 19:49:28,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2132406.0, ans=0.1 2023-06-28 19:49:39,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2132466.0, ans=0.125 2023-06-28 19:50:25,836 INFO [train.py:996] (1/4) Epoch 12, batch 20000, loss[loss=0.2138, simple_loss=0.3119, pruned_loss=0.05784, over 21741.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2846, pruned_loss=0.06137, over 4263738.82 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 19:51:20,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-28 19:51:28,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-28 19:51:36,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2132826.0, ans=0.025 2023-06-28 19:51:38,218 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-28 19:52:06,801 INFO [train.py:996] (1/4) Epoch 12, batch 20050, loss[loss=0.2005, simple_loss=0.2817, pruned_loss=0.05965, over 21892.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2872, pruned_loss=0.06369, over 4274155.15 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 19:52:08,364 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.463e+02 7.625e+02 1.079e+03 1.735e+03 4.168e+03, threshold=2.158e+03, percent-clipped=5.0 2023-06-28 19:52:42,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2133006.0, ans=0.125 2023-06-28 19:52:54,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2133066.0, ans=0.125 2023-06-28 19:53:44,639 INFO [train.py:996] (1/4) Epoch 12, batch 20100, loss[loss=0.2105, simple_loss=0.2866, pruned_loss=0.06718, over 21206.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2887, pruned_loss=0.06528, over 4287043.61 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:54:20,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2133306.0, ans=0.2 2023-06-28 19:54:48,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2133366.0, ans=0.125 2023-06-28 19:55:11,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-28 19:55:38,339 INFO [train.py:996] (1/4) Epoch 12, batch 20150, loss[loss=0.2569, simple_loss=0.3443, pruned_loss=0.08475, over 21823.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2955, pruned_loss=0.06883, over 4280784.41 frames. ], batch size: 124, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:55:41,574 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.747e+02 8.369e+02 1.261e+03 1.979e+03 4.381e+03, threshold=2.521e+03, percent-clipped=21.0 2023-06-28 19:56:42,483 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-28 19:57:11,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2133786.0, ans=0.035 2023-06-28 19:57:24,612 INFO [train.py:996] (1/4) Epoch 12, batch 20200, loss[loss=0.2842, simple_loss=0.371, pruned_loss=0.09871, over 21538.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3013, pruned_loss=0.07136, over 4277120.09 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:58:38,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=12.0 2023-06-28 19:58:46,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2134026.0, ans=0.125 2023-06-28 19:58:51,611 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-28 19:58:57,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2134086.0, ans=0.1 2023-06-28 19:59:11,833 INFO [train.py:996] (1/4) Epoch 12, batch 20250, loss[loss=0.2128, simple_loss=0.287, pruned_loss=0.06928, over 21454.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3022, pruned_loss=0.07026, over 4272602.41 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:59:12,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2134146.0, ans=0.125 2023-06-28 19:59:12,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2134146.0, ans=0.1 2023-06-28 19:59:19,724 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.497e+02 8.759e+02 1.398e+03 2.270e+03 4.094e+03, threshold=2.796e+03, percent-clipped=18.0 2023-06-28 19:59:44,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2134206.0, ans=0.125 2023-06-28 20:00:26,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-28 20:00:33,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2134386.0, ans=0.125 2023-06-28 20:00:53,961 INFO [train.py:996] (1/4) Epoch 12, batch 20300, loss[loss=0.2289, simple_loss=0.3219, pruned_loss=0.06794, over 21590.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3, pruned_loss=0.06768, over 4262517.38 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:01:31,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=2134566.0, ans=10.0 2023-06-28 20:02:34,223 INFO [train.py:996] (1/4) Epoch 12, batch 20350, loss[loss=0.2265, simple_loss=0.2956, pruned_loss=0.07864, over 21764.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.3004, pruned_loss=0.06774, over 4267228.20 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:02:37,268 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.278e+02 8.027e+02 1.220e+03 1.701e+03 2.990e+03, threshold=2.441e+03, percent-clipped=1.0 2023-06-28 20:03:39,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2134926.0, ans=0.05 2023-06-28 20:04:21,625 INFO [train.py:996] (1/4) Epoch 12, batch 20400, loss[loss=0.2258, simple_loss=0.3007, pruned_loss=0.07547, over 21491.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3035, pruned_loss=0.06998, over 4264682.54 frames. ], batch size: 194, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:04:38,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135106.0, ans=0.1 2023-06-28 20:04:58,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-28 20:05:33,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=22.5 2023-06-28 20:05:58,024 INFO [train.py:996] (1/4) Epoch 12, batch 20450, loss[loss=0.2159, simple_loss=0.2915, pruned_loss=0.07014, over 21876.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3033, pruned_loss=0.07158, over 4258954.95 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:06:03,014 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 7.818e+02 1.125e+03 1.970e+03 4.809e+03, threshold=2.251e+03, percent-clipped=13.0 2023-06-28 20:06:10,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2135346.0, ans=0.125 2023-06-28 20:06:43,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2135466.0, ans=0.1 2023-06-28 20:07:15,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2135526.0, ans=0.125 2023-06-28 20:07:20,401 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:07:39,486 INFO [train.py:996] (1/4) Epoch 12, batch 20500, loss[loss=0.1936, simple_loss=0.267, pruned_loss=0.06011, over 21811.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2996, pruned_loss=0.07141, over 4257213.86 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:07:55,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=2135646.0, ans=0.2 2023-06-28 20:08:07,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2135706.0, ans=0.125 2023-06-28 20:08:15,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135706.0, ans=0.1 2023-06-28 20:08:20,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2023-06-28 20:08:37,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2135766.0, ans=0.2 2023-06-28 20:08:41,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2135766.0, ans=0.2 2023-06-28 20:08:46,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2135826.0, ans=0.0 2023-06-28 20:08:55,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2135826.0, ans=15.0 2023-06-28 20:09:24,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2135886.0, ans=0.0 2023-06-28 20:09:27,004 INFO [train.py:996] (1/4) Epoch 12, batch 20550, loss[loss=0.1906, simple_loss=0.2746, pruned_loss=0.05326, over 21413.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2929, pruned_loss=0.06968, over 4246247.28 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:09:32,110 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.925e+02 7.744e+02 1.015e+03 1.488e+03 3.056e+03, threshold=2.029e+03, percent-clipped=4.0 2023-06-28 20:10:25,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2136066.0, ans=0.2 2023-06-28 20:10:40,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2136126.0, ans=0.125 2023-06-28 20:11:10,575 INFO [train.py:996] (1/4) Epoch 12, batch 20600, loss[loss=0.2655, simple_loss=0.3243, pruned_loss=0.1033, over 21587.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2946, pruned_loss=0.06825, over 4248487.02 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:11:12,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2136246.0, ans=0.5 2023-06-28 20:12:45,955 INFO [train.py:996] (1/4) Epoch 12, batch 20650, loss[loss=0.1854, simple_loss=0.2542, pruned_loss=0.05832, over 21668.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2902, pruned_loss=0.06785, over 4237580.41 frames. ], batch size: 332, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:12:51,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.648e+02 9.695e+02 1.455e+03 2.228e+03 5.123e+03, threshold=2.910e+03, percent-clipped=30.0 2023-06-28 20:13:28,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-28 20:13:33,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2136666.0, ans=0.0 2023-06-28 20:14:17,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2136786.0, ans=0.125 2023-06-28 20:14:23,669 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:14:28,008 INFO [train.py:996] (1/4) Epoch 12, batch 20700, loss[loss=0.164, simple_loss=0.2379, pruned_loss=0.04501, over 21474.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2833, pruned_loss=0.06538, over 4236022.31 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:14:31,012 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.09 vs. limit=15.0 2023-06-28 20:14:53,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2136906.0, ans=0.125 2023-06-28 20:16:09,298 INFO [train.py:996] (1/4) Epoch 12, batch 20750, loss[loss=0.2126, simple_loss=0.2998, pruned_loss=0.06272, over 21738.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2842, pruned_loss=0.06412, over 4240800.89 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:16:14,419 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 7.769e+02 1.310e+03 2.249e+03 6.727e+03, threshold=2.619e+03, percent-clipped=13.0 2023-06-28 20:17:28,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2137326.0, ans=0.07 2023-06-28 20:17:32,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2137326.0, ans=0.125 2023-06-28 20:17:51,089 INFO [train.py:996] (1/4) Epoch 12, batch 20800, loss[loss=0.1786, simple_loss=0.2516, pruned_loss=0.05285, over 21652.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2873, pruned_loss=0.06498, over 4245484.85 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:17:51,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2137446.0, ans=0.07 2023-06-28 20:17:55,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2137446.0, ans=0.0 2023-06-28 20:19:33,044 INFO [train.py:996] (1/4) Epoch 12, batch 20850, loss[loss=0.2195, simple_loss=0.2764, pruned_loss=0.08127, over 21619.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2792, pruned_loss=0.06273, over 4241996.72 frames. ], batch size: 509, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:19:39,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 7.517e+02 1.058e+03 1.433e+03 3.063e+03, threshold=2.117e+03, percent-clipped=2.0 2023-06-28 20:20:56,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2137986.0, ans=0.125 2023-06-28 20:21:10,319 INFO [train.py:996] (1/4) Epoch 12, batch 20900, loss[loss=0.2112, simple_loss=0.2937, pruned_loss=0.06438, over 21272.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.281, pruned_loss=0.06385, over 4255276.82 frames. ], batch size: 159, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:21:18,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2138046.0, ans=0.0 2023-06-28 20:21:25,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2138106.0, ans=0.125 2023-06-28 20:22:16,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2138226.0, ans=0.0 2023-06-28 20:22:48,735 INFO [train.py:996] (1/4) Epoch 12, batch 20950, loss[loss=0.2057, simple_loss=0.2734, pruned_loss=0.06897, over 21898.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2781, pruned_loss=0.06151, over 4253598.11 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:22:55,241 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.730e+02 8.164e+02 1.366e+03 2.074e+03 5.785e+03, threshold=2.733e+03, percent-clipped=24.0 2023-06-28 20:23:13,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2138406.0, ans=0.95 2023-06-28 20:23:36,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2138466.0, ans=0.1 2023-06-28 20:23:53,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2138526.0, ans=0.125 2023-06-28 20:24:24,230 INFO [train.py:996] (1/4) Epoch 12, batch 21000, loss[loss=0.1915, simple_loss=0.2669, pruned_loss=0.05808, over 21900.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2772, pruned_loss=0.06124, over 4257788.10 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:24:24,231 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 20:24:40,714 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2646, simple_loss=0.357, pruned_loss=0.08608, over 1796401.00 frames. 2023-06-28 20:24:40,715 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 20:25:35,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2138766.0, ans=0.5 2023-06-28 20:26:21,496 INFO [train.py:996] (1/4) Epoch 12, batch 21050, loss[loss=0.1842, simple_loss=0.2517, pruned_loss=0.05833, over 21155.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2756, pruned_loss=0.06205, over 4265063.48 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:26:27,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2138946.0, ans=0.0 2023-06-28 20:26:28,202 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.708e+02 6.795e+02 9.340e+02 1.308e+03 3.165e+03, threshold=1.868e+03, percent-clipped=2.0 2023-06-28 20:26:28,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2138946.0, ans=0.0 2023-06-28 20:26:46,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-28 20:26:55,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2139006.0, ans=0.1 2023-06-28 20:27:01,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-28 20:27:34,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2139126.0, ans=0.0 2023-06-28 20:27:37,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2139126.0, ans=0.0 2023-06-28 20:28:01,129 INFO [train.py:996] (1/4) Epoch 12, batch 21100, loss[loss=0.1983, simple_loss=0.266, pruned_loss=0.06529, over 21558.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2719, pruned_loss=0.06137, over 4255389.36 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:28:17,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2139306.0, ans=0.125 2023-06-28 20:28:51,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.09 vs. limit=15.0 2023-06-28 20:28:57,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2139366.0, ans=0.125 2023-06-28 20:29:34,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2139486.0, ans=0.125 2023-06-28 20:29:37,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2139486.0, ans=0.0 2023-06-28 20:29:42,284 INFO [train.py:996] (1/4) Epoch 12, batch 21150, loss[loss=0.2105, simple_loss=0.2804, pruned_loss=0.07036, over 21724.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2682, pruned_loss=0.06181, over 4263430.17 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:29:50,631 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.872e+02 8.259e+02 1.205e+03 1.749e+03 3.220e+03, threshold=2.410e+03, percent-clipped=20.0 2023-06-28 20:29:52,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2139546.0, ans=0.125 2023-06-28 20:30:22,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2139606.0, ans=0.0 2023-06-28 20:30:29,073 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-28 20:30:33,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2139666.0, ans=0.0 2023-06-28 20:30:51,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2139726.0, ans=0.125 2023-06-28 20:30:55,327 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-28 20:31:03,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2139786.0, ans=0.1 2023-06-28 20:31:14,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2139786.0, ans=0.0 2023-06-28 20:31:23,278 INFO [train.py:996] (1/4) Epoch 12, batch 21200, loss[loss=0.2037, simple_loss=0.2712, pruned_loss=0.06807, over 21907.00 frames. ], tot_loss[loss=0.194, simple_loss=0.266, pruned_loss=0.06097, over 4252008.85 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:31:32,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2139846.0, ans=0.1 2023-06-28 20:33:04,755 INFO [train.py:996] (1/4) Epoch 12, batch 21250, loss[loss=0.2116, simple_loss=0.2895, pruned_loss=0.0668, over 21738.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2635, pruned_loss=0.06032, over 4255834.61 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:33:13,142 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.761e+02 7.355e+02 9.747e+02 1.370e+03 2.666e+03, threshold=1.949e+03, percent-clipped=4.0 2023-06-28 20:34:01,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2140266.0, ans=10.0 2023-06-28 20:34:46,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2140446.0, ans=0.1 2023-06-28 20:34:47,105 INFO [train.py:996] (1/4) Epoch 12, batch 21300, loss[loss=0.206, simple_loss=0.2854, pruned_loss=0.06332, over 21575.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2701, pruned_loss=0.06256, over 4265078.92 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:35:01,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2140446.0, ans=0.125 2023-06-28 20:35:20,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-28 20:35:25,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-28 20:35:28,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2140566.0, ans=0.2 2023-06-28 20:35:28,598 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:35:35,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2140566.0, ans=0.04949747468305833 2023-06-28 20:35:41,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-28 20:35:52,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2140626.0, ans=0.2 2023-06-28 20:36:15,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2140686.0, ans=0.125 2023-06-28 20:36:17,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2140686.0, ans=0.125 2023-06-28 20:36:25,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2140686.0, ans=0.125 2023-06-28 20:36:29,986 INFO [train.py:996] (1/4) Epoch 12, batch 21350, loss[loss=0.2423, simple_loss=0.3286, pruned_loss=0.07804, over 21509.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2757, pruned_loss=0.06335, over 4276295.89 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:36:35,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2140746.0, ans=0.125 2023-06-28 20:36:43,162 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.041e+02 8.389e+02 1.153e+03 1.810e+03 4.461e+03, threshold=2.306e+03, percent-clipped=20.0 2023-06-28 20:37:13,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2140866.0, ans=0.05 2023-06-28 20:37:43,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2140926.0, ans=0.125 2023-06-28 20:37:43,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2140926.0, ans=0.125 2023-06-28 20:37:56,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2140986.0, ans=0.125 2023-06-28 20:38:16,931 INFO [train.py:996] (1/4) Epoch 12, batch 21400, loss[loss=0.2161, simple_loss=0.2985, pruned_loss=0.0668, over 21315.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2781, pruned_loss=0.06225, over 4273102.04 frames. ], batch size: 159, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:38:28,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-28 20:38:59,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=2141166.0, ans=0.05 2023-06-28 20:39:14,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2141226.0, ans=0.2 2023-06-28 20:39:50,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-28 20:39:57,099 INFO [train.py:996] (1/4) Epoch 12, batch 21450, loss[loss=0.2437, simple_loss=0.308, pruned_loss=0.08976, over 21609.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2835, pruned_loss=0.06498, over 4280015.92 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:40:04,992 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.076e+02 7.437e+02 1.005e+03 1.722e+03 2.921e+03, threshold=2.009e+03, percent-clipped=6.0 2023-06-28 20:40:18,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2141406.0, ans=0.125 2023-06-28 20:41:38,453 INFO [train.py:996] (1/4) Epoch 12, batch 21500, loss[loss=0.1969, simple_loss=0.2623, pruned_loss=0.06577, over 21791.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2806, pruned_loss=0.06563, over 4278276.04 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:41:41,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.17 vs. limit=15.0 2023-06-28 20:41:59,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2141706.0, ans=0.1 2023-06-28 20:42:19,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2141766.0, ans=0.125 2023-06-28 20:42:34,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2141766.0, ans=0.0 2023-06-28 20:42:43,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-28 20:42:48,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=12.0 2023-06-28 20:43:19,796 INFO [train.py:996] (1/4) Epoch 12, batch 21550, loss[loss=0.2432, simple_loss=0.3352, pruned_loss=0.07558, over 19857.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2741, pruned_loss=0.06351, over 4268517.31 frames. ], batch size: 702, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:43:32,809 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 7.462e+02 9.978e+02 1.500e+03 2.892e+03, threshold=1.996e+03, percent-clipped=12.0 2023-06-28 20:43:37,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2141946.0, ans=0.1 2023-06-28 20:43:53,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2142006.0, ans=0.0 2023-06-28 20:43:56,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2142006.0, ans=0.0 2023-06-28 20:44:09,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2142066.0, ans=0.0 2023-06-28 20:44:59,828 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-28 20:45:03,643 INFO [train.py:996] (1/4) Epoch 12, batch 21600, loss[loss=0.2129, simple_loss=0.3224, pruned_loss=0.05168, over 19670.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2703, pruned_loss=0.06194, over 4263130.38 frames. ], batch size: 703, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:45:40,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2142306.0, ans=0.125 2023-06-28 20:46:03,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2142366.0, ans=0.125 2023-06-28 20:46:24,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2142486.0, ans=0.0 2023-06-28 20:46:51,882 INFO [train.py:996] (1/4) Epoch 12, batch 21650, loss[loss=0.206, simple_loss=0.311, pruned_loss=0.05051, over 21736.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2748, pruned_loss=0.05979, over 4265676.24 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:46:58,599 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:47:03,117 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.132e+02 8.434e+02 1.336e+03 2.286e+03 3.969e+03, threshold=2.673e+03, percent-clipped=30.0 2023-06-28 20:47:03,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2142546.0, ans=0.0 2023-06-28 20:47:41,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2142666.0, ans=0.07 2023-06-28 20:47:50,438 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:47:57,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2142726.0, ans=0.125 2023-06-28 20:48:26,756 INFO [train.py:996] (1/4) Epoch 12, batch 21700, loss[loss=0.2136, simple_loss=0.2785, pruned_loss=0.07429, over 21542.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2777, pruned_loss=0.05899, over 4267268.31 frames. ], batch size: 442, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:48:37,188 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:49:38,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2143026.0, ans=0.2 2023-06-28 20:49:40,127 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-28 20:50:07,593 INFO [train.py:996] (1/4) Epoch 12, batch 21750, loss[loss=0.2031, simple_loss=0.2651, pruned_loss=0.07053, over 21542.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2736, pruned_loss=0.05986, over 4275400.90 frames. ], batch size: 391, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:50:24,253 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.270e+02 7.010e+02 1.001e+03 1.482e+03 3.293e+03, threshold=2.002e+03, percent-clipped=2.0 2023-06-28 20:51:06,661 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:51:45,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2143386.0, ans=0.125 2023-06-28 20:51:54,920 INFO [train.py:996] (1/4) Epoch 12, batch 21800, loss[loss=0.1997, simple_loss=0.2781, pruned_loss=0.06065, over 21675.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2701, pruned_loss=0.06033, over 4277889.99 frames. ], batch size: 248, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:51:57,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2143446.0, ans=0.125 2023-06-28 20:52:09,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2143446.0, ans=0.0 2023-06-28 20:52:17,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2143506.0, ans=0.125 2023-06-28 20:53:23,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2143686.0, ans=0.125 2023-06-28 20:53:37,021 INFO [train.py:996] (1/4) Epoch 12, batch 21850, loss[loss=0.2041, simple_loss=0.285, pruned_loss=0.06164, over 21455.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2769, pruned_loss=0.06102, over 4268699.24 frames. ], batch size: 194, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:53:48,650 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.108e+02 8.276e+02 1.227e+03 1.863e+03 4.037e+03, threshold=2.455e+03, percent-clipped=20.0 2023-06-28 20:55:09,410 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-28 20:55:18,295 INFO [train.py:996] (1/4) Epoch 12, batch 21900, loss[loss=0.2058, simple_loss=0.2745, pruned_loss=0.06854, over 21818.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2775, pruned_loss=0.06208, over 4279641.21 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:55:40,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-28 20:56:58,122 INFO [train.py:996] (1/4) Epoch 12, batch 21950, loss[loss=0.1749, simple_loss=0.2495, pruned_loss=0.05015, over 21782.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2718, pruned_loss=0.06081, over 4278096.87 frames. ], batch size: 124, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:57:09,570 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.558e+02 7.761e+02 1.147e+03 1.869e+03 4.092e+03, threshold=2.294e+03, percent-clipped=9.0 2023-06-28 20:57:21,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2144406.0, ans=0.04949747468305833 2023-06-28 20:58:00,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2144526.0, ans=0.125 2023-06-28 20:58:01,425 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-28 20:58:18,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2144586.0, ans=0.025 2023-06-28 20:58:22,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2144586.0, ans=0.035 2023-06-28 20:58:24,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=15.0 2023-06-28 20:58:39,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2144646.0, ans=0.0 2023-06-28 20:58:40,385 INFO [train.py:996] (1/4) Epoch 12, batch 22000, loss[loss=0.1785, simple_loss=0.2529, pruned_loss=0.05204, over 21800.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2653, pruned_loss=0.0578, over 4275014.92 frames. ], batch size: 118, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:58:46,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-28 20:59:08,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2144706.0, ans=0.125 2023-06-28 20:59:32,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-28 21:00:18,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.96 vs. limit=22.5 2023-06-28 21:00:23,763 INFO [train.py:996] (1/4) Epoch 12, batch 22050, loss[loss=0.1952, simple_loss=0.2726, pruned_loss=0.05885, over 21667.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2706, pruned_loss=0.05964, over 4269555.04 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:00:40,614 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 7.125e+02 1.182e+03 1.630e+03 4.961e+03, threshold=2.364e+03, percent-clipped=13.0 2023-06-28 21:00:56,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-28 21:01:14,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2145066.0, ans=0.1 2023-06-28 21:01:27,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2145126.0, ans=0.0 2023-06-28 21:02:06,215 INFO [train.py:996] (1/4) Epoch 12, batch 22100, loss[loss=0.2248, simple_loss=0.3013, pruned_loss=0.07414, over 21216.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2803, pruned_loss=0.06421, over 4250205.73 frames. ], batch size: 143, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:02:10,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2145246.0, ans=0.125 2023-06-28 21:02:26,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2145306.0, ans=0.125 2023-06-28 21:02:28,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2145306.0, ans=0.125 2023-06-28 21:02:50,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2145366.0, ans=0.025 2023-06-28 21:03:19,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2145426.0, ans=0.1 2023-06-28 21:03:47,948 INFO [train.py:996] (1/4) Epoch 12, batch 22150, loss[loss=0.1995, simple_loss=0.2742, pruned_loss=0.06243, over 21692.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.283, pruned_loss=0.06536, over 4256084.46 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:04:01,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-28 21:04:04,053 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 8.832e+02 1.298e+03 1.809e+03 3.590e+03, threshold=2.596e+03, percent-clipped=11.0 2023-06-28 21:04:55,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-28 21:05:29,498 INFO [train.py:996] (1/4) Epoch 12, batch 22200, loss[loss=0.2295, simple_loss=0.3115, pruned_loss=0.07371, over 20042.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2842, pruned_loss=0.06632, over 4271012.56 frames. ], batch size: 702, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:05:40,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2145846.0, ans=0.0 2023-06-28 21:05:43,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2145846.0, ans=0.2 2023-06-28 21:06:25,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-28 21:07:17,289 INFO [train.py:996] (1/4) Epoch 12, batch 22250, loss[loss=0.2522, simple_loss=0.365, pruned_loss=0.06969, over 19791.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2928, pruned_loss=0.06796, over 4269450.87 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:07:26,255 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:07:29,282 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.277e+02 8.206e+02 1.186e+03 1.604e+03 3.301e+03, threshold=2.372e+03, percent-clipped=3.0 2023-06-28 21:08:02,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-28 21:08:09,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2146266.0, ans=0.125 2023-06-28 21:08:12,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2146266.0, ans=0.1 2023-06-28 21:08:42,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2146386.0, ans=0.125 2023-06-28 21:08:42,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2146386.0, ans=0.125 2023-06-28 21:08:47,236 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:08:57,733 INFO [train.py:996] (1/4) Epoch 12, batch 22300, loss[loss=0.2089, simple_loss=0.2899, pruned_loss=0.06398, over 21897.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2958, pruned_loss=0.07036, over 4272796.78 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:09:24,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2146506.0, ans=0.025 2023-06-28 21:09:30,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2146506.0, ans=0.125 2023-06-28 21:09:46,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2146566.0, ans=0.0 2023-06-28 21:10:15,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2146686.0, ans=0.0 2023-06-28 21:10:38,566 INFO [train.py:996] (1/4) Epoch 12, batch 22350, loss[loss=0.1781, simple_loss=0.261, pruned_loss=0.04764, over 21864.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2936, pruned_loss=0.07082, over 4281398.19 frames. ], batch size: 333, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:10:50,163 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.715e+02 7.662e+02 1.007e+03 1.656e+03 3.932e+03, threshold=2.013e+03, percent-clipped=14.0 2023-06-28 21:11:28,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2146866.0, ans=0.125 2023-06-28 21:12:09,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2146986.0, ans=0.0 2023-06-28 21:12:10,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2146986.0, ans=0.035 2023-06-28 21:12:20,272 INFO [train.py:996] (1/4) Epoch 12, batch 22400, loss[loss=0.2061, simple_loss=0.2793, pruned_loss=0.06644, over 21765.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2909, pruned_loss=0.06723, over 4283417.81 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:12:49,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2147106.0, ans=0.125 2023-06-28 21:13:20,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2147166.0, ans=0.1 2023-06-28 21:13:23,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2147226.0, ans=0.0 2023-06-28 21:14:05,221 INFO [train.py:996] (1/4) Epoch 12, batch 22450, loss[loss=0.1801, simple_loss=0.2537, pruned_loss=0.05326, over 21775.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2857, pruned_loss=0.06661, over 4278774.22 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:14:12,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2147346.0, ans=0.125 2023-06-28 21:14:18,830 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.767e+02 6.974e+02 9.708e+02 1.486e+03 4.519e+03, threshold=1.942e+03, percent-clipped=14.0 2023-06-28 21:15:07,241 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:15:10,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2147526.0, ans=0.125 2023-06-28 21:15:14,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2147526.0, ans=0.0 2023-06-28 21:15:34,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2147586.0, ans=0.09899494936611666 2023-06-28 21:15:41,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2147586.0, ans=0.125 2023-06-28 21:15:47,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-28 21:15:48,357 INFO [train.py:996] (1/4) Epoch 12, batch 22500, loss[loss=0.2063, simple_loss=0.302, pruned_loss=0.0553, over 21660.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2808, pruned_loss=0.06581, over 4272681.81 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:15:50,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2147646.0, ans=0.0 2023-06-28 21:15:59,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=22.5 2023-06-28 21:17:31,319 INFO [train.py:996] (1/4) Epoch 12, batch 22550, loss[loss=0.2067, simple_loss=0.2875, pruned_loss=0.06294, over 21870.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2869, pruned_loss=0.06758, over 4269493.81 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:17:49,766 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.593e+02 9.385e+02 1.394e+03 1.973e+03 3.224e+03, threshold=2.788e+03, percent-clipped=25.0 2023-06-28 21:17:50,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-28 21:18:18,607 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-28 21:19:00,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2148186.0, ans=0.09899494936611666 2023-06-28 21:19:20,503 INFO [train.py:996] (1/4) Epoch 12, batch 22600, loss[loss=0.1815, simple_loss=0.2563, pruned_loss=0.05329, over 21623.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2886, pruned_loss=0.06754, over 4270442.14 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:19:55,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2148306.0, ans=10.0 2023-06-28 21:20:28,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-28 21:21:01,901 INFO [train.py:996] (1/4) Epoch 12, batch 22650, loss[loss=0.2359, simple_loss=0.2803, pruned_loss=0.09574, over 21355.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2846, pruned_loss=0.06689, over 4275452.02 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:21:14,879 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 9.650e+02 1.395e+03 1.973e+03 4.081e+03, threshold=2.791e+03, percent-clipped=9.0 2023-06-28 21:21:31,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2148606.0, ans=0.125 2023-06-28 21:21:39,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2148666.0, ans=0.09899494936611666 2023-06-28 21:21:41,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2148666.0, ans=0.05 2023-06-28 21:21:57,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2148666.0, ans=0.0 2023-06-28 21:22:21,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2148786.0, ans=0.0 2023-06-28 21:22:24,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2148786.0, ans=0.125 2023-06-28 21:22:35,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2148786.0, ans=0.0 2023-06-28 21:22:41,736 INFO [train.py:996] (1/4) Epoch 12, batch 22700, loss[loss=0.1826, simple_loss=0.2505, pruned_loss=0.05733, over 21565.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2782, pruned_loss=0.0659, over 4281923.73 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:23:28,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2148966.0, ans=0.125 2023-06-28 21:23:32,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2148966.0, ans=0.0 2023-06-28 21:24:00,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2149026.0, ans=0.125 2023-06-28 21:24:16,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2149086.0, ans=0.2 2023-06-28 21:24:24,412 INFO [train.py:996] (1/4) Epoch 12, batch 22750, loss[loss=0.3184, simple_loss=0.3612, pruned_loss=0.1378, over 21306.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2799, pruned_loss=0.06832, over 4267504.01 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:24:37,891 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.797e+02 7.718e+02 1.201e+03 1.681e+03 3.626e+03, threshold=2.402e+03, percent-clipped=4.0 2023-06-28 21:25:57,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=2149386.0, ans=15.0 2023-06-28 21:26:01,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2149386.0, ans=0.1 2023-06-28 21:26:05,754 INFO [train.py:996] (1/4) Epoch 12, batch 22800, loss[loss=0.2338, simple_loss=0.3171, pruned_loss=0.07522, over 21879.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2832, pruned_loss=0.06947, over 4276585.76 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:26:29,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2149506.0, ans=0.0 2023-06-28 21:27:19,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2149626.0, ans=0.125 2023-06-28 21:27:35,732 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-28 21:27:45,950 INFO [train.py:996] (1/4) Epoch 12, batch 22850, loss[loss=0.2123, simple_loss=0.2786, pruned_loss=0.07301, over 21652.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2802, pruned_loss=0.06894, over 4278095.23 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:27:53,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2149746.0, ans=0.125 2023-06-28 21:28:01,317 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.804e+02 7.642e+02 1.050e+03 1.882e+03 3.484e+03, threshold=2.099e+03, percent-clipped=13.0 2023-06-28 21:28:04,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2149806.0, ans=0.125 2023-06-28 21:28:25,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2149866.0, ans=0.1 2023-06-28 21:28:50,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2149926.0, ans=0.025 2023-06-28 21:29:22,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2149986.0, ans=0.125 2023-06-28 21:29:30,153 INFO [train.py:996] (1/4) Epoch 12, batch 22900, loss[loss=0.1999, simple_loss=0.3009, pruned_loss=0.04945, over 21722.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.281, pruned_loss=0.06768, over 4270626.98 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:30:09,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2150106.0, ans=0.125 2023-06-28 21:30:22,916 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:31:15,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2150286.0, ans=0.125 2023-06-28 21:31:17,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-28 21:31:19,856 INFO [train.py:996] (1/4) Epoch 12, batch 22950, loss[loss=0.2361, simple_loss=0.3478, pruned_loss=0.06216, over 21650.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2931, pruned_loss=0.06615, over 4267652.03 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:31:39,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.589e+02 9.756e+02 1.509e+03 2.315e+03 4.900e+03, threshold=3.017e+03, percent-clipped=30.0 2023-06-28 21:32:18,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2150466.0, ans=0.125 2023-06-28 21:32:19,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2150466.0, ans=0.125 2023-06-28 21:32:21,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-28 21:32:29,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2150526.0, ans=0.5 2023-06-28 21:32:29,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2150526.0, ans=0.125 2023-06-28 21:33:02,910 INFO [train.py:996] (1/4) Epoch 12, batch 23000, loss[loss=0.1798, simple_loss=0.2576, pruned_loss=0.05105, over 21094.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2952, pruned_loss=0.06472, over 4276194.53 frames. ], batch size: 608, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:33:05,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2150646.0, ans=0.125 2023-06-28 21:33:23,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=2150706.0, ans=0.05 2023-06-28 21:33:35,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-28 21:33:38,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2150706.0, ans=0.1 2023-06-28 21:33:47,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2150766.0, ans=0.0 2023-06-28 21:34:16,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2150826.0, ans=0.1 2023-06-28 21:34:34,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-28 21:34:51,526 INFO [train.py:996] (1/4) Epoch 12, batch 23050, loss[loss=0.2501, simple_loss=0.3241, pruned_loss=0.08805, over 21810.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2969, pruned_loss=0.06688, over 4282082.42 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:35:10,956 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.719e+02 9.558e+02 1.419e+03 1.890e+03 3.669e+03, threshold=2.838e+03, percent-clipped=6.0 2023-06-28 21:35:27,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2151006.0, ans=0.2 2023-06-28 21:35:43,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2151066.0, ans=0.0 2023-06-28 21:36:03,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2151126.0, ans=0.1 2023-06-28 21:36:28,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2151186.0, ans=0.0 2023-06-28 21:36:28,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-28 21:36:31,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2151186.0, ans=0.125 2023-06-28 21:36:34,601 INFO [train.py:996] (1/4) Epoch 12, batch 23100, loss[loss=0.1846, simple_loss=0.2565, pruned_loss=0.05636, over 21662.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2927, pruned_loss=0.06766, over 4275034.99 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:37:32,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2151426.0, ans=0.125 2023-06-28 21:37:49,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2151486.0, ans=0.125 2023-06-28 21:38:13,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2151486.0, ans=0.1 2023-06-28 21:38:16,276 INFO [train.py:996] (1/4) Epoch 12, batch 23150, loss[loss=0.202, simple_loss=0.277, pruned_loss=0.0635, over 21846.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.286, pruned_loss=0.06626, over 4277109.58 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:38:30,888 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.983e+02 7.198e+02 1.006e+03 1.345e+03 2.860e+03, threshold=2.012e+03, percent-clipped=2.0 2023-06-28 21:39:00,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-28 21:39:16,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2151726.0, ans=0.1 2023-06-28 21:39:57,522 INFO [train.py:996] (1/4) Epoch 12, batch 23200, loss[loss=0.198, simple_loss=0.2747, pruned_loss=0.06068, over 21906.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2859, pruned_loss=0.06793, over 4289997.56 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:41:25,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=2152086.0, ans=15.0 2023-06-28 21:41:38,931 INFO [train.py:996] (1/4) Epoch 12, batch 23250, loss[loss=0.2023, simple_loss=0.2698, pruned_loss=0.06742, over 21474.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2856, pruned_loss=0.06873, over 4297388.45 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:41:58,598 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.316e+02 9.370e+02 1.450e+03 2.114e+03 3.490e+03, threshold=2.900e+03, percent-clipped=30.0 2023-06-28 21:42:36,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2152266.0, ans=0.0 2023-06-28 21:42:47,019 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-28 21:42:51,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2152326.0, ans=0.2 2023-06-28 21:43:10,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2152386.0, ans=0.1 2023-06-28 21:43:22,234 INFO [train.py:996] (1/4) Epoch 12, batch 23300, loss[loss=0.3349, simple_loss=0.4245, pruned_loss=0.1227, over 21428.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2928, pruned_loss=0.07023, over 4299377.41 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:43:34,772 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-28 21:43:41,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2152446.0, ans=0.0 2023-06-28 21:44:33,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2152626.0, ans=0.125 2023-06-28 21:44:45,790 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:45:09,872 INFO [train.py:996] (1/4) Epoch 12, batch 23350, loss[loss=0.2607, simple_loss=0.348, pruned_loss=0.08665, over 21482.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2965, pruned_loss=0.06895, over 4297741.28 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:45:17,623 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-28 21:45:33,281 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 1.010e+03 1.481e+03 2.093e+03 4.806e+03, threshold=2.962e+03, percent-clipped=5.0 2023-06-28 21:46:28,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=22.5 2023-06-28 21:46:30,706 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:46:51,352 INFO [train.py:996] (1/4) Epoch 12, batch 23400, loss[loss=0.1777, simple_loss=0.281, pruned_loss=0.03723, over 20749.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2915, pruned_loss=0.06629, over 4294289.89 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:47:10,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2153046.0, ans=0.1 2023-06-28 21:47:11,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2153106.0, ans=0.04949747468305833 2023-06-28 21:48:05,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2153226.0, ans=0.125 2023-06-28 21:48:38,218 INFO [train.py:996] (1/4) Epoch 12, batch 23450, loss[loss=0.2409, simple_loss=0.3173, pruned_loss=0.08228, over 21485.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2916, pruned_loss=0.06776, over 4298435.77 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:48:56,418 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.004e+02 7.180e+02 1.083e+03 1.740e+03 4.594e+03, threshold=2.165e+03, percent-clipped=4.0 2023-06-28 21:49:11,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=12.0 2023-06-28 21:50:19,189 INFO [train.py:996] (1/4) Epoch 12, batch 23500, loss[loss=0.2453, simple_loss=0.304, pruned_loss=0.09334, over 21606.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2914, pruned_loss=0.0695, over 4297206.35 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:50:29,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-28 21:50:34,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.62 vs. limit=10.0 2023-06-28 21:51:10,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2153766.0, ans=0.125 2023-06-28 21:51:19,430 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=22.5 2023-06-28 21:51:27,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=12.0 2023-06-28 21:51:48,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-28 21:51:50,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2153886.0, ans=0.0 2023-06-28 21:51:56,099 INFO [train.py:996] (1/4) Epoch 12, batch 23550, loss[loss=0.2211, simple_loss=0.3455, pruned_loss=0.04837, over 19784.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2863, pruned_loss=0.06913, over 4288784.59 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:52:18,944 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.122e+02 7.386e+02 1.223e+03 1.985e+03 5.110e+03, threshold=2.446e+03, percent-clipped=21.0 2023-06-28 21:52:29,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2154006.0, ans=0.2 2023-06-28 21:53:39,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2154186.0, ans=0.0 2023-06-28 21:53:43,334 INFO [train.py:996] (1/4) Epoch 12, batch 23600, loss[loss=0.2237, simple_loss=0.294, pruned_loss=0.07668, over 21415.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2872, pruned_loss=0.06879, over 4285538.26 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:54:04,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2154306.0, ans=0.2 2023-06-28 21:54:22,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2154366.0, ans=0.125 2023-06-28 21:55:20,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2154486.0, ans=0.125 2023-06-28 21:55:26,537 INFO [train.py:996] (1/4) Epoch 12, batch 23650, loss[loss=0.1641, simple_loss=0.2389, pruned_loss=0.04468, over 16729.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2877, pruned_loss=0.06728, over 4281417.78 frames. ], batch size: 61, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:55:50,248 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.631e+02 9.498e+02 1.627e+03 2.545e+03 5.743e+03, threshold=3.254e+03, percent-clipped=28.0 2023-06-28 21:56:32,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2154726.0, ans=0.1 2023-06-28 21:57:04,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2154786.0, ans=0.0 2023-06-28 21:57:10,336 INFO [train.py:996] (1/4) Epoch 12, batch 23700, loss[loss=0.2477, simple_loss=0.3209, pruned_loss=0.08724, over 21394.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2898, pruned_loss=0.06701, over 4281542.57 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:57:11,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2154846.0, ans=0.1 2023-06-28 21:57:25,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2154846.0, ans=0.125 2023-06-28 21:58:32,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-28 21:58:58,952 INFO [train.py:996] (1/4) Epoch 12, batch 23750, loss[loss=0.2031, simple_loss=0.2999, pruned_loss=0.05311, over 21705.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2924, pruned_loss=0.06689, over 4271427.38 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:59:16,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2155146.0, ans=0.0 2023-06-28 21:59:21,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.921e+02 7.434e+02 9.463e+02 1.338e+03 4.159e+03, threshold=1.893e+03, percent-clipped=3.0 2023-06-28 21:59:53,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2155266.0, ans=0.125 2023-06-28 21:59:56,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2155266.0, ans=0.0 2023-06-28 22:00:08,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2155326.0, ans=0.0 2023-06-28 22:00:47,742 INFO [train.py:996] (1/4) Epoch 12, batch 23800, loss[loss=0.2353, simple_loss=0.3292, pruned_loss=0.07068, over 21712.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2902, pruned_loss=0.0652, over 4273415.12 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:01:09,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2155506.0, ans=0.2 2023-06-28 22:01:50,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2155626.0, ans=0.125 2023-06-28 22:02:36,536 INFO [train.py:996] (1/4) Epoch 12, batch 23850, loss[loss=0.2304, simple_loss=0.3046, pruned_loss=0.07811, over 21448.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2994, pruned_loss=0.06761, over 4271819.43 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:03:01,487 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 9.558e+02 1.642e+03 2.659e+03 5.260e+03, threshold=3.284e+03, percent-clipped=38.0 2023-06-28 22:03:26,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2155866.0, ans=0.1 2023-06-28 22:03:37,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2155926.0, ans=0.0 2023-06-28 22:03:50,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2155926.0, ans=0.1 2023-06-28 22:03:55,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2155926.0, ans=0.125 2023-06-28 22:04:19,077 INFO [train.py:996] (1/4) Epoch 12, batch 23900, loss[loss=0.227, simple_loss=0.3071, pruned_loss=0.07347, over 21585.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3065, pruned_loss=0.06974, over 4264836.48 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:04:22,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2156046.0, ans=0.2 2023-06-28 22:04:40,344 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:05:28,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-28 22:06:00,802 INFO [train.py:996] (1/4) Epoch 12, batch 23950, loss[loss=0.2067, simple_loss=0.2849, pruned_loss=0.06423, over 21721.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2994, pruned_loss=0.06915, over 4259990.21 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:06:19,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2156346.0, ans=0.125 2023-06-28 22:06:25,836 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 6.890e+02 9.042e+02 1.238e+03 2.308e+03, threshold=1.808e+03, percent-clipped=0.0 2023-06-28 22:07:04,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2156526.0, ans=0.1 2023-06-28 22:07:48,526 INFO [train.py:996] (1/4) Epoch 12, batch 24000, loss[loss=0.1806, simple_loss=0.2394, pruned_loss=0.06093, over 20195.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3005, pruned_loss=0.07207, over 4260698.71 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:07:48,526 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 22:08:05,128 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.264, simple_loss=0.3553, pruned_loss=0.08634, over 1796401.00 frames. 2023-06-28 22:08:05,129 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 22:08:24,410 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-28 22:09:12,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2156826.0, ans=0.125 2023-06-28 22:09:37,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-28 22:09:49,067 INFO [train.py:996] (1/4) Epoch 12, batch 24050, loss[loss=0.192, simple_loss=0.2851, pruned_loss=0.0495, over 21846.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3027, pruned_loss=0.07227, over 4264565.82 frames. ], batch size: 371, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:10:06,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2157006.0, ans=0.125 2023-06-28 22:10:14,177 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.391e+02 8.286e+02 1.353e+03 2.052e+03 4.335e+03, threshold=2.707e+03, percent-clipped=33.0 2023-06-28 22:10:37,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2157066.0, ans=0.125 2023-06-28 22:11:04,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2157126.0, ans=0.125 2023-06-28 22:11:10,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2157126.0, ans=0.0 2023-06-28 22:11:18,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2157186.0, ans=0.1 2023-06-28 22:11:26,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2157186.0, ans=0.125 2023-06-28 22:11:31,511 INFO [train.py:996] (1/4) Epoch 12, batch 24100, loss[loss=0.236, simple_loss=0.3281, pruned_loss=0.07193, over 21766.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3017, pruned_loss=0.07051, over 4265574.88 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:11:40,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2157246.0, ans=0.125 2023-06-28 22:12:02,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=2157306.0, ans=0.5 2023-06-28 22:12:11,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2157306.0, ans=0.1 2023-06-28 22:12:26,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2157366.0, ans=0.125 2023-06-28 22:12:52,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2157426.0, ans=0.1 2023-06-28 22:13:01,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2157486.0, ans=0.125 2023-06-28 22:13:06,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2157486.0, ans=0.95 2023-06-28 22:13:12,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-28 22:13:13,308 INFO [train.py:996] (1/4) Epoch 12, batch 24150, loss[loss=0.2075, simple_loss=0.2758, pruned_loss=0.06964, over 21866.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.301, pruned_loss=0.07159, over 4270766.25 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:13:43,011 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.450e+02 8.470e+02 1.133e+03 1.588e+03 3.416e+03, threshold=2.267e+03, percent-clipped=5.0 2023-06-28 22:13:53,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2157606.0, ans=0.125 2023-06-28 22:14:04,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-06-28 22:14:16,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2157666.0, ans=0.125 2023-06-28 22:14:21,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2157726.0, ans=0.125 2023-06-28 22:14:41,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2157786.0, ans=0.1 2023-06-28 22:14:45,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2157786.0, ans=0.0 2023-06-28 22:14:49,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2157786.0, ans=0.125 2023-06-28 22:14:49,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-28 22:14:56,849 INFO [train.py:996] (1/4) Epoch 12, batch 24200, loss[loss=0.2297, simple_loss=0.3258, pruned_loss=0.06685, over 21603.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3024, pruned_loss=0.07229, over 4280010.75 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:15:14,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2157846.0, ans=0.2 2023-06-28 22:15:28,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2157906.0, ans=0.09899494936611666 2023-06-28 22:16:08,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2158026.0, ans=0.125 2023-06-28 22:16:26,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2158086.0, ans=0.1 2023-06-28 22:16:47,972 INFO [train.py:996] (1/4) Epoch 12, batch 24250, loss[loss=0.2115, simple_loss=0.3131, pruned_loss=0.05493, over 21490.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2992, pruned_loss=0.06726, over 4272439.42 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:17:07,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2158146.0, ans=0.125 2023-06-28 22:17:17,819 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.733e+02 8.184e+02 1.120e+03 1.541e+03 3.593e+03, threshold=2.240e+03, percent-clipped=10.0 2023-06-28 22:17:33,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2158266.0, ans=0.0 2023-06-28 22:18:20,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-28 22:18:31,293 INFO [train.py:996] (1/4) Epoch 12, batch 24300, loss[loss=0.1494, simple_loss=0.2356, pruned_loss=0.03159, over 21758.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2943, pruned_loss=0.06269, over 4274807.11 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:19:21,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2158566.0, ans=0.2 2023-06-28 22:19:33,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2158626.0, ans=0.125 2023-06-28 22:20:13,787 INFO [train.py:996] (1/4) Epoch 12, batch 24350, loss[loss=0.2152, simple_loss=0.2885, pruned_loss=0.07099, over 21793.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2909, pruned_loss=0.06236, over 4277486.08 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:20:14,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-28 22:20:34,879 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-28 22:20:38,542 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-28 22:20:38,902 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 7.403e+02 1.076e+03 1.597e+03 3.002e+03, threshold=2.153e+03, percent-clipped=3.0 2023-06-28 22:21:12,053 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=12.0 2023-06-28 22:21:46,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2158986.0, ans=0.125 2023-06-28 22:21:52,106 INFO [train.py:996] (1/4) Epoch 12, batch 24400, loss[loss=0.1642, simple_loss=0.2376, pruned_loss=0.04541, over 21790.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.296, pruned_loss=0.06562, over 4276069.03 frames. ], batch size: 102, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:21:56,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-28 22:22:31,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=2159166.0, ans=0.2 2023-06-28 22:22:50,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2159226.0, ans=0.125 2023-06-28 22:22:55,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2159226.0, ans=0.125 2023-06-28 22:22:55,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2159226.0, ans=0.04949747468305833 2023-06-28 22:23:39,914 INFO [train.py:996] (1/4) Epoch 12, batch 24450, loss[loss=0.2033, simple_loss=0.2893, pruned_loss=0.05867, over 21688.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2962, pruned_loss=0.06659, over 4280108.76 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:24:01,303 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.221e+02 9.735e+02 1.433e+03 2.433e+03 5.313e+03, threshold=2.865e+03, percent-clipped=29.0 2023-06-28 22:25:22,583 INFO [train.py:996] (1/4) Epoch 12, batch 24500, loss[loss=0.2173, simple_loss=0.2938, pruned_loss=0.07042, over 21860.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2957, pruned_loss=0.06597, over 4286257.50 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:25:33,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2159646.0, ans=0.2 2023-06-28 22:26:32,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2159826.0, ans=0.125 2023-06-28 22:27:04,796 INFO [train.py:996] (1/4) Epoch 12, batch 24550, loss[loss=0.2425, simple_loss=0.3137, pruned_loss=0.08568, over 21330.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2992, pruned_loss=0.06901, over 4290452.30 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:27:28,296 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 8.632e+02 1.069e+03 1.677e+03 3.577e+03, threshold=2.139e+03, percent-clipped=6.0 2023-06-28 22:28:03,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-28 22:28:28,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=15.0 2023-06-28 22:28:48,545 INFO [train.py:996] (1/4) Epoch 12, batch 24600, loss[loss=0.1763, simple_loss=0.2451, pruned_loss=0.05381, over 21452.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.295, pruned_loss=0.06813, over 4277171.03 frames. ], batch size: 212, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:28:51,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2160246.0, ans=0.0 2023-06-28 22:28:54,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2160246.0, ans=0.2 2023-06-28 22:29:04,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2160306.0, ans=0.2 2023-06-28 22:29:11,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-28 22:29:32,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2160366.0, ans=0.2 2023-06-28 22:29:47,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.20 vs. limit=15.0 2023-06-28 22:30:20,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2160486.0, ans=0.125 2023-06-28 22:30:32,029 INFO [train.py:996] (1/4) Epoch 12, batch 24650, loss[loss=0.1703, simple_loss=0.2313, pruned_loss=0.05462, over 21281.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2871, pruned_loss=0.06741, over 4277106.57 frames. ], batch size: 551, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:30:53,461 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 9.210e+02 1.420e+03 2.040e+03 4.110e+03, threshold=2.841e+03, percent-clipped=23.0 2023-06-28 22:31:08,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.74 vs. limit=10.0 2023-06-28 22:32:03,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2160786.0, ans=0.0 2023-06-28 22:32:13,588 INFO [train.py:996] (1/4) Epoch 12, batch 24700, loss[loss=0.1994, simple_loss=0.2737, pruned_loss=0.0625, over 21627.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2849, pruned_loss=0.06641, over 4277620.40 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:32:14,826 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2023-06-28 22:32:17,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2160846.0, ans=0.125 2023-06-28 22:32:23,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2160846.0, ans=0.125 2023-06-28 22:33:23,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-28 22:33:28,738 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.19 vs. limit=22.5 2023-06-28 22:33:33,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-28 22:33:41,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2161086.0, ans=0.125 2023-06-28 22:33:54,738 INFO [train.py:996] (1/4) Epoch 12, batch 24750, loss[loss=0.1744, simple_loss=0.2505, pruned_loss=0.04914, over 21772.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2784, pruned_loss=0.06439, over 4274308.58 frames. ], batch size: 317, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:34:15,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.049e+02 6.504e+02 9.325e+02 1.249e+03 2.794e+03, threshold=1.865e+03, percent-clipped=0.0 2023-06-28 22:34:22,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-28 22:35:09,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2161326.0, ans=0.1 2023-06-28 22:35:10,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2161326.0, ans=0.0 2023-06-28 22:35:35,293 INFO [train.py:996] (1/4) Epoch 12, batch 24800, loss[loss=0.2028, simple_loss=0.2662, pruned_loss=0.06972, over 21514.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2729, pruned_loss=0.06335, over 4281476.07 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:36:41,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2161626.0, ans=0.125 2023-06-28 22:37:19,232 INFO [train.py:996] (1/4) Epoch 12, batch 24850, loss[loss=0.1861, simple_loss=0.2591, pruned_loss=0.05651, over 21476.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.274, pruned_loss=0.0645, over 4283901.33 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:37:25,690 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.43 vs. limit=15.0 2023-06-28 22:37:38,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2161806.0, ans=0.125 2023-06-28 22:37:42,809 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.956e+02 8.268e+02 1.225e+03 1.737e+03 3.601e+03, threshold=2.449e+03, percent-clipped=20.0 2023-06-28 22:37:56,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2161866.0, ans=0.1 2023-06-28 22:38:00,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2161866.0, ans=0.125 2023-06-28 22:38:54,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2161986.0, ans=0.2 2023-06-28 22:39:01,936 INFO [train.py:996] (1/4) Epoch 12, batch 24900, loss[loss=0.2158, simple_loss=0.2962, pruned_loss=0.06771, over 21761.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2761, pruned_loss=0.06508, over 4282725.10 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:39:02,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2162046.0, ans=0.125 2023-06-28 22:39:41,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2162106.0, ans=0.125 2023-06-28 22:39:54,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.14 vs. limit=6.0 2023-06-28 22:40:17,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.44 vs. limit=8.0 2023-06-28 22:40:42,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2162286.0, ans=0.0 2023-06-28 22:40:43,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2162286.0, ans=0.125 2023-06-28 22:40:46,423 INFO [train.py:996] (1/4) Epoch 12, batch 24950, loss[loss=0.2526, simple_loss=0.3203, pruned_loss=0.0924, over 21411.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2835, pruned_loss=0.06884, over 4281472.50 frames. ], batch size: 549, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:41:06,170 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.01 vs. limit=10.0 2023-06-28 22:41:17,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2162406.0, ans=0.0 2023-06-28 22:41:20,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.276e+02 8.663e+02 1.354e+03 1.983e+03 3.739e+03, threshold=2.709e+03, percent-clipped=10.0 2023-06-28 22:41:28,587 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-28 22:41:40,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2162466.0, ans=0.125 2023-06-28 22:42:03,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2162526.0, ans=0.125 2023-06-28 22:42:10,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2162586.0, ans=0.1 2023-06-28 22:42:10,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2162586.0, ans=0.09899494936611666 2023-06-28 22:42:31,484 INFO [train.py:996] (1/4) Epoch 12, batch 25000, loss[loss=0.2142, simple_loss=0.2914, pruned_loss=0.06851, over 21537.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2899, pruned_loss=0.07019, over 4268675.52 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:42:32,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2162646.0, ans=0.125 2023-06-28 22:42:37,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2162646.0, ans=0.025 2023-06-28 22:42:46,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2162646.0, ans=0.2 2023-06-28 22:43:12,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2162706.0, ans=0.125 2023-06-28 22:43:22,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2162766.0, ans=0.04949747468305833 2023-06-28 22:43:32,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2162766.0, ans=0.1 2023-06-28 22:43:44,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-28 22:44:12,540 INFO [train.py:996] (1/4) Epoch 12, batch 25050, loss[loss=0.1798, simple_loss=0.2551, pruned_loss=0.0523, over 21836.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2838, pruned_loss=0.06854, over 4272205.96 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:44:16,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2162946.0, ans=0.1 2023-06-28 22:44:49,733 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.916e+02 6.443e+02 9.220e+02 1.309e+03 4.556e+03, threshold=1.844e+03, percent-clipped=4.0 2023-06-28 22:45:11,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2163066.0, ans=0.125 2023-06-28 22:45:45,428 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-28 22:45:54,124 INFO [train.py:996] (1/4) Epoch 12, batch 25100, loss[loss=0.1958, simple_loss=0.2582, pruned_loss=0.06664, over 21737.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2791, pruned_loss=0.06737, over 4282353.36 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:46:03,572 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-28 22:46:24,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2163306.0, ans=0.125 2023-06-28 22:47:12,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2163486.0, ans=0.125 2023-06-28 22:47:30,133 INFO [train.py:996] (1/4) Epoch 12, batch 25150, loss[loss=0.2247, simple_loss=0.3082, pruned_loss=0.0706, over 21802.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.282, pruned_loss=0.06511, over 4270684.05 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:48:07,817 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.073e+02 7.241e+02 9.101e+02 1.469e+03 3.331e+03, threshold=1.820e+03, percent-clipped=11.0 2023-06-28 22:48:48,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2163726.0, ans=0.125 2023-06-28 22:48:56,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2163786.0, ans=0.1 2023-06-28 22:49:06,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2163786.0, ans=0.125 2023-06-28 22:49:08,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2163786.0, ans=0.1 2023-06-28 22:49:12,471 INFO [train.py:996] (1/4) Epoch 12, batch 25200, loss[loss=0.1898, simple_loss=0.2901, pruned_loss=0.0448, over 21666.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2815, pruned_loss=0.06337, over 4273311.91 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:49:24,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2163846.0, ans=0.125 2023-06-28 22:50:18,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2163966.0, ans=0.0 2023-06-28 22:50:47,336 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-28 22:50:54,691 INFO [train.py:996] (1/4) Epoch 12, batch 25250, loss[loss=0.1906, simple_loss=0.2595, pruned_loss=0.06088, over 21369.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2806, pruned_loss=0.06168, over 4274379.97 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 22:50:55,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2164146.0, ans=0.125 2023-06-28 22:51:25,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2164206.0, ans=0.015 2023-06-28 22:51:27,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2164206.0, ans=0.125 2023-06-28 22:51:33,364 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.757e+02 8.180e+02 1.142e+03 1.720e+03 2.915e+03, threshold=2.285e+03, percent-clipped=21.0 2023-06-28 22:51:41,406 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.41 vs. limit=15.0 2023-06-28 22:51:50,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2164266.0, ans=0.125 2023-06-28 22:52:06,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2164326.0, ans=0.2 2023-06-28 22:52:36,301 INFO [train.py:996] (1/4) Epoch 12, batch 25300, loss[loss=0.2251, simple_loss=0.3077, pruned_loss=0.07129, over 21452.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2779, pruned_loss=0.06162, over 4256206.60 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:53:36,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2164566.0, ans=0.125 2023-06-28 22:53:59,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2164686.0, ans=0.0 2023-06-28 22:54:12,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2164686.0, ans=0.125 2023-06-28 22:54:22,171 INFO [train.py:996] (1/4) Epoch 12, batch 25350, loss[loss=0.1841, simple_loss=0.2677, pruned_loss=0.05021, over 21706.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2822, pruned_loss=0.06192, over 4247654.21 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:54:42,027 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:54:55,963 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 8.129e+02 1.301e+03 1.964e+03 4.138e+03, threshold=2.601e+03, percent-clipped=21.0 2023-06-28 22:55:09,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-28 22:55:30,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2164926.0, ans=0.0 2023-06-28 22:55:38,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2164926.0, ans=0.125 2023-06-28 22:55:48,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2164986.0, ans=0.0 2023-06-28 22:55:57,308 INFO [train.py:996] (1/4) Epoch 12, batch 25400, loss[loss=0.1828, simple_loss=0.2429, pruned_loss=0.06135, over 21204.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2769, pruned_loss=0.06048, over 4243246.25 frames. ], batch size: 549, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:56:19,871 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:56:21,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2165106.0, ans=0.1 2023-06-28 22:56:30,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-06-28 22:56:41,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-28 22:57:29,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2165286.0, ans=0.1 2023-06-28 22:57:37,576 INFO [train.py:996] (1/4) Epoch 12, batch 25450, loss[loss=0.2111, simple_loss=0.3064, pruned_loss=0.05796, over 21597.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2767, pruned_loss=0.06158, over 4233861.07 frames. ], batch size: 230, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:58:11,894 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 9.465e+02 1.393e+03 2.029e+03 3.933e+03, threshold=2.786e+03, percent-clipped=12.0 2023-06-28 22:58:32,920 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-28 22:58:48,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=12.0 2023-06-28 22:58:48,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2165526.0, ans=0.125 2023-06-28 22:59:04,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2165586.0, ans=0.0 2023-06-28 22:59:09,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2165586.0, ans=0.125 2023-06-28 22:59:24,500 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:59:25,649 INFO [train.py:996] (1/4) Epoch 12, batch 25500, loss[loss=0.1685, simple_loss=0.2771, pruned_loss=0.03, over 20873.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2774, pruned_loss=0.05873, over 4240455.22 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:00:28,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2165826.0, ans=0.125 2023-06-28 23:01:00,382 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-28 23:01:12,020 INFO [train.py:996] (1/4) Epoch 12, batch 25550, loss[loss=0.2046, simple_loss=0.3043, pruned_loss=0.05248, over 21661.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2856, pruned_loss=0.06009, over 4245290.39 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:01:15,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2165946.0, ans=0.0 2023-06-28 23:01:46,810 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.170e+02 8.424e+02 1.256e+03 1.965e+03 3.448e+03, threshold=2.512e+03, percent-clipped=4.0 2023-06-28 23:02:10,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2166126.0, ans=0.04949747468305833 2023-06-28 23:02:25,840 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-28 23:02:58,675 INFO [train.py:996] (1/4) Epoch 12, batch 25600, loss[loss=0.2387, simple_loss=0.3283, pruned_loss=0.0746, over 21415.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2892, pruned_loss=0.06093, over 4258393.74 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:04:24,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2166486.0, ans=0.125 2023-06-28 23:04:30,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2166486.0, ans=0.125 2023-06-28 23:04:39,577 INFO [train.py:996] (1/4) Epoch 12, batch 25650, loss[loss=0.2, simple_loss=0.2646, pruned_loss=0.06768, over 21871.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.289, pruned_loss=0.06269, over 4258876.95 frames. ], batch size: 373, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:04:59,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2166606.0, ans=0.04949747468305833 2023-06-28 23:05:10,124 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.401e+02 8.404e+02 1.162e+03 1.787e+03 4.210e+03, threshold=2.325e+03, percent-clipped=7.0 2023-06-28 23:06:08,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2166786.0, ans=0.125 2023-06-28 23:06:19,764 INFO [train.py:996] (1/4) Epoch 12, batch 25700, loss[loss=0.2097, simple_loss=0.2972, pruned_loss=0.06103, over 21732.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2868, pruned_loss=0.06313, over 4251078.33 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:07:02,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2166966.0, ans=10.0 2023-06-28 23:07:09,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2166966.0, ans=0.0 2023-06-28 23:07:21,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2167026.0, ans=0.125 2023-06-28 23:07:54,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2167086.0, ans=0.125 2023-06-28 23:08:07,858 INFO [train.py:996] (1/4) Epoch 12, batch 25750, loss[loss=0.1955, simple_loss=0.2596, pruned_loss=0.06567, over 21168.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2896, pruned_loss=0.06566, over 4258869.99 frames. ], batch size: 608, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:08:10,655 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.17 vs. limit=12.0 2023-06-28 23:08:13,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2167146.0, ans=0.0 2023-06-28 23:08:39,821 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.094e+02 7.467e+02 1.127e+03 1.693e+03 5.779e+03, threshold=2.254e+03, percent-clipped=13.0 2023-06-28 23:09:02,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2167266.0, ans=0.125 2023-06-28 23:09:13,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2167326.0, ans=0.2 2023-06-28 23:09:21,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2167326.0, ans=0.125 2023-06-28 23:09:30,542 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-28 23:09:33,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2167386.0, ans=0.125 2023-06-28 23:09:40,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2167386.0, ans=0.1 2023-06-28 23:09:43,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2167386.0, ans=0.0 2023-06-28 23:09:51,796 INFO [train.py:996] (1/4) Epoch 12, batch 25800, loss[loss=0.217, simple_loss=0.2989, pruned_loss=0.06752, over 21603.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3009, pruned_loss=0.07008, over 4258354.57 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:10:44,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2167566.0, ans=0.2 2023-06-28 23:11:05,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2167626.0, ans=0.2 2023-06-28 23:11:21,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2167686.0, ans=0.1 2023-06-28 23:11:33,286 INFO [train.py:996] (1/4) Epoch 12, batch 25850, loss[loss=0.2043, simple_loss=0.2793, pruned_loss=0.06466, over 21716.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3031, pruned_loss=0.07049, over 4264038.94 frames. ], batch size: 230, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:12:08,744 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.172e+02 7.676e+02 1.093e+03 1.751e+03 3.507e+03, threshold=2.187e+03, percent-clipped=11.0 2023-06-28 23:13:00,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2167986.0, ans=0.2 2023-06-28 23:13:23,852 INFO [train.py:996] (1/4) Epoch 12, batch 25900, loss[loss=0.2566, simple_loss=0.353, pruned_loss=0.08006, over 21856.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3065, pruned_loss=0.07203, over 4270915.35 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:13:32,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2168046.0, ans=0.125 2023-06-28 23:14:00,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2168106.0, ans=0.125 2023-06-28 23:14:17,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2168166.0, ans=10.0 2023-06-28 23:14:25,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2168226.0, ans=0.125 2023-06-28 23:15:07,746 INFO [train.py:996] (1/4) Epoch 12, batch 25950, loss[loss=0.2133, simple_loss=0.3066, pruned_loss=0.06003, over 21785.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3114, pruned_loss=0.07396, over 4264804.99 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:15:35,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2168406.0, ans=0.125 2023-06-28 23:15:43,674 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.181e+02 7.724e+02 1.093e+03 1.792e+03 4.212e+03, threshold=2.186e+03, percent-clipped=19.0 2023-06-28 23:16:33,294 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:16:54,131 INFO [train.py:996] (1/4) Epoch 12, batch 26000, loss[loss=0.1855, simple_loss=0.276, pruned_loss=0.04754, over 21212.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3124, pruned_loss=0.07278, over 4265551.23 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:16:56,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2168646.0, ans=0.125 2023-06-28 23:17:28,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-28 23:17:46,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2168766.0, ans=0.125 2023-06-28 23:17:59,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-28 23:18:36,003 INFO [train.py:996] (1/4) Epoch 12, batch 26050, loss[loss=0.2369, simple_loss=0.3155, pruned_loss=0.07919, over 21816.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.312, pruned_loss=0.07293, over 4260680.00 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:18:51,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2168946.0, ans=0.125 2023-06-28 23:19:00,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2169006.0, ans=0.0 2023-06-28 23:19:08,203 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.201e+02 7.488e+02 9.436e+02 1.230e+03 3.511e+03, threshold=1.887e+03, percent-clipped=1.0 2023-06-28 23:20:16,747 INFO [train.py:996] (1/4) Epoch 12, batch 26100, loss[loss=0.2351, simple_loss=0.306, pruned_loss=0.08207, over 21847.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3061, pruned_loss=0.07286, over 4268624.98 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:20:50,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2169306.0, ans=0.2 2023-06-28 23:22:03,471 INFO [train.py:996] (1/4) Epoch 12, batch 26150, loss[loss=0.2011, simple_loss=0.2643, pruned_loss=0.06895, over 20095.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3026, pruned_loss=0.07299, over 4280635.29 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:22:15,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2169546.0, ans=0.125 2023-06-28 23:22:31,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.410e+02 8.718e+02 1.214e+03 1.632e+03 3.208e+03, threshold=2.428e+03, percent-clipped=15.0 2023-06-28 23:23:32,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-28 23:23:44,679 INFO [train.py:996] (1/4) Epoch 12, batch 26200, loss[loss=0.2125, simple_loss=0.2831, pruned_loss=0.07093, over 20028.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3028, pruned_loss=0.07135, over 4284281.58 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:24:00,115 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:24:43,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2170026.0, ans=0.1 2023-06-28 23:25:16,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2170086.0, ans=0.125 2023-06-28 23:25:18,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2170086.0, ans=0.1 2023-06-28 23:25:23,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2170086.0, ans=0.0 2023-06-28 23:25:25,902 INFO [train.py:996] (1/4) Epoch 12, batch 26250, loss[loss=0.2113, simple_loss=0.2938, pruned_loss=0.06437, over 21845.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3073, pruned_loss=0.07036, over 4285820.80 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:25:29,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2170146.0, ans=0.1 2023-06-28 23:25:41,459 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=22.5 2023-06-28 23:25:52,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.372e+02 9.468e+02 1.366e+03 2.102e+03 4.403e+03, threshold=2.732e+03, percent-clipped=13.0 2023-06-28 23:26:09,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2170266.0, ans=0.125 2023-06-28 23:26:11,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2170266.0, ans=0.125 2023-06-28 23:27:01,332 INFO [train.py:996] (1/4) Epoch 12, batch 26300, loss[loss=0.1977, simple_loss=0.2743, pruned_loss=0.06057, over 21847.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3038, pruned_loss=0.07056, over 4292777.95 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:27:35,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2170506.0, ans=0.125 2023-06-28 23:28:31,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2170686.0, ans=0.0 2023-06-28 23:28:42,457 INFO [train.py:996] (1/4) Epoch 12, batch 26350, loss[loss=0.2614, simple_loss=0.3276, pruned_loss=0.09765, over 21776.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3006, pruned_loss=0.0702, over 4289849.53 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:28:46,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2170746.0, ans=0.125 2023-06-28 23:29:19,232 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.451e+02 7.926e+02 1.139e+03 2.111e+03 4.700e+03, threshold=2.277e+03, percent-clipped=11.0 2023-06-28 23:29:22,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2170866.0, ans=0.0 2023-06-28 23:29:28,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2170866.0, ans=0.125 2023-06-28 23:29:44,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2170926.0, ans=0.125 2023-06-28 23:29:52,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2170926.0, ans=0.125 2023-06-28 23:30:23,056 INFO [train.py:996] (1/4) Epoch 12, batch 26400, loss[loss=0.2183, simple_loss=0.266, pruned_loss=0.08531, over 21265.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2958, pruned_loss=0.07037, over 4288394.52 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:31:02,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2171106.0, ans=0.125 2023-06-28 23:31:04,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-28 23:31:24,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2171166.0, ans=0.0 2023-06-28 23:31:44,796 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:31:46,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2171226.0, ans=0.2 2023-06-28 23:32:16,582 INFO [train.py:996] (1/4) Epoch 12, batch 26450, loss[loss=0.2462, simple_loss=0.3436, pruned_loss=0.07437, over 21850.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2949, pruned_loss=0.06995, over 4279641.57 frames. ], batch size: 317, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:32:21,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.26 vs. limit=15.0 2023-06-28 23:32:47,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2171406.0, ans=0.125 2023-06-28 23:32:51,626 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.901e+02 9.721e+02 1.441e+03 2.127e+03 5.226e+03, threshold=2.882e+03, percent-clipped=23.0 2023-06-28 23:33:17,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2171526.0, ans=0.1 2023-06-28 23:33:59,929 INFO [train.py:996] (1/4) Epoch 12, batch 26500, loss[loss=0.2257, simple_loss=0.3114, pruned_loss=0.06999, over 21775.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2951, pruned_loss=0.06872, over 4274530.47 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:34:05,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2171646.0, ans=0.0 2023-06-28 23:34:43,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-28 23:34:48,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.20 vs. limit=10.0 2023-06-28 23:35:15,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2171826.0, ans=0.1 2023-06-28 23:35:38,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2171886.0, ans=0.1 2023-06-28 23:35:47,899 INFO [train.py:996] (1/4) Epoch 12, batch 26550, loss[loss=0.1825, simple_loss=0.2793, pruned_loss=0.04283, over 21723.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.294, pruned_loss=0.06682, over 4274493.51 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:36:23,125 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.969e+02 7.971e+02 1.184e+03 2.245e+03 4.419e+03, threshold=2.369e+03, percent-clipped=15.0 2023-06-28 23:37:28,574 INFO [train.py:996] (1/4) Epoch 12, batch 26600, loss[loss=0.2155, simple_loss=0.2929, pruned_loss=0.06911, over 21485.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2917, pruned_loss=0.06383, over 4273920.74 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:38:17,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2172366.0, ans=0.1 2023-06-28 23:38:35,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2172426.0, ans=0.0 2023-06-28 23:38:41,805 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:39:04,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2172486.0, ans=0.0 2023-06-28 23:39:08,229 INFO [train.py:996] (1/4) Epoch 12, batch 26650, loss[loss=0.1541, simple_loss=0.2451, pruned_loss=0.0315, over 21803.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2854, pruned_loss=0.06237, over 4276524.84 frames. ], batch size: 352, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:39:13,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2172546.0, ans=0.0 2023-06-28 23:39:45,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2172606.0, ans=0.2 2023-06-28 23:39:46,394 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.698e+02 6.768e+02 8.885e+02 1.234e+03 3.430e+03, threshold=1.777e+03, percent-clipped=1.0 2023-06-28 23:40:07,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2172666.0, ans=0.0 2023-06-28 23:40:22,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2172726.0, ans=0.07 2023-06-28 23:40:30,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-28 23:40:33,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2172786.0, ans=0.125 2023-06-28 23:40:42,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-28 23:40:52,109 INFO [train.py:996] (1/4) Epoch 12, batch 26700, loss[loss=0.2016, simple_loss=0.2816, pruned_loss=0.06075, over 21816.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.279, pruned_loss=0.06006, over 4271427.41 frames. ], batch size: 118, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:41:04,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2172846.0, ans=0.125 2023-06-28 23:41:04,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2172846.0, ans=10.0 2023-06-28 23:42:03,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2173026.0, ans=0.0 2023-06-28 23:42:33,655 INFO [train.py:996] (1/4) Epoch 12, batch 26750, loss[loss=0.2328, simple_loss=0.3134, pruned_loss=0.07613, over 21936.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2801, pruned_loss=0.05971, over 4277990.80 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:42:52,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2173146.0, ans=0.125 2023-06-28 23:43:05,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2173206.0, ans=0.125 2023-06-28 23:43:11,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2173206.0, ans=0.125 2023-06-28 23:43:12,966 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.416e+02 8.010e+02 1.094e+03 1.630e+03 3.819e+03, threshold=2.187e+03, percent-clipped=19.0 2023-06-28 23:43:16,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-28 23:43:33,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2173266.0, ans=0.0 2023-06-28 23:44:07,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-28 23:44:20,466 INFO [train.py:996] (1/4) Epoch 12, batch 26800, loss[loss=0.2427, simple_loss=0.3298, pruned_loss=0.07778, over 21461.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2873, pruned_loss=0.06344, over 4280397.93 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:44:32,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2173446.0, ans=0.0 2023-06-28 23:44:37,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2173446.0, ans=0.0 2023-06-28 23:44:38,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2173446.0, ans=0.0 2023-06-28 23:44:40,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2173506.0, ans=0.125 2023-06-28 23:44:52,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2173506.0, ans=0.0 2023-06-28 23:45:48,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2173686.0, ans=0.1 2023-06-28 23:45:50,467 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-28 23:46:05,362 INFO [train.py:996] (1/4) Epoch 12, batch 26850, loss[loss=0.2364, simple_loss=0.2791, pruned_loss=0.09687, over 21441.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2891, pruned_loss=0.06655, over 4283791.65 frames. ], batch size: 510, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:46:10,735 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:46:12,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2173746.0, ans=0.125 2023-06-28 23:46:17,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-28 23:46:22,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-28 23:46:40,822 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.776e+02 8.023e+02 1.160e+03 1.579e+03 4.505e+03, threshold=2.321e+03, percent-clipped=8.0 2023-06-28 23:46:57,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-28 23:47:11,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2173926.0, ans=0.125 2023-06-28 23:47:16,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2173926.0, ans=0.125 2023-06-28 23:47:19,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2173986.0, ans=0.125 2023-06-28 23:47:40,060 INFO [train.py:996] (1/4) Epoch 12, batch 26900, loss[loss=0.1973, simple_loss=0.2682, pruned_loss=0.06323, over 21891.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2819, pruned_loss=0.06578, over 4267678.25 frames. ], batch size: 125, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:48:05,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2174106.0, ans=0.125 2023-06-28 23:48:12,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2174106.0, ans=0.125 2023-06-28 23:48:29,749 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:48:52,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2174226.0, ans=0.125 2023-06-28 23:48:55,649 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:49:19,001 INFO [train.py:996] (1/4) Epoch 12, batch 26950, loss[loss=0.2483, simple_loss=0.3391, pruned_loss=0.0787, over 21596.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2814, pruned_loss=0.0658, over 4272244.11 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:49:54,867 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 6.986e+02 1.003e+03 1.529e+03 4.492e+03, threshold=2.006e+03, percent-clipped=11.0 2023-06-28 23:50:25,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2174526.0, ans=0.125 2023-06-28 23:50:48,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2174586.0, ans=0.125 2023-06-28 23:51:06,157 INFO [train.py:996] (1/4) Epoch 12, batch 27000, loss[loss=0.1813, simple_loss=0.2747, pruned_loss=0.04398, over 21702.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2821, pruned_loss=0.06421, over 4268240.83 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:51:06,158 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-28 23:51:22,019 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2512, simple_loss=0.3387, pruned_loss=0.08188, over 1796401.00 frames. 2023-06-28 23:51:22,020 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-28 23:51:28,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2174646.0, ans=0.125 2023-06-28 23:51:47,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-28 23:51:59,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2174706.0, ans=0.2 2023-06-28 23:52:04,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2174766.0, ans=0.2 2023-06-28 23:52:55,641 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.27 vs. limit=22.5 2023-06-28 23:53:03,874 INFO [train.py:996] (1/4) Epoch 12, batch 27050, loss[loss=0.2095, simple_loss=0.2983, pruned_loss=0.06036, over 21592.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2842, pruned_loss=0.06124, over 4268280.52 frames. ], batch size: 230, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:53:22,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2174946.0, ans=0.0 2023-06-28 23:53:44,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.952e+02 1.010e+03 1.463e+03 2.409e+03 4.686e+03, threshold=2.925e+03, percent-clipped=39.0 2023-06-28 23:53:45,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2175066.0, ans=0.0 2023-06-28 23:54:03,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2175066.0, ans=0.2 2023-06-28 23:54:36,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2175186.0, ans=0.2 2023-06-28 23:54:37,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2175186.0, ans=0.125 2023-06-28 23:54:43,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-28 23:54:45,815 INFO [train.py:996] (1/4) Epoch 12, batch 27100, loss[loss=0.1795, simple_loss=0.2971, pruned_loss=0.03097, over 19752.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2853, pruned_loss=0.06235, over 4267199.03 frames. ], batch size: 702, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:54:46,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2175246.0, ans=0.125 2023-06-28 23:55:17,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2175306.0, ans=0.125 2023-06-28 23:55:37,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2175366.0, ans=0.07 2023-06-28 23:55:39,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2175366.0, ans=0.0 2023-06-28 23:56:21,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2175486.0, ans=0.1 2023-06-28 23:56:34,066 INFO [train.py:996] (1/4) Epoch 12, batch 27150, loss[loss=0.2438, simple_loss=0.3422, pruned_loss=0.07269, over 21758.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2965, pruned_loss=0.06498, over 4269297.53 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:56:34,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2175546.0, ans=0.125 2023-06-28 23:57:10,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2175606.0, ans=0.0 2023-06-28 23:57:14,490 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.191e+02 8.496e+02 1.171e+03 1.771e+03 3.313e+03, threshold=2.341e+03, percent-clipped=5.0 2023-06-28 23:57:34,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2175726.0, ans=0.125 2023-06-28 23:57:47,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2175726.0, ans=0.125 2023-06-28 23:58:14,341 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:58:15,372 INFO [train.py:996] (1/4) Epoch 12, batch 27200, loss[loss=0.2478, simple_loss=0.3335, pruned_loss=0.08109, over 21739.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3051, pruned_loss=0.0674, over 4269696.19 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:58:45,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2175906.0, ans=0.125 2023-06-28 23:58:51,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2175906.0, ans=0.1 2023-06-28 23:58:55,837 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.23 vs. limit=22.5 2023-06-28 23:58:56,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2175966.0, ans=0.125 2023-06-28 23:59:43,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-29 00:00:01,697 INFO [train.py:996] (1/4) Epoch 12, batch 27250, loss[loss=0.2414, simple_loss=0.3198, pruned_loss=0.08147, over 21595.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3067, pruned_loss=0.07053, over 4272648.95 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:00:02,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2176146.0, ans=0.125 2023-06-29 00:00:14,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-29 00:00:17,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2176146.0, ans=0.125 2023-06-29 00:00:45,293 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.531e+02 9.436e+02 1.424e+03 2.260e+03 4.305e+03, threshold=2.849e+03, percent-clipped=22.0 2023-06-29 00:00:49,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2176266.0, ans=0.1 2023-06-29 00:01:40,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2176386.0, ans=0.125 2023-06-29 00:01:49,812 INFO [train.py:996] (1/4) Epoch 12, batch 27300, loss[loss=0.2209, simple_loss=0.3179, pruned_loss=0.06198, over 21927.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.309, pruned_loss=0.07215, over 4271710.99 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:02:20,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2176506.0, ans=0.125 2023-06-29 00:03:06,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-29 00:03:31,714 INFO [train.py:996] (1/4) Epoch 12, batch 27350, loss[loss=0.2188, simple_loss=0.3041, pruned_loss=0.06677, over 21691.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3113, pruned_loss=0.07274, over 4264014.12 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:03:41,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2176746.0, ans=0.0 2023-06-29 00:04:02,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2176806.0, ans=0.125 2023-06-29 00:04:05,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2023-06-29 00:04:13,498 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.353e+02 7.469e+02 1.032e+03 1.512e+03 4.171e+03, threshold=2.065e+03, percent-clipped=4.0 2023-06-29 00:04:14,687 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-29 00:04:40,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2176926.0, ans=0.035 2023-06-29 00:04:45,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2176926.0, ans=0.0 2023-06-29 00:04:46,065 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=22.5 2023-06-29 00:04:52,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-29 00:05:07,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.14 vs. limit=10.0 2023-06-29 00:05:15,179 INFO [train.py:996] (1/4) Epoch 12, batch 27400, loss[loss=0.2018, simple_loss=0.271, pruned_loss=0.06633, over 21671.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3069, pruned_loss=0.07228, over 4273194.89 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:05:24,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2177046.0, ans=0.0 2023-06-29 00:05:36,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-29 00:05:47,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.61 vs. limit=15.0 2023-06-29 00:05:56,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2177166.0, ans=0.0 2023-06-29 00:06:32,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2177226.0, ans=0.1 2023-06-29 00:06:55,574 INFO [train.py:996] (1/4) Epoch 12, batch 27450, loss[loss=0.2254, simple_loss=0.3114, pruned_loss=0.0697, over 21538.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3009, pruned_loss=0.07067, over 4270012.50 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:07:32,478 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.358e+02 7.847e+02 1.147e+03 1.584e+03 3.380e+03, threshold=2.294e+03, percent-clipped=11.0 2023-06-29 00:07:38,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2177466.0, ans=0.0 2023-06-29 00:07:51,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2177526.0, ans=0.0 2023-06-29 00:08:34,594 INFO [train.py:996] (1/4) Epoch 12, batch 27500, loss[loss=0.2075, simple_loss=0.2877, pruned_loss=0.06368, over 21557.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2986, pruned_loss=0.07043, over 4274887.73 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:10:09,231 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:10:15,300 INFO [train.py:996] (1/4) Epoch 12, batch 27550, loss[loss=0.1756, simple_loss=0.2588, pruned_loss=0.04618, over 21678.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2929, pruned_loss=0.06767, over 4283331.27 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:10:37,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2178006.0, ans=0.125 2023-06-29 00:10:47,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2178006.0, ans=0.0 2023-06-29 00:10:57,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.117e+02 1.004e+03 1.516e+03 2.430e+03 4.785e+03, threshold=3.032e+03, percent-clipped=27.0 2023-06-29 00:11:00,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2178066.0, ans=0.125 2023-06-29 00:11:13,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2178066.0, ans=0.0 2023-06-29 00:11:24,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2178126.0, ans=0.0 2023-06-29 00:11:54,697 INFO [train.py:996] (1/4) Epoch 12, batch 27600, loss[loss=0.1733, simple_loss=0.2396, pruned_loss=0.05354, over 21619.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2864, pruned_loss=0.06658, over 4268132.13 frames. ], batch size: 231, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:12:53,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2178366.0, ans=0.125 2023-06-29 00:13:17,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2178486.0, ans=0.2 2023-06-29 00:13:31,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2178486.0, ans=0.2 2023-06-29 00:13:34,132 INFO [train.py:996] (1/4) Epoch 12, batch 27650, loss[loss=0.1801, simple_loss=0.269, pruned_loss=0.04556, over 21862.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2823, pruned_loss=0.0661, over 4270104.55 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:14:03,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2178606.0, ans=0.125 2023-06-29 00:14:06,196 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-29 00:14:15,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2178666.0, ans=0.0 2023-06-29 00:14:17,724 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.725e+02 1.101e+03 1.627e+03 3.974e+03, threshold=2.201e+03, percent-clipped=3.0 2023-06-29 00:14:18,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2178666.0, ans=0.125 2023-06-29 00:14:44,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2178726.0, ans=0.125 2023-06-29 00:14:58,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2178786.0, ans=0.0 2023-06-29 00:15:04,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2178786.0, ans=0.125 2023-06-29 00:15:06,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-29 00:15:15,562 INFO [train.py:996] (1/4) Epoch 12, batch 27700, loss[loss=0.3206, simple_loss=0.386, pruned_loss=0.1276, over 21525.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2832, pruned_loss=0.06552, over 4278494.81 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:15:55,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2178906.0, ans=0.0 2023-06-29 00:16:11,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2178966.0, ans=0.125 2023-06-29 00:16:26,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2179026.0, ans=0.125 2023-06-29 00:16:53,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2179086.0, ans=0.1 2023-06-29 00:16:56,208 INFO [train.py:996] (1/4) Epoch 12, batch 27750, loss[loss=0.1922, simple_loss=0.2795, pruned_loss=0.05252, over 21650.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2859, pruned_loss=0.0645, over 4280857.83 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:17:03,568 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-29 00:17:39,849 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.037e+02 8.775e+02 1.414e+03 2.124e+03 3.615e+03, threshold=2.828e+03, percent-clipped=21.0 2023-06-29 00:17:57,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2179326.0, ans=0.125 2023-06-29 00:18:12,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2179326.0, ans=0.0 2023-06-29 00:18:12,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2179326.0, ans=0.0 2023-06-29 00:18:27,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2179386.0, ans=0.1 2023-06-29 00:18:35,463 INFO [train.py:996] (1/4) Epoch 12, batch 27800, loss[loss=0.1823, simple_loss=0.2315, pruned_loss=0.06655, over 20425.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2843, pruned_loss=0.0645, over 4285148.11 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:18:37,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2179446.0, ans=0.2 2023-06-29 00:19:00,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.77 vs. limit=22.5 2023-06-29 00:20:16,294 INFO [train.py:996] (1/4) Epoch 12, batch 27850, loss[loss=0.2037, simple_loss=0.2709, pruned_loss=0.06821, over 21586.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2837, pruned_loss=0.06575, over 4285449.45 frames. ], batch size: 212, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:20:36,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2179806.0, ans=0.125 2023-06-29 00:21:00,408 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.968e+02 8.980e+02 1.586e+03 2.122e+03 3.865e+03, threshold=3.171e+03, percent-clipped=6.0 2023-06-29 00:21:14,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-29 00:22:03,471 INFO [train.py:996] (1/4) Epoch 12, batch 27900, loss[loss=0.1896, simple_loss=0.2777, pruned_loss=0.05077, over 21730.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2924, pruned_loss=0.06668, over 4282320.88 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:23:18,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2180226.0, ans=0.125 2023-06-29 00:23:43,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-29 00:23:51,646 INFO [train.py:996] (1/4) Epoch 12, batch 27950, loss[loss=0.197, simple_loss=0.2909, pruned_loss=0.05151, over 21703.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.293, pruned_loss=0.06415, over 4281156.20 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:24:35,524 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.187e+02 9.154e+02 1.408e+03 1.897e+03 4.005e+03, threshold=2.816e+03, percent-clipped=4.0 2023-06-29 00:24:58,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2180526.0, ans=0.1 2023-06-29 00:25:04,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2180526.0, ans=0.125 2023-06-29 00:25:06,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2023-06-29 00:25:31,890 INFO [train.py:996] (1/4) Epoch 12, batch 28000, loss[loss=0.2239, simple_loss=0.2913, pruned_loss=0.0783, over 21358.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2904, pruned_loss=0.0617, over 4287089.67 frames. ], batch size: 143, lr: 2.38e-03, grad_scale: 32.0 2023-06-29 00:26:00,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2180706.0, ans=0.1 2023-06-29 00:26:05,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2180706.0, ans=0.5 2023-06-29 00:26:25,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2180766.0, ans=0.0 2023-06-29 00:26:29,913 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-06-29 00:27:15,058 INFO [train.py:996] (1/4) Epoch 12, batch 28050, loss[loss=0.1845, simple_loss=0.2498, pruned_loss=0.05961, over 21822.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2874, pruned_loss=0.06319, over 4294221.41 frames. ], batch size: 118, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:27:36,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2181006.0, ans=0.125 2023-06-29 00:27:57,845 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:27:59,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2181066.0, ans=0.125 2023-06-29 00:28:00,399 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.693e+02 7.704e+02 1.092e+03 1.721e+03 4.655e+03, threshold=2.185e+03, percent-clipped=4.0 2023-06-29 00:28:31,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2181126.0, ans=0.05 2023-06-29 00:28:59,169 INFO [train.py:996] (1/4) Epoch 12, batch 28100, loss[loss=0.1875, simple_loss=0.2387, pruned_loss=0.06818, over 19982.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2869, pruned_loss=0.06336, over 4288534.64 frames. ], batch size: 702, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:29:41,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2181366.0, ans=0.125 2023-06-29 00:30:00,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2181426.0, ans=0.0 2023-06-29 00:30:30,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2181486.0, ans=0.07 2023-06-29 00:30:39,424 INFO [train.py:996] (1/4) Epoch 12, batch 28150, loss[loss=0.2191, simple_loss=0.2838, pruned_loss=0.07721, over 21834.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2824, pruned_loss=0.06337, over 4287154.50 frames. ], batch size: 107, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:30:40,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2181546.0, ans=0.04949747468305833 2023-06-29 00:31:14,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2181606.0, ans=0.125 2023-06-29 00:31:20,817 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 8.258e+02 1.413e+03 2.441e+03 4.810e+03, threshold=2.825e+03, percent-clipped=31.0 2023-06-29 00:32:20,021 INFO [train.py:996] (1/4) Epoch 12, batch 28200, loss[loss=0.2172, simple_loss=0.2933, pruned_loss=0.07051, over 22012.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2803, pruned_loss=0.06454, over 4292545.93 frames. ], batch size: 103, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:32:20,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2181846.0, ans=0.125 2023-06-29 00:33:23,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2182026.0, ans=0.0 2023-06-29 00:33:27,150 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-29 00:34:06,094 INFO [train.py:996] (1/4) Epoch 12, batch 28250, loss[loss=0.201, simple_loss=0.2667, pruned_loss=0.06764, over 21227.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2841, pruned_loss=0.06735, over 4290700.14 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:34:37,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2182206.0, ans=0.125 2023-06-29 00:34:48,085 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.469e+02 1.185e+03 1.669e+03 2.503e+03 4.651e+03, threshold=3.338e+03, percent-clipped=13.0 2023-06-29 00:34:53,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2182266.0, ans=0.0 2023-06-29 00:34:57,362 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.87 vs. limit=5.0 2023-06-29 00:35:05,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2182326.0, ans=0.0 2023-06-29 00:35:07,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2182326.0, ans=0.2 2023-06-29 00:35:48,350 INFO [train.py:996] (1/4) Epoch 12, batch 28300, loss[loss=0.1631, simple_loss=0.2497, pruned_loss=0.03821, over 21452.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.282, pruned_loss=0.06531, over 4285382.81 frames. ], batch size: 194, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:35:49,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2182446.0, ans=0.125 2023-06-29 00:35:57,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2182446.0, ans=0.125 2023-06-29 00:36:30,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2182566.0, ans=0.1 2023-06-29 00:36:38,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2182566.0, ans=0.125 2023-06-29 00:36:56,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2182626.0, ans=0.125 2023-06-29 00:37:16,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2182686.0, ans=0.125 2023-06-29 00:37:28,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2182746.0, ans=0.125 2023-06-29 00:37:29,515 INFO [train.py:996] (1/4) Epoch 12, batch 28350, loss[loss=0.2047, simple_loss=0.2745, pruned_loss=0.06747, over 21563.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2785, pruned_loss=0.06016, over 4282160.51 frames. ], batch size: 414, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:37:56,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2182806.0, ans=0.07 2023-06-29 00:38:02,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2182806.0, ans=0.0 2023-06-29 00:38:13,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-29 00:38:15,052 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.438e+02 7.077e+02 1.027e+03 1.914e+03 4.296e+03, threshold=2.054e+03, percent-clipped=2.0 2023-06-29 00:38:54,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2182986.0, ans=0.125 2023-06-29 00:38:56,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2182986.0, ans=0.2 2023-06-29 00:39:10,250 INFO [train.py:996] (1/4) Epoch 12, batch 28400, loss[loss=0.2111, simple_loss=0.2792, pruned_loss=0.07148, over 21267.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.275, pruned_loss=0.05968, over 4274813.82 frames. ], batch size: 549, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 00:39:42,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2183106.0, ans=0.5 2023-06-29 00:40:43,796 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.22 vs. limit=10.0 2023-06-29 00:40:52,142 INFO [train.py:996] (1/4) Epoch 12, batch 28450, loss[loss=0.2422, simple_loss=0.3247, pruned_loss=0.07991, over 21782.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2793, pruned_loss=0.06288, over 4267382.06 frames. ], batch size: 112, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:41:08,801 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:41:43,663 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.501e+02 7.826e+02 1.097e+03 1.608e+03 4.884e+03, threshold=2.195e+03, percent-clipped=11.0 2023-06-29 00:42:12,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2183526.0, ans=0.0 2023-06-29 00:42:38,367 INFO [train.py:996] (1/4) Epoch 12, batch 28500, loss[loss=0.2028, simple_loss=0.2812, pruned_loss=0.06215, over 21597.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2831, pruned_loss=0.06582, over 4280876.34 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:42:43,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2183646.0, ans=0.125 2023-06-29 00:43:05,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-29 00:44:00,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2183886.0, ans=0.0 2023-06-29 00:44:21,547 INFO [train.py:996] (1/4) Epoch 12, batch 28550, loss[loss=0.2313, simple_loss=0.334, pruned_loss=0.0643, over 21837.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2914, pruned_loss=0.06823, over 4285411.78 frames. ], batch size: 316, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:44:37,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2183946.0, ans=0.125 2023-06-29 00:44:48,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2184006.0, ans=0.0 2023-06-29 00:44:52,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2184006.0, ans=0.0 2023-06-29 00:44:57,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2184006.0, ans=0.0 2023-06-29 00:45:06,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2184066.0, ans=0.0 2023-06-29 00:45:12,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.914e+02 9.584e+02 1.430e+03 2.076e+03 4.050e+03, threshold=2.859e+03, percent-clipped=23.0 2023-06-29 00:45:19,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2184066.0, ans=0.125 2023-06-29 00:45:26,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2184126.0, ans=0.1 2023-06-29 00:45:28,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=2184126.0, ans=22.5 2023-06-29 00:45:31,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2184126.0, ans=0.2 2023-06-29 00:46:00,144 INFO [train.py:996] (1/4) Epoch 12, batch 28600, loss[loss=0.2058, simple_loss=0.2945, pruned_loss=0.05855, over 21405.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2977, pruned_loss=0.07022, over 4286739.93 frames. ], batch size: 131, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:46:13,392 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-29 00:46:35,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-29 00:46:52,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2184366.0, ans=0.2 2023-06-29 00:47:45,765 INFO [train.py:996] (1/4) Epoch 12, batch 28650, loss[loss=0.1734, simple_loss=0.2459, pruned_loss=0.05048, over 21655.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2922, pruned_loss=0.06937, over 4278198.48 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:48:01,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2184606.0, ans=0.04949747468305833 2023-06-29 00:48:30,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.107e+02 8.338e+02 1.219e+03 1.644e+03 3.488e+03, threshold=2.437e+03, percent-clipped=4.0 2023-06-29 00:49:26,552 INFO [train.py:996] (1/4) Epoch 12, batch 28700, loss[loss=0.1804, simple_loss=0.2177, pruned_loss=0.07156, over 20074.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2897, pruned_loss=0.0698, over 4278441.82 frames. ], batch size: 704, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:49:45,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2184906.0, ans=0.125 2023-06-29 00:49:50,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2184906.0, ans=10.0 2023-06-29 00:49:57,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2184906.0, ans=0.1 2023-06-29 00:50:50,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2185086.0, ans=0.0 2023-06-29 00:51:06,116 INFO [train.py:996] (1/4) Epoch 12, batch 28750, loss[loss=0.2477, simple_loss=0.3272, pruned_loss=0.08403, over 22045.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2898, pruned_loss=0.07019, over 4280291.95 frames. ], batch size: 119, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:51:27,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2185206.0, ans=0.0 2023-06-29 00:51:40,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=2185206.0, ans=0.5 2023-06-29 00:51:50,492 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.316e+02 7.873e+02 1.096e+03 1.643e+03 3.604e+03, threshold=2.192e+03, percent-clipped=9.0 2023-06-29 00:52:23,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2185326.0, ans=0.125 2023-06-29 00:52:47,778 INFO [train.py:996] (1/4) Epoch 12, batch 28800, loss[loss=0.2452, simple_loss=0.3336, pruned_loss=0.07834, over 21234.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2936, pruned_loss=0.07021, over 4282563.22 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:52:53,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2185446.0, ans=0.125 2023-06-29 00:53:21,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2185566.0, ans=0.0 2023-06-29 00:54:04,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-29 00:54:24,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-29 00:54:28,452 INFO [train.py:996] (1/4) Epoch 12, batch 28850, loss[loss=0.2223, simple_loss=0.2899, pruned_loss=0.07732, over 21379.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2941, pruned_loss=0.07117, over 4283008.69 frames. ], batch size: 159, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:54:29,822 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-29 00:54:46,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2185806.0, ans=0.0 2023-06-29 00:54:47,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2185806.0, ans=0.1 2023-06-29 00:55:18,114 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.464e+02 8.066e+02 1.240e+03 1.993e+03 4.428e+03, threshold=2.479e+03, percent-clipped=20.0 2023-06-29 00:55:24,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-29 00:55:36,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2185926.0, ans=0.125 2023-06-29 00:56:11,325 INFO [train.py:996] (1/4) Epoch 12, batch 28900, loss[loss=0.2221, simple_loss=0.2909, pruned_loss=0.07668, over 21489.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2988, pruned_loss=0.07327, over 4280256.95 frames. ], batch size: 211, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:56:20,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2186046.0, ans=0.125 2023-06-29 00:56:43,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2186106.0, ans=0.2 2023-06-29 00:56:45,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-29 00:57:10,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2186166.0, ans=0.125 2023-06-29 00:57:53,926 INFO [train.py:996] (1/4) Epoch 12, batch 28950, loss[loss=0.2681, simple_loss=0.3622, pruned_loss=0.087, over 21487.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.298, pruned_loss=0.07276, over 4284666.86 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:57:57,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2186346.0, ans=0.0 2023-06-29 00:58:40,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2186466.0, ans=0.1 2023-06-29 00:58:45,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=22.5 2023-06-29 00:58:47,798 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.559e+02 9.001e+02 1.307e+03 1.896e+03 3.907e+03, threshold=2.614e+03, percent-clipped=14.0 2023-06-29 00:59:15,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2186526.0, ans=0.125 2023-06-29 00:59:17,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2186526.0, ans=0.2 2023-06-29 00:59:17,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2186526.0, ans=0.04949747468305833 2023-06-29 00:59:40,847 INFO [train.py:996] (1/4) Epoch 12, batch 29000, loss[loss=0.29, simple_loss=0.3571, pruned_loss=0.1115, over 21349.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.302, pruned_loss=0.07138, over 4284571.82 frames. ], batch size: 507, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:00:25,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2186766.0, ans=0.0 2023-06-29 01:00:27,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2186766.0, ans=0.125 2023-06-29 01:01:15,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2186886.0, ans=0.1 2023-06-29 01:01:21,625 INFO [train.py:996] (1/4) Epoch 12, batch 29050, loss[loss=0.2346, simple_loss=0.3017, pruned_loss=0.08374, over 21604.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3006, pruned_loss=0.07224, over 4284483.11 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:02:14,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+02 7.703e+02 1.025e+03 1.554e+03 4.084e+03, threshold=2.051e+03, percent-clipped=7.0 2023-06-29 01:02:40,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-29 01:02:54,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2187186.0, ans=0.1 2023-06-29 01:02:56,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2187186.0, ans=0.0 2023-06-29 01:03:02,208 INFO [train.py:996] (1/4) Epoch 12, batch 29100, loss[loss=0.1928, simple_loss=0.2585, pruned_loss=0.06353, over 21532.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2917, pruned_loss=0.06954, over 4286723.47 frames. ], batch size: 391, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:03:30,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-29 01:03:56,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2187366.0, ans=0.0 2023-06-29 01:04:09,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2187426.0, ans=0.125 2023-06-29 01:04:17,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2187486.0, ans=0.125 2023-06-29 01:04:23,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-29 01:04:38,548 INFO [train.py:996] (1/4) Epoch 12, batch 29150, loss[loss=0.2366, simple_loss=0.3124, pruned_loss=0.08036, over 21345.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2923, pruned_loss=0.06914, over 4277688.43 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:05:30,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2187666.0, ans=0.1 2023-06-29 01:05:30,924 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.044e+02 8.799e+02 1.298e+03 1.831e+03 4.569e+03, threshold=2.596e+03, percent-clipped=20.0 2023-06-29 01:06:18,430 INFO [train.py:996] (1/4) Epoch 12, batch 29200, loss[loss=0.1953, simple_loss=0.2585, pruned_loss=0.06608, over 21571.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2875, pruned_loss=0.06824, over 4270484.78 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:06:19,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=22.5 2023-06-29 01:06:46,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2187906.0, ans=0.125 2023-06-29 01:06:51,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.12 vs. limit=22.5 2023-06-29 01:07:27,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2188026.0, ans=10.0 2023-06-29 01:08:03,705 INFO [train.py:996] (1/4) Epoch 12, batch 29250, loss[loss=0.1953, simple_loss=0.2789, pruned_loss=0.05589, over 21346.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2866, pruned_loss=0.06615, over 4273447.97 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:08:25,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2188206.0, ans=0.0 2023-06-29 01:08:30,195 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 01:08:38,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2188206.0, ans=0.0 2023-06-29 01:08:44,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2188266.0, ans=0.0 2023-06-29 01:08:48,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-29 01:08:50,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-29 01:08:53,907 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.020e+02 6.977e+02 9.878e+02 1.357e+03 4.006e+03, threshold=1.976e+03, percent-clipped=3.0 2023-06-29 01:08:57,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2188266.0, ans=0.2 2023-06-29 01:09:10,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2188326.0, ans=0.0 2023-06-29 01:09:43,867 INFO [train.py:996] (1/4) Epoch 12, batch 29300, loss[loss=0.1811, simple_loss=0.2282, pruned_loss=0.06699, over 20053.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2872, pruned_loss=0.06555, over 4267332.81 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:10:18,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2188506.0, ans=0.125 2023-06-29 01:10:21,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2188506.0, ans=0.0 2023-06-29 01:10:38,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2188566.0, ans=0.0 2023-06-29 01:11:17,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2188686.0, ans=0.125 2023-06-29 01:11:30,173 INFO [train.py:996] (1/4) Epoch 12, batch 29350, loss[loss=0.214, simple_loss=0.2896, pruned_loss=0.06916, over 21441.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2833, pruned_loss=0.06516, over 4265333.68 frames. ], batch size: 195, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:12:16,937 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 7.385e+02 1.115e+03 1.625e+03 3.431e+03, threshold=2.230e+03, percent-clipped=15.0 2023-06-29 01:12:43,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2188926.0, ans=0.0 2023-06-29 01:12:55,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2188986.0, ans=0.07 2023-06-29 01:13:11,575 INFO [train.py:996] (1/4) Epoch 12, batch 29400, loss[loss=0.1908, simple_loss=0.2863, pruned_loss=0.04766, over 21183.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2838, pruned_loss=0.06252, over 4272260.94 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:13:43,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.23 vs. limit=10.0 2023-06-29 01:13:46,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2189106.0, ans=0.2 2023-06-29 01:13:48,232 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-29 01:14:06,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2189166.0, ans=0.2 2023-06-29 01:14:36,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2189286.0, ans=0.0 2023-06-29 01:14:52,541 INFO [train.py:996] (1/4) Epoch 12, batch 29450, loss[loss=0.2303, simple_loss=0.3094, pruned_loss=0.07559, over 21724.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2818, pruned_loss=0.06218, over 4272473.11 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:14:59,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2189346.0, ans=0.125 2023-06-29 01:15:44,174 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.913e+02 9.244e+02 1.482e+03 2.285e+03 4.603e+03, threshold=2.964e+03, percent-clipped=27.0 2023-06-29 01:15:59,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2189526.0, ans=0.1 2023-06-29 01:16:25,669 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.85 vs. limit=12.0 2023-06-29 01:16:38,759 INFO [train.py:996] (1/4) Epoch 12, batch 29500, loss[loss=0.2102, simple_loss=0.285, pruned_loss=0.06767, over 21865.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2839, pruned_loss=0.06495, over 4273311.13 frames. ], batch size: 351, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:17:15,252 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-29 01:17:35,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=2189826.0, ans=15.0 2023-06-29 01:17:59,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2189886.0, ans=0.125 2023-06-29 01:18:18,330 INFO [train.py:996] (1/4) Epoch 12, batch 29550, loss[loss=0.2379, simple_loss=0.2995, pruned_loss=0.08811, over 21743.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2843, pruned_loss=0.06677, over 4284574.58 frames. ], batch size: 473, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:19:01,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2190066.0, ans=0.125 2023-06-29 01:19:02,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2190066.0, ans=0.125 2023-06-29 01:19:05,175 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.716e+02 8.394e+02 1.189e+03 1.876e+03 3.636e+03, threshold=2.379e+03, percent-clipped=6.0 2023-06-29 01:19:17,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2190126.0, ans=0.125 2023-06-29 01:20:00,974 INFO [train.py:996] (1/4) Epoch 12, batch 29600, loss[loss=0.2346, simple_loss=0.3224, pruned_loss=0.07335, over 21708.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2896, pruned_loss=0.06832, over 4284759.06 frames. ], batch size: 247, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:21:03,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2190426.0, ans=0.2 2023-06-29 01:21:03,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2190426.0, ans=0.125 2023-06-29 01:21:09,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2190426.0, ans=0.125 2023-06-29 01:21:11,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-29 01:21:28,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2190486.0, ans=0.125 2023-06-29 01:21:41,181 INFO [train.py:996] (1/4) Epoch 12, batch 29650, loss[loss=0.1779, simple_loss=0.2536, pruned_loss=0.05105, over 21535.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2861, pruned_loss=0.06504, over 4282122.45 frames. ], batch size: 211, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:21:58,946 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-29 01:22:25,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2190666.0, ans=0.1 2023-06-29 01:22:33,828 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 9.770e+02 1.872e+03 2.859e+03 6.209e+03, threshold=3.743e+03, percent-clipped=35.0 2023-06-29 01:22:53,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2190726.0, ans=0.125 2023-06-29 01:23:19,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2190786.0, ans=0.125 2023-06-29 01:23:22,770 INFO [train.py:996] (1/4) Epoch 12, batch 29700, loss[loss=0.2632, simple_loss=0.3655, pruned_loss=0.08043, over 21654.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2902, pruned_loss=0.0655, over 4280404.66 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:23:36,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2190846.0, ans=0.125 2023-06-29 01:24:14,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2190966.0, ans=0.0 2023-06-29 01:24:25,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2191026.0, ans=0.125 2023-06-29 01:24:37,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2191026.0, ans=0.0 2023-06-29 01:25:02,375 INFO [train.py:996] (1/4) Epoch 12, batch 29750, loss[loss=0.2273, simple_loss=0.3235, pruned_loss=0.06557, over 21792.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2942, pruned_loss=0.06508, over 4281464.30 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:25:24,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2191206.0, ans=0.125 2023-06-29 01:25:34,338 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=22.5 2023-06-29 01:25:36,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2191206.0, ans=0.125 2023-06-29 01:25:58,258 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.524e+02 7.715e+02 1.077e+03 1.535e+03 3.860e+03, threshold=2.154e+03, percent-clipped=1.0 2023-06-29 01:26:08,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2191326.0, ans=0.125 2023-06-29 01:26:42,163 INFO [train.py:996] (1/4) Epoch 12, batch 29800, loss[loss=0.1985, simple_loss=0.278, pruned_loss=0.05954, over 21683.00 frames. ], tot_loss[loss=0.213, simple_loss=0.295, pruned_loss=0.06548, over 4280000.57 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:27:03,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2191506.0, ans=10.0 2023-06-29 01:27:40,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2191566.0, ans=0.0 2023-06-29 01:27:51,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2191626.0, ans=0.0 2023-06-29 01:28:15,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2191686.0, ans=0.05 2023-06-29 01:28:20,979 INFO [train.py:996] (1/4) Epoch 12, batch 29850, loss[loss=0.168, simple_loss=0.248, pruned_loss=0.04401, over 21179.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2913, pruned_loss=0.06388, over 4279289.84 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:28:42,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2191806.0, ans=0.0 2023-06-29 01:28:43,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2191806.0, ans=0.125 2023-06-29 01:29:06,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2191866.0, ans=0.125 2023-06-29 01:29:16,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.021e+02 7.804e+02 1.039e+03 1.669e+03 3.761e+03, threshold=2.078e+03, percent-clipped=15.0 2023-06-29 01:29:53,301 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-29 01:29:57,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2191986.0, ans=0.2 2023-06-29 01:30:00,612 INFO [train.py:996] (1/4) Epoch 12, batch 29900, loss[loss=0.2584, simple_loss=0.3235, pruned_loss=0.09663, over 21506.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2901, pruned_loss=0.06537, over 4281545.53 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:30:21,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.67 vs. limit=15.0 2023-06-29 01:31:46,204 INFO [train.py:996] (1/4) Epoch 12, batch 29950, loss[loss=0.2333, simple_loss=0.3112, pruned_loss=0.07767, over 21468.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2938, pruned_loss=0.06861, over 4286135.26 frames. ], batch size: 194, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:32:02,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2192346.0, ans=0.0 2023-06-29 01:32:25,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2192406.0, ans=0.125 2023-06-29 01:32:38,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.758e+02 9.924e+02 1.385e+03 1.832e+03 3.568e+03, threshold=2.770e+03, percent-clipped=22.0 2023-06-29 01:33:33,000 INFO [train.py:996] (1/4) Epoch 12, batch 30000, loss[loss=0.1839, simple_loss=0.3031, pruned_loss=0.0323, over 20756.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2946, pruned_loss=0.06849, over 4280493.24 frames. ], batch size: 608, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:33:33,000 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-29 01:33:51,774 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.255, simple_loss=0.3458, pruned_loss=0.08216, over 1796401.00 frames. 2023-06-29 01:33:51,775 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 23743MB 2023-06-29 01:34:28,323 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-06-29 01:35:25,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2192886.0, ans=0.0 2023-06-29 01:35:38,186 INFO [train.py:996] (1/4) Epoch 12, batch 30050, loss[loss=0.1897, simple_loss=0.3246, pruned_loss=0.02737, over 19716.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2985, pruned_loss=0.06595, over 4275886.99 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:36:03,932 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-29 01:36:09,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2193006.0, ans=0.1 2023-06-29 01:36:33,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2193066.0, ans=0.1 2023-06-29 01:36:33,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2193066.0, ans=0.2 2023-06-29 01:36:36,202 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.629e+02 9.060e+02 1.265e+03 2.367e+03 5.681e+03, threshold=2.530e+03, percent-clipped=16.0 2023-06-29 01:36:38,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=2193066.0, ans=0.5 2023-06-29 01:37:17,749 INFO [train.py:996] (1/4) Epoch 12, batch 30100, loss[loss=0.1996, simple_loss=0.2675, pruned_loss=0.06586, over 21537.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2972, pruned_loss=0.06586, over 4269024.87 frames. ], batch size: 414, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:38:18,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.73 vs. limit=10.0 2023-06-29 01:39:04,267 INFO [train.py:996] (1/4) Epoch 12, batch 30150, loss[loss=0.2125, simple_loss=0.2875, pruned_loss=0.06878, over 21595.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2933, pruned_loss=0.06715, over 4267551.71 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:39:35,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2193606.0, ans=0.125 2023-06-29 01:39:47,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2193666.0, ans=0.0 2023-06-29 01:40:05,026 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.363e+02 8.482e+02 1.272e+03 2.081e+03 3.656e+03, threshold=2.544e+03, percent-clipped=13.0 2023-06-29 01:40:47,523 INFO [train.py:996] (1/4) Epoch 12, batch 30200, loss[loss=0.2226, simple_loss=0.3235, pruned_loss=0.06086, over 21695.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.296, pruned_loss=0.06637, over 4269918.59 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:41:51,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-29 01:42:28,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2194086.0, ans=0.125 2023-06-29 01:42:38,874 INFO [train.py:996] (1/4) Epoch 12, batch 30250, loss[loss=0.2686, simple_loss=0.3823, pruned_loss=0.07741, over 21272.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3049, pruned_loss=0.06917, over 4272248.21 frames. ], batch size: 549, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:42:46,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2194146.0, ans=0.5 2023-06-29 01:43:10,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2194206.0, ans=0.2 2023-06-29 01:43:33,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.232e+02 7.981e+02 1.163e+03 1.576e+03 2.909e+03, threshold=2.325e+03, percent-clipped=5.0 2023-06-29 01:44:08,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2194386.0, ans=0.2 2023-06-29 01:44:12,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2194386.0, ans=0.125 2023-06-29 01:44:21,065 INFO [train.py:996] (1/4) Epoch 12, batch 30300, loss[loss=0.1955, simple_loss=0.2586, pruned_loss=0.06619, over 21245.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.3014, pruned_loss=0.06854, over 4269547.67 frames. ], batch size: 144, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:44:40,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2194446.0, ans=0.0 2023-06-29 01:46:09,183 INFO [train.py:996] (1/4) Epoch 12, batch 30350, loss[loss=0.2311, simple_loss=0.3248, pruned_loss=0.06874, over 21864.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3002, pruned_loss=0.06893, over 4264274.28 frames. ], batch size: 317, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:46:26,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2194806.0, ans=0.2 2023-06-29 01:46:28,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2194806.0, ans=0.2 2023-06-29 01:46:32,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2194806.0, ans=0.0 2023-06-29 01:46:49,134 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.687e+02 9.388e+02 1.588e+03 2.178e+03 4.101e+03, threshold=3.176e+03, percent-clipped=21.0 2023-06-29 01:46:52,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2194926.0, ans=0.1 2023-06-29 01:47:18,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2194986.0, ans=0.95 2023-06-29 01:47:26,697 INFO [train.py:996] (1/4) Epoch 12, batch 30400, loss[loss=0.1909, simple_loss=0.244, pruned_loss=0.06886, over 20265.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.294, pruned_loss=0.06751, over 4251247.45 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:48:14,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2195166.0, ans=0.1 2023-06-29 01:48:45,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2195286.0, ans=0.125 2023-06-29 01:48:50,489 INFO [train.py:996] (1/4) Epoch 12, batch 30450, loss[loss=0.257, simple_loss=0.379, pruned_loss=0.06751, over 19895.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2947, pruned_loss=0.06695, over 4193384.32 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 01:48:54,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2195346.0, ans=0.0 2023-06-29 01:49:38,085 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.733e+02 1.475e+03 2.498e+03 5.657e+03 1.532e+04, threshold=4.997e+03, percent-clipped=41.0 2023-06-29 01:49:42,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2195526.0, ans=0.125 2023-06-29 01:49:57,355 INFO [train.py:1249] (1/4) Done!