[2025-03-02 14:55:20,472] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 14:55:20,472] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 14:55:20,472] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 14:55:20,472] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 14:55:20,473] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 14:55:20,480] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 14:55:20,483] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 03-02 14:55:25 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 14:55:25 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 14:55:25 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 14:55:25 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 14:55:25 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 14:55:25 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 14:55:25 __init__.py:190] Automatically detected platform cuda.
[2025-03-02 14:55:31,563] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 14:55:31,563] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 14:55:31,564] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 14:55:31,564] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 14:55:31,565] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 14:55:31,565] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 14:55:31,565] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-03-02 14:55:31,565] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 14:55:35,116] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
[2025-03-02 14:55:35,169] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-03-02 14:55:35,176] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
[2025-03-02 14:55:35,223] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-03-02 14:55:35,230] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
p-phy-ctyun-gz-a800-node-prod-200-110:688502:688502 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688502:688502 [0] NCCL INFO Bootstrap : Using bond0:10.9.200.110<0>
[2025-03-02 14:55:35,274] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
p-phy-ctyun-gz-a800-node-prod-200-110:688502:688502 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
p-phy-ctyun-gz-a800-node-prod-200-110:688503:688503 [1] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-110:688503:688503 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688503:688503 [1] NCCL INFO Bootstrap : Using bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688505:688505 [3] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-110:688505:688505 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688508:688508 [6] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-110:688507:688507 [5] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-110:688508:688508 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688507:688507 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688506:688506 [4] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-110:688506:688506 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688505:688505 [3] NCCL INFO Bootstrap : Using bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688508:688508 [6] NCCL INFO Bootstrap : Using bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688506:688506 [4] NCCL INFO Bootstrap : Using bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688507:688507 [5] NCCL INFO Bootstrap : Using bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Using network IBext_v8
[2025-03-02 14:55:37,912] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
p-phy-ctyun-gz-a800-node-prod-200-110:688504:688504 [2] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-110:688504:688504 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688504:688504 [2] NCCL INFO Bootstrap : Using bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.110<0>
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO ncclCommInitRank comm 0x55be9b3c3af0 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 commId 0xa155d924c2747755 - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO ncclCommInitRank comm 0x556d5854aa00 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 commId 0xa155d924c2747755 - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO ncclCommInitRank comm 0x56258c81ee30 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 commId 0xa155d924c2747755 - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO ncclCommInitRank comm 0x5573879bc140 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 commId 0xa155d924c2747755 - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO ncclCommInitRank comm 0x55a23fa4b490 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 commId 0xa155d924c2747755 - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO ncclCommInitRank comm 0x56293ebfec60 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 commId 0xa155d924c2747755 - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO ncclCommInitRank comm 0x564e03063a50 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 commId 0xa155d924c2747755 - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO NVLS multicast support is not available on dev 4
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO NVLS multicast support is not available on dev 2
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO NVLS multicast support is not available on dev 5
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO NVLS multicast support is not available on dev 0
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO NVLS multicast support is not available on dev 6
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO NVLS multicast support is not available on dev 1
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO NVLS multicast support is not available on dev 3
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO comm 0x56293ebfec60 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 00/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 01/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 02/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 03/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 04/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 05/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 06/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 07/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 08/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 09/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 10/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 11/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 12/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 13/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 14/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 15/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO comm 0x564e03063a50 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO comm 0x55a23fa4b490 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO comm 0x556d5854aa00 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO comm 0x5573879bc140 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO comm 0x56258c81ee30 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO comm 0x55be9b3c3af0 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-110:688506:689836 [4] NCCL INFO ncclCommInitRank comm 0x55a23fa4b490 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 commId 0xa155d924c2747755 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688504:689976 [2] NCCL INFO ncclCommInitRank comm 0x55be9b3c3af0 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 commId 0xa155d924c2747755 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-110:688503:689830 [1] NCCL INFO ncclCommInitRank comm 0x56258c81ee30 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 commId 0xa155d924c2747755 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688505:689834 [3] NCCL INFO ncclCommInitRank comm 0x556d5854aa00 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 commId 0xa155d924c2747755 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-110:688507:689838 [5] NCCL INFO ncclCommInitRank comm 0x5573879bc140 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 commId 0xa155d924c2747755 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-110:688508:689835 [6] NCCL INFO ncclCommInitRank comm 0x564e03063a50 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 commId 0xa155d924c2747755 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-110:688502:689819 [0] NCCL INFO ncclCommInitRank comm 0x56293ebfec60 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 commId 0xa155d924c2747755 - Init COMPLETE
[2025-03-02 14:55:39,653] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 730, num_elems = 8.29B
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:  25%|██▌       | 1/4 [00:06<00:19,  6.43s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:06<00:19,  6.44s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:06<00:19,  6.45s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:06<00:19,  6.45s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:06<00:19,  6.47s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:06<00:19,  6.55s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:06<00:19,  6.62s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:12<00:12,  6.18s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:12<00:12,  6.18s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:12<00:12,  6.19s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:12<00:12,  6.21s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:12<00:12,  6.21s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:12<00:12,  6.20s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:12<00:12,  6.29s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:18<00:06,  6.10s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:18<00:06,  6.10s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:18<00:06,  6.11s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:18<00:06,  6.11s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:18<00:06,  6.08s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:18<00:06,  6.12s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:18<00:06,  6.19s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.02s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.82s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.01s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.82s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.02s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.82s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.02s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.82s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  3.99s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.82s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.02s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.82s/it]
[2025-03-02 14:55:59,007] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:55:59,008] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:55:59,008] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:55:59,008] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:55:59,008] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:55:59,009] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
Loading checkpoint shards: 100%|██████████| 4/4 [00:20<00:00,  4.55s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:20<00:00,  5.19s/it]
[2025-03-02 14:56:00,483] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s][2025-03-02 14:56:00,798] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 1460, num_elems = 16.58B
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:07,  2.35s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:07,  2.36s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:07,  2.37s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:07,  2.38s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:07,  2.39s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:07,  2.39s/it]Loading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:06,  2.27s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:04<00:04,  2.42s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:04<00:04,  2.42s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:04<00:04,  2.43s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:04<00:04,  2.42s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:04<00:04,  2.42s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:04<00:04,  2.35s/it]Loading checkpoint shards:  50%|█████     | 2/4 [00:04<00:04,  2.44s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:06<00:02,  2.29s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:06<00:02,  2.29s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:06<00:02,  2.29s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:06<00:02,  2.29s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:06<00:02,  2.29s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:06<00:02,  2.30s/it]Loading checkpoint shards:  75%|███████▌  | 3/4 [00:06<00:02,  2.29s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.61s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.88s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.61s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.89s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.61s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.89s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.62s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.89s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.61s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.88s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.62s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.89s/it]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.78s/it]Loading checkpoint shards: 100%|██████████| 4/4 [00:07<00:00,  1.97s/it]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[2025-03-02 14:56:09,858] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:56:09,858] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:56:09,858] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:56:09,859] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
[2025-03-02 14:56:09,859] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:56:09,859] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:56:09,859] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:56:09,860] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 14:56:09,875] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-03-02 14:56:09,878] [INFO] [logging.py:128:log_dist] [Rank 0] Creating ZeRO Offload
[2025-03-02 14:56:10,064] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-03-02 14:56:10,065] [INFO] [utils.py:782:see_memory_usage] MA 4.43 GB         Max_MA 7.33 GB         CA 7.56 GB         Max_CA 8 GB 
[2025-03-02 14:56:10,065] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 67.78 GB, percent = 6.7%
Parameter Offload: Total persistent parameters: 877056 in 401 params
[2025-03-02 14:56:10,277] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-03-02 14:56:10,277] [INFO] [utils.py:782:see_memory_usage] MA 4.43 GB         Max_MA 4.43 GB         CA 7.56 GB         Max_CA 8 GB 
[2025-03-02 14:56:10,278] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 67.78 GB, percent = 6.7%
[2025-03-02 14:56:10,279] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f1d14f35ed0>
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-03-02 14:56:10,280] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-03-02 14:56:10,281] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 2
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   optimizer_name ............... None
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   optimizer_params ............. None
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-03-02 14:56:10,282] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   train_batch_size ............. 14
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... False
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   world_size ................... 7
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  False
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-03-02 14:56:10,283] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-03-02 14:56:10,284] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 3
[2025-03-02 14:56:10,284] [INFO] [config.py:989:print_user_config]   json = {
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "none", 
            "pin_memory": true
        }, 
        "offload_param": {
            "device": "none", 
            "pin_memory": true
        }, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09, 
        "reduce_bucket_size": "auto", 
        "stage3_prefetch_bucket_size": "auto", 
        "stage3_param_persistence_threshold": "auto", 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 14, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": false, 
    "zero_optimization.reduce_bucket_size": 1.284506e+07, 
    "zero_optimization.stage3_param_persistence_threshold": 3.584000e+04, 
    "zero_optimization.stage3_prefetch_bucket_size": 1.156055e+07
}
INFO 03-02 14:56:24 config.py:542] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
WARNING 03-02 14:56:24 arg_utils.py:1079] --enable-prefix-caching is currently not supported for multimodal models in v0 and has been disabled.
INFO 03-02 14:56:24 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/home/vlm/workspace/r1_checkpoints/qwen2vl_7b_R1_finetune_by_trance_60k_cot_sft_every_100/checkpoint-400', speculative_config=None, tokenizer='/home/vlm/workspace/r1_checkpoints/qwen2vl_7b_R1_finetune_by_trance_60k_cot_sft_every_100/checkpoint-400', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:7, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/vlm/workspace/r1_checkpoints/qwen2vl_7b_R1_finetune_by_trance_60k_cot_sft_every_100/checkpoint-400, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
INFO 03-02 14:56:25 cuda.py:230] Using Flash Attention backend.
INFO 03-02 14:56:26 model_runner.py:1110] Starting to load model /home/vlm/workspace/r1_checkpoints/qwen2vl_7b_R1_finetune_by_trance_60k_cot_sft_every_100/checkpoint-400...
INFO 03-02 14:56:26 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:05,  1.72s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.11s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.51s/it]

INFO 03-02 14:56:33 model_runner.py:1115] Loading model weights took 0.0000 GB
WARNING 03-02 14:56:34 model_runner.py:1288] Computed max_num_seqs (min(256, 32768 // 49152)) to be less than 1. Setting it to the minimum value of 1.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
Token indices sequence length is longer than the specified maximum sequence length for this model (49152 > 8192). Running this sequence through the model will result in indexing errors
WARNING 03-02 14:56:39 profiling.py:187] The context length (32768) of the model is too short to hold the multi-modal embeddings in the worst case (49152 tokens in total, out of which {'image': 32768, 'video': 16384} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
INFO 03-02 14:56:42 worker.py:267] Memory profiling takes 9.73 seconds
INFO 03-02 14:56:42 worker.py:267] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.70) = 55.53GiB
INFO 03-02 14:56:42 worker.py:267] model weights take 0.00GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.00GiB; the rest of the memory reserved for KV Cache is 55.53GiB.
INFO 03-02 14:56:43 executor_base.py:110] # CUDA blocks: 64982, # CPU blocks: 4681
INFO 03-02 14:56:43 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 31.73x
INFO 03-02 14:56:45 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes:   0%|          | 0/35 [00:00<?, ?it/s]Capturing CUDA graph shapes:   3%|▎         | 1/35 [00:00<00:18,  1.87it/s]Capturing CUDA graph shapes:   6%|▌         | 2/35 [00:01<00:16,  1.97it/s]Capturing CUDA graph shapes:   9%|▊         | 3/35 [00:01<00:16,  2.00it/s]Capturing CUDA graph shapes:  11%|█▏        | 4/35 [00:02<00:15,  2.01it/s]Capturing CUDA graph shapes:  14%|█▍        | 5/35 [00:02<00:14,  2.02it/s]Capturing CUDA graph shapes:  17%|█▋        | 6/35 [00:02<00:14,  2.03it/s]Capturing CUDA graph shapes:  20%|██        | 7/35 [00:03<00:13,  2.04it/s]Capturing CUDA graph shapes:  23%|██▎       | 8/35 [00:03<00:13,  2.04it/s]Capturing CUDA graph shapes:  26%|██▌       | 9/35 [00:04<00:12,  2.03it/s]Capturing CUDA graph shapes:  29%|██▊       | 10/35 [00:04<00:12,  2.03it/s]Capturing CUDA graph shapes:  31%|███▏      | 11/35 [00:05<00:11,  2.04it/s]Capturing CUDA graph shapes:  34%|███▍      | 12/35 [00:05<00:11,  2.05it/s]Capturing CUDA graph shapes:  37%|███▋      | 13/35 [00:06<00:10,  2.05it/s]Capturing CUDA graph shapes:  40%|████      | 14/35 [00:06<00:10,  2.05it/s]Capturing CUDA graph shapes:  43%|████▎     | 15/35 [00:07<00:09,  2.05it/s]Capturing CUDA graph shapes:  46%|████▌     | 16/35 [00:07<00:09,  2.06it/s]Capturing CUDA graph shapes:  49%|████▊     | 17/35 [00:08<00:08,  2.07it/s]Capturing CUDA graph shapes:  51%|█████▏    | 18/35 [00:08<00:08,  2.07it/s]Capturing CUDA graph shapes:  54%|█████▍    | 19/35 [00:09<00:07,  2.07it/s]Capturing CUDA graph shapes:  57%|█████▋    | 20/35 [00:09<00:07,  2.08it/s]Capturing CUDA graph shapes:  60%|██████    | 21/35 [00:10<00:06,  2.08it/s]Capturing CUDA graph shapes:  63%|██████▎   | 22/35 [00:10<00:06,  2.08it/s]Capturing CUDA graph shapes:  66%|██████▌   | 23/35 [00:11<00:05,  2.08it/s]Capturing CUDA graph shapes:  69%|██████▊   | 24/35 [00:11<00:05,  2.09it/s]Capturing CUDA graph shapes:  71%|███████▏  | 25/35 [00:12<00:04,  2.09it/s]Capturing CUDA graph shapes:  74%|███████▍  | 26/35 [00:12<00:04,  2.09it/s]Capturing CUDA graph shapes:  77%|███████▋  | 27/35 [00:13<00:03,  2.09it/s]Capturing CUDA graph shapes:  80%|████████  | 28/35 [00:13<00:03,  2.09it/s]Capturing CUDA graph shapes:  83%|████████▎ | 29/35 [00:14<00:02,  2.09it/s]Capturing CUDA graph shapes:  86%|████████▌ | 30/35 [00:14<00:02,  2.09it/s]Capturing CUDA graph shapes:  89%|████████▊ | 31/35 [00:15<00:01,  2.08it/s]Capturing CUDA graph shapes:  91%|█████████▏| 32/35 [00:15<00:01,  2.08it/s]Capturing CUDA graph shapes:  94%|█████████▍| 33/35 [00:16<00:00,  2.06it/s]Capturing CUDA graph shapes:  97%|█████████▋| 34/35 [00:16<00:00,  2.07it/s]Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:16<00:00,  2.07it/s]Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:16<00:00,  2.06it/s]
INFO 03-02 14:57:02 model_runner.py:1562] Graph capturing finished in 17 secs, took 0.00 GiB
INFO 03-02 14:57:02 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 29.61 seconds
Parameter Offload: Total persistent parameters: 877056 in 401 params
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.5
wandb: Run data is saved locally in /home/vlm/workspace/vision-open-r1-spatial/wandb/run-20250302_145719-ax087bcz
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run ONLY-FULL-SHUFFLE-BEST-HIGH-POINT-R1-RESUME-COT-VLLM-Correct-Qwen2-VL-7B-GRPO-TRANCE-60k-2025-03-02-14-54-34
wandb: ⭐️ View project at https://wandb.ai/tanhuajie264-peking-university/vison-open-r1
wandb: 🚀 View run at https://wandb.ai/tanhuajie264-peking-university/vison-open-r1/runs/ax087bcz
  0%|          | 0/4286 [00:00<?, ?it/s]p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO bootstrapSplit: comm 0x7f5ac006f030 parent 0x556d5854aa00 rank 3 nranks 7 color -1326228412 key 3 prev 2 next 4 - DONE
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO bootstrapSplit: comm 0x7ee91c0710e0 parent 0x55a23fa4b490 rank 4 nranks 7 color -1326228412 key 4 prev 3 next 5 - DONE
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO bootstrapSplit: comm 0x7f014006fc10 parent 0x55be9b3c3af0 rank 2 nranks 7 color -1326228412 key 2 prev 1 next 3 - DONE
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO bootstrapSplit: comm 0x7fb13806ef40 parent 0x5573879bc140 rank 5 nranks 7 color -1326228412 key 5 prev 4 next 6 - DONE
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO bootstrapSplit: comm 0x7f190006fea0 parent 0x56258c81ee30 rank 1 nranks 7 color -1326228412 key 1 prev 0 next 2 - DONE
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO bootstrapSplit: comm 0x7effd006ffe0 parent 0x564e03063a50 rank 6 nranks 7 color -1326228412 key 6 prev 5 next 0 - DONE
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO ncclCommSplit comm 0x7ee91c0710e0 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 parent 0x55a23fa4b490 color -1326228412 key 4 commId 0x218decc57953bb - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO ncclCommSplit comm 0x7f014006fc10 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 parent 0x55be9b3c3af0 color -1326228412 key 2 commId 0x218decc57953bb - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO ncclCommSplit comm 0x7fb13806ef40 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 parent 0x5573879bc140 color -1326228412 key 5 commId 0x218decc57953bb - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO ncclCommSplit comm 0x7effd006ffe0 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 parent 0x564e03063a50 color -1326228412 key 6 commId 0x218decc57953bb - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO ncclCommSplit comm 0x7f5ac006f030 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 parent 0x556d5854aa00 color -1326228412 key 3 commId 0x218decc57953bb - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO ncclCommSplit comm 0x7f190006fea0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 parent 0x56258c81ee30 color -1326228412 key 1 commId 0x218decc57953bb - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO bootstrapSplit: comm 0x7ef7b406e9d0 parent 0x56293ebfec60 rank 0 nranks 7 color -1326228412 key 0 prev 6 next 1 - DONE
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO ncclCommSplit comm 0x7ef7b406e9d0 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 parent 0x56293ebfec60 color -1326228412 key 0 commId 0x218decc57953bb - Init START
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO NVLS multicast support is not available on dev 5
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO NVLS multicast support is not available on dev 6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO NVLS multicast support is not available on dev 0
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO NVLS multicast support is not available on dev 1
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO NVLS multicast support is not available on dev 2
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO NVLS multicast support is not available on dev 3
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO NVLS multicast support is not available on dev 4
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO comm 0x7f190006fea0 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO comm 0x7f014006fc10 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO comm 0x7ef7b406e9d0 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO comm 0x7effd006ffe0 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 00/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO comm 0x7fb13806ef40 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 01/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 02/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 03/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 04/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 05/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO comm 0x7ee91c0710e0 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 06/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 07/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO comm 0x7f5ac006f030 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 08/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 09/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 10/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 11/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 12/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 13/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 14/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 15/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-110:688502:694390 [0] NCCL INFO ncclCommSplit comm 0x7ef7b406e9d0 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 parent 0x56293ebfec60 color -1326228412 key 0 commId 0x218decc57953bb - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688507:694387 [5] NCCL INFO ncclCommSplit comm 0x7fb13806ef40 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 parent 0x5573879bc140 color -1326228412 key 5 commId 0x218decc57953bb - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688505:694388 [3] NCCL INFO ncclCommSplit comm 0x7f5ac006f030 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 parent 0x556d5854aa00 color -1326228412 key 3 commId 0x218decc57953bb - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688503:694392 [1] NCCL INFO ncclCommSplit comm 0x7f190006fea0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 parent 0x56258c81ee30 color -1326228412 key 1 commId 0x218decc57953bb - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688506:694393 [4] NCCL INFO ncclCommSplit comm 0x7ee91c0710e0 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 parent 0x55a23fa4b490 color -1326228412 key 4 commId 0x218decc57953bb - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688508:694391 [6] NCCL INFO ncclCommSplit comm 0x7effd006ffe0 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 parent 0x564e03063a50 color -1326228412 key 6 commId 0x218decc57953bb - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-110:688504:694389 [2] NCCL INFO ncclCommSplit comm 0x7f014006fc10 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 parent 0x55be9b3c3af0 color -1326228412 key 2 commId 0x218decc57953bb - Init COMPLETE
[2025-03-02 14:57:51,848] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 1/4286 [00:29<35:01:31, 29.43s/it]                                                   {'loss': 0.0, 'grad_norm': 1.125779529196918, 'learning_rate': 9.997666822211853e-07, 'completion_length': 214.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.4032738357782364, 'rewards/format_reward': 1.0, 'reward': 1.4032739400863647, 'reward_std': 0.24061636626720428, 'kl': 0.0, 'epoch': 0.0}
  0%|          | 1/4286 [00:29<35:01:31, 29.43s/it][2025-03-02 14:58:14,961] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 2/4286 [00:52<30:35:51, 25.71s/it]                                                   {'loss': 0.0, 'grad_norm': 0.9270203490945449, 'learning_rate': 9.995333644423704e-07, 'completion_length': 245.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.4895833730697632, 'rewards/format_reward': 1.0, 'reward': 1.4895834922790527, 'reward_std': 0.1952773630619049, 'kl': 4.431605339050293e-05, 'epoch': 0.0}
  0%|          | 2/4286 [00:52<30:35:51, 25.71s/it]  0%|          | 3/4286 [01:15<29:00:08, 24.38s/it]                                                   {'loss': 0.0, 'grad_norm': 1.871224214601804, 'learning_rate': 9.993000466635557e-07, 'completion_length': 205.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.3336309790611267, 'rewards/format_reward': 1.0, 'reward': 1.3336310386657715, 'reward_std': 0.20096609741449356, 'kl': 8.536875247955322e-05, 'epoch': 0.0}
  0%|          | 3/4286 [01:15<29:00:08, 24.38s/it][2025-03-02 14:59:03,119] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 4/4286 [01:40<29:27:41, 24.77s/it]                                                   {'loss': 0.0, 'grad_norm': 2.481396731303544, 'learning_rate': 9.99066728884741e-07, 'completion_length': 205.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.34073323011398315, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3228761553764343, 'reward_std': 0.20557495951652527, 'kl': 1.4185905456542969e-05, 'epoch': 0.0}
  0%|          | 4/4286 [01:40<29:27:41, 24.77s/it]  0%|          | 5/4286 [02:04<29:06:33, 24.48s/it]                                                   {'loss': 0.0, 'grad_norm': 1.22595512784807, 'learning_rate': 9.988334111059262e-07, 'completion_length': 198.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.4047619253396988, 'rewards/format_reward': 1.0, 'reward': 1.4047619700431824, 'reward_std': 0.1554151475429535, 'kl': 6.854534149169922e-05, 'epoch': 0.0}
  0%|          | 5/4286 [02:04<29:06:33, 24.48s/it]  0%|          | 6/4286 [02:27<28:29:18, 23.96s/it]                                                   {'loss': 0.0, 'grad_norm': 0.810748948750428, 'learning_rate': 9.986000933271115e-07, 'completion_length': 236.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5104166865348816, 'rewards/format_reward': 1.0, 'reward': 1.5104168057441711, 'reward_std': 0.1846674457192421, 'kl': 2.212822437286377e-06, 'epoch': 0.0}
  0%|          | 6/4286 [02:27<28:29:18, 23.96s/it]  0%|          | 7/4286 [02:51<28:15:26, 23.77s/it]                                                   {'loss': 0.0, 'grad_norm': 2.5076846031166187, 'learning_rate': 9.983667755482968e-07, 'completion_length': 229.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.4836309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.46577388048172, 'reward_std': 0.2457391545176506, 'kl': 2.6047229766845703e-05, 'epoch': 0.0}
  0%|          | 7/4286 [02:51<28:15:26, 23.77s/it]  0%|          | 8/4286 [03:13<27:38:16, 23.26s/it]                                                   {'loss': 0.0, 'grad_norm': 1.0941936849552598, 'learning_rate': 9.98133457769482e-07, 'completion_length': 200.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.385416716337204, 'rewards/format_reward': 1.0, 'reward': 1.3854168057441711, 'reward_std': 0.1886540949344635, 'kl': 8.654594421386719e-05, 'epoch': 0.0}
  0%|          | 8/4286 [03:13<27:38:16, 23.26s/it]  0%|          | 9/4286 [03:37<27:51:37, 23.45s/it]                                                   {'loss': 0.0, 'grad_norm': 1.4973155367748008, 'learning_rate': 9.979001399906673e-07, 'completion_length': 198.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.4735119342803955, 'rewards/format_reward': 1.0, 'reward': 1.473512053489685, 'reward_std': 0.215504951775074, 'kl': 0.00028395652770996094, 'epoch': 0.0}
  0%|          | 9/4286 [03:37<27:51:37, 23.45s/it]  0%|          | 10/4286 [03:59<27:25:09, 23.08s/it]                                                    {'loss': 0.0, 'grad_norm': 2.9115751254535396, 'learning_rate': 9.976668222118526e-07, 'completion_length': 194.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.2395833432674408, 'rewards/format_reward': 1.0, 'reward': 1.2395834922790527, 'reward_std': 0.10996554046869278, 'kl': 0.00072479248046875, 'epoch': 0.0}
  0%|          | 10/4286 [03:59<27:25:09, 23.08s/it]  0%|          | 11/4286 [04:20<26:49:46, 22.59s/it]                                                    {'loss': 0.0, 'grad_norm': 4.115641300734455, 'learning_rate': 9.974335044330377e-07, 'completion_length': 205.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.4006696790456772, 'rewards/format_reward': 1.0, 'reward': 1.4006697535514832, 'reward_std': 0.18548224121332169, 'kl': 0.00020742416381835938, 'epoch': 0.0}
  0%|          | 11/4286 [04:20<26:49:46, 22.59s/it]  0%|          | 12/4286 [04:44<27:17:04, 22.98s/it]                                                    {'loss': 0.0, 'grad_norm': 1.2911960601600605, 'learning_rate': 9.97200186654223e-07, 'completion_length': 214.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.38839291036129, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3705358505249023, 'reward_std': 0.1341910921037197, 'kl': 0.00037479400634765625, 'epoch': 0.0}
  0%|          | 12/4286 [04:44<27:17:04, 22.98s/it]  0%|          | 13/4286 [05:08<27:32:10, 23.20s/it]                                                    {'loss': 0.0, 'grad_norm': 0.8746255598453496, 'learning_rate': 9.969668688754082e-07, 'completion_length': 240.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.316815510392189, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.298958420753479, 'reward_std': 0.15921126678586006, 'kl': 0.00047779083251953125, 'epoch': 0.0}
  0%|          | 13/4286 [05:08<27:32:10, 23.20s/it]  0%|          | 14/4286 [05:32<27:55:33, 23.53s/it]                                                    {'loss': 0.0, 'grad_norm': 1.2217842068412192, 'learning_rate': 9.967335510965935e-07, 'completion_length': 244.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.42767859995365143, 'rewards/format_reward': 1.0, 'reward': 1.427678644657135, 'reward_std': 0.1430930458009243, 'kl': 0.00069427490234375, 'epoch': 0.0}
  0%|          | 14/4286 [05:32<27:55:33, 23.53s/it]  0%|          | 15/4286 [05:55<27:29:58, 23.18s/it]                                                    {'loss': 0.0, 'grad_norm': 4.7684325115108885, 'learning_rate': 9.965002333177788e-07, 'completion_length': 223.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.4508928805589676, 'rewards/format_reward': 1.0, 'reward': 1.450892984867096, 'reward_std': 0.18434765189886093, 'kl': 0.0007152557373046875, 'epoch': 0.0}
  0%|          | 15/4286 [05:55<27:29:58, 23.18s/it]  0%|          | 16/4286 [06:15<26:31:30, 22.36s/it]                                                    {'loss': 0.0, 'grad_norm': 1.0412658608092658, 'learning_rate': 9.96266915538964e-07, 'completion_length': 221.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.5294643342494965, 'rewards/format_reward': 1.0, 'reward': 1.5294643640518188, 'reward_std': 0.19285529479384422, 'kl': 0.00046634674072265625, 'epoch': 0.0}
  0%|          | 16/4286 [06:15<26:31:30, 22.36s/it]  0%|          | 17/4286 [06:39<27:08:37, 22.89s/it]                                                    {'loss': 0.0, 'grad_norm': 1.5646562952618803, 'learning_rate': 9.960335977601493e-07, 'completion_length': 211.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.39392009377479553, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3760629892349243, 'reward_std': 0.16270218044519424, 'kl': 0.0010223388671875, 'epoch': 0.0}
  0%|          | 17/4286 [06:39<27:08:37, 22.89s/it]  0%|          | 18/4286 [07:02<27:18:49, 23.04s/it]                                                    {'loss': 0.0001, 'grad_norm': 1.1377569157666894, 'learning_rate': 9.958002799813346e-07, 'completion_length': 223.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.4288690984249115, 'rewards/format_reward': 1.0, 'reward': 1.4288691878318787, 'reward_std': 0.2144467458128929, 'kl': 0.001300811767578125, 'epoch': 0.0}
  0%|          | 18/4286 [07:02<27:18:49, 23.04s/it]  0%|          | 19/4286 [07:26<27:22:59, 23.10s/it]                                                    {'loss': 0.0001, 'grad_norm': 2.2170735411472022, 'learning_rate': 9.955669622025197e-07, 'completion_length': 240.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.5208333879709244, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5029763579368591, 'reward_std': 0.16683555766940117, 'kl': 0.001506805419921875, 'epoch': 0.0}
  0%|          | 19/4286 [07:26<27:22:59, 23.10s/it]  0%|          | 20/4286 [07:48<27:10:17, 22.93s/it]                                                    {'loss': 0.0001, 'grad_norm': 1.4579314662004086, 'learning_rate': 9.95333644423705e-07, 'completion_length': 236.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.3991071581840515, 'rewards/format_reward': 1.0, 'reward': 1.3991072177886963, 'reward_std': 0.15400740504264832, 'kl': 0.00150299072265625, 'epoch': 0.0}
  0%|          | 20/4286 [07:48<27:10:17, 22.93s/it]  0%|          | 21/4286 [08:10<26:38:40, 22.49s/it]                                                    {'loss': 0.0, 'grad_norm': 0.5533783292515638, 'learning_rate': 9.951003266448904e-07, 'completion_length': 245.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.5398809909820557, 'rewards/format_reward': 1.0, 'reward': 1.5398809909820557, 'reward_std': 0.1540084332227707, 'kl': 0.000797271728515625, 'epoch': 0.0}
  0%|          | 21/4286 [08:10<26:38:40, 22.49s/it]  1%|          | 22/4286 [08:32<26:31:52, 22.40s/it]                                                    {'loss': 0.0001, 'grad_norm': 1.3179820085026706, 'learning_rate': 9.948670088660755e-07, 'completion_length': 217.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4880952537059784, 'rewards/format_reward': 1.0, 'reward': 1.4880954027175903, 'reward_std': 0.18849977850914001, 'kl': 0.001556396484375, 'epoch': 0.01}
  1%|          | 22/4286 [08:32<26:31:52, 22.40s/it]  1%|          | 23/4286 [08:55<26:39:56, 22.52s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.9078129784673831, 'learning_rate': 9.946336910872608e-07, 'completion_length': 232.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.4145408272743225, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.37882661819458, 'reward_std': 0.22913537919521332, 'kl': 0.00180816650390625, 'epoch': 0.01}
  1%|          | 23/4286 [08:55<26:39:56, 22.52s/it]  1%|          | 24/4286 [09:18<26:59:12, 22.80s/it]                                                    {'loss': 0.0001, 'grad_norm': 1.2224452366716043, 'learning_rate': 9.944003733084461e-07, 'completion_length': 245.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.38363097608089447, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.365773856639862, 'reward_std': 0.21965544670820236, 'kl': 0.00232696533203125, 'epoch': 0.01}
  1%|          | 24/4286 [09:18<26:59:12, 22.80s/it]  1%|          | 25/4286 [09:41<27:08:06, 22.93s/it]                                                    {'loss': 0.0001, 'grad_norm': 2.154132373655847, 'learning_rate': 9.941670555296313e-07, 'completion_length': 247.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.4866071790456772, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4687501192092896, 'reward_std': 0.19657714664936066, 'kl': 0.00208282470703125, 'epoch': 0.01}
  1%|          | 25/4286 [09:41<27:08:06, 22.93s/it]  1%|          | 26/4286 [10:06<27:38:30, 23.36s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.6556225183733548, 'learning_rate': 9.939337377508166e-07, 'completion_length': 219.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5034722238779068, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4856151938438416, 'reward_std': 0.1950373873114586, 'kl': 0.00156402587890625, 'epoch': 0.01}
  1%|          | 26/4286 [10:06<27:38:30, 23.36s/it]  1%|          | 27/4286 [10:28<27:13:26, 23.01s/it]                                                    {'loss': 0.0001, 'grad_norm': 1.2518407952374393, 'learning_rate': 9.93700419972002e-07, 'completion_length': 243.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.336309552192688, 'rewards/format_reward': 1.0, 'reward': 1.3363096117973328, 'reward_std': 0.21204090118408203, 'kl': 0.0027313232421875, 'epoch': 0.01}
  1%|          | 27/4286 [10:28<27:13:26, 23.01s/it]  1%|          | 28/4286 [10:51<27:11:35, 22.99s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.8161436265846934, 'learning_rate': 9.93467102193187e-07, 'completion_length': 265.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.4880952686071396, 'rewards/format_reward': 1.0, 'reward': 1.4880953431129456, 'reward_std': 0.1649840548634529, 'kl': 0.00287628173828125, 'epoch': 0.01}
  1%|          | 28/4286 [10:51<27:11:35, 22.99s/it]  1%|          | 29/4286 [11:18<28:35:57, 24.19s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.8318512621723625, 'learning_rate': 9.932337844143724e-07, 'completion_length': 265.1428756713867, 'rewards/only_full_func_accuracy_reward': 0.424107164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3883929252624512, 'reward_std': 0.23073500394821167, 'kl': 0.00222015380859375, 'epoch': 0.01}
  1%|          | 29/4286 [11:18<28:35:57, 24.19s/it]  1%|          | 30/4286 [11:41<28:17:52, 23.94s/it]                                                    {'loss': 0.0001, 'grad_norm': 1.047456576707496, 'learning_rate': 9.930004666355577e-07, 'completion_length': 268.3928756713867, 'rewards/only_full_func_accuracy_reward': 0.5156746208667755, 'rewards/format_reward': 1.0, 'reward': 1.5156747102737427, 'reward_std': 0.19716593623161316, 'kl': 0.00243377685546875, 'epoch': 0.01}
  1%|          | 30/4286 [11:41<28:17:52, 23.94s/it]  1%|          | 31/4286 [12:03<27:33:29, 23.32s/it]                                                    {'loss': 0.0001, 'grad_norm': 1.2151347281163924, 'learning_rate': 9.927671488567428e-07, 'completion_length': 262.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.5059524178504944, 'rewards/format_reward': 1.0, 'reward': 1.505952537059784, 'reward_std': 0.13264676928520203, 'kl': 0.0027008056640625, 'epoch': 0.01}
  1%|          | 31/4286 [12:03<27:33:29, 23.32s/it]  1%|          | 32/4286 [12:28<28:07:43, 23.80s/it]                                                    {'loss': 0.0002, 'grad_norm': 0.7322507704138832, 'learning_rate': 9.925338310779281e-07, 'completion_length': 293.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.4543309658765793, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.418616771697998, 'reward_std': 0.2485160231590271, 'kl': 0.004058837890625, 'epoch': 0.01}
  1%|          | 32/4286 [12:28<28:07:43, 23.80s/it]  1%|          | 33/4286 [12:53<28:23:24, 24.03s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.7790134062409135, 'learning_rate': 9.923005132991135e-07, 'completion_length': 265.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.5896046459674835, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571747601032257, 'reward_std': 0.14771082252264023, 'kl': 0.002910614013671875, 'epoch': 0.01}
  1%|          | 33/4286 [12:53<28:23:24, 24.03s/it][2025-03-02 15:10:39,821] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|          | 34/4286 [13:17<28:28:51, 24.11s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.8015215040236777, 'learning_rate': 9.920671955202986e-07, 'completion_length': 256.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.4508928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4330358505249023, 'reward_std': 0.21350206434726715, 'kl': 0.00312042236328125, 'epoch': 0.01}
  1%|          | 34/4286 [13:17<28:28:51, 24.11s/it]  1%|          | 35/4286 [13:41<28:19:25, 23.99s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.5627291174279683, 'learning_rate': 9.91833877741484e-07, 'completion_length': 285.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.4627976417541504, 'rewards/format_reward': 1.0, 'reward': 1.4627977013587952, 'reward_std': 0.11306972429156303, 'kl': 0.00295257568359375, 'epoch': 0.01}
  1%|          | 35/4286 [13:41<28:19:25, 23.99s/it][2025-03-02 15:11:27,081] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|          | 36/4286 [14:04<28:10:11, 23.86s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.3903981426183082, 'learning_rate': 9.91600559962669e-07, 'completion_length': 284.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.48630955815315247, 'rewards/format_reward': 1.0, 'reward': 1.4863096475601196, 'reward_std': 0.10988440737128258, 'kl': 0.002315521240234375, 'epoch': 0.01}
  1%|          | 36/4286 [14:04<28:10:11, 23.86s/it]  1%|          | 37/4286 [14:31<29:10:07, 24.71s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.2492667758047917, 'learning_rate': 9.913672421838543e-07, 'completion_length': 280.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.5038265436887741, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4859694838523865, 'reward_std': 0.2649233117699623, 'kl': 0.00402069091796875, 'epoch': 0.01}
  1%|          | 37/4286 [14:31<29:10:07, 24.71s/it]  1%|          | 38/4286 [14:55<28:48:16, 24.41s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.6729910054064279, 'learning_rate': 9.911339244050397e-07, 'completion_length': 277.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.3759959042072296, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3402816653251648, 'reward_std': 0.24656936526298523, 'kl': 0.0034637451171875, 'epoch': 0.01}
  1%|          | 38/4286 [14:55<28:48:16, 24.41s/it]  1%|          | 39/4286 [15:20<29:12:18, 24.76s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.3616102226637798, 'learning_rate': 9.909006066262248e-07, 'completion_length': 268.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.4113095551729202, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3755953311920166, 'reward_std': 0.23141776770353317, 'kl': 0.00388336181640625, 'epoch': 0.01}
  1%|          | 39/4286 [15:20<29:12:18, 24.76s/it][2025-03-02 15:13:08,640] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|          | 40/4286 [15:46<29:29:39, 25.01s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.4955762695797538, 'learning_rate': 9.906672888474101e-07, 'completion_length': 295.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.3011479675769806, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.265433669090271, 'reward_std': 0.19561711698770523, 'kl': 0.00328826904296875, 'epoch': 0.01}
  1%|          | 40/4286 [15:46<29:29:39, 25.01s/it][2025-03-02 15:13:34,320] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|          | 41/4286 [16:11<29:43:33, 25.21s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.7729658328753292, 'learning_rate': 9.904339710685954e-07, 'completion_length': 274.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.49659867584705353, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4608843922615051, 'reward_std': 0.22668445110321045, 'kl': 0.00308990478515625, 'epoch': 0.01}
  1%|          | 41/4286 [16:11<29:43:33, 25.21s/it]  1%|          | 42/4286 [16:35<29:07:09, 24.70s/it]                                                    {'loss': 0.0001, 'grad_norm': 1.2060694620655679, 'learning_rate': 9.902006532897806e-07, 'completion_length': 273.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.5684524178504944, 'rewards/format_reward': 1.0, 'reward': 1.568452537059784, 'reward_std': 0.15380359441041946, 'kl': 0.0035247802734375, 'epoch': 0.01}
  1%|          | 42/4286 [16:35<29:07:09, 24.70s/it][2025-03-02 15:14:22,752] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|          | 43/4286 [17:00<29:11:19, 24.77s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.2021484291190632, 'learning_rate': 9.899673355109659e-07, 'completion_length': 284.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.15328271687030792, 'kl': 0.00421142578125, 'epoch': 0.01}
  1%|          | 43/4286 [17:00<29:11:19, 24.77s/it]  1%|          | 44/4286 [17:22<28:23:50, 24.10s/it]                                                    {'loss': 0.0002, 'grad_norm': 5.223712377051991, 'learning_rate': 9.897340177321512e-07, 'completion_length': 249.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.415178582072258, 'rewards/format_reward': 1.0, 'reward': 1.4151785969734192, 'reward_std': 0.10246941074728966, 'kl': 0.0048370361328125, 'epoch': 0.01}
  1%|          | 44/4286 [17:22<28:23:50, 24.10s/it][2025-03-02 15:15:08,795] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|          | 45/4286 [17:46<28:10:40, 23.92s/it]                                                    {'loss': 0.0002, 'grad_norm': 0.6227470004075416, 'learning_rate': 9.895006999533363e-07, 'completion_length': 285.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.4211309850215912, 'rewards/format_reward': 1.0, 'reward': 1.4211310744285583, 'reward_std': 0.13977742195129395, 'kl': 0.0042877197265625, 'epoch': 0.01}
  1%|          | 45/4286 [17:46<28:10:40, 23.92s/it][2025-03-02 15:15:32,731] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|          | 46/4286 [18:10<28:10:37, 23.92s/it]                                                    {'loss': 0.0002, 'grad_norm': 0.5038586441515338, 'learning_rate': 9.892673821745217e-07, 'completion_length': 294.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.48928573727607727, 'rewards/format_reward': 1.0, 'reward': 1.4892858266830444, 'reward_std': 0.13559852167963982, 'kl': 0.0045318603515625, 'epoch': 0.01}
  1%|          | 46/4286 [18:10<28:10:37, 23.92s/it]  1%|          | 47/4286 [18:33<27:56:25, 23.73s/it]                                                    {'loss': 0.0002, 'grad_norm': 0.5643158790395144, 'learning_rate': 9.89034064395707e-07, 'completion_length': 282.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.38809525966644287, 'rewards/format_reward': 1.0, 'reward': 1.3880953192710876, 'reward_std': 0.10609548538923264, 'kl': 0.0044097900390625, 'epoch': 0.01}
  1%|          | 47/4286 [18:33<27:56:25, 23.73s/it]  1%|          | 48/4286 [18:56<27:30:11, 23.36s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.5838148932953922, 'learning_rate': 9.88800746616892e-07, 'completion_length': 270.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.5982143580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.580357313156128, 'reward_std': 0.24347585439682007, 'kl': 0.00390625, 'epoch': 0.01}
  1%|          | 48/4286 [18:56<27:30:11, 23.36s/it]  1%|          | 49/4286 [19:18<27:13:57, 23.14s/it]                                                    {'loss': 0.0002, 'grad_norm': 0.6862352043808435, 'learning_rate': 9.885674288380774e-07, 'completion_length': 266.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041668057441711, 'reward_std': 0.11319093778729439, 'kl': 0.00377655029296875, 'epoch': 0.01}
  1%|          | 49/4286 [19:18<27:13:57, 23.14s/it]  1%|          | 50/4286 [19:41<27:02:56, 22.99s/it]                                                    {'loss': 0.0002, 'grad_norm': 0.7033520131852378, 'learning_rate': 9.883341110592628e-07, 'completion_length': 262.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.4233631193637848, 'rewards/format_reward': 1.0, 'reward': 1.423363208770752, 'reward_std': 0.09919556230306625, 'kl': 0.0041046142578125, 'epoch': 0.01}
  1%|          | 50/4286 [19:41<27:02:56, 22.99s/it]  1%|          | 51/4286 [20:04<27:13:27, 23.14s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.5974409989026082, 'learning_rate': 9.881007932804479e-07, 'completion_length': 272.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.4687500149011612, 'rewards/format_reward': 1.0, 'reward': 1.4687501192092896, 'reward_std': 0.15584345161914825, 'kl': 0.00443267822265625, 'epoch': 0.01}
  1%|          | 51/4286 [20:04<27:13:27, 23.14s/it]  1%|          | 52/4286 [20:29<27:43:35, 23.57s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.323658608170036, 'learning_rate': 9.878674755016332e-07, 'completion_length': 262.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.5104166865348816, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4747024774551392, 'reward_std': 0.21550318598747253, 'kl': 0.00409698486328125, 'epoch': 0.01}
  1%|          | 52/4286 [20:29<27:43:35, 23.57s/it][2025-03-02 15:18:17,317] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|          | 53/4286 [20:54<28:23:15, 24.14s/it]                                                    {'loss': 0.0003, 'grad_norm': 1.3968540058610375, 'learning_rate': 9.876341577228185e-07, 'completion_length': 276.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5758928954601288, 'rewards/format_reward': 1.0, 'reward': 1.5758929252624512, 'reward_std': 0.14393959566950798, 'kl': 0.00653076171875, 'epoch': 0.01}
  1%|          | 53/4286 [20:54<28:23:15, 24.14s/it]  1%|▏         | 54/4286 [21:16<27:35:48, 23.48s/it]                                                    {'loss': 0.0001, 'grad_norm': 0.5088460639683173, 'learning_rate': 9.874008399440036e-07, 'completion_length': 253.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.5788690894842148, 'rewards/format_reward': 1.0, 'reward': 1.5788691639900208, 'reward_std': 0.10581597685813904, 'kl': 0.0037078857421875, 'epoch': 0.01}
  1%|▏         | 54/4286 [21:16<27:35:48, 23.48s/it]  1%|▏         | 55/4286 [21:40<27:34:19, 23.46s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.9219392740325478, 'learning_rate': 9.87167522165189e-07, 'completion_length': 296.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.4970238208770752, 'rewards/format_reward': 1.0, 'reward': 1.4970239400863647, 'reward_std': 0.16544584557414055, 'kl': 0.00518798828125, 'epoch': 0.01}
  1%|▏         | 55/4286 [21:40<27:34:19, 23.46s/it]  1%|▏         | 56/4286 [22:02<27:16:34, 23.21s/it]                                                    {'loss': 0.0003, 'grad_norm': 2.008515501062089, 'learning_rate': 9.869342043863743e-07, 'completion_length': 268.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.11828072741627693, 'kl': 0.0068359375, 'epoch': 0.01}
  1%|▏         | 56/4286 [22:02<27:16:34, 23.21s/it]  1%|▏         | 57/4286 [22:23<26:31:47, 22.58s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.9799450278501702, 'learning_rate': 9.867008866075594e-07, 'completion_length': 251.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.15981485694646835, 'kl': 0.004791259765625, 'epoch': 0.01}
  1%|▏         | 57/4286 [22:23<26:31:47, 22.58s/it]  1%|▏         | 58/4286 [22:47<26:59:32, 22.98s/it]                                                    {'loss': 0.0002, 'grad_norm': 0.9910411754175749, 'learning_rate': 9.864675688287447e-07, 'completion_length': 269.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5193452835083008, 'rewards/format_reward': 1.0, 'reward': 1.5193453431129456, 'reward_std': 0.12911851704120636, 'kl': 0.006134033203125, 'epoch': 0.01}
  1%|▏         | 58/4286 [22:47<26:59:32, 22.98s/it][2025-03-02 15:20:33,087] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|▏         | 59/4286 [23:10<26:54:24, 22.92s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.8596743239889869, 'learning_rate': 9.862342510499299e-07, 'completion_length': 232.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.5238095819950104, 'rewards/format_reward': 1.0, 'reward': 1.5238096117973328, 'reward_std': 0.12439596280455589, 'kl': 0.0081329345703125, 'epoch': 0.01}
  1%|▏         | 59/4286 [23:10<26:54:24, 22.92s/it]  1%|▏         | 60/4286 [23:33<26:41:54, 22.74s/it]                                                    {'loss': 0.0002, 'grad_norm': 0.7395029353407906, 'learning_rate': 9.860009332711152e-07, 'completion_length': 251.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.4779762476682663, 'rewards/format_reward': 1.0, 'reward': 1.4779763221740723, 'reward_std': 0.12219792604446411, 'kl': 0.0052490234375, 'epoch': 0.01}
  1%|▏         | 60/4286 [23:33<26:41:54, 22.74s/it]  1%|▏         | 61/4286 [23:54<26:11:13, 22.31s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.3515548239294872, 'learning_rate': 9.857676154923005e-07, 'completion_length': 278.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.5520833730697632, 'rewards/format_reward': 1.0, 'reward': 1.552083432674408, 'reward_std': 0.12492924556136131, 'kl': 0.0047149658203125, 'epoch': 0.01}
  1%|▏         | 61/4286 [23:54<26:11:13, 22.31s/it]  1%|▏         | 62/4286 [24:17<26:23:52, 22.50s/it]                                                    {'loss': 0.0002, 'grad_norm': 0.8651305409287975, 'learning_rate': 9.855342977134856e-07, 'completion_length': 276.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.18991459906101227, 'kl': 0.0050811767578125, 'epoch': 0.01}
  1%|▏         | 62/4286 [24:17<26:23:52, 22.50s/it]  1%|▏         | 63/4286 [24:38<26:05:46, 22.25s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.9837621078980275, 'learning_rate': 9.85300979934671e-07, 'completion_length': 246.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6339287161827087, 'reward_std': 0.19521182775497437, 'kl': 0.006591796875, 'epoch': 0.01}
  1%|▏         | 63/4286 [24:38<26:05:46, 22.25s/it][2025-03-02 15:22:22,926] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|▏         | 64/4286 [25:00<25:51:42, 22.05s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.6935934967723442, 'learning_rate': 9.850676621558563e-07, 'completion_length': 229.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.5342262387275696, 'rewards/format_reward': 1.0, 'reward': 1.5342262983322144, 'reward_std': 0.09761026594787836, 'kl': 0.0069580078125, 'epoch': 0.01}
  1%|▏         | 64/4286 [25:00<25:51:42, 22.05s/it]  2%|▏         | 65/4286 [25:22<25:43:28, 21.94s/it]                                                    {'loss': 0.0003, 'grad_norm': 1.184931694565328, 'learning_rate': 9.848343443770414e-07, 'completion_length': 246.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.598214328289032, 'rewards/format_reward': 1.0, 'reward': 1.5982144474983215, 'reward_std': 0.1840764358639717, 'kl': 0.0069122314453125, 'epoch': 0.02}
  2%|▏         | 65/4286 [25:22<25:43:28, 21.94s/it]  2%|▏         | 66/4286 [25:44<25:55:26, 22.12s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.370550870848877, 'learning_rate': 9.846010265982267e-07, 'completion_length': 236.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.1534072458744049, 'kl': 0.005950927734375, 'epoch': 0.02}
  2%|▏         | 66/4286 [25:44<25:55:26, 22.12s/it]  2%|▏         | 67/4286 [26:08<26:30:15, 22.62s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.9239429870057287, 'learning_rate': 9.84367708819412e-07, 'completion_length': 263.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6684524118900299, 'rewards/format_reward': 1.0, 'reward': 1.6684525609016418, 'reward_std': 0.14317354559898376, 'kl': 0.0077667236328125, 'epoch': 0.02}
  2%|▏         | 67/4286 [26:08<26:30:15, 22.62s/it]  2%|▏         | 68/4286 [26:29<25:55:57, 22.13s/it]                                                    {'loss': 0.0003, 'grad_norm': 1.1154605499672474, 'learning_rate': 9.841343910405972e-07, 'completion_length': 243.26787567138672, 'rewards/only_full_func_accuracy_reward': 0.443452388048172, 'rewards/format_reward': 1.0, 'reward': 1.4434524774551392, 'reward_std': 0.16678961366415024, 'kl': 0.008636474609375, 'epoch': 0.02}
  2%|▏         | 68/4286 [26:29<25:55:57, 22.13s/it]  2%|▏         | 69/4286 [26:52<26:19:56, 22.48s/it]                                                    {'loss': 0.0003, 'grad_norm': 1.3060578601675898, 'learning_rate': 9.839010732617825e-07, 'completion_length': 257.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.5610119700431824, 'rewards/format_reward': 1.0, 'reward': 1.5610119700431824, 'reward_std': 0.15109313279390335, 'kl': 0.0065155029296875, 'epoch': 0.02}
  2%|▏         | 69/4286 [26:52<26:19:56, 22.48s/it][2025-03-02 15:24:40,421] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  2%|▏         | 70/4286 [27:17<27:17:13, 23.30s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.5230753371301726, 'learning_rate': 9.836677554829678e-07, 'completion_length': 269.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.4672619253396988, 'rewards/format_reward': 1.0, 'reward': 1.4672620296478271, 'reward_std': 0.12829656526446342, 'kl': 0.0079345703125, 'epoch': 0.02}
  2%|▏         | 70/4286 [27:18<27:17:13, 23.30s/it]  2%|▏         | 71/4286 [27:41<27:14:36, 23.27s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.7142572328348586, 'learning_rate': 9.83434437704153e-07, 'completion_length': 267.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 1.0, 'reward': 1.6235119700431824, 'reward_std': 0.10380810871720314, 'kl': 0.0075225830078125, 'epoch': 0.02}
  2%|▏         | 71/4286 [27:41<27:14:36, 23.27s/it]  2%|▏         | 72/4286 [28:04<27:16:48, 23.31s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.7493836336522479, 'learning_rate': 9.832011199253383e-07, 'completion_length': 245.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.5907738506793976, 'rewards/format_reward': 1.0, 'reward': 1.59077388048172, 'reward_std': 0.13512895815074444, 'kl': 0.0077362060546875, 'epoch': 0.02}
  2%|▏         | 72/4286 [28:04<27:16:48, 23.31s/it]  2%|▏         | 73/4286 [28:28<27:19:23, 23.35s/it]                                                    {'loss': 0.0003, 'grad_norm': 1.7617251604139446, 'learning_rate': 9.829678021465236e-07, 'completion_length': 284.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.509523868560791, 'rewards/format_reward': 1.0, 'reward': 1.5095239281654358, 'reward_std': 0.22027657181024551, 'kl': 0.00775146484375, 'epoch': 0.02}
  2%|▏         | 73/4286 [28:28<27:19:23, 23.35s/it]  2%|▏         | 74/4286 [28:50<26:58:07, 23.05s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.6197067483597251, 'learning_rate': 9.827344843677087e-07, 'completion_length': 284.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6163691282272339, 'rewards/format_reward': 1.0, 'reward': 1.6163691282272339, 'reward_std': 0.11302809789776802, 'kl': 0.0070648193359375, 'epoch': 0.02}
  2%|▏         | 74/4286 [28:50<26:58:07, 23.05s/it]  2%|▏         | 75/4286 [29:14<27:22:09, 23.40s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.759355582938135, 'learning_rate': 9.82501166588894e-07, 'completion_length': 285.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.5119047909975052, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.3869048953056335, 'reward_std': 0.3165854513645172, 'kl': 0.0084228515625, 'epoch': 0.02}
  2%|▏         | 75/4286 [29:14<27:22:09, 23.40s/it]  2%|▏         | 76/4286 [29:38<27:35:18, 23.59s/it]                                                    {'loss': 0.0005, 'grad_norm': 10.546735264379507, 'learning_rate': 9.822678488100794e-07, 'completion_length': 277.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.4020833522081375, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3663691878318787, 'reward_std': 0.17205670475959778, 'kl': 0.011688232421875, 'epoch': 0.02}
  2%|▏         | 76/4286 [29:38<27:35:18, 23.59s/it][2025-03-02 15:27:27,674] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  2%|▏         | 77/4286 [30:05<28:38:30, 24.50s/it]                                                    {'loss': 0.0005, 'grad_norm': 0.7585013223197496, 'learning_rate': 9.820345310312645e-07, 'completion_length': 285.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5818452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.563988208770752, 'reward_std': 0.23371950536966324, 'kl': 0.012939453125, 'epoch': 0.02}
  2%|▏         | 77/4286 [30:05<28:38:30, 24.50s/it][2025-03-02 15:27:52,154] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  2%|▏         | 78/4286 [30:29<28:37:43, 24.49s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.6405546869197962, 'learning_rate': 9.818012132524498e-07, 'completion_length': 262.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.5877976417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5699406266212463, 'reward_std': 0.2025759071111679, 'kl': 0.00750732421875, 'epoch': 0.02}
  2%|▏         | 78/4286 [30:29<28:37:43, 24.49s/it]  2%|▏         | 79/4286 [30:51<27:47:42, 23.78s/it]                                                    {'loss': 0.0004, 'grad_norm': 3.534037212671902, 'learning_rate': 9.815678954736352e-07, 'completion_length': 273.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4642858505249023, 'reward_std': 0.21510930359363556, 'kl': 0.009796142578125, 'epoch': 0.02}
  2%|▏         | 79/4286 [30:51<27:47:42, 23.78s/it]  2%|▏         | 80/4286 [31:13<27:05:35, 23.19s/it]                                                    {'loss': 0.0003, 'grad_norm': 2.21316501567496, 'learning_rate': 9.813345776948203e-07, 'completion_length': 271.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.47291670739650726, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4550595879554749, 'reward_std': 0.12791457027196884, 'kl': 0.0066986083984375, 'epoch': 0.02}
  2%|▏         | 80/4286 [31:13<27:05:35, 23.19s/it]  2%|▏         | 81/4286 [31:36<26:48:02, 22.94s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.7581399432981502, 'learning_rate': 9.811012599160056e-07, 'completion_length': 283.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.6264881193637848, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.14964647591114044, 'kl': 0.0066986083984375, 'epoch': 0.02}
  2%|▏         | 81/4286 [31:36<26:48:02, 22.94s/it][2025-03-02 15:29:22,209] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  2%|▏         | 82/4286 [31:59<27:04:31, 23.19s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.8785531101687957, 'learning_rate': 9.808679421371907e-07, 'completion_length': 250.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5886904895305634, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.570833444595337, 'reward_std': 0.1878114864230156, 'kl': 0.0078582763671875, 'epoch': 0.02}
  2%|▏         | 82/4286 [31:59<27:04:31, 23.19s/it]  2%|▏         | 83/4286 [32:23<27:22:23, 23.45s/it]                                                    {'loss': 0.0004, 'grad_norm': 0.9869865536060365, 'learning_rate': 9.80634624358376e-07, 'completion_length': 271.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.46190477907657623, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.444047749042511, 'reward_std': 0.17826277017593384, 'kl': 0.00927734375, 'epoch': 0.02}
  2%|▏         | 83/4286 [32:23<27:22:23, 23.45s/it]  2%|▏         | 84/4286 [32:46<27:10:42, 23.28s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.6466834816962715, 'learning_rate': 9.804013065795614e-07, 'completion_length': 276.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.4985119551420212, 'rewards/format_reward': 1.0, 'reward': 1.4985119700431824, 'reward_std': 0.18932335823774338, 'kl': 0.0068359375, 'epoch': 0.02}
  2%|▏         | 84/4286 [32:46<27:10:42, 23.28s/it]  2%|▏         | 85/4286 [33:09<26:59:59, 23.14s/it]                                                    {'loss': 0.0002, 'grad_norm': 1.5799074969167426, 'learning_rate': 9.801679888007465e-07, 'completion_length': 261.9643020629883, 'rewards/only_full_func_accuracy_reward': 0.4925595670938492, 'rewards/format_reward': 1.0, 'reward': 1.4925596714019775, 'reward_std': 0.07311070151627064, 'kl': 0.0060577392578125, 'epoch': 0.02}
  2%|▏         | 85/4286 [33:09<26:59:59, 23.14s/it]  2%|▏         | 86/4286 [33:32<27:04:56, 23.21s/it]                                                    {'loss': 0.0003, 'grad_norm': 1.2946421717147214, 'learning_rate': 9.799346710219318e-07, 'completion_length': 256.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.5595238357782364, 'rewards/format_reward': 1.0, 'reward': 1.55952388048172, 'reward_std': 0.1303369589149952, 'kl': 0.007843017578125, 'epoch': 0.02}
  2%|▏         | 86/4286 [33:32<27:04:56, 23.21s/it]  2%|▏         | 87/4286 [33:55<27:01:03, 23.16s/it]                                                    {'loss': 0.0002, 'grad_norm': 0.6787470135062725, 'learning_rate': 9.797013532431171e-07, 'completion_length': 303.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.4077381044626236, 'rewards/format_reward': 1.0, 'reward': 1.4077381491661072, 'reward_std': 0.13066881150007248, 'kl': 0.005645751953125, 'epoch': 0.02}
  2%|▏         | 87/4286 [33:55<27:01:03, 23.16s/it]  2%|▏         | 88/4286 [34:19<27:11:27, 23.32s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.5962649408881759, 'learning_rate': 9.794680354643023e-07, 'completion_length': 292.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6202380955219269, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6023810505867004, 'reward_std': 0.21196506172418594, 'kl': 0.006256103515625, 'epoch': 0.02}
  2%|▏         | 88/4286 [34:19<27:11:27, 23.32s/it]  2%|▏         | 89/4286 [34:43<27:22:34, 23.48s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.5111786266664878, 'learning_rate': 9.792347176854876e-07, 'completion_length': 288.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.4895833730697632, 'rewards/format_reward': 1.0, 'reward': 1.4895834922790527, 'reward_std': 0.16105982288718224, 'kl': 0.0064849853515625, 'epoch': 0.02}
  2%|▏         | 89/4286 [34:43<27:22:34, 23.48s/it]  2%|▏         | 90/4286 [35:05<27:00:28, 23.17s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.8174279195699401, 'learning_rate': 9.79001399906673e-07, 'completion_length': 240.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5252976566553116, 'rewards/format_reward': 1.0, 'reward': 1.52529776096344, 'reward_std': 0.17716515064239502, 'kl': 0.008331298828125, 'epoch': 0.02}
  2%|▏         | 90/4286 [35:05<27:00:28, 23.17s/it]  2%|▏         | 91/4286 [35:27<26:34:50, 22.81s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.5489091500701785, 'learning_rate': 9.78768082127858e-07, 'completion_length': 282.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.08769076690077782, 'kl': 0.0065460205078125, 'epoch': 0.02}
  2%|▏         | 91/4286 [35:27<26:34:50, 22.81s/it]  2%|▏         | 92/4286 [35:52<27:02:50, 23.22s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.7623524519038397, 'learning_rate': 9.785347643490434e-07, 'completion_length': 303.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.5654762387275696, 'rewards/format_reward': 1.0, 'reward': 1.5654762983322144, 'reward_std': 0.15013636648654938, 'kl': 0.006683349609375, 'epoch': 0.02}
  2%|▏         | 92/4286 [35:52<27:02:50, 23.22s/it]  2%|▏         | 93/4286 [36:15<27:07:25, 23.29s/it]                                                    {'loss': 0.0003, 'grad_norm': 2.240147246366759, 'learning_rate': 9.783014465702287e-07, 'completion_length': 251.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.5943453013896942, 'rewards/format_reward': 1.0, 'reward': 1.5943453311920166, 'reward_std': 0.14811362326145172, 'kl': 0.006561279296875, 'epoch': 0.02}
  2%|▏         | 93/4286 [36:15<27:07:25, 23.29s/it]  2%|▏         | 94/4286 [36:37<26:47:13, 23.00s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.491387061048388, 'learning_rate': 9.780681287914138e-07, 'completion_length': 282.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562501192092896, 'reward_std': 0.10706287249922752, 'kl': 0.0071868896484375, 'epoch': 0.02}
  2%|▏         | 94/4286 [36:37<26:47:13, 23.00s/it]  2%|▏         | 95/4286 [37:01<27:00:29, 23.20s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.4766710320792105, 'learning_rate': 9.778348110125991e-07, 'completion_length': 312.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.5684524178504944, 'rewards/format_reward': 1.0, 'reward': 1.568452537059784, 'reward_std': 0.08348604664206505, 'kl': 0.006561279296875, 'epoch': 0.02}
  2%|▏         | 95/4286 [37:01<27:00:29, 23.20s/it]  2%|▏         | 96/4286 [37:24<26:59:52, 23.20s/it]                                                    {'loss': 0.0004, 'grad_norm': 5.269446215901448, 'learning_rate': 9.776014932337845e-07, 'completion_length': 283.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.5095238536596298, 'rewards/format_reward': 1.0, 'reward': 1.5095239281654358, 'reward_std': 0.1504560336470604, 'kl': 0.0107421875, 'epoch': 0.02}
  2%|▏         | 96/4286 [37:24<26:59:52, 23.20s/it]  2%|▏         | 97/4286 [37:47<26:55:10, 23.13s/it]                                                    {'loss': 0.0003, 'grad_norm': 1.7489188763539607, 'learning_rate': 9.773681754549696e-07, 'completion_length': 270.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5877976566553116, 'rewards/format_reward': 1.0, 'reward': 1.5877977013587952, 'reward_std': 0.23070310056209564, 'kl': 0.0073089599609375, 'epoch': 0.02}
  2%|▏         | 97/4286 [37:47<26:55:10, 23.13s/it]  2%|▏         | 98/4286 [38:09<26:35:00, 22.85s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.8064857744793836, 'learning_rate': 9.77134857676155e-07, 'completion_length': 266.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.4925595670938492, 'rewards/format_reward': 1.0, 'reward': 1.4925596714019775, 'reward_std': 0.15210693329572678, 'kl': 0.0074005126953125, 'epoch': 0.02}
  2%|▏         | 98/4286 [38:09<26:35:00, 22.85s/it]  2%|▏         | 99/4286 [38:33<26:42:08, 22.96s/it]                                                    {'loss': 0.0003, 'grad_norm': 0.9182846903096532, 'learning_rate': 9.769015398973402e-07, 'completion_length': 286.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5610119253396988, 'rewards/format_reward': 1.0, 'reward': 1.5610119700431824, 'reward_std': 0.16366712003946304, 'kl': 0.0072021484375, 'epoch': 0.02}
  2%|▏         | 99/4286 [38:33<26:42:08, 22.96s/it]  2%|▏         | 100/4286 [38:56<26:54:54, 23.15s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.6364759452507801, 'learning_rate': 9.766682221185254e-07, 'completion_length': 285.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 1.0, 'reward': 1.65029776096344, 'reward_std': 0.09619198366999626, 'kl': 0.00677490234375, 'epoch': 0.02}
  2%|▏         | 100/4286 [38:56<26:54:54, 23.15s/it]  2%|▏         | 101/4286 [43:46<119:59:46, 103.22s/it]                                                       {'loss': 0.0003, 'grad_norm': 0.9653083479981857, 'learning_rate': 9.764349043397107e-07, 'completion_length': 292.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.4627976715564728, 'rewards/format_reward': 1.0, 'reward': 1.4627977013587952, 'reward_std': 0.07557233236730099, 'kl': 0.0072479248046875, 'epoch': 0.02}
  2%|▏         | 101/4286 [43:46<119:59:46, 103.22s/it]  2%|▏         | 102/4286 [44:10<92:18:02, 79.42s/it]                                                       {'loss': 0.0003, 'grad_norm': 0.8258530564211637, 'learning_rate': 9.76201586560896e-07, 'completion_length': 277.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6220238506793976, 'rewards/format_reward': 1.0, 'reward': 1.6220239400863647, 'reward_std': 0.15993988513946533, 'kl': 0.0080413818359375, 'epoch': 0.02}
  2%|▏         | 102/4286 [44:10<92:18:02, 79.42s/it]  2%|▏         | 103/4286 [44:37<74:02:20, 63.72s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.5013709082894343, 'learning_rate': 9.759682687820811e-07, 'completion_length': 302.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.5211309790611267, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.485416829586029, 'reward_std': 0.1566486358642578, 'kl': 0.006256103515625, 'epoch': 0.02}
  2%|▏         | 103/4286 [44:37<74:02:20, 63.72s/it]  2%|▏         | 104/4286 [45:02<60:17:08, 51.90s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.5337168499958969, 'learning_rate': 9.757349510032665e-07, 'completion_length': 271.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.4761905074119568, 'rewards/format_reward': 1.0, 'reward': 1.4761905670166016, 'reward_std': 0.051889022812247276, 'kl': 0.0067291259765625, 'epoch': 0.02}
  2%|▏         | 104/4286 [45:02<60:17:08, 51.90s/it]  2%|▏         | 105/4286 [45:25<50:26:26, 43.43s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.8255468117831706, 'learning_rate': 9.755016332244516e-07, 'completion_length': 289.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.5773810148239136, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5595239400863647, 'reward_std': 0.15940962731838226, 'kl': 0.0068511962890625, 'epoch': 0.02}
  2%|▏         | 105/4286 [45:25<50:26:26, 43.43s/it]  2%|▏         | 106/4286 [45:50<43:45:08, 37.68s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.4447961949605477, 'learning_rate': 9.75268315445637e-07, 'completion_length': 302.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6041666865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5863096117973328, 'reward_std': 0.18680847436189651, 'kl': 0.00750732421875, 'epoch': 0.02}
  2%|▏         | 106/4286 [45:50<43:45:08, 37.68s/it]  2%|▏         | 107/4286 [46:14<39:06:13, 33.69s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.8367137508564089, 'learning_rate': 9.750349976668222e-07, 'completion_length': 283.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.5833334028720856, 'rewards/format_reward': 1.0, 'reward': 1.583333432674408, 'reward_std': 0.1544884890317917, 'kl': 0.0077667236328125, 'epoch': 0.02}
  2%|▏         | 107/4286 [46:14<39:06:13, 33.69s/it]  3%|▎         | 108/4286 [46:39<36:03:11, 31.07s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.8642662560226869, 'learning_rate': 9.748016798880073e-07, 'completion_length': 302.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 1.0, 'reward': 1.6443453431129456, 'reward_std': 0.12871142476797104, 'kl': 0.00653076171875, 'epoch': 0.03}
  3%|▎         | 108/4286 [46:39<36:03:11, 31.07s/it]  3%|▎         | 109/4286 [47:03<33:30:49, 28.88s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.904944897196603, 'learning_rate': 9.745683621091927e-07, 'completion_length': 258.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.6119047701358795, 'rewards/format_reward': 1.0, 'reward': 1.6119049191474915, 'reward_std': 0.21823062002658844, 'kl': 0.008758544921875, 'epoch': 0.03}
  3%|▎         | 109/4286 [47:03<33:30:49, 28.88s/it]  3%|▎         | 110/4286 [47:28<32:10:37, 27.74s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.7481803231343656, 'learning_rate': 9.74335044330378e-07, 'completion_length': 282.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5052721351385117, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.487415075302124, 'reward_std': 0.20508097857236862, 'kl': 0.009063720703125, 'epoch': 0.03}
  3%|▎         | 110/4286 [47:28<32:10:37, 27.74s/it]  3%|▎         | 111/4286 [47:53<31:21:28, 27.04s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.47598343310427993, 'learning_rate': 9.741017265515631e-07, 'completion_length': 300.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.617559552192688, 'rewards/format_reward': 1.0, 'reward': 1.6175596714019775, 'reward_std': 0.1391613557934761, 'kl': 0.0076904296875, 'epoch': 0.03}
  3%|▎         | 111/4286 [47:53<31:21:28, 27.04s/it]  3%|▎         | 112/4286 [48:19<30:53:01, 26.64s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.7766801100976781, 'learning_rate': 9.738684087727484e-07, 'completion_length': 296.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.5684524178504944, 'rewards/format_reward': 1.0, 'reward': 1.5684524774551392, 'reward_std': 0.13283320143818855, 'kl': 0.007598876953125, 'epoch': 0.03}
  3%|▎         | 112/4286 [48:19<30:53:01, 26.64s/it]  3%|▎         | 113/4286 [48:42<29:50:06, 25.74s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.37044512506234895, 'learning_rate': 9.736350909939338e-07, 'completion_length': 281.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.6666667461395264, 'reward_std': 0.10053746402263641, 'kl': 0.008941650390625, 'epoch': 0.03}
  3%|▎         | 113/4286 [48:42<29:50:06, 25.74s/it]  3%|▎         | 114/4286 [49:08<29:52:22, 25.78s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.8633141555321671, 'learning_rate': 9.734017732151189e-07, 'completion_length': 311.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.45803575217723846, 'rewards/format_reward': 1.0, 'reward': 1.4580358266830444, 'reward_std': 0.13096698001027107, 'kl': 0.008880615234375, 'epoch': 0.03}
  3%|▎         | 114/4286 [49:08<29:52:22, 25.78s/it]  3%|▎         | 115/4286 [49:33<29:21:56, 25.35s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.6057674565288076, 'learning_rate': 9.731684554363042e-07, 'completion_length': 297.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.6157738566398621, 'rewards/format_reward': 1.0, 'reward': 1.6157739162445068, 'reward_std': 0.16079571098089218, 'kl': 0.0087127685546875, 'epoch': 0.03}
  3%|▎         | 115/4286 [49:33<29:21:56, 25.35s/it]  3%|▎         | 116/4286 [49:57<29:09:13, 25.17s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.645421705988292, 'learning_rate': 9.729351376574895e-07, 'completion_length': 278.51788330078125, 'rewards/only_full_func_accuracy_reward': 0.4895833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4717263579368591, 'reward_std': 0.17396444082260132, 'kl': 0.008392333984375, 'epoch': 0.03}
  3%|▎         | 116/4286 [49:57<29:09:13, 25.17s/it]  3%|▎         | 117/4286 [50:22<29:05:40, 25.12s/it]                                                     {'loss': 0.0003, 'grad_norm': 19.686313236674135, 'learning_rate': 9.727018198786747e-07, 'completion_length': 280.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6732143759727478, 'rewards/format_reward': 1.0, 'reward': 1.6732143759727478, 'reward_std': 0.09642508998513222, 'kl': 0.008544921875, 'epoch': 0.03}
  3%|▎         | 117/4286 [50:22<29:05:40, 25.12s/it]  3%|▎         | 118/4286 [50:47<28:43:57, 24.82s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.6812501794408203, 'learning_rate': 9.7246850209986e-07, 'completion_length': 283.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068454027175903, 'reward_std': 0.15444907546043396, 'kl': 0.0082855224609375, 'epoch': 0.03}
  3%|▎         | 118/4286 [50:47<28:43:57, 24.82s/it]  3%|▎         | 119/4286 [51:11<28:33:13, 24.67s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.6746152114393515, 'learning_rate': 9.722351843210453e-07, 'completion_length': 306.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.5461309850215912, 'rewards/format_reward': 1.0, 'reward': 1.5461310744285583, 'reward_std': 0.09318274259567261, 'kl': 0.00946044921875, 'epoch': 0.03}
  3%|▎         | 119/4286 [51:11<28:33:13, 24.67s/it][2025-03-02 15:49:01,151] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  3%|▎         | 120/4286 [51:38<29:29:44, 25.49s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.6372290699878612, 'learning_rate': 9.720018665422304e-07, 'completion_length': 287.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6501701176166534, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6323130130767822, 'reward_std': 0.17344967275857925, 'kl': 0.009552001953125, 'epoch': 0.03}
  3%|▎         | 120/4286 [51:38<29:29:44, 25.49s/it]  3%|▎         | 121/4286 [52:02<28:59:58, 25.07s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.4436864015971846, 'learning_rate': 9.717685487634158e-07, 'completion_length': 292.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5327381044626236, 'rewards/format_reward': 1.0, 'reward': 1.532738208770752, 'reward_std': 0.15510808676481247, 'kl': 0.0077362060546875, 'epoch': 0.03}
  3%|▎         | 121/4286 [52:02<28:59:58, 25.07s/it]  3%|▎         | 122/4286 [52:28<29:03:24, 25.12s/it]                                                     {'loss': 0.0004, 'grad_norm': 1.3130227775356702, 'learning_rate': 9.71535230984601e-07, 'completion_length': 303.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.46150796115398407, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4436508417129517, 'reward_std': 0.160709410905838, 'kl': 0.009613037109375, 'epoch': 0.03}
  3%|▎         | 122/4286 [52:28<29:03:24, 25.12s/it][2025-03-02 15:50:16,190] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  3%|▎         | 123/4286 [52:53<29:15:11, 25.30s/it]                                                     {'loss': 0.0004, 'grad_norm': 1.8121065103066816, 'learning_rate': 9.713019132057862e-07, 'completion_length': 288.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949406266212463, 'reward_std': 0.12641322053968906, 'kl': 0.009124755859375, 'epoch': 0.03}
  3%|▎         | 123/4286 [52:53<29:15:11, 25.30s/it][2025-03-02 15:50:41,791] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  3%|▎         | 124/4286 [53:19<29:21:07, 25.39s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.9592624659036246, 'learning_rate': 9.710685954269715e-07, 'completion_length': 306.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.12812386453151703, 'kl': 0.009552001953125, 'epoch': 0.03}
  3%|▎         | 124/4286 [53:19<29:21:07, 25.39s/it]  3%|▎         | 125/4286 [53:43<29:02:56, 25.13s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.6149570225122105, 'learning_rate': 9.708352776481569e-07, 'completion_length': 277.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.4791667014360428, 'rewards/format_reward': 1.0, 'reward': 1.4791668057441711, 'reward_std': 0.07382954470813274, 'kl': 0.00897216796875, 'epoch': 0.03}
  3%|▎         | 125/4286 [53:43<29:02:56, 25.13s/it]  3%|▎         | 126/4286 [54:08<28:54:53, 25.02s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.5830411286887371, 'learning_rate': 9.70601959869342e-07, 'completion_length': 300.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.633928656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5982143878936768, 'reward_std': 0.2104962319135666, 'kl': 0.009979248046875, 'epoch': 0.03}
  3%|▎         | 126/4286 [54:08<28:54:53, 25.02s/it][2025-03-02 15:51:54,321] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  3%|▎         | 127/4286 [54:31<28:17:10, 24.48s/it]                                                     {'loss': 0.0004, 'grad_norm': 2.1932983311377803, 'learning_rate': 9.703686420905273e-07, 'completion_length': 270.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.06845237873494625, 'kl': 0.011016845703125, 'epoch': 0.03}
  3%|▎         | 127/4286 [54:31<28:17:10, 24.48s/it]  3%|▎         | 128/4286 [54:56<28:17:49, 24.50s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.5597733465064139, 'learning_rate': 9.701353243117124e-07, 'completion_length': 308.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.130731962621212, 'kl': 0.011199951171875, 'epoch': 0.03}
  3%|▎         | 128/4286 [54:56<28:17:49, 24.50s/it]  3%|▎         | 129/4286 [55:20<28:01:30, 24.27s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.654685418236714, 'learning_rate': 9.699020065328977e-07, 'completion_length': 298.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.49910716712474823, 'rewards/format_reward': 1.0, 'reward': 1.499107301235199, 'reward_std': 0.10835172981023788, 'kl': 0.00823974609375, 'epoch': 0.03}
  3%|▎         | 129/4286 [55:20<28:01:30, 24.27s/it]  3%|▎         | 130/4286 [55:43<27:36:38, 23.92s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.6220129404110091, 'learning_rate': 9.69668688754083e-07, 'completion_length': 295.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5000000447034836, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.10562239959836006, 'kl': 0.010467529296875, 'epoch': 0.03}
  3%|▎         | 130/4286 [55:43<27:36:38, 23.92s/it]  3%|▎         | 131/4286 [56:10<28:37:54, 24.81s/it]                                                     {'loss': 0.0005, 'grad_norm': 0.8652713743157741, 'learning_rate': 9.694353709752682e-07, 'completion_length': 310.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.5491071939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5312501788139343, 'reward_std': 0.12895772233605385, 'kl': 0.01171875, 'epoch': 0.03}
  3%|▎         | 131/4286 [56:10<28:37:54, 24.81s/it]  3%|▎         | 132/4286 [56:34<28:24:15, 24.62s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.48824972067970346, 'learning_rate': 9.692020531964535e-07, 'completion_length': 302.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.5636904835700989, 'rewards/format_reward': 1.0, 'reward': 1.5636905431747437, 'reward_std': 0.08157341368496418, 'kl': 0.01055908203125, 'epoch': 0.03}
  3%|▎         | 132/4286 [56:34<28:24:15, 24.62s/it]  3%|▎         | 133/4286 [56:59<28:28:38, 24.69s/it]                                                     {'loss': 0.0003, 'grad_norm': 0.4610594623088054, 'learning_rate': 9.689687354176389e-07, 'completion_length': 291.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.4913690835237503, 'rewards/format_reward': 1.0, 'reward': 1.4913691282272339, 'reward_std': 0.11666245944797993, 'kl': 0.0082855224609375, 'epoch': 0.03}
  3%|▎         | 133/4286 [56:59<28:28:38, 24.69s/it]  3%|▎         | 134/4286 [57:22<28:05:49, 24.36s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.7204332530587318, 'learning_rate': 9.68735417638824e-07, 'completion_length': 281.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.522321492433548, 'rewards/format_reward': 1.0, 'reward': 1.5223215818405151, 'reward_std': 0.1257556788623333, 'kl': 0.010833740234375, 'epoch': 0.03}
  3%|▎         | 134/4286 [57:22<28:05:49, 24.36s/it]  3%|▎         | 135/4286 [57:46<27:51:32, 24.16s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.37018351873573774, 'learning_rate': 9.685020998600093e-07, 'completion_length': 291.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 1.0, 'reward': 1.6294644474983215, 'reward_std': 0.07397740334272385, 'kl': 0.01025390625, 'epoch': 0.03}
  3%|▎         | 135/4286 [57:46<27:51:32, 24.16s/it]  3%|▎         | 136/4286 [58:10<27:54:51, 24.21s/it]                                                     {'loss': 0.0006, 'grad_norm': 0.869001146512396, 'learning_rate': 9.682687820811946e-07, 'completion_length': 270.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.5654762387275696, 'rewards/format_reward': 1.0, 'reward': 1.5654762983322144, 'reward_std': 0.13759475946426392, 'kl': 0.0155029296875, 'epoch': 0.03}
  3%|▎         | 136/4286 [58:10<27:54:51, 24.21s/it]  3%|▎         | 137/4286 [58:36<28:30:27, 24.74s/it]                                                     {'loss': 0.0005, 'grad_norm': 0.4936149206540597, 'learning_rate': 9.680354643023797e-07, 'completion_length': 304.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6060799956321716, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.588222861289978, 'reward_std': 0.1593688651919365, 'kl': 0.011474609375, 'epoch': 0.03}
  3%|▎         | 137/4286 [58:36<28:30:27, 24.74s/it]  3%|▎         | 138/4286 [59:02<28:49:02, 25.01s/it]                                                     {'loss': 0.0004, 'grad_norm': 0.5417090274692586, 'learning_rate': 9.67802146523565e-07, 'completion_length': 280.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.14644645899534225, 'kl': 0.01116943359375, 'epoch': 0.03}
  3%|▎         | 138/4286 [59:02<28:49:02, 25.01s/it]  3%|▎         | 139/4286 [59:26<28:37:02, 24.84s/it]                                                     {'loss': 0.0005, 'grad_norm': 0.5528481859600894, 'learning_rate': 9.675688287447504e-07, 'completion_length': 268.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 1.0, 'reward': 1.6473215818405151, 'reward_std': 0.10876645892858505, 'kl': 0.011749267578125, 'epoch': 0.03}
  3%|▎         | 139/4286 [59:26<28:37:02, 24.84s/it]  3%|▎         | 140/4286 [59:51<28:27:05, 24.70s/it]                                                     {'loss': 0.0005, 'grad_norm': 0.7067219672955338, 'learning_rate': 9.673355109659355e-07, 'completion_length': 307.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.6404762268066406, 'rewards/format_reward': 1.0, 'reward': 1.6404762864112854, 'reward_std': 0.11395768634974957, 'kl': 0.0123291015625, 'epoch': 0.03}
  3%|▎         | 140/4286 [59:51<28:27:05, 24.70s/it]  3%|▎         | 141/4286 [1:00:15<28:16:58, 24.56s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.7168228933247629, 'learning_rate': 9.671021931871208e-07, 'completion_length': 301.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.504464328289032, 'rewards/format_reward': 1.0, 'reward': 1.5044643878936768, 'reward_std': 0.06076502241194248, 'kl': 0.01220703125, 'epoch': 0.03}
  3%|▎         | 141/4286 [1:00:15<28:16:58, 24.56s/it]  3%|▎         | 142/4286 [1:00:41<28:49:55, 25.05s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.7657632142833128, 'learning_rate': 9.668688754083062e-07, 'completion_length': 279.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.5297619253396988, 'rewards/format_reward': 1.0, 'reward': 1.5297620296478271, 'reward_std': 0.13276179507374763, 'kl': 0.012115478515625, 'epoch': 0.03}
  3%|▎         | 142/4286 [1:00:41<28:49:55, 25.05s/it]  3%|▎         | 143/4286 [1:01:07<29:03:01, 25.24s/it]                                                       {'loss': 0.0004, 'grad_norm': 0.4661107441077113, 'learning_rate': 9.666355576294913e-07, 'completion_length': 310.875, 'rewards/only_full_func_accuracy_reward': 0.5282738357782364, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5104168057441711, 'reward_std': 0.15208405628800392, 'kl': 0.011199951171875, 'epoch': 0.03}
  3%|▎         | 143/4286 [1:01:07<29:03:01, 25.24s/it]  3%|▎         | 144/4286 [1:01:31<28:35:53, 24.86s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.1771044436081022, 'learning_rate': 9.664022398506766e-07, 'completion_length': 282.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.6741071343421936, 'rewards/format_reward': 1.0, 'reward': 1.674107313156128, 'reward_std': 0.07933587580919266, 'kl': 0.014892578125, 'epoch': 0.03}
  3%|▎         | 144/4286 [1:01:31<28:35:53, 24.86s/it]  3%|▎         | 145/4286 [1:01:54<27:58:10, 24.32s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.605873182821666, 'learning_rate': 9.66168922071862e-07, 'completion_length': 277.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5982143878936768, 'reward_std': 0.0883118100464344, 'kl': 0.014862060546875, 'epoch': 0.03}
  3%|▎         | 145/4286 [1:01:54<27:58:10, 24.32s/it]  3%|▎         | 146/4286 [1:02:19<28:06:16, 24.44s/it]                                                       {'loss': 0.0008, 'grad_norm': 2.0158229019700555, 'learning_rate': 9.65935604293047e-07, 'completion_length': 297.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.5952381193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5773810148239136, 'reward_std': 0.2026984989643097, 'kl': 0.019805908203125, 'epoch': 0.03}
  3%|▎         | 146/4286 [1:02:19<28:06:16, 24.44s/it]  3%|▎         | 147/4286 [1:02:42<27:42:46, 24.10s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.43547233932546525, 'learning_rate': 9.657022865142324e-07, 'completion_length': 287.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.5818452537059784, 'rewards/format_reward': 1.0, 'reward': 1.5818453431129456, 'reward_std': 0.07062526233494282, 'kl': 0.01373291015625, 'epoch': 0.03}
  3%|▎         | 147/4286 [1:02:42<27:42:46, 24.10s/it]  3%|▎         | 148/4286 [1:03:07<27:53:29, 24.27s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.5970692983354255, 'learning_rate': 9.654689687354177e-07, 'completion_length': 277.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.465773805975914, 'rewards/format_reward': 1.0, 'reward': 1.4657739400863647, 'reward_std': 0.04053215216845274, 'kl': 0.013336181640625, 'epoch': 0.03}
  3%|▎         | 148/4286 [1:03:07<27:53:29, 24.27s/it]  3%|▎         | 149/4286 [1:03:30<27:43:37, 24.13s/it]                                                       {'loss': 0.0004, 'grad_norm': 0.6641702966617326, 'learning_rate': 9.652356509566028e-07, 'completion_length': 288.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5342262089252472, 'rewards/format_reward': 1.0, 'reward': 1.5342262983322144, 'reward_std': 0.1395193189382553, 'kl': 0.010406494140625, 'epoch': 0.03}
  3%|▎         | 149/4286 [1:03:30<27:43:37, 24.13s/it]  3%|▎         | 150/4286 [1:03:56<28:17:06, 24.62s/it]                                                       {'loss': 0.0004, 'grad_norm': 0.3759399871113997, 'learning_rate': 9.650023331777882e-07, 'completion_length': 313.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7247024178504944, 'rewards/format_reward': 1.0, 'reward': 1.724702537059784, 'reward_std': 0.11701322346925735, 'kl': 0.01092529296875, 'epoch': 0.03}
  3%|▎         | 150/4286 [1:03:56<28:17:06, 24.62s/it][2025-03-02 16:01:47,057] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  4%|▎         | 151/4286 [1:04:24<29:26:52, 25.64s/it]                                                       {'loss': 0.0004, 'grad_norm': 0.5503842246131861, 'learning_rate': 9.647690153989733e-07, 'completion_length': 291.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.5922619700431824, 'rewards/format_reward': 1.0, 'reward': 1.5922619700431824, 'reward_std': 0.06198650784790516, 'kl': 0.01092529296875, 'epoch': 0.04}
  4%|▎         | 151/4286 [1:04:24<29:26:52, 25.64s/it]  4%|▎         | 152/4286 [1:04:50<29:21:02, 25.56s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.7484675788585458, 'learning_rate': 9.645356976201586e-07, 'completion_length': 287.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.508928582072258, 'rewards/format_reward': 1.0, 'reward': 1.5089287161827087, 'reward_std': 0.10947800800204277, 'kl': 0.01214599609375, 'epoch': 0.04}
  4%|▎         | 152/4286 [1:04:50<29:21:02, 25.56s/it][2025-03-02 16:02:37,011] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  4%|▎         | 153/4286 [1:05:14<29:00:18, 25.26s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.44295185024977074, 'learning_rate': 9.64302379841344e-07, 'completion_length': 302.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5416667312383652, 'rewards/format_reward': 1.0, 'reward': 1.5416667461395264, 'reward_std': 0.13767902925610542, 'kl': 0.012176513671875, 'epoch': 0.04}
  4%|▎         | 153/4286 [1:05:14<29:00:18, 25.26s/it]  4%|▎         | 154/4286 [1:05:40<29:12:50, 25.45s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.837339480913917, 'learning_rate': 9.64069062062529e-07, 'completion_length': 295.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.5863095223903656, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5684524774551392, 'reward_std': 0.11658431962132454, 'kl': 0.012359619140625, 'epoch': 0.04}
  4%|▎         | 154/4286 [1:05:40<29:12:50, 25.45s/it]  4%|▎         | 155/4286 [1:06:05<29:08:05, 25.39s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.5455254026430507, 'learning_rate': 9.638357442837144e-07, 'completion_length': 300.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6043367981910706, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5864797234535217, 'reward_std': 0.09204084798693657, 'kl': 0.012786865234375, 'epoch': 0.04}
  4%|▎         | 155/4286 [1:06:05<29:08:05, 25.39s/it]  4%|▎         | 156/4286 [1:06:30<28:58:50, 25.26s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.819240698866026, 'learning_rate': 9.636024265048997e-07, 'completion_length': 295.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.539583370089531, 'rewards/format_reward': 1.0, 'reward': 1.5395833849906921, 'reward_std': 0.10623375698924065, 'kl': 0.014404296875, 'epoch': 0.04}
  4%|▎         | 156/4286 [1:06:30<28:58:50, 25.26s/it]  4%|▎         | 157/4286 [1:06:53<28:18:06, 24.68s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.5132524609141897, 'learning_rate': 9.633691087260848e-07, 'completion_length': 293.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7127977013587952, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.14166954904794693, 'kl': 0.011627197265625, 'epoch': 0.04}
  4%|▎         | 157/4286 [1:06:53<28:18:06, 24.68s/it]  4%|▎         | 158/4286 [1:07:18<28:18:52, 24.69s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.5651414021861991, 'learning_rate': 9.631357909472701e-07, 'completion_length': 295.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.611607164144516, 'rewards/format_reward': 1.0, 'reward': 1.6116072535514832, 'reward_std': 0.11812522262334824, 'kl': 0.01171875, 'epoch': 0.04}
  4%|▎         | 158/4286 [1:07:18<28:18:52, 24.69s/it]  4%|▎         | 159/4286 [1:07:41<27:32:54, 24.03s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.7166643727054162, 'learning_rate': 9.629024731684555e-07, 'completion_length': 274.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.5255953073501587, 'rewards/format_reward': 1.0, 'reward': 1.5255953669548035, 'reward_std': 0.08536100387573242, 'kl': 0.01416015625, 'epoch': 0.04}
  4%|▎         | 159/4286 [1:07:41<27:32:54, 24.03s/it]  4%|▎         | 160/4286 [1:08:04<27:17:26, 23.81s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.4904068349506838, 'learning_rate': 9.626691553896406e-07, 'completion_length': 279.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7752977013587952, 'reward_std': 0.10466014221310616, 'kl': 0.01446533203125, 'epoch': 0.04}
  4%|▎         | 160/4286 [1:08:04<27:17:26, 23.81s/it]  4%|▍         | 161/4286 [1:08:29<27:33:40, 24.05s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.4032392396550778, 'learning_rate': 9.62435837610826e-07, 'completion_length': 289.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.5550596117973328, 'rewards/format_reward': 1.0, 'reward': 1.5550596117973328, 'reward_std': 0.10804453119635582, 'kl': 0.012664794921875, 'epoch': 0.04}
  4%|▍         | 161/4286 [1:08:29<27:33:40, 24.05s/it]  4%|▍         | 162/4286 [1:08:53<27:49:29, 24.29s/it]                                                       {'loss': 0.0004, 'grad_norm': 0.317293270842499, 'learning_rate': 9.622025198320112e-07, 'completion_length': 286.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5854166746139526, 'rewards/format_reward': 1.0, 'reward': 1.5854167342185974, 'reward_std': 0.09654377773404121, 'kl': 0.01116943359375, 'epoch': 0.04}
  4%|▍         | 162/4286 [1:08:53<27:49:29, 24.29s/it]  4%|▍         | 163/4286 [1:09:17<27:27:27, 23.97s/it]                                                       {'loss': 0.0005, 'grad_norm': 2.455540604361925, 'learning_rate': 9.619692020531964e-07, 'completion_length': 268.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7098215222358704, 'rewards/format_reward': 1.0, 'reward': 1.7098215818405151, 'reward_std': 0.10306521132588387, 'kl': 0.01177978515625, 'epoch': 0.04}
  4%|▍         | 163/4286 [1:09:17<27:27:27, 23.97s/it]  4%|▍         | 164/4286 [1:09:41<27:38:13, 24.14s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.6140242944078237, 'learning_rate': 9.617358842743817e-07, 'completion_length': 269.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.5877976715564728, 'rewards/format_reward': 1.0, 'reward': 1.58779776096344, 'reward_std': 0.10859859362244606, 'kl': 0.013702392578125, 'epoch': 0.04}
  4%|▍         | 164/4286 [1:09:41<27:38:13, 24.14s/it][2025-03-02 16:07:28,533] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  4%|▍         | 165/4286 [1:10:06<27:42:53, 24.21s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.5337377826512261, 'learning_rate': 9.61502566495567e-07, 'completion_length': 278.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 1.0, 'reward': 1.602678656578064, 'reward_std': 0.08005733601748943, 'kl': 0.012847900390625, 'epoch': 0.04}
  4%|▍         | 165/4286 [1:10:06<27:42:53, 24.21s/it]  4%|▍         | 166/4286 [1:10:28<27:11:15, 23.76s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.3733755356099567, 'learning_rate': 9.612692487167521e-07, 'completion_length': 260.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6431547999382019, 'rewards/format_reward': 1.0, 'reward': 1.643154799938202, 'reward_std': 0.05740878079086542, 'kl': 0.014739990234375, 'epoch': 0.04}
  4%|▍         | 166/4286 [1:10:28<27:11:15, 23.76s/it][2025-03-02 16:08:15,882] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  4%|▍         | 167/4286 [1:10:53<27:29:22, 24.03s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.9223215460494206, 'learning_rate': 9.610359309379375e-07, 'completion_length': 285.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6696429252624512, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.1449766829609871, 'kl': 0.012542724609375, 'epoch': 0.04}
  4%|▍         | 167/4286 [1:10:53<27:29:22, 24.03s/it]  4%|▍         | 168/4286 [1:11:16<27:06:05, 23.69s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.4128301089654148, 'learning_rate': 9.608026131591228e-07, 'completion_length': 280.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.6041667461395264, 'rewards/format_reward': 1.0, 'reward': 1.6041668057441711, 'reward_std': 0.04602410038933158, 'kl': 0.01336669921875, 'epoch': 0.04}
  4%|▍         | 168/4286 [1:11:16<27:06:05, 23.69s/it]  4%|▍         | 169/4286 [1:11:42<27:49:47, 24.34s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.596488186397391, 'learning_rate': 9.60569295380308e-07, 'completion_length': 303.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.621131032705307, 'rewards/format_reward': 1.0, 'reward': 1.6211310029029846, 'reward_std': 0.0759124867618084, 'kl': 0.014068603515625, 'epoch': 0.04}
  4%|▍         | 169/4286 [1:11:42<27:49:47, 24.34s/it]  4%|▍         | 170/4286 [1:12:06<27:38:10, 24.17s/it]                                                       {'loss': 0.0005, 'grad_norm': 7.7448627171508875, 'learning_rate': 9.603359776014932e-07, 'completion_length': 292.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.5970238447189331, 'rewards/format_reward': 1.0, 'reward': 1.5970239043235779, 'reward_std': 0.1972348764538765, 'kl': 0.013275146484375, 'epoch': 0.04}
  4%|▍         | 170/4286 [1:12:06<27:38:10, 24.17s/it]  4%|▍         | 171/4286 [1:12:30<27:43:46, 24.26s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.31596606863284177, 'learning_rate': 9.601026598226786e-07, 'completion_length': 296.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6488095223903656, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.07327024638652802, 'kl': 0.012359619140625, 'epoch': 0.04}
  4%|▍         | 171/4286 [1:12:30<27:43:46, 24.26s/it]  4%|▍         | 172/4286 [1:12:54<27:44:55, 24.28s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.449304701528456, 'learning_rate': 9.598693420438637e-07, 'completion_length': 289.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.5848214626312256, 'rewards/format_reward': 1.0, 'reward': 1.5848215818405151, 'reward_std': 0.13853275403380394, 'kl': 0.0133056640625, 'epoch': 0.04}
  4%|▍         | 172/4286 [1:12:54<27:44:55, 24.28s/it]  4%|▍         | 173/4286 [1:13:18<27:30:12, 24.07s/it]                                                       {'loss': 0.0005, 'grad_norm': 1.0587016226504313, 'learning_rate': 9.59636024265049e-07, 'completion_length': 284.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6369048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.15616286545991898, 'kl': 0.013031005859375, 'epoch': 0.04}
  4%|▍         | 173/4286 [1:13:18<27:30:12, 24.07s/it]  4%|▍         | 174/4286 [1:13:43<27:44:48, 24.29s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.476459508486348, 'learning_rate': 9.594027064862341e-07, 'completion_length': 298.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.5788690894842148, 'rewards/format_reward': 1.0, 'reward': 1.5788692235946655, 'reward_std': 0.07876221090555191, 'kl': 0.014312744140625, 'epoch': 0.04}
  4%|▍         | 174/4286 [1:13:43<27:44:48, 24.29s/it]  4%|▍         | 175/4286 [1:14:07<27:47:13, 24.33s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.346934828024844, 'learning_rate': 9.591693887074195e-07, 'completion_length': 289.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5699405372142792, 'rewards/format_reward': 1.0, 'reward': 1.5699405670166016, 'reward_std': 0.09534517303109169, 'kl': 0.01763916015625, 'epoch': 0.04}
  4%|▍         | 175/4286 [1:14:07<27:47:13, 24.33s/it]  4%|▍         | 176/4286 [1:14:30<27:12:30, 23.83s/it]                                                       {'loss': 0.0006, 'grad_norm': 4.522503819055147, 'learning_rate': 9.589360709286048e-07, 'completion_length': 284.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.8351190984249115, 'rewards/format_reward': 1.0, 'reward': 1.8351191282272339, 'reward_std': 0.08444123342633247, 'kl': 0.014373779296875, 'epoch': 0.04}
  4%|▍         | 176/4286 [1:14:30<27:12:30, 23.83s/it]  4%|▍         | 177/4286 [1:14:55<27:33:17, 24.14s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.0643623857541764, 'learning_rate': 9.5870275314979e-07, 'completion_length': 282.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.6434524357318878, 'rewards/format_reward': 1.0, 'reward': 1.643452525138855, 'reward_std': 0.09259899333119392, 'kl': 0.01409912109375, 'epoch': 0.04}
  4%|▍         | 177/4286 [1:14:55<27:33:17, 24.14s/it]  4%|▍         | 178/4286 [1:15:18<27:18:35, 23.93s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.6252367912719, 'learning_rate': 9.584694353709752e-07, 'completion_length': 277.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7202381193637848, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.0948852114379406, 'kl': 0.014678955078125, 'epoch': 0.04}
  4%|▍         | 178/4286 [1:15:18<27:18:35, 23.93s/it]  4%|▍         | 179/4286 [1:15:43<27:28:46, 24.09s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.5750398229956638, 'learning_rate': 9.582361175921606e-07, 'completion_length': 297.5, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6919643878936768, 'reward_std': 0.1091451346874237, 'kl': 0.014251708984375, 'epoch': 0.04}
  4%|▍         | 179/4286 [1:15:43<27:28:46, 24.09s/it]  4%|▍         | 180/4286 [1:16:05<26:51:45, 23.55s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.723686630060121, 'learning_rate': 9.580027998133457e-07, 'completion_length': 286.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.5907738506793976, 'rewards/format_reward': 1.0, 'reward': 1.5907739400863647, 'reward_std': 0.11824015155434608, 'kl': 0.014678955078125, 'epoch': 0.04}
  4%|▍         | 180/4286 [1:16:05<26:51:45, 23.55s/it]  4%|▍         | 181/4286 [1:16:29<26:54:48, 23.60s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.505107643149347, 'learning_rate': 9.57769482034531e-07, 'completion_length': 281.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.733631044626236, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.13031133264303207, 'kl': 0.01483154296875, 'epoch': 0.04}
  4%|▍         | 181/4286 [1:16:29<26:54:48, 23.60s/it]  4%|▍         | 182/4286 [1:16:53<27:12:56, 23.87s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.5978871773755116, 'learning_rate': 9.575361642557163e-07, 'completion_length': 295.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7127976417541504, 'rewards/format_reward': 1.0, 'reward': 1.7127978205680847, 'reward_std': 0.12463568150997162, 'kl': 0.012451171875, 'epoch': 0.04}
  4%|▍         | 182/4286 [1:16:53<27:12:56, 23.87s/it]  4%|▍         | 183/4286 [1:17:18<27:24:18, 24.05s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.40409223884130363, 'learning_rate': 9.573028464769014e-07, 'completion_length': 284.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.07716727443039417, 'kl': 0.013427734375, 'epoch': 0.04}
  4%|▍         | 183/4286 [1:17:18<27:24:18, 24.05s/it][2025-03-02 16:15:04,825] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  4%|▍         | 184/4286 [1:17:42<27:30:59, 24.15s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.4408385948294414, 'learning_rate': 9.570695286980868e-07, 'completion_length': 263.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.736607164144516, 'rewards/format_reward': 1.0, 'reward': 1.7366072535514832, 'reward_std': 0.07421152107417583, 'kl': 0.015838623046875, 'epoch': 0.04}
  4%|▍         | 184/4286 [1:17:42<27:30:59, 24.15s/it]  4%|▍         | 185/4286 [1:18:06<27:28:06, 24.11s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5100755070534629, 'learning_rate': 9.56836210919272e-07, 'completion_length': 291.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.13232120126485825, 'kl': 0.0185546875, 'epoch': 0.04}
  4%|▍         | 185/4286 [1:18:06<27:28:06, 24.11s/it][2025-03-02 16:15:54,276] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  4%|▍         | 186/4286 [1:18:31<27:54:33, 24.51s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.378442414570804, 'learning_rate': 9.566028931404572e-07, 'completion_length': 305.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.08759675174951553, 'kl': 0.013641357421875, 'epoch': 0.04}
  4%|▍         | 186/4286 [1:18:31<27:54:33, 24.51s/it]  4%|▍         | 187/4286 [1:18:58<28:42:29, 25.21s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.6224745658456908, 'learning_rate': 9.563695753616425e-07, 'completion_length': 302.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5729166865348816, 'rewards/format_reward': 1.0, 'reward': 1.5729167461395264, 'reward_std': 0.14722763001918793, 'kl': 0.01708984375, 'epoch': 0.04}
  4%|▍         | 187/4286 [1:18:58<28:42:29, 25.21s/it][2025-03-02 16:16:44,780] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  4%|▍         | 188/4286 [1:19:22<28:09:50, 24.74s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.2511110196889832, 'learning_rate': 9.561362575828279e-07, 'completion_length': 260.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.04685881361365318, 'kl': 0.0164794921875, 'epoch': 0.04}
  4%|▍         | 188/4286 [1:19:22<28:09:50, 24.74s/it][2025-03-02 16:17:09,572] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  4%|▍         | 189/4286 [1:19:47<28:10:26, 24.76s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.4499626237440344, 'learning_rate': 9.55902939804013e-07, 'completion_length': 287.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6830357015132904, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.09740299358963966, 'kl': 0.0147705078125, 'epoch': 0.04}
  4%|▍         | 189/4286 [1:19:47<28:10:26, 24.76s/it]  4%|▍         | 190/4286 [1:20:12<28:22:03, 24.93s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.5247021726348803, 'learning_rate': 9.556696220251983e-07, 'completion_length': 296.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.6666667461395264, 'reward_std': 0.11768773570656776, 'kl': 0.01263427734375, 'epoch': 0.04}
  4%|▍         | 190/4286 [1:20:12<28:22:03, 24.93s/it][2025-03-02 16:18:01,178] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  4%|▍         | 191/4286 [1:20:38<28:48:53, 25.33s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.3771513953780711, 'learning_rate': 9.554363042463836e-07, 'completion_length': 311.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.6738095581531525, 'rewards/format_reward': 1.0, 'reward': 1.6738095879554749, 'reward_std': 0.034523806534707546, 'kl': 0.013916015625, 'epoch': 0.04}
  4%|▍         | 191/4286 [1:20:38<28:48:53, 25.33s/it]  4%|▍         | 192/4286 [1:21:02<28:17:34, 24.88s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.46452740586671315, 'learning_rate': 9.552029864675688e-07, 'completion_length': 277.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7202381193637848, 'rewards/format_reward': 1.0, 'reward': 1.7202382683753967, 'reward_std': 0.08466817997395992, 'kl': 0.0155029296875, 'epoch': 0.04}
  4%|▍         | 192/4286 [1:21:02<28:17:34, 24.88s/it]  5%|▍         | 193/4286 [1:21:24<27:26:50, 24.14s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.3476309042401373, 'learning_rate': 9.54969668688754e-07, 'completion_length': 285.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6800595223903656, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.04053214658051729, 'kl': 0.01751708984375, 'epoch': 0.05}
  5%|▍         | 193/4286 [1:21:25<27:26:50, 24.14s/it]  5%|▍         | 194/4286 [1:21:48<27:17:49, 24.02s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.6026396504201689, 'learning_rate': 9.547363509099394e-07, 'completion_length': 286.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187501192092896, 'reward_std': 0.08404048904776573, 'kl': 0.016265869140625, 'epoch': 0.05}
  5%|▍         | 194/4286 [1:21:48<27:17:49, 24.02s/it]  5%|▍         | 195/4286 [1:22:13<27:30:31, 24.21s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4677623267256978, 'learning_rate': 9.545030331311245e-07, 'completion_length': 288.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.566964328289032, 'rewards/format_reward': 1.0, 'reward': 1.566964328289032, 'reward_std': 0.0917475763708353, 'kl': 0.01715087890625, 'epoch': 0.05}
  5%|▍         | 195/4286 [1:22:13<27:30:31, 24.21s/it]  5%|▍         | 196/4286 [1:22:36<27:13:15, 23.96s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.681052977163319, 'learning_rate': 9.542697153523099e-07, 'completion_length': 282.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.625, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.13268541172146797, 'kl': 0.02093505859375, 'epoch': 0.05}
  5%|▍         | 196/4286 [1:22:36<27:13:15, 23.96s/it][2025-03-02 16:20:23,038] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  5%|▍         | 197/4286 [1:23:00<27:10:46, 23.93s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.3757088769089625, 'learning_rate': 9.54036397573495e-07, 'completion_length': 285.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7276785969734192, 'rewards/format_reward': 1.0, 'reward': 1.7276787161827087, 'reward_std': 0.10263696312904358, 'kl': 0.01806640625, 'epoch': 0.05}
  5%|▍         | 197/4286 [1:23:00<27:10:46, 23.93s/it]  5%|▍         | 198/4286 [1:23:24<27:03:23, 23.83s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.32231156533346156, 'learning_rate': 9.538030797946803e-07, 'completion_length': 294.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.6592262089252472, 'rewards/format_reward': 1.0, 'reward': 1.6592262387275696, 'reward_std': 0.06781976018100977, 'kl': 0.012847900390625, 'epoch': 0.05}
  5%|▍         | 198/4286 [1:23:24<27:03:23, 23.83s/it]  5%|▍         | 199/4286 [1:23:49<27:29:37, 24.22s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.44334135348836484, 'learning_rate': 9.535697620158656e-07, 'completion_length': 306.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6497024595737457, 'rewards/format_reward': 1.0, 'reward': 1.6497024297714233, 'reward_std': 0.06794260442256927, 'kl': 0.01568603515625, 'epoch': 0.05}
  5%|▍         | 199/4286 [1:23:49<27:29:37, 24.22s/it]  5%|▍         | 200/4286 [1:24:14<27:45:55, 24.46s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.31101145783638956, 'learning_rate': 9.533364442370509e-07, 'completion_length': 307.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.6577381789684296, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.0416666641831398, 'kl': 0.016448974609375, 'epoch': 0.05}
  5%|▍         | 200/4286 [1:24:14<27:45:55, 24.46s/it]  5%|▍         | 201/4286 [1:29:53<134:46:56, 118.78s/it]                                                         {'loss': 0.0006, 'grad_norm': 1.5782247448109579, 'learning_rate': 9.531031264582361e-07, 'completion_length': 274.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810148239136, 'reward_std': 0.06902234628796577, 'kl': 0.015533447265625, 'epoch': 0.05}
  5%|▍         | 201/4286 [1:29:53<134:46:56, 118.78s/it]  5%|▍         | 202/4286 [1:30:18<102:46:52, 90.60s/it]                                                         {'loss': 0.0006, 'grad_norm': 1.0530839470964402, 'learning_rate': 9.528698086794213e-07, 'completion_length': 304.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.6547619700431824, 'rewards/format_reward': 1.0, 'reward': 1.6547620296478271, 'reward_std': 0.0720495954155922, 'kl': 0.015167236328125, 'epoch': 0.05}
  5%|▍         | 202/4286 [1:30:18<102:46:52, 90.60s/it]  5%|▍         | 203/4286 [1:30:42<80:20:17, 70.83s/it]                                                        {'loss': 0.0006, 'grad_norm': 0.49063271248584445, 'learning_rate': 9.526364909006066e-07, 'completion_length': 298.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.1101190485060215, 'kl': 0.015350341796875, 'epoch': 0.05}
  5%|▍         | 203/4286 [1:30:42<80:20:17, 70.83s/it]  5%|▍         | 204/4286 [1:31:05<64:05:58, 56.53s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5573648321177249, 'learning_rate': 9.524031731217919e-07, 'completion_length': 281.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6607143580913544, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.07327024638652802, 'kl': 0.0169677734375, 'epoch': 0.05}
  5%|▍         | 204/4286 [1:31:05<64:05:58, 56.53s/it]  5%|▍         | 205/4286 [1:31:29<52:55:40, 46.69s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.4355745183415424, 'learning_rate': 9.521698553429771e-07, 'completion_length': 284.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7366071939468384, 'rewards/format_reward': 1.0, 'reward': 1.7366072535514832, 'reward_std': 0.026785715483129025, 'kl': 0.014129638671875, 'epoch': 0.05}
  5%|▍         | 205/4286 [1:31:29<52:55:40, 46.69s/it]  5%|▍         | 206/4286 [1:31:54<45:36:02, 40.24s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.3810335770253886, 'learning_rate': 9.519365375641624e-07, 'completion_length': 321.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7559524476528168, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.07142857648432255, 'kl': 0.01336669921875, 'epoch': 0.05}
  5%|▍         | 206/4286 [1:31:54<45:36:02, 40.24s/it]  5%|▍         | 207/4286 [1:32:20<40:36:41, 35.84s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.42382711650406, 'learning_rate': 9.517032197853476e-07, 'completion_length': 304.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7959184348583221, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7780612707138062, 'reward_std': 0.12244898080825806, 'kl': 0.013458251953125, 'epoch': 0.05}
  5%|▍         | 207/4286 [1:32:20<40:36:41, 35.84s/it]  5%|▍         | 208/4286 [1:32:44<36:38:12, 32.34s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5374525864333888, 'learning_rate': 9.514699020065328e-07, 'completion_length': 278.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.8145833909511566, 'rewards/format_reward': 1.0, 'reward': 1.814583420753479, 'reward_std': 0.09530104324221611, 'kl': 0.01708984375, 'epoch': 0.05}
  5%|▍         | 208/4286 [1:32:44<36:38:12, 32.34s/it]  5%|▍         | 209/4286 [1:33:10<34:31:00, 30.48s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.747566255452699, 'learning_rate': 9.512365842277182e-07, 'completion_length': 312.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6294643878936768, 'reward_std': 0.17183031141757965, 'kl': 0.015899658203125, 'epoch': 0.05}
  5%|▍         | 209/4286 [1:33:10<34:31:00, 30.48s/it]  5%|▍         | 210/4286 [1:33:38<33:32:02, 29.62s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.8924489067586779, 'learning_rate': 9.510032664489034e-07, 'completion_length': 315.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 1.0, 'reward': 1.5937500596046448, 'reward_std': 0.09232765063643456, 'kl': 0.01763916015625, 'epoch': 0.05}
  5%|▍         | 210/4286 [1:33:38<33:32:02, 29.62s/it]  5%|▍         | 211/4286 [1:34:03<32:06:18, 28.36s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.70126378164327, 'learning_rate': 9.507699486700886e-07, 'completion_length': 319.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.5178571939468384, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.09936580061912537, 'kl': 0.013946533203125, 'epoch': 0.05}
  5%|▍         | 211/4286 [1:34:03<32:06:18, 28.36s/it]  5%|▍         | 212/4286 [1:34:29<31:06:03, 27.48s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.5319347090458596, 'learning_rate': 9.505366308912739e-07, 'completion_length': 305.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.627976268529892, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6101191639900208, 'reward_std': 0.21738936007022858, 'kl': 0.016082763671875, 'epoch': 0.05}
  5%|▍         | 212/4286 [1:34:29<31:06:03, 27.48s/it]  5%|▍         | 213/4286 [1:34:53<29:56:05, 26.46s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5004360531682583, 'learning_rate': 9.503033131124592e-07, 'completion_length': 279.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.522916704416275, 'rewards/format_reward': 1.0, 'reward': 1.5229167938232422, 'reward_std': 0.09521211124956608, 'kl': 0.01788330078125, 'epoch': 0.05}
  5%|▍         | 213/4286 [1:34:53<29:56:05, 26.46s/it]  5%|▍         | 214/4286 [1:35:19<29:58:32, 26.50s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.6688998835695282, 'learning_rate': 9.500699953336444e-07, 'completion_length': 328.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.576828271150589, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5589711666107178, 'reward_std': 0.19878699630498886, 'kl': 0.0155029296875, 'epoch': 0.05}
  5%|▍         | 214/4286 [1:35:19<29:58:32, 26.50s/it]  5%|▌         | 215/4286 [1:35:43<28:58:26, 25.62s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.2102844551205767, 'learning_rate': 9.498366775548296e-07, 'completion_length': 274.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.03411935083568096, 'kl': 0.01617431640625, 'epoch': 0.05}
  5%|▌         | 215/4286 [1:35:43<28:58:26, 25.62s/it]  5%|▌         | 216/4286 [1:36:07<28:17:26, 25.02s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.49081751837825055, 'learning_rate': 9.496033597760149e-07, 'completion_length': 303.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803572535514832, 'reward_std': 0.12172314897179604, 'kl': 0.017333984375, 'epoch': 0.05}
  5%|▌         | 216/4286 [1:36:07<28:17:26, 25.02s/it]  5%|▌         | 217/4286 [1:36:31<28:14:27, 24.99s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5984797074275124, 'learning_rate': 9.493700419972002e-07, 'completion_length': 300.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.4642857611179352, 'rewards/format_reward': 1.0, 'reward': 1.4642858505249023, 'reward_std': 0.08919291384518147, 'kl': 0.01812744140625, 'epoch': 0.05}
  5%|▌         | 217/4286 [1:36:31<28:14:27, 24.99s/it]  5%|▌         | 218/4286 [1:36:59<28:56:10, 25.61s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.5569566494448792, 'learning_rate': 9.491367242183854e-07, 'completion_length': 315.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7336310744285583, 'reward_std': 0.13073870167136192, 'kl': 0.0162353515625, 'epoch': 0.05}
  5%|▌         | 218/4286 [1:36:59<28:56:10, 25.61s/it]  5%|▌         | 219/4286 [1:37:22<28:15:50, 25.02s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.7187290307211056, 'learning_rate': 9.489034064395707e-07, 'completion_length': 294.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6622024476528168, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.14966127276420593, 'kl': 0.02008056640625, 'epoch': 0.05}
  5%|▌         | 219/4286 [1:37:22<28:15:50, 25.02s/it][2025-03-02 16:35:11,020] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  5%|▌         | 220/4286 [1:37:48<28:33:43, 25.29s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.6563844213385529, 'learning_rate': 9.486700886607559e-07, 'completion_length': 298.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.1262023113667965, 'kl': 0.0179443359375, 'epoch': 0.05}
  5%|▌         | 220/4286 [1:37:48<28:33:43, 25.29s/it]  5%|▌         | 221/4286 [1:38:13<28:15:43, 25.03s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.429791524460604, 'learning_rate': 9.484367708819412e-07, 'completion_length': 306.875, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562500596046448, 'reward_std': 0.04362921789288521, 'kl': 0.0146484375, 'epoch': 0.05}
  5%|▌         | 221/4286 [1:38:13<28:15:43, 25.03s/it]  5%|▌         | 222/4286 [1:38:38<28:23:54, 25.16s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.9342955723467564, 'learning_rate': 9.482034531031265e-07, 'completion_length': 304.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 1.0, 'reward': 1.65029776096344, 'reward_std': 0.13504307717084885, 'kl': 0.014617919921875, 'epoch': 0.05}
  5%|▌         | 222/4286 [1:38:38<28:23:54, 25.16s/it][2025-03-02 16:36:26,880] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  5%|▌         | 223/4286 [1:39:04<28:40:18, 25.40s/it]                                                       {'loss': 0.0007, 'grad_norm': 2.0556883415804394, 'learning_rate': 9.479701353243117e-07, 'completion_length': 276.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.5925595760345459, 'rewards/format_reward': 1.0, 'reward': 1.592559576034546, 'reward_std': 0.11693192273378372, 'kl': 0.0181884765625, 'epoch': 0.05}
  5%|▌         | 223/4286 [1:39:04<28:40:18, 25.40s/it]  5%|▌         | 224/4286 [1:39:29<28:31:34, 25.28s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.3937068958182308, 'learning_rate': 9.477368175454969e-07, 'completion_length': 314.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.5357142686843872, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.08173839934170246, 'kl': 0.0166015625, 'epoch': 0.05}
  5%|▌         | 224/4286 [1:39:29<28:31:34, 25.28s/it][2025-03-02 16:37:16,264] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  5%|▌         | 225/4286 [1:39:53<28:13:00, 25.01s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.29808296428292663, 'learning_rate': 9.475034997666822e-07, 'completion_length': 287.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6770834028720856, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.04488959535956383, 'kl': 0.015533447265625, 'epoch': 0.05}
  5%|▌         | 225/4286 [1:39:53<28:13:00, 25.01s/it][2025-03-02 16:37:40,744] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  5%|▌         | 226/4286 [1:40:18<28:01:47, 24.85s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.49317400850062454, 'learning_rate': 9.472701819878675e-07, 'completion_length': 297.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6830357909202576, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.08931876718997955, 'kl': 0.01519775390625, 'epoch': 0.05}
  5%|▌         | 226/4286 [1:40:18<28:01:47, 24.85s/it]  5%|▌         | 227/4286 [1:40:43<28:13:05, 25.03s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.6166006238057021, 'learning_rate': 9.470368642090527e-07, 'completion_length': 293.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6919642984867096, 'rewards/format_reward': 1.0, 'reward': 1.6919643878936768, 'reward_std': 0.0626977551728487, 'kl': 0.016082763671875, 'epoch': 0.05}
  5%|▌         | 227/4286 [1:40:43<28:13:05, 25.03s/it]  5%|▌         | 228/4286 [1:41:11<28:59:49, 25.72s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.7399154873236206, 'learning_rate': 9.468035464302379e-07, 'completion_length': 314.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6036706864833832, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5679564476013184, 'reward_std': 0.1325879842042923, 'kl': 0.015899658203125, 'epoch': 0.05}
  5%|▌         | 228/4286 [1:41:11<28:59:49, 25.72s/it]  5%|▌         | 229/4286 [1:41:34<28:17:50, 25.11s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.899628414755704, 'learning_rate': 9.465702286514233e-07, 'completion_length': 278.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5529762208461761, 'rewards/format_reward': 1.0, 'reward': 1.5529762506484985, 'reward_std': 0.08663513325154781, 'kl': 0.016754150390625, 'epoch': 0.05}
  5%|▌         | 229/4286 [1:41:34<28:17:50, 25.11s/it]  5%|▌         | 230/4286 [1:41:59<28:08:42, 24.98s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.0545251255291888, 'learning_rate': 9.463369108726085e-07, 'completion_length': 294.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.586309552192688, 'rewards/format_reward': 1.0, 'reward': 1.5863096714019775, 'reward_std': 0.1069505475461483, 'kl': 0.01702880859375, 'epoch': 0.05}
  5%|▌         | 230/4286 [1:41:59<28:08:42, 24.98s/it]  5%|▌         | 231/4286 [1:42:24<28:11:25, 25.03s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.6470535830350487, 'learning_rate': 9.461035930937937e-07, 'completion_length': 310.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6636904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6636906266212463, 'reward_std': 0.09045328944921494, 'kl': 0.01873779296875, 'epoch': 0.05}
  5%|▌         | 231/4286 [1:42:24<28:11:25, 25.03s/it]  5%|▌         | 232/4286 [1:42:50<28:30:18, 25.31s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.5231922630906729, 'learning_rate': 9.45870275314979e-07, 'completion_length': 324.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.5386905521154404, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5208334922790527, 'reward_std': 0.18608662486076355, 'kl': 0.01556396484375, 'epoch': 0.05}
  5%|▌         | 232/4286 [1:42:50<28:30:18, 25.31s/it]  5%|▌         | 233/4286 [1:43:16<28:49:18, 25.60s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.7877874342692456, 'learning_rate': 9.456369575361642e-07, 'completion_length': 298.375, 'rewards/only_full_func_accuracy_reward': 0.60833340883255, 'rewards/format_reward': 1.0, 'reward': 1.6083334684371948, 'reward_std': 0.09770475327968597, 'kl': 0.01715087890625, 'epoch': 0.05}
  5%|▌         | 233/4286 [1:43:16<28:49:18, 25.60s/it]  5%|▌         | 234/4286 [1:43:43<29:17:29, 26.02s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.5213912124625685, 'learning_rate': 9.454036397573495e-07, 'completion_length': 317.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7553572058677673, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7375001311302185, 'reward_std': 0.10609135404229164, 'kl': 0.014404296875, 'epoch': 0.05}
  5%|▌         | 234/4286 [1:43:43<29:17:29, 26.02s/it]  5%|▌         | 235/4286 [1:44:10<29:24:19, 26.13s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4368851363524466, 'learning_rate': 9.451703219785348e-07, 'completion_length': 309.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5907738208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5729168057441711, 'reward_std': 0.12000151351094246, 'kl': 0.016815185546875, 'epoch': 0.05}
  5%|▌         | 235/4286 [1:44:10<29:24:19, 26.13s/it][2025-03-02 16:42:01,482] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 236/4286 [1:44:39<30:18:16, 26.94s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.2439835784153563, 'learning_rate': 9.4493700419972e-07, 'completion_length': 318.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5312500298023224, 'rewards/format_reward': 1.0, 'reward': 1.5312501788139343, 'reward_std': 0.13706324994564056, 'kl': 0.01666259765625, 'epoch': 0.06}
  6%|▌         | 236/4286 [1:44:39<30:18:16, 26.94s/it]  6%|▌         | 237/4286 [1:45:03<29:18:12, 26.05s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.2020032671862488, 'learning_rate': 9.447036864209052e-07, 'completion_length': 300.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.721726268529892, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.07624644041061401, 'kl': 0.01824951171875, 'epoch': 0.06}
  6%|▌         | 237/4286 [1:45:03<29:18:12, 26.05s/it]  6%|▌         | 238/4286 [1:45:27<28:44:38, 25.56s/it]                                                       {'loss': 0.0006, 'grad_norm': 2.1429145259335094, 'learning_rate': 9.444703686420905e-07, 'completion_length': 296.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5680803954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5502232909202576, 'reward_std': 0.20004902780056, 'kl': 0.01605224609375, 'epoch': 0.06}
  6%|▌         | 238/4286 [1:45:27<28:44:38, 25.56s/it]  6%|▌         | 239/4286 [1:45:51<28:21:43, 25.23s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.3113961923024995, 'learning_rate': 9.442370508632758e-07, 'completion_length': 295.2678756713867, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.09204822033643723, 'kl': 0.01898193359375, 'epoch': 0.06}
  6%|▌         | 239/4286 [1:45:51<28:21:43, 25.23s/it]  6%|▌         | 240/4286 [1:46:17<28:20:52, 25.22s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.6735813639177701, 'learning_rate': 9.44003733084461e-07, 'completion_length': 280.375, 'rewards/only_full_func_accuracy_reward': 0.6140731871128082, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5962159633636475, 'reward_std': 0.14453162997961044, 'kl': 0.015472412109375, 'epoch': 0.06}
  6%|▌         | 240/4286 [1:46:17<28:20:52, 25.22s/it]  6%|▌         | 241/4286 [1:46:42<28:13:48, 25.12s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.075683952522367, 'learning_rate': 9.437704153056462e-07, 'completion_length': 304.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.11200924590229988, 'kl': 0.02093505859375, 'epoch': 0.06}
  6%|▌         | 241/4286 [1:46:42<28:13:48, 25.12s/it]  6%|▌         | 242/4286 [1:47:06<28:00:08, 24.93s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.35347118085211837, 'learning_rate': 9.435370975268316e-07, 'completion_length': 289.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7321429550647736, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.10919768176972866, 'kl': 0.016265869140625, 'epoch': 0.06}
  6%|▌         | 242/4286 [1:47:06<28:00:08, 24.93s/it][2025-03-02 16:44:53,467] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 243/4286 [1:47:31<27:52:07, 24.82s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.8700495970584757, 'learning_rate': 9.433037797480168e-07, 'completion_length': 268.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.09959554672241211, 'kl': 0.015533447265625, 'epoch': 0.06}
  6%|▌         | 243/4286 [1:47:31<27:52:07, 24.82s/it]  6%|▌         | 244/4286 [1:47:57<28:28:51, 25.37s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.33688912299565826, 'learning_rate': 9.43070461969202e-07, 'completion_length': 297.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6473215222358704, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6294644474983215, 'reward_std': 0.12544209510087967, 'kl': 0.01666259765625, 'epoch': 0.06}
  6%|▌         | 244/4286 [1:47:57<28:28:51, 25.37s/it]  6%|▌         | 245/4286 [1:48:23<28:42:14, 25.57s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.45030461645188835, 'learning_rate': 9.428371441903873e-07, 'completion_length': 302.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7648809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.09045328572392464, 'kl': 0.015350341796875, 'epoch': 0.06}
  6%|▌         | 245/4286 [1:48:23<28:42:14, 25.57s/it]  6%|▌         | 246/4286 [1:48:48<28:16:01, 25.19s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4310546450798149, 'learning_rate': 9.426038264115726e-07, 'completion_length': 283.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.6324405074119568, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.08152471296489239, 'kl': 0.016265869140625, 'epoch': 0.06}
  6%|▌         | 246/4286 [1:48:48<28:16:01, 25.19s/it][2025-03-02 16:46:36,421] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 247/4286 [1:49:13<28:31:06, 25.42s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.843138065034853, 'learning_rate': 9.423705086327578e-07, 'completion_length': 324.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592262983322144, 'reward_std': 0.08617449924349785, 'kl': 0.019287109375, 'epoch': 0.06}
  6%|▌         | 247/4286 [1:49:14<28:31:06, 25.42s/it][2025-03-02 16:47:03,720] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 248/4286 [1:49:41<29:08:39, 25.98s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.0050589875312474, 'learning_rate': 9.42137190853943e-07, 'completion_length': 291.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6808036267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6629465222358704, 'reward_std': 0.14052648097276688, 'kl': 0.01617431640625, 'epoch': 0.06}
  6%|▌         | 248/4286 [1:49:41<29:08:39, 25.98s/it][2025-03-02 16:47:29,462] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 249/4286 [1:50:07<29:03:20, 25.91s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5001216080696763, 'learning_rate': 9.419038730751283e-07, 'completion_length': 315.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5491071939468384, 'rewards/format_reward': 1.0, 'reward': 1.5491072535514832, 'reward_std': 0.08163641765713692, 'kl': 0.01885986328125, 'epoch': 0.06}
  6%|▌         | 249/4286 [1:50:07<29:03:20, 25.91s/it]  6%|▌         | 250/4286 [1:50:32<28:50:21, 25.72s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.426717804224375, 'learning_rate': 9.416705552963136e-07, 'completion_length': 282.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.12282984331250191, 'kl': 0.018035888671875, 'epoch': 0.06}
  6%|▌         | 250/4286 [1:50:32<28:50:21, 25.72s/it]  6%|▌         | 251/4286 [1:50:58<29:00:24, 25.88s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5598351231566676, 'learning_rate': 9.414372375174988e-07, 'completion_length': 323.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6205357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6205357909202576, 'reward_std': 0.09275537356734276, 'kl': 0.017333984375, 'epoch': 0.06}
  6%|▌         | 251/4286 [1:50:58<29:00:24, 25.88s/it]  6%|▌         | 252/4286 [1:51:24<29:01:12, 25.90s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.8593893847069828, 'learning_rate': 9.412039197386841e-07, 'completion_length': 314.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.6547618806362152, 'rewards/format_reward': 1.0, 'reward': 1.654762089252472, 'reward_std': 0.12179679051041603, 'kl': 0.01995849609375, 'epoch': 0.06}
  6%|▌         | 252/4286 [1:51:24<29:01:12, 25.90s/it]  6%|▌         | 253/4286 [1:51:51<29:25:20, 26.26s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.391420006749012, 'learning_rate': 9.409706019598693e-07, 'completion_length': 318.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6748512387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6569941639900208, 'reward_std': 0.20689845830202103, 'kl': 0.01788330078125, 'epoch': 0.06}
  6%|▌         | 253/4286 [1:51:51<29:25:20, 26.26s/it][2025-03-02 16:49:40,519] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 254/4286 [1:52:18<29:29:02, 26.33s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.0243829114251908, 'learning_rate': 9.407372841810545e-07, 'completion_length': 301.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.12024993821978569, 'kl': 0.016387939453125, 'epoch': 0.06}
  6%|▌         | 254/4286 [1:52:18<29:29:02, 26.33s/it]  6%|▌         | 255/4286 [1:52:43<29:06:11, 25.99s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.2779884400731537, 'learning_rate': 9.405039664022399e-07, 'completion_length': 297.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.06047770008444786, 'kl': 0.018798828125, 'epoch': 0.06}
  6%|▌         | 255/4286 [1:52:43<29:06:11, 25.99s/it][2025-03-02 16:50:31,378] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 256/4286 [1:53:08<28:58:48, 25.89s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4660596383842627, 'learning_rate': 9.402706486234251e-07, 'completion_length': 299.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.7202382683753967, 'reward_std': 0.06572292372584343, 'kl': 0.01751708984375, 'epoch': 0.06}
  6%|▌         | 256/4286 [1:53:08<28:58:48, 25.89s/it]  6%|▌         | 257/4286 [1:53:34<28:49:48, 25.76s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.2571122948392364, 'learning_rate': 9.400373308446103e-07, 'completion_length': 289.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.03243744093924761, 'kl': 0.01678466796875, 'epoch': 0.06}
  6%|▌         | 257/4286 [1:53:34<28:49:48, 25.76s/it][2025-03-02 16:51:24,854] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 258/4286 [1:54:02<29:34:45, 26.44s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.3942037270757796, 'learning_rate': 9.398040130657957e-07, 'completion_length': 332.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 1.0, 'reward': 1.623512089252472, 'reward_std': 0.10680188052356243, 'kl': 0.0169677734375, 'epoch': 0.06}
  6%|▌         | 258/4286 [1:54:02<29:34:45, 26.44s/it][2025-03-02 16:51:52,065] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 259/4286 [1:54:29<29:49:53, 26.67s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.6952753322564922, 'learning_rate': 9.395706952869809e-07, 'completion_length': 323.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803571939468384, 'reward_std': 0.15271935611963272, 'kl': 0.01611328125, 'epoch': 0.06}
  6%|▌         | 259/4286 [1:54:29<29:49:53, 26.67s/it]  6%|▌         | 260/4286 [1:54:55<29:24:32, 26.30s/it]                                                       {'loss': 0.0007, 'grad_norm': 5.519378881974041, 'learning_rate': 9.393373775081661e-07, 'completion_length': 311.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.1300976611673832, 'kl': 0.01641845703125, 'epoch': 0.06}
  6%|▌         | 260/4286 [1:54:55<29:24:32, 26.30s/it][2025-03-02 16:52:42,654] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 261/4286 [1:55:20<29:01:10, 25.96s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5749329626530281, 'learning_rate': 9.391040597293513e-07, 'completion_length': 302.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6190476715564728, 'rewards/format_reward': 1.0, 'reward': 1.6190477013587952, 'reward_std': 0.09112738817930222, 'kl': 0.01702880859375, 'epoch': 0.06}
  6%|▌         | 261/4286 [1:55:20<29:01:10, 25.96s/it][2025-03-02 16:53:07,887] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▌         | 262/4286 [1:55:45<28:46:13, 25.74s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4823705022808955, 'learning_rate': 9.388707419505366e-07, 'completion_length': 295.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 1.0, 'reward': 1.686012089252472, 'reward_std': 0.060979549773037434, 'kl': 0.0167236328125, 'epoch': 0.06}
  6%|▌         | 262/4286 [1:55:45<28:46:13, 25.74s/it]  6%|▌         | 263/4286 [1:56:11<28:49:36, 25.80s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.0183495616642901, 'learning_rate': 9.386374241717219e-07, 'completion_length': 321.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.5877976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5699405670166016, 'reward_std': 0.12775015458464622, 'kl': 0.01605224609375, 'epoch': 0.06}
  6%|▌         | 263/4286 [1:56:11<28:49:36, 25.80s/it]  6%|▌         | 264/4286 [1:56:36<28:32:54, 25.55s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.534189149226421, 'learning_rate': 9.384041063929071e-07, 'completion_length': 310.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.5580357611179352, 'rewards/format_reward': 1.0, 'reward': 1.5580358505249023, 'reward_std': 0.11772308871150017, 'kl': 0.0152587890625, 'epoch': 0.06}
  6%|▌         | 264/4286 [1:56:36<28:32:54, 25.55s/it]  6%|▌         | 265/4286 [1:57:01<28:18:36, 25.35s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.33529468179246863, 'learning_rate': 9.381707886140924e-07, 'completion_length': 315.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.8139881491661072, 'rewards/format_reward': 1.0, 'reward': 1.813988208770752, 'reward_std': 0.0329848313704133, 'kl': 0.013763427734375, 'epoch': 0.06}
  6%|▌         | 265/4286 [1:57:01<28:18:36, 25.35s/it]  6%|▌         | 266/4286 [1:57:26<28:11:02, 25.24s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.0160928244577252, 'learning_rate': 9.379374708352776e-07, 'completion_length': 290.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.5163690745830536, 'rewards/format_reward': 1.0, 'reward': 1.516369104385376, 'reward_std': 0.05066636577248573, 'kl': 0.01934814453125, 'epoch': 0.06}
  6%|▌         | 266/4286 [1:57:26<28:11:02, 25.24s/it]  6%|▌         | 267/4286 [1:57:52<28:34:11, 25.59s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.44719708838122546, 'learning_rate': 9.377041530564629e-07, 'completion_length': 318.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7857143878936768, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678572535514832, 'reward_std': 0.13211995363235474, 'kl': 0.01910400390625, 'epoch': 0.06}
  6%|▌         | 267/4286 [1:57:52<28:34:11, 25.59s/it]  6%|▋         | 268/4286 [1:58:20<29:16:24, 26.23s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.37171285200999604, 'learning_rate': 9.374708352776482e-07, 'completion_length': 322.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.49523812532424927, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4773810505867004, 'reward_std': 0.11666667461395264, 'kl': 0.01666259765625, 'epoch': 0.06}
  6%|▋         | 268/4286 [1:58:20<29:16:24, 26.23s/it][2025-03-02 16:56:08,760] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▋         | 269/4286 [1:58:46<29:10:56, 26.15s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.250906177314805, 'learning_rate': 9.372375174988334e-07, 'completion_length': 303.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.05654761753976345, 'kl': 0.01690673828125, 'epoch': 0.06}
  6%|▋         | 269/4286 [1:58:46<29:10:56, 26.15s/it]  6%|▋         | 270/4286 [1:59:11<28:47:39, 25.81s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5313471037542721, 'learning_rate': 9.370041997200186e-07, 'completion_length': 309.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.5595238208770752, 'rewards/format_reward': 1.0, 'reward': 1.5595239400863647, 'reward_std': 0.09955444559454918, 'kl': 0.017181396484375, 'epoch': 0.06}
  6%|▋         | 270/4286 [1:59:11<28:47:39, 25.81s/it]  6%|▋         | 271/4286 [1:59:37<28:57:51, 25.97s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.43492347329971875, 'learning_rate': 9.367708819412039e-07, 'completion_length': 283.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.5803571790456772, 'rewards/format_reward': 1.0, 'reward': 1.5803572535514832, 'reward_std': 0.06664376892149448, 'kl': 0.016571044921875, 'epoch': 0.06}
  6%|▋         | 271/4286 [1:59:37<28:57:51, 25.97s/it][2025-03-02 16:57:28,525] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▋         | 272/4286 [2:00:06<29:46:21, 26.70s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.8585327801999778, 'learning_rate': 9.365375641623892e-07, 'completion_length': 334.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7012649178504944, 'rewards/format_reward': 1.0, 'reward': 1.7012649774551392, 'reward_std': 0.04985118843615055, 'kl': 0.0155029296875, 'epoch': 0.06}
  6%|▋         | 272/4286 [2:00:06<29:46:21, 26.70s/it]  6%|▋         | 273/4286 [2:00:32<29:30:59, 26.48s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.4739853193414486, 'learning_rate': 9.363042463835744e-07, 'completion_length': 315.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6488095223903656, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.07142856903374195, 'kl': 0.014984130859375, 'epoch': 0.06}
  6%|▋         | 273/4286 [2:00:32<29:30:59, 26.48s/it]  6%|▋         | 274/4286 [2:00:58<29:29:15, 26.46s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.42153736708207773, 'learning_rate': 9.360709286047596e-07, 'completion_length': 317.625, 'rewards/only_full_func_accuracy_reward': 0.6645833849906921, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6288691759109497, 'reward_std': 0.1317561287432909, 'kl': 0.015838623046875, 'epoch': 0.06}
  6%|▋         | 274/4286 [2:00:58<29:29:15, 26.46s/it]  6%|▋         | 275/4286 [2:01:25<29:35:20, 26.56s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.45678837321262317, 'learning_rate': 9.35837610825945e-07, 'completion_length': 310.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6274802088737488, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5917659401893616, 'reward_std': 0.12449607998132706, 'kl': 0.016387939453125, 'epoch': 0.06}
  6%|▋         | 275/4286 [2:01:25<29:35:20, 26.56s/it]  6%|▋         | 276/4286 [2:01:50<29:14:11, 26.25s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.32303765209454355, 'learning_rate': 9.356042930471302e-07, 'completion_length': 291.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6636905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.08700072020292282, 'kl': 0.015380859375, 'epoch': 0.06}
  6%|▋         | 276/4286 [2:01:50<29:14:11, 26.25s/it][2025-03-02 16:59:40,535] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  6%|▋         | 277/4286 [2:02:18<29:35:26, 26.57s/it]                                                       {'loss': 0.0005, 'grad_norm': 0.7742868028778133, 'learning_rate': 9.353709752683154e-07, 'completion_length': 326.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6922619342803955, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6744048595428467, 'reward_std': 0.11772088706493378, 'kl': 0.01373291015625, 'epoch': 0.06}
  6%|▋         | 277/4286 [2:02:18<29:35:26, 26.57s/it]  6%|▋         | 278/4286 [2:02:42<28:55:39, 25.98s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.6822086949213847, 'learning_rate': 9.351376574895007e-07, 'completion_length': 286.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.6830357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.10097679868340492, 'kl': 0.01641845703125, 'epoch': 0.06}
  6%|▋         | 278/4286 [2:02:42<28:55:39, 25.98s/it]  7%|▋         | 279/4286 [2:03:06<28:03:27, 25.21s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4961190333133767, 'learning_rate': 9.34904339710686e-07, 'completion_length': 280.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.5312500298023224, 'rewards/format_reward': 1.0, 'reward': 1.5312500596046448, 'reward_std': 0.09367622062563896, 'kl': 0.01708984375, 'epoch': 0.07}
  7%|▋         | 279/4286 [2:03:06<28:03:27, 25.21s/it]  7%|▋         | 280/4286 [2:03:33<28:39:07, 25.75s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.8167141705143927, 'learning_rate': 9.346710219318712e-07, 'completion_length': 336.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.5533820986747742, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5355249643325806, 'reward_std': 0.15713933110237122, 'kl': 0.01861572265625, 'epoch': 0.07}
  7%|▋         | 280/4286 [2:03:33<28:39:07, 25.75s/it]  7%|▋         | 281/4286 [2:04:00<29:01:34, 26.09s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.9101322399301028, 'learning_rate': 9.344377041530565e-07, 'completion_length': 312.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.07397740334272385, 'kl': 0.01422119140625, 'epoch': 0.07}
  7%|▋         | 281/4286 [2:04:00<29:01:34, 26.09s/it]  7%|▋         | 282/4286 [2:04:25<28:54:49, 26.00s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.1606649434142895, 'learning_rate': 9.342043863742417e-07, 'completion_length': 300.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217262983322144, 'reward_std': 0.1408847514539957, 'kl': 0.01556396484375, 'epoch': 0.07}
  7%|▋         | 282/4286 [2:04:25<28:54:49, 26.00s/it]  7%|▋         | 283/4286 [2:04:50<28:28:56, 25.62s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.9739068362222999, 'learning_rate': 9.339710685954269e-07, 'completion_length': 315.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.6532738208770752, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.11521172523498535, 'kl': 0.016143798828125, 'epoch': 0.07}
  7%|▋         | 283/4286 [2:04:50<28:28:56, 25.62s/it]  7%|▋         | 284/4286 [2:05:16<28:45:02, 25.86s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.8214487972563648, 'learning_rate': 9.337377508166122e-07, 'completion_length': 327.14288330078125, 'rewards/only_full_func_accuracy_reward': 0.6413690447807312, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.09482263028621674, 'kl': 0.01556396484375, 'epoch': 0.07}
  7%|▋         | 284/4286 [2:05:16<28:45:02, 25.86s/it]  7%|▋         | 285/4286 [2:05:44<29:15:13, 26.32s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.7101202331460926, 'learning_rate': 9.335044330377975e-07, 'completion_length': 316.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.7767857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7767857909202576, 'reward_std': 0.12297770753502846, 'kl': 0.015411376953125, 'epoch': 0.07}
  7%|▋         | 285/4286 [2:05:44<29:15:13, 26.32s/it]  7%|▋         | 286/4286 [2:06:08<28:26:39, 25.60s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.4409509453277953, 'learning_rate': 9.332711152589827e-07, 'completion_length': 316.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7514881193637848, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.08165043871849775, 'kl': 0.01788330078125, 'epoch': 0.07}
  7%|▋         | 286/4286 [2:06:08<28:26:39, 25.60s/it]  7%|▋         | 287/4286 [2:06:31<27:44:41, 24.98s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.46295467836519333, 'learning_rate': 9.330377974801679e-07, 'completion_length': 286.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7247024178504944, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.10264620557427406, 'kl': 0.014617919921875, 'epoch': 0.07}
  7%|▋         | 287/4286 [2:06:31<27:44:41, 24.98s/it][2025-03-02 17:04:20,951] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  7%|▋         | 288/4286 [2:06:58<28:19:24, 25.50s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.9773511272892327, 'learning_rate': 9.328044797013533e-07, 'completion_length': 319.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5163690894842148, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4806548953056335, 'reward_std': 0.12216099724173546, 'kl': 0.02166748046875, 'epoch': 0.07}
  7%|▋         | 288/4286 [2:06:58<28:19:24, 25.50s/it][2025-03-02 17:04:49,504] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  7%|▋         | 289/4286 [2:07:27<29:19:55, 26.42s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4395238930176313, 'learning_rate': 9.325711619225385e-07, 'completion_length': 319.14288330078125, 'rewards/only_full_func_accuracy_reward': 0.492559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.474702537059784, 'reward_std': 0.07066834159195423, 'kl': 0.01654052734375, 'epoch': 0.07}
  7%|▋         | 289/4286 [2:07:27<29:19:55, 26.42s/it]  7%|▋         | 290/4286 [2:07:53<29:28:27, 26.55s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.29542175480787897, 'learning_rate': 9.323378441437237e-07, 'completion_length': 317.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.540178656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5223215818405151, 'reward_std': 0.06540660560131073, 'kl': 0.0166015625, 'epoch': 0.07}
  7%|▋         | 290/4286 [2:07:53<29:28:27, 26.55s/it]  7%|▋         | 291/4286 [2:08:21<29:49:29, 26.88s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.4589625955964376, 'learning_rate': 9.32104526364909e-07, 'completion_length': 302.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.5611111223697662, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5253969430923462, 'reward_std': 0.1755351424217224, 'kl': 0.01947021484375, 'epoch': 0.07}
  7%|▋         | 291/4286 [2:08:21<29:49:29, 26.88s/it]  7%|▋         | 292/4286 [2:08:47<29:36:30, 26.69s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.8458182925251155, 'learning_rate': 9.318712085860943e-07, 'completion_length': 293.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6693452596664429, 'rewards/format_reward': 1.0, 'reward': 1.6693453788757324, 'reward_std': 0.13186714053153992, 'kl': 0.01629638671875, 'epoch': 0.07}
  7%|▋         | 292/4286 [2:08:47<29:36:30, 26.69s/it]  7%|▋         | 293/4286 [2:09:11<28:44:27, 25.91s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.6464468377603085, 'learning_rate': 9.316378908072795e-07, 'completion_length': 284.14288330078125, 'rewards/only_full_func_accuracy_reward': 0.5669643431901932, 'rewards/format_reward': 1.0, 'reward': 1.566964328289032, 'reward_std': 0.07029405608773232, 'kl': 0.01708984375, 'epoch': 0.07}
  7%|▋         | 293/4286 [2:09:11<28:44:27, 25.91s/it]  7%|▋         | 294/4286 [2:09:37<28:42:55, 25.90s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4839290671626241, 'learning_rate': 9.314045730284647e-07, 'completion_length': 318.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6175595819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.599702537059784, 'reward_std': 0.10837682336568832, 'kl': 0.01751708984375, 'epoch': 0.07}
  7%|▋         | 294/4286 [2:09:37<28:42:55, 25.90s/it]  7%|▋         | 295/4286 [2:10:02<28:10:15, 25.41s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.6220159350601887, 'learning_rate': 9.3117125524965e-07, 'completion_length': 275.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.5997024327516556, 'rewards/format_reward': 1.0, 'reward': 1.599702537059784, 'reward_std': 0.1593991443514824, 'kl': 0.015716552734375, 'epoch': 0.07}
  7%|▋         | 295/4286 [2:10:02<28:10:15, 25.41s/it]  7%|▋         | 296/4286 [2:10:27<28:01:01, 25.28s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.8513689860235254, 'learning_rate': 9.309379374708353e-07, 'completion_length': 291.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6949404776096344, 'rewards/format_reward': 1.0, 'reward': 1.6949405670166016, 'reward_std': 0.11350258439779282, 'kl': 0.015228271484375, 'epoch': 0.07}
  7%|▋         | 296/4286 [2:10:27<28:01:01, 25.28s/it]  7%|▋         | 297/4286 [2:10:52<28:01:36, 25.29s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.33446788825358864, 'learning_rate': 9.307046196920205e-07, 'completion_length': 300.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.65327388048172, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.05281120166182518, 'kl': 0.01617431640625, 'epoch': 0.07}
  7%|▋         | 297/4286 [2:10:52<28:01:36, 25.29s/it]  7%|▋         | 298/4286 [2:11:17<28:04:47, 25.35s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.0706177167999456, 'learning_rate': 9.304713019132058e-07, 'completion_length': 269.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.7211309969425201, 'rewards/format_reward': 1.0, 'reward': 1.7211310863494873, 'reward_std': 0.06488094944506884, 'kl': 0.01873779296875, 'epoch': 0.07}
  7%|▋         | 298/4286 [2:11:17<28:04:47, 25.35s/it]  7%|▋         | 299/4286 [2:11:44<28:36:11, 25.83s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.35450049892714713, 'learning_rate': 9.30237984134391e-07, 'completion_length': 317.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7053571343421936, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.05373070016503334, 'kl': 0.016937255859375, 'epoch': 0.07}
  7%|▋         | 299/4286 [2:11:44<28:36:11, 25.83s/it]  7%|▋         | 300/4286 [2:12:09<28:14:09, 25.50s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4397274311663669, 'learning_rate': 9.300046663555763e-07, 'completion_length': 276.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.06388125754892826, 'kl': 0.01629638671875, 'epoch': 0.07}
  7%|▋         | 300/4286 [2:12:09<28:14:09, 25.50s/it][2025-03-02 17:14:02,729] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  7%|▋         | 301/4286 [2:16:40<109:40:58, 99.09s/it]                                                        {'loss': 0.0007, 'grad_norm': 0.733609320705889, 'learning_rate': 9.297713485767616e-07, 'completion_length': 308.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6815477013587952, 'reward_std': 0.1409609243273735, 'kl': 0.016937255859375, 'epoch': 0.07}
  7%|▋         | 301/4286 [2:16:40<109:40:58, 99.09s/it]  7%|▋         | 302/4286 [2:17:05<85:14:05, 77.02s/it]                                                        {'loss': 0.0007, 'grad_norm': 0.39724878442291667, 'learning_rate': 9.295380307979468e-07, 'completion_length': 304.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.5937500298023224, 'rewards/format_reward': 1.0, 'reward': 1.5937501192092896, 'reward_std': 0.05116274021565914, 'kl': 0.0177001953125, 'epoch': 0.07}
  7%|▋         | 302/4286 [2:17:05<85:14:05, 77.02s/it]  7%|▋         | 303/4286 [2:17:29<67:30:44, 61.02s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.3971539252293133, 'learning_rate': 9.29304713019132e-07, 'completion_length': 290.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6803571879863739, 'rewards/format_reward': 1.0, 'reward': 1.6803572177886963, 'reward_std': 0.06667705066502094, 'kl': 0.01824951171875, 'epoch': 0.07}
  7%|▋         | 303/4286 [2:17:29<67:30:44, 61.02s/it]  7%|▋         | 304/4286 [2:17:57<56:40:57, 51.25s/it]                                                       {'loss': 0.0007, 'grad_norm': 2.5875475999320274, 'learning_rate': 9.290713952403174e-07, 'completion_length': 336.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.5669642686843872, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5491072535514832, 'reward_std': 0.10708648338913918, 'kl': 0.0186767578125, 'epoch': 0.07}
  7%|▋         | 304/4286 [2:17:57<56:40:57, 51.25s/it]  7%|▋         | 305/4286 [2:18:22<47:47:47, 43.22s/it]                                                       {'loss': 0.0007, 'grad_norm': 5.335142525926351, 'learning_rate': 9.288380774615026e-07, 'completion_length': 305.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.6294643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6294643878936768, 'reward_std': 0.08931877091526985, 'kl': 0.017425537109375, 'epoch': 0.07}
  7%|▋         | 305/4286 [2:18:22<47:47:47, 43.22s/it]  7%|▋         | 306/4286 [2:18:48<42:06:53, 38.09s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.1617324724887972, 'learning_rate': 9.286047596826878e-07, 'completion_length': 324.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.010309826582670212, 'kl': 0.017974853515625, 'epoch': 0.07}
  7%|▋         | 306/4286 [2:18:48<42:06:53, 38.09s/it]  7%|▋         | 307/4286 [2:19:17<38:55:54, 35.22s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.47940384718964174, 'learning_rate': 9.28371441903873e-07, 'completion_length': 343.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.5625000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.544642984867096, 'reward_std': 0.15343885123729706, 'kl': 0.0164794921875, 'epoch': 0.07}
  7%|▋         | 307/4286 [2:19:17<38:55:54, 35.22s/it]  7%|▋         | 308/4286 [2:19:45<36:37:00, 33.14s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.47342571457917637, 'learning_rate': 9.281381241250583e-07, 'completion_length': 324.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.12485412880778313, 'kl': 0.016998291015625, 'epoch': 0.07}
  7%|▋         | 308/4286 [2:19:45<36:37:00, 33.14s/it]  7%|▋         | 309/4286 [2:20:10<33:59:47, 30.77s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.3716841570411765, 'learning_rate': 9.279048063462436e-07, 'completion_length': 316.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6220238506793976, 'rewards/format_reward': 1.0, 'reward': 1.6220239400863647, 'reward_std': 0.05096161924302578, 'kl': 0.01580810546875, 'epoch': 0.07}
  7%|▋         | 309/4286 [2:20:10<33:59:47, 30.77s/it]  7%|▋         | 310/4286 [2:20:35<31:54:46, 28.89s/it]                                                       {'loss': 0.0006, 'grad_norm': 1.048239284310524, 'learning_rate': 9.276714885674288e-07, 'completion_length': 308.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6577381789684296, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.08005647920072079, 'kl': 0.0147705078125, 'epoch': 0.07}
  7%|▋         | 310/4286 [2:20:35<31:54:46, 28.89s/it]  7%|▋         | 311/4286 [2:20:59<30:20:32, 27.48s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5798521873549847, 'learning_rate': 9.274381707886141e-07, 'completion_length': 267.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 1.0, 'reward': 1.6741071939468384, 'reward_std': 0.056547620333731174, 'kl': 0.01837158203125, 'epoch': 0.07}
  7%|▋         | 311/4286 [2:20:59<30:20:32, 27.48s/it][2025-03-02 17:18:48,178] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  7%|▋         | 312/4286 [2:21:25<29:58:57, 27.16s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.6619412860006596, 'learning_rate': 9.272048530097993e-07, 'completion_length': 300.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6294643878936768, 'reward_std': 0.10008022747933865, 'kl': 0.015960693359375, 'epoch': 0.07}
  7%|▋         | 312/4286 [2:21:25<29:58:57, 27.16s/it]  7%|▋         | 313/4286 [2:21:51<29:35:35, 26.81s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.9277515237177059, 'learning_rate': 9.269715352309846e-07, 'completion_length': 309.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.4992559999227524, 'rewards/format_reward': 1.0, 'reward': 1.4992560744285583, 'reward_std': 0.10108364373445511, 'kl': 0.02001953125, 'epoch': 0.07}
  7%|▋         | 313/4286 [2:21:51<29:35:35, 26.81s/it]  7%|▋         | 314/4286 [2:22:18<29:41:55, 26.92s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.4379611891124945, 'learning_rate': 9.267382174521699e-07, 'completion_length': 334.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.5937500894069672, 'rewards/format_reward': 1.0, 'reward': 1.5937501788139343, 'reward_std': 0.10078872740268707, 'kl': 0.015838623046875, 'epoch': 0.07}
  7%|▋         | 314/4286 [2:22:18<29:41:55, 26.92s/it]  7%|▋         | 315/4286 [2:22:44<29:06:18, 26.39s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.6045032538394417, 'learning_rate': 9.265048996733551e-07, 'completion_length': 291.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6369047909975052, 'rewards/format_reward': 1.0, 'reward': 1.6369048953056335, 'reward_std': 0.12273096293210983, 'kl': 0.0162353515625, 'epoch': 0.07}
  7%|▋         | 315/4286 [2:22:44<29:06:18, 26.39s/it]  7%|▋         | 316/4286 [2:23:09<28:54:55, 26.22s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.3294828123623182, 'learning_rate': 9.262715818945403e-07, 'completion_length': 303.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351192235946655, 'reward_std': 0.044429175555706024, 'kl': 0.015838623046875, 'epoch': 0.07}
  7%|▋         | 316/4286 [2:23:09<28:54:55, 26.22s/it]  7%|▋         | 317/4286 [2:23:36<28:55:10, 26.23s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.9527854938514249, 'learning_rate': 9.260382641157256e-07, 'completion_length': 323.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.626488208770752, 'reward_std': 0.15193629264831543, 'kl': 0.016693115234375, 'epoch': 0.07}
  7%|▋         | 317/4286 [2:23:36<28:55:10, 26.23s/it]  7%|▋         | 318/4286 [2:24:02<28:53:37, 26.21s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.6451007552474236, 'learning_rate': 9.258049463369109e-07, 'completion_length': 290.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5662202835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.548363208770752, 'reward_std': 0.0997023917734623, 'kl': 0.0194091796875, 'epoch': 0.07}
  7%|▋         | 318/4286 [2:24:02<28:53:37, 26.21s/it]  7%|▋         | 319/4286 [2:24:27<28:41:50, 26.04s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.8349215192076445, 'learning_rate': 9.255716285580961e-07, 'completion_length': 329.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6160715818405151, 'reward_std': 0.1260746531188488, 'kl': 0.017578125, 'epoch': 0.07}
  7%|▋         | 319/4286 [2:24:27<28:41:50, 26.04s/it]  7%|▋         | 320/4286 [2:24:51<28:00:51, 25.43s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.29782834784177603, 'learning_rate': 9.253383107792813e-07, 'completion_length': 297.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.0535714328289032, 'kl': 0.018310546875, 'epoch': 0.07}
  7%|▋         | 320/4286 [2:24:51<28:00:51, 25.43s/it]  7%|▋         | 321/4286 [2:25:17<28:10:16, 25.58s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.354107241724624, 'learning_rate': 9.251049930004667e-07, 'completion_length': 300.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 1.0, 'reward': 1.629464328289032, 'reward_std': 0.09134429693222046, 'kl': 0.01885986328125, 'epoch': 0.07}
  7%|▋         | 321/4286 [2:25:17<28:10:16, 25.58s/it]  8%|▊         | 322/4286 [2:25:43<28:07:18, 25.54s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.7498964711985032, 'learning_rate': 9.248716752216519e-07, 'completion_length': 303.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6235119700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6056549549102783, 'reward_std': 0.08143774420022964, 'kl': 0.01959228515625, 'epoch': 0.08}
  8%|▊         | 322/4286 [2:25:43<28:07:18, 25.54s/it]  8%|▊         | 323/4286 [2:26:07<27:40:01, 25.13s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.0926252881886447, 'learning_rate': 9.246383574428371e-07, 'completion_length': 279.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.058389293029904366, 'kl': 0.017791748046875, 'epoch': 0.08}
  8%|▊         | 323/4286 [2:26:07<27:40:01, 25.13s/it]  8%|▊         | 324/4286 [2:26:32<27:45:17, 25.22s/it]                                                       {'loss': 0.0008, 'grad_norm': 3.1889663381415327, 'learning_rate': 9.244050396640224e-07, 'completion_length': 324.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.549107164144516, 'rewards/format_reward': 1.0, 'reward': 1.5491071939468384, 'reward_std': 0.0863095298409462, 'kl': 0.01898193359375, 'epoch': 0.08}
  8%|▊         | 324/4286 [2:26:32<27:45:17, 25.22s/it]  8%|▊         | 325/4286 [2:26:56<27:18:56, 24.83s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.4611432898049414, 'learning_rate': 9.241717218852077e-07, 'completion_length': 300.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.5528274178504944, 'rewards/format_reward': 1.0, 'reward': 1.552827537059784, 'reward_std': 0.07465791143476963, 'kl': 0.0198974609375, 'epoch': 0.08}
  8%|▊         | 325/4286 [2:26:56<27:18:56, 24.83s/it]  8%|▊         | 326/4286 [2:27:21<27:09:31, 24.69s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.5914226289836308, 'learning_rate': 9.239384041063929e-07, 'completion_length': 274.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 1.0, 'reward': 1.5997024774551392, 'reward_std': 0.036505917087197304, 'kl': 0.021484375, 'epoch': 0.08}
  8%|▊         | 326/4286 [2:27:21<27:09:31, 24.69s/it]  8%|▊         | 327/4286 [2:27:46<27:16:59, 24.81s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5364116934290962, 'learning_rate': 9.237050863275782e-07, 'completion_length': 308.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.08319672383368015, 'kl': 0.016510009765625, 'epoch': 0.08}
  8%|▊         | 327/4286 [2:27:46<27:16:59, 24.81s/it][2025-03-02 17:25:35,360] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  8%|▊         | 328/4286 [2:28:12<27:52:24, 25.35s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5822703231478349, 'learning_rate': 9.234717685487634e-07, 'completion_length': 332.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5773810744285583, 'reward_std': 0.0956411100924015, 'kl': 0.01922607421875, 'epoch': 0.08}
  8%|▊         | 328/4286 [2:28:12<27:52:24, 25.35s/it]  8%|▊         | 329/4286 [2:28:38<27:51:28, 25.34s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.7452313963569934, 'learning_rate': 9.232384507699487e-07, 'completion_length': 307.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.7693453133106232, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.08097816072404385, 'kl': 0.01666259765625, 'epoch': 0.08}
  8%|▊         | 329/4286 [2:28:38<27:51:28, 25.34s/it]  8%|▊         | 330/4286 [2:29:02<27:35:18, 25.11s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.47395233579064366, 'learning_rate': 9.230051329911339e-07, 'completion_length': 292.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7449405491352081, 'rewards/format_reward': 1.0, 'reward': 1.7449406385421753, 'reward_std': 0.10524234175682068, 'kl': 0.0179443359375, 'epoch': 0.08}
  8%|▊         | 330/4286 [2:29:02<27:35:18, 25.11s/it]  8%|▊         | 331/4286 [2:29:30<28:19:53, 25.79s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.3399036859296842, 'learning_rate': 9.227718152123192e-07, 'completion_length': 323.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.790178656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7723215818405151, 'reward_std': 0.13692889362573624, 'kl': 0.0146484375, 'epoch': 0.08}
  8%|▊         | 331/4286 [2:29:30<28:19:53, 25.79s/it]  8%|▊         | 332/4286 [2:29:56<28:21:50, 25.82s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.8371896750151199, 'learning_rate': 9.225384974335044e-07, 'completion_length': 299.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6205357909202576, 'rewards/format_reward': 1.0, 'reward': 1.6205358505249023, 'reward_std': 0.04556369036436081, 'kl': 0.0164794921875, 'epoch': 0.08}
  8%|▊         | 332/4286 [2:29:56<28:21:50, 25.82s/it]  8%|▊         | 333/4286 [2:30:21<28:17:28, 25.76s/it]                                                       {'loss': 0.0007, 'grad_norm': 3.503488859526147, 'learning_rate': 9.223051796546896e-07, 'completion_length': 308.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6113095581531525, 'rewards/format_reward': 1.0, 'reward': 1.6113095879554749, 'reward_std': 0.10330710932612419, 'kl': 0.01861572265625, 'epoch': 0.08}
  8%|▊         | 333/4286 [2:30:21<28:17:28, 25.76s/it]  8%|▊         | 334/4286 [2:30:46<27:58:05, 25.48s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.7554780680154325, 'learning_rate': 9.22071861875875e-07, 'completion_length': 289.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.06358060520142317, 'kl': 0.0166015625, 'epoch': 0.08}
  8%|▊         | 334/4286 [2:30:46<27:58:05, 25.48s/it]  8%|▊         | 335/4286 [2:31:10<27:32:46, 25.10s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.33692660233324373, 'learning_rate': 9.218385440970602e-07, 'completion_length': 309.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6288690865039825, 'rewards/format_reward': 1.0, 'reward': 1.6288691759109497, 'reward_std': 0.03392857313156128, 'kl': 0.017425537109375, 'epoch': 0.08}
  8%|▊         | 335/4286 [2:31:10<27:32:46, 25.10s/it]  8%|▊         | 336/4286 [2:31:36<27:43:19, 25.27s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.6257683302092946, 'learning_rate': 9.216052263182454e-07, 'completion_length': 307.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7244048118591309, 'rewards/format_reward': 1.0, 'reward': 1.7244048118591309, 'reward_std': 0.10207685828208923, 'kl': 0.0211181640625, 'epoch': 0.08}
  8%|▊         | 336/4286 [2:31:36<27:43:19, 25.27s/it]  8%|▊         | 337/4286 [2:32:00<27:17:51, 24.89s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.4274955252319907, 'learning_rate': 9.213719085394307e-07, 'completion_length': 278.69644927978516, 'rewards/only_full_func_accuracy_reward': 0.5889881551265717, 'rewards/format_reward': 1.0, 'reward': 1.5889882445335388, 'reward_std': 0.06883394159376621, 'kl': 0.021240234375, 'epoch': 0.08}
  8%|▊         | 337/4286 [2:32:00<27:17:51, 24.89s/it]  8%|▊         | 338/4286 [2:32:26<27:34:18, 25.14s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.38372182685311396, 'learning_rate': 9.21138590760616e-07, 'completion_length': 326.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.6245265454053879, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.570955216884613, 'reward_std': 0.08585994318127632, 'kl': 0.0166015625, 'epoch': 0.08}
  8%|▊         | 338/4286 [2:32:26<27:34:18, 25.14s/it]  8%|▊         | 339/4286 [2:32:50<27:12:24, 24.81s/it]                                                       {'loss': 0.0007, 'grad_norm': 5.769294226367645, 'learning_rate': 9.209052729818012e-07, 'completion_length': 306.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7276786267757416, 'rewards/format_reward': 1.0, 'reward': 1.727678656578064, 'reward_std': 0.04379208851605654, 'kl': 0.01727294921875, 'epoch': 0.08}
  8%|▊         | 339/4286 [2:32:50<27:12:24, 24.81s/it]  8%|▊         | 340/4286 [2:33:14<27:05:28, 24.72s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.4519508718737422, 'learning_rate': 9.206719552029864e-07, 'completion_length': 292.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7172620296478271, 'reward_std': 0.10176010429859161, 'kl': 0.01947021484375, 'epoch': 0.08}
  8%|▊         | 340/4286 [2:33:14<27:05:28, 24.72s/it]  8%|▊         | 341/4286 [2:33:38<26:42:34, 24.37s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.6774743440434242, 'learning_rate': 9.204386374241717e-07, 'completion_length': 285.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.5803572535514832, 'reward_std': 0.13713815808296204, 'kl': 0.017120361328125, 'epoch': 0.08}
  8%|▊         | 341/4286 [2:33:38<26:42:34, 24.37s/it]  8%|▊         | 342/4286 [2:34:03<26:59:22, 24.64s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.4556472226482377, 'learning_rate': 9.20205319645357e-07, 'completion_length': 305.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.13814511895179749, 'kl': 0.01910400390625, 'epoch': 0.08}
  8%|▊         | 342/4286 [2:34:03<26:59:22, 24.64s/it]  8%|▊         | 343/4286 [2:34:30<27:44:51, 25.33s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.6110496683004201, 'learning_rate': 9.199720018665422e-07, 'completion_length': 291.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.550000011920929, 'rewards/format_reward': 1.0, 'reward': 1.5500000715255737, 'reward_std': 0.08822939172387123, 'kl': 0.02227783203125, 'epoch': 0.08}
  8%|▊         | 343/4286 [2:34:30<27:44:51, 25.33s/it]  8%|▊         | 344/4286 [2:34:55<27:38:59, 25.25s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.3362240251232118, 'learning_rate': 9.197386840877275e-07, 'completion_length': 317.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.5758928805589676, 'rewards/format_reward': 1.0, 'reward': 1.575892984867096, 'reward_std': 0.08529254049062729, 'kl': 0.0191650390625, 'epoch': 0.08}
  8%|▊         | 344/4286 [2:34:55<27:38:59, 25.25s/it]  8%|▊         | 345/4286 [2:35:20<27:27:12, 25.08s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.9094407747038462, 'learning_rate': 9.195053663089127e-07, 'completion_length': 301.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.6666667461395264, 'reward_std': 0.13008152693510056, 'kl': 0.018310546875, 'epoch': 0.08}
  8%|▊         | 345/4286 [2:35:20<27:27:12, 25.08s/it]  8%|▊         | 346/4286 [2:35:45<27:25:42, 25.06s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.46782820786135637, 'learning_rate': 9.19272048530098e-07, 'completion_length': 312.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.4806547909975052, 'rewards/format_reward': 1.0, 'reward': 1.4806548953056335, 'reward_std': 0.11601909250020981, 'kl': 0.0191650390625, 'epoch': 0.08}
  8%|▊         | 346/4286 [2:35:45<27:25:42, 25.06s/it]  8%|▊         | 347/4286 [2:36:09<27:09:57, 24.83s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.3981965055292523, 'learning_rate': 9.190387307512833e-07, 'completion_length': 291.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6547619998455048, 'rewards/format_reward': 1.0, 'reward': 1.654762089252472, 'reward_std': 0.08198513090610504, 'kl': 0.0167236328125, 'epoch': 0.08}
  8%|▊         | 347/4286 [2:36:09<27:09:57, 24.83s/it]  8%|▊         | 348/4286 [2:36:35<27:40:12, 25.30s/it]                                                       {'loss': 0.0007, 'grad_norm': 3.6555920124040555, 'learning_rate': 9.188054129724685e-07, 'completion_length': 310.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.044642859138548374, 'kl': 0.016845703125, 'epoch': 0.08}
  8%|▊         | 348/4286 [2:36:35<27:40:12, 25.30s/it]  8%|▊         | 349/4286 [2:37:00<27:34:53, 25.22s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.8063485001243222, 'learning_rate': 9.185720951936537e-07, 'completion_length': 295.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.5386905074119568, 'rewards/format_reward': 1.0, 'reward': 1.5386905670166016, 'reward_std': 0.07057780586183071, 'kl': 0.02081298828125, 'epoch': 0.08}
  8%|▊         | 349/4286 [2:37:00<27:34:53, 25.22s/it]  8%|▊         | 350/4286 [2:37:27<28:04:58, 25.69s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.299367567373549, 'learning_rate': 9.183387774148391e-07, 'completion_length': 272.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6949404776096344, 'rewards/format_reward': 1.0, 'reward': 1.6949405670166016, 'reward_std': 0.0912406425923109, 'kl': 0.0177001953125, 'epoch': 0.08}
  8%|▊         | 350/4286 [2:37:27<28:04:58, 25.69s/it]  8%|▊         | 351/4286 [2:37:53<27:56:58, 25.57s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.0524137368822928, 'learning_rate': 9.181054596360243e-07, 'completion_length': 284.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.13853276148438454, 'kl': 0.02130126953125, 'epoch': 0.08}
  8%|▊         | 351/4286 [2:37:53<27:56:58, 25.57s/it][2025-03-02 17:35:41,065] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  8%|▊         | 352/4286 [2:38:18<27:57:52, 25.59s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.7468100353699645, 'learning_rate': 9.178721418572095e-07, 'completion_length': 292.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.07559811696410179, 'kl': 0.01837158203125, 'epoch': 0.08}
  8%|▊         | 352/4286 [2:38:18<27:57:52, 25.59s/it]  8%|▊         | 353/4286 [2:38:43<27:34:36, 25.24s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.23660458653695868, 'learning_rate': 9.176388240783947e-07, 'completion_length': 279.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8005952537059784, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.07100120931863785, 'kl': 0.01708984375, 'epoch': 0.08}
  8%|▊         | 353/4286 [2:38:43<27:34:36, 25.24s/it]  8%|▊         | 354/4286 [2:39:08<27:38:27, 25.31s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.3819930566405677, 'learning_rate': 9.1740550629958e-07, 'completion_length': 307.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.7321430444717407, 'reward_std': 0.13647740334272385, 'kl': 0.016082763671875, 'epoch': 0.08}
  8%|▊         | 354/4286 [2:39:08<27:38:27, 25.31s/it][2025-03-02 17:36:55,667] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  8%|▊         | 355/4286 [2:39:33<27:26:20, 25.13s/it]                                                       {'loss': 0.001, 'grad_norm': 0.9050877656805016, 'learning_rate': 9.171721885207653e-07, 'completion_length': 281.8928756713867, 'rewards/only_full_func_accuracy_reward': 0.6785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6785715818405151, 'reward_std': 0.22983846813440323, 'kl': 0.024169921875, 'epoch': 0.08}
  8%|▊         | 355/4286 [2:39:33<27:26:20, 25.13s/it]  8%|▊         | 356/4286 [2:39:58<27:27:57, 25.16s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.6189616263359182, 'learning_rate': 9.169388707419505e-07, 'completion_length': 325.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7336310148239136, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.09267593175172806, 'kl': 0.01983642578125, 'epoch': 0.08}
  8%|▊         | 356/4286 [2:39:58<27:27:57, 25.16s/it][2025-03-02 17:37:47,024] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  8%|▊         | 357/4286 [2:40:24<27:46:29, 25.45s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.3257440888201459, 'learning_rate': 9.167055529631358e-07, 'completion_length': 311.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.553571492433548, 'rewards/format_reward': 1.0, 'reward': 1.5535715818405151, 'reward_std': 0.03210100997239351, 'kl': 0.017547607421875, 'epoch': 0.08}
  8%|▊         | 357/4286 [2:40:24<27:46:29, 25.45s/it]  8%|▊         | 358/4286 [2:40:51<28:12:59, 25.86s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4159294183553811, 'learning_rate': 9.16472235184321e-07, 'completion_length': 320.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.7083334922790527, 'reward_std': 0.06547618471086025, 'kl': 0.0181884765625, 'epoch': 0.08}
  8%|▊         | 358/4286 [2:40:51<28:12:59, 25.86s/it]  8%|▊         | 359/4286 [2:41:15<27:35:50, 25.30s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5505348468493491, 'learning_rate': 9.162389174055063e-07, 'completion_length': 297.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6086310148239136, 'reward_std': 0.07029405608773232, 'kl': 0.0198974609375, 'epoch': 0.08}
  8%|▊         | 359/4286 [2:41:15<27:35:50, 25.30s/it]  8%|▊         | 360/4286 [2:41:41<27:55:31, 25.61s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5712504939249072, 'learning_rate': 9.160055996266916e-07, 'completion_length': 319.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.07522517070174217, 'kl': 0.020263671875, 'epoch': 0.08}
  8%|▊         | 360/4286 [2:41:41<27:55:31, 25.61s/it]  8%|▊         | 361/4286 [2:42:06<27:33:17, 25.27s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.8645435641010724, 'learning_rate': 9.157722818478768e-07, 'completion_length': 315.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.0689128004014492, 'kl': 0.0185546875, 'epoch': 0.08}
  8%|▊         | 361/4286 [2:42:06<27:33:17, 25.27s/it]  8%|▊         | 362/4286 [2:42:31<27:28:54, 25.21s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.894575224317876, 'learning_rate': 9.15538964069062e-07, 'completion_length': 277.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5639881491661072, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.10809674859046936, 'kl': 0.02093505859375, 'epoch': 0.08}
  8%|▊         | 362/4286 [2:42:31<27:28:54, 25.21s/it]  8%|▊         | 363/4286 [2:42:55<27:09:39, 24.92s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.7709756191439878, 'learning_rate': 9.153056462902473e-07, 'completion_length': 289.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.570963591337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5531064867973328, 'reward_std': 0.1140252985060215, 'kl': 0.02166748046875, 'epoch': 0.08}
  8%|▊         | 363/4286 [2:42:55<27:09:39, 24.92s/it]  8%|▊         | 364/4286 [2:43:20<27:05:51, 24.87s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.17494992438040635, 'learning_rate': 9.150723285114326e-07, 'completion_length': 308.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6413690447807312, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.034620251506567, 'kl': 0.01788330078125, 'epoch': 0.08}
  8%|▊         | 364/4286 [2:43:20<27:05:51, 24.87s/it]  9%|▊         | 365/4286 [2:43:46<27:21:39, 25.12s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.7233589259382542, 'learning_rate': 9.148390107326178e-07, 'completion_length': 310.14288330078125, 'rewards/only_full_func_accuracy_reward': 0.572916716337204, 'rewards/format_reward': 1.0, 'reward': 1.5729168057441711, 'reward_std': 0.18017347157001495, 'kl': 0.02056884765625, 'epoch': 0.09}
  9%|▊         | 365/4286 [2:43:46<27:21:39, 25.12s/it]  9%|▊         | 366/4286 [2:44:12<27:38:21, 25.38s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.1304377097085092, 'learning_rate': 9.14605692953803e-07, 'completion_length': 304.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.555059552192688, 'rewards/format_reward': 1.0, 'reward': 1.5550596714019775, 'reward_std': 0.06438315659761429, 'kl': 0.020599365234375, 'epoch': 0.09}
  9%|▊         | 366/4286 [2:44:12<27:38:21, 25.38s/it]  9%|▊         | 367/4286 [2:44:36<27:11:02, 24.97s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.49682734937319684, 'learning_rate': 9.143723751749884e-07, 'completion_length': 286.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7470238506793976, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.12181013077497482, 'kl': 0.017791748046875, 'epoch': 0.09}
  9%|▊         | 367/4286 [2:44:36<27:11:02, 24.97s/it]  9%|▊         | 368/4286 [2:45:00<26:55:23, 24.74s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5236649582983532, 'learning_rate': 9.141390573961736e-07, 'completion_length': 278.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562500596046448, 'reward_std': 0.0555737130343914, 'kl': 0.01898193359375, 'epoch': 0.09}
  9%|▊         | 368/4286 [2:45:00<26:55:23, 24.74s/it]  9%|▊         | 369/4286 [2:45:26<27:23:24, 25.17s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.6141203386110358, 'learning_rate': 9.139057396173588e-07, 'completion_length': 305.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6279762983322144, 'reward_std': 0.1161789670586586, 'kl': 0.02069091796875, 'epoch': 0.09}
  9%|▊         | 369/4286 [2:45:26<27:23:24, 25.17s/it]  9%|▊         | 370/4286 [2:45:52<27:39:50, 25.43s/it]                                                       {'loss': 0.0008, 'grad_norm': 2.0937748536635135, 'learning_rate': 9.136724218385441e-07, 'completion_length': 293.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6314484179019928, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.595734179019928, 'reward_std': 0.14189279451966286, 'kl': 0.0189208984375, 'epoch': 0.09}
  9%|▊         | 370/4286 [2:45:52<27:39:50, 25.43s/it]  9%|▊         | 371/4286 [2:46:16<27:22:17, 25.17s/it]                                                       {'loss': 0.001, 'grad_norm': 0.6793596354418793, 'learning_rate': 9.134391040597294e-07, 'completion_length': 284.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.648809552192688, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.12340506166219711, 'kl': 0.0252685546875, 'epoch': 0.09}
  9%|▊         | 371/4286 [2:46:16<27:22:17, 25.17s/it][2025-03-02 17:44:06,217] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▊         | 372/4286 [2:46:43<27:53:58, 25.66s/it]                                                       {'loss': 0.0007, 'grad_norm': 2.0690209134961326, 'learning_rate': 9.132057862809146e-07, 'completion_length': 314.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.09360114857554436, 'kl': 0.01849365234375, 'epoch': 0.09}
  9%|▊         | 372/4286 [2:46:43<27:53:58, 25.66s/it]  9%|▊         | 373/4286 [2:47:08<27:43:20, 25.50s/it]                                                       {'loss': 0.0008, 'grad_norm': 2.379626723092922, 'learning_rate': 9.129724685020999e-07, 'completion_length': 281.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.08029617182910442, 'kl': 0.0206298828125, 'epoch': 0.09}
  9%|▊         | 373/4286 [2:47:08<27:43:20, 25.50s/it]  9%|▊         | 374/4286 [2:47:33<27:30:37, 25.32s/it]                                                       {'loss': 0.001, 'grad_norm': 0.4502197886667261, 'learning_rate': 9.127391507232851e-07, 'completion_length': 285.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.5178571939468384, 'rewards/format_reward': 1.0, 'reward': 1.517857313156128, 'reward_std': 0.0595238171517849, 'kl': 0.0238037109375, 'epoch': 0.09}
  9%|▊         | 374/4286 [2:47:33<27:30:37, 25.32s/it]  9%|▊         | 375/4286 [2:48:00<28:00:24, 25.78s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.8643636290222733, 'learning_rate': 9.125058329444704e-07, 'completion_length': 325.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5625000596046448, 'reward_std': 0.14386197179555893, 'kl': 0.02191162109375, 'epoch': 0.09}
  9%|▊         | 375/4286 [2:48:00<28:00:24, 25.78s/it][2025-03-02 17:45:49,406] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▉         | 376/4286 [2:48:26<28:10:22, 25.94s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.6599480287536537, 'learning_rate': 9.122725151656556e-07, 'completion_length': 308.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.649215430021286, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6313583850860596, 'reward_std': 0.16840333491563797, 'kl': 0.0185546875, 'epoch': 0.09}
  9%|▉         | 376/4286 [2:48:26<28:10:22, 25.94s/it]  9%|▉         | 377/4286 [2:48:52<27:56:11, 25.73s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.4495905194040979, 'learning_rate': 9.120391973868409e-07, 'completion_length': 299.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6636904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.060604410246014595, 'kl': 0.0174560546875, 'epoch': 0.09}
  9%|▉         | 377/4286 [2:48:52<27:56:11, 25.73s/it]  9%|▉         | 378/4286 [2:49:17<27:37:20, 25.45s/it]                                                       {'loss': 0.001, 'grad_norm': 0.3063018964765777, 'learning_rate': 9.118058796080261e-07, 'completion_length': 308.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5735119432210922, 'rewards/format_reward': 1.0, 'reward': 1.5735119581222534, 'reward_std': 0.039341666735708714, 'kl': 0.02569580078125, 'epoch': 0.09}
  9%|▉         | 378/4286 [2:49:17<27:37:20, 25.45s/it]  9%|▉         | 379/4286 [2:49:40<27:01:28, 24.90s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.46486058540975594, 'learning_rate': 9.115725618292113e-07, 'completion_length': 274.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7678572237491608, 'rewards/format_reward': 1.0, 'reward': 1.767857313156128, 'reward_std': 0.08014346286654472, 'kl': 0.0169677734375, 'epoch': 0.09}
  9%|▉         | 379/4286 [2:49:40<27:01:28, 24.90s/it]  9%|▉         | 380/4286 [2:50:06<27:18:16, 25.17s/it]                                                       {'loss': 0.001, 'grad_norm': 0.8320183223914432, 'learning_rate': 9.113392440503967e-07, 'completion_length': 312.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6803572177886963, 'rewards/format_reward': 1.0, 'reward': 1.6803573369979858, 'reward_std': 0.11593816801905632, 'kl': 0.0240478515625, 'epoch': 0.09}
  9%|▉         | 380/4286 [2:50:06<27:18:16, 25.17s/it]  9%|▉         | 381/4286 [2:50:32<27:26:55, 25.30s/it]                                                       {'loss': 0.0008, 'grad_norm': 2.062890778797357, 'learning_rate': 9.111059262715819e-07, 'completion_length': 310.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.5193452686071396, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.501488208770752, 'reward_std': 0.1431669071316719, 'kl': 0.02020263671875, 'epoch': 0.09}
  9%|▉         | 381/4286 [2:50:32<27:26:55, 25.30s/it]  9%|▉         | 382/4286 [2:50:56<27:08:22, 25.03s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.4156186053079649, 'learning_rate': 9.108726084927671e-07, 'completion_length': 273.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5833333432674408, 'rewards/format_reward': 1.0, 'reward': 1.583333432674408, 'reward_std': 0.08474764414131641, 'kl': 0.019775390625, 'epoch': 0.09}
  9%|▉         | 382/4286 [2:50:56<27:08:22, 25.03s/it]  9%|▉         | 383/4286 [2:51:22<27:33:47, 25.42s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.2900496419147423, 'learning_rate': 9.106392907139524e-07, 'completion_length': 309.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.6738095581531525, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6380953788757324, 'reward_std': 0.12295292317867279, 'kl': 0.02099609375, 'epoch': 0.09}
  9%|▉         | 383/4286 [2:51:22<27:33:47, 25.42s/it]  9%|▉         | 384/4286 [2:51:46<26:56:31, 24.86s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.6673782125959383, 'learning_rate': 9.104059729351377e-07, 'completion_length': 277.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6130952537059784, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.11677857488393784, 'kl': 0.021270751953125, 'epoch': 0.09}
  9%|▉         | 384/4286 [2:51:46<26:56:31, 24.86s/it]  9%|▉         | 385/4286 [2:52:11<27:08:09, 25.04s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5803945672080657, 'learning_rate': 9.101726551563229e-07, 'completion_length': 301.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.6401786506175995, 'rewards/format_reward': 1.0, 'reward': 1.640178620815277, 'reward_std': 0.051273198798298836, 'kl': 0.02032470703125, 'epoch': 0.09}
  9%|▉         | 385/4286 [2:52:11<27:08:09, 25.04s/it]  9%|▉         | 386/4286 [2:52:37<27:14:47, 25.15s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.4315511659823431, 'learning_rate': 9.099393373775081e-07, 'completion_length': 293.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6622024476528168, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.07622811198234558, 'kl': 0.0234375, 'epoch': 0.09}
  9%|▉         | 386/4286 [2:52:37<27:14:47, 25.15s/it]  9%|▉         | 387/4286 [2:53:03<27:36:36, 25.49s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.7974940565514522, 'learning_rate': 9.097060195986934e-07, 'completion_length': 314.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6562500596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.638392984867096, 'reward_std': 0.15127138048410416, 'kl': 0.02178955078125, 'epoch': 0.09}
  9%|▉         | 387/4286 [2:53:03<27:36:36, 25.49s/it]  9%|▉         | 388/4286 [2:53:28<27:18:39, 25.22s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.6608990531434082, 'learning_rate': 9.094727018198787e-07, 'completion_length': 313.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6404761970043182, 'rewards/format_reward': 1.0, 'reward': 1.6404762864112854, 'reward_std': 0.08506512455642223, 'kl': 0.0211181640625, 'epoch': 0.09}
  9%|▉         | 388/4286 [2:53:28<27:18:39, 25.22s/it]  9%|▉         | 389/4286 [2:53:51<26:48:26, 24.76s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5972174140744826, 'learning_rate': 9.092393840410639e-07, 'completion_length': 276.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6556548178195953, 'rewards/format_reward': 1.0, 'reward': 1.6556548476219177, 'reward_std': 0.10679368302226067, 'kl': 0.01812744140625, 'epoch': 0.09}
  9%|▉         | 389/4286 [2:53:51<26:48:26, 24.76s/it]  9%|▉         | 390/4286 [2:54:15<26:35:04, 24.56s/it]                                                       {'loss': 0.001, 'grad_norm': 6.256291826746742, 'learning_rate': 9.090060662622492e-07, 'completion_length': 288.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.8002976775169373, 'rewards/format_reward': 1.0, 'reward': 1.8002977967262268, 'reward_std': 0.0865662470459938, 'kl': 0.024169921875, 'epoch': 0.09}
  9%|▉         | 390/4286 [2:54:15<26:35:04, 24.56s/it]  9%|▉         | 391/4286 [2:54:37<25:43:59, 23.78s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.3458326874027513, 'learning_rate': 9.087727484834344e-07, 'completion_length': 237.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.8258928954601288, 'rewards/format_reward': 1.0, 'reward': 1.8258929252624512, 'reward_std': 0.03869047574698925, 'kl': 0.015380859375, 'epoch': 0.09}
  9%|▉         | 391/4286 [2:54:37<25:43:59, 23.78s/it]  9%|▉         | 392/4286 [2:55:02<25:55:39, 23.97s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.530267031341631, 'learning_rate': 9.085394307046197e-07, 'completion_length': 295.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6830357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.11511712521314621, 'kl': 0.0228271484375, 'epoch': 0.09}
  9%|▉         | 392/4286 [2:55:02<25:55:39, 23.97s/it]  9%|▉         | 393/4286 [2:55:28<26:40:02, 24.66s/it]                                                       {'loss': 0.001, 'grad_norm': 0.4118765141139742, 'learning_rate': 9.08306112925805e-07, 'completion_length': 283.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5505952686071396, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.532738208770752, 'reward_std': 0.07959691435098648, 'kl': 0.0244140625, 'epoch': 0.09}
  9%|▉         | 393/4286 [2:55:28<26:40:02, 24.66s/it][2025-03-02 17:53:16,001] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▉         | 394/4286 [2:55:53<26:47:42, 24.78s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5787317304597545, 'learning_rate': 9.080727951469902e-07, 'completion_length': 287.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7404761910438538, 'rewards/format_reward': 1.0, 'reward': 1.7404763102531433, 'reward_std': 0.07895297929644585, 'kl': 0.01898193359375, 'epoch': 0.09}
  9%|▉         | 394/4286 [2:55:53<26:47:42, 24.78s/it][2025-03-02 17:53:40,162] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▉         | 395/4286 [2:56:17<26:35:09, 24.60s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.2476825098631913, 'learning_rate': 9.078394773681754e-07, 'completion_length': 280.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949406266212463, 'reward_std': 0.04388680309057236, 'kl': 0.02099609375, 'epoch': 0.09}
  9%|▉         | 395/4286 [2:56:17<26:35:09, 24.60s/it][2025-03-02 17:54:05,747] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▉         | 396/4286 [2:56:43<26:53:58, 24.89s/it]                                                       {'loss': 0.001, 'grad_norm': 0.340559692209441, 'learning_rate': 9.076061595893607e-07, 'completion_length': 293.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6497024297714233, 'rewards/format_reward': 1.0, 'reward': 1.6497024893760681, 'reward_std': 0.09642763808369637, 'kl': 0.026123046875, 'epoch': 0.09}
  9%|▉         | 396/4286 [2:56:43<26:53:58, 24.89s/it]  9%|▉         | 397/4286 [2:57:06<26:16:49, 24.33s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.4372181176454737, 'learning_rate': 9.07372841810546e-07, 'completion_length': 286.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.0446428582072258, 'kl': 0.02215576171875, 'epoch': 0.09}
  9%|▉         | 397/4286 [2:57:06<26:16:49, 24.33s/it][2025-03-02 17:54:54,164] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▉         | 398/4286 [2:57:31<26:37:29, 24.65s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5141838852890659, 'learning_rate': 9.071395240317312e-07, 'completion_length': 282.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6793154776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.661458432674408, 'reward_std': 0.15163720026612282, 'kl': 0.02105712890625, 'epoch': 0.09}
  9%|▉         | 398/4286 [2:57:31<26:37:29, 24.65s/it]  9%|▉         | 399/4286 [2:57:55<26:15:14, 24.32s/it]                                                       {'loss': 0.0007, 'grad_norm': 2.5809493642463304, 'learning_rate': 9.069062062529164e-07, 'completion_length': 268.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6607143878936768, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.09707976877689362, 'kl': 0.0184326171875, 'epoch': 0.09}
  9%|▉         | 399/4286 [2:57:55<26:15:14, 24.32s/it]  9%|▉         | 400/4286 [2:58:21<26:47:25, 24.82s/it]                                                       {'loss': 0.001, 'grad_norm': 0.5751999804965848, 'learning_rate': 9.066728884741018e-07, 'completion_length': 291.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.10596857778728008, 'kl': 0.02410888671875, 'epoch': 0.09}
  9%|▉         | 400/4286 [2:58:21<26:47:25, 24.82s/it]  9%|▉         | 401/4286 [3:02:17<95:04:25, 88.10s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.3034757321802064, 'learning_rate': 9.06439570695287e-07, 'completion_length': 280.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.5580357760190964, 'rewards/format_reward': 1.0, 'reward': 1.5580358505249023, 'reward_std': 0.0508419806137681, 'kl': 0.0230712890625, 'epoch': 0.09}
  9%|▉         | 401/4286 [3:02:17<95:04:25, 88.10s/it][2025-03-02 18:00:04,182] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▉         | 402/4286 [3:02:41<74:32:35, 69.09s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.7722767224965346, 'learning_rate': 9.062062529164722e-07, 'completion_length': 277.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.663690522313118, 'rewards/format_reward': 1.0, 'reward': 1.6636906266212463, 'reward_std': 0.03596102260053158, 'kl': 0.02685546875, 'epoch': 0.09}
  9%|▉         | 402/4286 [3:02:41<74:32:35, 69.09s/it]  9%|▉         | 403/4286 [3:03:06<60:13:37, 55.84s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5849201068405331, 'learning_rate': 9.059729351376575e-07, 'completion_length': 295.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.710416704416275, 'rewards/format_reward': 1.0, 'reward': 1.710416853427887, 'reward_std': 0.09337493404746056, 'kl': 0.019287109375, 'epoch': 0.09}
  9%|▉         | 403/4286 [3:03:06<60:13:37, 55.84s/it]  9%|▉         | 404/4286 [3:03:32<50:29:22, 46.82s/it]                                                       {'loss': 0.0006, 'grad_norm': 0.2851949045307867, 'learning_rate': 9.057396173588428e-07, 'completion_length': 288.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592262387275696, 'reward_std': 0.0267857164144516, 'kl': 0.0159912109375, 'epoch': 0.09}
  9%|▉         | 404/4286 [3:03:32<50:29:22, 46.82s/it]  9%|▉         | 405/4286 [3:03:57<43:30:52, 40.36s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.46900987391225535, 'learning_rate': 9.05506299580028e-07, 'completion_length': 275.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6419642865657806, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6241071820259094, 'reward_std': 0.15771466866135597, 'kl': 0.0194091796875, 'epoch': 0.09}
  9%|▉         | 405/4286 [3:03:57<43:30:52, 40.36s/it]  9%|▉         | 406/4286 [3:04:23<38:40:41, 35.89s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.4272690987048381, 'learning_rate': 9.052729818012133e-07, 'completion_length': 316.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.09254170209169388, 'kl': 0.02166748046875, 'epoch': 0.09}
  9%|▉         | 406/4286 [3:04:23<38:40:41, 35.89s/it]  9%|▉         | 407/4286 [3:04:48<35:11:16, 32.66s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.6338838609607574, 'learning_rate': 9.050396640223985e-07, 'completion_length': 282.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7964286506175995, 'rewards/format_reward': 1.0, 'reward': 1.7964286804199219, 'reward_std': 0.04961633309721947, 'kl': 0.0191650390625, 'epoch': 0.09}
  9%|▉         | 407/4286 [3:04:48<35:11:16, 32.66s/it] 10%|▉         | 408/4286 [3:05:14<33:01:38, 30.66s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.8792238908097187, 'learning_rate': 9.048063462435837e-07, 'completion_length': 307.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.12021518871188164, 'kl': 0.02349853515625, 'epoch': 0.1}
 10%|▉         | 408/4286 [3:05:14<33:01:38, 30.66s/it] 10%|▉         | 409/4286 [3:05:40<31:38:43, 29.38s/it]                                                       {'loss': 0.001, 'grad_norm': 0.5445984726797368, 'learning_rate': 9.04573028464769e-07, 'completion_length': 316.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6794643402099609, 'rewards/format_reward': 1.0, 'reward': 1.6794643998146057, 'reward_std': 0.09799163416028023, 'kl': 0.0244140625, 'epoch': 0.1}
 10%|▉         | 409/4286 [3:05:40<31:38:43, 29.38s/it][2025-03-02 18:03:30,221] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|▉         | 410/4286 [3:06:07<30:53:33, 28.69s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.2647329662411682, 'learning_rate': 9.043397106859543e-07, 'completion_length': 306.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.62115678191185, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5854425430297852, 'reward_std': 0.12370046973228455, 'kl': 0.01849365234375, 'epoch': 0.1}
 10%|▉         | 410/4286 [3:06:07<30:53:33, 28.69s/it][2025-03-02 18:03:57,537] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|▉         | 411/4286 [3:06:35<30:26:24, 28.28s/it]                                                       {'loss': 0.001, 'grad_norm': 4.192318302294236, 'learning_rate': 9.041063929071395e-07, 'completion_length': 334.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6041666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6041667461395264, 'reward_std': 0.07229919172823429, 'kl': 0.0250244140625, 'epoch': 0.1}
 10%|▉         | 411/4286 [3:06:35<30:26:24, 28.28s/it] 10%|▉         | 412/4286 [3:07:03<30:30:28, 28.35s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.3533047909461537, 'learning_rate': 9.038730751283247e-07, 'completion_length': 323.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.06953298300504684, 'kl': 0.0201416015625, 'epoch': 0.1}
 10%|▉         | 412/4286 [3:07:03<30:30:28, 28.35s/it] 10%|▉         | 413/4286 [3:07:29<29:35:14, 27.50s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.6977282162528701, 'learning_rate': 9.036397573495101e-07, 'completion_length': 303.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.5947172939777374, 'rewards/format_reward': 1.0, 'reward': 1.5947173833847046, 'reward_std': 0.08873244374990463, 'kl': 0.02349853515625, 'epoch': 0.1}
 10%|▉         | 413/4286 [3:07:29<29:35:14, 27.50s/it] 10%|▉         | 414/4286 [3:07:55<29:08:13, 27.09s/it]                                                       {'loss': 0.0007, 'grad_norm': 1.7964492761756699, 'learning_rate': 9.034064395706953e-07, 'completion_length': 303.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.08131103683263063, 'kl': 0.018798828125, 'epoch': 0.1}
 10%|▉         | 414/4286 [3:07:55<29:08:13, 27.09s/it] 10%|▉         | 415/4286 [3:08:21<28:46:51, 26.77s/it]                                                       {'loss': 0.0009, 'grad_norm': 1.690633004743933, 'learning_rate': 9.031731217918805e-07, 'completion_length': 302.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.4645833820104599, 'rewards/format_reward': 1.0, 'reward': 1.4645834565162659, 'reward_std': 0.15864039957523346, 'kl': 0.02349853515625, 'epoch': 0.1}
 10%|▉         | 415/4286 [3:08:21<28:46:51, 26.77s/it] 10%|▉         | 416/4286 [3:08:48<28:52:29, 26.86s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.6060152363742107, 'learning_rate': 9.029398040130658e-07, 'completion_length': 321.0, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5907739400863647, 'reward_std': 0.10387949645519257, 'kl': 0.0208740234375, 'epoch': 0.1}
 10%|▉         | 416/4286 [3:08:48<28:52:29, 26.86s/it] 10%|▉         | 417/4286 [3:09:14<28:35:58, 26.61s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.18128759020660212, 'learning_rate': 9.027064862342511e-07, 'completion_length': 294.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.020833336748182774, 'kl': 0.01873779296875, 'epoch': 0.1}
 10%|▉         | 417/4286 [3:09:14<28:35:58, 26.61s/it] 10%|▉         | 418/4286 [3:09:39<28:15:28, 26.30s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.45047872935801814, 'learning_rate': 9.024731684554363e-07, 'completion_length': 306.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 1.0, 'reward': 1.6860120296478271, 'reward_std': 0.0744047574698925, 'kl': 0.01739501953125, 'epoch': 0.1}
 10%|▉         | 418/4286 [3:09:39<28:15:28, 26.30s/it][2025-03-02 18:07:29,548] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|▉         | 419/4286 [3:10:07<28:31:29, 26.56s/it]                                                       {'loss': 0.001, 'grad_norm': 1.7333887507616512, 'learning_rate': 9.022398506766215e-07, 'completion_length': 316.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7148065567016602, 'rewards/format_reward': 1.0, 'reward': 1.714806616306305, 'reward_std': 0.10598525777459145, 'kl': 0.025146484375, 'epoch': 0.1}
 10%|▉         | 419/4286 [3:10:07<28:31:29, 26.56s/it] 10%|▉         | 420/4286 [3:10:31<27:54:35, 25.99s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.0800633623374107, 'learning_rate': 9.020065328978068e-07, 'completion_length': 287.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 1.0, 'reward': 1.6860119700431824, 'reward_std': 0.04648453835397959, 'kl': 0.02069091796875, 'epoch': 0.1}
 10%|▉         | 420/4286 [3:10:31<27:54:35, 25.99s/it] 10%|▉         | 421/4286 [3:10:57<27:50:18, 25.93s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.3632281324312258, 'learning_rate': 9.017732151189921e-07, 'completion_length': 329.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755954027175903, 'reward_std': 0.0637825969606638, 'kl': 0.02105712890625, 'epoch': 0.1}
 10%|▉         | 421/4286 [3:10:57<27:50:18, 25.93s/it][2025-03-02 18:08:46,761] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|▉         | 422/4286 [3:11:24<28:05:47, 26.18s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.5378112498797636, 'learning_rate': 9.015398973401773e-07, 'completion_length': 314.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.6369047462940216, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.09066440537571907, 'kl': 0.0233154296875, 'epoch': 0.1}
 10%|▉         | 422/4286 [3:11:24<28:05:47, 26.18s/it] 10%|▉         | 423/4286 [3:11:50<28:01:54, 26.12s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.5693590317890035, 'learning_rate': 9.013065795613626e-07, 'completion_length': 310.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7306548953056335, 'reward_std': 0.11763132363557816, 'kl': 0.02777099609375, 'epoch': 0.1}
 10%|▉         | 423/4286 [3:11:50<28:01:54, 26.12s/it] 10%|▉         | 424/4286 [3:12:15<27:52:30, 25.98s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.46490134231733315, 'learning_rate': 9.010732617825478e-07, 'completion_length': 306.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7074405252933502, 'rewards/format_reward': 1.0, 'reward': 1.7074406147003174, 'reward_std': 0.09829567559063435, 'kl': 0.01922607421875, 'epoch': 0.1}
 10%|▉         | 424/4286 [3:12:16<27:52:30, 25.98s/it] 10%|▉         | 425/4286 [3:12:41<27:35:34, 25.73s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.4670292887354779, 'learning_rate': 9.008399440037331e-07, 'completion_length': 324.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.7327381372451782, 'rewards/format_reward': 1.0, 'reward': 1.732738196849823, 'reward_std': 0.0657518021762371, 'kl': 0.021484375, 'epoch': 0.1}
 10%|▉         | 425/4286 [3:12:41<27:35:34, 25.73s/it] 10%|▉         | 426/4286 [3:13:07<27:47:33, 25.92s/it]                                                       {'loss': 0.0011, 'grad_norm': 2.0503593948326984, 'learning_rate': 9.006066262249184e-07, 'completion_length': 315.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.4419643133878708, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4241071939468384, 'reward_std': 0.10424961894750595, 'kl': 0.02752685546875, 'epoch': 0.1}
 10%|▉         | 426/4286 [3:13:07<27:47:33, 25.92s/it][2025-03-02 18:10:58,445] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|▉         | 427/4286 [3:13:36<28:37:23, 26.70s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.5672308043103825, 'learning_rate': 9.003733084461036e-07, 'completion_length': 323.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.651488184928894, 'rewards/format_reward': 1.0, 'reward': 1.651488184928894, 'reward_std': 0.029683038592338562, 'kl': 0.02178955078125, 'epoch': 0.1}
 10%|▉         | 427/4286 [3:13:36<28:37:23, 26.70s/it] 10%|▉         | 428/4286 [3:14:01<28:14:39, 26.36s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5776759343156886, 'learning_rate': 9.001399906672888e-07, 'completion_length': 332.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.0866192951798439, 'kl': 0.0201416015625, 'epoch': 0.1}
 10%|▉         | 428/4286 [3:14:01<28:14:39, 26.36s/it][2025-03-02 18:11:49,041] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|█         | 429/4286 [3:14:26<27:49:01, 25.96s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.5566164850927413, 'learning_rate': 8.999066728884742e-07, 'completion_length': 296.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6247024238109589, 'rewards/format_reward': 1.0, 'reward': 1.624702513217926, 'reward_std': 0.13348043709993362, 'kl': 0.0213623046875, 'epoch': 0.1}
 10%|█         | 429/4286 [3:14:26<27:49:01, 25.96s/it] 10%|█         | 430/4286 [3:14:53<28:03:22, 26.19s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.3532386112492327, 'learning_rate': 8.996733551096594e-07, 'completion_length': 303.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.751488208770752, 'reward_std': 0.06275240797549486, 'kl': 0.0169677734375, 'epoch': 0.1}
 10%|█         | 430/4286 [3:14:53<28:03:22, 26.19s/it] 10%|█         | 431/4286 [3:15:18<27:38:49, 25.82s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5781851895733012, 'learning_rate': 8.994400373308446e-07, 'completion_length': 303.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6294642984867096, 'rewards/format_reward': 1.0, 'reward': 1.6294643878936768, 'reward_std': 0.08815119415521622, 'kl': 0.01654052734375, 'epoch': 0.1}
 10%|█         | 431/4286 [3:15:18<27:38:49, 25.82s/it] 10%|█         | 432/4286 [3:15:44<27:41:13, 25.86s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.087150737739483, 'learning_rate': 8.992067195520298e-07, 'completion_length': 312.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.6235119700431824, 'rewards/format_reward': 1.0, 'reward': 1.623512089252472, 'reward_std': 0.10960471071302891, 'kl': 0.02117919921875, 'epoch': 0.1}
 10%|█         | 432/4286 [3:15:44<27:41:13, 25.86s/it] 10%|█         | 433/4286 [3:16:11<28:06:10, 26.26s/it]                                                       {'loss': 0.0011, 'grad_norm': 1.0498274547725717, 'learning_rate': 8.989734017732151e-07, 'completion_length': 330.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.6556277573108673, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6020563840866089, 'reward_std': 0.22739601135253906, 'kl': 0.0269775390625, 'epoch': 0.1}
 10%|█         | 433/4286 [3:16:11<28:06:10, 26.26s/it] 10%|█         | 434/4286 [3:16:37<27:54:21, 26.08s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.7103956576872635, 'learning_rate': 8.987400839944004e-07, 'completion_length': 320.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5690476596355438, 'rewards/format_reward': 1.0, 'reward': 1.569047749042511, 'reward_std': 0.11885679885745049, 'kl': 0.0264892578125, 'epoch': 0.1}
 10%|█         | 434/4286 [3:16:37<27:54:21, 26.08s/it] 10%|█         | 435/4286 [3:17:03<27:58:05, 26.15s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.5984772326805402, 'learning_rate': 8.985067662155856e-07, 'completion_length': 309.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.5809524059295654, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5452381372451782, 'reward_std': 0.10579781606793404, 'kl': 0.02740478515625, 'epoch': 0.1}
 10%|█         | 435/4286 [3:17:03<27:58:05, 26.15s/it] 10%|█         | 436/4286 [3:17:28<27:45:39, 25.96s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.18024633212488755, 'learning_rate': 8.982734484367709e-07, 'completion_length': 300.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.6741071939468384, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.03495405800640583, 'kl': 0.0205078125, 'epoch': 0.1}
 10%|█         | 436/4286 [3:17:28<27:45:39, 25.96s/it] 10%|█         | 437/4286 [3:17:56<28:17:06, 26.46s/it]                                                       {'loss': 0.0009, 'grad_norm': 2.4283833254214087, 'learning_rate': 8.980401306579561e-07, 'completion_length': 317.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 1.0, 'reward': 1.6741071939468384, 'reward_std': 0.11128663644194603, 'kl': 0.02203369140625, 'epoch': 0.1}
 10%|█         | 437/4286 [3:17:56<28:17:06, 26.46s/it][2025-03-02 18:15:45,719] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|█         | 438/4286 [3:18:23<28:22:30, 26.55s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.071376309248193, 'learning_rate': 8.978068128791414e-07, 'completion_length': 324.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.7827381789684296, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.06345389783382416, 'kl': 0.019775390625, 'epoch': 0.1}
 10%|█         | 438/4286 [3:18:23<28:22:30, 26.55s/it][2025-03-02 18:16:13,453] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|█         | 439/4286 [3:18:51<28:44:55, 26.90s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.5260971394883309, 'learning_rate': 8.975734951003267e-07, 'completion_length': 350.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.5459822118282318, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5281251072883606, 'reward_std': 0.17551341652870178, 'kl': 0.021240234375, 'epoch': 0.1}
 10%|█         | 439/4286 [3:18:51<28:44:55, 26.90s/it] 10%|█         | 440/4286 [3:19:16<28:20:30, 26.53s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.18580275699072435, 'learning_rate': 8.973401773215119e-07, 'completion_length': 297.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7247024178504944, 'rewards/format_reward': 1.0, 'reward': 1.724702537059784, 'reward_std': 0.047405367717146873, 'kl': 0.01739501953125, 'epoch': 0.1}
 10%|█         | 440/4286 [3:19:16<28:20:30, 26.53s/it] 10%|█         | 441/4286 [3:19:41<27:46:18, 26.00s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.16171857718760982, 'learning_rate': 8.971068595426971e-07, 'completion_length': 298.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.5848214328289032, 'rewards/format_reward': 1.0, 'reward': 1.5848215222358704, 'reward_std': 0.008928571827709675, 'kl': 0.02081298828125, 'epoch': 0.1}
 10%|█         | 441/4286 [3:19:41<27:46:18, 26.00s/it] 10%|█         | 442/4286 [3:20:07<27:42:52, 25.96s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.6153192085963313, 'learning_rate': 8.968735417638824e-07, 'completion_length': 318.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.730654776096344, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.08360585011541843, 'kl': 0.01983642578125, 'epoch': 0.1}
 10%|█         | 442/4286 [3:20:07<27:42:52, 25.96s/it] 10%|█         | 443/4286 [3:20:33<27:52:25, 26.11s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.5783020970961217, 'learning_rate': 8.966402239850677e-07, 'completion_length': 323.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.07020125165581703, 'kl': 0.023193359375, 'epoch': 0.1}
 10%|█         | 443/4286 [3:20:33<27:52:25, 26.11s/it][2025-03-02 18:18:23,541] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|█         | 444/4286 [3:21:01<28:15:31, 26.48s/it]                                                       {'loss': 0.0008, 'grad_norm': 1.3717433169938889, 'learning_rate': 8.964069062062529e-07, 'completion_length': 329.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6619048118591309, 'rewards/format_reward': 1.0, 'reward': 1.6619048118591309, 'reward_std': 0.06742122024297714, 'kl': 0.0203857421875, 'epoch': 0.1}
 10%|█         | 444/4286 [3:21:01<28:15:31, 26.48s/it] 10%|█         | 445/4286 [3:21:30<29:11:23, 27.36s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.5820651704093522, 'learning_rate': 8.961735884274381e-07, 'completion_length': 323.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7559524774551392, 'reward_std': 0.14126220531761646, 'kl': 0.02178955078125, 'epoch': 0.1}
 10%|█         | 445/4286 [3:21:30<29:11:23, 27.36s/it][2025-03-02 18:19:20,000] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|█         | 446/4286 [3:21:57<29:05:00, 27.27s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.290620522852722, 'learning_rate': 8.959402706486235e-07, 'completion_length': 347.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.71577388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6979168057441711, 'reward_std': 0.05393141880631447, 'kl': 0.0191650390625, 'epoch': 0.1}
 10%|█         | 446/4286 [3:21:57<29:05:00, 27.27s/it] 10%|█         | 447/4286 [3:22:21<28:07:15, 26.37s/it]                                                       {'loss': 0.001, 'grad_norm': 0.7992528222067949, 'learning_rate': 8.957069528698087e-07, 'completion_length': 296.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.619047686457634, 'rewards/format_reward': 1.0, 'reward': 1.61904776096344, 'reward_std': 0.08953245729207993, 'kl': 0.0240478515625, 'epoch': 0.1}
 10%|█         | 447/4286 [3:22:21<28:07:15, 26.37s/it] 10%|█         | 448/4286 [3:22:48<28:12:56, 26.47s/it]                                                       {'loss': 0.0009, 'grad_norm': 1.422517320813294, 'learning_rate': 8.954736350909939e-07, 'completion_length': 332.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7336310148239136, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.08361312001943588, 'kl': 0.0218505859375, 'epoch': 0.1}
 10%|█         | 448/4286 [3:22:48<28:12:56, 26.47s/it][2025-03-02 18:20:36,232] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|█         | 449/4286 [3:23:13<27:49:23, 26.10s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.40686589107309307, 'learning_rate': 8.952403173121792e-07, 'completion_length': 313.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.07008037343621254, 'kl': 0.0179443359375, 'epoch': 0.1}
 10%|█         | 449/4286 [3:23:13<27:49:23, 26.10s/it] 10%|█         | 450/4286 [3:23:40<28:01:00, 26.29s/it]                                                       {'loss': 0.001, 'grad_norm': 0.48887352299710735, 'learning_rate': 8.950069995333645e-07, 'completion_length': 330.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.0821827445179224, 'kl': 0.02490234375, 'epoch': 0.1}
 10%|█         | 450/4286 [3:23:40<28:01:00, 26.29s/it][2025-03-02 18:21:29,367] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 451/4286 [3:24:06<28:02:38, 26.33s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.3581707896472147, 'learning_rate': 8.947736817545497e-07, 'completion_length': 307.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.0416666679084301, 'kl': 0.01947021484375, 'epoch': 0.11}
 11%|█         | 451/4286 [3:24:06<28:02:38, 26.33s/it][2025-03-02 18:21:56,097] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 452/4286 [3:24:33<28:09:57, 26.45s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.3909800877703422, 'learning_rate': 8.94540363975735e-07, 'completion_length': 319.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.5595238655805588, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5416668057441711, 'reward_std': 0.12885063514113426, 'kl': 0.02728271484375, 'epoch': 0.11}
 11%|█         | 452/4286 [3:24:33<28:09:57, 26.45s/it][2025-03-02 18:22:23,964] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 453/4286 [3:25:01<28:36:43, 26.87s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.3385800918886004, 'learning_rate': 8.943070461969202e-07, 'completion_length': 339.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7306549549102783, 'reward_std': 0.11269348859786987, 'kl': 0.02056884765625, 'epoch': 0.11}
 11%|█         | 453/4286 [3:25:01<28:36:43, 26.87s/it] 11%|█         | 454/4286 [3:25:30<29:09:38, 27.40s/it]                                                       {'loss': 0.001, 'grad_norm': 0.6452449133797735, 'learning_rate': 8.940737284181055e-07, 'completion_length': 317.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.06148636154830456, 'kl': 0.02374267578125, 'epoch': 0.11}
 11%|█         | 454/4286 [3:25:30<29:09:38, 27.40s/it][2025-03-02 18:23:20,529] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 455/4286 [3:25:58<29:19:50, 27.56s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.4255726762023754, 'learning_rate': 8.938404106392907e-07, 'completion_length': 343.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7484694719314575, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7306123971939087, 'reward_std': 0.11807328835129738, 'kl': 0.02105712890625, 'epoch': 0.11}
 11%|█         | 455/4286 [3:25:58<29:19:50, 27.56s/it] 11%|█         | 456/4286 [3:26:26<29:31:30, 27.75s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.8144578351509676, 'learning_rate': 8.93607092860476e-07, 'completion_length': 312.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5833334922790527, 'reward_std': 0.15282289683818817, 'kl': 0.02252197265625, 'epoch': 0.11}
 11%|█         | 456/4286 [3:26:26<29:31:30, 27.75s/it][2025-03-02 18:24:15,451] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 457/4286 [3:26:53<29:11:25, 27.44s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.7312703720189287, 'learning_rate': 8.933737750816612e-07, 'completion_length': 284.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8095238208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7916668057441711, 'reward_std': 0.059523805975914, 'kl': 0.0224609375, 'epoch': 0.11}
 11%|█         | 457/4286 [3:26:53<29:11:25, 27.44s/it][2025-03-02 18:24:42,026] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 458/4286 [3:27:19<28:54:17, 27.18s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.6264009246861371, 'learning_rate': 8.931404573028464e-07, 'completion_length': 302.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.610119104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5922620296478271, 'reward_std': 0.044342199340462685, 'kl': 0.02166748046875, 'epoch': 0.11}
 11%|█         | 458/4286 [3:27:19<28:54:17, 27.18s/it][2025-03-02 18:25:08,688] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 459/4286 [3:27:46<28:43:51, 27.03s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.27360811833155774, 'learning_rate': 8.929071395240318e-07, 'completion_length': 336.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7982143759727478, 'rewards/format_reward': 1.0, 'reward': 1.7982143759727478, 'reward_std': 0.04257730860263109, 'kl': 0.0201416015625, 'epoch': 0.11}
 11%|█         | 459/4286 [3:27:46<28:43:51, 27.03s/it] 11%|█         | 460/4286 [3:28:10<27:53:27, 26.24s/it]                                                       {'loss': 0.001, 'grad_norm': 0.5278892897891729, 'learning_rate': 8.92673821745217e-07, 'completion_length': 289.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6681548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6681548357009888, 'reward_std': 0.07029405422508717, 'kl': 0.02484130859375, 'epoch': 0.11}
 11%|█         | 460/4286 [3:28:10<27:53:27, 26.24s/it] 11%|█         | 461/4286 [3:28:35<27:31:15, 25.90s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.793639509357224, 'learning_rate': 8.924405039664022e-07, 'completion_length': 294.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.4791666716337204, 'rewards/format_reward': 1.0, 'reward': 1.4791668057441711, 'reward_std': 0.0416666716337204, 'kl': 0.0281982421875, 'epoch': 0.11}
 11%|█         | 461/4286 [3:28:35<27:31:15, 25.90s/it] 11%|█         | 462/4286 [3:29:02<27:40:33, 26.05s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.4996989203621799, 'learning_rate': 8.922071861875875e-07, 'completion_length': 318.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7961310148239136, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.08471458405256271, 'kl': 0.02081298828125, 'epoch': 0.11}
 11%|█         | 462/4286 [3:29:02<27:40:33, 26.05s/it] 11%|█         | 463/4286 [3:29:26<27:10:57, 25.60s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.2900438664350367, 'learning_rate': 8.919738684087728e-07, 'completion_length': 301.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.031603582203388214, 'kl': 0.0203857421875, 'epoch': 0.11}
 11%|█         | 463/4286 [3:29:26<27:10:57, 25.60s/it] 11%|█         | 464/4286 [3:29:51<27:02:54, 25.48s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.3621679088261161, 'learning_rate': 8.91740550629958e-07, 'completion_length': 312.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.09452664852142334, 'kl': 0.02178955078125, 'epoch': 0.11}
 11%|█         | 464/4286 [3:29:51<27:02:54, 25.48s/it] 11%|█         | 465/4286 [3:30:18<27:20:05, 25.75s/it]                                                       {'loss': 0.0008, 'grad_norm': 2.0121804821170746, 'learning_rate': 8.915072328511432e-07, 'completion_length': 311.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7229167222976685, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7050597071647644, 'reward_std': 0.19510606676340103, 'kl': 0.02099609375, 'epoch': 0.11}
 11%|█         | 465/4286 [3:30:18<27:20:05, 25.75s/it] 11%|█         | 466/4286 [3:30:41<26:33:07, 25.02s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.5802317944491948, 'learning_rate': 8.912739150723285e-07, 'completion_length': 284.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6458333432674408, 'rewards/format_reward': 1.0, 'reward': 1.6458334922790527, 'reward_std': 0.05633393442258239, 'kl': 0.02044677734375, 'epoch': 0.11}
 11%|█         | 466/4286 [3:30:41<26:33:07, 25.02s/it] 11%|█         | 467/4286 [3:31:07<26:41:08, 25.16s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.31587552237096356, 'learning_rate': 8.910405972935138e-07, 'completion_length': 304.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 1.0, 'reward': 1.7574405670166016, 'reward_std': 0.04090644046664238, 'kl': 0.017059326171875, 'epoch': 0.11}
 11%|█         | 467/4286 [3:31:07<26:41:08, 25.16s/it] 11%|█         | 468/4286 [3:31:33<26:55:28, 25.39s/it]                                                       {'loss': 0.0008, 'grad_norm': 4.56452369485542, 'learning_rate': 8.90807279514699e-07, 'completion_length': 286.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6737554371356964, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.65589839220047, 'reward_std': 0.11266081407666206, 'kl': 0.01959228515625, 'epoch': 0.11}
 11%|█         | 468/4286 [3:31:33<26:55:28, 25.39s/it] 11%|█         | 469/4286 [3:31:58<26:49:47, 25.30s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.22260790308464246, 'learning_rate': 8.905739617358843e-07, 'completion_length': 300.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7247024476528168, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.0295482249930501, 'kl': 0.01983642578125, 'epoch': 0.11}
 11%|█         | 469/4286 [3:31:58<26:49:47, 25.30s/it] 11%|█         | 470/4286 [3:32:22<26:32:09, 25.03s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.8037124102128702, 'learning_rate': 8.903406439570695e-07, 'completion_length': 276.69644927978516, 'rewards/only_full_func_accuracy_reward': 0.6357143223285675, 'rewards/format_reward': 1.0, 'reward': 1.6357144713401794, 'reward_std': 0.1423376202583313, 'kl': 0.02716064453125, 'epoch': 0.11}
 11%|█         | 470/4286 [3:32:22<26:32:09, 25.03s/it] 11%|█         | 471/4286 [3:32:46<26:03:00, 24.58s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.5602844740094725, 'learning_rate': 8.901073261782548e-07, 'completion_length': 278.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7008929550647736, 'rewards/format_reward': 1.0, 'reward': 1.700892984867096, 'reward_std': 0.08404048904776573, 'kl': 0.0181884765625, 'epoch': 0.11}
 11%|█         | 471/4286 [3:32:46<26:03:00, 24.58s/it] 11%|█         | 472/4286 [3:33:10<25:51:55, 24.41s/it]                                                       {'loss': 0.0009, 'grad_norm': 1.1724410609568974, 'learning_rate': 8.898740083994401e-07, 'completion_length': 296.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.025289656594395638, 'kl': 0.0224609375, 'epoch': 0.11}
 11%|█         | 472/4286 [3:33:10<25:51:55, 24.41s/it][2025-03-02 18:30:57,796] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 473/4286 [3:33:35<26:07:57, 24.67s/it]                                                       {'loss': 0.001, 'grad_norm': 1.0303484447140785, 'learning_rate': 8.896406906206253e-07, 'completion_length': 270.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.6651786267757416, 'rewards/format_reward': 1.0, 'reward': 1.6651785969734192, 'reward_std': 0.10600834712386131, 'kl': 0.02459716796875, 'epoch': 0.11}
 11%|█         | 473/4286 [3:33:35<26:07:57, 24.67s/it][2025-03-02 18:31:24,005] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 474/4286 [3:34:01<26:36:49, 25.13s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.4478972888011463, 'learning_rate': 8.894073728418105e-07, 'completion_length': 288.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 1.0, 'reward': 1.598214328289032, 'reward_std': 0.08329574763774872, 'kl': 0.02325439453125, 'epoch': 0.11}
 11%|█         | 474/4286 [3:34:01<26:36:49, 25.13s/it] 11%|█         | 475/4286 [3:34:25<26:19:35, 24.87s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.3514661578014706, 'learning_rate': 8.891740550629959e-07, 'completion_length': 306.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.04442917322739959, 'kl': 0.02166748046875, 'epoch': 0.11}
 11%|█         | 475/4286 [3:34:25<26:19:35, 24.87s/it][2025-03-02 18:32:12,435] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 476/4286 [3:34:50<26:06:01, 24.66s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.35286580275320606, 'learning_rate': 8.889407372841811e-07, 'completion_length': 285.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098215818405151, 'reward_std': 0.020833336748182774, 'kl': 0.02044677734375, 'epoch': 0.11}
 11%|█         | 476/4286 [3:34:50<26:06:01, 24.66s/it] 11%|█         | 477/4286 [3:35:15<26:23:49, 24.95s/it]                                                       {'loss': 0.001, 'grad_norm': 0.7244965076334866, 'learning_rate': 8.887074195053663e-07, 'completion_length': 301.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.555059552192688, 'rewards/format_reward': 1.0, 'reward': 1.5550596714019775, 'reward_std': 0.06845237873494625, 'kl': 0.02618408203125, 'epoch': 0.11}
 11%|█         | 477/4286 [3:35:15<26:23:49, 24.95s/it] 11%|█         | 478/4286 [3:35:39<25:59:02, 24.56s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.598501473508207, 'learning_rate': 8.884741017265515e-07, 'completion_length': 289.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.12315377965569496, 'kl': 0.021728515625, 'epoch': 0.11}
 11%|█         | 478/4286 [3:35:39<25:59:02, 24.56s/it] 11%|█         | 479/4286 [3:36:04<26:16:04, 24.84s/it]                                                       {'loss': 0.001, 'grad_norm': 0.4186103993250743, 'learning_rate': 8.882407839477369e-07, 'completion_length': 308.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.438988134264946, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4211310744285583, 'reward_std': 0.11934426799416542, 'kl': 0.0245361328125, 'epoch': 0.11}
 11%|█         | 479/4286 [3:36:04<26:16:04, 24.84s/it] 11%|█         | 480/4286 [3:36:29<26:21:38, 24.93s/it]                                                       {'loss': 0.0013, 'grad_norm': 0.8416735887371124, 'learning_rate': 8.880074661689221e-07, 'completion_length': 290.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7232142984867096, 'rewards/format_reward': 1.0, 'reward': 1.7232143878936768, 'reward_std': 0.0886116186156869, 'kl': 0.0330810546875, 'epoch': 0.11}
 11%|█         | 480/4286 [3:36:29<26:21:38, 24.93s/it] 11%|█         | 481/4286 [3:36:54<26:19:02, 24.90s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.28988138754397125, 'learning_rate': 8.877741483901073e-07, 'completion_length': 271.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.5729167014360428, 'rewards/format_reward': 1.0, 'reward': 1.5729167461395264, 'reward_std': 0.04136601369827986, 'kl': 0.02947998046875, 'epoch': 0.11}
 11%|█         | 481/4286 [3:36:54<26:19:02, 24.90s/it][2025-03-02 18:34:42,612] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 482/4286 [3:37:20<26:28:50, 25.06s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.8472297674135192, 'learning_rate': 8.875408306112926e-07, 'completion_length': 278.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6398810297250748, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.0535597475245595, 'kl': 0.02642822265625, 'epoch': 0.11}
 11%|█         | 482/4286 [3:37:20<26:28:50, 25.06s/it] 11%|█▏        | 483/4286 [3:37:45<26:34:59, 25.16s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.6172120520008634, 'learning_rate': 8.873075128324778e-07, 'completion_length': 285.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.0807027593255043, 'kl': 0.02374267578125, 'epoch': 0.11}
 11%|█▏        | 483/4286 [3:37:45<26:34:59, 25.16s/it] 11%|█▏        | 484/4286 [3:38:11<26:42:29, 25.29s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.7521990468753385, 'learning_rate': 8.870741950536631e-07, 'completion_length': 301.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.08441393449902534, 'kl': 0.02093505859375, 'epoch': 0.11}
 11%|█▏        | 484/4286 [3:38:11<26:42:29, 25.29s/it] 11%|█▏        | 485/4286 [3:38:37<27:03:58, 25.63s/it]                                                       {'loss': 0.001, 'grad_norm': 0.39903070442976196, 'learning_rate': 8.868408772748484e-07, 'completion_length': 267.3928756713867, 'rewards/only_full_func_accuracy_reward': 0.7848639786243439, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.749149739742279, 'reward_std': 0.16079533100128174, 'kl': 0.02508544921875, 'epoch': 0.11}
 11%|█▏        | 485/4286 [3:38:37<27:03:58, 25.63s/it] 11%|█▏        | 486/4286 [3:39:02<26:54:42, 25.50s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.3613976585313253, 'learning_rate': 8.866075594960336e-07, 'completion_length': 310.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.8110119700431824, 'rewards/format_reward': 1.0, 'reward': 1.8110119700431824, 'reward_std': 0.0516753401607275, 'kl': 0.02264404296875, 'epoch': 0.11}
 11%|█▏        | 486/4286 [3:39:02<26:54:42, 25.50s/it] 11%|█▏        | 487/4286 [3:39:26<26:25:09, 25.04s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.4178696110683811, 'learning_rate': 8.863742417172188e-07, 'completion_length': 278.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6172619462013245, 'rewards/format_reward': 1.0, 'reward': 1.6172619462013245, 'reward_std': 0.10752441734075546, 'kl': 0.02099609375, 'epoch': 0.11}
 11%|█▏        | 487/4286 [3:39:26<26:25:09, 25.04s/it] 11%|█▏        | 488/4286 [3:39:51<26:10:51, 24.82s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.9622273137611418, 'learning_rate': 8.861409239384041e-07, 'completion_length': 297.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7008928954601288, 'rewards/format_reward': 1.0, 'reward': 1.7008929252624512, 'reward_std': 0.06977971643209457, 'kl': 0.0220947265625, 'epoch': 0.11}
 11%|█▏        | 488/4286 [3:39:51<26:10:51, 24.82s/it] 11%|█▏        | 489/4286 [3:40:16<26:21:29, 24.99s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.8359751173219336, 'learning_rate': 8.859076061595894e-07, 'completion_length': 322.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 1.0, 'reward': 1.6860119700431824, 'reward_std': 0.08471458777785301, 'kl': 0.02264404296875, 'epoch': 0.11}
 11%|█▏        | 489/4286 [3:40:16<26:21:29, 24.99s/it] 11%|█▏        | 490/4286 [3:40:40<26:02:11, 24.69s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.2942779900377205, 'learning_rate': 8.856742883807746e-07, 'completion_length': 273.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.07284288853406906, 'kl': 0.0208740234375, 'epoch': 0.11}
 11%|█▏        | 490/4286 [3:40:40<26:02:11, 24.69s/it] 11%|█▏        | 491/4286 [3:41:05<26:04:11, 24.73s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.4441552366912381, 'learning_rate': 8.854409706019598e-07, 'completion_length': 299.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.09991235285997391, 'kl': 0.01904296875, 'epoch': 0.11}
 11%|█▏        | 491/4286 [3:41:05<26:04:11, 24.73s/it] 11%|█▏        | 492/4286 [3:41:28<25:27:19, 24.15s/it]                                                       {'loss': 0.0011, 'grad_norm': 10.67205931431385, 'learning_rate': 8.852076528231452e-07, 'completion_length': 261.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.04936028644442558, 'kl': 0.02850341796875, 'epoch': 0.11}
 11%|█▏        | 492/4286 [3:41:28<25:27:19, 24.15s/it] 12%|█▏        | 493/4286 [3:41:52<25:28:50, 24.18s/it]                                                       {'loss': 0.001, 'grad_norm': 0.31560378173231285, 'learning_rate': 8.849743350443304e-07, 'completion_length': 268.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7217261791229248, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.04090643487870693, 'kl': 0.025634765625, 'epoch': 0.12}
 12%|█▏        | 493/4286 [3:41:52<25:28:50, 24.18s/it] 12%|█▏        | 494/4286 [3:42:16<25:36:00, 24.30s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.5426830377752013, 'learning_rate': 8.847410172655156e-07, 'completion_length': 297.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6235119253396988, 'rewards/format_reward': 1.0, 'reward': 1.6235120296478271, 'reward_std': 0.056120261549949646, 'kl': 0.0224609375, 'epoch': 0.12}
 12%|█▏        | 494/4286 [3:42:16<25:36:00, 24.30s/it] 12%|█▏        | 495/4286 [3:42:41<25:44:50, 24.45s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.38448923972336946, 'learning_rate': 8.845076994867009e-07, 'completion_length': 289.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.8437500298023224, 'rewards/format_reward': 1.0, 'reward': 1.8437500596046448, 'reward_std': 0.055526357144117355, 'kl': 0.02301025390625, 'epoch': 0.12}
 12%|█▏        | 495/4286 [3:42:41<25:44:50, 24.45s/it] 12%|█▏        | 496/4286 [3:43:06<25:55:48, 24.63s/it]                                                       {'loss': 0.0009, 'grad_norm': 1.8653366739763038, 'learning_rate': 8.842743817078862e-07, 'completion_length': 332.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7407738864421844, 'rewards/format_reward': 1.0, 'reward': 1.740773856639862, 'reward_std': 0.08496995642781258, 'kl': 0.02288818359375, 'epoch': 0.12}
 12%|█▏        | 496/4286 [3:43:06<25:55:48, 24.63s/it] 12%|█▏        | 497/4286 [3:43:33<26:43:52, 25.40s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.824388991205001, 'learning_rate': 8.840410639290714e-07, 'completion_length': 304.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6865079402923584, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6507937908172607, 'reward_std': 0.14613492041826248, 'kl': 0.0223388671875, 'epoch': 0.12}
 12%|█▏        | 497/4286 [3:43:33<26:43:52, 25.40s/it] 12%|█▏        | 498/4286 [3:43:58<26:27:14, 25.14s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.40642511270913423, 'learning_rate': 8.838077461502567e-07, 'completion_length': 280.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7976190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.08600886911153793, 'kl': 0.0225830078125, 'epoch': 0.12}
 12%|█▏        | 498/4286 [3:43:58<26:27:14, 25.14s/it] 12%|█▏        | 499/4286 [3:44:23<26:31:21, 25.21s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.6919999011012847, 'learning_rate': 8.835744283714419e-07, 'completion_length': 311.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.07716727629303932, 'kl': 0.021484375, 'epoch': 0.12}
 12%|█▏        | 499/4286 [3:44:23<26:31:21, 25.21s/it] 12%|█▏        | 500/4286 [3:44:50<26:50:36, 25.52s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.6614769312984814, 'learning_rate': 8.833411105926272e-07, 'completion_length': 327.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7130952775478363, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.695238173007965, 'reward_std': 0.11388125643134117, 'kl': 0.0234375, 'epoch': 0.12}
 12%|█▏        | 500/4286 [3:44:50<26:50:36, 25.52s/it] 12%|█▏        | 501/4286 [3:49:49<113:13:19, 107.69s/it]                                                         {'loss': 0.001, 'grad_norm': 0.5176235969049536, 'learning_rate': 8.831077928138124e-07, 'completion_length': 308.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.5639881193637848, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.15402043610811234, 'kl': 0.025146484375, 'epoch': 0.12}
 12%|█▏        | 501/4286 [3:49:49<113:13:19, 107.69s/it] 12%|█▏        | 502/4286 [3:50:13<86:55:26, 82.70s/it]                                                         {'loss': 0.0008, 'grad_norm': 0.39553421757264634, 'learning_rate': 8.828744750349977e-07, 'completion_length': 305.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6220238506793976, 'rewards/format_reward': 1.0, 'reward': 1.6220239400863647, 'reward_std': 0.1468617133796215, 'kl': 0.0211181640625, 'epoch': 0.12}
 12%|█▏        | 502/4286 [3:50:13<86:55:26, 82.70s/it] 12%|█▏        | 503/4286 [3:50:41<69:24:00, 66.04s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.37667876570440284, 'learning_rate': 8.826411572561829e-07, 'completion_length': 318.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.5565476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5386905670166016, 'reward_std': 0.08769077993929386, 'kl': 0.0218505859375, 'epoch': 0.12}
 12%|█▏        | 503/4286 [3:50:41<69:24:00, 66.04s/it] 12%|█▏        | 504/4286 [3:51:05<56:21:23, 53.64s/it]                                                       {'loss': 0.0009, 'grad_norm': 2.237717200310406, 'learning_rate': 8.824078394773681e-07, 'completion_length': 303.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.0841057263314724, 'kl': 0.02215576171875, 'epoch': 0.12}
 12%|█▏        | 504/4286 [3:51:05<56:21:23, 53.64s/it] 12%|█▏        | 505/4286 [3:51:30<47:21:31, 45.09s/it]                                                       {'loss': 0.001, 'grad_norm': 0.6983798624376512, 'learning_rate': 8.821745216985535e-07, 'completion_length': 311.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5491071939468384, 'rewards/format_reward': 1.0, 'reward': 1.5491072535514832, 'reward_std': 0.07624643761664629, 'kl': 0.0260009765625, 'epoch': 0.12}
 12%|█▏        | 505/4286 [3:51:30<47:21:31, 45.09s/it] 12%|█▏        | 506/4286 [3:51:56<41:09:36, 39.20s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.5148257332539213, 'learning_rate': 8.819412039197387e-07, 'completion_length': 319.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7836310267448425, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.765773892402649, 'reward_std': 0.11574145406484604, 'kl': 0.0223388671875, 'epoch': 0.12}
 12%|█▏        | 506/4286 [3:51:56<41:09:36, 39.20s/it] 12%|█▏        | 507/4286 [3:52:20<36:27:57, 34.74s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.3440265227232305, 'learning_rate': 8.817078861409239e-07, 'completion_length': 300.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6443452537059784, 'rewards/format_reward': 1.0, 'reward': 1.6443453431129456, 'reward_std': 0.04362159222364426, 'kl': 0.02166748046875, 'epoch': 0.12}
 12%|█▏        | 507/4286 [3:52:20<36:27:57, 34.74s/it] 12%|█▏        | 508/4286 [3:52:45<33:27:53, 31.89s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.885943724918231, 'learning_rate': 8.814745683621092e-07, 'completion_length': 322.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.65327388048172, 'reward_std': 0.05654761753976345, 'kl': 0.02069091796875, 'epoch': 0.12}
 12%|█▏        | 508/4286 [3:52:45<33:27:53, 31.89s/it] 12%|█▏        | 509/4286 [3:53:12<31:45:28, 30.27s/it]                                                       {'loss': 0.001, 'grad_norm': 0.799741281381053, 'learning_rate': 8.812412505832945e-07, 'completion_length': 325.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7639881372451782, 'rewards/format_reward': 1.0, 'reward': 1.7639882564544678, 'reward_std': 0.10368956997990608, 'kl': 0.02508544921875, 'epoch': 0.12}
 12%|█▏        | 509/4286 [3:53:12<31:45:28, 30.27s/it] 12%|█▏        | 510/4286 [3:53:38<30:19:24, 28.91s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.4172831107240831, 'learning_rate': 8.810079328044797e-07, 'completion_length': 290.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5639881640672684, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5461310148239136, 'reward_std': 0.1357702501118183, 'kl': 0.02337646484375, 'epoch': 0.12}
 12%|█▏        | 510/4286 [3:53:38<30:19:24, 28.91s/it] 12%|█▏        | 511/4286 [3:54:05<29:54:20, 28.52s/it]                                                       {'loss': 0.0009, 'grad_norm': 1.3438303598342072, 'learning_rate': 8.807746150256649e-07, 'completion_length': 305.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7488095760345459, 'rewards/format_reward': 1.0, 'reward': 1.7488096952438354, 'reward_std': 0.10999840497970581, 'kl': 0.02362060546875, 'epoch': 0.12}
 12%|█▏        | 511/4286 [3:54:05<29:54:20, 28.52s/it] 12%|█▏        | 512/4286 [3:54:32<29:13:39, 27.88s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.7092338089482254, 'learning_rate': 8.805412972468502e-07, 'completion_length': 290.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7422619163990021, 'rewards/format_reward': 1.0, 'reward': 1.7422619462013245, 'reward_std': 0.029272289481014013, 'kl': 0.0213623046875, 'epoch': 0.12}
 12%|█▏        | 512/4286 [3:54:32<29:13:39, 27.88s/it] 12%|█▏        | 513/4286 [3:54:58<28:41:13, 27.37s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.2624356194214735, 'learning_rate': 8.803079794680355e-07, 'completion_length': 316.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7494048476219177, 'rewards/format_reward': 1.0, 'reward': 1.7494048476219177, 'reward_std': 0.05476190894842148, 'kl': 0.022705078125, 'epoch': 0.12}
 12%|█▏        | 513/4286 [3:54:58<28:41:13, 27.37s/it] 12%|█▏        | 514/4286 [3:55:22<27:43:23, 26.46s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.5585788537105563, 'learning_rate': 8.800746616892207e-07, 'completion_length': 303.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.0476190485060215, 'kl': 0.0267333984375, 'epoch': 0.12}
 12%|█▏        | 514/4286 [3:55:22<27:43:23, 26.46s/it] 12%|█▏        | 515/4286 [3:55:49<27:53:02, 26.62s/it]                                                       {'loss': 0.001, 'grad_norm': 7.6704206128210295, 'learning_rate': 8.79841343910406e-07, 'completion_length': 341.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.7041666805744171, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6863096356391907, 'reward_std': 0.14269215613603592, 'kl': 0.026123046875, 'epoch': 0.12}
 12%|█▏        | 515/4286 [3:55:49<27:53:02, 26.62s/it] 12%|█▏        | 516/4286 [3:56:14<27:24:51, 26.18s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.7427645470800521, 'learning_rate': 8.796080261315912e-07, 'completion_length': 331.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.09364314936101437, 'kl': 0.02825927734375, 'epoch': 0.12}
 12%|█▏        | 516/4286 [3:56:14<27:24:51, 26.18s/it] 12%|█▏        | 517/4286 [3:56:41<27:26:56, 26.22s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.49735943146768646, 'learning_rate': 8.793747083527765e-07, 'completion_length': 316.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7053572535514832, 'reward_std': 0.10787562653422356, 'kl': 0.0228271484375, 'epoch': 0.12}
 12%|█▏        | 517/4286 [3:56:41<27:26:56, 26.22s/it] 12%|█▏        | 518/4286 [3:57:07<27:21:23, 26.14s/it]                                                       {'loss': 0.0009, 'grad_norm': 1.4975708800086034, 'learning_rate': 8.791413905739618e-07, 'completion_length': 330.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.599702462553978, 'rewards/format_reward': 1.0, 'reward': 1.599702537059784, 'reward_std': 0.1141766831278801, 'kl': 0.0224609375, 'epoch': 0.12}
 12%|█▏        | 518/4286 [3:57:07<27:21:23, 26.14s/it] 12%|█▏        | 519/4286 [3:57:33<27:26:26, 26.22s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.24819419973695334, 'learning_rate': 8.78908072795147e-07, 'completion_length': 341.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.7717262208461761, 'rewards/format_reward': 1.0, 'reward': 1.7717262506484985, 'reward_std': 0.03320222860202193, 'kl': 0.0216064453125, 'epoch': 0.12}
 12%|█▏        | 519/4286 [3:57:33<27:26:26, 26.22s/it] 12%|█▏        | 520/4286 [3:58:01<27:49:42, 26.60s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.46553381180629155, 'learning_rate': 8.786747550163322e-07, 'completion_length': 330.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5290178805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5111608505249023, 'reward_std': 0.1514144204556942, 'kl': 0.02734375, 'epoch': 0.12}
 12%|█▏        | 520/4286 [3:58:01<27:49:42, 26.60s/it] 12%|█▏        | 521/4286 [3:58:27<27:49:14, 26.60s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.17529163942009746, 'learning_rate': 8.784414372375176e-07, 'completion_length': 335.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6127976179122925, 'rewards/format_reward': 1.0, 'reward': 1.6127976775169373, 'reward_std': 0.032257046550512314, 'kl': 0.02215576171875, 'epoch': 0.12}
 12%|█▏        | 521/4286 [3:58:27<27:49:14, 26.60s/it] 12%|█▏        | 522/4286 [3:58:54<27:50:22, 26.63s/it]                                                       {'loss': 0.0007, 'grad_norm': 0.2827739166549855, 'learning_rate': 8.782081194587028e-07, 'completion_length': 339.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.057041093707084656, 'kl': 0.01702880859375, 'epoch': 0.12}
 12%|█▏        | 522/4286 [3:58:54<27:50:22, 26.63s/it] 12%|█▏        | 523/4286 [3:59:20<27:35:18, 26.39s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.44622376553614307, 'learning_rate': 8.77974801679888e-07, 'completion_length': 291.7678756713867, 'rewards/only_full_func_accuracy_reward': 0.56101194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5431548357009888, 'reward_std': 0.09502441808581352, 'kl': 0.0225830078125, 'epoch': 0.12}
 12%|█▏        | 523/4286 [3:59:20<27:35:18, 26.39s/it][2025-03-02 18:57:09,031] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 12%|█▏        | 524/4286 [3:59:46<27:36:07, 26.41s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.42585148620246566, 'learning_rate': 8.777414839010732e-07, 'completion_length': 324.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6830357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.10580769926309586, 'kl': 0.02288818359375, 'epoch': 0.12}
 12%|█▏        | 524/4286 [3:59:46<27:36:07, 26.41s/it] 12%|█▏        | 525/4286 [4:00:12<27:17:36, 26.13s/it]                                                       {'loss': 0.0008, 'grad_norm': 0.6772363549626016, 'learning_rate': 8.775081661222586e-07, 'completion_length': 308.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.592262089252472, 'reward_std': 0.07994920760393143, 'kl': 0.02001953125, 'epoch': 0.12}
 12%|█▏        | 525/4286 [4:00:12<27:17:36, 26.13s/it] 12%|█▏        | 526/4286 [4:00:39<27:35:55, 26.42s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.34138319024015346, 'learning_rate': 8.772748483434438e-07, 'completion_length': 338.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5729168057441711, 'reward_std': 0.17048469930887222, 'kl': 0.02294921875, 'epoch': 0.12}
 12%|█▏        | 526/4286 [4:00:39<27:35:55, 26.42s/it] 12%|█▏        | 527/4286 [4:01:05<27:33:34, 26.39s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.42614776170578095, 'learning_rate': 8.77041530564629e-07, 'completion_length': 340.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6157738268375397, 'rewards/format_reward': 1.0, 'reward': 1.6157739162445068, 'reward_std': 0.08233362808823586, 'kl': 0.029052734375, 'epoch': 0.12}
 12%|█▏        | 527/4286 [4:01:05<27:33:34, 26.39s/it] 12%|█▏        | 528/4286 [4:01:32<27:42:05, 26.54s/it]                                                       {'loss': 0.001, 'grad_norm': 0.542898095620541, 'learning_rate': 8.768082127858143e-07, 'completion_length': 327.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.57738097012043, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5595239400863647, 'reward_std': 0.12432590499520302, 'kl': 0.025146484375, 'epoch': 0.12}
 12%|█▏        | 528/4286 [4:01:32<27:42:05, 26.54s/it] 12%|█▏        | 529/4286 [4:01:58<27:34:36, 26.42s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.4788126706352359, 'learning_rate': 8.765748950069996e-07, 'completion_length': 331.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.722321480512619, 'rewards/format_reward': 1.0, 'reward': 1.7223215699195862, 'reward_std': 0.07632800936698914, 'kl': 0.02667236328125, 'epoch': 0.12}
 12%|█▏        | 529/4286 [4:01:58<27:34:36, 26.42s/it] 12%|█▏        | 530/4286 [4:02:24<27:19:52, 26.20s/it]                                                       {'loss': 0.001, 'grad_norm': 0.38905560230862357, 'learning_rate': 8.763415772281848e-07, 'completion_length': 320.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6636904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6636906266212463, 'reward_std': 0.0662089604884386, 'kl': 0.0252685546875, 'epoch': 0.12}
 12%|█▏        | 530/4286 [4:02:24<27:19:52, 26.20s/it] 12%|█▏        | 531/4286 [4:02:49<27:09:17, 26.03s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.8650126630871073, 'learning_rate': 8.761082594493701e-07, 'completion_length': 320.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.49765412509441376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4797970056533813, 'reward_std': 0.12101073563098907, 'kl': 0.02703857421875, 'epoch': 0.12}
 12%|█▏        | 531/4286 [4:02:49<27:09:17, 26.03s/it] 12%|█▏        | 532/4286 [4:03:14<26:41:21, 25.59s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.6307096090843309, 'learning_rate': 8.758749416705553e-07, 'completion_length': 297.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.6495536267757416, 'rewards/format_reward': 1.0, 'reward': 1.6495537161827087, 'reward_std': 0.058035713620483875, 'kl': 0.0289306640625, 'epoch': 0.12}
 12%|█▏        | 532/4286 [4:03:14<26:41:21, 25.59s/it] 12%|█▏        | 533/4286 [4:03:41<27:00:57, 25.91s/it]                                                       {'loss': 0.001, 'grad_norm': 0.2516061146422571, 'learning_rate': 8.756416238917405e-07, 'completion_length': 328.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.5498511791229248, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5319942235946655, 'reward_std': 0.06696428917348385, 'kl': 0.024169921875, 'epoch': 0.12}
 12%|█▏        | 533/4286 [4:03:41<27:00:57, 25.91s/it] 12%|█▏        | 534/4286 [4:04:08<27:26:26, 26.33s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.4541337235949047, 'learning_rate': 8.754083061129258e-07, 'completion_length': 316.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6455357670783997, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6276786923408508, 'reward_std': 0.10507582500576973, 'kl': 0.02203369140625, 'epoch': 0.12}
 12%|█▏        | 534/4286 [4:04:08<27:26:26, 26.33s/it] 12%|█▏        | 535/4286 [4:04:33<26:55:49, 25.85s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.6539198807498768, 'learning_rate': 8.751749883341111e-07, 'completion_length': 296.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.19304209202528, 'kl': 0.02874755859375, 'epoch': 0.12}
 12%|█▏        | 535/4286 [4:04:33<26:55:49, 25.85s/it] 13%|█▎        | 536/4286 [4:04:57<26:36:53, 25.55s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.18528591575884992, 'learning_rate': 8.749416705552962e-07, 'completion_length': 305.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7991072237491608, 'rewards/format_reward': 1.0, 'reward': 1.7991072535514832, 'reward_std': 0.020833336748182774, 'kl': 0.0225830078125, 'epoch': 0.13}
 13%|█▎        | 536/4286 [4:04:57<26:36:53, 25.55s/it] 13%|█▎        | 537/4286 [4:05:22<26:24:08, 25.35s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.2738823795144467, 'learning_rate': 8.747083527764814e-07, 'completion_length': 312.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.6116071939468384, 'rewards/format_reward': 1.0, 'reward': 1.6116072535514832, 'reward_std': 0.06250000186264515, 'kl': 0.0228271484375, 'epoch': 0.13}
 13%|█▎        | 537/4286 [4:05:22<26:24:08, 25.35s/it] 13%|█▎        | 538/4286 [4:05:49<26:39:20, 25.60s/it]                                                       {'loss': 0.0011, 'grad_norm': 1.3059570331003403, 'learning_rate': 8.744750349976668e-07, 'completion_length': 320.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.6937500834465027, 'rewards/format_reward': 1.0, 'reward': 1.6937501430511475, 'reward_std': 0.11782415211200714, 'kl': 0.028076171875, 'epoch': 0.13}
 13%|█▎        | 538/4286 [4:05:49<26:39:20, 25.60s/it] 13%|█▎        | 539/4286 [4:06:15<26:51:05, 25.80s/it]                                                       {'loss': 0.001, 'grad_norm': 0.3388020659801409, 'learning_rate': 8.74241717218852e-07, 'completion_length': 319.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.039611320942640305, 'kl': 0.02593994140625, 'epoch': 0.13}
 13%|█▎        | 539/4286 [4:06:15<26:51:05, 25.80s/it] 13%|█▎        | 540/4286 [4:06:40<26:42:37, 25.67s/it]                                                       {'loss': 0.001, 'grad_norm': 0.21721710613023804, 'learning_rate': 8.740083994400372e-07, 'completion_length': 301.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.5494047999382019, 'rewards/format_reward': 1.0, 'reward': 1.5494049191474915, 'reward_std': 0.053081818856298923, 'kl': 0.02410888671875, 'epoch': 0.13}
 13%|█▎        | 540/4286 [4:06:40<26:42:37, 25.67s/it] 13%|█▎        | 541/4286 [4:07:05<26:35:43, 25.57s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.840786772189801, 'learning_rate': 8.737750816612225e-07, 'completion_length': 286.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6636904776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.645833432674408, 'reward_std': 0.16359854117035866, 'kl': 0.0274658203125, 'epoch': 0.13}
 13%|█▎        | 541/4286 [4:07:05<26:35:43, 25.57s/it] 13%|█▎        | 542/4286 [4:07:31<26:28:32, 25.46s/it]                                                       {'loss': 0.0011, 'grad_norm': 2.248816632267111, 'learning_rate': 8.735417638824078e-07, 'completion_length': 305.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5952381789684296, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.0956411175429821, 'kl': 0.02813720703125, 'epoch': 0.13}
 13%|█▎        | 542/4286 [4:07:31<26:28:32, 25.46s/it] 13%|█▎        | 543/4286 [4:07:55<26:10:10, 25.17s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.38720076710614665, 'learning_rate': 8.73308446103593e-07, 'completion_length': 265.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.4508928656578064, 'rewards/format_reward': 1.0, 'reward': 1.450892984867096, 'reward_std': 0.03961131442338228, 'kl': 0.03021240234375, 'epoch': 0.13}
 13%|█▎        | 543/4286 [4:07:55<26:10:10, 25.17s/it] 13%|█▎        | 544/4286 [4:08:19<25:48:59, 24.84s/it]                                                       {'loss': 0.0011, 'grad_norm': 2.181615393002335, 'learning_rate': 8.730751283247782e-07, 'completion_length': 282.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.4717262536287308, 'rewards/format_reward': 1.0, 'reward': 1.4717263579368591, 'reward_std': 0.07121489383280277, 'kl': 0.0286865234375, 'epoch': 0.13}
 13%|█▎        | 544/4286 [4:08:19<25:48:59, 24.84s/it][2025-03-02 19:06:07,171] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 13%|█▎        | 545/4286 [4:08:44<25:51:33, 24.88s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.21891753237004297, 'learning_rate': 8.728418105459635e-07, 'completion_length': 305.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.5773810148239136, 'rewards/format_reward': 1.0, 'reward': 1.5773810744285583, 'reward_std': 0.022214585915207863, 'kl': 0.027587890625, 'epoch': 0.13}
 13%|█▎        | 545/4286 [4:08:44<25:51:33, 24.88s/it] 13%|█▎        | 546/4286 [4:09:09<25:45:07, 24.79s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.4238878482400712, 'learning_rate': 8.726084927671488e-07, 'completion_length': 306.0, 'rewards/only_full_func_accuracy_reward': 0.5848214626312256, 'rewards/format_reward': 1.0, 'reward': 1.5848215222358704, 'reward_std': 0.07619336247444153, 'kl': 0.02960205078125, 'epoch': 0.13}
 13%|█▎        | 546/4286 [4:09:09<25:45:07, 24.79s/it] 13%|█▎        | 547/4286 [4:09:33<25:27:38, 24.51s/it]                                                       {'loss': 0.001, 'grad_norm': 0.44686903077755885, 'learning_rate': 8.72375174988334e-07, 'completion_length': 289.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.04596670996397734, 'kl': 0.0247802734375, 'epoch': 0.13}
 13%|█▎        | 547/4286 [4:09:33<25:27:38, 24.51s/it] 13%|█▎        | 548/4286 [4:09:58<25:39:36, 24.71s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.4109574392011173, 'learning_rate': 8.721418572095193e-07, 'completion_length': 309.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7014881074428558, 'rewards/format_reward': 1.0, 'reward': 1.701488196849823, 'reward_std': 0.053591263480484486, 'kl': 0.02752685546875, 'epoch': 0.13}
 13%|█▎        | 548/4286 [4:09:58<25:39:36, 24.71s/it] 13%|█▎        | 549/4286 [4:10:24<26:06:12, 25.15s/it]                                                       {'loss': 0.001, 'grad_norm': 0.5126753613556713, 'learning_rate': 8.719085394307045e-07, 'completion_length': 290.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.6577381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6398810744285583, 'reward_std': 0.1071428507566452, 'kl': 0.025634765625, 'epoch': 0.13}
 13%|█▎        | 549/4286 [4:10:24<26:06:12, 25.15s/it] 13%|█▎        | 550/4286 [4:10:49<26:10:24, 25.22s/it]                                                       {'loss': 0.0011, 'grad_norm': 1.5309873528663613, 'learning_rate': 8.716752216518897e-07, 'completion_length': 323.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5967261791229248, 'rewards/format_reward': 1.0, 'reward': 1.5967263579368591, 'reward_std': 0.09929489344358444, 'kl': 0.02783203125, 'epoch': 0.13}
 13%|█▎        | 550/4286 [4:10:49<26:10:24, 25.22s/it] 13%|█▎        | 551/4286 [4:11:13<25:36:47, 24.69s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.7193920381777092, 'learning_rate': 8.714419038730751e-07, 'completion_length': 275.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 1.0, 'reward': 1.7633929252624512, 'reward_std': 0.0848743487149477, 'kl': 0.02801513671875, 'epoch': 0.13}
 13%|█▎        | 551/4286 [4:11:13<25:36:47, 24.69s/it] 13%|█▎        | 552/4286 [4:11:37<25:25:40, 24.52s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.5065208931811235, 'learning_rate': 8.712085860942603e-07, 'completion_length': 275.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.052803294733166695, 'kl': 0.02325439453125, 'epoch': 0.13}
 13%|█▎        | 552/4286 [4:11:37<25:25:40, 24.52s/it] 13%|█▎        | 553/4286 [4:12:00<24:53:10, 24.00s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.4801431444278899, 'learning_rate': 8.709752683154455e-07, 'completion_length': 253.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.727678656578064, 'rewards/format_reward': 1.0, 'reward': 1.7276787161827087, 'reward_std': 0.06674263998866081, 'kl': 0.02691650390625, 'epoch': 0.13}
 13%|█▎        | 553/4286 [4:12:00<24:53:10, 24.00s/it] 13%|█▎        | 554/4286 [4:12:24<25:04:47, 24.19s/it]                                                       {'loss': 0.001, 'grad_norm': 0.3889934704010488, 'learning_rate': 8.707419505366308e-07, 'completion_length': 305.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.09162481501698494, 'kl': 0.0252685546875, 'epoch': 0.13}
 13%|█▎        | 554/4286 [4:12:24<25:04:47, 24.19s/it] 13%|█▎        | 555/4286 [4:12:50<25:28:05, 24.57s/it]                                                       {'loss': 0.001, 'grad_norm': 0.1703067815895014, 'learning_rate': 8.705086327578161e-07, 'completion_length': 311.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.019698817282915115, 'kl': 0.0245361328125, 'epoch': 0.13}
 13%|█▎        | 555/4286 [4:12:50<25:28:05, 24.57s/it] 13%|█▎        | 556/4286 [4:13:14<25:11:29, 24.31s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.3346382173831901, 'learning_rate': 8.702753149790013e-07, 'completion_length': 302.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7484375238418579, 'rewards/format_reward': 1.0, 'reward': 1.7484376430511475, 'reward_std': 0.06480802595615387, 'kl': 0.02239990234375, 'epoch': 0.13}
 13%|█▎        | 556/4286 [4:13:14<25:11:29, 24.31s/it] 13%|█▎        | 557/4286 [4:13:40<25:51:15, 24.96s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.977192180150908, 'learning_rate': 8.700419972001865e-07, 'completion_length': 283.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7053572535514832, 'reward_std': 0.1657290756702423, 'kl': 0.0294189453125, 'epoch': 0.13}
 13%|█▎        | 557/4286 [4:13:40<25:51:15, 24.96s/it] 13%|█▎        | 558/4286 [4:14:05<25:44:31, 24.86s/it]                                                       {'loss': 0.001, 'grad_norm': 0.25898609214389856, 'learning_rate': 8.698086794213718e-07, 'completion_length': 308.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.797619104385376, 'reward_std': 0.07142857648432255, 'kl': 0.02508544921875, 'epoch': 0.13}
 13%|█▎        | 558/4286 [4:14:05<25:44:31, 24.86s/it] 13%|█▎        | 559/4286 [4:14:31<26:12:49, 25.32s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.7558713846945284, 'learning_rate': 8.695753616425571e-07, 'completion_length': 294.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.07419108226895332, 'kl': 0.02978515625, 'epoch': 0.13}
 13%|█▎        | 559/4286 [4:14:31<26:12:49, 25.32s/it] 13%|█▎        | 560/4286 [4:14:56<26:00:43, 25.13s/it]                                                       {'loss': 0.0012, 'grad_norm': 1.582996931833819, 'learning_rate': 8.693420438637423e-07, 'completion_length': 262.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6830357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6830359101295471, 'reward_std': 0.11960749514400959, 'kl': 0.02935791015625, 'epoch': 0.13}
 13%|█▎        | 560/4286 [4:14:56<26:00:43, 25.13s/it] 13%|█▎        | 561/4286 [4:15:23<26:34:28, 25.68s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.4574065366629001, 'learning_rate': 8.691087260849276e-07, 'completion_length': 336.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7220238447189331, 'rewards/format_reward': 1.0, 'reward': 1.7220239043235779, 'reward_std': 0.1090964563190937, 'kl': 0.0283203125, 'epoch': 0.13}
 13%|█▎        | 561/4286 [4:15:23<26:34:28, 25.68s/it] 13%|█▎        | 562/4286 [4:15:47<26:16:55, 25.41s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.5852826286992907, 'learning_rate': 8.688754083061128e-07, 'completion_length': 279.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.035714281257241964, 'kl': 0.02978515625, 'epoch': 0.13}
 13%|█▎        | 562/4286 [4:15:47<26:16:55, 25.41s/it] 13%|█▎        | 563/4286 [4:16:12<25:54:19, 25.05s/it]                                                       {'loss': 0.0013, 'grad_norm': 1.19546236137236, 'learning_rate': 8.686420905272981e-07, 'completion_length': 271.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7127977311611176, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.04841067083179951, 'kl': 0.03179931640625, 'epoch': 0.13}
 13%|█▎        | 563/4286 [4:16:12<25:54:19, 25.05s/it][2025-03-02 19:14:01,479] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 13%|█▎        | 564/4286 [4:16:39<26:27:22, 25.59s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.6702044184694598, 'learning_rate': 8.684087727484834e-07, 'completion_length': 348.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6564485132694244, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6207342147827148, 'reward_std': 0.1862412467598915, 'kl': 0.02813720703125, 'epoch': 0.13}
 13%|█▎        | 564/4286 [4:16:39<26:27:22, 25.59s/it] 13%|█▎        | 565/4286 [4:17:05<26:41:05, 25.82s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.6292697537513712, 'learning_rate': 8.681754549696686e-07, 'completion_length': 319.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6826106011867523, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6647535562515259, 'reward_std': 0.13136856257915497, 'kl': 0.0306396484375, 'epoch': 0.13}
 13%|█▎        | 565/4286 [4:17:05<26:41:05, 25.82s/it] 13%|█▎        | 566/4286 [4:17:29<26:10:03, 25.32s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.5367253681284464, 'learning_rate': 8.679421371908538e-07, 'completion_length': 292.25, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.06924401037395, 'kl': 0.02899169921875, 'epoch': 0.13}
 13%|█▎        | 566/4286 [4:17:29<26:10:03, 25.32s/it][2025-03-02 19:15:17,610] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 13%|█▎        | 567/4286 [4:17:55<26:14:58, 25.41s/it]                                                       {'loss': 0.0014, 'grad_norm': 0.5141925855018504, 'learning_rate': 8.677088194120391e-07, 'completion_length': 297.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6949405670166016, 'reward_std': 0.18328632786870003, 'kl': 0.0345458984375, 'epoch': 0.13}
 13%|█▎        | 567/4286 [4:17:55<26:14:58, 25.41s/it] 13%|█▎        | 568/4286 [4:18:21<26:28:49, 25.64s/it]                                                       {'loss': 0.0013, 'grad_norm': 1.103620098601905, 'learning_rate': 8.674755016332244e-07, 'completion_length': 317.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.6357143521308899, 'rewards/format_reward': 1.0, 'reward': 1.63571435213089, 'reward_std': 0.1159205436706543, 'kl': 0.0335693359375, 'epoch': 0.13}
 13%|█▎        | 568/4286 [4:18:21<26:28:49, 25.64s/it] 13%|█▎        | 569/4286 [4:18:46<26:15:05, 25.43s/it]                                                       {'loss': 0.0014, 'grad_norm': 0.5011433144993759, 'learning_rate': 8.672421838544096e-07, 'completion_length': 273.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.07465635240077972, 'kl': 0.0343017578125, 'epoch': 0.13}
 13%|█▎        | 569/4286 [4:18:46<26:15:05, 25.43s/it] 13%|█▎        | 570/4286 [4:19:11<26:12:16, 25.39s/it]                                                       {'loss': 0.0015, 'grad_norm': 0.9526274946532065, 'learning_rate': 8.670088660755948e-07, 'completion_length': 314.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.605654776096344, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.084917351603508, 'kl': 0.0379638671875, 'epoch': 0.13}
 13%|█▎        | 570/4286 [4:19:11<26:12:16, 25.39s/it] 13%|█▎        | 571/4286 [4:19:37<26:13:58, 25.42s/it]                                                       {'loss': 0.0009, 'grad_norm': 0.331326675039131, 'learning_rate': 8.667755482967802e-07, 'completion_length': 313.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.8333334028720856, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.02816697023808956, 'kl': 0.02197265625, 'epoch': 0.13}
 13%|█▎        | 571/4286 [4:19:37<26:13:58, 25.42s/it] 13%|█▎        | 572/4286 [4:20:04<26:46:00, 25.95s/it]                                                       {'loss': 0.001, 'grad_norm': 0.36577428395482015, 'learning_rate': 8.665422305179654e-07, 'completion_length': 333.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.7944303452968597, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7408589124679565, 'reward_std': 0.16694114357233047, 'kl': 0.026123046875, 'epoch': 0.13}
 13%|█▎        | 572/4286 [4:20:04<26:46:00, 25.95s/it] 13%|█▎        | 573/4286 [4:20:29<26:35:33, 25.78s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.6536341298178766, 'learning_rate': 8.663089127391506e-07, 'completion_length': 328.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7309524416923523, 'rewards/format_reward': 1.0, 'reward': 1.730952501296997, 'reward_std': 0.03716716915369034, 'kl': 0.02825927734375, 'epoch': 0.13}
 13%|█▎        | 573/4286 [4:20:29<26:35:33, 25.78s/it] 13%|█▎        | 574/4286 [4:20:55<26:28:52, 25.68s/it]                                                       {'loss': 0.0015, 'grad_norm': 4.346112042044458, 'learning_rate': 8.660755949603359e-07, 'completion_length': 304.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.610119104385376, 'rewards/format_reward': 1.0, 'reward': 1.610119104385376, 'reward_std': 0.11322717182338238, 'kl': 0.03662109375, 'epoch': 0.13}
 13%|█▎        | 574/4286 [4:20:55<26:28:52, 25.68s/it] 13%|█▎        | 575/4286 [4:21:21<26:45:09, 25.95s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.6179695159089438, 'learning_rate': 8.658422771815211e-07, 'completion_length': 300.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.06441530212759972, 'kl': 0.03094482421875, 'epoch': 0.13}
 13%|█▎        | 575/4286 [4:21:21<26:45:09, 25.95s/it] 13%|█▎        | 576/4286 [4:21:45<26:10:39, 25.40s/it]                                                       {'loss': 0.0015, 'grad_norm': 1.0170739321732436, 'learning_rate': 8.656089594027064e-07, 'completion_length': 304.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6577381491661072, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.09364316426217556, 'kl': 0.037109375, 'epoch': 0.13}
 13%|█▎        | 576/4286 [4:21:45<26:10:39, 25.40s/it] 13%|█▎        | 577/4286 [4:22:13<26:58:49, 26.19s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.6303986377305335, 'learning_rate': 8.653756416238917e-07, 'completion_length': 304.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.854166716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8184524774551392, 'reward_std': 0.15816230326890945, 'kl': 0.03033447265625, 'epoch': 0.13}
 13%|█▎        | 577/4286 [4:22:13<26:58:49, 26.19s/it] 13%|█▎        | 578/4286 [4:22:39<26:44:15, 25.96s/it]                                                       {'loss': 0.0011, 'grad_norm': 1.0008530612141344, 'learning_rate': 8.651423238450769e-07, 'completion_length': 316.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6744047701358795, 'rewards/format_reward': 1.0, 'reward': 1.6744048595428467, 'reward_std': 0.06290648132562637, 'kl': 0.0283203125, 'epoch': 0.13}
 13%|█▎        | 578/4286 [4:22:39<26:44:15, 25.96s/it] 14%|█▎        | 579/4286 [4:23:05<26:54:20, 26.13s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.8139103580981875, 'learning_rate': 8.649090060662621e-07, 'completion_length': 299.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.5729166865348816, 'rewards/format_reward': 1.0, 'reward': 1.5729168057441711, 'reward_std': 0.0863095298409462, 'kl': 0.0401611328125, 'epoch': 0.14}
 14%|█▎        | 579/4286 [4:23:05<26:54:20, 26.13s/it][2025-03-02 19:20:55,317] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▎        | 580/4286 [4:23:32<27:12:10, 26.42s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.4739753813771147, 'learning_rate': 8.646756882874474e-07, 'completion_length': 328.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6488095223903656, 'rewards/format_reward': 1.0, 'reward': 1.6488096714019775, 'reward_std': 0.03755595628172159, 'kl': 0.028564453125, 'epoch': 0.14}
 14%|█▎        | 580/4286 [4:23:32<27:12:10, 26.42s/it] 14%|█▎        | 581/4286 [4:23:59<27:12:31, 26.44s/it]                                                       {'loss': 0.0013, 'grad_norm': 0.2656767191209774, 'learning_rate': 8.644423705086327e-07, 'completion_length': 336.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.7276786267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6919643878936768, 'reward_std': 0.10188196040689945, 'kl': 0.032958984375, 'epoch': 0.14}
 14%|█▎        | 581/4286 [4:23:59<27:12:31, 26.44s/it] 14%|█▎        | 582/4286 [4:24:24<26:53:50, 26.14s/it]                                                       {'loss': 0.0017, 'grad_norm': 4.2540003420649795, 'learning_rate': 8.642090527298179e-07, 'completion_length': 297.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7208333611488342, 'rewards/format_reward': 1.0, 'reward': 1.720833420753479, 'reward_std': 0.0896662250161171, 'kl': 0.04290771484375, 'epoch': 0.14}
 14%|█▎        | 582/4286 [4:24:24<26:53:50, 26.14s/it] 14%|█▎        | 583/4286 [4:24:48<26:17:04, 25.55s/it]                                                       {'loss': 0.001, 'grad_norm': 0.6388500972131704, 'learning_rate': 8.639757349510031e-07, 'completion_length': 304.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.06915953382849693, 'kl': 0.0257568359375, 'epoch': 0.14}
 14%|█▎        | 583/4286 [4:24:49<26:17:04, 25.55s/it] 14%|█▎        | 584/4286 [4:25:15<26:27:25, 25.73s/it]                                                       {'loss': 0.0017, 'grad_norm': 30.095768327989205, 'learning_rate': 8.637424171721885e-07, 'completion_length': 311.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.5035714358091354, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4678572416305542, 'reward_std': 0.14534377306699753, 'kl': 0.0413818359375, 'epoch': 0.14}
 14%|█▎        | 584/4286 [4:25:15<26:27:25, 25.73s/it][2025-03-02 19:23:04,755] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▎        | 585/4286 [4:25:42<26:54:16, 26.17s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.9304455449370906, 'learning_rate': 8.635090993933737e-07, 'completion_length': 314.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7482143044471741, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7303572297096252, 'reward_std': 0.1373608037829399, 'kl': 0.0413818359375, 'epoch': 0.14}
 14%|█▎        | 585/4286 [4:25:42<26:54:16, 26.17s/it][2025-03-02 19:23:34,237] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▎        | 586/4286 [4:26:11<27:55:05, 27.16s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.4849872339123791, 'learning_rate': 8.632757816145589e-07, 'completion_length': 345.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6872024238109589, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6693452596664429, 'reward_std': 0.16483522206544876, 'kl': 0.0406494140625, 'epoch': 0.14}
 14%|█▎        | 586/4286 [4:26:11<27:55:05, 27.16s/it] 14%|█▎        | 587/4286 [4:26:38<27:46:02, 27.02s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.4346297163450996, 'learning_rate': 8.630424638357442e-07, 'completion_length': 287.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.05495268478989601, 'kl': 0.02874755859375, 'epoch': 0.14}
 14%|█▎        | 587/4286 [4:26:38<27:46:02, 27.02s/it][2025-03-02 19:24:27,363] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▎        | 588/4286 [4:27:04<27:34:33, 26.85s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.5660516431892061, 'learning_rate': 8.628091460569295e-07, 'completion_length': 335.2143096923828, 'rewards/only_full_func_accuracy_reward': 0.6644345819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6465774774551392, 'reward_std': 0.13677266240119934, 'kl': 0.0413818359375, 'epoch': 0.14}
 14%|█▎        | 588/4286 [4:27:04<27:34:33, 26.85s/it] 14%|█▎        | 589/4286 [4:27:30<27:09:53, 26.45s/it]                                                       {'loss': 0.0022, 'grad_norm': 2.9681931139923226, 'learning_rate': 8.625758282781147e-07, 'completion_length': 280.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6369047462940216, 'rewards/format_reward': 1.0, 'reward': 1.6369049549102783, 'reward_std': 0.06025657244026661, 'kl': 0.0556640625, 'epoch': 0.14}
 14%|█▎        | 589/4286 [4:27:30<27:09:53, 26.45s/it] 14%|█▍        | 590/4286 [4:27:54<26:29:00, 25.80s/it]                                                       {'loss': 0.0017, 'grad_norm': 2.596136350872983, 'learning_rate': 8.623425104992999e-07, 'completion_length': 282.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.06664376705884933, 'kl': 0.042236328125, 'epoch': 0.14}
 14%|█▍        | 590/4286 [4:27:54<26:29:00, 25.80s/it][2025-03-02 19:25:44,643] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▍        | 591/4286 [4:28:22<26:59:43, 26.30s/it]                                                       {'loss': 0.0018, 'grad_norm': 1.484341404018085, 'learning_rate': 8.621091927204852e-07, 'completion_length': 327.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.5223214626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4866072535514832, 'reward_std': 0.1428072303533554, 'kl': 0.0460205078125, 'epoch': 0.14}
 14%|█▍        | 591/4286 [4:28:22<26:59:43, 26.30s/it] 14%|█▍        | 592/4286 [4:28:47<26:41:09, 26.01s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.2976715151889451, 'learning_rate': 8.618758749416705e-07, 'completion_length': 311.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7180060148239136, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7001489400863647, 'reward_std': 0.07801165198907256, 'kl': 0.02886962890625, 'epoch': 0.14}
 14%|█▍        | 592/4286 [4:28:47<26:41:09, 26.01s/it] 14%|█▍        | 593/4286 [4:29:12<26:21:29, 25.69s/it]                                                       {'loss': 0.0023, 'grad_norm': 0.6668606374074877, 'learning_rate': 8.616425571628557e-07, 'completion_length': 304.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6681548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6502977013587952, 'reward_std': 0.21171020716428757, 'kl': 0.0565185546875, 'epoch': 0.14}
 14%|█▍        | 593/4286 [4:29:12<26:21:29, 25.69s/it] 14%|█▍        | 594/4286 [4:29:37<26:04:01, 25.42s/it]                                                       {'loss': 0.0018, 'grad_norm': 1.021791597906157, 'learning_rate': 8.61409239384041e-07, 'completion_length': 300.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7306548357009888, 'rewards/format_reward': 1.0, 'reward': 1.7306549549102783, 'reward_std': 0.11679216846823692, 'kl': 0.0455322265625, 'epoch': 0.14}
 14%|█▍        | 594/4286 [4:29:37<26:04:01, 25.42s/it][2025-03-02 19:27:24,473] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▍        | 595/4286 [4:30:02<25:51:42, 25.22s/it]                                                       {'loss': 0.002, 'grad_norm': 0.5082540319961211, 'learning_rate': 8.611759216052262e-07, 'completion_length': 277.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6450892984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6272322535514832, 'reward_std': 0.1421817559748888, 'kl': 0.0491943359375, 'epoch': 0.14}
 14%|█▍        | 595/4286 [4:30:02<25:51:42, 25.22s/it] 14%|█▍        | 596/4286 [4:30:28<26:19:15, 25.68s/it]                                                       {'loss': 0.002, 'grad_norm': 0.7833543190431844, 'learning_rate': 8.609426038264115e-07, 'completion_length': 303.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6056548357009888, 'reward_std': 0.09227254800498486, 'kl': 0.0499267578125, 'epoch': 0.14}
 14%|█▍        | 596/4286 [4:30:28<26:19:15, 25.68s/it] 14%|█▍        | 597/4286 [4:30:53<26:00:04, 25.37s/it]                                                       {'loss': 0.0028, 'grad_norm': 1.2314415605180498, 'learning_rate': 8.607092860475968e-07, 'completion_length': 301.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.56101194024086, 'rewards/format_reward': 1.0, 'reward': 1.5610120296478271, 'reward_std': 0.1160714291036129, 'kl': 0.0693359375, 'epoch': 0.14}
 14%|█▍        | 597/4286 [4:30:53<26:00:04, 25.37s/it][2025-03-02 19:28:42,466] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▍        | 598/4286 [4:31:20<26:22:04, 25.74s/it]                                                       {'loss': 0.0015, 'grad_norm': 0.754737384540052, 'learning_rate': 8.60475968268782e-07, 'completion_length': 325.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.737500011920929, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7196429371833801, 'reward_std': 0.15524620935320854, 'kl': 0.0369873046875, 'epoch': 0.14}
 14%|█▍        | 598/4286 [4:31:20<26:22:04, 25.74s/it] 14%|█▍        | 599/4286 [4:31:45<26:22:02, 25.75s/it]                                                       {'loss': 0.0023, 'grad_norm': 0.39012642838239603, 'learning_rate': 8.602426504899672e-07, 'completion_length': 308.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.5699405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5520833730697632, 'reward_std': 0.068452388048172, 'kl': 0.056640625, 'epoch': 0.14}
 14%|█▍        | 599/4286 [4:31:45<26:22:02, 25.75s/it] 14%|█▍        | 600/4286 [4:32:10<25:57:58, 25.36s/it]                                                       {'loss': 0.0018, 'grad_norm': 1.9549892498556098, 'learning_rate': 8.600093327111526e-07, 'completion_length': 294.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7074405252933502, 'rewards/format_reward': 1.0, 'reward': 1.7074405550956726, 'reward_std': 0.10641280934214592, 'kl': 0.04486083984375, 'epoch': 0.14}
 14%|█▍        | 600/4286 [4:32:10<25:57:58, 25.36s/it] 14%|█▍        | 601/4286 [4:37:31<116:44:06, 114.04s/it]                                                         {'loss': 0.0019, 'grad_norm': 0.567902349955534, 'learning_rate': 8.597760149323378e-07, 'completion_length': 310.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.04464286006987095, 'kl': 0.048095703125, 'epoch': 0.14}
 14%|█▍        | 601/4286 [4:37:31<116:44:06, 114.04s/it] 14%|█▍        | 602/4286 [4:37:57<89:36:52, 87.57s/it]                                                         {'loss': 0.003, 'grad_norm': 1.033080981255434, 'learning_rate': 8.59542697153523e-07, 'completion_length': 303.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.5639880895614624, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5282739400863647, 'reward_std': 0.15536291524767876, 'kl': 0.07568359375, 'epoch': 0.14}
 14%|█▍        | 602/4286 [4:37:57<89:36:52, 87.57s/it] 14%|█▍        | 603/4286 [4:38:23<70:54:03, 69.30s/it]                                                       {'loss': 0.0029, 'grad_norm': 0.5820246437822034, 'learning_rate': 8.593093793747082e-07, 'completion_length': 295.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6755953133106232, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.657738208770752, 'reward_std': 0.1369047686457634, 'kl': 0.0721435546875, 'epoch': 0.14}
 14%|█▍        | 603/4286 [4:38:23<70:54:03, 69.30s/it] 14%|█▍        | 604/4286 [4:38:49<57:24:42, 56.13s/it]                                                       {'loss': 0.0022, 'grad_norm': 0.5412121228820708, 'learning_rate': 8.590760615958935e-07, 'completion_length': 303.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6607143580913544, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.08908393420279026, 'kl': 0.0545654296875, 'epoch': 0.14}
 14%|█▍        | 604/4286 [4:38:49<57:24:42, 56.13s/it] 14%|█▍        | 605/4286 [4:39:17<48:46:01, 47.69s/it]                                                       {'loss': 0.002, 'grad_norm': 0.7895757212139306, 'learning_rate': 8.588427438170788e-07, 'completion_length': 322.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7148809731006622, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6791667342185974, 'reward_std': 0.15792888402938843, 'kl': 0.0501708984375, 'epoch': 0.14}
 14%|█▍        | 605/4286 [4:39:17<48:46:01, 47.69s/it] 14%|█▍        | 606/4286 [4:39:43<42:21:57, 41.45s/it]                                                       {'loss': 0.0021, 'grad_norm': 1.0704013471047422, 'learning_rate': 8.58609426038264e-07, 'completion_length': 326.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.59226194024086, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5565477013587952, 'reward_std': 0.1463249921798706, 'kl': 0.0513916015625, 'epoch': 0.14}
 14%|█▍        | 606/4286 [4:39:43<42:21:57, 41.45s/it] 14%|█▍        | 607/4286 [4:40:08<37:06:12, 36.31s/it]                                                       {'loss': 0.0031, 'grad_norm': 0.6241818909692748, 'learning_rate': 8.583761082594493e-07, 'completion_length': 283.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7196428775787354, 'rewards/format_reward': 1.0, 'reward': 1.7196429371833801, 'reward_std': 0.09101385436952114, 'kl': 0.07666015625, 'epoch': 0.14}
 14%|█▍        | 607/4286 [4:40:08<37:06:12, 36.31s/it] 14%|█▍        | 608/4286 [4:40:33<33:39:14, 32.94s/it]                                                       {'loss': 0.0019, 'grad_norm': 0.3362548365561796, 'learning_rate': 8.581427904806345e-07, 'completion_length': 286.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.8556548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8377977013587952, 'reward_std': 0.06951731257140636, 'kl': 0.0472412109375, 'epoch': 0.14}
 14%|█▍        | 608/4286 [4:40:33<33:39:14, 32.94s/it] 14%|█▍        | 609/4286 [4:40:58<31:07:00, 30.47s/it]                                                       {'loss': 0.0072, 'grad_norm': 1.301462243983485, 'learning_rate': 8.579094727018198e-07, 'completion_length': 279.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6034226715564728, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.531994104385376, 'reward_std': 0.2495468631386757, 'kl': 0.17919921875, 'epoch': 0.14}
 14%|█▍        | 609/4286 [4:40:58<31:07:00, 30.47s/it] 14%|█▍        | 610/4286 [4:41:24<29:55:47, 29.31s/it]                                                       {'loss': 0.0093, 'grad_norm': 2.0243524825253574, 'learning_rate': 8.576761549230051e-07, 'completion_length': 295.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.616071492433548, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5446430444717407, 'reward_std': 0.26271943747997284, 'kl': 0.232421875, 'epoch': 0.14}
 14%|█▍        | 610/4286 [4:41:24<29:55:47, 29.31s/it] 14%|█▍        | 611/4286 [4:41:51<29:17:38, 28.70s/it]                                                       {'loss': 0.0063, 'grad_norm': 1.0081881576063036, 'learning_rate': 8.574428371441903e-07, 'completion_length': 309.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5907739400863647, 'reward_std': 0.16737382858991623, 'kl': 0.15771484375, 'epoch': 0.14}
 14%|█▍        | 611/4286 [4:41:51<29:17:38, 28.70s/it] 14%|█▍        | 612/4286 [4:42:19<29:04:18, 28.49s/it]                                                       {'loss': 0.0146, 'grad_norm': 1.4828050233855663, 'learning_rate': 8.572095193653755e-07, 'completion_length': 311.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.522321492433548, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.415178656578064, 'reward_std': 0.2916680574417114, 'kl': 0.3662109375, 'epoch': 0.14}
 14%|█▍        | 612/4286 [4:42:19<29:04:18, 28.49s/it] 14%|█▍        | 613/4286 [4:42:46<28:26:53, 27.88s/it]                                                       {'loss': 0.0096, 'grad_norm': 0.9548852080699378, 'learning_rate': 8.569762015865608e-07, 'completion_length': 300.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.5997024029493332, 'rewards/format_reward': 1.0, 'reward': 1.599702537059784, 'reward_std': 0.025514851324260235, 'kl': 0.2412109375, 'epoch': 0.14}
 14%|█▍        | 613/4286 [4:42:46<28:26:53, 27.88s/it] 14%|█▍        | 614/4286 [4:43:13<28:09:47, 27.61s/it]                                                       {'loss': 0.0094, 'grad_norm': 2.004198648072687, 'learning_rate': 8.567428838077461e-07, 'completion_length': 308.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.712266206741333, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6408376693725586, 'reward_std': 0.21818048506975174, 'kl': 0.23486328125, 'epoch': 0.14}
 14%|█▍        | 614/4286 [4:43:13<28:09:47, 27.61s/it][2025-03-02 19:41:02,867] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▍        | 615/4286 [4:43:40<27:58:50, 27.44s/it]                                                       {'loss': 0.0167, 'grad_norm': 2.0223599289562593, 'learning_rate': 8.565095660289313e-07, 'completion_length': 282.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6130952835083008, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5416666865348816, 'reward_std': 0.22602581977844238, 'kl': 0.41796875, 'epoch': 0.14}
 14%|█▍        | 615/4286 [4:43:40<27:58:50, 27.44s/it] 14%|█▍        | 616/4286 [4:44:07<27:56:15, 27.40s/it]                                                       {'loss': 0.0345, 'grad_norm': 4.372862623156136, 'learning_rate': 8.562762482501165e-07, 'completion_length': 328.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6952381432056427, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.5880953073501587, 'reward_std': 0.36755719035863876, 'kl': 0.859375, 'epoch': 0.14}
 14%|█▍        | 616/4286 [4:44:07<27:56:15, 27.40s/it] 14%|█▍        | 617/4286 [4:44:35<28:10:52, 27.65s/it]                                                       {'loss': 0.0981, 'grad_norm': 3.1348329936593116, 'learning_rate': 8.560429304713019e-07, 'completion_length': 319.75, 'rewards/only_full_func_accuracy_reward': 0.4598214477300644, 'rewards/format_reward': 0.7321428954601288, 'reward': 1.1919643878936768, 'reward_std': 0.5312229245901108, 'kl': 2.4609375, 'epoch': 0.14}
 14%|█▍        | 617/4286 [4:44:35<28:10:52, 27.65s/it][2025-03-02 19:42:26,116] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▍        | 618/4286 [4:45:03<28:11:17, 27.67s/it]                                                       {'loss': 0.0953, 'grad_norm': 2.4227673948614914, 'learning_rate': 8.558096126924871e-07, 'completion_length': 317.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6086310148239136, 'rewards/format_reward': 0.785714328289032, 'reward': 1.3943453431129456, 'reward_std': 0.45775073766708374, 'kl': 2.3828125, 'epoch': 0.14}
 14%|█▍        | 618/4286 [4:45:03<28:11:17, 27.67s/it][2025-03-02 19:42:54,259] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▍        | 619/4286 [4:45:31<28:19:34, 27.81s/it]                                                       {'loss': 0.1797, 'grad_norm': 5.481585534306794, 'learning_rate': 8.555762949136723e-07, 'completion_length': 297.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.609127014875412, 'rewards/format_reward': 0.785714328289032, 'reward': 1.3948413133621216, 'reward_std': 0.3810455650091171, 'kl': 4.484375, 'epoch': 0.14}
 14%|█▍        | 619/4286 [4:45:31<28:19:34, 27.81s/it] 14%|█▍        | 620/4286 [4:45:59<28:10:37, 27.67s/it]                                                       {'loss': 0.2032, 'grad_norm': 4.790939082678196, 'learning_rate': 8.553429771348576e-07, 'completion_length': 318.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.6629464626312256, 'rewards/format_reward': 0.7500000298023224, 'reward': 1.4129465818405151, 'reward_std': 0.5297921001911163, 'kl': 5.078125, 'epoch': 0.14}
 14%|█▍        | 620/4286 [4:45:59<28:10:37, 27.67s/it][2025-03-02 19:43:49,867] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 14%|█▍        | 621/4286 [4:46:27<28:21:01, 27.85s/it]                                                       {'loss': 0.3815, 'grad_norm': 14.399826910081176, 'learning_rate': 8.551096593560429e-07, 'completion_length': 344.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.3928571715950966, 'rewards/format_reward': 0.6071428954601288, 'reward': 1.0000000298023224, 'reward_std': 0.5606023371219635, 'kl': 9.53125, 'epoch': 0.14}
 14%|█▍        | 621/4286 [4:46:27<28:21:01, 27.85s/it] 15%|█▍        | 622/4286 [4:46:54<28:00:47, 27.52s/it]                                                       {'loss': 0.1758, 'grad_norm': 5.741461204489477, 'learning_rate': 8.548763415772281e-07, 'completion_length': 288.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.5980902910232544, 'rewards/format_reward': 0.7678571939468384, 'reward': 1.3659474849700928, 'reward_std': 0.37685488164424896, 'kl': 4.40625, 'epoch': 0.15}
 15%|█▍        | 622/4286 [4:46:54<28:00:47, 27.52s/it][2025-03-02 19:44:43,181] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▍        | 623/4286 [4:47:20<27:42:24, 27.23s/it]                                                       {'loss': 0.2091, 'grad_norm': 6.1520955773946335, 'learning_rate': 8.546430237984134e-07, 'completion_length': 264.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.65476194024086, 'rewards/format_reward': 0.767857164144516, 'reward': 1.422619104385376, 'reward_std': 0.5025074779987335, 'kl': 5.21875, 'epoch': 0.15}
 15%|█▍        | 623/4286 [4:47:20<27:42:24, 27.23s/it] 15%|█▍        | 624/4286 [4:47:48<27:45:51, 27.29s/it]                                                       {'loss': 0.0702, 'grad_norm': 1.362471757222333, 'learning_rate': 8.544097060195986e-07, 'completion_length': 311.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.5173611640930176, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4280754327774048, 'reward_std': 0.30359238386154175, 'kl': 1.7578125, 'epoch': 0.15}
 15%|█▍        | 624/4286 [4:47:48<27:45:51, 27.29s/it][2025-03-02 19:45:39,608] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▍        | 625/4286 [4:48:17<28:16:19, 27.80s/it]                                                       {'loss': 0.0684, 'grad_norm': 1.4072353769582746, 'learning_rate': 8.541763882407838e-07, 'completion_length': 309.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.5178571939468384, 'reward_std': 0.31553706526756287, 'kl': 1.70703125, 'epoch': 0.15}
 15%|█▍        | 625/4286 [4:48:17<28:16:19, 27.80s/it][2025-03-02 19:46:07,085] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▍        | 626/4286 [4:48:44<28:09:56, 27.70s/it]                                                       {'loss': 0.1097, 'grad_norm': 2.285759017345099, 'learning_rate': 8.539430704619691e-07, 'completion_length': 282.51788330078125, 'rewards/only_full_func_accuracy_reward': 0.5472470670938492, 'rewards/format_reward': 0.803571492433548, 'reward': 1.3508185744285583, 'reward_std': 0.5223073214292526, 'kl': 2.7421875, 'epoch': 0.15}
 15%|█▍        | 626/4286 [4:48:44<28:09:56, 27.70s/it] 15%|█▍        | 627/4286 [4:49:10<27:40:24, 27.23s/it]                                                       {'loss': 0.0213, 'grad_norm': 1.934458409504816, 'learning_rate': 8.537097526831544e-07, 'completion_length': 312.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.5688350796699524, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4974066615104675, 'reward_std': 0.2238302193582058, 'kl': 0.53125, 'epoch': 0.15}
 15%|█▍        | 627/4286 [4:49:10<27:40:24, 27.23s/it] 15%|█▍        | 628/4286 [4:49:37<27:30:37, 27.07s/it]                                                       {'loss': 0.0417, 'grad_norm': 2.990713229175543, 'learning_rate': 8.534764349043396e-07, 'completion_length': 287.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5645461976528168, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.49311763048172, 'reward_std': 0.2849437892436981, 'kl': 1.041015625, 'epoch': 0.15}
 15%|█▍        | 628/4286 [4:49:37<27:30:37, 27.07s/it] 15%|█▍        | 629/4286 [4:50:03<27:06:07, 26.68s/it]                                                       {'loss': 0.0281, 'grad_norm': 1.5962237104614834, 'learning_rate': 8.532431171255248e-07, 'completion_length': 309.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.5987246036529541, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5451532006263733, 'reward_std': 0.2489161640405655, 'kl': 0.701171875, 'epoch': 0.15}
 15%|█▍        | 629/4286 [4:50:03<27:06:07, 26.68s/it][2025-03-02 19:47:51,324] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▍        | 630/4286 [4:50:28<26:46:48, 26.37s/it]                                                       {'loss': 0.0352, 'grad_norm': 3.043675907068077, 'learning_rate': 8.530097993467102e-07, 'completion_length': 292.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.5744048357009888, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.520833432674408, 'reward_std': 0.24034344032406807, 'kl': 0.87890625, 'epoch': 0.15}
 15%|█▍        | 630/4286 [4:50:28<26:46:48, 26.37s/it] 15%|█▍        | 631/4286 [4:50:54<26:40:07, 26.27s/it]                                                       {'loss': 0.031, 'grad_norm': 13.448038325672705, 'learning_rate': 8.527764815678954e-07, 'completion_length': 306.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.5416667014360428, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.470238208770752, 'reward_std': 0.25148557871580124, 'kl': 0.77734375, 'epoch': 0.15}
 15%|█▍        | 631/4286 [4:50:54<26:40:07, 26.27s/it] 15%|█▍        | 632/4286 [4:51:23<27:17:34, 26.89s/it]                                                       {'loss': 0.0748, 'grad_norm': 2.1939376440861142, 'learning_rate': 8.525431637890806e-07, 'completion_length': 308.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.706250011920929, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.5455358028411865, 'reward_std': 0.3622410148382187, 'kl': 1.8671875, 'epoch': 0.15}
 15%|█▍        | 632/4286 [4:51:23<27:17:34, 26.89s/it][2025-03-02 19:49:13,430] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▍        | 633/4286 [4:51:51<27:32:35, 27.14s/it]                                                       {'loss': 0.1088, 'grad_norm': 2.9977580101737717, 'learning_rate': 8.523098460102659e-07, 'completion_length': 295.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.6158234477043152, 'rewards/format_reward': 0.8214285969734192, 'reward': 1.4372521042823792, 'reward_std': 0.44587841629981995, 'kl': 2.7109375, 'epoch': 0.15}
 15%|█▍        | 633/4286 [4:51:51<27:32:35, 27.14s/it] 15%|█▍        | 634/4286 [4:52:17<27:21:13, 26.96s/it]                                                       {'loss': 0.0978, 'grad_norm': 2.813393512330604, 'learning_rate': 8.520765282314512e-07, 'completion_length': 252.64287567138672, 'rewards/only_full_func_accuracy_reward': 0.6875000894069672, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.5803572535514832, 'reward_std': 0.3725840747356415, 'kl': 2.44140625, 'epoch': 0.15}
 15%|█▍        | 634/4286 [4:52:17<27:21:13, 26.96s/it][2025-03-02 19:50:06,822] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▍        | 635/4286 [4:52:44<27:18:38, 26.93s/it]                                                       {'loss': 0.1395, 'grad_norm': 6.005319097290889, 'learning_rate': 8.518432104526364e-07, 'completion_length': 288.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5515873432159424, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.4087302088737488, 'reward_std': 0.38795602321624756, 'kl': 3.484375, 'epoch': 0.15}
 15%|█▍        | 635/4286 [4:52:44<27:18:38, 26.93s/it] 15%|█▍        | 636/4286 [4:53:09<26:41:00, 26.32s/it]                                                       {'loss': 0.1568, 'grad_norm': 4.02644020571608, 'learning_rate': 8.516098926738216e-07, 'completion_length': 270.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.5178571939468384, 'reward_std': 0.41197289526462555, 'kl': 3.921875, 'epoch': 0.15}
 15%|█▍        | 636/4286 [4:53:09<26:41:00, 26.32s/it] 15%|█▍        | 637/4286 [4:53:35<26:40:08, 26.31s/it]                                                       {'loss': 0.2113, 'grad_norm': 8.556526245205378, 'learning_rate': 8.513765748950069e-07, 'completion_length': 288.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.476190522313118, 'rewards/format_reward': 0.7321428954601288, 'reward': 1.2083334028720856, 'reward_std': 0.5435838252305984, 'kl': 5.2734375, 'epoch': 0.15}
 15%|█▍        | 637/4286 [4:53:35<26:40:08, 26.31s/it][2025-03-02 19:51:26,428] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▍        | 638/4286 [4:54:04<27:18:10, 26.94s/it]                                                       {'loss': 0.306, 'grad_norm': 8.28217430208253, 'learning_rate': 8.511432571161922e-07, 'completion_length': 271.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.4880952686071396, 'rewards/format_reward': 0.660714328289032, 'reward': 1.1488096117973328, 'reward_std': 0.5773670226335526, 'kl': 7.640625, 'epoch': 0.15}
 15%|█▍        | 638/4286 [4:54:04<27:18:10, 26.94s/it] 15%|█▍        | 639/4286 [4:54:28<26:31:01, 26.18s/it]                                                       {'loss': 0.2091, 'grad_norm': 3.5832784755814786, 'learning_rate': 8.509099393373774e-07, 'completion_length': 259.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.4360119551420212, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.2395834922790527, 'reward_std': 0.45914414525032043, 'kl': 5.234375, 'epoch': 0.15}
 15%|█▍        | 639/4286 [4:54:28<26:31:01, 26.18s/it] 15%|█▍        | 640/4286 [4:54:54<26:38:11, 26.30s/it]                                                       {'loss': 0.1713, 'grad_norm': 4.147612711698859, 'learning_rate': 8.506766215585627e-07, 'completion_length': 278.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5773810148239136, 'rewards/format_reward': 0.8214286267757416, 'reward': 1.3988096117973328, 'reward_std': 0.42821942269802094, 'kl': 4.28125, 'epoch': 0.15}
 15%|█▍        | 640/4286 [4:54:54<26:38:11, 26.30s/it] 15%|█▍        | 641/4286 [4:55:18<25:47:39, 25.48s/it]                                                       {'loss': 0.1535, 'grad_norm': 3.02043132297121, 'learning_rate': 8.504433037797479e-07, 'completion_length': 266.19644927978516, 'rewards/only_full_func_accuracy_reward': 0.6145833730697632, 'rewards/format_reward': 0.7500000596046448, 'reward': 1.3645833730697632, 'reward_std': 0.3846854269504547, 'kl': 3.828125, 'epoch': 0.15}
 15%|█▍        | 641/4286 [4:55:18<25:47:39, 25.48s/it] 15%|█▍        | 642/4286 [4:55:42<25:26:13, 25.13s/it]                                                       {'loss': 0.0926, 'grad_norm': 2.944387193243648, 'learning_rate': 8.502099860009332e-07, 'completion_length': 274.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6145833730697632, 'rewards/format_reward': 0.785714328289032, 'reward': 1.40029776096344, 'reward_std': 0.43638400733470917, 'kl': 2.3125, 'epoch': 0.15}
 15%|█▍        | 642/4286 [4:55:42<25:26:13, 25.13s/it] 15%|█▌        | 643/4286 [4:56:07<25:19:11, 25.02s/it]                                                       {'loss': 0.1172, 'grad_norm': 3.4526506354673625, 'learning_rate': 8.499766682221185e-07, 'completion_length': 289.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.5372024029493332, 'rewards/format_reward': 0.7500000298023224, 'reward': 1.2872024178504944, 'reward_std': 0.37893305718898773, 'kl': 2.9296875, 'epoch': 0.15}
 15%|█▌        | 643/4286 [4:56:07<25:19:11, 25.02s/it] 15%|█▌        | 644/4286 [4:56:32<25:25:03, 25.12s/it]                                                       {'loss': 0.1137, 'grad_norm': 2.9166814157650514, 'learning_rate': 8.497433504433037e-07, 'completion_length': 251.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.48154765367507935, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.285119116306305, 'reward_std': 0.5373871028423309, 'kl': 2.8359375, 'epoch': 0.15}
 15%|█▌        | 644/4286 [4:56:32<25:25:03, 25.12s/it][2025-03-02 19:54:20,216] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▌        | 645/4286 [4:56:57<25:18:48, 25.03s/it]                                                       {'loss': 0.1012, 'grad_norm': 3.6887349522838506, 'learning_rate': 8.495100326644889e-07, 'completion_length': 231.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.4732144474983215, 'reward_std': 0.30611902475357056, 'kl': 2.53125, 'epoch': 0.15}
 15%|█▌        | 645/4286 [4:56:57<25:18:48, 25.03s/it] 15%|█▌        | 646/4286 [4:57:21<25:03:18, 24.78s/it]                                                       {'loss': 0.1143, 'grad_norm': 9.283327316744812, 'learning_rate': 8.492767148856743e-07, 'completion_length': 231.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.5639881491661072, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.438988208770752, 'reward_std': 0.4056508541107178, 'kl': 2.8671875, 'epoch': 0.15}
 15%|█▌        | 646/4286 [4:57:21<25:03:18, 24.78s/it] 15%|█▌        | 647/4286 [4:57:45<24:41:37, 24.43s/it]                                                       {'loss': 0.1381, 'grad_norm': 3.8094411807489736, 'learning_rate': 8.490433971068595e-07, 'completion_length': 235.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.5565476417541504, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.395833432674408, 'reward_std': 0.36313609778881073, 'kl': 3.453125, 'epoch': 0.15}
 15%|█▌        | 647/4286 [4:57:45<24:41:37, 24.43s/it] 15%|█▌        | 648/4286 [4:58:09<24:36:24, 24.35s/it]                                                       {'loss': 0.0778, 'grad_norm': 1.9865009689924888, 'learning_rate': 8.488100793280447e-07, 'completion_length': 259.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.6086310744285583, 'reward_std': 0.3267856538295746, 'kl': 1.9453125, 'epoch': 0.15}
 15%|█▌        | 648/4286 [4:58:09<24:36:24, 24.35s/it] 15%|█▌        | 649/4286 [4:58:35<25:05:30, 24.84s/it]                                                       {'loss': 0.1052, 'grad_norm': 4.314062348920373, 'learning_rate': 8.485767615492299e-07, 'completion_length': 246.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.5610119104385376, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4895833730697632, 'reward_std': 0.27356646955013275, 'kl': 2.62890625, 'epoch': 0.15}
 15%|█▌        | 649/4286 [4:58:35<25:05:30, 24.84s/it] 15%|█▌        | 650/4286 [4:59:02<25:38:38, 25.39s/it]                                                       {'loss': 0.0733, 'grad_norm': 2.9312736385567204, 'learning_rate': 8.483434437704153e-07, 'completion_length': 308.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6517857909202576, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6160715818405151, 'reward_std': 0.20311372727155685, 'kl': 1.83203125, 'epoch': 0.15}
 15%|█▌        | 650/4286 [4:59:02<25:38:38, 25.39s/it] 15%|█▌        | 651/4286 [4:59:26<25:19:31, 25.08s/it]                                                       {'loss': 0.1217, 'grad_norm': 3.6825646151011115, 'learning_rate': 8.481101259916005e-07, 'completion_length': 262.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6592262089252472, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5699405670166016, 'reward_std': 0.31904279440641403, 'kl': 3.0390625, 'epoch': 0.15}
 15%|█▌        | 651/4286 [4:59:26<25:19:31, 25.08s/it] 15%|█▌        | 652/4286 [4:59:51<25:21:12, 25.12s/it]                                                       {'loss': 0.2017, 'grad_norm': 4.204003767521601, 'learning_rate': 8.478768082127857e-07, 'completion_length': 268.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.4613095670938492, 'rewards/format_reward': 0.7678571939468384, 'reward': 1.2291667759418488, 'reward_std': 0.5089395493268967, 'kl': 5.0390625, 'epoch': 0.15}
 15%|█▌        | 652/4286 [4:59:51<25:21:12, 25.12s/it][2025-03-02 19:57:38,761] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▌        | 653/4286 [5:00:16<25:07:01, 24.89s/it]                                                       {'loss': 0.2211, 'grad_norm': 4.09626366010817, 'learning_rate': 8.47643490433971e-07, 'completion_length': 260.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5729167014360428, 'rewards/format_reward': 0.785714328289032, 'reward': 1.3586310744285583, 'reward_std': 0.5142412930727005, 'kl': 5.53125, 'epoch': 0.15}
 15%|█▌        | 653/4286 [5:00:16<25:07:01, 24.89s/it] 15%|█▌        | 654/4286 [5:00:41<25:06:02, 24.88s/it]                                                       {'loss': 0.2143, 'grad_norm': 5.531961406418679, 'learning_rate': 8.474101726551562e-07, 'completion_length': 258.9643020629883, 'rewards/only_full_func_accuracy_reward': 0.6741071343421936, 'rewards/format_reward': 0.6964285969734192, 'reward': 1.3705358505249023, 'reward_std': 0.6579667329788208, 'kl': 5.359375, 'epoch': 0.15}
 15%|█▌        | 654/4286 [5:00:41<25:06:02, 24.88s/it] 15%|█▌        | 655/4286 [5:01:07<25:27:12, 25.24s/it]                                                       {'loss': 0.1702, 'grad_norm': 2.5129578223564177, 'learning_rate': 8.471768548763415e-07, 'completion_length': 301.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.5960034430027008, 'rewards/format_reward': 0.8214286267757416, 'reward': 1.4174320101737976, 'reward_std': 0.537154495716095, 'kl': 4.2421875, 'epoch': 0.15}
 15%|█▌        | 655/4286 [5:01:07<25:27:12, 25.24s/it] 15%|█▌        | 656/4286 [5:01:31<25:01:19, 24.82s/it]                                                       {'loss': 0.1603, 'grad_norm': 3.9669072564247156, 'learning_rate': 8.469435370975268e-07, 'completion_length': 259.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.6208333969116211, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5494049191474915, 'reward_std': 0.2805466949939728, 'kl': 4.0078125, 'epoch': 0.15}
 15%|█▌        | 656/4286 [5:01:31<25:01:19, 24.82s/it] 15%|█▌        | 657/4286 [5:01:56<25:09:18, 24.95s/it]                                                       {'loss': 0.265, 'grad_norm': 155.03375525901413, 'learning_rate': 8.46710219318712e-07, 'completion_length': 265.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.5517113506793976, 'rewards/format_reward': 0.7321428954601288, 'reward': 1.2838541865348816, 'reward_std': 0.6172238886356354, 'kl': 6.6171875, 'epoch': 0.15}
 15%|█▌        | 657/4286 [5:01:56<25:09:18, 24.95s/it] 15%|█▌        | 658/4286 [5:02:21<25:11:54, 25.00s/it]                                                       {'loss': 0.1587, 'grad_norm': 4.275783517248334, 'learning_rate': 8.464769015398972e-07, 'completion_length': 270.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.5401785969734192, 'reward_std': 0.30687953531742096, 'kl': 3.96875, 'epoch': 0.15}
 15%|█▌        | 658/4286 [5:02:21<25:11:54, 25.00s/it] 15%|█▌        | 659/4286 [5:02:45<24:44:15, 24.55s/it]                                                       {'loss': 0.107, 'grad_norm': 2.1345197061681733, 'learning_rate': 8.462435837610825e-07, 'completion_length': 255.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.635416716337204, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5818453431129456, 'reward_std': 0.163152277469635, 'kl': 2.671875, 'epoch': 0.15}
 15%|█▌        | 659/4286 [5:02:45<24:44:15, 24.55s/it] 15%|█▌        | 660/4286 [5:03:10<24:56:21, 24.76s/it]                                                       {'loss': 0.0745, 'grad_norm': 4.0023266181342825, 'learning_rate': 8.460102659822678e-07, 'completion_length': 295.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.5342262536287308, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.498512089252472, 'reward_std': 0.20254157483577728, 'kl': 1.86328125, 'epoch': 0.15}
 15%|█▌        | 660/4286 [5:03:10<24:56:21, 24.76s/it][2025-03-02 20:00:59,764] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▌        | 661/4286 [5:03:37<25:38:19, 25.46s/it]                                                       {'loss': 0.0838, 'grad_norm': 3.4006094018281474, 'learning_rate': 8.45776948203453e-07, 'completion_length': 279.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.696428656578064, 'reward_std': 0.17708802223205566, 'kl': 2.08984375, 'epoch': 0.15}
 15%|█▌        | 661/4286 [5:03:37<25:38:19, 25.46s/it] 15%|█▌        | 662/4286 [5:04:01<25:20:23, 25.17s/it]                                                       {'loss': 0.075, 'grad_norm': 1.6708173473886654, 'learning_rate': 8.455436304246382e-07, 'completion_length': 299.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7559524476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7380953431129456, 'reward_std': 0.13742605596780777, 'kl': 1.87109375, 'epoch': 0.15}
 15%|█▌        | 662/4286 [5:04:01<25:20:23, 25.17s/it][2025-03-02 20:01:49,521] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 15%|█▌        | 663/4286 [5:04:27<25:21:35, 25.20s/it]                                                       {'loss': 0.0276, 'grad_norm': 1.5425054820921305, 'learning_rate': 8.453103126458236e-07, 'completion_length': 301.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.5714286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715818405151, 'reward_std': 0.07233373075723648, 'kl': 0.689453125, 'epoch': 0.15}
 15%|█▌        | 663/4286 [5:04:27<25:21:35, 25.20s/it] 15%|█▌        | 664/4286 [5:04:51<25:02:15, 24.89s/it]                                                       {'loss': 0.067, 'grad_norm': 2.1102392354701762, 'learning_rate': 8.450769948670088e-07, 'completion_length': 278.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.5750000178813934, 'rewards/format_reward': 1.0, 'reward': 1.5750000476837158, 'reward_std': 0.04852968920022249, 'kl': 1.67578125, 'epoch': 0.15}
 15%|█▌        | 664/4286 [5:04:51<25:02:15, 24.89s/it] 16%|█▌        | 665/4286 [5:05:15<24:56:56, 24.80s/it]                                                       {'loss': 0.0487, 'grad_norm': 10.082239881224275, 'learning_rate': 8.44843677088194e-07, 'completion_length': 281.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.6270833909511566, 'rewards/format_reward': 1.0, 'reward': 1.627083420753479, 'reward_std': 0.08585182577371597, 'kl': 1.21875, 'epoch': 0.16}
 16%|█▌        | 665/4286 [5:05:15<24:56:56, 24.80s/it] 16%|█▌        | 666/4286 [5:05:40<24:46:23, 24.64s/it]                                                       {'loss': 0.0353, 'grad_norm': 2.7845878913298656, 'learning_rate': 8.446103593093793e-07, 'completion_length': 289.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6339287161827087, 'reward_std': 0.10646876320242882, 'kl': 0.8828125, 'epoch': 0.16}
 16%|█▌        | 666/4286 [5:05:40<24:46:23, 24.64s/it] 16%|█▌        | 667/4286 [5:06:04<24:36:50, 24.48s/it]                                                       {'loss': 0.0571, 'grad_norm': 4.146826003700151, 'learning_rate': 8.443770415305646e-07, 'completion_length': 288.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6116071939468384, 'reward_std': 0.17407477647066116, 'kl': 1.4296875, 'epoch': 0.16}
 16%|█▌        | 667/4286 [5:06:04<24:36:50, 24.48s/it] 16%|█▌        | 668/4286 [5:06:28<24:39:50, 24.54s/it]                                                       {'loss': 0.0162, 'grad_norm': 1.424467983707889, 'learning_rate': 8.441437237517498e-07, 'completion_length': 319.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.5505952835083008, 'rewards/format_reward': 1.0, 'reward': 1.5505953431129456, 'reward_std': 0.06612942647188902, 'kl': 0.40478515625, 'epoch': 0.16}
 16%|█▌        | 668/4286 [5:06:28<24:39:50, 24.54s/it] 16%|█▌        | 669/4286 [5:06:51<24:12:05, 24.09s/it]                                                       {'loss': 0.0165, 'grad_norm': 41.79385924386062, 'learning_rate': 8.439104059729351e-07, 'completion_length': 286.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6904761791229248, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.037630438804626465, 'kl': 0.4140625, 'epoch': 0.16}
 16%|█▌        | 669/4286 [5:06:51<24:12:05, 24.09s/it] 16%|█▌        | 670/4286 [5:07:16<24:19:59, 24.23s/it]                                                       {'loss': 0.1141, 'grad_norm': 1.8918072114037752, 'learning_rate': 8.436770881941203e-07, 'completion_length': 275.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7282738983631134, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6925596594810486, 'reward_std': 0.18750303983688354, 'kl': 2.8515625, 'epoch': 0.16}
 16%|█▌        | 670/4286 [5:07:16<24:19:59, 24.23s/it] 16%|█▌        | 671/4286 [5:07:40<24:23:26, 24.29s/it]                                                       {'loss': 0.0529, 'grad_norm': 1.3526085253377422, 'learning_rate': 8.434437704153056e-07, 'completion_length': 295.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.07955649495124817, 'kl': 1.32275390625, 'epoch': 0.16}
 16%|█▌        | 671/4286 [5:07:40<24:23:26, 24.29s/it] 16%|█▌        | 672/4286 [5:08:06<24:47:32, 24.70s/it]                                                       {'loss': 0.0297, 'grad_norm': 1.1050813605668752, 'learning_rate': 8.432104526364908e-07, 'completion_length': 309.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.712202399969101, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6943453550338745, 'reward_std': 0.10427485313266516, 'kl': 0.742919921875, 'epoch': 0.16}
 16%|█▌        | 672/4286 [5:08:06<24:47:32, 24.70s/it] 16%|█▌        | 673/4286 [5:08:29<24:09:48, 24.08s/it]                                                       {'loss': 165.5258, 'grad_norm': 412788.9314775585, 'learning_rate': 8.429771348576761e-07, 'completion_length': 253.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5863096117973328, 'reward_std': 0.1896989494562149, 'kl': 4161.34375, 'epoch': 0.16}
 16%|█▌        | 673/4286 [5:08:29<24:09:48, 24.08s/it] 16%|█▌        | 674/4286 [5:08:55<24:42:57, 24.63s/it]                                                       {'loss': 0.0939, 'grad_norm': 4.148160776804671, 'learning_rate': 8.427438170788613e-07, 'completion_length': 285.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7404762208461761, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.722619116306305, 'reward_std': 0.14188669621944427, 'kl': 2.33984375, 'epoch': 0.16}
 16%|█▌        | 674/4286 [5:08:55<24:42:57, 24.63s/it] 16%|█▌        | 675/4286 [5:09:21<25:21:51, 25.29s/it]                                                       {'loss': 0.0968, 'grad_norm': 2.763533053884077, 'learning_rate': 8.425104993000465e-07, 'completion_length': 332.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6636906266212463, 'reward_std': 0.1011904813349247, 'kl': 2.4140625, 'epoch': 0.16}
 16%|█▌        | 675/4286 [5:09:21<25:21:51, 25.29s/it][2025-03-02 20:07:11,641] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 16%|█▌        | 676/4286 [5:09:49<25:57:08, 25.88s/it]                                                       {'loss': 0.0712, 'grad_norm': 1.6663019509513541, 'learning_rate': 8.422771815212319e-07, 'completion_length': 308.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6523810029029846, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6166667938232422, 'reward_std': 0.1458047442138195, 'kl': 1.78515625, 'epoch': 0.16}
 16%|█▌        | 676/4286 [5:09:49<25:57:08, 25.88s/it] 16%|█▌        | 677/4286 [5:10:15<26:12:52, 26.15s/it]                                                       {'loss': 0.0367, 'grad_norm': 1.4536738436637466, 'learning_rate': 8.420438637424171e-07, 'completion_length': 327.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.5773809850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5595239400863647, 'reward_std': 0.09959554299712181, 'kl': 0.916015625, 'epoch': 0.16}
 16%|█▌        | 677/4286 [5:10:15<26:12:52, 26.15s/it] 16%|█▌        | 678/4286 [5:10:42<26:23:39, 26.34s/it]                                                       {'loss': 0.0388, 'grad_norm': 1.230215986178126, 'learning_rate': 8.418105459636023e-07, 'completion_length': 293.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7336310744285583, 'reward_std': 0.1101190559566021, 'kl': 0.970703125, 'epoch': 0.16}
 16%|█▌        | 678/4286 [5:10:42<26:23:39, 26.34s/it][2025-03-02 20:08:30,734] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 16%|█▌        | 679/4286 [5:11:08<26:08:58, 26.10s/it]                                                       {'loss': 0.019, 'grad_norm': 1.2949220539573034, 'learning_rate': 8.415772281847876e-07, 'completion_length': 303.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8273809552192688, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.05952381528913975, 'kl': 0.474609375, 'epoch': 0.16}
 16%|█▌        | 679/4286 [5:11:08<26:08:58, 26.10s/it] 16%|█▌        | 680/4286 [5:11:33<25:53:45, 25.85s/it]                                                       {'loss': 0.0134, 'grad_norm': 2.1190621588808196, 'learning_rate': 8.413439104059729e-07, 'completion_length': 271.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6468254327774048, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6289682984352112, 'reward_std': 0.09601649083197117, 'kl': 0.3349609375, 'epoch': 0.16}
 16%|█▌        | 680/4286 [5:11:33<25:53:45, 25.85s/it] 16%|█▌        | 681/4286 [5:11:59<25:56:00, 25.90s/it]                                                       {'loss': 0.003, 'grad_norm': 0.861892252428594, 'learning_rate': 8.411105926271581e-07, 'completion_length': 351.2143096923828, 'rewards/only_full_func_accuracy_reward': 0.7252976596355438, 'rewards/format_reward': 1.0, 'reward': 1.7252976298332214, 'reward_std': 0.029311808291822672, 'kl': 0.07470703125, 'epoch': 0.16}
 16%|█▌        | 681/4286 [5:11:59<25:56:00, 25.90s/it] 16%|█▌        | 682/4286 [5:12:23<25:20:41, 25.32s/it]                                                       {'loss': 0.0135, 'grad_norm': 1.6489320956854188, 'learning_rate': 8.408772748483433e-07, 'completion_length': 270.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291667461395264, 'reward_std': 0.059523805975914, 'kl': 0.3369140625, 'epoch': 0.16}
 16%|█▌        | 682/4286 [5:12:23<25:20:41, 25.32s/it] 16%|█▌        | 683/4286 [5:12:47<25:00:44, 24.99s/it]                                                       {'loss': 0.031, 'grad_norm': 1.201235004272562, 'learning_rate': 8.406439570695286e-07, 'completion_length': 272.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7238095700740814, 'rewards/format_reward': 1.0, 'reward': 1.7238096594810486, 'reward_std': 0.05673839896917343, 'kl': 0.775390625, 'epoch': 0.16}
 16%|█▌        | 683/4286 [5:12:47<25:00:44, 24.99s/it] 16%|█▌        | 684/4286 [5:13:12<25:03:48, 25.05s/it]                                                       {'loss': 0.007, 'grad_norm': 1.5205934611800924, 'learning_rate': 8.404106392907139e-07, 'completion_length': 290.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979167461395264, 'reward_std': 0.0208333358168602, 'kl': 0.17529296875, 'epoch': 0.16}
 16%|█▌        | 684/4286 [5:13:12<25:03:48, 25.05s/it] 16%|█▌        | 685/4286 [5:13:36<24:42:28, 24.70s/it]                                                       {'loss': 0.0327, 'grad_norm': 1.3881837618338024, 'learning_rate': 8.401773215118991e-07, 'completion_length': 282.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.11493691802024841, 'kl': 0.8203125, 'epoch': 0.16}
 16%|█▌        | 685/4286 [5:13:36<24:42:28, 24.70s/it] 16%|█▌        | 686/4286 [5:14:00<24:29:54, 24.50s/it]                                                       {'loss': 0.0247, 'grad_norm': 3.6437833761883307, 'learning_rate': 8.399440037330844e-07, 'completion_length': 275.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.635416716337204, 'rewards/format_reward': 1.0, 'reward': 1.6354168057441711, 'reward_std': 0.0803571492433548, 'kl': 0.615234375, 'epoch': 0.16}
 16%|█▌        | 686/4286 [5:14:00<24:29:54, 24.50s/it] 16%|█▌        | 687/4286 [5:14:27<25:00:01, 25.01s/it]                                                       {'loss': 0.0024, 'grad_norm': 0.48880856962357844, 'learning_rate': 8.397106859542696e-07, 'completion_length': 353.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.021707767620682716, 'kl': 0.060302734375, 'epoch': 0.16}
 16%|█▌        | 687/4286 [5:14:27<25:00:01, 25.01s/it] 16%|█▌        | 688/4286 [5:14:50<24:38:32, 24.66s/it]                                                       {'loss': 0.0087, 'grad_norm': 0.7877250136216088, 'learning_rate': 8.394773681754549e-07, 'completion_length': 298.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.023809518665075302, 'kl': 0.216796875, 'epoch': 0.16}
 16%|█▌        | 688/4286 [5:14:50<24:38:32, 24.66s/it] 16%|█▌        | 689/4286 [5:15:16<24:50:12, 24.86s/it]                                                       {'loss': 0.021, 'grad_norm': 1.0809729124961605, 'learning_rate': 8.392440503966402e-07, 'completion_length': 283.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.031603582203388214, 'kl': 0.52685546875, 'epoch': 0.16}
 16%|█▌        | 689/4286 [5:15:16<24:50:12, 24.86s/it] 16%|█▌        | 690/4286 [5:15:42<25:14:43, 25.27s/it]                                                       {'loss': 0.0373, 'grad_norm': 3.0569779541241817, 'learning_rate': 8.390107326178254e-07, 'completion_length': 338.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.6666667461395264, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6309524774551392, 'reward_std': 0.10352958738803864, 'kl': 0.9356689453125, 'epoch': 0.16}
 16%|█▌        | 690/4286 [5:15:42<25:14:43, 25.27s/it] 16%|█▌        | 691/4286 [5:16:06<24:57:23, 24.99s/it]                                                       {'loss': 0.0578, 'grad_norm': 0.7043613247689926, 'learning_rate': 8.387774148390106e-07, 'completion_length': 321.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6056548357009888, 'reward_std': 0.09821429289877415, 'kl': 1.4453125, 'epoch': 0.16}
 16%|█▌        | 691/4286 [5:16:06<24:57:23, 24.99s/it] 16%|█▌        | 692/4286 [5:16:32<25:12:42, 25.25s/it]                                                       {'loss': 0.0193, 'grad_norm': 15.311731938212445, 'learning_rate': 8.38544097060196e-07, 'completion_length': 307.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.647321492433548, 'rewards/format_reward': 1.0, 'reward': 1.6473215222358704, 'reward_std': 0.061898695304989815, 'kl': 0.4840087890625, 'epoch': 0.16}
 16%|█▌        | 692/4286 [5:16:32<25:12:42, 25.25s/it] 16%|█▌        | 693/4286 [5:17:00<26:03:14, 26.10s/it]                                                       {'loss': 0.0272, 'grad_norm': 8.798379356830447, 'learning_rate': 8.383107792813812e-07, 'completion_length': 315.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.6329365670681, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5972222685813904, 'reward_std': 0.12655732035636902, 'kl': 0.6767578125, 'epoch': 0.16}
 16%|█▌        | 693/4286 [5:17:00<26:03:14, 26.10s/it] 16%|█▌        | 694/4286 [5:17:25<25:32:02, 25.59s/it]                                                       {'loss': 0.0021, 'grad_norm': 2.316720816905836, 'learning_rate': 8.380774615025664e-07, 'completion_length': 319.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.011904764920473099, 'kl': 0.0531005859375, 'epoch': 0.16}
 16%|█▌        | 694/4286 [5:17:25<25:32:02, 25.59s/it] 16%|█▌        | 695/4286 [5:17:50<25:20:02, 25.40s/it]                                                       {'loss': 0.0417, 'grad_norm': 1.461961717673159, 'learning_rate': 8.378441437237516e-07, 'completion_length': 321.625, 'rewards/only_full_func_accuracy_reward': 0.6517857015132904, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6339287161827087, 'reward_std': 0.0917598307132721, 'kl': 1.0413818359375, 'epoch': 0.16}
 16%|█▌        | 695/4286 [5:17:50<25:20:02, 25.40s/it] 16%|█▌        | 696/4286 [5:18:14<25:02:45, 25.12s/it]                                                       {'loss': 0.0646, 'grad_norm': 1.545791242170329, 'learning_rate': 8.37610825944937e-07, 'completion_length': 309.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6369048953056335, 'reward_std': 0.16011298447847366, 'kl': 1.61328125, 'epoch': 0.16}
 16%|█▌        | 696/4286 [5:18:14<25:02:45, 25.12s/it] 16%|█▋        | 697/4286 [5:18:40<25:17:03, 25.36s/it]                                                       {'loss': 0.0071, 'grad_norm': 1.0814168297146904, 'learning_rate': 8.373775081661222e-07, 'completion_length': 356.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.71726194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6994048357009888, 'reward_std': 0.1488095298409462, 'kl': 0.1767578125, 'epoch': 0.16}
 16%|█▋        | 697/4286 [5:18:40<25:17:03, 25.36s/it] 16%|█▋        | 698/4286 [5:19:06<25:20:29, 25.43s/it]                                                       {'loss': 0.014, 'grad_norm': 3.1981023930650068, 'learning_rate': 8.371441903873074e-07, 'completion_length': 317.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.49345242977142334, 'rewards/format_reward': 1.0, 'reward': 1.4934524893760681, 'reward_std': 0.10041029006242752, 'kl': 0.349609375, 'epoch': 0.16}
 16%|█▋        | 698/4286 [5:19:06<25:20:29, 25.43s/it] 16%|█▋        | 699/4286 [5:19:30<25:10:40, 25.27s/it]                                                       {'loss': 0.0667, 'grad_norm': 14.652847211170695, 'learning_rate': 8.369108726084927e-07, 'completion_length': 279.5178756713867, 'rewards/only_full_func_accuracy_reward': 0.5625000596046448, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5267857909202576, 'reward_std': 0.16338983178138733, 'kl': 1.6640625, 'epoch': 0.16}
 16%|█▋        | 699/4286 [5:19:30<25:10:40, 25.27s/it] 16%|█▋        | 700/4286 [5:19:57<25:37:44, 25.73s/it]                                                       {'loss': 0.0259, 'grad_norm': 0.8952216792666298, 'learning_rate': 8.36677554829678e-07, 'completion_length': 307.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.6830357611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6473215222358704, 'reward_std': 0.10097679868340492, 'kl': 0.64453125, 'epoch': 0.16}
 16%|█▋        | 700/4286 [5:19:57<25:37:44, 25.73s/it] 16%|█▋        | 701/4286 [5:23:19<78:10:39, 78.50s/it]                                                       {'loss': 0.014, 'grad_norm': 3.0848773399641622, 'learning_rate': 8.364442370508632e-07, 'completion_length': 330.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6666668057441711, 'reward_std': 0.1354695949703455, 'kl': 0.3511962890625, 'epoch': 0.16}
 16%|█▋        | 701/4286 [5:23:19<78:10:39, 78.50s/it] 16%|█▋        | 702/4286 [5:23:48<63:23:13, 63.67s/it]                                                       {'loss': 0.0342, 'grad_norm': 1.4051985946868493, 'learning_rate': 8.362109192720485e-07, 'completion_length': 329.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.65327388048172, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6175596714019775, 'reward_std': 0.1379956714808941, 'kl': 0.8515625, 'epoch': 0.16}
 16%|█▋        | 702/4286 [5:23:48<63:23:13, 63.67s/it] 16%|█▋        | 703/4286 [5:24:15<52:32:33, 52.79s/it]                                                       {'loss': 0.0219, 'grad_norm': 1.7546285748413393, 'learning_rate': 8.359776014932337e-07, 'completion_length': 341.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.5617559552192688, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5260418057441711, 'reward_std': 0.15306012332439423, 'kl': 0.5458984375, 'epoch': 0.16}
 16%|█▋        | 703/4286 [5:24:15<52:32:33, 52.79s/it] 16%|█▋        | 704/4286 [5:24:42<44:50:56, 45.07s/it]                                                       {'loss': 0.0072, 'grad_norm': 0.6215374159351039, 'learning_rate': 8.357442837144189e-07, 'completion_length': 344.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7032313644886017, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6853743195533752, 'reward_std': 0.04591836780309677, 'kl': 0.181640625, 'epoch': 0.16}
 16%|█▋        | 704/4286 [5:24:42<44:50:56, 45.07s/it] 16%|█▋        | 705/4286 [5:25:10<39:34:12, 39.78s/it]                                                       {'loss': 0.0048, 'grad_norm': 1.470019389448927, 'learning_rate': 8.355109659356042e-07, 'completion_length': 333.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7886905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7886905670166016, 'reward_std': 0.08474764227867126, 'kl': 0.1201171875, 'epoch': 0.16}
 16%|█▋        | 705/4286 [5:25:10<39:34:12, 39.78s/it] 16%|█▋        | 706/4286 [5:25:36<35:31:28, 35.72s/it]                                                       {'loss': 0.0505, 'grad_norm': 30.953782742376475, 'learning_rate': 8.352776481567895e-07, 'completion_length': 285.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6458333432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6279762983322144, 'reward_std': 0.09900591894984245, 'kl': 1.263671875, 'epoch': 0.16}
 16%|█▋        | 706/4286 [5:25:36<35:31:28, 35.72s/it] 16%|█▋        | 707/4286 [5:26:04<33:07:45, 33.32s/it]                                                       {'loss': 0.0113, 'grad_norm': 1.879203006900383, 'learning_rate': 8.350443303779747e-07, 'completion_length': 328.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6041666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6041668057441711, 'reward_std': 0.06116022355854511, 'kl': 0.28369140625, 'epoch': 0.16}
 16%|█▋        | 707/4286 [5:26:04<33:07:45, 33.32s/it] 17%|█▋        | 708/4286 [5:26:31<31:14:23, 31.43s/it]                                                       {'loss': 0.0234, 'grad_norm': 0.5773624965933243, 'learning_rate': 8.348110125991599e-07, 'completion_length': 322.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.8318452835083008, 'rewards/format_reward': 1.0, 'reward': 1.8318453431129456, 'reward_std': 0.014880956150591373, 'kl': 0.583740234375, 'epoch': 0.17}
 17%|█▋        | 708/4286 [5:26:31<31:14:23, 31.43s/it] 17%|█▋        | 709/4286 [5:26:56<29:26:56, 29.64s/it]                                                       {'loss': 0.0261, 'grad_norm': 2.9562016343563213, 'learning_rate': 8.345776948203453e-07, 'completion_length': 280.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7127976417541504, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.677083432674408, 'reward_std': 0.19489669427275658, 'kl': 0.65234375, 'epoch': 0.17}
 17%|█▋        | 709/4286 [5:26:56<29:26:56, 29.64s/it] 17%|█▋        | 710/4286 [5:27:23<28:24:21, 28.60s/it]                                                       {'loss': 0.0039, 'grad_norm': 0.3940268408829614, 'learning_rate': 8.343443770415305e-07, 'completion_length': 310.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.83928582072258, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.013746432960033417, 'kl': 0.0968017578125, 'epoch': 0.17}
 17%|█▋        | 710/4286 [5:27:23<28:24:21, 28.60s/it] 17%|█▋        | 711/4286 [5:27:50<28:10:17, 28.37s/it]                                                       {'loss': 0.0095, 'grad_norm': 1.4516444403740798, 'learning_rate': 8.341110592627157e-07, 'completion_length': 318.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7250000834465027, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.707142949104309, 'reward_std': 0.10807518288493156, 'kl': 0.23876953125, 'epoch': 0.17}
 17%|█▋        | 711/4286 [5:27:50<28:10:17, 28.37s/it] 17%|█▋        | 712/4286 [5:28:16<27:14:01, 27.43s/it]                                                       {'loss': 0.0178, 'grad_norm': 5.6271906139554915, 'learning_rate': 8.33877741483901e-07, 'completion_length': 295.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6696428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6696430444717407, 'reward_std': 0.06823869794607162, 'kl': 0.44580078125, 'epoch': 0.17}
 17%|█▋        | 712/4286 [5:28:16<27:14:01, 27.43s/it] 17%|█▋        | 713/4286 [5:28:41<26:42:37, 26.91s/it]                                                       {'loss': 0.0346, 'grad_norm': 5.7826135656671545, 'learning_rate': 8.336444237050863e-07, 'completion_length': 282.44644927978516, 'rewards/only_full_func_accuracy_reward': 0.828869104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7931548953056335, 'reward_std': 0.14874977618455887, 'kl': 0.8662109375, 'epoch': 0.17}
 17%|█▋        | 713/4286 [5:28:41<26:42:37, 26.91s/it] 17%|█▋        | 714/4286 [5:29:09<27:02:51, 27.26s/it]                                                       {'loss': 0.0091, 'grad_norm': 1.2553466420705535, 'learning_rate': 8.334111059262715e-07, 'completion_length': 320.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6510417461395264, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6153274774551392, 'reward_std': 0.1502976231276989, 'kl': 0.228271484375, 'epoch': 0.17}
 17%|█▋        | 714/4286 [5:29:09<27:02:51, 27.26s/it] 17%|█▋        | 715/4286 [5:29:34<26:11:05, 26.40s/it]                                                       {'loss': 0.0151, 'grad_norm': 1.2711361011658948, 'learning_rate': 8.331777881474568e-07, 'completion_length': 300.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6577381789684296, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.06547619123011827, 'kl': 0.37841796875, 'epoch': 0.17}
 17%|█▋        | 715/4286 [5:29:34<26:11:05, 26.40s/it] 17%|█▋        | 716/4286 [5:29:59<25:54:21, 26.12s/it]                                                       {'loss': 0.0108, 'grad_norm': 2.4268404545742754, 'learning_rate': 8.32944470368642e-07, 'completion_length': 324.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217262983322144, 'reward_std': 0.11085737496614456, 'kl': 0.26904296875, 'epoch': 0.17}
 17%|█▋        | 716/4286 [5:29:59<25:54:21, 26.12s/it] 17%|█▋        | 717/4286 [5:30:25<25:43:53, 25.96s/it]                                                       {'loss': 0.0086, 'grad_norm': 2.4268017447749917, 'learning_rate': 8.327111525898273e-07, 'completion_length': 325.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6657898128032684, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6300755739212036, 'reward_std': 0.1505141779780388, 'kl': 0.213623046875, 'epoch': 0.17}
 17%|█▋        | 717/4286 [5:30:25<25:43:53, 25.96s/it] 17%|█▋        | 718/4286 [5:30:49<25:10:20, 25.40s/it]                                                       {'loss': 0.009, 'grad_norm': 2.442697880319185, 'learning_rate': 8.324778348110125e-07, 'completion_length': 288.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.610119104385376, 'rewards/format_reward': 1.0, 'reward': 1.6101191639900208, 'reward_std': 0.07022446766495705, 'kl': 0.22637939453125, 'epoch': 0.17}
 17%|█▋        | 718/4286 [5:30:49<25:10:20, 25.40s/it] 17%|█▋        | 719/4286 [5:31:14<25:03:32, 25.29s/it]                                                       {'loss': 0.0147, 'grad_norm': 1.9754567564716337, 'learning_rate': 8.322445170321978e-07, 'completion_length': 314.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.6815476417541504, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.07040730305016041, 'kl': 0.36669921875, 'epoch': 0.17}
 17%|█▋        | 719/4286 [5:31:14<25:03:32, 25.29s/it] 17%|█▋        | 720/4286 [5:31:38<24:49:58, 25.07s/it]                                                       {'loss': 0.0063, 'grad_norm': 0.8828514451417725, 'learning_rate': 8.32011199253383e-07, 'completion_length': 311.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6577380895614624, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.02976190857589245, 'kl': 0.1585693359375, 'epoch': 0.17}
 17%|█▋        | 720/4286 [5:31:39<24:49:58, 25.07s/it] 17%|█▋        | 721/4286 [5:32:06<25:29:04, 25.73s/it]                                                       {'loss': 0.0024, 'grad_norm': 1.9696577338747123, 'learning_rate': 8.317778814745683e-07, 'completion_length': 302.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.665178656578064, 'rewards/format_reward': 1.0, 'reward': 1.6651787161827087, 'reward_std': 0.06767786014825106, 'kl': 0.06103515625, 'epoch': 0.17}
 17%|█▋        | 721/4286 [5:32:06<25:29:04, 25.73s/it] 17%|█▋        | 722/4286 [5:32:31<25:13:39, 25.48s/it]                                                       {'loss': 0.0178, 'grad_norm': 1.4702189362438511, 'learning_rate': 8.315445636957536e-07, 'completion_length': 308.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.632440522313118, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.08035714365541935, 'kl': 0.4443359375, 'epoch': 0.17}
 17%|█▋        | 722/4286 [5:32:31<25:13:39, 25.48s/it] 17%|█▋        | 723/4286 [5:32:57<25:23:09, 25.65s/it]                                                       {'loss': 0.0056, 'grad_norm': 3.262142383050046, 'learning_rate': 8.313112459169388e-07, 'completion_length': 292.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6773810088634491, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6595239043235779, 'reward_std': 0.14718037843704224, 'kl': 0.1396484375, 'epoch': 0.17}
 17%|█▋        | 723/4286 [5:32:57<25:23:09, 25.65s/it] 17%|█▋        | 724/4286 [5:33:21<25:06:19, 25.37s/it]                                                       {'loss': 0.0075, 'grad_norm': 1.4941661984682506, 'learning_rate': 8.31077928138124e-07, 'completion_length': 278.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.741071492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7053572535514832, 'reward_std': 0.14301356300711632, 'kl': 0.18701171875, 'epoch': 0.17}
 17%|█▋        | 724/4286 [5:33:21<25:06:19, 25.37s/it] 17%|█▋        | 725/4286 [5:33:49<25:41:45, 25.98s/it]                                                       {'loss': 0.0034, 'grad_norm': 1.8914628521907324, 'learning_rate': 8.308446103593094e-07, 'completion_length': 301.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.1643432229757309, 'kl': 0.085205078125, 'epoch': 0.17}
 17%|█▋        | 725/4286 [5:33:49<25:41:45, 25.98s/it] 17%|█▋        | 726/4286 [5:34:16<26:06:56, 26.41s/it]                                                       {'loss': 0.0091, 'grad_norm': 2.555653784595484, 'learning_rate': 8.306112925804946e-07, 'completion_length': 347.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.5758928656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5580357313156128, 'reward_std': 0.07296611741185188, 'kl': 0.22705078125, 'epoch': 0.17}
 17%|█▋        | 726/4286 [5:34:16<26:06:56, 26.41s/it] 17%|█▋        | 727/4286 [5:34:42<25:51:55, 26.16s/it]                                                       {'loss': 0.0036, 'grad_norm': 1.8574439095672195, 'learning_rate': 8.303779748016798e-07, 'completion_length': 336.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6857143342494965, 'rewards/format_reward': 1.0, 'reward': 1.6857144236564636, 'reward_std': 0.05570555850863457, 'kl': 0.091064453125, 'epoch': 0.17}
 17%|█▋        | 727/4286 [5:34:42<25:51:55, 26.16s/it] 17%|█▋        | 728/4286 [5:35:08<25:48:23, 26.11s/it]                                                       {'loss': 0.0091, 'grad_norm': 1.6791048987164348, 'learning_rate': 8.30144657022865e-07, 'completion_length': 319.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.627604216337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6097471714019775, 'reward_std': 0.12192143499851227, 'kl': 0.2265625, 'epoch': 0.17}
 17%|█▋        | 728/4286 [5:35:08<25:48:23, 26.11s/it] 17%|█▋        | 729/4286 [5:35:31<25:00:15, 25.31s/it]                                                       {'loss': 0.0349, 'grad_norm': 4.021572130850116, 'learning_rate': 8.299113392440503e-07, 'completion_length': 290.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.5104167461395264, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4925596714019775, 'reward_std': 0.15135836601257324, 'kl': 0.87109375, 'epoch': 0.17}
 17%|█▋        | 729/4286 [5:35:31<25:00:15, 25.31s/it] 17%|█▋        | 730/4286 [5:35:57<25:11:13, 25.50s/it]                                                       {'loss': 0.0112, 'grad_norm': 6.513811273078627, 'learning_rate': 8.296780214652356e-07, 'completion_length': 292.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7172620296478271, 'reward_std': 0.1301671490073204, 'kl': 0.2802734375, 'epoch': 0.17}
 17%|█▋        | 730/4286 [5:35:57<25:11:13, 25.50s/it] 17%|█▋        | 731/4286 [5:36:25<25:51:02, 26.18s/it]                                                       {'loss': 0.0203, 'grad_norm': 5.833753060181384, 'learning_rate': 8.294447036864208e-07, 'completion_length': 349.9464569091797, 'rewards/only_full_func_accuracy_reward': 0.380952425301075, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.345238208770752, 'reward_std': 0.0952381044626236, 'kl': 0.508544921875, 'epoch': 0.17}
 17%|█▋        | 731/4286 [5:36:25<25:51:02, 26.18s/it] 17%|█▋        | 732/4286 [5:36:50<25:29:10, 25.82s/it]                                                       {'loss': 0.0333, 'grad_norm': 6.1571472456947935, 'learning_rate': 8.292113859076061e-07, 'completion_length': 327.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.74702388048172, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7113096117973328, 'reward_std': 0.1498417854309082, 'kl': 0.83203125, 'epoch': 0.17}
 17%|█▋        | 732/4286 [5:36:50<25:29:10, 25.82s/it] 17%|█▋        | 733/4286 [5:37:14<25:00:03, 25.33s/it]                                                       {'loss': 0.0806, 'grad_norm': 2.8036568416080527, 'learning_rate': 8.289780681287913e-07, 'completion_length': 322.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.5270833820104599, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4556548595428467, 'reward_std': 0.2726175859570503, 'kl': 2.015625, 'epoch': 0.17}
 17%|█▋        | 733/4286 [5:37:14<25:00:03, 25.33s/it] 17%|█▋        | 734/4286 [5:37:40<25:07:52, 25.47s/it]                                                       {'loss': 0.0654, 'grad_norm': 2.7475997558920415, 'learning_rate': 8.287447503499766e-07, 'completion_length': 324.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.6436012089252472, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5543155670166016, 'reward_std': 0.2842262014746666, 'kl': 1.6328125, 'epoch': 0.17}
 17%|█▋        | 734/4286 [5:37:40<25:07:52, 25.47s/it] 17%|█▋        | 735/4286 [5:38:03<24:31:47, 24.87s/it]                                                       {'loss': 0.064, 'grad_norm': 3.2528300196804327, 'learning_rate': 8.285114325711619e-07, 'completion_length': 275.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6101191639900208, 'reward_std': 0.2195577695965767, 'kl': 1.6015625, 'epoch': 0.17}
 17%|█▋        | 735/4286 [5:38:03<24:31:47, 24.87s/it] 17%|█▋        | 736/4286 [5:38:26<23:42:42, 24.05s/it]                                                       {'loss': 0.0572, 'grad_norm': 2.2468987937648066, 'learning_rate': 8.282781147923471e-07, 'completion_length': 265.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7008928954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6473215222358704, 'reward_std': 0.16645299270749092, 'kl': 1.427734375, 'epoch': 0.17}
 17%|█▋        | 736/4286 [5:38:26<23:42:42, 24.05s/it] 17%|█▋        | 737/4286 [5:38:49<23:23:58, 23.74s/it]                                                       {'loss': 0.0236, 'grad_norm': 4.154037189316109, 'learning_rate': 8.280447970135323e-07, 'completion_length': 297.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.09502441436052322, 'kl': 0.58984375, 'epoch': 0.17}
 17%|█▋        | 737/4286 [5:38:49<23:23:58, 23.74s/it] 17%|█▋        | 738/4286 [5:39:13<23:28:22, 23.82s/it]                                                       {'loss': 0.0046, 'grad_norm': 1.9249777986729841, 'learning_rate': 8.278114792347177e-07, 'completion_length': 332.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7857143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7857144474983215, 'reward_std': 0.0759457778185606, 'kl': 0.11376953125, 'epoch': 0.17}
 17%|█▋        | 738/4286 [5:39:13<23:28:22, 23.82s/it] 17%|█▋        | 739/4286 [5:39:37<23:45:19, 24.11s/it]                                                       {'loss': 0.0129, 'grad_norm': 4.854513543248768, 'learning_rate': 8.275781614559029e-07, 'completion_length': 315.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7232144474983215, 'reward_std': 0.05357143096625805, 'kl': 0.32177734375, 'epoch': 0.17}
 17%|█▋        | 739/4286 [5:39:37<23:45:19, 24.11s/it] 17%|█▋        | 740/4286 [5:40:02<24:01:59, 24.40s/it]                                                       {'loss': 0.0011, 'grad_norm': 1.1734891122526936, 'learning_rate': 8.273448436770881e-07, 'completion_length': 337.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.05578739009797573, 'kl': 0.02862548828125, 'epoch': 0.17}
 17%|█▋        | 740/4286 [5:40:02<24:01:59, 24.40s/it] 17%|█▋        | 741/4286 [5:40:29<24:33:02, 24.93s/it]                                                       {'loss': 0.0203, 'grad_norm': 55.561246315930816, 'learning_rate': 8.271115258982733e-07, 'completion_length': 309.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.5409226715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5230656266212463, 'reward_std': 0.09302986226975918, 'kl': 0.5078125, 'epoch': 0.17}
 17%|█▋        | 741/4286 [5:40:29<24:33:02, 24.93s/it] 17%|█▋        | 742/4286 [5:40:53<24:31:58, 24.92s/it]                                                       {'loss': 0.0075, 'grad_norm': 4.805796126383999, 'learning_rate': 8.268782081194587e-07, 'completion_length': 312.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6086310148239136, 'rewards/format_reward': 1.0, 'reward': 1.6086310148239136, 'reward_std': 0.08717728778719902, 'kl': 0.18701171875, 'epoch': 0.17}
 17%|█▋        | 742/4286 [5:40:53<24:31:58, 24.92s/it] 17%|█▋        | 743/4286 [5:41:20<25:00:39, 25.41s/it]                                                       {'loss': 0.0045, 'grad_norm': 21.28162505899726, 'learning_rate': 8.266448903406439e-07, 'completion_length': 342.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.5833333432674408, 'rewards/format_reward': 1.0, 'reward': 1.5833334922790527, 'reward_std': 0.04123930633068085, 'kl': 0.11126708984375, 'epoch': 0.17}
 17%|█▋        | 743/4286 [5:41:20<25:00:39, 25.41s/it] 17%|█▋        | 744/4286 [5:41:47<25:26:23, 25.86s/it]                                                       {'loss': 0.0018, 'grad_norm': 1.115840371478265, 'learning_rate': 8.264115725618291e-07, 'completion_length': 338.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7872024774551392, 'rewards/format_reward': 1.0, 'reward': 1.787202537059784, 'reward_std': 0.008928571827709675, 'kl': 0.04541015625, 'epoch': 0.17}
 17%|█▋        | 744/4286 [5:41:47<25:26:23, 25.86s/it] 17%|█▋        | 745/4286 [5:42:14<25:52:28, 26.31s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.6114803053158029, 'learning_rate': 8.261782547830144e-07, 'completion_length': 347.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.6785715818405151, 'reward_std': 0.025651197880506516, 'kl': 0.04119873046875, 'epoch': 0.17}
 17%|█▋        | 745/4286 [5:42:14<25:52:28, 26.31s/it] 17%|█▋        | 746/4286 [5:42:41<25:52:10, 26.31s/it]                                                       {'loss': 1.9299, 'grad_norm': 168099.78365557297, 'learning_rate': 8.259449370041997e-07, 'completion_length': 318.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.702381044626236, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.03436608985066414, 'kl': 48.0174560546875, 'epoch': 0.17}
 17%|█▋        | 746/4286 [5:42:41<25:52:10, 26.31s/it] 17%|█▋        | 747/4286 [5:43:07<26:01:51, 26.48s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.5681376742468837, 'learning_rate': 8.257116192253849e-07, 'completion_length': 337.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.7063988745212555, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6349703669548035, 'reward_std': 0.025614673271775246, 'kl': 0.041259765625, 'epoch': 0.17}
 17%|█▋        | 747/4286 [5:43:07<26:01:51, 26.48s/it] 17%|█▋        | 748/4286 [5:43:31<25:05:41, 25.53s/it]                                                       {'loss': 0.0065, 'grad_norm': 1.5932034484889517, 'learning_rate': 8.254783014465702e-07, 'completion_length': 296.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.5163690894842148, 'rewards/format_reward': 1.0, 'reward': 1.516369104385376, 'reward_std': 0.062072642147541046, 'kl': 0.162353515625, 'epoch': 0.17}
 17%|█▋        | 748/4286 [5:43:31<25:05:41, 25.53s/it] 17%|█▋        | 749/4286 [5:43:57<25:09:02, 25.60s/it]                                                       {'loss': 0.0022, 'grad_norm': 6.472185963493129, 'learning_rate': 8.252449836677554e-07, 'completion_length': 302.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.6952381432056427, 'rewards/format_reward': 1.0, 'reward': 1.6952382326126099, 'reward_std': 0.09648564457893372, 'kl': 0.0555419921875, 'epoch': 0.17}
 17%|█▋        | 749/4286 [5:43:57<25:09:02, 25.60s/it] 17%|█▋        | 750/4286 [5:44:22<24:57:06, 25.40s/it]                                                       {'loss': 0.0029, 'grad_norm': 4.751646440453488, 'learning_rate': 8.250116658889406e-07, 'completion_length': 298.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.0535714328289032, 'kl': 0.0712890625, 'epoch': 0.17}
 17%|█▋        | 750/4286 [5:44:22<24:57:06, 25.40s/it] 18%|█▊        | 751/4286 [5:44:47<24:51:54, 25.32s/it]                                                       {'loss': 0.0023, 'grad_norm': 23.309217359019822, 'learning_rate': 8.247783481101259e-07, 'completion_length': 335.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7678572237491608, 'rewards/format_reward': 1.0, 'reward': 1.767857313156128, 'reward_std': 0.011904759332537651, 'kl': 0.0576171875, 'epoch': 0.18}
 18%|█▊        | 751/4286 [5:44:47<24:51:54, 25.32s/it] 18%|█▊        | 752/4286 [5:45:11<24:34:04, 25.03s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.4182796576951116, 'learning_rate': 8.245450303313112e-07, 'completion_length': 291.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7458333969116211, 'rewards/format_reward': 1.0, 'reward': 1.7458334565162659, 'reward_std': 0.02559152338653803, 'kl': 0.04248046875, 'epoch': 0.18}
 18%|█▊        | 752/4286 [5:45:11<24:34:04, 25.03s/it] 18%|█▊        | 753/4286 [5:45:33<23:40:43, 24.13s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.7719780560430648, 'learning_rate': 8.243117125524964e-07, 'completion_length': 266.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6175595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6175596117973328, 'reward_std': 0.03457976970821619, 'kl': 0.04150390625, 'epoch': 0.18}
 18%|█▊        | 753/4286 [5:45:33<23:40:43, 24.13s/it] 18%|█▊        | 754/4286 [5:45:58<23:51:25, 24.32s/it]                                                       {'loss': 0.0019, 'grad_norm': 0.6936873695675186, 'learning_rate': 8.240783947736816e-07, 'completion_length': 339.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.029761903919279575, 'kl': 0.046875, 'epoch': 0.18}
 18%|█▊        | 754/4286 [5:45:58<23:51:25, 24.32s/it] 18%|█▊        | 755/4286 [5:46:22<23:55:04, 24.39s/it]                                                       {'loss': 0.0023, 'grad_norm': 0.576146118352893, 'learning_rate': 8.23845076994867e-07, 'completion_length': 294.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.8452380895614624, 'rewards/format_reward': 1.0, 'reward': 1.8452382683753967, 'reward_std': 0.011904762126505375, 'kl': 0.0565185546875, 'epoch': 0.18}
 18%|█▊        | 755/4286 [5:46:22<23:55:04, 24.39s/it] 18%|█▊        | 756/4286 [5:46:46<23:46:13, 24.24s/it]                                                       {'loss': 0.0055, 'grad_norm': 32.70919206907161, 'learning_rate': 8.236117592160522e-07, 'completion_length': 306.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6696430444717407, 'reward_std': 0.0416666679084301, 'kl': 0.1376953125, 'epoch': 0.18}
 18%|█▊        | 756/4286 [5:46:46<23:46:13, 24.24s/it] 18%|█▊        | 757/4286 [5:47:10<23:33:47, 24.04s/it]                                                       {'loss': 0.0021, 'grad_norm': 0.44607340764675285, 'learning_rate': 8.233784414372374e-07, 'completion_length': 265.0, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.025651196017861366, 'kl': 0.051513671875, 'epoch': 0.18}
 18%|█▊        | 757/4286 [5:47:10<23:33:47, 24.04s/it] 18%|█▊        | 758/4286 [5:47:35<23:56:28, 24.43s/it]                                                       {'loss': 0.0037, 'grad_norm': 1.814459180644247, 'learning_rate': 8.231451236584227e-07, 'completion_length': 319.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.616071492433548, 'rewards/format_reward': 1.0, 'reward': 1.6160715818405151, 'reward_std': 0.08928571455180645, 'kl': 0.091796875, 'epoch': 0.18}
 18%|█▊        | 758/4286 [5:47:35<23:56:28, 24.43s/it] 18%|█▊        | 759/4286 [5:48:03<24:48:37, 25.32s/it]                                                       {'loss': 0.0017, 'grad_norm': 1.1279973443528724, 'learning_rate': 8.22911805879608e-07, 'completion_length': 331.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7861201763153076, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.7146916389465332, 'reward_std': 0.015359078533947468, 'kl': 0.0418701171875, 'epoch': 0.18}
 18%|█▊        | 759/4286 [5:48:03<24:48:37, 25.32s/it] 18%|█▊        | 760/4286 [5:48:27<24:28:46, 24.99s/it]                                                       {'loss': 0.0021, 'grad_norm': 0.7669068447867721, 'learning_rate': 8.226784881007932e-07, 'completion_length': 262.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.0357142873108387, 'kl': 0.05230712890625, 'epoch': 0.18}
 18%|█▊        | 760/4286 [5:48:27<24:28:46, 24.99s/it] 18%|█▊        | 761/4286 [5:48:51<24:13:34, 24.74s/it]                                                       {'loss': 0.0015, 'grad_norm': 0.6608158050658556, 'learning_rate': 8.224451703219785e-07, 'completion_length': 309.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.5610119700431824, 'rewards/format_reward': 1.0, 'reward': 1.5610120296478271, 'reward_std': 0.0267857164144516, 'kl': 0.0382080078125, 'epoch': 0.18}
 18%|█▊        | 761/4286 [5:48:51<24:13:34, 24.74s/it] 18%|█▊        | 762/4286 [5:49:17<24:29:55, 25.03s/it]                                                       {'loss': 0.0055, 'grad_norm': 329.41876562698917, 'learning_rate': 8.222118525431637e-07, 'completion_length': 297.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.7169643044471741, 'rewards/format_reward': 1.0, 'reward': 1.7169644236564636, 'reward_std': 0.020833331160247326, 'kl': 0.136962890625, 'epoch': 0.18}
 18%|█▊        | 762/4286 [5:49:17<24:29:55, 25.03s/it] 18%|█▊        | 763/4286 [5:49:42<24:37:56, 25.17s/it]                                                       {'loss': 0.0017, 'grad_norm': 7.125359352247331, 'learning_rate': 8.21978534764349e-07, 'completion_length': 300.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.815476268529892, 'rewards/format_reward': 1.0, 'reward': 1.8154763579368591, 'reward_std': 0.013746432960033417, 'kl': 0.04296875, 'epoch': 0.18}
 18%|█▊        | 763/4286 [5:49:42<24:37:56, 25.17s/it] 18%|█▊        | 764/4286 [5:50:08<24:46:40, 25.33s/it]                                                       {'loss': 0.002, 'grad_norm': 68.35087904197988, 'learning_rate': 8.217452169855342e-07, 'completion_length': 350.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.610119104385376, 'rewards/format_reward': 1.0, 'reward': 1.610119104385376, 'reward_std': 0.005952381528913975, 'kl': 0.0489501953125, 'epoch': 0.18}
 18%|█▊        | 764/4286 [5:50:08<24:46:40, 25.33s/it] 18%|█▊        | 765/4286 [5:50:35<25:12:06, 25.77s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.4152943377980223, 'learning_rate': 8.215118992067195e-07, 'completion_length': 342.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7199404835700989, 'rewards/format_reward': 1.0, 'reward': 1.7199405431747437, 'reward_std': 0.019238397479057312, 'kl': 0.04052734375, 'epoch': 0.18}
 18%|█▊        | 765/4286 [5:50:35<25:12:06, 25.77s/it][2025-03-02 20:48:24,496] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 18%|█▊        | 766/4286 [5:51:02<25:33:02, 26.13s/it]                                                       {'loss': 0.0022, 'grad_norm': 0.5046245159089385, 'learning_rate': 8.212785814279047e-07, 'completion_length': 332.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291667461395264, 'reward_std': 0.026572037022560835, 'kl': 0.0560302734375, 'epoch': 0.18}
 18%|█▊        | 766/4286 [5:51:02<25:33:02, 26.13s/it] 18%|█▊        | 767/4286 [5:51:27<25:19:52, 25.91s/it]                                                       {'loss': 0.0021, 'grad_norm': 1.77638626257079, 'learning_rate': 8.2104526364909e-07, 'completion_length': 325.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6158008873462677, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5622295141220093, 'reward_std': 0.11381752230226994, 'kl': 0.051513671875, 'epoch': 0.18}
 18%|█▊        | 767/4286 [5:51:27<25:19:52, 25.91s/it] 18%|█▊        | 768/4286 [5:51:54<25:32:01, 26.13s/it]                                                       {'loss': 0.0053, 'grad_norm': 1.2318253983705865, 'learning_rate': 8.208119458702753e-07, 'completion_length': 310.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6894983649253845, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6180697679519653, 'reward_std': 0.05975540913641453, 'kl': 0.1339111328125, 'epoch': 0.18}
 18%|█▊        | 768/4286 [5:51:54<25:32:01, 26.13s/it] 18%|█▊        | 769/4286 [5:52:20<25:42:24, 26.31s/it]                                                       {'loss': 0.0015, 'grad_norm': 0.11827812600300222, 'learning_rate': 8.205786280914605e-07, 'completion_length': 349.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7458333671092987, 'rewards/format_reward': 1.0, 'reward': 1.7458334565162659, 'reward_std': 0.025505101308226585, 'kl': 0.03729248046875, 'epoch': 0.18}
 18%|█▊        | 769/4286 [5:52:20<25:42:24, 26.31s/it] 18%|█▊        | 770/4286 [5:52:46<25:27:38, 26.07s/it]                                                       {'loss': 0.0027, 'grad_norm': 4.689536863110915, 'learning_rate': 8.203453103126457e-07, 'completion_length': 273.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6180556118488312, 'rewards/format_reward': 1.0, 'reward': 1.6180556416511536, 'reward_std': 0.03266687132418156, 'kl': 0.06787109375, 'epoch': 0.18}
 18%|█▊        | 770/4286 [5:52:46<25:27:38, 26.07s/it] 18%|█▊        | 771/4286 [5:53:10<24:47:39, 25.39s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.49094214826012367, 'learning_rate': 8.201119925338311e-07, 'completion_length': 277.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 1.0, 'reward': 1.7529762387275696, 'reward_std': 0.01785714365541935, 'kl': 0.042236328125, 'epoch': 0.18}
 18%|█▊        | 771/4286 [5:53:10<24:47:39, 25.39s/it] 18%|█▊        | 772/4286 [5:53:35<24:38:43, 25.25s/it]                                                       {'loss': 0.0037, 'grad_norm': 46.988961880712935, 'learning_rate': 8.198786747550163e-07, 'completion_length': 341.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.5639881491661072, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.026785715483129025, 'kl': 0.0936279296875, 'epoch': 0.18}
 18%|█▊        | 772/4286 [5:53:35<24:38:43, 25.25s/it] 18%|█▊        | 773/4286 [5:53:59<24:18:48, 24.92s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.841220718097934, 'learning_rate': 8.196453569762015e-07, 'completion_length': 305.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7767857015132904, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.010309826582670212, 'kl': 0.04052734375, 'epoch': 0.18}
 18%|█▊        | 773/4286 [5:53:59<24:18:48, 24.92s/it] 18%|█▊        | 774/4286 [5:54:26<24:55:33, 25.55s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.3565173816652242, 'learning_rate': 8.194120391973867e-07, 'completion_length': 352.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.5885416865348816, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.517113208770752, 'reward_std': 0.011797921732068062, 'kl': 0.0418701171875, 'epoch': 0.18}
 18%|█▊        | 774/4286 [5:54:26<24:55:33, 25.55s/it] 18%|█▊        | 775/4286 [5:54:53<25:22:05, 26.01s/it]                                                       {'loss': 0.002, 'grad_norm': 1.6640930768062048, 'learning_rate': 8.19178721418572e-07, 'completion_length': 357.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6337160170078278, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6158589124679565, 'reward_std': 0.08903678506612778, 'kl': 0.04931640625, 'epoch': 0.18}
 18%|█▊        | 775/4286 [5:54:53<25:22:05, 26.01s/it] 18%|█▊        | 776/4286 [5:55:19<25:20:52, 26.00s/it]                                                       {'loss': 0.002, 'grad_norm': 1.5409781505236448, 'learning_rate': 8.189454036397573e-07, 'completion_length': 326.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5267857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5267857313156128, 'reward_std': 0.005952378269284964, 'kl': 0.0504150390625, 'epoch': 0.18}
 18%|█▊        | 776/4286 [5:55:19<25:20:52, 26.00s/it] 18%|█▊        | 777/4286 [5:55:43<24:54:22, 25.55s/it]                                                       {'loss': 0.0019, 'grad_norm': 33.095707175428636, 'learning_rate': 8.187120858609425e-07, 'completion_length': 284.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.0446428582072258, 'kl': 0.0472412109375, 'epoch': 0.18}
 18%|█▊        | 777/4286 [5:55:43<24:54:22, 25.55s/it] 18%|█▊        | 778/4286 [5:56:11<25:32:32, 26.21s/it]                                                       {'loss': 0.0018, 'grad_norm': 0.5863997452662569, 'learning_rate': 8.184787680821278e-07, 'completion_length': 342.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7455357611179352, 'rewards/format_reward': 1.0, 'reward': 1.7455357909202576, 'reward_std': 0.03839066671207547, 'kl': 0.044189453125, 'epoch': 0.18}
 18%|█▊        | 778/4286 [5:56:11<25:32:32, 26.21s/it] 18%|█▊        | 779/4286 [5:56:39<26:08:00, 26.83s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.46003696455113435, 'learning_rate': 8.18245450303313e-07, 'completion_length': 326.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7767857909202576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7589287161827087, 'reward_std': 0.0687921941280365, 'kl': 0.0419921875, 'epoch': 0.18}
 18%|█▊        | 779/4286 [5:56:39<26:08:00, 26.83s/it] 18%|█▊        | 780/4286 [5:57:05<25:55:52, 26.63s/it]                                                       {'loss': 32146.0039, 'grad_norm': 99568901.52059035, 'learning_rate': 8.180121325244983e-07, 'completion_length': 329.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.061365483328700066, 'kl': 798720.0191650391, 'epoch': 0.18}
 18%|█▊        | 780/4286 [5:57:05<25:55:52, 26.63s/it] 18%|█▊        | 781/4286 [5:57:35<26:37:20, 27.34s/it]                                                       {'loss': 0.004, 'grad_norm': 39.27883304548436, 'learning_rate': 8.177788147456836e-07, 'completion_length': 372.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.6608090102672577, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5893804430961609, 'reward_std': 0.13894526660442352, 'kl': 0.10009765625, 'epoch': 0.18}
 18%|█▊        | 781/4286 [5:57:35<26:37:20, 27.34s/it] 18%|█▊        | 782/4286 [5:57:58<25:35:47, 26.30s/it]                                                       {'loss': 0.0018, 'grad_norm': 1.5615694867313659, 'learning_rate': 8.175454969668688e-07, 'completion_length': 268.7678756713867, 'rewards/only_full_func_accuracy_reward': 0.7276785969734192, 'rewards/format_reward': 1.0, 'reward': 1.727678656578064, 'reward_std': 0.0680250208824873, 'kl': 0.04443359375, 'epoch': 0.18}
 18%|█▊        | 782/4286 [5:57:58<25:35:47, 26.30s/it] 18%|█▊        | 783/4286 [5:58:25<25:49:17, 26.54s/it]                                                       {'loss': 0.0015, 'grad_norm': 0.6443350630456025, 'learning_rate': 8.17312179188054e-07, 'completion_length': 349.89288330078125, 'rewards/only_full_func_accuracy_reward': 0.7595238387584686, 'rewards/format_reward': 1.0, 'reward': 1.7595239877700806, 'reward_std': 0.04378413036465645, 'kl': 0.036376953125, 'epoch': 0.18}
 18%|█▊        | 783/4286 [5:58:25<25:49:17, 26.54s/it] 18%|█▊        | 784/4286 [5:58:51<25:31:21, 26.24s/it]                                                       {'loss': 0.1465, 'grad_norm': 3208.3164734521056, 'learning_rate': 8.170788614092394e-07, 'completion_length': 359.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.7235119044780731, 'rewards/format_reward': 1.0, 'reward': 1.7235119342803955, 'reward_std': 0.03352411463856697, 'kl': 3.6473388671875, 'epoch': 0.18}
 18%|█▊        | 784/4286 [5:58:51<25:31:21, 26.24s/it] 18%|█▊        | 785/4286 [5:59:14<24:35:52, 25.29s/it]                                                       {'loss': 0.0022, 'grad_norm': 3.1702766868889882, 'learning_rate': 8.168455436304246e-07, 'completion_length': 241.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715818405151, 'reward_std': 0.07971610128879547, 'kl': 0.054931640625, 'epoch': 0.18}
 18%|█▊        | 785/4286 [5:59:14<24:35:52, 25.29s/it] 18%|█▊        | 786/4286 [5:59:41<24:57:05, 25.66s/it]                                                       {'loss': 0.0027, 'grad_norm': 4.159060494099237, 'learning_rate': 8.166122258516098e-07, 'completion_length': 322.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6339287161827087, 'reward_std': 0.048439450562000275, 'kl': 0.0682373046875, 'epoch': 0.18}
 18%|█▊        | 786/4286 [5:59:41<24:57:05, 25.66s/it] 18%|█▊        | 787/4286 [6:00:04<24:22:48, 25.08s/it]                                                       {'loss': 0.0076, 'grad_norm': 18.309930166438544, 'learning_rate': 8.16378908072795e-07, 'completion_length': 301.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6800595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.0982142947614193, 'kl': 0.1912841796875, 'epoch': 0.18}
 18%|█▊        | 787/4286 [6:00:04<24:22:48, 25.08s/it] 18%|█▊        | 788/4286 [6:00:31<24:47:03, 25.51s/it]                                                       {'loss': 0.0018, 'grad_norm': 0.30712579046951016, 'learning_rate': 8.161455902939804e-07, 'completion_length': 331.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.008928571827709675, 'kl': 0.04400634765625, 'epoch': 0.18}
 18%|█▊        | 788/4286 [6:00:31<24:47:03, 25.51s/it] 18%|█▊        | 789/4286 [6:00:56<24:35:05, 25.31s/it]                                                       {'loss': 0.002, 'grad_norm': 2.411181383636963, 'learning_rate': 8.159122725151656e-07, 'completion_length': 296.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7098214030265808, 'rewards/format_reward': 1.0, 'reward': 1.7098215222358704, 'reward_std': 0.035813162103295326, 'kl': 0.050048828125, 'epoch': 0.18}
 18%|█▊        | 789/4286 [6:00:56<24:35:05, 25.31s/it] 18%|█▊        | 790/4286 [6:01:22<24:43:35, 25.46s/it]                                                       {'loss': 0.0017, 'grad_norm': 1.1605238115768088, 'learning_rate': 8.156789547363508e-07, 'completion_length': 322.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 1.0, 'reward': 1.6294643878936768, 'reward_std': 0.04304791986942291, 'kl': 0.0433349609375, 'epoch': 0.18}
 18%|█▊        | 790/4286 [6:01:22<24:43:35, 25.46s/it] 18%|█▊        | 791/4286 [6:01:46<24:26:55, 25.18s/it]                                                       {'loss': 0.0021, 'grad_norm': 1.0076310877548051, 'learning_rate': 8.154456369575361e-07, 'completion_length': 300.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7172619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7172620296478271, 'reward_std': 0.038476793095469475, 'kl': 0.0513916015625, 'epoch': 0.18}
 18%|█▊        | 791/4286 [6:01:46<24:26:55, 25.18s/it] 18%|█▊        | 792/4286 [6:02:11<24:15:12, 24.99s/it]                                                       {'loss': 0.0142, 'grad_norm': 0.9721416956056353, 'learning_rate': 8.152123191787214e-07, 'completion_length': 314.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7053572535514832, 'reward_std': 0.08928570710122585, 'kl': 0.3544921875, 'epoch': 0.18}
 18%|█▊        | 792/4286 [6:02:11<24:15:12, 24.99s/it] 19%|█▊        | 793/4286 [6:02:36<24:22:20, 25.12s/it]                                                       {'loss': 0.0042, 'grad_norm': 14.151416332551454, 'learning_rate': 8.149790013999066e-07, 'completion_length': 326.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7083333432674408, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.048786623403429985, 'kl': 0.1051025390625, 'epoch': 0.19}
 19%|█▊        | 793/4286 [6:02:36<24:22:20, 25.12s/it] 19%|█▊        | 794/4286 [6:02:59<23:49:19, 24.56s/it]                                                       {'loss': 0.0018, 'grad_norm': 3.6022632879577694, 'learning_rate': 8.147456836210919e-07, 'completion_length': 292.7143020629883, 'rewards/only_full_func_accuracy_reward': 0.6279762089252472, 'rewards/format_reward': 1.0, 'reward': 1.6279762387275696, 'reward_std': 0.03847679682075977, 'kl': 0.0460205078125, 'epoch': 0.19}
 19%|█▊        | 794/4286 [6:02:59<23:49:19, 24.56s/it] 19%|█▊        | 795/4286 [6:03:27<24:43:07, 25.49s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.4971625807216824, 'learning_rate': 8.145123658422771e-07, 'completion_length': 342.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.8323768079280853, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.7609482407569885, 'reward_std': 0.04934793524444103, 'kl': 0.04345703125, 'epoch': 0.19}
 19%|█▊        | 795/4286 [6:03:27<24:43:07, 25.49s/it][2025-03-02 21:01:16,248] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▊        | 796/4286 [6:03:53<24:58:45, 25.77s/it]                                                       {'loss': 0.0027, 'grad_norm': 0.963702479232081, 'learning_rate': 8.142790480634624e-07, 'completion_length': 298.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.6474207043647766, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6117064356803894, 'reward_std': 0.11360103264451027, 'kl': 0.067138671875, 'epoch': 0.19}
 19%|█▊        | 796/4286 [6:03:53<24:58:45, 25.77s/it] 19%|█▊        | 797/4286 [6:04:20<25:13:41, 26.03s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.7862136328651312, 'learning_rate': 8.140457302846476e-07, 'completion_length': 333.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.7056547999382019, 'rewards/format_reward': 1.0, 'reward': 1.7056549191474915, 'reward_std': 0.026810658164322376, 'kl': 0.0428466796875, 'epoch': 0.19}
 19%|█▊        | 797/4286 [6:04:20<25:13:41, 26.03s/it] 19%|█▊        | 798/4286 [6:04:47<25:32:27, 26.36s/it]                                                       {'loss': 0.0018, 'grad_norm': 0.80263337158656, 'learning_rate': 8.138124125058329e-07, 'completion_length': 327.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7096088528633118, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6381803750991821, 'reward_std': 0.12318087369203568, 'kl': 0.046142578125, 'epoch': 0.19}
 19%|█▊        | 798/4286 [6:04:47<25:32:27, 26.36s/it] 19%|█▊        | 799/4286 [6:05:14<25:45:01, 26.58s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.9011916143797811, 'learning_rate': 8.135790947270181e-07, 'completion_length': 330.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6190477013587952, 'reward_std': 0.14402472972869873, 'kl': 0.0399169921875, 'epoch': 0.19}
 19%|█▊        | 799/4286 [6:05:14<25:45:01, 26.58s/it] 19%|█▊        | 800/4286 [6:05:42<26:10:19, 27.03s/it]                                                       {'loss': 0.0015, 'grad_norm': 0.2360094275331167, 'learning_rate': 8.133457769482033e-07, 'completion_length': 325.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8008928596973419, 'rewards/format_reward': 1.0, 'reward': 1.800892949104309, 'reward_std': 0.01726190373301506, 'kl': 0.0380859375, 'epoch': 0.19}
 19%|█▊        | 800/4286 [6:05:42<26:10:19, 27.03s/it] 19%|█▊        | 801/4286 [6:09:24<82:36:02, 85.33s/it]                                                       {'loss': 0.0024, 'grad_norm': 7.928175070568725, 'learning_rate': 8.131124591693887e-07, 'completion_length': 319.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6830357909202576, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.05746845155954361, 'kl': 0.058837890625, 'epoch': 0.19}
 19%|█▊        | 801/4286 [6:09:24<82:36:02, 85.33s/it] 19%|█▊        | 802/4286 [6:09:51<65:38:36, 67.83s/it]                                                       {'loss': 0.0015, 'grad_norm': 2.32247909991782, 'learning_rate': 8.128791413905739e-07, 'completion_length': 295.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.784226268529892, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7663691639900208, 'reward_std': 0.06616532057523727, 'kl': 0.03662109375, 'epoch': 0.19}
 19%|█▊        | 802/4286 [6:09:51<65:38:36, 67.83s/it][2025-03-02 21:07:39,000] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▊        | 803/4286 [6:10:16<53:19:22, 55.11s/it]                                                       {'loss': 0.0025, 'grad_norm': 0.3648840061724146, 'learning_rate': 8.126458236117591e-07, 'completion_length': 287.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.01785714365541935, 'kl': 0.0628662109375, 'epoch': 0.19}
 19%|█▊        | 803/4286 [6:10:16<53:19:22, 55.11s/it][2025-03-02 21:08:06,108] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 804/4286 [6:10:43<45:10:51, 46.71s/it]                                                       {'loss': 0.0018, 'grad_norm': 0.7138936621868808, 'learning_rate': 8.124125058329444e-07, 'completion_length': 335.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949405670166016, 'reward_std': 0.0884707230143249, 'kl': 0.0439453125, 'epoch': 0.19}
 19%|█▉        | 804/4286 [6:10:43<45:10:51, 46.71s/it] 19%|█▉        | 805/4286 [6:11:10<39:16:15, 40.61s/it]                                                       {'loss': 0.0019, 'grad_norm': 0.3905647541641113, 'learning_rate': 8.121791880541297e-07, 'completion_length': 322.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7559524476528168, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.01877797581255436, 'kl': 0.0482177734375, 'epoch': 0.19}
 19%|█▉        | 805/4286 [6:11:10<39:16:15, 40.61s/it] 19%|█▉        | 806/4286 [6:11:34<34:38:58, 35.84s/it]                                                       {'loss': 0.0044, 'grad_norm': 0.4177088310821052, 'learning_rate': 8.119458702753149e-07, 'completion_length': 303.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.02816697023808956, 'kl': 0.109375, 'epoch': 0.19}
 19%|█▉        | 806/4286 [6:11:34<34:38:58, 35.84s/it][2025-03-02 21:09:23,232] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 807/4286 [6:12:00<31:47:33, 32.90s/it]                                                       {'loss': 0.002, 'grad_norm': 1.1462740176215083, 'learning_rate': 8.117125524965002e-07, 'completion_length': 266.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.5803571939468384, 'rewards/format_reward': 1.0, 'reward': 1.580357313156128, 'reward_std': 0.04602411389350891, 'kl': 0.0501708984375, 'epoch': 0.19}
 19%|█▉        | 807/4286 [6:12:00<31:47:33, 32.90s/it] 19%|█▉        | 808/4286 [6:12:27<29:57:03, 31.00s/it]                                                       {'loss': 0.0186, 'grad_norm': 0.47767282248926307, 'learning_rate': 8.114792347176854e-07, 'completion_length': 324.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6488096117973328, 'reward_std': 0.08145630359649658, 'kl': 0.4671630859375, 'epoch': 0.19}
 19%|█▉        | 808/4286 [6:12:27<29:57:03, 31.00s/it][2025-03-02 21:10:18,595] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 809/4286 [6:12:56<29:18:01, 30.34s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.37796180311741956, 'learning_rate': 8.112459169388707e-07, 'completion_length': 336.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.77827388048172, 'rewards/format_reward': 1.0, 'reward': 1.77827388048172, 'reward_std': 0.008928571827709675, 'kl': 0.042724609375, 'epoch': 0.19}
 19%|█▉        | 809/4286 [6:12:56<29:18:01, 30.34s/it][2025-03-02 21:10:45,600] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 810/4286 [6:13:23<28:19:38, 29.34s/it]                                                       {'loss': 0.3354, 'grad_norm': 56415.017065170556, 'learning_rate': 8.110125991600559e-07, 'completion_length': 366.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6830357909202576, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.07029405608773232, 'kl': 8.3983154296875, 'epoch': 0.19}
 19%|█▉        | 810/4286 [6:13:23<28:19:38, 29.34s/it] 19%|█▉        | 811/4286 [6:13:47<26:45:54, 27.73s/it]                                                       {'loss': 0.0162, 'grad_norm': 3.3280486134905134, 'learning_rate': 8.107792813812412e-07, 'completion_length': 297.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.7529763579368591, 'reward_std': 0.02908780612051487, 'kl': 0.403076171875, 'epoch': 0.19}
 19%|█▉        | 811/4286 [6:13:47<26:45:54, 27.73s/it] 19%|█▉        | 812/4286 [6:14:10<25:34:00, 26.49s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.19452868042270086, 'learning_rate': 8.105459636024264e-07, 'completion_length': 295.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.815476268529892, 'rewards/format_reward': 1.0, 'reward': 1.8154763579368591, 'reward_std': 0.011904759332537651, 'kl': 0.0399169921875, 'epoch': 0.19}
 19%|█▉        | 812/4286 [6:14:10<25:34:00, 26.49s/it] 19%|█▉        | 813/4286 [6:14:36<25:16:39, 26.20s/it]                                                       {'loss': 0.0023, 'grad_norm': 2.2233738707879676, 'learning_rate': 8.103126458236117e-07, 'completion_length': 328.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6294642984867096, 'rewards/format_reward': 1.0, 'reward': 1.629464328289032, 'reward_std': 0.06593661196529865, 'kl': 0.056884765625, 'epoch': 0.19}
 19%|█▉        | 813/4286 [6:14:36<25:16:39, 26.20s/it] 19%|█▉        | 814/4286 [6:15:01<24:52:21, 25.79s/it]                                                       {'loss': 0.008, 'grad_norm': 2.6762963690629498, 'learning_rate': 8.10079328044797e-07, 'completion_length': 271.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.0, 'kl': 0.2000732421875, 'epoch': 0.19}
 19%|█▉        | 814/4286 [6:15:01<24:52:21, 25.79s/it] 19%|█▉        | 815/4286 [6:15:26<24:37:48, 25.55s/it]                                                       {'loss': 165.9578, 'grad_norm': 2114956.200103722, 'learning_rate': 8.098460102659822e-07, 'completion_length': 291.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6116071939468384, 'rewards/format_reward': 1.0, 'reward': 1.611607313156128, 'reward_std': 0.08177145570516586, 'kl': 4128.021728515625, 'epoch': 0.19}
 19%|█▉        | 815/4286 [6:15:26<24:37:48, 25.55s/it] 19%|█▉        | 816/4286 [6:15:50<24:22:26, 25.29s/it]                                                       {'loss': 0.0105, 'grad_norm': 1.3492930645547105, 'learning_rate': 8.096126924871674e-07, 'completion_length': 307.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.4508928954601288, 'rewards/format_reward': 1.0, 'reward': 1.450892984867096, 'reward_std': 0.029461252503097057, 'kl': 0.26220703125, 'epoch': 0.19}
 19%|█▉        | 816/4286 [6:15:50<24:22:26, 25.29s/it][2025-03-02 21:13:38,275] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 817/4286 [6:16:15<24:18:25, 25.22s/it]                                                       {'loss': 0.0099, 'grad_norm': 2.410210236593531, 'learning_rate': 8.093793747083528e-07, 'completion_length': 301.875, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.059310128912329674, 'kl': 0.248046875, 'epoch': 0.19}
 19%|█▉        | 817/4286 [6:16:15<24:18:25, 25.22s/it] 19%|█▉        | 818/4286 [6:16:41<24:20:49, 25.27s/it]                                                       {'loss': 0.0019, 'grad_norm': 1.1321374282955265, 'learning_rate': 8.09146056929538e-07, 'completion_length': 297.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6160715222358704, 'reward_std': 0.04602411389350891, 'kl': 0.04833984375, 'epoch': 0.19}
 19%|█▉        | 818/4286 [6:16:41<24:20:49, 25.27s/it] 19%|█▉        | 819/4286 [6:17:06<24:11:38, 25.12s/it]                                                       {'loss': 0.0021, 'grad_norm': 1.3622463555042004, 'learning_rate': 8.089127391507232e-07, 'completion_length': 321.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7752977013587952, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.02611161395907402, 'kl': 0.053466796875, 'epoch': 0.19}
 19%|█▉        | 819/4286 [6:17:06<24:11:38, 25.12s/it] 19%|█▉        | 820/4286 [6:17:30<23:57:04, 24.88s/it]                                                       {'loss': 0.0241, 'grad_norm': 1.7292964070777053, 'learning_rate': 8.086794213719084e-07, 'completion_length': 321.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.639881044626236, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.08934850618243217, 'kl': 0.6015625, 'epoch': 0.19}
 19%|█▉        | 820/4286 [6:17:30<23:57:04, 24.88s/it] 19%|█▉        | 821/4286 [6:17:53<23:27:57, 24.38s/it]                                                       {'loss': 0.0055, 'grad_norm': 1.6603260161903066, 'learning_rate': 8.084461035930938e-07, 'completion_length': 274.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.09481073915958405, 'kl': 0.138427734375, 'epoch': 0.19}
 19%|█▉        | 821/4286 [6:17:53<23:27:57, 24.38s/it] 19%|█▉        | 822/4286 [6:18:17<23:19:30, 24.24s/it]                                                       {'loss': 0.0048, 'grad_norm': 1.0839424300488447, 'learning_rate': 8.08212785814279e-07, 'completion_length': 293.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.04761904594488442, 'kl': 0.12109375, 'epoch': 0.19}
 19%|█▉        | 822/4286 [6:18:17<23:19:30, 24.24s/it][2025-03-02 21:16:05,959] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 823/4286 [6:18:43<23:51:02, 24.79s/it]                                                       {'loss': 0.0133, 'grad_norm': 1.00990526676561, 'learning_rate': 8.079794680354642e-07, 'completion_length': 325.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.02976190857589245, 'kl': 0.331298828125, 'epoch': 0.19}
 19%|█▉        | 823/4286 [6:18:43<23:51:02, 24.79s/it] 19%|█▉        | 824/4286 [6:19:07<23:43:09, 24.66s/it]                                                       {'loss': 0.0245, 'grad_norm': 1.2488887682989565, 'learning_rate': 8.077461502566495e-07, 'completion_length': 313.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.11600670218467712, 'kl': 0.6119384765625, 'epoch': 0.19}
 19%|█▉        | 824/4286 [6:19:07<23:43:09, 24.66s/it][2025-03-02 21:16:57,125] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 825/4286 [6:19:34<24:19:46, 25.31s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.2731190714424204, 'learning_rate': 8.075128324778347e-07, 'completion_length': 324.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7752977013587952, 'reward_std': 0.015801792964339256, 'kl': 0.043212890625, 'epoch': 0.19}
 19%|█▉        | 825/4286 [6:19:34<24:19:46, 25.31s/it] 19%|█▉        | 826/4286 [6:20:01<24:37:15, 25.62s/it]                                                       {'loss': 0.0014, 'grad_norm': 0.275157421021684, 'learning_rate': 8.0727951469902e-07, 'completion_length': 340.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.011904762359336019, 'kl': 0.035888671875, 'epoch': 0.19}
 19%|█▉        | 826/4286 [6:20:01<24:37:15, 25.62s/it][2025-03-02 21:17:48,728] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 827/4286 [6:20:26<24:30:39, 25.51s/it]                                                       {'loss': 0.002, 'grad_norm': 0.34950168963337386, 'learning_rate': 8.070461969202053e-07, 'completion_length': 321.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7008928954601288, 'rewards/format_reward': 1.0, 'reward': 1.700892984867096, 'reward_std': 0.019238397479057312, 'kl': 0.050537109375, 'epoch': 0.19}
 19%|█▉        | 827/4286 [6:20:26<24:30:39, 25.51s/it] 19%|█▉        | 828/4286 [6:20:50<24:04:06, 25.06s/it]                                                       {'loss': 0.0192, 'grad_norm': 0.4787582088551123, 'learning_rate': 8.068128791413905e-07, 'completion_length': 307.5, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.65327388048172, 'reward_std': 0.039858050644397736, 'kl': 0.47705078125, 'epoch': 0.19}
 19%|█▉        | 828/4286 [6:20:50<24:04:06, 25.06s/it] 19%|█▉        | 829/4286 [6:21:16<24:29:36, 25.51s/it]                                                       {'loss': 0.049, 'grad_norm': 17.4395016892256, 'learning_rate': 8.065795613625757e-07, 'completion_length': 302.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.642857313156128, 'reward_std': 0.12776251137256622, 'kl': 1.226806640625, 'epoch': 0.19}
 19%|█▉        | 829/4286 [6:21:16<24:29:36, 25.51s/it][2025-03-02 21:19:04,567] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 830/4286 [6:21:42<24:25:20, 25.44s/it]                                                       {'loss': 0.0241, 'grad_norm': 1.016657229896295, 'learning_rate': 8.063462435837611e-07, 'completion_length': 301.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.049460725858807564, 'kl': 0.6044921875, 'epoch': 0.19}
 19%|█▉        | 830/4286 [6:21:42<24:25:20, 25.44s/it][2025-03-02 21:19:29,757] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 831/4286 [6:22:07<24:20:34, 25.36s/it]                                                       {'loss': 0.0231, 'grad_norm': 8.630896970333097, 'learning_rate': 8.061129258049463e-07, 'completion_length': 279.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127978205680847, 'reward_std': 0.08209849894046783, 'kl': 0.57666015625, 'epoch': 0.19}
 19%|█▉        | 831/4286 [6:22:07<24:20:34, 25.36s/it] 19%|█▉        | 832/4286 [6:22:34<24:55:56, 25.99s/it]                                                       {'loss': 0.0032, 'grad_norm': 1.354638671078419, 'learning_rate': 8.058796080261315e-07, 'completion_length': 335.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6130953431129456, 'reward_std': 0.10611418634653091, 'kl': 0.081298828125, 'epoch': 0.19}
 19%|█▉        | 832/4286 [6:22:34<24:55:56, 25.99s/it] 19%|█▉        | 833/4286 [6:23:00<24:54:07, 25.96s/it]                                                       {'loss': 0.067, 'grad_norm': 1.6255905415310399, 'learning_rate': 8.056462902473167e-07, 'completion_length': 311.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7544643580913544, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7187501192092896, 'reward_std': 0.19738724827766418, 'kl': 1.673828125, 'epoch': 0.19}
 19%|█▉        | 833/4286 [6:23:00<24:54:07, 25.96s/it][2025-03-02 21:20:50,211] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 834/4286 [6:23:27<25:13:34, 26.31s/it]                                                       {'loss': 0.0671, 'grad_norm': 2.7449221028102393, 'learning_rate': 8.054129724685021e-07, 'completion_length': 303.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6404762268066406, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5690476894378662, 'reward_std': 0.17039990425109863, 'kl': 1.67724609375, 'epoch': 0.19}
 19%|█▉        | 834/4286 [6:23:27<25:13:34, 26.31s/it] 19%|█▉        | 835/4286 [6:23:52<24:52:27, 25.95s/it]                                                       {'loss': 0.0045, 'grad_norm': 0.6160092157519204, 'learning_rate': 8.051796546896873e-07, 'completion_length': 279.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.6398810148239136, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.01785714365541935, 'kl': 0.1123046875, 'epoch': 0.19}
 19%|█▉        | 835/4286 [6:23:52<24:52:27, 25.95s/it] 20%|█▉        | 836/4286 [6:24:17<24:33:53, 25.63s/it]                                                       {'loss': 0.0081, 'grad_norm': 1.7477394002071174, 'learning_rate': 8.049463369108725e-07, 'completion_length': 277.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.8809524178504944, 'rewards/format_reward': 1.0, 'reward': 1.880952537059784, 'reward_std': 0.0476190485060215, 'kl': 0.20166015625, 'epoch': 0.2}
 20%|█▉        | 836/4286 [6:24:17<24:33:53, 25.63s/it] 20%|█▉        | 837/4286 [6:24:41<23:58:15, 25.02s/it]                                                       {'loss': 0.0111, 'grad_norm': 3.9153566140073566, 'learning_rate': 8.047130191320578e-07, 'completion_length': 306.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.766369104385376, 'rewards/format_reward': 1.0, 'reward': 1.7663692235946655, 'reward_std': 0.0505952388048172, 'kl': 0.2763671875, 'epoch': 0.2}
 20%|█▉        | 837/4286 [6:24:41<23:58:15, 25.02s/it] 20%|█▉        | 838/4286 [6:25:04<23:23:44, 24.43s/it]                                                       {'loss': 0.1172, 'grad_norm': 7.388825190832966, 'learning_rate': 8.044797013532431e-07, 'completion_length': 269.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6190476715564728, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.583333432674408, 'reward_std': 0.12068404257297516, 'kl': 2.9296875, 'epoch': 0.2}
 20%|█▉        | 838/4286 [6:25:04<23:23:44, 24.43s/it] 20%|█▉        | 839/4286 [6:25:28<23:17:21, 24.32s/it]                                                       {'loss': 0.0066, 'grad_norm': 3.3065607754131388, 'learning_rate': 8.042463835744283e-07, 'completion_length': 310.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.10204490274190903, 'kl': 0.16455078125, 'epoch': 0.2}
 20%|█▉        | 839/4286 [6:25:28<23:17:21, 24.32s/it] 20%|█▉        | 840/4286 [6:25:55<23:59:56, 25.07s/it]                                                       {'loss': 0.0401, 'grad_norm': 2.5513432178335984, 'learning_rate': 8.040130657956136e-07, 'completion_length': 332.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.49821431934833527, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.444642961025238, 'reward_std': 0.18121915310621262, 'kl': 1.0, 'epoch': 0.2}
 20%|█▉        | 840/4286 [6:25:55<23:59:56, 25.07s/it] 20%|█▉        | 841/4286 [6:26:19<23:51:11, 24.93s/it]                                                       {'loss': 0.0047, 'grad_norm': 1.434831149636645, 'learning_rate': 8.037797480167988e-07, 'completion_length': 326.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.08511402644217014, 'kl': 0.1171875, 'epoch': 0.2}
 20%|█▉        | 841/4286 [6:26:19<23:51:11, 24.93s/it] 20%|█▉        | 842/4286 [6:26:45<24:05:53, 25.19s/it]                                                       {'loss': 0.0028, 'grad_norm': 2.25604064267987, 'learning_rate': 8.035464302379841e-07, 'completion_length': 312.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7023810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.06412798725068569, 'kl': 0.069580078125, 'epoch': 0.2}
 20%|█▉        | 842/4286 [6:26:45<24:05:53, 25.19s/it] 20%|█▉        | 843/4286 [6:27:10<23:53:16, 24.98s/it]                                                       {'loss': 0.003, 'grad_norm': 1.0079108192489479, 'learning_rate': 8.033131124591693e-07, 'completion_length': 329.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.023809521459043026, 'kl': 0.07568359375, 'epoch': 0.2}
 20%|█▉        | 843/4286 [6:27:10<23:53:16, 24.98s/it] 20%|█▉        | 844/4286 [6:27:34<23:48:50, 24.91s/it]                                                       {'loss': 0.0051, 'grad_norm': 0.86014577659116, 'learning_rate': 8.030797946803546e-07, 'completion_length': 318.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7315476536750793, 'rewards/format_reward': 1.0, 'reward': 1.7315477132797241, 'reward_std': 0.02863109763711691, 'kl': 0.127197265625, 'epoch': 0.2}
 20%|█▉        | 844/4286 [6:27:34<23:48:50, 24.91s/it] 20%|█▉        | 845/4286 [6:27:59<23:43:48, 24.83s/it]                                                       {'loss': 0.0014, 'grad_norm': 0.542484053120948, 'learning_rate': 8.028464769015398e-07, 'completion_length': 322.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979167461395264, 'reward_std': 0.048079472966492176, 'kl': 0.0343017578125, 'epoch': 0.2}
 20%|█▉        | 845/4286 [6:27:59<23:43:48, 24.83s/it] 20%|█▉        | 846/4286 [6:28:23<23:26:54, 24.54s/it]                                                       {'loss': 0.0021, 'grad_norm': 1.5354788764886877, 'learning_rate': 8.02613159122725e-07, 'completion_length': 289.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 1.0, 'reward': 1.6473215222358704, 'reward_std': 0.0803571417927742, 'kl': 0.0517578125, 'epoch': 0.2}
 20%|█▉        | 846/4286 [6:28:23<23:26:54, 24.54s/it] 20%|█▉        | 847/4286 [6:28:49<23:58:11, 25.09s/it]                                                       {'loss': 0.0023, 'grad_norm': 0.9541660754266965, 'learning_rate': 8.023798413439104e-07, 'completion_length': 305.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.5473901629447937, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5295330286026, 'reward_std': 0.13571173325181007, 'kl': 0.05859375, 'epoch': 0.2}
 20%|█▉        | 847/4286 [6:28:49<23:58:11, 25.09s/it] 20%|█▉        | 848/4286 [6:29:15<24:04:20, 25.21s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.3404149843823791, 'learning_rate': 8.021465235650956e-07, 'completion_length': 331.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.642857313156128, 'reward_std': 0.04081590101122856, 'kl': 0.042236328125, 'epoch': 0.2}
 20%|█▉        | 848/4286 [6:29:15<24:04:20, 25.21s/it] 20%|█▉        | 849/4286 [6:29:41<24:18:28, 25.46s/it]                                                       {'loss': 0.0013, 'grad_norm': 0.5555013506990873, 'learning_rate': 8.019132057862808e-07, 'completion_length': 345.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7872024178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7693454027175903, 'reward_std': 0.07029405608773232, 'kl': 0.03271484375, 'epoch': 0.2}
 20%|█▉        | 849/4286 [6:29:41<24:18:28, 25.46s/it] 20%|█▉        | 850/4286 [6:30:07<24:37:02, 25.79s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.6944418127568203, 'learning_rate': 8.016798880074662e-07, 'completion_length': 309.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.583333358168602, 'rewards/format_reward': 1.0, 'reward': 1.583333432674408, 'reward_std': 0.08609583601355553, 'kl': 0.0399169921875, 'epoch': 0.2}
 20%|█▉        | 850/4286 [6:30:07<24:37:02, 25.79s/it][2025-03-02 21:27:56,957] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 20%|█▉        | 851/4286 [6:30:34<24:50:34, 26.04s/it]                                                       {'loss': 0.0013, 'grad_norm': 0.12500198688684436, 'learning_rate': 8.014465702286514e-07, 'completion_length': 308.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7440477013587952, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.011904759332537651, 'kl': 0.03228759765625, 'epoch': 0.2}
 20%|█▉        | 851/4286 [6:30:34<24:50:34, 26.04s/it] 20%|█▉        | 852/4286 [6:30:59<24:34:13, 25.76s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.927723455741889, 'learning_rate': 8.012132524498366e-07, 'completion_length': 308.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6494472920894623, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6315902471542358, 'reward_std': 0.13481435179710388, 'kl': 0.04150390625, 'epoch': 0.2}
 20%|█▉        | 852/4286 [6:30:59<24:34:13, 25.76s/it] 20%|█▉        | 853/4286 [6:31:25<24:38:49, 25.85s/it]                                                       {'loss': 0.0015, 'grad_norm': 0.2631233869861084, 'learning_rate': 8.009799346710219e-07, 'completion_length': 308.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.71726194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6994048953056335, 'reward_std': 0.05357143096625805, 'kl': 0.0384521484375, 'epoch': 0.2}
 20%|█▉        | 853/4286 [6:31:25<24:38:49, 25.85s/it] 20%|█▉        | 854/4286 [6:31:51<24:30:00, 25.70s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.4087470584940142, 'learning_rate': 8.007466168922071e-07, 'completion_length': 294.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.779762089252472, 'reward_std': 0.03388838469982147, 'kl': 0.039794921875, 'epoch': 0.2}
 20%|█▉        | 854/4286 [6:31:51<24:30:00, 25.70s/it] 20%|█▉        | 855/4286 [6:32:15<24:11:25, 25.38s/it]                                                       {'loss': 0.0015, 'grad_norm': 1.5030319229089861, 'learning_rate': 8.005132991133924e-07, 'completion_length': 317.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.816964328289032, 'rewards/format_reward': 1.0, 'reward': 1.8169643878936768, 'reward_std': 0.03457976318895817, 'kl': 0.0384521484375, 'epoch': 0.2}
 20%|█▉        | 855/4286 [6:32:15<24:11:25, 25.38s/it] 20%|█▉        | 856/4286 [6:32:40<23:53:33, 25.08s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.4192781896720575, 'learning_rate': 8.002799813345776e-07, 'completion_length': 325.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.05952380783855915, 'kl': 0.03125, 'epoch': 0.2}
 20%|█▉        | 856/4286 [6:32:40<23:53:33, 25.08s/it] 20%|█▉        | 857/4286 [6:33:04<23:37:16, 24.80s/it]                                                       {'loss': 0.0018, 'grad_norm': 0.6136203526813001, 'learning_rate': 8.000466635557629e-07, 'completion_length': 322.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7023810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.1326134279370308, 'kl': 0.0443115234375, 'epoch': 0.2}
 20%|█▉        | 857/4286 [6:33:04<23:37:16, 24.80s/it] 20%|██        | 858/4286 [6:33:28<23:27:57, 24.64s/it]                                                       {'loss': 0.0015, 'grad_norm': 0.6534214996841433, 'learning_rate': 7.998133457769481e-07, 'completion_length': 294.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6979167461395264, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.05243690684437752, 'kl': 0.0384521484375, 'epoch': 0.2}
 20%|██        | 858/4286 [6:33:28<23:27:57, 24.64s/it] 20%|██        | 859/4286 [6:33:52<23:22:52, 24.56s/it]                                                       {'loss': 0.0015, 'grad_norm': 0.7230737426983599, 'learning_rate': 7.995800279981334e-07, 'completion_length': 310.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7961309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.05495268292725086, 'kl': 0.0377197265625, 'epoch': 0.2}
 20%|██        | 859/4286 [6:33:52<23:22:52, 24.56s/it][2025-03-02 21:31:42,282] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 20%|██        | 860/4286 [6:34:19<24:04:12, 25.29s/it]                                                       {'loss': 0.0015, 'grad_norm': 1.5041003029301243, 'learning_rate': 7.993467102193187e-07, 'completion_length': 339.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.09463847056031227, 'kl': 0.0362548828125, 'epoch': 0.2}
 20%|██        | 860/4286 [6:34:19<24:04:12, 25.29s/it] 20%|██        | 861/4286 [6:34:44<23:54:04, 25.12s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.31366810326665473, 'learning_rate': 7.991133924405039e-07, 'completion_length': 308.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.495535746216774, 'rewards/format_reward': 1.0, 'reward': 1.4955358505249023, 'reward_std': 0.04673127271234989, 'kl': 0.04248046875, 'epoch': 0.2}
 20%|██        | 861/4286 [6:34:44<23:54:04, 25.12s/it] 20%|██        | 862/4286 [6:35:10<24:00:43, 25.25s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.5254128942976025, 'learning_rate': 7.988800746616891e-07, 'completion_length': 297.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7041666805744171, 'rewards/format_reward': 1.0, 'reward': 1.7041667699813843, 'reward_std': 0.07933454215526581, 'kl': 0.038818359375, 'epoch': 0.2}
 20%|██        | 862/4286 [6:35:10<24:00:43, 25.25s/it] 20%|██        | 863/4286 [6:35:36<24:26:59, 25.71s/it]                                                       {'loss': 0.0019, 'grad_norm': 0.5262010009032236, 'learning_rate': 7.986467568828745e-07, 'completion_length': 350.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6714286208152771, 'rewards/format_reward': 1.0, 'reward': 1.671428620815277, 'reward_std': 0.03614550828933716, 'kl': 0.0472412109375, 'epoch': 0.2}
 20%|██        | 863/4286 [6:35:36<24:26:59, 25.71s/it] 20%|██        | 864/4286 [6:36:01<24:13:22, 25.48s/it]                                                       {'loss': 0.0016, 'grad_norm': 0.3796766672595647, 'learning_rate': 7.984134391040597e-07, 'completion_length': 315.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.03755596000701189, 'kl': 0.0389404296875, 'epoch': 0.2}
 20%|██        | 864/4286 [6:36:01<24:13:22, 25.48s/it] 20%|██        | 865/4286 [6:36:27<24:09:22, 25.42s/it]                                                       {'loss': 0.0022, 'grad_norm': 1.2805723800186932, 'learning_rate': 7.981801213252449e-07, 'completion_length': 306.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5773809552192688, 'rewards/format_reward': 1.0, 'reward': 1.5773810744285583, 'reward_std': 0.07142857112921774, 'kl': 0.0543212890625, 'epoch': 0.2}
 20%|██        | 865/4286 [6:36:27<24:09:22, 25.42s/it] 20%|██        | 866/4286 [6:36:53<24:28:38, 25.77s/it]                                                       {'loss': 0.0029, 'grad_norm': 0.6270834938845191, 'learning_rate': 7.979468035464301e-07, 'completion_length': 317.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.5553571581840515, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5375002026557922, 'reward_std': 0.06190476659685373, 'kl': 0.0718994140625, 'epoch': 0.2}
 20%|██        | 866/4286 [6:36:53<24:28:38, 25.77s/it] 20%|██        | 867/4286 [6:37:20<24:52:56, 26.20s/it]                                                       {'loss': 0.0026, 'grad_norm': 1.1232040788996647, 'learning_rate': 7.977134857676155e-07, 'completion_length': 317.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.6547619700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6369048357009888, 'reward_std': 0.17529671639204025, 'kl': 0.06640625, 'epoch': 0.2}
 20%|██        | 867/4286 [6:37:20<24:52:56, 26.20s/it][2025-03-02 21:35:10,807] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 20%|██        | 868/4286 [6:37:48<25:14:00, 26.58s/it]                                                       {'loss': 0.0048, 'grad_norm': 1.7260949871520823, 'learning_rate': 7.974801679888007e-07, 'completion_length': 309.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.6927827596664429, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6749256253242493, 'reward_std': 0.11283958703279495, 'kl': 0.12060546875, 'epoch': 0.2}
 20%|██        | 868/4286 [6:37:48<25:14:00, 26.58s/it][2025-03-02 21:35:38,736] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 20%|██        | 869/4286 [6:38:16<25:36:39, 26.98s/it]                                                       {'loss': 0.0044, 'grad_norm': 2.1943026928948393, 'learning_rate': 7.972468502099859e-07, 'completion_length': 338.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6294643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6294644474983215, 'reward_std': 0.11360794305801392, 'kl': 0.11083984375, 'epoch': 0.2}
 20%|██        | 869/4286 [6:38:16<25:36:39, 26.98s/it] 20%|██        | 870/4286 [6:38:41<25:10:10, 26.53s/it]                                                       {'loss': 0.012, 'grad_norm': 3.276573843919187, 'learning_rate': 7.970135324311712e-07, 'completion_length': 322.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5357144474983215, 'reward_std': 0.1647080034017563, 'kl': 0.30029296875, 'epoch': 0.2}
 20%|██        | 870/4286 [6:38:41<25:10:10, 26.53s/it][2025-03-02 21:36:30,517] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 20%|██        | 871/4286 [6:39:08<25:06:17, 26.46s/it]                                                       {'loss': 0.0166, 'grad_norm': 1.1227456043502313, 'learning_rate': 7.967802146523565e-07, 'completion_length': 313.2143096923828, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6190477013587952, 'reward_std': 0.13275901228189468, 'kl': 0.41650390625, 'epoch': 0.2}
 20%|██        | 871/4286 [6:39:08<25:06:17, 26.46s/it][2025-03-02 21:36:57,641] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 20%|██        | 872/4286 [6:39:35<25:17:04, 26.66s/it]                                                       {'loss': 0.0294, 'grad_norm': 3.455917205218313, 'learning_rate': 7.965468968735417e-07, 'completion_length': 325.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6096230745315552, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.573908805847168, 'reward_std': 0.23071300238370895, 'kl': 0.734375, 'epoch': 0.2}
 20%|██        | 872/4286 [6:39:35<25:17:04, 26.66s/it] 20%|██        | 873/4286 [6:40:01<25:02:43, 26.42s/it]                                                       {'loss': 0.0264, 'grad_norm': 6.950874539783784, 'learning_rate': 7.96313579094727e-07, 'completion_length': 319.375, 'rewards/only_full_func_accuracy_reward': 0.7276785671710968, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7098215818405151, 'reward_std': 0.13309068977832794, 'kl': 0.662109375, 'epoch': 0.2}
 20%|██        | 873/4286 [6:40:01<25:02:43, 26.42s/it] 20%|██        | 874/4286 [6:40:26<24:44:11, 26.10s/it]                                                       {'loss': 0.0346, 'grad_norm': 2.390319332202432, 'learning_rate': 7.960802613159122e-07, 'completion_length': 283.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7477679252624512, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7120537161827087, 'reward_std': 0.07354088872671127, 'kl': 0.8623046875, 'epoch': 0.2}
 20%|██        | 874/4286 [6:40:26<24:44:11, 26.10s/it] 20%|██        | 875/4286 [6:40:50<24:12:41, 25.55s/it]                                                       {'loss': 0.0326, 'grad_norm': 2.3398465066950194, 'learning_rate': 7.958469435370974e-07, 'completion_length': 304.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7976190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7797620296478271, 'reward_std': 0.1367695815861225, 'kl': 0.814453125, 'epoch': 0.2}
 20%|██        | 875/4286 [6:40:50<24:12:41, 25.55s/it] 20%|██        | 876/4286 [6:41:15<24:05:12, 25.43s/it]                                                       {'loss': 0.05, 'grad_norm': 4.809572755889409, 'learning_rate': 7.956136257582828e-07, 'completion_length': 309.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6610119640827179, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6252976059913635, 'reward_std': 0.1586013287305832, 'kl': 1.25, 'epoch': 0.2}
 20%|██        | 876/4286 [6:41:15<24:05:12, 25.43s/it][2025-03-02 21:39:05,235] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 20%|██        | 877/4286 [6:41:42<24:31:06, 25.89s/it]                                                       {'loss': 0.0364, 'grad_norm': 2.5179887739808406, 'learning_rate': 7.95380307979468e-07, 'completion_length': 303.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6868235766887665, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6511094570159912, 'reward_std': 0.2366538792848587, 'kl': 0.912109375, 'epoch': 0.2}
 20%|██        | 877/4286 [6:41:42<24:31:06, 25.89s/it] 20%|██        | 878/4286 [6:42:08<24:28:36, 25.86s/it]                                                       {'loss': 0.0421, 'grad_norm': 2.4783560060145557, 'learning_rate': 7.951469902006532e-07, 'completion_length': 298.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6662946939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.630580484867096, 'reward_std': 0.21456562727689743, 'kl': 1.05078125, 'epoch': 0.2}
 20%|██        | 878/4286 [6:42:08<24:28:36, 25.86s/it] 21%|██        | 879/4286 [6:42:33<24:14:30, 25.62s/it]                                                       {'loss': 0.0472, 'grad_norm': 3.2703413977747804, 'learning_rate': 7.949136724218384e-07, 'completion_length': 304.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.647321492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6116072535514832, 'reward_std': 0.21364210546016693, 'kl': 1.177734375, 'epoch': 0.21}
 21%|██        | 879/4286 [6:42:33<24:14:30, 25.62s/it] 21%|██        | 880/4286 [6:42:58<24:05:43, 25.47s/it]                                                       {'loss': 0.0212, 'grad_norm': 5.663177077725876, 'learning_rate': 7.946803546430238e-07, 'completion_length': 270.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.7053571939468384, 'reward_std': 0.1071428544819355, 'kl': 0.5283203125, 'epoch': 0.21}
 21%|██        | 880/4286 [6:42:58<24:05:43, 25.47s/it] 21%|██        | 881/4286 [6:43:22<23:43:14, 25.08s/it]                                                       {'loss': 0.0145, 'grad_norm': 1.8368786277409002, 'learning_rate': 7.94447036864209e-07, 'completion_length': 266.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 1.0, 'reward': 1.6502977013587952, 'reward_std': 0.06802502274513245, 'kl': 0.361328125, 'epoch': 0.21}
 21%|██        | 881/4286 [6:43:22<23:43:14, 25.08s/it][2025-03-02 21:41:11,196] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 21%|██        | 882/4286 [6:43:48<23:55:46, 25.31s/it]                                                       {'loss': 0.017, 'grad_norm': 4.074345040149567, 'learning_rate': 7.942137190853942e-07, 'completion_length': 313.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.636011928319931, 'rewards/format_reward': 1.0, 'reward': 1.6360120177268982, 'reward_std': 0.07873930037021637, 'kl': 0.4234619140625, 'epoch': 0.21}
 21%|██        | 882/4286 [6:43:48<23:55:46, 25.31s/it] 21%|██        | 883/4286 [6:44:13<23:40:09, 25.04s/it]                                                       {'loss': 0.0251, 'grad_norm': 1.8548566144849836, 'learning_rate': 7.939804013065795e-07, 'completion_length': 282.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6711310744285583, 'reward_std': 0.13584472611546516, 'kl': 0.626953125, 'epoch': 0.21}
 21%|██        | 883/4286 [6:44:13<23:40:09, 25.04s/it] 21%|██        | 884/4286 [6:44:36<23:11:02, 24.53s/it]                                                       {'loss': 0.0426, 'grad_norm': 4.206034340534742, 'learning_rate': 7.937470835277648e-07, 'completion_length': 297.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.62202388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6041667461395264, 'reward_std': 0.1900879144668579, 'kl': 1.0625, 'epoch': 0.21}
 21%|██        | 884/4286 [6:44:36<23:11:02, 24.53s/it] 21%|██        | 885/4286 [6:45:03<23:47:54, 25.19s/it]                                                       {'loss': 0.0222, 'grad_norm': 2.951584879720611, 'learning_rate': 7.9351376574895e-07, 'completion_length': 311.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6666668057441711, 'reward_std': 0.05952381156384945, 'kl': 0.5556640625, 'epoch': 0.21}
 21%|██        | 885/4286 [6:45:03<23:47:54, 25.19s/it][2025-03-02 21:42:49,599] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 21%|██        | 886/4286 [6:45:27<23:25:43, 24.81s/it]                                                       {'loss': 0.0135, 'grad_norm': 1.798229618391688, 'learning_rate': 7.932804479701353e-07, 'completion_length': 308.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7351190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7351192235946655, 'reward_std': 0.04398157820105553, 'kl': 0.3365478515625, 'epoch': 0.21}
 21%|██        | 886/4286 [6:45:27<23:25:43, 24.81s/it] 21%|██        | 887/4286 [6:45:52<23:40:00, 25.07s/it]                                                       {'loss': 0.0026, 'grad_norm': 2.458900799513177, 'learning_rate': 7.930471301913205e-07, 'completion_length': 314.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7053571939468384, 'reward_std': 0.06259887851774693, 'kl': 0.06396484375, 'epoch': 0.21}
 21%|██        | 887/4286 [6:45:52<23:40:00, 25.07s/it] 21%|██        | 888/4286 [6:46:16<23:21:53, 24.75s/it]                                                       {'loss': 0.0071, 'grad_norm': 2.0679603962815625, 'learning_rate': 7.928138124125058e-07, 'completion_length': 302.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.14271127432584763, 'kl': 0.178466796875, 'epoch': 0.21}
 21%|██        | 888/4286 [6:46:16<23:21:53, 24.75s/it] 21%|██        | 889/4286 [6:46:40<23:07:32, 24.51s/it]                                                       {'loss': 0.0137, 'grad_norm': 2.798180464137897, 'learning_rate': 7.92580494633691e-07, 'completion_length': 302.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.5788690745830536, 'rewards/format_reward': 1.0, 'reward': 1.5788691639900208, 'reward_std': 0.0414529899135232, 'kl': 0.3428955078125, 'epoch': 0.21}
 21%|██        | 889/4286 [6:46:40<23:07:32, 24.51s/it] 21%|██        | 890/4286 [6:47:04<22:57:44, 24.34s/it]                                                       {'loss': 0.0064, 'grad_norm': 1.4719385162318148, 'learning_rate': 7.923471768548763e-07, 'completion_length': 309.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 1.0, 'reward': 1.6860120296478271, 'reward_std': 0.046731267124414444, 'kl': 0.15966796875, 'epoch': 0.21}
 21%|██        | 890/4286 [6:47:04<22:57:44, 24.34s/it] 21%|██        | 891/4286 [6:47:29<23:09:11, 24.55s/it]                                                       {'loss': 0.0017, 'grad_norm': 0.8229518847615297, 'learning_rate': 7.921138590760615e-07, 'completion_length': 317.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7514881789684296, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.038690476678311825, 'kl': 0.042724609375, 'epoch': 0.21}
 21%|██        | 891/4286 [6:47:29<23:09:11, 24.55s/it] 21%|██        | 892/4286 [6:47:52<22:29:20, 23.85s/it]                                                       {'loss': 0.0188, 'grad_norm': 3.0629765531936166, 'learning_rate': 7.918805412972468e-07, 'completion_length': 284.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.05197649821639061, 'kl': 0.47216796875, 'epoch': 0.21}
 21%|██        | 892/4286 [6:47:52<22:29:20, 23.85s/it] 21%|██        | 893/4286 [6:48:15<22:22:33, 23.74s/it]                                                       {'loss': 0.0136, 'grad_norm': 1.3826910927516851, 'learning_rate': 7.916472235184321e-07, 'completion_length': 299.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 1.0, 'reward': 1.6026787161827087, 'reward_std': 0.04464286006987095, 'kl': 0.340087890625, 'epoch': 0.21}
 21%|██        | 893/4286 [6:48:15<22:22:33, 23.74s/it] 21%|██        | 894/4286 [6:48:38<22:10:04, 23.53s/it]                                                       {'loss': 0.0014, 'grad_norm': 0.8244647205882166, 'learning_rate': 7.914139057396173e-07, 'completion_length': 306.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 1.0, 'reward': 1.6741071939468384, 'reward_std': 0.04464286006987095, 'kl': 0.033935546875, 'epoch': 0.21}
 21%|██        | 894/4286 [6:48:38<22:10:04, 23.53s/it] 21%|██        | 895/4286 [6:49:01<22:02:05, 23.39s/it]                                                       {'loss': 0.0015, 'grad_norm': 1.0600678697574086, 'learning_rate': 7.911805879608025e-07, 'completion_length': 283.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.08885835111141205, 'kl': 0.0364990234375, 'epoch': 0.21}
 21%|██        | 895/4286 [6:49:01<22:02:05, 23.39s/it] 21%|██        | 896/4286 [6:49:24<21:45:34, 23.11s/it]                                                       {'loss': 0.0043, 'grad_norm': 1.1638906675063119, 'learning_rate': 7.909472701819879e-07, 'completion_length': 274.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592262983322144, 'reward_std': 0.1355779469013214, 'kl': 0.107177734375, 'epoch': 0.21}
 21%|██        | 896/4286 [6:49:24<21:45:34, 23.11s/it][2025-03-02 21:47:11,274] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 21%|██        | 897/4286 [6:49:48<22:13:48, 23.61s/it]                                                       {'loss': 0.0022, 'grad_norm': 2.720388267465404, 'learning_rate': 7.907139524031731e-07, 'completion_length': 294.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.6175595223903656, 'rewards/format_reward': 1.0, 'reward': 1.6175596117973328, 'reward_std': 0.0386904738843441, 'kl': 0.0552978515625, 'epoch': 0.21}
 21%|██        | 897/4286 [6:49:48<22:13:48, 23.61s/it] 21%|██        | 898/4286 [6:50:13<22:33:48, 23.98s/it]                                                       {'loss': 0.0014, 'grad_norm': 0.7690722596923724, 'learning_rate': 7.904806346243583e-07, 'completion_length': 311.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7607143521308899, 'rewards/format_reward': 1.0, 'reward': 1.7607142925262451, 'reward_std': 0.07396222651004791, 'kl': 0.033935546875, 'epoch': 0.21}
 21%|██        | 898/4286 [6:50:13<22:33:48, 23.98s/it] 21%|██        | 899/4286 [6:50:36<22:12:43, 23.61s/it]                                                       {'loss': 0.0057, 'grad_norm': 2.240725938222749, 'learning_rate': 7.902473168455436e-07, 'completion_length': 263.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7883928716182709, 'rewards/format_reward': 1.0, 'reward': 1.788392961025238, 'reward_std': 0.12743691354990005, 'kl': 0.141357421875, 'epoch': 0.21}
 21%|██        | 899/4286 [6:50:36<22:12:43, 23.61s/it] 21%|██        | 900/4286 [6:50:59<22:11:13, 23.59s/it]                                                       {'loss': 0.0019, 'grad_norm': 1.7881236789942652, 'learning_rate': 7.900139990667289e-07, 'completion_length': 303.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6398810148239136, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.06547619216144085, 'kl': 0.0478515625, 'epoch': 0.21}
 21%|██        | 900/4286 [6:50:59<22:11:13, 23.59s/it] 21%|██        | 901/4286 [6:54:33<75:41:49, 80.51s/it]                                                       {'loss': 0.0011, 'grad_norm': 0.9211645292829648, 'learning_rate': 7.897806812879141e-07, 'completion_length': 308.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 1.0, 'reward': 1.6443454027175903, 'reward_std': 0.05495268478989601, 'kl': 0.02752685546875, 'epoch': 0.21}
 21%|██        | 901/4286 [6:54:33<75:41:49, 80.51s/it][2025-03-02 21:52:19,442] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 21%|██        | 902/4286 [6:54:57<59:40:05, 63.48s/it]                                                       {'loss': 0.002, 'grad_norm': 0.23103511280920871, 'learning_rate': 7.895473635090993e-07, 'completion_length': 265.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.8928572237491608, 'rewards/format_reward': 1.0, 'reward': 1.892857313156128, 'reward_std': 0.02380952052772045, 'kl': 0.0506591796875, 'epoch': 0.21}
 21%|██        | 902/4286 [6:54:57<59:40:05, 63.48s/it] 21%|██        | 903/4286 [6:55:21<48:33:53, 51.68s/it]                                                       {'loss': 0.0039, 'grad_norm': 2.772793455767321, 'learning_rate': 7.893140457302846e-07, 'completion_length': 290.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7175595462322235, 'rewards/format_reward': 1.0, 'reward': 1.717559576034546, 'reward_std': 0.11862387508153915, 'kl': 0.0975341796875, 'epoch': 0.21}
 21%|██        | 903/4286 [6:55:21<48:33:53, 51.68s/it][2025-03-02 21:53:09,396] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 21%|██        | 904/4286 [6:55:46<41:15:23, 43.92s/it]                                                       {'loss': 0.0021, 'grad_norm': 0.5333260222043534, 'learning_rate': 7.890807279514698e-07, 'completion_length': 295.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.791666716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7738096117973328, 'reward_std': 0.047619049437344074, 'kl': 0.0535888671875, 'epoch': 0.21}
 21%|██        | 904/4286 [6:55:46<41:15:23, 43.92s/it][2025-03-02 21:53:36,693] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 21%|██        | 905/4286 [6:56:14<36:33:43, 38.93s/it]                                                       {'loss': 0.0097, 'grad_norm': 4.253113356392193, 'learning_rate': 7.888474101726551e-07, 'completion_length': 308.3928756713867, 'rewards/only_full_func_accuracy_reward': 0.6227678954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6049107909202576, 'reward_std': 0.06081788241863251, 'kl': 0.2421875, 'epoch': 0.21}
 21%|██        | 905/4286 [6:56:14<36:33:43, 38.93s/it] 21%|██        | 906/4286 [6:56:41<33:09:45, 35.32s/it]                                                       {'loss': 0.0117, 'grad_norm': 2.581777302081824, 'learning_rate': 7.886140923938404e-07, 'completion_length': 278.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6339287161827087, 'reward_std': 0.1293574497103691, 'kl': 0.2919921875, 'epoch': 0.21}
 21%|██        | 906/4286 [6:56:41<33:09:45, 35.32s/it] 21%|██        | 907/4286 [6:57:05<30:03:55, 32.03s/it]                                                       {'loss': 0.0069, 'grad_norm': 6.502450584874761, 'learning_rate': 7.883807746150256e-07, 'completion_length': 294.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.03847679682075977, 'kl': 0.1734619140625, 'epoch': 0.21}
 21%|██        | 907/4286 [6:57:05<30:03:55, 32.03s/it] 21%|██        | 908/4286 [6:57:30<28:11:57, 30.05s/it]                                                       {'loss': 0.0301, 'grad_norm': 5.796164944703902, 'learning_rate': 7.881474568362108e-07, 'completion_length': 316.51788330078125, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6547619700431824, 'reward_std': 0.1319393366575241, 'kl': 0.750244140625, 'epoch': 0.21}
 21%|██        | 908/4286 [6:57:30<28:11:57, 30.05s/it] 21%|██        | 909/4286 [6:57:58<27:28:15, 29.28s/it]                                                       {'loss': 0.0271, 'grad_norm': 2.43189253476272, 'learning_rate': 7.879141390573962e-07, 'completion_length': 316.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7276785969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6919643878936768, 'reward_std': 0.10743949562311172, 'kl': 0.6748046875, 'epoch': 0.21}
 21%|██        | 909/4286 [6:57:58<27:28:15, 29.28s/it][2025-03-02 21:55:46,244] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 21%|██        | 910/4286 [6:58:23<26:21:37, 28.11s/it]                                                       {'loss': 0.025, 'grad_norm': 4.6030211169675175, 'learning_rate': 7.876808212785814e-07, 'completion_length': 281.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7023810744285583, 'reward_std': 0.11465512961149216, 'kl': 0.625, 'epoch': 0.21}
 21%|██        | 910/4286 [6:58:23<26:21:37, 28.11s/it][2025-03-02 21:56:09,612] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 21%|██▏       | 911/4286 [6:58:47<25:01:08, 26.69s/it]                                                       {'loss': 0.0225, 'grad_norm': 7.774778575362253, 'learning_rate': 7.874475034997666e-07, 'completion_length': 294.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.67113097012043, 'rewards/format_reward': 1.0, 'reward': 1.6711310148239136, 'reward_std': 0.04685882292687893, 'kl': 0.5625, 'epoch': 0.21}
 21%|██▏       | 911/4286 [6:58:47<25:01:08, 26.69s/it] 21%|██▏       | 912/4286 [6:59:13<24:47:46, 26.46s/it]                                                       {'loss': 0.0214, 'grad_norm': 3.128551005714224, 'learning_rate': 7.872141857209518e-07, 'completion_length': 309.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.750744104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7150299549102783, 'reward_std': 0.09998663887381554, 'kl': 0.53466796875, 'epoch': 0.21}
 21%|██▏       | 912/4286 [6:59:13<24:47:46, 26.46s/it] 21%|██▏       | 913/4286 [6:59:37<24:17:53, 25.93s/it]                                                       {'loss': 0.0625, 'grad_norm': 6.264220596375894, 'learning_rate': 7.869808679421372e-07, 'completion_length': 309.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.5943452715873718, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5407739877700806, 'reward_std': 0.17831461504101753, 'kl': 1.55859375, 'epoch': 0.21}
 21%|██▏       | 913/4286 [6:59:37<24:17:53, 25.93s/it][2025-03-02 21:57:25,078] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 21%|██▏       | 914/4286 [7:00:02<23:58:55, 25.60s/it]                                                       {'loss': 0.0915, 'grad_norm': 4.354990986115163, 'learning_rate': 7.867475501633224e-07, 'completion_length': 292.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5863096714019775, 'reward_std': 0.2700731009244919, 'kl': 2.2890625, 'epoch': 0.21}
 21%|██▏       | 914/4286 [7:00:02<23:58:55, 25.60s/it] 21%|██▏       | 915/4286 [7:00:27<23:48:36, 25.43s/it]                                                       {'loss': 0.0692, 'grad_norm': 18.488591538743613, 'learning_rate': 7.865142323845076e-07, 'completion_length': 316.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.566964328289032, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4955357909202576, 'reward_std': 0.23428896069526672, 'kl': 1.73828125, 'epoch': 0.21}
 21%|██▏       | 915/4286 [7:00:27<23:48:36, 25.43s/it] 21%|██▏       | 916/4286 [7:00:54<24:04:59, 25.73s/it]                                                       {'loss': 0.0536, 'grad_norm': 4.349494803221466, 'learning_rate': 7.862809146056929e-07, 'completion_length': 338.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.6488096117973328, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6130953431129456, 'reward_std': 0.17454484850168228, 'kl': 1.33984375, 'epoch': 0.21}
 21%|██▏       | 916/4286 [7:00:54<24:04:59, 25.73s/it] 21%|██▏       | 917/4286 [7:01:20<24:08:15, 25.79s/it]                                                       {'loss': 0.0241, 'grad_norm': 24.390544026810648, 'learning_rate': 7.860475968268782e-07, 'completion_length': 322.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.5877976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5699405670166016, 'reward_std': 0.15219222754240036, 'kl': 0.6015625, 'epoch': 0.21}
 21%|██▏       | 917/4286 [7:01:20<24:08:15, 25.79s/it] 21%|██▏       | 918/4286 [7:01:45<24:05:54, 25.76s/it]                                                       {'loss': 0.0505, 'grad_norm': 21.244993508116973, 'learning_rate': 7.858142790480634e-07, 'completion_length': 285.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6467127203941345, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.610998511314392, 'reward_std': 0.17195003852248192, 'kl': 1.265625, 'epoch': 0.21}
 21%|██▏       | 918/4286 [7:01:45<24:05:54, 25.76s/it][2025-03-02 21:59:34,677] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 21%|██▏       | 919/4286 [7:02:12<24:18:29, 25.99s/it]                                                       {'loss': 0.0787, 'grad_norm': 3.277151745834644, 'learning_rate': 7.855809612692487e-07, 'completion_length': 328.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7065972983837128, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.635168731212616, 'reward_std': 0.2302502691745758, 'kl': 1.96484375, 'epoch': 0.21}
 21%|██▏       | 919/4286 [7:02:12<24:18:29, 25.99s/it] 21%|██▏       | 920/4286 [7:02:39<24:34:04, 26.28s/it]                                                       {'loss': 0.0146, 'grad_norm': 2.300567867880099, 'learning_rate': 7.853476434904339e-07, 'completion_length': 336.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6400162577629089, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6043020486831665, 'reward_std': 0.12647845316678286, 'kl': 0.36474609375, 'epoch': 0.21}
 21%|██▏       | 920/4286 [7:02:39<24:34:04, 26.28s/it] 21%|██▏       | 921/4286 [7:03:04<24:21:41, 26.06s/it]                                                       {'loss': 0.03, 'grad_norm': 6.7405629561491125, 'learning_rate': 7.851143257116192e-07, 'completion_length': 316.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7086309790611267, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6907739043235779, 'reward_std': 0.0849946141242981, 'kl': 0.75, 'epoch': 0.21}
 21%|██▏       | 921/4286 [7:03:04<24:21:41, 26.06s/it] 22%|██▏       | 922/4286 [7:03:31<24:26:04, 26.15s/it]                                                       {'loss': 0.008, 'grad_norm': 1.524637499115325, 'learning_rate': 7.848810079328045e-07, 'completion_length': 326.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.6919642984867096, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.09903469681739807, 'kl': 0.2001953125, 'epoch': 0.22}
 22%|██▏       | 922/4286 [7:03:31<24:26:04, 26.15s/it] 22%|██▏       | 923/4286 [7:03:58<24:48:32, 26.56s/it]                                                       {'loss': 0.0429, 'grad_norm': 6.871907877704986, 'learning_rate': 7.846476901539897e-07, 'completion_length': 334.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5949404835700989, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5413691997528076, 'reward_std': 0.16588040441274643, 'kl': 1.0693359375, 'epoch': 0.22}
 22%|██▏       | 923/4286 [7:03:58<24:48:32, 26.56s/it] 22%|██▏       | 924/4286 [7:04:25<24:51:00, 26.61s/it]                                                       {'loss': 0.0391, 'grad_norm': 18.154696684398786, 'learning_rate': 7.844143723751749e-07, 'completion_length': 356.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.6071428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5714285969734192, 'reward_std': 0.17395390570163727, 'kl': 0.978515625, 'epoch': 0.22}
 22%|██▏       | 924/4286 [7:04:25<24:51:00, 26.61s/it][2025-03-02 22:02:17,417] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 22%|██▏       | 925/4286 [7:04:54<25:41:32, 27.52s/it]                                                       {'loss': 0.0307, 'grad_norm': 3.432692512242026, 'learning_rate': 7.841810545963601e-07, 'completion_length': 324.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7053571343421936, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.669642984867096, 'reward_std': 0.19669277966022491, 'kl': 0.76953125, 'epoch': 0.22}
 22%|██▏       | 925/4286 [7:04:55<25:41:32, 27.52s/it] 22%|██▏       | 926/4286 [7:05:21<25:26:23, 27.26s/it]                                                       {'loss': 0.0157, 'grad_norm': 17.35042420402851, 'learning_rate': 7.839477368175455e-07, 'completion_length': 291.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.730654776096344, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.09883531369268894, 'kl': 0.390869140625, 'epoch': 0.22}
 22%|██▏       | 926/4286 [7:05:21<25:26:23, 27.26s/it] 22%|██▏       | 927/4286 [7:05:48<25:13:11, 27.03s/it]                                                       {'loss': 0.035, 'grad_norm': 2.562295726067071, 'learning_rate': 7.837144190387307e-07, 'completion_length': 281.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.5163690745830536, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4806548357009888, 'reward_std': 0.12366021983325481, 'kl': 0.8779296875, 'epoch': 0.22}
 22%|██▏       | 927/4286 [7:05:48<25:13:11, 27.03s/it] 22%|██▏       | 928/4286 [7:06:12<24:32:37, 26.31s/it]                                                       {'loss': 0.0473, 'grad_norm': 3.8207253847898968, 'learning_rate': 7.834811012599159e-07, 'completion_length': 323.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607144474983215, 'reward_std': 0.1071428582072258, 'kl': 1.181640625, 'epoch': 0.22}
 22%|██▏       | 928/4286 [7:06:12<24:32:37, 26.31s/it] 22%|██▏       | 929/4286 [7:06:38<24:14:16, 25.99s/it]                                                       {'loss': 0.0042, 'grad_norm': 7.852882925083621, 'learning_rate': 7.832477834811012e-07, 'completion_length': 308.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.7157738208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6979168057441711, 'reward_std': 0.08152472227811813, 'kl': 0.1044921875, 'epoch': 0.22}
 22%|██▏       | 929/4286 [7:06:38<24:14:16, 25.99s/it] 22%|██▏       | 930/4286 [7:07:04<24:28:40, 26.26s/it]                                                       {'loss': 0.0266, 'grad_norm': 9.738462058557765, 'learning_rate': 7.830144657022865e-07, 'completion_length': 316.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7800595760345459, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.726488173007965, 'reward_std': 0.17233119904994965, 'kl': 0.666015625, 'epoch': 0.22}
 22%|██▏       | 930/4286 [7:07:04<24:28:40, 26.26s/it] 22%|██▏       | 931/4286 [7:07:30<24:21:29, 26.14s/it]                                                       {'loss': 0.0143, 'grad_norm': 3.3407853881582232, 'learning_rate': 7.827811479234717e-07, 'completion_length': 292.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7651786208152771, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7116072177886963, 'reward_std': 0.13177145831286907, 'kl': 0.35546875, 'epoch': 0.22}
 22%|██▏       | 931/4286 [7:07:30<24:21:29, 26.14s/it] 22%|██▏       | 932/4286 [7:07:55<23:54:06, 25.65s/it]                                                       {'loss': 0.0135, 'grad_norm': 5.624311064177997, 'learning_rate': 7.82547830144657e-07, 'completion_length': 296.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 1.0, 'reward': 1.758928656578064, 'reward_std': 0.0357142873108387, 'kl': 0.337890625, 'epoch': 0.22}
 22%|██▏       | 932/4286 [7:07:55<23:54:06, 25.65s/it] 22%|██▏       | 933/4286 [7:08:19<23:37:13, 25.36s/it]                                                       {'loss': 0.0418, 'grad_norm': 2.546030145844171, 'learning_rate': 7.823145123658422e-07, 'completion_length': 288.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6041667461395264, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.568452537059784, 'reward_std': 0.1656566709280014, 'kl': 1.046875, 'epoch': 0.22}
 22%|██▏       | 933/4286 [7:08:19<23:37:13, 25.36s/it] 22%|██▏       | 934/4286 [7:08:43<23:03:10, 24.76s/it]                                                       {'loss': 0.0479, 'grad_norm': 3.314914786716784, 'learning_rate': 7.820811945870275e-07, 'completion_length': 290.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7118327915668488, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.658261477947235, 'reward_std': 0.24925966933369637, 'kl': 1.1953125, 'epoch': 0.22}
 22%|██▏       | 934/4286 [7:08:43<23:03:10, 24.76s/it] 22%|██▏       | 935/4286 [7:09:07<22:46:58, 24.48s/it]                                                       {'loss': 0.0213, 'grad_norm': 8.538441834442745, 'learning_rate': 7.818478768082127e-07, 'completion_length': 252.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.5918367654085159, 'rewards/format_reward': 1.0, 'reward': 1.5918368101119995, 'reward_std': 0.08793980814516544, 'kl': 0.5322265625, 'epoch': 0.22}
 22%|██▏       | 935/4286 [7:09:07<22:46:58, 24.48s/it] 22%|██▏       | 936/4286 [7:09:32<23:05:50, 24.82s/it]                                                       {'loss': 0.0167, 'grad_norm': 5.956621849421741, 'learning_rate': 7.81614559029398e-07, 'completion_length': 321.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.76264888048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7447918057441711, 'reward_std': 0.15414445102214813, 'kl': 0.419189453125, 'epoch': 0.22}
 22%|██▏       | 936/4286 [7:09:32<23:05:50, 24.82s/it] 22%|██▏       | 937/4286 [7:09:56<22:54:21, 24.62s/it]                                                       {'loss': 0.0198, 'grad_norm': 1.7882072863486438, 'learning_rate': 7.813812412505832e-07, 'completion_length': 332.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.617559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5997024774551392, 'reward_std': 0.11157910153269768, 'kl': 0.498046875, 'epoch': 0.22}
 22%|██▏       | 937/4286 [7:09:56<22:54:21, 24.62s/it] 22%|██▏       | 938/4286 [7:10:22<23:05:01, 24.82s/it]                                                       {'loss': 0.0431, 'grad_norm': 2.0411356551030755, 'learning_rate': 7.811479234717685e-07, 'completion_length': 277.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.6026786118745804, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.566964328289032, 'reward_std': 0.1864863783121109, 'kl': 1.080078125, 'epoch': 0.22}
 22%|██▏       | 938/4286 [7:10:22<23:05:01, 24.82s/it] 22%|██▏       | 939/4286 [7:10:47<23:10:23, 24.92s/it]                                                       {'loss': 0.0408, 'grad_norm': 3.7170694631954566, 'learning_rate': 7.809146056929538e-07, 'completion_length': 290.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.706845223903656, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6711310148239136, 'reward_std': 0.1417868584394455, 'kl': 1.0234375, 'epoch': 0.22}
 22%|██▏       | 939/4286 [7:10:47<23:10:23, 24.92s/it][2025-03-02 22:08:37,673] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 22%|██▏       | 940/4286 [7:11:15<23:59:29, 25.81s/it]                                                       {'loss': 0.0218, 'grad_norm': 4.138383768838944, 'learning_rate': 7.80681287914139e-07, 'completion_length': 292.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7233495712280273, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7054925560951233, 'reward_std': 0.10954783484339714, 'kl': 0.5458984375, 'epoch': 0.22}
 22%|██▏       | 940/4286 [7:11:15<23:59:29, 25.81s/it] 22%|██▏       | 941/4286 [7:11:39<23:26:21, 25.23s/it]                                                       {'loss': 0.0141, 'grad_norm': 1.4880480459962213, 'learning_rate': 7.804479701353242e-07, 'completion_length': 280.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.03519995138049126, 'kl': 0.3505859375, 'epoch': 0.22}
 22%|██▏       | 941/4286 [7:11:39<23:26:21, 25.23s/it] 22%|██▏       | 942/4286 [7:12:05<23:37:19, 25.43s/it]                                                       {'loss': 0.0144, 'grad_norm': 2.6212863461036116, 'learning_rate': 7.802146523565096e-07, 'completion_length': 323.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7452381253242493, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7095239162445068, 'reward_std': 0.11850834265351295, 'kl': 0.359619140625, 'epoch': 0.22}
 22%|██▏       | 942/4286 [7:12:05<23:37:19, 25.43s/it] 22%|██▏       | 943/4286 [7:12:31<24:00:05, 25.85s/it]                                                       {'loss': 0.0085, 'grad_norm': 2.246479721216466, 'learning_rate': 7.799813345776948e-07, 'completion_length': 325.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6116072535514832, 'reward_std': 0.09066697023808956, 'kl': 0.21337890625, 'epoch': 0.22}
 22%|██▏       | 943/4286 [7:12:31<24:00:05, 25.85s/it] 22%|██▏       | 944/4286 [7:12:58<24:11:13, 26.05s/it]                                                       {'loss': 0.0034, 'grad_norm': 2.813524810821871, 'learning_rate': 7.7974801679888e-07, 'completion_length': 321.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7053572237491608, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.05725477263331413, 'kl': 0.083740234375, 'epoch': 0.22}
 22%|██▏       | 944/4286 [7:12:58<24:11:13, 26.05s/it] 22%|██▏       | 945/4286 [7:13:23<24:00:37, 25.87s/it]                                                       {'loss': 0.0016, 'grad_norm': 3.065130300338319, 'learning_rate': 7.795146990200653e-07, 'completion_length': 311.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7056548297405243, 'rewards/format_reward': 1.0, 'reward': 1.7056548595428467, 'reward_std': 0.03352411463856697, 'kl': 0.041259765625, 'epoch': 0.22}
 22%|██▏       | 945/4286 [7:13:23<24:00:37, 25.87s/it] 22%|██▏       | 946/4286 [7:13:50<24:16:05, 26.16s/it]                                                       {'loss': 0.0346, 'grad_norm': 6.787766062226347, 'learning_rate': 7.792813812412506e-07, 'completion_length': 331.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.6755952537059784, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.07922262698411942, 'kl': 0.8642578125, 'epoch': 0.22}
 22%|██▏       | 946/4286 [7:13:50<24:16:05, 26.16s/it] 22%|██▏       | 947/4286 [7:14:17<24:22:29, 26.28s/it]                                                       {'loss': 0.0074, 'grad_norm': 5.432151204536765, 'learning_rate': 7.790480634624358e-07, 'completion_length': 334.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6949405074119568, 'rewards/format_reward': 1.0, 'reward': 1.6949405670166016, 'reward_std': 0.06845238525420427, 'kl': 0.1839599609375, 'epoch': 0.22}
 22%|██▏       | 947/4286 [7:14:17<24:22:29, 26.28s/it] 22%|██▏       | 948/4286 [7:14:43<24:14:06, 26.14s/it]                                                       {'loss': 0.0105, 'grad_norm': 4.244969958661573, 'learning_rate': 7.78814745683621e-07, 'completion_length': 310.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7090774476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6912203431129456, 'reward_std': 0.10565476678311825, 'kl': 0.264404296875, 'epoch': 0.22}
 22%|██▏       | 948/4286 [7:14:43<24:14:06, 26.14s/it] 22%|██▏       | 949/4286 [7:15:07<23:43:46, 25.60s/it]                                                       {'loss': 0.0023, 'grad_norm': 1.8975278710786154, 'learning_rate': 7.785814279048063e-07, 'completion_length': 276.9643020629883, 'rewards/only_full_func_accuracy_reward': 0.6607143580913544, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.07811371795833111, 'kl': 0.0582275390625, 'epoch': 0.22}
 22%|██▏       | 949/4286 [7:15:07<23:43:46, 25.60s/it] 22%|██▏       | 950/4286 [7:15:34<24:10:06, 26.08s/it]                                                       {'loss': 0.0188, 'grad_norm': 12.230946693848802, 'learning_rate': 7.783481101259915e-07, 'completion_length': 339.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7008928954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6473214626312256, 'reward_std': 0.21963614597916603, 'kl': 0.46923828125, 'epoch': 0.22}
 22%|██▏       | 950/4286 [7:15:34<24:10:06, 26.08s/it] 22%|██▏       | 951/4286 [7:16:00<23:59:04, 25.89s/it]                                                       {'loss': 0.0038, 'grad_norm': 1.3809531136146287, 'learning_rate': 7.781147923471768e-07, 'completion_length': 325.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.8035715222358704, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.032524414360523224, 'kl': 0.0941162109375, 'epoch': 0.22}
 22%|██▏       | 951/4286 [7:16:00<23:59:04, 25.89s/it] 22%|██▏       | 952/4286 [7:16:25<23:53:25, 25.80s/it]                                                       {'loss': 0.0159, 'grad_norm': 5.872816733685305, 'learning_rate': 7.778814745683621e-07, 'completion_length': 301.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.709821492433548, 'rewards/format_reward': 1.0, 'reward': 1.7098215222358704, 'reward_std': 0.05746846366673708, 'kl': 0.3984375, 'epoch': 0.22}
 22%|██▏       | 952/4286 [7:16:25<23:53:25, 25.80s/it][2025-03-02 22:14:14,204] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 22%|██▏       | 953/4286 [7:16:51<23:59:38, 25.92s/it]                                                       {'loss': 0.006, 'grad_norm': 3.757571015100768, 'learning_rate': 7.776481567895473e-07, 'completion_length': 331.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.7657738626003265, 'rewards/format_reward': 1.0, 'reward': 1.7657739520072937, 'reward_std': 0.01607143087312579, 'kl': 0.1507568359375, 'epoch': 0.22}
 22%|██▏       | 953/4286 [7:16:51<23:59:38, 25.92s/it] 22%|██▏       | 954/4286 [7:17:16<23:43:05, 25.63s/it]                                                       {'loss': 0.0351, 'grad_norm': 6.683602010584326, 'learning_rate': 7.774148390107325e-07, 'completion_length': 259.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.7053572237491608, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.07749691046774387, 'kl': 0.875, 'epoch': 0.22}
 22%|██▏       | 954/4286 [7:17:16<23:43:05, 25.63s/it][2025-03-02 22:15:06,046] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 22%|██▏       | 955/4286 [7:17:43<24:03:46, 26.01s/it]                                                       {'loss': 0.0374, 'grad_norm': 6.435127816688616, 'learning_rate': 7.771815212319179e-07, 'completion_length': 325.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.5193452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.501488208770752, 'reward_std': 0.0565476231276989, 'kl': 0.93359375, 'epoch': 0.22}
 22%|██▏       | 955/4286 [7:17:43<24:03:46, 26.01s/it] 22%|██▏       | 956/4286 [7:18:10<24:12:42, 26.17s/it]                                                       {'loss': 0.0268, 'grad_norm': 2.777146105116444, 'learning_rate': 7.769482034531031e-07, 'completion_length': 309.875, 'rewards/only_full_func_accuracy_reward': 0.742559552192688, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7068454027175903, 'reward_std': 0.052506979554891586, 'kl': 0.668212890625, 'epoch': 0.22}
 22%|██▏       | 956/4286 [7:18:10<24:12:42, 26.17s/it] 22%|██▏       | 957/4286 [7:18:37<24:28:25, 26.47s/it]                                                       {'loss': 0.0335, 'grad_norm': 6.128608874067733, 'learning_rate': 7.767148856742883e-07, 'completion_length': 305.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.799107164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7633929252624512, 'reward_std': 0.15393789112567902, 'kl': 0.837890625, 'epoch': 0.22}
 22%|██▏       | 957/4286 [7:18:37<24:28:25, 26.47s/it] 22%|██▏       | 958/4286 [7:19:03<24:31:02, 26.52s/it]                                                       {'loss': 0.0156, 'grad_norm': 4.893162173734457, 'learning_rate': 7.764815678954735e-07, 'completion_length': 311.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.5595238357782364, 'rewards/format_reward': 1.0, 'reward': 1.5595239400863647, 'reward_std': 0.03571429289877415, 'kl': 0.390625, 'epoch': 0.22}
 22%|██▏       | 958/4286 [7:19:03<24:31:02, 26.52s/it] 22%|██▏       | 959/4286 [7:19:29<24:07:07, 26.10s/it]                                                       {'loss': 0.0208, 'grad_norm': 1.4881690788365123, 'learning_rate': 7.762482501166589e-07, 'completion_length': 304.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.5931548178195953, 'rewards/format_reward': 1.0, 'reward': 1.5931548476219177, 'reward_std': 0.08357786387205124, 'kl': 0.5205078125, 'epoch': 0.22}
 22%|██▏       | 959/4286 [7:19:29<24:07:07, 26.10s/it] 22%|██▏       | 960/4286 [7:19:58<24:55:45, 26.98s/it]                                                       {'loss': 0.0084, 'grad_norm': 1.4573626197858673, 'learning_rate': 7.760149323378441e-07, 'completion_length': 338.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6738095581531525, 'rewards/format_reward': 1.0, 'reward': 1.6738096475601196, 'reward_std': 0.02792726643383503, 'kl': 0.2098388671875, 'epoch': 0.22}
 22%|██▏       | 960/4286 [7:19:58<24:55:45, 26.98s/it][2025-03-02 22:17:46,596] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 22%|██▏       | 961/4286 [7:20:24<24:39:25, 26.70s/it]                                                       {'loss': 0.0246, 'grad_norm': 3.019813675237806, 'learning_rate': 7.757816145590293e-07, 'completion_length': 299.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.5659722089767456, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5481151938438416, 'reward_std': 0.09105616062879562, 'kl': 0.615234375, 'epoch': 0.22}
 22%|██▏       | 961/4286 [7:20:24<24:39:25, 26.70s/it] 22%|██▏       | 962/4286 [7:20:50<24:33:30, 26.60s/it]                                                       {'loss': 0.0179, 'grad_norm': 5.458847342728964, 'learning_rate': 7.755482967802146e-07, 'completion_length': 313.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.742559552192688, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.0950244190171361, 'kl': 0.4482421875, 'epoch': 0.22}
 22%|██▏       | 962/4286 [7:20:50<24:33:30, 26.60s/it] 22%|██▏       | 963/4286 [7:21:14<23:50:19, 25.83s/it]                                                       {'loss': 0.0066, 'grad_norm': 3.60931795486492, 'learning_rate': 7.753149790013999e-07, 'completion_length': 316.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7291667461395264, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.0713137686252594, 'kl': 0.16455078125, 'epoch': 0.22}
 22%|██▏       | 963/4286 [7:21:14<23:50:19, 25.83s/it] 22%|██▏       | 964/4286 [7:21:41<24:08:16, 26.16s/it]                                                       {'loss': 0.032, 'grad_norm': 7.898877706569209, 'learning_rate': 7.750816612225851e-07, 'completion_length': 339.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5610119700431824, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5252977013587952, 'reward_std': 0.1553892344236374, 'kl': 0.8006591796875, 'epoch': 0.22}
 22%|██▏       | 964/4286 [7:21:41<24:08:16, 26.16s/it] 23%|██▎       | 965/4286 [7:22:09<24:41:36, 26.77s/it]                                                       {'loss': 0.0085, 'grad_norm': 1.9175893452274504, 'learning_rate': 7.748483434437704e-07, 'completion_length': 329.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.657738208770752, 'reward_std': 0.11203522607684135, 'kl': 0.21240234375, 'epoch': 0.23}
 23%|██▎       | 965/4286 [7:22:09<24:41:36, 26.77s/it] 23%|██▎       | 966/4286 [7:22:36<24:38:05, 26.71s/it]                                                       {'loss': 0.0131, 'grad_norm': 3.779078902060476, 'learning_rate': 7.746150256649556e-07, 'completion_length': 297.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.7008929252624512, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6651787161827087, 'reward_std': 0.10523160174489021, 'kl': 0.326416015625, 'epoch': 0.23}
 23%|██▎       | 966/4286 [7:22:36<24:38:05, 26.71s/it] 23%|██▎       | 967/4286 [7:23:00<23:56:24, 25.97s/it]                                                       {'loss': 0.0035, 'grad_norm': 3.3436921050699056, 'learning_rate': 7.743817078861409e-07, 'completion_length': 268.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.633928656578064, 'rewards/format_reward': 1.0, 'reward': 1.6339287161827087, 'reward_std': 0.054153766483068466, 'kl': 0.087158203125, 'epoch': 0.23}
 23%|██▎       | 967/4286 [7:23:00<23:56:24, 25.97s/it] 23%|██▎       | 968/4286 [7:23:25<23:39:55, 25.68s/it]                                                       {'loss': 0.0019, 'grad_norm': 0.6236389702716553, 'learning_rate': 7.741483901073262e-07, 'completion_length': 323.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.818452388048172, 'rewards/format_reward': 1.0, 'reward': 1.818452537059784, 'reward_std': 0.026572031434625387, 'kl': 0.0465087890625, 'epoch': 0.23}
 23%|██▎       | 968/4286 [7:23:25<23:39:55, 25.68s/it] 23%|██▎       | 969/4286 [7:23:49<23:03:40, 25.03s/it]                                                       {'loss': 0.0019, 'grad_norm': 1.150245092171562, 'learning_rate': 7.739150723285114e-07, 'completion_length': 268.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666667461395264, 'reward_std': 0.04946071980521083, 'kl': 0.0472412109375, 'epoch': 0.23}
 23%|██▎       | 969/4286 [7:23:49<23:03:40, 25.03s/it] 23%|██▎       | 970/4286 [7:24:13<22:59:43, 24.96s/it]                                                       {'loss': 0.0026, 'grad_norm': 1.7750104126639294, 'learning_rate': 7.736817545496966e-07, 'completion_length': 282.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.07854852825403214, 'kl': 0.063720703125, 'epoch': 0.23}
 23%|██▎       | 970/4286 [7:24:13<22:59:43, 24.96s/it] 23%|██▎       | 971/4286 [7:24:38<22:55:25, 24.89s/it]                                                       {'loss': 0.0012, 'grad_norm': 0.21600034490483128, 'learning_rate': 7.734484367708819e-07, 'completion_length': 320.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.0, 'kl': 0.0308837890625, 'epoch': 0.23}
 23%|██▎       | 971/4286 [7:24:38<22:55:25, 24.89s/it][2025-03-02 22:22:29,278] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 23%|██▎       | 972/4286 [7:25:06<23:51:18, 25.91s/it]                                                       {'loss': 0.005, 'grad_norm': 1.5379708611419625, 'learning_rate': 7.732151189920672e-07, 'completion_length': 331.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.6958333551883698, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6779762506484985, 'reward_std': 0.06710773799568415, 'kl': 0.125, 'epoch': 0.23}
 23%|██▎       | 972/4286 [7:25:06<23:51:18, 25.91s/it] 23%|██▎       | 973/4286 [7:25:34<24:16:10, 26.37s/it]                                                       {'loss': 0.0047, 'grad_norm': 1.3195224621448103, 'learning_rate': 7.729818012132524e-07, 'completion_length': 329.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6696429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6517858505249023, 'reward_std': 0.056844256818294525, 'kl': 0.11700439453125, 'epoch': 0.23}
 23%|██▎       | 973/4286 [7:25:34<24:16:10, 26.37s/it] 23%|██▎       | 974/4286 [7:26:00<24:19:29, 26.44s/it]                                                       {'loss': 0.008, 'grad_norm': 0.6270451730447748, 'learning_rate': 7.727484834344376e-07, 'completion_length': 322.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.735119104385376, 'reward_std': 0.053571430034935474, 'kl': 0.2005615234375, 'epoch': 0.23}
 23%|██▎       | 974/4286 [7:26:00<24:19:29, 26.44s/it] 23%|██▎       | 975/4286 [7:26:26<24:12:21, 26.32s/it]                                                       {'loss': 0.005, 'grad_norm': 1.7632948729950206, 'learning_rate': 7.72515165655623e-07, 'completion_length': 317.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 1.0, 'reward': 1.7574405670166016, 'reward_std': 0.043089400976896286, 'kl': 0.12481689453125, 'epoch': 0.23}
 23%|██▎       | 975/4286 [7:26:26<24:12:21, 26.32s/it] 23%|██▎       | 976/4286 [7:26:54<24:33:33, 26.71s/it]                                                       {'loss': 0.01, 'grad_norm': 1.8956523079880208, 'learning_rate': 7.722818478768082e-07, 'completion_length': 329.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6949405670166016, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6592262983322144, 'reward_std': 0.13158316537737846, 'kl': 0.2490234375, 'epoch': 0.23}
 23%|██▎       | 976/4286 [7:26:54<24:33:33, 26.71s/it] 23%|██▎       | 977/4286 [7:27:19<24:03:31, 26.17s/it]                                                       {'loss': 0.0025, 'grad_norm': 6.633416702046191, 'learning_rate': 7.720485300979934e-07, 'completion_length': 322.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.061739769764244556, 'kl': 0.06329345703125, 'epoch': 0.23}
 23%|██▎       | 977/4286 [7:27:19<24:03:31, 26.17s/it] 23%|██▎       | 978/4286 [7:27:49<25:11:56, 27.42s/it]                                                       {'loss': 0.0067, 'grad_norm': 2.4038905799744685, 'learning_rate': 7.718152123191787e-07, 'completion_length': 327.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5300595462322235, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.512202501296997, 'reward_std': 0.07108007930219173, 'kl': 0.167724609375, 'epoch': 0.23}
 23%|██▎       | 978/4286 [7:27:49<25:11:56, 27.42s/it] 23%|██▎       | 979/4286 [7:28:16<24:52:58, 27.09s/it]                                                       {'loss': 0.0018, 'grad_norm': 0.8687764372777433, 'learning_rate': 7.715818945403639e-07, 'completion_length': 306.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.026572037022560835, 'kl': 0.0450439453125, 'epoch': 0.23}
 23%|██▎       | 979/4286 [7:28:16<24:52:58, 27.09s/it] 23%|██▎       | 980/4286 [7:28:41<24:20:12, 26.50s/it]                                                       {'loss': 0.0189, 'grad_norm': 2.180643500179778, 'learning_rate': 7.713485767615492e-07, 'completion_length': 311.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5907739400863647, 'reward_std': 0.12124167010188103, 'kl': 0.4716796875, 'epoch': 0.23}
 23%|██▎       | 980/4286 [7:28:41<24:20:12, 26.50s/it] 23%|██▎       | 981/4286 [7:29:06<24:06:08, 26.25s/it]                                                       {'loss': 0.0017, 'grad_norm': 24.454297538664658, 'learning_rate': 7.711152589827344e-07, 'completion_length': 288.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6720238626003265, 'rewards/format_reward': 1.0, 'reward': 1.672023892402649, 'reward_std': 0.041643764823675156, 'kl': 0.0426025390625, 'epoch': 0.23}
 23%|██▎       | 981/4286 [7:29:06<24:06:08, 26.25s/it] 23%|██▎       | 982/4286 [7:29:31<23:36:45, 25.73s/it]                                                       {'loss': 0.0043, 'grad_norm': 0.993269307640593, 'learning_rate': 7.708819412039197e-07, 'completion_length': 295.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6800595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.008928571827709675, 'kl': 0.1064453125, 'epoch': 0.23}
 23%|██▎       | 982/4286 [7:29:31<23:36:45, 25.73s/it] 23%|██▎       | 983/4286 [7:30:00<24:29:31, 26.69s/it]                                                       {'loss': 0.0136, 'grad_norm': 2.3232353031501414, 'learning_rate': 7.706486234251049e-07, 'completion_length': 355.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6329816579818726, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5972674489021301, 'reward_std': 0.16004224121570587, 'kl': 0.33984375, 'epoch': 0.23}
 23%|██▎       | 983/4286 [7:30:00<24:29:31, 26.69s/it] 23%|██▎       | 984/4286 [7:30:28<24:58:36, 27.23s/it]                                                       {'loss': 0.0069, 'grad_norm': 5.4886633405781975, 'learning_rate': 7.704153056462902e-07, 'completion_length': 330.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.0476190485060215, 'kl': 0.1728515625, 'epoch': 0.23}
 23%|██▎       | 984/4286 [7:30:28<24:58:36, 27.23s/it][2025-03-02 22:28:17,780] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 23%|██▎       | 985/4286 [7:30:55<24:45:59, 27.01s/it]                                                       {'loss': 0.0124, 'grad_norm': 1.6832499782480665, 'learning_rate': 7.701819878674755e-07, 'completion_length': 328.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7574405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.739583432674408, 'reward_std': 0.1160714365541935, 'kl': 0.309814453125, 'epoch': 0.23}
 23%|██▎       | 985/4286 [7:30:55<24:45:59, 27.01s/it] 23%|██▎       | 986/4286 [7:31:17<23:17:44, 25.41s/it]                                                       {'loss': 0.0087, 'grad_norm': 3.2608372215880306, 'learning_rate': 7.699486700886607e-07, 'completion_length': 219.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.7281746566295624, 'rewards/format_reward': 1.0, 'reward': 1.7281746864318848, 'reward_std': 0.0238095261156559, 'kl': 0.217529296875, 'epoch': 0.23}
 23%|██▎       | 986/4286 [7:31:17<23:17:44, 25.41s/it] 23%|██▎       | 987/4286 [7:31:45<24:12:58, 26.43s/it]                                                       {'loss': 0.0306, 'grad_norm': 3.802979273264998, 'learning_rate': 7.697153523098459e-07, 'completion_length': 290.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.566964328289032, 'reward_std': 0.11916154250502586, 'kl': 0.7646484375, 'epoch': 0.23}
 23%|██▎       | 987/4286 [7:31:45<24:12:58, 26.43s/it] 23%|██▎       | 988/4286 [7:32:14<24:51:07, 27.13s/it]                                                       {'loss': 0.0159, 'grad_norm': 9.604357108740455, 'learning_rate': 7.694820345310313e-07, 'completion_length': 297.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7529762983322144, 'reward_std': 0.13426091521978378, 'kl': 0.394775390625, 'epoch': 0.23}
 23%|██▎       | 988/4286 [7:32:14<24:51:07, 27.13s/it] 23%|██▎       | 989/4286 [7:32:43<25:26:33, 27.78s/it]                                                       {'loss': 0.0122, 'grad_norm': 1.2287452585345866, 'learning_rate': 7.692487167522165e-07, 'completion_length': 295.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.6309524476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6130953431129456, 'reward_std': 0.07327024452388287, 'kl': 0.30419921875, 'epoch': 0.23}
 23%|██▎       | 989/4286 [7:32:43<25:26:33, 27.78s/it] 23%|██▎       | 990/4286 [7:33:09<24:47:59, 27.09s/it]                                                       {'loss': 0.0275, 'grad_norm': 2.2562835349952706, 'learning_rate': 7.690153989734017e-07, 'completion_length': 338.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7029762268066406, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6672620177268982, 'reward_std': 0.14166667684912682, 'kl': 0.6925048828125, 'epoch': 0.23}
 23%|██▎       | 990/4286 [7:33:09<24:47:59, 27.09s/it] 23%|██▎       | 991/4286 [7:33:35<24:31:25, 26.79s/it]                                                       {'loss': 0.0092, 'grad_norm': 2.630775373345883, 'learning_rate': 7.68782081194587e-07, 'completion_length': 313.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6436012387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6257441639900208, 'reward_std': 0.07988189347088337, 'kl': 0.2288818359375, 'epoch': 0.23}
 23%|██▎       | 991/4286 [7:33:35<24:31:25, 26.79s/it] 23%|██▎       | 992/4286 [7:34:00<24:08:02, 26.38s/it]                                                       {'loss': 0.0289, 'grad_norm': 4.103155089117636, 'learning_rate': 7.685487634157723e-07, 'completion_length': 304.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.5952381193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5773810148239136, 'reward_std': 0.07762769609689713, 'kl': 0.722900390625, 'epoch': 0.23}
 23%|██▎       | 992/4286 [7:34:00<24:08:02, 26.38s/it] 23%|██▎       | 993/4286 [7:34:26<23:57:41, 26.20s/it]                                                       {'loss': 0.0018, 'grad_norm': 1.68800296036807, 'learning_rate': 7.683154456369575e-07, 'completion_length': 344.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.014880955684930086, 'kl': 0.0455322265625, 'epoch': 0.23}
 23%|██▎       | 993/4286 [7:34:26<23:57:41, 26.20s/it] 23%|██▎       | 994/4286 [7:34:51<23:33:52, 25.77s/it]                                                       {'loss': 0.0018, 'grad_norm': 2.0970496849236295, 'learning_rate': 7.680821278581427e-07, 'completion_length': 328.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.04946071840822697, 'kl': 0.046142578125, 'epoch': 0.23}
 23%|██▎       | 994/4286 [7:34:51<23:33:52, 25.77s/it] 23%|██▎       | 995/4286 [7:35:17<23:33:21, 25.77s/it]                                                       {'loss': 0.0137, 'grad_norm': 1.6327504206831203, 'learning_rate': 7.67848810079328e-07, 'completion_length': 310.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6755953431129456, 'reward_std': 0.07082726433873177, 'kl': 0.341796875, 'epoch': 0.23}
 23%|██▎       | 995/4286 [7:35:17<23:33:21, 25.77s/it] 23%|██▎       | 996/4286 [7:35:41<23:13:28, 25.41s/it]                                                       {'loss': 0.0019, 'grad_norm': 1.3369604457133284, 'learning_rate': 7.676154923005133e-07, 'completion_length': 264.07144927978516, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562501788139343, 'reward_std': 0.060398245230317116, 'kl': 0.0462646484375, 'epoch': 0.23}
 23%|██▎       | 996/4286 [7:35:41<23:13:28, 25.41s/it] 23%|██▎       | 997/4286 [7:36:06<23:02:49, 25.23s/it]                                                       {'loss': 0.0041, 'grad_norm': 0.916649988905193, 'learning_rate': 7.673821745216985e-07, 'completion_length': 321.5714569091797, 'rewards/only_full_func_accuracy_reward': 0.7500000894069672, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.0476190485060215, 'kl': 0.1031494140625, 'epoch': 0.23}
 23%|██▎       | 997/4286 [7:36:06<23:02:49, 25.23s/it] 23%|██▎       | 998/4286 [7:36:30<22:48:51, 24.98s/it]                                                       {'loss': 0.0358, 'grad_norm': 3.878789923882846, 'learning_rate': 7.671488567428838e-07, 'completion_length': 320.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7008928656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6830358505249023, 'reward_std': 0.0625, 'kl': 0.900390625, 'epoch': 0.23}
 23%|██▎       | 998/4286 [7:36:30<22:48:51, 24.98s/it] 23%|██▎       | 999/4286 [7:36:54<22:24:05, 24.53s/it]                                                       {'loss': 0.0442, 'grad_norm': 1.5590298901610768, 'learning_rate': 7.66915538964069e-07, 'completion_length': 284.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.54067462682724, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5049604773521423, 'reward_std': 0.15466262493282557, 'kl': 1.109375, 'epoch': 0.23}
 23%|██▎       | 999/4286 [7:36:54<22:24:05, 24.53s/it] 23%|██▎       | 1000/4286 [7:37:19<22:32:50, 24.70s/it]                                                        {'loss': 0.0112, 'grad_norm': 5.495380156794959, 'learning_rate': 7.666822211852542e-07, 'completion_length': 304.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6755953431129456, 'reward_std': 0.10523692518472672, 'kl': 0.279541015625, 'epoch': 0.23}
 23%|██▎       | 1000/4286 [7:37:19<22:32:50, 24.70s/it] 23%|██▎       | 1001/4286 [7:40:49<73:18:42, 80.34s/it]                                                        {'loss': 0.0194, 'grad_norm': 1.630505113905146, 'learning_rate': 7.664489034064396e-07, 'completion_length': 316.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5779762268066406, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5601191520690918, 'reward_std': 0.05649853125214577, 'kl': 0.486328125, 'epoch': 0.23}
 23%|██▎       | 1001/4286 [7:40:49<73:18:42, 80.34s/it] 23%|██▎       | 1002/4286 [7:41:15<58:15:03, 63.86s/it]                                                        {'loss': 0.0033, 'grad_norm': 1.3970153142905377, 'learning_rate': 7.662155856276248e-07, 'completion_length': 314.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.71577388048172, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.02862738538533449, 'kl': 0.0826416015625, 'epoch': 0.23}
 23%|██▎       | 1002/4286 [7:41:15<58:15:03, 63.86s/it] 23%|██▎       | 1003/4286 [7:41:40<47:36:18, 52.20s/it]                                                        {'loss': 0.0228, 'grad_norm': 3.5529857010107215, 'learning_rate': 7.6598226784881e-07, 'completion_length': 314.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6901786029338837, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6723215579986572, 'reward_std': 0.06780397333204746, 'kl': 0.57275390625, 'epoch': 0.23}
 23%|██▎       | 1003/4286 [7:41:40<47:36:18, 52.20s/it] 23%|██▎       | 1004/4286 [7:42:04<39:57:07, 43.82s/it]                                                        {'loss': 0.0046, 'grad_norm': 1.9429306385051, 'learning_rate': 7.657489500699952e-07, 'completion_length': 276.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.040146206971257925, 'kl': 0.1142578125, 'epoch': 0.23}
 23%|██▎       | 1004/4286 [7:42:04<39:57:07, 43.82s/it] 23%|██▎       | 1005/4286 [7:42:28<34:39:41, 38.03s/it]                                                        {'loss': 0.0277, 'grad_norm': 4.56539768140185, 'learning_rate': 7.655156322911806e-07, 'completion_length': 297.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.618452399969101, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.582738220691681, 'reward_std': 0.1365003138780594, 'kl': 0.69140625, 'epoch': 0.23}
 23%|██▎       | 1005/4286 [7:42:28<34:39:41, 38.03s/it] 23%|██▎       | 1006/4286 [7:42:56<31:41:39, 34.79s/it]                                                        {'loss': 0.0317, 'grad_norm': 2.4320229988458077, 'learning_rate': 7.652823145123658e-07, 'completion_length': 312.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6264882683753967, 'reward_std': 0.1398809589445591, 'kl': 0.79150390625, 'epoch': 0.23}
 23%|██▎       | 1006/4286 [7:42:56<31:41:39, 34.79s/it] 23%|██▎       | 1007/4286 [7:43:20<28:49:17, 31.64s/it]                                                        {'loss': 0.0462, 'grad_norm': 5.008291859845379, 'learning_rate': 7.65048996733551e-07, 'completion_length': 328.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5372024178504944, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.501488208770752, 'reward_std': 0.1689425315707922, 'kl': 1.154296875, 'epoch': 0.23}
 23%|██▎       | 1007/4286 [7:43:20<28:49:17, 31.64s/it] 24%|██▎       | 1008/4286 [7:43:46<27:16:13, 29.95s/it]                                                        {'loss': 0.0317, 'grad_norm': 63.921266530293046, 'learning_rate': 7.648156789547363e-07, 'completion_length': 326.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5848215818405151, 'reward_std': 0.12746937200427055, 'kl': 0.79296875, 'epoch': 0.24}
 24%|██▎       | 1008/4286 [7:43:46<27:16:13, 29.95s/it] 24%|██▎       | 1009/4286 [7:44:12<26:05:38, 28.67s/it]                                                        {'loss': 0.0473, 'grad_norm': 4.456933315625233, 'learning_rate': 7.645823611759216e-07, 'completion_length': 297.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.6774749457836151, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.606046438217163, 'reward_std': 0.17470426857471466, 'kl': 1.177734375, 'epoch': 0.24}
 24%|██▎       | 1009/4286 [7:44:12<26:05:38, 28.67s/it][2025-03-02 22:42:00,196] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▎       | 1010/4286 [7:44:37<25:15:55, 27.76s/it]                                                        {'loss': 0.0086, 'grad_norm': 3.9796179594220797, 'learning_rate': 7.643490433971068e-07, 'completion_length': 305.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.47261907160282135, 'rewards/format_reward': 1.0, 'reward': 1.4726191759109497, 'reward_std': 0.06666667107492685, 'kl': 0.213623046875, 'epoch': 0.24}
 24%|██▎       | 1010/4286 [7:44:37<25:15:55, 27.76s/it][2025-03-02 22:42:26,330] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▎       | 1011/4286 [7:45:03<24:48:46, 27.28s/it]                                                        {'loss': 0.0109, 'grad_norm': 10.097112190902651, 'learning_rate': 7.641157256182921e-07, 'completion_length': 284.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.08724318072199821, 'kl': 0.273193359375, 'epoch': 0.24}
 24%|██▎       | 1011/4286 [7:45:03<24:48:46, 27.28s/it][2025-03-02 22:42:51,781] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▎       | 1012/4286 [7:45:29<24:18:26, 26.73s/it]                                                        {'loss': 0.0017, 'grad_norm': 2.48451708761687, 'learning_rate': 7.638824078394773e-07, 'completion_length': 308.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.8571429550647736, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.011904759332537651, 'kl': 0.0413818359375, 'epoch': 0.24}
 24%|██▎       | 1012/4286 [7:45:29<24:18:26, 26.73s/it][2025-03-02 22:43:16,317] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▎       | 1013/4286 [7:45:53<23:42:07, 26.07s/it]                                                        {'loss': 0.0505, 'grad_norm': 10.838520410850263, 'learning_rate': 7.636490900606626e-07, 'completion_length': 322.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6318452656269073, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5604167580604553, 'reward_std': 0.13807334005832672, 'kl': 1.26171875, 'epoch': 0.24}
 24%|██▎       | 1013/4286 [7:45:53<23:42:07, 26.07s/it] 24%|██▎       | 1014/4286 [7:46:17<22:58:39, 25.28s/it]                                                        {'loss': 0.0372, 'grad_norm': 45.057073997902414, 'learning_rate': 7.634157722818479e-07, 'completion_length': 301.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.5520833432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5342262983322144, 'reward_std': 0.09410358592867851, 'kl': 0.931640625, 'epoch': 0.24}
 24%|██▎       | 1014/4286 [7:46:17<22:58:39, 25.28s/it][2025-03-02 22:44:06,281] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▎       | 1015/4286 [7:46:43<23:18:34, 25.65s/it]                                                        {'loss': 0.0244, 'grad_norm': 6.936278874604113, 'learning_rate': 7.631824545030331e-07, 'completion_length': 321.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6984578371047974, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6627436876296997, 'reward_std': 0.17244210094213486, 'kl': 0.611328125, 'epoch': 0.24}
 24%|██▎       | 1015/4286 [7:46:43<23:18:34, 25.65s/it] 24%|██▎       | 1016/4286 [7:47:07<22:46:12, 25.07s/it]                                                        {'loss': 0.007, 'grad_norm': 47.469977915373946, 'learning_rate': 7.629491367242183e-07, 'completion_length': 305.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 1.0, 'reward': 1.669642984867096, 'reward_std': 0.03878935053944588, 'kl': 0.1748046875, 'epoch': 0.24}
 24%|██▎       | 1016/4286 [7:47:07<22:46:12, 25.07s/it] 24%|██▎       | 1017/4286 [7:47:32<22:48:32, 25.12s/it]                                                        {'loss': 0.0184, 'grad_norm': 4.55499030010368, 'learning_rate': 7.627158189454036e-07, 'completion_length': 306.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6729167103767395, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6550596356391907, 'reward_std': 0.11618061736226082, 'kl': 0.460205078125, 'epoch': 0.24}
 24%|██▎       | 1017/4286 [7:47:32<22:48:32, 25.12s/it] 24%|██▍       | 1018/4286 [7:47:58<22:54:57, 25.24s/it]                                                        {'loss': 0.012, 'grad_norm': 1.9227779049031053, 'learning_rate': 7.624825011665889e-07, 'completion_length': 291.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.723214328289032, 'reward_std': 0.08014346659183502, 'kl': 0.2996826171875, 'epoch': 0.24}
 24%|██▍       | 1018/4286 [7:47:58<22:54:57, 25.24s/it] 24%|██▍       | 1019/4286 [7:48:23<22:56:07, 25.27s/it]                                                        {'loss': 0.0119, 'grad_norm': 5.7269903482526185, 'learning_rate': 7.622491833877741e-07, 'completion_length': 307.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7101190686225891, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6922619938850403, 'reward_std': 0.10346660763025284, 'kl': 0.297119140625, 'epoch': 0.24}
 24%|██▍       | 1019/4286 [7:48:23<22:56:07, 25.27s/it][2025-03-02 22:46:12,204] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1020/4286 [7:48:49<23:09:20, 25.52s/it]                                                        {'loss': 0.0181, 'grad_norm': 1.6556573418994809, 'learning_rate': 7.620158656089593e-07, 'completion_length': 337.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6305272579193115, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6126701831817627, 'reward_std': 0.13686051592230797, 'kl': 0.45458984375, 'epoch': 0.24}
 24%|██▍       | 1020/4286 [7:48:49<23:09:20, 25.52s/it] 24%|██▍       | 1021/4286 [7:49:17<23:44:34, 26.18s/it]                                                        {'loss': 0.0422, 'grad_norm': 4.68805415351615, 'learning_rate': 7.617825478301447e-07, 'completion_length': 363.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.6832058429718018, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.629634439945221, 'reward_std': 0.21342702955007553, 'kl': 1.0546875, 'epoch': 0.24}
 24%|██▍       | 1021/4286 [7:49:17<23:44:34, 26.18s/it] 24%|██▍       | 1022/4286 [7:49:44<23:51:59, 26.32s/it]                                                        {'loss': 0.0266, 'grad_norm': 3.4196287437436905, 'learning_rate': 7.615492300513299e-07, 'completion_length': 331.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6020834147930145, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5842262506484985, 'reward_std': 0.07891347724944353, 'kl': 0.6639404296875, 'epoch': 0.24}
 24%|██▍       | 1022/4286 [7:49:44<23:51:59, 26.32s/it] 24%|██▍       | 1023/4286 [7:50:09<23:32:35, 25.97s/it]                                                        {'loss': 0.0027, 'grad_norm': 4.335592568239056, 'learning_rate': 7.613159122725151e-07, 'completion_length': 302.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.01785714365541935, 'kl': 0.06787109375, 'epoch': 0.24}
 24%|██▍       | 1023/4286 [7:50:09<23:32:35, 25.97s/it] 24%|██▍       | 1024/4286 [7:50:34<23:27:03, 25.88s/it]                                                        {'loss': 0.0097, 'grad_norm': 0.5688924374403055, 'learning_rate': 7.610825944937004e-07, 'completion_length': 317.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187501788139343, 'reward_std': 0.020833331160247326, 'kl': 0.242919921875, 'epoch': 0.24}
 24%|██▍       | 1024/4286 [7:50:34<23:27:03, 25.88s/it] 24%|██▍       | 1025/4286 [7:51:01<23:43:37, 26.19s/it]                                                        {'loss': 0.0019, 'grad_norm': 6.110474518696454, 'learning_rate': 7.608492767148857e-07, 'completion_length': 341.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.020833331160247326, 'kl': 0.0477294921875, 'epoch': 0.24}
 24%|██▍       | 1025/4286 [7:51:01<23:43:37, 26.19s/it] 24%|██▍       | 1026/4286 [7:51:29<24:03:29, 26.57s/it]                                                        {'loss': 0.0066, 'grad_norm': 1.2199071297806177, 'learning_rate': 7.606159589360709e-07, 'completion_length': 322.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8488096296787262, 'rewards/format_reward': 1.0, 'reward': 1.8488096594810486, 'reward_std': 0.06679030694067478, 'kl': 0.166748046875, 'epoch': 0.24}
 24%|██▍       | 1026/4286 [7:51:29<24:03:29, 26.57s/it][2025-03-02 22:49:17,111] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1027/4286 [7:51:54<23:43:14, 26.20s/it]                                                        {'loss': 0.0074, 'grad_norm': 10.475657338997708, 'learning_rate': 7.603826411572561e-07, 'completion_length': 295.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.09622006863355637, 'kl': 0.185546875, 'epoch': 0.24}
 24%|██▍       | 1027/4286 [7:51:54<23:43:14, 26.20s/it] 24%|██▍       | 1028/4286 [7:52:20<23:34:44, 26.05s/it]                                                        {'loss': 0.0024, 'grad_norm': 10.164640017939593, 'learning_rate': 7.601493233784414e-07, 'completion_length': 326.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.71577388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6979167461395264, 'reward_std': 0.06526251323521137, 'kl': 0.059814453125, 'epoch': 0.24}
 24%|██▍       | 1028/4286 [7:52:20<23:34:44, 26.05s/it] 24%|██▍       | 1029/4286 [7:52:48<24:03:32, 26.59s/it]                                                        {'loss': 0.0086, 'grad_norm': 3.6800329699063368, 'learning_rate': 7.599160055996266e-07, 'completion_length': 382.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.6170725524425507, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5992155075073242, 'reward_std': 0.11966332048177719, 'kl': 0.21484375, 'epoch': 0.24}
 24%|██▍       | 1029/4286 [7:52:48<24:03:32, 26.59s/it] 24%|██▍       | 1030/4286 [7:53:15<24:19:46, 26.90s/it]                                                        {'loss': 0.0176, 'grad_norm': 1.9527878077995577, 'learning_rate': 7.596826878208119e-07, 'completion_length': 355.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217262983322144, 'reward_std': 0.11828738823533058, 'kl': 0.4429931640625, 'epoch': 0.24}
 24%|██▍       | 1030/4286 [7:53:15<24:19:46, 26.90s/it] 24%|██▍       | 1031/4286 [7:53:42<24:13:17, 26.79s/it]                                                        {'loss': 0.0215, 'grad_norm': 1.9458556737337034, 'learning_rate': 7.594493700419972e-07, 'completion_length': 334.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.0714285746216774, 'kl': 0.53515625, 'epoch': 0.24}
 24%|██▍       | 1031/4286 [7:53:42<24:13:17, 26.79s/it][2025-03-02 22:51:32,626] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1032/4286 [7:54:10<24:29:28, 27.10s/it]                                                        {'loss': 0.0319, 'grad_norm': 2.5773361239253854, 'learning_rate': 7.592160522631824e-07, 'completion_length': 330.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949406266212463, 'reward_std': 0.07900894992053509, 'kl': 0.798828125, 'epoch': 0.24}
 24%|██▍       | 1032/4286 [7:54:10<24:29:28, 27.10s/it][2025-03-02 22:52:00,215] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1033/4286 [7:54:37<24:37:04, 27.24s/it]                                                        {'loss': 0.0022, 'grad_norm': 7.743004487393013, 'learning_rate': 7.589827344843676e-07, 'completion_length': 323.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.10554791800677776, 'kl': 0.053955078125, 'epoch': 0.24}
 24%|██▍       | 1033/4286 [7:54:37<24:37:04, 27.24s/it][2025-03-02 22:52:28,325] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1034/4286 [7:55:05<24:50:42, 27.50s/it]                                                        {'loss': 0.005, 'grad_norm': 0.35608575906383283, 'learning_rate': 7.58749416705553e-07, 'completion_length': 309.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.7351190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.005952383857220411, 'kl': 0.1251220703125, 'epoch': 0.24}
 24%|██▍       | 1034/4286 [7:55:05<24:50:42, 27.50s/it][2025-03-02 22:52:56,869] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1035/4286 [7:55:34<25:07:09, 27.82s/it]                                                        {'loss': 0.0026, 'grad_norm': 0.33172939113278743, 'learning_rate': 7.585160989267382e-07, 'completion_length': 357.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7195684909820557, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7017114758491516, 'reward_std': 0.07913508266210556, 'kl': 0.0655517578125, 'epoch': 0.24}
 24%|██▍       | 1035/4286 [7:55:34<25:07:09, 27.82s/it] 24%|██▍       | 1036/4286 [7:56:03<25:25:13, 28.16s/it]                                                        {'loss': 0.0044, 'grad_norm': 3.2614873962183815, 'learning_rate': 7.582827811479234e-07, 'completion_length': 371.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7849161624908447, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7492019534111023, 'reward_std': 0.15461242571473122, 'kl': 0.109375, 'epoch': 0.24}
 24%|██▍       | 1036/4286 [7:56:03<25:25:13, 28.16s/it][2025-03-02 22:53:53,876] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1037/4286 [7:56:31<25:23:01, 28.13s/it]                                                        {'loss': 0.0017, 'grad_norm': 4.245275116262176, 'learning_rate': 7.580494633691087e-07, 'completion_length': 329.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8023810088634491, 'rewards/format_reward': 1.0, 'reward': 1.8023810386657715, 'reward_std': 0.03333333507180214, 'kl': 0.0435791015625, 'epoch': 0.24}
 24%|██▍       | 1037/4286 [7:56:31<25:23:01, 28.13s/it][2025-03-02 22:54:21,574] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1038/4286 [7:56:59<25:15:35, 28.00s/it]                                                        {'loss': 0.0024, 'grad_norm': 167.23837628177395, 'learning_rate': 7.57816145590294e-07, 'completion_length': 385.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6520833671092987, 'rewards/format_reward': 1.0, 'reward': 1.652083396911621, 'reward_std': 0.06923839822411537, 'kl': 0.05908203125, 'epoch': 0.24}
 24%|██▍       | 1038/4286 [7:56:59<25:15:35, 28.00s/it][2025-03-02 22:54:50,625] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1039/4286 [7:57:28<25:32:14, 28.31s/it]                                                        {'loss': 0.0473, 'grad_norm': 9.99779735896727, 'learning_rate': 7.575828278114792e-07, 'completion_length': 343.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.6161140203475952, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5625426173210144, 'reward_std': 0.19267379492521286, 'kl': 1.18359375, 'epoch': 0.24}
 24%|██▍       | 1039/4286 [7:57:28<25:32:14, 28.31s/it][2025-03-02 22:55:18,222] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1040/4286 [7:57:55<25:20:08, 28.10s/it]                                                        {'loss': 0.041, 'grad_norm': 4.2549239437324315, 'learning_rate': 7.573495100326644e-07, 'completion_length': 334.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5520834028720856, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5163692235946655, 'reward_std': 0.1415719836950302, 'kl': 1.025390625, 'epoch': 0.24}
 24%|██▍       | 1040/4286 [7:57:55<25:20:08, 28.10s/it] 24%|██▍       | 1041/4286 [7:58:21<24:41:33, 27.39s/it]                                                        {'loss': 0.0429, 'grad_norm': 15.597906243136377, 'learning_rate': 7.571161922538497e-07, 'completion_length': 310.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.75, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7142858505249023, 'reward_std': 0.14721458591520786, 'kl': 1.06982421875, 'epoch': 0.24}
 24%|██▍       | 1041/4286 [7:58:21<24:41:33, 27.39s/it][2025-03-02 22:56:10,773] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1042/4286 [7:58:48<24:31:28, 27.22s/it]                                                        {'loss': 0.0077, 'grad_norm': 7.189091283164401, 'learning_rate': 7.56882874475035e-07, 'completion_length': 353.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7410715818405151, 'reward_std': 0.1375580057501793, 'kl': 0.19287109375, 'epoch': 0.24}
 24%|██▍       | 1042/4286 [7:58:48<24:31:28, 27.22s/it] 24%|██▍       | 1043/4286 [7:59:14<24:20:44, 27.03s/it]                                                        {'loss': 0.0292, 'grad_norm': 14.657945920213749, 'learning_rate': 7.566495566962202e-07, 'completion_length': 356.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.6803571283817291, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6625000834465027, 'reward_std': 0.127627681940794, 'kl': 0.728515625, 'epoch': 0.24}
 24%|██▍       | 1043/4286 [7:59:14<24:20:44, 27.03s/it] 24%|██▍       | 1044/4286 [7:59:41<24:18:49, 27.00s/it]                                                        {'loss': 0.0687, 'grad_norm': 2.528397795876625, 'learning_rate': 7.564162389174055e-07, 'completion_length': 360.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6396555304527283, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6039413213729858, 'reward_std': 0.09908289927989244, 'kl': 1.72265625, 'epoch': 0.24}
 24%|██▍       | 1044/4286 [7:59:41<24:18:49, 27.00s/it][2025-03-02 22:57:32,260] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1045/4286 [8:00:09<24:34:06, 27.29s/it]                                                        {'loss': 0.0436, 'grad_norm': 1.8360302267156605, 'learning_rate': 7.561829211385907e-07, 'completion_length': 367.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6757034659385681, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6042749881744385, 'reward_std': 0.15581009536981583, 'kl': 1.0928955078125, 'epoch': 0.24}
 24%|██▍       | 1045/4286 [8:00:09<24:34:06, 27.29s/it] 24%|██▍       | 1046/4286 [8:00:39<25:09:28, 27.95s/it]                                                        {'loss': 0.0182, 'grad_norm': 1.4577032724152068, 'learning_rate': 7.55949603359776e-07, 'completion_length': 332.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7602564692497253, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7245422005653381, 'reward_std': 0.14933179318904877, 'kl': 0.4532470703125, 'epoch': 0.24}
 24%|██▍       | 1046/4286 [8:00:39<25:09:28, 27.95s/it][2025-03-02 22:58:29,317] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1047/4286 [8:01:06<25:02:35, 27.83s/it]                                                        {'loss': 0.0169, 'grad_norm': 4.213639338339962, 'learning_rate': 7.557162855809613e-07, 'completion_length': 342.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6907738745212555, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6550596356391907, 'reward_std': 0.07710126973688602, 'kl': 0.42333984375, 'epoch': 0.24}
 24%|██▍       | 1047/4286 [8:01:06<25:02:35, 27.83s/it][2025-03-02 22:58:56,436] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1048/4286 [8:01:34<24:50:31, 27.62s/it]                                                        {'loss': 0.0212, 'grad_norm': 1.396092039884197, 'learning_rate': 7.554829678021465e-07, 'completion_length': 338.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.8122024238109589, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7586310505867004, 'reward_std': 0.10902716964483261, 'kl': 0.53192138671875, 'epoch': 0.24}
 24%|██▍       | 1048/4286 [8:01:34<24:50:31, 27.62s/it][2025-03-02 22:59:21,694] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1049/4286 [8:01:59<24:11:51, 26.91s/it]                                                        {'loss': 0.0153, 'grad_norm': 1.1521664214089047, 'learning_rate': 7.552496500233317e-07, 'completion_length': 293.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.0357142873108387, 'kl': 0.3812255859375, 'epoch': 0.24}
 24%|██▍       | 1049/4286 [8:01:59<24:11:51, 26.91s/it][2025-03-02 22:59:48,514] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1050/4286 [8:02:26<24:09:55, 26.88s/it]                                                        {'loss': 0.0072, 'grad_norm': 21.3965374284416, 'learning_rate': 7.55016332244517e-07, 'completion_length': 343.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7514881789684296, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.04622611217200756, 'kl': 0.179443359375, 'epoch': 0.24}
 24%|██▍       | 1050/4286 [8:02:26<24:09:55, 26.88s/it] 25%|██▍       | 1051/4286 [8:02:52<24:00:23, 26.72s/it]                                                        {'loss': 0.0162, 'grad_norm': 2.8862948941448185, 'learning_rate': 7.547830144657023e-07, 'completion_length': 317.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6041666567325592, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.568452537059784, 'reward_std': 0.11745268851518631, 'kl': 0.40625, 'epoch': 0.25}
 25%|██▍       | 1051/4286 [8:02:52<24:00:23, 26.72s/it][2025-03-02 23:00:40,012] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▍       | 1052/4286 [8:03:17<23:35:03, 26.25s/it]                                                        {'loss': 0.0138, 'grad_norm': 3.011327051507833, 'learning_rate': 7.545496966868875e-07, 'completion_length': 300.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.700892984867096, 'reward_std': 0.07339111901819706, 'kl': 0.34442138671875, 'epoch': 0.25}
 25%|██▍       | 1052/4286 [8:03:17<23:35:03, 26.25s/it] 25%|██▍       | 1053/4286 [8:03:41<22:57:33, 25.57s/it]                                                        {'loss': 0.0022, 'grad_norm': 20.789186172820013, 'learning_rate': 7.543163789080727e-07, 'completion_length': 319.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.06689050048589706, 'kl': 0.0555419921875, 'epoch': 0.25}
 25%|██▍       | 1053/4286 [8:03:41<22:57:33, 25.57s/it] 25%|██▍       | 1054/4286 [8:04:07<23:07:15, 25.75s/it]                                                        {'loss': 0.017, 'grad_norm': 1.0398750092454383, 'learning_rate': 7.54083061129258e-07, 'completion_length': 328.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.03419382870197296, 'kl': 0.42626953125, 'epoch': 0.25}
 25%|██▍       | 1054/4286 [8:04:07<23:07:15, 25.75s/it] 25%|██▍       | 1055/4286 [8:04:35<23:33:28, 26.25s/it]                                                        {'loss': 0.0014, 'grad_norm': 0.3256180126839006, 'learning_rate': 7.538497433504433e-07, 'completion_length': 373.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.8139423131942749, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.7425137758255005, 'reward_std': 0.09682668000459671, 'kl': 0.0361328125, 'epoch': 0.25}
 25%|██▍       | 1055/4286 [8:04:35<23:33:28, 26.25s/it] 25%|██▍       | 1056/4286 [8:05:02<23:46:51, 26.51s/it]                                                        {'loss': 0.0083, 'grad_norm': 4.0277768680325154, 'learning_rate': 7.536164255716285e-07, 'completion_length': 350.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7065476179122925, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6886906027793884, 'reward_std': 0.0698343412950635, 'kl': 0.208251953125, 'epoch': 0.25}
 25%|██▍       | 1056/4286 [8:05:02<23:46:51, 26.51s/it] 25%|██▍       | 1057/4286 [8:05:28<23:41:03, 26.41s/it]                                                        {'loss': 0.0075, 'grad_norm': 1.7677723167811867, 'learning_rate': 7.533831077928138e-07, 'completion_length': 342.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.5848214328289032, 'rewards/format_reward': 1.0, 'reward': 1.5848215222358704, 'reward_std': 0.07557234168052673, 'kl': 0.189208984375, 'epoch': 0.25}
 25%|██▍       | 1057/4286 [8:05:28<23:41:03, 26.41s/it] 25%|██▍       | 1058/4286 [8:05:56<24:11:43, 26.98s/it]                                                        {'loss': 0.0139, 'grad_norm': 1.5736657974428363, 'learning_rate': 7.53149790013999e-07, 'completion_length': 360.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.6142399907112122, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5963829159736633, 'reward_std': 0.11589129641652107, 'kl': 0.3466796875, 'epoch': 0.25}
 25%|██▍       | 1058/4286 [8:05:56<24:11:43, 26.98s/it] 25%|██▍       | 1059/4286 [8:06:23<24:11:41, 26.99s/it]                                                        {'loss': 0.0061, 'grad_norm': 1.0530626422486737, 'learning_rate': 7.529164722351843e-07, 'completion_length': 360.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6919643878936768, 'reward_std': 0.11555709317326546, 'kl': 0.15283203125, 'epoch': 0.25}
 25%|██▍       | 1059/4286 [8:06:23<24:11:41, 26.99s/it] 25%|██▍       | 1060/4286 [8:06:51<24:23:22, 27.22s/it]                                                        {'loss': 0.0098, 'grad_norm': 1.1081412624710298, 'learning_rate': 7.526831544563696e-07, 'completion_length': 344.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7276786267757416, 'rewards/format_reward': 1.0, 'reward': 1.7276787161827087, 'reward_std': 0.07440476305782795, 'kl': 0.244873046875, 'epoch': 0.25}
 25%|██▍       | 1060/4286 [8:06:51<24:23:22, 27.22s/it] 25%|██▍       | 1061/4286 [8:07:16<23:52:35, 26.65s/it]                                                        {'loss': 0.0068, 'grad_norm': 3.9316705408771604, 'learning_rate': 7.524498366775548e-07, 'completion_length': 307.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.625, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.011904764920473099, 'kl': 0.170166015625, 'epoch': 0.25}
 25%|██▍       | 1061/4286 [8:07:16<23:52:35, 26.65s/it][2025-03-02 23:05:05,733] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▍       | 1062/4286 [8:07:43<23:49:06, 26.60s/it]                                                        {'loss': 0.0027, 'grad_norm': 4.45760067668055, 'learning_rate': 7.5221651889874e-07, 'completion_length': 344.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6636905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6458335518836975, 'reward_std': 0.10855060815811157, 'kl': 0.0667724609375, 'epoch': 0.25}
 25%|██▍       | 1062/4286 [8:07:43<23:49:06, 26.60s/it] 25%|██▍       | 1063/4286 [8:08:08<23:27:21, 26.20s/it]                                                        {'loss': 0.0017, 'grad_norm': 0.9159218839156671, 'learning_rate': 7.519832011199253e-07, 'completion_length': 349.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.6904762089252472, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.04123930633068085, 'kl': 0.04150390625, 'epoch': 0.25}
 25%|██▍       | 1063/4286 [8:08:08<23:27:21, 26.20s/it] 25%|██▍       | 1064/4286 [8:08:35<23:37:36, 26.40s/it]                                                        {'loss': 0.0023, 'grad_norm': 0.28287477375413955, 'learning_rate': 7.517498833411106e-07, 'completion_length': 301.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 1.0, 'reward': 1.8154763579368591, 'reward_std': 0.0, 'kl': 0.05615234375, 'epoch': 0.25}
 25%|██▍       | 1064/4286 [8:08:35<23:37:36, 26.40s/it][2025-03-02 23:06:24,802] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▍       | 1065/4286 [8:09:02<23:45:46, 26.56s/it]                                                        {'loss': 0.0097, 'grad_norm': 4.713661148608512, 'learning_rate': 7.515165655622958e-07, 'completion_length': 314.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.6354167461395264, 'rewards/format_reward': 1.0, 'reward': 1.6354167461395264, 'reward_std': 0.04876218922436237, 'kl': 0.2421875, 'epoch': 0.25}
 25%|██▍       | 1065/4286 [8:09:02<23:45:46, 26.56s/it][2025-03-02 23:06:51,780] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▍       | 1066/4286 [8:09:29<23:52:04, 26.68s/it]                                                        {'loss': 0.013, 'grad_norm': 1.662466811169711, 'learning_rate': 7.51283247783481e-07, 'completion_length': 319.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.4806548058986664, 'rewards/format_reward': 1.0, 'reward': 1.4806548357009888, 'reward_std': 0.03335912525653839, 'kl': 0.325439453125, 'epoch': 0.25}
 25%|██▍       | 1066/4286 [8:09:29<23:52:04, 26.68s/it][2025-03-02 23:07:19,367] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▍       | 1067/4286 [8:09:56<24:06:10, 26.96s/it]                                                        {'loss': 0.0081, 'grad_norm': 1.4966426343102304, 'learning_rate': 7.510499300046664e-07, 'completion_length': 323.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7500000894069672, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.025651196017861366, 'kl': 0.203125, 'epoch': 0.25}
 25%|██▍       | 1067/4286 [8:09:56<24:06:10, 26.96s/it][2025-03-02 23:07:46,045] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▍       | 1068/4286 [8:10:23<24:01:14, 26.87s/it]                                                        {'loss': 0.0015, 'grad_norm': 7.800581616033743, 'learning_rate': 7.508166122258516e-07, 'completion_length': 326.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.013746436685323715, 'kl': 0.0367431640625, 'epoch': 0.25}
 25%|██▍       | 1068/4286 [8:10:23<24:01:14, 26.87s/it] 25%|██▍       | 1069/4286 [8:10:47<23:13:24, 25.99s/it]                                                        {'loss': 0.0015, 'grad_norm': 0.8219063608171024, 'learning_rate': 7.505832944470368e-07, 'completion_length': 309.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.022214585915207863, 'kl': 0.0380859375, 'epoch': 0.25}
 25%|██▍       | 1069/4286 [8:10:47<23:13:24, 25.99s/it] 25%|██▍       | 1070/4286 [8:11:12<22:59:02, 25.73s/it]                                                        {'loss': 0.3569, 'grad_norm': 20298.680883193436, 'learning_rate': 7.503499766682221e-07, 'completion_length': 328.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.010309826582670212, 'kl': 8.9617919921875, 'epoch': 0.25}
 25%|██▍       | 1070/4286 [8:11:12<22:59:02, 25.73s/it] 25%|██▍       | 1071/4286 [8:11:38<23:07:59, 25.90s/it]                                                        {'loss': 0.0103, 'grad_norm': 1.6902917415536824, 'learning_rate': 7.501166588894074e-07, 'completion_length': 338.9464569091797, 'rewards/only_full_func_accuracy_reward': 0.6592262089252472, 'rewards/format_reward': 1.0, 'reward': 1.6592262983322144, 'reward_std': 0.02678571827709675, 'kl': 0.25830078125, 'epoch': 0.25}
 25%|██▍       | 1071/4286 [8:11:38<23:07:59, 25.90s/it] 25%|██▌       | 1072/4286 [8:12:04<23:06:28, 25.88s/it]                                                        {'loss': 0.0031, 'grad_norm': 5.128128909797862, 'learning_rate': 7.498833411105926e-07, 'completion_length': 317.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.07577433437108994, 'kl': 0.07861328125, 'epoch': 0.25}
 25%|██▌       | 1072/4286 [8:12:04<23:06:28, 25.88s/it] 25%|██▌       | 1073/4286 [8:12:30<23:08:46, 25.93s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.42145438245694705, 'learning_rate': 7.496500233317778e-07, 'completion_length': 341.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.8068453371524811, 'rewards/format_reward': 1.0, 'reward': 1.8068453669548035, 'reward_std': 0.05850121518597007, 'kl': 0.032958984375, 'epoch': 0.25}
 25%|██▌       | 1073/4286 [8:12:30<23:08:46, 25.93s/it] 25%|██▌       | 1074/4286 [8:12:55<22:51:57, 25.63s/it]                                                        {'loss': 0.0023, 'grad_norm': 0.6161908193127619, 'learning_rate': 7.494167055529631e-07, 'completion_length': 326.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7857143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7857144474983215, 'reward_std': 0.040071725845336914, 'kl': 0.0565185546875, 'epoch': 0.25}
 25%|██▌       | 1074/4286 [8:12:55<22:51:57, 25.63s/it] 25%|██▌       | 1075/4286 [8:13:21<22:53:52, 25.67s/it]                                                        {'loss': 0.0065, 'grad_norm': 5.061996635380908, 'learning_rate': 7.491833877741483e-07, 'completion_length': 326.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.04007172957062721, 'kl': 0.16259765625, 'epoch': 0.25}
 25%|██▌       | 1075/4286 [8:13:21<22:53:52, 25.67s/it][2025-03-02 23:11:08,312] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▌       | 1076/4286 [8:13:45<22:31:54, 25.27s/it]                                                        {'loss': 0.0012, 'grad_norm': 0.5771542012573718, 'learning_rate': 7.489500699953336e-07, 'completion_length': 338.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.014880950096994638, 'kl': 0.02947998046875, 'epoch': 0.25}
 25%|██▌       | 1076/4286 [8:13:45<22:31:54, 25.27s/it] 25%|██▌       | 1077/4286 [8:14:11<22:34:32, 25.33s/it]                                                        {'loss': 0.0012, 'grad_norm': 2.457382076504096, 'learning_rate': 7.487167522165189e-07, 'completion_length': 313.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.71726194024086, 'rewards/format_reward': 1.0, 'reward': 1.717262089252472, 'reward_std': 0.061563242226839066, 'kl': 0.02880859375, 'epoch': 0.25}
 25%|██▌       | 1077/4286 [8:14:11<22:34:32, 25.33s/it] 25%|██▌       | 1078/4286 [8:14:37<22:48:10, 25.59s/it]                                                        {'loss': 0.0014, 'grad_norm': 1.2948483621890103, 'learning_rate': 7.484834344377041e-07, 'completion_length': 313.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187500596046448, 'reward_std': 0.0446428582072258, 'kl': 0.03460693359375, 'epoch': 0.25}
 25%|██▌       | 1078/4286 [8:14:37<22:48:10, 25.59s/it] 25%|██▌       | 1079/4286 [8:15:02<22:36:15, 25.37s/it]                                                        {'loss': 0.0033, 'grad_norm': 0.3384068214633645, 'learning_rate': 7.482501166588893e-07, 'completion_length': 324.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.8955357670783997, 'rewards/format_reward': 1.0, 'reward': 1.8955358266830444, 'reward_std': 0.03755657374858856, 'kl': 0.0821533203125, 'epoch': 0.25}
 25%|██▌       | 1079/4286 [8:15:02<22:36:15, 25.37s/it] 25%|██▌       | 1080/4286 [8:15:26<22:19:23, 25.07s/it]                                                        {'loss': 0.0023, 'grad_norm': 1.538469623602756, 'learning_rate': 7.480167988800747e-07, 'completion_length': 308.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6443452537059784, 'rewards/format_reward': 1.0, 'reward': 1.6443453431129456, 'reward_std': 0.02267500851303339, 'kl': 0.0574951171875, 'epoch': 0.25}
 25%|██▌       | 1080/4286 [8:15:26<22:19:23, 25.07s/it] 25%|██▌       | 1081/4286 [8:15:53<22:37:39, 25.42s/it]                                                        {'loss': 0.0021, 'grad_norm': 2.7836943246299493, 'learning_rate': 7.477834811012599e-07, 'completion_length': 329.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.81101194024086, 'rewards/format_reward': 1.0, 'reward': 1.8110119700431824, 'reward_std': 0.037095542065799236, 'kl': 0.0533447265625, 'epoch': 0.25}
 25%|██▌       | 1081/4286 [8:15:53<22:37:39, 25.42s/it] 25%|██▌       | 1082/4286 [8:16:16<22:14:15, 24.99s/it]                                                        {'loss': 0.0014, 'grad_norm': 0.17574038839618564, 'learning_rate': 7.475501633224451e-07, 'completion_length': 292.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.013746436685323715, 'kl': 0.035400390625, 'epoch': 0.25}
 25%|██▌       | 1082/4286 [8:16:16<22:14:15, 24.99s/it] 25%|██▌       | 1083/4286 [8:16:41<22:07:44, 24.87s/it]                                                        {'loss': 0.0012, 'grad_norm': 5.839997869068922, 'learning_rate': 7.473168455436303e-07, 'completion_length': 323.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7410715520381927, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.04166666977107525, 'kl': 0.03057861328125, 'epoch': 0.25}
 25%|██▌       | 1083/4286 [8:16:41<22:07:44, 24.87s/it] 25%|██▌       | 1084/4286 [8:17:06<22:11:02, 24.94s/it]                                                        {'loss': 0.0088, 'grad_norm': 1.5499448365292023, 'learning_rate': 7.470835277648157e-07, 'completion_length': 293.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6949405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.677083432674408, 'reward_std': 0.06250000465661287, 'kl': 0.220947265625, 'epoch': 0.25}
 25%|██▌       | 1084/4286 [8:17:06<22:11:02, 24.94s/it] 25%|██▌       | 1085/4286 [8:17:31<22:03:41, 24.81s/it]                                                        {'loss': 0.0028, 'grad_norm': 0.7774531270191904, 'learning_rate': 7.468502099860009e-07, 'completion_length': 322.1964569091797, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.05952380783855915, 'kl': 0.06982421875, 'epoch': 0.25}
 25%|██▌       | 1085/4286 [8:17:31<22:03:41, 24.81s/it][2025-03-02 23:15:21,599] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▌       | 1086/4286 [8:17:59<22:53:50, 25.76s/it]                                                        {'loss': 0.0025, 'grad_norm': 4.162625181647598, 'learning_rate': 7.466168922071861e-07, 'completion_length': 337.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.8205783069133759, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8027212023735046, 'reward_std': 0.09369122236967087, 'kl': 0.0628662109375, 'epoch': 0.25}
 25%|██▌       | 1086/4286 [8:17:59<22:53:50, 25.76s/it] 25%|██▌       | 1087/4286 [8:18:25<22:59:04, 25.87s/it]                                                        {'loss': 0.0029, 'grad_norm': 2.5639849137376407, 'learning_rate': 7.463835744283714e-07, 'completion_length': 346.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.7250000238418579, 'rewards/format_reward': 1.0, 'reward': 1.7250000834465027, 'reward_std': 0.08834509551525116, 'kl': 0.0721435546875, 'epoch': 0.25}
 25%|██▌       | 1087/4286 [8:18:25<22:59:04, 25.87s/it][2025-03-02 23:16:14,209] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▌       | 1088/4286 [8:18:51<23:08:43, 26.05s/it]                                                        {'loss': 0.009, 'grad_norm': 1.073095292504922, 'learning_rate': 7.461502566495567e-07, 'completion_length': 296.375, 'rewards/only_full_func_accuracy_reward': 0.8229167461395264, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8050596714019775, 'reward_std': 0.08186878263950348, 'kl': 0.22412109375, 'epoch': 0.25}
 25%|██▌       | 1088/4286 [8:18:51<23:08:43, 26.05s/it] 25%|██▌       | 1089/4286 [8:19:15<22:37:30, 25.48s/it]                                                        {'loss': 0.0026, 'grad_norm': 2.353150799480056, 'learning_rate': 7.459169388707419e-07, 'completion_length': 324.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7827381193637848, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.094435915350914, 'kl': 0.0654296875, 'epoch': 0.25}
 25%|██▌       | 1089/4286 [8:19:15<22:37:30, 25.48s/it] 25%|██▌       | 1090/4286 [8:19:42<22:51:16, 25.74s/it]                                                        {'loss': 0.0255, 'grad_norm': 3.895187682231652, 'learning_rate': 7.456836210919272e-07, 'completion_length': 349.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.7142857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.04660541191697121, 'kl': 0.638671875, 'epoch': 0.25}
 25%|██▌       | 1090/4286 [8:19:42<22:51:16, 25.74s/it] 25%|██▌       | 1091/4286 [8:20:07<22:46:15, 25.66s/it]                                                        {'loss': 0.0143, 'grad_norm': 2.4271181662767707, 'learning_rate': 7.454503033131124e-07, 'completion_length': 324.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.5714285671710968, 'rewards/format_reward': 1.0, 'reward': 1.5714287161827087, 'reward_std': 0.011904762126505375, 'kl': 0.35888671875, 'epoch': 0.25}
 25%|██▌       | 1091/4286 [8:20:07<22:46:15, 25.66s/it] 25%|██▌       | 1092/4286 [8:20:32<22:29:38, 25.35s/it]                                                        {'loss': 0.0034, 'grad_norm': 1.9349176973031321, 'learning_rate': 7.452169855342977e-07, 'completion_length': 321.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.7946429550647736, 'rewards/format_reward': 1.0, 'reward': 1.794642984867096, 'reward_std': 0.02976190857589245, 'kl': 0.085205078125, 'epoch': 0.25}
 25%|██▌       | 1092/4286 [8:20:32<22:29:38, 25.35s/it] 26%|██▌       | 1093/4286 [8:20:58<22:38:27, 25.53s/it]                                                        {'loss': 0.0488, 'grad_norm': 2.8745627426161953, 'learning_rate': 7.44983667755483e-07, 'completion_length': 338.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.5059524029493332, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4880953431129456, 'reward_std': 0.08005648851394653, 'kl': 1.21875, 'epoch': 0.26}
 26%|██▌       | 1093/4286 [8:20:58<22:38:27, 25.53s/it] 26%|██▌       | 1094/4286 [8:21:22<22:19:15, 25.17s/it]                                                        {'loss': 0.0028, 'grad_norm': 3.097855861624751, 'learning_rate': 7.447503499766682e-07, 'completion_length': 294.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.791666716337204, 'rewards/format_reward': 1.0, 'reward': 1.7916668057441711, 'reward_std': 0.013746436685323715, 'kl': 0.069091796875, 'epoch': 0.26}
 26%|██▌       | 1094/4286 [8:21:22<22:19:15, 25.17s/it] 26%|██▌       | 1095/4286 [8:21:46<21:58:54, 24.80s/it]                                                        {'loss': 0.0124, 'grad_norm': 14.61707043774755, 'learning_rate': 7.445170321978534e-07, 'completion_length': 280.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.05197649821639061, 'kl': 0.311767578125, 'epoch': 0.26}
 26%|██▌       | 1095/4286 [8:21:46<21:58:54, 24.80s/it] 26%|██▌       | 1096/4286 [8:22:13<22:30:27, 25.40s/it]                                                        {'loss': 0.0324, 'grad_norm': 1.4047290886768629, 'learning_rate': 7.442837144190387e-07, 'completion_length': 364.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.743303656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7254465222358704, 'reward_std': 0.06870058923959732, 'kl': 0.811767578125, 'epoch': 0.26}
 26%|██▌       | 1096/4286 [8:22:13<22:30:27, 25.40s/it] 26%|██▌       | 1097/4286 [8:22:39<22:42:44, 25.64s/it]                                                        {'loss': 0.0104, 'grad_norm': 1.954560687039723, 'learning_rate': 7.44050396640224e-07, 'completion_length': 352.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6017857789993286, 'rewards/format_reward': 1.0, 'reward': 1.6017857789993286, 'reward_std': 0.06324290484189987, 'kl': 0.260009765625, 'epoch': 0.26}
 26%|██▌       | 1097/4286 [8:22:39<22:42:44, 25.64s/it] 26%|██▌       | 1098/4286 [8:23:07<23:20:15, 26.35s/it]                                                        {'loss': 0.0181, 'grad_norm': 21.0789680670765, 'learning_rate': 7.438170788614092e-07, 'completion_length': 346.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.02816697023808956, 'kl': 0.453857421875, 'epoch': 0.26}
 26%|██▌       | 1098/4286 [8:23:07<23:20:15, 26.35s/it] 26%|██▌       | 1099/4286 [8:23:34<23:32:20, 26.59s/it]                                                        {'loss': 0.0176, 'grad_norm': 9.829042633479053, 'learning_rate': 7.435837610825944e-07, 'completion_length': 354.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6160715818405151, 'reward_std': 0.09112738817930222, 'kl': 0.44091796875, 'epoch': 0.26}
 26%|██▌       | 1099/4286 [8:23:34<23:32:20, 26.59s/it] 26%|██▌       | 1100/4286 [8:24:01<23:28:29, 26.53s/it]                                                        {'loss': 0.0021, 'grad_norm': 4.307488532521529, 'learning_rate': 7.433504433037798e-07, 'completion_length': 334.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.5869047939777374, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5690476894378662, 'reward_std': 0.11190475896000862, 'kl': 0.05322265625, 'epoch': 0.26}
 26%|██▌       | 1100/4286 [8:24:01<23:28:29, 26.53s/it] 26%|██▌       | 1101/4286 [8:28:31<88:16:14, 99.77s/it]                                                        {'loss': 0.0047, 'grad_norm': 1.175398396367311, 'learning_rate': 7.43117125524965e-07, 'completion_length': 309.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6041666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6041668057441711, 'reward_std': 0.04602411389350891, 'kl': 0.117431640625, 'epoch': 0.26}
 26%|██▌       | 1101/4286 [8:28:31<88:16:14, 99.77s/it] 26%|██▌       | 1102/4286 [8:28:57<68:40:55, 77.66s/it]                                                        {'loss': 0.0225, 'grad_norm': 4.913027577283872, 'learning_rate': 7.428838077461502e-07, 'completion_length': 303.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6755953431129456, 'reward_std': 0.08915280178189278, 'kl': 0.5615234375, 'epoch': 0.26}
 26%|██▌       | 1102/4286 [8:28:57<68:40:55, 77.66s/it] 26%|██▌       | 1103/4286 [8:29:22<54:36:23, 61.76s/it]                                                        {'loss': 0.0037, 'grad_norm': 4.29917558920227, 'learning_rate': 7.426504899673355e-07, 'completion_length': 343.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8050595223903656, 'rewards/format_reward': 1.0, 'reward': 1.8050596714019775, 'reward_std': 0.02083333395421505, 'kl': 0.09185791015625, 'epoch': 0.26}
 26%|██▌       | 1103/4286 [8:29:22<54:36:23, 61.76s/it] 26%|██▌       | 1104/4286 [8:29:47<44:47:59, 50.68s/it]                                                        {'loss': 0.0016, 'grad_norm': 0.12100421772090446, 'learning_rate': 7.424171721885207e-07, 'completion_length': 311.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.0, 'kl': 0.03973388671875, 'epoch': 0.26}
 26%|██▌       | 1104/4286 [8:29:47<44:47:59, 50.68s/it] 26%|██▌       | 1105/4286 [8:30:10<37:34:03, 42.52s/it]                                                        {'loss': 0.0036, 'grad_norm': 1.1032402168394904, 'learning_rate': 7.42183854409706e-07, 'completion_length': 238.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.0535714328289032, 'kl': 0.09033203125, 'epoch': 0.26}
 26%|██▌       | 1105/4286 [8:30:10<37:34:03, 42.52s/it] 26%|██▌       | 1106/4286 [8:30:35<32:56:42, 37.30s/it]                                                        {'loss': 0.0066, 'grad_norm': 2.106166092970626, 'learning_rate': 7.419505366308912e-07, 'completion_length': 287.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6260822713375092, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5725109577178955, 'reward_std': 0.10319678857922554, 'kl': 0.1666259765625, 'epoch': 0.26}
 26%|██▌       | 1106/4286 [8:30:35<32:56:42, 37.30s/it] 26%|██▌       | 1107/4286 [8:30:59<29:09:44, 33.02s/it]                                                        {'loss': 0.0116, 'grad_norm': 2.1159636986338133, 'learning_rate': 7.417172188520765e-07, 'completion_length': 318.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6130953431129456, 'reward_std': 0.07142857182770967, 'kl': 0.2900390625, 'epoch': 0.26}
 26%|██▌       | 1107/4286 [8:30:59<29:09:44, 33.02s/it] 26%|██▌       | 1108/4286 [8:31:24<27:11:11, 30.80s/it]                                                        {'loss': 0.0045, 'grad_norm': 4.1754603510758095, 'learning_rate': 7.414839010732617e-07, 'completion_length': 291.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6398810148239136, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.019698821008205414, 'kl': 0.1126708984375, 'epoch': 0.26}
 26%|██▌       | 1108/4286 [8:31:24<27:11:11, 30.80s/it][2025-03-02 23:29:12,740] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▌       | 1109/4286 [8:31:50<25:50:02, 29.27s/it]                                                        {'loss': 0.0017, 'grad_norm': 0.25647519450213063, 'learning_rate': 7.41250583294447e-07, 'completion_length': 292.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.008928571827709675, 'kl': 0.04254150390625, 'epoch': 0.26}
 26%|██▌       | 1109/4286 [8:31:50<25:50:02, 29.27s/it][2025-03-02 23:29:38,046] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▌       | 1110/4286 [8:32:15<24:46:32, 28.08s/it]                                                        {'loss': 0.0012, 'grad_norm': 1.0628197413961764, 'learning_rate': 7.410172655156323e-07, 'completion_length': 299.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 1.0, 'reward': 1.6473215818405151, 'reward_std': 0.109691696241498, 'kl': 0.03076171875, 'epoch': 0.26}
 26%|██▌       | 1110/4286 [8:32:15<24:46:32, 28.08s/it] 26%|██▌       | 1111/4286 [8:32:40<23:54:00, 27.10s/it]                                                        {'loss': 0.0026, 'grad_norm': 10.167380186311291, 'learning_rate': 7.407839477368175e-07, 'completion_length': 304.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6845238208770752, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.05952381156384945, 'kl': 0.0645751953125, 'epoch': 0.26}
 26%|██▌       | 1111/4286 [8:32:40<23:54:00, 27.10s/it] 26%|██▌       | 1112/4286 [8:33:07<23:53:55, 27.11s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.8623338332230525, 'learning_rate': 7.405506299580027e-07, 'completion_length': 363.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.5946158468723297, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.541044533252716, 'reward_std': 0.04843553155660629, 'kl': 0.03271484375, 'epoch': 0.26}
 26%|██▌       | 1112/4286 [8:33:07<23:53:55, 27.11s/it] 26%|██▌       | 1113/4286 [8:33:32<23:16:34, 26.41s/it]                                                        {'loss': 0.0011, 'grad_norm': 0.6336582390548046, 'learning_rate': 7.403173121791881e-07, 'completion_length': 323.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.6190476417541504, 'rewards/format_reward': 1.0, 'reward': 1.61904776096344, 'reward_std': 0.011904764920473099, 'kl': 0.027099609375, 'epoch': 0.26}
 26%|██▌       | 1113/4286 [8:33:32<23:16:34, 26.41s/it] 26%|██▌       | 1114/4286 [8:34:00<23:45:34, 26.97s/it]                                                        {'loss': 0.0019, 'grad_norm': 2.263781421225852, 'learning_rate': 7.400839944003733e-07, 'completion_length': 308.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7038690745830536, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.07136759348213673, 'kl': 0.0472412109375, 'epoch': 0.26}
 26%|██▌       | 1114/4286 [8:34:00<23:45:34, 26.97s/it] 26%|██▌       | 1115/4286 [8:34:27<23:49:18, 27.04s/it]                                                        {'loss': 0.0012, 'grad_norm': 1.027616196422716, 'learning_rate': 7.398506766215585e-07, 'completion_length': 366.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.03411935269832611, 'kl': 0.02978515625, 'epoch': 0.26}
 26%|██▌       | 1115/4286 [8:34:27<23:49:18, 27.04s/it] 26%|██▌       | 1116/4286 [8:34:53<23:27:04, 26.63s/it]                                                        {'loss': 0.0013, 'grad_norm': 1.345757323867427, 'learning_rate': 7.396173588427438e-07, 'completion_length': 303.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.6979167461395264, 'rewards/format_reward': 1.0, 'reward': 1.6979167461395264, 'reward_std': 0.0565476194024086, 'kl': 0.031494140625, 'epoch': 0.26}
 26%|██▌       | 1116/4286 [8:34:53<23:27:04, 26.63s/it][2025-03-02 23:32:43,283] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▌       | 1117/4286 [8:35:20<23:38:15, 26.85s/it]                                                        {'loss': 0.0011, 'grad_norm': 1.7897083428077796, 'learning_rate': 7.393840410639291e-07, 'completion_length': 365.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.008928571827709675, 'kl': 0.02838134765625, 'epoch': 0.26}
 26%|██▌       | 1117/4286 [8:35:20<23:38:15, 26.85s/it] 26%|██▌       | 1118/4286 [8:35:49<24:03:07, 27.33s/it]                                                        {'loss': 0.0017, 'grad_norm': 2.9481548559835797, 'learning_rate': 7.391507232851143e-07, 'completion_length': 324.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.5409812778234482, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4874098896980286, 'reward_std': 0.14766617864370346, 'kl': 0.04296875, 'epoch': 0.26}
 26%|██▌       | 1118/4286 [8:35:49<24:03:07, 27.33s/it] 26%|██▌       | 1119/4286 [8:36:16<24:00:03, 27.28s/it]                                                        {'loss': 0.0011, 'grad_norm': 0.09621534643273968, 'learning_rate': 7.389174055062995e-07, 'completion_length': 340.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.834821492433548, 'rewards/format_reward': 1.0, 'reward': 1.8348215222358704, 'reward_std': 0.008928571827709675, 'kl': 0.02716064453125, 'epoch': 0.26}
 26%|██▌       | 1119/4286 [8:36:16<24:00:03, 27.28s/it] 26%|██▌       | 1120/4286 [8:36:42<23:39:34, 26.90s/it]                                                        {'loss': 0.0016, 'grad_norm': 3.627536799549427, 'learning_rate': 7.386840877274848e-07, 'completion_length': 328.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.042834240943193436, 'kl': 0.0408935546875, 'epoch': 0.26}
 26%|██▌       | 1120/4286 [8:36:42<23:39:34, 26.90s/it] 26%|██▌       | 1121/4286 [8:37:07<23:09:57, 26.35s/it]                                                        {'loss': 0.001, 'grad_norm': 0.26055374701635736, 'learning_rate': 7.384507699486701e-07, 'completion_length': 343.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7351190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.01785714365541935, 'kl': 0.0247802734375, 'epoch': 0.26}
 26%|██▌       | 1121/4286 [8:37:07<23:09:57, 26.35s/it] 26%|██▌       | 1122/4286 [8:37:36<23:46:08, 27.04s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.8957480369459219, 'learning_rate': 7.382174521698553e-07, 'completion_length': 367.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7961309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.03985805157572031, 'kl': 0.03173828125, 'epoch': 0.26}
 26%|██▌       | 1122/4286 [8:37:36<23:46:08, 27.04s/it][2025-03-02 23:35:26,762] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▌       | 1123/4286 [8:38:04<24:02:44, 27.37s/it]                                                        {'loss': 0.0017, 'grad_norm': 3.506544305248371, 'learning_rate': 7.379841343910406e-07, 'completion_length': 297.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.8035714030265808, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.0595238022506237, 'kl': 0.04296875, 'epoch': 0.26}
 26%|██▌       | 1123/4286 [8:38:04<24:02:44, 27.37s/it][2025-03-02 23:35:56,401] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▌       | 1124/4286 [8:38:33<24:38:08, 28.05s/it]                                                        {'loss': 0.0017, 'grad_norm': 0.5362581693549682, 'learning_rate': 7.377508166122258e-07, 'completion_length': 400.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7347719371318817, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6812004446983337, 'reward_std': 0.13726655393838882, 'kl': 0.04278564453125, 'epoch': 0.26}
 26%|██▌       | 1124/4286 [8:38:33<24:38:08, 28.05s/it] 26%|██▌       | 1125/4286 [8:39:01<24:24:24, 27.80s/it]                                                        {'loss': 0.0011, 'grad_norm': 0.5761069541127093, 'learning_rate': 7.37517498833411e-07, 'completion_length': 347.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.839826911687851, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8219698071479797, 'reward_std': 0.0959975179284811, 'kl': 0.02734375, 'epoch': 0.26}
 26%|██▌       | 1125/4286 [8:39:01<24:24:24, 27.80s/it][2025-03-02 23:36:53,462] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▋       | 1126/4286 [8:39:31<24:56:26, 28.41s/it]                                                        {'loss': 0.0014, 'grad_norm': 4.479023649600391, 'learning_rate': 7.372841810545964e-07, 'completion_length': 371.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.625405877828598, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5718345642089844, 'reward_std': 0.21347403526306152, 'kl': 0.0345458984375, 'epoch': 0.26}
 26%|██▋       | 1126/4286 [8:39:31<24:56:26, 28.41s/it][2025-03-02 23:37:21,582] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▋       | 1127/4286 [8:39:59<24:51:19, 28.33s/it]                                                        {'loss': 0.0022, 'grad_norm': 2.0426408011174497, 'learning_rate': 7.370508632757816e-07, 'completion_length': 353.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.6655844748020172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6477274298667908, 'reward_std': 0.11114903911948204, 'kl': 0.0543212890625, 'epoch': 0.26}
 26%|██▋       | 1127/4286 [8:39:59<24:51:19, 28.33s/it][2025-03-02 23:37:49,050] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▋       | 1128/4286 [8:40:26<24:37:18, 28.07s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.17484763996219727, 'learning_rate': 7.368175454969668e-07, 'completion_length': 384.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6958875060081482, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6423161029815674, 'reward_std': 0.13464132882654667, 'kl': 0.03277587890625, 'epoch': 0.26}
 26%|██▋       | 1128/4286 [8:40:26<24:37:18, 28.07s/it] 26%|██▋       | 1129/4286 [8:40:51<23:45:18, 27.09s/it]                                                        {'loss': 0.0024, 'grad_norm': 1.959152745431263, 'learning_rate': 7.36584227718152e-07, 'completion_length': 332.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.0535714328289032, 'kl': 0.05963134765625, 'epoch': 0.26}
 26%|██▋       | 1129/4286 [8:40:51<23:45:18, 27.09s/it][2025-03-02 23:38:42,852] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▋       | 1130/4286 [8:41:20<24:15:00, 27.66s/it]                                                        {'loss': 0.0014, 'grad_norm': 3.748003446175883, 'learning_rate': 7.363509099393374e-07, 'completion_length': 335.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6949406266212463, 'reward_std': 0.11607143748551607, 'kl': 0.035888671875, 'epoch': 0.26}
 26%|██▋       | 1130/4286 [8:41:20<24:15:00, 27.66s/it] 26%|██▋       | 1131/4286 [8:41:46<23:49:44, 27.19s/it]                                                        {'loss': 0.0013, 'grad_norm': 1.5284195677277648, 'learning_rate': 7.361175921605226e-07, 'completion_length': 360.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7324405610561371, 'rewards/format_reward': 1.0, 'reward': 1.7324405312538147, 'reward_std': 0.04981276113539934, 'kl': 0.031494140625, 'epoch': 0.26}
 26%|██▋       | 1131/4286 [8:41:46<23:49:44, 27.19s/it][2025-03-02 23:39:34,047] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▋       | 1132/4286 [8:42:11<23:16:25, 26.56s/it]                                                        {'loss': 0.0031, 'grad_norm': 2.266019798352932, 'learning_rate': 7.358842743817078e-07, 'completion_length': 318.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562501192092896, 'reward_std': 0.04479555785655975, 'kl': 0.0771484375, 'epoch': 0.26}
 26%|██▋       | 1132/4286 [8:42:11<23:16:25, 26.56s/it] 26%|██▋       | 1133/4286 [8:42:37<22:59:44, 26.26s/it]                                                        {'loss': 0.0025, 'grad_norm': 1.3946962305598454, 'learning_rate': 7.356509566028931e-07, 'completion_length': 318.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 1.0, 'reward': 1.6934524774551392, 'reward_std': 0.01969881122931838, 'kl': 0.0635986328125, 'epoch': 0.26}
 26%|██▋       | 1133/4286 [8:42:37<22:59:44, 26.26s/it][2025-03-02 23:40:27,702] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▋       | 1134/4286 [8:43:05<23:28:40, 26.81s/it]                                                        {'loss': 0.0012, 'grad_norm': 1.0308336855526072, 'learning_rate': 7.354176388240784e-07, 'completion_length': 389.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.7336310148239136, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.032063992926850915, 'kl': 0.0291748046875, 'epoch': 0.26}
 26%|██▋       | 1134/4286 [8:43:05<23:28:40, 26.81s/it] 26%|██▋       | 1135/4286 [8:43:29<22:48:55, 26.07s/it]                                                        {'loss': 0.0017, 'grad_norm': 0.18111794879280096, 'learning_rate': 7.351843210452636e-07, 'completion_length': 284.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.011904759332537651, 'kl': 0.04315185546875, 'epoch': 0.26}
 26%|██▋       | 1135/4286 [8:43:29<22:48:55, 26.07s/it] 27%|██▋       | 1136/4286 [8:43:56<23:06:18, 26.41s/it]                                                        {'loss': 0.0015, 'grad_norm': 5.644152145369708, 'learning_rate': 7.349510032664489e-07, 'completion_length': 364.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.5595238357782364, 'rewards/format_reward': 1.0, 'reward': 1.55952388048172, 'reward_std': 0.12532122433185577, 'kl': 0.0372314453125, 'epoch': 0.27}
 27%|██▋       | 1136/4286 [8:43:56<23:06:18, 26.41s/it] 27%|██▋       | 1137/4286 [8:44:21<22:42:14, 25.96s/it]                                                        {'loss': 0.0041, 'grad_norm': 1.9491409775609194, 'learning_rate': 7.347176854876341e-07, 'completion_length': 308.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.0595238134264946, 'kl': 0.1014404296875, 'epoch': 0.27}
 27%|██▋       | 1137/4286 [8:44:21<22:42:14, 25.96s/it] 27%|██▋       | 1138/4286 [8:44:45<22:08:05, 25.31s/it]                                                        {'loss': 0.0016, 'grad_norm': 1.7337970607949602, 'learning_rate': 7.344843677088194e-07, 'completion_length': 305.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.691964328289032, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.022675003856420517, 'kl': 0.0396728515625, 'epoch': 0.27}
 27%|██▋       | 1138/4286 [8:44:45<22:08:05, 25.31s/it][2025-03-02 23:42:34,987] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1139/4286 [8:45:12<22:34:58, 25.83s/it]                                                        {'loss': 0.0032, 'grad_norm': 0.579494418004562, 'learning_rate': 7.342510499300047e-07, 'completion_length': 325.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7173972129821777, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6995400786399841, 'reward_std': 0.0514095164835453, 'kl': 0.080078125, 'epoch': 0.27}
 27%|██▋       | 1139/4286 [8:45:12<22:34:58, 25.83s/it] 27%|██▋       | 1140/4286 [8:45:37<22:21:22, 25.58s/it]                                                        {'loss': 0.0013, 'grad_norm': 2.2084005499648693, 'learning_rate': 7.340177321511899e-07, 'completion_length': 280.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.0355006055906415, 'kl': 0.0321044921875, 'epoch': 0.27}
 27%|██▋       | 1140/4286 [8:45:37<22:21:22, 25.58s/it] 27%|██▋       | 1141/4286 [8:46:02<22:18:09, 25.53s/it]                                                        {'loss': 0.0115, 'grad_norm': 1.4724402261555734, 'learning_rate': 7.337844143723751e-07, 'completion_length': 287.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7172620296478271, 'reward_std': 0.07738095847889781, 'kl': 0.28851318359375, 'epoch': 0.27}
 27%|██▋       | 1141/4286 [8:46:02<22:18:09, 25.53s/it][2025-03-02 23:43:50,193] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1142/4286 [8:46:27<22:06:20, 25.31s/it]                                                        {'loss': 0.0019, 'grad_norm': 1.0516443102550685, 'learning_rate': 7.335510965935604e-07, 'completion_length': 314.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715818405151, 'reward_std': 0.04350833594799042, 'kl': 0.047607421875, 'epoch': 0.27}
 27%|██▋       | 1142/4286 [8:46:27<22:06:20, 25.31s/it] 27%|██▋       | 1143/4286 [8:46:54<22:35:46, 25.88s/it]                                                        {'loss': 0.0022, 'grad_norm': 1.4721572354894958, 'learning_rate': 7.333177788147457e-07, 'completion_length': 338.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.651671290397644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6338141560554504, 'reward_std': 0.09896359778940678, 'kl': 0.054443359375, 'epoch': 0.27}
 27%|██▋       | 1143/4286 [8:46:54<22:35:46, 25.88s/it][2025-03-02 23:44:45,527] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1144/4286 [8:47:23<23:10:32, 26.55s/it]                                                        {'loss': 0.0015, 'grad_norm': 0.3482312320787857, 'learning_rate': 7.330844610359309e-07, 'completion_length': 351.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.759523868560791, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7238096594810486, 'reward_std': 0.09829914756119251, 'kl': 0.036865234375, 'epoch': 0.27}
 27%|██▋       | 1144/4286 [8:47:23<23:10:32, 26.55s/it] 27%|██▋       | 1145/4286 [8:47:49<23:04:34, 26.45s/it]                                                        {'loss': 0.0026, 'grad_norm': 1.1072926431148598, 'learning_rate': 7.328511432571161e-07, 'completion_length': 330.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.049460720270872116, 'kl': 0.064697265625, 'epoch': 0.27}
 27%|██▋       | 1145/4286 [8:47:49<23:04:34, 26.45s/it][2025-03-02 23:45:39,290] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1146/4286 [8:48:16<23:21:36, 26.78s/it]                                                        {'loss': 0.0016, 'grad_norm': 6.116811357355882, 'learning_rate': 7.326178254783015e-07, 'completion_length': 344.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6297348737716675, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6118777394294739, 'reward_std': 0.12171928957104683, 'kl': 0.03924560546875, 'epoch': 0.27}
 27%|██▋       | 1146/4286 [8:48:16<23:21:36, 26.78s/it] 27%|██▋       | 1147/4286 [8:48:45<23:46:38, 27.27s/it]                                                        {'loss': 0.0021, 'grad_norm': 4.692849942397902, 'learning_rate': 7.323845076994867e-07, 'completion_length': 355.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.06388125568628311, 'kl': 0.05267333984375, 'epoch': 0.27}
 27%|██▋       | 1147/4286 [8:48:45<23:46:38, 27.27s/it][2025-03-02 23:46:35,420] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1148/4286 [8:49:12<23:53:19, 27.41s/it]                                                        {'loss': 0.0067, 'grad_norm': 2.609582211442685, 'learning_rate': 7.321511899206719e-07, 'completion_length': 324.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7667410969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7310269474983215, 'reward_std': 0.061831921339035034, 'kl': 0.1673583984375, 'epoch': 0.27}
 27%|██▋       | 1148/4286 [8:49:13<23:53:19, 27.41s/it] 27%|██▋       | 1149/4286 [8:49:40<23:52:57, 27.41s/it]                                                        {'loss': 0.0043, 'grad_norm': 3.1893640935117027, 'learning_rate': 7.319178721418572e-07, 'completion_length': 310.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160715222358704, 'reward_std': 0.042834240943193436, 'kl': 0.106201171875, 'epoch': 0.27}
 27%|██▋       | 1149/4286 [8:49:40<23:52:57, 27.41s/it] 27%|██▋       | 1150/4286 [8:50:06<23:31:10, 27.00s/it]                                                        {'loss': 0.013, 'grad_norm': 1.1755485262627918, 'learning_rate': 7.316845543630425e-07, 'completion_length': 318.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857144474983215, 'reward_std': 0.0535714291036129, 'kl': 0.3262939453125, 'epoch': 0.27}
 27%|██▋       | 1150/4286 [8:50:06<23:31:10, 27.00s/it] 27%|██▋       | 1151/4286 [8:50:31<22:58:21, 26.38s/it]                                                        {'loss': 0.0042, 'grad_norm': 11.201458490785205, 'learning_rate': 7.314512365842277e-07, 'completion_length': 316.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.12347954511642456, 'kl': 0.105712890625, 'epoch': 0.27}
 27%|██▋       | 1151/4286 [8:50:31<22:58:21, 26.38s/it][2025-03-02 23:48:21,224] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1152/4286 [8:50:58<23:14:02, 26.69s/it]                                                        {'loss': 0.018, 'grad_norm': 6.4577194912268885, 'learning_rate': 7.312179188054129e-07, 'completion_length': 357.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7325893640518188, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6968750953674316, 'reward_std': 0.14196429587900639, 'kl': 0.45068359375, 'epoch': 0.27}
 27%|██▋       | 1152/4286 [8:50:58<23:14:02, 26.69s/it] 27%|██▋       | 1153/4286 [8:51:24<22:52:43, 26.29s/it]                                                        {'loss': 0.003, 'grad_norm': 1.7418846016598135, 'learning_rate': 7.309846010265982e-07, 'completion_length': 332.89288330078125, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0357142873108387, 'kl': 0.073974609375, 'epoch': 0.27}
 27%|██▋       | 1153/4286 [8:51:24<22:52:43, 26.29s/it] 27%|██▋       | 1154/4286 [8:51:48<22:26:33, 25.80s/it]                                                        {'loss': 0.0088, 'grad_norm': 1.388525345526124, 'learning_rate': 7.307512832477834e-07, 'completion_length': 292.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.020619653165340424, 'kl': 0.21875, 'epoch': 0.27}
 27%|██▋       | 1154/4286 [8:51:48<22:26:33, 25.80s/it] 27%|██▋       | 1155/4286 [8:52:15<22:44:26, 26.15s/it]                                                        {'loss': 0.0047, 'grad_norm': 2.106309556782885, 'learning_rate': 7.305179654689687e-07, 'completion_length': 333.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7217262387275696, 'reward_std': 0.02083333395421505, 'kl': 0.118408203125, 'epoch': 0.27}
 27%|██▋       | 1155/4286 [8:52:15<22:44:26, 26.15s/it] 27%|██▋       | 1156/4286 [8:52:40<22:18:13, 25.65s/it]                                                        {'loss': 0.0275, 'grad_norm': 3.3722300625779136, 'learning_rate': 7.30284647690154e-07, 'completion_length': 305.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.533482164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4977679252624512, 'reward_std': 0.08870278298854828, 'kl': 0.6884765625, 'epoch': 0.27}
 27%|██▋       | 1156/4286 [8:52:40<22:18:13, 25.65s/it] 27%|██▋       | 1157/4286 [8:53:04<22:00:22, 25.32s/it]                                                        {'loss': 0.0451, 'grad_norm': 10.452958188676975, 'learning_rate': 7.300513299113392e-07, 'completion_length': 303.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6116072535514832, 'reward_std': 0.1841229423880577, 'kl': 1.12890625, 'epoch': 0.27}
 27%|██▋       | 1157/4286 [8:53:04<22:00:22, 25.32s/it] 27%|██▋       | 1158/4286 [8:53:30<21:57:59, 25.28s/it]                                                        {'loss': 0.0142, 'grad_norm': 4.607568986561481, 'learning_rate': 7.298180121325244e-07, 'completion_length': 287.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.03411934711039066, 'kl': 0.35791015625, 'epoch': 0.27}
 27%|██▋       | 1158/4286 [8:53:30<21:57:59, 25.28s/it][2025-03-02 23:51:18,139] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1159/4286 [8:53:55<22:04:21, 25.41s/it]                                                        {'loss': 0.0136, 'grad_norm': 3.833302724007084, 'learning_rate': 7.295846943537098e-07, 'completion_length': 311.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7552084028720856, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7016369700431824, 'reward_std': 0.08041498064994812, 'kl': 0.34130859375, 'epoch': 0.27}
 27%|██▋       | 1159/4286 [8:53:55<22:04:21, 25.41s/it][2025-03-02 23:51:46,548] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1160/4286 [8:54:24<22:50:46, 26.31s/it]                                                        {'loss': 0.0171, 'grad_norm': 2.28148051806348, 'learning_rate': 7.29351376574895e-07, 'completion_length': 329.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6668793261051178, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6490222811698914, 'reward_std': 0.13924211263656616, 'kl': 0.42578125, 'epoch': 0.27}
 27%|██▋       | 1160/4286 [8:54:24<22:50:46, 26.31s/it] 27%|██▋       | 1161/4286 [8:54:51<23:06:25, 26.62s/it]                                                        {'loss': 0.0095, 'grad_norm': 2.133527010501141, 'learning_rate': 7.291180587960802e-07, 'completion_length': 300.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5848214477300644, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5491072535514832, 'reward_std': 0.1107763946056366, 'kl': 0.238037109375, 'epoch': 0.27}
 27%|██▋       | 1161/4286 [8:54:51<23:06:25, 26.62s/it][2025-03-02 23:52:38,457] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1162/4286 [8:55:16<22:33:56, 26.00s/it]                                                        {'loss': 0.0068, 'grad_norm': 3.27471604105479, 'learning_rate': 7.288847410172655e-07, 'completion_length': 283.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.029750222340226173, 'kl': 0.170166015625, 'epoch': 0.27}
 27%|██▋       | 1162/4286 [8:55:16<22:33:56, 26.00s/it] 27%|██▋       | 1163/4286 [8:55:43<22:51:56, 26.36s/it]                                                        {'loss': 0.0114, 'grad_norm': 10.273571263409016, 'learning_rate': 7.286514232384508e-07, 'completion_length': 311.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.07578602060675621, 'kl': 0.2841796875, 'epoch': 0.27}
 27%|██▋       | 1163/4286 [8:55:43<22:51:56, 26.36s/it][2025-03-02 23:53:31,338] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1164/4286 [8:56:08<22:41:11, 26.16s/it]                                                        {'loss': 0.0037, 'grad_norm': 8.753365803579477, 'learning_rate': 7.28418105459636e-07, 'completion_length': 315.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5982142686843872, 'rewards/format_reward': 1.0, 'reward': 1.5982143878936768, 'reward_std': 0.0892857238650322, 'kl': 0.093017578125, 'epoch': 0.27}
 27%|██▋       | 1164/4286 [8:56:08<22:41:11, 26.16s/it] 27%|██▋       | 1165/4286 [8:56:33<22:16:59, 25.70s/it]                                                        {'loss': 0.003, 'grad_norm': 1.5616854073002409, 'learning_rate': 7.281847876808212e-07, 'completion_length': 311.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.5699404776096344, 'rewards/format_reward': 1.0, 'reward': 1.5699405670166016, 'reward_std': 0.06845237873494625, 'kl': 0.074951171875, 'epoch': 0.27}
 27%|██▋       | 1165/4286 [8:56:33<22:16:59, 25.70s/it] 27%|██▋       | 1166/4286 [8:57:00<22:37:20, 26.10s/it]                                                        {'loss': 0.0077, 'grad_norm': 3.1645644574638783, 'learning_rate': 7.279514699020065e-07, 'completion_length': 342.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.09493744373321533, 'kl': 0.192138671875, 'epoch': 0.27}
 27%|██▋       | 1166/4286 [8:57:00<22:37:20, 26.10s/it][2025-03-02 23:54:50,002] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1167/4286 [8:57:27<22:50:47, 26.37s/it]                                                        {'loss': 0.0023, 'grad_norm': 2.027710333351665, 'learning_rate': 7.277181521231918e-07, 'completion_length': 332.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7202381193637848, 'rewards/format_reward': 1.0, 'reward': 1.7202382683753967, 'reward_std': 0.056333938613533974, 'kl': 0.0576171875, 'epoch': 0.27}
 27%|██▋       | 1167/4286 [8:57:27<22:50:47, 26.37s/it] 27%|██▋       | 1168/4286 [8:57:55<23:11:10, 26.77s/it]                                                        {'loss': 0.0035, 'grad_norm': 1.5488268553638493, 'learning_rate': 7.27484834344377e-07, 'completion_length': 331.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7403274178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7224703431129456, 'reward_std': 0.08152406848967075, 'kl': 0.086669921875, 'epoch': 0.27}
 27%|██▋       | 1168/4286 [8:57:55<23:11:10, 26.77s/it][2025-03-02 23:55:44,903] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1169/4286 [8:58:22<23:17:19, 26.90s/it]                                                        {'loss': 0.0094, 'grad_norm': 11.90342996436904, 'learning_rate': 7.272515165655623e-07, 'completion_length': 349.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7026786208152771, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6848215460777283, 'reward_std': 0.10811495035886765, 'kl': 0.234619140625, 'epoch': 0.27}
 27%|██▋       | 1169/4286 [8:58:22<23:17:19, 26.90s/it][2025-03-02 23:56:10,091] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1170/4286 [8:58:47<22:50:14, 26.38s/it]                                                        {'loss': 0.0056, 'grad_norm': 1.8460318784901277, 'learning_rate': 7.270181987867475e-07, 'completion_length': 330.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.6151785850524902, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.579464316368103, 'reward_std': 0.09669449180364609, 'kl': 0.1402587890625, 'epoch': 0.27}
 27%|██▋       | 1170/4286 [8:58:47<22:50:14, 26.38s/it] 27%|██▋       | 1171/4286 [8:59:13<22:34:20, 26.09s/it]                                                        {'loss': 0.0178, 'grad_norm': 2.425415870174815, 'learning_rate': 7.267848810079328e-07, 'completion_length': 313.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.614583432674408, 'reward_std': 0.06845238897949457, 'kl': 0.44482421875, 'epoch': 0.27}
 27%|██▋       | 1171/4286 [8:59:13<22:34:20, 26.09s/it] 27%|██▋       | 1172/4286 [8:59:36<21:48:23, 25.21s/it]                                                        {'loss': 0.0026, 'grad_norm': 2.052869480759497, 'learning_rate': 7.265515632291181e-07, 'completion_length': 274.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.04761904664337635, 'kl': 0.0640869140625, 'epoch': 0.27}
 27%|██▋       | 1172/4286 [8:59:36<21:48:23, 25.21s/it][2025-03-02 23:57:24,487] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1173/4286 [9:00:02<21:57:47, 25.40s/it]                                                        {'loss': 0.006, 'grad_norm': 5.93211295327878, 'learning_rate': 7.263182454503033e-07, 'completion_length': 312.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6577381491661072, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.15114467591047287, 'kl': 0.151123046875, 'epoch': 0.27}
 27%|██▋       | 1173/4286 [9:00:02<21:57:47, 25.40s/it] 27%|██▋       | 1174/4286 [9:00:28<22:13:03, 25.70s/it]                                                        {'loss': 0.0177, 'grad_norm': 2.815608271288451, 'learning_rate': 7.260849276714885e-07, 'completion_length': 278.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.7095238268375397, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6916667819023132, 'reward_std': 0.10343670472502708, 'kl': 0.444091796875, 'epoch': 0.27}
 27%|██▋       | 1174/4286 [9:00:28<22:13:03, 25.70s/it] 27%|██▋       | 1175/4286 [9:00:52<21:50:54, 25.28s/it]                                                        {'loss': 0.0347, 'grad_norm': 1.6989909875253988, 'learning_rate': 7.258516098926737e-07, 'completion_length': 261.3928756713867, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.638392984867096, 'reward_std': 0.056547620333731174, 'kl': 0.869140625, 'epoch': 0.27}
 27%|██▋       | 1175/4286 [9:00:52<21:50:54, 25.28s/it] 27%|██▋       | 1176/4286 [9:01:18<21:58:05, 25.43s/it]                                                        {'loss': 0.0271, 'grad_norm': 6.391685561272474, 'learning_rate': 7.256182921138591e-07, 'completion_length': 325.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6904762983322144, 'reward_std': 0.08173839934170246, 'kl': 0.67724609375, 'epoch': 0.27}
 27%|██▋       | 1176/4286 [9:01:18<21:58:05, 25.43s/it] 27%|██▋       | 1177/4286 [9:01:43<21:57:26, 25.43s/it]                                                        {'loss': 0.0274, 'grad_norm': 2.7378654550143606, 'learning_rate': 7.253849743350443e-07, 'completion_length': 338.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803571939468384, 'reward_std': 0.06915953941643238, 'kl': 0.685546875, 'epoch': 0.27}
 27%|██▋       | 1177/4286 [9:01:43<21:57:26, 25.43s/it][2025-03-02 23:59:32,746] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1178/4286 [9:02:10<22:11:33, 25.71s/it]                                                        {'loss': 0.0182, 'grad_norm': 3.3162542105188093, 'learning_rate': 7.251516565562295e-07, 'completion_length': 322.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.10067614167928696, 'kl': 0.455078125, 'epoch': 0.27}
 27%|██▋       | 1178/4286 [9:02:10<22:11:33, 25.71s/it] 28%|██▊       | 1179/4286 [9:02:36<22:24:24, 25.96s/it]                                                        {'loss': 0.0036, 'grad_norm': 0.9155201892241216, 'learning_rate': 7.249183387774148e-07, 'completion_length': 311.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.860119104385376, 'rewards/format_reward': 1.0, 'reward': 1.8601192235946655, 'reward_std': 0.017857137601822615, 'kl': 0.0906982421875, 'epoch': 0.28}
 28%|██▊       | 1179/4286 [9:02:36<22:24:24, 25.96s/it] 28%|██▊       | 1180/4286 [9:03:02<22:24:04, 25.96s/it]                                                        {'loss': 0.009, 'grad_norm': 1.0178290097937925, 'learning_rate': 7.246850209986001e-07, 'completion_length': 330.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.7395833134651184, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217262983322144, 'reward_std': 0.08404048904776573, 'kl': 0.2257080078125, 'epoch': 0.28}
 28%|██▊       | 1180/4286 [9:03:02<22:24:04, 25.96s/it] 28%|██▊       | 1181/4286 [9:03:27<22:08:22, 25.67s/it]                                                        {'loss': 0.0048, 'grad_norm': 1.0859774226566457, 'learning_rate': 7.244517032197853e-07, 'completion_length': 310.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6726190894842148, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547619700431824, 'reward_std': 0.0357142873108387, 'kl': 0.119140625, 'epoch': 0.28}
 28%|██▊       | 1181/4286 [9:03:27<22:08:22, 25.67s/it] 28%|██▊       | 1182/4286 [9:03:52<21:48:04, 25.28s/it]                                                        {'loss': 0.0225, 'grad_norm': 5.1329990788321505, 'learning_rate': 7.242183854409706e-07, 'completion_length': 334.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.696428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715818405151, 'reward_std': 0.11745269224047661, 'kl': 0.560546875, 'epoch': 0.28}
 28%|██▊       | 1182/4286 [9:03:52<21:48:04, 25.28s/it] 28%|██▊       | 1183/4286 [9:04:16<21:28:23, 24.91s/it]                                                        {'loss': 0.01, 'grad_norm': 1.4208672875675512, 'learning_rate': 7.239850676621558e-07, 'completion_length': 322.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.7574405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.739583432674408, 'reward_std': 0.06685744412243366, 'kl': 0.25, 'epoch': 0.28}
 28%|██▊       | 1183/4286 [9:04:16<21:28:23, 24.91s/it] 28%|██▊       | 1184/4286 [9:04:41<21:37:34, 25.10s/it]                                                        {'loss': 0.0034, 'grad_norm': 0.8437462980643745, 'learning_rate': 7.237517498833411e-07, 'completion_length': 300.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.04123930633068085, 'kl': 0.084228515625, 'epoch': 0.28}
 28%|██▊       | 1184/4286 [9:04:41<21:37:34, 25.10s/it] 28%|██▊       | 1185/4286 [9:05:06<21:37:45, 25.11s/it]                                                        {'loss': 0.0049, 'grad_norm': 1.457764605964446, 'learning_rate': 7.235184321045264e-07, 'completion_length': 296.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.5816558599472046, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5637988448143005, 'reward_std': 0.10375901847146451, 'kl': 0.12158203125, 'epoch': 0.28}
 28%|██▊       | 1185/4286 [9:05:06<21:37:45, 25.11s/it] 28%|██▊       | 1186/4286 [9:05:30<21:20:28, 24.78s/it]                                                        {'loss': 0.0018, 'grad_norm': 3.2967375182396945, 'learning_rate': 7.232851143257116e-07, 'completion_length': 287.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.03273809980601072, 'kl': 0.04541015625, 'epoch': 0.28}
 28%|██▊       | 1186/4286 [9:05:30<21:20:28, 24.78s/it] 28%|██▊       | 1187/4286 [9:05:55<21:15:18, 24.69s/it]                                                        {'loss': 0.0026, 'grad_norm': 0.6811628158531244, 'learning_rate': 7.230517965468968e-07, 'completion_length': 290.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.025651199743151665, 'kl': 0.06488037109375, 'epoch': 0.28}
 28%|██▊       | 1187/4286 [9:05:55<21:15:18, 24.69s/it] 28%|██▊       | 1188/4286 [9:06:20<21:16:59, 24.73s/it]                                                        {'loss': 0.0059, 'grad_norm': 1.9834626620085138, 'learning_rate': 7.228184787680821e-07, 'completion_length': 339.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.7321430444717407, 'reward_std': 0.011904759332537651, 'kl': 0.1470947265625, 'epoch': 0.28}
 28%|██▊       | 1188/4286 [9:06:20<21:16:59, 24.73s/it] 28%|██▊       | 1189/4286 [9:06:45<21:31:39, 25.02s/it]                                                        {'loss': 0.0117, 'grad_norm': 4.748993088712447, 'learning_rate': 7.225851609892674e-07, 'completion_length': 345.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5752976685762405, 'rewards/format_reward': 1.0, 'reward': 1.5752976536750793, 'reward_std': 0.018600046634674072, 'kl': 0.29052734375, 'epoch': 0.28}
 28%|██▊       | 1189/4286 [9:06:45<21:31:39, 25.02s/it] 28%|██▊       | 1190/4286 [9:07:12<21:54:24, 25.47s/it]                                                        {'loss': 0.0187, 'grad_norm': 3.6920887675972325, 'learning_rate': 7.223518432104526e-07, 'completion_length': 345.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6612103879451752, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6254960894584656, 'reward_std': 0.06937122065573931, 'kl': 0.468505859375, 'epoch': 0.28}
 28%|██▊       | 1190/4286 [9:07:12<21:54:24, 25.47s/it] 28%|██▊       | 1191/4286 [9:07:36<21:27:57, 24.97s/it]                                                        {'loss': 0.0156, 'grad_norm': 1.9805006041193696, 'learning_rate': 7.221185254316378e-07, 'completion_length': 305.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6860120296478271, 'reward_std': 0.14642563089728355, 'kl': 0.391357421875, 'epoch': 0.28}
 28%|██▊       | 1191/4286 [9:07:36<21:27:57, 24.97s/it] 28%|██▊       | 1192/4286 [9:08:02<21:42:02, 25.25s/it]                                                        {'loss': 0.0332, 'grad_norm': 1.362157508930963, 'learning_rate': 7.218852076528232e-07, 'completion_length': 324.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7113096117973328, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.693452537059784, 'reward_std': 0.10235805064439774, 'kl': 0.83203125, 'epoch': 0.28}
 28%|██▊       | 1192/4286 [9:08:02<21:42:02, 25.25s/it] 28%|██▊       | 1193/4286 [9:08:27<21:42:51, 25.27s/it]                                                        {'loss': 0.0022, 'grad_norm': 1.818790076605534, 'learning_rate': 7.216518898740084e-07, 'completion_length': 309.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.589285746216774, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.10151169821619987, 'kl': 0.0555419921875, 'epoch': 0.28}
 28%|██▊       | 1193/4286 [9:08:27<21:42:51, 25.27s/it] 28%|██▊       | 1194/4286 [9:08:52<21:31:11, 25.06s/it]                                                        {'loss': 0.0083, 'grad_norm': 2.192956462958126, 'learning_rate': 7.214185720951936e-07, 'completion_length': 316.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7886905670166016, 'rewards/format_reward': 1.0, 'reward': 1.7886906266212463, 'reward_std': 0.018271165899932384, 'kl': 0.20654296875, 'epoch': 0.28}
 28%|██▊       | 1194/4286 [9:08:52<21:31:11, 25.06s/it] 28%|██▊       | 1195/4286 [9:09:15<21:10:59, 24.67s/it]                                                        {'loss': 0.0036, 'grad_norm': 2.1434727647024086, 'learning_rate': 7.211852543163789e-07, 'completion_length': 282.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7455357909202576, 'rewards/format_reward': 1.0, 'reward': 1.7455358505249023, 'reward_std': 0.031143159605562687, 'kl': 0.089599609375, 'epoch': 0.28}
 28%|██▊       | 1195/4286 [9:09:15<21:10:59, 24.67s/it] 28%|██▊       | 1196/4286 [9:09:42<21:36:50, 25.18s/it]                                                        {'loss': 0.0273, 'grad_norm': 1.7463757897120606, 'learning_rate': 7.209519365375642e-07, 'completion_length': 333.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.6839286386966705, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6660715341567993, 'reward_std': 0.16699498146772385, 'kl': 0.6796875, 'epoch': 0.28}
 28%|██▊       | 1196/4286 [9:09:42<21:36:50, 25.18s/it] 28%|██▊       | 1197/4286 [9:10:06<21:24:37, 24.95s/it]                                                        {'loss': 0.031, 'grad_norm': 2.163032690060113, 'learning_rate': 7.207186187587494e-07, 'completion_length': 314.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.09800060838460922, 'kl': 0.77734375, 'epoch': 0.28}
 28%|██▊       | 1197/4286 [9:10:06<21:24:37, 24.95s/it] 28%|██▊       | 1198/4286 [9:10:30<21:10:17, 24.68s/it]                                                        {'loss': 0.0152, 'grad_norm': 4.29282034000564, 'learning_rate': 7.204853009799346e-07, 'completion_length': 297.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.6991071701049805, 'rewards/format_reward': 1.0, 'reward': 1.6991072297096252, 'reward_std': 0.08711556904017925, 'kl': 0.3780517578125, 'epoch': 0.28}
 28%|██▊       | 1198/4286 [9:10:30<21:10:17, 24.68s/it] 28%|██▊       | 1199/4286 [9:10:56<21:22:03, 24.92s/it]                                                        {'loss': 0.0294, 'grad_norm': 4.502225012794045, 'learning_rate': 7.202519832011199e-07, 'completion_length': 328.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6205357313156128, 'rewards/format_reward': 1.0, 'reward': 1.6205357909202576, 'reward_std': 0.058389294892549515, 'kl': 0.734130859375, 'epoch': 0.28}
 28%|██▊       | 1199/4286 [9:10:56<21:22:03, 24.92s/it] 28%|██▊       | 1200/4286 [9:11:21<21:25:16, 24.99s/it]                                                        {'loss': 0.0103, 'grad_norm': 2.974317489768195, 'learning_rate': 7.200186654223051e-07, 'completion_length': 311.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7449405491352081, 'rewards/format_reward': 1.0, 'reward': 1.7449405789375305, 'reward_std': 0.04721366195008159, 'kl': 0.255615234375, 'epoch': 0.28}
 28%|██▊       | 1200/4286 [9:11:21<21:25:16, 24.99s/it] 28%|██▊       | 1201/4286 [9:15:49<83:58:00, 97.98s/it]                                                        {'loss': 0.0119, 'grad_norm': 4.129864394813738, 'learning_rate': 7.197853476434904e-07, 'completion_length': 315.2143096923828, 'rewards/only_full_func_accuracy_reward': 0.6365327835083008, 'rewards/format_reward': 1.0, 'reward': 1.6365329027175903, 'reward_std': 0.0840773805975914, 'kl': 0.2978515625, 'epoch': 0.28}
 28%|██▊       | 1201/4286 [9:15:49<83:58:00, 97.98s/it] 28%|██▊       | 1202/4286 [9:16:14<65:15:44, 76.18s/it]                                                        {'loss': 0.0086, 'grad_norm': 2.6404199963758197, 'learning_rate': 7.195520298646757e-07, 'completion_length': 290.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.07412620075047016, 'kl': 0.2164306640625, 'epoch': 0.28}
 28%|██▊       | 1202/4286 [9:16:14<65:15:44, 76.18s/it][2025-03-03 00:14:03,556] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 28%|██▊       | 1203/4286 [9:16:41<52:24:18, 61.19s/it]                                                        {'loss': 0.0043, 'grad_norm': 2.5778406664271056, 'learning_rate': 7.193187120858609e-07, 'completion_length': 301.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.8333333730697632, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.06388125941157341, 'kl': 0.108154296875, 'epoch': 0.28}
 28%|██▊       | 1203/4286 [9:16:41<52:24:18, 61.19s/it] 28%|██▊       | 1204/4286 [9:17:04<42:38:29, 49.81s/it]                                                        {'loss': 0.0012, 'grad_norm': 0.3459749537820859, 'learning_rate': 7.190853943070461e-07, 'completion_length': 293.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.7455357909202576, 'rewards/format_reward': 1.0, 'reward': 1.7455357909202576, 'reward_std': 0.020833336748182774, 'kl': 0.03106689453125, 'epoch': 0.28}
 28%|██▊       | 1204/4286 [9:17:04<42:38:29, 49.81s/it][2025-03-03 00:14:51,743] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 28%|██▊       | 1205/4286 [9:17:29<36:14:36, 42.35s/it]                                                        {'loss': 0.0036, 'grad_norm': 1.2340505289436678, 'learning_rate': 7.188520765282315e-07, 'completion_length': 305.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6130952537059784, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.023809523787349463, 'kl': 0.091064453125, 'epoch': 0.28}
 28%|██▊       | 1205/4286 [9:17:29<36:14:36, 42.35s/it][2025-03-03 00:15:16,285] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 28%|██▊       | 1206/4286 [9:17:53<31:39:41, 37.01s/it]                                                        {'loss': 0.0014, 'grad_norm': 1.8295330941464594, 'learning_rate': 7.186187587494167e-07, 'completion_length': 275.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7455357313156128, 'rewards/format_reward': 1.0, 'reward': 1.7455357909202576, 'reward_std': 0.0267857164144516, 'kl': 0.0360107421875, 'epoch': 0.28}
 28%|██▊       | 1206/4286 [9:17:53<31:39:41, 37.01s/it] 28%|██▊       | 1207/4286 [9:18:17<28:11:03, 32.95s/it]                                                        {'loss': 0.0023, 'grad_norm': 3.1164128620261766, 'learning_rate': 7.183854409706019e-07, 'completion_length': 293.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.8065476715564728, 'rewards/format_reward': 1.0, 'reward': 1.80654776096344, 'reward_std': 0.050381558015942574, 'kl': 0.05810546875, 'epoch': 0.28}
 28%|██▊       | 1207/4286 [9:18:17<28:11:03, 32.95s/it][2025-03-03 00:16:05,657] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 28%|██▊       | 1208/4286 [9:18:43<26:21:35, 30.83s/it]                                                        {'loss': 0.006, 'grad_norm': 1.386210179163915, 'learning_rate': 7.181521231917872e-07, 'completion_length': 333.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6651786267757416, 'rewards/format_reward': 1.0, 'reward': 1.6651787161827087, 'reward_std': 0.03114316239953041, 'kl': 0.14947509765625, 'epoch': 0.28}
 28%|██▊       | 1208/4286 [9:18:43<26:21:35, 30.83s/it] 28%|██▊       | 1209/4286 [9:19:08<25:02:39, 29.30s/it]                                                        {'loss': 0.0107, 'grad_norm': 3.475166229082998, 'learning_rate': 7.179188054129725e-07, 'completion_length': 298.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6931548118591309, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6752976775169373, 'reward_std': 0.08645058795809746, 'kl': 0.269287109375, 'epoch': 0.28}
 28%|██▊       | 1209/4286 [9:19:08<25:02:39, 29.30s/it] 28%|██▊       | 1210/4286 [9:19:33<23:50:32, 27.90s/it]                                                        {'loss': 0.0022, 'grad_norm': 3.72815717058101, 'learning_rate': 7.176854876341577e-07, 'completion_length': 297.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.6639881432056427, 'rewards/format_reward': 1.0, 'reward': 1.6639882922172546, 'reward_std': 0.06386731378734112, 'kl': 0.0537109375, 'epoch': 0.28}
 28%|██▊       | 1210/4286 [9:19:33<23:50:32, 27.90s/it] 28%|██▊       | 1211/4286 [9:20:00<23:32:24, 27.56s/it]                                                        {'loss': 0.0022, 'grad_norm': 2.1602871418343117, 'learning_rate': 7.174521698553429e-07, 'completion_length': 328.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6488096117973328, 'reward_std': 0.06823870539665222, 'kl': 0.054931640625, 'epoch': 0.28}
 28%|██▊       | 1211/4286 [9:20:00<23:32:24, 27.56s/it] 28%|██▊       | 1212/4286 [9:20:26<23:13:27, 27.20s/it]                                                        {'loss': 0.0014, 'grad_norm': 1.3584121473527877, 'learning_rate': 7.172188520765282e-07, 'completion_length': 349.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.6668020188808441, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.648944914340973, 'reward_std': 0.11239731684327126, 'kl': 0.0340576171875, 'epoch': 0.28}
 28%|██▊       | 1212/4286 [9:20:26<23:13:27, 27.20s/it] 28%|██▊       | 1213/4286 [9:20:52<22:53:41, 26.82s/it]                                                        {'loss': 0.0017, 'grad_norm': 1.476562726039098, 'learning_rate': 7.169855342977135e-07, 'completion_length': 324.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.754464328289032, 'reward_std': 0.04312239959836006, 'kl': 0.0419921875, 'epoch': 0.28}
 28%|██▊       | 1213/4286 [9:20:52<22:53:41, 26.82s/it] 28%|██▊       | 1214/4286 [9:21:17<22:27:54, 26.33s/it]                                                        {'loss': 0.01, 'grad_norm': 0.6192015262299426, 'learning_rate': 7.167522165188987e-07, 'completion_length': 322.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8422619700431824, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.0297619067132473, 'kl': 0.2509765625, 'epoch': 0.28}
 28%|██▊       | 1214/4286 [9:21:17<22:27:54, 26.33s/it] 28%|██▊       | 1215/4286 [9:21:42<21:59:41, 25.78s/it]                                                        {'loss': 0.0025, 'grad_norm': 0.3020684907294154, 'learning_rate': 7.16518898740084e-07, 'completion_length': 311.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8437500298023224, 'rewards/format_reward': 1.0, 'reward': 1.8437500596046448, 'reward_std': 0.008928571827709675, 'kl': 0.063232421875, 'epoch': 0.28}
 28%|██▊       | 1215/4286 [9:21:42<21:59:41, 25.78s/it] 28%|██▊       | 1216/4286 [9:22:06<21:32:11, 25.25s/it]                                                        {'loss': 0.0067, 'grad_norm': 1.8146899745269836, 'learning_rate': 7.162855809612692e-07, 'completion_length': 292.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8244048357009888, 'rewards/format_reward': 1.0, 'reward': 1.8244048357009888, 'reward_std': 0.045350007712841034, 'kl': 0.166259765625, 'epoch': 0.28}
 28%|██▊       | 1216/4286 [9:22:06<21:32:11, 25.25s/it] 28%|██▊       | 1217/4286 [9:22:32<21:44:58, 25.51s/it]                                                        {'loss': 0.0118, 'grad_norm': 3.807367373835497, 'learning_rate': 7.160522631824545e-07, 'completion_length': 344.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.04602411389350891, 'kl': 0.295166015625, 'epoch': 0.28}
 28%|██▊       | 1217/4286 [9:22:32<21:44:58, 25.51s/it] 28%|██▊       | 1218/4286 [9:22:58<21:45:33, 25.53s/it]                                                        {'loss': 0.0015, 'grad_norm': 1.2535183093996305, 'learning_rate': 7.158189454036398e-07, 'completion_length': 321.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7702381312847137, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7523809671401978, 'reward_std': 0.08746998757123947, 'kl': 0.0367431640625, 'epoch': 0.28}
 28%|██▊       | 1218/4286 [9:22:58<21:45:33, 25.53s/it] 28%|██▊       | 1219/4286 [9:23:25<22:18:42, 26.19s/it]                                                        {'loss': 0.0035, 'grad_norm': 2.059314909192807, 'learning_rate': 7.15585627624825e-07, 'completion_length': 295.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6636905670166016, 'reward_std': 0.11862026620656252, 'kl': 0.08807373046875, 'epoch': 0.28}
 28%|██▊       | 1219/4286 [9:23:25<22:18:42, 26.19s/it] 28%|██▊       | 1220/4286 [9:23:52<22:19:35, 26.22s/it]                                                        {'loss': 0.0019, 'grad_norm': 0.5578430078449416, 'learning_rate': 7.153523098460102e-07, 'completion_length': 322.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.7202381193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7023810744285583, 'reward_std': 0.07355843484401703, 'kl': 0.046875, 'epoch': 0.28}
 28%|██▊       | 1220/4286 [9:23:52<22:19:35, 26.22s/it] 28%|██▊       | 1221/4286 [9:24:17<22:00:28, 25.85s/it]                                                        {'loss': 0.0033, 'grad_norm': 4.266347255670636, 'learning_rate': 7.151189920671955e-07, 'completion_length': 324.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7633929252624512, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.10692917928099632, 'kl': 0.0833740234375, 'epoch': 0.28}
 28%|██▊       | 1221/4286 [9:24:17<22:00:28, 25.85s/it] 29%|██▊       | 1222/4286 [9:24:44<22:16:56, 26.18s/it]                                                        {'loss': 0.0012, 'grad_norm': 1.4899635147158825, 'learning_rate': 7.148856742883808e-07, 'completion_length': 309.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7702381610870361, 'rewards/format_reward': 1.0, 'reward': 1.7702381610870361, 'reward_std': 0.06266787904314697, 'kl': 0.029052734375, 'epoch': 0.29}
 29%|██▊       | 1222/4286 [9:24:44<22:16:56, 26.18s/it] 29%|██▊       | 1223/4286 [9:25:10<22:19:24, 26.24s/it]                                                        {'loss': 0.0017, 'grad_norm': 0.2515742590842144, 'learning_rate': 7.14652356509566e-07, 'completion_length': 347.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.02514437772333622, 'kl': 0.0430908203125, 'epoch': 0.29}
 29%|██▊       | 1223/4286 [9:25:10<22:19:24, 26.24s/it][2025-03-03 00:22:57,504] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▊       | 1224/4286 [9:25:35<21:55:26, 25.78s/it]                                                        {'loss': 0.0035, 'grad_norm': 0.7676321090723315, 'learning_rate': 7.144190387307512e-07, 'completion_length': 289.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.013746436685323715, 'kl': 0.08837890625, 'epoch': 0.29}
 29%|██▊       | 1224/4286 [9:25:35<21:55:26, 25.78s/it][2025-03-03 00:23:25,828] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▊       | 1225/4286 [9:26:03<22:34:00, 26.54s/it]                                                        {'loss': 0.0014, 'grad_norm': 1.175294076749533, 'learning_rate': 7.141857209519366e-07, 'completion_length': 335.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.008928571827709675, 'kl': 0.03387451171875, 'epoch': 0.29}
 29%|██▊       | 1225/4286 [9:26:03<22:34:00, 26.54s/it] 29%|██▊       | 1226/4286 [9:26:29<22:29:56, 26.47s/it]                                                        {'loss': 0.001, 'grad_norm': 0.7491032645175477, 'learning_rate': 7.139524031731218e-07, 'completion_length': 312.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.708333432674408, 'reward_std': 0.08726342022418976, 'kl': 0.024658203125, 'epoch': 0.29}
 29%|██▊       | 1226/4286 [9:26:29<22:29:56, 26.47s/it] 29%|██▊       | 1227/4286 [9:26:57<22:42:42, 26.73s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.7147353590823483, 'learning_rate': 7.13719085394307e-07, 'completion_length': 311.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7961309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.026785715483129025, 'kl': 0.03179931640625, 'epoch': 0.29}
 29%|██▊       | 1227/4286 [9:26:57<22:42:42, 26.73s/it] 29%|██▊       | 1228/4286 [9:27:21<22:14:25, 26.18s/it]                                                        {'loss': 0.0019, 'grad_norm': 0.1432075125850464, 'learning_rate': 7.134857676154923e-07, 'completion_length': 291.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6562500596046448, 'rewards/format_reward': 1.0, 'reward': 1.6562501788139343, 'reward_std': 0.008928571827709675, 'kl': 0.04638671875, 'epoch': 0.29}
 29%|██▊       | 1228/4286 [9:27:21<22:14:25, 26.18s/it] 29%|██▊       | 1229/4286 [9:27:48<22:26:43, 26.43s/it]                                                        {'loss': 0.0011, 'grad_norm': 0.3537088405910966, 'learning_rate': 7.132524498366775e-07, 'completion_length': 324.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7720238566398621, 'rewards/format_reward': 1.0, 'reward': 1.7720239758491516, 'reward_std': 0.02738095633685589, 'kl': 0.02752685546875, 'epoch': 0.29}
 29%|██▊       | 1229/4286 [9:27:48<22:26:43, 26.43s/it][2025-03-03 00:25:38,608] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▊       | 1230/4286 [9:28:16<22:38:18, 26.67s/it]                                                        {'loss': 0.0024, 'grad_norm': 0.981423210802198, 'learning_rate': 7.130191320578628e-07, 'completion_length': 330.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.46547622978687286, 'rewards/format_reward': 1.0, 'reward': 1.4654762744903564, 'reward_std': 0.07262943871319294, 'kl': 0.058837890625, 'epoch': 0.29}
 29%|██▊       | 1230/4286 [9:28:16<22:38:18, 26.67s/it] 29%|██▊       | 1231/4286 [9:28:42<22:29:02, 26.49s/it]                                                        {'loss': 0.0012, 'grad_norm': 3.1445045578631077, 'learning_rate': 7.127858142790481e-07, 'completion_length': 295.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6773810386657715, 'rewards/format_reward': 1.0, 'reward': 1.6773810982704163, 'reward_std': 0.1085956059396267, 'kl': 0.03045654296875, 'epoch': 0.29}
 29%|██▊       | 1231/4286 [9:28:42<22:29:02, 26.49s/it][2025-03-03 00:26:31,784] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▊       | 1232/4286 [9:29:09<22:37:37, 26.67s/it]                                                        {'loss': 0.0011, 'grad_norm': 0.13411052699811665, 'learning_rate': 7.125524965002333e-07, 'completion_length': 321.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.7127976417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6949405670166016, 'reward_std': 0.03959564119577408, 'kl': 0.027587890625, 'epoch': 0.29}
 29%|██▊       | 1232/4286 [9:29:09<22:37:37, 26.67s/it][2025-03-03 00:26:58,604] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 1233/4286 [9:29:36<22:39:25, 26.72s/it]                                                        {'loss': 0.0012, 'grad_norm': 0.5370827174486852, 'learning_rate': 7.123191787214185e-07, 'completion_length': 264.7678756713867, 'rewards/only_full_func_accuracy_reward': 0.74702388048172, 'rewards/format_reward': 1.0, 'reward': 1.74702388048172, 'reward_std': 0.029761905781924725, 'kl': 0.02923583984375, 'epoch': 0.29}
 29%|██▉       | 1233/4286 [9:29:36<22:39:25, 26.72s/it][2025-03-03 00:27:24,661] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 1234/4286 [9:30:02<22:28:54, 26.52s/it]                                                        {'loss': 0.0025, 'grad_norm': 1.120447294839636, 'learning_rate': 7.120858609426038e-07, 'completion_length': 301.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6830357313156128, 'rewards/format_reward': 1.0, 'reward': 1.6830357909202576, 'reward_std': 0.06434167735278606, 'kl': 0.063232421875, 'epoch': 0.29}
 29%|██▉       | 1234/4286 [9:30:02<22:28:54, 26.52s/it] 29%|██▉       | 1235/4286 [9:30:28<22:30:13, 26.55s/it]                                                        {'loss': 0.0015, 'grad_norm': 3.411353225948458, 'learning_rate': 7.118525431637891e-07, 'completion_length': 322.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8809524178504944, 'rewards/format_reward': 1.0, 'reward': 1.8809524774551392, 'reward_std': 0.07008037157356739, 'kl': 0.0384521484375, 'epoch': 0.29}
 29%|██▉       | 1235/4286 [9:30:28<22:30:13, 26.55s/it][2025-03-03 00:28:19,814] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 1236/4286 [9:30:57<22:59:46, 27.14s/it]                                                        {'loss': 0.0069, 'grad_norm': 0.9413758608719189, 'learning_rate': 7.116192253849743e-07, 'completion_length': 328.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7541667222976685, 'rewards/format_reward': 1.0, 'reward': 1.7541667819023132, 'reward_std': 0.08506932854652405, 'kl': 0.17236328125, 'epoch': 0.29}
 29%|██▉       | 1236/4286 [9:30:57<22:59:46, 27.14s/it] 29%|██▉       | 1237/4286 [9:31:22<22:29:32, 26.56s/it]                                                        {'loss': 0.0015, 'grad_norm': 0.40034828719113813, 'learning_rate': 7.113859076061595e-07, 'completion_length': 287.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7842262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.019238398410379887, 'kl': 0.038330078125, 'epoch': 0.29}
 29%|██▉       | 1237/4286 [9:31:22<22:29:32, 26.56s/it][2025-03-03 00:29:12,756] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 1238/4286 [9:31:50<22:47:19, 26.92s/it]                                                        {'loss': 0.0031, 'grad_norm': 1.0841121826650435, 'learning_rate': 7.111525898273449e-07, 'completion_length': 372.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6857993602752686, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6500851511955261, 'reward_std': 0.11490166652947664, 'kl': 0.0765380859375, 'epoch': 0.29}
 29%|██▉       | 1238/4286 [9:31:50<22:47:19, 26.92s/it] 29%|██▉       | 1239/4286 [9:32:16<22:32:43, 26.64s/it]                                                        {'loss': 0.001, 'grad_norm': 0.6106888403264462, 'learning_rate': 7.109192720485301e-07, 'completion_length': 306.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.657738134264946, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.059373158030211926, 'kl': 0.025634765625, 'epoch': 0.29}
 29%|██▉       | 1239/4286 [9:32:16<22:32:43, 26.64s/it][2025-03-03 00:30:06,845] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 1240/4286 [9:32:44<22:54:35, 27.08s/it]                                                        {'loss': 0.0021, 'grad_norm': 0.8479440905864118, 'learning_rate': 7.106859542697153e-07, 'completion_length': 334.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.01785714365541935, 'kl': 0.052001953125, 'epoch': 0.29}
 29%|██▉       | 1240/4286 [9:32:44<22:54:35, 27.08s/it] 29%|██▉       | 1241/4286 [9:33:12<23:11:51, 27.43s/it]                                                        {'loss': 0.0012, 'grad_norm': 2.930115982986524, 'learning_rate': 7.104526364909006e-07, 'completion_length': 323.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.8049745261669159, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7514032125473022, 'reward_std': 0.13406570628285408, 'kl': 0.03009033203125, 'epoch': 0.29}
 29%|██▉       | 1241/4286 [9:33:12<23:11:51, 27.43s/it][2025-03-03 00:31:02,053] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 1242/4286 [9:33:39<23:04:24, 27.29s/it]                                                        {'loss': 0.0012, 'grad_norm': 0.5122803481096664, 'learning_rate': 7.102193187120859e-07, 'completion_length': 317.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.723809540271759, 'rewards/format_reward': 1.0, 'reward': 1.7238095998764038, 'reward_std': 0.06893923878669739, 'kl': 0.02960205078125, 'epoch': 0.29}
 29%|██▉       | 1242/4286 [9:33:39<23:04:24, 27.29s/it] 29%|██▉       | 1243/4286 [9:34:05<22:49:05, 26.99s/it]                                                        {'loss': 0.0012, 'grad_norm': 1.1266152424422242, 'learning_rate': 7.099860009332711e-07, 'completion_length': 298.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.799107164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7812500596046448, 'reward_std': 0.08907202631235123, 'kl': 0.029296875, 'epoch': 0.29}
 29%|██▉       | 1243/4286 [9:34:05<22:49:05, 26.99s/it][2025-03-03 00:31:54,471] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 1244/4286 [9:34:32<22:35:08, 26.73s/it]                                                        {'loss': 0.0019, 'grad_norm': 1.208884342352427, 'learning_rate': 7.097526831544563e-07, 'completion_length': 317.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6562500447034836, 'rewards/format_reward': 1.0, 'reward': 1.6562500596046448, 'reward_std': 0.023595841601490974, 'kl': 0.04803466796875, 'epoch': 0.29}
 29%|██▉       | 1244/4286 [9:34:32<22:35:08, 26.73s/it][2025-03-03 00:32:21,224] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 1245/4286 [9:34:58<22:35:04, 26.74s/it]                                                        {'loss': 0.001, 'grad_norm': 0.2512289505177899, 'learning_rate': 7.095193653756416e-07, 'completion_length': 314.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.006873216480016708, 'kl': 0.02618408203125, 'epoch': 0.29}
 29%|██▉       | 1245/4286 [9:34:58<22:35:04, 26.74s/it] 29%|██▉       | 1246/4286 [9:35:25<22:32:32, 26.70s/it]                                                        {'loss': 0.0017, 'grad_norm': 0.9304333430681352, 'learning_rate': 7.092860475968269e-07, 'completion_length': 327.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6937500834465027, 'rewards/format_reward': 1.0, 'reward': 1.6937500834465027, 'reward_std': 0.02235108893364668, 'kl': 0.041259765625, 'epoch': 0.29}
 29%|██▉       | 1246/4286 [9:35:25<22:32:32, 26.70s/it] 29%|██▉       | 1247/4286 [9:35:50<22:08:07, 26.22s/it]                                                        {'loss': 0.0019, 'grad_norm': 1.1098195352383895, 'learning_rate': 7.090527298180121e-07, 'completion_length': 275.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 1.0, 'reward': 1.6860119700431824, 'reward_std': 0.04304792732000351, 'kl': 0.047119140625, 'epoch': 0.29}
 29%|██▉       | 1247/4286 [9:35:50<22:08:07, 26.22s/it] 29%|██▉       | 1248/4286 [9:36:16<22:04:22, 26.16s/it]                                                        {'loss': 0.0016, 'grad_norm': 1.8877160038103902, 'learning_rate': 7.088194120391974e-07, 'completion_length': 332.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.791666716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.755952537059784, 'reward_std': 0.1242227628827095, 'kl': 0.0396728515625, 'epoch': 0.29}
 29%|██▉       | 1248/4286 [9:36:16<22:04:22, 26.16s/it] 29%|██▉       | 1249/4286 [9:36:41<21:41:41, 25.72s/it]                                                        {'loss': 0.0016, 'grad_norm': 3.077119415453418, 'learning_rate': 7.085860942603826e-07, 'completion_length': 283.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.7071428894996643, 'rewards/format_reward': 1.0, 'reward': 1.7071428894996643, 'reward_std': 0.030344204045832157, 'kl': 0.0400390625, 'epoch': 0.29}
 29%|██▉       | 1249/4286 [9:36:41<21:41:41, 25.72s/it] 29%|██▉       | 1250/4286 [9:37:07<21:43:30, 25.76s/it]                                                        {'loss': 0.0011, 'grad_norm': 1.4178355123428543, 'learning_rate': 7.083527764815678e-07, 'completion_length': 318.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6208333671092987, 'rewards/format_reward': 1.0, 'reward': 1.620833396911621, 'reward_std': 0.04807508364319801, 'kl': 0.0272216796875, 'epoch': 0.29}
 29%|██▉       | 1250/4286 [9:37:07<21:43:30, 25.76s/it][2025-03-03 00:34:56,510] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 1251/4286 [9:37:34<22:02:02, 26.14s/it]                                                        {'loss': 0.0039, 'grad_norm': 23.314422171969223, 'learning_rate': 7.081194587027532e-07, 'completion_length': 296.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6595413684844971, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6416842937469482, 'reward_std': 0.11573553644120693, 'kl': 0.0966796875, 'epoch': 0.29}
 29%|██▉       | 1251/4286 [9:37:34<22:02:02, 26.14s/it] 29%|██▉       | 1252/4286 [9:37:59<21:53:05, 25.97s/it]                                                        {'loss': 0.0022, 'grad_norm': 1.520320665810869, 'learning_rate': 7.078861409239384e-07, 'completion_length': 326.64288330078125, 'rewards/only_full_func_accuracy_reward': 0.7130952775478363, 'rewards/format_reward': 1.0, 'reward': 1.7130953073501587, 'reward_std': 0.021994300186634064, 'kl': 0.05450439453125, 'epoch': 0.29}
 29%|██▉       | 1252/4286 [9:37:59<21:53:05, 25.97s/it] 29%|██▉       | 1253/4286 [9:38:24<21:36:30, 25.65s/it]                                                        {'loss': 0.0023, 'grad_norm': 0.46232669198274334, 'learning_rate': 7.076528231451236e-07, 'completion_length': 311.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.6770834028720856, 'rewards/format_reward': 1.0, 'reward': 1.6770834922790527, 'reward_std': 0.04136601183563471, 'kl': 0.05609130859375, 'epoch': 0.29}
 29%|██▉       | 1253/4286 [9:38:24<21:36:30, 25.65s/it] 29%|██▉       | 1254/4286 [9:38:51<21:56:19, 26.05s/it]                                                        {'loss': 0.0013, 'grad_norm': 1.9470320358028976, 'learning_rate': 7.07419505366309e-07, 'completion_length': 335.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7738096117973328, 'reward_std': 0.05314406845718622, 'kl': 0.031982421875, 'epoch': 0.29}
 29%|██▉       | 1254/4286 [9:38:51<21:56:19, 26.05s/it] 29%|██▉       | 1255/4286 [9:39:17<21:51:09, 25.95s/it]                                                        {'loss': 0.0021, 'grad_norm': 21.266587139797487, 'learning_rate': 7.071861875874942e-07, 'completion_length': 301.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.054391831159591675, 'kl': 0.05352783203125, 'epoch': 0.29}
 29%|██▉       | 1255/4286 [9:39:17<21:51:09, 25.95s/it] 29%|██▉       | 1256/4286 [9:39:41<21:26:38, 25.48s/it]                                                        {'loss': 0.0015, 'grad_norm': 4.611055679888883, 'learning_rate': 7.069528698086794e-07, 'completion_length': 288.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.574404776096344, 'rewards/format_reward': 1.0, 'reward': 1.5744048357009888, 'reward_std': 0.06983364373445511, 'kl': 0.037109375, 'epoch': 0.29}
 29%|██▉       | 1256/4286 [9:39:41<21:26:38, 25.48s/it] 29%|██▉       | 1257/4286 [9:40:06<21:16:16, 25.28s/it]                                                        {'loss': 0.001, 'grad_norm': 0.9463316629919863, 'learning_rate': 7.067195520298646e-07, 'completion_length': 301.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001788139343, 'reward_std': 0.028166969306766987, 'kl': 0.02593994140625, 'epoch': 0.29}
 29%|██▉       | 1257/4286 [9:40:06<21:16:16, 25.28s/it] 29%|██▉       | 1258/4286 [9:40:32<21:20:12, 25.37s/it]                                                        {'loss': 0.0025, 'grad_norm': 1.1743733839623722, 'learning_rate': 7.064862342510499e-07, 'completion_length': 342.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098215222358704, 'reward_std': 0.08655625954270363, 'kl': 0.06353759765625, 'epoch': 0.29}
 29%|██▉       | 1258/4286 [9:40:32<21:20:12, 25.37s/it] 29%|██▉       | 1259/4286 [9:40:57<21:16:46, 25.31s/it]                                                        {'loss': 0.0012, 'grad_norm': 4.92257738724081, 'learning_rate': 7.062529164722352e-07, 'completion_length': 319.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7648810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.01785714365541935, 'kl': 0.02880859375, 'epoch': 0.29}
 29%|██▉       | 1259/4286 [9:40:57<21:16:46, 25.31s/it] 29%|██▉       | 1260/4286 [9:41:22<21:09:15, 25.17s/it]                                                        {'loss': 0.0017, 'grad_norm': 1.8469323798908204, 'learning_rate': 7.060195986934204e-07, 'completion_length': 287.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7247024178504944, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.0386904738843441, 'kl': 0.041259765625, 'epoch': 0.29}
 29%|██▉       | 1260/4286 [9:41:22<21:09:15, 25.17s/it][2025-03-03 00:39:10,630] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 29%|██▉       | 1261/4286 [9:41:48<21:23:51, 25.47s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.8382580590226447, 'learning_rate': 7.057862809146057e-07, 'completion_length': 292.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8851191103458405, 'rewards/format_reward': 1.0, 'reward': 1.8851191997528076, 'reward_std': 0.025000002700835466, 'kl': 0.03363037109375, 'epoch': 0.29}
 29%|██▉       | 1261/4286 [9:41:48<21:23:51, 25.47s/it] 29%|██▉       | 1262/4286 [9:42:16<22:06:41, 26.32s/it]                                                        {'loss': 0.0015, 'grad_norm': 2.0542273601086403, 'learning_rate': 7.055529631357909e-07, 'completion_length': 346.14288330078125, 'rewards/only_full_func_accuracy_reward': 0.6979166865348816, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.041366010904312134, 'kl': 0.0362548828125, 'epoch': 0.29}
 29%|██▉       | 1262/4286 [9:42:16<22:06:41, 26.32s/it] 29%|██▉       | 1263/4286 [9:42:40<21:34:30, 25.69s/it]                                                        {'loss': 0.0042, 'grad_norm': 1.4579376287522556, 'learning_rate': 7.053196453569762e-07, 'completion_length': 284.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7172620296478271, 'reward_std': 0.09732650220394135, 'kl': 0.1053466796875, 'epoch': 0.29}
 29%|██▉       | 1263/4286 [9:42:40<21:34:30, 25.69s/it] 29%|██▉       | 1264/4286 [9:43:06<21:38:38, 25.78s/it]                                                        {'loss': 0.0077, 'grad_norm': 1.2737899284428171, 'learning_rate': 7.050863275781615e-07, 'completion_length': 276.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7398809790611267, 'rewards/format_reward': 1.0, 'reward': 1.7398810386657715, 'reward_std': 0.10988231375813484, 'kl': 0.191650390625, 'epoch': 0.29}
 29%|██▉       | 1264/4286 [9:43:06<21:38:38, 25.78s/it] 30%|██▉       | 1265/4286 [9:43:33<21:51:52, 26.06s/it]                                                        {'loss': 0.0202, 'grad_norm': 0.9768611020671045, 'learning_rate': 7.048530097993467e-07, 'completion_length': 284.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6294643878936768, 'reward_std': 0.06518351286649704, 'kl': 0.50537109375, 'epoch': 0.3}
 30%|██▉       | 1265/4286 [9:43:33<21:51:52, 26.06s/it] 30%|██▉       | 1266/4286 [9:43:57<21:27:10, 25.57s/it]                                                        {'loss': 0.0031, 'grad_norm': 1.3646543060371188, 'learning_rate': 7.046196920205319e-07, 'completion_length': 280.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8174603879451752, 'rewards/format_reward': 1.0, 'reward': 1.8174604177474976, 'reward_std': 0.01944039575755596, 'kl': 0.07720947265625, 'epoch': 0.3}
 30%|██▉       | 1266/4286 [9:43:57<21:27:10, 25.57s/it] 30%|██▉       | 1267/4286 [9:44:24<21:35:05, 25.74s/it]                                                        {'loss': 0.0111, 'grad_norm': 1.7389398466524135, 'learning_rate': 7.043863742417172e-07, 'completion_length': 332.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6354167461395264, 'reward_std': 0.07674386166036129, 'kl': 0.2777099609375, 'epoch': 0.3}
 30%|██▉       | 1267/4286 [9:44:24<21:35:05, 25.74s/it] 30%|██▉       | 1268/4286 [9:44:49<21:32:48, 25.70s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.32920003616796933, 'learning_rate': 7.041530564629025e-07, 'completion_length': 316.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.5714285671710968, 'rewards/format_reward': 1.0, 'reward': 1.5714287161827087, 'reward_std': 0.056333936750888824, 'kl': 0.0325927734375, 'epoch': 0.3}
 30%|██▉       | 1268/4286 [9:44:49<21:32:48, 25.70s/it] 30%|██▉       | 1269/4286 [9:45:16<21:46:21, 25.98s/it]                                                        {'loss': 0.016, 'grad_norm': 1.5477468561088001, 'learning_rate': 7.039197386840877e-07, 'completion_length': 330.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7142857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.026025486178696156, 'kl': 0.4007568359375, 'epoch': 0.3}
 30%|██▉       | 1269/4286 [9:45:16<21:46:21, 25.98s/it] 30%|██▉       | 1270/4286 [9:45:41<21:40:03, 25.86s/it]                                                        {'loss': 0.0014, 'grad_norm': 2.3018281569263594, 'learning_rate': 7.036864209052729e-07, 'completion_length': 322.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.7440018951892853, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7082876563072205, 'reward_std': 0.12320792302489281, 'kl': 0.0343017578125, 'epoch': 0.3}
 30%|██▉       | 1270/4286 [9:45:41<21:40:03, 25.86s/it] 30%|██▉       | 1271/4286 [9:46:05<21:07:13, 25.22s/it]                                                        {'loss': 0.0013, 'grad_norm': 1.1659213215418263, 'learning_rate': 7.034531031264583e-07, 'completion_length': 296.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.05262723006308079, 'kl': 0.031982421875, 'epoch': 0.3}
 30%|██▉       | 1271/4286 [9:46:05<21:07:13, 25.22s/it] 30%|██▉       | 1272/4286 [9:46:30<21:04:49, 25.18s/it]                                                        {'loss': 0.0015, 'grad_norm': 3.5666705667504353, 'learning_rate': 7.032197853476435e-07, 'completion_length': 294.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7425595223903656, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.04362159222364426, 'kl': 0.0382080078125, 'epoch': 0.3}
 30%|██▉       | 1272/4286 [9:46:30<21:04:49, 25.18s/it] 30%|██▉       | 1273/4286 [9:46:54<20:38:56, 24.67s/it]                                                        {'loss': 0.0015, 'grad_norm': 1.0320822533111922, 'learning_rate': 7.029864675688287e-07, 'completion_length': 287.125, 'rewards/only_full_func_accuracy_reward': 0.6711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.056547620333731174, 'kl': 0.03729248046875, 'epoch': 0.3}
 30%|██▉       | 1273/4286 [9:46:54<20:38:56, 24.67s/it] 30%|██▉       | 1274/4286 [9:47:19<20:55:24, 25.01s/it]                                                        {'loss': 0.0011, 'grad_norm': 0.5573222909571258, 'learning_rate': 7.02753149790014e-07, 'completion_length': 277.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.854166716337204, 'rewards/format_reward': 1.0, 'reward': 1.8541667461395264, 'reward_std': 0.04609858803451061, 'kl': 0.02716064453125, 'epoch': 0.3}
 30%|██▉       | 1274/4286 [9:47:19<20:55:24, 25.01s/it] 30%|██▉       | 1275/4286 [9:47:44<20:54:10, 24.99s/it]                                                        {'loss': 0.0014, 'grad_norm': 0.3788668767196535, 'learning_rate': 7.025198320111993e-07, 'completion_length': 274.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7306548357009888, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.008928571827709675, 'kl': 0.03546142578125, 'epoch': 0.3}
 30%|██▉       | 1275/4286 [9:47:44<20:54:10, 24.99s/it] 30%|██▉       | 1276/4286 [9:48:08<20:26:54, 24.46s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.5569044066000755, 'learning_rate': 7.022865142323845e-07, 'completion_length': 265.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.04761904664337635, 'kl': 0.03204345703125, 'epoch': 0.3}
 30%|██▉       | 1276/4286 [9:48:08<20:26:54, 24.46s/it][2025-03-03 00:45:56,672] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|██▉       | 1277/4286 [9:48:34<20:52:04, 24.97s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.9101298216928199, 'learning_rate': 7.020531964535698e-07, 'completion_length': 296.125, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7113096117973328, 'reward_std': 0.14427145570516586, 'kl': 0.032470703125, 'epoch': 0.3}
 30%|██▉       | 1277/4286 [9:48:34<20:52:04, 24.97s/it][2025-03-03 00:46:22,550] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|██▉       | 1278/4286 [9:49:00<21:05:21, 25.24s/it]                                                        {'loss': 0.0019, 'grad_norm': 1.0786801196282538, 'learning_rate': 7.01819878674755e-07, 'completion_length': 295.5893020629883, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 1.0, 'reward': 1.7633929252624512, 'reward_std': 0.019238398410379887, 'kl': 0.0484619140625, 'epoch': 0.3}
 30%|██▉       | 1278/4286 [9:49:00<21:05:21, 25.24s/it] 30%|██▉       | 1279/4286 [9:49:25<21:08:18, 25.31s/it]                                                        {'loss': 0.0181, 'grad_norm': 3.2642619074883683, 'learning_rate': 7.015865608959402e-07, 'completion_length': 269.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.616071492433548, 'rewards/format_reward': 1.0, 'reward': 1.6160715818405151, 'reward_std': 0.14476493000984192, 'kl': 0.455078125, 'epoch': 0.3}
 30%|██▉       | 1279/4286 [9:49:25<21:08:18, 25.31s/it][2025-03-03 00:47:12,085] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|██▉       | 1280/4286 [9:49:49<20:49:19, 24.94s/it]                                                        {'loss': 0.0014, 'grad_norm': 6.520172711430901, 'learning_rate': 7.013532431171255e-07, 'completion_length': 248.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.610119104385376, 'rewards/format_reward': 1.0, 'reward': 1.6101191639900208, 'reward_std': 0.024056265130639076, 'kl': 0.0343017578125, 'epoch': 0.3}
 30%|██▉       | 1280/4286 [9:49:49<20:49:19, 24.94s/it] 30%|██▉       | 1281/4286 [9:50:13<20:37:09, 24.70s/it]                                                        {'loss': 0.0181, 'grad_norm': 0.5320021998956359, 'learning_rate': 7.011199253383108e-07, 'completion_length': 305.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7663691639900208, 'reward_std': 0.08311965316534042, 'kl': 0.453125, 'epoch': 0.3}
 30%|██▉       | 1281/4286 [9:50:13<20:37:09, 24.70s/it][2025-03-03 00:48:02,427] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|██▉       | 1282/4286 [9:50:40<20:59:02, 25.15s/it]                                                        {'loss': 0.0064, 'grad_norm': 1.7246295775501737, 'learning_rate': 7.00886607559496e-07, 'completion_length': 302.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.5853316336870193, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5674745440483093, 'reward_std': 0.1419898420572281, 'kl': 0.157958984375, 'epoch': 0.3}
 30%|██▉       | 1282/4286 [9:50:40<20:59:02, 25.15s/it][2025-03-03 00:48:28,459] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|██▉       | 1283/4286 [9:51:06<21:11:55, 25.41s/it]                                                        {'loss': 0.0036, 'grad_norm': 1.3957288439695363, 'learning_rate': 7.006532897806812e-07, 'completion_length': 305.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6889881193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6711310744285583, 'reward_std': 0.07557233795523643, 'kl': 0.09027099609375, 'epoch': 0.3}
 30%|██▉       | 1283/4286 [9:51:06<21:11:55, 25.41s/it][2025-03-03 00:48:53,905] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|██▉       | 1284/4286 [9:51:31<21:11:58, 25.42s/it]                                                        {'loss': 0.0023, 'grad_norm': 2.59819328226215, 'learning_rate': 7.004199720018666e-07, 'completion_length': 268.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.03436608985066414, 'kl': 0.056884765625, 'epoch': 0.3}
 30%|██▉       | 1284/4286 [9:51:31<21:11:58, 25.42s/it] 30%|██▉       | 1285/4286 [9:51:59<21:46:59, 26.13s/it]                                                        {'loss': 0.0032, 'grad_norm': 14.000052965042286, 'learning_rate': 7.001866542230518e-07, 'completion_length': 304.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.583333432674408, 'reward_std': 0.10209564119577408, 'kl': 0.0789794921875, 'epoch': 0.3}
 30%|██▉       | 1285/4286 [9:51:59<21:46:59, 26.13s/it][2025-03-03 00:49:48,220] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|███       | 1286/4286 [9:52:25<21:52:33, 26.25s/it]                                                        {'loss': 0.0023, 'grad_norm': 2.5066682130519973, 'learning_rate': 6.99953336444237e-07, 'completion_length': 285.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.0762145146727562, 'kl': 0.0584716796875, 'epoch': 0.3}
 30%|███       | 1286/4286 [9:52:25<21:52:33, 26.25s/it] 30%|███       | 1287/4286 [9:52:49<21:15:00, 25.51s/it]                                                        {'loss': 0.0016, 'grad_norm': 1.8939605444428553, 'learning_rate': 6.997200186654223e-07, 'completion_length': 282.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.7842261791229248, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.10054942965507507, 'kl': 0.04083251953125, 'epoch': 0.3}
 30%|███       | 1287/4286 [9:52:49<21:15:00, 25.51s/it] 30%|███       | 1288/4286 [9:53:13<20:45:52, 24.93s/it]                                                        {'loss': 0.0013, 'grad_norm': 1.9520757461717297, 'learning_rate': 6.994867008866076e-07, 'completion_length': 294.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.07158833369612694, 'kl': 0.0325927734375, 'epoch': 0.3}
 30%|███       | 1288/4286 [9:53:13<20:45:52, 24.93s/it] 30%|███       | 1289/4286 [9:53:39<20:59:33, 25.22s/it]                                                        {'loss': 0.0067, 'grad_norm': 2.234547994962412, 'learning_rate': 6.992533831077928e-07, 'completion_length': 311.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7976191639900208, 'reward_std': 0.07142857648432255, 'kl': 0.167724609375, 'epoch': 0.3}
 30%|███       | 1289/4286 [9:53:39<20:59:33, 25.22s/it][2025-03-03 00:51:25,995] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|███       | 1290/4286 [9:54:03<20:48:51, 25.01s/it]                                                        {'loss': 0.0155, 'grad_norm': 1.9647549587586692, 'learning_rate': 6.99020065328978e-07, 'completion_length': 257.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.60007444024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5822174549102783, 'reward_std': 0.07824434153735638, 'kl': 0.385498046875, 'epoch': 0.3}
 30%|███       | 1290/4286 [9:54:03<20:48:51, 25.01s/it][2025-03-03 00:51:52,081] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|███       | 1291/4286 [9:54:29<21:04:33, 25.33s/it]                                                        {'loss': 0.0033, 'grad_norm': 1.0598976629323835, 'learning_rate': 6.987867475501633e-07, 'completion_length': 300.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.09437813330441713, 'kl': 0.082763671875, 'epoch': 0.3}
 30%|███       | 1291/4286 [9:54:29<21:04:33, 25.33s/it] 30%|███       | 1292/4286 [9:54:54<20:57:52, 25.21s/it]                                                        {'loss': 0.0023, 'grad_norm': 1.066728918701241, 'learning_rate': 6.985534297713486e-07, 'completion_length': 325.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.7229167222976685, 'rewards/format_reward': 1.0, 'reward': 1.7229167222976685, 'reward_std': 0.03744495287537575, 'kl': 0.058837890625, 'epoch': 0.3}
 30%|███       | 1292/4286 [9:54:54<20:57:52, 25.21s/it][2025-03-03 00:52:43,683] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|███       | 1293/4286 [9:55:21<21:19:34, 25.65s/it]                                                        {'loss': 0.0025, 'grad_norm': 2.7907542025020677, 'learning_rate': 6.983201119925338e-07, 'completion_length': 323.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7142857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.026485062204301357, 'kl': 0.0615234375, 'epoch': 0.3}
 30%|███       | 1293/4286 [9:55:21<21:19:34, 25.65s/it] 30%|███       | 1294/4286 [9:55:45<21:01:35, 25.30s/it]                                                        {'loss': 0.003, 'grad_norm': 1.8339982765532452, 'learning_rate': 6.980867942137191e-07, 'completion_length': 293.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.65327388048172, 'reward_std': 0.03709554299712181, 'kl': 0.0758056640625, 'epoch': 0.3}
 30%|███       | 1294/4286 [9:55:45<21:01:35, 25.30s/it][2025-03-03 00:53:33,186] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|███       | 1295/4286 [9:56:10<20:57:04, 25.22s/it]                                                        {'loss': 0.0052, 'grad_norm': 1.3625142161871509, 'learning_rate': 6.978534764349043e-07, 'completion_length': 286.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7053572535514832, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.04031846486032009, 'kl': 0.130859375, 'epoch': 0.3}
 30%|███       | 1295/4286 [9:56:10<20:57:04, 25.22s/it] 30%|███       | 1296/4286 [9:56:35<20:48:30, 25.05s/it]                                                        {'loss': 0.0178, 'grad_norm': 11.455729529159429, 'learning_rate': 6.976201586560896e-07, 'completion_length': 273.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.886904776096344, 'rewards/format_reward': 1.0, 'reward': 1.8869048953056335, 'reward_std': 0.0357142873108387, 'kl': 0.4453125, 'epoch': 0.3}
 30%|███       | 1296/4286 [9:56:35<20:48:30, 25.05s/it] 30%|███       | 1297/4286 [9:56:59<20:26:36, 24.62s/it]                                                        {'loss': 0.012, 'grad_norm': 2.522519687550932, 'learning_rate': 6.973868408772749e-07, 'completion_length': 293.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.06100394017994404, 'kl': 0.30029296875, 'epoch': 0.3}
 30%|███       | 1297/4286 [9:56:59<20:26:36, 24.62s/it][2025-03-03 00:54:46,555] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|███       | 1298/4286 [9:57:24<20:33:02, 24.76s/it]                                                        {'loss': 0.0019, 'grad_norm': 2.0109052743768694, 'learning_rate': 6.971535230984601e-07, 'completion_length': 299.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.6413690745830536, 'rewards/format_reward': 1.0, 'reward': 1.6413692235946655, 'reward_std': 0.032738097012043, 'kl': 0.0469970703125, 'epoch': 0.3}
 30%|███       | 1298/4286 [9:57:24<20:33:02, 24.76s/it] 30%|███       | 1299/4286 [9:57:48<20:27:56, 24.67s/it]                                                        {'loss': 0.003, 'grad_norm': 1.5937391681911885, 'learning_rate': 6.969202053196453e-07, 'completion_length': 312.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6303571611642838, 'rewards/format_reward': 1.0, 'reward': 1.6303572058677673, 'reward_std': 0.04646032862365246, 'kl': 0.0753173828125, 'epoch': 0.3}
 30%|███       | 1299/4286 [9:57:48<20:27:56, 24.67s/it][2025-03-03 00:55:39,169] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|███       | 1300/4286 [9:58:16<21:19:49, 25.72s/it]                                                        {'loss': 0.0324, 'grad_norm': 5.602169936058919, 'learning_rate': 6.966868875408307e-07, 'completion_length': 288.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6747024357318878, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6211310625076294, 'reward_std': 0.18321173265576363, 'kl': 0.8125, 'epoch': 0.3}
 30%|███       | 1300/4286 [9:58:16<21:19:49, 25.72s/it][2025-03-03 00:59:02,561] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|███       | 1301/4286 [10:01:40<65:31:11, 79.02s/it]                                                         {'loss': 0.0045, 'grad_norm': 1.6277639173304883, 'learning_rate': 6.964535697620159e-07, 'completion_length': 296.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7023810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.042747270315885544, 'kl': 0.1123046875, 'epoch': 0.3}
 30%|███       | 1301/4286 [10:01:40<65:31:11, 79.02s/it] 30%|███       | 1302/4286 [10:02:02<51:22:53, 61.99s/it]                                                         {'loss': 0.0045, 'grad_norm': 1.4183511160563858, 'learning_rate': 6.962202519832011e-07, 'completion_length': 282.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.029761902987957, 'kl': 0.11328125, 'epoch': 0.3}
 30%|███       | 1302/4286 [10:02:02<51:22:53, 61.99s/it][2025-03-03 00:59:50,150] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 30%|███       | 1303/4286 [10:02:27<42:15:12, 50.99s/it]                                                         {'loss': 0.0117, 'grad_norm': 6.313803816666947, 'learning_rate': 6.959869342043863e-07, 'completion_length': 299.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7795387208461761, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7438244819641113, 'reward_std': 0.16023226454854012, 'kl': 0.2916259765625, 'epoch': 0.3}
 30%|███       | 1303/4286 [10:02:27<42:15:12, 50.99s/it] 30%|███       | 1304/4286 [10:02:52<35:36:36, 42.99s/it]                                                         {'loss': 0.0157, 'grad_norm': 1.894898228305978, 'learning_rate': 6.957536164255716e-07, 'completion_length': 316.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7101190984249115, 'rewards/format_reward': 1.0, 'reward': 1.7101191282272339, 'reward_std': 0.050684958696365356, 'kl': 0.3916015625, 'epoch': 0.3}
 30%|███       | 1304/4286 [10:02:52<35:36:36, 42.99s/it] 30%|███       | 1305/4286 [10:03:16<31:06:40, 37.57s/it]                                                         {'loss': 0.0079, 'grad_norm': 2.685004746015528, 'learning_rate': 6.955202986467569e-07, 'completion_length': 256.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.7514881193637848, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.0267857164144516, 'kl': 0.1983642578125, 'epoch': 0.3}
 30%|███       | 1305/4286 [10:03:16<31:06:40, 37.57s/it] 30%|███       | 1306/4286 [10:03:42<27:59:31, 33.82s/it]                                                         {'loss': 0.0229, 'grad_norm': 1.4959747481793089, 'learning_rate': 6.952869808679421e-07, 'completion_length': 307.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6889881789684296, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6711310744285583, 'reward_std': 0.1101190447807312, 'kl': 0.5748291015625, 'epoch': 0.3}
 30%|███       | 1306/4286 [10:03:42<27:59:31, 33.82s/it] 30%|███       | 1307/4286 [10:04:05<25:29:20, 30.80s/it]                                                         {'loss': 0.0072, 'grad_norm': 1.366786119263946, 'learning_rate': 6.950536630891274e-07, 'completion_length': 272.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.7916666865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7738096714019775, 'reward_std': 0.0476190522313118, 'kl': 0.179443359375, 'epoch': 0.3}
 30%|███       | 1307/4286 [10:04:05<25:29:20, 30.80s/it] 31%|███       | 1308/4286 [10:04:30<23:55:54, 28.93s/it]                                                         {'loss': 0.0143, 'grad_norm': 4.131435015201763, 'learning_rate': 6.948203453103126e-07, 'completion_length': 260.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.4791666865348816, 'rewards/format_reward': 1.0, 'reward': 1.4791667461395264, 'reward_std': 0.06413594260811806, 'kl': 0.357421875, 'epoch': 0.31}
 31%|███       | 1308/4286 [10:04:30<23:55:54, 28.93s/it] 31%|███       | 1309/4286 [10:04:54<22:49:57, 27.61s/it]                                                         {'loss': 0.0221, 'grad_norm': 3.8085461079441316, 'learning_rate': 6.945870275314979e-07, 'completion_length': 316.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6824405193328857, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.664583444595337, 'reward_std': 0.1210826225578785, 'kl': 0.5498046875, 'epoch': 0.31}
 31%|███       | 1309/4286 [10:04:54<22:49:57, 27.61s/it] 31%|███       | 1310/4286 [10:05:18<21:46:54, 26.35s/it]                                                         {'loss': 0.0197, 'grad_norm': 1.923505614560452, 'learning_rate': 6.943537097526832e-07, 'completion_length': 257.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.68452388048172, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.08826445043087006, 'kl': 0.4921875, 'epoch': 0.31}
 31%|███       | 1310/4286 [10:05:18<21:46:54, 26.35s/it] 31%|███       | 1311/4286 [10:05:43<21:30:43, 26.03s/it]                                                         {'loss': 0.0053, 'grad_norm': 1.5984341392835306, 'learning_rate': 6.941203919738684e-07, 'completion_length': 312.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7931548357009888, 'rewards/format_reward': 1.0, 'reward': 1.7931549549102783, 'reward_std': 0.038690474815666676, 'kl': 0.132080078125, 'epoch': 0.31}
 31%|███       | 1311/4286 [10:05:43<21:30:43, 26.03s/it] 31%|███       | 1312/4286 [10:06:08<21:10:44, 25.64s/it]                                                         {'loss': 0.019, 'grad_norm': 2.8597185413286854, 'learning_rate': 6.938870741950536e-07, 'completion_length': 283.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.11160699650645256, 'kl': 0.47314453125, 'epoch': 0.31}
 31%|███       | 1312/4286 [10:06:08<21:10:44, 25.64s/it] 31%|███       | 1313/4286 [10:06:33<21:10:01, 25.63s/it]                                                         {'loss': 0.0028, 'grad_norm': 1.1852152141311965, 'learning_rate': 6.936537564162389e-07, 'completion_length': 263.9643020629883, 'rewards/only_full_func_accuracy_reward': 0.6502976417541504, 'rewards/format_reward': 1.0, 'reward': 1.65029776096344, 'reward_std': 0.008928571827709675, 'kl': 0.0694580078125, 'epoch': 0.31}
 31%|███       | 1313/4286 [10:06:33<21:10:01, 25.63s/it] 31%|███       | 1314/4286 [10:06:59<21:11:24, 25.67s/it]                                                         {'loss': 0.0552, 'grad_norm': 22.24160101746536, 'learning_rate': 6.934204386374242e-07, 'completion_length': 324.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6160289645195007, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5803147554397583, 'reward_std': 0.2204648107290268, 'kl': 1.384765625, 'epoch': 0.31}
 31%|███       | 1314/4286 [10:06:59<21:11:24, 25.67s/it] 31%|███       | 1315/4286 [10:07:25<21:17:47, 25.81s/it]                                                         {'loss': 0.0071, 'grad_norm': 4.018266766647404, 'learning_rate': 6.931871208586094e-07, 'completion_length': 321.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6452381014823914, 'rewards/format_reward': 1.0, 'reward': 1.645238220691681, 'reward_std': 0.05729053169488907, 'kl': 0.177978515625, 'epoch': 0.31}
 31%|███       | 1315/4286 [10:07:25<21:17:47, 25.81s/it] 31%|███       | 1316/4286 [10:07:51<21:22:56, 25.92s/it]                                                         {'loss': 0.0146, 'grad_norm': 2.263773159274623, 'learning_rate': 6.929538030797946e-07, 'completion_length': 312.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7279762625694275, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7101191282272339, 'reward_std': 0.11427320167422295, 'kl': 0.3623046875, 'epoch': 0.31}
 31%|███       | 1316/4286 [10:07:51<21:22:56, 25.92s/it] 31%|███       | 1317/4286 [10:08:17<21:17:37, 25.82s/it]                                                         {'loss': 0.0025, 'grad_norm': 1.100074864033287, 'learning_rate': 6.9272048530098e-07, 'completion_length': 330.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.922619104385376, 'rewards/format_reward': 1.0, 'reward': 1.922619104385376, 'reward_std': 0.011904759332537651, 'kl': 0.0626220703125, 'epoch': 0.31}
 31%|███       | 1317/4286 [10:08:17<21:17:37, 25.82s/it] 31%|███       | 1318/4286 [10:08:42<21:06:39, 25.61s/it]                                                         {'loss': 0.0113, 'grad_norm': 5.002047983219602, 'learning_rate': 6.924871675221652e-07, 'completion_length': 290.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7242063879966736, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7063493132591248, 'reward_std': 0.10356379672884941, 'kl': 0.28350830078125, 'epoch': 0.31}
 31%|███       | 1318/4286 [10:08:42<21:06:39, 25.61s/it] 31%|███       | 1319/4286 [10:09:07<20:53:11, 25.34s/it]                                                         {'loss': 0.0167, 'grad_norm': 2.7342066404699317, 'learning_rate': 6.922538497433504e-07, 'completion_length': 302.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7455358505249023, 'reward_std': 0.0922619067132473, 'kl': 0.4169921875, 'epoch': 0.31}
 31%|███       | 1319/4286 [10:09:07<20:53:11, 25.34s/it] 31%|███       | 1320/4286 [10:09:31<20:39:04, 25.07s/it]                                                         {'loss': 0.025, 'grad_norm': 1.2446644159093811, 'learning_rate': 6.920205319645357e-07, 'completion_length': 298.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949406266212463, 'reward_std': 0.05746845994144678, 'kl': 0.6195068359375, 'epoch': 0.31}
 31%|███       | 1320/4286 [10:09:31<20:39:04, 25.07s/it] 31%|███       | 1321/4286 [10:09:54<20:05:14, 24.39s/it]                                                         {'loss': 0.0333, 'grad_norm': 2.9226384084739663, 'learning_rate': 6.91787214185721e-07, 'completion_length': 282.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.13800392672419548, 'kl': 0.830078125, 'epoch': 0.31}
 31%|███       | 1321/4286 [10:09:54<20:05:14, 24.39s/it] 31%|███       | 1322/4286 [10:10:19<20:05:25, 24.40s/it]                                                         {'loss': 0.0289, 'grad_norm': 4.993818916612436, 'learning_rate': 6.915538964069062e-07, 'completion_length': 326.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.626488208770752, 'reward_std': 0.12797620333731174, 'kl': 0.72412109375, 'epoch': 0.31}
 31%|███       | 1322/4286 [10:10:19<20:05:25, 24.40s/it][2025-03-03 01:08:07,069] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 31%|███       | 1323/4286 [10:10:44<20:22:30, 24.76s/it]                                                         {'loss': 0.0124, 'grad_norm': 4.332325006020521, 'learning_rate': 6.913205786280915e-07, 'completion_length': 310.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6747024208307266, 'rewards/format_reward': 1.0, 'reward': 1.674702525138855, 'reward_std': 0.11466451361775398, 'kl': 0.310546875, 'epoch': 0.31}
 31%|███       | 1323/4286 [10:10:44<20:22:30, 24.76s/it] 31%|███       | 1324/4286 [10:11:08<20:09:31, 24.50s/it]                                                         {'loss': 0.0068, 'grad_norm': 2.912490373545279, 'learning_rate': 6.910872608492767e-07, 'completion_length': 280.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.742559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7247024774551392, 'reward_std': 0.12134971842169762, 'kl': 0.170654296875, 'epoch': 0.31}
 31%|███       | 1324/4286 [10:11:08<20:09:31, 24.50s/it] 31%|███       | 1325/4286 [10:11:35<20:41:48, 25.16s/it]                                                         {'loss': 0.0325, 'grad_norm': 4.348664637164406, 'learning_rate': 6.90853943070462e-07, 'completion_length': 293.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6148955821990967, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5434670448303223, 'reward_std': 0.19820769131183624, 'kl': 0.810546875, 'epoch': 0.31}
 31%|███       | 1325/4286 [10:11:35<20:41:48, 25.16s/it] 31%|███       | 1326/4286 [10:11:58<20:08:45, 24.50s/it]                                                         {'loss': 0.0291, 'grad_norm': 2.549478874695474, 'learning_rate': 6.906206252916472e-07, 'completion_length': 256.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.7287415564060211, 'rewards/format_reward': 1.0, 'reward': 1.7287415862083435, 'reward_std': 0.0410783477127552, 'kl': 0.728515625, 'epoch': 0.31}
 31%|███       | 1326/4286 [10:11:58<20:08:45, 24.50s/it] 31%|███       | 1327/4286 [10:12:21<19:44:37, 24.02s/it]                                                         {'loss': 0.0209, 'grad_norm': 3.731082391710463, 'learning_rate': 6.903873075128325e-07, 'completion_length': 273.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.5758928954601288, 'rewards/format_reward': 1.0, 'reward': 1.575892984867096, 'reward_std': 0.05059523694217205, 'kl': 0.52001953125, 'epoch': 0.31}
 31%|███       | 1327/4286 [10:12:21<19:44:37, 24.02s/it] 31%|███       | 1328/4286 [10:12:44<19:30:28, 23.74s/it]                                                         {'loss': 0.0486, 'grad_norm': 3.2138564550739166, 'learning_rate': 6.901539897340177e-07, 'completion_length': 285.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7752976417541504, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.08041993249207735, 'kl': 1.21484375, 'epoch': 0.31}
 31%|███       | 1328/4286 [10:12:44<19:30:28, 23.74s/it] 31%|███       | 1329/4286 [10:13:09<19:48:05, 24.11s/it]                                                         {'loss': 0.0265, 'grad_norm': 6.351023530377551, 'learning_rate': 6.899206719552029e-07, 'completion_length': 311.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.035714288242161274, 'kl': 0.66064453125, 'epoch': 0.31}
 31%|███       | 1329/4286 [10:13:09<19:48:05, 24.11s/it] 31%|███       | 1330/4286 [10:13:34<20:03:30, 24.43s/it]                                                         {'loss': 0.0245, 'grad_norm': 4.9134972690633125, 'learning_rate': 6.896873541763883e-07, 'completion_length': 314.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7333333790302277, 'rewards/format_reward': 1.0, 'reward': 1.7333334684371948, 'reward_std': 0.0948440209031105, 'kl': 0.61083984375, 'epoch': 0.31}
 31%|███       | 1330/4286 [10:13:34<20:03:30, 24.43s/it] 31%|███       | 1331/4286 [10:13:59<20:12:19, 24.62s/it]                                                         {'loss': 0.0024, 'grad_norm': 1.5659485933636685, 'learning_rate': 6.894540363975735e-07, 'completion_length': 304.625, 'rewards/only_full_func_accuracy_reward': 0.8127976655960083, 'rewards/format_reward': 1.0, 'reward': 1.8127976655960083, 'reward_std': 0.033524114172905684, 'kl': 0.060546875, 'epoch': 0.31}
 31%|███       | 1331/4286 [10:13:59<20:12:19, 24.62s/it] 31%|███       | 1332/4286 [10:14:22<19:46:09, 24.09s/it]                                                         {'loss': 0.0043, 'grad_norm': 4.415673371231028, 'learning_rate': 6.892207186187587e-07, 'completion_length': 268.0, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 1.0, 'reward': 1.641369104385376, 'reward_std': 0.08311965502798557, 'kl': 0.107666015625, 'epoch': 0.31}
 31%|███       | 1332/4286 [10:14:22<19:46:09, 24.09s/it] 31%|███       | 1333/4286 [10:14:46<19:47:46, 24.13s/it]                                                         {'loss': 0.0044, 'grad_norm': 2.8171548951898466, 'learning_rate': 6.88987400839944e-07, 'completion_length': 319.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6651785969734192, 'rewards/format_reward': 1.0, 'reward': 1.6651787161827087, 'reward_std': 0.08907203748822212, 'kl': 0.109619140625, 'epoch': 0.31}
 31%|███       | 1333/4286 [10:14:46<19:47:46, 24.13s/it] 31%|███       | 1334/4286 [10:15:10<19:47:48, 24.14s/it]                                                         {'loss': 0.0099, 'grad_norm': 3.9758975405587527, 'learning_rate': 6.887540830611293e-07, 'completion_length': 252.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.703869104385376, 'reward_std': 0.04802638664841652, 'kl': 0.24658203125, 'epoch': 0.31}
 31%|███       | 1334/4286 [10:15:10<19:47:48, 24.14s/it] 31%|███       | 1335/4286 [10:15:34<19:42:17, 24.04s/it]                                                         {'loss': 0.0136, 'grad_norm': 11.73103225825875, 'learning_rate': 6.885207652823145e-07, 'completion_length': 300.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6898810267448425, 'rewards/format_reward': 1.0, 'reward': 1.6898810863494873, 'reward_std': 0.053740641102194786, 'kl': 0.339111328125, 'epoch': 0.31}
 31%|███       | 1335/4286 [10:15:34<19:42:17, 24.04s/it] 31%|███       | 1336/4286 [10:15:56<19:11:58, 23.43s/it]                                                         {'loss': 0.0196, 'grad_norm': 5.328914606745517, 'learning_rate': 6.882874475034997e-07, 'completion_length': 264.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7812500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.763392984867096, 'reward_std': 0.07280982472002506, 'kl': 0.48828125, 'epoch': 0.31}
 31%|███       | 1336/4286 [10:15:56<19:11:58, 23.43s/it] 31%|███       | 1337/4286 [10:16:19<19:11:15, 23.42s/it]                                                         {'loss': 0.0303, 'grad_norm': 5.282925518704042, 'learning_rate': 6.88054129724685e-07, 'completion_length': 249.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.07820136100053787, 'kl': 0.7578125, 'epoch': 0.31}
 31%|███       | 1337/4286 [10:16:19<19:11:15, 23.42s/it] 31%|███       | 1338/4286 [10:16:43<19:20:04, 23.61s/it]                                                         {'loss': 0.0253, 'grad_norm': 2.226916487584575, 'learning_rate': 6.878208119458703e-07, 'completion_length': 303.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6577381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6398810744285583, 'reward_std': 0.0844139358960092, 'kl': 0.6328125, 'epoch': 0.31}
 31%|███       | 1338/4286 [10:16:43<19:20:04, 23.61s/it][2025-03-03 01:14:30,427] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 31%|███       | 1339/4286 [10:17:08<19:26:33, 23.75s/it]                                                         {'loss': 0.0121, 'grad_norm': 4.960686191495942, 'learning_rate': 6.875874941670555e-07, 'completion_length': 270.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.08332165516912937, 'kl': 0.302734375, 'epoch': 0.31}
 31%|███       | 1339/4286 [10:17:08<19:26:33, 23.75s/it] 31%|███▏      | 1340/4286 [10:17:31<19:25:11, 23.73s/it]                                                         {'loss': 0.0221, 'grad_norm': 13.263920043850415, 'learning_rate': 6.873541763882408e-07, 'completion_length': 296.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.1061689518392086, 'kl': 0.552734375, 'epoch': 0.31}
 31%|███▏      | 1340/4286 [10:17:31<19:25:11, 23.73s/it] 31%|███▏      | 1341/4286 [10:17:54<19:08:45, 23.40s/it]                                                         {'loss': 0.0256, 'grad_norm': 3.2859361472325634, 'learning_rate': 6.87120858609426e-07, 'completion_length': 274.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.0773809514939785, 'kl': 0.638671875, 'epoch': 0.31}
 31%|███▏      | 1341/4286 [10:17:54<19:08:45, 23.40s/it] 31%|███▏      | 1342/4286 [10:18:19<19:35:57, 23.97s/it]                                                         {'loss': 0.0414, 'grad_norm': 2.9883841012347463, 'learning_rate': 6.868875408306113e-07, 'completion_length': 262.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6157913506031036, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5622199773788452, 'reward_std': 0.1707277074456215, 'kl': 1.03125, 'epoch': 0.31}
 31%|███▏      | 1342/4286 [10:18:19<19:35:57, 23.97s/it] 31%|███▏      | 1343/4286 [10:18:43<19:31:44, 23.89s/it]                                                         {'loss': 0.005, 'grad_norm': 14.258116939647156, 'learning_rate': 6.866542230517966e-07, 'completion_length': 307.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.6982143223285675, 'rewards/format_reward': 1.0, 'reward': 1.69821435213089, 'reward_std': 0.11494078114628792, 'kl': 0.12451171875, 'epoch': 0.31}
 31%|███▏      | 1343/4286 [10:18:43<19:31:44, 23.89s/it] 31%|███▏      | 1344/4286 [10:19:06<19:15:35, 23.57s/it]                                                         {'loss': 0.0035, 'grad_norm': 6.722792925389998, 'learning_rate': 6.864209052729818e-07, 'completion_length': 252.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8511905074119568, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.02816697023808956, 'kl': 0.08740234375, 'epoch': 0.31}
 31%|███▏      | 1344/4286 [10:19:06<19:15:35, 23.57s/it] 31%|███▏      | 1345/4286 [10:19:31<19:47:23, 24.22s/it]                                                         {'loss': 0.0171, 'grad_norm': 34.82372446804567, 'learning_rate': 6.86187587494167e-07, 'completion_length': 297.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7910715043544769, 'rewards/format_reward': 1.0, 'reward': 1.7910715341567993, 'reward_std': 0.1051434688270092, 'kl': 0.427734375, 'epoch': 0.31}
 31%|███▏      | 1345/4286 [10:19:31<19:47:23, 24.22s/it] 31%|███▏      | 1346/4286 [10:19:55<19:41:49, 24.12s/it]                                                         {'loss': 0.0432, 'grad_norm': 5.4978963604506985, 'learning_rate': 6.859542697153524e-07, 'completion_length': 298.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.6555060148239136, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6197917461395264, 'reward_std': 0.1695360280573368, 'kl': 1.076171875, 'epoch': 0.31}
 31%|███▏      | 1346/4286 [10:19:55<19:41:49, 24.12s/it] 31%|███▏      | 1347/4286 [10:20:20<19:44:04, 24.17s/it]                                                         {'loss': 0.0231, 'grad_norm': 28.395108959146324, 'learning_rate': 6.857209519365376e-07, 'completion_length': 286.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6101191639900208, 'reward_std': 0.1424297858029604, 'kl': 0.5791015625, 'epoch': 0.31}
 31%|███▏      | 1347/4286 [10:20:20<19:44:04, 24.17s/it] 31%|███▏      | 1348/4286 [10:20:43<19:32:08, 23.94s/it]                                                         {'loss': 0.0074, 'grad_norm': 4.573876903551425, 'learning_rate': 6.854876341577228e-07, 'completion_length': 267.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.7767857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.02380952052772045, 'kl': 0.18359375, 'epoch': 0.31}
 31%|███▏      | 1348/4286 [10:20:43<19:32:08, 23.94s/it] 31%|███▏      | 1349/4286 [10:21:07<19:39:53, 24.10s/it]                                                         {'loss': 0.0348, 'grad_norm': 1.485369922650233, 'learning_rate': 6.85254316378908e-07, 'completion_length': 306.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7675595879554749, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7318453192710876, 'reward_std': 0.11097152531147003, 'kl': 0.87158203125, 'epoch': 0.31}
 31%|███▏      | 1349/4286 [10:21:07<19:39:53, 24.10s/it] 31%|███▏      | 1350/4286 [10:21:31<19:24:48, 23.80s/it]                                                         {'loss': 0.0043, 'grad_norm': 20.71731550299562, 'learning_rate': 6.850209986000934e-07, 'completion_length': 232.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.7931548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7931548953056335, 'reward_std': 0.032738096080720425, 'kl': 0.107177734375, 'epoch': 0.31}
 31%|███▏      | 1350/4286 [10:21:31<19:24:48, 23.80s/it] 32%|███▏      | 1351/4286 [10:21:55<19:38:12, 24.09s/it]                                                         {'loss': 0.0141, 'grad_norm': 22.258432805276808, 'learning_rate': 6.847876808212786e-07, 'completion_length': 297.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.5074405074119568, 'rewards/format_reward': 1.0, 'reward': 1.5074406266212463, 'reward_std': 0.05654761753976345, 'kl': 0.3525390625, 'epoch': 0.32}
 32%|███▏      | 1351/4286 [10:21:55<19:38:12, 24.09s/it] 32%|███▏      | 1352/4286 [10:22:19<19:37:23, 24.08s/it]                                                         {'loss': 0.0127, 'grad_norm': 8.743367900246835, 'learning_rate': 6.845543630424638e-07, 'completion_length': 274.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.0833333358168602, 'kl': 0.31787109375, 'epoch': 0.32}
 32%|███▏      | 1352/4286 [10:22:19<19:37:23, 24.08s/it] 32%|███▏      | 1353/4286 [10:22:43<19:25:43, 23.85s/it]                                                         {'loss': 0.0244, 'grad_norm': 5.131892974987004, 'learning_rate': 6.843210452636491e-07, 'completion_length': 261.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7276785969734192, 'rewards/format_reward': 1.0, 'reward': 1.727678656578064, 'reward_std': 0.0803571455180645, 'kl': 0.609375, 'epoch': 0.32}
 32%|███▏      | 1353/4286 [10:22:43<19:25:43, 23.85s/it] 32%|███▏      | 1354/4286 [10:23:06<19:13:01, 23.60s/it]                                                         {'loss': 0.0192, 'grad_norm': 2.9153626966757313, 'learning_rate': 6.840877274848343e-07, 'completion_length': 266.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7886905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7886906266212463, 'reward_std': 0.06173977255821228, 'kl': 0.47998046875, 'epoch': 0.32}
 32%|███▏      | 1354/4286 [10:23:06<19:13:01, 23.60s/it] 32%|███▏      | 1355/4286 [10:23:30<19:17:55, 23.70s/it]                                                         {'loss': 0.021, 'grad_norm': 4.222943390780748, 'learning_rate': 6.838544097060196e-07, 'completion_length': 312.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.6630952954292297, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.645238220691681, 'reward_std': 0.11883765645325184, 'kl': 0.52587890625, 'epoch': 0.32}
 32%|███▏      | 1355/4286 [10:23:30<19:17:55, 23.70s/it] 32%|███▏      | 1356/4286 [10:23:54<19:34:39, 24.05s/it]                                                         {'loss': 0.0097, 'grad_norm': 19.66650537390135, 'learning_rate': 6.836210919272049e-07, 'completion_length': 283.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.574404776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5565477013587952, 'reward_std': 0.07988504506647587, 'kl': 0.24334716796875, 'epoch': 0.32}
 32%|███▏      | 1356/4286 [10:23:55<19:34:39, 24.05s/it] 32%|███▏      | 1357/4286 [10:24:21<20:05:33, 24.70s/it]                                                         {'loss': 0.0204, 'grad_norm': 3.8091539755000077, 'learning_rate': 6.833877741483901e-07, 'completion_length': 306.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6562500596046448, 'rewards/format_reward': 1.0, 'reward': 1.6562501192092896, 'reward_std': 0.07782968878746033, 'kl': 0.51171875, 'epoch': 0.32}
 32%|███▏      | 1357/4286 [10:24:21<20:05:33, 24.70s/it] 32%|███▏      | 1358/4286 [10:24:45<19:58:08, 24.55s/it]                                                         {'loss': 0.0129, 'grad_norm': 10.346928320272184, 'learning_rate': 6.831544563695753e-07, 'completion_length': 265.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.8720238208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8541668057441711, 'reward_std': 0.13584475964307785, 'kl': 0.321044921875, 'epoch': 0.32}
 32%|███▏      | 1358/4286 [10:24:45<19:58:08, 24.55s/it] 32%|███▏      | 1359/4286 [10:25:10<20:01:07, 24.62s/it]                                                         {'loss': 0.0553, 'grad_norm': 4.9027503405337365, 'learning_rate': 6.829211385907606e-07, 'completion_length': 306.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6946429014205933, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6410715579986572, 'reward_std': 0.20683162659406662, 'kl': 1.38671875, 'epoch': 0.32}
 32%|███▏      | 1359/4286 [10:25:10<20:01:07, 24.62s/it] 32%|███▏      | 1360/4286 [10:25:34<19:58:43, 24.58s/it]                                                         {'loss': 0.0211, 'grad_norm': 1.9671886818926358, 'learning_rate': 6.826878208119459e-07, 'completion_length': 307.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.04007173515856266, 'kl': 0.525390625, 'epoch': 0.32}
 32%|███▏      | 1360/4286 [10:25:34<19:58:43, 24.58s/it] 32%|███▏      | 1361/4286 [10:26:02<20:52:50, 25.70s/it]                                                         {'loss': 0.0244, 'grad_norm': 3.1673687903072767, 'learning_rate': 6.824545030331311e-07, 'completion_length': 301.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6976877450942993, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6619735956192017, 'reward_std': 0.07972212694585323, 'kl': 0.609619140625, 'epoch': 0.32}
 32%|███▏      | 1361/4286 [10:26:02<20:52:50, 25.70s/it] 32%|███▏      | 1362/4286 [10:26:29<21:06:20, 25.99s/it]                                                         {'loss': 0.0285, 'grad_norm': 14.792153139712365, 'learning_rate': 6.822211852543163e-07, 'completion_length': 304.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7711309790611267, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7532739043235779, 'reward_std': 0.11499597132205963, 'kl': 0.712890625, 'epoch': 0.32}
 32%|███▏      | 1362/4286 [10:26:29<21:06:20, 25.99s/it] 32%|███▏      | 1363/4286 [10:26:53<20:29:55, 25.25s/it]                                                         {'loss': 0.0104, 'grad_norm': 3.042992760930332, 'learning_rate': 6.819878674755017e-07, 'completion_length': 294.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.6502976268529892, 'rewards/format_reward': 1.0, 'reward': 1.65029776096344, 'reward_std': 0.03869047947227955, 'kl': 0.25927734375, 'epoch': 0.32}
 32%|███▏      | 1363/4286 [10:26:53<20:29:55, 25.25s/it] 32%|███▏      | 1364/4286 [10:27:15<19:51:42, 24.47s/it]                                                         {'loss': 0.0127, 'grad_norm': 3.836180926195307, 'learning_rate': 6.817545496966869e-07, 'completion_length': 258.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6383929252624512, 'reward_std': 0.06777828186750412, 'kl': 0.31787109375, 'epoch': 0.32}
 32%|███▏      | 1364/4286 [10:27:15<19:51:42, 24.47s/it] 32%|███▏      | 1365/4286 [10:27:40<19:56:35, 24.58s/it]                                                         {'loss': 0.0148, 'grad_norm': 1.4522969962246688, 'learning_rate': 6.815212319178721e-07, 'completion_length': 279.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7446429431438446, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7089287042617798, 'reward_std': 0.10662448033690453, 'kl': 0.37060546875, 'epoch': 0.32}
 32%|███▏      | 1365/4286 [10:27:40<19:56:35, 24.58s/it] 32%|███▏      | 1366/4286 [10:28:04<19:44:57, 24.35s/it]                                                         {'loss': 0.0133, 'grad_norm': 2.3189932710303958, 'learning_rate': 6.812879141390574e-07, 'completion_length': 281.89288330078125, 'rewards/only_full_func_accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803571939468384, 'reward_std': 0.017857140861451626, 'kl': 0.3310546875, 'epoch': 0.32}
 32%|███▏      | 1366/4286 [10:28:04<19:44:57, 24.35s/it] 32%|███▏      | 1367/4286 [10:28:27<19:29:42, 24.04s/it]                                                         {'loss': 0.0293, 'grad_norm': 2.130073347872499, 'learning_rate': 6.810545963602427e-07, 'completion_length': 265.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7916666865348816, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.755952537059784, 'reward_std': 0.0833333395421505, 'kl': 0.73095703125, 'epoch': 0.32}
 32%|███▏      | 1367/4286 [10:28:27<19:29:42, 24.04s/it] 32%|███▏      | 1368/4286 [10:28:50<19:14:08, 23.73s/it]                                                         {'loss': 0.0088, 'grad_norm': 2.640223877709104, 'learning_rate': 6.808212785814279e-07, 'completion_length': 297.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6934524774551392, 'reward_std': 0.05222323164343834, 'kl': 0.220703125, 'epoch': 0.32}
 32%|███▏      | 1368/4286 [10:28:50<19:14:08, 23.73s/it] 32%|███▏      | 1369/4286 [10:29:14<19:14:26, 23.75s/it]                                                         {'loss': 0.0172, 'grad_norm': 1.7664267754294747, 'learning_rate': 6.805879608026132e-07, 'completion_length': 303.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.09204822406172752, 'kl': 0.427734375, 'epoch': 0.32}
 32%|███▏      | 1369/4286 [10:29:14<19:14:26, 23.75s/it] 32%|███▏      | 1370/4286 [10:29:38<19:15:21, 23.77s/it]                                                         {'loss': 0.0344, 'grad_norm': 8.46774575891623, 'learning_rate': 6.803546430237984e-07, 'completion_length': 276.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.6092262268066406, 'rewards/format_reward': 1.0, 'reward': 1.6092262864112854, 'reward_std': 0.05535714514553547, 'kl': 0.861328125, 'epoch': 0.32}
 32%|███▏      | 1370/4286 [10:29:38<19:15:21, 23.77s/it] 32%|███▏      | 1371/4286 [10:30:05<19:57:26, 24.65s/it]                                                         {'loss': 0.0076, 'grad_norm': 1.9863099949376875, 'learning_rate': 6.801213252449837e-07, 'completion_length': 319.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.5297619253396988, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5119048953056335, 'reward_std': 0.09007243998348713, 'kl': 0.1884765625, 'epoch': 0.32}
 32%|███▏      | 1371/4286 [10:30:05<19:57:26, 24.65s/it][2025-03-03 01:27:53,393] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 32%|███▏      | 1372/4286 [10:30:30<20:14:50, 25.01s/it]                                                         {'loss': 0.0144, 'grad_norm': 1.8030421942757118, 'learning_rate': 6.798880074661689e-07, 'completion_length': 314.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.57440485060215, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5565477013587952, 'reward_std': 0.1011904776096344, 'kl': 0.361328125, 'epoch': 0.32}
 32%|███▏      | 1372/4286 [10:30:30<20:14:50, 25.01s/it] 32%|███▏      | 1373/4286 [10:30:53<19:45:17, 24.41s/it]                                                         {'loss': 0.0025, 'grad_norm': 1.6912165979302891, 'learning_rate': 6.796546896873542e-07, 'completion_length': 245.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6770834028720856, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.07121489197015762, 'kl': 0.0623779296875, 'epoch': 0.32}
 32%|███▏      | 1373/4286 [10:30:53<19:45:17, 24.41s/it] 32%|███▏      | 1374/4286 [10:31:17<19:36:39, 24.24s/it]                                                         {'loss': 0.0033, 'grad_norm': 2.166609307274838, 'learning_rate': 6.794213719085394e-07, 'completion_length': 311.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.09661935456097126, 'kl': 0.0833740234375, 'epoch': 0.32}
 32%|███▏      | 1374/4286 [10:31:17<19:36:39, 24.24s/it] 32%|███▏      | 1375/4286 [10:31:39<19:03:42, 23.57s/it]                                                         {'loss': 0.01, 'grad_norm': 2.4623968298420396, 'learning_rate': 6.791880541297246e-07, 'completion_length': 269.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 1.0, 'reward': 1.6502977013587952, 'reward_std': 0.06526250159367919, 'kl': 0.2481689453125, 'epoch': 0.32}
 32%|███▏      | 1375/4286 [10:31:39<19:03:42, 23.57s/it][2025-03-03 01:29:28,321] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 32%|███▏      | 1376/4286 [10:32:05<19:39:27, 24.32s/it]                                                         {'loss': 0.0048, 'grad_norm': 1.4192089793651093, 'learning_rate': 6.7895473635091e-07, 'completion_length': 286.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6294642984867096, 'rewards/format_reward': 1.0, 'reward': 1.629464328289032, 'reward_std': 0.020833331160247326, 'kl': 0.1190185546875, 'epoch': 0.32}
 32%|███▏      | 1376/4286 [10:32:05<19:39:27, 24.32s/it][2025-03-03 01:29:54,081] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 32%|███▏      | 1377/4286 [10:32:31<20:00:00, 24.75s/it]                                                         {'loss': 0.0017, 'grad_norm': 3.1557712487078486, 'learning_rate': 6.787214185720952e-07, 'completion_length': 318.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.8839286267757416, 'rewards/format_reward': 1.0, 'reward': 1.883928656578064, 'reward_std': 0.03847679682075977, 'kl': 0.04345703125, 'epoch': 0.32}
 32%|███▏      | 1377/4286 [10:32:31<20:00:00, 24.75s/it][2025-03-03 01:30:16,932] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 32%|███▏      | 1378/4286 [10:32:54<19:31:58, 24.18s/it]                                                         {'loss': 0.0058, 'grad_norm': 2.522220899528347, 'learning_rate': 6.784881007932804e-07, 'completion_length': 254.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.766369104385376, 'rewards/format_reward': 1.0, 'reward': 1.7663691639900208, 'reward_std': 0.06250000465661287, 'kl': 0.146240234375, 'epoch': 0.32}
 32%|███▏      | 1378/4286 [10:32:54<19:31:58, 24.18s/it][2025-03-03 01:30:44,523] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 32%|███▏      | 1379/4286 [10:33:22<20:21:08, 25.20s/it]                                                         {'loss': 0.0188, 'grad_norm': 5.348288447430705, 'learning_rate': 6.782547830144657e-07, 'completion_length': 326.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.571428656578064, 'reward_std': 0.2646516114473343, 'kl': 0.4697265625, 'epoch': 0.32}
 32%|███▏      | 1379/4286 [10:33:22<20:21:08, 25.20s/it][2025-03-03 01:31:07,878] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 32%|███▏      | 1380/4286 [10:33:45<19:53:51, 24.65s/it]                                                         {'loss': 0.0029, 'grad_norm': 4.721277044135586, 'learning_rate': 6.78021465235651e-07, 'completion_length': 255.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6190476417541504, 'rewards/format_reward': 1.0, 'reward': 1.6190477013587952, 'reward_std': 0.06388125661760569, 'kl': 0.071533203125, 'epoch': 0.32}
 32%|███▏      | 1380/4286 [10:33:45<19:53:51, 24.65s/it][2025-03-03 01:31:34,247] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 32%|███▏      | 1381/4286 [10:34:11<20:18:24, 25.17s/it]                                                         {'loss': 0.0048, 'grad_norm': 0.6586188707829539, 'learning_rate': 6.777881474568362e-07, 'completion_length': 316.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7142857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.0, 'kl': 0.1195068359375, 'epoch': 0.32}
 32%|███▏      | 1381/4286 [10:34:11<20:18:24, 25.17s/it][2025-03-03 01:31:59,532] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 32%|███▏      | 1382/4286 [10:34:37<20:19:45, 25.20s/it]                                                         {'loss': 0.0055, 'grad_norm': 5.338614378502678, 'learning_rate': 6.775548296780214e-07, 'completion_length': 295.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8422619104385376, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.03818839695304632, 'kl': 0.138427734375, 'epoch': 0.32}
 32%|███▏      | 1382/4286 [10:34:37<20:19:45, 25.20s/it] 32%|███▏      | 1383/4286 [10:35:01<20:13:16, 25.08s/it]                                                         {'loss': 0.0105, 'grad_norm': 1.8754561567417805, 'learning_rate': 6.773215118992067e-07, 'completion_length': 284.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678572535514832, 'reward_std': 0.08038126677274704, 'kl': 0.2607421875, 'epoch': 0.32}
 32%|███▏      | 1383/4286 [10:35:01<20:13:16, 25.08s/it] 32%|███▏      | 1384/4286 [10:35:26<20:08:58, 25.00s/it]                                                         {'loss': 0.0018, 'grad_norm': 1.3496450039787733, 'learning_rate': 6.77088194120392e-07, 'completion_length': 326.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6279762983322144, 'reward_std': 0.029761902987957, 'kl': 0.04608154296875, 'epoch': 0.32}
 32%|███▏      | 1384/4286 [10:35:26<20:08:58, 25.00s/it] 32%|███▏      | 1385/4286 [10:35:51<20:02:36, 24.87s/it]                                                         {'loss': 0.0063, 'grad_norm': 1.6233846593915924, 'learning_rate': 6.768548763415772e-07, 'completion_length': 276.5893020629883, 'rewards/only_full_func_accuracy_reward': 0.598214328289032, 'rewards/format_reward': 1.0, 'reward': 1.5982143878936768, 'reward_std': 0.07650598883628845, 'kl': 0.15771484375, 'epoch': 0.32}
 32%|███▏      | 1385/4286 [10:35:51<20:02:36, 24.87s/it] 32%|███▏      | 1386/4286 [10:36:15<19:46:45, 24.55s/it]                                                         {'loss': 0.0074, 'grad_norm': 14.434626536463803, 'learning_rate': 6.766215585627625e-07, 'completion_length': 280.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7872024774551392, 'rewards/format_reward': 1.0, 'reward': 1.7872024774551392, 'reward_std': 0.03709554299712181, 'kl': 0.1854248046875, 'epoch': 0.32}
 32%|███▏      | 1386/4286 [10:36:15<19:46:45, 24.55s/it] 32%|███▏      | 1387/4286 [10:36:40<20:04:43, 24.93s/it]                                                         {'loss': 0.0016, 'grad_norm': 0.8277226221678108, 'learning_rate': 6.763882407839477e-07, 'completion_length': 294.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6300595998764038, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.612202525138855, 'reward_std': 0.10205161198973656, 'kl': 0.0408935546875, 'epoch': 0.32}
 32%|███▏      | 1387/4286 [10:36:40<20:04:43, 24.93s/it] 32%|███▏      | 1388/4286 [10:37:05<20:04:29, 24.94s/it]                                                         {'loss': 0.019, 'grad_norm': 2.7651489578840813, 'learning_rate': 6.76154923005133e-07, 'completion_length': 304.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7291667461395264, 'rewards/format_reward': 1.0, 'reward': 1.7291667461395264, 'reward_std': 0.06044464744627476, 'kl': 0.475830078125, 'epoch': 0.32}
 32%|███▏      | 1388/4286 [10:37:05<20:04:29, 24.94s/it] 32%|███▏      | 1389/4286 [10:37:30<20:00:46, 24.87s/it]                                                         {'loss': 0.0075, 'grad_norm': 3.130719759735214, 'learning_rate': 6.759216052263183e-07, 'completion_length': 318.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6979167461395264, 'rewards/format_reward': 1.0, 'reward': 1.6979167461395264, 'reward_std': 0.035413630306720734, 'kl': 0.18865966796875, 'epoch': 0.32}
 32%|███▏      | 1389/4286 [10:37:30<20:00:46, 24.87s/it] 32%|███▏      | 1390/4286 [10:37:56<20:12:37, 25.12s/it]                                                         {'loss': 0.0017, 'grad_norm': 4.3398667640660795, 'learning_rate': 6.756882874475035e-07, 'completion_length': 264.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.05541309528052807, 'kl': 0.0423583984375, 'epoch': 0.32}
 32%|███▏      | 1390/4286 [10:37:56<20:12:37, 25.12s/it] 32%|███▏      | 1391/4286 [10:38:21<20:18:24, 25.25s/it]                                                         {'loss': 0.0081, 'grad_norm': 3.1776332543557264, 'learning_rate': 6.754549696686887e-07, 'completion_length': 312.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.5476190596818924, 'rewards/format_reward': 1.0, 'reward': 1.5476192235946655, 'reward_std': 0.08352040126919746, 'kl': 0.201904296875, 'epoch': 0.32}
 32%|███▏      | 1391/4286 [10:38:21<20:18:24, 25.25s/it] 32%|███▏      | 1392/4286 [10:38:46<20:08:10, 25.05s/it]                                                         {'loss': 0.003, 'grad_norm': 2.5132630514949748, 'learning_rate': 6.752216518898741e-07, 'completion_length': 317.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7217263579368591, 'reward_std': 0.053357748314738274, 'kl': 0.0745849609375, 'epoch': 0.32}
 32%|███▏      | 1392/4286 [10:38:46<20:08:10, 25.05s/it] 33%|███▎      | 1393/4286 [10:39:12<20:15:50, 25.22s/it]                                                         {'loss': 0.0118, 'grad_norm': 2.0643177043123346, 'learning_rate': 6.749883341110593e-07, 'completion_length': 268.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6636905670166016, 'reward_std': 0.0854631932452321, 'kl': 0.294921875, 'epoch': 0.33}
 33%|███▎      | 1393/4286 [10:39:12<20:15:50, 25.22s/it][2025-03-03 01:36:59,661] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 33%|███▎      | 1394/4286 [10:39:37<20:15:22, 25.22s/it]                                                         {'loss': 0.0035, 'grad_norm': 1.5637759375333868, 'learning_rate': 6.747550163322445e-07, 'completion_length': 285.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6443452537059784, 'rewards/format_reward': 1.0, 'reward': 1.6443453431129456, 'reward_std': 0.06090506445616484, 'kl': 0.0885009765625, 'epoch': 0.33}
 33%|███▎      | 1394/4286 [10:39:37<20:15:22, 25.22s/it] 33%|███▎      | 1395/4286 [10:40:03<20:26:41, 25.46s/it]                                                         {'loss': 0.0167, 'grad_norm': 47.44655836822487, 'learning_rate': 6.745216985534297e-07, 'completion_length': 318.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.6830357611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6294643878936768, 'reward_std': 0.17871829494833946, 'kl': 0.41796875, 'epoch': 0.33}
 33%|███▎      | 1395/4286 [10:40:03<20:26:41, 25.46s/it][2025-03-03 01:37:52,026] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 33%|███▎      | 1396/4286 [10:40:29<20:38:58, 25.72s/it]                                                         {'loss': 0.0032, 'grad_norm': 0.39364189838687125, 'learning_rate': 6.742883807746151e-07, 'completion_length': 311.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.760416716337204, 'rewards/format_reward': 1.0, 'reward': 1.7604168057441711, 'reward_std': 0.008928571827709675, 'kl': 0.081298828125, 'epoch': 0.33}
 33%|███▎      | 1396/4286 [10:40:29<20:38:58, 25.72s/it][2025-03-03 01:38:18,654] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 33%|███▎      | 1397/4286 [10:40:56<20:51:37, 25.99s/it]                                                         {'loss': 0.0246, 'grad_norm': 4.762537882088138, 'learning_rate': 6.740550629958003e-07, 'completion_length': 352.3214569091797, 'rewards/only_full_func_accuracy_reward': 0.7303571999073029, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.694642961025238, 'reward_std': 0.13434407487511635, 'kl': 0.61328125, 'epoch': 0.33}
 33%|███▎      | 1397/4286 [10:40:56<20:51:37, 25.99s/it] 33%|███▎      | 1398/4286 [10:41:20<20:32:46, 25.61s/it]                                                         {'loss': 0.0112, 'grad_norm': 2.0964715308996276, 'learning_rate': 6.738217452169855e-07, 'completion_length': 307.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7351191639900208, 'rewards/format_reward': 1.0, 'reward': 1.7351192235946655, 'reward_std': 0.06228632293641567, 'kl': 0.2802734375, 'epoch': 0.33}
 33%|███▎      | 1398/4286 [10:41:20<20:32:46, 25.61s/it] 33%|███▎      | 1399/4286 [10:41:47<20:44:17, 25.86s/it]                                                         {'loss': 0.0033, 'grad_norm': 2.4194901006912297, 'learning_rate': 6.735884274381708e-07, 'completion_length': 345.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.5744048058986664, 'rewards/format_reward': 1.0, 'reward': 1.5744048953056335, 'reward_std': 0.09388989955186844, 'kl': 0.083251953125, 'epoch': 0.33}
 33%|███▎      | 1399/4286 [10:41:47<20:44:17, 25.86s/it][2025-03-03 01:39:36,815] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 33%|███▎      | 1400/4286 [10:42:14<21:00:22, 26.20s/it]                                                         {'loss': 0.0089, 'grad_norm': 7.163501290852484, 'learning_rate': 6.73355109659356e-07, 'completion_length': 304.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6514881253242493, 'rewards/format_reward': 1.0, 'reward': 1.6514881253242493, 'reward_std': 0.028361600823700428, 'kl': 0.22216796875, 'epoch': 0.33}
 33%|███▎      | 1400/4286 [10:42:14<21:00:22, 26.20s/it] 33%|███▎      | 1401/4286 [10:45:40<64:16:06, 80.20s/it]                                                         {'loss': 0.0034, 'grad_norm': 1.8080495881756289, 'learning_rate': 6.731217918805413e-07, 'completion_length': 323.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.02380952751263976, 'kl': 0.084716796875, 'epoch': 0.33}
 33%|███▎      | 1401/4286 [10:45:40<64:16:06, 80.20s/it][2025-03-03 01:43:29,945] [WARNING] [stage3.py:2134:step] 3 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 33%|███▎      | 1402/4286 [10:46:07<51:26:57, 64.22s/it]                                                         {'loss': 0.0193, 'grad_norm': 14.469236292745052, 'learning_rate': 6.728884741017266e-07, 'completion_length': 294.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.8050595223903656, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7872024774551392, 'reward_std': 0.11119965091347694, 'kl': 0.482421875, 'epoch': 0.33}
 33%|███▎      | 1402/4286 [10:46:07<51:26:57, 64.22s/it] 33%|███▎      | 1403/4286 [10:46:34<42:22:08, 52.91s/it]                                                         {'loss': 0.0181, 'grad_norm': 2.4975279701095014, 'learning_rate': 6.726551563229118e-07, 'completion_length': 341.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7241072058677673, 'rewards/format_reward': 1.0, 'reward': 1.7241072058677673, 'reward_std': 0.03392856940627098, 'kl': 0.451171875, 'epoch': 0.33}
 33%|███▎      | 1403/4286 [10:46:34<42:22:08, 52.91s/it] 33%|███▎      | 1404/4286 [10:46:59<35:43:59, 44.64s/it]                                                         {'loss': 0.0048, 'grad_norm': 1.2717669802038314, 'learning_rate': 6.72421838544097e-07, 'completion_length': 289.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.8288690745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8110120296478271, 'reward_std': 0.07979566231369972, 'kl': 0.1202392578125, 'epoch': 0.33}
 33%|███▎      | 1404/4286 [10:46:59<35:43:59, 44.64s/it] 33%|███▎      | 1405/4286 [10:47:24<31:02:09, 38.78s/it]                                                         {'loss': 0.0028, 'grad_norm': 5.157418002566759, 'learning_rate': 6.721885207652823e-07, 'completion_length': 323.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 1.0, 'reward': 1.7514882683753967, 'reward_std': 0.050595229491591454, 'kl': 0.0703125, 'epoch': 0.33}
 33%|███▎      | 1405/4286 [10:47:24<31:02:09, 38.78s/it] 33%|███▎      | 1406/4286 [10:47:50<27:51:46, 34.83s/it]                                                         {'loss': 0.0041, 'grad_norm': 1.1957492033358514, 'learning_rate': 6.719552029864676e-07, 'completion_length': 307.625, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7857143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.1033935546875, 'epoch': 0.33}
 33%|███▎      | 1406/4286 [10:47:50<27:51:46, 34.83s/it] 33%|███▎      | 1407/4286 [10:48:14<25:27:31, 31.83s/it]                                                         {'loss': 0.0173, 'grad_norm': 1.9746562065611235, 'learning_rate': 6.717218852076528e-07, 'completion_length': 322.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041667461395264, 'reward_std': 0.0829059686511755, 'kl': 0.43505859375, 'epoch': 0.33}
 33%|███▎      | 1407/4286 [10:48:14<25:27:31, 31.83s/it][2025-03-03 01:46:04,115] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 33%|███▎      | 1408/4286 [10:48:41<24:13:53, 30.31s/it]                                                         {'loss': 0.0021, 'grad_norm': 2.575584675260256, 'learning_rate': 6.71488567428838e-07, 'completion_length': 296.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7005952596664429, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6648811101913452, 'reward_std': 0.08076310902833939, 'kl': 0.0526123046875, 'epoch': 0.33}
 33%|███▎      | 1408/4286 [10:48:41<24:13:53, 30.31s/it] 33%|███▎      | 1409/4286 [10:49:06<22:53:07, 28.64s/it]                                                         {'loss': 0.0076, 'grad_norm': 2.369143864241049, 'learning_rate': 6.712552496500234e-07, 'completion_length': 294.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8258929252624512, 'rewards/format_reward': 1.0, 'reward': 1.8258929252624512, 'reward_std': 0.06250000465661287, 'kl': 0.18896484375, 'epoch': 0.33}
 33%|███▎      | 1409/4286 [10:49:06<22:53:07, 28.64s/it] 33%|███▎      | 1410/4286 [10:49:30<21:49:59, 27.33s/it]                                                         {'loss': 0.008, 'grad_norm': 1.6102891552219798, 'learning_rate': 6.710219318712086e-07, 'completion_length': 305.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7354167103767395, 'rewards/format_reward': 1.0, 'reward': 1.735416829586029, 'reward_std': 0.005357143934816122, 'kl': 0.200927734375, 'epoch': 0.33}
 33%|███▎      | 1410/4286 [10:49:30<21:49:59, 27.33s/it] 33%|███▎      | 1411/4286 [10:49:54<21:00:02, 26.30s/it]                                                         {'loss': 0.0064, 'grad_norm': 1.8402053965466414, 'learning_rate': 6.707886140923938e-07, 'completion_length': 304.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7294643521308899, 'rewards/format_reward': 1.0, 'reward': 1.7294644117355347, 'reward_std': 0.02629610151052475, 'kl': 0.161376953125, 'epoch': 0.33}
 33%|███▎      | 1411/4286 [10:49:54<21:00:02, 26.30s/it] 33%|███▎      | 1412/4286 [10:50:20<20:58:49, 26.28s/it]                                                         {'loss': 0.0067, 'grad_norm': 3.971763443201261, 'learning_rate': 6.705552963135791e-07, 'completion_length': 317.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6264881640672684, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.06845238246023655, 'kl': 0.168701171875, 'epoch': 0.33}
 33%|███▎      | 1412/4286 [10:50:20<20:58:49, 26.28s/it] 33%|███▎      | 1413/4286 [10:50:46<20:53:49, 26.19s/it]                                                         {'loss': 0.0181, 'grad_norm': 3.287957427454435, 'learning_rate': 6.703219785347644e-07, 'completion_length': 289.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.6803572177886963, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6625001430511475, 'reward_std': 0.12261904776096344, 'kl': 0.451171875, 'epoch': 0.33}
 33%|███▎      | 1413/4286 [10:50:46<20:53:49, 26.19s/it] 33%|███▎      | 1414/4286 [10:51:13<20:55:56, 26.24s/it]                                                         {'loss': 0.0117, 'grad_norm': 4.9349133934022955, 'learning_rate': 6.700886607559496e-07, 'completion_length': 342.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6502976417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6324406266212463, 'reward_std': 0.10597576946020126, 'kl': 0.291015625, 'epoch': 0.33}
 33%|███▎      | 1414/4286 [10:51:13<20:55:56, 26.24s/it] 33%|███▎      | 1415/4286 [10:51:39<20:55:35, 26.24s/it]                                                         {'loss': 0.0244, 'grad_norm': 6.832902641220339, 'learning_rate': 6.698553429771349e-07, 'completion_length': 319.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6530448794364929, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6351879239082336, 'reward_std': 0.09595859982073307, 'kl': 0.6083984375, 'epoch': 0.33}
 33%|███▎      | 1415/4286 [10:51:39<20:55:35, 26.24s/it] 33%|███▎      | 1416/4286 [10:52:04<20:32:53, 25.77s/it]                                                         {'loss': 0.0174, 'grad_norm': 6.341055890343373, 'learning_rate': 6.696220251983201e-07, 'completion_length': 263.94644927978516, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7366072535514832, 'reward_std': 0.10648379102349281, 'kl': 0.4326171875, 'epoch': 0.33}
 33%|███▎      | 1416/4286 [10:52:04<20:32:53, 25.77s/it] 33%|███▎      | 1417/4286 [10:52:29<20:23:50, 25.59s/it]                                                         {'loss': 0.0299, 'grad_norm': 6.075004156045097, 'learning_rate': 6.693887074195054e-07, 'completion_length': 319.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.657738208770752, 'reward_std': 0.13548234105110168, 'kl': 0.7470703125, 'epoch': 0.33}
 33%|███▎      | 1417/4286 [10:52:29<20:23:50, 25.59s/it] 33%|███▎      | 1418/4286 [10:52:54<20:20:19, 25.53s/it]                                                         {'loss': 0.0111, 'grad_norm': 3.991753716233017, 'learning_rate': 6.691553896406906e-07, 'completion_length': 326.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.1041666716337204, 'kl': 0.2763671875, 'epoch': 0.33}
 33%|███▎      | 1418/4286 [10:52:54<20:20:19, 25.53s/it] 33%|███▎      | 1419/4286 [10:53:19<20:14:02, 25.41s/it]                                                         {'loss': 0.0519, 'grad_norm': 8.700438134452803, 'learning_rate': 6.689220718618759e-07, 'completion_length': 311.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7023810744285583, 'reward_std': 0.1142389103770256, 'kl': 1.29296875, 'epoch': 0.33}
 33%|███▎      | 1419/4286 [10:53:19<20:14:02, 25.41s/it] 33%|███▎      | 1420/4286 [10:53:46<20:28:30, 25.72s/it]                                                         {'loss': 0.0248, 'grad_norm': 2.8850057027028906, 'learning_rate': 6.686887540830611e-07, 'completion_length': 326.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.760416716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7425596117973328, 'reward_std': 0.0982142873108387, 'kl': 0.6201171875, 'epoch': 0.33}
 33%|███▎      | 1420/4286 [10:53:46<20:28:30, 25.72s/it] 33%|███▎      | 1421/4286 [10:54:09<19:56:49, 25.06s/it]                                                         {'loss': 0.0144, 'grad_norm': 20.927396264254654, 'learning_rate': 6.684554363042464e-07, 'completion_length': 281.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.041666666977107525, 'kl': 0.359375, 'epoch': 0.33}
 33%|███▎      | 1421/4286 [10:54:09<19:56:49, 25.06s/it] 33%|███▎      | 1422/4286 [10:54:33<19:39:54, 24.72s/it]                                                         {'loss': 0.0494, 'grad_norm': 4.598414223854135, 'learning_rate': 6.682221185254317e-07, 'completion_length': 281.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7588542103767395, 'rewards/format_reward': 1.0, 'reward': 1.7588542699813843, 'reward_std': 0.08249357715249062, 'kl': 1.232421875, 'epoch': 0.33}
 33%|███▎      | 1422/4286 [10:54:33<19:39:54, 24.72s/it] 33%|███▎      | 1423/4286 [10:54:58<19:38:23, 24.70s/it]                                                         {'loss': 0.0175, 'grad_norm': 5.841554464985433, 'learning_rate': 6.679888007466169e-07, 'completion_length': 308.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7175595760345459, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6997024416923523, 'reward_std': 0.1161002479493618, 'kl': 0.4365234375, 'epoch': 0.33}
 33%|███▎      | 1423/4286 [10:54:58<19:38:23, 24.70s/it] 33%|███▎      | 1424/4286 [10:55:24<19:54:09, 25.03s/it]                                                         {'loss': 0.0166, 'grad_norm': 2.1559964729938903, 'learning_rate': 6.677554829678021e-07, 'completion_length': 275.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.41650390625, 'epoch': 0.33}
 33%|███▎      | 1424/4286 [10:55:24<19:54:09, 25.03s/it] 33%|███▎      | 1425/4286 [10:55:49<19:51:33, 24.99s/it]                                                         {'loss': 0.0446, 'grad_norm': 8.18032240827003, 'learning_rate': 6.675221651889875e-07, 'completion_length': 316.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.6895833909511566, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6538691520690918, 'reward_std': 0.2489754632115364, 'kl': 1.115234375, 'epoch': 0.33}
 33%|███▎      | 1425/4286 [10:55:49<19:51:33, 24.99s/it] 33%|███▎      | 1426/4286 [10:56:15<20:07:46, 25.34s/it]                                                         {'loss': 0.0085, 'grad_norm': 4.847474864851728, 'learning_rate': 6.672888474101727e-07, 'completion_length': 314.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7872024178504944, 'rewards/format_reward': 1.0, 'reward': 1.787202537059784, 'reward_std': 0.08143725246191025, 'kl': 0.2119140625, 'epoch': 0.33}
 33%|███▎      | 1426/4286 [10:56:15<20:07:46, 25.34s/it] 33%|███▎      | 1427/4286 [10:56:41<20:26:52, 25.75s/it]                                                         {'loss': 0.0063, 'grad_norm': 4.139090842996341, 'learning_rate': 6.670555296313579e-07, 'completion_length': 298.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.04761904664337635, 'kl': 0.1563720703125, 'epoch': 0.33}
 33%|███▎      | 1427/4286 [10:56:41<20:26:52, 25.75s/it] 33%|███▎      | 1428/4286 [10:57:06<20:06:15, 25.32s/it]                                                         {'loss': 0.003, 'grad_norm': 6.826753452337434, 'learning_rate': 6.668222118525431e-07, 'completion_length': 308.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7678572237491608, 'rewards/format_reward': 1.0, 'reward': 1.767857313156128, 'reward_std': 0.011904764920473099, 'kl': 0.0760498046875, 'epoch': 0.33}
 33%|███▎      | 1428/4286 [10:57:06<20:06:15, 25.32s/it] 33%|███▎      | 1429/4286 [10:57:32<20:13:32, 25.49s/it]                                                         {'loss': 0.0098, 'grad_norm': 2.235210029050278, 'learning_rate': 6.665888940737284e-07, 'completion_length': 271.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.5907738506793976, 'rewards/format_reward': 1.0, 'reward': 1.5907739400863647, 'reward_std': 0.03792233765125275, 'kl': 0.246826171875, 'epoch': 0.33}
 33%|███▎      | 1429/4286 [10:57:32<20:13:32, 25.49s/it] 33%|███▎      | 1430/4286 [10:57:58<20:32:20, 25.89s/it]                                                         {'loss': 0.0143, 'grad_norm': 1.7750059370838391, 'learning_rate': 6.663555762949137e-07, 'completion_length': 300.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7857143580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.767857313156128, 'reward_std': 0.11266788095235825, 'kl': 0.357421875, 'epoch': 0.33}
 33%|███▎      | 1430/4286 [10:57:58<20:32:20, 25.89s/it] 33%|███▎      | 1431/4286 [10:58:24<20:24:01, 25.72s/it]                                                         {'loss': 0.0081, 'grad_norm': 27.83016560891457, 'learning_rate': 6.661222585160989e-07, 'completion_length': 281.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.791666716337204, 'rewards/format_reward': 1.0, 'reward': 1.7916668057441711, 'reward_std': 0.0476190485060215, 'kl': 0.20263671875, 'epoch': 0.33}
 33%|███▎      | 1431/4286 [10:58:24<20:24:01, 25.72s/it] 33%|███▎      | 1432/4286 [10:58:48<19:58:35, 25.20s/it]                                                         {'loss': 0.0052, 'grad_norm': 3.5475250526347795, 'learning_rate': 6.658889407372842e-07, 'completion_length': 316.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.758928656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7410715818405151, 'reward_std': 0.11208159849047661, 'kl': 0.1298828125, 'epoch': 0.33}
 33%|███▎      | 1432/4286 [10:58:48<19:58:35, 25.20s/it] 33%|███▎      | 1433/4286 [10:59:14<20:14:06, 25.53s/it]                                                         {'loss': 0.0057, 'grad_norm': 3.9308346952735134, 'learning_rate': 6.656556229584694e-07, 'completion_length': 343.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.8014611303806305, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7657468914985657, 'reward_std': 0.056018401868641376, 'kl': 0.14208984375, 'epoch': 0.33}
 33%|███▎      | 1433/4286 [10:59:14<20:14:06, 25.53s/it] 33%|███▎      | 1434/4286 [10:59:39<20:01:15, 25.27s/it]                                                         {'loss': 0.0187, 'grad_norm': 2.409399896496872, 'learning_rate': 6.654223051796547e-07, 'completion_length': 293.7143096923828, 'rewards/only_full_func_accuracy_reward': 0.6735119521617889, 'rewards/format_reward': 1.0, 'reward': 1.6735119819641113, 'reward_std': 0.05582009721547365, 'kl': 0.46728515625, 'epoch': 0.33}
 33%|███▎      | 1434/4286 [10:59:39<20:01:15, 25.27s/it] 33%|███▎      | 1435/4286 [11:00:04<20:05:28, 25.37s/it]                                                         {'loss': 0.0022, 'grad_norm': 2.030071226993537, 'learning_rate': 6.6518898740084e-07, 'completion_length': 323.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.643750011920929, 'rewards/format_reward': 1.0, 'reward': 1.6437500715255737, 'reward_std': 0.06350706145167351, 'kl': 0.0540771484375, 'epoch': 0.33}
 33%|███▎      | 1435/4286 [11:00:04<20:05:28, 25.37s/it] 34%|███▎      | 1436/4286 [11:00:30<20:05:20, 25.38s/it]                                                         {'loss': 0.0097, 'grad_norm': 1.776250239974365, 'learning_rate': 6.649556696220252e-07, 'completion_length': 290.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.10671549290418625, 'kl': 0.24169921875, 'epoch': 0.34}
 34%|███▎      | 1436/4286 [11:00:30<20:05:20, 25.38s/it][2025-03-03 01:58:19,415] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 34%|███▎      | 1437/4286 [11:00:56<20:25:33, 25.81s/it]                                                         {'loss': 0.0075, 'grad_norm': 0.8568910459164732, 'learning_rate': 6.647223518432104e-07, 'completion_length': 285.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8883928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8705357909202576, 'reward_std': 0.08219882100820541, 'kl': 0.18701171875, 'epoch': 0.34}
 34%|███▎      | 1437/4286 [11:00:56<20:25:33, 25.81s/it] 34%|███▎      | 1438/4286 [11:01:21<20:01:21, 25.31s/it]                                                         {'loss': 0.003, 'grad_norm': 1.1185626266114566, 'learning_rate': 6.644890340643958e-07, 'completion_length': 282.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.8556548058986664, 'rewards/format_reward': 1.0, 'reward': 1.8556548357009888, 'reward_std': 0.07029405236244202, 'kl': 0.07470703125, 'epoch': 0.34}
 34%|███▎      | 1438/4286 [11:01:21<20:01:21, 25.31s/it] 34%|███▎      | 1439/4286 [11:01:45<19:44:58, 24.97s/it]                                                         {'loss': 0.0113, 'grad_norm': 6.622219034199977, 'learning_rate': 6.64255716285581e-07, 'completion_length': 294.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7395834177732468, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.0799297746270895, 'kl': 0.28125, 'epoch': 0.34}
 34%|███▎      | 1439/4286 [11:01:45<19:44:58, 24.97s/it] 34%|███▎      | 1440/4286 [11:02:08<19:25:13, 24.57s/it]                                                         {'loss': 0.003, 'grad_norm': 0.5771309169784387, 'learning_rate': 6.640223985067662e-07, 'completion_length': 249.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.6428571790456772, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.0357142873108387, 'kl': 0.0750732421875, 'epoch': 0.34}
 34%|███▎      | 1440/4286 [11:02:08<19:25:13, 24.57s/it] 34%|███▎      | 1441/4286 [11:02:34<19:32:16, 24.72s/it]                                                         {'loss': 0.0132, 'grad_norm': 3.631146328347155, 'learning_rate': 6.637890807279514e-07, 'completion_length': 331.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.5940476655960083, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5761905908584595, 'reward_std': 0.0828437302261591, 'kl': 0.3291015625, 'epoch': 0.34}
 34%|███▎      | 1441/4286 [11:02:34<19:32:16, 24.72s/it] 34%|███▎      | 1442/4286 [11:02:58<19:28:07, 24.64s/it]                                                         {'loss': 0.007, 'grad_norm': 8.299260437619425, 'learning_rate': 6.635557629491368e-07, 'completion_length': 304.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.06547619216144085, 'kl': 0.1767578125, 'epoch': 0.34}
 34%|███▎      | 1442/4286 [11:02:58<19:28:07, 24.64s/it] 34%|███▎      | 1443/4286 [11:03:24<19:47:54, 25.07s/it]                                                         {'loss': 0.0157, 'grad_norm': 4.607471184319958, 'learning_rate': 6.63322445170322e-07, 'completion_length': 339.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7514881193637848, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.09957961365580559, 'kl': 0.39111328125, 'epoch': 0.34}
 34%|███▎      | 1443/4286 [11:03:24<19:47:54, 25.07s/it] 34%|███▎      | 1444/4286 [11:03:50<19:59:09, 25.32s/it]                                                         {'loss': 0.032, 'grad_norm': 18.98661165943654, 'learning_rate': 6.630891273915072e-07, 'completion_length': 332.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.4764881432056427, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4586310386657715, 'reward_std': 0.12090916559100151, 'kl': 0.80078125, 'epoch': 0.34}
 34%|███▎      | 1444/4286 [11:03:50<19:59:09, 25.32s/it] 34%|███▎      | 1445/4286 [11:04:14<19:42:17, 24.97s/it]                                                         {'loss': 0.009, 'grad_norm': 8.655128488571554, 'learning_rate': 6.628558096126925e-07, 'completion_length': 275.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.52976194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5119048357009888, 'reward_std': 0.08659757021814585, 'kl': 0.2257080078125, 'epoch': 0.34}
 34%|███▎      | 1445/4286 [11:04:14<19:42:17, 24.97s/it] 34%|███▎      | 1446/4286 [11:04:39<19:45:32, 25.05s/it]                                                         {'loss': 0.0169, 'grad_norm': 3.6825748858830623, 'learning_rate': 6.626224918338778e-07, 'completion_length': 320.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.636904776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6190477013587952, 'reward_std': 0.14016074687242508, 'kl': 0.4228515625, 'epoch': 0.34}
 34%|███▎      | 1446/4286 [11:04:39<19:45:32, 25.05s/it] 34%|███▍      | 1447/4286 [11:05:04<19:45:22, 25.05s/it]                                                         {'loss': 0.0376, 'grad_norm': 4.787785140374847, 'learning_rate': 6.62389174055063e-07, 'completion_length': 305.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.5818452537059784, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.563988208770752, 'reward_std': 0.11890842020511627, 'kl': 0.94140625, 'epoch': 0.34}
 34%|███▍      | 1447/4286 [11:05:04<19:45:22, 25.05s/it] 34%|███▍      | 1448/4286 [11:05:31<20:01:39, 25.41s/it]                                                         {'loss': 0.0131, 'grad_norm': 21.241943363547765, 'learning_rate': 6.621558562762483e-07, 'completion_length': 328.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.8398809731006622, 'rewards/format_reward': 1.0, 'reward': 1.8398810625076294, 'reward_std': 0.03888125531375408, 'kl': 0.32586669921875, 'epoch': 0.34}
 34%|███▍      | 1448/4286 [11:05:31<20:01:39, 25.41s/it] 34%|███▍      | 1449/4286 [11:05:59<20:38:17, 26.19s/it]                                                         {'loss': 0.0438, 'grad_norm': 6.474868347823434, 'learning_rate': 6.619225384974335e-07, 'completion_length': 336.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.6436012089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6257442235946655, 'reward_std': 0.11809826269745827, 'kl': 1.0947265625, 'epoch': 0.34}
 34%|███▍      | 1449/4286 [11:05:59<20:38:17, 26.19s/it][2025-03-03 02:03:46,395] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 34%|███▍      | 1450/4286 [11:06:23<20:18:38, 25.78s/it]                                                         {'loss': 0.0069, 'grad_norm': 40.93312958258147, 'learning_rate': 6.616892207186187e-07, 'completion_length': 271.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6549745500087738, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.619260311126709, 'reward_std': 0.17495639249682426, 'kl': 0.17138671875, 'epoch': 0.34}
 34%|███▍      | 1450/4286 [11:06:23<20:18:38, 25.78s/it] 34%|███▍      | 1451/4286 [11:06:49<20:17:36, 25.77s/it]                                                         {'loss': 0.016, 'grad_norm': 4.12306590839641, 'learning_rate': 6.61455902939804e-07, 'completion_length': 335.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.6636905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.010309826582670212, 'kl': 0.400390625, 'epoch': 0.34}
 34%|███▍      | 1451/4286 [11:06:49<20:17:36, 25.77s/it] 34%|███▍      | 1452/4286 [11:07:14<19:59:20, 25.39s/it]                                                         {'loss': 0.0038, 'grad_norm': 3.6180611227300377, 'learning_rate': 6.612225851609893e-07, 'completion_length': 300.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8303571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8303572535514832, 'reward_std': 0.06388125568628311, 'kl': 0.094970703125, 'epoch': 0.34}
 34%|███▍      | 1452/4286 [11:07:14<19:59:20, 25.39s/it] 34%|███▍      | 1453/4286 [11:07:39<19:56:36, 25.34s/it]                                                         {'loss': 0.0294, 'grad_norm': 2.408421340816735, 'learning_rate': 6.609892673821745e-07, 'completion_length': 307.89288330078125, 'rewards/only_full_func_accuracy_reward': 0.70783731341362, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6721230745315552, 'reward_std': 0.11111470405012369, 'kl': 0.73486328125, 'epoch': 0.34}
 34%|███▍      | 1453/4286 [11:07:39<19:56:36, 25.34s/it] 34%|███▍      | 1454/4286 [11:08:06<20:15:01, 25.74s/it]                                                         {'loss': 0.0069, 'grad_norm': 1.4935392008144635, 'learning_rate': 6.607559496033597e-07, 'completion_length': 276.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7723215222358704, 'reward_std': 0.04740536957979202, 'kl': 0.173583984375, 'epoch': 0.34}
 34%|███▍      | 1454/4286 [11:08:06<20:15:01, 25.74s/it] 34%|███▍      | 1455/4286 [11:08:31<20:16:23, 25.78s/it]                                                         {'loss': 0.0157, 'grad_norm': 13.809560583479552, 'learning_rate': 6.605226318245451e-07, 'completion_length': 322.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5967262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5788691639900208, 'reward_std': 0.06558204162865877, 'kl': 0.3916015625, 'epoch': 0.34}
 34%|███▍      | 1455/4286 [11:08:31<20:16:23, 25.78s/it] 34%|███▍      | 1456/4286 [11:08:58<20:24:50, 25.97s/it]                                                         {'loss': 0.0078, 'grad_norm': 15.122751759571472, 'learning_rate': 6.602893140457303e-07, 'completion_length': 291.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6648809909820557, 'rewards/format_reward': 1.0, 'reward': 1.6648810505867004, 'reward_std': 0.08586078137159348, 'kl': 0.195556640625, 'epoch': 0.34}
 34%|███▍      | 1456/4286 [11:08:58<20:24:50, 25.97s/it][2025-03-03 02:06:45,387] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 34%|███▍      | 1457/4286 [11:09:22<20:04:30, 25.55s/it]                                                         {'loss': 0.0046, 'grad_norm': 2.1898050456363825, 'learning_rate': 6.600559962669155e-07, 'completion_length': 277.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6428571343421936, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.095238097012043, 'kl': 0.11474609375, 'epoch': 0.34}
 34%|███▍      | 1457/4286 [11:09:22<20:04:30, 25.55s/it][2025-03-03 02:07:11,594] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 34%|███▍      | 1458/4286 [11:09:49<20:13:26, 25.74s/it]                                                         {'loss': 0.008, 'grad_norm': 0.7121980616739031, 'learning_rate': 6.598226784881008e-07, 'completion_length': 306.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7440477013587952, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.0, 'kl': 0.199462890625, 'epoch': 0.34}
 34%|███▍      | 1458/4286 [11:09:49<20:13:26, 25.74s/it] 34%|███▍      | 1459/4286 [11:10:14<20:09:49, 25.68s/it]                                                         {'loss': 0.0136, 'grad_norm': 1.8147947746576172, 'learning_rate': 6.595893607092861e-07, 'completion_length': 282.44644927978516, 'rewards/only_full_func_accuracy_reward': 0.7931548655033112, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.77529776096344, 'reward_std': 0.06845238246023655, 'kl': 0.338623046875, 'epoch': 0.34}
 34%|███▍      | 1459/4286 [11:10:14<20:09:49, 25.68s/it][2025-03-03 02:08:02,602] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 34%|███▍      | 1460/4286 [11:10:40<20:06:43, 25.62s/it]                                                         {'loss': 0.0032, 'grad_norm': 4.025434236422631, 'learning_rate': 6.593560429304713e-07, 'completion_length': 331.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6931548118591309, 'rewards/format_reward': 1.0, 'reward': 1.6931548118591309, 'reward_std': 0.07189898937940598, 'kl': 0.079833984375, 'epoch': 0.34}
 34%|███▍      | 1460/4286 [11:10:40<20:06:43, 25.62s/it] 34%|███▍      | 1461/4286 [11:11:06<20:14:30, 25.79s/it]                                                         {'loss': 0.022, 'grad_norm': 3.3228593898401138, 'learning_rate': 6.591227251516566e-07, 'completion_length': 313.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7247024476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7068453431129456, 'reward_std': 0.13073870167136192, 'kl': 0.55078125, 'epoch': 0.34}
 34%|███▍      | 1461/4286 [11:11:06<20:14:30, 25.79s/it] 34%|███▍      | 1462/4286 [11:11:31<20:02:07, 25.54s/it]                                                         {'loss': 0.0034, 'grad_norm': 9.78196212879438, 'learning_rate': 6.588894073728418e-07, 'completion_length': 326.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6735119521617889, 'rewards/format_reward': 1.0, 'reward': 1.6735119819641113, 'reward_std': 0.012681029038503766, 'kl': 0.084716796875, 'epoch': 0.34}
 34%|███▍      | 1462/4286 [11:11:31<20:02:07, 25.54s/it] 34%|███▍      | 1463/4286 [11:11:55<19:36:37, 25.01s/it]                                                         {'loss': 0.0066, 'grad_norm': 1.242366015297605, 'learning_rate': 6.586560895940271e-07, 'completion_length': 271.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.10293934494256973, 'kl': 0.165771484375, 'epoch': 0.34}
 34%|███▍      | 1463/4286 [11:11:55<19:36:37, 25.01s/it][2025-03-03 02:09:42,127] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 34%|███▍      | 1464/4286 [11:12:19<19:30:35, 24.89s/it]                                                         {'loss': 0.0014, 'grad_norm': 6.04426018029304, 'learning_rate': 6.584227718152123e-07, 'completion_length': 321.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7247024476528168, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.05059524439275265, 'kl': 0.0361328125, 'epoch': 0.34}
 34%|███▍      | 1464/4286 [11:12:19<19:30:35, 24.89s/it] 34%|███▍      | 1465/4286 [11:12:44<19:25:59, 24.80s/it]                                                         {'loss': 0.0038, 'grad_norm': 0.60035471839938, 'learning_rate': 6.581894540363976e-07, 'completion_length': 309.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261906266212463, 'reward_std': 0.020619653165340424, 'kl': 0.095458984375, 'epoch': 0.34}
 34%|███▍      | 1465/4286 [11:12:44<19:25:59, 24.80s/it] 34%|███▍      | 1466/4286 [11:13:09<19:27:12, 24.83s/it]                                                         {'loss': 0.0102, 'grad_norm': 2.346857725870908, 'learning_rate': 6.579561362575828e-07, 'completion_length': 298.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.7428571879863739, 'rewards/format_reward': 1.0, 'reward': 1.742857277393341, 'reward_std': 0.07889259047806263, 'kl': 0.2548828125, 'epoch': 0.34}
 34%|███▍      | 1466/4286 [11:13:09<19:27:12, 24.83s/it] 34%|███▍      | 1467/4286 [11:13:32<19:06:40, 24.41s/it]                                                         {'loss': 0.0025, 'grad_norm': 4.485727425026551, 'learning_rate': 6.577228184787681e-07, 'completion_length': 259.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7175595760345459, 'rewards/format_reward': 1.0, 'reward': 1.7175596952438354, 'reward_std': 0.070405974984169, 'kl': 0.0615234375, 'epoch': 0.34}
 34%|███▍      | 1467/4286 [11:13:32<19:06:40, 24.41s/it] 34%|███▍      | 1468/4286 [11:13:57<19:09:10, 24.47s/it]                                                         {'loss': 0.0012, 'grad_norm': 6.4837489749502675, 'learning_rate': 6.574895006999534e-07, 'completion_length': 307.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.06345389038324356, 'kl': 0.02984619140625, 'epoch': 0.34}
 34%|███▍      | 1468/4286 [11:13:57<19:09:10, 24.47s/it] 34%|███▍      | 1469/4286 [11:14:22<19:18:19, 24.67s/it]                                                         {'loss': 0.0038, 'grad_norm': 2.731758608350112, 'learning_rate': 6.572561829211386e-07, 'completion_length': 308.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592263579368591, 'reward_std': 0.020326515659689903, 'kl': 0.0938720703125, 'epoch': 0.34}
 34%|███▍      | 1469/4286 [11:14:22<19:18:19, 24.67s/it] 34%|███▍      | 1470/4286 [11:14:46<19:03:12, 24.36s/it]                                                         {'loss': 0.0018, 'grad_norm': 3.177563900223841, 'learning_rate': 6.570228651423238e-07, 'completion_length': 293.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5625000447034836, 'rewards/format_reward': 1.0, 'reward': 1.5625001192092896, 'reward_std': 0.01785714365541935, 'kl': 0.0458984375, 'epoch': 0.34}
 34%|███▍      | 1470/4286 [11:14:46<19:03:12, 24.36s/it] 34%|███▍      | 1471/4286 [11:15:11<19:13:57, 24.60s/it]                                                         {'loss': 0.0013, 'grad_norm': 0.6307534822629127, 'learning_rate': 6.567895473635092e-07, 'completion_length': 319.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 1.0, 'reward': 1.6443453431129456, 'reward_std': 0.01580178737640381, 'kl': 0.03216552734375, 'epoch': 0.34}
 34%|███▍      | 1471/4286 [11:15:11<19:13:57, 24.60s/it] 34%|███▍      | 1472/4286 [11:15:35<19:11:05, 24.54s/it]                                                         {'loss': 0.0011, 'grad_norm': 0.44709279715554395, 'learning_rate': 6.565562295846944e-07, 'completion_length': 274.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8244048953056335, 'reward_std': 0.02405625954270363, 'kl': 0.02691650390625, 'epoch': 0.34}
 34%|███▍      | 1472/4286 [11:15:35<19:11:05, 24.54s/it] 34%|███▍      | 1473/4286 [11:15:59<18:59:21, 24.30s/it]                                                         {'loss': 0.0023, 'grad_norm': 2.811797980834645, 'learning_rate': 6.563229118058796e-07, 'completion_length': 305.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7767857909202576, 'reward_std': 0.04761905036866665, 'kl': 0.0584716796875, 'epoch': 0.34}
 34%|███▍      | 1473/4286 [11:15:59<18:59:21, 24.30s/it] 34%|███▍      | 1474/4286 [11:16:25<19:26:17, 24.89s/it]                                                         {'loss': 0.0016, 'grad_norm': 4.3976593052803, 'learning_rate': 6.560895940270648e-07, 'completion_length': 327.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7193452715873718, 'rewards/format_reward': 1.0, 'reward': 1.7193452715873718, 'reward_std': 0.0709925964474678, 'kl': 0.0390625, 'epoch': 0.34}
 34%|███▍      | 1474/4286 [11:16:25<19:26:17, 24.89s/it][2025-03-03 02:14:13,126] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 34%|███▍      | 1475/4286 [11:16:50<19:29:30, 24.96s/it]                                                         {'loss': 0.0028, 'grad_norm': 0.1809907453044085, 'learning_rate': 6.558562762482502e-07, 'completion_length': 260.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.636904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6369048953056335, 'reward_std': 0.0, 'kl': 0.0699462890625, 'epoch': 0.34}
 34%|███▍      | 1475/4286 [11:16:50<19:29:30, 24.96s/it] 34%|███▍      | 1476/4286 [11:17:16<19:34:03, 25.07s/it]                                                         {'loss': 0.0019, 'grad_norm': 6.318351201494396, 'learning_rate': 6.556229584694354e-07, 'completion_length': 305.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.735119104385376, 'reward_std': 0.07216878980398178, 'kl': 0.0479736328125, 'epoch': 0.34}
 34%|███▍      | 1476/4286 [11:17:16<19:34:03, 25.07s/it][2025-03-03 02:15:04,627] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 34%|███▍      | 1477/4286 [11:17:42<19:49:18, 25.40s/it]                                                         {'loss': 0.0042, 'grad_norm': 10.874910337006424, 'learning_rate': 6.553896406906206e-07, 'completion_length': 336.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.7053571939468384, 'reward_std': 0.06388125941157341, 'kl': 0.1044921875, 'epoch': 0.34}
 34%|███▍      | 1477/4286 [11:17:42<19:49:18, 25.40s/it] 34%|███▍      | 1478/4286 [11:18:07<19:42:12, 25.26s/it]                                                         {'loss': 0.0012, 'grad_norm': 0.45104854928695876, 'learning_rate': 6.551563229118059e-07, 'completion_length': 292.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.02380952052772045, 'kl': 0.030029296875, 'epoch': 0.34}
 34%|███▍      | 1478/4286 [11:18:07<19:42:12, 25.26s/it][2025-03-03 02:15:54,404] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▍      | 1479/4286 [11:18:31<19:35:59, 25.14s/it]                                                         {'loss': 0.0056, 'grad_norm': 5.5337095054139205, 'learning_rate': 6.549230051329911e-07, 'completion_length': 329.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.6666666567325592, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.013746436685323715, 'kl': 0.13916015625, 'epoch': 0.35}
 35%|███▍      | 1479/4286 [11:18:31<19:35:59, 25.14s/it] 35%|███▍      | 1480/4286 [11:18:58<19:49:45, 25.44s/it]                                                         {'loss': 0.0025, 'grad_norm': 0.8376968891093267, 'learning_rate': 6.546896873541764e-07, 'completion_length': 284.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7827381789684296, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.05357143096625805, 'kl': 0.0616455078125, 'epoch': 0.35}
 35%|███▍      | 1480/4286 [11:18:58<19:49:45, 25.44s/it] 35%|███▍      | 1481/4286 [11:19:23<19:45:41, 25.36s/it]                                                         {'loss': 0.0021, 'grad_norm': 0.26084539561038084, 'learning_rate': 6.544563695753617e-07, 'completion_length': 323.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7366071939468384, 'rewards/format_reward': 1.0, 'reward': 1.7366072535514832, 'reward_std': 0.01580178737640381, 'kl': 0.0516357421875, 'epoch': 0.35}
 35%|███▍      | 1481/4286 [11:19:23<19:45:41, 25.36s/it][2025-03-03 02:17:10,451] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▍      | 1482/4286 [11:19:48<19:36:14, 25.17s/it]                                                         {'loss': 0.0034, 'grad_norm': 1.0132694948977028, 'learning_rate': 6.542230517965469e-07, 'completion_length': 296.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 1.0, 'reward': 1.7485119700431824, 'reward_std': 0.008928571827709675, 'kl': 0.0838623046875, 'epoch': 0.35}
 35%|███▍      | 1482/4286 [11:19:48<19:36:14, 25.17s/it] 35%|███▍      | 1483/4286 [11:20:13<19:37:03, 25.20s/it]                                                         {'loss': 0.0014, 'grad_norm': 7.499901496647147, 'learning_rate': 6.539897340177321e-07, 'completion_length': 289.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.06504883244633675, 'kl': 0.0361328125, 'epoch': 0.35}
 35%|███▍      | 1483/4286 [11:20:13<19:37:03, 25.20s/it] 35%|███▍      | 1484/4286 [11:20:38<19:38:09, 25.23s/it]                                                         {'loss': 0.0045, 'grad_norm': 1.149999192941412, 'learning_rate': 6.537564162389175e-07, 'completion_length': 300.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7217262089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7038691639900208, 'reward_std': 0.06177530437707901, 'kl': 0.113037109375, 'epoch': 0.35}
 35%|███▍      | 1484/4286 [11:20:38<19:38:09, 25.23s/it] 35%|███▍      | 1485/4286 [11:21:02<19:19:16, 24.83s/it]                                                         {'loss': 0.0041, 'grad_norm': 2.025582807469172, 'learning_rate': 6.535230984601027e-07, 'completion_length': 288.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.08517500758171082, 'kl': 0.1031494140625, 'epoch': 0.35}
 35%|███▍      | 1485/4286 [11:21:02<19:19:16, 24.83s/it][2025-03-03 02:18:50,295] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▍      | 1486/4286 [11:21:27<19:26:26, 25.00s/it]                                                         {'loss': 0.0051, 'grad_norm': 8.28859820623147, 'learning_rate': 6.532897806812879e-07, 'completion_length': 289.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7380953431129456, 'reward_std': 0.10512056574225426, 'kl': 0.1280517578125, 'epoch': 0.35}
 35%|███▍      | 1486/4286 [11:21:27<19:26:26, 25.00s/it][2025-03-03 02:19:17,486] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▍      | 1487/4286 [11:21:55<19:56:44, 25.65s/it]                                                         {'loss': 0.0019, 'grad_norm': 1.127786433742057, 'learning_rate': 6.530564629024731e-07, 'completion_length': 332.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.014580297283828259, 'kl': 0.048583984375, 'epoch': 0.35}
 35%|███▍      | 1487/4286 [11:21:55<19:56:44, 25.65s/it][2025-03-03 02:19:40,656] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▍      | 1488/4286 [11:22:18<19:21:34, 24.91s/it]                                                         {'loss': 0.0029, 'grad_norm': 13.703806325753465, 'learning_rate': 6.528231451236585e-07, 'completion_length': 285.7143096923828, 'rewards/only_full_func_accuracy_reward': 0.8139881491661072, 'rewards/format_reward': 1.0, 'reward': 1.813988208770752, 'reward_std': 0.026785715483129025, 'kl': 0.0732421875, 'epoch': 0.35}
 35%|███▍      | 1488/4286 [11:22:18<19:21:34, 24.91s/it] 35%|███▍      | 1489/4286 [11:22:43<19:29:54, 25.10s/it]                                                         {'loss': 0.0031, 'grad_norm': 2.26689317903543, 'learning_rate': 6.525898273448437e-07, 'completion_length': 311.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.0650488305836916, 'kl': 0.0780029296875, 'epoch': 0.35}
 35%|███▍      | 1489/4286 [11:22:43<19:29:54, 25.10s/it][2025-03-03 02:20:32,104] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▍      | 1490/4286 [11:23:09<19:40:55, 25.34s/it]                                                         {'loss': 0.0214, 'grad_norm': 2.3142807981185163, 'learning_rate': 6.523565095660289e-07, 'completion_length': 301.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.7464286088943481, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.728571593761444, 'reward_std': 0.10546423494815826, 'kl': 0.53466796875, 'epoch': 0.35}
 35%|███▍      | 1490/4286 [11:23:09<19:40:55, 25.34s/it] 35%|███▍      | 1491/4286 [11:23:34<19:37:51, 25.29s/it]                                                         {'loss': 0.0039, 'grad_norm': 5.151456513361863, 'learning_rate': 6.521231917872142e-07, 'completion_length': 293.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.6547619998455048, 'rewards/format_reward': 1.0, 'reward': 1.654762089252472, 'reward_std': 0.06067970208823681, 'kl': 0.09619140625, 'epoch': 0.35}
 35%|███▍      | 1491/4286 [11:23:34<19:37:51, 25.29s/it] 35%|███▍      | 1492/4286 [11:23:59<19:35:40, 25.25s/it]                                                         {'loss': 0.0083, 'grad_norm': 3.3192287645718426, 'learning_rate': 6.518898740083995e-07, 'completion_length': 319.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7708333432674408, 'rewards/format_reward': 1.0, 'reward': 1.7708335518836975, 'reward_std': 0.06596073880791664, 'kl': 0.2066650390625, 'epoch': 0.35}
 35%|███▍      | 1492/4286 [11:23:59<19:35:40, 25.25s/it] 35%|███▍      | 1493/4286 [11:24:23<19:16:30, 24.84s/it]                                                         {'loss': 0.0056, 'grad_norm': 3.760995586910472, 'learning_rate': 6.516565562295847e-07, 'completion_length': 283.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.877976268529892, 'rewards/format_reward': 1.0, 'reward': 1.8779762983322144, 'reward_std': 0.0357142873108387, 'kl': 0.1402587890625, 'epoch': 0.35}
 35%|███▍      | 1493/4286 [11:24:23<19:16:30, 24.84s/it] 35%|███▍      | 1494/4286 [11:24:48<19:15:10, 24.82s/it]                                                         {'loss': 0.0241, 'grad_norm': 1.9271186151880944, 'learning_rate': 6.5142323845077e-07, 'completion_length': 285.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.044741734862327576, 'kl': 0.603515625, 'epoch': 0.35}
 35%|███▍      | 1494/4286 [11:24:48<19:15:10, 24.82s/it] 35%|███▍      | 1495/4286 [11:25:13<19:10:15, 24.73s/it]                                                         {'loss': 0.0124, 'grad_norm': 2.631844105788816, 'learning_rate': 6.511899206719552e-07, 'completion_length': 288.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6607143580913544, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.0744378250092268, 'kl': 0.3092041015625, 'epoch': 0.35}
 35%|███▍      | 1495/4286 [11:25:13<19:10:15, 24.73s/it] 35%|███▍      | 1496/4286 [11:25:36<18:55:01, 24.41s/it]                                                         {'loss': 0.002, 'grad_norm': 0.5656922891376851, 'learning_rate': 6.509566028931405e-07, 'completion_length': 298.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.8422619700431824, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.005952383857220411, 'kl': 0.05029296875, 'epoch': 0.35}
 35%|███▍      | 1496/4286 [11:25:36<18:55:01, 24.41s/it] 35%|███▍      | 1497/4286 [11:25:59<18:32:09, 23.93s/it]                                                         {'loss': 0.0256, 'grad_norm': 3.65361354341941, 'learning_rate': 6.507232851143257e-07, 'completion_length': 277.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6863095164299011, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6684524416923523, 'reward_std': 0.10522664152085781, 'kl': 0.640625, 'epoch': 0.35}
 35%|███▍      | 1497/4286 [11:25:59<18:32:09, 23.93s/it] 35%|███▍      | 1498/4286 [11:26:25<19:01:08, 24.56s/it]                                                         {'loss': 0.0043, 'grad_norm': 3.696335385747372, 'learning_rate': 6.50489967335511e-07, 'completion_length': 314.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6443454027175903, 'reward_std': 0.13417532108724117, 'kl': 0.107666015625, 'epoch': 0.35}
 35%|███▍      | 1498/4286 [11:26:25<19:01:08, 24.56s/it] 35%|███▍      | 1499/4286 [11:26:52<19:28:59, 25.17s/it]                                                         {'loss': 0.0152, 'grad_norm': 6.268987148272501, 'learning_rate': 6.502566495566962e-07, 'completion_length': 298.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6884566843509674, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6527424454689026, 'reward_std': 0.18843808770179749, 'kl': 0.3818359375, 'epoch': 0.35}
 35%|███▍      | 1499/4286 [11:26:52<19:28:59, 25.17s/it] 35%|███▍      | 1500/4286 [11:27:16<19:20:20, 24.99s/it]                                                         {'loss': 0.0108, 'grad_norm': 2.5232462562522073, 'learning_rate': 6.500233317778814e-07, 'completion_length': 284.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.680059552192688, 'rewards/format_reward': 1.0, 'reward': 1.6800596714019775, 'reward_std': 0.044642859138548374, 'kl': 0.270263671875, 'epoch': 0.35}
 35%|███▍      | 1500/4286 [11:27:16<19:20:20, 24.99s/it] 35%|███▌      | 1501/4286 [11:31:19<69:57:43, 90.44s/it]                                                         {'loss': 0.0028, 'grad_norm': 1.9183407753897574, 'learning_rate': 6.497900139990668e-07, 'completion_length': 271.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.6324405074119568, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.058389291167259216, 'kl': 0.0689697265625, 'epoch': 0.35}
 35%|███▌      | 1501/4286 [11:31:19<69:57:43, 90.44s/it][2025-03-03 02:29:06,797] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▌      | 1502/4286 [11:31:44<54:36:53, 70.62s/it]                                                         {'loss': 0.0094, 'grad_norm': 1.7938647172630293, 'learning_rate': 6.49556696220252e-07, 'completion_length': 283.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.65476194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6369048953056335, 'reward_std': 0.0535714328289032, 'kl': 0.2353515625, 'epoch': 0.35}
 35%|███▌      | 1502/4286 [11:31:44<54:36:53, 70.62s/it] 35%|███▌      | 1503/4286 [11:32:08<43:55:12, 56.81s/it]                                                         {'loss': 0.0221, 'grad_norm': 10.767711961681089, 'learning_rate': 6.493233784414372e-07, 'completion_length': 315.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7455358505249023, 'reward_std': 0.11791310459375381, 'kl': 0.5537109375, 'epoch': 0.35}
 35%|███▌      | 1503/4286 [11:32:08<43:55:12, 56.81s/it] 35%|███▌      | 1504/4286 [11:32:34<36:32:56, 47.30s/it]                                                         {'loss': 0.0154, 'grad_norm': 3.7893543251130932, 'learning_rate': 6.490900606626225e-07, 'completion_length': 302.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6577381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6398810148239136, 'reward_std': 0.10771035961806774, 'kl': 0.385498046875, 'epoch': 0.35}
 35%|███▌      | 1504/4286 [11:32:34<36:32:56, 47.30s/it] 35%|███▌      | 1505/4286 [11:33:00<31:42:35, 41.05s/it]                                                         {'loss': 0.0122, 'grad_norm': 5.532097067146918, 'learning_rate': 6.488567428838078e-07, 'completion_length': 325.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.754464328289032, 'reward_std': 0.09904232248663902, 'kl': 0.3037109375, 'epoch': 0.35}
 35%|███▌      | 1505/4286 [11:33:00<31:42:35, 41.05s/it] 35%|███▌      | 1506/4286 [11:33:24<27:47:32, 35.99s/it]                                                         {'loss': 0.0086, 'grad_norm': 3.489639095689171, 'learning_rate': 6.48623425104993e-07, 'completion_length': 316.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6294642984867096, 'rewards/format_reward': 1.0, 'reward': 1.6294643878936768, 'reward_std': 0.04958159103989601, 'kl': 0.2158203125, 'epoch': 0.35}
 35%|███▌      | 1506/4286 [11:33:24<27:47:32, 35.99s/it] 35%|███▌      | 1507/4286 [11:33:49<25:17:53, 32.77s/it]                                                         {'loss': 0.0145, 'grad_norm': 1.4246696660846818, 'learning_rate': 6.483901073261783e-07, 'completion_length': 301.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.6215986907482147, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6037415862083435, 'reward_std': 0.04251701012253761, 'kl': 0.36328125, 'epoch': 0.35}
 35%|███▌      | 1507/4286 [11:33:49<25:17:53, 32.77s/it][2025-03-03 02:31:36,952] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▌      | 1508/4286 [11:34:14<23:23:10, 30.31s/it]                                                         {'loss': 0.0163, 'grad_norm': 9.089659679472291, 'learning_rate': 6.481567895473635e-07, 'completion_length': 280.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.6696428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.633928656578064, 'reward_std': 0.09045329131186008, 'kl': 0.4063720703125, 'epoch': 0.35}
 35%|███▌      | 1508/4286 [11:34:14<23:23:10, 30.31s/it][2025-03-03 02:32:02,991] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▌      | 1509/4286 [11:34:40<22:23:25, 29.03s/it]                                                         {'loss': 0.0141, 'grad_norm': 4.369740961300113, 'learning_rate': 6.479234717685488e-07, 'completion_length': 305.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7247024476528168, 'rewards/format_reward': 1.0, 'reward': 1.724702537059784, 'reward_std': 0.04556369874626398, 'kl': 0.3525390625, 'epoch': 0.35}
 35%|███▌      | 1509/4286 [11:34:40<22:23:25, 29.03s/it] 35%|███▌      | 1510/4286 [11:35:05<21:20:47, 27.68s/it]                                                         {'loss': 0.0414, 'grad_norm': 8.143914904420766, 'learning_rate': 6.47690153989734e-07, 'completion_length': 311.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.4836309850215912, 'rewards/format_reward': 1.0, 'reward': 1.4836310148239136, 'reward_std': 0.07820397801697254, 'kl': 1.037109375, 'epoch': 0.35}
 35%|███▌      | 1510/4286 [11:35:05<21:20:47, 27.68s/it] 35%|███▌      | 1511/4286 [11:35:31<21:07:09, 27.40s/it]                                                         {'loss': 0.0193, 'grad_norm': 5.828017898009417, 'learning_rate': 6.474568362109193e-07, 'completion_length': 313.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7023810744285583, 'reward_std': 0.11715288832783699, 'kl': 0.482421875, 'epoch': 0.35}
 35%|███▌      | 1511/4286 [11:35:31<21:07:09, 27.40s/it] 35%|███▌      | 1512/4286 [11:35:57<20:37:28, 26.77s/it]                                                         {'loss': 0.0132, 'grad_norm': 1.4825796029646476, 'learning_rate': 6.472235184321045e-07, 'completion_length': 273.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7678572535514832, 'reward_std': 0.10714286379516125, 'kl': 0.3323974609375, 'epoch': 0.35}
 35%|███▌      | 1512/4286 [11:35:57<20:37:28, 26.77s/it][2025-03-03 02:33:47,113] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▌      | 1513/4286 [11:36:24<20:47:52, 27.00s/it]                                                         {'loss': 0.0323, 'grad_norm': 11.109302003350253, 'learning_rate': 6.469902006532898e-07, 'completion_length': 286.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 1.0, 'reward': 1.633928656578064, 'reward_std': 0.06731786299496889, 'kl': 0.80810546875, 'epoch': 0.35}
 35%|███▌      | 1513/4286 [11:36:24<20:47:52, 27.00s/it] 35%|███▌      | 1514/4286 [11:36:51<20:51:23, 27.09s/it]                                                         {'loss': 0.0147, 'grad_norm': 10.459495082378043, 'learning_rate': 6.467568828744751e-07, 'completion_length': 281.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6413690745830536, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.08035714365541935, 'kl': 0.3670654296875, 'epoch': 0.35}
 35%|███▌      | 1514/4286 [11:36:51<20:51:23, 27.09s/it][2025-03-03 02:34:40,804] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▌      | 1515/4286 [11:37:18<20:41:30, 26.88s/it]                                                         {'loss': 0.0466, 'grad_norm': 4.018119128200975, 'learning_rate': 6.465235650956603e-07, 'completion_length': 252.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.5036139786243439, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.467899739742279, 'reward_std': 0.1948356181383133, 'kl': 1.166015625, 'epoch': 0.35}
 35%|███▌      | 1515/4286 [11:37:18<20:41:30, 26.88s/it][2025-03-03 02:35:05,874] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▌      | 1516/4286 [11:37:43<20:15:57, 26.34s/it]                                                         {'loss': 0.0412, 'grad_norm': 4.78663699136112, 'learning_rate': 6.462902473168455e-07, 'completion_length': 303.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7589287161827087, 'reward_std': 0.0773809477686882, 'kl': 1.0283203125, 'epoch': 0.35}
 35%|███▌      | 1516/4286 [11:37:43<20:15:57, 26.34s/it] 35%|███▌      | 1517/4286 [11:38:09<20:12:03, 26.26s/it]                                                         {'loss': 0.0649, 'grad_norm': 14.804691174757293, 'learning_rate': 6.460569295380309e-07, 'completion_length': 319.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6995748281478882, 'rewards/format_reward': 0.910714328289032, 'reward': 1.6102891564369202, 'reward_std': 0.21901244670152664, 'kl': 1.625, 'epoch': 0.35}
 35%|███▌      | 1517/4286 [11:38:09<20:12:03, 26.26s/it][2025-03-03 02:35:57,754] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▌      | 1518/4286 [11:38:35<20:05:05, 26.12s/it]                                                         {'loss': 0.0239, 'grad_norm': 3.50686070833727, 'learning_rate': 6.458236117592161e-07, 'completion_length': 278.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6339287161827087, 'reward_std': 0.04166667256504297, 'kl': 0.595703125, 'epoch': 0.35}
 35%|███▌      | 1518/4286 [11:38:35<20:05:05, 26.12s/it] 35%|███▌      | 1519/4286 [11:38:59<19:39:28, 25.58s/it]                                                         {'loss': 0.017, 'grad_norm': 22.55580996051519, 'learning_rate': 6.455902939804013e-07, 'completion_length': 291.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6502976417541504, 'rewards/format_reward': 1.0, 'reward': 1.65029776096344, 'reward_std': 0.055622491985559464, 'kl': 0.42626953125, 'epoch': 0.35}
 35%|███▌      | 1519/4286 [11:38:59<19:39:28, 25.58s/it][2025-03-03 02:36:45,231] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 35%|███▌      | 1520/4286 [11:39:22<19:05:50, 24.86s/it]                                                         {'loss': 0.0403, 'grad_norm': 2.2744514060778367, 'learning_rate': 6.453569762015865e-07, 'completion_length': 264.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.8169642984867096, 'rewards/format_reward': 1.0, 'reward': 1.8169643878936768, 'reward_std': 0.0505952388048172, 'kl': 1.0107421875, 'epoch': 0.35}
 35%|███▌      | 1520/4286 [11:39:22<19:05:50, 24.86s/it] 35%|███▌      | 1521/4286 [11:39:48<19:12:50, 25.02s/it]                                                         {'loss': 0.0425, 'grad_norm': 5.154504497129337, 'learning_rate': 6.451236584227719e-07, 'completion_length': 290.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7071287333965302, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.653557300567627, 'reward_std': 0.15716829895973206, 'kl': 1.060546875, 'epoch': 0.35}
 35%|███▌      | 1521/4286 [11:39:48<19:12:50, 25.02s/it] 36%|███▌      | 1522/4286 [11:40:12<18:56:57, 24.68s/it]                                                         {'loss': 0.0287, 'grad_norm': 3.8490585515896827, 'learning_rate': 6.448903406439571e-07, 'completion_length': 305.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592262387275696, 'reward_std': 0.07111446000635624, 'kl': 0.716796875, 'epoch': 0.36}
 36%|███▌      | 1522/4286 [11:40:12<18:56:57, 24.68s/it] 36%|███▌      | 1523/4286 [11:40:38<19:13:48, 25.06s/it]                                                         {'loss': 0.0433, 'grad_norm': 9.301741217318012, 'learning_rate': 6.446570228651423e-07, 'completion_length': 324.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6413690745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6235120296478271, 'reward_std': 0.14770828187465668, 'kl': 1.0859375, 'epoch': 0.36}
 36%|███▌      | 1523/4286 [11:40:38<19:13:48, 25.06s/it] 36%|███▌      | 1524/4286 [11:41:02<19:02:34, 24.82s/it]                                                         {'loss': 0.0255, 'grad_norm': 3.4821673630396015, 'learning_rate': 6.444237050863276e-07, 'completion_length': 308.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7038689851760864, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6860119700431824, 'reward_std': 0.16142144054174423, 'kl': 0.6337890625, 'epoch': 0.36}
 36%|███▌      | 1524/4286 [11:41:02<19:02:34, 24.82s/it] 36%|███▌      | 1525/4286 [11:41:28<19:26:22, 25.35s/it]                                                         {'loss': 0.0431, 'grad_norm': 3.3135126891737436, 'learning_rate': 6.441903873075129e-07, 'completion_length': 302.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6434884667396545, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.607774257659912, 'reward_std': 0.1224487517029047, 'kl': 1.078125, 'epoch': 0.36}
 36%|███▌      | 1525/4286 [11:41:28<19:26:22, 25.35s/it] 36%|███▌      | 1526/4286 [11:41:53<19:15:12, 25.11s/it]                                                         {'loss': 0.0228, 'grad_norm': 1.9919545038098563, 'learning_rate': 6.439570695286981e-07, 'completion_length': 322.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7648810744285583, 'reward_std': 0.15809809416532516, 'kl': 0.572265625, 'epoch': 0.36}
 36%|███▌      | 1526/4286 [11:41:53<19:15:12, 25.11s/it] 36%|███▌      | 1527/4286 [11:42:19<19:34:00, 25.53s/it]                                                         {'loss': 0.0662, 'grad_norm': 9.509045446581121, 'learning_rate': 6.437237517498834e-07, 'completion_length': 301.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5818452686071396, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.52827388048172, 'reward_std': 0.205897256731987, 'kl': 1.65234375, 'epoch': 0.36}
 36%|███▌      | 1527/4286 [11:42:19<19:34:00, 25.53s/it] 36%|███▌      | 1528/4286 [11:42:46<19:45:35, 25.79s/it]                                                         {'loss': 0.0072, 'grad_norm': 7.3852746814908, 'learning_rate': 6.434904339710686e-07, 'completion_length': 310.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7500001192092896, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.732142984867096, 'reward_std': 0.09523809514939785, 'kl': 0.1796875, 'epoch': 0.36}
 36%|███▌      | 1528/4286 [11:42:46<19:45:35, 25.79s/it] 36%|███▌      | 1529/4286 [11:43:10<19:24:38, 25.35s/it]                                                         {'loss': 0.0103, 'grad_norm': 3.3076701845949805, 'learning_rate': 6.432571161922538e-07, 'completion_length': 295.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7961310148239136, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.05059523694217205, 'kl': 0.25732421875, 'epoch': 0.36}
 36%|███▌      | 1529/4286 [11:43:10<19:24:38, 25.35s/it] 36%|███▌      | 1530/4286 [11:43:35<19:18:32, 25.22s/it]                                                         {'loss': 0.0245, 'grad_norm': 2.768028788048043, 'learning_rate': 6.430237984134392e-07, 'completion_length': 305.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.7773809731006622, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.759523868560791, 'reward_std': 0.08124320395290852, 'kl': 0.611328125, 'epoch': 0.36}
 36%|███▌      | 1530/4286 [11:43:35<19:18:32, 25.22s/it][2025-03-03 02:41:24,521] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 36%|███▌      | 1531/4286 [11:44:02<19:35:50, 25.61s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.9231663701346263, 'learning_rate': 6.427904806346244e-07, 'completion_length': 267.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7604166865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7425596714019775, 'reward_std': 0.06038287654519081, 'kl': 0.045654296875, 'epoch': 0.36}
 36%|███▌      | 1531/4286 [11:44:02<19:35:50, 25.61s/it][2025-03-03 02:41:50,826] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 36%|███▌      | 1532/4286 [11:44:28<19:45:01, 25.82s/it]                                                         {'loss': 0.0046, 'grad_norm': 5.925342790138304, 'learning_rate': 6.425571628558096e-07, 'completion_length': 300.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035714626312256, 'reward_std': 0.01877797581255436, 'kl': 0.11376953125, 'epoch': 0.36}
 36%|███▌      | 1532/4286 [11:44:28<19:45:01, 25.82s/it] 36%|███▌      | 1533/4286 [11:44:54<19:42:10, 25.76s/it]                                                         {'loss': 0.0275, 'grad_norm': 3.532474371667918, 'learning_rate': 6.423238450769948e-07, 'completion_length': 301.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217262983322144, 'reward_std': 0.07876220531761646, 'kl': 0.685546875, 'epoch': 0.36}
 36%|███▌      | 1533/4286 [11:44:54<19:42:10, 25.76s/it] 36%|███▌      | 1534/4286 [11:45:18<19:28:03, 25.47s/it]                                                         {'loss': 0.0061, 'grad_norm': 2.2576521106977463, 'learning_rate': 6.420905272981802e-07, 'completion_length': 303.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7738095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.049460720270872116, 'kl': 0.1519775390625, 'epoch': 0.36}
 36%|███▌      | 1534/4286 [11:45:18<19:28:03, 25.47s/it] 36%|███▌      | 1535/4286 [11:45:42<19:04:05, 24.95s/it]                                                         {'loss': 0.04, 'grad_norm': 2.3315737223470125, 'learning_rate': 6.418572095193654e-07, 'completion_length': 271.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7247024774551392, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.07111651077866554, 'kl': 0.9993896484375, 'epoch': 0.36}
 36%|███▌      | 1535/4286 [11:45:42<19:04:05, 24.95s/it] 36%|███▌      | 1536/4286 [11:46:08<19:20:27, 25.32s/it]                                                         {'loss': 0.0033, 'grad_norm': 1.6342229508778103, 'learning_rate': 6.416238917405506e-07, 'completion_length': 308.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7693453431129456, 'rewards/format_reward': 1.0, 'reward': 1.7693454027175903, 'reward_std': 0.022675009444355965, 'kl': 0.083251953125, 'epoch': 0.36}
 36%|███▌      | 1536/4286 [11:46:08<19:20:27, 25.32s/it] 36%|███▌      | 1537/4286 [11:46:34<19:25:02, 25.43s/it]                                                         {'loss': 0.0047, 'grad_norm': 3.970677074728662, 'learning_rate': 6.413905739617359e-07, 'completion_length': 330.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7023809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.0476190447807312, 'kl': 0.1177978515625, 'epoch': 0.36}
 36%|███▌      | 1537/4286 [11:46:34<19:25:02, 25.43s/it] 36%|███▌      | 1538/4286 [11:46:59<19:15:21, 25.23s/it]                                                         {'loss': 0.0163, 'grad_norm': 2.802428957344311, 'learning_rate': 6.411572561829212e-07, 'completion_length': 302.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.046452607959508896, 'kl': 0.406494140625, 'epoch': 0.36}
 36%|███▌      | 1538/4286 [11:46:59<19:15:21, 25.23s/it] 36%|███▌      | 1539/4286 [11:47:24<19:11:59, 25.16s/it]                                                         {'loss': 0.0083, 'grad_norm': 2.218021717146661, 'learning_rate': 6.409239384041064e-07, 'completion_length': 329.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7684524357318878, 'rewards/format_reward': 1.0, 'reward': 1.7684524655342102, 'reward_std': 0.08354990556836128, 'kl': 0.20794677734375, 'epoch': 0.36}
 36%|███▌      | 1539/4286 [11:47:24<19:11:59, 25.16s/it] 36%|███▌      | 1540/4286 [11:47:48<18:57:47, 24.86s/it]                                                         {'loss': 0.0068, 'grad_norm': 1.9059255250866383, 'learning_rate': 6.406906206252917e-07, 'completion_length': 273.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.09548483975231647, 'kl': 0.16986083984375, 'epoch': 0.36}
 36%|███▌      | 1540/4286 [11:47:48<18:57:47, 24.86s/it][2025-03-03 02:45:36,038] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 36%|███▌      | 1541/4286 [11:48:13<19:02:54, 24.98s/it]                                                         {'loss': 0.0098, 'grad_norm': 2.227052925797731, 'learning_rate': 6.404573028464769e-07, 'completion_length': 326.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.008928571827709675, 'kl': 0.2445068359375, 'epoch': 0.36}
 36%|███▌      | 1541/4286 [11:48:13<19:02:54, 24.98s/it] 36%|███▌      | 1542/4286 [11:48:39<19:08:37, 25.12s/it]                                                         {'loss': 0.0023, 'grad_norm': 3.530821051296371, 'learning_rate': 6.402239850676622e-07, 'completion_length': 320.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.84226194024086, 'rewards/format_reward': 1.0, 'reward': 1.8422619700431824, 'reward_std': 0.03847679682075977, 'kl': 0.056396484375, 'epoch': 0.36}
 36%|███▌      | 1542/4286 [11:48:39<19:08:37, 25.12s/it] 36%|███▌      | 1543/4286 [11:49:04<19:14:05, 25.24s/it]                                                         {'loss': 0.0028, 'grad_norm': 2.4869931483292422, 'learning_rate': 6.399906672888474e-07, 'completion_length': 347.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.09611427411437035, 'kl': 0.0692138671875, 'epoch': 0.36}
 36%|███▌      | 1543/4286 [11:49:04<19:14:05, 25.24s/it] 36%|███▌      | 1544/4286 [11:49:30<19:21:14, 25.41s/it]                                                         {'loss': 0.0012, 'grad_norm': 0.2537804160188863, 'learning_rate': 6.397573495100327e-07, 'completion_length': 311.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.8324405252933502, 'rewards/format_reward': 1.0, 'reward': 1.8324405550956726, 'reward_std': 0.0017857126658782363, 'kl': 0.029296875, 'epoch': 0.36}
 36%|███▌      | 1544/4286 [11:49:30<19:21:14, 25.41s/it] 36%|███▌      | 1545/4286 [11:49:54<19:01:32, 24.99s/it]                                                         {'loss': 0.0013, 'grad_norm': 0.4570359071926258, 'learning_rate': 6.395240317312179e-07, 'completion_length': 287.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7872024774551392, 'rewards/format_reward': 1.0, 'reward': 1.7872024774551392, 'reward_std': 0.032738096080720425, 'kl': 0.03216552734375, 'epoch': 0.36}
 36%|███▌      | 1545/4286 [11:49:54<19:01:32, 24.99s/it] 36%|███▌      | 1546/4286 [11:50:18<18:46:53, 24.68s/it]                                                         {'loss': 0.0092, 'grad_norm': 5.078526004715215, 'learning_rate': 6.392907139524032e-07, 'completion_length': 304.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6383929252624512, 'reward_std': 0.12645572796463966, 'kl': 0.23095703125, 'epoch': 0.36}
 36%|███▌      | 1546/4286 [11:50:18<18:46:53, 24.68s/it] 36%|███▌      | 1547/4286 [11:50:44<19:05:13, 25.09s/it]                                                         {'loss': 0.0118, 'grad_norm': 1.4047676740952455, 'learning_rate': 6.390573961735885e-07, 'completion_length': 327.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7809523642063141, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7630954384803772, 'reward_std': 0.11521804332733154, 'kl': 0.295166015625, 'epoch': 0.36}
 36%|███▌      | 1547/4286 [11:50:44<19:05:13, 25.09s/it] 36%|███▌      | 1548/4286 [11:51:10<19:21:09, 25.45s/it]                                                         {'loss': 0.0012, 'grad_norm': 2.1200677694529104, 'learning_rate': 6.388240783947737e-07, 'completion_length': 290.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.709821492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6919644474983215, 'reward_std': 0.08630952425301075, 'kl': 0.02886962890625, 'epoch': 0.36}
 36%|███▌      | 1548/4286 [11:51:10<19:21:09, 25.45s/it][2025-03-03 02:48:58,775] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 36%|███▌      | 1549/4286 [11:51:36<19:24:02, 25.52s/it]                                                         {'loss': 0.0221, 'grad_norm': 5.241855175178882, 'learning_rate': 6.385907606159589e-07, 'completion_length': 297.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6949405670166016, 'rewards/format_reward': 1.0, 'reward': 1.6949406266212463, 'reward_std': 0.09539870172739029, 'kl': 0.552734375, 'epoch': 0.36}
 36%|███▌      | 1549/4286 [11:51:36<19:24:02, 25.52s/it] 36%|███▌      | 1550/4286 [11:52:01<19:16:26, 25.36s/it]                                                         {'loss': 0.0144, 'grad_norm': 0.9672482441720052, 'learning_rate': 6.383574428371443e-07, 'completion_length': 315.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6488095819950104, 'rewards/format_reward': 1.0, 'reward': 1.6488096714019775, 'reward_std': 0.011904764920473099, 'kl': 0.361083984375, 'epoch': 0.36}
 36%|███▌      | 1550/4286 [11:52:01<19:16:26, 25.36s/it] 36%|███▌      | 1551/4286 [11:52:29<19:47:49, 26.06s/it]                                                         {'loss': 0.009, 'grad_norm': 1.1759098176327258, 'learning_rate': 6.381241250583295e-07, 'completion_length': 328.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7261905670166016, 'reward_std': 0.12340506538748741, 'kl': 0.2254638671875, 'epoch': 0.36}
 36%|███▌      | 1551/4286 [11:52:29<19:47:49, 26.06s/it][2025-03-03 02:50:17,187] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 36%|███▌      | 1552/4286 [11:52:54<19:42:56, 25.96s/it]                                                         {'loss': 0.0081, 'grad_norm': 2.0192336913783566, 'learning_rate': 6.378908072795147e-07, 'completion_length': 300.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7517007887363434, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7338436841964722, 'reward_std': 0.08545699715614319, 'kl': 0.2034912109375, 'epoch': 0.36}
 36%|███▌      | 1552/4286 [11:52:54<19:42:56, 25.96s/it] 36%|███▌      | 1553/4286 [11:53:20<19:34:24, 25.78s/it]                                                         {'loss': 0.0081, 'grad_norm': 1.9163739341040757, 'learning_rate': 6.376574895007e-07, 'completion_length': 295.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7172619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7172620296478271, 'reward_std': 0.01785714365541935, 'kl': 0.20184326171875, 'epoch': 0.36}
 36%|███▌      | 1553/4286 [11:53:20<19:34:24, 25.78s/it] 36%|███▋      | 1554/4286 [11:53:43<19:05:32, 25.16s/it]                                                         {'loss': 0.0089, 'grad_norm': 2.6165065386865787, 'learning_rate': 6.374241717218852e-07, 'completion_length': 305.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6041666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6041668057441711, 'reward_std': 0.05808864161372185, 'kl': 0.22314453125, 'epoch': 0.36}
 36%|███▋      | 1554/4286 [11:53:43<19:05:32, 25.16s/it] 36%|███▋      | 1555/4286 [11:54:07<18:43:25, 24.68s/it]                                                         {'loss': 0.012, 'grad_norm': 3.867433200346498, 'learning_rate': 6.371908539430705e-07, 'completion_length': 277.0893020629883, 'rewards/only_full_func_accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.019440393894910812, 'kl': 0.29931640625, 'epoch': 0.36}
 36%|███▋      | 1555/4286 [11:54:07<18:43:25, 24.68s/it] 36%|███▋      | 1556/4286 [11:54:31<18:30:22, 24.40s/it]                                                         {'loss': 0.0019, 'grad_norm': 1.2340050974264372, 'learning_rate': 6.369575361642557e-07, 'completion_length': 295.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0357142873108387, 'kl': 0.047119140625, 'epoch': 0.36}
 36%|███▋      | 1556/4286 [11:54:31<18:30:22, 24.40s/it] 36%|███▋      | 1557/4286 [11:54:54<18:16:49, 24.11s/it]                                                         {'loss': 0.0256, 'grad_norm': 2.2705812321081815, 'learning_rate': 6.36724218385441e-07, 'completion_length': 292.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6666667461395264, 'rewards/format_reward': 1.0, 'reward': 1.6666667461395264, 'reward_std': 0.09639318287372589, 'kl': 0.6396484375, 'epoch': 0.36}
 36%|███▋      | 1557/4286 [11:54:54<18:16:49, 24.11s/it][2025-03-03 02:52:44,472] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 36%|███▋      | 1558/4286 [11:55:22<19:01:55, 25.12s/it]                                                         {'loss': 0.0026, 'grad_norm': 3.2678585384261734, 'learning_rate': 6.364909006066262e-07, 'completion_length': 337.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.06250000186264515, 'kl': 0.064697265625, 'epoch': 0.36}
 36%|███▋      | 1558/4286 [11:55:22<19:01:55, 25.12s/it] 36%|███▋      | 1559/4286 [11:55:47<19:06:06, 25.22s/it]                                                         {'loss': 0.0063, 'grad_norm': 2.803172503443507, 'learning_rate': 6.362575828278115e-07, 'completion_length': 342.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5907738506793976, 'rewards/format_reward': 1.0, 'reward': 1.5907739400863647, 'reward_std': 0.0446428582072258, 'kl': 0.1580810546875, 'epoch': 0.36}
 36%|███▋      | 1559/4286 [11:55:47<19:06:06, 25.22s/it] 36%|███▋      | 1560/4286 [11:56:12<19:03:15, 25.16s/it]                                                         {'loss': 0.0218, 'grad_norm': 0.939418520882824, 'learning_rate': 6.360242650489968e-07, 'completion_length': 306.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.8660714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8482143878936768, 'reward_std': 0.05357142724096775, 'kl': 0.54168701171875, 'epoch': 0.36}
 36%|███▋      | 1560/4286 [11:56:12<19:03:15, 25.16s/it] 36%|███▋      | 1561/4286 [11:56:38<19:11:53, 25.36s/it]                                                         {'loss': 0.0183, 'grad_norm': 2.7608535847771263, 'learning_rate': 6.35790947270182e-07, 'completion_length': 291.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7687500417232513, 'rewards/format_reward': 1.0, 'reward': 1.7687501311302185, 'reward_std': 0.05411786213517189, 'kl': 0.458740234375, 'epoch': 0.36}
 36%|███▋      | 1561/4286 [11:56:38<19:11:53, 25.36s/it][2025-03-03 02:54:28,774] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 36%|███▋      | 1562/4286 [11:57:06<19:47:08, 26.15s/it]                                                         {'loss': 0.0556, 'grad_norm': 5.279297580976489, 'learning_rate': 6.355576294913672e-07, 'completion_length': 317.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.59670689702034, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5431355834007263, 'reward_std': 0.15293484926223755, 'kl': 1.39453125, 'epoch': 0.36}
 36%|███▋      | 1562/4286 [11:57:06<19:47:08, 26.15s/it] 36%|███▋      | 1563/4286 [11:57:32<19:50:08, 26.22s/it]                                                         {'loss': 0.0071, 'grad_norm': 0.6622405192877041, 'learning_rate': 6.353243117125526e-07, 'completion_length': 325.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.8690476417541504, 'rewards/format_reward': 1.0, 'reward': 1.86904776096344, 'reward_std': 0.027492869645357132, 'kl': 0.177001953125, 'epoch': 0.36}
 36%|███▋      | 1563/4286 [11:57:32<19:50:08, 26.22s/it] 36%|███▋      | 1564/4286 [11:57:55<18:58:16, 25.09s/it]                                                         {'loss': 0.0109, 'grad_norm': 2.644440460262721, 'learning_rate': 6.350909939337378e-07, 'completion_length': 249.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.9098640084266663, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8920068740844727, 'reward_std': 0.061224497854709625, 'kl': 0.27099609375, 'epoch': 0.36}
 36%|███▋      | 1564/4286 [11:57:55<18:58:16, 25.09s/it] 37%|███▋      | 1565/4286 [11:58:21<19:12:41, 25.42s/it]                                                         {'loss': 0.0256, 'grad_norm': 2.757706050119525, 'learning_rate': 6.34857676154923e-07, 'completion_length': 341.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7750000357627869, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7392858266830444, 'reward_std': 0.14185301586985588, 'kl': 0.640625, 'epoch': 0.37}
 37%|███▋      | 1565/4286 [11:58:21<19:12:41, 25.42s/it][2025-03-03 02:56:10,909] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1566/4286 [11:58:48<19:35:15, 25.92s/it]                                                         {'loss': 0.0466, 'grad_norm': 2.345328228003302, 'learning_rate': 6.346243583761082e-07, 'completion_length': 332.875, 'rewards/only_full_func_accuracy_reward': 0.6481718122959137, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6124576330184937, 'reward_std': 0.1237271772697568, 'kl': 1.162109375, 'epoch': 0.37}
 37%|███▋      | 1566/4286 [11:58:48<19:35:15, 25.92s/it] 37%|███▋      | 1567/4286 [11:59:15<19:53:41, 26.34s/it]                                                         {'loss': 0.0042, 'grad_norm': 2.652034987634343, 'learning_rate': 6.343910405972936e-07, 'completion_length': 355.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6988095641136169, 'rewards/format_reward': 1.0, 'reward': 1.6988096237182617, 'reward_std': 0.03617841750383377, 'kl': 0.10498046875, 'epoch': 0.37}
 37%|███▋      | 1567/4286 [11:59:15<19:53:41, 26.34s/it] 37%|███▋      | 1568/4286 [11:59:40<19:33:11, 25.90s/it]                                                         {'loss': 0.0031, 'grad_norm': 4.321621057176361, 'learning_rate': 6.341577228184788e-07, 'completion_length': 307.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592262983322144, 'reward_std': 0.06685745343565941, 'kl': 0.0767822265625, 'epoch': 0.37}
 37%|███▋      | 1568/4286 [11:59:40<19:33:11, 25.90s/it] 37%|███▋      | 1569/4286 [12:00:06<19:38:40, 26.03s/it]                                                         {'loss': 0.0025, 'grad_norm': 2.2837462985997456, 'learning_rate': 6.33924405039664e-07, 'completion_length': 324.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6758928894996643, 'rewards/format_reward': 1.0, 'reward': 1.6758928894996643, 'reward_std': 0.08250274509191513, 'kl': 0.06298828125, 'epoch': 0.37}
 37%|███▋      | 1569/4286 [12:00:07<19:38:40, 26.03s/it] 37%|███▋      | 1570/4286 [12:00:32<19:27:56, 25.80s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.6382127543107147, 'learning_rate': 6.336910872608493e-07, 'completion_length': 317.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.03114316239953041, 'kl': 0.03515625, 'epoch': 0.37}
 37%|███▋      | 1570/4286 [12:00:32<19:27:56, 25.80s/it] 37%|███▋      | 1571/4286 [12:00:58<19:32:38, 25.91s/it]                                                         {'loss': 0.0124, 'grad_norm': 3.8141098801851108, 'learning_rate': 6.334577694820346e-07, 'completion_length': 315.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7369048297405243, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7190477848052979, 'reward_std': 0.10258116573095322, 'kl': 0.3089599609375, 'epoch': 0.37}
 37%|███▋      | 1571/4286 [12:00:58<19:32:38, 25.91s/it][2025-03-03 02:58:47,225] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1572/4286 [12:01:24<19:38:10, 26.05s/it]                                                         {'loss': 0.01, 'grad_norm': 4.604003904049378, 'learning_rate': 6.332244517032198e-07, 'completion_length': 328.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142858505249023, 'reward_std': 0.05189165845513344, 'kl': 0.2508544921875, 'epoch': 0.37}
 37%|███▋      | 1572/4286 [12:01:24<19:38:10, 26.05s/it] 37%|███▋      | 1573/4286 [12:01:49<19:16:37, 25.58s/it]                                                         {'loss': 0.0187, 'grad_norm': 2.9932051211182014, 'learning_rate': 6.329911339244051e-07, 'completion_length': 282.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7342261672019958, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6985120177268982, 'reward_std': 0.13936903700232506, 'kl': 0.468017578125, 'epoch': 0.37}
 37%|███▋      | 1573/4286 [12:01:49<19:16:37, 25.58s/it] 37%|███▋      | 1574/4286 [12:02:15<19:18:30, 25.63s/it]                                                         {'loss': 0.0214, 'grad_norm': 2.043456225591505, 'learning_rate': 6.327578161455903e-07, 'completion_length': 298.69644927978516, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.703869104385376, 'reward_std': 0.12577905505895615, 'kl': 0.53619384765625, 'epoch': 0.37}
 37%|███▋      | 1574/4286 [12:02:15<19:18:30, 25.63s/it] 37%|███▋      | 1575/4286 [12:02:43<19:56:13, 26.48s/it]                                                         {'loss': 0.0016, 'grad_norm': 10.159385867175642, 'learning_rate': 6.325244983667755e-07, 'completion_length': 321.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187500596046448, 'reward_std': 0.07280982844531536, 'kl': 0.0396728515625, 'epoch': 0.37}
 37%|███▋      | 1575/4286 [12:02:43<19:56:13, 26.48s/it] 37%|███▋      | 1576/4286 [12:03:08<19:41:57, 26.17s/it]                                                         {'loss': 0.0061, 'grad_norm': 1.7338659210680456, 'learning_rate': 6.322911805879609e-07, 'completion_length': 321.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6636906266212463, 'reward_std': 0.029761902056634426, 'kl': 0.15301513671875, 'epoch': 0.37}
 37%|███▋      | 1576/4286 [12:03:08<19:41:57, 26.17s/it] 37%|███▋      | 1577/4286 [12:03:34<19:32:30, 25.97s/it]                                                         {'loss': 0.0059, 'grad_norm': 1.38132957771377, 'learning_rate': 6.320578628091461e-07, 'completion_length': 303.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6741072237491608, 'rewards/format_reward': 1.0, 'reward': 1.6741071939468384, 'reward_std': 0.0918345432728529, 'kl': 0.1474609375, 'epoch': 0.37}
 37%|███▋      | 1577/4286 [12:03:34<19:32:30, 25.97s/it] 37%|███▋      | 1578/4286 [12:03:58<19:02:59, 25.32s/it]                                                         {'loss': 0.0015, 'grad_norm': 1.584370206823835, 'learning_rate': 6.318245450303313e-07, 'completion_length': 303.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.07078753132373095, 'kl': 0.03802490234375, 'epoch': 0.37}
 37%|███▋      | 1578/4286 [12:03:58<19:02:59, 25.32s/it][2025-03-03 03:01:47,898] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1579/4286 [12:04:25<19:28:04, 25.89s/it]                                                         {'loss': 0.013, 'grad_norm': 2.650539460861701, 'learning_rate': 6.315912272515165e-07, 'completion_length': 309.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.71875, 'rewards/format_reward': 1.0, 'reward': 1.7187501192092896, 'reward_std': 0.06031543388962746, 'kl': 0.325927734375, 'epoch': 0.37}
 37%|███▋      | 1579/4286 [12:04:25<19:28:04, 25.89s/it] 37%|███▋      | 1580/4286 [12:04:50<19:16:30, 25.64s/it]                                                         {'loss': 0.0165, 'grad_norm': 2.5050830249014644, 'learning_rate': 6.313579094727019e-07, 'completion_length': 314.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 1.0, 'reward': 1.758928656578064, 'reward_std': 0.0297619067132473, 'kl': 0.412841796875, 'epoch': 0.37}
 37%|███▋      | 1580/4286 [12:04:50<19:16:30, 25.64s/it] 37%|███▋      | 1581/4286 [12:05:15<19:05:01, 25.40s/it]                                                         {'loss': 0.0023, 'grad_norm': 7.550618667703159, 'learning_rate': 6.311245916938871e-07, 'completion_length': 282.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7782738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.0680250208824873, 'kl': 0.0579833984375, 'epoch': 0.37}
 37%|███▋      | 1581/4286 [12:05:15<19:05:01, 25.40s/it] 37%|███▋      | 1582/4286 [12:05:41<19:12:25, 25.57s/it]                                                         {'loss': 0.0017, 'grad_norm': 2.1414112867927924, 'learning_rate': 6.308912739150723e-07, 'completion_length': 305.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7207483351230621, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7028912901878357, 'reward_std': 0.10550883412361145, 'kl': 0.0421142578125, 'epoch': 0.37}
 37%|███▋      | 1582/4286 [12:05:41<19:12:25, 25.57s/it][2025-03-03 03:03:30,055] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1583/4286 [12:06:07<19:21:40, 25.79s/it]                                                         {'loss': 0.0057, 'grad_norm': 4.0968442003406285, 'learning_rate': 6.306579561362576e-07, 'completion_length': 292.0893020629883, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.784226417541504, 'reward_std': 0.10629495047032833, 'kl': 0.1439208984375, 'epoch': 0.37}
 37%|███▋      | 1583/4286 [12:06:07<19:21:40, 25.79s/it] 37%|███▋      | 1584/4286 [12:06:32<19:08:46, 25.51s/it]                                                         {'loss': 0.011, 'grad_norm': 2.7073159100349846, 'learning_rate': 6.304246383574429e-07, 'completion_length': 323.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.758928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.0773809589445591, 'kl': 0.2750244140625, 'epoch': 0.37}
 37%|███▋      | 1584/4286 [12:06:32<19:08:46, 25.51s/it] 37%|███▋      | 1585/4286 [12:06:56<18:46:21, 25.02s/it]                                                         {'loss': 0.0047, 'grad_norm': 4.549874773993563, 'learning_rate': 6.301913205786281e-07, 'completion_length': 299.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.013746432960033417, 'kl': 0.1173095703125, 'epoch': 0.37}
 37%|███▋      | 1585/4286 [12:06:56<18:46:21, 25.02s/it][2025-03-03 03:04:45,592] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1586/4286 [12:07:23<19:09:52, 25.55s/it]                                                         {'loss': 0.0066, 'grad_norm': 4.1427451420648405, 'learning_rate': 6.299580027998134e-07, 'completion_length': 310.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6156888008117676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5799745321273804, 'reward_std': 0.1525387093424797, 'kl': 0.1658935546875, 'epoch': 0.37}
 37%|███▋      | 1586/4286 [12:07:23<19:09:52, 25.55s/it] 37%|███▋      | 1587/4286 [12:07:48<19:08:03, 25.52s/it]                                                         {'loss': 0.0019, 'grad_norm': 0.24341103424009503, 'learning_rate': 6.297246850209986e-07, 'completion_length': 287.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.6880411803722382, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.670184075832367, 'reward_std': 0.05315990746021271, 'kl': 0.0474853515625, 'epoch': 0.37}
 37%|███▋      | 1587/4286 [12:07:48<19:08:03, 25.52s/it] 37%|███▋      | 1588/4286 [12:08:13<18:52:30, 25.19s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.3787343054994942, 'learning_rate': 6.294913672421839e-07, 'completion_length': 316.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7315476536750793, 'rewards/format_reward': 1.0, 'reward': 1.7315477132797241, 'reward_std': 0.02115892805159092, 'kl': 0.0347900390625, 'epoch': 0.37}
 37%|███▋      | 1588/4286 [12:08:13<18:52:30, 25.19s/it][2025-03-03 03:06:02,460] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1589/4286 [12:08:40<19:16:47, 25.73s/it]                                                         {'loss': 0.0012, 'grad_norm': 0.18404284506183108, 'learning_rate': 6.292580494633691e-07, 'completion_length': 339.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.611607313156128, 'reward_std': 0.08035714644938707, 'kl': 0.03057861328125, 'epoch': 0.37}
 37%|███▋      | 1589/4286 [12:08:40<19:16:47, 25.73s/it] 37%|███▋      | 1590/4286 [12:09:07<19:33:20, 26.11s/it]                                                         {'loss': 0.0095, 'grad_norm': 0.9367556523274501, 'learning_rate': 6.290247316845544e-07, 'completion_length': 332.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8139881789684296, 'rewards/format_reward': 1.0, 'reward': 1.813988208770752, 'reward_std': 0.06802502274513245, 'kl': 0.2354736328125, 'epoch': 0.37}
 37%|███▋      | 1590/4286 [12:09:07<19:33:20, 26.11s/it] 37%|███▋      | 1591/4286 [12:09:33<19:42:39, 26.33s/it]                                                         {'loss': 0.02, 'grad_norm': 0.9518528673302576, 'learning_rate': 6.287914139057396e-07, 'completion_length': 315.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6770834922790527, 'reward_std': 0.0922619067132473, 'kl': 0.501953125, 'epoch': 0.37}
 37%|███▋      | 1591/4286 [12:09:33<19:42:39, 26.33s/it] 37%|███▋      | 1592/4286 [12:09:58<19:18:51, 25.81s/it]                                                         {'loss': 0.0138, 'grad_norm': 4.7801566297778555, 'learning_rate': 6.285580961269249e-07, 'completion_length': 300.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7827380895614624, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7648810744285583, 'reward_std': 0.14398439228534698, 'kl': 0.34307861328125, 'epoch': 0.37}
 37%|███▋      | 1592/4286 [12:09:58<19:18:51, 25.81s/it][2025-03-03 03:07:45,137] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1593/4286 [12:10:22<18:57:25, 25.34s/it]                                                         {'loss': 0.0027, 'grad_norm': 1.3813733576359548, 'learning_rate': 6.283247783481102e-07, 'completion_length': 301.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.05314406752586365, 'kl': 0.068359375, 'epoch': 0.37}
 37%|███▋      | 1593/4286 [12:10:22<18:57:25, 25.34s/it] 37%|███▋      | 1594/4286 [12:10:46<18:38:59, 24.94s/it]                                                         {'loss': 0.0219, 'grad_norm': 1.337782558097461, 'learning_rate': 6.280914605692954e-07, 'completion_length': 294.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.54464291036129, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.0750191193073988, 'kl': 0.54833984375, 'epoch': 0.37}
 37%|███▋      | 1594/4286 [12:10:46<18:38:59, 24.94s/it] 37%|███▋      | 1595/4286 [12:11:11<18:37:01, 24.91s/it]                                                         {'loss': 0.0128, 'grad_norm': 3.2839572596630027, 'learning_rate': 6.278581427904806e-07, 'completion_length': 325.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.672619104385376, 'reward_std': 0.05038155708462, 'kl': 0.319091796875, 'epoch': 0.37}
 37%|███▋      | 1595/4286 [12:11:11<18:37:01, 24.91s/it] 37%|███▋      | 1596/4286 [12:11:35<18:30:07, 24.76s/it]                                                         {'loss': 0.0114, 'grad_norm': 1.9714322235497546, 'learning_rate': 6.27624825011666e-07, 'completion_length': 258.57144927978516, 'rewards/only_full_func_accuracy_reward': 0.7291667461395264, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.05096162483096123, 'kl': 0.28564453125, 'epoch': 0.37}
 37%|███▋      | 1596/4286 [12:11:35<18:30:07, 24.76s/it] 37%|███▋      | 1597/4286 [12:12:01<18:42:50, 25.05s/it]                                                         {'loss': 0.0042, 'grad_norm': 44.46263954838904, 'learning_rate': 6.273915072328512e-07, 'completion_length': 285.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.03869047574698925, 'kl': 0.10516357421875, 'epoch': 0.37}
 37%|███▋      | 1597/4286 [12:12:01<18:42:50, 25.05s/it] 37%|███▋      | 1598/4286 [12:12:27<18:48:49, 25.20s/it]                                                         {'loss': 0.0014, 'grad_norm': 2.152515254085433, 'learning_rate': 6.271581894540364e-07, 'completion_length': 328.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.784226268529892, 'rewards/format_reward': 1.0, 'reward': 1.7842263579368591, 'reward_std': 0.02678571455180645, 'kl': 0.03399658203125, 'epoch': 0.37}
 37%|███▋      | 1598/4286 [12:12:27<18:48:49, 25.20s/it] 37%|███▋      | 1599/4286 [12:12:54<19:19:19, 25.89s/it]                                                         {'loss': 0.0066, 'grad_norm': 1.7669451683615063, 'learning_rate': 6.269248716752217e-07, 'completion_length': 340.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.5567602813243866, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5210460424423218, 'reward_std': 0.16213078796863556, 'kl': 0.16552734375, 'epoch': 0.37}
 37%|███▋      | 1599/4286 [12:12:54<19:19:19, 25.89s/it][2025-03-03 03:10:43,229] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1600/4286 [12:13:20<19:21:23, 25.94s/it]                                                         {'loss': 0.0082, 'grad_norm': 3.0284651820045285, 'learning_rate': 6.26691553896407e-07, 'completion_length': 309.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.689583420753479, 'rewards/format_reward': 1.0, 'reward': 1.689583420753479, 'reward_std': 0.10383668541908264, 'kl': 0.205078125, 'epoch': 0.37}
 37%|███▋      | 1600/4286 [12:13:20<19:21:23, 25.94s/it][2025-03-03 03:16:11,519] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1601/4286 [12:18:49<86:59:57, 116.65s/it]                                                          {'loss': 0.0343, 'grad_norm': 4.31446194814152, 'learning_rate': 6.264582361175922e-07, 'completion_length': 311.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7050595283508301, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6693453788757324, 'reward_std': 0.10497413948178291, 'kl': 0.854736328125, 'epoch': 0.37}
 37%|███▋      | 1601/4286 [12:18:49<86:59:57, 116.65s/it][2025-03-03 03:16:36,205] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1602/4286 [12:19:13<66:23:54, 89.06s/it]                                                          {'loss': 0.0296, 'grad_norm': 19.01952948403328, 'learning_rate': 6.262249183387774e-07, 'completion_length': 331.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7446428835391998, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7089287042617798, 'reward_std': 0.14594642259180546, 'kl': 0.74072265625, 'epoch': 0.37}
 37%|███▋      | 1602/4286 [12:19:13<66:23:54, 89.06s/it] 37%|███▋      | 1603/4286 [12:19:36<51:38:31, 69.29s/it]                                                         {'loss': 0.0196, 'grad_norm': 3.2983324485777374, 'learning_rate': 6.259916005599627e-07, 'completion_length': 283.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.8458758890628815, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.828018844127655, 'reward_std': 0.05600804463028908, 'kl': 0.48974609375, 'epoch': 0.37}
 37%|███▋      | 1603/4286 [12:19:36<51:38:31, 69.29s/it][2025-03-03 03:17:22,991] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1604/4286 [12:20:00<41:24:50, 55.59s/it]                                                         {'loss': 0.0046, 'grad_norm': 2.5100836472638095, 'learning_rate': 6.257582827811479e-07, 'completion_length': 297.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.7529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.7529763579368591, 'reward_std': 0.06644324585795403, 'kl': 0.1142578125, 'epoch': 0.37}
 37%|███▋      | 1604/4286 [12:20:00<41:24:50, 55.59s/it] 37%|███▋      | 1605/4286 [12:20:23<34:12:17, 45.93s/it]                                                         {'loss': 0.0258, 'grad_norm': 2.222670618018338, 'learning_rate': 6.255249650023332e-07, 'completion_length': 331.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.6267857253551483, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.608928620815277, 'reward_std': 0.06701068766415119, 'kl': 0.645263671875, 'epoch': 0.37}
 37%|███▋      | 1605/4286 [12:20:23<34:12:17, 45.93s/it] 37%|███▋      | 1606/4286 [12:20:46<28:54:03, 38.82s/it]                                                         {'loss': 0.0127, 'grad_norm': 5.3270079764710845, 'learning_rate': 6.252916472235185e-07, 'completion_length': 322.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.74702388048172, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7113096117973328, 'reward_std': 0.049658652395009995, 'kl': 0.3177490234375, 'epoch': 0.37}
 37%|███▋      | 1606/4286 [12:20:46<28:54:03, 38.82s/it] 37%|███▋      | 1607/4286 [12:21:07<25:00:01, 33.60s/it]                                                         {'loss': 0.006, 'grad_norm': 0.8798488433868464, 'learning_rate': 6.250583294447037e-07, 'completion_length': 267.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.14990234375, 'epoch': 0.37}
 37%|███▋      | 1607/4286 [12:21:07<25:00:01, 33.60s/it] 38%|███▊      | 1608/4286 [12:21:32<23:09:16, 31.13s/it]                                                         {'loss': 0.0119, 'grad_norm': 1.5916944246380034, 'learning_rate': 6.248250116658888e-07, 'completion_length': 299.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 1.0, 'reward': 1.641369104385376, 'reward_std': 0.026785715483129025, 'kl': 0.2958984375, 'epoch': 0.38}
 38%|███▊      | 1608/4286 [12:21:32<23:09:16, 31.13s/it] 38%|███▊      | 1609/4286 [12:21:55<21:19:24, 28.68s/it]                                                         {'loss': 0.021, 'grad_norm': 2.3523086968242666, 'learning_rate': 6.245916938870742e-07, 'completion_length': 257.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.7648809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.065476194024086, 'kl': 0.5244140625, 'epoch': 0.38}
 38%|███▊      | 1609/4286 [12:21:55<21:19:24, 28.68s/it] 38%|███▊      | 1610/4286 [12:22:20<20:25:15, 27.47s/it]                                                         {'loss': 0.0016, 'grad_norm': 0.4973741860080748, 'learning_rate': 6.243583761082594e-07, 'completion_length': 300.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7038690745830536, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.05838929582387209, 'kl': 0.04022216796875, 'epoch': 0.38}
 38%|███▊      | 1610/4286 [12:22:20<20:25:15, 27.47s/it] 38%|███▊      | 1611/4286 [12:22:43<19:28:10, 26.20s/it]                                                         {'loss': 0.0184, 'grad_norm': 1.1181517157215437, 'learning_rate': 6.241250583294446e-07, 'completion_length': 274.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.7886905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7886906266212463, 'reward_std': 0.0535714328289032, 'kl': 0.4580078125, 'epoch': 0.38}
 38%|███▊      | 1611/4286 [12:22:43<19:28:10, 26.20s/it] 38%|███▊      | 1612/4286 [12:23:07<18:54:32, 25.46s/it]                                                         {'loss': 0.0089, 'grad_norm': 4.556309976607242, 'learning_rate': 6.238917405506298e-07, 'completion_length': 299.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6577381193637848, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.05792887136340141, 'kl': 0.22265625, 'epoch': 0.38}
 38%|███▊      | 1612/4286 [12:23:07<18:54:32, 25.46s/it] 38%|███▊      | 1613/4286 [12:23:31<18:32:43, 24.98s/it]                                                         {'loss': 0.0037, 'grad_norm': 2.060276751115015, 'learning_rate': 6.236584227718152e-07, 'completion_length': 282.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.0833333358168602, 'kl': 0.0933837890625, 'epoch': 0.38}
 38%|███▊      | 1613/4286 [12:23:31<18:32:43, 24.98s/it] 38%|███▊      | 1614/4286 [12:23:55<18:14:49, 24.58s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.289828947156178, 'learning_rate': 6.234251049930004e-07, 'completion_length': 290.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.7711310386657715, 'rewards/format_reward': 1.0, 'reward': 1.7711310982704163, 'reward_std': 0.005357143934816122, 'kl': 0.03564453125, 'epoch': 0.38}
 38%|███▊      | 1614/4286 [12:23:55<18:14:49, 24.58s/it] 38%|███▊      | 1615/4286 [12:24:22<18:48:29, 25.35s/it]                                                         {'loss': 0.0014, 'grad_norm': 3.342857469766493, 'learning_rate': 6.231917872141856e-07, 'completion_length': 314.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8612352013587952, 'rewards/format_reward': 1.0, 'reward': 1.86123526096344, 'reward_std': 0.07375163212418556, 'kl': 0.035400390625, 'epoch': 0.38}
 38%|███▊      | 1615/4286 [12:24:22<18:48:29, 25.35s/it] 38%|███▊      | 1616/4286 [12:24:46<18:40:18, 25.18s/it]                                                         {'loss': 0.0024, 'grad_norm': 4.79217975538718, 'learning_rate': 6.229584694353709e-07, 'completion_length': 294.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.7440478205680847, 'reward_std': 0.07167530618607998, 'kl': 0.058837890625, 'epoch': 0.38}
 38%|███▊      | 1616/4286 [12:24:46<18:40:18, 25.18s/it] 38%|███▊      | 1617/4286 [12:25:12<18:51:05, 25.43s/it]                                                         {'loss': 0.0082, 'grad_norm': 2.44047904197119, 'learning_rate': 6.227251516565562e-07, 'completion_length': 323.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6651785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6473215818405151, 'reward_std': 0.12012907490134239, 'kl': 0.205322265625, 'epoch': 0.38}
 38%|███▊      | 1617/4286 [12:25:12<18:51:05, 25.43s/it] 38%|███▊      | 1618/4286 [12:25:36<18:27:57, 24.92s/it]                                                         {'loss': 0.0058, 'grad_norm': 2.2570673444196823, 'learning_rate': 6.224918338777414e-07, 'completion_length': 294.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.669642984867096, 'reward_std': 0.08609584905207157, 'kl': 0.145263671875, 'epoch': 0.38}
 38%|███▊      | 1618/4286 [12:25:36<18:27:57, 24.92s/it] 38%|███▊      | 1619/4286 [12:26:01<18:29:22, 24.96s/it]                                                         {'loss': 0.0019, 'grad_norm': 5.379006882249907, 'learning_rate': 6.222585160989267e-07, 'completion_length': 327.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6235119700431824, 'rewards/format_reward': 1.0, 'reward': 1.623512089252472, 'reward_std': 0.07028236985206604, 'kl': 0.047607421875, 'epoch': 0.38}
 38%|███▊      | 1619/4286 [12:26:01<18:29:22, 24.96s/it] 38%|███▊      | 1620/4286 [12:26:25<18:18:38, 24.73s/it]                                                         {'loss': 0.0081, 'grad_norm': 9.079290246984284, 'learning_rate': 6.220251983201119e-07, 'completion_length': 284.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.5818453133106232, 'rewards/format_reward': 1.0, 'reward': 1.5818453431129456, 'reward_std': 0.12301075085997581, 'kl': 0.203125, 'epoch': 0.38}
 38%|███▊      | 1620/4286 [12:26:25<18:18:38, 24.73s/it] 38%|███▊      | 1621/4286 [12:26:49<18:02:57, 24.38s/it]                                                         {'loss': 0.0041, 'grad_norm': 0.47344730320259243, 'learning_rate': 6.217918805412971e-07, 'completion_length': 306.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.779762089252472, 'reward_std': 0.013746436685323715, 'kl': 0.1024169921875, 'epoch': 0.38}
 38%|███▊      | 1621/4286 [12:26:49<18:02:57, 24.38s/it] 38%|███▊      | 1622/4286 [12:27:15<18:19:44, 24.77s/it]                                                         {'loss': 0.0025, 'grad_norm': 5.35376846015191, 'learning_rate': 6.215585627624824e-07, 'completion_length': 316.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.013746436685323715, 'kl': 0.062255859375, 'epoch': 0.38}
 38%|███▊      | 1622/4286 [12:27:15<18:19:44, 24.77s/it] 38%|███▊      | 1623/4286 [12:27:38<18:06:13, 24.47s/it]                                                         {'loss': 0.0124, 'grad_norm': 12.103090202506724, 'learning_rate': 6.213252449836677e-07, 'completion_length': 290.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7040391862392426, 'rewards/format_reward': 1.0, 'reward': 1.7040392756462097, 'reward_std': 0.04618912562727928, 'kl': 0.3095703125, 'epoch': 0.38}
 38%|███▊      | 1623/4286 [12:27:38<18:06:13, 24.47s/it] 38%|███▊      | 1624/4286 [12:28:05<18:28:55, 24.99s/it]                                                         {'loss': 0.0114, 'grad_norm': 2.4067999996388756, 'learning_rate': 6.210919272048529e-07, 'completion_length': 297.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6607143580913544, 'rewards/format_reward': 1.0, 'reward': 1.6607144474983215, 'reward_std': 0.056408412754535675, 'kl': 0.28369140625, 'epoch': 0.38}
 38%|███▊      | 1624/4286 [12:28:05<18:28:55, 24.99s/it][2025-03-03 03:25:54,304] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1625/4286 [12:28:31<18:51:01, 25.50s/it]                                                         {'loss': 0.012, 'grad_norm': 7.389684325813447, 'learning_rate': 6.208586094260381e-07, 'completion_length': 313.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7127977013587952, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.0963195376098156, 'kl': 0.30078125, 'epoch': 0.38}
 38%|███▊      | 1625/4286 [12:28:31<18:51:01, 25.50s/it][2025-03-03 03:26:20,197] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1626/4286 [12:28:57<18:55:46, 25.62s/it]                                                         {'loss': 0.0115, 'grad_norm': 4.329089923004802, 'learning_rate': 6.206252916472235e-07, 'completion_length': 327.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6071429252624512, 'rewards/format_reward': 1.0, 'reward': 1.6071430444717407, 'reward_std': 0.08352699875831604, 'kl': 0.288330078125, 'epoch': 0.38}
 38%|███▊      | 1626/4286 [12:28:57<18:55:46, 25.62s/it] 38%|███▊      | 1627/4286 [12:29:21<18:24:07, 24.91s/it]                                                         {'loss': 0.0098, 'grad_norm': 2.8863660866045895, 'learning_rate': 6.203919738684087e-07, 'completion_length': 285.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7172619700431824, 'reward_std': 0.06388125754892826, 'kl': 0.244140625, 'epoch': 0.38}
 38%|███▊      | 1627/4286 [12:29:21<18:24:07, 24.91s/it] 38%|███▊      | 1628/4286 [12:29:45<18:19:37, 24.82s/it]                                                         {'loss': 0.0342, 'grad_norm': 3.0650676653286366, 'learning_rate': 6.201586560895939e-07, 'completion_length': 317.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.12346810474991798, 'kl': 0.85205078125, 'epoch': 0.38}
 38%|███▊      | 1628/4286 [12:29:45<18:19:37, 24.82s/it] 38%|███▊      | 1629/4286 [12:30:08<17:52:54, 24.23s/it]                                                         {'loss': 0.0506, 'grad_norm': 5.104950932036499, 'learning_rate': 6.199253383107792e-07, 'completion_length': 274.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6913690865039825, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.673512041568756, 'reward_std': 0.15937994793057442, 'kl': 1.265625, 'epoch': 0.38}
 38%|███▊      | 1629/4286 [12:30:08<17:52:54, 24.23s/it] 38%|███▊      | 1630/4286 [12:30:32<17:53:03, 24.24s/it]                                                         {'loss': 0.0498, 'grad_norm': 4.622972614428331, 'learning_rate': 6.196920205319645e-07, 'completion_length': 299.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7552083730697632, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7016369700431824, 'reward_std': 0.2292562946677208, 'kl': 1.24609375, 'epoch': 0.38}
 38%|███▊      | 1630/4286 [12:30:32<17:53:03, 24.24s/it] 38%|███▊      | 1631/4286 [12:30:58<18:14:09, 24.73s/it]                                                         {'loss': 0.0591, 'grad_norm': 9.783120374290593, 'learning_rate': 6.194587027531497e-07, 'completion_length': 320.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7253402173519135, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6717687845230103, 'reward_std': 0.22338998317718506, 'kl': 1.4765625, 'epoch': 0.38}
 38%|███▊      | 1631/4286 [12:30:58<18:14:09, 24.73s/it] 38%|███▊      | 1632/4286 [12:31:22<18:00:19, 24.42s/it]                                                         {'loss': 0.042, 'grad_norm': 23.07660031767014, 'learning_rate': 6.19225384974335e-07, 'completion_length': 283.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6458334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6279762983322144, 'reward_std': 0.14052540063858032, 'kl': 1.0498046875, 'epoch': 0.38}
 38%|███▊      | 1632/4286 [12:31:22<18:00:19, 24.42s/it] 38%|███▊      | 1633/4286 [12:31:48<18:18:26, 24.84s/it]                                                         {'loss': 0.0854, 'grad_norm': 6.224903117309679, 'learning_rate': 6.189920671955202e-07, 'completion_length': 298.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7687970697879791, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.715225636959076, 'reward_std': 0.2574795335531235, 'kl': 2.134765625, 'epoch': 0.38}
 38%|███▊      | 1633/4286 [12:31:48<18:18:26, 24.84s/it] 38%|███▊      | 1634/4286 [12:32:11<17:58:05, 24.39s/it]                                                         {'loss': 0.0438, 'grad_norm': 6.285503745675198, 'learning_rate': 6.187587494167055e-07, 'completion_length': 285.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7461310029029846, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.728273868560791, 'reward_std': 0.11386480182409286, 'kl': 1.09765625, 'epoch': 0.38}
 38%|███▊      | 1634/4286 [12:32:11<17:58:05, 24.39s/it] 38%|███▊      | 1635/4286 [12:32:35<17:54:27, 24.32s/it]                                                         {'loss': 0.1009, 'grad_norm': 4.700823346022992, 'learning_rate': 6.185254316378907e-07, 'completion_length': 305.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.6294643580913544, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.575892984867096, 'reward_std': 0.18940778821706772, 'kl': 2.5234375, 'epoch': 0.38}
 38%|███▊      | 1635/4286 [12:32:35<17:54:27, 24.32s/it][2025-03-03 03:30:24,856] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1636/4286 [12:33:02<18:26:45, 25.06s/it]                                                         {'loss': 0.0904, 'grad_norm': 5.778635466321421, 'learning_rate': 6.18292113859076e-07, 'completion_length': 297.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5677083432674408, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5141370296478271, 'reward_std': 0.23806288093328476, 'kl': 2.2578125, 'epoch': 0.38}
 38%|███▊      | 1636/4286 [12:33:02<18:26:45, 25.06s/it][2025-03-03 03:30:50,979] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1637/4286 [12:33:28<18:40:27, 25.38s/it]                                                         {'loss': 0.066, 'grad_norm': 4.466988984713998, 'learning_rate': 6.180587960802612e-07, 'completion_length': 323.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6759673357009888, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6223958730697632, 'reward_std': 0.20903357863426208, 'kl': 1.6484375, 'epoch': 0.38}
 38%|███▊      | 1637/4286 [12:33:28<18:40:27, 25.38s/it][2025-03-03 03:31:17,252] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1638/4286 [12:33:54<18:51:51, 25.65s/it]                                                         {'loss': 0.0156, 'grad_norm': 8.454775353148761, 'learning_rate': 6.178254783014465e-07, 'completion_length': 320.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8675596117973328, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8318454027175903, 'reward_std': 0.10224451683461666, 'kl': 0.3896484375, 'epoch': 0.38}
 38%|███▊      | 1638/4286 [12:33:54<18:51:51, 25.65s/it] 38%|███▊      | 1639/4286 [12:34:18<18:28:44, 25.13s/it]                                                         {'loss': 0.016, 'grad_norm': 3.9524495244375664, 'learning_rate': 6.175921605226318e-07, 'completion_length': 292.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.6830357313156128, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.04053215403109789, 'kl': 0.4013671875, 'epoch': 0.38}
 38%|███▊      | 1639/4286 [12:34:18<18:28:44, 25.13s/it][2025-03-03 03:32:07,747] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1640/4286 [12:34:45<18:47:15, 25.56s/it]                                                         {'loss': 0.049, 'grad_norm': 4.574966250366848, 'learning_rate': 6.17358842743817e-07, 'completion_length': 335.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6945153474807739, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6230868697166443, 'reward_std': 0.20532109588384628, 'kl': 1.224609375, 'epoch': 0.38}
 38%|███▊      | 1640/4286 [12:34:45<18:47:15, 25.56s/it] 38%|███▊      | 1641/4286 [12:35:09<18:32:07, 25.23s/it]                                                         {'loss': 0.0393, 'grad_norm': 35.75062360440603, 'learning_rate': 6.171255249650022e-07, 'completion_length': 310.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6748512387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6569941639900208, 'reward_std': 0.14391788095235825, 'kl': 0.986328125, 'epoch': 0.38}
 38%|███▊      | 1641/4286 [12:35:09<18:32:07, 25.23s/it] 38%|███▊      | 1642/4286 [12:35:33<18:06:22, 24.65s/it]                                                         {'loss': 0.0088, 'grad_norm': 17.481768632587848, 'learning_rate': 6.168922071861876e-07, 'completion_length': 295.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.05222322978079319, 'kl': 0.22021484375, 'epoch': 0.38}
 38%|███▊      | 1642/4286 [12:35:33<18:06:22, 24.65s/it] 38%|███▊      | 1643/4286 [12:35:58<18:14:26, 24.85s/it]                                                         {'loss': 0.0288, 'grad_norm': 5.046048571562869, 'learning_rate': 6.166588894073728e-07, 'completion_length': 328.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6946429312229156, 'rewards/format_reward': 1.0, 'reward': 1.694642961025238, 'reward_std': 0.09753580018877983, 'kl': 0.71875, 'epoch': 0.38}
 38%|███▊      | 1643/4286 [12:35:58<18:14:26, 24.85s/it] 38%|███▊      | 1644/4286 [12:36:23<18:19:28, 24.97s/it]                                                         {'loss': 0.0296, 'grad_norm': 13.72931427593427, 'learning_rate': 6.16425571628558e-07, 'completion_length': 299.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6930272579193115, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6573129892349243, 'reward_std': 0.15211293566972017, 'kl': 0.7391357421875, 'epoch': 0.38}
 38%|███▊      | 1644/4286 [12:36:23<18:19:28, 24.97s/it] 38%|███▊      | 1645/4286 [12:36:49<18:33:21, 25.29s/it]                                                         {'loss': 0.0089, 'grad_norm': 5.1612332905718095, 'learning_rate': 6.161922538497432e-07, 'completion_length': 323.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.68601194024086, 'rewards/format_reward': 1.0, 'reward': 1.6860120296478271, 'reward_std': 0.054858649149537086, 'kl': 0.221435546875, 'epoch': 0.38}
 38%|███▊      | 1645/4286 [12:36:49<18:33:21, 25.29s/it] 38%|███▊      | 1646/4286 [12:37:15<18:40:08, 25.46s/it]                                                         {'loss': 0.0284, 'grad_norm': 4.362336818013605, 'learning_rate': 6.159589360709286e-07, 'completion_length': 302.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.67038694024086, 'rewards/format_reward': 1.0, 'reward': 1.6703869700431824, 'reward_std': 0.050962723791599274, 'kl': 0.7109375, 'epoch': 0.38}
 38%|███▊      | 1646/4286 [12:37:15<18:40:08, 25.46s/it] 38%|███▊      | 1647/4286 [12:37:40<18:37:15, 25.40s/it]                                                         {'loss': 0.0126, 'grad_norm': 1.8511544814591643, 'learning_rate': 6.157256182921138e-07, 'completion_length': 308.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7708333730697632, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.038789357990026474, 'kl': 0.3154296875, 'epoch': 0.38}
 38%|███▊      | 1647/4286 [12:37:40<18:37:15, 25.40s/it][2025-03-03 03:35:29,323] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1648/4286 [12:38:06<18:46:03, 25.61s/it]                                                         {'loss': 0.0025, 'grad_norm': 3.6931208223031606, 'learning_rate': 6.15492300513299e-07, 'completion_length': 292.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 1.0, 'reward': 1.6860119700431824, 'reward_std': 0.0208333320915699, 'kl': 0.0628662109375, 'epoch': 0.38}
 38%|███▊      | 1648/4286 [12:38:06<18:46:03, 25.61s/it][2025-03-03 03:35:55,583] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1649/4286 [12:38:33<18:54:09, 25.81s/it]                                                         {'loss': 0.0168, 'grad_norm': 1.068063308026628, 'learning_rate': 6.152589827344843e-07, 'completion_length': 304.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714287161827087, 'reward_std': 0.025694200303405523, 'kl': 0.41796875, 'epoch': 0.38}
 38%|███▊      | 1649/4286 [12:38:33<18:54:09, 25.81s/it] 38%|███▊      | 1650/4286 [12:38:59<18:55:46, 25.85s/it]                                                         {'loss': 0.0304, 'grad_norm': 3.6687255787510917, 'learning_rate': 6.150256649556695e-07, 'completion_length': 296.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.693452537059784, 'reward_std': 0.10978306457400322, 'kl': 0.76220703125, 'epoch': 0.38}
 38%|███▊      | 1650/4286 [12:38:59<18:55:46, 25.85s/it][2025-03-03 03:36:47,404] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▊      | 1651/4286 [12:39:24<18:55:27, 25.85s/it]                                                         {'loss': 0.0369, 'grad_norm': 4.436601745141613, 'learning_rate': 6.147923471768548e-07, 'completion_length': 315.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7386904954910278, 'rewards/format_reward': 1.0, 'reward': 1.7386905550956726, 'reward_std': 0.023617813363671303, 'kl': 0.9241943359375, 'epoch': 0.39}
 39%|███▊      | 1651/4286 [12:39:24<18:55:27, 25.85s/it][2025-03-03 03:37:12,374] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▊      | 1652/4286 [12:39:49<18:43:22, 25.59s/it]                                                         {'loss': 0.0106, 'grad_norm': 2.4047024541439495, 'learning_rate': 6.145590293980401e-07, 'completion_length': 288.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7244048118591309, 'rewards/format_reward': 1.0, 'reward': 1.7244048714637756, 'reward_std': 0.03690476482734084, 'kl': 0.26397705078125, 'epoch': 0.39}
 39%|███▊      | 1652/4286 [12:39:49<18:43:22, 25.59s/it][2025-03-03 03:37:38,187] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▊      | 1653/4286 [12:40:15<18:45:53, 25.66s/it]                                                         {'loss': 0.0369, 'grad_norm': 6.331591308161503, 'learning_rate': 6.143257116192253e-07, 'completion_length': 324.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7235119640827179, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.705654799938202, 'reward_std': 0.09049958176910877, 'kl': 0.921875, 'epoch': 0.39}
 39%|███▊      | 1653/4286 [12:40:15<18:45:53, 25.66s/it] 39%|███▊      | 1654/4286 [12:40:39<18:18:22, 25.04s/it]                                                         {'loss': 0.0057, 'grad_norm': 1.972154911792233, 'learning_rate': 6.140923938404105e-07, 'completion_length': 260.44644927978516, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.05578739196062088, 'kl': 0.140869140625, 'epoch': 0.39}
 39%|███▊      | 1654/4286 [12:40:39<18:18:22, 25.04s/it] 39%|███▊      | 1655/4286 [12:41:03<18:09:14, 24.84s/it]                                                         {'loss': 0.0037, 'grad_norm': 1.2557402464393261, 'learning_rate': 6.138590760615959e-07, 'completion_length': 302.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.020833336748182774, 'kl': 0.09228515625, 'epoch': 0.39}
 39%|███▊      | 1655/4286 [12:41:03<18:09:14, 24.84s/it] 39%|███▊      | 1656/4286 [12:41:29<18:15:59, 25.00s/it]                                                         {'loss': 0.0043, 'grad_norm': 2.555124473008741, 'learning_rate': 6.136257582827811e-07, 'completion_length': 314.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6889881789684296, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.07440475933253765, 'kl': 0.1063232421875, 'epoch': 0.39}
 39%|███▊      | 1656/4286 [12:41:29<18:15:59, 25.00s/it] 39%|███▊      | 1657/4286 [12:41:53<18:03:44, 24.73s/it]                                                         {'loss': 0.0013, 'grad_norm': 0.3186712547710047, 'learning_rate': 6.133924405039663e-07, 'completion_length': 330.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.6547619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6547619700431824, 'reward_std': 0.014580297283828259, 'kl': 0.03265380859375, 'epoch': 0.39}
 39%|███▊      | 1657/4286 [12:41:53<18:03:44, 24.73s/it] 39%|███▊      | 1658/4286 [12:42:18<18:05:12, 24.78s/it]                                                         {'loss': 0.0066, 'grad_norm': 16.641569345970492, 'learning_rate': 6.131591227251515e-07, 'completion_length': 294.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.058549899607896805, 'kl': 0.16552734375, 'epoch': 0.39}
 39%|███▊      | 1658/4286 [12:42:18<18:05:12, 24.78s/it][2025-03-03 03:40:05,521] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▊      | 1659/4286 [12:42:43<18:07:40, 24.84s/it]                                                         {'loss': 0.017, 'grad_norm': 1.6418602480971145, 'learning_rate': 6.129258049463369e-07, 'completion_length': 257.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.8258928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8080359101295471, 'reward_std': 0.06250000186264515, 'kl': 0.42626953125, 'epoch': 0.39}
 39%|███▊      | 1659/4286 [12:42:43<18:07:40, 24.84s/it] 39%|███▊      | 1660/4286 [12:43:08<18:12:42, 24.97s/it]                                                         {'loss': 0.0295, 'grad_norm': 12.259341756211695, 'learning_rate': 6.126924871675221e-07, 'completion_length': 296.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6830357313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6651787161827087, 'reward_std': 0.08281664550304413, 'kl': 0.73583984375, 'epoch': 0.39}
 39%|███▊      | 1660/4286 [12:43:08<18:12:42, 24.97s/it] 39%|███▉      | 1661/4286 [12:43:33<18:17:45, 25.09s/it]                                                         {'loss': 0.0185, 'grad_norm': 3.0423737703456726, 'learning_rate': 6.124591693887073e-07, 'completion_length': 297.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.642857313156128, 'reward_std': 0.1468617022037506, 'kl': 0.463134765625, 'epoch': 0.39}
 39%|███▉      | 1661/4286 [12:43:33<18:17:45, 25.09s/it] 39%|███▉      | 1662/4286 [12:43:58<18:06:51, 24.85s/it]                                                         {'loss': 0.0207, 'grad_norm': 6.873449336988903, 'learning_rate': 6.122258516098926e-07, 'completion_length': 303.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.6502976417541504, 'rewards/format_reward': 1.0, 'reward': 1.6502977013587952, 'reward_std': 0.053357746452093124, 'kl': 0.521240234375, 'epoch': 0.39}
 39%|███▉      | 1662/4286 [12:43:58<18:06:51, 24.85s/it] 39%|███▉      | 1663/4286 [12:44:22<18:03:26, 24.78s/it]                                                         {'loss': 0.014, 'grad_norm': 9.780819934272197, 'learning_rate': 6.119925338310779e-07, 'completion_length': 310.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.6666667461395264, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6488096117973328, 'reward_std': 0.1011904813349247, 'kl': 0.3486328125, 'epoch': 0.39}
 39%|███▉      | 1663/4286 [12:44:22<18:03:26, 24.78s/it] 39%|███▉      | 1664/4286 [12:44:45<17:31:15, 24.06s/it]                                                         {'loss': 0.0285, 'grad_norm': 5.324834044804709, 'learning_rate': 6.117592160522631e-07, 'completion_length': 262.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.1371729001402855, 'kl': 0.70703125, 'epoch': 0.39}
 39%|███▉      | 1664/4286 [12:44:45<17:31:15, 24.06s/it] 39%|███▉      | 1665/4286 [12:45:10<17:43:59, 24.36s/it]                                                         {'loss': 0.0075, 'grad_norm': 4.03510667033304, 'learning_rate': 6.115258982734484e-07, 'completion_length': 294.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.8601191341876984, 'rewards/format_reward': 1.0, 'reward': 1.8601191639900208, 'reward_std': 0.028166969306766987, 'kl': 0.18603515625, 'epoch': 0.39}
 39%|███▉      | 1665/4286 [12:45:10<17:43:59, 24.36s/it] 39%|███▉      | 1666/4286 [12:45:34<17:42:32, 24.33s/it]                                                         {'loss': 0.0032, 'grad_norm': 4.166108735660923, 'learning_rate': 6.112925804946336e-07, 'completion_length': 305.14288330078125, 'rewards/only_full_func_accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.04442918114364147, 'kl': 0.080810546875, 'epoch': 0.39}
 39%|███▉      | 1666/4286 [12:45:34<17:42:32, 24.33s/it] 39%|███▉      | 1667/4286 [12:45:59<17:57:50, 24.69s/it]                                                         {'loss': 0.0033, 'grad_norm': 88.77612791314785, 'learning_rate': 6.110592627158189e-07, 'completion_length': 325.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125000596046448, 'reward_std': 0.07284288108348846, 'kl': 0.08349609375, 'epoch': 0.39}
 39%|███▉      | 1667/4286 [12:45:59<17:57:50, 24.69s/it][2025-03-03 03:43:48,211] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1668/4286 [12:46:25<18:13:18, 25.06s/it]                                                         {'loss': 0.0047, 'grad_norm': 7.850358596773552, 'learning_rate': 6.108259449370041e-07, 'completion_length': 307.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.043886798433959484, 'kl': 0.115966796875, 'epoch': 0.39}
 39%|███▉      | 1668/4286 [12:46:25<18:13:18, 25.06s/it][2025-03-03 03:44:13,718] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1669/4286 [12:46:51<18:18:46, 25.19s/it]                                                         {'loss': 0.017, 'grad_norm': 4.171951842477657, 'learning_rate': 6.105926271581894e-07, 'completion_length': 301.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5982143133878708, 'rewards/format_reward': 1.0, 'reward': 1.5982144474983215, 'reward_std': 0.0764070451259613, 'kl': 0.42431640625, 'epoch': 0.39}
 39%|███▉      | 1669/4286 [12:46:51<18:18:46, 25.19s/it] 39%|███▉      | 1670/4286 [12:47:16<18:18:00, 25.18s/it]                                                         {'loss': 0.023, 'grad_norm': 3.398232196429109, 'learning_rate': 6.103593093793746e-07, 'completion_length': 327.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.7788690328598022, 'rewards/format_reward': 1.0, 'reward': 1.7788691520690918, 'reward_std': 0.06832901388406754, 'kl': 0.57421875, 'epoch': 0.39}
 39%|███▉      | 1670/4286 [12:47:16<18:18:00, 25.18s/it] 39%|███▉      | 1671/4286 [12:47:39<17:50:41, 24.57s/it]                                                         {'loss': 0.0371, 'grad_norm': 15.370789534178146, 'learning_rate': 6.101259916005598e-07, 'completion_length': 275.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6991071701049805, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.663392961025238, 'reward_std': 0.13028930872678757, 'kl': 0.927734375, 'epoch': 0.39}
 39%|███▉      | 1671/4286 [12:47:39<17:50:41, 24.57s/it] 39%|███▉      | 1672/4286 [12:48:03<17:47:58, 24.51s/it]                                                         {'loss': 0.0034, 'grad_norm': 4.730270297104091, 'learning_rate': 6.098926738217452e-07, 'completion_length': 274.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.8125000894069672, 'rewards/format_reward': 1.0, 'reward': 1.8125001788139343, 'reward_std': 0.04166666604578495, 'kl': 0.084716796875, 'epoch': 0.39}
 39%|███▉      | 1672/4286 [12:48:03<17:47:58, 24.51s/it] 39%|███▉      | 1673/4286 [12:48:30<18:07:25, 24.97s/it]                                                         {'loss': 0.0224, 'grad_norm': 1.7771232479431371, 'learning_rate': 6.096593560429304e-07, 'completion_length': 338.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.677232176065445, 'rewards/format_reward': 1.0, 'reward': 1.6772322058677673, 'reward_std': 0.03653199225664139, 'kl': 0.55859375, 'epoch': 0.39}
 39%|███▉      | 1673/4286 [12:48:30<18:07:25, 24.97s/it] 39%|███▉      | 1674/4286 [12:48:54<18:01:53, 24.85s/it]                                                         {'loss': 0.0293, 'grad_norm': 13.196486411437586, 'learning_rate': 6.094260382641156e-07, 'completion_length': 298.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7202380895614624, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7023810744285583, 'reward_std': 0.13444766215980053, 'kl': 0.734375, 'epoch': 0.39}
 39%|███▉      | 1674/4286 [12:48:54<18:01:53, 24.85s/it] 39%|███▉      | 1675/4286 [12:49:20<18:19:15, 25.26s/it]                                                         {'loss': 0.0044, 'grad_norm': 0.5339040439059967, 'learning_rate': 6.09192720485301e-07, 'completion_length': 289.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.6577381193637848, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.005952378269284964, 'kl': 0.1104736328125, 'epoch': 0.39}
 39%|███▉      | 1675/4286 [12:49:20<18:19:15, 25.26s/it] 39%|███▉      | 1676/4286 [12:49:46<18:18:12, 25.25s/it]                                                         {'loss': 0.0053, 'grad_norm': 6.582385147774183, 'learning_rate': 6.089594027064862e-07, 'completion_length': 324.1964569091797, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6324406266212463, 'reward_std': 0.10533423721790314, 'kl': 0.13330078125, 'epoch': 0.39}
 39%|███▉      | 1676/4286 [12:49:46<18:18:12, 25.25s/it] 39%|███▉      | 1677/4286 [12:50:10<18:12:11, 25.12s/it]                                                         {'loss': 0.006, 'grad_norm': 4.116272778080149, 'learning_rate': 6.087260849276714e-07, 'completion_length': 320.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.8035714030265808, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.03160358127206564, 'kl': 0.1494140625, 'epoch': 0.39}
 39%|███▉      | 1677/4286 [12:50:10<18:12:11, 25.12s/it] 39%|███▉      | 1678/4286 [12:50:36<18:22:36, 25.37s/it]                                                         {'loss': 0.0333, 'grad_norm': 3.212068153833226, 'learning_rate': 6.084927671488567e-07, 'completion_length': 320.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7313582301139832, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7135012745857239, 'reward_std': 0.12150152772665024, 'kl': 0.8311767578125, 'epoch': 0.39}
 39%|███▉      | 1678/4286 [12:50:36<18:22:36, 25.37s/it] 39%|███▉      | 1679/4286 [12:51:00<18:04:19, 24.96s/it]                                                         {'loss': 0.0103, 'grad_norm': 1.1538183407424516, 'learning_rate': 6.082594493700419e-07, 'completion_length': 302.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.022469747811555862, 'kl': 0.2581787109375, 'epoch': 0.39}
 39%|███▉      | 1679/4286 [12:51:00<18:04:19, 24.96s/it] 39%|███▉      | 1680/4286 [12:51:23<17:41:17, 24.43s/it]                                                         {'loss': 0.0087, 'grad_norm': 3.067635953919997, 'learning_rate': 6.080261315912272e-07, 'completion_length': 266.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.09849408268928528, 'kl': 0.2177734375, 'epoch': 0.39}
 39%|███▉      | 1680/4286 [12:51:24<17:41:17, 24.43s/it][2025-03-03 03:49:11,064] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1681/4286 [12:51:48<17:43:36, 24.50s/it]                                                         {'loss': 0.0044, 'grad_norm': 4.4691979150853625, 'learning_rate': 6.077928138124124e-07, 'completion_length': 318.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.834821492433548, 'rewards/format_reward': 1.0, 'reward': 1.8348215818405151, 'reward_std': 0.052436910569667816, 'kl': 0.109619140625, 'epoch': 0.39}
 39%|███▉      | 1681/4286 [12:51:48<17:43:36, 24.50s/it][2025-03-03 03:49:36,972] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1682/4286 [12:52:14<18:01:33, 24.92s/it]                                                         {'loss': 0.004, 'grad_norm': 3.497580352413191, 'learning_rate': 6.075594960335977e-07, 'completion_length': 328.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8148810267448425, 'rewards/format_reward': 1.0, 'reward': 1.8148810863494873, 'reward_std': 0.033502545207738876, 'kl': 0.0986328125, 'epoch': 0.39}
 39%|███▉      | 1682/4286 [12:52:14<18:01:33, 24.92s/it] 39%|███▉      | 1683/4286 [12:52:41<18:29:11, 25.57s/it]                                                         {'loss': 0.0261, 'grad_norm': 5.336399782504127, 'learning_rate': 6.073261782547829e-07, 'completion_length': 295.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7210884988307953, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6853742003440857, 'reward_std': 0.07329437881708145, 'kl': 0.6552734375, 'epoch': 0.39}
 39%|███▉      | 1683/4286 [12:52:41<18:29:11, 25.57s/it] 39%|███▉      | 1684/4286 [12:53:07<18:33:27, 25.68s/it]                                                         {'loss': 0.0183, 'grad_norm': 1.976616625495021, 'learning_rate': 6.070928604759682e-07, 'completion_length': 296.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7952381372451782, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7773810029029846, 'reward_std': 0.10552502423524857, 'kl': 0.4560546875, 'epoch': 0.39}
 39%|███▉      | 1684/4286 [12:53:07<18:33:27, 25.68s/it] 39%|███▉      | 1685/4286 [12:53:33<18:35:59, 25.74s/it]                                                         {'loss': 0.0094, 'grad_norm': 2.0261514805307606, 'learning_rate': 6.068595426971535e-07, 'completion_length': 337.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.026785715483129025, 'kl': 0.2333984375, 'epoch': 0.39}
 39%|███▉      | 1685/4286 [12:53:33<18:35:59, 25.74s/it][2025-03-03 03:51:20,904] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1686/4286 [12:53:58<18:26:14, 25.53s/it]                                                         {'loss': 0.0212, 'grad_norm': 13.962318932442505, 'learning_rate': 6.066262249183387e-07, 'completion_length': 272.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7038690745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6860120296478271, 'reward_std': 0.07280982844531536, 'kl': 0.5302734375, 'epoch': 0.39}
 39%|███▉      | 1686/4286 [12:53:58<18:26:14, 25.53s/it] 39%|███▉      | 1687/4286 [12:54:23<18:22:09, 25.44s/it]                                                         {'loss': 0.0125, 'grad_norm': 4.161921803181179, 'learning_rate': 6.063929071395239e-07, 'completion_length': 337.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7991072535514832, 'rewards/format_reward': 1.0, 'reward': 1.799107313156128, 'reward_std': 0.07008879259228706, 'kl': 0.3138427734375, 'epoch': 0.39}
 39%|███▉      | 1687/4286 [12:54:23<18:22:09, 25.44s/it][2025-03-03 03:52:11,901] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1688/4286 [12:54:49<18:25:41, 25.54s/it]                                                         {'loss': 0.0366, 'grad_norm': 19.008314024453842, 'learning_rate': 6.061595893607093e-07, 'completion_length': 314.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7916667461395264, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7380953431129456, 'reward_std': 0.1303702238947153, 'kl': 0.9140625, 'epoch': 0.39}
 39%|███▉      | 1688/4286 [12:54:49<18:25:41, 25.54s/it] 39%|███▉      | 1689/4286 [12:55:15<18:29:44, 25.64s/it]                                                         {'loss': 0.0176, 'grad_norm': 3.668068754566953, 'learning_rate': 6.059262715818945e-07, 'completion_length': 313.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.5887897610664368, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5709326267242432, 'reward_std': 0.14250101894140244, 'kl': 0.439453125, 'epoch': 0.39}
 39%|███▉      | 1689/4286 [12:55:15<18:29:44, 25.64s/it] 39%|███▉      | 1690/4286 [12:55:38<17:58:36, 24.93s/it]                                                         {'loss': 0.0143, 'grad_norm': 4.0431599425796465, 'learning_rate': 6.056929538030797e-07, 'completion_length': 297.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.05952380783855915, 'kl': 0.357666015625, 'epoch': 0.39}
 39%|███▉      | 1690/4286 [12:55:38<17:58:36, 24.93s/it][2025-03-03 03:53:26,166] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1691/4286 [12:56:03<18:00:33, 24.98s/it]                                                         {'loss': 0.012, 'grad_norm': 7.662894415845755, 'learning_rate': 6.054596360242649e-07, 'completion_length': 312.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.6011904776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.583333432674408, 'reward_std': 0.10076311975717545, 'kl': 0.2998046875, 'epoch': 0.39}
 39%|███▉      | 1691/4286 [12:56:03<18:00:33, 24.98s/it][2025-03-03 03:53:51,936] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1692/4286 [12:56:29<18:10:20, 25.22s/it]                                                         {'loss': 0.0167, 'grad_norm': 1.1628502092993802, 'learning_rate': 6.052263182454503e-07, 'completion_length': 311.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.045350016094744205, 'kl': 0.416748046875, 'epoch': 0.39}
 39%|███▉      | 1692/4286 [12:56:29<18:10:20, 25.22s/it] 40%|███▉      | 1693/4286 [12:56:54<18:08:09, 25.18s/it]                                                         {'loss': 0.005, 'grad_norm': 3.8730576736459934, 'learning_rate': 6.049930004666355e-07, 'completion_length': 298.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.07695359364151955, 'kl': 0.12548828125, 'epoch': 0.4}
 40%|███▉      | 1693/4286 [12:56:54<18:08:09, 25.18s/it] 40%|███▉      | 1694/4286 [12:57:18<17:57:03, 24.93s/it]                                                         {'loss': 0.0018, 'grad_norm': 4.387837892131932, 'learning_rate': 6.047596826878207e-07, 'completion_length': 281.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.05495268478989601, 'kl': 0.04559326171875, 'epoch': 0.4}
 40%|███▉      | 1694/4286 [12:57:18<17:57:03, 24.93s/it][2025-03-03 03:55:06,901] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 40%|███▉      | 1695/4286 [12:57:44<18:04:20, 25.11s/it]                                                         {'loss': 0.0028, 'grad_norm': 1.767763211890089, 'learning_rate': 6.04526364909006e-07, 'completion_length': 346.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 1.0, 'reward': 1.6860119700431824, 'reward_std': 0.02267500851303339, 'kl': 0.06884765625, 'epoch': 0.4}
 40%|███▉      | 1695/4286 [12:57:44<18:04:20, 25.11s/it] 40%|███▉      | 1696/4286 [12:58:10<18:12:34, 25.31s/it]                                                         {'loss': 0.003, 'grad_norm': 0.8393634662762659, 'learning_rate': 6.042930471301912e-07, 'completion_length': 334.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.65476194024086, 'rewards/format_reward': 1.0, 'reward': 1.6547620296478271, 'reward_std': 0.049460720270872116, 'kl': 0.076416015625, 'epoch': 0.4}
 40%|███▉      | 1696/4286 [12:58:10<18:12:34, 25.31s/it] 40%|███▉      | 1697/4286 [12:58:34<17:52:28, 24.85s/it]                                                         {'loss': 0.0163, 'grad_norm': 4.995005986226559, 'learning_rate': 6.040597293513765e-07, 'completion_length': 268.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.7812500596046448, 'rewards/format_reward': 1.0, 'reward': 1.7812501192092896, 'reward_std': 0.02267500478774309, 'kl': 0.40771484375, 'epoch': 0.4}
 40%|███▉      | 1697/4286 [12:58:34<17:52:28, 24.85s/it][2025-03-03 03:56:22,191] [WARNING] [stage3.py:2134:step] 3 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 40%|███▉      | 1698/4286 [12:58:59<18:03:15, 25.11s/it]                                                         {'loss': 0.004, 'grad_norm': 4.857866362347206, 'learning_rate': 6.038264115725618e-07, 'completion_length': 295.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7544643580913544, 'rewards/format_reward': 1.0, 'reward': 1.7544643878936768, 'reward_std': 0.0625000037252903, 'kl': 0.098876953125, 'epoch': 0.4}
 40%|███▉      | 1698/4286 [12:58:59<18:03:15, 25.11s/it] 40%|███▉      | 1699/4286 [12:59:26<18:25:07, 25.63s/it]                                                         {'loss': 0.0032, 'grad_norm': 2.4146595334292376, 'learning_rate': 6.03593093793747e-07, 'completion_length': 330.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7980868220329285, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7802297472953796, 'reward_std': 0.07275302615016699, 'kl': 0.0797119140625, 'epoch': 0.4}
 40%|███▉      | 1699/4286 [12:59:26<18:25:07, 25.63s/it] 40%|███▉      | 1700/4286 [12:59:51<18:19:33, 25.51s/it]                                                         {'loss': 0.0028, 'grad_norm': 0.8379681323811929, 'learning_rate': 6.033597760149322e-07, 'completion_length': 310.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.06689049676060677, 'kl': 0.070556640625, 'epoch': 0.4}
 40%|███▉      | 1700/4286 [12:59:51<18:19:33, 25.51s/it] 40%|███▉      | 1701/4286 [13:03:17<57:02:51, 79.45s/it]                                                         {'loss': 0.0093, 'grad_norm': 6.999644280593362, 'learning_rate': 6.031264582361176e-07, 'completion_length': 322.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.7127977013587952, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.020833331160247326, 'kl': 0.23291015625, 'epoch': 0.4}
 40%|███▉      | 1701/4286 [13:03:17<57:02:51, 79.45s/it] 40%|███▉      | 1702/4286 [13:03:42<45:26:20, 63.31s/it]                                                         {'loss': 0.019, 'grad_norm': 3.7494162248583347, 'learning_rate': 6.028931404573028e-07, 'completion_length': 326.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755954027175903, 'reward_std': 0.035714288242161274, 'kl': 0.47509765625, 'epoch': 0.4}
 40%|███▉      | 1702/4286 [13:03:42<45:26:20, 63.31s/it] 40%|███▉      | 1703/4286 [13:04:09<37:29:11, 52.25s/it]                                                         {'loss': 0.0153, 'grad_norm': 3.456502977910675, 'learning_rate': 6.02659822678488e-07, 'completion_length': 330.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.68601194024086, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6502978205680847, 'reward_std': 0.19998425245285034, 'kl': 0.38232421875, 'epoch': 0.4}
 40%|███▉      | 1703/4286 [13:04:09<37:29:11, 52.25s/it] 40%|███▉      | 1704/4286 [13:04:34<31:37:44, 44.10s/it]                                                         {'loss': 0.0093, 'grad_norm': 5.3470312663861534, 'learning_rate': 6.024265048996732e-07, 'completion_length': 315.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.0744047611951828, 'kl': 0.23193359375, 'epoch': 0.4}
 40%|███▉      | 1704/4286 [13:04:34<31:37:44, 44.10s/it] 40%|███▉      | 1705/4286 [13:04:57<27:07:14, 37.83s/it]                                                         {'loss': 0.0099, 'grad_norm': 1.619484811909063, 'learning_rate': 6.021931871208586e-07, 'completion_length': 280.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6688988208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6510418057441711, 'reward_std': 0.12010835483670235, 'kl': 0.248046875, 'epoch': 0.4}
 40%|███▉      | 1705/4286 [13:04:57<27:07:14, 37.83s/it] 40%|███▉      | 1706/4286 [13:05:23<24:32:32, 34.25s/it]                                                         {'loss': 0.0087, 'grad_norm': 2.026309545096822, 'learning_rate': 6.019598693420438e-07, 'completion_length': 292.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7164115607738495, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6806973814964294, 'reward_std': 0.12572036311030388, 'kl': 0.2177734375, 'epoch': 0.4}
 40%|███▉      | 1706/4286 [13:05:23<24:32:32, 34.25s/it] 40%|███▉      | 1707/4286 [13:05:49<22:42:50, 31.71s/it]                                                         {'loss': 0.0223, 'grad_norm': 4.134872434531637, 'learning_rate': 6.01726551563229e-07, 'completion_length': 309.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.6220238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6041668057441711, 'reward_std': 0.07922262698411942, 'kl': 0.55859375, 'epoch': 0.4}
 40%|███▉      | 1707/4286 [13:05:49<22:42:50, 31.71s/it] 40%|███▉      | 1708/4286 [13:06:14<21:22:10, 29.84s/it]                                                         {'loss': 0.0043, 'grad_norm': 0.5872190734703306, 'learning_rate': 6.014932337844143e-07, 'completion_length': 282.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.8750000894069672, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.0397719144821167, 'kl': 0.1070556640625, 'epoch': 0.4}
 40%|███▉      | 1708/4286 [13:06:14<21:22:10, 29.84s/it][2025-03-03 04:04:02,473] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 40%|███▉      | 1709/4286 [13:06:40<20:24:19, 28.51s/it]                                                         {'loss': 0.0252, 'grad_norm': 3.3839605197041918, 'learning_rate': 6.012599160055996e-07, 'completion_length': 321.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.766241580247879, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7305273413658142, 'reward_std': 0.13721172139048576, 'kl': 0.62890625, 'epoch': 0.4}
 40%|███▉      | 1709/4286 [13:06:40<20:24:19, 28.51s/it] 40%|███▉      | 1710/4286 [13:07:05<19:44:26, 27.59s/it]                                                         {'loss': 0.0163, 'grad_norm': 6.050761455973227, 'learning_rate': 6.010265982267848e-07, 'completion_length': 320.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.5744048058986664, 'rewards/format_reward': 1.0, 'reward': 1.5744048357009888, 'reward_std': 0.043508339673280716, 'kl': 0.40625, 'epoch': 0.4}
 40%|███▉      | 1710/4286 [13:07:05<19:44:26, 27.59s/it] 40%|███▉      | 1711/4286 [13:07:30<19:04:17, 26.66s/it]                                                         {'loss': 0.0174, 'grad_norm': 2.9827433698242665, 'learning_rate': 6.007932804479701e-07, 'completion_length': 287.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.8392857909202576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8214287161827087, 'reward_std': 0.13279405049979687, 'kl': 0.43359375, 'epoch': 0.4}
 40%|███▉      | 1711/4286 [13:07:30<19:04:17, 26.66s/it][2025-03-03 04:05:18,381] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 40%|███▉      | 1712/4286 [13:07:55<18:54:46, 26.45s/it]                                                         {'loss': 0.006, 'grad_norm': 1.291114284844151, 'learning_rate': 6.005599626691553e-07, 'completion_length': 285.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.6741071939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6562501788139343, 'reward_std': 0.06250000465661287, 'kl': 0.1492919921875, 'epoch': 0.4}
 40%|███▉      | 1712/4286 [13:07:55<18:54:46, 26.45s/it] 40%|███▉      | 1713/4286 [13:08:21<18:48:32, 26.32s/it]                                                         {'loss': 0.0057, 'grad_norm': 1.415862444783635, 'learning_rate': 6.003266448903406e-07, 'completion_length': 310.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.860119104385376, 'rewards/format_reward': 1.0, 'reward': 1.8601191639900208, 'reward_std': 0.010309826582670212, 'kl': 0.14306640625, 'epoch': 0.4}
 40%|███▉      | 1713/4286 [13:08:21<18:48:32, 26.32s/it] 40%|███▉      | 1714/4286 [13:08:48<18:54:44, 26.47s/it]                                                         {'loss': 0.024, 'grad_norm': 5.074508449447356, 'learning_rate': 6.000933271115258e-07, 'completion_length': 338.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.815476268529892, 'rewards/format_reward': 1.0, 'reward': 1.8154763579368591, 'reward_std': 0.04925546143203974, 'kl': 0.6005859375, 'epoch': 0.4}
 40%|███▉      | 1714/4286 [13:08:48<18:54:44, 26.47s/it] 40%|████      | 1715/4286 [13:09:14<18:43:56, 26.23s/it]                                                         {'loss': 0.0082, 'grad_norm': 2.0500333862626094, 'learning_rate': 5.998600093327111e-07, 'completion_length': 311.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.7232142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7053571939468384, 'reward_std': 0.060219310224056244, 'kl': 0.20458984375, 'epoch': 0.4}
 40%|████      | 1715/4286 [13:09:14<18:43:56, 26.23s/it][2025-03-03 04:07:01,111] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 40%|████      | 1716/4286 [13:09:38<18:17:47, 25.63s/it]                                                         {'loss': 0.011, 'grad_norm': 30.765925878595215, 'learning_rate': 5.996266915538963e-07, 'completion_length': 311.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.08566848188638687, 'kl': 0.27490234375, 'epoch': 0.4}
 40%|████      | 1716/4286 [13:09:38<18:17:47, 25.63s/it][2025-03-03 04:07:28,818] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 40%|████      | 1717/4286 [13:10:06<18:44:03, 26.25s/it]                                                         {'loss': 0.0206, 'grad_norm': 2.6995734601298613, 'learning_rate': 5.993933737750816e-07, 'completion_length': 326.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7964285910129547, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7785715460777283, 'reward_std': 0.09846088290214539, 'kl': 0.513671875, 'epoch': 0.4}
 40%|████      | 1717/4286 [13:10:06<18:44:03, 26.25s/it] 40%|████      | 1718/4286 [13:10:31<18:30:12, 25.94s/it]                                                         {'loss': 0.0138, 'grad_norm': 2.406119473716477, 'learning_rate': 5.991600559962669e-07, 'completion_length': 300.32144927978516, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7232144474983215, 'reward_std': 0.01785714365541935, 'kl': 0.3447265625, 'epoch': 0.4}
 40%|████      | 1718/4286 [13:10:31<18:30:12, 25.94s/it] 40%|████      | 1719/4286 [13:10:56<18:10:00, 25.48s/it]                                                         {'loss': 0.0224, 'grad_norm': 2.39857792342708, 'learning_rate': 5.989267382174521e-07, 'completion_length': 317.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.07447618246078491, 'kl': 0.5615234375, 'epoch': 0.4}
 40%|████      | 1719/4286 [13:10:56<18:10:00, 25.48s/it] 40%|████      | 1720/4286 [13:11:19<17:44:35, 24.89s/it]                                                         {'loss': 0.01, 'grad_norm': 1.646092737996233, 'learning_rate': 5.986934204386373e-07, 'completion_length': 258.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6383928656578064, 'rewards/format_reward': 1.0, 'reward': 1.6383929252624512, 'reward_std': 0.008928571827709675, 'kl': 0.249755859375, 'epoch': 0.4}
 40%|████      | 1720/4286 [13:11:19<17:44:35, 24.89s/it] 40%|████      | 1721/4286 [13:11:42<17:24:35, 24.43s/it]                                                         {'loss': 0.0068, 'grad_norm': 5.050646107569147, 'learning_rate': 5.984601026598227e-07, 'completion_length': 273.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.5818452835083008, 'rewards/format_reward': 1.0, 'reward': 1.5818453431129456, 'reward_std': 0.07288430631160736, 'kl': 0.170654296875, 'epoch': 0.4}
 40%|████      | 1721/4286 [13:11:42<17:24:35, 24.43s/it] 40%|████      | 1722/4286 [13:12:08<17:34:27, 24.68s/it]                                                         {'loss': 0.0091, 'grad_norm': 10.954123853030751, 'learning_rate': 5.982267848810079e-07, 'completion_length': 294.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7351190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7172619700431824, 'reward_std': 0.11531120538711548, 'kl': 0.2275390625, 'epoch': 0.4}
 40%|████      | 1722/4286 [13:12:08<17:34:27, 24.68s/it] 40%|████      | 1723/4286 [13:12:31<17:15:40, 24.25s/it]                                                         {'loss': 0.0032, 'grad_norm': 1.4123281590315155, 'learning_rate': 5.979934671021931e-07, 'completion_length': 262.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.025651192292571068, 'kl': 0.078857421875, 'epoch': 0.4}
 40%|████      | 1723/4286 [13:12:31<17:15:40, 24.25s/it][2025-03-03 04:10:20,331] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 40%|████      | 1724/4286 [13:12:57<17:44:32, 24.93s/it]                                                         {'loss': 0.0017, 'grad_norm': 0.9318373101882353, 'learning_rate': 5.977601493233784e-07, 'completion_length': 308.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7886904776096344, 'rewards/format_reward': 1.0, 'reward': 1.7886906266212463, 'reward_std': 0.04602411389350891, 'kl': 0.0416259765625, 'epoch': 0.4}
 40%|████      | 1724/4286 [13:12:57<17:44:32, 24.93s/it][2025-03-03 04:10:45,595] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 40%|████      | 1725/4286 [13:13:23<17:48:24, 25.03s/it]                                                         {'loss': 0.0068, 'grad_norm': 14.523592979089525, 'learning_rate': 5.975268315445636e-07, 'completion_length': 321.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8139881491661072, 'rewards/format_reward': 1.0, 'reward': 1.813988208770752, 'reward_std': 0.05946989357471466, 'kl': 0.17041015625, 'epoch': 0.4}
 40%|████      | 1725/4286 [13:13:23<17:48:24, 25.03s/it] 40%|████      | 1726/4286 [13:13:48<17:46:53, 25.01s/it]                                                         {'loss': 0.0021, 'grad_norm': 4.738890705016921, 'learning_rate': 5.972935137657489e-07, 'completion_length': 321.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.8586309850215912, 'rewards/format_reward': 1.0, 'reward': 1.8586310744285583, 'reward_std': 0.030382090248167515, 'kl': 0.0521240234375, 'epoch': 0.4}
 40%|████      | 1726/4286 [13:13:48<17:46:53, 25.01s/it] 40%|████      | 1727/4286 [13:14:13<17:48:25, 25.05s/it]                                                         {'loss': 0.0018, 'grad_norm': 1.6115210306293701, 'learning_rate': 5.970601959869341e-07, 'completion_length': 316.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.8854167461395264, 'rewards/format_reward': 1.0, 'reward': 1.8854168057441711, 'reward_std': 0.026785715483129025, 'kl': 0.04449462890625, 'epoch': 0.4}
 40%|████      | 1727/4286 [13:14:13<17:48:25, 25.05s/it] 40%|████      | 1728/4286 [13:14:36<17:28:02, 24.58s/it]                                                         {'loss': 0.0041, 'grad_norm': 1.57911886055869, 'learning_rate': 5.968268782081194e-07, 'completion_length': 286.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.6636905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.005952383857220411, 'kl': 0.1025390625, 'epoch': 0.4}
 40%|████      | 1728/4286 [13:14:36<17:28:02, 24.58s/it] 40%|████      | 1729/4286 [13:14:59<17:08:22, 24.13s/it]                                                         {'loss': 0.0109, 'grad_norm': 6.628002079551839, 'learning_rate': 5.965935604293046e-07, 'completion_length': 285.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7752977013587952, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.03709554299712181, 'kl': 0.271484375, 'epoch': 0.4}
 40%|████      | 1729/4286 [13:14:59<17:08:22, 24.13s/it] 40%|████      | 1730/4286 [13:15:23<17:01:02, 23.97s/it]                                                         {'loss': 0.0011, 'grad_norm': 0.28225715977599736, 'learning_rate': 5.963602426504899e-07, 'completion_length': 301.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.7190476357936859, 'rewards/format_reward': 1.0, 'reward': 1.719047725200653, 'reward_std': 0.008247863501310349, 'kl': 0.02764892578125, 'epoch': 0.4}
 40%|████      | 1730/4286 [13:15:23<17:01:02, 23.97s/it][2025-03-03 04:13:11,047] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 40%|████      | 1731/4286 [13:15:48<17:16:18, 24.34s/it]                                                         {'loss': 0.0023, 'grad_norm': 7.130225860354103, 'learning_rate': 5.961269248716752e-07, 'completion_length': 278.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.010309826582670212, 'kl': 0.0574951171875, 'epoch': 0.4}
 40%|████      | 1731/4286 [13:15:48<17:16:18, 24.34s/it][2025-03-03 04:13:35,308] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 40%|████      | 1732/4286 [13:16:12<17:14:56, 24.31s/it]                                                         {'loss': 0.0068, 'grad_norm': 7.763877690264178, 'learning_rate': 5.958936070928604e-07, 'completion_length': 309.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7154762744903564, 'rewards/format_reward': 1.0, 'reward': 1.7154762744903564, 'reward_std': 0.08552441000938416, 'kl': 0.170654296875, 'epoch': 0.4}
 40%|████      | 1732/4286 [13:16:12<17:14:56, 24.31s/it] 40%|████      | 1733/4286 [13:16:37<17:21:42, 24.48s/it]                                                         {'loss': 0.0029, 'grad_norm': 13.01839114154998, 'learning_rate': 5.956602893140456e-07, 'completion_length': 308.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7336311340332031, 'reward_std': 0.03273809980601072, 'kl': 0.0731201171875, 'epoch': 0.4}
 40%|████      | 1733/4286 [13:16:37<17:21:42, 24.48s/it] 40%|████      | 1734/4286 [13:17:01<17:16:04, 24.36s/it]                                                         {'loss': 0.0078, 'grad_norm': 0.881171608234467, 'learning_rate': 5.95426971535231e-07, 'completion_length': 308.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.6104167103767395, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5925596952438354, 'reward_std': 0.03430245444178581, 'kl': 0.1949462890625, 'epoch': 0.4}
 40%|████      | 1734/4286 [13:17:01<17:16:04, 24.36s/it] 40%|████      | 1735/4286 [13:17:25<17:09:21, 24.21s/it]                                                         {'loss': 0.0018, 'grad_norm': 1.103386888118264, 'learning_rate': 5.951936537564162e-07, 'completion_length': 266.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.8125, 'rewards/format_reward': 1.0, 'reward': 1.8125001788139343, 'reward_std': 0.08106429874897003, 'kl': 0.046142578125, 'epoch': 0.4}
 40%|████      | 1735/4286 [13:17:25<17:09:21, 24.21s/it] 41%|████      | 1736/4286 [13:17:49<17:08:30, 24.20s/it]                                                         {'loss': 0.0167, 'grad_norm': 7.364014697949406, 'learning_rate': 5.949603359776014e-07, 'completion_length': 317.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.0267857164144516, 'kl': 0.418212890625, 'epoch': 0.41}
 41%|████      | 1736/4286 [13:17:49<17:08:30, 24.20s/it][2025-03-03 04:15:36,723] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 41%|████      | 1737/4286 [13:18:14<17:10:59, 24.27s/it]                                                         {'loss': 0.0062, 'grad_norm': 3.0220869781734776, 'learning_rate': 5.947270181987866e-07, 'completion_length': 290.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7217262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.08700503036379814, 'kl': 0.154541015625, 'epoch': 0.41}
 41%|████      | 1737/4286 [13:18:14<17:10:59, 24.27s/it] 41%|████      | 1738/4286 [13:18:39<17:26:46, 24.65s/it]                                                         {'loss': 0.0047, 'grad_norm': 1.7523986713304642, 'learning_rate': 5.94493700419972e-07, 'completion_length': 307.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.0327381007373333, 'kl': 0.11669921875, 'epoch': 0.41}
 41%|████      | 1738/4286 [13:18:39<17:26:46, 24.65s/it] 41%|████      | 1739/4286 [13:19:05<17:33:38, 24.82s/it]                                                         {'loss': 0.0055, 'grad_norm': 10.242364675054782, 'learning_rate': 5.942603826411572e-07, 'completion_length': 293.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7368198037147522, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7189627289772034, 'reward_std': 0.07619555294513702, 'kl': 0.1383056640625, 'epoch': 0.41}
 41%|████      | 1739/4286 [13:19:05<17:33:38, 24.82s/it] 41%|████      | 1740/4286 [13:19:30<17:46:38, 25.14s/it]                                                         {'loss': 0.0046, 'grad_norm': 0.6641096394344281, 'learning_rate': 5.940270648623424e-07, 'completion_length': 300.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666667461395264, 'reward_std': 0.0, 'kl': 0.114013671875, 'epoch': 0.41}
 41%|████      | 1740/4286 [13:19:30<17:46:38, 25.14s/it] 41%|████      | 1741/4286 [13:19:55<17:33:20, 24.83s/it]                                                         {'loss': 0.0033, 'grad_norm': 0.43599135262024497, 'learning_rate': 5.937937470835277e-07, 'completion_length': 286.4643020629883, 'rewards/only_full_func_accuracy_reward': 0.7062500417232513, 'rewards/format_reward': 1.0, 'reward': 1.7062500715255737, 'reward_std': 0.026424926705658436, 'kl': 0.08349609375, 'epoch': 0.41}
 41%|████      | 1741/4286 [13:19:55<17:33:20, 24.83s/it] 41%|████      | 1742/4286 [13:20:17<17:05:25, 24.18s/it]                                                         {'loss': 0.0084, 'grad_norm': 2.289296191740581, 'learning_rate': 5.93560429304713e-07, 'completion_length': 285.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.053571430034935474, 'kl': 0.2080078125, 'epoch': 0.41}
 41%|████      | 1742/4286 [13:20:17<17:05:25, 24.18s/it] 41%|████      | 1743/4286 [13:20:40<16:51:58, 23.88s/it]                                                         {'loss': 0.0072, 'grad_norm': 18.55136458172907, 'learning_rate': 5.933271115258982e-07, 'completion_length': 303.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.013746432960033417, 'kl': 0.18017578125, 'epoch': 0.41}
 41%|████      | 1743/4286 [13:20:40<16:51:58, 23.88s/it] 41%|████      | 1744/4286 [13:21:03<16:38:32, 23.57s/it]                                                         {'loss': 0.0017, 'grad_norm': 0.8598818321026283, 'learning_rate': 5.930937937470835e-07, 'completion_length': 270.89288330078125, 'rewards/only_full_func_accuracy_reward': 0.7276785969734192, 'rewards/format_reward': 1.0, 'reward': 1.727678656578064, 'reward_std': 0.037095542065799236, 'kl': 0.043212890625, 'epoch': 0.41}
 41%|████      | 1744/4286 [13:21:03<16:38:32, 23.57s/it][2025-03-03 04:18:52,941] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 41%|████      | 1745/4286 [13:21:30<17:18:56, 24.53s/it]                                                         {'loss': 0.0023, 'grad_norm': 0.8038533551975673, 'learning_rate': 5.928604759682687e-07, 'completion_length': 294.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.766369104385376, 'rewards/format_reward': 1.0, 'reward': 1.766369104385376, 'reward_std': 0.0295482249930501, 'kl': 0.0589599609375, 'epoch': 0.41}
 41%|████      | 1745/4286 [13:21:30<17:18:56, 24.53s/it] 41%|████      | 1746/4286 [13:21:54<17:06:47, 24.25s/it]                                                         {'loss': 0.0065, 'grad_norm': 2.800328401866785, 'learning_rate': 5.92627158189454e-07, 'completion_length': 266.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.01785714365541935, 'kl': 0.1617431640625, 'epoch': 0.41}
 41%|████      | 1746/4286 [13:21:54<17:06:47, 24.25s/it][2025-03-03 04:19:40,500] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 41%|████      | 1747/4286 [13:22:18<17:02:31, 24.16s/it]                                                         {'loss': 0.0072, 'grad_norm': 3.2406264215914256, 'learning_rate': 5.923938404106393e-07, 'completion_length': 300.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7752977013587952, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.020833336748182774, 'kl': 0.181396484375, 'epoch': 0.41}
 41%|████      | 1747/4286 [13:22:18<17:02:31, 24.16s/it] 41%|████      | 1748/4286 [13:22:42<17:00:12, 24.12s/it]                                                         {'loss': 0.012, 'grad_norm': 2.0465229467653874, 'learning_rate': 5.921605226318245e-07, 'completion_length': 304.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6190476417541504, 'rewards/format_reward': 1.0, 'reward': 1.6190477013587952, 'reward_std': 0.054739005863666534, 'kl': 0.2998046875, 'epoch': 0.41}
 41%|████      | 1748/4286 [13:22:42<17:00:12, 24.12s/it] 41%|████      | 1749/4286 [13:23:06<16:57:28, 24.06s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.21859431198073487, 'learning_rate': 5.919272048530097e-07, 'completion_length': 304.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7095238566398621, 'rewards/format_reward': 1.0, 'reward': 1.7095239162445068, 'reward_std': 0.012710805982351303, 'kl': 0.0457763671875, 'epoch': 0.41}
 41%|████      | 1749/4286 [13:23:06<16:57:28, 24.06s/it] 41%|████      | 1750/4286 [13:23:30<17:02:51, 24.20s/it]                                                         {'loss': 0.0088, 'grad_norm': 4.502714627053602, 'learning_rate': 5.916938870741949e-07, 'completion_length': 291.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7818453013896942, 'rewards/format_reward': 1.0, 'reward': 1.7818453907966614, 'reward_std': 0.03988710232079029, 'kl': 0.21923828125, 'epoch': 0.41}
 41%|████      | 1750/4286 [13:23:30<17:02:51, 24.20s/it] 41%|████      | 1751/4286 [13:23:54<16:56:44, 24.06s/it]                                                         {'loss': 0.0048, 'grad_norm': 1.7533117497282122, 'learning_rate': 5.914605692953803e-07, 'completion_length': 305.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.787202388048172, 'rewards/format_reward': 1.0, 'reward': 1.7872024774551392, 'reward_std': 0.03642144054174423, 'kl': 0.119140625, 'epoch': 0.41}
 41%|████      | 1751/4286 [13:23:54<16:56:44, 24.06s/it] 41%|████      | 1752/4286 [13:24:16<16:32:13, 23.49s/it]                                                         {'loss': 0.0017, 'grad_norm': 1.0960114269804524, 'learning_rate': 5.912272515165655e-07, 'completion_length': 259.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.0267857164144516, 'kl': 0.042724609375, 'epoch': 0.41}
 41%|████      | 1752/4286 [13:24:16<16:32:13, 23.49s/it] 41%|████      | 1753/4286 [13:24:38<16:16:40, 23.13s/it]                                                         {'loss': 0.0015, 'grad_norm': 0.3564849746426929, 'learning_rate': 5.909939337377507e-07, 'completion_length': 295.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.8169643580913544, 'rewards/format_reward': 1.0, 'reward': 1.8169644474983215, 'reward_std': 0.008928571827709675, 'kl': 0.037353515625, 'epoch': 0.41}
 41%|████      | 1753/4286 [13:24:38<16:16:40, 23.13s/it] 41%|████      | 1754/4286 [13:25:02<16:28:36, 23.43s/it]                                                         {'loss': 0.0016, 'grad_norm': 0.038675754117869644, 'learning_rate': 5.90760615958936e-07, 'completion_length': 307.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.0, 'kl': 0.039306640625, 'epoch': 0.41}
 41%|████      | 1754/4286 [13:25:02<16:28:36, 23.43s/it] 41%|████      | 1755/4286 [13:25:27<16:39:59, 23.71s/it]                                                         {'loss': 0.0086, 'grad_norm': 3.171990854656608, 'learning_rate': 5.905272981801213e-07, 'completion_length': 295.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.635416716337204, 'rewards/format_reward': 1.0, 'reward': 1.6354168057441711, 'reward_std': 0.059310125187039375, 'kl': 0.21435546875, 'epoch': 0.41}
 41%|████      | 1755/4286 [13:25:27<16:39:59, 23.71s/it] 41%|████      | 1756/4286 [13:25:50<16:36:57, 23.64s/it]                                                         {'loss': 0.0022, 'grad_norm': 5.504242013713042, 'learning_rate': 5.902939804013065e-07, 'completion_length': 302.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6773809790611267, 'rewards/format_reward': 1.0, 'reward': 1.6773810386657715, 'reward_std': 0.060485430993139744, 'kl': 0.0540771484375, 'epoch': 0.41}
 41%|████      | 1756/4286 [13:25:50<16:36:57, 23.64s/it] 41%|████      | 1757/4286 [13:26:14<16:37:13, 23.66s/it]                                                         {'loss': 0.0023, 'grad_norm': 0.5699927766696786, 'learning_rate': 5.900606626224918e-07, 'completion_length': 302.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6349207162857056, 'rewards/format_reward': 1.0, 'reward': 1.6349207162857056, 'reward_std': 0.036981185898184776, 'kl': 0.056640625, 'epoch': 0.41}
 41%|████      | 1757/4286 [13:26:14<16:37:13, 23.66s/it] 41%|████      | 1758/4286 [13:26:36<16:18:31, 23.22s/it]                                                         {'loss': 0.0022, 'grad_norm': 0.11208110462128901, 'learning_rate': 5.89827344843677e-07, 'completion_length': 251.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.0, 'kl': 0.0555419921875, 'epoch': 0.41}
 41%|████      | 1758/4286 [13:26:36<16:18:31, 23.22s/it] 41%|████      | 1759/4286 [13:26:58<16:06:50, 22.96s/it]                                                         {'loss': 0.007, 'grad_norm': 1.687931139293302, 'learning_rate': 5.895940270648623e-07, 'completion_length': 266.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.7976190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7976192235946655, 'reward_std': 0.023809521924704313, 'kl': 0.1748046875, 'epoch': 0.41}
 41%|████      | 1759/4286 [13:26:58<16:06:50, 22.96s/it] 41%|████      | 1760/4286 [13:27:22<16:11:15, 23.07s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.9596216937696465, 'learning_rate': 5.893607092860475e-07, 'completion_length': 258.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8556547462940216, 'rewards/format_reward': 1.0, 'reward': 1.8556548953056335, 'reward_std': 0.0386904776096344, 'kl': 0.03485107421875, 'epoch': 0.41}
 41%|████      | 1760/4286 [13:27:22<16:11:15, 23.07s/it] 41%|████      | 1761/4286 [13:27:44<16:04:20, 22.91s/it]                                                         {'loss': 0.0015, 'grad_norm': 0.041518834656272825, 'learning_rate': 5.891273915072328e-07, 'completion_length': 265.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.0, 'kl': 0.03857421875, 'epoch': 0.41}
 41%|████      | 1761/4286 [13:27:44<16:04:20, 22.91s/it] 41%|████      | 1762/4286 [13:28:08<16:14:28, 23.16s/it]                                                         {'loss': 0.0015, 'grad_norm': 0.2141825881931344, 'learning_rate': 5.88894073728418e-07, 'completion_length': 275.7143096923828, 'rewards/only_full_func_accuracy_reward': 0.761904776096344, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.006873216480016708, 'kl': 0.0364990234375, 'epoch': 0.41}
 41%|████      | 1762/4286 [13:28:08<16:14:28, 23.16s/it] 41%|████      | 1763/4286 [13:28:32<16:18:15, 23.26s/it]                                                         {'loss': 0.002, 'grad_norm': 2.238527204498185, 'learning_rate': 5.886607559496033e-07, 'completion_length': 277.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098215818405151, 'reward_std': 0.043122400529682636, 'kl': 0.0496826171875, 'epoch': 0.41}
 41%|████      | 1763/4286 [13:28:32<16:18:15, 23.26s/it] 41%|████      | 1764/4286 [13:28:53<15:58:21, 22.80s/it]                                                         {'loss': 0.0026, 'grad_norm': 0.28515231609396324, 'learning_rate': 5.884274381707886e-07, 'completion_length': 225.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.011904759332537651, 'kl': 0.0660400390625, 'epoch': 0.41}
 41%|████      | 1764/4286 [13:28:53<15:58:21, 22.80s/it] 41%|████      | 1765/4286 [13:29:17<16:04:23, 22.95s/it]                                                         {'loss': 0.0015, 'grad_norm': 0.07695757442710228, 'learning_rate': 5.881941203919738e-07, 'completion_length': 297.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.7738095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7738096117973328, 'reward_std': 0.0, 'kl': 0.0380859375, 'epoch': 0.41}
 41%|████      | 1765/4286 [13:29:17<16:04:23, 22.95s/it] 41%|████      | 1766/4286 [13:29:38<15:46:52, 22.54s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.057505405157633555, 'learning_rate': 5.87960802613159e-07, 'completion_length': 267.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.8273810148239136, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.0, 'kl': 0.0455322265625, 'epoch': 0.41}
 41%|████      | 1766/4286 [13:29:38<15:46:52, 22.54s/it] 41%|████      | 1767/4286 [13:30:01<15:46:23, 22.54s/it]                                                         {'loss': 0.002, 'grad_norm': 10.288615209355354, 'learning_rate': 5.877274848343444e-07, 'completion_length': 260.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5833333730697632, 'rewards/format_reward': 1.0, 'reward': 1.583333432674408, 'reward_std': 0.12946413457393646, 'kl': 0.050537109375, 'epoch': 0.41}
 41%|████      | 1767/4286 [13:30:01<15:46:23, 22.54s/it] 41%|████▏     | 1768/4286 [13:30:24<15:56:21, 22.79s/it]                                                         {'loss': 0.0129, 'grad_norm': 0.6987637987432679, 'learning_rate': 5.874941670555296e-07, 'completion_length': 315.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7931548357009888, 'rewards/format_reward': 1.0, 'reward': 1.7931548953056335, 'reward_std': 0.050841979682445526, 'kl': 0.3212890625, 'epoch': 0.41}
 41%|████▏     | 1768/4286 [13:30:24<15:56:21, 22.79s/it] 41%|████▏     | 1769/4286 [13:30:47<15:58:10, 22.84s/it]                                                         {'loss': 0.0044, 'grad_norm': 4.589369562711272, 'learning_rate': 5.872608492767148e-07, 'completion_length': 289.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7633929252624512, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.032738094218075275, 'kl': 0.1092529296875, 'epoch': 0.41}
 41%|████▏     | 1769/4286 [13:30:47<15:58:10, 22.84s/it] 41%|████▏     | 1770/4286 [13:31:10<15:55:35, 22.79s/it]                                                         {'loss': 0.0016, 'grad_norm': 0.5689399353333148, 'learning_rate': 5.870275314979e-07, 'completion_length': 286.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.8041667342185974, 'rewards/format_reward': 1.0, 'reward': 1.804166853427887, 'reward_std': 0.05061507550999522, 'kl': 0.039306640625, 'epoch': 0.41}
 41%|████▏     | 1770/4286 [13:31:10<15:55:35, 22.79s/it] 41%|████▏     | 1771/4286 [13:31:33<15:55:24, 22.79s/it]                                                         {'loss': 0.0038, 'grad_norm': 2.1145254450535758, 'learning_rate': 5.867942137190854e-07, 'completion_length': 282.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8244048357009888, 'rewards/format_reward': 1.0, 'reward': 1.8244048357009888, 'reward_std': 0.017857137601822615, 'kl': 0.094970703125, 'epoch': 0.41}
 41%|████▏     | 1771/4286 [13:31:33<15:55:24, 22.79s/it] 41%|████▏     | 1772/4286 [13:31:56<16:00:27, 22.92s/it]                                                         {'loss': 0.0019, 'grad_norm': 2.971304175498167, 'learning_rate': 5.865608959402706e-07, 'completion_length': 227.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.7232143878936768, 'reward_std': 0.01785714365541935, 'kl': 0.0472412109375, 'epoch': 0.41}
 41%|████▏     | 1772/4286 [13:31:56<16:00:27, 22.92s/it] 41%|████▏     | 1773/4286 [13:32:18<15:53:11, 22.76s/it]                                                         {'loss': 0.0013, 'grad_norm': 0.3779087255149699, 'learning_rate': 5.863275781614558e-07, 'completion_length': 284.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.7782738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.019238397479057312, 'kl': 0.03204345703125, 'epoch': 0.41}
 41%|████▏     | 1773/4286 [13:32:18<15:53:11, 22.76s/it] 41%|████▏     | 1774/4286 [13:32:41<15:58:05, 22.88s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.11237518674311339, 'learning_rate': 5.860942603826411e-07, 'completion_length': 304.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.0, 'kl': 0.0460205078125, 'epoch': 0.41}
 41%|████▏     | 1774/4286 [13:32:41<15:58:05, 22.88s/it] 41%|████▏     | 1775/4286 [13:33:04<15:51:48, 22.74s/it]                                                         {'loss': 0.0016, 'grad_norm': 5.119208000676783, 'learning_rate': 5.858609426038263e-07, 'completion_length': 283.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7857143878936768, 'rewards/format_reward': 1.0, 'reward': 1.7857144474983215, 'reward_std': 0.0476190485060215, 'kl': 0.04010009765625, 'epoch': 0.41}
 41%|████▏     | 1775/4286 [13:33:04<15:51:48, 22.74s/it] 41%|████▏     | 1776/4286 [13:33:26<15:46:33, 22.63s/it]                                                         {'loss': 0.0053, 'grad_norm': 50.540775327721214, 'learning_rate': 5.856276248250116e-07, 'completion_length': 283.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.022675003856420517, 'kl': 0.1334228515625, 'epoch': 0.41}
 41%|████▏     | 1776/4286 [13:33:26<15:46:33, 22.63s/it] 41%|████▏     | 1777/4286 [13:33:48<15:31:50, 22.28s/it]                                                         {'loss': 0.0096, 'grad_norm': 1.5473104252912877, 'learning_rate': 5.853943070461969e-07, 'completion_length': 233.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.0, 'kl': 0.240478515625, 'epoch': 0.41}
 41%|████▏     | 1777/4286 [13:33:48<15:31:50, 22.28s/it] 41%|████▏     | 1778/4286 [13:34:10<15:34:46, 22.36s/it]                                                         {'loss': 0.0118, 'grad_norm': 5.83629167144717, 'learning_rate': 5.851609892673821e-07, 'completion_length': 278.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 1.0, 'reward': 1.5937501192092896, 'reward_std': 0.043047917541116476, 'kl': 0.2943115234375, 'epoch': 0.41}
 41%|████▏     | 1778/4286 [13:34:10<15:34:46, 22.36s/it] 42%|████▏     | 1779/4286 [13:34:33<15:46:55, 22.66s/it]                                                         {'loss': 0.0016, 'grad_norm': 7.288008251696495, 'learning_rate': 5.849276714885673e-07, 'completion_length': 258.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875001192092896, 'reward_std': 0.010309826582670212, 'kl': 0.0389404296875, 'epoch': 0.42}
 42%|████▏     | 1779/4286 [13:34:33<15:46:55, 22.66s/it] 42%|████▏     | 1780/4286 [13:34:56<15:47:13, 22.68s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.35317061935981475, 'learning_rate': 5.846943537097527e-07, 'completion_length': 279.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.05633394047617912, 'kl': 0.0340576171875, 'epoch': 0.42}
 42%|████▏     | 1780/4286 [13:34:56<15:47:13, 22.68s/it] 42%|████▏     | 1781/4286 [13:35:17<15:28:56, 22.25s/it]                                                         {'loss': 0.0012, 'grad_norm': 4.227774985124232, 'learning_rate': 5.844610359309379e-07, 'completion_length': 231.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.8125001192092896, 'rewards/format_reward': 1.0, 'reward': 1.8125000596046448, 'reward_std': 0.054739005863666534, 'kl': 0.0296630859375, 'epoch': 0.42}
 42%|████▏     | 1781/4286 [13:35:17<15:28:56, 22.25s/it] 42%|████▏     | 1782/4286 [13:35:39<15:22:21, 22.10s/it]                                                         {'loss': 0.0017, 'grad_norm': 1.568364885643669, 'learning_rate': 5.842277181521231e-07, 'completion_length': 280.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.630952388048172, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.07404043339192867, 'kl': 0.0416259765625, 'epoch': 0.42}
 42%|████▏     | 1782/4286 [13:35:39<15:22:21, 22.10s/it] 42%|████▏     | 1783/4286 [13:36:02<15:25:05, 22.18s/it]                                                         {'loss': 0.0025, 'grad_norm': 2.01101612819439, 'learning_rate': 5.839944003733083e-07, 'completion_length': 280.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6086310148239136, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.0295482249930501, 'kl': 0.061767578125, 'epoch': 0.42}
 42%|████▏     | 1783/4286 [13:36:02<15:25:05, 22.18s/it] 42%|████▏     | 1784/4286 [13:36:24<15:33:47, 22.39s/it]                                                         {'loss': 0.0017, 'grad_norm': 0.8344216639219755, 'learning_rate': 5.837610825944937e-07, 'completion_length': 272.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187501788139343, 'reward_std': 0.03114316239953041, 'kl': 0.0419921875, 'epoch': 0.42}
 42%|████▏     | 1784/4286 [13:36:24<15:33:47, 22.39s/it] 42%|████▏     | 1785/4286 [13:36:47<15:39:39, 22.54s/it]                                                         {'loss': 0.0023, 'grad_norm': 0.6349465085975725, 'learning_rate': 5.835277648156789e-07, 'completion_length': 257.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7886905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7886905670166016, 'reward_std': 0.01785714365541935, 'kl': 0.0577392578125, 'epoch': 0.42}
 42%|████▏     | 1785/4286 [13:36:47<15:39:39, 22.54s/it] 42%|████▏     | 1786/4286 [13:37:09<15:33:02, 22.39s/it]                                                         {'loss': 0.0016, 'grad_norm': 0.36611527551575035, 'learning_rate': 5.832944470368641e-07, 'completion_length': 294.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.8342262804508209, 'rewards/format_reward': 1.0, 'reward': 1.8342263102531433, 'reward_std': 0.00865892879664898, 'kl': 0.038818359375, 'epoch': 0.42}
 42%|████▏     | 1786/4286 [13:37:09<15:33:02, 22.39s/it] 42%|████▏     | 1787/4286 [13:37:33<15:53:01, 22.88s/it]                                                         {'loss': 0.0032, 'grad_norm': 0.7743351812916957, 'learning_rate': 5.830611292580494e-07, 'completion_length': 286.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6636905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6636906862258911, 'reward_std': 0.03289870172739029, 'kl': 0.0787353515625, 'epoch': 0.42}
 42%|████▏     | 1787/4286 [13:37:33<15:53:01, 22.88s/it] 42%|████▏     | 1788/4286 [13:37:56<15:46:34, 22.74s/it]                                                         {'loss': 0.0014, 'grad_norm': 2.272746973867589, 'learning_rate': 5.828278114792347e-07, 'completion_length': 268.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.014880949631333351, 'kl': 0.03515625, 'epoch': 0.42}
 42%|████▏     | 1788/4286 [13:37:56<15:46:34, 22.74s/it] 42%|████▏     | 1789/4286 [13:38:18<15:43:24, 22.67s/it]                                                         {'loss': 0.0015, 'grad_norm': 12.688893195356858, 'learning_rate': 5.825944937004199e-07, 'completion_length': 281.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.8678571879863739, 'rewards/format_reward': 1.0, 'reward': 1.867857277393341, 'reward_std': 0.0749770924448967, 'kl': 0.0384521484375, 'epoch': 0.42}
 42%|████▏     | 1789/4286 [13:38:18<15:43:24, 22.67s/it] 42%|████▏     | 1790/4286 [13:38:41<15:41:47, 22.64s/it]                                                         {'loss': 0.0017, 'grad_norm': 0.1369494577586139, 'learning_rate': 5.823611759216052e-07, 'completion_length': 251.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.0, 'kl': 0.0428466796875, 'epoch': 0.42}
 42%|████▏     | 1790/4286 [13:38:41<15:41:47, 22.64s/it] 42%|████▏     | 1791/4286 [13:39:05<16:05:50, 23.23s/it]                                                         {'loss': 0.0017, 'grad_norm': 1.2368604931773557, 'learning_rate': 5.821278581427904e-07, 'completion_length': 300.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7544643878936768, 'reward_std': 0.10257173422724009, 'kl': 0.0428466796875, 'epoch': 0.42}
 42%|████▏     | 1791/4286 [13:39:05<16:05:50, 23.23s/it] 42%|████▏     | 1792/4286 [13:39:29<16:04:43, 23.21s/it]                                                         {'loss': 0.003, 'grad_norm': 0.582490180346856, 'learning_rate': 5.818945403639757e-07, 'completion_length': 270.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.01626221090555191, 'kl': 0.0740966796875, 'epoch': 0.42}
 42%|████▏     | 1792/4286 [13:39:29<16:04:43, 23.21s/it] 42%|████▏     | 1793/4286 [13:39:51<15:59:50, 23.10s/it]                                                         {'loss': 0.0021, 'grad_norm': 0.9367496014503339, 'learning_rate': 5.816612225851609e-07, 'completion_length': 284.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.8214287161827087, 'reward_std': 0.049460723996162415, 'kl': 0.0535888671875, 'epoch': 0.42}
 42%|████▏     | 1793/4286 [13:39:52<15:59:50, 23.10s/it] 42%|████▏     | 1794/4286 [13:40:15<16:00:44, 23.13s/it]                                                         {'loss': 0.0049, 'grad_norm': 0.9259689437928922, 'learning_rate': 5.814279048063462e-07, 'completion_length': 260.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.5863095372915268, 'rewards/format_reward': 1.0, 'reward': 1.5863096117973328, 'reward_std': 0.01785714365541935, 'kl': 0.122802734375, 'epoch': 0.42}
 42%|████▏     | 1794/4286 [13:40:15<16:00:44, 23.13s/it] 42%|████▏     | 1795/4286 [13:40:37<15:50:43, 22.90s/it]                                                         {'loss': 0.007, 'grad_norm': 6.973902534997177, 'learning_rate': 5.811945870275314e-07, 'completion_length': 281.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7589286863803864, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.07603275403380394, 'kl': 0.17578125, 'epoch': 0.42}
 42%|████▏     | 1795/4286 [13:40:37<15:50:43, 22.90s/it] 42%|████▏     | 1796/4286 [13:41:01<15:58:53, 23.11s/it]                                                         {'loss': 0.0037, 'grad_norm': 8.55705717280165, 'learning_rate': 5.809612692487166e-07, 'completion_length': 270.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.71577388048172, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.03869047574698925, 'kl': 0.093505859375, 'epoch': 0.42}
 42%|████▏     | 1796/4286 [13:41:01<15:58:53, 23.11s/it] 42%|████▏     | 1797/4286 [13:41:23<15:50:41, 22.92s/it]                                                         {'loss': 0.0046, 'grad_norm': 14.091069912133964, 'learning_rate': 5.80727951469902e-07, 'completion_length': 298.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6681548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6681548953056335, 'reward_std': 0.019238397479057312, 'kl': 0.1148681640625, 'epoch': 0.42}
 42%|████▏     | 1797/4286 [13:41:23<15:50:41, 22.92s/it] 42%|████▏     | 1798/4286 [13:41:46<15:45:00, 22.79s/it]                                                         {'loss': 0.0017, 'grad_norm': 0.7756547933640495, 'learning_rate': 5.804946336910872e-07, 'completion_length': 266.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.032524414360523224, 'kl': 0.041748046875, 'epoch': 0.42}
 42%|████▏     | 1798/4286 [13:41:46<15:45:00, 22.79s/it] 42%|████▏     | 1799/4286 [13:42:06<15:19:00, 22.17s/it]                                                         {'loss': 0.002, 'grad_norm': 0.2448009938128064, 'learning_rate': 5.802613159122724e-07, 'completion_length': 230.33930206298828, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797619700431824, 'reward_std': 0.0, 'kl': 0.0499267578125, 'epoch': 0.42}
 42%|████▏     | 1799/4286 [13:42:06<15:19:00, 22.17s/it] 42%|████▏     | 1800/4286 [13:42:30<15:42:42, 22.75s/it]                                                         {'loss': 0.0064, 'grad_norm': 0.8963770095976569, 'learning_rate': 5.800279981334577e-07, 'completion_length': 299.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.0, 'kl': 0.160400390625, 'epoch': 0.42}
 42%|████▏     | 1800/4286 [13:42:30<15:42:42, 22.75s/it] 42%|████▏     | 1801/4286 [13:46:30<60:40:44, 87.91s/it]                                                         {'loss': 0.0035, 'grad_norm': 0.520554851655414, 'learning_rate': 5.79794680354643e-07, 'completion_length': 279.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.018777981400489807, 'kl': 0.0870361328125, 'epoch': 0.42}
 42%|████▏     | 1801/4286 [13:46:30<60:40:44, 87.91s/it] 42%|████▏     | 1802/4286 [13:46:53<47:08:09, 68.31s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.38758846815603215, 'learning_rate': 5.795613625758282e-07, 'completion_length': 248.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.7931548357009888, 'rewards/format_reward': 1.0, 'reward': 1.7931548357009888, 'reward_std': 0.0327381007373333, 'kl': 0.03424072265625, 'epoch': 0.42}
 42%|████▏     | 1802/4286 [13:46:53<47:08:09, 68.31s/it] 42%|████▏     | 1803/4286 [13:47:16<37:42:44, 54.68s/it]                                                         {'loss': 0.0042, 'grad_norm': 3.357312423650618, 'learning_rate': 5.793280447970135e-07, 'completion_length': 289.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.6815476417541504, 'rewards/format_reward': 1.0, 'reward': 1.68154776096344, 'reward_std': 0.05289733596146107, 'kl': 0.105712890625, 'epoch': 0.42}
 42%|████▏     | 1803/4286 [13:47:16<37:42:44, 54.68s/it] 42%|████▏     | 1804/4286 [13:47:38<31:03:19, 45.04s/it]                                                         {'loss': 0.004, 'grad_norm': 4.347244593405955, 'learning_rate': 5.790947270181987e-07, 'completion_length': 277.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.06388125382363796, 'kl': 0.1009521484375, 'epoch': 0.42}
 42%|████▏     | 1804/4286 [13:47:38<31:03:19, 45.04s/it] 42%|████▏     | 1805/4286 [13:48:02<26:36:12, 38.60s/it]                                                         {'loss': 0.0031, 'grad_norm': 4.735865789865303, 'learning_rate': 5.78861409239384e-07, 'completion_length': 305.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6833333969116211, 'rewards/format_reward': 1.0, 'reward': 1.6833335161209106, 'reward_std': 0.04892143979668617, 'kl': 0.0767822265625, 'epoch': 0.42}
 42%|████▏     | 1805/4286 [13:48:02<26:36:12, 38.60s/it] 42%|████▏     | 1806/4286 [13:48:25<23:21:54, 33.92s/it]                                                         {'loss': 0.0025, 'grad_norm': 4.068335512586853, 'learning_rate': 5.786280914605692e-07, 'completion_length': 290.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.709821492433548, 'rewards/format_reward': 1.0, 'reward': 1.7098215818405151, 'reward_std': 0.06710418313741684, 'kl': 0.063720703125, 'epoch': 0.42}
 42%|████▏     | 1806/4286 [13:48:25<23:21:54, 33.92s/it] 42%|████▏     | 1807/4286 [13:48:48<21:09:56, 30.74s/it]                                                         {'loss': 0.0061, 'grad_norm': 5.582670180860772, 'learning_rate': 5.783947736817545e-07, 'completion_length': 283.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.01785714365541935, 'kl': 0.1517333984375, 'epoch': 0.42}
 42%|████▏     | 1807/4286 [13:48:48<21:09:56, 30.74s/it] 42%|████▏     | 1808/4286 [13:49:11<19:31:02, 28.35s/it]                                                         {'loss': 0.0044, 'grad_norm': 4.530389343810457, 'learning_rate': 5.781614559029397e-07, 'completion_length': 302.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7455357611179352, 'rewards/format_reward': 1.0, 'reward': 1.7455357909202576, 'reward_std': 0.05084197223186493, 'kl': 0.11083984375, 'epoch': 0.42}
 42%|████▏     | 1808/4286 [13:49:11<19:31:02, 28.35s/it] 42%|████▏     | 1809/4286 [13:49:33<18:11:41, 26.44s/it]                                                         {'loss': 0.0022, 'grad_norm': 1.187048192107145, 'learning_rate': 5.77928138124125e-07, 'completion_length': 267.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.7946428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7946429252624512, 'reward_std': 0.01785714365541935, 'kl': 0.0538330078125, 'epoch': 0.42}
 42%|████▏     | 1809/4286 [13:49:33<18:11:41, 26.44s/it] 42%|████▏     | 1810/4286 [13:49:57<17:36:41, 25.61s/it]                                                         {'loss': 0.0046, 'grad_norm': 2.407280871385769, 'learning_rate': 5.776948203453103e-07, 'completion_length': 277.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.742559552192688, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.03709553740918636, 'kl': 0.115966796875, 'epoch': 0.42}
 42%|████▏     | 1810/4286 [13:49:57<17:36:41, 25.61s/it] 42%|████▏     | 1811/4286 [13:50:20<17:01:43, 24.77s/it]                                                         {'loss': 0.002, 'grad_norm': 0.1591240318974089, 'learning_rate': 5.774615025664955e-07, 'completion_length': 297.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.8333334028720856, 'rewards/format_reward': 1.0, 'reward': 1.8333334922790527, 'reward_std': 0.011904759332537651, 'kl': 0.0496826171875, 'epoch': 0.42}
 42%|████▏     | 1811/4286 [13:50:20<17:01:43, 24.77s/it] 42%|████▏     | 1812/4286 [13:50:43<16:44:56, 24.37s/it]                                                         {'loss': 0.005, 'grad_norm': 6.073469990453766, 'learning_rate': 5.772281847876807e-07, 'completion_length': 269.2143020629883, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.07762768864631653, 'kl': 0.1253662109375, 'epoch': 0.42}
 42%|████▏     | 1812/4286 [13:50:43<16:44:56, 24.37s/it] 42%|████▏     | 1813/4286 [13:51:06<16:25:55, 23.92s/it]                                                         {'loss': 0.0031, 'grad_norm': 0.20520290818214695, 'learning_rate': 5.769948670088661e-07, 'completion_length': 290.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.0, 'kl': 0.076904296875, 'epoch': 0.42}
 42%|████▏     | 1813/4286 [13:51:06<16:25:55, 23.92s/it] 42%|████▏     | 1814/4286 [13:51:29<16:15:25, 23.68s/it]                                                         {'loss': 0.0051, 'grad_norm': 3.6171144359478498, 'learning_rate': 5.767615492300513e-07, 'completion_length': 290.75, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.08885835483670235, 'kl': 0.12841796875, 'epoch': 0.42}
 42%|████▏     | 1814/4286 [13:51:29<16:15:25, 23.68s/it] 42%|████▏     | 1815/4286 [13:51:52<16:13:24, 23.64s/it]                                                         {'loss': 0.0054, 'grad_norm': 2.6847411792225855, 'learning_rate': 5.765282314512365e-07, 'completion_length': 260.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.1039699986577034, 'kl': 0.13525390625, 'epoch': 0.42}
 42%|████▏     | 1815/4286 [13:51:52<16:13:24, 23.64s/it] 42%|████▏     | 1816/4286 [13:52:16<16:07:55, 23.51s/it]                                                         {'loss': 0.0055, 'grad_norm': 2.017142665290356, 'learning_rate': 5.762949136724217e-07, 'completion_length': 305.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.8318452835083008, 'rewards/format_reward': 1.0, 'reward': 1.8318454027175903, 'reward_std': 0.051762811839580536, 'kl': 0.13671875, 'epoch': 0.42}
 42%|████▏     | 1816/4286 [13:52:16<16:07:55, 23.51s/it] 42%|████▏     | 1817/4286 [13:52:38<15:50:48, 23.11s/it]                                                         {'loss': 0.0044, 'grad_norm': 2.0212989910663057, 'learning_rate': 5.760615958936071e-07, 'completion_length': 283.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.8065476417541504, 'rewards/format_reward': 1.0, 'reward': 1.8065477013587952, 'reward_std': 0.01785714365541935, 'kl': 0.10888671875, 'epoch': 0.42}
 42%|████▏     | 1817/4286 [13:52:38<15:50:48, 23.11s/it] 42%|████▏     | 1818/4286 [13:53:01<15:52:42, 23.16s/it]                                                         {'loss': 0.0089, 'grad_norm': 1.9197984557387946, 'learning_rate': 5.758282781147923e-07, 'completion_length': 258.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.7324405014514923, 'rewards/format_reward': 1.0, 'reward': 1.7324405908584595, 'reward_std': 0.05605800449848175, 'kl': 0.2218017578125, 'epoch': 0.42}
 42%|████▏     | 1818/4286 [13:53:01<15:52:42, 23.16s/it] 42%|████▏     | 1819/4286 [13:53:25<16:05:30, 23.48s/it]                                                         {'loss': 0.0014, 'grad_norm': 1.8032748880222451, 'learning_rate': 5.755949603359775e-07, 'completion_length': 286.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7053572535514832, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.04413222521543503, 'kl': 0.03369140625, 'epoch': 0.42}
 42%|████▏     | 1819/4286 [13:53:25<16:05:30, 23.48s/it] 42%|████▏     | 1820/4286 [13:53:49<16:12:09, 23.65s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.25377619629993975, 'learning_rate': 5.753616425571628e-07, 'completion_length': 298.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.7053571939468384, 'reward_std': 0.01785714365541935, 'kl': 0.045654296875, 'epoch': 0.42}
 42%|████▏     | 1820/4286 [13:53:49<16:12:09, 23.65s/it] 42%|████▏     | 1821/4286 [13:54:12<15:59:39, 23.36s/it]                                                         {'loss': 0.0024, 'grad_norm': 0.33627431885659415, 'learning_rate': 5.75128324778348e-07, 'completion_length': 272.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7157738208770752, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.019238397479057312, 'kl': 0.0594482421875, 'epoch': 0.42}
 42%|████▏     | 1821/4286 [13:54:12<15:59:39, 23.36s/it] 43%|████▎     | 1822/4286 [13:54:35<15:56:03, 23.28s/it]                                                         {'loss': 0.0088, 'grad_norm': 3.021715140828276, 'learning_rate': 5.748950069995333e-07, 'completion_length': 288.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7678572237491608, 'rewards/format_reward': 1.0, 'reward': 1.767857313156128, 'reward_std': 0.037930249236524105, 'kl': 0.218994140625, 'epoch': 0.43}
 43%|████▎     | 1822/4286 [13:54:35<15:56:03, 23.28s/it] 43%|████▎     | 1823/4286 [13:54:59<15:59:24, 23.37s/it]                                                         {'loss': 0.0027, 'grad_norm': 1.0869315072745687, 'learning_rate': 5.746616892207186e-07, 'completion_length': 292.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7083333432674408, 'rewards/format_reward': 1.0, 'reward': 1.7083334922790527, 'reward_std': 0.039397625252604485, 'kl': 0.0673828125, 'epoch': 0.43}
 43%|████▎     | 1823/4286 [13:54:59<15:59:24, 23.37s/it] 43%|████▎     | 1824/4286 [13:55:23<16:03:46, 23.49s/it]                                                         {'loss': 0.0059, 'grad_norm': 3.4749637430168883, 'learning_rate': 5.744283714419038e-07, 'completion_length': 279.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.089432034175843, 'kl': 0.1484375, 'epoch': 0.43}
 43%|████▎     | 1824/4286 [13:55:23<16:03:46, 23.49s/it] 43%|████▎     | 1825/4286 [13:55:46<15:57:36, 23.35s/it]                                                         {'loss': 0.006, 'grad_norm': 2.2208382660373345, 'learning_rate': 5.74195053663089e-07, 'completion_length': 290.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6726191341876984, 'rewards/format_reward': 1.0, 'reward': 1.672619104385376, 'reward_std': 0.0654761902987957, 'kl': 0.149169921875, 'epoch': 0.43}
 43%|████▎     | 1825/4286 [13:55:46<15:57:36, 23.35s/it] 43%|████▎     | 1826/4286 [13:56:09<15:57:04, 23.34s/it]                                                         {'loss': 0.0087, 'grad_norm': 4.658717763832951, 'learning_rate': 5.739617358842744e-07, 'completion_length': 284.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8511904776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.833333432674408, 'reward_std': 0.07142856903374195, 'kl': 0.218505859375, 'epoch': 0.43}
 43%|████▎     | 1826/4286 [13:56:09<15:57:04, 23.34s/it] 43%|████▎     | 1827/4286 [13:56:34<16:15:55, 23.81s/it]                                                         {'loss': 0.02, 'grad_norm': 6.8986359629390615, 'learning_rate': 5.737284181054596e-07, 'completion_length': 327.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7053572237491608, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.07327024638652802, 'kl': 0.5, 'epoch': 0.43}
 43%|████▎     | 1827/4286 [13:56:34<16:15:55, 23.81s/it] 43%|████▎     | 1828/4286 [13:56:57<16:05:37, 23.57s/it]                                                         {'loss': 0.0047, 'grad_norm': 7.30058638479648, 'learning_rate': 5.734951003266448e-07, 'completion_length': 290.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7872024476528168, 'rewards/format_reward': 1.0, 'reward': 1.787202537059784, 'reward_std': 0.07876220345497131, 'kl': 0.117431640625, 'epoch': 0.43}
 43%|████▎     | 1828/4286 [13:56:57<16:05:37, 23.57s/it] 43%|████▎     | 1829/4286 [13:57:20<15:54:52, 23.32s/it]                                                         {'loss': 0.0046, 'grad_norm': 0.8537058605683554, 'learning_rate': 5.7326178254783e-07, 'completion_length': 281.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7738096117973328, 'reward_std': 0.02380952052772045, 'kl': 0.1158447265625, 'epoch': 0.43}
 43%|████▎     | 1829/4286 [13:57:20<15:54:52, 23.32s/it] 43%|████▎     | 1830/4286 [13:57:43<15:54:16, 23.31s/it]                                                         {'loss': 0.0162, 'grad_norm': 2.4840963614731013, 'learning_rate': 5.730284647690154e-07, 'completion_length': 288.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.662202388048172, 'rewards/format_reward': 1.0, 'reward': 1.662202537059784, 'reward_std': 0.020833331160247326, 'kl': 0.404296875, 'epoch': 0.43}
 43%|████▎     | 1830/4286 [13:57:43<15:54:16, 23.31s/it] 43%|████▎     | 1831/4286 [13:58:05<15:43:58, 23.07s/it]                                                         {'loss': 0.0109, 'grad_norm': 1.1046448030454064, 'learning_rate': 5.727951469902006e-07, 'completion_length': 279.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.011904762359336019, 'kl': 0.271484375, 'epoch': 0.43}
 43%|████▎     | 1831/4286 [13:58:05<15:43:58, 23.07s/it] 43%|████▎     | 1832/4286 [13:58:28<15:41:40, 23.02s/it]                                                         {'loss': 0.0124, 'grad_norm': 3.9348459571699235, 'learning_rate': 5.725618292113858e-07, 'completion_length': 259.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.8511905670166016, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.833333432674408, 'reward_std': 0.06639703689143062, 'kl': 0.30810546875, 'epoch': 0.43}
 43%|████▎     | 1832/4286 [13:58:28<15:41:40, 23.02s/it] 43%|████▎     | 1833/4286 [13:58:52<15:48:53, 23.21s/it]                                                         {'loss': 0.0099, 'grad_norm': 8.066963872586948, 'learning_rate': 5.723285114325711e-07, 'completion_length': 281.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.06388125568628311, 'kl': 0.24658203125, 'epoch': 0.43}
 43%|████▎     | 1833/4286 [13:58:52<15:48:53, 23.21s/it] 43%|████▎     | 1834/4286 [13:59:15<15:46:57, 23.17s/it]                                                         {'loss': 0.006, 'grad_norm': 4.446684285625363, 'learning_rate': 5.720951936537564e-07, 'completion_length': 292.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.07578602060675621, 'kl': 0.150390625, 'epoch': 0.43}
 43%|████▎     | 1834/4286 [13:59:15<15:46:57, 23.17s/it] 43%|████▎     | 1835/4286 [13:59:37<15:37:26, 22.95s/it]                                                         {'loss': 0.0029, 'grad_norm': 3.484792730970527, 'learning_rate': 5.718618758749416e-07, 'completion_length': 272.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.029761902987957, 'kl': 0.073486328125, 'epoch': 0.43}
 43%|████▎     | 1835/4286 [13:59:37<15:37:26, 22.95s/it] 43%|████▎     | 1836/4286 [13:59:59<15:24:34, 22.64s/it]                                                         {'loss': 0.0296, 'grad_norm': 8.411100694017193, 'learning_rate': 5.716285580961269e-07, 'completion_length': 285.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6324405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6145833730697632, 'reward_std': 0.12478631921112537, 'kl': 0.74072265625, 'epoch': 0.43}
 43%|████▎     | 1836/4286 [13:59:59<15:24:34, 22.64s/it] 43%|████▎     | 1837/4286 [14:00:22<15:24:49, 22.66s/it]                                                         {'loss': 0.0445, 'grad_norm': 11.331118531493983, 'learning_rate': 5.713952403173121e-07, 'completion_length': 254.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.49412205815315247, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.440550684928894, 'reward_std': 0.16788744181394577, 'kl': 1.115234375, 'epoch': 0.43}
 43%|████▎     | 1837/4286 [14:00:22<15:24:49, 22.66s/it] 43%|████▎     | 1838/4286 [14:00:46<15:43:22, 23.12s/it]                                                         {'loss': 0.0182, 'grad_norm': 4.762576161902549, 'learning_rate': 5.711619225384974e-07, 'completion_length': 311.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857144474983215, 'reward_std': 0.0357142873108387, 'kl': 0.453125, 'epoch': 0.43}
 43%|████▎     | 1838/4286 [14:00:46<15:43:22, 23.12s/it] 43%|████▎     | 1839/4286 [14:01:09<15:38:02, 23.00s/it]                                                         {'loss': 0.0231, 'grad_norm': 8.271872389680805, 'learning_rate': 5.709286047596826e-07, 'completion_length': 302.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7802580296993256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7445437908172607, 'reward_std': 0.13867521658539772, 'kl': 0.576171875, 'epoch': 0.43}
 43%|████▎     | 1839/4286 [14:01:09<15:38:02, 23.00s/it] 43%|████▎     | 1840/4286 [14:01:33<15:47:11, 23.23s/it]                                                         {'loss': 0.0262, 'grad_norm': 6.260126978269522, 'learning_rate': 5.706952869808679e-07, 'completion_length': 298.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.026785715483129025, 'kl': 0.654296875, 'epoch': 0.43}
 43%|████▎     | 1840/4286 [14:01:33<15:47:11, 23.23s/it] 43%|████▎     | 1841/4286 [14:01:56<15:41:10, 23.10s/it]                                                         {'loss': 0.0238, 'grad_norm': 5.986206455687373, 'learning_rate': 5.704619692020531e-07, 'completion_length': 293.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6800595223903656, 'rewards/format_reward': 1.0, 'reward': 1.6800596714019775, 'reward_std': 0.06391431298106909, 'kl': 0.59375, 'epoch': 0.43}
 43%|████▎     | 1841/4286 [14:01:56<15:41:10, 23.10s/it] 43%|████▎     | 1842/4286 [14:02:20<15:51:55, 23.37s/it]                                                         {'loss': 0.0408, 'grad_norm': 12.165298813831633, 'learning_rate': 5.702286514232384e-07, 'completion_length': 305.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5967262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5788691639900208, 'reward_std': 0.08451610058546066, 'kl': 1.017578125, 'epoch': 0.43}
 43%|████▎     | 1842/4286 [14:02:20<15:51:55, 23.37s/it] 43%|████▎     | 1843/4286 [14:02:43<15:56:29, 23.49s/it]                                                         {'loss': 0.0158, 'grad_norm': 13.439998496817088, 'learning_rate': 5.699953336444237e-07, 'completion_length': 295.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6681548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6502977013587952, 'reward_std': 0.12326417304575443, 'kl': 0.3955078125, 'epoch': 0.43}
 43%|████▎     | 1843/4286 [14:02:43<15:56:29, 23.49s/it] 43%|████▎     | 1844/4286 [14:03:06<15:49:12, 23.32s/it]                                                         {'loss': 0.0143, 'grad_norm': 1.539328245983175, 'learning_rate': 5.697620158656089e-07, 'completion_length': 273.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7752976417541504, 'rewards/format_reward': 1.0, 'reward': 1.7752978205680847, 'reward_std': 0.049000298604369164, 'kl': 0.3583984375, 'epoch': 0.43}
 43%|████▎     | 1844/4286 [14:03:06<15:49:12, 23.32s/it] 43%|████▎     | 1845/4286 [14:03:28<15:34:41, 22.97s/it]                                                         {'loss': 0.0246, 'grad_norm': 8.22507557039817, 'learning_rate': 5.695286980867941e-07, 'completion_length': 272.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 1.0, 'reward': 1.8020834922790527, 'reward_std': 0.008928571827709675, 'kl': 0.6142578125, 'epoch': 0.43}
 43%|████▎     | 1845/4286 [14:03:28<15:34:41, 22.97s/it] 43%|████▎     | 1846/4286 [14:03:51<15:35:14, 23.00s/it]                                                         {'loss': 0.0107, 'grad_norm': 14.964802658887928, 'learning_rate': 5.692953803079795e-07, 'completion_length': 290.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.6424851715564728, 'rewards/format_reward': 1.0, 'reward': 1.6424852013587952, 'reward_std': 0.0721726231276989, 'kl': 0.266357421875, 'epoch': 0.43}
 43%|████▎     | 1846/4286 [14:03:51<15:35:14, 23.00s/it] 43%|████▎     | 1847/4286 [14:04:15<15:45:53, 23.27s/it]                                                         {'loss': 0.0191, 'grad_norm': 9.300700478829418, 'learning_rate': 5.690620625291647e-07, 'completion_length': 231.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.04311095643788576, 'kl': 0.47802734375, 'epoch': 0.43}
 43%|████▎     | 1847/4286 [14:04:15<15:45:53, 23.27s/it] 43%|████▎     | 1848/4286 [14:04:39<15:51:12, 23.41s/it]                                                         {'loss': 0.01, 'grad_norm': 3.351968890049943, 'learning_rate': 5.688287447503499e-07, 'completion_length': 281.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.04007173515856266, 'kl': 0.25, 'epoch': 0.43}
 43%|████▎     | 1848/4286 [14:04:39<15:51:12, 23.41s/it] 43%|████▎     | 1849/4286 [14:05:04<16:03:19, 23.72s/it]                                                         {'loss': 0.0028, 'grad_norm': 28.253360696815374, 'learning_rate': 5.685954269715352e-07, 'completion_length': 278.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7470238506793976, 'rewards/format_reward': 1.0, 'reward': 1.74702388048172, 'reward_std': 0.038476794958114624, 'kl': 0.070068359375, 'epoch': 0.43}
 43%|████▎     | 1849/4286 [14:05:04<16:03:19, 23.72s/it] 43%|████▎     | 1850/4286 [14:05:28<16:10:45, 23.91s/it]                                                         {'loss': 0.029, 'grad_norm': 2.0113592973881644, 'learning_rate': 5.683621091927204e-07, 'completion_length': 258.2678756713867, 'rewards/only_full_func_accuracy_reward': 0.5800595581531525, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.562202513217926, 'reward_std': 0.09859584271907806, 'kl': 0.7265625, 'epoch': 0.43}
 43%|████▎     | 1850/4286 [14:05:28<16:10:45, 23.91s/it] 43%|████▎     | 1851/4286 [14:05:52<16:12:10, 23.96s/it]                                                         {'loss': 0.0271, 'grad_norm': 2.2132022347466287, 'learning_rate': 5.681287914139057e-07, 'completion_length': 300.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6264881193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6086310744285583, 'reward_std': 0.07360392808914185, 'kl': 0.6776123046875, 'epoch': 0.43}
 43%|████▎     | 1851/4286 [14:05:52<16:12:10, 23.96s/it] 43%|████▎     | 1852/4286 [14:06:15<15:55:05, 23.54s/it]                                                         {'loss': 0.0063, 'grad_norm': 5.7806190087806355, 'learning_rate': 5.678954736350909e-07, 'completion_length': 291.7143096923828, 'rewards/only_full_func_accuracy_reward': 0.7232142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7053572535514832, 'reward_std': 0.10352562740445137, 'kl': 0.15869140625, 'epoch': 0.43}
 43%|████▎     | 1852/4286 [14:06:15<15:55:05, 23.54s/it] 43%|████▎     | 1853/4286 [14:06:39<16:08:19, 23.88s/it]                                                         {'loss': 0.0171, 'grad_norm': 3.5742957525450803, 'learning_rate': 5.676621558562762e-07, 'completion_length': 296.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.81101194024086, 'rewards/format_reward': 1.0, 'reward': 1.8110119104385376, 'reward_std': 0.03869048226624727, 'kl': 0.426513671875, 'epoch': 0.43}
 43%|████▎     | 1853/4286 [14:06:39<16:08:19, 23.88s/it] 43%|████▎     | 1854/4286 [14:07:03<16:04:26, 23.79s/it]                                                         {'loss': 0.0097, 'grad_norm': 2.899580508765272, 'learning_rate': 5.674288380774614e-07, 'completion_length': 295.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.4386904835700989, 'rewards/format_reward': 1.0, 'reward': 1.4386906027793884, 'reward_std': 0.024191079661250114, 'kl': 0.24365234375, 'epoch': 0.43}
 43%|████▎     | 1854/4286 [14:07:03<16:04:26, 23.79s/it] 43%|████▎     | 1855/4286 [14:07:25<15:49:13, 23.43s/it]                                                         {'loss': 0.0078, 'grad_norm': 4.271542794871575, 'learning_rate': 5.671955202986467e-07, 'completion_length': 251.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.8556548357009888, 'rewards/format_reward': 1.0, 'reward': 1.8556548953056335, 'reward_std': 0.026785715483129025, 'kl': 0.1956787109375, 'epoch': 0.43}
 43%|████▎     | 1855/4286 [14:07:25<15:49:13, 23.43s/it][2025-03-03 05:05:12,056] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 43%|████▎     | 1856/4286 [14:07:49<15:53:01, 23.53s/it]                                                         {'loss': 0.0128, 'grad_norm': 2.6904966832220922, 'learning_rate': 5.66962202519832e-07, 'completion_length': 221.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.017183048650622368, 'kl': 0.3173828125, 'epoch': 0.43}
 43%|████▎     | 1856/4286 [14:07:49<15:53:01, 23.53s/it] 43%|████▎     | 1857/4286 [14:08:14<16:06:58, 23.89s/it]                                                         {'loss': 0.0094, 'grad_norm': 10.159014075376426, 'learning_rate': 5.667288847410172e-07, 'completion_length': 291.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.0713137723505497, 'kl': 0.234130859375, 'epoch': 0.43}
 43%|████▎     | 1857/4286 [14:08:14<16:06:58, 23.89s/it] 43%|████▎     | 1858/4286 [14:08:38<16:05:56, 23.87s/it]                                                         {'loss': 0.0016, 'grad_norm': 0.5733960116789553, 'learning_rate': 5.664955669622024e-07, 'completion_length': 279.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.025651201605796814, 'kl': 0.0401611328125, 'epoch': 0.43}
 43%|████▎     | 1858/4286 [14:08:38<16:05:56, 23.87s/it] 43%|████▎     | 1859/4286 [14:09:02<16:06:32, 23.89s/it]                                                         {'loss': 0.0252, 'grad_norm': 15.814869969422226, 'learning_rate': 5.662622491833878e-07, 'completion_length': 309.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949406266212463, 'reward_std': 0.07460252195596695, 'kl': 0.62890625, 'epoch': 0.43}
 43%|████▎     | 1859/4286 [14:09:02<16:06:32, 23.89s/it] 43%|████▎     | 1860/4286 [14:09:26<16:07:53, 23.94s/it]                                                         {'loss': 0.0077, 'grad_norm': 3.3962778921127534, 'learning_rate': 5.66028931404573e-07, 'completion_length': 269.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.0416666716337204, 'kl': 0.193359375, 'epoch': 0.43}
 43%|████▎     | 1860/4286 [14:09:26<16:07:53, 23.94s/it] 43%|████▎     | 1861/4286 [14:09:51<16:26:39, 24.41s/it]                                                         {'loss': 0.0039, 'grad_norm': 7.310561525345591, 'learning_rate': 5.657956136257582e-07, 'completion_length': 336.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7886905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7886906862258911, 'reward_std': 0.035961028188467026, 'kl': 0.09765625, 'epoch': 0.43}
 43%|████▎     | 1861/4286 [14:09:51<16:26:39, 24.41s/it] 43%|████▎     | 1862/4286 [14:10:15<16:23:19, 24.34s/it]                                                         {'loss': 0.008, 'grad_norm': 3.106858760632372, 'learning_rate': 5.655622958469434e-07, 'completion_length': 289.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6205357909202576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6026787161827087, 'reward_std': 0.09082325361669064, 'kl': 0.2001953125, 'epoch': 0.43}
 43%|████▎     | 1862/4286 [14:10:15<16:23:19, 24.34s/it] 43%|████▎     | 1863/4286 [14:10:39<16:11:59, 24.07s/it]                                                         {'loss': 0.002, 'grad_norm': 3.715557824331764, 'learning_rate': 5.653289780681288e-07, 'completion_length': 296.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.74702388048172, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.08290597144514322, 'kl': 0.0506591796875, 'epoch': 0.43}
 43%|████▎     | 1863/4286 [14:10:39<16:11:59, 24.07s/it] 43%|████▎     | 1864/4286 [14:11:03<16:12:01, 24.08s/it]                                                         {'loss': 0.0123, 'grad_norm': 2.7225327812807447, 'learning_rate': 5.65095660289314e-07, 'completion_length': 282.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7648810744285583, 'reward_std': 0.1479678526520729, 'kl': 0.30615234375, 'epoch': 0.43}
 43%|████▎     | 1864/4286 [14:11:03<16:12:01, 24.08s/it] 44%|████▎     | 1865/4286 [14:11:26<16:01:41, 23.83s/it]                                                         {'loss': 0.017, 'grad_norm': 5.589726187540069, 'learning_rate': 5.648623425104992e-07, 'completion_length': 294.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7306548953056335, 'reward_std': 0.09226190485060215, 'kl': 0.42431640625, 'epoch': 0.44}
 44%|████▎     | 1865/4286 [14:11:26<16:01:41, 23.83s/it] 44%|████▎     | 1866/4286 [14:11:50<16:02:31, 23.86s/it]                                                         {'loss': 0.0141, 'grad_norm': 27.644080569648263, 'learning_rate': 5.646290247316845e-07, 'completion_length': 320.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.0773809514939785, 'kl': 0.353515625, 'epoch': 0.44}
 44%|████▎     | 1866/4286 [14:11:50<16:02:31, 23.86s/it] 44%|████▎     | 1867/4286 [14:12:14<15:56:54, 23.73s/it]                                                         {'loss': 0.0019, 'grad_norm': 1.090780796282284, 'learning_rate': 5.643957069528698e-07, 'completion_length': 318.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7000000476837158, 'rewards/format_reward': 1.0, 'reward': 1.7000001072883606, 'reward_std': 0.027457598596811295, 'kl': 0.048828125, 'epoch': 0.44}
 44%|████▎     | 1867/4286 [14:12:14<15:56:54, 23.73s/it] 44%|████▎     | 1868/4286 [14:12:36<15:45:27, 23.46s/it]                                                         {'loss': 0.0074, 'grad_norm': 1.75352197259574, 'learning_rate': 5.64162389174055e-07, 'completion_length': 275.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.492559552192688, 'rewards/format_reward': 1.0, 'reward': 1.4925596117973328, 'reward_std': 0.04627927392721176, 'kl': 0.1856689453125, 'epoch': 0.44}
 44%|████▎     | 1868/4286 [14:12:36<15:45:27, 23.46s/it] 44%|████▎     | 1869/4286 [14:12:59<15:38:41, 23.30s/it]                                                         {'loss': 0.0032, 'grad_norm': 0.21420229116366787, 'learning_rate': 5.639290713952403e-07, 'completion_length': 268.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.785714328289032, 'reward_std': 0.0, 'kl': 0.078857421875, 'epoch': 0.44}
 44%|████▎     | 1869/4286 [14:12:59<15:38:41, 23.30s/it] 44%|████▎     | 1870/4286 [14:13:24<15:58:34, 23.81s/it]                                                         {'loss': 0.0049, 'grad_norm': 2.154860576595919, 'learning_rate': 5.636957536164255e-07, 'completion_length': 311.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7351191341876984, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.05357142724096775, 'kl': 0.123291015625, 'epoch': 0.44}
 44%|████▎     | 1870/4286 [14:13:24<15:58:34, 23.81s/it] 44%|████▎     | 1871/4286 [14:13:50<16:23:01, 24.42s/it]                                                         {'loss': 0.0245, 'grad_norm': 20.22771542796451, 'learning_rate': 5.634624358376107e-07, 'completion_length': 273.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7300595641136169, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7122024297714233, 'reward_std': 0.11761565878987312, 'kl': 0.6123046875, 'epoch': 0.44}
 44%|████▎     | 1871/4286 [14:13:50<16:23:01, 24.42s/it] 44%|████▎     | 1872/4286 [14:14:15<16:29:23, 24.59s/it]                                                         {'loss': 0.0187, 'grad_norm': 7.0488633857274365, 'learning_rate': 5.632291180587961e-07, 'completion_length': 305.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.6607143580913544, 'rewards/format_reward': 1.0, 'reward': 1.6607144474983215, 'reward_std': 0.05609631724655628, 'kl': 0.46630859375, 'epoch': 0.44}
 44%|████▎     | 1872/4286 [14:14:15<16:29:23, 24.59s/it] 44%|████▎     | 1873/4286 [14:14:39<16:24:06, 24.47s/it]                                                         {'loss': 0.0026, 'grad_norm': 7.376563828490063, 'learning_rate': 5.629958002799813e-07, 'completion_length': 301.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.8511905372142792, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.04946071654558182, 'kl': 0.0650634765625, 'epoch': 0.44}
 44%|████▎     | 1873/4286 [14:14:39<16:24:06, 24.47s/it] 44%|████▎     | 1874/4286 [14:15:03<16:09:42, 24.12s/it]                                                         {'loss': 0.0109, 'grad_norm': 3.7189132869772292, 'learning_rate': 5.627624825011665e-07, 'completion_length': 288.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038692235946655, 'reward_std': 0.0744047649204731, 'kl': 0.271484375, 'epoch': 0.44}
 44%|████▎     | 1874/4286 [14:15:03<16:09:42, 24.12s/it] 44%|████▎     | 1875/4286 [14:15:27<16:10:33, 24.15s/it]                                                         {'loss': 0.0195, 'grad_norm': 2.924633875965462, 'learning_rate': 5.625291647223517e-07, 'completion_length': 288.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.5669642984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5491071939468384, 'reward_std': 0.055216461420059204, 'kl': 0.486328125, 'epoch': 0.44}
 44%|████▎     | 1875/4286 [14:15:27<16:10:33, 24.15s/it] 44%|████▍     | 1876/4286 [14:15:51<16:07:03, 24.08s/it]                                                         {'loss': 0.0118, 'grad_norm': 3.8681445432442834, 'learning_rate': 5.622958469435371e-07, 'completion_length': 273.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.029761902987957, 'kl': 0.29541015625, 'epoch': 0.44}
 44%|████▍     | 1876/4286 [14:15:51<16:07:03, 24.08s/it] 44%|████▍     | 1877/4286 [14:16:15<16:13:26, 24.25s/it]                                                         {'loss': 0.0019, 'grad_norm': 1.1887928028099606, 'learning_rate': 5.620625291647223e-07, 'completion_length': 299.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8258929252624512, 'rewards/format_reward': 1.0, 'reward': 1.8258930444717407, 'reward_std': 0.008928571827709675, 'kl': 0.0477294921875, 'epoch': 0.44}
 44%|████▍     | 1877/4286 [14:16:15<16:13:26, 24.25s/it] 44%|████▍     | 1878/4286 [14:16:39<16:09:09, 24.15s/it]                                                         {'loss': 0.0132, 'grad_norm': 2.8604772979081976, 'learning_rate': 5.618292113859075e-07, 'completion_length': 310.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.5907738655805588, 'rewards/format_reward': 1.0, 'reward': 1.5907739400863647, 'reward_std': 0.05654762126505375, 'kl': 0.3289794921875, 'epoch': 0.44}
 44%|████▍     | 1878/4286 [14:16:39<16:09:09, 24.15s/it] 44%|████▍     | 1879/4286 [14:17:04<16:11:09, 24.21s/it]                                                         {'loss': 0.0119, 'grad_norm': 2.3726385613338294, 'learning_rate': 5.615958936070928e-07, 'completion_length': 290.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098215222358704, 'reward_std': 0.0922619104385376, 'kl': 0.297119140625, 'epoch': 0.44}
 44%|████▍     | 1879/4286 [14:17:04<16:11:09, 24.21s/it] 44%|████▍     | 1880/4286 [14:17:29<16:25:21, 24.57s/it]                                                         {'loss': 0.0078, 'grad_norm': 1.7603772697163615, 'learning_rate': 5.613625758282781e-07, 'completion_length': 294.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6815477013587952, 'reward_std': 0.1186202634125948, 'kl': 0.1962890625, 'epoch': 0.44}
 44%|████▍     | 1880/4286 [14:17:29<16:25:21, 24.57s/it] 44%|████▍     | 1881/4286 [14:17:52<16:06:56, 24.12s/it]                                                         {'loss': 0.0031, 'grad_norm': 1.0141028860614791, 'learning_rate': 5.611292580494633e-07, 'completion_length': 296.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.0680250208824873, 'kl': 0.0770263671875, 'epoch': 0.44}
 44%|████▍     | 1881/4286 [14:17:52<16:06:56, 24.12s/it] 44%|████▍     | 1882/4286 [14:18:17<16:12:37, 24.28s/it]                                                         {'loss': 0.0058, 'grad_norm': 1.3684872238305195, 'learning_rate': 5.608959402706486e-07, 'completion_length': 302.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8053571879863739, 'rewards/format_reward': 1.0, 'reward': 1.8053572177886963, 'reward_std': 0.04076577536761761, 'kl': 0.14501953125, 'epoch': 0.44}
 44%|████▍     | 1882/4286 [14:18:17<16:12:37, 24.28s/it] 44%|████▍     | 1883/4286 [14:18:40<16:00:22, 23.98s/it]                                                         {'loss': 0.0031, 'grad_norm': 8.981710157363226, 'learning_rate': 5.606626224918338e-07, 'completion_length': 260.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.8184524476528168, 'rewards/format_reward': 1.0, 'reward': 1.818452537059784, 'reward_std': 0.029761903919279575, 'kl': 0.07666015625, 'epoch': 0.44}
 44%|████▍     | 1883/4286 [14:18:40<16:00:22, 23.98s/it] 44%|████▍     | 1884/4286 [14:19:03<15:50:55, 23.75s/it]                                                         {'loss': 0.0087, 'grad_norm': 8.635838666301003, 'learning_rate': 5.604293047130191e-07, 'completion_length': 294.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.63839291036129, 'rewards/format_reward': 1.0, 'reward': 1.638392984867096, 'reward_std': 0.05495268292725086, 'kl': 0.21728515625, 'epoch': 0.44}
 44%|████▍     | 1884/4286 [14:19:03<15:50:55, 23.75s/it][2025-03-03 05:16:50,943] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 44%|████▍     | 1885/4286 [14:19:28<16:02:20, 24.05s/it]                                                         {'loss': 0.0273, 'grad_norm': 0.8246092923093737, 'learning_rate': 5.601959869342043e-07, 'completion_length': 256.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7752976417541504, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.739583432674408, 'reward_std': 0.10744538903236389, 'kl': 0.681640625, 'epoch': 0.44}
 44%|████▍     | 1885/4286 [14:19:28<16:02:20, 24.05s/it] 44%|████▍     | 1886/4286 [14:19:53<16:09:07, 24.23s/it]                                                         {'loss': 0.0256, 'grad_norm': 9.099290299984618, 'learning_rate': 5.599626691553896e-07, 'completion_length': 326.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.656808078289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.638951063156128, 'reward_std': 0.1149553656578064, 'kl': 0.63671875, 'epoch': 0.44}
 44%|████▍     | 1886/4286 [14:19:53<16:09:07, 24.23s/it] 44%|████▍     | 1887/4286 [14:20:19<16:28:13, 24.72s/it]                                                         {'loss': 0.0183, 'grad_norm': 11.637529851675072, 'learning_rate': 5.597293513765748e-07, 'completion_length': 305.2143096923828, 'rewards/only_full_func_accuracy_reward': 0.6696429252624512, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6160715222358704, 'reward_std': 0.13364067301154137, 'kl': 0.45751953125, 'epoch': 0.44}
 44%|████▍     | 1887/4286 [14:20:19<16:28:13, 24.72s/it] 44%|████▍     | 1888/4286 [14:20:43<16:29:12, 24.75s/it]                                                         {'loss': 0.0096, 'grad_norm': 4.6156080965617905, 'learning_rate': 5.594960335977601e-07, 'completion_length': 297.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7446429133415222, 'rewards/format_reward': 1.0, 'reward': 1.744642972946167, 'reward_std': 0.10845697671175003, 'kl': 0.24072265625, 'epoch': 0.44}
 44%|████▍     | 1888/4286 [14:20:43<16:29:12, 24.75s/it] 44%|████▍     | 1889/4286 [14:21:09<16:39:52, 25.03s/it]                                                         {'loss': 0.0224, 'grad_norm': 59.57957722594043, 'learning_rate': 5.592627158189454e-07, 'completion_length': 293.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.8822511434555054, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.846536934375763, 'reward_std': 0.0956951156258583, 'kl': 0.560546875, 'epoch': 0.44}
 44%|████▍     | 1889/4286 [14:21:09<16:39:52, 25.03s/it][2025-03-03 05:18:54,932] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 44%|████▍     | 1890/4286 [14:21:32<16:14:55, 24.41s/it]                                                         {'loss': 0.03, 'grad_norm': 6.829243306456333, 'learning_rate': 5.590293980401306e-07, 'completion_length': 294.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7648810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.0535714328289032, 'kl': 0.75, 'epoch': 0.44}
 44%|████▍     | 1890/4286 [14:21:32<16:14:55, 24.41s/it] 44%|████▍     | 1891/4286 [14:21:57<16:18:03, 24.50s/it]                                                         {'loss': 0.0055, 'grad_norm': 3.0565939583327615, 'learning_rate': 5.587960802613158e-07, 'completion_length': 314.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.7559524476528168, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.03755595162510872, 'kl': 0.138671875, 'epoch': 0.44}
 44%|████▍     | 1891/4286 [14:21:57<16:18:03, 24.50s/it] 44%|████▍     | 1892/4286 [14:22:21<16:14:23, 24.42s/it]                                                         {'loss': 0.0285, 'grad_norm': 3.6015521299816755, 'learning_rate': 5.585627624825012e-07, 'completion_length': 274.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.7693452537059784, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7336310744285583, 'reward_std': 0.10743949934840202, 'kl': 0.712890625, 'epoch': 0.44}
 44%|████▍     | 1892/4286 [14:22:21<16:14:23, 24.42s/it][2025-03-03 05:20:09,166] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 44%|████▍     | 1893/4286 [14:22:46<16:24:25, 24.68s/it]                                                         {'loss': 0.0176, 'grad_norm': 26.980415570928308, 'learning_rate': 5.583294447036864e-07, 'completion_length': 284.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6800596117973328, 'reward_std': 0.08219881914556026, 'kl': 0.43896484375, 'epoch': 0.44}
 44%|████▍     | 1893/4286 [14:22:46<16:24:25, 24.68s/it] 44%|████▍     | 1894/4286 [14:23:09<15:59:29, 24.07s/it]                                                         {'loss': 0.0304, 'grad_norm': 4.146584909231128, 'learning_rate': 5.580961269248716e-07, 'completion_length': 272.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.05357143096625805, 'kl': 0.759765625, 'epoch': 0.44}
 44%|████▍     | 1894/4286 [14:23:09<15:59:29, 24.07s/it][2025-03-03 05:20:54,747] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 44%|████▍     | 1895/4286 [14:23:32<15:45:43, 23.73s/it]                                                         {'loss': 0.0178, 'grad_norm': 3.0391599749764566, 'learning_rate': 5.578628091460569e-07, 'completion_length': 246.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666667461395264, 'reward_std': 0.1062086820602417, 'kl': 0.442626953125, 'epoch': 0.44}
 44%|████▍     | 1895/4286 [14:23:32<15:45:43, 23.73s/it] 44%|████▍     | 1896/4286 [14:23:56<15:49:35, 23.84s/it]                                                         {'loss': 0.0123, 'grad_norm': 3.9035719018812705, 'learning_rate': 5.576294913672421e-07, 'completion_length': 288.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.782738208770752, 'reward_std': 0.04602411389350891, 'kl': 0.30859375, 'epoch': 0.44}
 44%|████▍     | 1896/4286 [14:23:56<15:49:35, 23.84s/it][2025-03-03 05:21:43,424] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 44%|████▍     | 1897/4286 [14:24:21<15:58:08, 24.06s/it]                                                         {'loss': 0.037, 'grad_norm': 8.80548133979222, 'learning_rate': 5.573961735884274e-07, 'completion_length': 273.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547620296478271, 'reward_std': 0.11082620546221733, 'kl': 0.9248046875, 'epoch': 0.44}
 44%|████▍     | 1897/4286 [14:24:21<15:58:08, 24.06s/it] 44%|████▍     | 1898/4286 [14:24:44<15:49:03, 23.85s/it]                                                         {'loss': 0.0115, 'grad_norm': 58.142381386919986, 'learning_rate': 5.571628558096126e-07, 'completion_length': 273.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 1.0, 'reward': 1.8020834922790527, 'reward_std': 0.026785715483129025, 'kl': 0.287353515625, 'epoch': 0.44}
 44%|████▍     | 1898/4286 [14:24:44<15:49:03, 23.85s/it] 44%|████▍     | 1899/4286 [14:25:09<16:03:08, 24.21s/it]                                                         {'loss': 0.0241, 'grad_norm': 5.850686473523072, 'learning_rate': 5.569295380307979e-07, 'completion_length': 318.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.042443971149623394, 'kl': 0.6015625, 'epoch': 0.44}
 44%|████▍     | 1899/4286 [14:25:09<16:03:08, 24.21s/it] 44%|████▍     | 1900/4286 [14:25:32<15:54:46, 24.01s/it]                                                         {'loss': 0.0273, 'grad_norm': 4.57051669969506, 'learning_rate': 5.566962202519831e-07, 'completion_length': 277.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7797620296478271, 'reward_std': 0.1428571529686451, 'kl': 0.68359375, 'epoch': 0.44}
 44%|████▍     | 1900/4286 [14:25:32<15:54:46, 24.01s/it] 44%|████▍     | 1901/4286 [14:29:11<54:37:42, 82.46s/it]                                                         {'loss': 0.0862, 'grad_norm': 6.080867661240323, 'learning_rate': 5.564629024731684e-07, 'completion_length': 309.375, 'rewards/only_full_func_accuracy_reward': 0.5669642984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5312500596046448, 'reward_std': 0.1675058901309967, 'kl': 2.15625, 'epoch': 0.44}
 44%|████▍     | 1901/4286 [14:29:11<54:37:42, 82.46s/it] 44%|████▍     | 1902/4286 [14:29:35<42:51:42, 64.72s/it]                                                         {'loss': 0.0082, 'grad_norm': 4.06579145896965, 'learning_rate': 5.562295846943537e-07, 'completion_length': 279.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.677083358168602, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6592263579368591, 'reward_std': 0.09226190485060215, 'kl': 0.20556640625, 'epoch': 0.44}
 44%|████▍     | 1902/4286 [14:29:35<42:51:42, 64.72s/it] 44%|████▍     | 1903/4286 [14:30:00<34:59:15, 52.86s/it]                                                         {'loss': 0.0083, 'grad_norm': 4.633155480197665, 'learning_rate': 5.559962669155389e-07, 'completion_length': 275.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7235119044780731, 'rewards/format_reward': 1.0, 'reward': 1.723512053489685, 'reward_std': 0.027080299332737923, 'kl': 0.20849609375, 'epoch': 0.44}
 44%|████▍     | 1903/4286 [14:30:00<34:59:15, 52.86s/it] 44%|████▍     | 1904/4286 [14:30:25<29:26:10, 44.49s/it]                                                         {'loss': 0.0463, 'grad_norm': 3.2612281312828797, 'learning_rate': 5.557629491367241e-07, 'completion_length': 321.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.771428644657135, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.717857301235199, 'reward_std': 0.10899937897920609, 'kl': 1.1533203125, 'epoch': 0.44}
 44%|████▍     | 1904/4286 [14:30:25<29:26:10, 44.49s/it] 44%|████▍     | 1905/4286 [14:30:48<25:14:42, 38.17s/it]                                                         {'loss': 0.0222, 'grad_norm': 24.717802562163445, 'learning_rate': 5.555296313579095e-07, 'completion_length': 289.7143096923828, 'rewards/only_full_func_accuracy_reward': 0.7946428656578064, 'rewards/format_reward': 1.0, 'reward': 1.794642984867096, 'reward_std': 0.06703080236911774, 'kl': 0.5537109375, 'epoch': 0.44}
 44%|████▍     | 1905/4286 [14:30:48<25:14:42, 38.17s/it] 44%|████▍     | 1906/4286 [14:31:11<22:14:29, 33.64s/it]                                                         {'loss': 0.0025, 'grad_norm': 4.441429415358993, 'learning_rate': 5.552963135790947e-07, 'completion_length': 285.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7321428656578064, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.04761904664337635, 'kl': 0.0615234375, 'epoch': 0.44}
 44%|████▍     | 1906/4286 [14:31:11<22:14:29, 33.64s/it] 44%|████▍     | 1907/4286 [14:31:36<20:28:39, 30.99s/it]                                                         {'loss': 0.0096, 'grad_norm': 3.289984279359882, 'learning_rate': 5.550629958002799e-07, 'completion_length': 283.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.796131044626236, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7782739400863647, 'reward_std': 0.098214291036129, 'kl': 0.2398681640625, 'epoch': 0.44}
 44%|████▍     | 1907/4286 [14:31:36<20:28:39, 30.99s/it] 45%|████▍     | 1908/4286 [14:32:00<18:59:05, 28.74s/it]                                                         {'loss': 0.0029, 'grad_norm': 1.2109574665067433, 'learning_rate': 5.548296780214651e-07, 'completion_length': 279.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.834821492433548, 'rewards/format_reward': 1.0, 'reward': 1.8348215818405151, 'reward_std': 0.008928571827709675, 'kl': 0.072509765625, 'epoch': 0.45}
 45%|████▍     | 1908/4286 [14:32:00<18:59:05, 28.74s/it] 45%|████▍     | 1909/4286 [14:32:24<18:01:47, 27.31s/it]                                                         {'loss': 0.0212, 'grad_norm': 2.0123197213691886, 'learning_rate': 5.545963602426505e-07, 'completion_length': 261.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.752976268529892, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7351192235946655, 'reward_std': 0.1109987236559391, 'kl': 0.528076171875, 'epoch': 0.45}
 45%|████▍     | 1909/4286 [14:32:24<18:01:47, 27.31s/it] 45%|████▍     | 1910/4286 [14:32:46<17:09:47, 26.00s/it]                                                         {'loss': 0.0023, 'grad_norm': 0.6536834451172998, 'learning_rate': 5.543630424638357e-07, 'completion_length': 279.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.8690477013587952, 'rewards/format_reward': 1.0, 'reward': 1.8690477013587952, 'reward_std': 0.0357142873108387, 'kl': 0.057373046875, 'epoch': 0.45}
 45%|████▍     | 1910/4286 [14:32:46<17:09:47, 26.00s/it] 45%|████▍     | 1911/4286 [14:33:10<16:40:21, 25.27s/it]                                                         {'loss': 0.0076, 'grad_norm': 1.945784082241073, 'learning_rate': 5.541297246850209e-07, 'completion_length': 289.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6830357760190964, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.05495268199592829, 'kl': 0.1895751953125, 'epoch': 0.45}
 45%|████▍     | 1911/4286 [14:33:10<16:40:21, 25.27s/it] 45%|████▍     | 1912/4286 [14:33:35<16:31:48, 25.07s/it]                                                         {'loss': 0.0129, 'grad_norm': 2.601596536131032, 'learning_rate': 5.538964069062062e-07, 'completion_length': 285.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8184524476528168, 'rewards/format_reward': 1.0, 'reward': 1.818452537059784, 'reward_std': 0.0297619067132473, 'kl': 0.322998046875, 'epoch': 0.45}
 45%|████▍     | 1912/4286 [14:33:35<16:31:48, 25.07s/it] 45%|████▍     | 1913/4286 [14:33:59<16:22:58, 24.85s/it]                                                         {'loss': 0.0169, 'grad_norm': 2.370861678560923, 'learning_rate': 5.536630891273915e-07, 'completion_length': 261.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 1.0, 'reward': 1.6294644474983215, 'reward_std': 0.059622690081596375, 'kl': 0.4215087890625, 'epoch': 0.45}
 45%|████▍     | 1913/4286 [14:33:59<16:22:58, 24.85s/it] 45%|████▍     | 1914/4286 [14:34:23<16:09:38, 24.53s/it]                                                         {'loss': 0.0038, 'grad_norm': 1.458651537733721, 'learning_rate': 5.534297713485767e-07, 'completion_length': 296.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.68154776096344, 'reward_std': 0.07167530432343483, 'kl': 0.095458984375, 'epoch': 0.45}
 45%|████▍     | 1914/4286 [14:34:23<16:09:38, 24.53s/it] 45%|████▍     | 1915/4286 [14:34:47<16:03:19, 24.38s/it]                                                         {'loss': 0.0029, 'grad_norm': 7.461342663158304, 'learning_rate': 5.53196453569762e-07, 'completion_length': 315.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7514880895614624, 'rewards/format_reward': 1.0, 'reward': 1.7514882683753967, 'reward_std': 0.04685881733894348, 'kl': 0.072509765625, 'epoch': 0.45}
 45%|████▍     | 1915/4286 [14:34:47<16:03:19, 24.38s/it] 45%|████▍     | 1916/4286 [14:35:11<16:00:48, 24.32s/it]                                                         {'loss': 0.0125, 'grad_norm': 3.716384568950554, 'learning_rate': 5.529631357909472e-07, 'completion_length': 280.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.8098214268684387, 'rewards/format_reward': 1.0, 'reward': 1.8098214864730835, 'reward_std': 0.05904166214168072, 'kl': 0.3125, 'epoch': 0.45}
 45%|████▍     | 1916/4286 [14:35:11<16:00:48, 24.32s/it] 45%|████▍     | 1917/4286 [14:35:36<16:11:50, 24.61s/it]                                                         {'loss': 0.0054, 'grad_norm': 1.5898815024140014, 'learning_rate': 5.527298180121325e-07, 'completion_length': 287.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8184524178504944, 'rewards/format_reward': 1.0, 'reward': 1.8184524774551392, 'reward_std': 0.06345389038324356, 'kl': 0.1353759765625, 'epoch': 0.45}
 45%|████▍     | 1917/4286 [14:35:36<16:11:50, 24.61s/it][2025-03-03 05:33:23,227] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▍     | 1918/4286 [14:36:00<16:04:39, 24.44s/it]                                                         {'loss': 0.0034, 'grad_norm': 1.2097974742624136, 'learning_rate': 5.524965002333178e-07, 'completion_length': 236.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6943452954292297, 'rewards/format_reward': 1.0, 'reward': 1.6943453550338745, 'reward_std': 0.029335035011172295, 'kl': 0.085205078125, 'epoch': 0.45}
 45%|████▍     | 1918/4286 [14:36:00<16:04:39, 24.44s/it] 45%|████▍     | 1919/4286 [14:36:23<15:40:11, 23.83s/it]                                                         {'loss': 0.0051, 'grad_norm': 1.564126381898633, 'learning_rate': 5.52263182454503e-07, 'completion_length': 262.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.5863095670938492, 'rewards/format_reward': 1.0, 'reward': 1.5863096117973328, 'reward_std': 0.0590964499861002, 'kl': 0.127685546875, 'epoch': 0.45}
 45%|████▍     | 1919/4286 [14:36:23<15:40:11, 23.83s/it] 45%|████▍     | 1920/4286 [14:36:46<15:36:19, 23.74s/it]                                                         {'loss': 0.0137, 'grad_norm': 2.4162800591741913, 'learning_rate': 5.520298646756882e-07, 'completion_length': 314.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5937501788139343, 'reward_std': 0.1688869558274746, 'kl': 0.3411865234375, 'epoch': 0.45}
 45%|████▍     | 1920/4286 [14:36:46<15:36:19, 23.74s/it] 45%|████▍     | 1921/4286 [14:37:11<15:42:05, 23.90s/it]                                                         {'loss': 0.0021, 'grad_norm': 1.7097926190370738, 'learning_rate': 5.517965468968734e-07, 'completion_length': 298.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.00297618773765862, 'kl': 0.053466796875, 'epoch': 0.45}
 45%|████▍     | 1921/4286 [14:37:11<15:42:05, 23.90s/it][2025-03-03 05:34:57,367] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▍     | 1922/4286 [14:37:34<15:41:59, 23.91s/it]                                                         {'loss': 0.0046, 'grad_norm': 1.2963120889705513, 'learning_rate': 5.515632291180588e-07, 'completion_length': 278.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.05800335668027401, 'kl': 0.1156005859375, 'epoch': 0.45}
 45%|████▍     | 1922/4286 [14:37:34<15:41:59, 23.91s/it] 45%|████▍     | 1923/4286 [14:37:57<15:28:09, 23.57s/it]                                                         {'loss': 0.0043, 'grad_norm': 0.7134304801993859, 'learning_rate': 5.51329911339244e-07, 'completion_length': 277.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6904762089252472, 'rewards/format_reward': 1.0, 'reward': 1.6904763579368591, 'reward_std': 0.06504883337765932, 'kl': 0.1087646484375, 'epoch': 0.45}
 45%|████▍     | 1923/4286 [14:37:57<15:28:09, 23.57s/it] 45%|████▍     | 1924/4286 [14:38:21<15:28:32, 23.59s/it]                                                         {'loss': 0.0124, 'grad_norm': 0.5584638334849511, 'learning_rate': 5.510965935604292e-07, 'completion_length': 316.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6904762089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6726191639900208, 'reward_std': 0.0357142873108387, 'kl': 0.3070068359375, 'epoch': 0.45}
 45%|████▍     | 1924/4286 [14:38:21<15:28:32, 23.59s/it] 45%|████▍     | 1925/4286 [14:38:45<15:39:01, 23.86s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.6238227299844248, 'learning_rate': 5.508632757816145e-07, 'completion_length': 269.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.06136547867208719, 'kl': 0.045654296875, 'epoch': 0.45}
 45%|████▍     | 1925/4286 [14:38:45<15:39:01, 23.86s/it][2025-03-03 05:36:32,113] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▍     | 1926/4286 [14:39:09<15:38:15, 23.85s/it]                                                         {'loss': 0.006, 'grad_norm': 3.106030165589177, 'learning_rate': 5.506299580027998e-07, 'completion_length': 260.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.042747266590595245, 'kl': 0.15087890625, 'epoch': 0.45}
 45%|████▍     | 1926/4286 [14:39:09<15:38:15, 23.85s/it] 45%|████▍     | 1927/4286 [14:39:34<15:53:11, 24.24s/it]                                                         {'loss': 0.003, 'grad_norm': 4.062920753073198, 'learning_rate': 5.50396640223985e-07, 'completion_length': 295.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.828869104385376, 'rewards/format_reward': 1.0, 'reward': 1.828869104385376, 'reward_std': 0.08269228786230087, 'kl': 0.0760498046875, 'epoch': 0.45}
 45%|████▍     | 1927/4286 [14:39:34<15:53:11, 24.24s/it][2025-03-03 05:37:22,260] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▍     | 1928/4286 [14:39:59<16:01:37, 24.47s/it]                                                         {'loss': 0.002, 'grad_norm': 0.3505354972403038, 'learning_rate': 5.501633224451703e-07, 'completion_length': 284.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.9122023582458496, 'rewards/format_reward': 1.0, 'reward': 1.9122024774551392, 'reward_std': 0.02317243255674839, 'kl': 0.0498046875, 'epoch': 0.45}
 45%|████▍     | 1928/4286 [14:39:59<16:01:37, 24.47s/it] 45%|████▌     | 1929/4286 [14:40:24<16:00:57, 24.46s/it]                                                         {'loss': 0.0031, 'grad_norm': 3.416727575893517, 'learning_rate': 5.499300046663555e-07, 'completion_length': 297.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.704464316368103, 'rewards/format_reward': 1.0, 'reward': 1.7044644355773926, 'reward_std': 0.03219882398843765, 'kl': 0.0771484375, 'epoch': 0.45}
 45%|████▌     | 1929/4286 [14:40:24<16:00:57, 24.46s/it] 45%|████▌     | 1930/4286 [14:40:47<15:44:54, 24.06s/it]                                                         {'loss': 0.0027, 'grad_norm': 3.935328453347979, 'learning_rate': 5.496966868875408e-07, 'completion_length': 292.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 1.0, 'reward': 1.6860119700431824, 'reward_std': 0.0863095261156559, 'kl': 0.0670166015625, 'epoch': 0.45}
 45%|████▌     | 1930/4286 [14:40:47<15:44:54, 24.06s/it] 45%|████▌     | 1931/4286 [14:41:10<15:32:08, 23.75s/it]                                                         {'loss': 0.0044, 'grad_norm': 8.5879119436495, 'learning_rate': 5.49463369108726e-07, 'completion_length': 247.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.5877976715564728, 'rewards/format_reward': 1.0, 'reward': 1.5877977013587952, 'reward_std': 0.035413133911788464, 'kl': 0.111083984375, 'epoch': 0.45}
 45%|████▌     | 1931/4286 [14:41:10<15:32:08, 23.75s/it] 45%|████▌     | 1932/4286 [14:41:35<15:44:25, 24.07s/it]                                                         {'loss': 0.0041, 'grad_norm': 2.8649531780077533, 'learning_rate': 5.492300513299113e-07, 'completion_length': 317.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7889881432056427, 'rewards/format_reward': 1.0, 'reward': 1.788988173007965, 'reward_std': 0.06285865372046828, 'kl': 0.101318359375, 'epoch': 0.45}
 45%|████▌     | 1932/4286 [14:41:35<15:44:25, 24.07s/it] 45%|████▌     | 1933/4286 [14:41:59<15:51:05, 24.25s/it]                                                         {'loss': 0.0041, 'grad_norm': 7.819171585147129, 'learning_rate': 5.489967335510965e-07, 'completion_length': 272.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.8407738208770752, 'rewards/format_reward': 1.0, 'reward': 1.8407739400863647, 'reward_std': 0.008928571827709675, 'kl': 0.103271484375, 'epoch': 0.45}
 45%|████▌     | 1933/4286 [14:41:59<15:51:05, 24.25s/it] 45%|████▌     | 1934/4286 [14:42:25<16:02:06, 24.54s/it]                                                         {'loss': 0.0069, 'grad_norm': 1.6386002642344264, 'learning_rate': 5.487634157722818e-07, 'completion_length': 295.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.7053571939468384, 'reward_std': 0.010309826582670212, 'kl': 0.17333984375, 'epoch': 0.45}
 45%|████▌     | 1934/4286 [14:42:25<16:02:06, 24.54s/it] 45%|████▌     | 1935/4286 [14:42:48<15:50:30, 24.26s/it]                                                         {'loss': 0.0018, 'grad_norm': 4.813100190983491, 'learning_rate': 5.485300979934671e-07, 'completion_length': 302.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7247024178504944, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.031143165193498135, 'kl': 0.044921875, 'epoch': 0.45}
 45%|████▌     | 1935/4286 [14:42:48<15:50:30, 24.26s/it] 45%|████▌     | 1936/4286 [14:43:13<15:51:30, 24.29s/it]                                                         {'loss': 0.0028, 'grad_norm': 0.8286852071654237, 'learning_rate': 5.482967802146523e-07, 'completion_length': 277.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.7961309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7961311340332031, 'reward_std': 0.04053214751183987, 'kl': 0.0701904296875, 'epoch': 0.45}
 45%|████▌     | 1936/4286 [14:43:13<15:51:30, 24.29s/it] 45%|████▌     | 1937/4286 [14:43:36<15:34:34, 23.87s/it]                                                         {'loss': 0.0037, 'grad_norm': 0.3661921644148594, 'learning_rate': 5.480634624358375e-07, 'completion_length': 259.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.8020833432674408, 'rewards/format_reward': 1.0, 'reward': 1.802083432674408, 'reward_std': 0.0029761905316263437, 'kl': 0.091552734375, 'epoch': 0.45}
 45%|████▌     | 1937/4286 [14:43:36<15:34:34, 23.87s/it][2025-03-03 05:41:24,119] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▌     | 1938/4286 [14:44:01<15:55:28, 24.42s/it]                                                         {'loss': 0.0033, 'grad_norm': 1.5600495432884713, 'learning_rate': 5.478301446570229e-07, 'completion_length': 294.89288330078125, 'rewards/only_full_func_accuracy_reward': 0.7059524655342102, 'rewards/format_reward': 1.0, 'reward': 1.705952525138855, 'reward_std': 0.06439300812780857, 'kl': 0.0830078125, 'epoch': 0.45}
 45%|████▌     | 1938/4286 [14:44:01<15:55:28, 24.42s/it] 45%|████▌     | 1939/4286 [14:44:26<15:55:18, 24.42s/it]                                                         {'loss': 0.0067, 'grad_norm': 3.093516662442533, 'learning_rate': 5.475968268782081e-07, 'completion_length': 276.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7291667461395264, 'rewards/format_reward': 1.0, 'reward': 1.7291667461395264, 'reward_std': 0.060764169320464134, 'kl': 0.1680908203125, 'epoch': 0.45}
 45%|████▌     | 1939/4286 [14:44:26<15:55:18, 24.42s/it] 45%|████▌     | 1940/4286 [14:44:49<15:44:23, 24.15s/it]                                                         {'loss': 0.0059, 'grad_norm': 2.1052130422927924, 'learning_rate': 5.473635090993933e-07, 'completion_length': 253.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.709821492433548, 'rewards/format_reward': 1.0, 'reward': 1.7098215818405151, 'reward_std': 0.058389294892549515, 'kl': 0.1483154296875, 'epoch': 0.45}
 45%|████▌     | 1940/4286 [14:44:49<15:44:23, 24.15s/it] 45%|████▌     | 1941/4286 [14:45:14<15:49:23, 24.29s/it]                                                         {'loss': 0.0031, 'grad_norm': 1.9505995845995, 'learning_rate': 5.471301913205786e-07, 'completion_length': 293.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5520833730697632, 'rewards/format_reward': 1.0, 'reward': 1.5520833730697632, 'reward_std': 0.026785715483129025, 'kl': 0.077392578125, 'epoch': 0.45}
 45%|████▌     | 1941/4286 [14:45:14<15:49:23, 24.29s/it] 45%|████▌     | 1942/4286 [14:45:38<15:53:55, 24.42s/it]                                                         {'loss': 0.0168, 'grad_norm': 4.012107476524624, 'learning_rate': 5.468968735417639e-07, 'completion_length': 292.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6889880895614624, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.038690478540956974, 'kl': 0.4217529296875, 'epoch': 0.45}
 45%|████▌     | 1942/4286 [14:45:38<15:53:55, 24.42s/it] 45%|████▌     | 1943/4286 [14:46:03<15:50:39, 24.34s/it]                                                         {'loss': 0.0019, 'grad_norm': 3.9002537642889332, 'learning_rate': 5.466635557629491e-07, 'completion_length': 299.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.836309552192688, 'rewards/format_reward': 1.0, 'reward': 1.8363096714019775, 'reward_std': 0.06793888658285141, 'kl': 0.0487060546875, 'epoch': 0.45}
 45%|████▌     | 1943/4286 [14:46:03<15:50:39, 24.34s/it][2025-03-03 05:43:49,864] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▌     | 1944/4286 [14:46:27<15:49:30, 24.33s/it]                                                         {'loss': 0.0053, 'grad_norm': 3.818272794505143, 'learning_rate': 5.464302379841343e-07, 'completion_length': 261.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979167461395264, 'reward_std': 0.03298483043909073, 'kl': 0.13232421875, 'epoch': 0.45}
 45%|████▌     | 1944/4286 [14:46:27<15:49:30, 24.33s/it] 45%|████▌     | 1945/4286 [14:46:52<15:56:30, 24.52s/it]                                                         {'loss': 0.0166, 'grad_norm': 3.318006141481176, 'learning_rate': 5.461969202053196e-07, 'completion_length': 304.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.699999988079071, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.682142972946167, 'reward_std': 0.08306369557976723, 'kl': 0.416015625, 'epoch': 0.45}
 45%|████▌     | 1945/4286 [14:46:52<15:56:30, 24.52s/it][2025-03-03 05:44:39,616] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▌     | 1946/4286 [14:47:17<15:59:21, 24.60s/it]                                                         {'loss': 0.0039, 'grad_norm': 2.0897281025632592, 'learning_rate': 5.459636024265048e-07, 'completion_length': 317.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6205357313156128, 'rewards/format_reward': 1.0, 'reward': 1.6205357313156128, 'reward_std': 0.029548224061727524, 'kl': 0.096435546875, 'epoch': 0.45}
 45%|████▌     | 1946/4286 [14:47:17<15:59:21, 24.60s/it] 45%|████▌     | 1947/4286 [14:47:41<15:50:19, 24.38s/it]                                                         {'loss': 0.0039, 'grad_norm': 2.269731324482638, 'learning_rate': 5.457302846476901e-07, 'completion_length': 325.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7380953133106232, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.720238208770752, 'reward_std': 0.07142857182770967, 'kl': 0.09765625, 'epoch': 0.45}
 45%|████▌     | 1947/4286 [14:47:41<15:50:19, 24.38s/it] 45%|████▌     | 1948/4286 [14:48:05<15:45:36, 24.27s/it]                                                         {'loss': 0.0024, 'grad_norm': 4.3390460298222, 'learning_rate': 5.454969668688754e-07, 'completion_length': 286.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7261905670166016, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.0476190522313118, 'kl': 0.0599365234375, 'epoch': 0.45}
 45%|████▌     | 1948/4286 [14:48:05<15:45:36, 24.27s/it] 45%|████▌     | 1949/4286 [14:48:29<15:42:58, 24.21s/it]                                                         {'loss': 0.005, 'grad_norm': 2.3753994848627666, 'learning_rate': 5.452636490900606e-07, 'completion_length': 289.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.8184524476528168, 'rewards/format_reward': 1.0, 'reward': 1.818452537059784, 'reward_std': 0.06731786392629147, 'kl': 0.1234130859375, 'epoch': 0.45}
 45%|████▌     | 1949/4286 [14:48:29<15:42:58, 24.21s/it] 45%|████▌     | 1950/4286 [14:48:53<15:49:57, 24.40s/it]                                                         {'loss': 0.0115, 'grad_norm': 7.824949023717367, 'learning_rate': 5.450303313112458e-07, 'completion_length': 304.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.05154913291335106, 'kl': 0.28662109375, 'epoch': 0.45}
 45%|████▌     | 1950/4286 [14:48:53<15:49:57, 24.40s/it] 46%|████▌     | 1951/4286 [14:49:18<15:45:36, 24.30s/it]                                                         {'loss': 0.0028, 'grad_norm': 2.0321335249438808, 'learning_rate': 5.447970135324312e-07, 'completion_length': 288.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.848214328289032, 'rewards/format_reward': 1.0, 'reward': 1.8482143878936768, 'reward_std': 0.04350833687931299, 'kl': 0.070068359375, 'epoch': 0.46}
 46%|████▌     | 1951/4286 [14:49:18<15:45:36, 24.30s/it] 46%|████▌     | 1952/4286 [14:49:42<15:51:43, 24.47s/it]                                                         {'loss': 0.0029, 'grad_norm': 2.7776456467905044, 'learning_rate': 5.445636957536164e-07, 'completion_length': 325.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7276786267757416, 'rewards/format_reward': 1.0, 'reward': 1.7276787161827087, 'reward_std': 0.026785715483129025, 'kl': 0.072998046875, 'epoch': 0.46}
 46%|████▌     | 1952/4286 [14:49:42<15:51:43, 24.47s/it] 46%|████▌     | 1953/4286 [14:50:08<16:01:22, 24.72s/it]                                                         {'loss': 0.008, 'grad_norm': 70.41769570587408, 'learning_rate': 5.443303779748016e-07, 'completion_length': 298.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7398810088634491, 'rewards/format_reward': 1.0, 'reward': 1.7398810386657715, 'reward_std': 0.04404762387275696, 'kl': 0.19970703125, 'epoch': 0.46}
 46%|████▌     | 1953/4286 [14:50:08<16:01:22, 24.72s/it] 46%|████▌     | 1954/4286 [14:50:31<15:46:28, 24.35s/it]                                                         {'loss': 0.0066, 'grad_norm': 2.0912910658272352, 'learning_rate': 5.440970601959868e-07, 'completion_length': 305.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.62202388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6041667461395264, 'reward_std': 0.07971610128879547, 'kl': 0.1654052734375, 'epoch': 0.46}
 46%|████▌     | 1954/4286 [14:50:31<15:46:28, 24.35s/it][2025-03-03 05:48:18,727] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▌     | 1955/4286 [14:50:56<15:48:51, 24.42s/it]                                                         {'loss': 0.0176, 'grad_norm': 2.1243254652858603, 'learning_rate': 5.438637424171722e-07, 'completion_length': 294.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.7812500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7812500596046448, 'reward_std': 0.08520806953310966, 'kl': 0.4375, 'epoch': 0.46}
 46%|████▌     | 1955/4286 [14:50:56<15:48:51, 24.42s/it] 46%|████▌     | 1956/4286 [14:51:23<16:25:42, 25.38s/it]                                                         {'loss': 0.0092, 'grad_norm': 1.0315303835016418, 'learning_rate': 5.436304246383574e-07, 'completion_length': 317.375, 'rewards/only_full_func_accuracy_reward': 0.7767857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.029761902987957, 'kl': 0.2296142578125, 'epoch': 0.46}
 46%|████▌     | 1956/4286 [14:51:23<16:25:42, 25.38s/it] 46%|████▌     | 1957/4286 [14:51:50<16:37:30, 25.70s/it]                                                         {'loss': 0.0025, 'grad_norm': 0.20222647028044047, 'learning_rate': 5.433971068595426e-07, 'completion_length': 310.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.0, 'kl': 0.0633544921875, 'epoch': 0.46}
 46%|████▌     | 1957/4286 [14:51:50<16:37:30, 25.70s/it] 46%|████▌     | 1958/4286 [14:52:15<16:26:50, 25.43s/it]                                                         {'loss': 0.0168, 'grad_norm': 8.500872120811584, 'learning_rate': 5.431637890807279e-07, 'completion_length': 312.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038692235946655, 'reward_std': 0.046484531834721565, 'kl': 0.4208984375, 'epoch': 0.46}
 46%|████▌     | 1958/4286 [14:52:15<16:26:50, 25.43s/it] 46%|████▌     | 1959/4286 [14:52:40<16:25:27, 25.41s/it]                                                         {'loss': 0.0071, 'grad_norm': 7.400013266004214, 'learning_rate': 5.429304713019132e-07, 'completion_length': 312.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.034119345247745514, 'kl': 0.176513671875, 'epoch': 0.46}
 46%|████▌     | 1959/4286 [14:52:40<16:25:27, 25.41s/it] 46%|████▌     | 1960/4286 [14:53:05<16:14:10, 25.13s/it]                                                         {'loss': 0.01, 'grad_norm': 35.48870012132727, 'learning_rate': 5.426971535230984e-07, 'completion_length': 270.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.785714328289032, 'reward_std': 0.07008037343621254, 'kl': 0.2490234375, 'epoch': 0.46}
 46%|████▌     | 1960/4286 [14:53:05<16:14:10, 25.13s/it] 46%|████▌     | 1961/4286 [14:53:28<15:56:49, 24.69s/it]                                                         {'loss': 0.0175, 'grad_norm': 11.230749745056233, 'learning_rate': 5.424638357442837e-07, 'completion_length': 317.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.8671343922615051, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8492772579193115, 'reward_std': 0.07961255498230457, 'kl': 0.4375, 'epoch': 0.46}
 46%|████▌     | 1961/4286 [14:53:28<15:56:49, 24.69s/it] 46%|████▌     | 1962/4286 [14:53:53<15:59:21, 24.77s/it]                                                         {'loss': 0.0603, 'grad_norm': 3.263918509220819, 'learning_rate': 5.422305179654689e-07, 'completion_length': 272.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.8306547999382019, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7949405908584595, 'reward_std': 0.1530887670814991, 'kl': 1.515625, 'epoch': 0.46}
 46%|████▌     | 1962/4286 [14:53:53<15:59:21, 24.77s/it] 46%|████▌     | 1963/4286 [14:54:17<15:53:24, 24.63s/it]                                                         {'loss': 0.0323, 'grad_norm': 1.2520641333524587, 'learning_rate': 5.419972001866542e-07, 'completion_length': 305.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.8095238506793976, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7738096117973328, 'reward_std': 0.11958652548491955, 'kl': 0.8062744140625, 'epoch': 0.46}
 46%|████▌     | 1963/4286 [14:54:17<15:53:24, 24.63s/it] 46%|████▌     | 1964/4286 [14:54:43<16:05:34, 24.95s/it]                                                         {'loss': 0.02, 'grad_norm': 2.8807087722170923, 'learning_rate': 5.417638824078395e-07, 'completion_length': 319.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6726191341876984, 'rewards/format_reward': 1.0, 'reward': 1.672619104385376, 'reward_std': 0.025651192292571068, 'kl': 0.5, 'epoch': 0.46}
 46%|████▌     | 1964/4286 [14:54:43<16:05:34, 24.95s/it] 46%|████▌     | 1965/4286 [14:55:08<16:02:57, 24.89s/it]                                                         {'loss': 0.0225, 'grad_norm': 5.755175873806527, 'learning_rate': 5.415305646290247e-07, 'completion_length': 317.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.77976194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7619048357009888, 'reward_std': 0.1011904776096344, 'kl': 0.5625, 'epoch': 0.46}
 46%|████▌     | 1965/4286 [14:55:08<16:02:57, 24.89s/it][2025-03-03 05:52:56,277] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▌     | 1966/4286 [14:55:33<16:09:15, 25.07s/it]                                                         {'loss': 0.0175, 'grad_norm': 2.127908106425377, 'learning_rate': 5.412972468502099e-07, 'completion_length': 274.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.7738095819950104, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7380953431129456, 'reward_std': 0.14778826385736465, 'kl': 0.435302734375, 'epoch': 0.46}
 46%|████▌     | 1966/4286 [14:55:33<16:09:15, 25.07s/it] 46%|████▌     | 1967/4286 [14:55:59<16:14:05, 25.20s/it]                                                         {'loss': 0.0278, 'grad_norm': 11.87268741353337, 'learning_rate': 5.410639290713952e-07, 'completion_length': 325.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7187500596046448, 'reward_std': 0.098371721804142, 'kl': 0.693359375, 'epoch': 0.46}
 46%|████▌     | 1967/4286 [14:55:59<16:14:05, 25.20s/it] 46%|████▌     | 1968/4286 [14:56:23<16:06:24, 25.01s/it]                                                         {'loss': 0.0034, 'grad_norm': 1.7567119440917054, 'learning_rate': 5.408306112925805e-07, 'completion_length': 316.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.755952388048172, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.05314406845718622, 'kl': 0.0845947265625, 'epoch': 0.46}
 46%|████▌     | 1968/4286 [14:56:23<16:06:24, 25.01s/it] 46%|████▌     | 1969/4286 [14:56:47<15:47:01, 24.52s/it]                                                         {'loss': 0.0056, 'grad_norm': 16.448894293020114, 'learning_rate': 5.405972935137657e-07, 'completion_length': 296.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6279762089252472, 'rewards/format_reward': 1.0, 'reward': 1.6279763579368591, 'reward_std': 0.11508527025580406, 'kl': 0.13916015625, 'epoch': 0.46}
 46%|████▌     | 1969/4286 [14:56:47<15:47:01, 24.52s/it] 46%|████▌     | 1970/4286 [14:57:13<16:02:46, 24.94s/it]                                                         {'loss': 0.0097, 'grad_norm': 9.478133221646159, 'learning_rate': 5.403639757349509e-07, 'completion_length': 342.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7886905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7708334922790527, 'reward_std': 0.095238097012043, 'kl': 0.240966796875, 'epoch': 0.46}
 46%|████▌     | 1970/4286 [14:57:13<16:02:46, 24.94s/it] 46%|████▌     | 1971/4286 [14:57:36<15:47:06, 24.55s/it]                                                         {'loss': 0.0184, 'grad_norm': 2.421360288039515, 'learning_rate': 5.401306579561363e-07, 'completion_length': 255.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.04711223021149635, 'kl': 0.458984375, 'epoch': 0.46}
 46%|████▌     | 1971/4286 [14:57:36<15:47:06, 24.55s/it] 46%|████▌     | 1972/4286 [14:58:00<15:37:31, 24.31s/it]                                                         {'loss': 0.0209, 'grad_norm': 2.1769041843420895, 'learning_rate': 5.398973401773215e-07, 'completion_length': 282.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7619048953056335, 'reward_std': 0.09791731461882591, 'kl': 0.521728515625, 'epoch': 0.46}
 46%|████▌     | 1972/4286 [14:58:00<15:37:31, 24.31s/it] 46%|████▌     | 1973/4286 [14:58:24<15:35:53, 24.28s/it]                                                         {'loss': 0.0118, 'grad_norm': 3.083959376869142, 'learning_rate': 5.396640223985067e-07, 'completion_length': 313.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7544643878936768, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7366072535514832, 'reward_std': 0.0744047649204731, 'kl': 0.294921875, 'epoch': 0.46}
 46%|████▌     | 1973/4286 [14:58:24<15:35:53, 24.28s/it] 46%|████▌     | 1974/4286 [14:58:51<16:02:04, 24.97s/it]                                                         {'loss': 0.0085, 'grad_norm': 4.124514982798725, 'learning_rate': 5.39430704619692e-07, 'completion_length': 332.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7607143223285675, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7428572177886963, 'reward_std': 0.07262943685054779, 'kl': 0.213623046875, 'epoch': 0.46}
 46%|████▌     | 1974/4286 [14:58:51<16:02:04, 24.97s/it][2025-03-03 05:56:41,140] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▌     | 1975/4286 [14:59:18<16:28:43, 25.67s/it]                                                         {'loss': 0.0203, 'grad_norm': 2.3980964950560297, 'learning_rate': 5.391973868408772e-07, 'completion_length': 330.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7008928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6830358505249023, 'reward_std': 0.0625, 'kl': 0.50732421875, 'epoch': 0.46}
 46%|████▌     | 1975/4286 [14:59:18<16:28:43, 25.67s/it] 46%|████▌     | 1976/4286 [14:59:43<16:20:15, 25.46s/it]                                                         {'loss': 0.0106, 'grad_norm': 2.1867972889540503, 'learning_rate': 5.389640690620625e-07, 'completion_length': 292.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.732142984867096, 'reward_std': 0.07142857648432255, 'kl': 0.265625, 'epoch': 0.46}
 46%|████▌     | 1976/4286 [14:59:43<16:20:15, 25.46s/it] 46%|████▌     | 1977/4286 [15:00:08<16:12:17, 25.27s/it]                                                         {'loss': 0.0052, 'grad_norm': 4.577088363352835, 'learning_rate': 5.387307512832477e-07, 'completion_length': 281.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.7857144474983215, 'reward_std': 0.07511191815137863, 'kl': 0.1307373046875, 'epoch': 0.46}
 46%|████▌     | 1977/4286 [15:00:08<16:12:17, 25.27s/it][2025-03-03 05:57:58,620] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▌     | 1978/4286 [15:00:36<16:39:57, 26.00s/it]                                                         {'loss': 0.0057, 'grad_norm': 5.086681714692386, 'learning_rate': 5.38497433504433e-07, 'completion_length': 303.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7633929252624512, 'rewards/format_reward': 1.0, 'reward': 1.7633929252624512, 'reward_std': 0.05587352253496647, 'kl': 0.14111328125, 'epoch': 0.46}
 46%|████▌     | 1978/4286 [15:00:36<16:39:57, 26.00s/it][2025-03-03 05:58:23,629] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▌     | 1979/4286 [15:01:01<16:28:07, 25.70s/it]                                                         {'loss': 0.0066, 'grad_norm': 3.452489698236838, 'learning_rate': 5.382641157256182e-07, 'completion_length': 261.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217262983322144, 'reward_std': 0.06845238897949457, 'kl': 0.1656494140625, 'epoch': 0.46}
 46%|████▌     | 1979/4286 [15:01:01<16:28:07, 25.70s/it] 46%|████▌     | 1980/4286 [15:01:27<16:30:38, 25.78s/it]                                                         {'loss': 0.0058, 'grad_norm': 2.027291756895921, 'learning_rate': 5.380307979468035e-07, 'completion_length': 307.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6889881789684296, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.03335912525653839, 'kl': 0.145263671875, 'epoch': 0.46}
 46%|████▌     | 1980/4286 [15:01:27<16:30:38, 25.78s/it] 46%|████▌     | 1981/4286 [15:01:51<16:11:42, 25.29s/it]                                                         {'loss': 0.0062, 'grad_norm': 1.8078581085431922, 'learning_rate': 5.377974801679888e-07, 'completion_length': 296.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125000596046448, 'reward_std': 0.031603576615452766, 'kl': 0.154296875, 'epoch': 0.46}
 46%|████▌     | 1981/4286 [15:01:51<16:11:42, 25.29s/it] 46%|████▌     | 1982/4286 [15:02:17<16:19:00, 25.49s/it]                                                         {'loss': 0.0027, 'grad_norm': 4.971823135885081, 'learning_rate': 5.37564162389174e-07, 'completion_length': 301.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.8276785910129547, 'rewards/format_reward': 1.0, 'reward': 1.8276786804199219, 'reward_std': 0.015666970517486334, 'kl': 0.06640625, 'epoch': 0.46}
 46%|████▌     | 1982/4286 [15:02:17<16:19:00, 25.49s/it] 46%|████▋     | 1983/4286 [15:02:41<16:06:54, 25.19s/it]                                                         {'loss': 0.01, 'grad_norm': 1.731815449362314, 'learning_rate': 5.373308446103592e-07, 'completion_length': 303.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.754464328289032, 'reward_std': 0.023172441869974136, 'kl': 0.2491455078125, 'epoch': 0.46}
 46%|████▋     | 1983/4286 [15:02:41<16:06:54, 25.19s/it][2025-03-03 06:00:29,115] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▋     | 1984/4286 [15:03:06<16:03:20, 25.11s/it]                                                         {'loss': 0.0019, 'grad_norm': 1.678757696572436, 'learning_rate': 5.370975268315446e-07, 'completion_length': 289.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7886905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7886905670166016, 'reward_std': 0.03780269995331764, 'kl': 0.04833984375, 'epoch': 0.46}
 46%|████▋     | 1984/4286 [15:03:06<16:03:20, 25.11s/it][2025-03-03 06:00:53,689] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▋     | 1985/4286 [15:03:31<15:56:46, 24.95s/it]                                                         {'loss': 0.0016, 'grad_norm': 1.5697358661772116, 'learning_rate': 5.368642090527298e-07, 'completion_length': 282.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.020619653165340424, 'kl': 0.040283203125, 'epoch': 0.46}
 46%|████▋     | 1985/4286 [15:03:31<15:56:46, 24.95s/it] 46%|████▋     | 1986/4286 [15:03:55<15:48:06, 24.73s/it]                                                         {'loss': 0.0024, 'grad_norm': 29.954230090296058, 'learning_rate': 5.36630891273915e-07, 'completion_length': 318.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.8377976417541504, 'rewards/format_reward': 1.0, 'reward': 1.83779776096344, 'reward_std': 0.031143158674240112, 'kl': 0.059326171875, 'epoch': 0.46}
 46%|████▋     | 1986/4286 [15:03:55<15:48:06, 24.73s/it] 46%|████▋     | 1987/4286 [15:04:19<15:39:40, 24.52s/it]                                                         {'loss': 0.0059, 'grad_norm': 4.780941981082429, 'learning_rate': 5.363975734951003e-07, 'completion_length': 303.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7708333730697632, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.04007172957062721, 'kl': 0.147705078125, 'epoch': 0.46}
 46%|████▋     | 1987/4286 [15:04:19<15:39:40, 24.52s/it][2025-03-03 06:02:06,853] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▋     | 1988/4286 [15:04:44<15:43:33, 24.64s/it]                                                         {'loss': 0.0086, 'grad_norm': 1.9154212245678672, 'learning_rate': 5.361642557162856e-07, 'completion_length': 298.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.13690477050840855, 'kl': 0.21630859375, 'epoch': 0.46}
 46%|████▋     | 1988/4286 [15:04:44<15:43:33, 24.64s/it] 46%|████▋     | 1989/4286 [15:05:07<15:29:51, 24.29s/it]                                                         {'loss': 0.0066, 'grad_norm': 12.987553461762051, 'learning_rate': 5.359309379374708e-07, 'completion_length': 281.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.8065477013587952, 'rewards/format_reward': 1.0, 'reward': 1.8065477013587952, 'reward_std': 0.08106430247426033, 'kl': 0.1640625, 'epoch': 0.46}
 46%|████▋     | 1989/4286 [15:05:07<15:29:51, 24.29s/it] 46%|████▋     | 1990/4286 [15:05:32<15:29:11, 24.28s/it]                                                         {'loss': 0.0076, 'grad_norm': 5.781206707830476, 'learning_rate': 5.35697620158656e-07, 'completion_length': 290.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.056333936750888824, 'kl': 0.189453125, 'epoch': 0.46}
 46%|████▋     | 1990/4286 [15:05:32<15:29:11, 24.28s/it][2025-03-03 06:03:20,465] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▋     | 1991/4286 [15:05:58<15:46:58, 24.76s/it]                                                         {'loss': 0.0104, 'grad_norm': 1.5118174316355586, 'learning_rate': 5.354643023798413e-07, 'completion_length': 261.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.7348214685916901, 'rewards/format_reward': 1.0, 'reward': 1.7348214983940125, 'reward_std': 0.010664566420018673, 'kl': 0.26123046875, 'epoch': 0.46}
 46%|████▋     | 1991/4286 [15:05:58<15:46:58, 24.76s/it] 46%|████▋     | 1992/4286 [15:06:23<15:51:10, 24.88s/it]                                                         {'loss': 0.005, 'grad_norm': 2.2010618376275426, 'learning_rate': 5.352309846010266e-07, 'completion_length': 312.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.05038155056536198, 'kl': 0.124267578125, 'epoch': 0.46}
 46%|████▋     | 1992/4286 [15:06:23<15:51:10, 24.88s/it] 47%|████▋     | 1993/4286 [15:06:47<15:40:08, 24.60s/it]                                                         {'loss': 0.0024, 'grad_norm': 7.140011221789588, 'learning_rate': 5.349976668222118e-07, 'completion_length': 314.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.698511928319931, 'rewards/format_reward': 1.0, 'reward': 1.6985120177268982, 'reward_std': 0.08191166818141937, 'kl': 0.058837890625, 'epoch': 0.47}
 47%|████▋     | 1993/4286 [15:06:47<15:40:08, 24.60s/it] 47%|████▋     | 1994/4286 [15:07:10<15:30:17, 24.35s/it]                                                         {'loss': 0.0143, 'grad_norm': 5.888791400169087, 'learning_rate': 5.347643490433971e-07, 'completion_length': 270.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.77827388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7604168057441711, 'reward_std': 0.1455996371805668, 'kl': 0.356689453125, 'epoch': 0.47}
 47%|████▋     | 1994/4286 [15:07:10<15:30:17, 24.35s/it] 47%|████▋     | 1995/4286 [15:07:35<15:35:23, 24.50s/it]                                                         {'loss': 0.01, 'grad_norm': 10.25446987631741, 'learning_rate': 5.345310312645823e-07, 'completion_length': 320.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7247024178504944, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.07414468005299568, 'kl': 0.2501220703125, 'epoch': 0.47}
 47%|████▋     | 1995/4286 [15:07:35<15:35:23, 24.50s/it] 47%|████▋     | 1996/4286 [15:08:00<15:37:04, 24.55s/it]                                                         {'loss': 0.0045, 'grad_norm': 2.7197109337732424, 'learning_rate': 5.342977134857675e-07, 'completion_length': 262.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.7089286148548126, 'rewards/format_reward': 1.0, 'reward': 1.7089287042617798, 'reward_std': 0.0035714309196919203, 'kl': 0.1129150390625, 'epoch': 0.47}
 47%|████▋     | 1996/4286 [15:08:00<15:37:04, 24.55s/it] 47%|████▋     | 1997/4286 [15:08:26<15:57:32, 25.10s/it]                                                         {'loss': 0.0131, 'grad_norm': 0.8145936110974654, 'learning_rate': 5.340643957069529e-07, 'completion_length': 330.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.8809524178504944, 'rewards/format_reward': 1.0, 'reward': 1.880952537059784, 'reward_std': 0.0, 'kl': 0.327392578125, 'epoch': 0.47}
 47%|████▋     | 1997/4286 [15:08:26<15:57:32, 25.10s/it] 47%|████▋     | 1998/4286 [15:08:49<15:25:43, 24.28s/it]                                                         {'loss': 0.0015, 'grad_norm': 0.32189800918388967, 'learning_rate': 5.338310779281381e-07, 'completion_length': 259.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.0, 'kl': 0.0380859375, 'epoch': 0.47}
 47%|████▋     | 1998/4286 [15:08:49<15:25:43, 24.28s/it] 47%|████▋     | 1999/4286 [15:09:13<15:21:09, 24.17s/it]                                                         {'loss': 0.0017, 'grad_norm': 4.960625344452757, 'learning_rate': 5.335977601493233e-07, 'completion_length': 284.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.8080357313156128, 'rewards/format_reward': 1.0, 'reward': 1.8080357909202576, 'reward_std': 0.08471458591520786, 'kl': 0.0428466796875, 'epoch': 0.47}
 47%|████▋     | 1999/4286 [15:09:13<15:21:09, 24.17s/it] 47%|████▋     | 2000/4286 [15:09:36<15:14:59, 24.02s/it]                                                         {'loss': 0.0046, 'grad_norm': 6.0898616616934165, 'learning_rate': 5.333644423705085e-07, 'completion_length': 297.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.06961995735764503, 'kl': 0.114990234375, 'epoch': 0.47}
 47%|████▋     | 2000/4286 [15:09:36<15:14:59, 24.02s/it] 47%|████▋     | 2001/4286 [15:13:41<57:16:39, 90.24s/it]                                                         {'loss': 0.0077, 'grad_norm': 4.635211038558845, 'learning_rate': 5.331311245916939e-07, 'completion_length': 304.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5684524178504944, 'rewards/format_reward': 1.0, 'reward': 1.5684524774551392, 'reward_std': 0.09858068078756332, 'kl': 0.19091796875, 'epoch': 0.47}
 47%|████▋     | 2001/4286 [15:13:41<57:16:39, 90.24s/it] 47%|████▋     | 2002/4286 [15:14:05<44:40:45, 70.42s/it]                                                         {'loss': 0.0103, 'grad_norm': 54.93296147822901, 'learning_rate': 5.328978068128791e-07, 'completion_length': 313.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.03709554299712181, 'kl': 0.2578125, 'epoch': 0.47}
 47%|████▋     | 2002/4286 [15:14:05<44:40:45, 70.42s/it] 47%|████▋     | 2003/4286 [15:14:30<35:55:04, 56.64s/it]                                                         {'loss': 0.0021, 'grad_norm': 0.5616273419813855, 'learning_rate': 5.326644890340643e-07, 'completion_length': 291.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8098214566707611, 'rewards/format_reward': 1.0, 'reward': 1.8098215460777283, 'reward_std': 0.04462423548102379, 'kl': 0.05126953125, 'epoch': 0.47}
 47%|████▋     | 2003/4286 [15:14:30<35:55:04, 56.64s/it][2025-03-03 06:12:18,824] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 47%|████▋     | 2004/4286 [15:14:56<30:07:10, 47.52s/it]                                                         {'loss': 0.0038, 'grad_norm': 5.937217650768928, 'learning_rate': 5.324311712552496e-07, 'completion_length': 339.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7172619104385376, 'rewards/format_reward': 1.0, 'reward': 1.717262089252472, 'reward_std': 0.06547618471086025, 'kl': 0.0947265625, 'epoch': 0.47}
 47%|████▋     | 2004/4286 [15:14:56<30:07:10, 47.52s/it] 47%|████▋     | 2005/4286 [15:15:21<25:50:30, 40.78s/it]                                                         {'loss': 0.0145, 'grad_norm': 4.118796032146068, 'learning_rate': 5.321978534764349e-07, 'completion_length': 319.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.6830357909202576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6651787161827087, 'reward_std': 0.07708030007779598, 'kl': 0.361572265625, 'epoch': 0.47}
 47%|████▋     | 2005/4286 [15:15:21<25:50:30, 40.78s/it][2025-03-03 06:13:06,300] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 47%|████▋     | 2006/4286 [15:15:43<22:20:11, 35.27s/it]                                                         {'loss': 0.005, 'grad_norm': 4.0072548603018, 'learning_rate': 5.319645356976201e-07, 'completion_length': 268.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.10076311323791742, 'kl': 0.1259765625, 'epoch': 0.47}
 47%|████▋     | 2006/4286 [15:15:43<22:20:11, 35.27s/it] 47%|████▋     | 2007/4286 [15:16:21<22:45:22, 35.95s/it]                                                         {'loss': 0.0024, 'grad_norm': 1.0942298246344524, 'learning_rate': 5.317312179188054e-07, 'completion_length': 306.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.03507719375193119, 'kl': 0.0587158203125, 'epoch': 0.47}
 47%|████▋     | 2007/4286 [15:16:21<22:45:22, 35.95s/it] 47%|████▋     | 2008/4286 [15:16:45<20:29:12, 32.38s/it]                                                         {'loss': 0.007, 'grad_norm': 0.4739927543411482, 'learning_rate': 5.314979001399906e-07, 'completion_length': 297.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.5416666716337204, 'rewards/format_reward': 1.0, 'reward': 1.5416668057441711, 'reward_std': 0.0, 'kl': 0.17333984375, 'epoch': 0.47}
 47%|████▋     | 2008/4286 [15:16:45<20:29:12, 32.38s/it] 47%|████▋     | 2009/4286 [15:17:08<18:45:14, 29.65s/it]                                                         {'loss': 0.0042, 'grad_norm': 5.9824730549996445, 'learning_rate': 5.312645823611759e-07, 'completion_length': 267.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.05357143096625805, 'kl': 0.10400390625, 'epoch': 0.47}
 47%|████▋     | 2009/4286 [15:17:08<18:45:14, 29.65s/it] 47%|████▋     | 2010/4286 [15:17:33<17:50:41, 28.23s/it]                                                         {'loss': 0.0029, 'grad_norm': 3.5317072768726114, 'learning_rate': 5.310312645823612e-07, 'completion_length': 266.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.04556369408965111, 'kl': 0.0733642578125, 'epoch': 0.47}
 47%|████▋     | 2010/4286 [15:17:33<17:50:41, 28.23s/it] 47%|████▋     | 2011/4286 [15:17:58<17:07:21, 27.10s/it]                                                         {'loss': 0.0032, 'grad_norm': 22.813187865497977, 'learning_rate': 5.307979468035464e-07, 'completion_length': 293.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7663690745830536, 'rewards/format_reward': 1.0, 'reward': 1.766369104385376, 'reward_std': 0.029461245983839035, 'kl': 0.079345703125, 'epoch': 0.47}
 47%|████▋     | 2011/4286 [15:17:58<17:07:21, 27.10s/it] 47%|████▋     | 2012/4286 [15:18:22<16:32:41, 26.19s/it]                                                         {'loss': 0.0019, 'grad_norm': 2.565039866244797, 'learning_rate': 5.305646290247316e-07, 'completion_length': 320.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.735119104385376, 'reward_std': 0.04722520709037781, 'kl': 0.046630859375, 'epoch': 0.47}
 47%|████▋     | 2012/4286 [15:18:22<16:32:41, 26.19s/it] 47%|████▋     | 2013/4286 [15:18:46<16:06:13, 25.51s/it]                                                         {'loss': 0.0018, 'grad_norm': 4.821716814699268, 'learning_rate': 5.303313112459169e-07, 'completion_length': 286.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.8913690745830536, 'rewards/format_reward': 1.0, 'reward': 1.8913691639900208, 'reward_std': 0.03808916639536619, 'kl': 0.044921875, 'epoch': 0.47}
 47%|████▋     | 2013/4286 [15:18:46<16:06:13, 25.51s/it] 47%|████▋     | 2014/4286 [15:19:10<15:52:47, 25.16s/it]                                                         {'loss': 0.0021, 'grad_norm': 2.944446636798176, 'learning_rate': 5.300979934671022e-07, 'completion_length': 302.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.031603580340743065, 'kl': 0.0517578125, 'epoch': 0.47}
 47%|████▋     | 2014/4286 [15:19:10<15:52:47, 25.16s/it] 47%|████▋     | 2015/4286 [15:19:34<15:44:16, 24.95s/it]                                                         {'loss': 0.0052, 'grad_norm': 5.873130617620867, 'learning_rate': 5.298646756882874e-07, 'completion_length': 271.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.025347060058265924, 'kl': 0.131103515625, 'epoch': 0.47}
 47%|████▋     | 2015/4286 [15:19:34<15:44:16, 24.95s/it][2025-03-03 06:17:23,280] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 47%|████▋     | 2016/4286 [15:20:00<15:55:21, 25.25s/it]                                                         {'loss': 0.0053, 'grad_norm': 3.921428700419989, 'learning_rate': 5.296313579094726e-07, 'completion_length': 279.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.62202388048172, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5863096117973328, 'reward_std': 0.0859102662652731, 'kl': 0.1337890625, 'epoch': 0.47}
 47%|████▋     | 2016/4286 [15:20:00<15:55:21, 25.25s/it] 47%|████▋     | 2017/4286 [15:20:25<15:49:07, 25.10s/it]                                                         {'loss': 0.01, 'grad_norm': 9.05861505499918, 'learning_rate': 5.29398040130658e-07, 'completion_length': 309.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.6413690447807312, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.07440476305782795, 'kl': 0.2490234375, 'epoch': 0.47}
 47%|████▋     | 2017/4286 [15:20:25<15:49:07, 25.10s/it][2025-03-03 06:18:14,025] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 47%|████▋     | 2018/4286 [15:20:51<15:58:59, 25.37s/it]                                                         {'loss': 0.0037, 'grad_norm': 5.583475419478234, 'learning_rate': 5.291647223518432e-07, 'completion_length': 302.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.04635532200336456, 'kl': 0.093017578125, 'epoch': 0.47}
 47%|████▋     | 2018/4286 [15:20:51<15:58:59, 25.37s/it][2025-03-03 06:18:38,830] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 47%|████▋     | 2019/4286 [15:21:16<15:52:09, 25.20s/it]                                                         {'loss': 0.0019, 'grad_norm': 2.536685538876342, 'learning_rate': 5.289314045730284e-07, 'completion_length': 303.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.8705357611179352, 'rewards/format_reward': 1.0, 'reward': 1.8705358505249023, 'reward_std': 0.03273809142410755, 'kl': 0.0469970703125, 'epoch': 0.47}
 47%|████▋     | 2019/4286 [15:21:16<15:52:09, 25.20s/it] 47%|████▋     | 2020/4286 [15:21:40<15:40:35, 24.91s/it]                                                         {'loss': 0.0103, 'grad_norm': 2.5312043700667224, 'learning_rate': 5.286980867942137e-07, 'completion_length': 284.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 1.0, 'reward': 1.602678656578064, 'reward_std': 0.03495405614376068, 'kl': 0.2564697265625, 'epoch': 0.47}
 47%|████▋     | 2020/4286 [15:21:40<15:40:35, 24.91s/it] 47%|████▋     | 2021/4286 [15:22:05<15:37:05, 24.82s/it]                                                         {'loss': 0.0058, 'grad_norm': 2.6065279558128847, 'learning_rate': 5.28464769015399e-07, 'completion_length': 331.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.7574404776096344, 'rewards/format_reward': 1.0, 'reward': 1.7574406266212463, 'reward_std': 0.057715192437171936, 'kl': 0.1455078125, 'epoch': 0.47}
 47%|████▋     | 2021/4286 [15:22:05<15:37:05, 24.82s/it] 47%|████▋     | 2022/4286 [15:22:31<15:47:28, 25.11s/it]                                                         {'loss': 0.004, 'grad_norm': 2.0990467821908716, 'learning_rate': 5.282314512365842e-07, 'completion_length': 333.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.7842262983322144, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7663691639900208, 'reward_std': 0.1324934009462595, 'kl': 0.09912109375, 'epoch': 0.47}
 47%|████▋     | 2022/4286 [15:22:31<15:47:28, 25.11s/it] 47%|████▋     | 2023/4286 [15:22:55<15:40:18, 24.93s/it]                                                         {'loss': 0.0029, 'grad_norm': 5.751304284867882, 'learning_rate': 5.279981334577694e-07, 'completion_length': 315.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8080357313156128, 'rewards/format_reward': 1.0, 'reward': 1.8080357909202576, 'reward_std': 0.031143157742917538, 'kl': 0.0716552734375, 'epoch': 0.47}
 47%|████▋     | 2023/4286 [15:22:55<15:40:18, 24.93s/it] 47%|████▋     | 2024/4286 [15:23:19<15:32:35, 24.74s/it]                                                         {'loss': 0.007, 'grad_norm': 4.601463739805156, 'learning_rate': 5.277648156789547e-07, 'completion_length': 283.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.04602411109954119, 'kl': 0.174072265625, 'epoch': 0.47}
 47%|████▋     | 2024/4286 [15:23:19<15:32:35, 24.74s/it] 47%|████▋     | 2025/4286 [15:23:45<15:38:47, 24.91s/it]                                                         {'loss': 0.0119, 'grad_norm': 3.978347558704246, 'learning_rate': 5.275314979001399e-07, 'completion_length': 315.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.5907738506793976, 'rewards/format_reward': 1.0, 'reward': 1.5907739400863647, 'reward_std': 0.05059524066746235, 'kl': 0.296875, 'epoch': 0.47}
 47%|████▋     | 2025/4286 [15:23:45<15:38:47, 24.91s/it] 47%|████▋     | 2026/4286 [15:24:09<15:34:33, 24.81s/it]                                                         {'loss': 0.0111, 'grad_norm': 21.312961971504123, 'learning_rate': 5.272981801213252e-07, 'completion_length': 303.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5877976715564728, 'rewards/format_reward': 1.0, 'reward': 1.5877977013587952, 'reward_std': 0.13517062366008759, 'kl': 0.278564453125, 'epoch': 0.47}
 47%|████▋     | 2026/4286 [15:24:09<15:34:33, 24.81s/it] 47%|████▋     | 2027/4286 [15:24:34<15:30:09, 24.71s/it]                                                         {'loss': 0.013, 'grad_norm': 5.661234699889932, 'learning_rate': 5.270648623425105e-07, 'completion_length': 308.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.01785714365541935, 'kl': 0.325439453125, 'epoch': 0.47}
 47%|████▋     | 2027/4286 [15:24:34<15:30:09, 24.71s/it] 47%|████▋     | 2028/4286 [15:25:00<15:45:26, 25.12s/it]                                                         {'loss': 0.005, 'grad_norm': 2.4782517869677423, 'learning_rate': 5.268315445636957e-07, 'completion_length': 324.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.05243690870702267, 'kl': 0.124755859375, 'epoch': 0.47}
 47%|████▋     | 2028/4286 [15:25:00<15:45:26, 25.12s/it] 47%|████▋     | 2029/4286 [15:25:24<15:34:19, 24.84s/it]                                                         {'loss': 0.005, 'grad_norm': 7.813354158165314, 'learning_rate': 5.265982267848809e-07, 'completion_length': 316.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.5818452686071396, 'rewards/format_reward': 1.0, 'reward': 1.5818453431129456, 'reward_std': 0.030222328379750252, 'kl': 0.1260986328125, 'epoch': 0.47}
 47%|████▋     | 2029/4286 [15:25:24<15:34:19, 24.84s/it] 47%|████▋     | 2030/4286 [15:25:49<15:31:57, 24.79s/it]                                                         {'loss': 0.0032, 'grad_norm': 2.884768098351041, 'learning_rate': 5.263649090060663e-07, 'completion_length': 297.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.04946072772145271, 'kl': 0.0791015625, 'epoch': 0.47}
 47%|████▋     | 2030/4286 [15:25:49<15:31:57, 24.79s/it] 47%|████▋     | 2031/4286 [15:26:13<15:24:50, 24.61s/it]                                                         {'loss': 0.0035, 'grad_norm': 7.062605550754036, 'learning_rate': 5.261315912272515e-07, 'completion_length': 269.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 1.0, 'reward': 1.602678656578064, 'reward_std': 0.026785709895193577, 'kl': 0.0869140625, 'epoch': 0.47}
 47%|████▋     | 2031/4286 [15:26:13<15:24:50, 24.61s/it] 47%|████▋     | 2032/4286 [15:26:39<15:37:40, 24.96s/it]                                                         {'loss': 0.0127, 'grad_norm': 2.7806161045332347, 'learning_rate': 5.258982734484367e-07, 'completion_length': 284.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7604166865348816, 'rewards/format_reward': 1.0, 'reward': 1.7604168057441711, 'reward_std': 0.019238398410379887, 'kl': 0.318359375, 'epoch': 0.47}
 47%|████▋     | 2032/4286 [15:26:39<15:37:40, 24.96s/it] 47%|████▋     | 2033/4286 [15:27:03<15:29:32, 24.75s/it]                                                         {'loss': 0.0046, 'grad_norm': 2.232103182480702, 'learning_rate': 5.25664955669622e-07, 'completion_length': 287.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7752976417541504, 'rewards/format_reward': 1.0, 'reward': 1.7752977013587952, 'reward_std': 0.11855553090572357, 'kl': 0.1162109375, 'epoch': 0.47}
 47%|████▋     | 2033/4286 [15:27:03<15:29:32, 24.75s/it] 47%|████▋     | 2034/4286 [15:27:29<15:39:37, 25.03s/it]                                                         {'loss': 0.0137, 'grad_norm': 4.232937722365116, 'learning_rate': 5.254316378908073e-07, 'completion_length': 319.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7782739102840424, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.04362921416759491, 'kl': 0.3446044921875, 'epoch': 0.47}
 47%|████▋     | 2034/4286 [15:27:29<15:39:37, 25.03s/it][2025-03-03 06:25:18,847] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 47%|████▋     | 2035/4286 [15:27:56<16:05:27, 25.73s/it]                                                         {'loss': 0.0226, 'grad_norm': 18.19048434548084, 'learning_rate': 5.251983201119925e-07, 'completion_length': 335.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7351191639900208, 'reward_std': 0.12940751761198044, 'kl': 0.564453125, 'epoch': 0.47}
 47%|████▋     | 2035/4286 [15:27:56<16:05:27, 25.73s/it] 48%|████▊     | 2036/4286 [15:28:22<16:05:42, 25.75s/it]                                                         {'loss': 0.0292, 'grad_norm': 1.9731151086515546, 'learning_rate': 5.249650023331777e-07, 'completion_length': 310.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6092687547206879, 'rewards/format_reward': 1.0, 'reward': 1.6092687845230103, 'reward_std': 0.0579476491548121, 'kl': 0.728271484375, 'epoch': 0.48}
 48%|████▊     | 2036/4286 [15:28:22<16:05:42, 25.75s/it] 48%|████▊     | 2037/4286 [15:28:45<15:42:13, 25.14s/it]                                                         {'loss': 0.0015, 'grad_norm': 12.186587014035167, 'learning_rate': 5.24731684554363e-07, 'completion_length': 299.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8610120415687561, 'rewards/format_reward': 1.0, 'reward': 1.861012041568756, 'reward_std': 0.029817864298820496, 'kl': 0.0362548828125, 'epoch': 0.48}
 48%|████▊     | 2037/4286 [15:28:45<15:42:13, 25.14s/it][2025-03-03 06:26:34,262] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 48%|████▊     | 2038/4286 [15:29:11<15:50:34, 25.37s/it]                                                         {'loss': 0.0089, 'grad_norm': 2.7642937975752107, 'learning_rate': 5.244983667755483e-07, 'completion_length': 340.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.7648810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7648811340332031, 'reward_std': 0.05255015939474106, 'kl': 0.222900390625, 'epoch': 0.48}
 48%|████▊     | 2038/4286 [15:29:11<15:50:34, 25.37s/it] 48%|████▊     | 2039/4286 [15:29:35<15:34:21, 24.95s/it]                                                         {'loss': 0.0069, 'grad_norm': 0.9421555566589823, 'learning_rate': 5.242650489967335e-07, 'completion_length': 299.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.741071492433548, 'rewards/format_reward': 1.0, 'reward': 1.7410715818405151, 'reward_std': 0.0178571417927742, 'kl': 0.1727294921875, 'epoch': 0.48}
 48%|████▊     | 2039/4286 [15:29:35<15:34:21, 24.95s/it] 48%|████▊     | 2040/4286 [15:30:00<15:33:02, 24.93s/it]                                                         {'loss': 0.015, 'grad_norm': 7.426704158854483, 'learning_rate': 5.240317312179188e-07, 'completion_length': 332.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7232144474983215, 'reward_std': 0.054064907133579254, 'kl': 0.3740234375, 'epoch': 0.48}
 48%|████▊     | 2040/4286 [15:30:00<15:33:02, 24.93s/it] 48%|████▊     | 2041/4286 [15:30:25<15:34:15, 24.97s/it]                                                         {'loss': 0.0415, 'grad_norm': 10.352776408378482, 'learning_rate': 5.23798413439104e-07, 'completion_length': 324.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6997024416923523, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6818453073501587, 'reward_std': 0.10222826525568962, 'kl': 1.03662109375, 'epoch': 0.48}
 48%|████▊     | 2041/4286 [15:30:25<15:34:15, 24.97s/it] 48%|████▊     | 2042/4286 [15:30:51<15:44:37, 25.26s/it]                                                         {'loss': 0.0159, 'grad_norm': 2.861164536170724, 'learning_rate': 5.235650956602893e-07, 'completion_length': 311.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.5907738506793976, 'rewards/format_reward': 1.0, 'reward': 1.5907739400863647, 'reward_std': 0.11132809519767761, 'kl': 0.3955078125, 'epoch': 0.48}
 48%|████▊     | 2042/4286 [15:30:51<15:44:37, 25.26s/it] 48%|████▊     | 2043/4286 [15:31:16<15:38:52, 25.11s/it]                                                         {'loss': 0.0118, 'grad_norm': 1.0675674395596764, 'learning_rate': 5.233317778814746e-07, 'completion_length': 315.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.05888672545552254, 'kl': 0.295166015625, 'epoch': 0.48}
 48%|████▊     | 2043/4286 [15:31:16<15:38:52, 25.11s/it] 48%|████▊     | 2044/4286 [15:31:42<15:48:02, 25.37s/it]                                                         {'loss': 0.0322, 'grad_norm': 9.551295578215253, 'learning_rate': 5.230984601026598e-07, 'completion_length': 344.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.6799320578575134, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.644217848777771, 'reward_std': 0.1281980238854885, 'kl': 0.8056640625, 'epoch': 0.48}
 48%|████▊     | 2044/4286 [15:31:42<15:48:02, 25.37s/it] 48%|████▊     | 2045/4286 [15:32:08<15:52:50, 25.51s/it]                                                         {'loss': 0.0049, 'grad_norm': 2.834359865587399, 'learning_rate': 5.22865142323845e-07, 'completion_length': 293.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7425596117973328, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.06366758048534393, 'kl': 0.123046875, 'epoch': 0.48}
 48%|████▊     | 2045/4286 [15:32:08<15:52:50, 25.51s/it] 48%|████▊     | 2046/4286 [15:32:33<15:46:35, 25.36s/it]                                                         {'loss': 0.0053, 'grad_norm': 2.6160934335115575, 'learning_rate': 5.226318245450302e-07, 'completion_length': 324.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.059523805975914, 'kl': 0.132080078125, 'epoch': 0.48}
 48%|████▊     | 2046/4286 [15:32:33<15:46:35, 25.36s/it] 48%|████▊     | 2047/4286 [15:32:58<15:47:02, 25.38s/it]                                                         {'loss': 0.0051, 'grad_norm': 4.197938461734982, 'learning_rate': 5.223985067662156e-07, 'completion_length': 323.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.6383928954601288, 'rewards/format_reward': 1.0, 'reward': 1.638392984867096, 'reward_std': 0.045563699677586555, 'kl': 0.1279296875, 'epoch': 0.48}
 48%|████▊     | 2047/4286 [15:32:58<15:47:02, 25.38s/it] 48%|████▊     | 2048/4286 [15:33:22<15:26:57, 24.85s/it]                                                         {'loss': 0.0094, 'grad_norm': 2.0225777790070816, 'learning_rate': 5.221651889874008e-07, 'completion_length': 299.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7172620296478271, 'reward_std': 0.0535714328289032, 'kl': 0.234130859375, 'epoch': 0.48}
 48%|████▊     | 2048/4286 [15:33:22<15:26:57, 24.85s/it][2025-03-03 06:31:09,323] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 48%|████▊     | 2049/4286 [15:33:46<15:23:37, 24.77s/it]                                                         {'loss': 0.0128, 'grad_norm': 2.171350011003429, 'learning_rate': 5.21931871208586e-07, 'completion_length': 297.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.84077388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8229168057441711, 'reward_std': 0.07213572412729263, 'kl': 0.31884765625, 'epoch': 0.48}
 48%|████▊     | 2049/4286 [15:33:46<15:23:37, 24.77s/it] 48%|████▊     | 2050/4286 [15:34:10<15:10:31, 24.43s/it]                                                         {'loss': 0.004, 'grad_norm': 3.881574517717347, 'learning_rate': 5.216985534297713e-07, 'completion_length': 289.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8511905074119568, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.01877797581255436, 'kl': 0.10009765625, 'epoch': 0.48}
 48%|████▊     | 2050/4286 [15:34:10<15:10:31, 24.43s/it][2025-03-03 06:31:56,866] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 48%|████▊     | 2051/4286 [15:34:34<15:04:12, 24.27s/it]                                                         {'loss': 0.0021, 'grad_norm': 0.38486570685989924, 'learning_rate': 5.214652356509566e-07, 'completion_length': 278.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.8184524476528168, 'rewards/format_reward': 1.0, 'reward': 1.818452537059784, 'reward_std': 0.01785714365541935, 'kl': 0.0521240234375, 'epoch': 0.48}
 48%|████▊     | 2051/4286 [15:34:34<15:04:12, 24.27s/it] 48%|████▊     | 2052/4286 [15:34:58<14:57:47, 24.11s/it]                                                         {'loss': 0.0098, 'grad_norm': 7.035606592393388, 'learning_rate': 5.212319178721418e-07, 'completion_length': 287.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.09661934897303581, 'kl': 0.24560546875, 'epoch': 0.48}
 48%|████▊     | 2052/4286 [15:34:58<14:57:47, 24.11s/it] 48%|████▊     | 2053/4286 [15:35:23<15:06:38, 24.36s/it]                                                         {'loss': 0.0276, 'grad_norm': 4.024278550393879, 'learning_rate': 5.209986000933271e-07, 'completion_length': 283.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7495039999485016, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7316468954086304, 'reward_std': 0.09304028935730457, 'kl': 0.689453125, 'epoch': 0.48}
 48%|████▊     | 2053/4286 [15:35:23<15:06:38, 24.36s/it] 48%|████▊     | 2054/4286 [15:35:48<15:20:10, 24.74s/it]                                                         {'loss': 0.0125, 'grad_norm': 26.849290975635647, 'learning_rate': 5.207652823145123e-07, 'completion_length': 294.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.8467262089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8288691639900208, 'reward_std': 0.12202381668612361, 'kl': 0.3115234375, 'epoch': 0.48}
 48%|████▊     | 2054/4286 [15:35:48<15:20:10, 24.74s/it] 48%|████▊     | 2055/4286 [15:36:14<15:26:30, 24.92s/it]                                                         {'loss': 0.0213, 'grad_norm': 4.036229900234627, 'learning_rate': 5.205319645356976e-07, 'completion_length': 284.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548357009888, 'reward_std': 0.0555737130343914, 'kl': 0.5322265625, 'epoch': 0.48}
 48%|████▊     | 2055/4286 [15:36:14<15:26:30, 24.92s/it] 48%|████▊     | 2056/4286 [15:36:38<15:18:40, 24.72s/it]                                                         {'loss': 0.0427, 'grad_norm': 12.230292551469264, 'learning_rate': 5.202986467568829e-07, 'completion_length': 289.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6636905074119568, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6279762983322144, 'reward_std': 0.1607142835855484, 'kl': 1.072265625, 'epoch': 0.48}
 48%|████▊     | 2056/4286 [15:36:38<15:18:40, 24.72s/it] 48%|████▊     | 2057/4286 [15:37:02<15:09:14, 24.48s/it]                                                         {'loss': 0.0254, 'grad_norm': 4.124728846394252, 'learning_rate': 5.200653289780681e-07, 'completion_length': 298.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.4642857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.446428656578064, 'reward_std': 0.08083896338939667, 'kl': 0.63671875, 'epoch': 0.48}
 48%|████▊     | 2057/4286 [15:37:02<15:09:14, 24.48s/it] 48%|████▊     | 2058/4286 [15:37:27<15:17:22, 24.70s/it]                                                         {'loss': 0.0429, 'grad_norm': 2.554788510325197, 'learning_rate': 5.198320111992533e-07, 'completion_length': 321.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.5364583730697632, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4828870296478271, 'reward_std': 0.1244736835360527, 'kl': 1.076171875, 'epoch': 0.48}
 48%|████▊     | 2058/4286 [15:37:27<15:17:22, 24.70s/it] 48%|████▊     | 2059/4286 [15:37:52<15:24:50, 24.92s/it]                                                         {'loss': 0.029, 'grad_norm': 10.20351742975922, 'learning_rate': 5.195986934204386e-07, 'completion_length': 274.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6441163122653961, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6084020733833313, 'reward_std': 0.11149775981903076, 'kl': 0.724609375, 'epoch': 0.48}
 48%|████▊     | 2059/4286 [15:37:52<15:24:50, 24.92s/it] 48%|████▊     | 2060/4286 [15:38:18<15:32:40, 25.14s/it]                                                         {'loss': 0.0083, 'grad_norm': 5.174308671110795, 'learning_rate': 5.193653756416239e-07, 'completion_length': 328.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.662202388048172, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.068452388048172, 'kl': 0.2060546875, 'epoch': 0.48}
 48%|████▊     | 2060/4286 [15:38:18<15:32:40, 25.14s/it] 48%|████▊     | 2061/4286 [15:38:44<15:40:49, 25.37s/it]                                                         {'loss': 0.0743, 'grad_norm': 4.875206946925307, 'learning_rate': 5.191320578628091e-07, 'completion_length': 361.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7927296161651611, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7570154666900635, 'reward_std': 0.18098211660981178, 'kl': 1.85546875, 'epoch': 0.48}
 48%|████▊     | 2061/4286 [15:38:44<15:40:49, 25.37s/it] 48%|████▊     | 2062/4286 [15:39:09<15:32:50, 25.17s/it]                                                         {'loss': 0.008, 'grad_norm': 3.0127434269208697, 'learning_rate': 5.188987400839943e-07, 'completion_length': 313.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.04007172957062721, 'kl': 0.19921875, 'epoch': 0.48}
 48%|████▊     | 2062/4286 [15:39:09<15:32:50, 25.17s/it] 48%|████▊     | 2063/4286 [15:39:33<15:20:08, 24.83s/it]                                                         {'loss': 0.051, 'grad_norm': 11.55248158471607, 'learning_rate': 5.186654223051797e-07, 'completion_length': 275.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.7422619760036469, 'rewards/format_reward': 1.0, 'reward': 1.7422620058059692, 'reward_std': 0.047612499445676804, 'kl': 1.27734375, 'epoch': 0.48}
 48%|████▊     | 2063/4286 [15:39:33<15:20:08, 24.83s/it] 48%|████▊     | 2064/4286 [15:39:58<15:28:16, 25.07s/it]                                                         {'loss': 0.0765, 'grad_norm': 25.979818036334866, 'learning_rate': 5.184321045263649e-07, 'completion_length': 339.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7311650216579437, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.695450782775879, 'reward_std': 0.15160714834928513, 'kl': 1.9140625, 'epoch': 0.48}
 48%|████▊     | 2064/4286 [15:39:58<15:28:16, 25.07s/it] 48%|████▊     | 2065/4286 [15:40:22<15:14:53, 24.72s/it]                                                         {'loss': 0.0239, 'grad_norm': 2.9056816688267335, 'learning_rate': 5.181987867475501e-07, 'completion_length': 266.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7485119104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7306548357009888, 'reward_std': 0.09410357847809792, 'kl': 0.595703125, 'epoch': 0.48}
 48%|████▊     | 2065/4286 [15:40:22<15:14:53, 24.72s/it] 48%|████▊     | 2066/4286 [15:40:47<15:13:40, 24.69s/it]                                                         {'loss': 0.0384, 'grad_norm': 25.304080871678273, 'learning_rate': 5.179654689687354e-07, 'completion_length': 329.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7708333134651184, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7529762983322144, 'reward_std': 0.0722704054787755, 'kl': 0.9561767578125, 'epoch': 0.48}
 48%|████▊     | 2066/4286 [15:40:47<15:13:40, 24.69s/it] 48%|████▊     | 2067/4286 [15:41:12<15:21:35, 24.92s/it]                                                         {'loss': 0.0093, 'grad_norm': 2.100564934695893, 'learning_rate': 5.177321511899207e-07, 'completion_length': 340.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.758928656578064, 'reward_std': 0.1116466149687767, 'kl': 0.233154296875, 'epoch': 0.48}
 48%|████▊     | 2067/4286 [15:41:12<15:21:35, 24.92s/it] 48%|████▊     | 2068/4286 [15:41:38<15:25:37, 25.04s/it]                                                         {'loss': 0.0328, 'grad_norm': 3.230608130461877, 'learning_rate': 5.174988334111059e-07, 'completion_length': 284.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7184524238109589, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7005953788757324, 'reward_std': 0.08230667212046683, 'kl': 0.8203125, 'epoch': 0.48}
 48%|████▊     | 2068/4286 [15:41:38<15:25:37, 25.04s/it] 48%|████▊     | 2069/4286 [15:42:03<15:33:23, 25.26s/it]                                                         {'loss': 0.0417, 'grad_norm': 4.085287994226372, 'learning_rate': 5.172655156322911e-07, 'completion_length': 277.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7782738506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7604168057441711, 'reward_std': 0.09821428917348385, 'kl': 1.04296875, 'epoch': 0.48}
 48%|████▊     | 2069/4286 [15:42:03<15:33:23, 25.26s/it] 48%|████▊     | 2070/4286 [15:42:28<15:29:18, 25.16s/it]                                                         {'loss': 0.0138, 'grad_norm': 2.757973796368982, 'learning_rate': 5.170321978534764e-07, 'completion_length': 280.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.7200758159160614, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7022187113761902, 'reward_std': 0.08014263724908233, 'kl': 0.345947265625, 'epoch': 0.48}
 48%|████▊     | 2070/4286 [15:42:28<15:29:18, 25.16s/it] 48%|████▊     | 2071/4286 [15:42:55<15:43:56, 25.57s/it]                                                         {'loss': 0.0057, 'grad_norm': 6.250488823909818, 'learning_rate': 5.167988800746616e-07, 'completion_length': 286.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125001788139343, 'reward_std': 0.01785714365541935, 'kl': 0.143310546875, 'epoch': 0.48}
 48%|████▊     | 2071/4286 [15:42:55<15:43:56, 25.57s/it] 48%|████▊     | 2072/4286 [15:43:20<15:41:54, 25.53s/it]                                                         {'loss': 0.0231, 'grad_norm': 1.9556729424179176, 'learning_rate': 5.165655622958469e-07, 'completion_length': 298.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.6800595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.029548222199082375, 'kl': 0.576171875, 'epoch': 0.48}
 48%|████▊     | 2072/4286 [15:43:20<15:41:54, 25.53s/it] 48%|████▊     | 2073/4286 [15:43:46<15:39:25, 25.47s/it]                                                         {'loss': 0.0221, 'grad_norm': 2.654866507936553, 'learning_rate': 5.163322445170322e-07, 'completion_length': 280.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5982143878936768, 'reward_std': 0.10872611775994301, 'kl': 0.55224609375, 'epoch': 0.48}
 48%|████▊     | 2073/4286 [15:43:46<15:39:25, 25.47s/it][2025-03-03 06:41:33,448] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 48%|████▊     | 2074/4286 [15:44:11<15:32:51, 25.30s/it]                                                         {'loss': 0.0055, 'grad_norm': 6.17244253061911, 'learning_rate': 5.160989267382174e-07, 'completion_length': 278.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 1.0, 'reward': 1.7395835518836975, 'reward_std': 0.023595844628289342, 'kl': 0.138671875, 'epoch': 0.48}
 48%|████▊     | 2074/4286 [15:44:11<15:32:51, 25.30s/it] 48%|████▊     | 2075/4286 [15:44:36<15:39:10, 25.49s/it]                                                         {'loss': 0.0035, 'grad_norm': 13.869288056636423, 'learning_rate': 5.158656089594026e-07, 'completion_length': 314.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.029761902987957, 'kl': 0.0877685546875, 'epoch': 0.48}
 48%|████▊     | 2075/4286 [15:44:36<15:39:10, 25.49s/it] 48%|████▊     | 2076/4286 [15:45:01<15:28:58, 25.22s/it]                                                         {'loss': 0.0061, 'grad_norm': 0.6724982998994979, 'learning_rate': 5.15632291180588e-07, 'completion_length': 320.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.01709691435098648, 'kl': 0.152099609375, 'epoch': 0.48}
 48%|████▊     | 2076/4286 [15:45:01<15:28:58, 25.22s/it] 48%|████▊     | 2077/4286 [15:45:27<15:37:50, 25.47s/it]                                                         {'loss': 0.0162, 'grad_norm': 3.862173136677753, 'learning_rate': 5.153989734017732e-07, 'completion_length': 318.75, 'rewards/only_full_func_accuracy_reward': 0.6907738447189331, 'rewards/format_reward': 1.0, 'reward': 1.6907739639282227, 'reward_std': 0.02470116876065731, 'kl': 0.404296875, 'epoch': 0.48}
 48%|████▊     | 2077/4286 [15:45:27<15:37:50, 25.47s/it] 48%|████▊     | 2078/4286 [15:45:54<15:49:02, 25.79s/it]                                                         {'loss': 0.0053, 'grad_norm': 3.689654528840596, 'learning_rate': 5.151656556229584e-07, 'completion_length': 304.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8809524774551392, 'rewards/format_reward': 1.0, 'reward': 1.8809524774551392, 'reward_std': 0.05817560479044914, 'kl': 0.1326904296875, 'epoch': 0.48}
 48%|████▊     | 2078/4286 [15:45:54<15:49:02, 25.79s/it] 49%|████▊     | 2079/4286 [15:46:20<15:50:38, 25.84s/it]                                                         {'loss': 0.0026, 'grad_norm': 0.41871559540089454, 'learning_rate': 5.149323378441437e-07, 'completion_length': 296.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 1.0, 'reward': 1.6934524774551392, 'reward_std': 0.01785714365541935, 'kl': 0.064453125, 'epoch': 0.49}
 49%|████▊     | 2079/4286 [15:46:20<15:50:38, 25.84s/it] 49%|████▊     | 2080/4286 [15:46:47<16:02:19, 26.17s/it]                                                         {'loss': 0.0013, 'grad_norm': 0.6447104547823225, 'learning_rate': 5.14699020065329e-07, 'completion_length': 305.875, 'rewards/only_full_func_accuracy_reward': 0.761904776096344, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7261906266212463, 'reward_std': 0.13328753039240837, 'kl': 0.0323486328125, 'epoch': 0.49}
 49%|████▊     | 2080/4286 [15:46:47<16:02:19, 26.17s/it] 49%|████▊     | 2081/4286 [15:47:14<16:11:08, 26.43s/it]                                                         {'loss': 0.0015, 'grad_norm': 0.3262848652770224, 'learning_rate': 5.144657022865142e-07, 'completion_length': 343.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.8511905372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.833333432674408, 'reward_std': 0.0357142873108387, 'kl': 0.0382080078125, 'epoch': 0.49}
 49%|████▊     | 2081/4286 [15:47:14<16:11:08, 26.43s/it] 49%|████▊     | 2082/4286 [15:47:38<15:47:13, 25.79s/it]                                                         {'loss': 0.0056, 'grad_norm': 5.289516736238011, 'learning_rate': 5.142323845076994e-07, 'completion_length': 306.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6369048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.028166969306766987, 'kl': 0.141845703125, 'epoch': 0.49}
 49%|████▊     | 2082/4286 [15:47:38<15:47:13, 25.79s/it] 49%|████▊     | 2083/4286 [15:48:03<15:43:10, 25.69s/it]                                                         {'loss': 0.005, 'grad_norm': 1.8709078173445925, 'learning_rate': 5.139990667288847e-07, 'completion_length': 325.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.802083432674408, 'rewards/format_reward': 1.0, 'reward': 1.802083432674408, 'reward_std': 0.025103310123085976, 'kl': 0.12646484375, 'epoch': 0.49}
 49%|████▊     | 2083/4286 [15:48:03<15:43:10, 25.69s/it] 49%|████▊     | 2084/4286 [15:48:28<15:28:43, 25.31s/it]                                                         {'loss': 0.0094, 'grad_norm': 15.438663950801784, 'learning_rate': 5.1376574895007e-07, 'completion_length': 292.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.04123930633068085, 'kl': 0.234375, 'epoch': 0.49}
 49%|████▊     | 2084/4286 [15:48:28<15:28:43, 25.31s/it] 49%|████▊     | 2085/4286 [15:48:56<15:57:52, 26.11s/it]                                                         {'loss': 0.0035, 'grad_norm': 3.5598787599737483, 'learning_rate': 5.135324311712552e-07, 'completion_length': 333.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6860119700431824, 'reward_std': 0.10441340878605843, 'kl': 0.08642578125, 'epoch': 0.49}
 49%|████▊     | 2085/4286 [15:48:56<15:57:52, 26.11s/it] 49%|████▊     | 2086/4286 [15:49:21<15:44:48, 25.77s/it]                                                         {'loss': 0.0033, 'grad_norm': 9.454165493206386, 'learning_rate': 5.132991133924405e-07, 'completion_length': 274.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.8095238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.06388125941157341, 'kl': 0.082763671875, 'epoch': 0.49}
 49%|████▊     | 2086/4286 [15:49:21<15:44:48, 25.77s/it] 49%|████▊     | 2087/4286 [15:49:46<15:43:11, 25.74s/it]                                                         {'loss': 0.0017, 'grad_norm': 1.3721670438426334, 'learning_rate': 5.130657956136257e-07, 'completion_length': 317.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 1.0, 'reward': 1.7425596117973328, 'reward_std': 0.06526251137256622, 'kl': 0.041748046875, 'epoch': 0.49}
 49%|████▊     | 2087/4286 [15:49:46<15:43:11, 25.74s/it] 49%|████▊     | 2088/4286 [15:50:12<15:38:13, 25.61s/it]                                                         {'loss': 0.0036, 'grad_norm': 2.1339362183303816, 'learning_rate': 5.12832477834811e-07, 'completion_length': 308.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.709821492433548, 'rewards/format_reward': 1.0, 'reward': 1.7098215222358704, 'reward_std': 0.04329466260969639, 'kl': 0.090576171875, 'epoch': 0.49}
 49%|████▊     | 2088/4286 [15:50:12<15:38:13, 25.61s/it] 49%|████▊     | 2089/4286 [15:50:36<15:22:42, 25.20s/it]                                                         {'loss': 0.0079, 'grad_norm': 2.3207344354969126, 'learning_rate': 5.125991600559963e-07, 'completion_length': 312.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592262387275696, 'reward_std': 0.05393781512975693, 'kl': 0.19677734375, 'epoch': 0.49}
 49%|████▊     | 2089/4286 [15:50:36<15:22:42, 25.20s/it] 49%|████▉     | 2090/4286 [15:51:01<15:21:34, 25.18s/it]                                                         {'loss': 0.0101, 'grad_norm': 7.293470767928485, 'learning_rate': 5.123658422771815e-07, 'completion_length': 291.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.01877797581255436, 'kl': 0.251953125, 'epoch': 0.49}
 49%|████▉     | 2090/4286 [15:51:01<15:21:34, 25.18s/it] 49%|████▉     | 2091/4286 [15:51:26<15:19:22, 25.13s/it]                                                         {'loss': 0.0059, 'grad_norm': 6.499325435200206, 'learning_rate': 5.121325244983667e-07, 'completion_length': 319.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7812500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7812501192092896, 'reward_std': 0.044642859138548374, 'kl': 0.146240234375, 'epoch': 0.49}
 49%|████▉     | 2091/4286 [15:51:26<15:19:22, 25.13s/it] 49%|████▉     | 2092/4286 [15:51:51<15:13:15, 24.98s/it]                                                         {'loss': 0.0037, 'grad_norm': 5.907810667440947, 'learning_rate': 5.11899206719552e-07, 'completion_length': 310.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.8258928954601288, 'rewards/format_reward': 1.0, 'reward': 1.825892984867096, 'reward_std': 0.04329465702176094, 'kl': 0.0924072265625, 'epoch': 0.49}
 49%|████▉     | 2092/4286 [15:51:51<15:13:15, 24.98s/it] 49%|████▉     | 2093/4286 [15:52:17<15:28:14, 25.40s/it]                                                         {'loss': 0.0162, 'grad_norm': 2.3699961265252414, 'learning_rate': 5.116658889407373e-07, 'completion_length': 319.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7886905670166016, 'rewards/format_reward': 1.0, 'reward': 1.7886906862258911, 'reward_std': 0.09162086620926857, 'kl': 0.403564453125, 'epoch': 0.49}
 49%|████▉     | 2093/4286 [15:52:17<15:28:14, 25.40s/it] 49%|████▉     | 2094/4286 [15:52:44<15:44:31, 25.85s/it]                                                         {'loss': 0.0048, 'grad_norm': 1.594773126181517, 'learning_rate': 5.114325711619225e-07, 'completion_length': 320.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7261905670166016, 'reward_std': 0.08517501130700111, 'kl': 0.1202392578125, 'epoch': 0.49}
 49%|████▉     | 2094/4286 [15:52:44<15:44:31, 25.85s/it] 49%|████▉     | 2095/4286 [15:53:08<15:22:32, 25.26s/it]                                                         {'loss': 0.0073, 'grad_norm': 10.630383405607091, 'learning_rate': 5.111992533831077e-07, 'completion_length': 293.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6205357760190964, 'rewards/format_reward': 1.0, 'reward': 1.6205358505249023, 'reward_std': 0.04304791707545519, 'kl': 0.181640625, 'epoch': 0.49}
 49%|████▉     | 2095/4286 [15:53:08<15:22:32, 25.26s/it] 49%|████▉     | 2096/4286 [15:53:33<15:18:07, 25.15s/it]                                                         {'loss': 0.0025, 'grad_norm': 6.033086477051546, 'learning_rate': 5.10965935604293e-07, 'completion_length': 280.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.07091423496603966, 'kl': 0.0634765625, 'epoch': 0.49}
 49%|████▉     | 2096/4286 [15:53:33<15:18:07, 25.15s/it] 49%|████▉     | 2097/4286 [15:53:59<15:34:50, 25.62s/it]                                                         {'loss': 0.0048, 'grad_norm': 1.0204482359411144, 'learning_rate': 5.107326178254783e-07, 'completion_length': 313.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.626488208770752, 'reward_std': 0.10859859734773636, 'kl': 0.1204833984375, 'epoch': 0.49}
 49%|████▉     | 2097/4286 [15:53:59<15:34:50, 25.62s/it][2025-03-03 06:51:48,442] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 49%|████▉     | 2098/4286 [15:54:26<15:39:05, 25.75s/it]                                                         {'loss': 0.007, 'grad_norm': 0.7730583578130223, 'learning_rate': 5.104993000466635e-07, 'completion_length': 307.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.7946428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7946429252624512, 'reward_std': 0.01785714365541935, 'kl': 0.1749267578125, 'epoch': 0.49}
 49%|████▉     | 2098/4286 [15:54:26<15:39:05, 25.75s/it] 49%|████▉     | 2099/4286 [15:54:51<15:31:19, 25.55s/it]                                                         {'loss': 0.0133, 'grad_norm': 0.8873255157327105, 'learning_rate': 5.102659822678488e-07, 'completion_length': 295.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.0446428582072258, 'kl': 0.33154296875, 'epoch': 0.49}
 49%|████▉     | 2099/4286 [15:54:51<15:31:19, 25.55s/it] 49%|████▉     | 2100/4286 [15:55:18<15:50:19, 26.08s/it]                                                         {'loss': 0.0193, 'grad_norm': 7.052171432365992, 'learning_rate': 5.10032664489034e-07, 'completion_length': 331.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5714287161827087, 'reward_std': 0.05825009196996689, 'kl': 0.48291015625, 'epoch': 0.49}
 49%|████▉     | 2100/4286 [15:55:18<15:50:19, 26.08s/it] 49%|████▉     | 2101/4286 [15:58:53<50:15:29, 82.81s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.6872656042166267, 'learning_rate': 5.097993467102193e-07, 'completion_length': 264.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8616072237491608, 'rewards/format_reward': 1.0, 'reward': 1.8616072535514832, 'reward_std': 0.038690474815666676, 'kl': 0.04412841796875, 'epoch': 0.49}
 49%|████▉     | 2101/4286 [15:58:53<50:15:29, 82.81s/it] 49%|████▉     | 2102/4286 [15:59:17<39:33:04, 65.19s/it]                                                         {'loss': 0.0084, 'grad_norm': 17.54905810946222, 'learning_rate': 5.095660289314046e-07, 'completion_length': 316.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.6398809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6220239400863647, 'reward_std': 0.07525601610541344, 'kl': 0.21142578125, 'epoch': 0.49}
 49%|████▉     | 2102/4286 [15:59:17<39:33:04, 65.19s/it] 49%|████▉     | 2103/4286 [15:59:40<31:48:06, 52.44s/it]                                                         {'loss': 0.0023, 'grad_norm': 0.3092493511772873, 'learning_rate': 5.093327111525898e-07, 'completion_length': 325.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.77976194024086, 'rewards/format_reward': 1.0, 'reward': 1.779762089252472, 'reward_std': 0.0, 'kl': 0.056640625, 'epoch': 0.49}
 49%|████▉     | 2103/4286 [15:59:40<31:48:06, 52.44s/it] 49%|████▉     | 2104/4286 [16:00:03<26:21:49, 43.50s/it]                                                         {'loss': 0.0024, 'grad_norm': 4.421175711038021, 'learning_rate': 5.09099393373775e-07, 'completion_length': 318.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7872024476528168, 'rewards/format_reward': 1.0, 'reward': 1.7872024774551392, 'reward_std': 0.019238398410379887, 'kl': 0.060302734375, 'epoch': 0.49}
 49%|████▉     | 2104/4286 [16:00:03<26:21:49, 43.50s/it][2025-03-03 06:57:48,701] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 49%|████▉     | 2105/4286 [16:00:26<22:40:38, 37.43s/it]                                                         {'loss': 0.0027, 'grad_norm': 4.0042819934985445, 'learning_rate': 5.088660755949603e-07, 'completion_length': 310.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.02816697023808956, 'kl': 0.068603515625, 'epoch': 0.49}
 49%|████▉     | 2105/4286 [16:00:26<22:40:38, 37.43s/it] 49%|████▉     | 2106/4286 [16:00:50<20:12:03, 33.36s/it]                                                         {'loss': 0.0039, 'grad_norm': 7.115115045122577, 'learning_rate': 5.086327578161456e-07, 'completion_length': 298.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.719642847776413, 'rewards/format_reward': 1.0, 'reward': 1.7196429371833801, 'reward_std': 0.036178416572511196, 'kl': 0.0987548828125, 'epoch': 0.49}
 49%|████▉     | 2106/4286 [16:00:50<20:12:03, 33.36s/it] 49%|████▉     | 2107/4286 [16:01:13<18:26:13, 30.46s/it]                                                         {'loss': 0.0026, 'grad_norm': 4.924099092698647, 'learning_rate': 5.083994400373308e-07, 'completion_length': 318.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7931548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7931548953056335, 'reward_std': 0.039858050644397736, 'kl': 0.064453125, 'epoch': 0.49}
 49%|████▉     | 2107/4286 [16:01:13<18:26:13, 30.46s/it] 49%|████▉     | 2108/4286 [16:01:37<17:07:15, 28.30s/it]                                                         {'loss': 0.0022, 'grad_norm': 1.614765724254108, 'learning_rate': 5.08166122258516e-07, 'completion_length': 294.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7470238208770752, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.01785714365541935, 'kl': 0.055419921875, 'epoch': 0.49}
 49%|████▉     | 2108/4286 [16:01:37<17:07:15, 28.30s/it] 49%|████▉     | 2109/4286 [16:02:02<16:38:54, 27.53s/it]                                                         {'loss': 0.0062, 'grad_norm': 2.1336842739221975, 'learning_rate': 5.079328044797014e-07, 'completion_length': 309.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7083333432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6904763579368591, 'reward_std': 0.09523810259997845, 'kl': 0.155517578125, 'epoch': 0.49}
 49%|████▉     | 2109/4286 [16:02:02<16:38:54, 27.53s/it] 49%|████▉     | 2110/4286 [16:02:29<16:30:28, 27.31s/it]                                                         {'loss': 0.0064, 'grad_norm': 1.0257887758232316, 'learning_rate': 5.076994867008866e-07, 'completion_length': 314.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.836309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8184524774551392, 'reward_std': 0.04708904027938843, 'kl': 0.1591796875, 'epoch': 0.49}
 49%|████▉     | 2110/4286 [16:02:29<16:30:28, 27.31s/it] 49%|████▉     | 2111/4286 [16:02:54<16:00:17, 26.49s/it]                                                         {'loss': 0.0139, 'grad_norm': 1.9420520916510404, 'learning_rate': 5.074661689220718e-07, 'completion_length': 263.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.6607143133878708, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.020619653165340424, 'kl': 0.3475341796875, 'epoch': 0.49}
 49%|████▉     | 2111/4286 [16:02:54<16:00:17, 26.49s/it] 49%|████▉     | 2112/4286 [16:03:17<15:25:29, 25.54s/it]                                                         {'loss': 0.0063, 'grad_norm': 1.5917845956010366, 'learning_rate': 5.072328511432571e-07, 'completion_length': 288.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.06388125754892826, 'kl': 0.158447265625, 'epoch': 0.49}
 49%|████▉     | 2112/4286 [16:03:17<15:25:29, 25.54s/it] 49%|████▉     | 2113/4286 [16:03:41<15:05:27, 25.00s/it]                                                         {'loss': 0.0027, 'grad_norm': 1.8250916210772317, 'learning_rate': 5.069995333644424e-07, 'completion_length': 309.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.828869104385376, 'rewards/format_reward': 1.0, 'reward': 1.8288691639900208, 'reward_std': 0.0295482249930501, 'kl': 0.06689453125, 'epoch': 0.49}
 49%|████▉     | 2113/4286 [16:03:41<15:05:27, 25.00s/it] 49%|████▉     | 2114/4286 [16:04:06<15:04:46, 24.99s/it]                                                         {'loss': 0.0037, 'grad_norm': 4.193473538547764, 'learning_rate': 5.067662155856276e-07, 'completion_length': 302.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592263579368591, 'reward_std': 0.06685744784772396, 'kl': 0.0911865234375, 'epoch': 0.49}
 49%|████▉     | 2114/4286 [16:04:06<15:04:46, 24.99s/it] 49%|████▉     | 2115/4286 [16:04:31<15:03:53, 24.98s/it]                                                         {'loss': 0.0059, 'grad_norm': 1.454707956984543, 'learning_rate': 5.065328978068128e-07, 'completion_length': 309.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.482142984867096, 'reward_std': 0.0357142873108387, 'kl': 0.147705078125, 'epoch': 0.49}
 49%|████▉     | 2115/4286 [16:04:31<15:03:53, 24.98s/it] 49%|████▉     | 2116/4286 [16:04:56<15:04:00, 25.00s/it]                                                         {'loss': 0.0038, 'grad_norm': 4.05864074804139, 'learning_rate': 5.062995800279981e-07, 'completion_length': 310.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.782738208770752, 'reward_std': 0.05541309341788292, 'kl': 0.094482421875, 'epoch': 0.49}
 49%|████▉     | 2116/4286 [16:04:56<15:04:00, 25.00s/it] 49%|████▉     | 2117/4286 [16:05:21<15:02:35, 24.97s/it]                                                         {'loss': 0.0018, 'grad_norm': 23.513745568327312, 'learning_rate': 5.060662622491834e-07, 'completion_length': 315.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7514880895614624, 'rewards/format_reward': 1.0, 'reward': 1.7514882683753967, 'reward_std': 0.06845238618552685, 'kl': 0.0458984375, 'epoch': 0.49}
 49%|████▉     | 2117/4286 [16:05:21<15:02:35, 24.97s/it] 49%|████▉     | 2118/4286 [16:05:45<14:53:06, 24.72s/it]                                                         {'loss': 0.003, 'grad_norm': 3.146359382390257, 'learning_rate': 5.058329444703686e-07, 'completion_length': 306.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7217262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7217263579368591, 'reward_std': 0.09023960679769516, 'kl': 0.0740966796875, 'epoch': 0.49}
 49%|████▉     | 2118/4286 [16:05:45<14:53:06, 24.72s/it] 49%|████▉     | 2119/4286 [16:06:08<14:34:05, 24.20s/it]                                                         {'loss': 0.0043, 'grad_norm': 2.4840039165209844, 'learning_rate': 5.055996266915539e-07, 'completion_length': 238.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.04123930633068085, 'kl': 0.1065673828125, 'epoch': 0.49}
 49%|████▉     | 2119/4286 [16:06:08<14:34:05, 24.20s/it] 49%|████▉     | 2120/4286 [16:06:32<14:29:05, 24.07s/it]                                                         {'loss': 0.0042, 'grad_norm': 5.1160698959173985, 'learning_rate': 5.053663089127391e-07, 'completion_length': 282.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7961310148239136, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.05495268106460571, 'kl': 0.104736328125, 'epoch': 0.49}
 49%|████▉     | 2120/4286 [16:06:32<14:29:05, 24.07s/it] 49%|████▉     | 2121/4286 [16:06:55<14:25:05, 23.97s/it]                                                         {'loss': 0.0022, 'grad_norm': 2.809356859221309, 'learning_rate': 5.051329911339243e-07, 'completion_length': 303.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7425595223903656, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.026785715483129025, 'kl': 0.05615234375, 'epoch': 0.49}
 49%|████▉     | 2121/4286 [16:06:55<14:25:05, 23.97s/it] 50%|████▉     | 2122/4286 [16:07:21<14:44:52, 24.53s/it]                                                         {'loss': 0.0071, 'grad_norm': 1.4553829655584944, 'learning_rate': 5.048996733551097e-07, 'completion_length': 329.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.8324829936027527, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8146259188652039, 'reward_std': 0.07312925904989243, 'kl': 0.1767578125, 'epoch': 0.5}
 50%|████▉     | 2122/4286 [16:07:21<14:44:52, 24.53s/it] 50%|████▉     | 2123/4286 [16:07:46<14:43:32, 24.51s/it]                                                         {'loss': 0.0018, 'grad_norm': 2.4426063119353305, 'learning_rate': 5.046663555762949e-07, 'completion_length': 315.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7898065745830536, 'rewards/format_reward': 1.0, 'reward': 1.7898066639900208, 'reward_std': 0.0334821455180645, 'kl': 0.044189453125, 'epoch': 0.5}
 50%|████▉     | 2123/4286 [16:07:46<14:43:32, 24.51s/it] 50%|████▉     | 2124/4286 [16:08:11<14:47:45, 24.64s/it]                                                         {'loss': 0.0024, 'grad_norm': 6.274897463562961, 'learning_rate': 5.044330377974801e-07, 'completion_length': 277.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.5788690745830536, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5431548953056335, 'reward_std': 0.11236722767353058, 'kl': 0.0606689453125, 'epoch': 0.5}
 50%|████▉     | 2124/4286 [16:08:11<14:47:45, 24.64s/it] 50%|████▉     | 2125/4286 [16:08:35<14:48:37, 24.67s/it]                                                         {'loss': 0.0026, 'grad_norm': 0.6890558659736178, 'learning_rate': 5.041997200186654e-07, 'completion_length': 314.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.7574404776096344, 'rewards/format_reward': 1.0, 'reward': 1.7574406266212463, 'reward_std': 0.008928571827709675, 'kl': 0.06494140625, 'epoch': 0.5}
 50%|████▉     | 2125/4286 [16:08:35<14:48:37, 24.67s/it] 50%|████▉     | 2126/4286 [16:09:00<14:45:19, 24.59s/it]                                                         {'loss': 0.0035, 'grad_norm': 1.1920681702968658, 'learning_rate': 5.039664022398507e-07, 'completion_length': 269.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6666667461395264, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6488096117973328, 'reward_std': 0.0357142873108387, 'kl': 0.0875244140625, 'epoch': 0.5}
 50%|████▉     | 2126/4286 [16:09:00<14:45:19, 24.59s/it] 50%|████▉     | 2127/4286 [16:09:23<14:30:53, 24.20s/it]                                                         {'loss': 0.0019, 'grad_norm': 3.4050213667121576, 'learning_rate': 5.037330844610359e-07, 'completion_length': 290.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.696428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.02380952052772045, 'kl': 0.0478515625, 'epoch': 0.5}
 50%|████▉     | 2127/4286 [16:09:23<14:30:53, 24.20s/it] 50%|████▉     | 2128/4286 [16:09:46<14:20:28, 23.92s/it]                                                         {'loss': 0.0024, 'grad_norm': 0.8661995019484701, 'learning_rate': 5.034997666822211e-07, 'completion_length': 280.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.7752976417541504, 'rewards/format_reward': 1.0, 'reward': 1.7752977013587952, 'reward_std': 0.014880955684930086, 'kl': 0.059326171875, 'epoch': 0.5}
 50%|████▉     | 2128/4286 [16:09:46<14:20:28, 23.92s/it] 50%|████▉     | 2129/4286 [16:10:10<14:15:40, 23.80s/it]                                                         {'loss': 0.0021, 'grad_norm': 9.354158385225585, 'learning_rate': 5.032664489034064e-07, 'completion_length': 297.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.053144071251153946, 'kl': 0.0533447265625, 'epoch': 0.5}
 50%|████▉     | 2129/4286 [16:10:10<14:15:40, 23.80s/it] 50%|████▉     | 2130/4286 [16:10:34<14:18:09, 23.88s/it]                                                         {'loss': 0.0062, 'grad_norm': 2.373267804314717, 'learning_rate': 5.030331311245917e-07, 'completion_length': 289.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.8482142984867096, 'rewards/format_reward': 1.0, 'reward': 1.848214328289032, 'reward_std': 0.062286313623189926, 'kl': 0.1553955078125, 'epoch': 0.5}
 50%|████▉     | 2130/4286 [16:10:34<14:18:09, 23.88s/it] 50%|████▉     | 2131/4286 [16:10:58<14:16:13, 23.84s/it]                                                         {'loss': 0.0021, 'grad_norm': 0.37492542175771515, 'learning_rate': 5.027998133457769e-07, 'completion_length': 298.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.648809552192688, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.025651192292571068, 'kl': 0.0521240234375, 'epoch': 0.5}
 50%|████▉     | 2131/4286 [16:10:58<14:16:13, 23.84s/it] 50%|████▉     | 2132/4286 [16:11:21<14:12:54, 23.76s/it]                                                         {'loss': 0.0032, 'grad_norm': 2.666899289612495, 'learning_rate': 5.025664955669622e-07, 'completion_length': 293.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.690476268529892, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.07082725688815117, 'kl': 0.0791015625, 'epoch': 0.5}
 50%|████▉     | 2132/4286 [16:11:21<14:12:54, 23.76s/it] 50%|████▉     | 2133/4286 [16:11:44<14:07:22, 23.61s/it]                                                         {'loss': 0.0052, 'grad_norm': 1.7131722985563773, 'learning_rate': 5.023331777881474e-07, 'completion_length': 271.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.0357142873108387, 'kl': 0.129638671875, 'epoch': 0.5}
 50%|████▉     | 2133/4286 [16:11:44<14:07:22, 23.61s/it] 50%|████▉     | 2134/4286 [16:12:09<14:18:44, 23.94s/it]                                                         {'loss': 0.0036, 'grad_norm': 0.47209588744173253, 'learning_rate': 5.020998600093327e-07, 'completion_length': 303.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7261904776096344, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.016262203454971313, 'kl': 0.0904541015625, 'epoch': 0.5}
 50%|████▉     | 2134/4286 [16:12:09<14:18:44, 23.94s/it] 50%|████▉     | 2135/4286 [16:12:31<14:00:31, 23.45s/it]                                                         {'loss': 0.0017, 'grad_norm': 0.5103240984650872, 'learning_rate': 5.01866542230518e-07, 'completion_length': 255.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.023809521459043026, 'kl': 0.043212890625, 'epoch': 0.5}
 50%|████▉     | 2135/4286 [16:12:31<14:00:31, 23.45s/it] 50%|████▉     | 2136/4286 [16:12:56<14:07:22, 23.65s/it]                                                         {'loss': 0.0032, 'grad_norm': 2.8740471557897713, 'learning_rate': 5.016332244517032e-07, 'completion_length': 306.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.6532737910747528, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.026785715483129025, 'kl': 0.078857421875, 'epoch': 0.5}
 50%|████▉     | 2136/4286 [16:12:56<14:07:22, 23.65s/it] 50%|████▉     | 2137/4286 [16:13:19<14:02:58, 23.54s/it]                                                         {'loss': 0.0019, 'grad_norm': 3.335701749963117, 'learning_rate': 5.013999066728884e-07, 'completion_length': 287.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.04274726752191782, 'kl': 0.048095703125, 'epoch': 0.5}
 50%|████▉     | 2137/4286 [16:13:19<14:02:58, 23.54s/it] 50%|████▉     | 2138/4286 [16:13:42<13:56:49, 23.37s/it]                                                         {'loss': 0.0018, 'grad_norm': 3.375009256703748, 'learning_rate': 5.011665888940737e-07, 'completion_length': 286.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.05633394047617912, 'kl': 0.044189453125, 'epoch': 0.5}
 50%|████▉     | 2138/4286 [16:13:42<13:56:49, 23.37s/it] 50%|████▉     | 2139/4286 [16:14:06<14:04:10, 23.59s/it]                                                         {'loss': 0.0024, 'grad_norm': 0.7079388844666131, 'learning_rate': 5.00933271115259e-07, 'completion_length': 286.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001788139343, 'reward_std': 0.03504018671810627, 'kl': 0.0611572265625, 'epoch': 0.5}
 50%|████▉     | 2139/4286 [16:14:06<14:04:10, 23.59s/it] 50%|████▉     | 2140/4286 [16:14:28<13:42:47, 23.00s/it]                                                         {'loss': 0.0016, 'grad_norm': 0.06692871644470924, 'learning_rate': 5.006999533364442e-07, 'completion_length': 229.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.827381044626236, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.0, 'kl': 0.04046630859375, 'epoch': 0.5}
 50%|████▉     | 2140/4286 [16:14:28<13:42:47, 23.00s/it] 50%|████▉     | 2141/4286 [16:14:51<13:50:20, 23.23s/it]                                                         {'loss': 0.0049, 'grad_norm': 4.032274347998471, 'learning_rate': 5.004666355576294e-07, 'completion_length': 288.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485119700431824, 'reward_std': 0.019238398410379887, 'kl': 0.1240234375, 'epoch': 0.5}
 50%|████▉     | 2141/4286 [16:14:51<13:50:20, 23.23s/it] 50%|████▉     | 2142/4286 [16:15:14<13:45:56, 23.11s/it]                                                         {'loss': 0.0016, 'grad_norm': 2.88442246413385, 'learning_rate': 5.002333177788148e-07, 'completion_length': 242.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.7366072237491608, 'rewards/format_reward': 1.0, 'reward': 1.736607313156128, 'reward_std': 0.019238397479057312, 'kl': 0.0394287109375, 'epoch': 0.5}
 50%|████▉     | 2142/4286 [16:15:14<13:45:56, 23.11s/it] 50%|█████     | 2143/4286 [16:15:37<13:41:18, 23.00s/it]                                                         {'loss': 0.0017, 'grad_norm': 2.177383284711662, 'learning_rate': 5e-07, 'completion_length': 243.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.672619104385376, 'reward_std': 0.04123930633068085, 'kl': 0.043701171875, 'epoch': 0.5}
 50%|█████     | 2143/4286 [16:15:37<13:41:18, 23.00s/it] 50%|█████     | 2144/4286 [16:16:01<13:55:09, 23.39s/it]                                                         {'loss': 0.005, 'grad_norm': 0.7145488732031355, 'learning_rate': 4.997666822211852e-07, 'completion_length': 290.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.8333334028720856, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.020619653165340424, 'kl': 0.125732421875, 'epoch': 0.5}
 50%|█████     | 2144/4286 [16:16:01<13:55:09, 23.39s/it] 50%|█████     | 2145/4286 [16:16:25<13:57:02, 23.46s/it]                                                         {'loss': 0.0124, 'grad_norm': 2.520880397339074, 'learning_rate': 4.995333644423705e-07, 'completion_length': 294.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8050595819950104, 'rewards/format_reward': 1.0, 'reward': 1.8050596714019775, 'reward_std': 0.03709554113447666, 'kl': 0.3115234375, 'epoch': 0.5}
 50%|█████     | 2145/4286 [16:16:25<13:57:02, 23.46s/it] 50%|█████     | 2146/4286 [16:16:48<13:53:23, 23.37s/it]                                                         {'loss': 0.003, 'grad_norm': 1.9777766884486165, 'learning_rate': 4.993000466635557e-07, 'completion_length': 285.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.8110119700431824, 'rewards/format_reward': 1.0, 'reward': 1.8110120296478271, 'reward_std': 0.05495268478989601, 'kl': 0.073974609375, 'epoch': 0.5}
 50%|█████     | 2146/4286 [16:16:48<13:53:23, 23.37s/it] 50%|█████     | 2147/4286 [16:17:11<13:45:04, 23.14s/it]                                                         {'loss': 0.0047, 'grad_norm': 0.5787050468691928, 'learning_rate': 4.99066728884741e-07, 'completion_length': 270.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.011904762126505375, 'kl': 0.1181640625, 'epoch': 0.5}
 50%|█████     | 2147/4286 [16:17:11<13:45:04, 23.14s/it] 50%|█████     | 2148/4286 [16:17:37<14:16:07, 24.03s/it]                                                         {'loss': 0.0027, 'grad_norm': 1.0661441691324614, 'learning_rate': 4.988334111059263e-07, 'completion_length': 308.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6746032238006592, 'rewards/format_reward': 1.0, 'reward': 1.674603283405304, 'reward_std': 0.05515468493103981, 'kl': 0.068115234375, 'epoch': 0.5}
 50%|█████     | 2148/4286 [16:17:37<14:16:07, 24.03s/it] 50%|█████     | 2149/4286 [16:18:01<14:23:54, 24.26s/it]                                                         {'loss': 0.007, 'grad_norm': 0.4650398014079085, 'learning_rate': 4.986000933271115e-07, 'completion_length': 323.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.0, 'kl': 0.1751708984375, 'epoch': 0.5}
 50%|█████     | 2149/4286 [16:18:01<14:23:54, 24.26s/it] 50%|█████     | 2150/4286 [16:18:26<14:25:16, 24.31s/it]                                                         {'loss': 0.0034, 'grad_norm': 0.4341889225334534, 'learning_rate': 4.983667755482967e-07, 'completion_length': 300.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.008928571827709675, 'kl': 0.083984375, 'epoch': 0.5}
 50%|█████     | 2150/4286 [16:18:26<14:25:16, 24.31s/it] 50%|█████     | 2151/4286 [16:18:49<14:16:54, 24.08s/it]                                                         {'loss': 0.0097, 'grad_norm': 14.001481995871924, 'learning_rate': 4.98133457769482e-07, 'completion_length': 303.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5004058629274368, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4825487732887268, 'reward_std': 0.0946731511503458, 'kl': 0.241943359375, 'epoch': 0.5}
 50%|█████     | 2151/4286 [16:18:49<14:16:54, 24.08s/it] 50%|█████     | 2152/4286 [16:19:14<14:23:01, 24.26s/it]                                                         {'loss': 0.0028, 'grad_norm': 2.8555983333266366, 'learning_rate': 4.979001399906673e-07, 'completion_length': 302.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949405670166016, 'reward_std': 0.052436910569667816, 'kl': 0.0697021484375, 'epoch': 0.5}
 50%|█████     | 2152/4286 [16:19:14<14:23:01, 24.26s/it] 50%|█████     | 2153/4286 [16:19:37<14:04:03, 23.74s/it]                                                         {'loss': 0.0034, 'grad_norm': 1.7330166803763218, 'learning_rate': 4.976668222118525e-07, 'completion_length': 295.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7127976417541504, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.008928571827709675, 'kl': 0.0858154296875, 'epoch': 0.5}
 50%|█████     | 2153/4286 [16:19:37<14:04:03, 23.74s/it] 50%|█████     | 2154/4286 [16:20:01<14:08:56, 23.89s/it]                                                         {'loss': 0.009, 'grad_norm': 4.604006790744046, 'learning_rate': 4.974335044330377e-07, 'completion_length': 302.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.026785715483129025, 'kl': 0.22509765625, 'epoch': 0.5}
 50%|█████     | 2154/4286 [16:20:01<14:08:56, 23.89s/it] 50%|█████     | 2155/4286 [16:20:25<14:12:52, 24.01s/it]                                                         {'loss': 0.0028, 'grad_norm': 2.353727565819259, 'learning_rate': 4.972001866542231e-07, 'completion_length': 291.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.65327388048172, 'reward_std': 0.03989280201494694, 'kl': 0.070068359375, 'epoch': 0.5}
 50%|█████     | 2155/4286 [16:20:25<14:12:52, 24.01s/it] 50%|█████     | 2156/4286 [16:20:50<14:20:39, 24.24s/it]                                                         {'loss': 0.0019, 'grad_norm': 0.08744453915564658, 'learning_rate': 4.969668688754083e-07, 'completion_length': 302.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797619700431824, 'reward_std': 0.0, 'kl': 0.047119140625, 'epoch': 0.5}
 50%|█████     | 2156/4286 [16:20:50<14:20:39, 24.24s/it] 50%|█████     | 2157/4286 [16:21:14<14:16:17, 24.13s/it]                                                         {'loss': 0.0029, 'grad_norm': 6.344056093264139, 'learning_rate': 4.967335510965935e-07, 'completion_length': 323.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.668154776096344, 'rewards/format_reward': 1.0, 'reward': 1.6681549549102783, 'reward_std': 0.04464286006987095, 'kl': 0.0712890625, 'epoch': 0.5}
 50%|█████     | 2157/4286 [16:21:14<14:16:17, 24.13s/it] 50%|█████     | 2158/4286 [16:21:39<14:23:00, 24.33s/it]                                                         {'loss': 0.0043, 'grad_norm': 1.0761557805725444, 'learning_rate': 4.965002333177788e-07, 'completion_length': 308.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.77827388048172, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.026785715483129025, 'kl': 0.1085205078125, 'epoch': 0.5}
 50%|█████     | 2158/4286 [16:21:39<14:23:00, 24.33s/it] 50%|█████     | 2159/4286 [16:22:03<14:23:28, 24.36s/it]                                                         {'loss': 0.0021, 'grad_norm': 4.138530182031538, 'learning_rate': 4.962669155389641e-07, 'completion_length': 311.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7827381789684296, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.029761911369860172, 'kl': 0.0526123046875, 'epoch': 0.5}
 50%|█████     | 2159/4286 [16:22:03<14:23:28, 24.36s/it] 50%|█████     | 2160/4286 [16:22:28<14:28:58, 24.52s/it]                                                         {'loss': 0.0144, 'grad_norm': 9.783197492944808, 'learning_rate': 4.960335977601493e-07, 'completion_length': 333.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.730654776096344, 'rewards/format_reward': 1.0, 'reward': 1.7306548357009888, 'reward_std': 0.07876220531761646, 'kl': 0.3603515625, 'epoch': 0.5}
 50%|█████     | 2160/4286 [16:22:28<14:28:58, 24.52s/it] 50%|█████     | 2161/4286 [16:22:51<14:17:02, 24.20s/it]                                                         {'loss': 0.0032, 'grad_norm': 1.2219730608620896, 'learning_rate': 4.958002799813345e-07, 'completion_length': 298.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.8169642984867096, 'rewards/format_reward': 1.0, 'reward': 1.8169644474983215, 'reward_std': 0.025347060058265924, 'kl': 0.078857421875, 'epoch': 0.5}
 50%|█████     | 2161/4286 [16:22:51<14:17:02, 24.20s/it] 50%|█████     | 2162/4286 [16:23:16<14:19:29, 24.28s/it]                                                         {'loss': 0.002, 'grad_norm': 5.303776863516637, 'learning_rate': 4.955669622025198e-07, 'completion_length': 286.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7202381193637848, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.02279588207602501, 'kl': 0.0494384765625, 'epoch': 0.5}
 50%|█████     | 2162/4286 [16:23:16<14:19:29, 24.28s/it] 50%|█████     | 2163/4286 [16:23:39<14:10:53, 24.05s/it]                                                         {'loss': 0.0022, 'grad_norm': 1.87390550771914, 'learning_rate': 4.953336444237051e-07, 'completion_length': 278.07144927978516, 'rewards/only_full_func_accuracy_reward': 0.677083432674408, 'rewards/format_reward': 1.0, 'reward': 1.6770834922790527, 'reward_std': 0.020833336748182774, 'kl': 0.055419921875, 'epoch': 0.5}
 50%|█████     | 2163/4286 [16:23:39<14:10:53, 24.05s/it] 50%|█████     | 2164/4286 [16:24:03<14:08:24, 23.99s/it]                                                         {'loss': 0.0039, 'grad_norm': 1.3301732303203606, 'learning_rate': 4.951003266448903e-07, 'completion_length': 293.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.8675596117973328, 'rewards/format_reward': 1.0, 'reward': 1.8675596117973328, 'reward_std': 0.03298483043909073, 'kl': 0.09716796875, 'epoch': 0.5}
 50%|█████     | 2164/4286 [16:24:03<14:08:24, 23.99s/it] 51%|█████     | 2165/4286 [16:24:27<14:03:36, 23.86s/it]                                                         {'loss': 0.0073, 'grad_norm': 3.1517521012458407, 'learning_rate': 4.948670088660756e-07, 'completion_length': 295.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.04602411389350891, 'kl': 0.182861328125, 'epoch': 0.51}
 51%|█████     | 2165/4286 [16:24:27<14:03:36, 23.86s/it] 51%|█████     | 2166/4286 [16:24:51<14:06:50, 23.97s/it]                                                         {'loss': 0.0037, 'grad_norm': 1.5514716373173434, 'learning_rate': 4.946336910872608e-07, 'completion_length': 295.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7797619998455048, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.011904764920473099, 'kl': 0.09368896484375, 'epoch': 0.51}
 51%|█████     | 2166/4286 [16:24:51<14:06:50, 23.97s/it] 51%|█████     | 2167/4286 [16:25:15<14:07:17, 23.99s/it]                                                         {'loss': 0.0112, 'grad_norm': 29.06095194915721, 'learning_rate': 4.94400373308446e-07, 'completion_length': 287.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.050660968758165836, 'kl': 0.2783203125, 'epoch': 0.51}
 51%|█████     | 2167/4286 [16:25:15<14:07:17, 23.99s/it] 51%|█████     | 2168/4286 [16:25:39<14:02:00, 23.85s/it]                                                         {'loss': 0.0092, 'grad_norm': 2.0226722347305213, 'learning_rate': 4.941670555296314e-07, 'completion_length': 261.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.026785715483129025, 'kl': 0.2294921875, 'epoch': 0.51}
 51%|█████     | 2168/4286 [16:25:39<14:02:00, 23.85s/it] 51%|█████     | 2169/4286 [16:26:04<14:15:58, 24.26s/it]                                                         {'loss': 0.0054, 'grad_norm': 9.28078387745592, 'learning_rate': 4.939337377508166e-07, 'completion_length': 318.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6354167461395264, 'rewards/format_reward': 1.0, 'reward': 1.6354167461395264, 'reward_std': 0.08772383257746696, 'kl': 0.1337890625, 'epoch': 0.51}
 51%|█████     | 2169/4286 [16:26:04<14:15:58, 24.26s/it] 51%|█████     | 2170/4286 [16:26:27<14:06:50, 24.01s/it]                                                         {'loss': 0.0104, 'grad_norm': 7.143478347043306, 'learning_rate': 4.937004199720018e-07, 'completion_length': 284.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6681548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6681549549102783, 'reward_std': 0.06685744598507881, 'kl': 0.25927734375, 'epoch': 0.51}
 51%|█████     | 2170/4286 [16:26:27<14:06:50, 24.01s/it] 51%|█████     | 2171/4286 [16:26:51<14:02:50, 23.91s/it]                                                         {'loss': 0.0039, 'grad_norm': 3.250753830392051, 'learning_rate': 4.934671021931872e-07, 'completion_length': 277.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7410715222358704, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.08464721404016018, 'kl': 0.097900390625, 'epoch': 0.51}
 51%|█████     | 2171/4286 [16:26:51<14:02:50, 23.91s/it] 51%|█████     | 2172/4286 [16:27:15<14:06:10, 24.02s/it]                                                         {'loss': 0.0027, 'grad_norm': 0.7987897898671055, 'learning_rate': 4.932337844143724e-07, 'completion_length': 316.7143096923828, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098214626312256, 'reward_std': 0.008928571827709675, 'kl': 0.0687255859375, 'epoch': 0.51}
 51%|█████     | 2172/4286 [16:27:15<14:06:10, 24.02s/it] 51%|█████     | 2173/4286 [16:27:41<14:24:38, 24.55s/it]                                                         {'loss': 0.0041, 'grad_norm': 4.74297076568734, 'learning_rate': 4.930004666355576e-07, 'completion_length': 284.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.07804170437157154, 'kl': 0.10205078125, 'epoch': 0.51}
 51%|█████     | 2173/4286 [16:27:41<14:24:38, 24.55s/it] 51%|█████     | 2174/4286 [16:28:06<14:28:23, 24.67s/it]                                                         {'loss': 0.0022, 'grad_norm': 1.642542951742342, 'learning_rate': 4.927671488567428e-07, 'completion_length': 310.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592263579368591, 'reward_std': 0.03869047574698925, 'kl': 0.0545654296875, 'epoch': 0.51}
 51%|█████     | 2174/4286 [16:28:06<14:28:23, 24.67s/it] 51%|█████     | 2175/4286 [16:28:29<14:15:28, 24.31s/it]                                                         {'loss': 0.0033, 'grad_norm': 2.368971609558933, 'learning_rate': 4.925338310779281e-07, 'completion_length': 296.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.05654761753976345, 'kl': 0.08154296875, 'epoch': 0.51}
 51%|█████     | 2175/4286 [16:28:29<14:15:28, 24.31s/it] 51%|█████     | 2176/4286 [16:28:53<14:08:25, 24.13s/it]                                                         {'loss': 0.0089, 'grad_norm': 8.660956671782593, 'learning_rate': 4.923005132991134e-07, 'completion_length': 298.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.7068452537059784, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.008928571827709675, 'kl': 0.2227783203125, 'epoch': 0.51}
 51%|█████     | 2176/4286 [16:28:53<14:08:25, 24.13s/it] 51%|█████     | 2177/4286 [16:29:18<14:11:25, 24.22s/it]                                                         {'loss': 0.0019, 'grad_norm': 4.208734016947475, 'learning_rate': 4.920671955202986e-07, 'completion_length': 322.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8898809552192688, 'rewards/format_reward': 1.0, 'reward': 1.8898810744285583, 'reward_std': 0.02976190485060215, 'kl': 0.04833984375, 'epoch': 0.51}
 51%|█████     | 2177/4286 [16:29:18<14:11:25, 24.22s/it] 51%|█████     | 2178/4286 [16:29:42<14:11:37, 24.24s/it]                                                         {'loss': 0.0019, 'grad_norm': 55.912996303990276, 'learning_rate': 4.918338777414839e-07, 'completion_length': 295.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125001788139343, 'reward_std': 0.09044160321354866, 'kl': 0.0482177734375, 'epoch': 0.51}
 51%|█████     | 2178/4286 [16:29:42<14:11:37, 24.24s/it] 51%|█████     | 2179/4286 [16:30:06<14:08:16, 24.16s/it]                                                         {'loss': 0.0075, 'grad_norm': 3.172091243857536, 'learning_rate': 4.916005599626691e-07, 'completion_length': 304.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6741072237491608, 'rewards/format_reward': 1.0, 'reward': 1.674107313156128, 'reward_std': 0.037669211626052856, 'kl': 0.1884765625, 'epoch': 0.51}
 51%|█████     | 2179/4286 [16:30:06<14:08:16, 24.16s/it] 51%|█████     | 2180/4286 [16:30:30<14:11:13, 24.25s/it]                                                         {'loss': 0.0061, 'grad_norm': 1.633463419593703, 'learning_rate': 4.913672421838544e-07, 'completion_length': 307.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6502976417541504, 'rewards/format_reward': 1.0, 'reward': 1.6502977013587952, 'reward_std': 0.06250000465661287, 'kl': 0.153076171875, 'epoch': 0.51}
 51%|█████     | 2180/4286 [16:30:30<14:11:13, 24.25s/it] 51%|█████     | 2181/4286 [16:30:55<14:18:28, 24.47s/it]                                                         {'loss': 0.0047, 'grad_norm': 0.7315754920083062, 'learning_rate': 4.911339244050397e-07, 'completion_length': 298.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.721726268529892, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.05016787815839052, 'kl': 0.117919921875, 'epoch': 0.51}
 51%|█████     | 2181/4286 [16:30:55<14:18:28, 24.47s/it] 51%|█████     | 2182/4286 [16:31:19<14:15:02, 24.38s/it]                                                         {'loss': 0.002, 'grad_norm': 4.057950452033725, 'learning_rate': 4.909006066262249e-07, 'completion_length': 285.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.8184524476528168, 'rewards/format_reward': 1.0, 'reward': 1.8184524774551392, 'reward_std': 0.05792887136340141, 'kl': 0.0499267578125, 'epoch': 0.51}
 51%|█████     | 2182/4286 [16:31:19<14:15:02, 24.38s/it] 51%|█████     | 2183/4286 [16:31:44<14:12:43, 24.33s/it]                                                         {'loss': 0.0018, 'grad_norm': 5.907163490784608, 'learning_rate': 4.906672888474101e-07, 'completion_length': 332.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262387275696, 'reward_std': 0.031143159605562687, 'kl': 0.0460205078125, 'epoch': 0.51}
 51%|█████     | 2183/4286 [16:31:44<14:12:43, 24.33s/it] 51%|█████     | 2184/4286 [16:32:07<14:07:41, 24.20s/it]                                                         {'loss': 0.0028, 'grad_norm': 14.816315137989717, 'learning_rate': 4.904339710685954e-07, 'completion_length': 288.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.796131044626236, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.02267500851303339, 'kl': 0.0704345703125, 'epoch': 0.51}
 51%|█████     | 2184/4286 [16:32:07<14:07:41, 24.20s/it] 51%|█████     | 2185/4286 [16:32:32<14:11:17, 24.31s/it]                                                         {'loss': 0.0049, 'grad_norm': 1.7394444653688612, 'learning_rate': 4.902006532897807e-07, 'completion_length': 319.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7604166865348816, 'rewards/format_reward': 1.0, 'reward': 1.7604168057441711, 'reward_std': 0.03869047574698925, 'kl': 0.121337890625, 'epoch': 0.51}
 51%|█████     | 2185/4286 [16:32:32<14:11:17, 24.31s/it] 51%|█████     | 2186/4286 [16:32:57<14:17:14, 24.49s/it]                                                         {'loss': 0.0035, 'grad_norm': 4.377478683204706, 'learning_rate': 4.899673355109659e-07, 'completion_length': 311.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6755953431129456, 'reward_std': 0.1083104433491826, 'kl': 0.08837890625, 'epoch': 0.51}
 51%|█████     | 2186/4286 [16:32:57<14:17:14, 24.49s/it] 51%|█████     | 2187/4286 [16:33:22<14:19:32, 24.57s/it]                                                         {'loss': 0.0018, 'grad_norm': 3.58202036399688, 'learning_rate': 4.897340177321511e-07, 'completion_length': 300.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7361820042133331, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.718324899673462, 'reward_std': 0.05739795882254839, 'kl': 0.04443359375, 'epoch': 0.51}
 51%|█████     | 2187/4286 [16:33:22<14:19:32, 24.57s/it] 51%|█████     | 2188/4286 [16:33:47<14:27:11, 24.80s/it]                                                         {'loss': 0.0021, 'grad_norm': 2.2288383056840475, 'learning_rate': 4.895006999533365e-07, 'completion_length': 252.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750001192092896, 'reward_std': 0.04123930633068085, 'kl': 0.05224609375, 'epoch': 0.51}
 51%|█████     | 2188/4286 [16:33:47<14:27:11, 24.80s/it] 51%|█████     | 2189/4286 [16:34:11<14:16:41, 24.51s/it]                                                         {'loss': 0.0092, 'grad_norm': 0.6854883482116669, 'learning_rate': 4.892673821745217e-07, 'completion_length': 293.375, 'rewards/only_full_func_accuracy_reward': 0.8660714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8660715222358704, 'reward_std': 0.010309826582670212, 'kl': 0.231689453125, 'epoch': 0.51}
 51%|█████     | 2189/4286 [16:34:11<14:16:41, 24.51s/it] 51%|█████     | 2190/4286 [16:34:35<14:15:10, 24.48s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.9934131711884441, 'learning_rate': 4.890340643957069e-07, 'completion_length': 313.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.5952381193637848, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.02380952425301075, 'kl': 0.0438232421875, 'epoch': 0.51}
 51%|█████     | 2190/4286 [16:34:35<14:15:10, 24.48s/it] 51%|█████     | 2191/4286 [16:34:58<14:00:02, 24.06s/it]                                                         {'loss': 0.003, 'grad_norm': 3.4528549876372936, 'learning_rate': 4.888007466168922e-07, 'completion_length': 258.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8065476715564728, 'rewards/format_reward': 1.0, 'reward': 1.8065477013587952, 'reward_std': 0.05792887508869171, 'kl': 0.07470703125, 'epoch': 0.51}
 51%|█████     | 2191/4286 [16:34:58<14:00:02, 24.06s/it] 51%|█████     | 2192/4286 [16:35:23<14:10:29, 24.37s/it]                                                         {'loss': 0.0017, 'grad_norm': 1.1860575860636113, 'learning_rate': 4.885674288380775e-07, 'completion_length': 339.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.764881044626236, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.005952383857220411, 'kl': 0.0421142578125, 'epoch': 0.51}
 51%|█████     | 2192/4286 [16:35:23<14:10:29, 24.37s/it] 51%|█████     | 2193/4286 [16:35:48<14:08:55, 24.34s/it]                                                         {'loss': 0.0028, 'grad_norm': 0.5478094591456052, 'learning_rate': 4.883341110592627e-07, 'completion_length': 299.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.008928571827709675, 'kl': 0.0697021484375, 'epoch': 0.51}
 51%|█████     | 2193/4286 [16:35:48<14:08:55, 24.34s/it] 51%|█████     | 2194/4286 [16:36:10<13:47:58, 23.75s/it]                                                         {'loss': 0.0034, 'grad_norm': 12.411851510132568, 'learning_rate': 4.88100793280448e-07, 'completion_length': 259.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.025651196017861366, 'kl': 0.0843505859375, 'epoch': 0.51}
 51%|█████     | 2194/4286 [16:36:10<13:47:58, 23.75s/it] 51%|█████     | 2195/4286 [16:36:34<13:51:06, 23.85s/it]                                                         {'loss': 0.0067, 'grad_norm': 7.516225060713164, 'learning_rate': 4.878674755016332e-07, 'completion_length': 308.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 1.0, 'reward': 1.5937501192092896, 'reward_std': 0.05495268478989601, 'kl': 0.16796875, 'epoch': 0.51}
 51%|█████     | 2195/4286 [16:36:34<13:51:06, 23.85s/it] 51%|█████     | 2196/4286 [16:36:58<13:47:55, 23.77s/it]                                                         {'loss': 0.006, 'grad_norm': 2.222383165513169, 'learning_rate': 4.876341577228184e-07, 'completion_length': 291.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.023809521924704313, 'kl': 0.14892578125, 'epoch': 0.51}
 51%|█████     | 2196/4286 [16:36:58<13:47:55, 23.77s/it] 51%|█████▏    | 2197/4286 [16:37:23<13:59:49, 24.12s/it]                                                         {'loss': 0.0056, 'grad_norm': 4.626969068909566, 'learning_rate': 4.874008399440037e-07, 'completion_length': 308.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7770833671092987, 'rewards/format_reward': 1.0, 'reward': 1.7770835161209106, 'reward_std': 0.05046095699071884, 'kl': 0.14013671875, 'epoch': 0.51}
 51%|█████▏    | 2197/4286 [16:37:23<13:59:49, 24.12s/it] 51%|█████▏    | 2198/4286 [16:37:47<13:56:10, 24.03s/it]                                                         {'loss': 0.0041, 'grad_norm': 3.6019237751657665, 'learning_rate': 4.87167522165189e-07, 'completion_length': 298.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.74702388048172, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.0535714328289032, 'kl': 0.1026611328125, 'epoch': 0.51}
 51%|█████▏    | 2198/4286 [16:37:47<13:56:10, 24.03s/it] 51%|█████▏    | 2199/4286 [16:38:10<13:47:43, 23.80s/it]                                                         {'loss': 0.0015, 'grad_norm': 9.701044822231431, 'learning_rate': 4.869342043863742e-07, 'completion_length': 279.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548357009888, 'reward_std': 0.025190773652866483, 'kl': 0.037841796875, 'epoch': 0.51}
 51%|█████▏    | 2199/4286 [16:38:10<13:47:43, 23.80s/it] 51%|█████▏    | 2200/4286 [16:38:33<13:37:27, 23.51s/it]                                                         {'loss': 0.003, 'grad_norm': 0.5964877323877218, 'learning_rate': 4.867008866075594e-07, 'completion_length': 249.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6964285671710968, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.0357142873108387, 'kl': 0.07568359375, 'epoch': 0.51}
 51%|█████▏    | 2200/4286 [16:38:33<13:37:27, 23.51s/it] 51%|█████▏    | 2201/4286 [16:42:01<45:43:35, 78.95s/it]                                                         {'loss': 0.0045, 'grad_norm': 4.039577829545489, 'learning_rate': 4.864675688287448e-07, 'completion_length': 305.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7559524774551392, 'reward_std': 0.14090932719409466, 'kl': 0.113037109375, 'epoch': 0.51}
 51%|█████▏    | 2201/4286 [16:42:01<45:43:35, 78.95s/it] 51%|█████▏    | 2202/4286 [16:42:26<36:20:50, 62.79s/it]                                                         {'loss': 0.0035, 'grad_norm': 5.9305648331763905, 'learning_rate': 4.8623425104993e-07, 'completion_length': 297.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.75, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.0357142835855484, 'kl': 0.087890625, 'epoch': 0.51}
 51%|█████▏    | 2202/4286 [16:42:26<36:20:50, 62.79s/it] 51%|█████▏    | 2203/4286 [16:42:49<29:24:04, 50.81s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.30395703035692834, 'learning_rate': 4.860009332711152e-07, 'completion_length': 265.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.011904764920473099, 'kl': 0.0357666015625, 'epoch': 0.51}
 51%|█████▏    | 2203/4286 [16:42:49<29:24:04, 50.81s/it] 51%|█████▏    | 2204/4286 [16:43:13<24:45:27, 42.81s/it]                                                         {'loss': 0.0067, 'grad_norm': 3.63482756716758, 'learning_rate': 4.857676154923005e-07, 'completion_length': 293.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7035714983940125, 'rewards/format_reward': 1.0, 'reward': 1.7035715579986572, 'reward_std': 0.06560760550200939, 'kl': 0.16748046875, 'epoch': 0.51}
 51%|█████▏    | 2204/4286 [16:43:13<24:45:27, 42.81s/it] 51%|█████▏    | 2205/4286 [16:43:40<22:01:17, 38.10s/it]                                                         {'loss': 0.0046, 'grad_norm': 3.532844090900496, 'learning_rate': 4.855342977134858e-07, 'completion_length': 340.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7907738387584686, 'rewards/format_reward': 1.0, 'reward': 1.790773868560791, 'reward_std': 0.04298301041126251, 'kl': 0.113525390625, 'epoch': 0.51}
 51%|█████▏    | 2205/4286 [16:43:40<22:01:17, 38.10s/it] 51%|█████▏    | 2206/4286 [16:44:04<19:33:21, 33.85s/it]                                                         {'loss': 0.0071, 'grad_norm': 10.314957187610219, 'learning_rate': 4.85300979934671e-07, 'completion_length': 283.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.5595238506793976, 'rewards/format_reward': 1.0, 'reward': 1.55952388048172, 'reward_std': 0.08229769766330719, 'kl': 0.17724609375, 'epoch': 0.51}
 51%|█████▏    | 2206/4286 [16:44:04<19:33:21, 33.85s/it] 51%|█████▏    | 2207/4286 [16:44:28<17:52:42, 30.96s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.30306962478663685, 'learning_rate': 4.850676621558562e-07, 'completion_length': 266.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.008928571827709675, 'kl': 0.0443115234375, 'epoch': 0.51}
 51%|█████▏    | 2207/4286 [16:44:28<17:52:42, 30.96s/it] 52%|█████▏    | 2208/4286 [16:44:53<16:44:17, 29.00s/it]                                                         {'loss': 0.0035, 'grad_norm': 0.7009832424087207, 'learning_rate': 4.848343443770415e-07, 'completion_length': 308.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7663690745830536, 'rewards/format_reward': 1.0, 'reward': 1.7663691639900208, 'reward_std': 0.008928571827709675, 'kl': 0.0872802734375, 'epoch': 0.52}
 52%|█████▏    | 2208/4286 [16:44:53<16:44:17, 29.00s/it] 52%|█████▏    | 2209/4286 [16:45:17<15:55:21, 27.60s/it]                                                         {'loss': 0.0017, 'grad_norm': 0.8771991856620291, 'learning_rate': 4.846010265982268e-07, 'completion_length': 291.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.02976190485060215, 'kl': 0.0416259765625, 'epoch': 0.52}
 52%|█████▏    | 2209/4286 [16:45:17<15:55:21, 27.60s/it] 52%|█████▏    | 2210/4286 [16:45:41<15:18:06, 26.53s/it]                                                         {'loss': 0.0088, 'grad_norm': 4.155947627598998, 'learning_rate': 4.84367708819412e-07, 'completion_length': 280.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7362245619297028, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7183674573898315, 'reward_std': 0.11640937998890877, 'kl': 0.21923828125, 'epoch': 0.52}
 52%|█████▏    | 2210/4286 [16:45:41<15:18:06, 26.53s/it] 52%|█████▏    | 2211/4286 [16:46:05<14:53:55, 25.85s/it]                                                         {'loss': 0.0045, 'grad_norm': 3.280331128612286, 'learning_rate': 4.841343910405973e-07, 'completion_length': 299.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.8258929252624512, 'rewards/format_reward': 1.0, 'reward': 1.825892984867096, 'reward_std': 0.02083333395421505, 'kl': 0.11376953125, 'epoch': 0.52}
 52%|█████▏    | 2211/4286 [16:46:05<14:53:55, 25.85s/it] 52%|█████▏    | 2212/4286 [16:46:30<14:46:01, 25.63s/it]                                                         {'loss': 0.0091, 'grad_norm': 5.62770552339174, 'learning_rate': 4.839010732617825e-07, 'completion_length': 340.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.5449405014514923, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.527083396911621, 'reward_std': 0.10931013897061348, 'kl': 0.22705078125, 'epoch': 0.52}
 52%|█████▏    | 2212/4286 [16:46:30<14:46:01, 25.63s/it] 52%|█████▏    | 2213/4286 [16:46:54<14:21:39, 24.94s/it]                                                         {'loss': 0.0047, 'grad_norm': 1.1655706062239026, 'learning_rate': 4.836677554829678e-07, 'completion_length': 299.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7812500894069672, 'rewards/format_reward': 1.0, 'reward': 1.7812500596046448, 'reward_std': 0.02900167927145958, 'kl': 0.1187744140625, 'epoch': 0.52}
 52%|█████▏    | 2213/4286 [16:46:54<14:21:39, 24.94s/it] 52%|█████▏    | 2214/4286 [16:47:18<14:15:39, 24.78s/it]                                                         {'loss': 0.0027, 'grad_norm': 1.0745032647426394, 'learning_rate': 4.834344377041531e-07, 'completion_length': 304.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.7288690507411957, 'rewards/format_reward': 1.0, 'reward': 1.728869080543518, 'reward_std': 0.022569325752556324, 'kl': 0.068603515625, 'epoch': 0.52}
 52%|█████▏    | 2214/4286 [16:47:18<14:15:39, 24.78s/it] 52%|█████▏    | 2215/4286 [16:47:44<14:29:33, 25.19s/it]                                                         {'loss': 0.0028, 'grad_norm': 5.5375531737539445, 'learning_rate': 4.832011199253383e-07, 'completion_length': 331.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.06618334539234638, 'kl': 0.070556640625, 'epoch': 0.52}
 52%|█████▏    | 2215/4286 [16:47:44<14:29:33, 25.19s/it] 52%|█████▏    | 2216/4286 [16:48:09<14:19:32, 24.91s/it]                                                         {'loss': 0.0026, 'grad_norm': 2.0245147845682867, 'learning_rate': 4.829678021465235e-07, 'completion_length': 285.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.1365518793463707, 'kl': 0.0648193359375, 'epoch': 0.52}
 52%|█████▏    | 2216/4286 [16:48:09<14:19:32, 24.91s/it] 52%|█████▏    | 2217/4286 [16:48:34<14:23:25, 25.04s/it]                                                         {'loss': 0.007, 'grad_norm': 2.123396798305494, 'learning_rate': 4.827344843677089e-07, 'completion_length': 309.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7289540767669678, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.711097002029419, 'reward_std': 0.13732993975281715, 'kl': 0.1748046875, 'epoch': 0.52}
 52%|█████▏    | 2217/4286 [16:48:34<14:23:25, 25.04s/it] 52%|█████▏    | 2218/4286 [16:48:58<14:12:17, 24.73s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.8395716376185216, 'learning_rate': 4.825011665888941e-07, 'completion_length': 249.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.011904764920473099, 'kl': 0.0460205078125, 'epoch': 0.52}
 52%|█████▏    | 2218/4286 [16:48:58<14:12:17, 24.73s/it] 52%|█████▏    | 2219/4286 [16:49:22<14:09:43, 24.67s/it]                                                         {'loss': 0.0022, 'grad_norm': 1.643850047227849, 'learning_rate': 4.822678488100793e-07, 'completion_length': 304.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6186756789684296, 'rewards/format_reward': 1.0, 'reward': 1.618675708770752, 'reward_std': 0.027098647318780422, 'kl': 0.054931640625, 'epoch': 0.52}
 52%|█████▏    | 2219/4286 [16:49:22<14:09:43, 24.67s/it] 52%|█████▏    | 2220/4286 [16:49:47<14:03:26, 24.49s/it]                                                         {'loss': 0.0054, 'grad_norm': 15.734624491355621, 'learning_rate': 4.820345310312645e-07, 'completion_length': 294.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.71726194024086, 'rewards/format_reward': 1.0, 'reward': 1.7172619700431824, 'reward_std': 0.042834239080548286, 'kl': 0.135986328125, 'epoch': 0.52}
 52%|█████▏    | 2220/4286 [16:49:47<14:03:26, 24.49s/it] 52%|█████▏    | 2221/4286 [16:50:13<14:20:54, 25.01s/it]                                                         {'loss': 0.011, 'grad_norm': 3.6947850087776954, 'learning_rate': 4.818012132524499e-07, 'completion_length': 289.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 1.0, 'reward': 1.6026785969734192, 'reward_std': 0.02267500851303339, 'kl': 0.274169921875, 'epoch': 0.52}
 52%|█████▏    | 2221/4286 [16:50:13<14:20:54, 25.01s/it] 52%|█████▏    | 2222/4286 [16:50:38<14:21:40, 25.05s/it]                                                         {'loss': 0.0089, 'grad_norm': 3.6679260281325985, 'learning_rate': 4.815678954736351e-07, 'completion_length': 306.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.07578601315617561, 'kl': 0.221923828125, 'epoch': 0.52}
 52%|█████▏    | 2222/4286 [16:50:38<14:21:40, 25.05s/it] 52%|█████▏    | 2223/4286 [16:51:04<14:36:50, 25.50s/it]                                                         {'loss': 0.0075, 'grad_norm': 0.8045929014881158, 'learning_rate': 4.813345776948203e-07, 'completion_length': 301.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.8515964150428772, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8337392807006836, 'reward_std': 0.0706168869510293, 'kl': 0.1859130859375, 'epoch': 0.52}
 52%|█████▏    | 2223/4286 [16:51:04<14:36:50, 25.50s/it] 52%|█████▏    | 2224/4286 [16:51:29<14:25:37, 25.19s/it]                                                         {'loss': 0.0054, 'grad_norm': 1.3663328601487381, 'learning_rate': 4.811012599160056e-07, 'completion_length': 281.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5907738506793976, 'rewards/format_reward': 1.0, 'reward': 1.59077388048172, 'reward_std': 0.04177374858409166, 'kl': 0.134765625, 'epoch': 0.52}
 52%|█████▏    | 2224/4286 [16:51:29<14:25:37, 25.19s/it] 52%|█████▏    | 2225/4286 [16:51:54<14:19:48, 25.03s/it]                                                         {'loss': 0.0162, 'grad_norm': 2.2775447914631526, 'learning_rate': 4.808679421371908e-07, 'completion_length': 311.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.7053571939468384, 'reward_std': 0.03243744093924761, 'kl': 0.4072265625, 'epoch': 0.52}
 52%|█████▏    | 2225/4286 [16:51:54<14:19:48, 25.03s/it] 52%|█████▏    | 2226/4286 [16:52:19<14:22:30, 25.12s/it]                                                         {'loss': 0.0192, 'grad_norm': 3.0582186181421203, 'learning_rate': 4.806346243583761e-07, 'completion_length': 327.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7663690745830536, 'rewards/format_reward': 1.0, 'reward': 1.7663692235946655, 'reward_std': 0.06090506538748741, 'kl': 0.4794921875, 'epoch': 0.52}
 52%|█████▏    | 2226/4286 [16:52:19<14:22:30, 25.12s/it] 52%|█████▏    | 2227/4286 [16:52:43<14:14:30, 24.90s/it]                                                         {'loss': 0.0175, 'grad_norm': 6.598013822344466, 'learning_rate': 4.804013065795614e-07, 'completion_length': 303.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8156251013278961, 'rewards/format_reward': 1.0, 'reward': 1.8156251311302185, 'reward_std': 0.0888526439666748, 'kl': 0.435546875, 'epoch': 0.52}
 52%|█████▏    | 2227/4286 [16:52:43<14:14:30, 24.90s/it] 52%|█████▏    | 2228/4286 [16:53:08<14:08:57, 24.75s/it]                                                         {'loss': 0.009, 'grad_norm': 2.4747682575545094, 'learning_rate': 4.801679888007466e-07, 'completion_length': 286.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875001192092896, 'reward_std': 0.08333333302289248, 'kl': 0.22509765625, 'epoch': 0.52}
 52%|█████▏    | 2228/4286 [16:53:08<14:08:57, 24.75s/it] 52%|█████▏    | 2229/4286 [16:53:34<14:20:01, 25.09s/it]                                                         {'loss': 0.004, 'grad_norm': 1.294886997373875, 'learning_rate': 4.799346710219318e-07, 'completion_length': 341.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.008928571827709675, 'kl': 0.099609375, 'epoch': 0.52}
 52%|█████▏    | 2229/4286 [16:53:34<14:20:01, 25.09s/it] 52%|█████▏    | 2230/4286 [16:53:58<14:09:40, 24.80s/it]                                                         {'loss': 0.0074, 'grad_norm': 3.0764003739524237, 'learning_rate': 4.797013532431171e-07, 'completion_length': 295.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187500596046448, 'reward_std': 0.0625, 'kl': 0.18310546875, 'epoch': 0.52}
 52%|█████▏    | 2230/4286 [16:53:58<14:09:40, 24.80s/it] 52%|█████▏    | 2231/4286 [16:54:23<14:12:58, 24.90s/it]                                                         {'loss': 0.0067, 'grad_norm': 5.439556823631734, 'learning_rate': 4.794680354643024e-07, 'completion_length': 310.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6922619342803955, 'rewards/format_reward': 1.0, 'reward': 1.692262053489685, 'reward_std': 0.08072172850370407, 'kl': 0.16650390625, 'epoch': 0.52}
 52%|█████▏    | 2231/4286 [16:54:23<14:12:58, 24.90s/it] 52%|█████▏    | 2232/4286 [16:54:48<14:14:54, 24.97s/it]                                                         {'loss': 0.0167, 'grad_norm': 9.54151248717004, 'learning_rate': 4.792347176854876e-07, 'completion_length': 295.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.846428632736206, 'rewards/format_reward': 1.0, 'reward': 1.8464286923408508, 'reward_std': 0.03161357529461384, 'kl': 0.4189453125, 'epoch': 0.52}
 52%|█████▏    | 2232/4286 [16:54:48<14:14:54, 24.97s/it] 52%|█████▏    | 2233/4286 [16:55:13<14:19:12, 25.11s/it]                                                         {'loss': 0.0102, 'grad_norm': 2.454385948899849, 'learning_rate': 4.790013999066728e-07, 'completion_length': 296.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7500000596046448, 'reward_std': 0.08021794259548187, 'kl': 0.2548828125, 'epoch': 0.52}
 52%|█████▏    | 2233/4286 [16:55:13<14:19:12, 25.11s/it] 52%|█████▏    | 2234/4286 [16:55:40<14:36:04, 25.62s/it]                                                         {'loss': 0.0243, 'grad_norm': 3.282037975582373, 'learning_rate': 4.787680821278582e-07, 'completion_length': 308.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6443453431129456, 'reward_std': 0.1636904925107956, 'kl': 0.60546875, 'epoch': 0.52}
 52%|█████▏    | 2234/4286 [16:55:40<14:36:04, 25.62s/it][2025-03-03 07:53:29,400] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 52%|█████▏    | 2235/4286 [16:56:06<14:42:22, 25.81s/it]                                                         {'loss': 0.0083, 'grad_norm': 36.98616625898741, 'learning_rate': 4.785347643490434e-07, 'completion_length': 323.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0531440656632185, 'kl': 0.20849609375, 'epoch': 0.52}
 52%|█████▏    | 2235/4286 [16:56:06<14:42:22, 25.81s/it] 52%|█████▏    | 2236/4286 [16:56:31<14:33:17, 25.56s/it]                                                         {'loss': 0.0132, 'grad_norm': 12.858823616644955, 'learning_rate': 4.783014465702286e-07, 'completion_length': 295.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.8005951941013336, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.07582749426364899, 'kl': 0.32861328125, 'epoch': 0.52}
 52%|█████▏    | 2236/4286 [16:56:31<14:33:17, 25.56s/it] 52%|█████▏    | 2237/4286 [16:56:58<14:40:29, 25.78s/it]                                                         {'loss': 0.004, 'grad_norm': 1.0652410087559625, 'learning_rate': 4.780681287914139e-07, 'completion_length': 323.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7008928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7008929252624512, 'reward_std': 0.019238397479057312, 'kl': 0.1002197265625, 'epoch': 0.52}
 52%|█████▏    | 2237/4286 [16:56:58<14:40:29, 25.78s/it] 52%|█████▏    | 2238/4286 [16:57:22<14:25:21, 25.35s/it]                                                         {'loss': 0.0053, 'grad_norm': 0.7669145718150989, 'learning_rate': 4.778348110125992e-07, 'completion_length': 311.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.824404776096344, 'rewards/format_reward': 1.0, 'reward': 1.8244049549102783, 'reward_std': 0.003436605678871274, 'kl': 0.1314697265625, 'epoch': 0.52}
 52%|█████▏    | 2238/4286 [16:57:22<14:25:21, 25.35s/it] 52%|█████▏    | 2239/4286 [16:57:46<14:14:27, 25.04s/it]                                                         {'loss': 0.0167, 'grad_norm': 6.996076027243976, 'learning_rate': 4.776014932337844e-07, 'completion_length': 267.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7116071879863739, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6937500834465027, 'reward_std': 0.11330237053334713, 'kl': 0.4169921875, 'epoch': 0.52}
 52%|█████▏    | 2239/4286 [16:57:46<14:14:27, 25.04s/it] 52%|█████▏    | 2240/4286 [16:58:13<14:33:02, 25.60s/it]                                                         {'loss': 0.0074, 'grad_norm': 13.205076528145034, 'learning_rate': 4.773681754549697e-07, 'completion_length': 351.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.691964328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6562501192092896, 'reward_std': 0.1667200569063425, 'kl': 0.185791015625, 'epoch': 0.52}
 52%|█████▏    | 2240/4286 [16:58:13<14:33:02, 25.60s/it] 52%|█████▏    | 2241/4286 [16:58:40<14:42:17, 25.89s/it]                                                         {'loss': 0.0112, 'grad_norm': 6.979663990003733, 'learning_rate': 4.771348576761549e-07, 'completion_length': 351.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547619700431824, 'reward_std': 0.09239229280501604, 'kl': 0.2802734375, 'epoch': 0.52}
 52%|█████▏    | 2241/4286 [16:58:40<14:42:17, 25.89s/it] 52%|█████▏    | 2242/4286 [16:59:07<14:56:41, 26.32s/it]                                                         {'loss': 0.007, 'grad_norm': 2.4327392526974845, 'learning_rate': 4.769015398973402e-07, 'completion_length': 325.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.6517857909202576, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.017857138067483902, 'kl': 0.17578125, 'epoch': 0.52}
 52%|█████▏    | 2242/4286 [16:59:07<14:56:41, 26.32s/it][2025-03-03 07:56:59,390] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 52%|█████▏    | 2243/4286 [16:59:36<15:26:07, 27.20s/it]                                                         {'loss': 0.006, 'grad_norm': 2.59959086084116, 'learning_rate': 4.7666822211852543e-07, 'completion_length': 350.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7202382683753967, 'reward_std': 0.15172100067138672, 'kl': 0.1507568359375, 'epoch': 0.52}
 52%|█████▏    | 2243/4286 [16:59:36<15:26:07, 27.20s/it] 52%|█████▏    | 2244/4286 [17:00:05<15:34:28, 27.46s/it]                                                         {'loss': 0.0022, 'grad_norm': 2.5506014894842313, 'learning_rate': 4.7643490433971065e-07, 'completion_length': 338.0714569091797, 'rewards/only_full_func_accuracy_reward': 0.6532738208770752, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.040532153099775314, 'kl': 0.0543212890625, 'epoch': 0.52}
 52%|█████▏    | 2244/4286 [17:00:05<15:34:28, 27.46s/it][2025-03-03 07:57:53,489] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 52%|█████▏    | 2245/4286 [17:00:31<15:19:31, 27.03s/it]                                                         {'loss': 0.009, 'grad_norm': 6.763753476029972, 'learning_rate': 4.762015865608959e-07, 'completion_length': 339.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.7678572535514832, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.09364316426217556, 'kl': 0.22509765625, 'epoch': 0.52}
 52%|█████▏    | 2245/4286 [17:00:31<15:19:31, 27.03s/it] 52%|█████▏    | 2246/4286 [17:00:57<15:10:27, 26.78s/it]                                                         {'loss': 0.0063, 'grad_norm': 1.4474518100547138, 'learning_rate': 4.759682687820812e-07, 'completion_length': 340.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.81101194024086, 'rewards/format_reward': 1.0, 'reward': 1.8110119700431824, 'reward_std': 0.026785715483129025, 'kl': 0.15771484375, 'epoch': 0.52}
 52%|█████▏    | 2246/4286 [17:00:57<15:10:27, 26.78s/it] 52%|█████▏    | 2247/4286 [17:01:22<14:56:52, 26.39s/it]                                                         {'loss': 0.0065, 'grad_norm': 1.7656065729644606, 'learning_rate': 4.757349510032664e-07, 'completion_length': 290.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.01785714365541935, 'kl': 0.16162109375, 'epoch': 0.52}
 52%|█████▏    | 2247/4286 [17:01:22<14:56:52, 26.39s/it] 52%|█████▏    | 2248/4286 [17:01:49<15:00:18, 26.51s/it]                                                         {'loss': 0.004, 'grad_norm': 1.3690910212048117, 'learning_rate': 4.755016332244517e-07, 'completion_length': 336.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.7404762208461761, 'rewards/format_reward': 1.0, 'reward': 1.7404762506484985, 'reward_std': 0.019047623965889215, 'kl': 0.100830078125, 'epoch': 0.52}
 52%|█████▏    | 2248/4286 [17:01:49<15:00:18, 26.51s/it] 52%|█████▏    | 2249/4286 [17:02:16<15:03:19, 26.61s/it]                                                         {'loss': 0.0059, 'grad_norm': 2.9004978019304657, 'learning_rate': 4.7526831544563697e-07, 'completion_length': 327.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.01785714365541935, 'kl': 0.148681640625, 'epoch': 0.52}
 52%|█████▏    | 2249/4286 [17:02:16<15:03:19, 26.61s/it] 52%|█████▏    | 2250/4286 [17:02:40<14:37:18, 25.85s/it]                                                         {'loss': 0.0103, 'grad_norm': 3.9696069892452885, 'learning_rate': 4.750349976668222e-07, 'completion_length': 272.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.0833333395421505, 'kl': 0.2578125, 'epoch': 0.52}
 52%|█████▏    | 2250/4286 [17:02:40<14:37:18, 25.85s/it] 53%|█████▎    | 2251/4286 [17:03:05<14:30:36, 25.67s/it]                                                         {'loss': 0.0028, 'grad_norm': 0.22835659579582954, 'learning_rate': 4.7480167988800747e-07, 'completion_length': 308.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.8214287161827087, 'reward_std': 0.0, 'kl': 0.07080078125, 'epoch': 0.53}
 53%|█████▎    | 2251/4286 [17:03:05<14:30:36, 25.67s/it] 53%|█████▎    | 2252/4286 [17:03:31<14:28:13, 25.61s/it]                                                         {'loss': 0.0135, 'grad_norm': 7.213247122636072, 'learning_rate': 4.745683621091927e-07, 'completion_length': 317.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.6145834028720856, 'rewards/format_reward': 1.0, 'reward': 1.614583432674408, 'reward_std': 0.014880955684930086, 'kl': 0.33837890625, 'epoch': 0.53}
 53%|█████▎    | 2252/4286 [17:03:31<14:28:13, 25.61s/it] 53%|█████▎    | 2253/4286 [17:03:57<14:36:37, 25.87s/it]                                                         {'loss': 0.0037, 'grad_norm': 2.6343032678597704, 'learning_rate': 4.7433504433037797e-07, 'completion_length': 311.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.04719168785959482, 'kl': 0.0927734375, 'epoch': 0.53}
 53%|█████▎    | 2253/4286 [17:03:57<14:36:37, 25.87s/it] 53%|█████▎    | 2254/4286 [17:04:23<14:34:32, 25.82s/it]                                                         {'loss': 0.0048, 'grad_norm': 3.410917212477572, 'learning_rate': 4.7410172655156324e-07, 'completion_length': 279.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.525297686457634, 'rewards/format_reward': 1.0, 'reward': 1.5252977013587952, 'reward_std': 0.05732116661965847, 'kl': 0.1209716796875, 'epoch': 0.53}
 53%|█████▎    | 2254/4286 [17:04:23<14:34:32, 25.82s/it][2025-03-03 08:02:14,628] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 53%|█████▎    | 2255/4286 [17:04:52<15:04:48, 26.73s/it]                                                         {'loss': 0.0196, 'grad_norm': 7.674012979728326, 'learning_rate': 4.7386840877274847e-07, 'completion_length': 366.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.8184524476528168, 'rewards/format_reward': 1.0, 'reward': 1.8184524774551392, 'reward_std': 0.0357142798602581, 'kl': 0.491455078125, 'epoch': 0.53}
 53%|█████▎    | 2255/4286 [17:04:52<15:04:48, 26.73s/it] 53%|█████▎    | 2256/4286 [17:05:19<15:10:05, 26.90s/it]                                                         {'loss': 0.002, 'grad_norm': 1.3655393716031785, 'learning_rate': 4.7363509099393374e-07, 'completion_length': 363.6964569091797, 'rewards/only_full_func_accuracy_reward': 0.815476268529892, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.797619104385376, 'reward_std': 0.08014346659183502, 'kl': 0.049560546875, 'epoch': 0.53}
 53%|█████▎    | 2256/4286 [17:05:19<15:10:05, 26.90s/it][2025-03-03 08:03:08,653] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 53%|█████▎    | 2257/4286 [17:05:46<15:07:56, 26.85s/it]                                                         {'loss': 0.0036, 'grad_norm': 2.997909749868015, 'learning_rate': 4.7340177321511896e-07, 'completion_length': 335.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7797619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7619048357009888, 'reward_std': 0.09204822778701782, 'kl': 0.09033203125, 'epoch': 0.53}
 53%|█████▎    | 2257/4286 [17:05:46<15:07:56, 26.85s/it][2025-03-03 08:03:35,692] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 53%|█████▎    | 2258/4286 [17:06:13<15:09:25, 26.91s/it]                                                         {'loss': 0.0023, 'grad_norm': 0.28971601557670457, 'learning_rate': 4.7316845543630424e-07, 'completion_length': 329.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.8690476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8511905074119568, 'reward_std': 0.0714285746216774, 'kl': 0.05810546875, 'epoch': 0.53}
 53%|█████▎    | 2258/4286 [17:06:13<15:09:25, 26.91s/it] 53%|█████▎    | 2259/4286 [17:06:38<14:47:29, 26.27s/it]                                                         {'loss': 0.0045, 'grad_norm': 9.013386010119941, 'learning_rate': 4.729351376574895e-07, 'completion_length': 336.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.7113096117973328, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.04350833781063557, 'kl': 0.11181640625, 'epoch': 0.53}
 53%|█████▎    | 2259/4286 [17:06:38<14:47:29, 26.27s/it] 53%|█████▎    | 2260/4286 [17:07:03<14:40:20, 26.07s/it]                                                         {'loss': 0.0149, 'grad_norm': 57.955243090863625, 'learning_rate': 4.7270181987867473e-07, 'completion_length': 337.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.8089286088943481, 'rewards/format_reward': 1.0, 'reward': 1.8089286088943481, 'reward_std': 0.04657791554927826, 'kl': 0.371826171875, 'epoch': 0.53}
 53%|█████▎    | 2260/4286 [17:07:03<14:40:20, 26.07s/it] 53%|█████▎    | 2261/4286 [17:07:30<14:48:17, 26.32s/it]                                                         {'loss': 0.0104, 'grad_norm': 186.4178492914944, 'learning_rate': 4.7246850209986e-07, 'completion_length': 311.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7393707633018494, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7215136885643005, 'reward_std': 0.10089514777064323, 'kl': 0.25830078125, 'epoch': 0.53}
 53%|█████▎    | 2261/4286 [17:07:30<14:48:17, 26.32s/it] 53%|█████▎    | 2262/4286 [17:07:56<14:41:20, 26.13s/it]                                                         {'loss': 0.0017, 'grad_norm': 0.27399898572781034, 'learning_rate': 4.7223518432104523e-07, 'completion_length': 317.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.02380952425301075, 'kl': 0.0416259765625, 'epoch': 0.53}
 53%|█████▎    | 2262/4286 [17:07:56<14:41:20, 26.13s/it] 53%|█████▎    | 2263/4286 [17:08:20<14:21:39, 25.56s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.4927324462013925, 'learning_rate': 4.720018665422305e-07, 'completion_length': 277.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.5892857909202576, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.013746436685323715, 'kl': 0.0452880859375, 'epoch': 0.53}
 53%|█████▎    | 2263/4286 [17:08:20<14:21:39, 25.56s/it][2025-03-03 08:06:08,514] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 53%|█████▎    | 2264/4286 [17:08:46<14:21:57, 25.58s/it]                                                         {'loss': 0.0013, 'grad_norm': 1.6321649065662622, 'learning_rate': 4.717685487634158e-07, 'completion_length': 336.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7976190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7976192235946655, 'reward_std': 0.0595238134264946, 'kl': 0.0333251953125, 'epoch': 0.53}
 53%|█████▎    | 2264/4286 [17:08:46<14:21:57, 25.58s/it] 53%|█████▎    | 2265/4286 [17:09:11<14:21:20, 25.57s/it]                                                         {'loss': 0.0029, 'grad_norm': 16.77606834715035, 'learning_rate': 4.71535230984601e-07, 'completion_length': 262.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.80952388048172, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.08021793887019157, 'kl': 0.072021484375, 'epoch': 0.53}
 53%|█████▎    | 2265/4286 [17:09:11<14:21:20, 25.57s/it] 53%|█████▎    | 2266/4286 [17:09:36<14:10:42, 25.27s/it]                                                         {'loss': 0.01, 'grad_norm': 22.235012758110358, 'learning_rate': 4.713019132057863e-07, 'completion_length': 310.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261906862258911, 'reward_std': 0.013746436685323715, 'kl': 0.24920654296875, 'epoch': 0.53}
 53%|█████▎    | 2266/4286 [17:09:36<14:10:42, 25.27s/it] 53%|█████▎    | 2267/4286 [17:10:01<14:11:20, 25.30s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.9205983991414439, 'learning_rate': 4.710685954269715e-07, 'completion_length': 311.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.71577388048172, 'rewards/format_reward': 1.0, 'reward': 1.71577388048172, 'reward_std': 0.008928571827709675, 'kl': 0.035400390625, 'epoch': 0.53}
 53%|█████▎    | 2267/4286 [17:10:01<14:11:20, 25.30s/it][2025-03-03 08:07:50,733] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 53%|█████▎    | 2268/4286 [17:10:28<14:25:18, 25.73s/it]                                                         {'loss': 0.0189, 'grad_norm': 2.549175827529325, 'learning_rate': 4.708352776481568e-07, 'completion_length': 320.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7521008849143982, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7342438101768494, 'reward_std': 0.0665527917444706, 'kl': 0.4716796875, 'epoch': 0.53}
 53%|█████▎    | 2268/4286 [17:10:28<14:25:18, 25.73s/it] 53%|█████▎    | 2269/4286 [17:10:53<14:19:07, 25.56s/it]                                                         {'loss': 0.002, 'grad_norm': 2.7679532634088284, 'learning_rate': 4.7060195986934205e-07, 'completion_length': 331.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.691964328289032, 'rewards/format_reward': 1.0, 'reward': 1.6919643878936768, 'reward_std': 0.041765548288822174, 'kl': 0.048828125, 'epoch': 0.53}
 53%|█████▎    | 2269/4286 [17:10:53<14:19:07, 25.56s/it] 53%|█████▎    | 2270/4286 [17:11:18<14:17:14, 25.51s/it]                                                         {'loss': 0.0043, 'grad_norm': 4.995441544117287, 'learning_rate': 4.703686420905273e-07, 'completion_length': 326.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6696429550647736, 'rewards/format_reward': 1.0, 'reward': 1.669642984867096, 'reward_std': 0.07364453375339508, 'kl': 0.1065673828125, 'epoch': 0.53}
 53%|█████▎    | 2270/4286 [17:11:18<14:17:14, 25.51s/it] 53%|█████▎    | 2271/4286 [17:11:44<14:18:05, 25.55s/it]                                                         {'loss': 0.0029, 'grad_norm': 7.536973597121662, 'learning_rate': 4.7013532431171255e-07, 'completion_length': 279.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.7544642984867096, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.031143159605562687, 'kl': 0.0718994140625, 'epoch': 0.53}
 53%|█████▎    | 2271/4286 [17:11:44<14:18:05, 25.55s/it] 53%|█████▎    | 2272/4286 [17:12:10<14:26:37, 25.82s/it]                                                         {'loss': 0.0042, 'grad_norm': 1.7209584974168137, 'learning_rate': 4.699020065328978e-07, 'completion_length': 292.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7767857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7767857909202576, 'reward_std': 0.005952383857220411, 'kl': 0.103271484375, 'epoch': 0.53}
 53%|█████▎    | 2272/4286 [17:12:10<14:26:37, 25.82s/it] 53%|█████▎    | 2273/4286 [17:12:36<14:25:49, 25.81s/it]                                                         {'loss': 0.004, 'grad_norm': 7.692583431789794, 'learning_rate': 4.6966868875408305e-07, 'completion_length': 328.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7458333671092987, 'rewards/format_reward': 1.0, 'reward': 1.7458335161209106, 'reward_std': 0.06579059921205044, 'kl': 0.10052490234375, 'epoch': 0.53}
 53%|█████▎    | 2273/4286 [17:12:36<14:25:49, 25.81s/it] 53%|█████▎    | 2274/4286 [17:13:01<14:13:54, 25.46s/it]                                                         {'loss': 0.0019, 'grad_norm': 1.2207915918183632, 'learning_rate': 4.694353709752683e-07, 'completion_length': 295.7678756713867, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755954027175903, 'reward_std': 0.01785713993012905, 'kl': 0.0467529296875, 'epoch': 0.53}
 53%|█████▎    | 2274/4286 [17:13:01<14:13:54, 25.46s/it] 53%|█████▎    | 2275/4286 [17:13:28<14:25:48, 25.83s/it]                                                         {'loss': 0.0059, 'grad_norm': 5.701356317822562, 'learning_rate': 4.6920205319645354e-07, 'completion_length': 304.75, 'rewards/only_full_func_accuracy_reward': 0.6880952715873718, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.670238196849823, 'reward_std': 0.07115893252193928, 'kl': 0.1474609375, 'epoch': 0.53}
 53%|█████▎    | 2275/4286 [17:13:28<14:25:48, 25.83s/it] 53%|█████▎    | 2276/4286 [17:13:54<14:29:08, 25.94s/it]                                                         {'loss': 0.0035, 'grad_norm': 6.557808948424518, 'learning_rate': 4.689687354176388e-07, 'completion_length': 308.625, 'rewards/only_full_func_accuracy_reward': 0.699404776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6815477013587952, 'reward_std': 0.13226891309022903, 'kl': 0.088623046875, 'epoch': 0.53}
 53%|█████▎    | 2276/4286 [17:13:54<14:29:08, 25.94s/it] 53%|█████▎    | 2277/4286 [17:14:21<14:39:34, 26.27s/it]                                                         {'loss': 0.003, 'grad_norm': 0.4590074032940889, 'learning_rate': 4.687354176388241e-07, 'completion_length': 362.5000305175781, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6369048953056335, 'reward_std': 0.04123930633068085, 'kl': 0.075927734375, 'epoch': 0.53}
 53%|█████▎    | 2277/4286 [17:14:21<14:39:34, 26.27s/it] 53%|█████▎    | 2278/4286 [17:14:45<14:15:43, 25.57s/it]                                                         {'loss': 0.0021, 'grad_norm': 0.5707974103597023, 'learning_rate': 4.685020998600093e-07, 'completion_length': 307.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904763579368591, 'reward_std': 0.011904759332537651, 'kl': 0.05224609375, 'epoch': 0.53}
 53%|█████▎    | 2278/4286 [17:14:45<14:15:43, 25.57s/it] 53%|█████▎    | 2279/4286 [17:15:11<14:17:39, 25.64s/it]                                                         {'loss': 0.0035, 'grad_norm': 2.054481868639015, 'learning_rate': 4.682687820811946e-07, 'completion_length': 326.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.020833336748182774, 'kl': 0.08837890625, 'epoch': 0.53}
 53%|█████▎    | 2279/4286 [17:15:11<14:17:39, 25.64s/it] 53%|█████▎    | 2280/4286 [17:15:36<14:12:19, 25.49s/it]                                                         {'loss': 0.0016, 'grad_norm': 1.052843547893897, 'learning_rate': 4.680354643023798e-07, 'completion_length': 254.26787567138672, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.0, 'kl': 0.04052734375, 'epoch': 0.53}
 53%|█████▎    | 2280/4286 [17:15:36<14:12:19, 25.49s/it] 53%|█████▎    | 2281/4286 [17:16:01<14:12:52, 25.52s/it]                                                         {'loss': 0.0055, 'grad_norm': 2.9787703388330384, 'learning_rate': 4.678021465235651e-07, 'completion_length': 300.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.038476798217743635, 'kl': 0.13818359375, 'epoch': 0.53}
 53%|█████▎    | 2281/4286 [17:16:01<14:12:52, 25.52s/it] 53%|█████▎    | 2282/4286 [17:16:25<13:58:06, 25.09s/it]                                                         {'loss': 0.0043, 'grad_norm': 1.7888641579788835, 'learning_rate': 4.6756882874475036e-07, 'completion_length': 310.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.8348215222358704, 'rewards/format_reward': 1.0, 'reward': 1.8348215818405151, 'reward_std': 0.031055688858032227, 'kl': 0.1083984375, 'epoch': 0.53}
 53%|█████▎    | 2282/4286 [17:16:25<13:58:06, 25.09s/it] 53%|█████▎    | 2283/4286 [17:16:51<14:06:40, 25.36s/it]                                                         {'loss': 0.0077, 'grad_norm': 2.3748097171252613, 'learning_rate': 4.673355109659356e-07, 'completion_length': 351.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7380952537059784, 'rewards/format_reward': 1.0, 'reward': 1.7380954027175903, 'reward_std': 0.03335827589035034, 'kl': 0.194580078125, 'epoch': 0.53}
 53%|█████▎    | 2283/4286 [17:16:51<14:06:40, 25.36s/it] 53%|█████▎    | 2284/4286 [17:17:16<13:58:26, 25.13s/it]                                                         {'loss': 0.004, 'grad_norm': 0.6636271398828412, 'learning_rate': 4.6710219318712086e-07, 'completion_length': 288.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.77827388048172, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.032738092355430126, 'kl': 0.1011962890625, 'epoch': 0.53}
 53%|█████▎    | 2284/4286 [17:17:16<13:58:26, 25.13s/it] 53%|█████▎    | 2285/4286 [17:17:42<14:08:34, 25.44s/it]                                                         {'loss': 0.0019, 'grad_norm': 0.389373315261528, 'learning_rate': 4.668688754083061e-07, 'completion_length': 301.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.60714291036129, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.025651194155216217, 'kl': 0.0487060546875, 'epoch': 0.53}
 53%|█████▎    | 2285/4286 [17:17:42<14:08:34, 25.44s/it] 53%|█████▎    | 2286/4286 [17:18:08<14:10:00, 25.50s/it]                                                         {'loss': 0.0017, 'grad_norm': 0.8194668995024629, 'learning_rate': 4.6663555762949136e-07, 'completion_length': 296.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.04627084918320179, 'kl': 0.041259765625, 'epoch': 0.53}
 53%|█████▎    | 2286/4286 [17:18:08<14:10:00, 25.50s/it] 53%|█████▎    | 2287/4286 [17:18:33<14:09:38, 25.50s/it]                                                         {'loss': 0.0031, 'grad_norm': 50.2036779109612, 'learning_rate': 4.6640223985067663e-07, 'completion_length': 307.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.735119104385376, 'reward_std': 0.026572035625576973, 'kl': 0.07666015625, 'epoch': 0.53}
 53%|█████▎    | 2287/4286 [17:18:33<14:09:38, 25.50s/it] 53%|█████▎    | 2288/4286 [17:19:00<14:23:49, 25.94s/it]                                                         {'loss': 0.0046, 'grad_norm': 2.1602869124148265, 'learning_rate': 4.6616892207186186e-07, 'completion_length': 356.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.7681547999382019, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7324405312538147, 'reward_std': 0.1176844909787178, 'kl': 0.115478515625, 'epoch': 0.53}
 53%|█████▎    | 2288/4286 [17:19:00<14:23:49, 25.94s/it] 53%|█████▎    | 2289/4286 [17:19:27<14:31:00, 26.17s/it]                                                         {'loss': 0.0015, 'grad_norm': 1.2135172923405129, 'learning_rate': 4.6593560429304713e-07, 'completion_length': 331.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7351190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7172620296478271, 'reward_std': 0.06388125568628311, 'kl': 0.03759765625, 'epoch': 0.53}
 53%|█████▎    | 2289/4286 [17:19:27<14:31:00, 26.17s/it] 53%|█████▎    | 2290/4286 [17:19:55<14:48:25, 26.71s/it]                                                         {'loss': 0.0022, 'grad_norm': 1.4659703233648902, 'learning_rate': 4.6570228651423235e-07, 'completion_length': 324.8214569091797, 'rewards/only_full_func_accuracy_reward': 0.8690476417541504, 'rewards/format_reward': 1.0, 'reward': 1.86904776096344, 'reward_std': 0.028166968375444412, 'kl': 0.0557861328125, 'epoch': 0.53}
 53%|█████▎    | 2290/4286 [17:19:55<14:48:25, 26.71s/it] 53%|█████▎    | 2291/4286 [17:20:19<14:26:33, 26.06s/it]                                                         {'loss': 0.0015, 'grad_norm': 2.9975837252567006, 'learning_rate': 4.6546896873541763e-07, 'completion_length': 257.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6398809552192688, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.005952381528913975, 'kl': 0.0374755859375, 'epoch': 0.53}
 53%|█████▎    | 2291/4286 [17:20:19<14:26:33, 26.06s/it] 53%|█████▎    | 2292/4286 [17:20:45<14:23:37, 25.99s/it]                                                         {'loss': 0.0021, 'grad_norm': 1.2848113432740162, 'learning_rate': 4.652356509566029e-07, 'completion_length': 332.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6666667461395264, 'rewards/format_reward': 1.0, 'reward': 1.6666667461395264, 'reward_std': 0.03318683058023453, 'kl': 0.0526123046875, 'epoch': 0.53}
 53%|█████▎    | 2292/4286 [17:20:45<14:23:37, 25.99s/it] 53%|█████▎    | 2293/4286 [17:21:11<14:17:27, 25.81s/it]                                                         {'loss': 0.0043, 'grad_norm': 6.3720889965541465, 'learning_rate': 4.6500233317778813e-07, 'completion_length': 318.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.6815477013587952, 'rewards/format_reward': 1.0, 'reward': 1.68154776096344, 'reward_std': 0.0357142873108387, 'kl': 0.107666015625, 'epoch': 0.53}
 53%|█████▎    | 2293/4286 [17:21:11<14:17:27, 25.81s/it] 54%|█████▎    | 2294/4286 [17:21:37<14:25:08, 26.06s/it]                                                         {'loss': 0.0015, 'grad_norm': 2.341230367616, 'learning_rate': 4.647690153989734e-07, 'completion_length': 327.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.023809518665075302, 'kl': 0.0377197265625, 'epoch': 0.54}
 54%|█████▎    | 2294/4286 [17:21:37<14:25:08, 26.06s/it] 54%|█████▎    | 2295/4286 [17:22:02<14:14:02, 25.74s/it]                                                         {'loss': 0.0017, 'grad_norm': 0.7010319633995696, 'learning_rate': 4.645356976201587e-07, 'completion_length': 291.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7276786267757416, 'rewards/format_reward': 1.0, 'reward': 1.7276787161827087, 'reward_std': 0.008928571827709675, 'kl': 0.04296875, 'epoch': 0.54}
 54%|█████▎    | 2295/4286 [17:22:02<14:14:02, 25.74s/it] 54%|█████▎    | 2296/4286 [17:22:28<14:08:42, 25.59s/it]                                                         {'loss': 0.0022, 'grad_norm': 1.6795488666560772, 'learning_rate': 4.643023798413439e-07, 'completion_length': 312.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7738095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.02380952052772045, 'kl': 0.055908203125, 'epoch': 0.54}
 54%|█████▎    | 2296/4286 [17:22:28<14:08:42, 25.59s/it] 54%|█████▎    | 2297/4286 [17:22:52<13:53:10, 25.13s/it]                                                         {'loss': 0.0013, 'grad_norm': 15.717781769336106, 'learning_rate': 4.640690620625292e-07, 'completion_length': 282.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.013746432960033417, 'kl': 0.03265380859375, 'epoch': 0.54}
 54%|█████▎    | 2297/4286 [17:22:52<13:53:10, 25.13s/it] 54%|█████▎    | 2298/4286 [17:23:17<13:58:03, 25.29s/it]                                                         {'loss': 0.0014, 'grad_norm': 1.5032100617495041, 'learning_rate': 4.638357442837144e-07, 'completion_length': 284.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715818405151, 'reward_std': 0.038476791232824326, 'kl': 0.0357666015625, 'epoch': 0.54}
 54%|█████▎    | 2298/4286 [17:23:17<13:58:03, 25.29s/it] 54%|█████▎    | 2299/4286 [17:23:43<14:04:06, 25.49s/it]                                                         {'loss': 0.0077, 'grad_norm': 1.439161493842005, 'learning_rate': 4.6360242650489967e-07, 'completion_length': 323.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.020833336748182774, 'kl': 0.19287109375, 'epoch': 0.54}
 54%|█████▎    | 2299/4286 [17:23:43<14:04:06, 25.49s/it] 54%|█████▎    | 2300/4286 [17:24:09<14:07:01, 25.59s/it]                                                         {'loss': 0.0073, 'grad_norm': 1.9024624031687127, 'learning_rate': 4.6336910872608495e-07, 'completion_length': 316.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.758928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.029761906247586012, 'kl': 0.1827392578125, 'epoch': 0.54}
 54%|█████▎    | 2300/4286 [17:24:09<14:07:01, 25.59s/it] 54%|█████▎    | 2301/4286 [17:27:41<44:54:54, 81.46s/it]                                                         {'loss': 0.0053, 'grad_norm': 1.1971295834272258, 'learning_rate': 4.6313579094727017e-07, 'completion_length': 328.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.05892249755561352, 'kl': 0.13330078125, 'epoch': 0.54}
 54%|█████▎    | 2301/4286 [17:27:41<44:54:54, 81.46s/it] 54%|█████▎    | 2302/4286 [17:28:05<35:25:01, 64.26s/it]                                                         {'loss': 0.0027, 'grad_norm': 3.1372273804609745, 'learning_rate': 4.6290247316845544e-07, 'completion_length': 321.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.13440870866179466, 'kl': 0.0684814453125, 'epoch': 0.54}
 54%|█████▎    | 2302/4286 [17:28:05<35:25:01, 64.26s/it] 54%|█████▎    | 2303/4286 [17:28:30<28:53:38, 52.45s/it]                                                         {'loss': 0.002, 'grad_norm': 0.7263225552679462, 'learning_rate': 4.6266915538964067e-07, 'completion_length': 342.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562500596046448, 'reward_std': 0.034579768776893616, 'kl': 0.0499267578125, 'epoch': 0.54}
 54%|█████▎    | 2303/4286 [17:28:30<28:53:38, 52.45s/it] 54%|█████▍    | 2304/4286 [17:28:55<24:25:16, 44.36s/it]                                                         {'loss': 0.0022, 'grad_norm': 4.87567305444728, 'learning_rate': 4.6243583761082594e-07, 'completion_length': 304.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.8601191639900208, 'rewards/format_reward': 1.0, 'reward': 1.8601192235946655, 'reward_std': 0.035961026325821877, 'kl': 0.053955078125, 'epoch': 0.54}
 54%|█████▍    | 2304/4286 [17:28:55<24:25:16, 44.36s/it] 54%|█████▍    | 2305/4286 [17:29:19<21:00:09, 38.17s/it]                                                         {'loss': 0.0042, 'grad_norm': 0.36206339371595114, 'learning_rate': 4.622025198320112e-07, 'completion_length': 260.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.0068732211366295815, 'kl': 0.103759765625, 'epoch': 0.54}
 54%|█████▍    | 2305/4286 [17:29:19<21:00:09, 38.17s/it] 54%|█████▍    | 2306/4286 [17:29:44<18:49:17, 34.22s/it]                                                         {'loss': 0.0098, 'grad_norm': 7.27998284194444, 'learning_rate': 4.6196920205319644e-07, 'completion_length': 293.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.5386905074119568, 'rewards/format_reward': 1.0, 'reward': 1.5386905670166016, 'reward_std': 0.0267283134162426, 'kl': 0.24365234375, 'epoch': 0.54}
 54%|█████▍    | 2306/4286 [17:29:44<18:49:17, 34.22s/it] 54%|█████▍    | 2307/4286 [17:30:09<17:12:33, 31.31s/it]                                                         {'loss': 0.0046, 'grad_norm': 3.503829719757714, 'learning_rate': 4.617358842743817e-07, 'completion_length': 295.5893020629883, 'rewards/only_full_func_accuracy_reward': 0.6755953133106232, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.05867303907871246, 'kl': 0.11572265625, 'epoch': 0.54}
 54%|█████▍    | 2307/4286 [17:30:09<17:12:33, 31.31s/it] 54%|█████▍    | 2308/4286 [17:30:34<16:08:34, 29.38s/it]                                                         {'loss': 0.0029, 'grad_norm': 0.22389665316375784, 'learning_rate': 4.6150256649556694e-07, 'completion_length': 317.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6309524774551392, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.0, 'kl': 0.07373046875, 'epoch': 0.54}
 54%|█████▍    | 2308/4286 [17:30:34<16:08:34, 29.38s/it] 54%|█████▍    | 2309/4286 [17:30:59<15:29:54, 28.22s/it]                                                         {'loss': 0.0034, 'grad_norm': 3.7705212577327964, 'learning_rate': 4.612692487167522e-07, 'completion_length': 324.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.742559552192688, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.0029761905316263437, 'kl': 0.084228515625, 'epoch': 0.54}
 54%|█████▍    | 2309/4286 [17:30:59<15:29:54, 28.22s/it] 54%|█████▍    | 2310/4286 [17:31:23<14:48:35, 26.98s/it]                                                         {'loss': 0.0103, 'grad_norm': 1.398597641793693, 'learning_rate': 4.610359309379375e-07, 'completion_length': 307.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.8125000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.0178571455180645, 'kl': 0.2572021484375, 'epoch': 0.54}
 54%|█████▍    | 2310/4286 [17:31:23<14:48:35, 26.98s/it] 54%|█████▍    | 2311/4286 [17:31:49<14:32:28, 26.51s/it]                                                         {'loss': 0.0041, 'grad_norm': 3.2309170400772045, 'learning_rate': 4.608026131591227e-07, 'completion_length': 348.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.775297611951828, 'rewards/format_reward': 1.0, 'reward': 1.7752977013587952, 'reward_std': 0.07121489569544792, 'kl': 0.1019287109375, 'epoch': 0.54}
 54%|█████▍    | 2311/4286 [17:31:49<14:32:28, 26.51s/it] 54%|█████▍    | 2312/4286 [17:32:14<14:19:13, 26.12s/it]                                                         {'loss': 0.0025, 'grad_norm': 1.2557479013076318, 'learning_rate': 4.60569295380308e-07, 'completion_length': 301.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6681548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6681549549102783, 'reward_std': 0.04638410173356533, 'kl': 0.063232421875, 'epoch': 0.54}
 54%|█████▍    | 2312/4286 [17:32:14<14:19:13, 26.12s/it] 54%|█████▍    | 2313/4286 [17:32:38<14:03:10, 25.64s/it]                                                         {'loss': 0.0031, 'grad_norm': 1.0856675169025995, 'learning_rate': 4.603359776014932e-07, 'completion_length': 282.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.724702388048172, 'rewards/format_reward': 1.0, 'reward': 1.724702537059784, 'reward_std': 0.008928571827709675, 'kl': 0.0777587890625, 'epoch': 0.54}
 54%|█████▍    | 2313/4286 [17:32:38<14:03:10, 25.64s/it] 54%|█████▍    | 2314/4286 [17:33:04<14:00:48, 25.58s/it]                                                         {'loss': 0.0029, 'grad_norm': 5.3884129344247045, 'learning_rate': 4.601026598226785e-07, 'completion_length': 288.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6815478205680847, 'reward_std': 0.1071428582072258, 'kl': 0.072509765625, 'epoch': 0.54}
 54%|█████▍    | 2314/4286 [17:33:04<14:00:48, 25.58s/it] 54%|█████▍    | 2315/4286 [17:33:29<13:55:51, 25.44s/it]                                                         {'loss': 0.0022, 'grad_norm': 1.98459894514783, 'learning_rate': 4.5986934204386376e-07, 'completion_length': 317.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.0535714328289032, 'kl': 0.0562744140625, 'epoch': 0.54}
 54%|█████▍    | 2315/4286 [17:33:29<13:55:51, 25.44s/it] 54%|█████▍    | 2316/4286 [17:33:54<13:57:08, 25.50s/it]                                                         {'loss': 0.0021, 'grad_norm': 3.6000179480216015, 'learning_rate': 4.59636024265049e-07, 'completion_length': 302.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.8675595819950104, 'rewards/format_reward': 1.0, 'reward': 1.8675596714019775, 'reward_std': 0.019238398410379887, 'kl': 0.05255126953125, 'epoch': 0.54}
 54%|█████▍    | 2316/4286 [17:33:54<13:57:08, 25.50s/it] 54%|█████▍    | 2317/4286 [17:34:20<14:01:16, 25.64s/it]                                                         {'loss': 0.0027, 'grad_norm': 1.7351430863801696, 'learning_rate': 4.5940270648623425e-07, 'completion_length': 333.0714569091797, 'rewards/only_full_func_accuracy_reward': 0.760416716337204, 'rewards/format_reward': 1.0, 'reward': 1.7604168057441711, 'reward_std': 0.05931013077497482, 'kl': 0.068359375, 'epoch': 0.54}
 54%|█████▍    | 2317/4286 [17:34:20<14:01:16, 25.64s/it] 54%|█████▍    | 2318/4286 [17:34:45<13:54:49, 25.45s/it]                                                         {'loss': 0.0016, 'grad_norm': 17.930939994445563, 'learning_rate': 4.5916938870741953e-07, 'completion_length': 299.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.8247024714946747, 'rewards/format_reward': 1.0, 'reward': 1.8247024416923523, 'reward_std': 0.08633596450090408, 'kl': 0.0406494140625, 'epoch': 0.54}
 54%|█████▍    | 2318/4286 [17:34:45<13:54:49, 25.45s/it] 54%|█████▍    | 2319/4286 [17:35:11<13:52:54, 25.41s/it]                                                         {'loss': 0.0028, 'grad_norm': 3.5115201818067177, 'learning_rate': 4.5893607092860475e-07, 'completion_length': 297.25, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187501192092896, 'reward_std': 0.06685744412243366, 'kl': 0.070556640625, 'epoch': 0.54}
 54%|█████▍    | 2319/4286 [17:35:11<13:52:54, 25.41s/it] 54%|█████▍    | 2320/4286 [17:35:36<13:50:26, 25.34s/it]                                                         {'loss': 0.007, 'grad_norm': 1.3077647987514096, 'learning_rate': 4.5870275314979e-07, 'completion_length': 294.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.8309524655342102, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8130953311920166, 'reward_std': 0.11303605511784554, 'kl': 0.1749267578125, 'epoch': 0.54}
 54%|█████▍    | 2320/4286 [17:35:36<13:50:26, 25.34s/it] 54%|█████▍    | 2321/4286 [17:36:02<13:55:21, 25.51s/it]                                                         {'loss': 0.004, 'grad_norm': 1.4485574973133908, 'learning_rate': 4.5846943537097525e-07, 'completion_length': 301.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.702381044626236, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.028166964650154114, 'kl': 0.10009765625, 'epoch': 0.54}
 54%|█████▍    | 2321/4286 [17:36:02<13:55:21, 25.51s/it] 54%|█████▍    | 2322/4286 [17:36:26<13:43:25, 25.16s/it]                                                         {'loss': 0.0015, 'grad_norm': 1.1302874671855045, 'learning_rate': 4.582361175921605e-07, 'completion_length': 318.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7931548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7931548357009888, 'reward_std': 0.008928571827709675, 'kl': 0.0372314453125, 'epoch': 0.54}
 54%|█████▍    | 2322/4286 [17:36:26<13:43:25, 25.16s/it] 54%|█████▍    | 2323/4286 [17:36:51<13:43:19, 25.17s/it]                                                         {'loss': 0.0035, 'grad_norm': 7.027501536194473, 'learning_rate': 4.580027998133458e-07, 'completion_length': 277.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.8095238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.030682741664350033, 'kl': 0.087890625, 'epoch': 0.54}
 54%|█████▍    | 2323/4286 [17:36:51<13:43:19, 25.17s/it] 54%|█████▍    | 2324/4286 [17:37:17<13:43:38, 25.19s/it]                                                         {'loss': 0.0034, 'grad_norm': 1.2945371876254477, 'learning_rate': 4.57769482034531e-07, 'completion_length': 286.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.0208333320915699, 'kl': 0.0859375, 'epoch': 0.54}
 54%|█████▍    | 2324/4286 [17:37:17<13:43:38, 25.19s/it] 54%|█████▍    | 2325/4286 [17:37:42<13:49:43, 25.39s/it]                                                         {'loss': 0.003, 'grad_norm': 3.0794254621516077, 'learning_rate': 4.575361642557163e-07, 'completion_length': 306.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7872024178504944, 'rewards/format_reward': 1.0, 'reward': 1.787202537059784, 'reward_std': 0.038690474815666676, 'kl': 0.0750732421875, 'epoch': 0.54}
 54%|█████▍    | 2325/4286 [17:37:42<13:49:43, 25.39s/it] 54%|█████▍    | 2326/4286 [17:38:08<13:51:21, 25.45s/it]                                                         {'loss': 0.0019, 'grad_norm': 6.669012457923552, 'learning_rate': 4.573028464769015e-07, 'completion_length': 311.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.766369104385376, 'rewards/format_reward': 1.0, 'reward': 1.7663691639900208, 'reward_std': 0.0500884223729372, 'kl': 0.0482177734375, 'epoch': 0.54}
 54%|█████▍    | 2326/4286 [17:38:08<13:51:21, 25.45s/it] 54%|█████▍    | 2327/4286 [17:38:33<13:49:37, 25.41s/it]                                                         {'loss': 0.0013, 'grad_norm': 0.6053187200335227, 'learning_rate': 4.570695286980868e-07, 'completion_length': 291.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 1.0, 'reward': 1.802083432674408, 'reward_std': 0.008928571827709675, 'kl': 0.03369140625, 'epoch': 0.54}
 54%|█████▍    | 2327/4286 [17:38:33<13:49:37, 25.41s/it][2025-03-03 08:36:23,459] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 54%|█████▍    | 2328/4286 [17:39:01<14:06:38, 25.94s/it]                                                         {'loss': 0.0023, 'grad_norm': 0.812762721095722, 'learning_rate': 4.5683621091927207e-07, 'completion_length': 320.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.659226268529892, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.641369104385376, 'reward_std': 0.10105127841234207, 'kl': 0.0587158203125, 'epoch': 0.54}
 54%|█████▍    | 2328/4286 [17:39:01<14:06:38, 25.94s/it][2025-03-03 08:36:51,329] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 54%|█████▍    | 2329/4286 [17:39:28<14:25:02, 26.52s/it]                                                         {'loss': 0.0033, 'grad_norm': 4.939779231535531, 'learning_rate': 4.566028931404573e-07, 'completion_length': 340.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6413690745830536, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.0301282936707139, 'kl': 0.0814208984375, 'epoch': 0.54}
 54%|█████▍    | 2329/4286 [17:39:28<14:25:02, 26.52s/it] 54%|█████▍    | 2330/4286 [17:39:54<14:14:52, 26.22s/it]                                                         {'loss': 0.0026, 'grad_norm': 7.025197489332266, 'learning_rate': 4.5636957536164257e-07, 'completion_length': 341.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.055142397060990334, 'kl': 0.065185546875, 'epoch': 0.54}
 54%|█████▍    | 2330/4286 [17:39:54<14:14:52, 26.22s/it] 54%|█████▍    | 2331/4286 [17:40:19<14:03:53, 25.90s/it]                                                         {'loss': 0.0041, 'grad_norm': 87.90025899964314, 'learning_rate': 4.561362575828278e-07, 'completion_length': 306.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.7455357611179352, 'rewards/format_reward': 1.0, 'reward': 1.7455357909202576, 'reward_std': 0.07066834159195423, 'kl': 0.101806640625, 'epoch': 0.54}
 54%|█████▍    | 2331/4286 [17:40:19<14:03:53, 25.90s/it] 54%|█████▍    | 2332/4286 [17:40:44<13:52:01, 25.55s/it]                                                         {'loss': 0.0018, 'grad_norm': 1.1201750469697518, 'learning_rate': 4.5590293980401306e-07, 'completion_length': 295.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.690476268529892, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.0714285671710968, 'kl': 0.0462646484375, 'epoch': 0.54}
 54%|█████▍    | 2332/4286 [17:40:44<13:52:01, 25.55s/it] 54%|█████▍    | 2333/4286 [17:41:09<13:44:57, 25.34s/it]                                                         {'loss': 0.004, 'grad_norm': 5.597180941327242, 'learning_rate': 4.5566962202519834e-07, 'completion_length': 340.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.8061756193637848, 'rewards/format_reward': 1.0, 'reward': 1.8061757683753967, 'reward_std': 0.042410717345774174, 'kl': 0.10107421875, 'epoch': 0.54}
 54%|█████▍    | 2333/4286 [17:41:09<13:44:57, 25.34s/it] 54%|█████▍    | 2334/4286 [17:41:33<13:38:14, 25.15s/it]                                                         {'loss': 0.0063, 'grad_norm': 1.968283432510679, 'learning_rate': 4.5543630424638356e-07, 'completion_length': 311.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.8038420081138611, 'rewards/format_reward': 1.0, 'reward': 1.8038421273231506, 'reward_std': 0.10022198595106602, 'kl': 0.157958984375, 'epoch': 0.54}
 54%|█████▍    | 2334/4286 [17:41:33<13:38:14, 25.15s/it] 54%|█████▍    | 2335/4286 [17:41:58<13:36:31, 25.11s/it]                                                         {'loss': 0.0019, 'grad_norm': 2.147543015782224, 'learning_rate': 4.5520298646756884e-07, 'completion_length': 316.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.723214328289032, 'reward_std': 0.0416666679084301, 'kl': 0.04833984375, 'epoch': 0.54}
 54%|█████▍    | 2335/4286 [17:41:58<13:36:31, 25.11s/it][2025-03-03 08:39:48,060] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 55%|█████▍    | 2336/4286 [17:42:25<13:52:02, 25.60s/it]                                                         {'loss': 0.013, 'grad_norm': 7.425708081010652, 'learning_rate': 4.5496966868875406e-07, 'completion_length': 335.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.042771173641085625, 'kl': 0.32568359375, 'epoch': 0.55}
 55%|█████▍    | 2336/4286 [17:42:25<13:52:02, 25.60s/it] 55%|█████▍    | 2337/4286 [17:42:50<13:42:52, 25.33s/it]                                                         {'loss': 0.0047, 'grad_norm': 0.8812448265264297, 'learning_rate': 4.5473635090993933e-07, 'completion_length': 328.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8214287161827087, 'reward_std': 0.011904759332537651, 'kl': 0.1177978515625, 'epoch': 0.55}
 55%|█████▍    | 2337/4286 [17:42:50<13:42:52, 25.33s/it] 55%|█████▍    | 2338/4286 [17:43:14<13:33:02, 25.04s/it]                                                         {'loss': 0.0039, 'grad_norm': 1.8559164603631688, 'learning_rate': 4.545030331311246e-07, 'completion_length': 311.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.01785714365541935, 'kl': 0.097412109375, 'epoch': 0.55}
 55%|█████▍    | 2338/4286 [17:43:14<13:33:02, 25.04s/it] 55%|█████▍    | 2339/4286 [17:43:39<13:31:53, 25.02s/it]                                                         {'loss': 0.0037, 'grad_norm': 5.152591790183934, 'learning_rate': 4.5426971535230983e-07, 'completion_length': 273.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7113096117973328, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.061365481466054916, 'kl': 0.0931396484375, 'epoch': 0.55}
 55%|█████▍    | 2339/4286 [17:43:39<13:31:53, 25.02s/it] 55%|█████▍    | 2340/4286 [17:44:04<13:28:16, 24.92s/it]                                                         {'loss': 0.0047, 'grad_norm': 3.6494384553145265, 'learning_rate': 4.540363975734951e-07, 'completion_length': 283.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.06250000186264515, 'kl': 0.11669921875, 'epoch': 0.55}
 55%|█████▍    | 2340/4286 [17:44:04<13:28:16, 24.92s/it] 55%|█████▍    | 2341/4286 [17:44:30<13:39:00, 25.27s/it]                                                         {'loss': 0.0023, 'grad_norm': 2.253982974562028, 'learning_rate': 4.5380307979468033e-07, 'completion_length': 307.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6681548357009888, 'rewards/format_reward': 1.0, 'reward': 1.6681548357009888, 'reward_std': 0.060905066318809986, 'kl': 0.056884765625, 'epoch': 0.55}
 55%|█████▍    | 2341/4286 [17:44:30<13:39:00, 25.27s/it] 55%|█████▍    | 2342/4286 [17:44:55<13:34:11, 25.13s/it]                                                         {'loss': 0.0075, 'grad_norm': 5.951226057171253, 'learning_rate': 4.535697620158656e-07, 'completion_length': 309.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.68452388048172, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.07653018832206726, 'kl': 0.1866455078125, 'epoch': 0.55}
 55%|█████▍    | 2342/4286 [17:44:55<13:34:11, 25.13s/it] 55%|█████▍    | 2343/4286 [17:45:20<13:37:24, 25.24s/it]                                                         {'loss': 0.0326, 'grad_norm': 2.6892893312691317, 'learning_rate': 4.533364442370509e-07, 'completion_length': 328.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6443453431129456, 'reward_std': 0.07118699513375759, 'kl': 0.815185546875, 'epoch': 0.55}
 55%|█████▍    | 2343/4286 [17:45:20<13:37:24, 25.24s/it] 55%|█████▍    | 2344/4286 [17:45:44<13:26:21, 24.91s/it]                                                         {'loss': 0.004, 'grad_norm': 1.2462702848814677, 'learning_rate': 4.531031264582361e-07, 'completion_length': 293.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7217262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7217262387275696, 'reward_std': 0.00297618773765862, 'kl': 0.10107421875, 'epoch': 0.55}
 55%|█████▍    | 2344/4286 [17:45:44<13:26:21, 24.91s/it] 55%|█████▍    | 2345/4286 [17:46:10<13:29:06, 25.01s/it]                                                         {'loss': 0.0038, 'grad_norm': 3.8481431501624885, 'learning_rate': 4.528698086794214e-07, 'completion_length': 312.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7892857193946838, 'rewards/format_reward': 1.0, 'reward': 1.7892858386039734, 'reward_std': 0.016291260719299316, 'kl': 0.0943603515625, 'epoch': 0.55}
 55%|█████▍    | 2345/4286 [17:46:10<13:29:06, 25.01s/it] 55%|█████▍    | 2346/4286 [17:46:35<13:34:04, 25.18s/it]                                                         {'loss': 0.0031, 'grad_norm': 0.334564409661523, 'learning_rate': 4.5263649090060665e-07, 'completion_length': 323.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.5431548058986664, 'rewards/format_reward': 1.0, 'reward': 1.5431548357009888, 'reward_std': 0.008928571827709675, 'kl': 0.07666015625, 'epoch': 0.55}
 55%|█████▍    | 2346/4286 [17:46:35<13:34:04, 25.18s/it] 55%|█████▍    | 2347/4286 [17:47:01<13:41:41, 25.43s/it]                                                         {'loss': 0.0177, 'grad_norm': 5.67174565690248, 'learning_rate': 4.5240317312179187e-07, 'completion_length': 330.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6586061716079712, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6407491564750671, 'reward_std': 0.16656696796417236, 'kl': 0.44140625, 'epoch': 0.55}
 55%|█████▍    | 2347/4286 [17:47:01<13:41:41, 25.43s/it] 55%|█████▍    | 2348/4286 [17:47:28<13:49:51, 25.69s/it]                                                         {'loss': 0.0096, 'grad_norm': 144.35889584926574, 'learning_rate': 4.5216985534297715e-07, 'completion_length': 314.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127978205680847, 'reward_std': 0.0565476231276989, 'kl': 0.240234375, 'epoch': 0.55}
 55%|█████▍    | 2348/4286 [17:47:28<13:49:51, 25.69s/it] 55%|█████▍    | 2349/4286 [17:47:54<14:00:49, 26.05s/it]                                                         {'loss': 0.0225, 'grad_norm': 9.237286309896637, 'learning_rate': 4.5193653756416237e-07, 'completion_length': 309.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7360119521617889, 'rewards/format_reward': 1.0, 'reward': 1.7360119819641113, 'reward_std': 0.07697555795311928, 'kl': 0.5634765625, 'epoch': 0.55}
 55%|█████▍    | 2349/4286 [17:47:54<14:00:49, 26.05s/it] 55%|█████▍    | 2350/4286 [17:48:19<13:47:49, 25.66s/it]                                                         {'loss': 0.0056, 'grad_norm': 6.8768245354034185, 'learning_rate': 4.5170321978534765e-07, 'completion_length': 340.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.7827382683753967, 'reward_std': 0.10990538075566292, 'kl': 0.139404296875, 'epoch': 0.55}
 55%|█████▍    | 2350/4286 [17:48:19<13:47:49, 25.66s/it] 55%|█████▍    | 2351/4286 [17:48:44<13:38:46, 25.39s/it]                                                         {'loss': 0.003, 'grad_norm': 0.13971272756669187, 'learning_rate': 4.514699020065329e-07, 'completion_length': 313.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.0, 'kl': 0.0750732421875, 'epoch': 0.55}
 55%|█████▍    | 2351/4286 [17:48:44<13:38:46, 25.39s/it] 55%|█████▍    | 2352/4286 [17:49:08<13:28:40, 25.09s/it]                                                         {'loss': 0.0123, 'grad_norm': 7.367637648931763, 'learning_rate': 4.5123658422771814e-07, 'completion_length': 303.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.05792887508869171, 'kl': 0.30810546875, 'epoch': 0.55}
 55%|█████▍    | 2352/4286 [17:49:08<13:28:40, 25.09s/it] 55%|█████▍    | 2353/4286 [17:49:32<13:17:51, 24.77s/it]                                                         {'loss': 0.0106, 'grad_norm': 6.339062464762586, 'learning_rate': 4.510032664489034e-07, 'completion_length': 313.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7675595879554749, 'rewards/format_reward': 1.0, 'reward': 1.7675596475601196, 'reward_std': 0.024404765106737614, 'kl': 0.26513671875, 'epoch': 0.55}
 55%|█████▍    | 2353/4286 [17:49:32<13:17:51, 24.77s/it] 55%|█████▍    | 2354/4286 [17:49:57<13:16:06, 24.72s/it]                                                         {'loss': 0.0053, 'grad_norm': 8.582526984100747, 'learning_rate': 4.5076994867008864e-07, 'completion_length': 287.75, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.046645132126286626, 'kl': 0.132568359375, 'epoch': 0.55}
 55%|█████▍    | 2354/4286 [17:49:57<13:16:06, 24.72s/it] 55%|█████▍    | 2355/4286 [17:50:20<12:58:29, 24.19s/it]                                                         {'loss': 0.0035, 'grad_norm': 0.40025224009993715, 'learning_rate': 4.505366308912739e-07, 'completion_length': 294.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.011904759332537651, 'kl': 0.086669921875, 'epoch': 0.55}
 55%|█████▍    | 2355/4286 [17:50:20<12:58:29, 24.19s/it] 55%|█████▍    | 2356/4286 [17:50:46<13:13:27, 24.67s/it]                                                         {'loss': 0.0024, 'grad_norm': 0.5448351941232119, 'learning_rate': 4.503033131124592e-07, 'completion_length': 294.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8005954027175903, 'reward_std': 0.01785714365541935, 'kl': 0.05908203125, 'epoch': 0.55}
 55%|█████▍    | 2356/4286 [17:50:46<13:13:27, 24.67s/it] 55%|█████▍    | 2357/4286 [17:51:11<13:23:40, 25.00s/it]                                                         {'loss': 0.0138, 'grad_norm': 3.004485547418689, 'learning_rate': 4.500699953336444e-07, 'completion_length': 325.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6622024476528168, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.027706551365554333, 'kl': 0.34521484375, 'epoch': 0.55}
 55%|█████▍    | 2357/4286 [17:51:11<13:23:40, 25.00s/it] 55%|█████▌    | 2358/4286 [17:51:36<13:18:20, 24.84s/it]                                                         {'loss': 0.0162, 'grad_norm': 24.797815445892592, 'learning_rate': 4.498366775548297e-07, 'completion_length': 317.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.05222323536872864, 'kl': 0.4052734375, 'epoch': 0.55}
 55%|█████▌    | 2358/4286 [17:51:36<13:18:20, 24.84s/it] 55%|█████▌    | 2359/4286 [17:52:01<13:22:15, 24.98s/it]                                                         {'loss': 0.0024, 'grad_norm': 2.265424511921134, 'learning_rate': 4.496033597760149e-07, 'completion_length': 312.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.8928572237491608, 'rewards/format_reward': 1.0, 'reward': 1.892857313156128, 'reward_std': 0.013746432960033417, 'kl': 0.060546875, 'epoch': 0.55}
 55%|█████▌    | 2359/4286 [17:52:01<13:22:15, 24.98s/it] 55%|█████▌    | 2360/4286 [17:52:25<13:14:32, 24.75s/it]                                                         {'loss': 0.0025, 'grad_norm': 8.92540196649504, 'learning_rate': 4.493700419972002e-07, 'completion_length': 297.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.8214287161827087, 'reward_std': 0.07578601315617561, 'kl': 0.063720703125, 'epoch': 0.55}
 55%|█████▌    | 2360/4286 [17:52:25<13:14:32, 24.75s/it][2025-03-03 08:50:13,524] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 55%|█████▌    | 2361/4286 [17:52:51<13:18:12, 24.88s/it]                                                         {'loss': 0.0039, 'grad_norm': 14.372097327765424, 'learning_rate': 4.4913672421838546e-07, 'completion_length': 307.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.04166666232049465, 'kl': 0.0980224609375, 'epoch': 0.55}
 55%|█████▌    | 2361/4286 [17:52:51<13:18:12, 24.88s/it] 55%|█████▌    | 2362/4286 [17:53:14<13:04:04, 24.45s/it]                                                         {'loss': 0.0093, 'grad_norm': 1.125044028277394, 'learning_rate': 4.489034064395707e-07, 'completion_length': 288.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8127126097679138, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.794855535030365, 'reward_std': 0.07879745680838823, 'kl': 0.232421875, 'epoch': 0.55}
 55%|█████▌    | 2362/4286 [17:53:14<13:04:04, 24.45s/it] 55%|█████▌    | 2363/4286 [17:53:37<12:49:50, 24.02s/it]                                                         {'loss': 0.0039, 'grad_norm': 18.04478688431349, 'learning_rate': 4.4867008866075596e-07, 'completion_length': 289.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.0357142835855484, 'kl': 0.096435546875, 'epoch': 0.55}
 55%|█████▌    | 2363/4286 [17:53:37<12:49:50, 24.02s/it] 55%|█████▌    | 2364/4286 [17:54:01<12:44:25, 23.86s/it]                                                         {'loss': 0.0183, 'grad_norm': 3.9236394899816243, 'learning_rate': 4.484367708819412e-07, 'completion_length': 303.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7529763579368591, 'reward_std': 0.06836476922035217, 'kl': 0.458984375, 'epoch': 0.55}
 55%|█████▌    | 2364/4286 [17:54:01<12:44:25, 23.86s/it] 55%|█████▌    | 2365/4286 [17:54:26<12:56:09, 24.24s/it]                                                         {'loss': 0.0286, 'grad_norm': 12.879561041212835, 'learning_rate': 4.4820345310312646e-07, 'completion_length': 308.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5654762387275696, 'rewards/format_reward': 1.0, 'reward': 1.5654762983322144, 'reward_std': 0.09858068078756332, 'kl': 0.71484375, 'epoch': 0.55}
 55%|█████▌    | 2365/4286 [17:54:26<12:56:09, 24.24s/it][2025-03-03 08:52:13,579] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 55%|█████▌    | 2366/4286 [17:54:51<13:02:40, 24.46s/it]                                                         {'loss': 0.0045, 'grad_norm': 3.7494234143830343, 'learning_rate': 4.4797013532431173e-07, 'completion_length': 281.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.068452388048172, 'kl': 0.11328125, 'epoch': 0.55}
 55%|█████▌    | 2366/4286 [17:54:51<13:02:40, 24.46s/it][2025-03-03 08:52:38,997] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 55%|█████▌    | 2367/4286 [17:55:16<13:11:28, 24.75s/it]                                                         {'loss': 0.0033, 'grad_norm': 0.9409107092013036, 'learning_rate': 4.4773681754549695e-07, 'completion_length': 307.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.9345238208770752, 'rewards/format_reward': 1.0, 'reward': 1.93452388048172, 'reward_std': 0.011904764920473099, 'kl': 0.0819091796875, 'epoch': 0.55}
 55%|█████▌    | 2367/4286 [17:55:16<13:11:28, 24.75s/it] 55%|█████▌    | 2368/4286 [17:55:40<13:03:48, 24.52s/it]                                                         {'loss': 0.0094, 'grad_norm': 6.265613124925348, 'learning_rate': 4.4750349976668223e-07, 'completion_length': 298.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035714626312256, 'reward_std': 0.059523806907236576, 'kl': 0.23486328125, 'epoch': 0.55}
 55%|█████▌    | 2368/4286 [17:55:40<13:03:48, 24.52s/it] 55%|█████▌    | 2369/4286 [17:56:05<13:02:41, 24.50s/it]                                                         {'loss': 0.0026, 'grad_norm': 1.7755429752193677, 'learning_rate': 4.472701819878675e-07, 'completion_length': 298.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.0357142873108387, 'kl': 0.064208984375, 'epoch': 0.55}
 55%|█████▌    | 2369/4286 [17:56:05<13:02:41, 24.50s/it] 55%|█████▌    | 2370/4286 [17:56:28<12:48:41, 24.07s/it]                                                         {'loss': 0.0156, 'grad_norm': 1.5052867496977556, 'learning_rate': 4.470368642090527e-07, 'completion_length': 259.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.8630952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8630953431129456, 'reward_std': 0.020619653165340424, 'kl': 0.388671875, 'epoch': 0.55}
 55%|█████▌    | 2370/4286 [17:56:28<12:48:41, 24.07s/it] 55%|█████▌    | 2371/4286 [17:56:51<12:40:34, 23.83s/it]                                                         {'loss': 0.0096, 'grad_norm': 11.019346705543839, 'learning_rate': 4.46803546430238e-07, 'completion_length': 265.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.04064540006220341, 'kl': 0.23974609375, 'epoch': 0.55}
 55%|█████▌    | 2371/4286 [17:56:51<12:40:34, 23.83s/it] 55%|█████▌    | 2372/4286 [17:57:15<12:44:58, 23.98s/it]                                                         {'loss': 0.0092, 'grad_norm': 7.371873272427389, 'learning_rate': 4.465702286514232e-07, 'completion_length': 270.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 1.0, 'reward': 1.6235119700431824, 'reward_std': 0.07439307495951653, 'kl': 0.23095703125, 'epoch': 0.55}
 55%|█████▌    | 2372/4286 [17:57:15<12:44:58, 23.98s/it] 55%|█████▌    | 2373/4286 [17:57:40<12:48:13, 24.09s/it]                                                         {'loss': 0.0053, 'grad_norm': 1.321815486700535, 'learning_rate': 4.463369108726085e-07, 'completion_length': 340.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.712797611951828, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.0625, 'kl': 0.133056640625, 'epoch': 0.55}
 55%|█████▌    | 2373/4286 [17:57:40<12:48:13, 24.09s/it] 55%|█████▌    | 2374/4286 [17:58:04<12:50:10, 24.17s/it]                                                         {'loss': 0.004, 'grad_norm': 3.9653130235745824, 'learning_rate': 4.4610359309379377e-07, 'completion_length': 318.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.8333333730697632, 'rewards/format_reward': 1.0, 'reward': 1.8333334922790527, 'reward_std': 0.0357142873108387, 'kl': 0.099609375, 'epoch': 0.55}
 55%|█████▌    | 2374/4286 [17:58:04<12:50:10, 24.17s/it] 55%|█████▌    | 2375/4286 [17:58:28<12:51:35, 24.23s/it]                                                         {'loss': 0.0031, 'grad_norm': 13.6355158077803, 'learning_rate': 4.45870275314979e-07, 'completion_length': 297.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.01626221090555191, 'kl': 0.0765380859375, 'epoch': 0.55}
 55%|█████▌    | 2375/4286 [17:58:28<12:51:35, 24.23s/it] 55%|█████▌    | 2376/4286 [17:58:53<12:59:55, 24.50s/it]                                                         {'loss': 0.0021, 'grad_norm': 6.783559640920275, 'learning_rate': 4.4563695753616427e-07, 'completion_length': 323.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.8026244640350342, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7847674489021301, 'reward_std': 0.14069904386997223, 'kl': 0.0537109375, 'epoch': 0.55}
 55%|█████▌    | 2376/4286 [17:58:53<12:59:55, 24.50s/it] 55%|█████▌    | 2377/4286 [17:59:19<13:07:20, 24.75s/it]                                                         {'loss': 0.0065, 'grad_norm': 3.3693881767855225, 'learning_rate': 4.454036397573495e-07, 'completion_length': 325.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.8630953133106232, 'rewards/format_reward': 1.0, 'reward': 1.8630953431129456, 'reward_std': 0.025651192292571068, 'kl': 0.1630859375, 'epoch': 0.55}
 55%|█████▌    | 2377/4286 [17:59:19<13:07:20, 24.75s/it] 55%|█████▌    | 2378/4286 [17:59:42<12:55:31, 24.39s/it]                                                         {'loss': 0.0078, 'grad_norm': 7.21584041749608, 'learning_rate': 4.4517032197853477e-07, 'completion_length': 252.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7276786267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7098215818405151, 'reward_std': 0.08588216453790665, 'kl': 0.19580078125, 'epoch': 0.55}
 55%|█████▌    | 2378/4286 [17:59:42<12:55:31, 24.39s/it] 56%|█████▌    | 2379/4286 [18:00:07<12:58:42, 24.50s/it]                                                         {'loss': 0.0069, 'grad_norm': 1.19789495636514, 'learning_rate': 4.4493700419972004e-07, 'completion_length': 309.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7470239102840424, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.024056263267993927, 'kl': 0.171630859375, 'epoch': 0.56}
 56%|█████▌    | 2379/4286 [18:00:07<12:58:42, 24.50s/it] 56%|█████▌    | 2380/4286 [18:00:32<12:59:45, 24.55s/it]                                                         {'loss': 0.0135, 'grad_norm': 1.615833020512239, 'learning_rate': 4.4470368642090527e-07, 'completion_length': 305.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7455357611179352, 'rewards/format_reward': 1.0, 'reward': 1.7455359101295471, 'reward_std': 0.05818403139710426, 'kl': 0.3365478515625, 'epoch': 0.56}
 56%|█████▌    | 2380/4286 [18:00:32<12:59:45, 24.55s/it] 56%|█████▌    | 2381/4286 [18:00:56<12:59:54, 24.56s/it]                                                         {'loss': 0.0022, 'grad_norm': 0.44740272089988414, 'learning_rate': 4.4447036864209054e-07, 'completion_length': 295.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 1.0, 'reward': 1.71577388048172, 'reward_std': 0.019238398410379887, 'kl': 0.0538330078125, 'epoch': 0.56}
 56%|█████▌    | 2381/4286 [18:00:56<12:59:54, 24.56s/it] 56%|█████▌    | 2382/4286 [18:01:21<13:05:13, 24.74s/it]                                                         {'loss': 0.0033, 'grad_norm': 26.272611549801706, 'learning_rate': 4.4423705086327576e-07, 'completion_length': 294.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.7490699887275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7312129139900208, 'reward_std': 0.08703512419015169, 'kl': 0.082275390625, 'epoch': 0.56}
 56%|█████▌    | 2382/4286 [18:01:21<13:05:13, 24.74s/it] 56%|█████▌    | 2383/4286 [18:01:45<12:58:01, 24.53s/it]                                                         {'loss': 0.0028, 'grad_norm': 1.5772876332366552, 'learning_rate': 4.4400373308446104e-07, 'completion_length': 285.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.011904759332537651, 'kl': 0.0692138671875, 'epoch': 0.56}
 56%|█████▌    | 2383/4286 [18:01:45<12:58:01, 24.53s/it] 56%|█████▌    | 2384/4286 [18:02:10<12:55:26, 24.46s/it]                                                         {'loss': 0.0254, 'grad_norm': 5.9308001315096845, 'learning_rate': 4.437704153056463e-07, 'completion_length': 322.51788330078125, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6190477013587952, 'reward_std': 0.06388125522062182, 'kl': 0.63671875, 'epoch': 0.56}
 56%|█████▌    | 2384/4286 [18:02:10<12:55:26, 24.46s/it][2025-03-03 08:59:59,093] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 56%|█████▌    | 2385/4286 [18:02:36<13:13:23, 25.04s/it]                                                         {'loss': 0.0094, 'grad_norm': 2.395232593963216, 'learning_rate': 4.4353709752683153e-07, 'completion_length': 281.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6700487434864044, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6521916389465332, 'reward_std': 0.0991665031760931, 'kl': 0.2359619140625, 'epoch': 0.56}
 56%|█████▌    | 2385/4286 [18:02:36<13:13:23, 25.04s/it] 56%|█████▌    | 2386/4286 [18:02:59<12:55:50, 24.50s/it]                                                         {'loss': 0.0036, 'grad_norm': 0.28272120716888405, 'learning_rate': 4.433037797480168e-07, 'completion_length': 276.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.0, 'kl': 0.08935546875, 'epoch': 0.56}
 56%|█████▌    | 2386/4286 [18:02:59<12:55:50, 24.50s/it] 56%|█████▌    | 2387/4286 [18:03:23<12:45:48, 24.20s/it]                                                         {'loss': 0.0037, 'grad_norm': 6.911288698610724, 'learning_rate': 4.4307046196920203e-07, 'completion_length': 318.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 1.0, 'reward': 1.7574405670166016, 'reward_std': 0.019238398410379887, 'kl': 0.09326171875, 'epoch': 0.56}
 56%|█████▌    | 2387/4286 [18:03:23<12:45:48, 24.20s/it] 56%|█████▌    | 2388/4286 [18:03:46<12:38:15, 23.97s/it]                                                         {'loss': 0.0018, 'grad_norm': 3.101996082297431, 'learning_rate': 4.428371441903873e-07, 'completion_length': 302.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7916667461395264, 'rewards/format_reward': 1.0, 'reward': 1.7916668057441711, 'reward_std': 0.04735896736383438, 'kl': 0.04534912109375, 'epoch': 0.56}
 56%|█████▌    | 2388/4286 [18:03:46<12:38:15, 23.97s/it] 56%|█████▌    | 2389/4286 [18:04:10<12:38:34, 23.99s/it]                                                         {'loss': 0.0032, 'grad_norm': 3.4927056084303065, 'learning_rate': 4.426038264115726e-07, 'completion_length': 279.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.05952380783855915, 'kl': 0.0804443359375, 'epoch': 0.56}
 56%|█████▌    | 2389/4286 [18:04:10<12:38:34, 23.99s/it] 56%|█████▌    | 2390/4286 [18:04:34<12:35:37, 23.91s/it]                                                         {'loss': 0.0087, 'grad_norm': 1.9131513151086736, 'learning_rate': 4.423705086327578e-07, 'completion_length': 300.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.11447649449110031, 'kl': 0.2177734375, 'epoch': 0.56}
 56%|█████▌    | 2390/4286 [18:04:34<12:35:37, 23.91s/it] 56%|█████▌    | 2391/4286 [18:04:58<12:35:22, 23.92s/it]                                                         {'loss': 0.0122, 'grad_norm': 14.983751078242388, 'learning_rate': 4.421371908539431e-07, 'completion_length': 250.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.70634925365448, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6706349849700928, 'reward_std': 0.14701906964182854, 'kl': 0.3056640625, 'epoch': 0.56}
 56%|█████▌    | 2391/4286 [18:04:58<12:35:22, 23.92s/it] 56%|█████▌    | 2392/4286 [18:05:24<12:50:22, 24.40s/it]                                                         {'loss': 0.0018, 'grad_norm': 5.900130207893677, 'learning_rate': 4.4190387307512836e-07, 'completion_length': 316.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7419643104076385, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7241072058677673, 'reward_std': 0.11726191267371178, 'kl': 0.04437255859375, 'epoch': 0.56}
 56%|█████▌    | 2392/4286 [18:05:24<12:50:22, 24.40s/it] 56%|█████▌    | 2393/4286 [18:05:47<12:42:07, 24.16s/it]                                                         {'loss': 0.006, 'grad_norm': 3.507832052814977, 'learning_rate': 4.416705552963136e-07, 'completion_length': 276.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.636904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.03139831963926554, 'kl': 0.149169921875, 'epoch': 0.56}
 56%|█████▌    | 2393/4286 [18:05:47<12:42:07, 24.16s/it] 56%|█████▌    | 2394/4286 [18:06:11<12:35:35, 23.96s/it]                                                         {'loss': 0.0275, 'grad_norm': 20.112499514493685, 'learning_rate': 4.4143723751749885e-07, 'completion_length': 274.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7485119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.10441341251134872, 'kl': 0.68798828125, 'epoch': 0.56}
 56%|█████▌    | 2394/4286 [18:06:11<12:35:35, 23.96s/it] 56%|█████▌    | 2395/4286 [18:06:35<12:42:14, 24.19s/it]                                                         {'loss': 0.0113, 'grad_norm': 3.1228430607049473, 'learning_rate': 4.412039197386841e-07, 'completion_length': 322.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7053572237491608, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.05649021826684475, 'kl': 0.28125, 'epoch': 0.56}
 56%|█████▌    | 2395/4286 [18:06:35<12:42:14, 24.19s/it] 56%|█████▌    | 2396/4286 [18:07:00<12:43:08, 24.23s/it]                                                         {'loss': 0.0042, 'grad_norm': 1.2570288140272916, 'learning_rate': 4.4097060195986935e-07, 'completion_length': 305.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.04123930633068085, 'kl': 0.1064453125, 'epoch': 0.56}
 56%|█████▌    | 2396/4286 [18:07:00<12:43:08, 24.23s/it] 56%|█████▌    | 2397/4286 [18:07:24<12:40:49, 24.17s/it]                                                         {'loss': 0.0056, 'grad_norm': 2.4469008794669658, 'learning_rate': 4.407372841810546e-07, 'completion_length': 305.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6651785969734192, 'rewards/format_reward': 1.0, 'reward': 1.665178656578064, 'reward_std': 0.11362177692353725, 'kl': 0.139892578125, 'epoch': 0.56}
 56%|█████▌    | 2397/4286 [18:07:24<12:40:49, 24.17s/it] 56%|█████▌    | 2398/4286 [18:07:49<12:53:06, 24.57s/it]                                                         {'loss': 0.0027, 'grad_norm': 21.39638847983665, 'learning_rate': 4.4050396640223985e-07, 'completion_length': 288.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7083334624767303, 'rewards/format_reward': 1.0, 'reward': 1.7083334922790527, 'reward_std': 0.030682736076414585, 'kl': 0.06689453125, 'epoch': 0.56}
 56%|█████▌    | 2398/4286 [18:07:49<12:53:06, 24.57s/it] 56%|█████▌    | 2399/4286 [18:08:14<12:50:22, 24.50s/it]                                                         {'loss': 0.0058, 'grad_norm': 16.312069845095518, 'learning_rate': 4.402706486234251e-07, 'completion_length': 318.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.03000863641500473, 'kl': 0.145751953125, 'epoch': 0.56}
 56%|█████▌    | 2399/4286 [18:08:14<12:50:22, 24.50s/it] 56%|█████▌    | 2400/4286 [18:08:38<12:46:15, 24.38s/it]                                                         {'loss': 0.0074, 'grad_norm': 8.346859929472881, 'learning_rate': 4.4003733084461034e-07, 'completion_length': 282.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.8511905372142792, 'rewards/format_reward': 1.0, 'reward': 1.8511906862258911, 'reward_std': 0.023809521459043026, 'kl': 0.183837890625, 'epoch': 0.56}
 56%|█████▌    | 2400/4286 [18:08:38<12:46:15, 24.38s/it][2025-03-03 09:09:28,996] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 56%|█████▌    | 2401/4286 [18:12:06<41:40:25, 79.59s/it]                                                         {'loss': 0.0058, 'grad_norm': 20.028034506102703, 'learning_rate': 4.398040130657956e-07, 'completion_length': 250.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7767857909202576, 'reward_std': 0.08045602217316628, 'kl': 0.1441650390625, 'epoch': 0.56}
 56%|█████▌    | 2401/4286 [18:12:06<41:40:25, 79.59s/it][2025-03-03 09:09:54,110] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 56%|█████▌    | 2402/4286 [18:12:31<33:05:56, 63.25s/it]                                                         {'loss': 0.0037, 'grad_norm': 5.578870976894055, 'learning_rate': 4.395706952869809e-07, 'completion_length': 286.5, 'rewards/only_full_func_accuracy_reward': 0.7455357313156128, 'rewards/format_reward': 1.0, 'reward': 1.7455358505249023, 'reward_std': 0.026785715483129025, 'kl': 0.091552734375, 'epoch': 0.56}
 56%|█████▌    | 2402/4286 [18:12:31<33:05:56, 63.25s/it] 56%|█████▌    | 2403/4286 [18:12:56<27:07:08, 51.85s/it]                                                         {'loss': 0.0038, 'grad_norm': 1.8370158360560638, 'learning_rate': 4.393373775081661e-07, 'completion_length': 327.6964569091797, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351192235946655, 'reward_std': 0.02976190485060215, 'kl': 0.09521484375, 'epoch': 0.56}
 56%|█████▌    | 2403/4286 [18:12:56<27:07:08, 51.85s/it] 56%|█████▌    | 2404/4286 [18:13:23<23:05:18, 44.17s/it]                                                         {'loss': 0.0044, 'grad_norm': 19.890734119337118, 'learning_rate': 4.391040597293514e-07, 'completion_length': 322.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8214287161827087, 'reward_std': 0.03755595721304417, 'kl': 0.10986328125, 'epoch': 0.56}
 56%|█████▌    | 2404/4286 [18:13:23<23:05:18, 44.17s/it] 56%|█████▌    | 2405/4286 [18:13:48<20:07:02, 38.50s/it]                                                         {'loss': 0.005, 'grad_norm': 3.685496554457308, 'learning_rate': 4.388707419505366e-07, 'completion_length': 299.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6264881193637848, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.05243691150099039, 'kl': 0.125732421875, 'epoch': 0.56}
 56%|█████▌    | 2405/4286 [18:13:48<20:07:02, 38.50s/it] 56%|█████▌    | 2406/4286 [18:14:12<17:51:48, 34.21s/it]                                                         {'loss': 0.0121, 'grad_norm': 2.919616769541071, 'learning_rate': 4.386374241717219e-07, 'completion_length': 312.0, 'rewards/only_full_func_accuracy_reward': 0.6577381491661072, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.012825604993849993, 'kl': 0.302978515625, 'epoch': 0.56}
 56%|█████▌    | 2406/4286 [18:14:12<17:51:48, 34.21s/it] 56%|█████▌    | 2407/4286 [18:14:38<16:33:58, 31.74s/it]                                                         {'loss': 0.0185, 'grad_norm': 15.480367916099759, 'learning_rate': 4.3840410639290716e-07, 'completion_length': 294.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6633929014205933, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6276786923408508, 'reward_std': 0.1460820473730564, 'kl': 0.462890625, 'epoch': 0.56}
 56%|█████▌    | 2407/4286 [18:14:38<16:33:58, 31.74s/it] 56%|█████▌    | 2408/4286 [18:15:04<15:35:52, 29.90s/it]                                                         {'loss': 0.0067, 'grad_norm': 3.173562551762739, 'learning_rate': 4.381707886140924e-07, 'completion_length': 287.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.6636905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6636906266212463, 'reward_std': 0.03596102260053158, 'kl': 0.16845703125, 'epoch': 0.56}
 56%|█████▌    | 2408/4286 [18:15:04<15:35:52, 29.90s/it] 56%|█████▌    | 2409/4286 [18:15:28<14:44:22, 28.27s/it]                                                         {'loss': 0.0076, 'grad_norm': 3.8178370594909135, 'learning_rate': 4.3793747083527766e-07, 'completion_length': 331.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7023810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.035714288242161274, 'kl': 0.189453125, 'epoch': 0.56}
 56%|█████▌    | 2409/4286 [18:15:28<14:44:22, 28.27s/it] 56%|█████▌    | 2410/4286 [18:15:54<14:18:20, 27.45s/it]                                                         {'loss': 0.0097, 'grad_norm': 9.682244832271886, 'learning_rate': 4.377041530564629e-07, 'completion_length': 280.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6925595700740814, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6747024059295654, 'reward_std': 0.10297618806362152, 'kl': 0.242919921875, 'epoch': 0.56}
 56%|█████▌    | 2410/4286 [18:15:54<14:18:20, 27.45s/it] 56%|█████▋    | 2411/4286 [18:16:18<13:43:32, 26.35s/it]                                                         {'loss': 0.005, 'grad_norm': 2.285229131887075, 'learning_rate': 4.374708352776481e-07, 'completion_length': 287.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.054985739290714264, 'kl': 0.12548828125, 'epoch': 0.56}
 56%|█████▋    | 2411/4286 [18:16:18<13:43:32, 26.35s/it] 56%|█████▋    | 2412/4286 [18:16:42<13:29:39, 25.92s/it]                                                         {'loss': 0.0091, 'grad_norm': 83.23025310909286, 'learning_rate': 4.372375174988334e-07, 'completion_length': 305.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 1.0, 'reward': 1.802083432674408, 'reward_std': 0.05281119421124458, 'kl': 0.228515625, 'epoch': 0.56}
 56%|█████▋    | 2412/4286 [18:16:42<13:29:39, 25.92s/it] 56%|█████▋    | 2413/4286 [18:17:06<13:09:29, 25.29s/it]                                                         {'loss': 0.0069, 'grad_norm': 36.34136595715844, 'learning_rate': 4.370041997200186e-07, 'completion_length': 269.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.6559523940086365, 'rewards/format_reward': 1.0, 'reward': 1.6559525728225708, 'reward_std': 0.04959554225206375, 'kl': 0.173828125, 'epoch': 0.56}
 56%|█████▋    | 2413/4286 [18:17:06<13:09:29, 25.29s/it] 56%|█████▋    | 2414/4286 [18:17:31<13:01:26, 25.05s/it]                                                         {'loss': 0.0189, 'grad_norm': 7.178231738363875, 'learning_rate': 4.367708819412039e-07, 'completion_length': 287.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6830357909202576, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.11631817370653152, 'kl': 0.470703125, 'epoch': 0.56}
 56%|█████▋    | 2414/4286 [18:17:31<13:01:26, 25.05s/it] 56%|█████▋    | 2415/4286 [18:17:55<12:51:11, 24.73s/it]                                                         {'loss': 0.0086, 'grad_norm': 7.66300385891697, 'learning_rate': 4.365375641623891e-07, 'completion_length': 294.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7529762983322144, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7351191639900208, 'reward_std': 0.06547619588673115, 'kl': 0.2138671875, 'epoch': 0.56}
 56%|█████▋    | 2415/4286 [18:17:55<12:51:11, 24.73s/it] 56%|█████▋    | 2416/4286 [18:18:19<12:47:56, 24.64s/it]                                                         {'loss': 0.0082, 'grad_norm': 5.63953650870007, 'learning_rate': 4.363042463835744e-07, 'completion_length': 293.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 1.0, 'reward': 1.7708334922790527, 'reward_std': 0.0710012074559927, 'kl': 0.205078125, 'epoch': 0.56}
 56%|█████▋    | 2416/4286 [18:18:19<12:47:56, 24.64s/it] 56%|█████▋    | 2417/4286 [18:18:45<13:03:11, 25.14s/it]                                                         {'loss': 0.0105, 'grad_norm': 15.436668195740834, 'learning_rate': 4.3607092860475965e-07, 'completion_length': 339.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6592262387275696, 'reward_std': 0.12334034219384193, 'kl': 0.26318359375, 'epoch': 0.56}
 56%|█████▋    | 2417/4286 [18:18:45<13:03:11, 25.14s/it] 56%|█████▋    | 2418/4286 [18:19:10<12:57:32, 24.97s/it]                                                         {'loss': 0.0091, 'grad_norm': 1.187232946024065, 'learning_rate': 4.358376108259449e-07, 'completion_length': 322.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7202381193637848, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.0, 'kl': 0.2275390625, 'epoch': 0.56}
 56%|█████▋    | 2418/4286 [18:19:10<12:57:32, 24.97s/it] 56%|█████▋    | 2419/4286 [18:19:33<12:39:02, 24.39s/it]                                                         {'loss': 0.0118, 'grad_norm': 8.612784216658806, 'learning_rate': 4.3560429304713015e-07, 'completion_length': 256.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.7886905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7886905670166016, 'reward_std': 0.0416666716337204, 'kl': 0.2957763671875, 'epoch': 0.56}
 56%|█████▋    | 2419/4286 [18:19:33<12:39:02, 24.39s/it] 56%|█████▋    | 2420/4286 [18:19:58<12:43:39, 24.55s/it]                                                         {'loss': 0.0109, 'grad_norm': 3.575563458111469, 'learning_rate': 4.353709752683154e-07, 'completion_length': 298.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.815476268529892, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.04761904664337635, 'kl': 0.27392578125, 'epoch': 0.56}
 56%|█████▋    | 2420/4286 [18:19:58<12:43:39, 24.55s/it] 56%|█████▋    | 2421/4286 [18:20:23<12:43:12, 24.55s/it]                                                         {'loss': 0.0082, 'grad_norm': 8.223216827105286, 'learning_rate': 4.3513765748950065e-07, 'completion_length': 295.7143096923828, 'rewards/only_full_func_accuracy_reward': 0.816964328289032, 'rewards/format_reward': 1.0, 'reward': 1.8169644474983215, 'reward_std': 0.04053214658051729, 'kl': 0.20458984375, 'epoch': 0.56}
 56%|█████▋    | 2421/4286 [18:20:23<12:43:12, 24.55s/it] 57%|█████▋    | 2422/4286 [18:20:45<12:26:31, 24.03s/it]                                                         {'loss': 0.0041, 'grad_norm': 0.6504517249126951, 'learning_rate': 4.349043397106859e-07, 'completion_length': 251.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.0, 'kl': 0.101318359375, 'epoch': 0.57}
 57%|█████▋    | 2422/4286 [18:20:45<12:26:31, 24.03s/it] 57%|█████▋    | 2423/4286 [18:21:10<12:35:25, 24.33s/it]                                                         {'loss': 0.0071, 'grad_norm': 20.765346636734055, 'learning_rate': 4.3467102193187114e-07, 'completion_length': 315.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6994048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6994048953056335, 'reward_std': 0.02489012386649847, 'kl': 0.1790771484375, 'epoch': 0.57}
 57%|█████▋    | 2423/4286 [18:21:10<12:35:25, 24.33s/it] 57%|█████▋    | 2424/4286 [18:21:35<12:36:04, 24.36s/it]                                                         {'loss': 0.0092, 'grad_norm': 18.34426574345975, 'learning_rate': 4.344377041530564e-07, 'completion_length': 321.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.7127976417541504, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.09908205270767212, 'kl': 0.229248046875, 'epoch': 0.57}
 57%|█████▋    | 2424/4286 [18:21:35<12:36:04, 24.36s/it] 57%|█████▋    | 2425/4286 [18:21:58<12:23:36, 23.97s/it]                                                         {'loss': 0.016, 'grad_norm': 2.558892347271383, 'learning_rate': 4.342043863742417e-07, 'completion_length': 265.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7782739102840424, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7425596714019775, 'reward_std': 0.10373931378126144, 'kl': 0.3983154296875, 'epoch': 0.57}
 57%|█████▋    | 2425/4286 [18:21:58<12:23:36, 23.97s/it] 57%|█████▋    | 2426/4286 [18:22:22<12:27:15, 24.11s/it]                                                         {'loss': 0.0062, 'grad_norm': 3.4898296809202547, 'learning_rate': 4.339710685954269e-07, 'completion_length': 257.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.6354167461395264, 'rewards/format_reward': 1.0, 'reward': 1.6354167461395264, 'reward_std': 0.03869047947227955, 'kl': 0.15576171875, 'epoch': 0.57}
 57%|█████▋    | 2426/4286 [18:22:22<12:27:15, 24.11s/it] 57%|█████▋    | 2427/4286 [18:22:47<12:33:09, 24.31s/it]                                                         {'loss': 0.0078, 'grad_norm': 5.801224430412166, 'learning_rate': 4.337377508166122e-07, 'completion_length': 301.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.770833432674408, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.034119345247745514, 'kl': 0.195556640625, 'epoch': 0.57}
 57%|█████▋    | 2427/4286 [18:22:47<12:33:09, 24.31s/it] 57%|█████▋    | 2428/4286 [18:23:13<12:43:33, 24.66s/it]                                                         {'loss': 0.0067, 'grad_norm': 7.30875163974107, 'learning_rate': 4.335044330377974e-07, 'completion_length': 323.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.028580989688634872, 'kl': 0.16650390625, 'epoch': 0.57}
 57%|█████▋    | 2428/4286 [18:23:13<12:43:33, 24.66s/it] 57%|█████▋    | 2429/4286 [18:23:36<12:33:09, 24.33s/it]                                                         {'loss': 0.0054, 'grad_norm': 4.435458071333365, 'learning_rate': 4.332711152589827e-07, 'completion_length': 284.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7351190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7351192235946655, 'reward_std': 0.030164924450218678, 'kl': 0.13525390625, 'epoch': 0.57}
 57%|█████▋    | 2429/4286 [18:23:36<12:33:09, 24.33s/it] 57%|█████▋    | 2430/4286 [18:24:00<12:24:06, 24.06s/it]                                                         {'loss': 0.0112, 'grad_norm': 5.600157344963297, 'learning_rate': 4.3303779748016796e-07, 'completion_length': 307.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.07337505742907524, 'kl': 0.28076171875, 'epoch': 0.57}
 57%|█████▋    | 2430/4286 [18:24:00<12:24:06, 24.06s/it] 57%|█████▋    | 2431/4286 [18:24:23<12:17:12, 23.85s/it]                                                         {'loss': 0.0112, 'grad_norm': 1.8391079374290789, 'learning_rate': 4.328044797013532e-07, 'completion_length': 294.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.8490647077560425, 'rewards/format_reward': 1.0, 'reward': 1.8490647673606873, 'reward_std': 0.028061222285032272, 'kl': 0.28076171875, 'epoch': 0.57}
 57%|█████▋    | 2431/4286 [18:24:23<12:17:12, 23.85s/it] 57%|█████▋    | 2432/4286 [18:24:47<12:18:00, 23.88s/it]                                                         {'loss': 0.0048, 'grad_norm': 22.567561685628718, 'learning_rate': 4.3257116192253846e-07, 'completion_length': 293.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7619048655033112, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.043508339673280716, 'kl': 0.1187744140625, 'epoch': 0.57}
 57%|█████▋    | 2432/4286 [18:24:47<12:18:00, 23.88s/it] 57%|█████▋    | 2433/4286 [18:25:11<12:19:21, 23.94s/it]                                                         {'loss': 0.0023, 'grad_norm': 2.644833274431865, 'learning_rate': 4.323378441437237e-07, 'completion_length': 274.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6250000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.10935882106423378, 'kl': 0.05810546875, 'epoch': 0.57}
 57%|█████▋    | 2433/4286 [18:25:11<12:19:21, 23.94s/it][2025-03-03 09:23:00,089] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 57%|█████▋    | 2434/4286 [18:25:37<12:39:42, 24.61s/it]                                                         {'loss': 0.005, 'grad_norm': 1.603831154431575, 'learning_rate': 4.3210452636490896e-07, 'completion_length': 337.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.7157738208770752, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.01580178737640381, 'kl': 0.12451171875, 'epoch': 0.57}
 57%|█████▋    | 2434/4286 [18:25:37<12:39:42, 24.61s/it] 57%|█████▋    | 2435/4286 [18:26:01<12:32:23, 24.39s/it]                                                         {'loss': 0.0015, 'grad_norm': 5.710931161241423, 'learning_rate': 4.3187120858609423e-07, 'completion_length': 290.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.8630952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8630954027175903, 'reward_std': 0.0357142873108387, 'kl': 0.0367431640625, 'epoch': 0.57}
 57%|█████▋    | 2435/4286 [18:26:01<12:32:23, 24.39s/it] 57%|█████▋    | 2436/4286 [18:26:25<12:25:53, 24.19s/it]                                                         {'loss': 0.0098, 'grad_norm': 4.389553515538942, 'learning_rate': 4.3163789080727946e-07, 'completion_length': 299.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6755952537059784, 'rewards/format_reward': 1.0, 'reward': 1.6755954027175903, 'reward_std': 0.0384767958894372, 'kl': 0.24609375, 'epoch': 0.57}
 57%|█████▋    | 2436/4286 [18:26:25<12:25:53, 24.19s/it] 57%|█████▋    | 2437/4286 [18:26:50<12:32:25, 24.42s/it]                                                         {'loss': 0.006, 'grad_norm': 1.8529372927419248, 'learning_rate': 4.3140457302846473e-07, 'completion_length': 320.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6919642984867096, 'rewards/format_reward': 1.0, 'reward': 1.6919643878936768, 'reward_std': 0.0414529861882329, 'kl': 0.1488037109375, 'epoch': 0.57}
 57%|█████▋    | 2437/4286 [18:26:50<12:32:25, 24.42s/it] 57%|█████▋    | 2438/4286 [18:27:16<12:46:15, 24.88s/it]                                                         {'loss': 0.0071, 'grad_norm': 58.680875536150296, 'learning_rate': 4.3117125524964995e-07, 'completion_length': 319.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.5431548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.52529776096344, 'reward_std': 0.06685744412243366, 'kl': 0.17822265625, 'epoch': 0.57}
 57%|█████▋    | 2438/4286 [18:27:16<12:46:15, 24.88s/it] 57%|█████▋    | 2439/4286 [18:27:40<12:44:28, 24.83s/it]                                                         {'loss': 0.0078, 'grad_norm': 10.348750454871645, 'learning_rate': 4.3093793747083523e-07, 'completion_length': 288.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.8169643580913544, 'rewards/format_reward': 1.0, 'reward': 1.8169644474983215, 'reward_std': 0.0446428582072258, 'kl': 0.19482421875, 'epoch': 0.57}
 57%|█████▋    | 2439/4286 [18:27:40<12:44:28, 24.83s/it] 57%|█████▋    | 2440/4286 [18:28:06<12:49:16, 25.00s/it]                                                         {'loss': 0.0026, 'grad_norm': 9.284599555961968, 'learning_rate': 4.307046196920205e-07, 'completion_length': 312.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.7755953073501587, 'rewards/format_reward': 1.0, 'reward': 1.7755953073501587, 'reward_std': 0.04001205787062645, 'kl': 0.06463623046875, 'epoch': 0.57}
 57%|█████▋    | 2440/4286 [18:28:06<12:49:16, 25.00s/it] 57%|█████▋    | 2441/4286 [18:28:31<12:51:46, 25.10s/it]                                                         {'loss': 0.012, 'grad_norm': 2.5163497077441073, 'learning_rate': 4.304713019132057e-07, 'completion_length': 334.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7924107611179352, 'rewards/format_reward': 1.0, 'reward': 1.7924109101295471, 'reward_std': 0.0491071455180645, 'kl': 0.29931640625, 'epoch': 0.57}
 57%|█████▋    | 2441/4286 [18:28:31<12:51:46, 25.10s/it] 57%|█████▋    | 2442/4286 [18:28:56<12:46:24, 24.94s/it]                                                         {'loss': 0.0059, 'grad_norm': 0.7667109405967882, 'learning_rate': 4.30237984134391e-07, 'completion_length': 290.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.8184524178504944, 'rewards/format_reward': 1.0, 'reward': 1.818452537059784, 'reward_std': 0.014979830011725426, 'kl': 0.146728515625, 'epoch': 0.57}
 57%|█████▋    | 2442/4286 [18:28:56<12:46:24, 24.94s/it] 57%|█████▋    | 2443/4286 [18:29:20<12:44:10, 24.88s/it]                                                         {'loss': 0.0052, 'grad_norm': 10.901420082299174, 'learning_rate': 4.300046663555763e-07, 'completion_length': 294.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.730654776096344, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.06250000325962901, 'kl': 0.12939453125, 'epoch': 0.57}
 57%|█████▋    | 2443/4286 [18:29:20<12:44:10, 24.88s/it] 57%|█████▋    | 2444/4286 [18:29:46<12:47:58, 25.02s/it]                                                         {'loss': 0.004, 'grad_norm': 1.4499782347874015, 'learning_rate': 4.297713485767615e-07, 'completion_length': 335.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.8288690745830536, 'rewards/format_reward': 1.0, 'reward': 1.8288692235946655, 'reward_std': 0.02267500478774309, 'kl': 0.098876953125, 'epoch': 0.57}
 57%|█████▋    | 2444/4286 [18:29:46<12:47:58, 25.02s/it][2025-03-03 09:27:34,896] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 57%|█████▋    | 2445/4286 [18:30:12<12:58:41, 25.38s/it]                                                         {'loss': 0.0044, 'grad_norm': 7.7184180157114435, 'learning_rate': 4.2953803079794677e-07, 'completion_length': 306.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.6502977013587952, 'rewards/format_reward': 1.0, 'reward': 1.6502977013587952, 'reward_std': 0.04479556903243065, 'kl': 0.110107421875, 'epoch': 0.57}
 57%|█████▋    | 2445/4286 [18:30:12<12:58:41, 25.38s/it] 57%|█████▋    | 2446/4286 [18:30:37<12:56:10, 25.31s/it]                                                         {'loss': 0.0136, 'grad_norm': 4.741509097555432, 'learning_rate': 4.29304713019132e-07, 'completion_length': 293.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7113095223903656, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6934524774551392, 'reward_std': 0.08136101067066193, 'kl': 0.34130859375, 'epoch': 0.57}
 57%|█████▋    | 2446/4286 [18:30:37<12:56:10, 25.31s/it][2025-03-03 09:28:23,240] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 57%|█████▋    | 2447/4286 [18:31:00<12:36:17, 24.67s/it]                                                         {'loss': 0.0021, 'grad_norm': 0.27175280770029686, 'learning_rate': 4.2907139524031727e-07, 'completion_length': 276.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.052001953125, 'epoch': 0.57}
 57%|█████▋    | 2447/4286 [18:31:00<12:36:17, 24.67s/it] 57%|█████▋    | 2448/4286 [18:31:26<12:45:31, 24.99s/it]                                                         {'loss': 0.0026, 'grad_norm': 4.7527944500450285, 'learning_rate': 4.2883807746150255e-07, 'completion_length': 334.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6193452775478363, 'rewards/format_reward': 1.0, 'reward': 1.6193453669548035, 'reward_std': 0.04683734476566315, 'kl': 0.0662841796875, 'epoch': 0.57}
 57%|█████▋    | 2448/4286 [18:31:26<12:45:31, 24.99s/it] 57%|█████▋    | 2449/4286 [18:31:50<12:39:36, 24.81s/it]                                                         {'loss': 0.0022, 'grad_norm': 4.592496045735509, 'learning_rate': 4.2860475968268777e-07, 'completion_length': 310.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.06738010607659817, 'kl': 0.0550537109375, 'epoch': 0.57}
 57%|█████▋    | 2449/4286 [18:31:50<12:39:36, 24.81s/it] 57%|█████▋    | 2450/4286 [18:32:16<12:50:38, 25.18s/it]                                                         {'loss': 0.0161, 'grad_norm': 1.3020529646290795, 'learning_rate': 4.2837144190387304e-07, 'completion_length': 322.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7068453133106232, 'rewards/format_reward': 1.0, 'reward': 1.7068454027175903, 'reward_std': 0.04090643860399723, 'kl': 0.404052734375, 'epoch': 0.57}
 57%|█████▋    | 2450/4286 [18:32:16<12:50:38, 25.18s/it] 57%|█████▋    | 2451/4286 [18:32:41<12:43:56, 24.98s/it]                                                         {'loss': 0.0127, 'grad_norm': 9.74883376083202, 'learning_rate': 4.2813812412505827e-07, 'completion_length': 310.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.06769215408712626, 'kl': 0.318359375, 'epoch': 0.57}
 57%|█████▋    | 2451/4286 [18:32:41<12:43:56, 24.98s/it] 57%|█████▋    | 2452/4286 [18:33:07<12:50:49, 25.22s/it]                                                         {'loss': 0.0093, 'grad_norm': 6.642159929649346, 'learning_rate': 4.2790480634624354e-07, 'completion_length': 301.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.696428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.04191340692341328, 'kl': 0.23388671875, 'epoch': 0.57}
 57%|█████▋    | 2452/4286 [18:33:07<12:50:49, 25.22s/it] 57%|█████▋    | 2453/4286 [18:33:31<12:42:59, 24.97s/it]                                                         {'loss': 0.002, 'grad_norm': 4.512572638250926, 'learning_rate': 4.276714885674288e-07, 'completion_length': 294.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7842262983322144, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.04136601369827986, 'kl': 0.0511474609375, 'epoch': 0.57}
 57%|█████▋    | 2453/4286 [18:33:31<12:42:59, 24.97s/it][2025-03-03 09:31:19,128] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 57%|█████▋    | 2454/4286 [18:33:56<12:43:05, 24.99s/it]                                                         {'loss': 0.0085, 'grad_norm': 6.239106366420394, 'learning_rate': 4.2743817078861404e-07, 'completion_length': 282.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.050841979682445526, 'kl': 0.2138671875, 'epoch': 0.57}
 57%|█████▋    | 2454/4286 [18:33:56<12:43:05, 24.99s/it] 57%|█████▋    | 2455/4286 [18:34:21<12:36:23, 24.79s/it]                                                         {'loss': 0.0066, 'grad_norm': 4.266246383282296, 'learning_rate': 4.272048530097993e-07, 'completion_length': 296.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.06980599462985992, 'kl': 0.1650390625, 'epoch': 0.57}
 57%|█████▋    | 2455/4286 [18:34:21<12:36:23, 24.79s/it] 57%|█████▋    | 2456/4286 [18:34:45<12:33:19, 24.70s/it]                                                         {'loss': 0.0053, 'grad_norm': 4.345907653445281, 'learning_rate': 4.2697153523098454e-07, 'completion_length': 319.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6830357313156128, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.026785715483129025, 'kl': 0.132080078125, 'epoch': 0.57}
 57%|█████▋    | 2456/4286 [18:34:45<12:33:19, 24.70s/it] 57%|█████▋    | 2457/4286 [18:35:10<12:37:05, 24.84s/it]                                                         {'loss': 0.0079, 'grad_norm': 12.980034549717704, 'learning_rate': 4.267382174521698e-07, 'completion_length': 310.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.755952388048172, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.07762768864631653, 'kl': 0.1982421875, 'epoch': 0.57}
 57%|█████▋    | 2457/4286 [18:35:10<12:37:05, 24.84s/it][2025-03-03 09:32:58,475] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 57%|█████▋    | 2458/4286 [18:35:36<12:41:43, 25.00s/it]                                                         {'loss': 0.0034, 'grad_norm': 1.460965241767252, 'learning_rate': 4.265048996733551e-07, 'completion_length': 278.32144927978516, 'rewards/only_full_func_accuracy_reward': 0.8258929252624512, 'rewards/format_reward': 1.0, 'reward': 1.8258929252624512, 'reward_std': 0.059310127049684525, 'kl': 0.083984375, 'epoch': 0.57}
 57%|█████▋    | 2458/4286 [18:35:36<12:41:43, 25.00s/it][2025-03-03 09:33:23,707] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 57%|█████▋    | 2459/4286 [18:36:01<12:43:24, 25.07s/it]                                                         {'loss': 0.0076, 'grad_norm': 8.385827617000379, 'learning_rate': 4.262715818945403e-07, 'completion_length': 306.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.68601194024086, 'rewards/format_reward': 1.0, 'reward': 1.6860120296478271, 'reward_std': 0.055573709309101105, 'kl': 0.1904296875, 'epoch': 0.57}
 57%|█████▋    | 2459/4286 [18:36:01<12:43:24, 25.07s/it] 57%|█████▋    | 2460/4286 [18:36:26<12:45:59, 25.17s/it]                                                         {'loss': 0.0262, 'grad_norm': 3.093109235254708, 'learning_rate': 4.260382641157256e-07, 'completion_length': 295.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.6309524476528168, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.035714288242161274, 'kl': 0.6533203125, 'epoch': 0.57}
 57%|█████▋    | 2460/4286 [18:36:26<12:45:59, 25.17s/it] 57%|█████▋    | 2461/4286 [18:36:51<12:45:00, 25.15s/it]                                                         {'loss': 0.0027, 'grad_norm': 4.4479502735039365, 'learning_rate': 4.258049463369108e-07, 'completion_length': 283.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.8392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.06173977069556713, 'kl': 0.06640625, 'epoch': 0.57}
 57%|█████▋    | 2461/4286 [18:36:51<12:45:00, 25.15s/it] 57%|█████▋    | 2462/4286 [18:37:14<12:22:49, 24.44s/it]                                                         {'loss': 0.0034, 'grad_norm': 1.3534855106692758, 'learning_rate': 4.255716285580961e-07, 'completion_length': 237.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.02816697023808956, 'kl': 0.0849609375, 'epoch': 0.57}
 57%|█████▋    | 2462/4286 [18:37:14<12:22:49, 24.44s/it] 57%|█████▋    | 2463/4286 [18:37:39<12:30:32, 24.70s/it]                                                         {'loss': 0.0044, 'grad_norm': 2.4912410715440263, 'learning_rate': 4.2533831077928136e-07, 'completion_length': 258.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.038690474815666676, 'kl': 0.111083984375, 'epoch': 0.57}
 57%|█████▋    | 2463/4286 [18:37:39<12:30:32, 24.70s/it] 57%|█████▋    | 2464/4286 [18:38:03<12:22:58, 24.47s/it]                                                         {'loss': 0.0099, 'grad_norm': 10.595222343316141, 'learning_rate': 4.251049930004666e-07, 'completion_length': 309.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.060017285868525505, 'kl': 0.2490234375, 'epoch': 0.57}
 57%|█████▋    | 2464/4286 [18:38:03<12:22:58, 24.47s/it] 58%|█████▊    | 2465/4286 [18:38:29<12:37:09, 24.95s/it]                                                         {'loss': 0.0107, 'grad_norm': 48.7024210647293, 'learning_rate': 4.2487167522165185e-07, 'completion_length': 311.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7142858505249023, 'reward_std': 0.1177637130022049, 'kl': 0.266357421875, 'epoch': 0.58}
 58%|█████▊    | 2465/4286 [18:38:29<12:37:09, 24.95s/it][2025-03-03 09:36:16,841] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 58%|█████▊    | 2466/4286 [18:38:54<12:33:07, 24.83s/it]                                                         {'loss': 0.009, 'grad_norm': 2.8302967596568713, 'learning_rate': 4.2463835744283713e-07, 'completion_length': 307.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.8407738506793976, 'rewards/format_reward': 1.0, 'reward': 1.8407739400863647, 'reward_std': 0.008928571827709675, 'kl': 0.225341796875, 'epoch': 0.58}
 58%|█████▊    | 2466/4286 [18:38:54<12:33:07, 24.83s/it] 58%|█████▊    | 2467/4286 [18:39:19<12:35:55, 24.93s/it]                                                         {'loss': 0.0069, 'grad_norm': 1.3478891355591702, 'learning_rate': 4.2440503966402235e-07, 'completion_length': 318.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.7812501192092896, 'rewards/format_reward': 1.0, 'reward': 1.7812501788139343, 'reward_std': 0.04900030232965946, 'kl': 0.173583984375, 'epoch': 0.58}
 58%|█████▊    | 2467/4286 [18:39:19<12:35:55, 24.93s/it] 58%|█████▊    | 2468/4286 [18:39:43<12:27:13, 24.66s/it]                                                         {'loss': 0.0378, 'grad_norm': 9.78956177449912, 'learning_rate': 4.241717218852076e-07, 'completion_length': 293.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7659970819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7481399774551392, 'reward_std': 0.10806078091263771, 'kl': 0.943603515625, 'epoch': 0.58}
 58%|█████▊    | 2468/4286 [18:39:43<12:27:13, 24.66s/it] 58%|█████▊    | 2469/4286 [18:40:08<12:30:14, 24.77s/it]                                                         {'loss': 0.0135, 'grad_norm': 0.5929787575905437, 'learning_rate': 4.2393840410639285e-07, 'completion_length': 304.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.77976194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7619048953056335, 'reward_std': 0.0595238134264946, 'kl': 0.3389892578125, 'epoch': 0.58}
 58%|█████▊    | 2469/4286 [18:40:08<12:30:14, 24.77s/it] 58%|█████▊    | 2470/4286 [18:40:33<12:29:39, 24.77s/it]                                                         {'loss': 0.0123, 'grad_norm': 14.01075380864642, 'learning_rate': 4.237050863275781e-07, 'completion_length': 269.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.8690476715564728, 'rewards/format_reward': 1.0, 'reward': 1.8690477013587952, 'reward_std': 0.044342199340462685, 'kl': 0.3056640625, 'epoch': 0.58}
 58%|█████▊    | 2470/4286 [18:40:33<12:29:39, 24.77s/it] 58%|█████▊    | 2471/4286 [18:40:57<12:26:58, 24.69s/it]                                                         {'loss': 0.0327, 'grad_norm': 3.516879653174441, 'learning_rate': 4.234717685487634e-07, 'completion_length': 316.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.642857313156128, 'reward_std': 0.02816697023808956, 'kl': 0.81640625, 'epoch': 0.58}
 58%|█████▊    | 2471/4286 [18:40:57<12:26:58, 24.69s/it] 58%|█████▊    | 2472/4286 [18:41:22<12:25:27, 24.66s/it]                                                         {'loss': 0.0167, 'grad_norm': 8.412324295090807, 'learning_rate': 4.232384507699486e-07, 'completion_length': 278.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.04136601276695728, 'kl': 0.41650390625, 'epoch': 0.58}
 58%|█████▊    | 2472/4286 [18:41:22<12:25:27, 24.66s/it] 58%|█████▊    | 2473/4286 [18:41:45<12:10:07, 24.16s/it]                                                         {'loss': 0.0025, 'grad_norm': 0.5238738919336724, 'learning_rate': 4.230051329911339e-07, 'completion_length': 277.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7767857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.010309826582670212, 'kl': 0.0633544921875, 'epoch': 0.58}
 58%|█████▊    | 2473/4286 [18:41:45<12:10:07, 24.16s/it] 58%|█████▊    | 2474/4286 [18:42:08<12:01:50, 23.90s/it]                                                         {'loss': 0.0044, 'grad_norm': 2.7236934107873036, 'learning_rate': 4.227718152123191e-07, 'completion_length': 288.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8333333432674408, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.0357142873108387, 'kl': 0.109130859375, 'epoch': 0.58}
 58%|█████▊    | 2474/4286 [18:42:08<12:01:50, 23.90s/it] 58%|█████▊    | 2475/4286 [18:42:34<12:16:23, 24.40s/it]                                                         {'loss': 0.0058, 'grad_norm': 12.77649357103232, 'learning_rate': 4.225384974335044e-07, 'completion_length': 301.0, 'rewards/only_full_func_accuracy_reward': 0.6964286863803864, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.07762768864631653, 'kl': 0.14501953125, 'epoch': 0.58}
 58%|█████▊    | 2475/4286 [18:42:34<12:16:23, 24.40s/it] 58%|█████▊    | 2476/4286 [18:42:58<12:15:34, 24.38s/it]                                                         {'loss': 0.0361, 'grad_norm': 3.389540195024546, 'learning_rate': 4.2230517965468967e-07, 'completion_length': 269.94644927978516, 'rewards/only_full_func_accuracy_reward': 0.7490079998970032, 'rewards/format_reward': 1.0, 'reward': 1.7490081191062927, 'reward_std': 0.055128199979662895, 'kl': 0.9033203125, 'epoch': 0.58}
 58%|█████▊    | 2476/4286 [18:42:58<12:15:34, 24.38s/it] 58%|█████▊    | 2477/4286 [18:43:24<12:23:18, 24.65s/it]                                                         {'loss': 0.0114, 'grad_norm': 8.571368017548274, 'learning_rate': 4.220718618758749e-07, 'completion_length': 313.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6450892984867096, 'rewards/format_reward': 1.0, 'reward': 1.645089328289032, 'reward_std': 0.07843663915991783, 'kl': 0.2861328125, 'epoch': 0.58}
 58%|█████▊    | 2477/4286 [18:43:24<12:23:18, 24.65s/it] 58%|█████▊    | 2478/4286 [18:43:49<12:28:58, 24.86s/it]                                                         {'loss': 0.0052, 'grad_norm': 19.875295048788256, 'learning_rate': 4.2183854409706017e-07, 'completion_length': 335.2143096923828, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7336310744285583, 'reward_std': 0.0831196503713727, 'kl': 0.1300048828125, 'epoch': 0.58}
 58%|█████▊    | 2478/4286 [18:43:49<12:28:58, 24.86s/it] 58%|█████▊    | 2479/4286 [18:44:13<12:19:58, 24.57s/it]                                                         {'loss': 0.0069, 'grad_norm': 31.218526236006365, 'learning_rate': 4.216052263182454e-07, 'completion_length': 264.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7752976417541504, 'rewards/format_reward': 1.0, 'reward': 1.7752978205680847, 'reward_std': 0.02678571827709675, 'kl': 0.1728515625, 'epoch': 0.58}
 58%|█████▊    | 2479/4286 [18:44:13<12:19:58, 24.57s/it] 58%|█████▊    | 2480/4286 [18:44:38<12:29:32, 24.90s/it]                                                         {'loss': 0.0082, 'grad_norm': 102.50866785696498, 'learning_rate': 4.2137190853943066e-07, 'completion_length': 289.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 1.0, 'reward': 1.633928656578064, 'reward_std': 0.02976190857589245, 'kl': 0.2060546875, 'epoch': 0.58}
 58%|█████▊    | 2480/4286 [18:44:38<12:29:32, 24.90s/it] 58%|█████▊    | 2481/4286 [18:45:04<12:35:46, 25.12s/it]                                                         {'loss': 0.0038, 'grad_norm': 2.995541368183255, 'learning_rate': 4.2113859076061594e-07, 'completion_length': 351.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.8497024476528168, 'rewards/format_reward': 1.0, 'reward': 1.8497024774551392, 'reward_std': 0.037095542065799236, 'kl': 0.0946044921875, 'epoch': 0.58}
 58%|█████▊    | 2481/4286 [18:45:04<12:35:46, 25.12s/it] 58%|█████▊    | 2482/4286 [18:45:29<12:35:23, 25.12s/it]                                                         {'loss': 0.0063, 'grad_norm': 6.352333422683298, 'learning_rate': 4.2090527298180116e-07, 'completion_length': 316.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.8050596117973328, 'rewards/format_reward': 1.0, 'reward': 1.8050596714019775, 'reward_std': 0.022675003856420517, 'kl': 0.156494140625, 'epoch': 0.58}
 58%|█████▊    | 2482/4286 [18:45:29<12:35:23, 25.12s/it] 58%|█████▊    | 2483/4286 [18:45:54<12:32:42, 25.05s/it]                                                         {'loss': 0.072, 'grad_norm': 34.61267505558539, 'learning_rate': 4.2067195520298644e-07, 'completion_length': 251.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.7626488506793976, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6912203431129456, 'reward_std': 0.19149275124073029, 'kl': 1.80078125, 'epoch': 0.58}
 58%|█████▊    | 2483/4286 [18:45:54<12:32:42, 25.05s/it] 58%|█████▊    | 2484/4286 [18:46:17<12:12:53, 24.40s/it]                                                         {'loss': 0.0337, 'grad_norm': 2.5322292869357925, 'learning_rate': 4.2043863742417166e-07, 'completion_length': 223.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.7366071939468384, 'rewards/format_reward': 1.0, 'reward': 1.736607313156128, 'reward_std': 0.08938459306955338, 'kl': 0.84375, 'epoch': 0.58}
 58%|█████▊    | 2484/4286 [18:46:17<12:12:53, 24.40s/it][2025-03-03 09:44:05,995] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 58%|█████▊    | 2485/4286 [18:46:43<12:28:04, 24.92s/it]                                                         {'loss': 0.0212, 'grad_norm': 3.6393158427047223, 'learning_rate': 4.2020531964535693e-07, 'completion_length': 312.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7288233041763306, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.710966169834137, 'reward_std': 0.11007784307003021, 'kl': 0.531005859375, 'epoch': 0.58}
 58%|█████▊    | 2485/4286 [18:46:43<12:28:04, 24.92s/it] 58%|█████▊    | 2486/4286 [18:47:09<12:39:16, 25.31s/it]                                                         {'loss': 0.0035, 'grad_norm': 4.027031709628246, 'learning_rate': 4.199720018665422e-07, 'completion_length': 345.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.8110119700431824, 'rewards/format_reward': 1.0, 'reward': 1.8110120296478271, 'reward_std': 0.029548224061727524, 'kl': 0.087646484375, 'epoch': 0.58}
 58%|█████▊    | 2486/4286 [18:47:09<12:39:16, 25.31s/it] 58%|█████▊    | 2487/4286 [18:47:33<12:26:18, 24.89s/it]                                                         {'loss': 0.0076, 'grad_norm': 5.273083501493979, 'learning_rate': 4.1973868408772743e-07, 'completion_length': 278.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.9047619104385376, 'rewards/format_reward': 1.0, 'reward': 1.9047619700431824, 'reward_std': 0.02380952052772045, 'kl': 0.187255859375, 'epoch': 0.58}
 58%|█████▊    | 2487/4286 [18:47:33<12:26:18, 24.89s/it] 58%|█████▊    | 2488/4286 [18:47:58<12:21:04, 24.73s/it]                                                         {'loss': 0.0613, 'grad_norm': 7.108329489187708, 'learning_rate': 4.195053663089127e-07, 'completion_length': 289.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.5528274178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5349704027175903, 'reward_std': 0.15251358225941658, 'kl': 1.53173828125, 'epoch': 0.58}
 58%|█████▊    | 2488/4286 [18:47:58<12:21:04, 24.73s/it] 58%|█████▊    | 2489/4286 [18:48:22<12:15:52, 24.57s/it]                                                         {'loss': 0.0096, 'grad_norm': 26.715493201603458, 'learning_rate': 4.19272048530098e-07, 'completion_length': 285.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 1.0, 'reward': 1.8154763579368591, 'reward_std': 0.08634257316589355, 'kl': 0.23828125, 'epoch': 0.58}
 58%|█████▊    | 2489/4286 [18:48:22<12:15:52, 24.57s/it][2025-03-03 09:46:11,003] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 58%|█████▊    | 2490/4286 [18:48:48<12:31:15, 25.10s/it]                                                         {'loss': 0.0273, 'grad_norm': 3.101540123925205, 'learning_rate': 4.190387307512832e-07, 'completion_length': 329.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.05081737972795963, 'kl': 0.68359375, 'epoch': 0.58}
 58%|█████▊    | 2490/4286 [18:48:48<12:31:15, 25.10s/it] 58%|█████▊    | 2491/4286 [18:49:13<12:24:49, 24.90s/it]                                                         {'loss': 0.0212, 'grad_norm': 5.052955193410931, 'learning_rate': 4.188054129724685e-07, 'completion_length': 315.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.07142857648432255, 'kl': 0.53125, 'epoch': 0.58}
 58%|█████▊    | 2491/4286 [18:49:13<12:24:49, 24.90s/it] 58%|█████▊    | 2492/4286 [18:49:35<12:05:18, 24.26s/it]                                                         {'loss': 0.0039, 'grad_norm': 0.9278895295620909, 'learning_rate': 4.185720951936537e-07, 'completion_length': 259.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.8511905074119568, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.011904759332537651, 'kl': 0.09765625, 'epoch': 0.58}
 58%|█████▊    | 2492/4286 [18:49:35<12:05:18, 24.26s/it] 58%|█████▊    | 2493/4286 [18:50:00<12:09:39, 24.42s/it]                                                         {'loss': 0.0169, 'grad_norm': 2.280169801187521, 'learning_rate': 4.18338777414839e-07, 'completion_length': 319.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.8244048357009888, 'rewards/format_reward': 1.0, 'reward': 1.8244048357009888, 'reward_std': 0.04602411389350891, 'kl': 0.423583984375, 'epoch': 0.58}
 58%|█████▊    | 2493/4286 [18:50:00<12:09:39, 24.42s/it] 58%|█████▊    | 2494/4286 [18:50:25<12:14:01, 24.58s/it]                                                         {'loss': 0.02, 'grad_norm': 1.4296756882146482, 'learning_rate': 4.1810545963602425e-07, 'completion_length': 293.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.782738208770752, 'reward_std': 0.1250000074505806, 'kl': 0.498046875, 'epoch': 0.58}
 58%|█████▊    | 2494/4286 [18:50:25<12:14:01, 24.58s/it] 58%|█████▊    | 2495/4286 [18:50:52<12:33:06, 25.23s/it]                                                         {'loss': 0.028, 'grad_norm': 16.7751252075287, 'learning_rate': 4.1787214185720947e-07, 'completion_length': 312.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.7893032729625702, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7535890340805054, 'reward_std': 0.15637341141700745, 'kl': 0.7021484375, 'epoch': 0.58}
 58%|█████▊    | 2495/4286 [18:50:52<12:33:06, 25.23s/it] 58%|█████▊    | 2496/4286 [18:51:17<12:28:55, 25.10s/it]                                                         {'loss': 0.0311, 'grad_norm': 1.5347462684801898, 'learning_rate': 4.1763882407839475e-07, 'completion_length': 347.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.8267857730388641, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7910715341567993, 'reward_std': 0.07915207743644714, 'kl': 0.77734375, 'epoch': 0.58}
 58%|█████▊    | 2496/4286 [18:51:17<12:28:55, 25.10s/it] 58%|█████▊    | 2497/4286 [18:51:42<12:27:01, 25.05s/it]                                                         {'loss': 0.0025, 'grad_norm': 8.39614553524936, 'learning_rate': 4.1740550629957997e-07, 'completion_length': 283.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.8050595819950104, 'rewards/format_reward': 1.0, 'reward': 1.8050596714019775, 'reward_std': 0.008928566006943583, 'kl': 0.0611572265625, 'epoch': 0.58}
 58%|█████▊    | 2497/4286 [18:51:42<12:27:01, 25.05s/it] 58%|█████▊    | 2498/4286 [18:52:08<12:35:42, 25.36s/it]                                                         {'loss': 0.018, 'grad_norm': 26.92299368289829, 'learning_rate': 4.1717218852076524e-07, 'completion_length': 272.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6279762387275696, 'reward_std': 0.06769215501844883, 'kl': 0.4482421875, 'epoch': 0.58}
 58%|█████▊    | 2498/4286 [18:52:08<12:35:42, 25.36s/it] 58%|█████▊    | 2499/4286 [18:52:32<12:31:15, 25.22s/it]                                                         {'loss': 0.0097, 'grad_norm': 4.453594745653083, 'learning_rate': 4.169388707419505e-07, 'completion_length': 288.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7889881432056427, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7711310386657715, 'reward_std': 0.11035366356372833, 'kl': 0.2431640625, 'epoch': 0.58}
 58%|█████▊    | 2499/4286 [18:52:32<12:31:15, 25.22s/it] 58%|█████▊    | 2500/4286 [18:52:56<12:15:23, 24.71s/it]                                                         {'loss': 0.0061, 'grad_norm': 0.6755896328271003, 'learning_rate': 4.1670555296313574e-07, 'completion_length': 248.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.9136905074119568, 'rewards/format_reward': 1.0, 'reward': 1.9136905670166016, 'reward_std': 0.01785714365541935, 'kl': 0.15283203125, 'epoch': 0.58}
 58%|█████▊    | 2500/4286 [18:52:56<12:15:23, 24.71s/it] 58%|█████▊    | 2501/4286 [18:56:14<38:02:49, 76.73s/it]                                                         {'loss': 0.0113, 'grad_norm': 6.499436886882608, 'learning_rate': 4.16472235184321e-07, 'completion_length': 296.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 1.0, 'reward': 1.6741071939468384, 'reward_std': 0.14020215719938278, 'kl': 0.28076171875, 'epoch': 0.58}
 58%|█████▊    | 2501/4286 [18:56:14<38:02:49, 76.73s/it] 58%|█████▊    | 2502/4286 [18:56:39<30:22:11, 61.28s/it]                                                         {'loss': 0.0213, 'grad_norm': 2.023691007044002, 'learning_rate': 4.1623891740550624e-07, 'completion_length': 330.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6517857909202576, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.0357142873108387, 'kl': 0.533203125, 'epoch': 0.58}
 58%|█████▊    | 2502/4286 [18:56:39<30:22:11, 61.28s/it] 58%|█████▊    | 2503/4286 [18:57:04<24:54:11, 50.28s/it]                                                         {'loss': 0.0146, 'grad_norm': 6.433443317473671, 'learning_rate': 4.160055996266915e-07, 'completion_length': 324.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7648810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.05505155771970749, 'kl': 0.3642578125, 'epoch': 0.58}
 58%|█████▊    | 2503/4286 [18:57:04<24:54:11, 50.28s/it] 58%|█████▊    | 2504/4286 [18:57:30<21:18:06, 43.03s/it]                                                         {'loss': 0.0192, 'grad_norm': 6.0565157411303145, 'learning_rate': 4.157722818478768e-07, 'completion_length': 332.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.7616071999073029, 'rewards/format_reward': 1.0, 'reward': 1.76160728931427, 'reward_std': 0.07159644924104214, 'kl': 0.479736328125, 'epoch': 0.58}
 58%|█████▊    | 2504/4286 [18:57:30<21:18:06, 43.03s/it] 58%|█████▊    | 2505/4286 [18:57:55<18:33:19, 37.51s/it]                                                         {'loss': 0.0214, 'grad_norm': 3.993241514445785, 'learning_rate': 4.15538964069062e-07, 'completion_length': 319.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.03419382870197296, 'kl': 0.5380859375, 'epoch': 0.58}
 58%|█████▊    | 2505/4286 [18:57:55<18:33:19, 37.51s/it] 58%|█████▊    | 2506/4286 [18:58:21<16:48:45, 34.00s/it]                                                         {'loss': 0.0063, 'grad_norm': 10.326375698314783, 'learning_rate': 4.153056462902473e-07, 'completion_length': 342.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.03283696994185448, 'kl': 0.15869140625, 'epoch': 0.58}
 58%|█████▊    | 2506/4286 [18:58:21<16:48:45, 34.00s/it] 58%|█████▊    | 2507/4286 [18:58:45<15:23:28, 31.15s/it]                                                         {'loss': 0.0144, 'grad_norm': 1.8977952297583276, 'learning_rate': 4.150723285114325e-07, 'completion_length': 301.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7261905670166016, 'rewards/format_reward': 1.0, 'reward': 1.7261906266212463, 'reward_std': 0.011904759332537651, 'kl': 0.3603515625, 'epoch': 0.58}
 58%|█████▊    | 2507/4286 [18:58:45<15:23:28, 31.15s/it] 59%|█████▊    | 2508/4286 [18:59:11<14:33:31, 29.48s/it]                                                         {'loss': 0.0195, 'grad_norm': 5.410745592948849, 'learning_rate': 4.148390107326178e-07, 'completion_length': 339.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7068452537059784, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.053357749711722136, 'kl': 0.486328125, 'epoch': 0.59}
 59%|█████▊    | 2508/4286 [18:59:11<14:33:31, 29.48s/it] 59%|█████▊    | 2509/4286 [18:59:37<14:01:41, 28.42s/it]                                                         {'loss': 0.0105, 'grad_norm': 3.4767950300343595, 'learning_rate': 4.1460569295380306e-07, 'completion_length': 324.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.8318453133106232, 'rewards/format_reward': 1.0, 'reward': 1.8318454027175903, 'reward_std': 0.031143157742917538, 'kl': 0.26318359375, 'epoch': 0.59}
 59%|█████▊    | 2509/4286 [18:59:37<14:01:41, 28.42s/it] 59%|█████▊    | 2510/4286 [19:00:02<13:35:27, 27.55s/it]                                                         {'loss': 0.0176, 'grad_norm': 5.744106228785309, 'learning_rate': 4.143723751749883e-07, 'completion_length': 316.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.05357143096625805, 'kl': 0.4404296875, 'epoch': 0.59}
 59%|█████▊    | 2510/4286 [19:00:02<13:35:27, 27.55s/it] 59%|█████▊    | 2511/4286 [19:00:26<13:07:14, 26.61s/it]                                                         {'loss': 0.0086, 'grad_norm': 2.573572318813772, 'learning_rate': 4.1413905739617356e-07, 'completion_length': 327.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.08450091257691383, 'kl': 0.216064453125, 'epoch': 0.59}
 59%|█████▊    | 2511/4286 [19:00:26<13:07:14, 26.61s/it] 59%|█████▊    | 2512/4286 [19:00:51<12:47:59, 25.97s/it]                                                         {'loss': 0.005, 'grad_norm': 42.24018577510892, 'learning_rate': 4.1390573961735883e-07, 'completion_length': 327.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.8303571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8303572535514832, 'reward_std': 0.01282560033723712, 'kl': 0.126220703125, 'epoch': 0.59}
 59%|█████▊    | 2512/4286 [19:00:51<12:47:59, 25.97s/it] 59%|█████▊    | 2513/4286 [19:01:14<12:19:06, 25.01s/it]                                                         {'loss': 0.0072, 'grad_norm': 9.223071309615523, 'learning_rate': 4.1367242183854405e-07, 'completion_length': 253.96430206298828, 'rewards/only_full_func_accuracy_reward': 0.8586309850215912, 'rewards/format_reward': 1.0, 'reward': 1.8586310744285583, 'reward_std': 0.029860783368349075, 'kl': 0.17913818359375, 'epoch': 0.59}
 59%|█████▊    | 2513/4286 [19:01:14<12:19:06, 25.01s/it] 59%|█████▊    | 2514/4286 [19:01:40<12:29:22, 25.37s/it]                                                         {'loss': 0.0024, 'grad_norm': 0.9522104989679782, 'learning_rate': 4.1343910405972933e-07, 'completion_length': 328.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.723214328289032, 'reward_std': 0.04701773822307587, 'kl': 0.0611572265625, 'epoch': 0.59}
 59%|█████▊    | 2514/4286 [19:01:40<12:29:22, 25.37s/it] 59%|█████▊    | 2515/4286 [19:02:05<12:28:26, 25.36s/it]                                                         {'loss': 0.0061, 'grad_norm': 12.549399785614517, 'learning_rate': 4.1320578628091455e-07, 'completion_length': 266.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.6880953013896942, 'rewards/format_reward': 1.0, 'reward': 1.6880953311920166, 'reward_std': 0.10724472533911467, 'kl': 0.15185546875, 'epoch': 0.59}
 59%|█████▊    | 2515/4286 [19:02:05<12:28:26, 25.36s/it] 59%|█████▊    | 2516/4286 [19:02:31<12:26:57, 25.32s/it]                                                         {'loss': 0.0131, 'grad_norm': 6.384794910564915, 'learning_rate': 4.1297246850209983e-07, 'completion_length': 324.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.040071736089885235, 'kl': 0.32861328125, 'epoch': 0.59}
 59%|█████▊    | 2516/4286 [19:02:31<12:26:57, 25.32s/it][2025-03-03 10:00:21,166] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 59%|█████▊    | 2517/4286 [19:02:58<12:47:53, 26.04s/it]                                                         {'loss': 0.0088, 'grad_norm': 5.038600616959886, 'learning_rate': 4.127391507232851e-07, 'completion_length': 319.14288330078125, 'rewards/only_full_func_accuracy_reward': 0.8339711427688599, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8161140084266663, 'reward_std': 0.05564830079674721, 'kl': 0.218994140625, 'epoch': 0.59}
 59%|█████▊    | 2517/4286 [19:02:58<12:47:53, 26.04s/it] 59%|█████▊    | 2518/4286 [19:03:24<12:47:31, 26.05s/it]                                                         {'loss': 0.0023, 'grad_norm': 10.814786785915858, 'learning_rate': 4.125058329444703e-07, 'completion_length': 290.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7723215520381927, 'rewards/format_reward': 1.0, 'reward': 1.7723215222358704, 'reward_std': 0.0505952350795269, 'kl': 0.0577392578125, 'epoch': 0.59}
 59%|█████▊    | 2518/4286 [19:03:24<12:47:31, 26.05s/it] 59%|█████▉    | 2519/4286 [19:03:50<12:47:11, 26.05s/it]                                                         {'loss': 0.0141, 'grad_norm': 19.872114199571975, 'learning_rate': 4.122725151656556e-07, 'completion_length': 323.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7023809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.08072352968156338, 'kl': 0.351806640625, 'epoch': 0.59}
 59%|█████▉    | 2519/4286 [19:03:50<12:47:11, 26.05s/it][2025-03-03 10:01:39,300] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 59%|█████▉    | 2520/4286 [19:04:16<12:46:30, 26.04s/it]                                                         {'loss': 0.0154, 'grad_norm': 5.2288972573597094, 'learning_rate': 4.120391973868408e-07, 'completion_length': 311.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6369048953056335, 'reward_std': 0.03805338963866234, 'kl': 0.384521484375, 'epoch': 0.59}
 59%|█████▉    | 2520/4286 [19:04:16<12:46:30, 26.04s/it][2025-03-03 10:02:04,589] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 59%|█████▉    | 2521/4286 [19:04:42<12:39:25, 25.82s/it]                                                         {'loss': 0.0023, 'grad_norm': 1.9677382566184352, 'learning_rate': 4.118058796080261e-07, 'completion_length': 316.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.80952388048172, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.01626221090555191, 'kl': 0.0584716796875, 'epoch': 0.59}
 59%|█████▉    | 2521/4286 [19:04:42<12:39:25, 25.82s/it] 59%|█████▉    | 2522/4286 [19:05:06<12:30:07, 25.51s/it]                                                         {'loss': 0.008, 'grad_norm': 6.551620522765911, 'learning_rate': 4.1157256182921137e-07, 'completion_length': 304.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 1.0, 'reward': 1.7633929252624512, 'reward_std': 0.029548224061727524, 'kl': 0.19970703125, 'epoch': 0.59}
 59%|█████▉    | 2522/4286 [19:05:06<12:30:07, 25.51s/it] 59%|█████▉    | 2523/4286 [19:05:30<12:12:04, 24.91s/it]                                                         {'loss': 0.0056, 'grad_norm': 3.8237706346505425, 'learning_rate': 4.113392440503966e-07, 'completion_length': 297.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6651786267757416, 'rewards/format_reward': 1.0, 'reward': 1.6651787161827087, 'reward_std': 0.044642859138548374, 'kl': 0.138671875, 'epoch': 0.59}
 59%|█████▉    | 2523/4286 [19:05:30<12:12:04, 24.91s/it] 59%|█████▉    | 2524/4286 [19:05:54<12:00:02, 24.52s/it]                                                         {'loss': 0.0096, 'grad_norm': 4.588390666630144, 'learning_rate': 4.1110592627158187e-07, 'completion_length': 297.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7827381193637848, 'rewards/format_reward': 1.0, 'reward': 1.7827382683753967, 'reward_std': 0.029761902987957, 'kl': 0.240234375, 'epoch': 0.59}
 59%|█████▉    | 2524/4286 [19:05:54<12:00:02, 24.52s/it] 59%|█████▉    | 2525/4286 [19:06:16<11:45:27, 24.04s/it]                                                         {'loss': 0.0046, 'grad_norm': 5.825296695473159, 'learning_rate': 4.108726084927671e-07, 'completion_length': 288.375, 'rewards/only_full_func_accuracy_reward': 0.8065476417541504, 'rewards/format_reward': 1.0, 'reward': 1.8065478205680847, 'reward_std': 0.050381558015942574, 'kl': 0.11474609375, 'epoch': 0.59}
 59%|█████▉    | 2525/4286 [19:06:17<11:45:27, 24.04s/it] 59%|█████▉    | 2526/4286 [19:06:41<11:52:51, 24.30s/it]                                                         {'loss': 0.0071, 'grad_norm': 7.410302891525076, 'learning_rate': 4.1063929071395237e-07, 'completion_length': 283.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8467262387275696, 'rewards/format_reward': 1.0, 'reward': 1.8467262983322144, 'reward_std': 0.040532149374485016, 'kl': 0.17626953125, 'epoch': 0.59}
 59%|█████▉    | 2526/4286 [19:06:41<11:52:51, 24.30s/it] 59%|█████▉    | 2527/4286 [19:07:05<11:49:25, 24.20s/it]                                                         {'loss': 0.0073, 'grad_norm': 4.815244706131596, 'learning_rate': 4.1040597293513764e-07, 'completion_length': 295.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7217262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7217262387275696, 'reward_std': 0.008928571827709675, 'kl': 0.182373046875, 'epoch': 0.59}
 59%|█████▉    | 2527/4286 [19:07:05<11:49:25, 24.20s/it] 59%|█████▉    | 2528/4286 [19:07:31<12:03:35, 24.70s/it]                                                         {'loss': 0.0172, 'grad_norm': 3.9647991712524964, 'learning_rate': 4.1017265515632286e-07, 'completion_length': 313.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7079081833362579, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6900510787963867, 'reward_std': 0.07227891404181719, 'kl': 0.43017578125, 'epoch': 0.59}
 59%|█████▉    | 2528/4286 [19:07:31<12:03:35, 24.70s/it] 59%|█████▉    | 2529/4286 [19:07:56<12:06:47, 24.82s/it]                                                         {'loss': 0.039, 'grad_norm': 10.918627409316606, 'learning_rate': 4.0993933737750814e-07, 'completion_length': 286.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.59077388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5729168057441711, 'reward_std': 0.18886125087738037, 'kl': 0.9765625, 'epoch': 0.59}
 59%|█████▉    | 2529/4286 [19:07:56<12:06:47, 24.82s/it] 59%|█████▉    | 2530/4286 [19:08:20<11:56:51, 24.49s/it]                                                         {'loss': 0.0172, 'grad_norm': 34.83111636191849, 'learning_rate': 4.0970601959869336e-07, 'completion_length': 282.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.5758928656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5580357909202576, 'reward_std': 0.1329546645283699, 'kl': 0.429931640625, 'epoch': 0.59}
 59%|█████▉    | 2530/4286 [19:08:20<11:56:51, 24.49s/it] 59%|█████▉    | 2531/4286 [19:08:45<11:57:10, 24.52s/it]                                                         {'loss': 0.0195, 'grad_norm': 3.1220618730773686, 'learning_rate': 4.0947270181987864e-07, 'completion_length': 323.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7529761791229248, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7351191639900208, 'reward_std': 0.10407912312075496, 'kl': 0.4892578125, 'epoch': 0.59}
 59%|█████▉    | 2531/4286 [19:08:45<11:57:10, 24.52s/it] 59%|█████▉    | 2532/4286 [19:09:09<11:58:23, 24.57s/it]                                                         {'loss': 0.0116, 'grad_norm': 4.192123002177712, 'learning_rate': 4.092393840410639e-07, 'completion_length': 305.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.87202388048172, 'rewards/format_reward': 1.0, 'reward': 1.8720239400863647, 'reward_std': 0.029761902056634426, 'kl': 0.290771484375, 'epoch': 0.59}
 59%|█████▉    | 2532/4286 [19:09:09<11:58:23, 24.57s/it] 59%|█████▉    | 2533/4286 [19:09:34<12:00:20, 24.66s/it]                                                         {'loss': 0.0073, 'grad_norm': 7.3002046739294295, 'learning_rate': 4.0900606626224913e-07, 'completion_length': 315.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7514881789684296, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.0533577473834157, 'kl': 0.181640625, 'epoch': 0.59}
 59%|█████▉    | 2533/4286 [19:09:34<12:00:20, 24.66s/it] 59%|█████▉    | 2534/4286 [19:09:58<11:55:31, 24.50s/it]                                                         {'loss': 0.004, 'grad_norm': 0.9737380055625573, 'learning_rate': 4.087727484834344e-07, 'completion_length': 334.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.05633394047617912, 'kl': 0.099609375, 'epoch': 0.59}
 59%|█████▉    | 2534/4286 [19:09:58<11:55:31, 24.50s/it] 59%|█████▉    | 2535/4286 [19:10:23<11:55:46, 24.53s/it]                                                         {'loss': 0.0248, 'grad_norm': 2.348195344473263, 'learning_rate': 4.085394307046197e-07, 'completion_length': 268.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217262983322144, 'reward_std': 0.08035714644938707, 'kl': 0.62060546875, 'epoch': 0.59}
 59%|█████▉    | 2535/4286 [19:10:23<11:55:46, 24.53s/it] 59%|█████▉    | 2536/4286 [19:10:48<11:58:19, 24.63s/it]                                                         {'loss': 0.0063, 'grad_norm': 6.633131299117422, 'learning_rate': 4.083061129258049e-07, 'completion_length': 317.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7901785969734192, 'rewards/format_reward': 1.0, 'reward': 1.790178656578064, 'reward_std': 0.04053215403109789, 'kl': 0.15673828125, 'epoch': 0.59}
 59%|█████▉    | 2536/4286 [19:10:48<11:58:19, 24.63s/it] 59%|█████▉    | 2537/4286 [19:11:14<12:08:30, 24.99s/it]                                                         {'loss': 0.003, 'grad_norm': 6.142119783919729, 'learning_rate': 4.080727951469902e-07, 'completion_length': 301.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.761408805847168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7435516715049744, 'reward_std': 0.09163206815719604, 'kl': 0.0745849609375, 'epoch': 0.59}
 59%|█████▉    | 2537/4286 [19:11:14<12:08:30, 24.99s/it] 59%|█████▉    | 2538/4286 [19:11:38<12:02:32, 24.80s/it]                                                         {'loss': 0.0043, 'grad_norm': 41.88164202981279, 'learning_rate': 4.078394773681754e-07, 'completion_length': 296.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7068452537059784, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.0267857164144516, 'kl': 0.1064453125, 'epoch': 0.59}
 59%|█████▉    | 2538/4286 [19:11:38<12:02:32, 24.80s/it] 59%|█████▉    | 2539/4286 [19:12:04<12:10:47, 25.10s/it]                                                         {'loss': 0.0077, 'grad_norm': 7.3101262746239195, 'learning_rate': 4.076061595893607e-07, 'completion_length': 327.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7752976417541504, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.020833334187045693, 'kl': 0.191650390625, 'epoch': 0.59}
 59%|█████▉    | 2539/4286 [19:12:04<12:10:47, 25.10s/it] 59%|█████▉    | 2540/4286 [19:12:30<12:16:49, 25.32s/it]                                                         {'loss': 0.0096, 'grad_norm': 5.458642095334967, 'learning_rate': 4.0737284181054595e-07, 'completion_length': 331.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7931548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7931548953056335, 'reward_std': 0.059310123324394226, 'kl': 0.240234375, 'epoch': 0.59}
 59%|█████▉    | 2540/4286 [19:12:30<12:16:49, 25.32s/it] 59%|█████▉    | 2541/4286 [19:12:56<12:24:37, 25.60s/it]                                                         {'loss': 0.0241, 'grad_norm': 16.96637964445609, 'learning_rate': 4.071395240317312e-07, 'completion_length': 327.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7422619462013245, 'rewards/format_reward': 1.0, 'reward': 1.7422620058059692, 'reward_std': 0.07626478746533394, 'kl': 0.6025390625, 'epoch': 0.59}
 59%|█████▉    | 2541/4286 [19:12:56<12:24:37, 25.60s/it] 59%|█████▉    | 2542/4286 [19:13:21<12:18:31, 25.41s/it]                                                         {'loss': 0.0178, 'grad_norm': 9.997976540224307, 'learning_rate': 4.0690620625291645e-07, 'completion_length': 304.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.830357164144516, 'rewards/format_reward': 1.0, 'reward': 1.830357313156128, 'reward_std': 0.05222322791814804, 'kl': 0.44580078125, 'epoch': 0.59}
 59%|█████▉    | 2542/4286 [19:13:21<12:18:31, 25.41s/it] 59%|█████▉    | 2543/4286 [19:13:46<12:13:23, 25.25s/it]                                                         {'loss': 0.0228, 'grad_norm': 10.449718222209263, 'learning_rate': 4.066728884741017e-07, 'completion_length': 308.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7619047462940216, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.05449226498603821, 'kl': 0.5693359375, 'epoch': 0.59}
 59%|█████▉    | 2543/4286 [19:13:46<12:13:23, 25.25s/it] 59%|█████▉    | 2544/4286 [19:14:10<12:08:44, 25.10s/it]                                                         {'loss': 0.0166, 'grad_norm': 2.080548303046536, 'learning_rate': 4.0643957069528695e-07, 'completion_length': 332.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8244048357009888, 'reward_std': 0.03228826471604407, 'kl': 0.4150390625, 'epoch': 0.59}
 59%|█████▉    | 2544/4286 [19:14:10<12:08:44, 25.10s/it] 59%|█████▉    | 2545/4286 [19:14:36<12:09:25, 25.14s/it]                                                         {'loss': 0.0126, 'grad_norm': 4.467868425634834, 'learning_rate': 4.062062529164722e-07, 'completion_length': 318.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.0508419806137681, 'kl': 0.313720703125, 'epoch': 0.59}
 59%|█████▉    | 2545/4286 [19:14:36<12:09:25, 25.14s/it] 59%|█████▉    | 2546/4286 [19:15:00<11:57:39, 24.75s/it]                                                         {'loss': 0.0078, 'grad_norm': 2.686478229575443, 'learning_rate': 4.0597293513765745e-07, 'completion_length': 291.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.752976268529892, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.05357143096625805, 'kl': 0.194091796875, 'epoch': 0.59}
 59%|█████▉    | 2546/4286 [19:15:00<11:57:39, 24.75s/it] 59%|█████▉    | 2547/4286 [19:15:25<12:00:46, 24.87s/it]                                                         {'loss': 0.0048, 'grad_norm': 16.26214915881296, 'learning_rate': 4.057396173588427e-07, 'completion_length': 333.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0595238171517849, 'kl': 0.1201171875, 'epoch': 0.59}
 59%|█████▉    | 2547/4286 [19:15:25<12:00:46, 24.87s/it] 59%|█████▉    | 2548/4286 [19:15:50<12:08:21, 25.14s/it]                                                         {'loss': 0.0152, 'grad_norm': 15.016698930814588, 'learning_rate': 4.0550629958002794e-07, 'completion_length': 321.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6235119104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.58779776096344, 'reward_std': 0.13123614341020584, 'kl': 0.3798828125, 'epoch': 0.59}
 59%|█████▉    | 2548/4286 [19:15:50<12:08:21, 25.14s/it] 59%|█████▉    | 2549/4286 [19:16:15<12:04:48, 25.04s/it]                                                         {'loss': 0.0169, 'grad_norm': 3.5940459445569974, 'learning_rate': 4.052729818012132e-07, 'completion_length': 311.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7410714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7232144474983215, 'reward_std': 0.0898780096322298, 'kl': 0.423828125, 'epoch': 0.59}
 59%|█████▉    | 2549/4286 [19:16:15<12:04:48, 25.04s/it] 59%|█████▉    | 2550/4286 [19:16:40<12:02:02, 24.96s/it]                                                         {'loss': 0.0091, 'grad_norm': 4.121334314725828, 'learning_rate': 4.050396640223985e-07, 'completion_length': 310.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.06320716254413128, 'kl': 0.22705078125, 'epoch': 0.59}
 59%|█████▉    | 2550/4286 [19:16:40<12:02:02, 24.96s/it] 60%|█████▉    | 2551/4286 [19:17:05<12:02:44, 24.99s/it]                                                         {'loss': 0.0185, 'grad_norm': 13.321208449583786, 'learning_rate': 4.048063462435837e-07, 'completion_length': 328.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.06320716068148613, 'kl': 0.4638671875, 'epoch': 0.6}
 60%|█████▉    | 2551/4286 [19:17:05<12:02:44, 24.99s/it] 60%|█████▉    | 2552/4286 [19:17:30<11:57:18, 24.82s/it]                                                         {'loss': 0.0258, 'grad_norm': 14.383542982881409, 'learning_rate': 4.04573028464769e-07, 'completion_length': 275.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7648810148239136, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.74702388048172, 'reward_std': 0.13371488451957703, 'kl': 0.64453125, 'epoch': 0.6}
 60%|█████▉    | 2552/4286 [19:17:30<11:57:18, 24.82s/it] 60%|█████▉    | 2553/4286 [19:17:53<11:49:24, 24.56s/it]                                                         {'loss': 0.0181, 'grad_norm': 3.5776644191057914, 'learning_rate': 4.043397106859542e-07, 'completion_length': 291.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7812500894069672, 'rewards/format_reward': 1.0, 'reward': 1.7812501788139343, 'reward_std': 0.008928571827709675, 'kl': 0.4541015625, 'epoch': 0.6}
 60%|█████▉    | 2553/4286 [19:17:53<11:49:24, 24.56s/it] 60%|█████▉    | 2554/4286 [19:18:21<12:13:35, 25.41s/it]                                                         {'loss': 0.0566, 'grad_norm': 13.965423668637099, 'learning_rate': 4.041063929071395e-07, 'completion_length': 309.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.693452537059784, 'reward_std': 0.1795474886894226, 'kl': 1.416015625, 'epoch': 0.6}
 60%|█████▉    | 2554/4286 [19:18:21<12:13:35, 25.41s/it][2025-03-03 10:16:09,294] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|█████▉    | 2555/4286 [19:18:46<12:13:52, 25.44s/it]                                                         {'loss': 0.0128, 'grad_norm': 6.502531454395359, 'learning_rate': 4.0387307512832476e-07, 'completion_length': 311.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.8199405074119568, 'rewards/format_reward': 1.0, 'reward': 1.8199405670166016, 'reward_std': 0.04900030232965946, 'kl': 0.32080078125, 'epoch': 0.6}
 60%|█████▉    | 2555/4286 [19:18:46<12:13:52, 25.44s/it][2025-03-03 10:16:33,950] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|█████▉    | 2556/4286 [19:19:11<12:06:41, 25.20s/it]                                                         {'loss': 0.0044, 'grad_norm': 6.288704380387442, 'learning_rate': 4.0363975734951e-07, 'completion_length': 313.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7366072237491608, 'rewards/format_reward': 1.0, 'reward': 1.7366072535514832, 'reward_std': 0.047405367717146873, 'kl': 0.109130859375, 'epoch': 0.6}
 60%|█████▉    | 2556/4286 [19:19:11<12:06:41, 25.20s/it] 60%|█████▉    | 2557/4286 [19:19:36<12:01:20, 25.03s/it]                                                         {'loss': 0.0267, 'grad_norm': 4.703112457310241, 'learning_rate': 4.0340643957069526e-07, 'completion_length': 266.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.8318452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.813988208770752, 'reward_std': 0.06538858264684677, 'kl': 0.6700439453125, 'epoch': 0.6}
 60%|█████▉    | 2557/4286 [19:19:36<12:01:20, 25.03s/it] 60%|█████▉    | 2558/4286 [19:20:00<11:51:58, 24.72s/it]                                                         {'loss': 0.0053, 'grad_norm': 0.9748920031454218, 'learning_rate': 4.0317312179188054e-07, 'completion_length': 269.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.011904759332537651, 'kl': 0.1318359375, 'epoch': 0.6}
 60%|█████▉    | 2558/4286 [19:20:00<11:51:58, 24.72s/it] 60%|█████▉    | 2559/4286 [19:20:22<11:34:25, 24.13s/it]                                                         {'loss': 0.0087, 'grad_norm': 2.1239468811492292, 'learning_rate': 4.0293980401306576e-07, 'completion_length': 239.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.8377976715564728, 'rewards/format_reward': 1.0, 'reward': 1.8377977013587952, 'reward_std': 0.008928571827709675, 'kl': 0.21923828125, 'epoch': 0.6}
 60%|█████▉    | 2559/4286 [19:20:22<11:34:25, 24.13s/it][2025-03-03 10:18:10,248] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|█████▉    | 2560/4286 [19:20:47<11:40:58, 24.37s/it]                                                         {'loss': 0.0171, 'grad_norm': 5.88956486143676, 'learning_rate': 4.0270648623425103e-07, 'completion_length': 302.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7008928954601288, 'rewards/format_reward': 1.0, 'reward': 1.700892984867096, 'reward_std': 0.06249999441206455, 'kl': 0.4287109375, 'epoch': 0.6}
 60%|█████▉    | 2560/4286 [19:20:47<11:40:58, 24.37s/it] 60%|█████▉    | 2561/4286 [19:21:11<11:37:28, 24.26s/it]                                                         {'loss': 0.0112, 'grad_norm': 1.2185419294590125, 'learning_rate': 4.0247316845543626e-07, 'completion_length': 295.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.02816697023808956, 'kl': 0.28076171875, 'epoch': 0.6}
 60%|█████▉    | 2561/4286 [19:21:11<11:37:28, 24.26s/it] 60%|█████▉    | 2562/4286 [19:21:36<11:38:31, 24.31s/it]                                                         {'loss': 0.0074, 'grad_norm': 3.0780505365220554, 'learning_rate': 4.0223985067662153e-07, 'completion_length': 342.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.03709553927183151, 'kl': 0.185302734375, 'epoch': 0.6}
 60%|█████▉    | 2562/4286 [19:21:36<11:38:31, 24.31s/it] 60%|█████▉    | 2563/4286 [19:22:00<11:35:00, 24.20s/it]                                                         {'loss': 0.0074, 'grad_norm': 4.52885045374572, 'learning_rate': 4.020065328978068e-07, 'completion_length': 307.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.03068273887038231, 'kl': 0.1865234375, 'epoch': 0.6}
 60%|█████▉    | 2563/4286 [19:22:00<11:35:00, 24.20s/it] 60%|█████▉    | 2564/4286 [19:22:24<11:37:27, 24.30s/it]                                                         {'loss': 0.0102, 'grad_norm': 2.5239650127648257, 'learning_rate': 4.0177321511899203e-07, 'completion_length': 282.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8080357611179352, 'rewards/format_reward': 1.0, 'reward': 1.8080358505249023, 'reward_std': 0.0414529861882329, 'kl': 0.25341796875, 'epoch': 0.6}
 60%|█████▉    | 2564/4286 [19:22:24<11:37:27, 24.30s/it] 60%|█████▉    | 2565/4286 [19:22:50<11:51:15, 24.80s/it]                                                         {'loss': 0.0049, 'grad_norm': 4.230453187729947, 'learning_rate': 4.015398973401773e-07, 'completion_length': 314.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.8050595819950104, 'rewards/format_reward': 1.0, 'reward': 1.8050596117973328, 'reward_std': 0.03869047574698925, 'kl': 0.12158203125, 'epoch': 0.6}
 60%|█████▉    | 2565/4286 [19:22:50<11:51:15, 24.80s/it] 60%|█████▉    | 2566/4286 [19:23:16<11:57:43, 25.04s/it]                                                         {'loss': 0.0084, 'grad_norm': 1.5734746488021698, 'learning_rate': 4.013065795613625e-07, 'completion_length': 323.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.852678656578064, 'rewards/format_reward': 1.0, 'reward': 1.8526787161827087, 'reward_std': 0.034374501556158066, 'kl': 0.20989990234375, 'epoch': 0.6}
 60%|█████▉    | 2566/4286 [19:23:16<11:57:43, 25.04s/it] 60%|█████▉    | 2567/4286 [19:23:40<11:52:34, 24.87s/it]                                                         {'loss': 0.0079, 'grad_norm': 10.178159288961059, 'learning_rate': 4.010732617825478e-07, 'completion_length': 269.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.8348214626312256, 'rewards/format_reward': 1.0, 'reward': 1.8348214626312256, 'reward_std': 0.03273810027167201, 'kl': 0.19580078125, 'epoch': 0.6}
 60%|█████▉    | 2567/4286 [19:23:40<11:52:34, 24.87s/it] 60%|█████▉    | 2568/4286 [19:24:04<11:44:39, 24.61s/it]                                                         {'loss': 0.0046, 'grad_norm': 31.105031718943327, 'learning_rate': 4.008399440037331e-07, 'completion_length': 291.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.84226194024086, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.005952378269284964, 'kl': 0.1160888671875, 'epoch': 0.6}
 60%|█████▉    | 2568/4286 [19:24:04<11:44:39, 24.61s/it] 60%|█████▉    | 2569/4286 [19:24:30<11:54:18, 24.96s/it]                                                         {'loss': 0.0083, 'grad_norm': 7.766179379921829, 'learning_rate': 4.006066262249183e-07, 'completion_length': 333.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.6770834028720856, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.05059524439275265, 'kl': 0.207763671875, 'epoch': 0.6}
 60%|█████▉    | 2569/4286 [19:24:30<11:54:18, 24.96s/it] 60%|█████▉    | 2570/4286 [19:24:56<11:58:36, 25.13s/it]                                                         {'loss': 0.0134, 'grad_norm': 0.9711665701177495, 'learning_rate': 4.0037330844610357e-07, 'completion_length': 341.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.06388125382363796, 'kl': 0.3359375, 'epoch': 0.6}
 60%|█████▉    | 2570/4286 [19:24:56<11:58:36, 25.13s/it] 60%|█████▉    | 2571/4286 [19:25:20<11:52:35, 24.93s/it]                                                         {'loss': 0.0075, 'grad_norm': 1.9324722075009881, 'learning_rate': 4.001399906672888e-07, 'completion_length': 294.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7247024774551392, 'reward_std': 0.0881511913612485, 'kl': 0.1884765625, 'epoch': 0.6}
 60%|█████▉    | 2571/4286 [19:25:20<11:52:35, 24.93s/it] 60%|██████    | 2572/4286 [19:25:45<11:56:02, 25.07s/it]                                                         {'loss': 0.0101, 'grad_norm': 5.130604400732474, 'learning_rate': 3.9990667288847407e-07, 'completion_length': 304.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.727678656578064, 'rewards/format_reward': 1.0, 'reward': 1.7276787161827087, 'reward_std': 0.019394677132368088, 'kl': 0.25146484375, 'epoch': 0.6}
 60%|██████    | 2572/4286 [19:25:45<11:56:02, 25.07s/it] 60%|██████    | 2573/4286 [19:26:10<11:52:52, 24.97s/it]                                                         {'loss': 0.0066, 'grad_norm': 0.8916704094494146, 'learning_rate': 3.9967335510965935e-07, 'completion_length': 298.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.8080357313156128, 'rewards/format_reward': 1.0, 'reward': 1.8080357909202576, 'reward_std': 0.014880956150591373, 'kl': 0.1663818359375, 'epoch': 0.6}
 60%|██████    | 2573/4286 [19:26:10<11:52:52, 24.97s/it] 60%|██████    | 2574/4286 [19:26:36<12:00:18, 25.24s/it]                                                         {'loss': 0.0041, 'grad_norm': 2.714032253752518, 'learning_rate': 3.9944003733084457e-07, 'completion_length': 312.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.816964328289032, 'rewards/format_reward': 1.0, 'reward': 1.816964328289032, 'reward_std': 0.026111614890396595, 'kl': 0.10302734375, 'epoch': 0.6}
 60%|██████    | 2574/4286 [19:26:36<12:00:18, 25.24s/it] 60%|██████    | 2575/4286 [19:27:02<12:01:37, 25.31s/it]                                                         {'loss': 0.0305, 'grad_norm': 3.7808771192134674, 'learning_rate': 3.9920671955202984e-07, 'completion_length': 308.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.05419245734810829, 'kl': 0.76220703125, 'epoch': 0.6}
 60%|██████    | 2575/4286 [19:27:02<12:01:37, 25.31s/it] 60%|██████    | 2576/4286 [19:27:26<11:51:41, 24.97s/it]                                                         {'loss': 0.0196, 'grad_norm': 27.4979023962826, 'learning_rate': 3.9897340177321507e-07, 'completion_length': 309.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5982144474983215, 'reward_std': 0.01785714365541935, 'kl': 0.48828125, 'epoch': 0.6}
 60%|██████    | 2576/4286 [19:27:26<11:51:41, 24.97s/it] 60%|██████    | 2577/4286 [19:27:51<11:53:23, 25.05s/it]                                                         {'loss': 0.0091, 'grad_norm': 31.040235297629437, 'learning_rate': 3.9874008399440034e-07, 'completion_length': 315.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.822916716337204, 'rewards/format_reward': 1.0, 'reward': 1.8229168057441711, 'reward_std': 0.04556369222700596, 'kl': 0.2275390625, 'epoch': 0.6}
 60%|██████    | 2577/4286 [19:27:51<11:53:23, 25.05s/it] 60%|██████    | 2578/4286 [19:28:17<11:57:50, 25.22s/it]                                                         {'loss': 0.003, 'grad_norm': 3.0298873062963554, 'learning_rate': 3.985067662155856e-07, 'completion_length': 344.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.8125000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.04166666232049465, 'kl': 0.074951171875, 'epoch': 0.6}
 60%|██████    | 2578/4286 [19:28:17<11:57:50, 25.22s/it] 60%|██████    | 2579/4286 [19:28:41<11:53:36, 25.08s/it]                                                         {'loss': 0.0278, 'grad_norm': 5.084586469437433, 'learning_rate': 3.9827344843677084e-07, 'completion_length': 328.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.742559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.724702537059784, 'reward_std': 0.04219478741288185, 'kl': 0.6943359375, 'epoch': 0.6}
 60%|██████    | 2579/4286 [19:28:41<11:53:36, 25.08s/it] 60%|██████    | 2580/4286 [19:29:06<11:49:29, 24.95s/it]                                                         {'loss': 0.005, 'grad_norm': 3.6014051112006373, 'learning_rate': 3.980401306579561e-07, 'completion_length': 300.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.006873216480016708, 'kl': 0.1248779296875, 'epoch': 0.6}
 60%|██████    | 2580/4286 [19:29:06<11:49:29, 24.95s/it] 60%|██████    | 2581/4286 [19:29:30<11:40:10, 24.64s/it]                                                         {'loss': 0.0042, 'grad_norm': 13.217945407100006, 'learning_rate': 3.978068128791414e-07, 'completion_length': 292.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.0476190485060215, 'kl': 0.105224609375, 'epoch': 0.6}
 60%|██████    | 2581/4286 [19:29:30<11:40:10, 24.64s/it] 60%|██████    | 2582/4286 [19:29:54<11:32:01, 24.37s/it]                                                         {'loss': 0.0148, 'grad_norm': 10.220316622011605, 'learning_rate': 3.975734951003266e-07, 'completion_length': 310.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.04761904664337635, 'kl': 0.369140625, 'epoch': 0.6}
 60%|██████    | 2582/4286 [19:29:54<11:32:01, 24.37s/it][2025-03-03 10:27:42,763] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|██████    | 2583/4286 [19:30:20<11:47:37, 24.93s/it]                                                         {'loss': 0.0148, 'grad_norm': 16.759407597490387, 'learning_rate': 3.973401773215119e-07, 'completion_length': 323.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7633928656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7455357909202576, 'reward_std': 0.07280983030796051, 'kl': 0.3681640625, 'epoch': 0.6}
 60%|██████    | 2583/4286 [19:30:20<11:47:37, 24.93s/it] 60%|██████    | 2584/4286 [19:30:44<11:43:12, 24.79s/it]                                                         {'loss': 0.0073, 'grad_norm': 14.63672819446413, 'learning_rate': 3.971068595426971e-07, 'completion_length': 311.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.04849822539836168, 'kl': 0.18408203125, 'epoch': 0.6}
 60%|██████    | 2584/4286 [19:30:44<11:43:12, 24.79s/it] 60%|██████    | 2585/4286 [19:31:10<11:49:05, 25.01s/it]                                                         {'loss': 0.008, 'grad_norm': 10.899808582135586, 'learning_rate': 3.968735417638824e-07, 'completion_length': 343.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0357142798602581, 'kl': 0.200439453125, 'epoch': 0.6}
 60%|██████    | 2585/4286 [19:31:10<11:49:05, 25.01s/it] 60%|██████    | 2586/4286 [19:31:35<11:48:05, 24.99s/it]                                                         {'loss': 0.0155, 'grad_norm': 2.2118597214044597, 'learning_rate': 3.9664022398506766e-07, 'completion_length': 312.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8258928954601288, 'rewards/format_reward': 1.0, 'reward': 1.8258929252624512, 'reward_std': 0.06700375117361546, 'kl': 0.3861083984375, 'epoch': 0.6}
 60%|██████    | 2586/4286 [19:31:35<11:48:05, 24.99s/it] 60%|██████    | 2587/4286 [19:32:00<11:51:47, 25.14s/it]                                                         {'loss': 0.0115, 'grad_norm': 3.7190914209966985, 'learning_rate': 3.964069062062529e-07, 'completion_length': 336.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.6294643431901932, 'rewards/format_reward': 1.0, 'reward': 1.6294643878936768, 'reward_std': 0.06265270244330168, 'kl': 0.28857421875, 'epoch': 0.6}
 60%|██████    | 2587/4286 [19:32:00<11:51:47, 25.14s/it] 60%|██████    | 2588/4286 [19:32:24<11:39:20, 24.71s/it]                                                         {'loss': 0.0071, 'grad_norm': 1.2061777988035791, 'learning_rate': 3.9617358842743816e-07, 'completion_length': 276.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.013746436685323715, 'kl': 0.178466796875, 'epoch': 0.6}
 60%|██████    | 2588/4286 [19:32:24<11:39:20, 24.71s/it] 60%|██████    | 2589/4286 [19:32:50<11:49:36, 25.09s/it]                                                         {'loss': 0.0075, 'grad_norm': 1.5118796758652269, 'learning_rate': 3.959402706486234e-07, 'completion_length': 339.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.691964328289032, 'rewards/format_reward': 1.0, 'reward': 1.691964328289032, 'reward_std': 0.01939467526972294, 'kl': 0.187255859375, 'epoch': 0.6}
 60%|██████    | 2589/4286 [19:32:50<11:49:36, 25.09s/it] 60%|██████    | 2590/4286 [19:33:15<11:48:13, 25.06s/it]                                                         {'loss': 0.0045, 'grad_norm': 15.548722258805997, 'learning_rate': 3.9570695286980865e-07, 'completion_length': 309.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7354166805744171, 'rewards/format_reward': 1.0, 'reward': 1.735416829586029, 'reward_std': 0.033524114172905684, 'kl': 0.11328125, 'epoch': 0.6}
 60%|██████    | 2590/4286 [19:33:15<11:48:13, 25.06s/it] 60%|██████    | 2591/4286 [19:33:40<11:45:38, 24.98s/it]                                                         {'loss': 0.0013, 'grad_norm': 1.928755898640534, 'learning_rate': 3.9547363509099393e-07, 'completion_length': 312.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.6979166865348816, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.020833331160247326, 'kl': 0.03289794921875, 'epoch': 0.6}
 60%|██████    | 2591/4286 [19:33:40<11:45:38, 24.98s/it] 60%|██████    | 2592/4286 [19:34:05<11:51:26, 25.20s/it]                                                         {'loss': 0.0037, 'grad_norm': 10.063775463791094, 'learning_rate': 3.9524031731217915e-07, 'completion_length': 346.0714569091797, 'rewards/only_full_func_accuracy_reward': 0.7976190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.06815123558044434, 'kl': 0.0921630859375, 'epoch': 0.6}
 60%|██████    | 2592/4286 [19:34:05<11:51:26, 25.20s/it] 60%|██████    | 2593/4286 [19:34:31<11:53:21, 25.28s/it]                                                         {'loss': 0.012, 'grad_norm': 6.245764861997512, 'learning_rate': 3.950069995333644e-07, 'completion_length': 305.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6562500596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.638392984867096, 'reward_std': 0.16131830215454102, 'kl': 0.30078125, 'epoch': 0.6}
 60%|██████    | 2593/4286 [19:34:31<11:53:21, 25.28s/it] 61%|██████    | 2594/4286 [19:34:56<11:47:36, 25.09s/it]                                                         {'loss': 0.0032, 'grad_norm': 2.011588269273668, 'learning_rate': 3.9477368175454965e-07, 'completion_length': 321.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.8139881193637848, 'rewards/format_reward': 1.0, 'reward': 1.813988208770752, 'reward_std': 0.05059524066746235, 'kl': 0.079833984375, 'epoch': 0.61}
 61%|██████    | 2594/4286 [19:34:56<11:47:36, 25.09s/it][2025-03-03 10:32:43,572] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████    | 2595/4286 [19:35:21<11:47:12, 25.09s/it]                                                         {'loss': 0.0072, 'grad_norm': 22.77921186251587, 'learning_rate': 3.945403639757349e-07, 'completion_length': 280.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.06620895676314831, 'kl': 0.180419921875, 'epoch': 0.61}
 61%|██████    | 2595/4286 [19:35:21<11:47:12, 25.09s/it][2025-03-03 10:33:10,192] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████    | 2596/4286 [19:35:47<11:59:41, 25.55s/it]                                                         {'loss': 0.0054, 'grad_norm': 0.79939358899392, 'learning_rate': 3.943070461969202e-07, 'completion_length': 322.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.7857144474983215, 'reward_std': 0.023135432042181492, 'kl': 0.135498046875, 'epoch': 0.61}
 61%|██████    | 2596/4286 [19:35:47<11:59:41, 25.55s/it] 61%|██████    | 2597/4286 [19:36:12<11:54:58, 25.40s/it]                                                         {'loss': 0.0029, 'grad_norm': 1.044992279729691, 'learning_rate': 3.940737284181054e-07, 'completion_length': 310.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7916667461395264, 'rewards/format_reward': 1.0, 'reward': 1.7916668057441711, 'reward_std': 0.020619653165340424, 'kl': 0.0716552734375, 'epoch': 0.61}
 61%|██████    | 2597/4286 [19:36:12<11:54:58, 25.40s/it] 61%|██████    | 2598/4286 [19:36:36<11:41:23, 24.93s/it]                                                         {'loss': 0.0013, 'grad_norm': 1.8349979818808682, 'learning_rate': 3.938404106392907e-07, 'completion_length': 272.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.022675009444355965, 'kl': 0.031494140625, 'epoch': 0.61}
 61%|██████    | 2598/4286 [19:36:36<11:41:23, 24.93s/it] 61%|██████    | 2599/4286 [19:37:02<11:46:22, 25.12s/it]                                                         {'loss': 0.0101, 'grad_norm': 1.2521140178806012, 'learning_rate': 3.936070928604759e-07, 'completion_length': 273.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7946429550647736, 'rewards/format_reward': 1.0, 'reward': 1.7946430444717407, 'reward_std': 0.05357143096625805, 'kl': 0.25244140625, 'epoch': 0.61}
 61%|██████    | 2599/4286 [19:37:02<11:46:22, 25.12s/it] 61%|██████    | 2600/4286 [19:37:27<11:44:23, 25.07s/it]                                                         {'loss': 0.012, 'grad_norm': 7.317277024987081, 'learning_rate': 3.933737750816612e-07, 'completion_length': 339.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6741071939468384, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.0474053667858243, 'kl': 0.298828125, 'epoch': 0.61}
 61%|██████    | 2600/4286 [19:37:27<11:44:23, 25.07s/it] 61%|██████    | 2601/4286 [19:40:52<36:59:52, 79.05s/it]                                                         {'loss': 0.0075, 'grad_norm': 3.9023535016509427, 'learning_rate': 3.9314045730284647e-07, 'completion_length': 340.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.7827382683753967, 'reward_std': 0.0416666641831398, 'kl': 0.186279296875, 'epoch': 0.61}
 61%|██████    | 2601/4286 [19:40:52<36:59:52, 79.05s/it] 61%|██████    | 2602/4286 [19:41:18<29:31:58, 63.13s/it]                                                         {'loss': 0.006, 'grad_norm': 0.981075167285085, 'learning_rate': 3.929071395240317e-07, 'completion_length': 309.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.0, 'kl': 0.149169921875, 'epoch': 0.61}
 61%|██████    | 2602/4286 [19:41:18<29:31:58, 63.13s/it] 61%|██████    | 2603/4286 [19:41:43<24:12:54, 51.80s/it]                                                         {'loss': 0.0025, 'grad_norm': 25.851382095317582, 'learning_rate': 3.9267382174521697e-07, 'completion_length': 286.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.05952381156384945, 'kl': 0.0625, 'epoch': 0.61}
 61%|██████    | 2603/4286 [19:41:43<24:12:54, 51.80s/it] 61%|██████    | 2604/4286 [19:42:08<20:26:10, 43.74s/it]                                                         {'loss': 0.0039, 'grad_norm': 3.763548520561468, 'learning_rate': 3.9244050396640224e-07, 'completion_length': 271.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 1.0, 'reward': 1.748512089252472, 'reward_std': 0.11447649076581001, 'kl': 0.09765625, 'epoch': 0.61}
 61%|██████    | 2604/4286 [19:42:08<20:26:10, 43.74s/it] 61%|██████    | 2605/4286 [19:42:32<17:43:04, 37.94s/it]                                                         {'loss': 0.0026, 'grad_norm': 2.776432926042338, 'learning_rate': 3.9220718618758746e-07, 'completion_length': 288.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6875001192092896, 'reward_std': 0.029761906247586012, 'kl': 0.065185546875, 'epoch': 0.61}
 61%|██████    | 2605/4286 [19:42:32<17:43:04, 37.94s/it] 61%|██████    | 2606/4286 [19:42:58<16:01:23, 34.34s/it]                                                         {'loss': 0.004, 'grad_norm': 3.184162084116277, 'learning_rate': 3.9197386840877274e-07, 'completion_length': 313.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6666668057441711, 'reward_std': 0.13708077743649483, 'kl': 0.100341796875, 'epoch': 0.61}
 61%|██████    | 2606/4286 [19:42:58<16:01:23, 34.34s/it] 61%|██████    | 2607/4286 [19:43:23<14:39:03, 31.41s/it]                                                         {'loss': 0.0031, 'grad_norm': 25.09685265497599, 'learning_rate': 3.9174055062995796e-07, 'completion_length': 317.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6294643133878708, 'rewards/format_reward': 1.0, 'reward': 1.6294643878936768, 'reward_std': 0.047405365854501724, 'kl': 0.0762939453125, 'epoch': 0.61}
 61%|██████    | 2607/4286 [19:43:23<14:39:03, 31.41s/it] 61%|██████    | 2608/4286 [19:43:48<13:48:55, 29.64s/it]                                                         {'loss': 0.003, 'grad_norm': 2.1304107633530935, 'learning_rate': 3.9150723285114324e-07, 'completion_length': 349.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.767857313156128, 'reward_std': 0.06823870167136192, 'kl': 0.07470703125, 'epoch': 0.61}
 61%|██████    | 2608/4286 [19:43:48<13:48:55, 29.64s/it] 61%|██████    | 2609/4286 [19:44:14<13:17:03, 28.52s/it]                                                         {'loss': 0.0013, 'grad_norm': 7.537045761670283, 'learning_rate': 3.912739150723285e-07, 'completion_length': 299.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7425596117973328, 'rewards/format_reward': 1.0, 'reward': 1.7425596117973328, 'reward_std': 0.03114316053688526, 'kl': 0.03289794921875, 'epoch': 0.61}
 61%|██████    | 2609/4286 [19:44:14<13:17:03, 28.52s/it] 61%|██████    | 2610/4286 [19:44:39<12:45:07, 27.39s/it]                                                         {'loss': 0.0076, 'grad_norm': 3.715029331166002, 'learning_rate': 3.9104059729351373e-07, 'completion_length': 317.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.654762089252472, 'reward_std': 0.07601707428693771, 'kl': 0.1893310546875, 'epoch': 0.61}
 61%|██████    | 2610/4286 [19:44:39<12:45:07, 27.39s/it] 61%|██████    | 2611/4286 [19:45:04<12:22:57, 26.61s/it]                                                         {'loss': 0.0044, 'grad_norm': 2.8416085594535994, 'learning_rate': 3.90807279514699e-07, 'completion_length': 319.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6130952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6130954027175903, 'reward_std': 0.020619653165340424, 'kl': 0.109619140625, 'epoch': 0.61}
 61%|██████    | 2611/4286 [19:45:04<12:22:57, 26.61s/it] 61%|██████    | 2612/4286 [19:45:29<12:12:22, 26.25s/it]                                                         {'loss': 0.0024, 'grad_norm': 2.9442064802644707, 'learning_rate': 3.9057396173588423e-07, 'completion_length': 313.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.941964328289032, 'rewards/format_reward': 1.0, 'reward': 1.9419643878936768, 'reward_std': 0.008928571827709675, 'kl': 0.0589599609375, 'epoch': 0.61}
 61%|██████    | 2612/4286 [19:45:29<12:12:22, 26.25s/it] 61%|██████    | 2613/4286 [19:45:54<12:00:35, 25.84s/it]                                                         {'loss': 0.0023, 'grad_norm': 2.867604522956534, 'learning_rate': 3.903406439570695e-07, 'completion_length': 311.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.815476268529892, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.0357142835855484, 'kl': 0.0574951171875, 'epoch': 0.61}
 61%|██████    | 2613/4286 [19:45:54<12:00:35, 25.84s/it] 61%|██████    | 2614/4286 [19:46:20<12:02:57, 25.94s/it]                                                         {'loss': 0.0019, 'grad_norm': 0.22104044600853107, 'learning_rate': 3.901073261782548e-07, 'completion_length': 337.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.008928571827709675, 'kl': 0.047119140625, 'epoch': 0.61}
 61%|██████    | 2614/4286 [19:46:20<12:02:57, 25.94s/it] 61%|██████    | 2615/4286 [19:46:45<11:53:49, 25.63s/it]                                                         {'loss': 0.0026, 'grad_norm': 9.099017723054867, 'learning_rate': 3.8987400839944e-07, 'completion_length': 327.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.752976268529892, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.029761902987957, 'kl': 0.063720703125, 'epoch': 0.61}
 61%|██████    | 2615/4286 [19:46:45<11:53:49, 25.63s/it] 61%|██████    | 2616/4286 [19:47:11<11:57:12, 25.77s/it]                                                         {'loss': 0.0021, 'grad_norm': 1.6887317334558833, 'learning_rate': 3.896406906206253e-07, 'completion_length': 327.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8131377696990967, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7952806949615479, 'reward_std': 0.07702628336846828, 'kl': 0.05206298828125, 'epoch': 0.61}
 61%|██████    | 2616/4286 [19:47:11<11:57:12, 25.77s/it] 61%|██████    | 2617/4286 [19:47:35<11:40:53, 25.20s/it]                                                         {'loss': 0.0013, 'grad_norm': 0.9069138730863021, 'learning_rate': 3.894073728418105e-07, 'completion_length': 289.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.752976268529892, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.0416666716337204, 'kl': 0.0323486328125, 'epoch': 0.61}
 61%|██████    | 2617/4286 [19:47:35<11:40:53, 25.20s/it] 61%|██████    | 2618/4286 [19:48:01<11:41:54, 25.25s/it]                                                         {'loss': 0.0151, 'grad_norm': 2.6627216999839938, 'learning_rate': 3.891740550629958e-07, 'completion_length': 308.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.724702388048172, 'rewards/format_reward': 1.0, 'reward': 1.724702537059784, 'reward_std': 0.055841732770204544, 'kl': 0.3779296875, 'epoch': 0.61}
 61%|██████    | 2618/4286 [19:48:01<11:41:54, 25.25s/it] 61%|██████    | 2619/4286 [19:48:26<11:40:23, 25.21s/it]                                                         {'loss': 0.0166, 'grad_norm': 1.6589387180944244, 'learning_rate': 3.8894073728418105e-07, 'completion_length': 309.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8001701533794403, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7823129892349243, 'reward_std': 0.08258874714374542, 'kl': 0.41357421875, 'epoch': 0.61}
 61%|██████    | 2619/4286 [19:48:26<11:40:23, 25.21s/it] 61%|██████    | 2620/4286 [19:48:51<11:43:02, 25.32s/it]                                                         {'loss': 0.0049, 'grad_norm': 1.1936172837792596, 'learning_rate': 3.8870741950536627e-07, 'completion_length': 278.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7544643878936768, 'reward_std': 0.0625000037252903, 'kl': 0.1217041015625, 'epoch': 0.61}
 61%|██████    | 2620/4286 [19:48:51<11:43:02, 25.32s/it] 61%|██████    | 2621/4286 [19:49:17<11:45:22, 25.42s/it]                                                         {'loss': 0.0036, 'grad_norm': 4.959463621777903, 'learning_rate': 3.8847410172655155e-07, 'completion_length': 295.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.8139881789684296, 'rewards/format_reward': 1.0, 'reward': 1.813988208770752, 'reward_std': 0.029001673683524132, 'kl': 0.08984375, 'epoch': 0.61}
 61%|██████    | 2621/4286 [19:49:17<11:45:22, 25.42s/it] 61%|██████    | 2622/4286 [19:49:42<11:38:50, 25.20s/it]                                                         {'loss': 0.003, 'grad_norm': 3.140033740473992, 'learning_rate': 3.8824078394773677e-07, 'completion_length': 306.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.68452388048172, 'rewards/format_reward': 1.0, 'reward': 1.68452388048172, 'reward_std': 0.03504018485546112, 'kl': 0.073974609375, 'epoch': 0.61}
 61%|██████    | 2622/4286 [19:49:42<11:38:50, 25.20s/it] 61%|██████    | 2623/4286 [19:50:07<11:44:04, 25.40s/it]                                                         {'loss': 0.0054, 'grad_norm': 6.118556559589555, 'learning_rate': 3.8800746616892204e-07, 'completion_length': 340.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.7931548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7931548357009888, 'reward_std': 0.020833331160247326, 'kl': 0.13421630859375, 'epoch': 0.61}
 61%|██████    | 2623/4286 [19:50:07<11:44:04, 25.40s/it] 61%|██████    | 2624/4286 [19:50:33<11:47:55, 25.56s/it]                                                         {'loss': 0.0081, 'grad_norm': 29.52245705090499, 'learning_rate': 3.877741483901073e-07, 'completion_length': 295.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 1.0, 'reward': 1.6934524774551392, 'reward_std': 0.0714285746216774, 'kl': 0.20062255859375, 'epoch': 0.61}
 61%|██████    | 2624/4286 [19:50:33<11:47:55, 25.56s/it] 61%|██████    | 2625/4286 [19:50:58<11:38:55, 25.25s/it]                                                         {'loss': 0.0135, 'grad_norm': 2.0502456605095984, 'learning_rate': 3.8754083061129254e-07, 'completion_length': 311.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.0769535917788744, 'kl': 0.337890625, 'epoch': 0.61}
 61%|██████    | 2625/4286 [19:50:58<11:38:55, 25.25s/it] 61%|██████▏   | 2626/4286 [19:51:20<11:15:22, 24.41s/it]                                                         {'loss': 0.0018, 'grad_norm': 1.1715987565518242, 'learning_rate': 3.873075128324778e-07, 'completion_length': 253.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.8110119700431824, 'rewards/format_reward': 1.0, 'reward': 1.8110119700431824, 'reward_std': 0.023049295879900455, 'kl': 0.0458984375, 'epoch': 0.61}
 61%|██████▏   | 2626/4286 [19:51:20<11:15:22, 24.41s/it] 61%|██████▏   | 2627/4286 [19:51:47<11:29:53, 24.95s/it]                                                         {'loss': 0.0218, 'grad_norm': 1.3729275497431705, 'learning_rate': 3.870741950536631e-07, 'completion_length': 320.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8407739102840424, 'rewards/format_reward': 1.0, 'reward': 1.8407739400863647, 'reward_std': 0.05063671991229057, 'kl': 0.542724609375, 'epoch': 0.61}
 61%|██████▏   | 2627/4286 [19:51:47<11:29:53, 24.95s/it] 61%|██████▏   | 2628/4286 [19:52:12<11:33:27, 25.09s/it]                                                         {'loss': 0.0016, 'grad_norm': 4.704741095438981, 'learning_rate': 3.868408772748483e-07, 'completion_length': 323.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7202381193637848, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.05197649262845516, 'kl': 0.0394287109375, 'epoch': 0.61}
 61%|██████▏   | 2628/4286 [19:52:12<11:33:27, 25.09s/it] 61%|██████▏   | 2629/4286 [19:52:36<11:22:16, 24.71s/it]                                                         {'loss': 0.0083, 'grad_norm': 1.2818197688440225, 'learning_rate': 3.866075594960336e-07, 'completion_length': 297.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.01785714365541935, 'kl': 0.2081298828125, 'epoch': 0.61}
 61%|██████▏   | 2629/4286 [19:52:36<11:22:16, 24.71s/it] 61%|██████▏   | 2630/4286 [19:53:01<11:30:01, 25.00s/it]                                                         {'loss': 0.0108, 'grad_norm': 8.87028731166306, 'learning_rate': 3.863742417172188e-07, 'completion_length': 271.07144927978516, 'rewards/only_full_func_accuracy_reward': 0.7645376324653625, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7466805577278137, 'reward_std': 0.08997253328561783, 'kl': 0.27044677734375, 'epoch': 0.61}
 61%|██████▏   | 2630/4286 [19:53:01<11:30:01, 25.00s/it] 61%|██████▏   | 2631/4286 [19:53:26<11:27:32, 24.93s/it]                                                         {'loss': 0.0033, 'grad_norm': 3.4349325263088226, 'learning_rate': 3.861409239384041e-07, 'completion_length': 278.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.8145833909511566, 'rewards/format_reward': 1.0, 'reward': 1.814583420753479, 'reward_std': 0.01970252674072981, 'kl': 0.083740234375, 'epoch': 0.61}
 61%|██████▏   | 2631/4286 [19:53:26<11:27:32, 24.93s/it] 61%|██████▏   | 2632/4286 [19:53:52<11:33:34, 25.16s/it]                                                         {'loss': 0.007, 'grad_norm': 1.9438918458041563, 'learning_rate': 3.8590760615958936e-07, 'completion_length': 334.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.816964328289032, 'rewards/format_reward': 1.0, 'reward': 1.8169644474983215, 'reward_std': 0.05367030389606953, 'kl': 0.173828125, 'epoch': 0.61}
 61%|██████▏   | 2632/4286 [19:53:52<11:33:34, 25.16s/it][2025-03-03 10:51:40,692] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████▏   | 2633/4286 [19:54:18<11:38:45, 25.36s/it]                                                         {'loss': 0.002, 'grad_norm': 0.556483504326993, 'learning_rate': 3.856742883807746e-07, 'completion_length': 328.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 1.0, 'reward': 1.669642984867096, 'reward_std': 0.024056263267993927, 'kl': 0.0489501953125, 'epoch': 0.61}
 61%|██████▏   | 2633/4286 [19:54:18<11:38:45, 25.36s/it] 61%|██████▏   | 2634/4286 [19:54:43<11:40:53, 25.46s/it]                                                         {'loss': 0.0033, 'grad_norm': 4.776755759116893, 'learning_rate': 3.8544097060195986e-07, 'completion_length': 345.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7324405014514923, 'rewards/format_reward': 1.0, 'reward': 1.7324405908584595, 'reward_std': 0.043833937495946884, 'kl': 0.0823974609375, 'epoch': 0.61}
 61%|██████▏   | 2634/4286 [19:54:43<11:40:53, 25.46s/it] 61%|██████▏   | 2635/4286 [19:55:10<11:48:13, 25.74s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.16620599148173368, 'learning_rate': 3.852076528231451e-07, 'completion_length': 310.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.6919642984867096, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.008928571827709675, 'kl': 0.0360107421875, 'epoch': 0.61}
 61%|██████▏   | 2635/4286 [19:55:10<11:48:13, 25.74s/it] 62%|██████▏   | 2636/4286 [19:55:36<11:47:56, 25.74s/it]                                                         {'loss': 0.0021, 'grad_norm': 0.3576636366086616, 'learning_rate': 3.8497433504433036e-07, 'completion_length': 303.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6755953133106232, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.026572031434625387, 'kl': 0.052734375, 'epoch': 0.62}
 62%|██████▏   | 2636/4286 [19:55:36<11:47:56, 25.74s/it] 62%|██████▏   | 2637/4286 [19:56:02<11:51:04, 25.87s/it]                                                         {'loss': 0.0066, 'grad_norm': 1.505181957398903, 'learning_rate': 3.8474101726551563e-07, 'completion_length': 338.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7946429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7767858505249023, 'reward_std': 0.0535714328289032, 'kl': 0.16455078125, 'epoch': 0.62}
 62%|██████▏   | 2637/4286 [19:56:02<11:51:04, 25.87s/it] 62%|██████▏   | 2638/4286 [19:56:30<12:10:36, 26.60s/it]                                                         {'loss': 0.0141, 'grad_norm': 2.2383854891747976, 'learning_rate': 3.8450769948670085e-07, 'completion_length': 322.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.10776388645172119, 'kl': 0.35302734375, 'epoch': 0.62}
 62%|██████▏   | 2638/4286 [19:56:30<12:10:36, 26.60s/it] 62%|██████▏   | 2639/4286 [19:56:56<12:02:59, 26.34s/it]                                                         {'loss': 0.0041, 'grad_norm': 1.3057567394993306, 'learning_rate': 3.8427438170788613e-07, 'completion_length': 331.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.008928571827709675, 'kl': 0.102783203125, 'epoch': 0.62}
 62%|██████▏   | 2639/4286 [19:56:56<12:02:59, 26.34s/it] 62%|██████▏   | 2640/4286 [19:57:21<11:50:47, 25.91s/it]                                                         {'loss': 0.0024, 'grad_norm': 1.4174280865825575, 'learning_rate': 3.8404106392907135e-07, 'completion_length': 305.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.815476268529892, 'rewards/format_reward': 1.0, 'reward': 1.8154763579368591, 'reward_std': 0.044741732999682426, 'kl': 0.06005859375, 'epoch': 0.62}
 62%|██████▏   | 2640/4286 [19:57:21<11:50:47, 25.91s/it] 62%|██████▏   | 2641/4286 [19:57:45<11:36:51, 25.42s/it]                                                         {'loss': 0.0035, 'grad_norm': 2.606272376878598, 'learning_rate': 3.8380774615025663e-07, 'completion_length': 289.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.09523810539394617, 'kl': 0.0865478515625, 'epoch': 0.62}
 62%|██████▏   | 2641/4286 [19:57:45<11:36:51, 25.42s/it] 62%|██████▏   | 2642/4286 [19:58:11<11:40:23, 25.56s/it]                                                         {'loss': 0.0016, 'grad_norm': 0.9583883299956136, 'learning_rate': 3.835744283714419e-07, 'completion_length': 338.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.886904776096344, 'rewards/format_reward': 1.0, 'reward': 1.8869048357009888, 'reward_std': 0.025651192292571068, 'kl': 0.0401611328125, 'epoch': 0.62}
 62%|██████▏   | 2642/4286 [19:58:11<11:40:23, 25.56s/it] 62%|██████▏   | 2643/4286 [19:58:36<11:33:07, 25.31s/it]                                                         {'loss': 0.0025, 'grad_norm': 17.70654234357721, 'learning_rate': 3.833411105926271e-07, 'completion_length': 325.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.7678572237491608, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.03436608612537384, 'kl': 0.0635986328125, 'epoch': 0.62}
 62%|██████▏   | 2643/4286 [19:58:36<11:33:07, 25.31s/it] 62%|██████▏   | 2644/4286 [19:59:02<11:38:54, 25.54s/it]                                                         {'loss': 0.0047, 'grad_norm': 9.752462000227927, 'learning_rate': 3.831077928138124e-07, 'completion_length': 338.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7991071939468384, 'rewards/format_reward': 1.0, 'reward': 1.7991072535514832, 'reward_std': 0.04464285168796778, 'kl': 0.11572265625, 'epoch': 0.62}
 62%|██████▏   | 2644/4286 [19:59:02<11:38:54, 25.54s/it] 62%|██████▏   | 2645/4286 [19:59:28<11:46:24, 25.83s/it]                                                         {'loss': 0.0033, 'grad_norm': 1.7419078205960397, 'learning_rate': 3.828744750349976e-07, 'completion_length': 336.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.0357142873108387, 'kl': 0.08203125, 'epoch': 0.62}
 62%|██████▏   | 2645/4286 [19:59:28<11:46:24, 25.83s/it][2025-03-03 10:57:16,671] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 62%|██████▏   | 2646/4286 [19:59:54<11:43:55, 25.75s/it]                                                         {'loss': 0.0027, 'grad_norm': 6.414071912932134, 'learning_rate': 3.826411572561829e-07, 'completion_length': 240.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.8422619998455048, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8244049549102783, 'reward_std': 0.10738959163427353, 'kl': 0.06793212890625, 'epoch': 0.62}
 62%|██████▏   | 2646/4286 [19:59:54<11:43:55, 25.75s/it] 62%|██████▏   | 2647/4286 [20:00:18<11:33:54, 25.40s/it]                                                         {'loss': 0.0103, 'grad_norm': 2.5527193635321734, 'learning_rate': 3.8240783947736817e-07, 'completion_length': 296.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.8095238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.0833333358168602, 'kl': 0.25830078125, 'epoch': 0.62}
 62%|██████▏   | 2647/4286 [20:00:18<11:33:54, 25.40s/it] 62%|██████▏   | 2648/4286 [20:00:43<11:29:13, 25.25s/it]                                                         {'loss': 0.0031, 'grad_norm': 0.4485956996861233, 'learning_rate': 3.821745216985534e-07, 'completion_length': 289.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.011904764920473099, 'kl': 0.0777587890625, 'epoch': 0.62}
 62%|██████▏   | 2648/4286 [20:00:43<11:29:13, 25.25s/it] 62%|██████▏   | 2649/4286 [20:01:07<11:20:09, 24.93s/it]                                                         {'loss': 0.0056, 'grad_norm': 28.793118761955313, 'learning_rate': 3.8194120391973867e-07, 'completion_length': 286.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7514881789684296, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.10859859362244606, 'kl': 0.140380859375, 'epoch': 0.62}
 62%|██████▏   | 2649/4286 [20:01:07<11:20:09, 24.93s/it] 62%|██████▏   | 2650/4286 [20:01:31<11:09:56, 24.57s/it]                                                         {'loss': 0.002, 'grad_norm': 4.715166409537034, 'learning_rate': 3.8170788614092394e-07, 'completion_length': 288.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7678572237491608, 'rewards/format_reward': 1.0, 'reward': 1.767857313156128, 'reward_std': 0.028740637004375458, 'kl': 0.048828125, 'epoch': 0.62}
 62%|██████▏   | 2650/4286 [20:01:31<11:09:56, 24.57s/it] 62%|██████▏   | 2651/4286 [20:01:55<11:03:05, 24.33s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.1050418398821944, 'learning_rate': 3.8147456836210917e-07, 'completion_length': 271.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.8214287161827087, 'reward_std': 0.0, 'kl': 0.0360107421875, 'epoch': 0.62}
 62%|██████▏   | 2651/4286 [20:01:55<11:03:05, 24.33s/it] 62%|██████▏   | 2652/4286 [20:02:21<11:13:46, 24.74s/it]                                                         {'loss': 0.002, 'grad_norm': 1.9386869512120337, 'learning_rate': 3.8124125058329444e-07, 'completion_length': 336.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.699404776096344, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.08491658791899681, 'kl': 0.0496826171875, 'epoch': 0.62}
 62%|██████▏   | 2652/4286 [20:02:21<11:13:46, 24.74s/it] 62%|██████▏   | 2653/4286 [20:02:46<11:16:46, 24.87s/it]                                                         {'loss': 0.0046, 'grad_norm': 1.4379615491672046, 'learning_rate': 3.8100793280447966e-07, 'completion_length': 323.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.7901786267757416, 'rewards/format_reward': 1.0, 'reward': 1.7901787161827087, 'reward_std': 0.041452985256910324, 'kl': 0.1146240234375, 'epoch': 0.62}
 62%|██████▏   | 2653/4286 [20:02:46<11:16:46, 24.87s/it] 62%|██████▏   | 2654/4286 [20:03:11<11:19:40, 24.99s/it]                                                         {'loss': 0.0027, 'grad_norm': 21.61756078711807, 'learning_rate': 3.8077461502566494e-07, 'completion_length': 312.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.7261905670166016, 'rewards/format_reward': 1.0, 'reward': 1.7261906266212463, 'reward_std': 0.0595238134264946, 'kl': 0.0665283203125, 'epoch': 0.62}
 62%|██████▏   | 2654/4286 [20:03:11<11:19:40, 24.99s/it] 62%|██████▏   | 2655/4286 [20:03:37<11:23:29, 25.14s/it]                                                         {'loss': 0.0014, 'grad_norm': 0.5419277590758336, 'learning_rate': 3.805412972468502e-07, 'completion_length': 320.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.8422619700431824, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.043508341535925865, 'kl': 0.0345458984375, 'epoch': 0.62}
 62%|██████▏   | 2655/4286 [20:03:37<11:23:29, 25.14s/it] 62%|██████▏   | 2656/4286 [20:04:03<11:34:09, 25.55s/it]                                                         {'loss': 0.0019, 'grad_norm': 1.1662462902559292, 'learning_rate': 3.8030797946803544e-07, 'completion_length': 317.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.029761905781924725, 'kl': 0.0467529296875, 'epoch': 0.62}
 62%|██████▏   | 2656/4286 [20:04:03<11:34:09, 25.55s/it] 62%|██████▏   | 2657/4286 [20:04:28<11:30:54, 25.45s/it]                                                         {'loss': 0.0039, 'grad_norm': 4.607683588611326, 'learning_rate': 3.800746616892207e-07, 'completion_length': 294.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8199405074119568, 'rewards/format_reward': 1.0, 'reward': 1.8199405670166016, 'reward_std': 0.04740536957979202, 'kl': 0.098388671875, 'epoch': 0.62}
 62%|██████▏   | 2657/4286 [20:04:28<11:30:54, 25.45s/it] 62%|██████▏   | 2658/4286 [20:04:54<11:32:04, 25.51s/it]                                                         {'loss': 0.0148, 'grad_norm': 2.1264859736582555, 'learning_rate': 3.7984134391040593e-07, 'completion_length': 305.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6337160170078278, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6158589720726013, 'reward_std': 0.08269762247800827, 'kl': 0.3720703125, 'epoch': 0.62}
 62%|██████▏   | 2658/4286 [20:04:54<11:32:04, 25.51s/it] 62%|██████▏   | 2659/4286 [20:05:22<11:49:18, 26.16s/it]                                                         {'loss': 0.0043, 'grad_norm': 2.1781879591045596, 'learning_rate': 3.796080261315912e-07, 'completion_length': 321.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7848214209079742, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7491072416305542, 'reward_std': 0.17280982434749603, 'kl': 0.1085205078125, 'epoch': 0.62}
 62%|██████▏   | 2659/4286 [20:05:22<11:49:18, 26.16s/it] 62%|██████▏   | 2660/4286 [20:05:47<11:42:48, 25.93s/it]                                                         {'loss': 0.0029, 'grad_norm': 5.072491196280686, 'learning_rate': 3.793747083527765e-07, 'completion_length': 304.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7258929312229156, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7080358266830444, 'reward_std': 0.08988096565008163, 'kl': 0.0723876953125, 'epoch': 0.62}
 62%|██████▏   | 2660/4286 [20:05:47<11:42:48, 25.93s/it] 62%|██████▏   | 2661/4286 [20:06:13<11:44:40, 26.02s/it]                                                         {'loss': 0.002, 'grad_norm': 35.13945435949533, 'learning_rate': 3.791413905739617e-07, 'completion_length': 318.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.721726268529892, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.07624643668532372, 'kl': 0.04931640625, 'epoch': 0.62}
 62%|██████▏   | 2661/4286 [20:06:13<11:44:40, 26.02s/it] 62%|██████▏   | 2662/4286 [20:06:38<11:33:40, 25.63s/it]                                                         {'loss': 0.0016, 'grad_norm': 0.12750993525014737, 'learning_rate': 3.78908072795147e-07, 'completion_length': 295.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.0, 'kl': 0.040283203125, 'epoch': 0.62}
 62%|██████▏   | 2662/4286 [20:06:38<11:33:40, 25.63s/it] 62%|██████▏   | 2663/4286 [20:07:03<11:27:15, 25.41s/it]                                                         {'loss': 0.0059, 'grad_norm': 3.295428541537298, 'learning_rate': 3.786747550163322e-07, 'completion_length': 301.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 1.0, 'reward': 1.662202537059784, 'reward_std': 0.044642859138548374, 'kl': 0.1474609375, 'epoch': 0.62}
 62%|██████▏   | 2663/4286 [20:07:03<11:27:15, 25.41s/it] 62%|██████▏   | 2664/4286 [20:07:28<11:25:23, 25.35s/it]                                                         {'loss': 0.0054, 'grad_norm': 1.2272503400242294, 'learning_rate': 3.784414372375175e-07, 'completion_length': 301.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 1.0, 'reward': 1.6934524774551392, 'reward_std': 0.005952378269284964, 'kl': 0.135009765625, 'epoch': 0.62}
 62%|██████▏   | 2664/4286 [20:07:28<11:25:23, 25.35s/it] 62%|██████▏   | 2665/4286 [20:07:53<11:24:29, 25.34s/it]                                                         {'loss': 0.0146, 'grad_norm': 9.88481004186853, 'learning_rate': 3.7820811945870275e-07, 'completion_length': 301.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.610119104385376, 'rewards/format_reward': 1.0, 'reward': 1.610119104385376, 'reward_std': 0.08641537372022867, 'kl': 0.36572265625, 'epoch': 0.62}
 62%|██████▏   | 2665/4286 [20:07:53<11:24:29, 25.34s/it] 62%|██████▏   | 2666/4286 [20:08:17<11:12:41, 24.91s/it]                                                         {'loss': 0.0089, 'grad_norm': 35.58948457261526, 'learning_rate': 3.77974801679888e-07, 'completion_length': 291.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0476190485060215, 'kl': 0.2227783203125, 'epoch': 0.62}
 62%|██████▏   | 2666/4286 [20:08:17<11:12:41, 24.91s/it] 62%|██████▏   | 2667/4286 [20:08:44<11:25:00, 25.39s/it]                                                         {'loss': 0.0188, 'grad_norm': 21.927197990777962, 'learning_rate': 3.7774148390107325e-07, 'completion_length': 328.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.816964328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7812500596046448, 'reward_std': 0.16853345185518265, 'kl': 0.470703125, 'epoch': 0.62}
 62%|██████▏   | 2667/4286 [20:08:44<11:25:00, 25.39s/it] 62%|██████▏   | 2668/4286 [20:09:11<11:35:46, 25.80s/it]                                                         {'loss': 0.0091, 'grad_norm': 25.231645944507385, 'learning_rate': 3.775081661222585e-07, 'completion_length': 320.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7886905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7886905670166016, 'reward_std': 0.127803985029459, 'kl': 0.226806640625, 'epoch': 0.62}
 62%|██████▏   | 2668/4286 [20:09:11<11:35:46, 25.80s/it] 62%|██████▏   | 2669/4286 [20:09:35<11:28:04, 25.53s/it]                                                         {'loss': 0.0017, 'grad_norm': 4.6831394407269125, 'learning_rate': 3.7727484834344375e-07, 'completion_length': 292.2143020629883, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.06526251137256622, 'kl': 0.041748046875, 'epoch': 0.62}
 62%|██████▏   | 2669/4286 [20:09:35<11:28:04, 25.53s/it] 62%|██████▏   | 2670/4286 [20:10:00<11:16:16, 25.11s/it]                                                         {'loss': 0.0027, 'grad_norm': 0.17157165026491644, 'learning_rate': 3.77041530564629e-07, 'completion_length': 291.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6250000894069672, 'rewards/format_reward': 1.0, 'reward': 1.6250001788139343, 'reward_std': 0.0, 'kl': 0.06689453125, 'epoch': 0.62}
 62%|██████▏   | 2670/4286 [20:10:00<11:16:16, 25.11s/it] 62%|██████▏   | 2671/4286 [20:10:24<11:13:32, 25.02s/it]                                                         {'loss': 0.006, 'grad_norm': 4.5119958184455955, 'learning_rate': 3.7680821278581425e-07, 'completion_length': 298.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7886905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7886906266212463, 'reward_std': 0.04350833594799042, 'kl': 0.1490478515625, 'epoch': 0.62}
 62%|██████▏   | 2671/4286 [20:10:24<11:13:32, 25.02s/it] 62%|██████▏   | 2672/4286 [20:10:50<11:19:00, 25.24s/it]                                                         {'loss': 0.016, 'grad_norm': 5.906179300420734, 'learning_rate': 3.765748950069995e-07, 'completion_length': 316.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7084751129150391, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.654903769493103, 'reward_std': 0.07741131447255611, 'kl': 0.401611328125, 'epoch': 0.62}
 62%|██████▏   | 2672/4286 [20:10:50<11:19:00, 25.24s/it] 62%|██████▏   | 2673/4286 [20:11:17<11:29:00, 25.63s/it]                                                         {'loss': 0.0155, 'grad_norm': 4.607612437690347, 'learning_rate': 3.763415772281848e-07, 'completion_length': 334.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.0654761902987957, 'kl': 0.388671875, 'epoch': 0.62}
 62%|██████▏   | 2673/4286 [20:11:17<11:29:00, 25.63s/it] 62%|██████▏   | 2674/4286 [20:11:41<11:15:28, 25.14s/it]                                                         {'loss': 0.0151, 'grad_norm': 11.685942491498205, 'learning_rate': 3.7610825944937e-07, 'completion_length': 311.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.07301182672381401, 'kl': 0.376220703125, 'epoch': 0.62}
 62%|██████▏   | 2674/4286 [20:11:41<11:15:28, 25.14s/it] 62%|██████▏   | 2675/4286 [20:12:06<11:14:49, 25.13s/it]                                                         {'loss': 0.0044, 'grad_norm': 3.687948300987963, 'learning_rate': 3.758749416705553e-07, 'completion_length': 289.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.8565476834774017, 'rewards/format_reward': 1.0, 'reward': 1.8565477132797241, 'reward_std': 0.06071428954601288, 'kl': 0.10986328125, 'epoch': 0.62}
 62%|██████▏   | 2675/4286 [20:12:06<11:14:49, 25.13s/it] 62%|██████▏   | 2676/4286 [20:12:31<11:16:25, 25.21s/it]                                                         {'loss': 0.0095, 'grad_norm': 12.862466423948117, 'learning_rate': 3.756416238917405e-07, 'completion_length': 337.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7857143580913544, 'rewards/format_reward': 1.0, 'reward': 1.785714328289032, 'reward_std': 0.050381558015942574, 'kl': 0.2373046875, 'epoch': 0.62}
 62%|██████▏   | 2676/4286 [20:12:31<11:16:25, 25.21s/it] 62%|██████▏   | 2677/4286 [20:12:56<11:11:34, 25.04s/it]                                                         {'loss': 0.007, 'grad_norm': 16.04286463491112, 'learning_rate': 3.754083061129258e-07, 'completion_length': 268.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.04123930633068085, 'kl': 0.174072265625, 'epoch': 0.62}
 62%|██████▏   | 2677/4286 [20:12:56<11:11:34, 25.04s/it] 62%|██████▏   | 2678/4286 [20:13:20<11:02:30, 24.72s/it]                                                         {'loss': 0.0048, 'grad_norm': 0.8156424444857179, 'learning_rate': 3.7517498833411107e-07, 'completion_length': 301.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.7440478205680847, 'reward_std': 0.0, 'kl': 0.120361328125, 'epoch': 0.62}
 62%|██████▏   | 2678/4286 [20:13:20<11:02:30, 24.72s/it] 63%|██████▎   | 2679/4286 [20:13:45<11:02:57, 24.75s/it]                                                         {'loss': 0.016, 'grad_norm': 7.788249976074075, 'learning_rate': 3.749416705552963e-07, 'completion_length': 320.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6614583730697632, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6257441639900208, 'reward_std': 0.1389174982905388, 'kl': 0.3994140625, 'epoch': 0.63}
 63%|██████▎   | 2679/4286 [20:13:45<11:02:57, 24.75s/it] 63%|██████▎   | 2680/4286 [20:14:09<11:02:42, 24.76s/it]                                                         {'loss': 0.0122, 'grad_norm': 3.4260749530528627, 'learning_rate': 3.7470835277648156e-07, 'completion_length': 323.375, 'rewards/only_full_func_accuracy_reward': 0.6651786267757416, 'rewards/format_reward': 1.0, 'reward': 1.6651787161827087, 'reward_std': 0.0208333320915699, 'kl': 0.3056640625, 'epoch': 0.63}
 63%|██████▎   | 2680/4286 [20:14:09<11:02:42, 24.76s/it] 63%|██████▎   | 2681/4286 [20:14:35<11:05:36, 24.88s/it]                                                         {'loss': 0.0067, 'grad_norm': 6.026817922199766, 'learning_rate': 3.744750349976668e-07, 'completion_length': 314.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.8095238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.02816697023808956, 'kl': 0.16796875, 'epoch': 0.63}
 63%|██████▎   | 2681/4286 [20:14:35<11:05:36, 24.88s/it] 63%|██████▎   | 2682/4286 [20:15:01<11:17:08, 25.33s/it]                                                         {'loss': 0.0031, 'grad_norm': 0.7671554411601408, 'learning_rate': 3.7424171721885206e-07, 'completion_length': 345.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 1.0, 'reward': 1.662202537059784, 'reward_std': 0.022675009444355965, 'kl': 0.078125, 'epoch': 0.63}
 63%|██████▎   | 2682/4286 [20:15:01<11:17:08, 25.33s/it] 63%|██████▎   | 2683/4286 [20:15:27<11:20:05, 25.46s/it]                                                         {'loss': 0.0156, 'grad_norm': 2.060548549257483, 'learning_rate': 3.7400839944003734e-07, 'completion_length': 283.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7574404776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.739583432674408, 'reward_std': 0.056547620333731174, 'kl': 0.39111328125, 'epoch': 0.63}
 63%|██████▎   | 2683/4286 [20:15:27<11:20:05, 25.46s/it] 63%|██████▎   | 2684/4286 [20:15:52<11:19:15, 25.44s/it]                                                         {'loss': 0.035, 'grad_norm': 4.5505346346051825, 'learning_rate': 3.7377508166122256e-07, 'completion_length': 321.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.8080357015132904, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7901787161827087, 'reward_std': 0.18774673715233803, 'kl': 0.873046875, 'epoch': 0.63}
 63%|██████▎   | 2684/4286 [20:15:52<11:19:15, 25.44s/it] 63%|██████▎   | 2685/4286 [20:16:18<11:20:22, 25.50s/it]                                                         {'loss': 0.0149, 'grad_norm': 2.644363853284341, 'learning_rate': 3.7354176388240783e-07, 'completion_length': 303.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 1.0, 'reward': 1.6860119700431824, 'reward_std': 0.04464286006987095, 'kl': 0.3720703125, 'epoch': 0.63}
 63%|██████▎   | 2685/4286 [20:16:18<11:20:22, 25.50s/it] 63%|██████▎   | 2686/4286 [20:16:44<11:22:21, 25.59s/it]                                                         {'loss': 0.017, 'grad_norm': 25.772725533021056, 'learning_rate': 3.7330844610359306e-07, 'completion_length': 297.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.7916666865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7738096714019775, 'reward_std': 0.09800060465931892, 'kl': 0.4248046875, 'epoch': 0.63}
 63%|██████▎   | 2686/4286 [20:16:44<11:22:21, 25.59s/it] 63%|██████▎   | 2687/4286 [20:17:10<11:30:30, 25.91s/it]                                                         {'loss': 0.0392, 'grad_norm': 4.368912115627666, 'learning_rate': 3.7307512832477833e-07, 'completion_length': 319.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.708333432674408, 'reward_std': 0.09523809142410755, 'kl': 0.98046875, 'epoch': 0.63}
 63%|██████▎   | 2687/4286 [20:17:10<11:30:30, 25.91s/it] 63%|██████▎   | 2688/4286 [20:17:35<11:20:48, 25.56s/it]                                                         {'loss': 0.039, 'grad_norm': 11.298784875525662, 'learning_rate': 3.728418105459636e-07, 'completion_length': 303.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.74404776096344, 'reward_std': 0.1639816053211689, 'kl': 0.9765625, 'epoch': 0.63}
 63%|██████▎   | 2688/4286 [20:17:35<11:20:48, 25.56s/it] 63%|██████▎   | 2689/4286 [20:18:00<11:15:49, 25.39s/it]                                                         {'loss': 0.0242, 'grad_norm': 21.79334922674393, 'learning_rate': 3.7260849276714883e-07, 'completion_length': 298.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.8452381193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8273810744285583, 'reward_std': 0.06968299299478531, 'kl': 0.60595703125, 'epoch': 0.63}
 63%|██████▎   | 2689/4286 [20:18:00<11:15:49, 25.39s/it] 63%|██████▎   | 2690/4286 [20:18:25<11:12:01, 25.26s/it]                                                         {'loss': 0.0198, 'grad_norm': 70.10544927768497, 'learning_rate': 3.723751749883341e-07, 'completion_length': 323.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6924745440483093, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6746174097061157, 'reward_std': 0.11001947894692421, 'kl': 0.4951171875, 'epoch': 0.63}
 63%|██████▎   | 2690/4286 [20:18:25<11:12:01, 25.26s/it] 63%|██████▎   | 2691/4286 [20:18:50<11:09:23, 25.18s/it]                                                         {'loss': 0.0077, 'grad_norm': 61.619137889240136, 'learning_rate': 3.721418572095193e-07, 'completion_length': 319.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.059310127049684525, 'kl': 0.193359375, 'epoch': 0.63}
 63%|██████▎   | 2691/4286 [20:18:50<11:09:23, 25.18s/it] 63%|██████▎   | 2692/4286 [20:19:16<11:20:19, 25.61s/it]                                                         {'loss': 0.0166, 'grad_norm': 28.71344328291562, 'learning_rate': 3.719085394307046e-07, 'completion_length': 325.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7574405074119568, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7217263579368591, 'reward_std': 0.12847857363522053, 'kl': 0.41552734375, 'epoch': 0.63}
 63%|██████▎   | 2692/4286 [20:19:16<11:20:19, 25.61s/it] 63%|██████▎   | 2693/4286 [20:19:43<11:25:28, 25.82s/it]                                                         {'loss': 0.0146, 'grad_norm': 3.606107207323235, 'learning_rate': 3.716752216518899e-07, 'completion_length': 318.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.8497024774551392, 'rewards/format_reward': 1.0, 'reward': 1.849702537059784, 'reward_std': 0.07029404491186142, 'kl': 0.3642578125, 'epoch': 0.63}
 63%|██████▎   | 2693/4286 [20:19:43<11:25:28, 25.82s/it] 63%|██████▎   | 2694/4286 [20:20:07<11:13:17, 25.38s/it]                                                         {'loss': 0.0196, 'grad_norm': 6.2931909880717205, 'learning_rate': 3.714419038730751e-07, 'completion_length': 295.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7440477013587952, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.029761910438537598, 'kl': 0.4892578125, 'epoch': 0.63}
 63%|██████▎   | 2694/4286 [20:20:07<11:13:17, 25.38s/it] 63%|██████▎   | 2695/4286 [20:20:33<11:16:03, 25.50s/it]                                                         {'loss': 0.0171, 'grad_norm': 13.525529673593235, 'learning_rate': 3.7120858609426037e-07, 'completion_length': 341.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.048447076231241226, 'kl': 0.4287109375, 'epoch': 0.63}
 63%|██████▎   | 2695/4286 [20:20:33<11:16:03, 25.50s/it] 63%|██████▎   | 2696/4286 [20:20:57<11:07:09, 25.18s/it]                                                         {'loss': 0.0396, 'grad_norm': 4.2541520649919, 'learning_rate': 3.709752683154456e-07, 'completion_length': 277.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7306548357009888, 'reward_std': 0.15135836228728294, 'kl': 0.990234375, 'epoch': 0.63}
 63%|██████▎   | 2696/4286 [20:20:57<11:07:09, 25.18s/it] 63%|██████▎   | 2697/4286 [20:21:22<10:58:46, 24.87s/it]                                                         {'loss': 0.0206, 'grad_norm': 6.046123613650561, 'learning_rate': 3.7074195053663087e-07, 'completion_length': 295.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6904762983322144, 'reward_std': 0.07468850631266832, 'kl': 0.515625, 'epoch': 0.63}
 63%|██████▎   | 2697/4286 [20:21:22<10:58:46, 24.87s/it] 63%|██████▎   | 2698/4286 [20:21:47<11:02:24, 25.03s/it]                                                         {'loss': 0.0236, 'grad_norm': 10.859187662467543, 'learning_rate': 3.7050863275781615e-07, 'completion_length': 322.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7958334386348724, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7779762744903564, 'reward_std': 0.08690477348864079, 'kl': 0.58984375, 'epoch': 0.63}
 63%|██████▎   | 2698/4286 [20:21:47<11:02:24, 25.03s/it] 63%|██████▎   | 2699/4286 [20:22:12<11:06:05, 25.18s/it]                                                         {'loss': 0.026, 'grad_norm': 4.123997322387185, 'learning_rate': 3.7027531497900137e-07, 'completion_length': 341.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.031603576615452766, 'kl': 0.6474609375, 'epoch': 0.63}
 63%|██████▎   | 2699/4286 [20:22:12<11:06:05, 25.18s/it] 63%|██████▎   | 2700/4286 [20:22:36<10:55:01, 24.78s/it]                                                         {'loss': 0.0167, 'grad_norm': 4.886533660108049, 'learning_rate': 3.7004199720018664e-07, 'completion_length': 299.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.04388263076543808, 'kl': 0.41796875, 'epoch': 0.63}
 63%|██████▎   | 2700/4286 [20:22:36<10:55:01, 24.78s/it] 63%|██████▎   | 2701/4286 [20:26:30<38:33:11, 87.57s/it]                                                         {'loss': 0.0597, 'grad_norm': 12.341012978786452, 'learning_rate': 3.698086794213719e-07, 'completion_length': 341.3214569091797, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6666668057441711, 'reward_std': 0.08333333395421505, 'kl': 1.4921875, 'epoch': 0.63}
 63%|██████▎   | 2701/4286 [20:26:30<38:33:11, 87.57s/it] 63%|██████▎   | 2702/4286 [20:26:55<30:11:54, 68.63s/it]                                                         {'loss': 0.0566, 'grad_norm': 32.427531500461505, 'learning_rate': 3.6957536164255714e-07, 'completion_length': 331.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.6502977013587952, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6145833730697632, 'reward_std': 0.18480360135436058, 'kl': 1.4140625, 'epoch': 0.63}
 63%|██████▎   | 2702/4286 [20:26:55<30:11:54, 68.63s/it] 63%|██████▎   | 2703/4286 [20:27:20<24:23:33, 55.47s/it]                                                         {'loss': 0.0026, 'grad_norm': 1.1142817965832343, 'learning_rate': 3.693420438637424e-07, 'completion_length': 296.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.7514881789684296, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.0208333358168602, 'kl': 0.06494140625, 'epoch': 0.63}
 63%|██████▎   | 2703/4286 [20:27:20<24:23:33, 55.47s/it] 63%|██████▎   | 2704/4286 [20:27:44<20:18:42, 46.22s/it]                                                         {'loss': 0.0135, 'grad_norm': 72.18279357782122, 'learning_rate': 3.6910872608492764e-07, 'completion_length': 329.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7544643878936768, 'reward_std': 0.06845238525420427, 'kl': 0.33984375, 'epoch': 0.63}
 63%|██████▎   | 2704/4286 [20:27:44<20:18:42, 46.22s/it] 63%|██████▎   | 2705/4286 [20:28:11<17:41:45, 40.29s/it]                                                         {'loss': 0.0399, 'grad_norm': 31.98003600439764, 'learning_rate': 3.688754083061129e-07, 'completion_length': 334.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.7656250298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.747767984867096, 'reward_std': 0.13339437916874886, 'kl': 0.998046875, 'epoch': 0.63}
 63%|██████▎   | 2705/4286 [20:28:11<17:41:45, 40.29s/it] 63%|██████▎   | 2706/4286 [20:28:36<15:45:55, 35.92s/it]                                                         {'loss': 0.0145, 'grad_norm': 2.9702289228002248, 'learning_rate': 3.686420905272982e-07, 'completion_length': 297.19644927978516, 'rewards/only_full_func_accuracy_reward': 0.6173735558986664, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5816593170166016, 'reward_std': 0.12315160222351551, 'kl': 0.3623046875, 'epoch': 0.63}
 63%|██████▎   | 2706/4286 [20:28:36<15:45:55, 35.92s/it] 63%|██████▎   | 2707/4286 [20:29:02<14:21:37, 32.74s/it]                                                         {'loss': 0.0045, 'grad_norm': 3.9270305095711535, 'learning_rate': 3.684087727484834e-07, 'completion_length': 301.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.758928656578064, 'reward_std': 0.0560455396771431, 'kl': 0.111572265625, 'epoch': 0.63}
 63%|██████▎   | 2707/4286 [20:29:02<14:21:37, 32.74s/it] 63%|██████▎   | 2708/4286 [20:29:26<13:13:34, 30.17s/it]                                                         {'loss': 0.0193, 'grad_norm': 9.705873817836457, 'learning_rate': 3.681754549696687e-07, 'completion_length': 272.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8377976715564728, 'rewards/format_reward': 1.0, 'reward': 1.8377977013587952, 'reward_std': 0.059310127049684525, 'kl': 0.4833984375, 'epoch': 0.63}
 63%|██████▎   | 2708/4286 [20:29:26<13:13:34, 30.17s/it] 63%|██████▎   | 2709/4286 [20:29:51<12:36:17, 28.77s/it]                                                         {'loss': 0.0123, 'grad_norm': 19.308369801512637, 'learning_rate': 3.679421371908539e-07, 'completion_length': 316.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.07511191070079803, 'kl': 0.3076171875, 'epoch': 0.63}
 63%|██████▎   | 2709/4286 [20:29:51<12:36:17, 28.77s/it] 63%|██████▎   | 2710/4286 [20:30:17<12:12:40, 27.89s/it]                                                         {'loss': 0.0337, 'grad_norm': 2.3909432123075365, 'learning_rate': 3.677088194120392e-07, 'completion_length': 322.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6619048118591309, 'rewards/format_reward': 1.0, 'reward': 1.6619049906730652, 'reward_std': 0.04003841057419777, 'kl': 0.84228515625, 'epoch': 0.63}
 63%|██████▎   | 2710/4286 [20:30:17<12:12:40, 27.89s/it] 63%|██████▎   | 2711/4286 [20:30:42<11:50:30, 27.07s/it]                                                         {'loss': 0.0054, 'grad_norm': 1.4132846908125072, 'learning_rate': 3.6747550163322446e-07, 'completion_length': 301.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.8839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.883928656578064, 'reward_std': 0.04719168459996581, 'kl': 0.1363525390625, 'epoch': 0.63}
 63%|██████▎   | 2711/4286 [20:30:42<11:50:30, 27.07s/it] 63%|██████▎   | 2712/4286 [20:31:07<11:32:36, 26.40s/it]                                                         {'loss': 0.0076, 'grad_norm': 4.445078341762109, 'learning_rate': 3.672421838544097e-07, 'completion_length': 335.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.77976194024086, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.0357142835855484, 'kl': 0.190185546875, 'epoch': 0.63}
 63%|██████▎   | 2712/4286 [20:31:07<11:32:36, 26.40s/it] 63%|██████▎   | 2713/4286 [20:31:32<11:22:58, 26.05s/it]                                                         {'loss': 0.0118, 'grad_norm': 21.083783986551435, 'learning_rate': 3.6700886607559496e-07, 'completion_length': 305.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7011905312538147, 'rewards/format_reward': 1.0, 'reward': 1.7011906504631042, 'reward_std': 0.020710106939077377, 'kl': 0.29541015625, 'epoch': 0.63}
 63%|██████▎   | 2713/4286 [20:31:32<11:22:58, 26.05s/it] 63%|██████▎   | 2714/4286 [20:31:58<11:18:01, 25.88s/it]                                                         {'loss': 0.0539, 'grad_norm': 12.01569421609949, 'learning_rate': 3.667755482967802e-07, 'completion_length': 311.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6994048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6994048953056335, 'reward_std': 0.04761904664337635, 'kl': 1.34765625, 'epoch': 0.63}
 63%|██████▎   | 2714/4286 [20:31:58<11:18:01, 25.88s/it] 63%|██████▎   | 2715/4286 [20:32:23<11:10:33, 25.61s/it]                                                         {'loss': 0.0047, 'grad_norm': 6.601200513093158, 'learning_rate': 3.6654223051796545e-07, 'completion_length': 323.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 1.0, 'reward': 1.7485119700431824, 'reward_std': 0.025347059592604637, 'kl': 0.117919921875, 'epoch': 0.63}
 63%|██████▎   | 2715/4286 [20:32:23<11:10:33, 25.61s/it] 63%|██████▎   | 2716/4286 [20:32:49<11:10:43, 25.63s/it]                                                         {'loss': 0.0317, 'grad_norm': 17.201825017451704, 'learning_rate': 3.6630891273915073e-07, 'completion_length': 324.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.8020834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7842262983322144, 'reward_std': 0.13362727873027325, 'kl': 0.791259765625, 'epoch': 0.63}
 63%|██████▎   | 2716/4286 [20:32:49<11:10:43, 25.63s/it] 63%|██████▎   | 2717/4286 [20:33:14<11:05:14, 25.44s/it]                                                         {'loss': 0.0022, 'grad_norm': 1.7854680238957266, 'learning_rate': 3.6607559496033595e-07, 'completion_length': 315.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.8154762089252472, 'rewards/format_reward': 1.0, 'reward': 1.8154763579368591, 'reward_std': 0.013746432960033417, 'kl': 0.05615234375, 'epoch': 0.63}
 63%|██████▎   | 2717/4286 [20:33:14<11:05:14, 25.44s/it] 63%|██████▎   | 2718/4286 [20:33:38<10:54:38, 25.05s/it]                                                         {'loss': 0.0166, 'grad_norm': 13.971263858008294, 'learning_rate': 3.658422771815212e-07, 'completion_length': 297.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7782738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.06249999813735485, 'kl': 0.4161376953125, 'epoch': 0.63}
 63%|██████▎   | 2718/4286 [20:33:38<10:54:38, 25.05s/it] 63%|██████▎   | 2719/4286 [20:34:03<10:58:56, 25.23s/it]                                                         {'loss': 0.036, 'grad_norm': 2.117008455855653, 'learning_rate': 3.6560895940270645e-07, 'completion_length': 278.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.6944535076618195, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6587393879890442, 'reward_std': 0.12299784272909164, 'kl': 0.900390625, 'epoch': 0.63}
 63%|██████▎   | 2719/4286 [20:34:03<10:58:56, 25.23s/it] 63%|██████▎   | 2720/4286 [20:34:29<10:58:00, 25.21s/it]                                                         {'loss': 0.0262, 'grad_norm': 3.691182959142145, 'learning_rate': 3.653756416238917e-07, 'completion_length': 309.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6365221440792084, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6008078455924988, 'reward_std': 0.09526434913277626, 'kl': 0.6552734375, 'epoch': 0.63}
 63%|██████▎   | 2720/4286 [20:34:29<10:58:00, 25.21s/it] 63%|██████▎   | 2721/4286 [20:34:53<10:48:47, 24.87s/it]                                                         {'loss': 0.0093, 'grad_norm': 6.227723275617872, 'learning_rate': 3.65142323845077e-07, 'completion_length': 282.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8645834028720856, 'rewards/format_reward': 1.0, 'reward': 1.864583432674408, 'reward_std': 0.008928571827709675, 'kl': 0.2320556640625, 'epoch': 0.63}
 63%|██████▎   | 2721/4286 [20:34:53<10:48:47, 24.87s/it] 64%|██████▎   | 2722/4286 [20:35:17<10:46:56, 24.82s/it]                                                         {'loss': 0.0179, 'grad_norm': 5.431481362686553, 'learning_rate': 3.649090060662622e-07, 'completion_length': 315.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7232144474983215, 'reward_std': 0.08769077435135841, 'kl': 0.4482421875, 'epoch': 0.64}
 64%|██████▎   | 2722/4286 [20:35:17<10:46:56, 24.82s/it] 64%|██████▎   | 2723/4286 [20:35:42<10:45:56, 24.80s/it]                                                         {'loss': 0.0033, 'grad_norm': 13.951932352576344, 'learning_rate': 3.646756882874475e-07, 'completion_length': 298.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.8407738506793976, 'rewards/format_reward': 1.0, 'reward': 1.8407739400863647, 'reward_std': 0.026785715483129025, 'kl': 0.083251953125, 'epoch': 0.64}
 64%|██████▎   | 2723/4286 [20:35:42<10:45:56, 24.80s/it] 64%|██████▎   | 2724/4286 [20:36:07<10:48:21, 24.91s/it]                                                         {'loss': 0.0043, 'grad_norm': 5.538091316355675, 'learning_rate': 3.6444237050863277e-07, 'completion_length': 289.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875001788139343, 'reward_std': 0.06735934130847454, 'kl': 0.107177734375, 'epoch': 0.64}
 64%|██████▎   | 2724/4286 [20:36:07<10:48:21, 24.91s/it] 64%|██████▎   | 2725/4286 [20:36:32<10:49:51, 24.98s/it]                                                         {'loss': 0.0026, 'grad_norm': 8.202498674030291, 'learning_rate': 3.64209052729818e-07, 'completion_length': 321.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7901786267757416, 'rewards/format_reward': 1.0, 'reward': 1.790178656578064, 'reward_std': 0.0565476194024086, 'kl': 0.066162109375, 'epoch': 0.64}
 64%|██████▎   | 2725/4286 [20:36:32<10:49:51, 24.98s/it] 64%|██████▎   | 2726/4286 [20:36:58<10:51:35, 25.06s/it]                                                         {'loss': 0.005, 'grad_norm': 2.611653968394038, 'learning_rate': 3.6397573495100327e-07, 'completion_length': 335.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8630952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8630954027175903, 'reward_std': 0.04224460944533348, 'kl': 0.125244140625, 'epoch': 0.64}
 64%|██████▎   | 2726/4286 [20:36:58<10:51:35, 25.06s/it] 64%|██████▎   | 2727/4286 [20:37:21<10:38:21, 24.57s/it]                                                         {'loss': 0.0025, 'grad_norm': 4.40348876140317, 'learning_rate': 3.637424171721885e-07, 'completion_length': 279.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7663690447807312, 'rewards/format_reward': 1.0, 'reward': 1.7663691639900208, 'reward_std': 0.034579768776893616, 'kl': 0.063232421875, 'epoch': 0.64}
 64%|██████▎   | 2727/4286 [20:37:21<10:38:21, 24.57s/it] 64%|██████▎   | 2728/4286 [20:37:46<10:41:13, 24.69s/it]                                                         {'loss': 0.0085, 'grad_norm': 12.593616113363222, 'learning_rate': 3.6350909939337377e-07, 'completion_length': 326.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.7050595581531525, 'rewards/format_reward': 1.0, 'reward': 1.7050597071647644, 'reward_std': 0.06806900538504124, 'kl': 0.212158203125, 'epoch': 0.64}
 64%|██████▎   | 2728/4286 [20:37:46<10:41:13, 24.69s/it] 64%|██████▎   | 2729/4286 [20:38:10<10:38:33, 24.61s/it]                                                         {'loss': 0.003, 'grad_norm': 1.5731452284484113, 'learning_rate': 3.6327578161455904e-07, 'completion_length': 301.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7726190686225891, 'rewards/format_reward': 1.0, 'reward': 1.7726191282272339, 'reward_std': 0.05108060035854578, 'kl': 0.0758056640625, 'epoch': 0.64}
 64%|██████▎   | 2729/4286 [20:38:10<10:38:33, 24.61s/it] 64%|██████▎   | 2730/4286 [20:38:34<10:32:28, 24.39s/it]                                                         {'loss': 0.0171, 'grad_norm': 2.4115313866329418, 'learning_rate': 3.6304246383574426e-07, 'completion_length': 312.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6822916865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6644346714019775, 'reward_std': 0.07589286006987095, 'kl': 0.4267578125, 'epoch': 0.64}
 64%|██████▎   | 2730/4286 [20:38:34<10:32:28, 24.39s/it] 64%|██████▎   | 2731/4286 [20:38:59<10:32:22, 24.40s/it]                                                         {'loss': 0.0081, 'grad_norm': 12.431313538323627, 'learning_rate': 3.6280914605692954e-07, 'completion_length': 291.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7607143223285675, 'rewards/format_reward': 1.0, 'reward': 1.76071435213089, 'reward_std': 0.08690830413252115, 'kl': 0.20172119140625, 'epoch': 0.64}
 64%|██████▎   | 2731/4286 [20:38:59<10:32:22, 24.40s/it] 64%|██████▎   | 2732/4286 [20:39:25<10:46:26, 24.96s/it]                                                         {'loss': 0.0158, 'grad_norm': 13.398927842129567, 'learning_rate': 3.6257582827811476e-07, 'completion_length': 307.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7023810744285583, 'reward_std': 0.10714286286383867, 'kl': 0.3955078125, 'epoch': 0.64}
 64%|██████▎   | 2732/4286 [20:39:25<10:46:26, 24.96s/it] 64%|██████▍   | 2733/4286 [20:39:50<10:45:01, 24.92s/it]                                                         {'loss': 0.0107, 'grad_norm': 6.510704978848437, 'learning_rate': 3.6234251049930004e-07, 'completion_length': 306.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6577381193637848, 'rewards/format_reward': 1.0, 'reward': 1.6577382683753967, 'reward_std': 0.06388125941157341, 'kl': 0.267578125, 'epoch': 0.64}
 64%|██████▍   | 2733/4286 [20:39:50<10:45:01, 24.92s/it] 64%|██████▍   | 2734/4286 [20:40:14<10:42:20, 24.83s/it]                                                         {'loss': 0.0029, 'grad_norm': 12.070270306698758, 'learning_rate': 3.621091927204853e-07, 'completion_length': 344.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.9017857611179352, 'rewards/format_reward': 1.0, 'reward': 1.9017857909202576, 'reward_std': 0.0476190485060215, 'kl': 0.0732421875, 'epoch': 0.64}
 64%|██████▍   | 2734/4286 [20:40:14<10:42:20, 24.83s/it] 64%|██████▍   | 2735/4286 [20:40:40<10:45:37, 24.98s/it]                                                         {'loss': 0.0082, 'grad_norm': 1.2048170496095638, 'learning_rate': 3.6187587494167053e-07, 'completion_length': 316.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.8214287161827087, 'reward_std': 0.04007172957062721, 'kl': 0.20654296875, 'epoch': 0.64}
 64%|██████▍   | 2735/4286 [20:40:40<10:45:37, 24.98s/it] 64%|██████▍   | 2736/4286 [20:41:03<10:33:19, 24.52s/it]                                                         {'loss': 0.0106, 'grad_norm': 1.9353988023694257, 'learning_rate': 3.616425571628558e-07, 'completion_length': 306.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7916667461395264, 'rewards/format_reward': 1.0, 'reward': 1.7916668057441711, 'reward_std': 0.06772436108440161, 'kl': 0.26678466796875, 'epoch': 0.64}
 64%|██████▍   | 2736/4286 [20:41:03<10:33:19, 24.52s/it] 64%|██████▍   | 2737/4286 [20:41:28<10:37:36, 24.70s/it]                                                         {'loss': 0.0073, 'grad_norm': 1.8379034237041934, 'learning_rate': 3.6140923938404103e-07, 'completion_length': 326.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7827381193637848, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.010309826582670212, 'kl': 0.181640625, 'epoch': 0.64}
 64%|██████▍   | 2737/4286 [20:41:28<10:37:36, 24.70s/it] 64%|██████▍   | 2738/4286 [20:41:53<10:33:13, 24.54s/it]                                                         {'loss': 0.0047, 'grad_norm': 0.8966787155662719, 'learning_rate': 3.611759216052263e-07, 'completion_length': 283.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6711310148239136, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.04304792732000351, 'kl': 0.1185302734375, 'epoch': 0.64}
 64%|██████▍   | 2738/4286 [20:41:53<10:33:13, 24.54s/it] 64%|██████▍   | 2739/4286 [20:42:17<10:36:03, 24.67s/it]                                                         {'loss': 0.0049, 'grad_norm': 1.218558013538848, 'learning_rate': 3.609426038264116e-07, 'completion_length': 304.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6056548357009888, 'reward_std': 0.03114316053688526, 'kl': 0.1229248046875, 'epoch': 0.64}
 64%|██████▍   | 2739/4286 [20:42:17<10:36:03, 24.67s/it] 64%|██████▍   | 2740/4286 [20:42:42<10:34:51, 24.64s/it]                                                         {'loss': 0.0125, 'grad_norm': 5.2332621426398696, 'learning_rate': 3.607092860475968e-07, 'completion_length': 302.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.05691400729119778, 'kl': 0.312255859375, 'epoch': 0.64}
 64%|██████▍   | 2740/4286 [20:42:42<10:34:51, 24.64s/it] 64%|██████▍   | 2741/4286 [20:43:07<10:40:12, 24.86s/it]                                                         {'loss': 0.0068, 'grad_norm': 7.127090999410053, 'learning_rate': 3.604759682687821e-07, 'completion_length': 283.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6681548953056335, 'reward_std': 0.1041666641831398, 'kl': 0.169189453125, 'epoch': 0.64}
 64%|██████▍   | 2741/4286 [20:43:07<10:40:12, 24.86s/it] 64%|██████▍   | 2742/4286 [20:43:33<10:43:26, 25.00s/it]                                                         {'loss': 0.0112, 'grad_norm': 13.508918216280538, 'learning_rate': 3.602426504899673e-07, 'completion_length': 325.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.8241071701049805, 'rewards/format_reward': 1.0, 'reward': 1.82410728931427, 'reward_std': 0.05073629692196846, 'kl': 0.27978515625, 'epoch': 0.64}
 64%|██████▍   | 2742/4286 [20:43:33<10:43:26, 25.00s/it] 64%|██████▍   | 2743/4286 [20:43:58<10:43:25, 25.02s/it]                                                         {'loss': 0.014, 'grad_norm': 2.4110566711464583, 'learning_rate': 3.600093327111526e-07, 'completion_length': 315.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7187500596046448, 'rewards/format_reward': 1.0, 'reward': 1.7187501192092896, 'reward_std': 0.015801788307726383, 'kl': 0.351806640625, 'epoch': 0.64}
 64%|██████▍   | 2743/4286 [20:43:58<10:43:25, 25.02s/it] 64%|██████▍   | 2744/4286 [20:44:22<10:36:57, 24.78s/it]                                                         {'loss': 0.0042, 'grad_norm': 2.8712553715593856, 'learning_rate': 3.5977601493233785e-07, 'completion_length': 294.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6979167461395264, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.04464286006987095, 'kl': 0.1044921875, 'epoch': 0.64}
 64%|██████▍   | 2744/4286 [20:44:22<10:36:57, 24.78s/it] 64%|██████▍   | 2745/4286 [20:44:47<10:35:03, 24.73s/it]                                                         {'loss': 0.0062, 'grad_norm': 6.63454296188468, 'learning_rate': 3.5954269715352307e-07, 'completion_length': 301.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 1.0, 'reward': 1.7529763579368591, 'reward_std': 0.10990536585450172, 'kl': 0.154541015625, 'epoch': 0.64}
 64%|██████▍   | 2745/4286 [20:44:47<10:35:03, 24.73s/it] 64%|██████▍   | 2746/4286 [20:45:10<10:26:54, 24.42s/it]                                                         {'loss': 0.0103, 'grad_norm': 1.441643379000555, 'learning_rate': 3.5930937937470835e-07, 'completion_length': 295.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7976191341876984, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.011904759332537651, 'kl': 0.25830078125, 'epoch': 0.64}
 64%|██████▍   | 2746/4286 [20:45:10<10:26:54, 24.42s/it] 64%|██████▍   | 2747/4286 [20:45:37<10:40:12, 24.96s/it]                                                         {'loss': 0.0113, 'grad_norm': 5.112721497256474, 'learning_rate': 3.590760615958936e-07, 'completion_length': 273.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.799107164144516, 'rewards/format_reward': 1.0, 'reward': 1.7991072535514832, 'reward_std': 0.0625000037252903, 'kl': 0.28271484375, 'epoch': 0.64}
 64%|██████▍   | 2747/4286 [20:45:37<10:40:12, 24.96s/it] 64%|██████▍   | 2748/4286 [20:46:00<10:26:18, 24.43s/it]                                                         {'loss': 0.0047, 'grad_norm': 3.7542717615980985, 'learning_rate': 3.5884274381707884e-07, 'completion_length': 249.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.7336309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.014880950096994638, 'kl': 0.11865234375, 'epoch': 0.64}
 64%|██████▍   | 2748/4286 [20:46:00<10:26:18, 24.43s/it] 64%|██████▍   | 2749/4286 [20:46:24<10:25:27, 24.42s/it]                                                         {'loss': 0.0094, 'grad_norm': 4.080963765716776, 'learning_rate': 3.586094260382641e-07, 'completion_length': 292.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7276785969734192, 'rewards/format_reward': 1.0, 'reward': 1.727678656578064, 'reward_std': 0.03188732638955116, 'kl': 0.23486328125, 'epoch': 0.64}
 64%|██████▍   | 2749/4286 [20:46:24<10:25:27, 24.42s/it] 64%|██████▍   | 2750/4286 [20:46:48<10:18:08, 24.15s/it]                                                         {'loss': 0.0049, 'grad_norm': 14.72644908413187, 'learning_rate': 3.5837610825944934e-07, 'completion_length': 257.7678756713867, 'rewards/only_full_func_accuracy_reward': 0.702381044626236, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.04664513934403658, 'kl': 0.1224365234375, 'epoch': 0.64}
 64%|██████▍   | 2750/4286 [20:46:48<10:18:08, 24.15s/it] 64%|██████▍   | 2751/4286 [20:47:12<10:19:16, 24.21s/it]                                                         {'loss': 0.0172, 'grad_norm': 1.1510481120780398, 'learning_rate': 3.581427904806346e-07, 'completion_length': 285.9643020629883, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7053572535514832, 'reward_std': 0.08083896804600954, 'kl': 0.43017578125, 'epoch': 0.64}
 64%|██████▍   | 2751/4286 [20:47:12<10:19:16, 24.21s/it][2025-03-03 11:45:01,285] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▍   | 2752/4286 [20:47:38<10:35:10, 24.84s/it]                                                         {'loss': 0.0137, 'grad_norm': 1.4195513910397068, 'learning_rate': 3.579094727018199e-07, 'completion_length': 274.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7931548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7931548357009888, 'reward_std': 0.09410358220338821, 'kl': 0.34326171875, 'epoch': 0.64}
 64%|██████▍   | 2752/4286 [20:47:38<10:35:10, 24.84s/it] 64%|██████▍   | 2753/4286 [20:48:01<10:18:18, 24.20s/it]                                                         {'loss': 0.0068, 'grad_norm': 4.020355914058721, 'learning_rate': 3.576761549230051e-07, 'completion_length': 252.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.8586310148239136, 'rewards/format_reward': 1.0, 'reward': 1.8586310744285583, 'reward_std': 0.008928571827709675, 'kl': 0.169189453125, 'epoch': 0.64}
 64%|██████▍   | 2753/4286 [20:48:01<10:18:18, 24.20s/it] 64%|██████▍   | 2754/4286 [20:48:24<10:10:29, 23.91s/it]                                                         {'loss': 0.0105, 'grad_norm': 4.09608133252915, 'learning_rate': 3.574428371441904e-07, 'completion_length': 268.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7514880895614624, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.020833334187045693, 'kl': 0.26318359375, 'epoch': 0.64}
 64%|██████▍   | 2754/4286 [20:48:24<10:10:29, 23.91s/it] 64%|██████▍   | 2755/4286 [20:48:48<10:12:00, 23.98s/it]                                                         {'loss': 0.0201, 'grad_norm': 1.5609292170259184, 'learning_rate': 3.572095193653756e-07, 'completion_length': 311.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.035714288242161274, 'kl': 0.500732421875, 'epoch': 0.64}
 64%|██████▍   | 2755/4286 [20:48:48<10:12:00, 23.98s/it] 64%|██████▍   | 2756/4286 [20:49:12<10:06:45, 23.79s/it]                                                         {'loss': 0.0073, 'grad_norm': 3.9464645544525676, 'learning_rate': 3.569762015865609e-07, 'completion_length': 292.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.8511905372142792, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.0476190522313118, 'kl': 0.1827392578125, 'epoch': 0.64}
 64%|██████▍   | 2756/4286 [20:49:12<10:06:45, 23.79s/it] 64%|██████▍   | 2757/4286 [20:49:36<10:13:04, 24.06s/it]                                                         {'loss': 0.013, 'grad_norm': 1.8630754297763823, 'learning_rate': 3.5674288380774616e-07, 'completion_length': 313.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8616072237491608, 'rewards/format_reward': 1.0, 'reward': 1.8616071939468384, 'reward_std': 0.0029761905316263437, 'kl': 0.3267822265625, 'epoch': 0.64}
 64%|██████▍   | 2757/4286 [20:49:36<10:13:04, 24.06s/it] 64%|██████▍   | 2758/4286 [20:50:02<10:27:17, 24.63s/it]                                                         {'loss': 0.008, 'grad_norm': 2.403934974352761, 'learning_rate': 3.565095660289314e-07, 'completion_length': 319.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.680059552192688, 'rewards/format_reward': 1.0, 'reward': 1.6800596714019775, 'reward_std': 0.008928571827709675, 'kl': 0.20068359375, 'epoch': 0.64}
 64%|██████▍   | 2758/4286 [20:50:02<10:27:17, 24.63s/it] 64%|██████▍   | 2759/4286 [20:50:28<10:30:49, 24.79s/it]                                                         {'loss': 0.0121, 'grad_norm': 3.9316524079184503, 'learning_rate': 3.5627624825011666e-07, 'completion_length': 326.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.04106704518198967, 'kl': 0.3017578125, 'epoch': 0.64}
 64%|██████▍   | 2759/4286 [20:50:28<10:30:49, 24.79s/it] 64%|██████▍   | 2760/4286 [20:50:53<10:32:50, 24.88s/it]                                                         {'loss': 0.0036, 'grad_norm': 2.315003751783615, 'learning_rate': 3.560429304713019e-07, 'completion_length': 300.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7827381193637848, 'rewards/format_reward': 1.0, 'reward': 1.7827382683753967, 'reward_std': 0.0178571417927742, 'kl': 0.090087890625, 'epoch': 0.64}
 64%|██████▍   | 2760/4286 [20:50:53<10:32:50, 24.88s/it] 64%|██████▍   | 2761/4286 [20:51:18<10:32:23, 24.88s/it]                                                         {'loss': 0.0086, 'grad_norm': 52.07372834190524, 'learning_rate': 3.5580961269248716e-07, 'completion_length': 322.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.06420914083719254, 'kl': 0.214111328125, 'epoch': 0.64}
 64%|██████▍   | 2761/4286 [20:51:18<10:32:23, 24.88s/it] 64%|██████▍   | 2762/4286 [20:51:42<10:31:46, 24.87s/it]                                                         {'loss': 0.016, 'grad_norm': 4.230429152255019, 'learning_rate': 3.5557629491367243e-07, 'completion_length': 285.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.8526786267757416, 'rewards/format_reward': 1.0, 'reward': 1.852678656578064, 'reward_std': 0.0295482249930501, 'kl': 0.40185546875, 'epoch': 0.64}
 64%|██████▍   | 2762/4286 [20:51:42<10:31:46, 24.87s/it] 64%|██████▍   | 2763/4286 [20:52:07<10:29:22, 24.80s/it]                                                         {'loss': 0.0066, 'grad_norm': 8.622846015476359, 'learning_rate': 3.5534297713485765e-07, 'completion_length': 320.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.8050595819950104, 'rewards/format_reward': 1.0, 'reward': 1.8050596714019775, 'reward_std': 0.028627384454011917, 'kl': 0.1654052734375, 'epoch': 0.64}
 64%|██████▍   | 2763/4286 [20:52:07<10:29:22, 24.80s/it] 64%|██████▍   | 2764/4286 [20:52:32<10:30:49, 24.87s/it]                                                         {'loss': 0.0123, 'grad_norm': 5.462630211131574, 'learning_rate': 3.5510965935604293e-07, 'completion_length': 302.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.07142857648432255, 'kl': 0.30615234375, 'epoch': 0.64}
 64%|██████▍   | 2764/4286 [20:52:32<10:30:49, 24.87s/it] 65%|██████▍   | 2765/4286 [20:52:58<10:35:44, 25.08s/it]                                                         {'loss': 0.0197, 'grad_norm': 8.230522922088065, 'learning_rate': 3.5487634157722815e-07, 'completion_length': 315.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7872024476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7693454027175903, 'reward_std': 0.04750441014766693, 'kl': 0.4912109375, 'epoch': 0.65}
 65%|██████▍   | 2765/4286 [20:52:58<10:35:44, 25.08s/it] 65%|██████▍   | 2766/4286 [20:53:22<10:28:47, 24.82s/it]                                                         {'loss': 0.0087, 'grad_norm': 2.476137929107636, 'learning_rate': 3.5464302379841343e-07, 'completion_length': 303.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.8273809552192688, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.042587509378790855, 'kl': 0.21826171875, 'epoch': 0.65}
 65%|██████▍   | 2766/4286 [20:53:22<10:28:47, 24.82s/it][2025-03-03 11:51:10,628] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2767/4286 [20:53:48<10:36:04, 25.12s/it]                                                         {'loss': 0.0043, 'grad_norm': 9.341408411747151, 'learning_rate': 3.544097060195987e-07, 'completion_length': 316.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7544643878936768, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.0744047574698925, 'kl': 0.107421875, 'epoch': 0.65}
 65%|██████▍   | 2767/4286 [20:53:48<10:36:04, 25.12s/it] 65%|██████▍   | 2768/4286 [20:54:12<10:26:13, 24.75s/it]                                                         {'loss': 0.0056, 'grad_norm': 1.360548461820981, 'learning_rate': 3.541763882407839e-07, 'completion_length': 300.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.761904776096344, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.0, 'kl': 0.139404296875, 'epoch': 0.65}
 65%|██████▍   | 2768/4286 [20:54:12<10:26:13, 24.75s/it][2025-03-03 11:51:59,781] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2769/4286 [20:54:37<10:29:45, 24.91s/it]                                                         {'loss': 0.0076, 'grad_norm': 3.168709548956697, 'learning_rate': 3.539430704619692e-07, 'completion_length': 305.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6592262089252472, 'rewards/format_reward': 1.0, 'reward': 1.6592262983322144, 'reward_std': 0.04329465702176094, 'kl': 0.18927001953125, 'epoch': 0.65}
 65%|██████▍   | 2769/4286 [20:54:37<10:29:45, 24.91s/it] 65%|██████▍   | 2770/4286 [20:55:02<10:29:48, 24.93s/it]                                                         {'loss': 0.0219, 'grad_norm': 4.631933287341526, 'learning_rate': 3.537097526831545e-07, 'completion_length': 310.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7480867803096771, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7302297353744507, 'reward_std': 0.07905752398073673, 'kl': 0.5478515625, 'epoch': 0.65}
 65%|██████▍   | 2770/4286 [20:55:02<10:29:48, 24.93s/it] 65%|██████▍   | 2771/4286 [20:55:26<10:26:28, 24.81s/it]                                                         {'loss': 0.0127, 'grad_norm': 6.342391409672019, 'learning_rate': 3.534764349043397e-07, 'completion_length': 302.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.8467262089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8288691639900208, 'reward_std': 0.09752450883388519, 'kl': 0.31640625, 'epoch': 0.65}
 65%|██████▍   | 2771/4286 [20:55:26<10:26:28, 24.81s/it] 65%|██████▍   | 2772/4286 [20:55:51<10:27:33, 24.87s/it]                                                         {'loss': 0.0038, 'grad_norm': 2.57096407573688, 'learning_rate': 3.5324311712552497e-07, 'completion_length': 331.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.8184524476528168, 'rewards/format_reward': 1.0, 'reward': 1.8184524774551392, 'reward_std': 0.017857138067483902, 'kl': 0.09619140625, 'epoch': 0.65}
 65%|██████▍   | 2772/4286 [20:55:51<10:27:33, 24.87s/it] 65%|██████▍   | 2773/4286 [20:56:17<10:31:49, 25.06s/it]                                                         {'loss': 0.004, 'grad_norm': 5.960999716029111, 'learning_rate': 3.530097993467102e-07, 'completion_length': 303.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5449405312538147, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5270834565162659, 'reward_std': 0.06931354943662882, 'kl': 0.10107421875, 'epoch': 0.65}
 65%|██████▍   | 2773/4286 [20:56:17<10:31:49, 25.06s/it] 65%|██████▍   | 2774/4286 [20:56:40<10:19:20, 24.58s/it]                                                         {'loss': 0.0041, 'grad_norm': 4.652187113249164, 'learning_rate': 3.5277648156789547e-07, 'completion_length': 282.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.0476190522313118, 'kl': 0.1025390625, 'epoch': 0.65}
 65%|██████▍   | 2774/4286 [20:56:40<10:19:20, 24.58s/it] 65%|██████▍   | 2775/4286 [20:57:04<10:15:16, 24.43s/it]                                                         {'loss': 0.004, 'grad_norm': 7.396895806348429, 'learning_rate': 3.5254316378908074e-07, 'completion_length': 324.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.8080357313156128, 'rewards/format_reward': 1.0, 'reward': 1.8080357909202576, 'reward_std': 0.05243690870702267, 'kl': 0.099853515625, 'epoch': 0.65}
 65%|██████▍   | 2775/4286 [20:57:04<10:15:16, 24.43s/it][2025-03-03 11:54:52,229] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2776/4286 [20:57:29<10:18:17, 24.57s/it]                                                         {'loss': 0.0144, 'grad_norm': 6.573416850766125, 'learning_rate': 3.5230984601026597e-07, 'completion_length': 280.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.5074404925107956, 'rewards/format_reward': 1.0, 'reward': 1.5074405670166016, 'reward_std': 0.0386904738843441, 'kl': 0.3609619140625, 'epoch': 0.65}
 65%|██████▍   | 2776/4286 [20:57:29<10:18:17, 24.57s/it] 65%|██████▍   | 2777/4286 [20:57:55<10:27:12, 24.94s/it]                                                         {'loss': 0.0174, 'grad_norm': 1.4730640188049875, 'learning_rate': 3.5207652823145124e-07, 'completion_length': 317.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6782280802726746, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6425138115882874, 'reward_std': 0.10782967507839203, 'kl': 0.4345703125, 'epoch': 0.65}
 65%|██████▍   | 2777/4286 [20:57:55<10:27:12, 24.94s/it] 65%|██████▍   | 2778/4286 [20:58:20<10:28:30, 25.01s/it]                                                         {'loss': 0.0031, 'grad_norm': 2.896398046969215, 'learning_rate': 3.5184321045263646e-07, 'completion_length': 307.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7842262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7842262387275696, 'reward_std': 0.0208333320915699, 'kl': 0.078125, 'epoch': 0.65}
 65%|██████▍   | 2778/4286 [20:58:20<10:28:30, 25.01s/it] 65%|██████▍   | 2779/4286 [20:58:43<10:14:30, 24.47s/it]                                                         {'loss': 0.0018, 'grad_norm': 5.117264420392026, 'learning_rate': 3.5160989267382174e-07, 'completion_length': 292.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.06731786392629147, 'kl': 0.0458984375, 'epoch': 0.65}
 65%|██████▍   | 2779/4286 [20:58:43<10:14:30, 24.47s/it] 65%|██████▍   | 2780/4286 [20:59:07<10:05:15, 24.11s/it]                                                         {'loss': 0.0131, 'grad_norm': 9.91046313306326, 'learning_rate': 3.51376574895007e-07, 'completion_length': 244.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.6934524774551392, 'rewards/format_reward': 1.0, 'reward': 1.6934524774551392, 'reward_std': 0.03703813999891281, 'kl': 0.327880859375, 'epoch': 0.65}
 65%|██████▍   | 2780/4286 [20:59:07<10:05:15, 24.11s/it] 65%|██████▍   | 2781/4286 [20:59:31<10:06:40, 24.19s/it]                                                         {'loss': 0.0054, 'grad_norm': 84.8258594175866, 'learning_rate': 3.5114325711619224e-07, 'completion_length': 289.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.7916666865348816, 'rewards/format_reward': 1.0, 'reward': 1.7916668057441711, 'reward_std': 0.06823870167136192, 'kl': 0.135498046875, 'epoch': 0.65}
 65%|██████▍   | 2781/4286 [20:59:31<10:06:40, 24.19s/it] 65%|██████▍   | 2782/4286 [20:59:56<10:09:13, 24.30s/it]                                                         {'loss': 0.0015, 'grad_norm': 5.5892048730688915, 'learning_rate': 3.509099393373775e-07, 'completion_length': 309.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.7083335518836975, 'reward_std': 0.052597517147660255, 'kl': 0.0379638671875, 'epoch': 0.65}
 65%|██████▍   | 2782/4286 [20:59:56<10:09:13, 24.30s/it] 65%|██████▍   | 2783/4286 [21:00:20<10:08:07, 24.28s/it]                                                         {'loss': 0.0078, 'grad_norm': 8.496804606869135, 'learning_rate': 3.5067662155856273e-07, 'completion_length': 305.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7782738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.05193009041249752, 'kl': 0.19580078125, 'epoch': 0.65}
 65%|██████▍   | 2783/4286 [21:00:20<10:08:07, 24.28s/it] 65%|██████▍   | 2784/4286 [21:00:44<10:06:38, 24.23s/it]                                                         {'loss': 0.0081, 'grad_norm': 4.0178633937708375, 'learning_rate': 3.50443303779748e-07, 'completion_length': 314.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7724965512752533, 'rewards/format_reward': 1.0, 'reward': 1.7724965810775757, 'reward_std': 0.07467564567923546, 'kl': 0.201904296875, 'epoch': 0.65}
 65%|██████▍   | 2784/4286 [21:00:44<10:06:38, 24.23s/it] 65%|██████▍   | 2785/4286 [21:01:08<10:03:11, 24.11s/it]                                                         {'loss': 0.0057, 'grad_norm': 3.2591082132689184, 'learning_rate': 3.502099860009333e-07, 'completion_length': 300.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8497024476528168, 'rewards/format_reward': 1.0, 'reward': 1.8497024774551392, 'reward_std': 0.03457976318895817, 'kl': 0.1416015625, 'epoch': 0.65}
 65%|██████▍   | 2785/4286 [21:01:08<10:03:11, 24.11s/it] 65%|██████▌   | 2786/4286 [21:01:33<10:10:08, 24.41s/it]                                                         {'loss': 0.0057, 'grad_norm': 28.674839853738153, 'learning_rate': 3.499766682221185e-07, 'completion_length': 304.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949405670166016, 'reward_std': 0.14370840042829514, 'kl': 0.141845703125, 'epoch': 0.65}
 65%|██████▌   | 2786/4286 [21:01:33<10:10:08, 24.41s/it] 65%|██████▌   | 2787/4286 [21:01:57<10:03:36, 24.16s/it]                                                         {'loss': 0.0022, 'grad_norm': 1.5700955868202977, 'learning_rate': 3.497433504433038e-07, 'completion_length': 308.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.022675009444355965, 'kl': 0.0550537109375, 'epoch': 0.65}
 65%|██████▌   | 2787/4286 [21:01:57<10:03:36, 24.16s/it] 65%|██████▌   | 2788/4286 [21:02:22<10:11:07, 24.48s/it]                                                         {'loss': 0.0061, 'grad_norm': 2.4143431270259237, 'learning_rate': 3.49510032664489e-07, 'completion_length': 316.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.032524414360523224, 'kl': 0.1527099609375, 'epoch': 0.65}
 65%|██████▌   | 2788/4286 [21:02:22<10:11:07, 24.48s/it] 65%|██████▌   | 2789/4286 [21:02:46<10:08:28, 24.39s/it]                                                         {'loss': 0.0134, 'grad_norm': 5.153208604130287, 'learning_rate': 3.492767148856743e-07, 'completion_length': 285.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7931548357009888, 'rewards/format_reward': 1.0, 'reward': 1.7931548357009888, 'reward_std': 0.06274674274027348, 'kl': 0.3359375, 'epoch': 0.65}
 65%|██████▌   | 2789/4286 [21:02:46<10:08:28, 24.39s/it] 65%|██████▌   | 2790/4286 [21:03:11<10:13:04, 24.59s/it]                                                         {'loss': 0.0039, 'grad_norm': 4.143894610201406, 'learning_rate': 3.4904339710685955e-07, 'completion_length': 318.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6071429252624512, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.06823869794607162, 'kl': 0.0975341796875, 'epoch': 0.65}
 65%|██████▌   | 2790/4286 [21:03:11<10:13:04, 24.59s/it] 65%|██████▌   | 2791/4286 [21:03:35<10:10:31, 24.50s/it]                                                         {'loss': 0.0086, 'grad_norm': 7.559169323057194, 'learning_rate': 3.488100793280448e-07, 'completion_length': 284.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.025392779614776373, 'kl': 0.21533203125, 'epoch': 0.65}
 65%|██████▌   | 2791/4286 [21:03:35<10:10:31, 24.50s/it][2025-03-03 12:01:24,409] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▌   | 2792/4286 [21:04:01<10:22:35, 25.00s/it]                                                         {'loss': 0.0056, 'grad_norm': 1.460253690501454, 'learning_rate': 3.4857676154923005e-07, 'completion_length': 275.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.8088189959526062, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7909618616104126, 'reward_std': 0.08313725143671036, 'kl': 0.1396484375, 'epoch': 0.65}
 65%|██████▌   | 2792/4286 [21:04:01<10:22:35, 25.00s/it][2025-03-03 12:01:49,774] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▌   | 2793/4286 [21:04:27<10:24:52, 25.11s/it]                                                         {'loss': 0.0026, 'grad_norm': 1.6238199787466863, 'learning_rate': 3.4834344377041533e-07, 'completion_length': 301.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.7946428954601288, 'rewards/format_reward': 1.0, 'reward': 1.794642984867096, 'reward_std': 0.005952378269284964, 'kl': 0.0643310546875, 'epoch': 0.65}
 65%|██████▌   | 2793/4286 [21:04:27<10:24:52, 25.11s/it] 65%|██████▌   | 2794/4286 [21:04:51<10:15:53, 24.77s/it]                                                         {'loss': 0.0024, 'grad_norm': 1.3031834890319465, 'learning_rate': 3.4811012599160055e-07, 'completion_length': 284.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7827381789684296, 'rewards/format_reward': 1.0, 'reward': 1.7827382683753967, 'reward_std': 0.029761902987957, 'kl': 0.0595703125, 'epoch': 0.65}
 65%|██████▌   | 2794/4286 [21:04:51<10:15:53, 24.77s/it] 65%|██████▌   | 2795/4286 [21:05:15<10:10:05, 24.55s/it]                                                         {'loss': 0.002, 'grad_norm': 0.2446367151795919, 'learning_rate': 3.478768082127858e-07, 'completion_length': 263.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.8333334028720856, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.0, 'kl': 0.049560546875, 'epoch': 0.65}
 65%|██████▌   | 2795/4286 [21:05:15<10:10:05, 24.55s/it] 65%|██████▌   | 2796/4286 [21:05:38<10:02:35, 24.27s/it]                                                         {'loss': 0.0178, 'grad_norm': 3.478407008158062, 'learning_rate': 3.4764349043397105e-07, 'completion_length': 302.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.6205357909202576, 'rewards/format_reward': 1.0, 'reward': 1.6205358505249023, 'reward_std': 0.05059524066746235, 'kl': 0.443359375, 'epoch': 0.65}
 65%|██████▌   | 2796/4286 [21:05:38<10:02:35, 24.27s/it] 65%|██████▌   | 2797/4286 [21:06:03<10:00:39, 24.20s/it]                                                         {'loss': 0.0044, 'grad_norm': 5.459485899507622, 'learning_rate': 3.474101726551563e-07, 'completion_length': 313.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.050081745721399784, 'kl': 0.109130859375, 'epoch': 0.65}
 65%|██████▌   | 2797/4286 [21:06:03<10:00:39, 24.20s/it] 65%|██████▌   | 2798/4286 [21:06:27<10:02:36, 24.30s/it]                                                         {'loss': 0.0047, 'grad_norm': 3.425620644649697, 'learning_rate': 3.471768548763416e-07, 'completion_length': 311.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6770833432674408, 'rewards/format_reward': 1.0, 'reward': 1.6770834922790527, 'reward_std': 0.07673990493640304, 'kl': 0.11767578125, 'epoch': 0.65}
 65%|██████▌   | 2798/4286 [21:06:27<10:02:36, 24.30s/it] 65%|██████▌   | 2799/4286 [21:06:53<10:12:22, 24.71s/it]                                                         {'loss': 0.0015, 'grad_norm': 4.721474066401416, 'learning_rate': 3.469435370975268e-07, 'completion_length': 320.875, 'rewards/only_full_func_accuracy_reward': 0.7023810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.029761902987957, 'kl': 0.037353515625, 'epoch': 0.65}
 65%|██████▌   | 2799/4286 [21:06:53<10:12:22, 24.71s/it] 65%|██████▌   | 2800/4286 [21:07:17<10:10:42, 24.66s/it]                                                         {'loss': 0.0029, 'grad_norm': 0.9187150016343862, 'learning_rate': 3.467102193187121e-07, 'completion_length': 311.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.8779762387275696, 'rewards/format_reward': 1.0, 'reward': 1.8779762983322144, 'reward_std': 0.01785714365541935, 'kl': 0.071533203125, 'epoch': 0.65}
 65%|██████▌   | 2800/4286 [21:07:17<10:10:42, 24.66s/it] 65%|██████▌   | 2801/4286 [21:10:38<31:56:17, 77.43s/it]                                                         {'loss': 0.0109, 'grad_norm': 17.512406824560056, 'learning_rate': 3.464769015398973e-07, 'completion_length': 286.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7187500596046448, 'rewards/format_reward': 1.0, 'reward': 1.7187501192092896, 'reward_std': 0.040532153099775314, 'kl': 0.2734375, 'epoch': 0.65}
 65%|██████▌   | 2801/4286 [21:10:38<31:56:17, 77.43s/it] 65%|██████▌   | 2802/4286 [21:11:05<25:45:55, 62.50s/it]                                                         {'loss': 0.016, 'grad_norm': 1.6464852539276111, 'learning_rate': 3.462435837610826e-07, 'completion_length': 276.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.848214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8303571939468384, 'reward_std': 0.07142857182770967, 'kl': 0.400390625, 'epoch': 0.65}
 65%|██████▌   | 2802/4286 [21:11:05<25:45:55, 62.50s/it] 65%|██████▌   | 2803/4286 [21:11:31<21:08:06, 51.31s/it]                                                         {'loss': 0.0177, 'grad_norm': 5.968949951830518, 'learning_rate': 3.4601026598226787e-07, 'completion_length': 286.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.6919642984867096, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.091248270124197, 'kl': 0.44140625, 'epoch': 0.65}
 65%|██████▌   | 2803/4286 [21:11:31<21:08:06, 51.31s/it][2025-03-03 12:09:20,350] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▌   | 2804/4286 [21:11:57<18:05:25, 43.94s/it]                                                         {'loss': 0.007, 'grad_norm': 2.6067994162819885, 'learning_rate': 3.457769482034531e-07, 'completion_length': 317.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.04685882292687893, 'kl': 0.1748046875, 'epoch': 0.65}
 65%|██████▌   | 2804/4286 [21:11:57<18:05:25, 43.94s/it] 65%|██████▌   | 2805/4286 [21:12:21<15:36:54, 37.96s/it]                                                         {'loss': 0.0039, 'grad_norm': 1.6523540760505173, 'learning_rate': 3.4554363042463836e-07, 'completion_length': 268.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 1.0, 'reward': 1.7485119700431824, 'reward_std': 0.026785715483129025, 'kl': 0.0986328125, 'epoch': 0.65}
 65%|██████▌   | 2805/4286 [21:12:21<15:36:54, 37.96s/it] 65%|██████▌   | 2806/4286 [21:12:45<13:48:59, 33.61s/it]                                                         {'loss': 0.0112, 'grad_norm': 3.6066660717060683, 'learning_rate': 3.453103126458236e-07, 'completion_length': 293.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.5997024476528168, 'rewards/format_reward': 1.0, 'reward': 1.5997024774551392, 'reward_std': 0.043204203713685274, 'kl': 0.27978515625, 'epoch': 0.65}
 65%|██████▌   | 2806/4286 [21:12:45<13:48:59, 33.61s/it] 65%|██████▌   | 2807/4286 [21:13:09<12:34:53, 30.62s/it]                                                         {'loss': 0.0067, 'grad_norm': 2.553430436658667, 'learning_rate': 3.4507699486700886e-07, 'completion_length': 281.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.8065476417541504, 'rewards/format_reward': 1.0, 'reward': 1.8065478205680847, 'reward_std': 0.03160357475280762, 'kl': 0.167724609375, 'epoch': 0.65}
 65%|██████▌   | 2807/4286 [21:13:09<12:34:53, 30.62s/it] 66%|██████▌   | 2808/4286 [21:13:33<11:49:41, 28.81s/it]                                                         {'loss': 0.0172, 'grad_norm': 1.2030953622793157, 'learning_rate': 3.4484367708819414e-07, 'completion_length': 306.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.0, 'kl': 0.4290771484375, 'epoch': 0.66}
 66%|██████▌   | 2808/4286 [21:13:33<11:49:41, 28.81s/it] 66%|██████▌   | 2809/4286 [21:13:57<11:13:14, 27.35s/it]                                                         {'loss': 0.0076, 'grad_norm': 5.315967469247212, 'learning_rate': 3.4461035930937936e-07, 'completion_length': 287.625, 'rewards/only_full_func_accuracy_reward': 0.8273810148239136, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.05952380783855915, 'kl': 0.189453125, 'epoch': 0.66}
 66%|██████▌   | 2809/4286 [21:13:57<11:13:14, 27.35s/it] 66%|██████▌   | 2810/4286 [21:14:22<10:55:50, 26.66s/it]                                                         {'loss': 0.0091, 'grad_norm': 3.934128617541346, 'learning_rate': 3.4437704153056463e-07, 'completion_length': 315.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548357009888, 'reward_std': 0.023508869111537933, 'kl': 0.228515625, 'epoch': 0.66}
 66%|██████▌   | 2810/4286 [21:14:22<10:55:50, 26.66s/it] 66%|██████▌   | 2811/4286 [21:14:46<10:32:09, 25.71s/it]                                                         {'loss': 0.0053, 'grad_norm': 5.940039899889948, 'learning_rate': 3.4414372375174986e-07, 'completion_length': 279.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.754464328289032, 'reward_std': 0.07280982658267021, 'kl': 0.1318359375, 'epoch': 0.66}
 66%|██████▌   | 2811/4286 [21:14:46<10:32:09, 25.71s/it] 66%|██████▌   | 2812/4286 [21:15:09<10:14:56, 25.03s/it]                                                         {'loss': 0.007, 'grad_norm': 3.823327920101851, 'learning_rate': 3.4391040597293513e-07, 'completion_length': 292.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.06374205276370049, 'kl': 0.175537109375, 'epoch': 0.66}
 66%|██████▌   | 2812/4286 [21:15:09<10:14:56, 25.03s/it] 66%|██████▌   | 2813/4286 [21:15:32<10:02:35, 24.55s/it]                                                         {'loss': 0.0017, 'grad_norm': 5.105366096243016, 'learning_rate': 3.436770881941204e-07, 'completion_length': 278.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.018777981400489807, 'kl': 0.0421142578125, 'epoch': 0.66}
 66%|██████▌   | 2813/4286 [21:15:32<10:02:35, 24.55s/it] 66%|██████▌   | 2814/4286 [21:15:57<10:01:46, 24.53s/it]                                                         {'loss': 0.0018, 'grad_norm': 0.5873387277516599, 'learning_rate': 3.4344377041530563e-07, 'completion_length': 271.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.8645834028720856, 'rewards/format_reward': 1.0, 'reward': 1.864583432674408, 'reward_std': 0.008928571827709675, 'kl': 0.0447998046875, 'epoch': 0.66}
 66%|██████▌   | 2814/4286 [21:15:57<10:01:46, 24.53s/it] 66%|██████▌   | 2815/4286 [21:16:22<10:03:19, 24.61s/it]                                                         {'loss': 0.0067, 'grad_norm': 2.8046031040557517, 'learning_rate': 3.432104526364909e-07, 'completion_length': 329.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7499999701976776, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.03755596047267318, 'kl': 0.1668701171875, 'epoch': 0.66}
 66%|██████▌   | 2815/4286 [21:16:22<10:03:19, 24.61s/it] 66%|██████▌   | 2816/4286 [21:16:47<10:07:30, 24.80s/it]                                                         {'loss': 0.0069, 'grad_norm': 8.465880443619232, 'learning_rate': 3.429771348576762e-07, 'completion_length': 299.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.8407738506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8229167461395264, 'reward_std': 0.10650577582418919, 'kl': 0.17333984375, 'epoch': 0.66}
 66%|██████▌   | 2816/4286 [21:16:47<10:07:30, 24.80s/it] 66%|██████▌   | 2817/4286 [21:17:10<9:50:32, 24.12s/it]                                                         {'loss': 0.0028, 'grad_norm': 1.1492119017597444, 'learning_rate': 3.427438170788614e-07, 'completion_length': 226.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6875001192092896, 'reward_std': 0.03411934990435839, 'kl': 0.0689697265625, 'epoch': 0.66}
 66%|██████▌   | 2817/4286 [21:17:10<9:50:32, 24.12s/it] 66%|██████▌   | 2818/4286 [21:17:35<10:03:19, 24.66s/it]                                                         {'loss': 0.0041, 'grad_norm': 4.9731663262062, 'learning_rate': 3.425104993000467e-07, 'completion_length': 346.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7157739400863647, 'reward_std': 0.09623343823477626, 'kl': 0.102783203125, 'epoch': 0.66}
 66%|██████▌   | 2818/4286 [21:17:35<10:03:19, 24.66s/it] 66%|██████▌   | 2819/4286 [21:18:00<10:04:49, 24.74s/it]                                                         {'loss': 0.0122, 'grad_norm': 4.279824219380968, 'learning_rate': 3.422771815212319e-07, 'completion_length': 297.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 1.0, 'reward': 1.6502977013587952, 'reward_std': 0.08907204121351242, 'kl': 0.3037109375, 'epoch': 0.66}
 66%|██████▌   | 2819/4286 [21:18:00<10:04:49, 24.74s/it] 66%|██████▌   | 2820/4286 [21:18:26<10:08:45, 24.92s/it]                                                         {'loss': 0.0034, 'grad_norm': 4.381194163756801, 'learning_rate': 3.4204386374241717e-07, 'completion_length': 314.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.733631044626236, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.05427858792245388, 'kl': 0.0860595703125, 'epoch': 0.66}
 66%|██████▌   | 2820/4286 [21:18:26<10:08:45, 24.92s/it] 66%|██████▌   | 2821/4286 [21:18:51<10:12:28, 25.08s/it]                                                         {'loss': 0.0035, 'grad_norm': 7.948156685074636, 'learning_rate': 3.4181054596360245e-07, 'completion_length': 324.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.13955247402191162, 'kl': 0.0882568359375, 'epoch': 0.66}
 66%|██████▌   | 2821/4286 [21:18:51<10:12:28, 25.08s/it] 66%|██████▌   | 2822/4286 [21:19:15<10:03:49, 24.75s/it]                                                         {'loss': 0.0096, 'grad_norm': 3.193538989574847, 'learning_rate': 3.4157722818478767e-07, 'completion_length': 255.46430206298828, 'rewards/only_full_func_accuracy_reward': 0.9002977013587952, 'rewards/format_reward': 1.0, 'reward': 1.90029776096344, 'reward_std': 0.04136601183563471, 'kl': 0.24102783203125, 'epoch': 0.66}
 66%|██████▌   | 2822/4286 [21:19:15<10:03:49, 24.75s/it] 66%|██████▌   | 2823/4286 [21:19:39<9:57:28, 24.50s/it]                                                         {'loss': 0.0031, 'grad_norm': 2.355626503784389, 'learning_rate': 3.4134391040597295e-07, 'completion_length': 306.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6994047462940216, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.06044464930891991, 'kl': 0.077392578125, 'epoch': 0.66}
 66%|██████▌   | 2823/4286 [21:19:39<9:57:28, 24.50s/it] 66%|██████▌   | 2824/4286 [21:20:03<9:53:22, 24.35s/it]                                                        {'loss': 0.0027, 'grad_norm': 2.4825021388675554, 'learning_rate': 3.4111059262715817e-07, 'completion_length': 284.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.8660714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8660715222358704, 'reward_std': 0.07922262325882912, 'kl': 0.0670166015625, 'epoch': 0.66}
 66%|██████▌   | 2824/4286 [21:20:03<9:53:22, 24.35s/it] 66%|██████▌   | 2825/4286 [21:20:26<9:45:49, 24.06s/it]                                                        {'loss': 0.0026, 'grad_norm': 0.9823388450298962, 'learning_rate': 3.4087727484834344e-07, 'completion_length': 288.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.060017285868525505, 'kl': 0.064208984375, 'epoch': 0.66}
 66%|██████▌   | 2825/4286 [21:20:26<9:45:49, 24.06s/it] 66%|██████▌   | 2826/4286 [21:20:50<9:41:15, 23.89s/it]                                                        {'loss': 0.0015, 'grad_norm': 0.7075397075998577, 'learning_rate': 3.406439570695287e-07, 'completion_length': 235.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.8511905372142792, 'rewards/format_reward': 1.0, 'reward': 1.8511906862258911, 'reward_std': 0.013746432960033417, 'kl': 0.038330078125, 'epoch': 0.66}
 66%|██████▌   | 2826/4286 [21:20:50<9:41:15, 23.89s/it] 66%|██████▌   | 2827/4286 [21:21:14<9:42:36, 23.96s/it]                                                        {'loss': 0.0031, 'grad_norm': 2.694780140261478, 'learning_rate': 3.4041063929071394e-07, 'completion_length': 291.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8273809850215912, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.025651192292571068, 'kl': 0.0771484375, 'epoch': 0.66}
 66%|██████▌   | 2827/4286 [21:21:14<9:42:36, 23.96s/it] 66%|██████▌   | 2828/4286 [21:21:38<9:39:59, 23.87s/it]                                                        {'loss': 0.0053, 'grad_norm': 0.7585700387192758, 'learning_rate': 3.401773215118992e-07, 'completion_length': 255.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187500596046448, 'reward_std': 0.06047770380973816, 'kl': 0.132080078125, 'epoch': 0.66}
 66%|██████▌   | 2828/4286 [21:21:38<9:39:59, 23.87s/it] 66%|██████▌   | 2829/4286 [21:22:02<9:46:07, 24.14s/it]                                                        {'loss': 0.0027, 'grad_norm': 1.8442487789829196, 'learning_rate': 3.3994400373308444e-07, 'completion_length': 277.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.09002592600882053, 'kl': 0.067626953125, 'epoch': 0.66}
 66%|██████▌   | 2829/4286 [21:22:02<9:46:07, 24.14s/it] 66%|██████▌   | 2830/4286 [21:22:26<9:41:03, 23.94s/it]                                                        {'loss': 0.0072, 'grad_norm': 7.54385014925457, 'learning_rate': 3.397106859542697e-07, 'completion_length': 273.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.752976268529892, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.08014346286654472, 'kl': 0.179931640625, 'epoch': 0.66}
 66%|██████▌   | 2830/4286 [21:22:26<9:41:03, 23.94s/it] 66%|██████▌   | 2831/4286 [21:22:50<9:38:10, 23.84s/it]                                                        {'loss': 0.0069, 'grad_norm': 5.556532421999645, 'learning_rate': 3.39477368175455e-07, 'completion_length': 282.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7023810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.029761905781924725, 'kl': 0.172607421875, 'epoch': 0.66}
 66%|██████▌   | 2831/4286 [21:22:50<9:38:10, 23.84s/it] 66%|██████▌   | 2832/4286 [21:23:14<9:45:10, 24.15s/it]                                                        {'loss': 0.0119, 'grad_norm': 1.9312169911032837, 'learning_rate': 3.392440503966402e-07, 'completion_length': 313.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.77976194024086, 'rewards/format_reward': 1.0, 'reward': 1.7797619700431824, 'reward_std': 0.011904762126505375, 'kl': 0.298828125, 'epoch': 0.66}
 66%|██████▌   | 2832/4286 [21:23:14<9:45:10, 24.15s/it] 66%|██████▌   | 2833/4286 [21:23:38<9:38:30, 23.89s/it]                                                        {'loss': 0.0123, 'grad_norm': 9.447462328642247, 'learning_rate': 3.390107326178255e-07, 'completion_length': 251.57144927978516, 'rewards/only_full_func_accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.04228769429028034, 'kl': 0.30908203125, 'epoch': 0.66}
 66%|██████▌   | 2833/4286 [21:23:38<9:38:30, 23.89s/it] 66%|██████▌   | 2834/4286 [21:24:02<9:43:43, 24.12s/it]                                                        {'loss': 0.0025, 'grad_norm': 0.5730813806429969, 'learning_rate': 3.387774148390107e-07, 'completion_length': 297.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.8333333432674408, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.03436608985066414, 'kl': 0.06298828125, 'epoch': 0.66}
 66%|██████▌   | 2834/4286 [21:24:02<9:43:43, 24.12s/it] 66%|██████▌   | 2835/4286 [21:24:27<9:43:47, 24.14s/it]                                                        {'loss': 0.0141, 'grad_norm': 1.0671813254222606, 'learning_rate': 3.38544097060196e-07, 'completion_length': 317.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.626488134264946, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.058389291167259216, 'kl': 0.3515625, 'epoch': 0.66}
 66%|██████▌   | 2835/4286 [21:24:27<9:43:47, 24.14s/it] 66%|██████▌   | 2836/4286 [21:24:51<9:47:13, 24.30s/it]                                                        {'loss': 0.0089, 'grad_norm': 4.816413903460464, 'learning_rate': 3.3831077928138126e-07, 'completion_length': 311.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7083333432674408, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.07738096080720425, 'kl': 0.22265625, 'epoch': 0.66}
 66%|██████▌   | 2836/4286 [21:24:51<9:47:13, 24.30s/it] 66%|██████▌   | 2837/4286 [21:25:18<10:03:01, 24.97s/it]                                                         {'loss': 0.0102, 'grad_norm': 2.7041846966278373, 'learning_rate': 3.380774615025665e-07, 'completion_length': 346.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.6413690447807312, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.605654776096344, 'reward_std': 0.1375329215079546, 'kl': 0.2548828125, 'epoch': 0.66}
 66%|██████▌   | 2837/4286 [21:25:18<10:03:01, 24.97s/it] 66%|██████▌   | 2838/4286 [21:25:44<10:12:43, 25.39s/it]                                                         {'loss': 0.0091, 'grad_norm': 10.37425715815043, 'learning_rate': 3.3784414372375176e-07, 'completion_length': 300.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.748512089252472, 'reward_std': 0.008928571827709675, 'kl': 0.2275390625, 'epoch': 0.66}
 66%|██████▌   | 2838/4286 [21:25:44<10:12:43, 25.39s/it] 66%|██████▌   | 2839/4286 [21:26:07<9:57:04, 24.76s/it]                                                         {'loss': 0.0065, 'grad_norm': 15.799262903493851, 'learning_rate': 3.3761082594493703e-07, 'completion_length': 273.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.791666716337204, 'rewards/format_reward': 1.0, 'reward': 1.7916667461395264, 'reward_std': 0.07419108599424362, 'kl': 0.1640625, 'epoch': 0.66}
 66%|██████▌   | 2839/4286 [21:26:07<9:57:04, 24.76s/it] 66%|██████▋   | 2840/4286 [21:26:32<9:54:10, 24.65s/it]                                                        {'loss': 0.0144, 'grad_norm': 5.951044962672789, 'learning_rate': 3.3737750816612225e-07, 'completion_length': 307.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6815477013587952, 'reward_std': 0.11340620019473135, 'kl': 0.359130859375, 'epoch': 0.66}
 66%|██████▋   | 2840/4286 [21:26:32<9:54:10, 24.65s/it] 66%|██████▋   | 2841/4286 [21:26:56<9:50:00, 24.50s/it]                                                        {'loss': 0.0168, 'grad_norm': 2.474990187933806, 'learning_rate': 3.3714419038730753e-07, 'completion_length': 272.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.7863095998764038, 'rewards/format_reward': 1.0, 'reward': 1.7863095998764038, 'reward_std': 0.09975380450487137, 'kl': 0.41796875, 'epoch': 0.66}
 66%|██████▋   | 2841/4286 [21:26:56<9:50:00, 24.50s/it] 66%|██████▋   | 2842/4286 [21:27:20<9:49:31, 24.50s/it]                                                        {'loss': 0.005, 'grad_norm': 9.998216042386561, 'learning_rate': 3.3691087260849275e-07, 'completion_length': 325.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.8363096117973328, 'rewards/format_reward': 1.0, 'reward': 1.8363096117973328, 'reward_std': 0.08767743036150932, 'kl': 0.12451171875, 'epoch': 0.66}
 66%|██████▋   | 2842/4286 [21:27:20<9:49:31, 24.50s/it] 66%|██████▋   | 2843/4286 [21:27:44<9:44:17, 24.29s/it]                                                        {'loss': 0.0106, 'grad_norm': 13.365536363106012, 'learning_rate': 3.36677554829678e-07, 'completion_length': 274.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.742559552192688, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.1000974290072918, 'kl': 0.263916015625, 'epoch': 0.66}
 66%|██████▋   | 2843/4286 [21:27:44<9:44:17, 24.29s/it] 66%|██████▋   | 2844/4286 [21:28:09<9:47:06, 24.43s/it]                                                        {'loss': 0.0038, 'grad_norm': 2.6651182783970886, 'learning_rate': 3.364442370508633e-07, 'completion_length': 318.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125001788139343, 'reward_std': 0.04467591270804405, 'kl': 0.0947265625, 'epoch': 0.66}
 66%|██████▋   | 2844/4286 [21:28:09<9:47:06, 24.43s/it] 66%|██████▋   | 2845/4286 [21:28:35<9:55:04, 24.78s/it]                                                        {'loss': 0.0149, 'grad_norm': 16.117851583137288, 'learning_rate': 3.362109192720485e-07, 'completion_length': 303.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5952380895614624, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.09640567377209663, 'kl': 0.37255859375, 'epoch': 0.66}
 66%|██████▋   | 2845/4286 [21:28:35<9:55:04, 24.78s/it] 66%|██████▋   | 2846/4286 [21:29:01<10:05:36, 25.23s/it]                                                         {'loss': 0.0112, 'grad_norm': 1.981927939300775, 'learning_rate': 3.359776014932338e-07, 'completion_length': 316.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.0412393044680357, 'kl': 0.27880859375, 'epoch': 0.66}
 66%|██████▋   | 2846/4286 [21:29:01<10:05:36, 25.23s/it] 66%|██████▋   | 2847/4286 [21:29:26<10:03:59, 25.18s/it]                                                         {'loss': 0.0055, 'grad_norm': 16.174302922343585, 'learning_rate': 3.35744283714419e-07, 'completion_length': 314.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7187500596046448, 'rewards/format_reward': 1.0, 'reward': 1.7187501192092896, 'reward_std': 0.10491083189845085, 'kl': 0.137451171875, 'epoch': 0.66}
 66%|██████▋   | 2847/4286 [21:29:26<10:03:59, 25.18s/it] 66%|██████▋   | 2848/4286 [21:29:54<10:20:30, 25.89s/it]                                                         {'loss': 0.0056, 'grad_norm': 6.776435898330739, 'learning_rate': 3.355109659356043e-07, 'completion_length': 320.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.630952388048172, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.044429175555706024, 'kl': 0.141357421875, 'epoch': 0.66}
 66%|██████▋   | 2848/4286 [21:29:54<10:20:30, 25.89s/it] 66%|██████▋   | 2849/4286 [21:30:20<10:22:00, 25.97s/it]                                                         {'loss': 0.0083, 'grad_norm': 7.694082847116715, 'learning_rate': 3.3527764815678957e-07, 'completion_length': 310.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.6357142925262451, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6000000834465027, 'reward_std': 0.16745267808437347, 'kl': 0.2071533203125, 'epoch': 0.66}
 66%|██████▋   | 2849/4286 [21:30:20<10:22:00, 25.97s/it] 66%|██████▋   | 2850/4286 [21:30:43<10:04:08, 25.24s/it]                                                         {'loss': 0.0077, 'grad_norm': 2.717142709906916, 'learning_rate': 3.350443303779748e-07, 'completion_length': 270.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.04602411389350891, 'kl': 0.1923828125, 'epoch': 0.66}
 66%|██████▋   | 2850/4286 [21:30:43<10:04:08, 25.24s/it] 67%|██████▋   | 2851/4286 [21:31:07<9:51:36, 24.74s/it]                                                         {'loss': 0.0076, 'grad_norm': 13.397496038590605, 'learning_rate': 3.3481101259916007e-07, 'completion_length': 270.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.6458333730697632, 'reward_std': 0.017857137601822615, 'kl': 0.19091796875, 'epoch': 0.67}
 67%|██████▋   | 2851/4286 [21:31:07<9:51:36, 24.74s/it] 67%|██████▋   | 2852/4286 [21:31:31<9:45:33, 24.50s/it]                                                        {'loss': 0.0036, 'grad_norm': 7.857939023149517, 'learning_rate': 3.345776948203453e-07, 'completion_length': 270.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7663690149784088, 'rewards/format_reward': 1.0, 'reward': 1.7663691639900208, 'reward_std': 0.03893720870837569, 'kl': 0.089599609375, 'epoch': 0.67}
 67%|██████▋   | 2852/4286 [21:31:31<9:45:33, 24.50s/it] 67%|██████▋   | 2853/4286 [21:31:54<9:36:30, 24.14s/it]                                                        {'loss': 0.0034, 'grad_norm': 13.493247653976178, 'learning_rate': 3.3434437704153057e-07, 'completion_length': 287.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.772321492433548, 'rewards/format_reward': 1.0, 'reward': 1.7723215818405151, 'reward_std': 0.0684523768723011, 'kl': 0.0845947265625, 'epoch': 0.67}
 67%|██████▋   | 2853/4286 [21:31:54<9:36:30, 24.14s/it] 67%|██████▋   | 2854/4286 [21:32:17<9:27:54, 23.80s/it]                                                        {'loss': 0.0118, 'grad_norm': 26.373558976749713, 'learning_rate': 3.3411105926271584e-07, 'completion_length': 217.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.8898809850215912, 'rewards/format_reward': 1.0, 'reward': 1.8898810744285583, 'reward_std': 0.041666656732559204, 'kl': 0.29638671875, 'epoch': 0.67}
 67%|██████▋   | 2854/4286 [21:32:17<9:27:54, 23.80s/it] 67%|██████▋   | 2855/4286 [21:32:40<9:21:15, 23.53s/it]                                                        {'loss': 0.004, 'grad_norm': 24.58502477359228, 'learning_rate': 3.3387774148390106e-07, 'completion_length': 257.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.9002976417541504, 'rewards/format_reward': 1.0, 'reward': 1.9002977013587952, 'reward_std': 0.030843347311019897, 'kl': 0.09912109375, 'epoch': 0.67}
 67%|██████▋   | 2855/4286 [21:32:40<9:21:15, 23.53s/it] 67%|██████▋   | 2856/4286 [21:33:05<9:32:31, 24.02s/it]                                                        {'loss': 0.0183, 'grad_norm': 3.11275055973868, 'learning_rate': 3.3364442370508634e-07, 'completion_length': 296.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.735491156578064, 'rewards/format_reward': 1.0, 'reward': 1.7354912161827087, 'reward_std': 0.0409226194024086, 'kl': 0.4571533203125, 'epoch': 0.67}
 67%|██████▋   | 2856/4286 [21:33:05<9:32:31, 24.02s/it] 67%|██████▋   | 2857/4286 [21:33:29<9:31:27, 23.99s/it]                                                        {'loss': 0.0132, 'grad_norm': 4.734322950486751, 'learning_rate': 3.3341110592627156e-07, 'completion_length': 296.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.0416666641831398, 'kl': 0.3291015625, 'epoch': 0.67}
 67%|██████▋   | 2857/4286 [21:33:29<9:31:27, 23.99s/it] 67%|██████▋   | 2858/4286 [21:33:56<9:49:43, 24.78s/it]                                                        {'loss': 0.0088, 'grad_norm': 6.848987755831029, 'learning_rate': 3.3317778814745684e-07, 'completion_length': 285.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6852679252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6674107909202576, 'reward_std': 0.14485793933272362, 'kl': 0.220703125, 'epoch': 0.67}
 67%|██████▋   | 2858/4286 [21:33:56<9:49:43, 24.78s/it] 67%|██████▋   | 2859/4286 [21:34:21<9:56:49, 25.09s/it]                                                        {'loss': 0.0109, 'grad_norm': 12.860406596607508, 'learning_rate': 3.329444703686421e-07, 'completion_length': 330.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6116072237491608, 'rewards/format_reward': 1.0, 'reward': 1.611607313156128, 'reward_std': 0.01488095335662365, 'kl': 0.272705078125, 'epoch': 0.67}
 67%|██████▋   | 2859/4286 [21:34:21<9:56:49, 25.09s/it] 67%|██████▋   | 2860/4286 [21:34:46<9:53:38, 24.98s/it]                                                        {'loss': 0.0074, 'grad_norm': 16.980684400379726, 'learning_rate': 3.3271115258982733e-07, 'completion_length': 290.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.8348214328289032, 'rewards/format_reward': 1.0, 'reward': 1.8348215222358704, 'reward_std': 0.04464286006987095, 'kl': 0.1845703125, 'epoch': 0.67}
 67%|██████▋   | 2860/4286 [21:34:46<9:53:38, 24.98s/it] 67%|██████▋   | 2861/4286 [21:35:10<9:47:56, 24.76s/it]                                                        {'loss': 0.013, 'grad_norm': 5.05571458299273, 'learning_rate': 3.324778348110126e-07, 'completion_length': 321.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.04186442866921425, 'kl': 0.3251953125, 'epoch': 0.67}
 67%|██████▋   | 2861/4286 [21:35:10<9:47:56, 24.76s/it] 67%|██████▋   | 2862/4286 [21:35:36<9:51:09, 24.91s/it]                                                        {'loss': 0.0033, 'grad_norm': 20.590859965164725, 'learning_rate': 3.322445170321979e-07, 'completion_length': 287.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.059770552441477776, 'kl': 0.082763671875, 'epoch': 0.67}
 67%|██████▋   | 2862/4286 [21:35:36<9:51:09, 24.91s/it] 67%|██████▋   | 2863/4286 [21:36:01<9:54:52, 25.08s/it]                                                        {'loss': 0.0065, 'grad_norm': 3.6324407607472575, 'learning_rate': 3.320111992533831e-07, 'completion_length': 315.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.7574405074119568, 'rewards/format_reward': 1.0, 'reward': 1.7574406266212463, 'reward_std': 0.032738097012043, 'kl': 0.1630859375, 'epoch': 0.67}
 67%|██████▋   | 2863/4286 [21:36:01<9:54:52, 25.08s/it] 67%|██████▋   | 2864/4286 [21:36:26<9:52:31, 25.00s/it]                                                        {'loss': 0.0108, 'grad_norm': 3.4584871634471446, 'learning_rate': 3.317778814745684e-07, 'completion_length': 302.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6428572237491608, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.0476190485060215, 'kl': 0.2705078125, 'epoch': 0.67}
 67%|██████▋   | 2864/4286 [21:36:26<9:52:31, 25.00s/it] 67%|██████▋   | 2865/4286 [21:36:50<9:45:49, 24.74s/it]                                                        {'loss': 0.0106, 'grad_norm': 2.9142797073516404, 'learning_rate': 3.315445636957536e-07, 'completion_length': 298.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.048786623403429985, 'kl': 0.2646484375, 'epoch': 0.67}
 67%|██████▋   | 2865/4286 [21:36:50<9:45:49, 24.74s/it] 67%|██████▋   | 2866/4286 [21:37:14<9:38:24, 24.44s/it]                                                        {'loss': 0.017, 'grad_norm': 4.49828116648509, 'learning_rate': 3.313112459169389e-07, 'completion_length': 287.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.7767857015132904, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.0773809589445591, 'kl': 0.4251708984375, 'epoch': 0.67}
 67%|██████▋   | 2866/4286 [21:37:14<9:38:24, 24.44s/it] 67%|██████▋   | 2867/4286 [21:37:39<9:40:15, 24.54s/it]                                                        {'loss': 0.0142, 'grad_norm': 3.0720475318264366, 'learning_rate': 3.3107792813812415e-07, 'completion_length': 317.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7857143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.0, 'kl': 0.35498046875, 'epoch': 0.67}
 67%|██████▋   | 2867/4286 [21:37:39<9:40:15, 24.54s/it] 67%|██████▋   | 2868/4286 [21:38:03<9:38:27, 24.48s/it]                                                        {'loss': 0.005, 'grad_norm': 4.0061026294647775, 'learning_rate': 3.308446103593094e-07, 'completion_length': 325.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.9029762446880341, 'rewards/format_reward': 1.0, 'reward': 1.9029763340950012, 'reward_std': 0.00357142835855484, 'kl': 0.1246337890625, 'epoch': 0.67}
 67%|██████▋   | 2868/4286 [21:38:03<9:38:27, 24.48s/it] 67%|██████▋   | 2869/4286 [21:38:27<9:37:49, 24.47s/it]                                                        {'loss': 0.0095, 'grad_norm': 13.24924738694049, 'learning_rate': 3.3061129258049465e-07, 'completion_length': 301.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.05029458273202181, 'kl': 0.2373046875, 'epoch': 0.67}
 67%|██████▋   | 2869/4286 [21:38:27<9:37:49, 24.47s/it] 67%|██████▋   | 2870/4286 [21:38:51<9:34:23, 24.34s/it]                                                        {'loss': 0.007, 'grad_norm': 46.419086892375006, 'learning_rate': 3.3037797480167987e-07, 'completion_length': 246.46430206298828, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.03905685991048813, 'kl': 0.1748046875, 'epoch': 0.67}
 67%|██████▋   | 2870/4286 [21:38:51<9:34:23, 24.34s/it] 67%|██████▋   | 2871/4286 [21:39:16<9:35:44, 24.41s/it]                                                        {'loss': 0.0054, 'grad_norm': 7.896892056422017, 'learning_rate': 3.3014465702286515e-07, 'completion_length': 309.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.8437500298023224, 'rewards/format_reward': 1.0, 'reward': 1.8437501788139343, 'reward_std': 0.05059523694217205, 'kl': 0.136474609375, 'epoch': 0.67}
 67%|██████▋   | 2871/4286 [21:39:16<9:35:44, 24.41s/it] 67%|██████▋   | 2872/4286 [21:39:42<9:44:26, 24.80s/it]                                                        {'loss': 0.0053, 'grad_norm': 3.742117759161738, 'learning_rate': 3.299113392440504e-07, 'completion_length': 289.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.830357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8125000596046448, 'reward_std': 0.09481073170900345, 'kl': 0.132080078125, 'epoch': 0.67}
 67%|██████▋   | 2872/4286 [21:39:42<9:44:26, 24.80s/it] 67%|██████▋   | 2873/4286 [21:40:08<9:51:43, 25.13s/it]                                                        {'loss': 0.0029, 'grad_norm': 2.6263446655137135, 'learning_rate': 3.2967802146523564e-07, 'completion_length': 303.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.6815476417541504, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.03160357475280762, 'kl': 0.0732421875, 'epoch': 0.67}
 67%|██████▋   | 2873/4286 [21:40:08<9:51:43, 25.13s/it] 67%|██████▋   | 2874/4286 [21:40:32<9:47:57, 24.98s/it]                                                        {'loss': 0.0086, 'grad_norm': 0.6842944341037421, 'learning_rate': 3.294447036864209e-07, 'completion_length': 301.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.84226194024086, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.010309826582670212, 'kl': 0.21484375, 'epoch': 0.67}
 67%|██████▋   | 2874/4286 [21:40:32<9:47:57, 24.98s/it] 67%|██████▋   | 2875/4286 [21:40:57<9:43:29, 24.81s/it]                                                        {'loss': 0.0035, 'grad_norm': 0.5951665016258708, 'learning_rate': 3.2921138590760614e-07, 'completion_length': 274.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.8169643580913544, 'rewards/format_reward': 1.0, 'reward': 1.8169644474983215, 'reward_std': 0.008928571827709675, 'kl': 0.0882568359375, 'epoch': 0.67}
 67%|██████▋   | 2875/4286 [21:40:57<9:43:29, 24.81s/it] 67%|██████▋   | 2876/4286 [21:41:21<9:42:00, 24.77s/it]                                                        {'loss': 0.0088, 'grad_norm': 8.273883618024776, 'learning_rate': 3.289780681287914e-07, 'completion_length': 277.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.726190447807312, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.06940627843141556, 'kl': 0.21923828125, 'epoch': 0.67}
 67%|██████▋   | 2876/4286 [21:41:21<9:42:00, 24.77s/it] 67%|██████▋   | 2877/4286 [21:41:46<9:43:10, 24.83s/it]                                                        {'loss': 0.0046, 'grad_norm': 1.747125967552047, 'learning_rate': 3.287447503499767e-07, 'completion_length': 284.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7380953133106232, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.05578738823533058, 'kl': 0.1136474609375, 'epoch': 0.67}
 67%|██████▋   | 2877/4286 [21:41:46<9:43:10, 24.83s/it] 67%|██████▋   | 2878/4286 [21:42:11<9:41:11, 24.77s/it]                                                        {'loss': 0.013, 'grad_norm': 3.3950825795462474, 'learning_rate': 3.285114325711619e-07, 'completion_length': 283.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.06823869794607162, 'kl': 0.3271484375, 'epoch': 0.67}
 67%|██████▋   | 2878/4286 [21:42:11<9:41:11, 24.77s/it] 67%|██████▋   | 2879/4286 [21:42:35<9:36:43, 24.59s/it]                                                        {'loss': 0.0126, 'grad_norm': 3.939854173318027, 'learning_rate': 3.282781147923472e-07, 'completion_length': 284.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7842262983322144, 'reward_std': 0.10778811387717724, 'kl': 0.3154296875, 'epoch': 0.67}
 67%|██████▋   | 2879/4286 [21:42:35<9:36:43, 24.59s/it] 67%|██████▋   | 2880/4286 [21:42:59<9:29:30, 24.30s/it]                                                        {'loss': 0.0188, 'grad_norm': 49.71629476813301, 'learning_rate': 3.280447970135324e-07, 'completion_length': 311.375, 'rewards/only_full_func_accuracy_reward': 0.8869048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8869048953056335, 'reward_std': 0.056333938613533974, 'kl': 0.47119140625, 'epoch': 0.67}
 67%|██████▋   | 2880/4286 [21:42:59<9:29:30, 24.30s/it] 67%|██████▋   | 2881/4286 [21:43:21<9:17:15, 23.80s/it]                                                        {'loss': 0.0035, 'grad_norm': 5.516272189263883, 'learning_rate': 3.278114792347177e-07, 'completion_length': 261.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.758928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.026884591206908226, 'kl': 0.0867919921875, 'epoch': 0.67}
 67%|██████▋   | 2881/4286 [21:43:21<9:17:15, 23.80s/it] 67%|██████▋   | 2882/4286 [21:43:45<9:15:57, 23.76s/it]                                                        {'loss': 0.0068, 'grad_norm': 8.23226733133807, 'learning_rate': 3.2757816145590296e-07, 'completion_length': 253.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8541668057441711, 'rewards/format_reward': 1.0, 'reward': 1.8541667461395264, 'reward_std': 0.03411934711039066, 'kl': 0.16943359375, 'epoch': 0.67}
 67%|██████▋   | 2882/4286 [21:43:45<9:15:57, 23.76s/it][2025-03-03 12:41:33,584] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 67%|██████▋   | 2883/4286 [21:44:11<9:28:42, 24.32s/it]                                                        {'loss': 0.0033, 'grad_norm': 9.851068120870869, 'learning_rate': 3.273448436770882e-07, 'completion_length': 316.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7169643342494965, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6991072297096252, 'reward_std': 0.11234227940440178, 'kl': 0.08203125, 'epoch': 0.67}
 67%|██████▋   | 2883/4286 [21:44:11<9:28:42, 24.32s/it] 67%|██████▋   | 2884/4286 [21:44:34<9:21:12, 24.02s/it]                                                        {'loss': 0.0201, 'grad_norm': 12.724130093281397, 'learning_rate': 3.2711152589827346e-07, 'completion_length': 306.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.68452388048172, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.648809552192688, 'reward_std': 0.16850833594799042, 'kl': 0.50390625, 'epoch': 0.67}
 67%|██████▋   | 2884/4286 [21:44:34<9:21:12, 24.02s/it] 67%|██████▋   | 2885/4286 [21:44:58<9:23:52, 24.15s/it]                                                        {'loss': 0.013, 'grad_norm': 6.173664166918899, 'learning_rate': 3.2687820811945873e-07, 'completion_length': 294.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 1.0, 'reward': 1.6443453431129456, 'reward_std': 0.03961131349205971, 'kl': 0.3251953125, 'epoch': 0.67}
 67%|██████▋   | 2885/4286 [21:44:58<9:23:52, 24.15s/it] 67%|██████▋   | 2886/4286 [21:45:23<9:25:29, 24.24s/it]                                                        {'loss': 0.0167, 'grad_norm': 2.4886352792192676, 'learning_rate': 3.2664489034064396e-07, 'completion_length': 321.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.023809521924704313, 'kl': 0.4169921875, 'epoch': 0.67}
 67%|██████▋   | 2886/4286 [21:45:23<9:25:29, 24.24s/it] 67%|██████▋   | 2887/4286 [21:45:47<9:26:00, 24.27s/it]                                                        {'loss': 0.0223, 'grad_norm': 5.014611563645236, 'learning_rate': 3.2641157256182923e-07, 'completion_length': 281.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7261904776096344, 'rewards/format_reward': 1.0, 'reward': 1.7261906862258911, 'reward_std': 0.08885835483670235, 'kl': 0.556640625, 'epoch': 0.67}
 67%|██████▋   | 2887/4286 [21:45:47<9:26:00, 24.27s/it] 67%|██████▋   | 2888/4286 [21:46:11<9:22:34, 24.15s/it]                                                        {'loss': 0.007, 'grad_norm': 4.315520763727771, 'learning_rate': 3.2617825478301445e-07, 'completion_length': 281.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.9077381789684296, 'rewards/format_reward': 1.0, 'reward': 1.907738208770752, 'reward_std': 0.06547618750482798, 'kl': 0.1748046875, 'epoch': 0.67}
 67%|██████▋   | 2888/4286 [21:46:11<9:22:34, 24.15s/it] 67%|██████▋   | 2889/4286 [21:46:34<9:15:29, 23.86s/it]                                                        {'loss': 0.0052, 'grad_norm': 5.20998881503347, 'learning_rate': 3.2594493700419973e-07, 'completion_length': 235.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6726192235946655, 'reward_std': 0.0357142835855484, 'kl': 0.131103515625, 'epoch': 0.67}
 67%|██████▋   | 2889/4286 [21:46:34<9:15:29, 23.86s/it] 67%|██████▋   | 2890/4286 [21:46:58<9:17:07, 23.95s/it]                                                        {'loss': 0.0049, 'grad_norm': 0.3974960147402034, 'learning_rate': 3.25711619225385e-07, 'completion_length': 275.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.0, 'kl': 0.123779296875, 'epoch': 0.67}
 67%|██████▋   | 2890/4286 [21:46:58<9:17:07, 23.95s/it] 67%|██████▋   | 2891/4286 [21:47:23<9:20:01, 24.09s/it]                                                        {'loss': 0.0132, 'grad_norm': 4.127872435854192, 'learning_rate': 3.2547830144657023e-07, 'completion_length': 288.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.7842262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.027406740933656693, 'kl': 0.3291015625, 'epoch': 0.67}
 67%|██████▋   | 2891/4286 [21:47:23<9:20:01, 24.09s/it] 67%|██████▋   | 2892/4286 [21:47:48<9:24:53, 24.31s/it]                                                        {'loss': 0.0174, 'grad_norm': 13.515207016416756, 'learning_rate': 3.252449836677555e-07, 'completion_length': 323.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6830357909202576, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.04013476520776749, 'kl': 0.433837890625, 'epoch': 0.67}
 67%|██████▋   | 2892/4286 [21:47:48<9:24:53, 24.31s/it] 67%|██████▋   | 2893/4286 [21:48:12<9:23:38, 24.28s/it]                                                        {'loss': 0.02, 'grad_norm': 1.9093349443828587, 'learning_rate': 3.250116658889407e-07, 'completion_length': 327.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.0357142873108387, 'kl': 0.5, 'epoch': 0.67}
 67%|██████▋   | 2893/4286 [21:48:12<9:23:38, 24.28s/it] 68%|██████▊   | 2894/4286 [21:48:36<9:22:31, 24.25s/it]                                                        {'loss': 0.0102, 'grad_norm': 4.291375004494897, 'learning_rate': 3.24778348110126e-07, 'completion_length': 292.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6845238208770752, 'rewards/format_reward': 1.0, 'reward': 1.68452388048172, 'reward_std': 0.03939763456583023, 'kl': 0.25537109375, 'epoch': 0.68}
 68%|██████▊   | 2894/4286 [21:48:36<9:22:31, 24.25s/it] 68%|██████▊   | 2895/4286 [21:49:00<9:21:44, 24.23s/it]                                                        {'loss': 0.0222, 'grad_norm': 7.332203219245338, 'learning_rate': 3.245450303313113e-07, 'completion_length': 296.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.808531790971756, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7906746864318848, 'reward_std': 0.0734127014875412, 'kl': 0.5556640625, 'epoch': 0.68}
 68%|██████▊   | 2895/4286 [21:49:00<9:21:44, 24.23s/it] 68%|██████▊   | 2896/4286 [21:49:24<9:17:05, 24.05s/it]                                                        {'loss': 0.0088, 'grad_norm': 12.242686161237241, 'learning_rate': 3.243117125524965e-07, 'completion_length': 281.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7708333730697632, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.03243744093924761, 'kl': 0.22021484375, 'epoch': 0.68}
 68%|██████▊   | 2896/4286 [21:49:24<9:17:05, 24.05s/it] 68%|██████▊   | 2897/4286 [21:49:47<9:13:28, 23.91s/it]                                                        {'loss': 0.0054, 'grad_norm': 2.695537057513874, 'learning_rate': 3.2407839477368177e-07, 'completion_length': 266.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7172620296478271, 'reward_std': 0.03644705377519131, 'kl': 0.13427734375, 'epoch': 0.68}
 68%|██████▊   | 2897/4286 [21:49:47<9:13:28, 23.91s/it] 68%|██████▊   | 2898/4286 [21:50:12<9:14:32, 23.97s/it]                                                        {'loss': 0.0125, 'grad_norm': 4.086420781508596, 'learning_rate': 3.23845076994867e-07, 'completion_length': 268.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607144474983215, 'reward_std': 0.040071725845336914, 'kl': 0.3115234375, 'epoch': 0.68}
 68%|██████▊   | 2898/4286 [21:50:12<9:14:32, 23.97s/it] 68%|██████▊   | 2899/4286 [21:50:36<9:16:50, 24.09s/it]                                                        {'loss': 0.0059, 'grad_norm': 8.506141060059479, 'learning_rate': 3.2361175921605227e-07, 'completion_length': 274.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7574404776096344, 'rewards/format_reward': 1.0, 'reward': 1.7574406266212463, 'reward_std': 0.020833336748182774, 'kl': 0.1474609375, 'epoch': 0.68}
 68%|██████▊   | 2899/4286 [21:50:36<9:16:50, 24.09s/it] 68%|██████▊   | 2900/4286 [21:51:01<9:24:13, 24.43s/it]                                                        {'loss': 0.0174, 'grad_norm': 14.932118510607191, 'learning_rate': 3.2337844143723754e-07, 'completion_length': 341.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548357009888, 'reward_std': 0.08655626699328423, 'kl': 0.4365234375, 'epoch': 0.68}
 68%|██████▊   | 2900/4286 [21:51:01<9:24:13, 24.43s/it] 68%|██████▊   | 2901/4286 [21:54:23<29:53:08, 77.68s/it]                                                         {'loss': 0.0074, 'grad_norm': 7.320057873592796, 'learning_rate': 3.2314512365842277e-07, 'completion_length': 294.375, 'rewards/only_full_func_accuracy_reward': 0.8809524178504944, 'rewards/format_reward': 1.0, 'reward': 1.880952537059784, 'reward_std': 0.0357142873108387, 'kl': 0.186279296875, 'epoch': 0.68}
 68%|██████▊   | 2901/4286 [21:54:23<29:53:08, 77.68s/it] 68%|██████▊   | 2902/4286 [21:54:47<23:41:59, 61.65s/it]                                                         {'loss': 0.0071, 'grad_norm': 1.4335618899641054, 'learning_rate': 3.2291180587960804e-07, 'completion_length': 296.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.8720238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8720239400863647, 'reward_std': 0.01785714365541935, 'kl': 0.17626953125, 'epoch': 0.68}
 68%|██████▊   | 2902/4286 [21:54:47<23:41:59, 61.65s/it] 68%|██████▊   | 2903/4286 [21:55:13<19:31:19, 50.82s/it]                                                         {'loss': 0.015, 'grad_norm': 13.921629272458759, 'learning_rate': 3.2267848810079326e-07, 'completion_length': 292.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6205357909202576, 'rewards/format_reward': 1.0, 'reward': 1.6205358505249023, 'reward_std': 0.07800615765154362, 'kl': 0.37646484375, 'epoch': 0.68}
 68%|██████▊   | 2903/4286 [21:55:13<19:31:19, 50.82s/it] 68%|██████▊   | 2904/4286 [21:55:38<16:32:45, 43.10s/it]                                                         {'loss': 0.0274, 'grad_norm': 9.250617372091378, 'learning_rate': 3.2244517032197854e-07, 'completion_length': 329.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.03436608612537384, 'kl': 0.68310546875, 'epoch': 0.68}
 68%|██████▊   | 2904/4286 [21:55:38<16:32:45, 43.10s/it] 68%|██████▊   | 2905/4286 [21:56:02<14:21:31, 37.43s/it]                                                         {'loss': 0.0315, 'grad_norm': 5.992629604485467, 'learning_rate': 3.222118525431638e-07, 'completion_length': 280.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562501192092896, 'reward_std': 0.0670952508226037, 'kl': 0.7880859375, 'epoch': 0.68}
 68%|██████▊   | 2905/4286 [21:56:02<14:21:31, 37.43s/it] 68%|██████▊   | 2906/4286 [21:56:27<12:57:11, 33.79s/it]                                                         {'loss': 0.0134, 'grad_norm': 11.19822756661324, 'learning_rate': 3.2197853476434904e-07, 'completion_length': 322.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.6681548357009888, 'rewards/format_reward': 1.0, 'reward': 1.6681548953056335, 'reward_std': 0.0670172143727541, 'kl': 0.3359375, 'epoch': 0.68}
 68%|██████▊   | 2906/4286 [21:56:27<12:57:11, 33.79s/it] 68%|██████▊   | 2907/4286 [21:56:53<11:59:37, 31.31s/it]                                                         {'loss': 0.0138, 'grad_norm': 1.5104039437149845, 'learning_rate': 3.217452169855343e-07, 'completion_length': 321.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762387275696, 'reward_std': 0.03068274538964033, 'kl': 0.347900390625, 'epoch': 0.68}
 68%|██████▊   | 2907/4286 [21:56:53<11:59:37, 31.31s/it] 68%|██████▊   | 2908/4286 [21:57:17<11:10:06, 29.18s/it]                                                         {'loss': 0.0228, 'grad_norm': 6.439658811318446, 'learning_rate': 3.215118992067196e-07, 'completion_length': 303.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5892857909202576, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.04007173515856266, 'kl': 0.5703125, 'epoch': 0.68}
 68%|██████▊   | 2908/4286 [21:57:17<11:10:06, 29.18s/it] 68%|██████▊   | 2909/4286 [21:57:42<10:39:21, 27.86s/it]                                                         {'loss': 0.0074, 'grad_norm': 10.51130736234079, 'learning_rate': 3.212785814279048e-07, 'completion_length': 304.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.791666716337204, 'rewards/format_reward': 1.0, 'reward': 1.7916667461395264, 'reward_std': 0.050381554290652275, 'kl': 0.18408203125, 'epoch': 0.68}
 68%|██████▊   | 2909/4286 [21:57:42<10:39:21, 27.86s/it] 68%|██████▊   | 2910/4286 [21:58:07<10:22:15, 27.13s/it]                                                         {'loss': 0.0218, 'grad_norm': 12.990008012741118, 'learning_rate': 3.210452636490901e-07, 'completion_length': 304.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.755952388048172, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.12574022263288498, 'kl': 0.546142578125, 'epoch': 0.68}
 68%|██████▊   | 2910/4286 [21:58:07<10:22:15, 27.13s/it] 68%|██████▊   | 2911/4286 [21:58:32<10:03:19, 26.33s/it]                                                         {'loss': 0.013, 'grad_norm': 11.305368519625526, 'learning_rate': 3.208119458702753e-07, 'completion_length': 313.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.719494104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7016370296478271, 'reward_std': 0.10108363069593906, 'kl': 0.325927734375, 'epoch': 0.68}
 68%|██████▊   | 2911/4286 [21:58:32<10:03:19, 26.33s/it] 68%|██████▊   | 2912/4286 [21:58:56<9:47:30, 25.66s/it]                                                         {'loss': 0.0065, 'grad_norm': 0.6333695327087608, 'learning_rate': 3.205786280914606e-07, 'completion_length': 297.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.8452381789684296, 'rewards/format_reward': 1.0, 'reward': 1.845238208770752, 'reward_std': 0.011904759332537651, 'kl': 0.1630859375, 'epoch': 0.68}
 68%|██████▊   | 2912/4286 [21:58:56<9:47:30, 25.66s/it] 68%|██████▊   | 2913/4286 [21:59:21<9:39:51, 25.34s/it]                                                        {'loss': 0.0058, 'grad_norm': 7.3718836364540365, 'learning_rate': 3.2034531031264586e-07, 'completion_length': 316.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.6666667461395264, 'reward_std': 0.07100120931863785, 'kl': 0.144775390625, 'epoch': 0.68}
 68%|██████▊   | 2913/4286 [21:59:21<9:39:51, 25.34s/it] 68%|██████▊   | 2914/4286 [21:59:46<9:38:14, 25.29s/it]                                                        {'loss': 0.0103, 'grad_norm': 6.842516810503312, 'learning_rate': 3.201119925338311e-07, 'completion_length': 315.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.06802501529455185, 'kl': 0.258544921875, 'epoch': 0.68}
 68%|██████▊   | 2914/4286 [21:59:46<9:38:14, 25.29s/it] 68%|██████▊   | 2915/4286 [22:00:10<9:29:57, 24.94s/it]                                                        {'loss': 0.0144, 'grad_norm': 21.66227767131302, 'learning_rate': 3.1987867475501635e-07, 'completion_length': 255.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.7232143878936768, 'reward_std': 0.05792887508869171, 'kl': 0.3603515625, 'epoch': 0.68}
 68%|██████▊   | 2915/4286 [22:00:10<9:29:57, 24.94s/it] 68%|██████▊   | 2916/4286 [22:00:34<9:26:34, 24.81s/it]                                                        {'loss': 0.0117, 'grad_norm': 10.948247714786211, 'learning_rate': 3.196453569762016e-07, 'completion_length': 300.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.799107164144516, 'rewards/format_reward': 1.0, 'reward': 1.7991071939468384, 'reward_std': 0.047405367717146873, 'kl': 0.29296875, 'epoch': 0.68}
 68%|██████▊   | 2916/4286 [22:00:34<9:26:34, 24.81s/it] 68%|██████▊   | 2917/4286 [22:00:59<9:27:20, 24.87s/it]                                                        {'loss': 0.0098, 'grad_norm': 24.307579989773757, 'learning_rate': 3.1941203919738685e-07, 'completion_length': 328.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7684524357318878, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7505953907966614, 'reward_std': 0.06364774238318205, 'kl': 0.2451171875, 'epoch': 0.68}
 68%|██████▊   | 2917/4286 [22:00:59<9:27:20, 24.87s/it] 68%|██████▊   | 2918/4286 [22:01:23<9:17:46, 24.46s/it]                                                        {'loss': 0.0022, 'grad_norm': 0.10483275250353295, 'learning_rate': 3.1917872141857213e-07, 'completion_length': 279.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.8511905372142792, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.0, 'kl': 0.0557861328125, 'epoch': 0.68}
 68%|██████▊   | 2918/4286 [22:01:23<9:17:46, 24.46s/it] 68%|██████▊   | 2919/4286 [22:01:48<9:18:52, 24.53s/it]                                                        {'loss': 0.0079, 'grad_norm': 2.072648768574641, 'learning_rate': 3.1894540363975735e-07, 'completion_length': 324.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.80952388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7916667461395264, 'reward_std': 0.0826719831675291, 'kl': 0.1981201171875, 'epoch': 0.68}
 68%|██████▊   | 2919/4286 [22:01:48<9:18:52, 24.53s/it] 68%|██████▊   | 2920/4286 [22:02:12<9:20:50, 24.63s/it]                                                        {'loss': 0.0046, 'grad_norm': 0.9596609459050676, 'learning_rate': 3.187120858609426e-07, 'completion_length': 291.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.0357142873108387, 'kl': 0.114501953125, 'epoch': 0.68}
 68%|██████▊   | 2920/4286 [22:02:12<9:20:50, 24.63s/it] 68%|██████▊   | 2921/4286 [22:02:36<9:15:01, 24.40s/it]                                                        {'loss': 0.0054, 'grad_norm': 13.188453903139582, 'learning_rate': 3.1847876808212785e-07, 'completion_length': 292.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7455357313156128, 'rewards/format_reward': 1.0, 'reward': 1.7455359101295471, 'reward_std': 0.08588216453790665, 'kl': 0.1357421875, 'epoch': 0.68}
 68%|██████▊   | 2921/4286 [22:02:36<9:15:01, 24.40s/it] 68%|██████▊   | 2922/4286 [22:03:01<9:13:55, 24.37s/it]                                                        {'loss': 0.018, 'grad_norm': 29.051230665793057, 'learning_rate': 3.182454503033131e-07, 'completion_length': 281.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.7202382683753967, 'reward_std': 0.0476190485060215, 'kl': 0.447265625, 'epoch': 0.68}
 68%|██████▊   | 2922/4286 [22:03:01<9:13:55, 24.37s/it] 68%|██████▊   | 2923/4286 [22:03:25<9:10:59, 24.26s/it]                                                        {'loss': 0.008, 'grad_norm': 3.1748599959825152, 'learning_rate': 3.180121325244984e-07, 'completion_length': 301.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7857143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7857144474983215, 'reward_std': 0.044429171830415726, 'kl': 0.20068359375, 'epoch': 0.68}
 68%|██████▊   | 2923/4286 [22:03:25<9:10:59, 24.26s/it] 68%|██████▊   | 2924/4286 [22:03:49<9:13:10, 24.37s/it]                                                        {'loss': 0.009, 'grad_norm': 1.9698786032103284, 'learning_rate': 3.177788147456836e-07, 'completion_length': 332.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.772321492433548, 'rewards/format_reward': 1.0, 'reward': 1.7723215818405151, 'reward_std': 0.06250000186264515, 'kl': 0.2255859375, 'epoch': 0.68}
 68%|██████▊   | 2924/4286 [22:03:49<9:13:10, 24.37s/it] 68%|██████▊   | 2925/4286 [22:04:14<9:13:56, 24.42s/it]                                                        {'loss': 0.0315, 'grad_norm': 5.9829312823288525, 'learning_rate': 3.175454969668689e-07, 'completion_length': 308.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7127978205680847, 'reward_std': 0.0744047611951828, 'kl': 0.7861328125, 'epoch': 0.68}
 68%|██████▊   | 2925/4286 [22:04:14<9:13:56, 24.42s/it] 68%|██████▊   | 2926/4286 [22:04:39<9:18:25, 24.64s/it]                                                        {'loss': 0.0122, 'grad_norm': 5.960417364209594, 'learning_rate': 3.173121791880541e-07, 'completion_length': 314.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7997024357318878, 'rewards/format_reward': 1.0, 'reward': 1.799702525138855, 'reward_std': 0.04262397438287735, 'kl': 0.30712890625, 'epoch': 0.68}
 68%|██████▊   | 2926/4286 [22:04:39<9:18:25, 24.64s/it] 68%|██████▊   | 2927/4286 [22:05:02<9:07:52, 24.19s/it]                                                        {'loss': 0.0085, 'grad_norm': 3.638890299036364, 'learning_rate': 3.170788614092394e-07, 'completion_length': 299.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7470238506793976, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.036635126918554306, 'kl': 0.21337890625, 'epoch': 0.68}
 68%|██████▊   | 2927/4286 [22:05:02<9:07:52, 24.19s/it] 68%|██████▊   | 2928/4286 [22:05:26<9:07:03, 24.17s/it]                                                        {'loss': 0.0231, 'grad_norm': 3.299288551460375, 'learning_rate': 3.1684554363042467e-07, 'completion_length': 286.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.098214291036129, 'kl': 0.57763671875, 'epoch': 0.68}
 68%|██████▊   | 2928/4286 [22:05:26<9:07:03, 24.17s/it] 68%|██████▊   | 2929/4286 [22:05:52<9:20:04, 24.76s/it]                                                        {'loss': 0.0292, 'grad_norm': 11.027592608614823, 'learning_rate': 3.166122258516099e-07, 'completion_length': 314.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.754464328289032, 'reward_std': 0.09938186779618263, 'kl': 0.73193359375, 'epoch': 0.68}
 68%|██████▊   | 2929/4286 [22:05:52<9:20:04, 24.76s/it] 68%|██████▊   | 2930/4286 [22:06:17<9:17:05, 24.65s/it]                                                        {'loss': 0.0228, 'grad_norm': 9.366881115300894, 'learning_rate': 3.1637890807279516e-07, 'completion_length': 312.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5699404776096344, 'rewards/format_reward': 1.0, 'reward': 1.5699405670166016, 'reward_std': 0.08260532096028328, 'kl': 0.568359375, 'epoch': 0.68}
 68%|██████▊   | 2930/4286 [22:06:17<9:17:05, 24.65s/it] 68%|██████▊   | 2931/4286 [22:06:40<9:11:01, 24.40s/it]                                                        {'loss': 0.045, 'grad_norm': 3.0875356180845235, 'learning_rate': 3.1614559029398044e-07, 'completion_length': 311.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.779762089252472, 'reward_std': 0.02380952751263976, 'kl': 1.123046875, 'epoch': 0.68}
 68%|██████▊   | 2931/4286 [22:06:40<9:11:01, 24.40s/it] 68%|██████▊   | 2932/4286 [22:07:06<9:18:18, 24.74s/it]                                                        {'loss': 0.0184, 'grad_norm': 1.3502566608050146, 'learning_rate': 3.1591227251516566e-07, 'completion_length': 322.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.8024892508983612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.78463214635849, 'reward_std': 0.07359307631850243, 'kl': 0.462890625, 'epoch': 0.68}
 68%|██████▊   | 2932/4286 [22:07:06<9:18:18, 24.74s/it] 68%|██████▊   | 2933/4286 [22:07:31<9:16:58, 24.70s/it]                                                        {'loss': 0.0122, 'grad_norm': 7.161001074592999, 'learning_rate': 3.1567895473635094e-07, 'completion_length': 314.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6577381491661072, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.04099256731569767, 'kl': 0.3056640625, 'epoch': 0.68}
 68%|██████▊   | 2933/4286 [22:07:31<9:16:58, 24.70s/it] 68%|██████▊   | 2934/4286 [22:07:57<9:25:09, 25.08s/it]                                                        {'loss': 0.0311, 'grad_norm': 11.824074024106924, 'learning_rate': 3.1544563695753616e-07, 'completion_length': 317.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5625001192092896, 'reward_std': 0.0892857164144516, 'kl': 0.7783203125, 'epoch': 0.68}
 68%|██████▊   | 2934/4286 [22:07:57<9:25:09, 25.08s/it] 68%|██████▊   | 2935/4286 [22:08:22<9:25:20, 25.11s/it]                                                        {'loss': 0.0023, 'grad_norm': 0.7772990552682681, 'learning_rate': 3.1521231917872143e-07, 'completion_length': 319.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.02380952169187367, 'kl': 0.05859375, 'epoch': 0.68}
 68%|██████▊   | 2935/4286 [22:08:22<9:25:20, 25.11s/it] 69%|██████▊   | 2936/4286 [22:08:49<9:36:26, 25.62s/it]                                                        {'loss': 0.0097, 'grad_norm': 46.26970635821291, 'learning_rate': 3.149790013999067e-07, 'completion_length': 324.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7047619223594666, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.669047772884369, 'reward_std': 0.10113442875444889, 'kl': 0.24267578125, 'epoch': 0.69}
 69%|██████▊   | 2936/4286 [22:08:49<9:36:26, 25.62s/it] 69%|██████▊   | 2937/4286 [22:09:14<9:31:49, 25.43s/it]                                                        {'loss': 0.0245, 'grad_norm': 10.140901542462633, 'learning_rate': 3.1474568362109193e-07, 'completion_length': 306.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6866071820259094, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6508928537368774, 'reward_std': 0.14856825396418571, 'kl': 0.611328125, 'epoch': 0.69}
 69%|██████▊   | 2937/4286 [22:09:14<9:31:49, 25.43s/it] 69%|██████▊   | 2938/4286 [22:09:38<9:26:07, 25.20s/it]                                                        {'loss': 0.0277, 'grad_norm': 5.004982856389313, 'learning_rate': 3.145123658422772e-07, 'completion_length': 284.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6562501192092896, 'reward_std': 0.12325883656740189, 'kl': 0.69140625, 'epoch': 0.69}
 69%|██████▊   | 2938/4286 [22:09:38<9:26:07, 25.20s/it] 69%|██████▊   | 2939/4286 [22:10:02<9:18:00, 24.86s/it]                                                        {'loss': 0.0311, 'grad_norm': 8.829265859677438, 'learning_rate': 3.1427904806346243e-07, 'completion_length': 300.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.08754769712686539, 'kl': 0.77978515625, 'epoch': 0.69}
 69%|██████▊   | 2939/4286 [22:10:02<9:18:00, 24.86s/it] 69%|██████▊   | 2940/4286 [22:10:27<9:14:17, 24.71s/it]                                                        {'loss': 0.0314, 'grad_norm': 4.391219243275538, 'learning_rate': 3.140457302846477e-07, 'completion_length': 292.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7599206268787384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.742063581943512, 'reward_std': 0.15998706221580505, 'kl': 0.787109375, 'epoch': 0.69}
 69%|██████▊   | 2940/4286 [22:10:27<9:14:17, 24.71s/it] 69%|██████▊   | 2941/4286 [22:10:51<9:11:54, 24.62s/it]                                                        {'loss': 0.0459, 'grad_norm': 10.101245997183216, 'learning_rate': 3.13812412505833e-07, 'completion_length': 308.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7345238924026489, 'rewards/format_reward': 1.0, 'reward': 1.7345239520072937, 'reward_std': 0.1548134796321392, 'kl': 1.1484375, 'epoch': 0.69}
 69%|██████▊   | 2941/4286 [22:10:51<9:11:54, 24.62s/it] 69%|██████▊   | 2942/4286 [22:11:16<9:14:35, 24.76s/it]                                                        {'loss': 0.0094, 'grad_norm': 4.426235598220146, 'learning_rate': 3.135790947270182e-07, 'completion_length': 316.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.7571429014205933, 'rewards/format_reward': 1.0, 'reward': 1.757142961025238, 'reward_std': 0.02142857201397419, 'kl': 0.2344970703125, 'epoch': 0.69}
 69%|██████▊   | 2942/4286 [22:11:16<9:14:35, 24.76s/it] 69%|██████▊   | 2943/4286 [22:11:41<9:15:08, 24.80s/it]                                                        {'loss': 0.004, 'grad_norm': 3.5376535448309108, 'learning_rate': 3.133457769482035e-07, 'completion_length': 309.5, 'rewards/only_full_func_accuracy_reward': 0.8288691341876984, 'rewards/format_reward': 1.0, 'reward': 1.8288691639900208, 'reward_std': 0.0386904738843441, 'kl': 0.099853515625, 'epoch': 0.69}
 69%|██████▊   | 2943/4286 [22:11:41<9:15:08, 24.80s/it] 69%|██████▊   | 2944/4286 [22:12:07<9:24:17, 25.23s/it]                                                        {'loss': 0.0314, 'grad_norm': 12.508683226344107, 'learning_rate': 3.131124591693887e-07, 'completion_length': 341.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.529120922088623, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4934066534042358, 'reward_std': 0.10892947390675545, 'kl': 0.787109375, 'epoch': 0.69}
 69%|██████▊   | 2944/4286 [22:12:07<9:24:17, 25.23s/it] 69%|██████▊   | 2945/4286 [22:12:32<9:18:54, 25.01s/it]                                                        {'loss': 0.0069, 'grad_norm': 2.8680815862567535, 'learning_rate': 3.1287914139057397e-07, 'completion_length': 331.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.07142856903374195, 'kl': 0.1727294921875, 'epoch': 0.69}
 69%|██████▊   | 2945/4286 [22:12:32<9:18:54, 25.01s/it] 69%|██████▊   | 2946/4286 [22:12:57<9:23:00, 25.21s/it]                                                        {'loss': 0.0144, 'grad_norm': 7.18346411174664, 'learning_rate': 3.1264582361175925e-07, 'completion_length': 290.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.6979166865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.680059552192688, 'reward_std': 0.10281847417354584, 'kl': 0.35791015625, 'epoch': 0.69}
 69%|██████▊   | 2946/4286 [22:12:57<9:23:00, 25.21s/it] 69%|██████▉   | 2947/4286 [22:13:21<9:12:04, 24.74s/it]                                                        {'loss': 0.0299, 'grad_norm': 10.916272316568138, 'learning_rate': 3.124125058329444e-07, 'completion_length': 287.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7681277394294739, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7324135303497314, 'reward_std': 0.09996392577886581, 'kl': 0.74755859375, 'epoch': 0.69}
 69%|██████▉   | 2947/4286 [22:13:21<9:12:04, 24.74s/it] 69%|██████▉   | 2948/4286 [22:13:45<9:09:09, 24.63s/it]                                                        {'loss': 0.0058, 'grad_norm': 8.769997715199498, 'learning_rate': 3.121791880541297e-07, 'completion_length': 281.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.8511905074119568, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.08517500944435596, 'kl': 0.143798828125, 'epoch': 0.69}
 69%|██████▉   | 2948/4286 [22:13:45<9:09:09, 24.63s/it] 69%|██████▉   | 2949/4286 [22:14:09<9:04:22, 24.43s/it]                                                        {'loss': 0.0073, 'grad_norm': 26.06672371537706, 'learning_rate': 3.119458702753149e-07, 'completion_length': 277.7143096923828, 'rewards/only_full_func_accuracy_reward': 0.828869104385376, 'rewards/format_reward': 1.0, 'reward': 1.8288691639900208, 'reward_std': 0.01580178737640381, 'kl': 0.18359375, 'epoch': 0.69}
 69%|██████▉   | 2949/4286 [22:14:09<9:04:22, 24.43s/it][2025-03-03 13:11:57,682] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 69%|██████▉   | 2950/4286 [22:14:35<9:10:00, 24.70s/it]                                                        {'loss': 0.0254, 'grad_norm': 5.627186233713294, 'learning_rate': 3.117125524965002e-07, 'completion_length': 264.2678756713867, 'rewards/only_full_func_accuracy_reward': 0.6994048357009888, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6815477013587952, 'reward_std': 0.12245827168226242, 'kl': 0.634765625, 'epoch': 0.69}
 69%|██████▉   | 2950/4286 [22:14:35<9:10:00, 24.70s/it] 69%|██████▉   | 2951/4286 [22:14:58<9:02:23, 24.38s/it]                                                        {'loss': 0.0159, 'grad_norm': 4.468936262439535, 'learning_rate': 3.1147923471768547e-07, 'completion_length': 311.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.627976268529892, 'rewards/format_reward': 1.0, 'reward': 1.6279762983322144, 'reward_std': 0.07848545629531145, 'kl': 0.39599609375, 'epoch': 0.69}
 69%|██████▉   | 2951/4286 [22:14:58<9:02:23, 24.38s/it] 69%|██████▉   | 2952/4286 [22:15:23<9:05:12, 24.52s/it]                                                        {'loss': 0.0063, 'grad_norm': 3.541743977413966, 'learning_rate': 3.112459169388707e-07, 'completion_length': 313.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.84226194024086, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.01785714365541935, 'kl': 0.158203125, 'epoch': 0.69}
 69%|██████▉   | 2952/4286 [22:15:23<9:05:12, 24.52s/it] 69%|██████▉   | 2953/4286 [22:15:48<9:06:43, 24.61s/it]                                                        {'loss': 0.0406, 'grad_norm': 16.98801916567528, 'learning_rate': 3.1101259916005596e-07, 'completion_length': 281.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.6673044264316559, 'rewards/format_reward': 1.0, 'reward': 1.6673044562339783, 'reward_std': 0.10231452528387308, 'kl': 1.01611328125, 'epoch': 0.69}
 69%|██████▉   | 2953/4286 [22:15:48<9:06:43, 24.61s/it] 69%|██████▉   | 2954/4286 [22:16:13<9:05:22, 24.57s/it]                                                        {'loss': 0.0064, 'grad_norm': 0.740875505495867, 'learning_rate': 3.107792813812412e-07, 'completion_length': 280.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.011904759332537651, 'kl': 0.15966796875, 'epoch': 0.69}
 69%|██████▉   | 2954/4286 [22:16:13<9:05:22, 24.57s/it] 69%|██████▉   | 2955/4286 [22:16:37<9:05:22, 24.58s/it]                                                        {'loss': 0.022, 'grad_norm': 4.062178792315926, 'learning_rate': 3.1054596360242646e-07, 'completion_length': 325.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.5684524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5505953431129456, 'reward_std': 0.08928570710122585, 'kl': 0.5498046875, 'epoch': 0.69}
 69%|██████▉   | 2955/4286 [22:16:37<9:05:22, 24.58s/it] 69%|██████▉   | 2956/4286 [22:17:02<9:05:32, 24.61s/it]                                                        {'loss': 0.0148, 'grad_norm': 14.27638827767518, 'learning_rate': 3.1031264582361174e-07, 'completion_length': 289.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.047619045712053776, 'kl': 0.3701171875, 'epoch': 0.69}
 69%|██████▉   | 2956/4286 [22:17:02<9:05:32, 24.61s/it] 69%|██████▉   | 2957/4286 [22:17:27<9:05:58, 24.65s/it]                                                        {'loss': 0.0156, 'grad_norm': 11.927187986192562, 'learning_rate': 3.1007932804479696e-07, 'completion_length': 286.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.739583432674408, 'reward_std': 0.09303545951843262, 'kl': 0.3916015625, 'epoch': 0.69}
 69%|██████▉   | 2957/4286 [22:17:27<9:05:58, 24.65s/it] 69%|██████▉   | 2958/4286 [22:17:51<9:06:01, 24.67s/it]                                                        {'loss': 0.0403, 'grad_norm': 8.109308641977318, 'learning_rate': 3.0984601026598223e-07, 'completion_length': 315.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6852679252624512, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6316965222358704, 'reward_std': 0.22305290400981903, 'kl': 1.0078125, 'epoch': 0.69}
 69%|██████▉   | 2958/4286 [22:17:51<9:06:01, 24.67s/it] 69%|██████▉   | 2959/4286 [22:18:17<9:11:28, 24.94s/it]                                                        {'loss': 0.0335, 'grad_norm': 6.169754197687433, 'learning_rate': 3.096126924871675e-07, 'completion_length': 319.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8020834028720856, 'rewards/format_reward': 1.0, 'reward': 1.802083432674408, 'reward_std': 0.10005595907568932, 'kl': 0.8359375, 'epoch': 0.69}
 69%|██████▉   | 2959/4286 [22:18:17<9:11:28, 24.94s/it] 69%|██████▉   | 2960/4286 [22:18:43<9:16:07, 25.16s/it]                                                        {'loss': 0.0162, 'grad_norm': 6.6094392408395235, 'learning_rate': 3.0937937470835273e-07, 'completion_length': 336.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.799107164144516, 'rewards/format_reward': 1.0, 'reward': 1.799107313156128, 'reward_std': 0.08451959118247032, 'kl': 0.404296875, 'epoch': 0.69}
 69%|██████▉   | 2960/4286 [22:18:43<9:16:07, 25.16s/it] 69%|██████▉   | 2961/4286 [22:19:08<9:19:52, 25.35s/it]                                                        {'loss': 0.0134, 'grad_norm': 1.8827991533926003, 'learning_rate': 3.09146056929538e-07, 'completion_length': 294.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7123140096664429, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6765998601913452, 'reward_std': 0.10778423026204109, 'kl': 0.3349609375, 'epoch': 0.69}
 69%|██████▉   | 2961/4286 [22:19:08<9:19:52, 25.35s/it] 69%|██████▉   | 2962/4286 [22:19:33<9:17:01, 25.24s/it]                                                        {'loss': 0.0265, 'grad_norm': 5.722926197622731, 'learning_rate': 3.0891273915072323e-07, 'completion_length': 292.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.08701667934656143, 'kl': 0.6611328125, 'epoch': 0.69}
 69%|██████▉   | 2962/4286 [22:19:33<9:17:01, 25.24s/it] 69%|██████▉   | 2963/4286 [22:20:00<9:23:28, 25.55s/it]                                                        {'loss': 0.0077, 'grad_norm': 21.872541400218964, 'learning_rate': 3.086794213719085e-07, 'completion_length': 334.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.031603576615452766, 'kl': 0.19140625, 'epoch': 0.69}
 69%|██████▉   | 2963/4286 [22:20:00<9:23:28, 25.55s/it] 69%|██████▉   | 2964/4286 [22:20:24<9:15:27, 25.21s/it]                                                        {'loss': 0.0124, 'grad_norm': 1.3879825124721465, 'learning_rate': 3.084461035930938e-07, 'completion_length': 303.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 1.0, 'reward': 1.7633929252624512, 'reward_std': 0.008928571827709675, 'kl': 0.30859375, 'epoch': 0.69}
 69%|██████▉   | 2964/4286 [22:20:24<9:15:27, 25.21s/it][2025-03-03 13:18:12,259] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 69%|██████▉   | 2965/4286 [22:20:49<9:15:54, 25.25s/it]                                                        {'loss': 0.0128, 'grad_norm': 3.4004275493722314, 'learning_rate': 3.08212785814279e-07, 'completion_length': 301.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8065476715564728, 'rewards/format_reward': 1.0, 'reward': 1.8065477013587952, 'reward_std': 0.01785714365541935, 'kl': 0.320068359375, 'epoch': 0.69}
 69%|██████▉   | 2965/4286 [22:20:49<9:15:54, 25.25s/it] 69%|██████▉   | 2966/4286 [22:21:13<9:06:34, 24.84s/it]                                                        {'loss': 0.0057, 'grad_norm': 3.144896140610251, 'learning_rate': 3.079794680354643e-07, 'completion_length': 329.64288330078125, 'rewards/only_full_func_accuracy_reward': 0.8080357611179352, 'rewards/format_reward': 1.0, 'reward': 1.8080357909202576, 'reward_std': 0.026785715483129025, 'kl': 0.1435546875, 'epoch': 0.69}
 69%|██████▉   | 2966/4286 [22:21:13<9:06:34, 24.84s/it] 69%|██████▉   | 2967/4286 [22:21:38<9:07:13, 24.89s/it]                                                        {'loss': 0.0102, 'grad_norm': 2.2506119043539794, 'learning_rate': 3.077461502566495e-07, 'completion_length': 320.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.8705357313156128, 'rewards/format_reward': 1.0, 'reward': 1.8705357909202576, 'reward_std': 0.0327380932867527, 'kl': 0.2548828125, 'epoch': 0.69}
 69%|██████▉   | 2967/4286 [22:21:38<9:07:13, 24.89s/it] 69%|██████▉   | 2968/4286 [22:22:03<9:09:02, 24.99s/it]                                                        {'loss': 0.0041, 'grad_norm': 11.909557072875467, 'learning_rate': 3.0751283247783477e-07, 'completion_length': 323.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8333334028720856, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.02380952052772045, 'kl': 0.1021728515625, 'epoch': 0.69}
 69%|██████▉   | 2968/4286 [22:22:03<9:09:02, 24.99s/it] 69%|██████▉   | 2969/4286 [22:22:28<9:02:19, 24.71s/it]                                                        {'loss': 0.0072, 'grad_norm': 8.494255761626796, 'learning_rate': 3.0727951469902005e-07, 'completion_length': 297.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.03663512319326401, 'kl': 0.1806640625, 'epoch': 0.69}
 69%|██████▉   | 2969/4286 [22:22:28<9:02:19, 24.71s/it] 69%|██████▉   | 2970/4286 [22:22:51<8:54:15, 24.36s/it]                                                        {'loss': 0.011, 'grad_norm': 2.6953243447173554, 'learning_rate': 3.0704619692020527e-07, 'completion_length': 289.875, 'rewards/only_full_func_accuracy_reward': 0.8601190745830536, 'rewards/format_reward': 1.0, 'reward': 1.8601192235946655, 'reward_std': 0.055456096306443214, 'kl': 0.27392578125, 'epoch': 0.69}
 69%|██████▉   | 2970/4286 [22:22:51<8:54:15, 24.36s/it] 69%|██████▉   | 2971/4286 [22:23:16<9:00:27, 24.66s/it]                                                        {'loss': 0.0467, 'grad_norm': 8.299964213455093, 'learning_rate': 3.0681287914139054e-07, 'completion_length': 345.2143096923828, 'rewards/only_full_func_accuracy_reward': 0.7857142686843872, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7500001788139343, 'reward_std': 0.11800234392285347, 'kl': 1.166015625, 'epoch': 0.69}
 69%|██████▉   | 2971/4286 [22:23:16<9:00:27, 24.66s/it] 69%|██████▉   | 2972/4286 [22:23:41<9:02:15, 24.76s/it]                                                        {'loss': 0.0256, 'grad_norm': 11.210167810339623, 'learning_rate': 3.0657956136257577e-07, 'completion_length': 284.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.5866071879863739, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.568750023841858, 'reward_std': 0.10059523582458496, 'kl': 0.6416015625, 'epoch': 0.69}
 69%|██████▉   | 2972/4286 [22:23:41<9:02:15, 24.76s/it] 69%|██████▉   | 2973/4286 [22:24:04<8:50:22, 24.24s/it]                                                        {'loss': 0.0032, 'grad_norm': 4.001415489635007, 'learning_rate': 3.0634624358376104e-07, 'completion_length': 232.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.758928656578064, 'reward_std': 0.056333938613533974, 'kl': 0.079345703125, 'epoch': 0.69}
 69%|██████▉   | 2973/4286 [22:24:04<8:50:22, 24.24s/it] 69%|██████▉   | 2974/4286 [22:24:30<8:55:27, 24.49s/it]                                                        {'loss': 0.0188, 'grad_norm': 13.03986449147373, 'learning_rate': 3.061129258049463e-07, 'completion_length': 325.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.6949405074119568, 'rewards/format_reward': 1.0, 'reward': 1.6949406266212463, 'reward_std': 0.04900030232965946, 'kl': 0.46875, 'epoch': 0.69}
 69%|██████▉   | 2974/4286 [22:24:30<8:55:27, 24.49s/it] 69%|██████▉   | 2975/4286 [22:24:54<8:54:01, 24.44s/it]                                                        {'loss': 0.011, 'grad_norm': 2.407968792343909, 'learning_rate': 3.0587960802613154e-07, 'completion_length': 302.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.74702388048172, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.03160357475280762, 'kl': 0.27490234375, 'epoch': 0.69}
 69%|██████▉   | 2975/4286 [22:24:54<8:54:01, 24.44s/it] 69%|██████▉   | 2976/4286 [22:25:17<8:47:15, 24.15s/it]                                                        {'loss': 0.0029, 'grad_norm': 6.229091714543461, 'learning_rate': 3.056462902473168e-07, 'completion_length': 281.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.8288690745830536, 'rewards/format_reward': 1.0, 'reward': 1.828869104385376, 'reward_std': 0.044642859138548374, 'kl': 0.0731201171875, 'epoch': 0.69}
 69%|██████▉   | 2976/4286 [22:25:17<8:47:15, 24.15s/it] 69%|██████▉   | 2977/4286 [22:25:42<8:50:23, 24.31s/it]                                                        {'loss': 0.0192, 'grad_norm': 4.040825653155666, 'learning_rate': 3.0541297246850204e-07, 'completion_length': 314.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.026785715483129025, 'kl': 0.48046875, 'epoch': 0.69}
 69%|██████▉   | 2977/4286 [22:25:42<8:50:23, 24.31s/it] 69%|██████▉   | 2978/4286 [22:26:06<8:48:41, 24.25s/it]                                                        {'loss': 0.0108, 'grad_norm': 1.8605088065675355, 'learning_rate': 3.051796546896873e-07, 'completion_length': 300.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7738096117973328, 'rewards/format_reward': 1.0, 'reward': 1.7738096117973328, 'reward_std': 0.046756881289184093, 'kl': 0.26953125, 'epoch': 0.69}
 69%|██████▉   | 2978/4286 [22:26:06<8:48:41, 24.25s/it] 70%|██████▉   | 2979/4286 [22:26:31<8:51:00, 24.38s/it]                                                        {'loss': 0.0168, 'grad_norm': 15.785653927799668, 'learning_rate': 3.049463369108726e-07, 'completion_length': 306.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7299107909202576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7120537161827087, 'reward_std': 0.09148097038269043, 'kl': 0.42041015625, 'epoch': 0.7}
 70%|██████▉   | 2979/4286 [22:26:31<8:51:00, 24.38s/it] 70%|██████▉   | 2980/4286 [22:26:56<8:58:29, 24.74s/it]                                                        {'loss': 0.0084, 'grad_norm': 8.208044477843293, 'learning_rate': 3.047130191320578e-07, 'completion_length': 351.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7306548357009888, 'rewards/format_reward': 1.0, 'reward': 1.7306548357009888, 'reward_std': 0.0712148966267705, 'kl': 0.20947265625, 'epoch': 0.7}
 70%|██████▉   | 2980/4286 [22:26:56<8:58:29, 24.74s/it] 70%|██████▉   | 2981/4286 [22:27:22<9:06:27, 25.12s/it]                                                        {'loss': 0.0075, 'grad_norm': 5.9473233446360645, 'learning_rate': 3.044797013532431e-07, 'completion_length': 329.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7056548297405243, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.687797725200653, 'reward_std': 0.07426993921399117, 'kl': 0.188720703125, 'epoch': 0.7}
 70%|██████▉   | 2981/4286 [22:27:22<9:06:27, 25.12s/it] 70%|██████▉   | 2982/4286 [22:27:46<8:58:07, 24.76s/it]                                                        {'loss': 0.0118, 'grad_norm': 21.86114551186556, 'learning_rate': 3.0424638357442836e-07, 'completion_length': 254.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.7296769320964813, 'rewards/format_reward': 1.0, 'reward': 1.7296769618988037, 'reward_std': 0.08559125289320946, 'kl': 0.29541015625, 'epoch': 0.7}
 70%|██████▉   | 2982/4286 [22:27:46<8:58:07, 24.76s/it] 70%|██████▉   | 2983/4286 [22:28:13<9:09:54, 25.32s/it]                                                        {'loss': 0.0472, 'grad_norm': 5.719064776764524, 'learning_rate': 3.040130657956136e-07, 'completion_length': 336.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6712391972541809, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6355249881744385, 'reward_std': 0.1333819329738617, 'kl': 1.177734375, 'epoch': 0.7}
 70%|██████▉   | 2983/4286 [22:28:13<9:09:54, 25.32s/it] 70%|██████▉   | 2984/4286 [22:28:39<9:13:11, 25.49s/it]                                                        {'loss': 0.0521, 'grad_norm': 4.062995988023478, 'learning_rate': 3.0377974801679886e-07, 'completion_length': 351.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.732142984867096, 'reward_std': 0.17083177715539932, 'kl': 1.306640625, 'epoch': 0.7}
 70%|██████▉   | 2984/4286 [22:28:39<9:13:11, 25.49s/it] 70%|██████▉   | 2985/4286 [22:29:05<9:16:06, 25.65s/it]                                                        {'loss': 0.0073, 'grad_norm': 23.795607017052703, 'learning_rate': 3.035464302379841e-07, 'completion_length': 326.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.846726268529892, 'rewards/format_reward': 1.0, 'reward': 1.8467262983322144, 'reward_std': 0.014880955684930086, 'kl': 0.18017578125, 'epoch': 0.7}
 70%|██████▉   | 2985/4286 [22:29:05<9:16:06, 25.65s/it] 70%|██████▉   | 2986/4286 [22:29:30<9:14:47, 25.61s/it]                                                        {'loss': 0.0076, 'grad_norm': 1.5080798110321219, 'learning_rate': 3.0331311245916935e-07, 'completion_length': 310.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.8095239102840424, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.03847679682075977, 'kl': 0.190185546875, 'epoch': 0.7}
 70%|██████▉   | 2986/4286 [22:29:30<9:14:47, 25.61s/it] 70%|██████▉   | 2987/4286 [22:29:55<9:05:17, 25.19s/it]                                                        {'loss': 0.0219, 'grad_norm': 10.247316095925276, 'learning_rate': 3.0307979468035463e-07, 'completion_length': 282.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.7142857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.05952381668612361, 'kl': 0.546875, 'epoch': 0.7}
 70%|██████▉   | 2987/4286 [22:29:55<9:05:17, 25.19s/it] 70%|██████▉   | 2988/4286 [22:30:19<9:00:36, 24.99s/it]                                                        {'loss': 0.0134, 'grad_norm': 12.295062037306858, 'learning_rate': 3.0284647690153985e-07, 'completion_length': 322.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6875000596046448, 'reward_std': 0.10067614167928696, 'kl': 0.3349609375, 'epoch': 0.7}
 70%|██████▉   | 2988/4286 [22:30:19<9:00:36, 24.99s/it] 70%|██████▉   | 2989/4286 [22:30:44<8:57:56, 24.89s/it]                                                        {'loss': 0.0225, 'grad_norm': 5.170449299792907, 'learning_rate': 3.0261315912272513e-07, 'completion_length': 321.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547619700431824, 'reward_std': 0.1109699197113514, 'kl': 0.564453125, 'epoch': 0.7}
 70%|██████▉   | 2989/4286 [22:30:44<8:57:56, 24.89s/it] 70%|██████▉   | 2990/4286 [22:31:08<8:52:28, 24.65s/it]                                                        {'loss': 0.0496, 'grad_norm': 21.272991744752762, 'learning_rate': 3.0237984134391035e-07, 'completion_length': 297.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.7068453133106232, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.688988208770752, 'reward_std': 0.11768607050180435, 'kl': 1.240234375, 'epoch': 0.7}
 70%|██████▉   | 2990/4286 [22:31:08<8:52:28, 24.65s/it] 70%|██████▉   | 2991/4286 [22:31:32<8:52:15, 24.66s/it]                                                        {'loss': 0.0215, 'grad_norm': 11.155751218181406, 'learning_rate': 3.021465235650956e-07, 'completion_length': 320.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.6654762327671051, 'rewards/format_reward': 1.0, 'reward': 1.6654763221740723, 'reward_std': 0.048886168748140335, 'kl': 0.5361328125, 'epoch': 0.7}
 70%|██████▉   | 2991/4286 [22:31:33<8:52:15, 24.66s/it] 70%|██████▉   | 2992/4286 [22:31:57<8:52:19, 24.68s/it]                                                        {'loss': 0.0038, 'grad_norm': 15.274550644584984, 'learning_rate': 3.019132057862809e-07, 'completion_length': 302.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7425595223903656, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.03961131535470486, 'kl': 0.09619140625, 'epoch': 0.7}
 70%|██████▉   | 2992/4286 [22:31:57<8:52:19, 24.68s/it] 70%|██████▉   | 2993/4286 [22:32:23<8:55:42, 24.86s/it]                                                        {'loss': 0.0171, 'grad_norm': 46.581914019365385, 'learning_rate': 3.016798880074661e-07, 'completion_length': 335.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8154761791229248, 'rewards/format_reward': 1.0, 'reward': 1.8154763579368591, 'reward_std': 0.044741734862327576, 'kl': 0.427734375, 'epoch': 0.7}
 70%|██████▉   | 2993/4286 [22:32:23<8:55:42, 24.86s/it] 70%|██████▉   | 2994/4286 [22:32:47<8:51:23, 24.68s/it]                                                        {'loss': 0.0238, 'grad_norm': 2.934648970170061, 'learning_rate': 3.014465702286514e-07, 'completion_length': 304.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.9047619700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8869048357009888, 'reward_std': 0.0714285746216774, 'kl': 0.59521484375, 'epoch': 0.7}
 70%|██████▉   | 2994/4286 [22:32:47<8:51:23, 24.68s/it] 70%|██████▉   | 2995/4286 [22:33:12<8:56:02, 24.91s/it]                                                        {'loss': 0.0656, 'grad_norm': 10.03088788917707, 'learning_rate': 3.012132524498366e-07, 'completion_length': 291.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7536140084266663, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7000426054000854, 'reward_std': 0.09362965077161789, 'kl': 1.640625, 'epoch': 0.7}
 70%|██████▉   | 2995/4286 [22:33:12<8:56:02, 24.91s/it] 70%|██████▉   | 2996/4286 [22:33:36<8:51:01, 24.70s/it]                                                        {'loss': 0.0047, 'grad_norm': 2.0161165550313416, 'learning_rate': 3.009799346710219e-07, 'completion_length': 290.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976192235946655, 'reward_std': 0.011904759332537651, 'kl': 0.11767578125, 'epoch': 0.7}
 70%|██████▉   | 2996/4286 [22:33:36<8:51:01, 24.70s/it][2025-03-03 13:31:26,265] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 70%|██████▉   | 2997/4286 [22:34:03<9:04:58, 25.37s/it]                                                        {'loss': 0.013, 'grad_norm': 3.2841061688970647, 'learning_rate': 3.0074661689220717e-07, 'completion_length': 327.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.816468358039856, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7807540893554688, 'reward_std': 0.10295330174267292, 'kl': 0.322509765625, 'epoch': 0.7}
 70%|██████▉   | 2997/4286 [22:34:03<9:04:58, 25.37s/it] 70%|██████▉   | 2998/4286 [22:34:29<9:04:33, 25.37s/it]                                                        {'loss': 0.0049, 'grad_norm': 1.7814263688745557, 'learning_rate': 3.005132991133924e-07, 'completion_length': 325.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6979167461395264, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.03273809980601072, 'kl': 0.121826171875, 'epoch': 0.7}
 70%|██████▉   | 2998/4286 [22:34:29<9:04:33, 25.37s/it] 70%|██████▉   | 2999/4286 [22:34:54<9:05:36, 25.44s/it]                                                        {'loss': 0.0081, 'grad_norm': 5.701977428695484, 'learning_rate': 3.0027998133457767e-07, 'completion_length': 325.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.8556973338127136, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.83784019947052, 'reward_std': 0.07077508419752121, 'kl': 0.20263671875, 'epoch': 0.7}
 70%|██████▉   | 2999/4286 [22:34:54<9:05:36, 25.44s/it] 70%|██████▉   | 3000/4286 [22:35:20<9:07:01, 25.52s/it]                                                        {'loss': 0.004, 'grad_norm': 18.674106875238746, 'learning_rate': 3.000466635557629e-07, 'completion_length': 326.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.730654776096344, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.026785717345774174, 'kl': 0.100830078125, 'epoch': 0.7}
 70%|██████▉   | 3000/4286 [22:35:20<9:07:01, 25.52s/it] 70%|███████   | 3001/4286 [22:38:52<29:02:56, 81.38s/it]                                                         {'loss': 0.0074, 'grad_norm': 1.1327313778509298, 'learning_rate': 2.9981334577694816e-07, 'completion_length': 305.2678756713867, 'rewards/only_full_func_accuracy_reward': 0.7946428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7946429252624512, 'reward_std': 0.02473035827279091, 'kl': 0.185302734375, 'epoch': 0.7}
 70%|███████   | 3001/4286 [22:38:52<29:02:56, 81.38s/it] 70%|███████   | 3002/4286 [22:39:18<23:07:15, 64.83s/it]                                                         {'loss': 0.013, 'grad_norm': 7.816794946391093, 'learning_rate': 2.9958002799813344e-07, 'completion_length': 345.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.784226268529892, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.06593661196529865, 'kl': 0.3251953125, 'epoch': 0.7}
 70%|███████   | 3002/4286 [22:39:18<23:07:15, 64.83s/it] 70%|███████   | 3003/4286 [22:39:43<18:53:15, 53.00s/it]                                                         {'loss': 0.0126, 'grad_norm': 8.204324803591561, 'learning_rate': 2.9934671021931866e-07, 'completion_length': 298.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 1.0, 'reward': 1.7574405670166016, 'reward_std': 0.14104853197932243, 'kl': 0.314453125, 'epoch': 0.7}
 70%|███████   | 3003/4286 [22:39:43<18:53:15, 53.00s/it] 70%|███████   | 3004/4286 [22:40:08<15:50:37, 44.49s/it]                                                         {'loss': 0.0207, 'grad_norm': 2.384567541798017, 'learning_rate': 2.9911339244050394e-07, 'completion_length': 322.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.02380952425301075, 'kl': 0.5166015625, 'epoch': 0.7}
 70%|███████   | 3004/4286 [22:40:08<15:50:37, 44.49s/it] 70%|███████   | 3005/4286 [22:40:34<13:50:13, 38.89s/it]                                                         {'loss': 0.0051, 'grad_norm': 3.69832693107406, 'learning_rate': 2.988800746616892e-07, 'completion_length': 348.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.758928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.04510328220203519, 'kl': 0.128173828125, 'epoch': 0.7}
 70%|███████   | 3005/4286 [22:40:34<13:50:13, 38.89s/it] 70%|███████   | 3006/4286 [22:40:59<12:22:23, 34.80s/it]                                                         {'loss': 0.0019, 'grad_norm': 2.83274862254982, 'learning_rate': 2.9864675688287443e-07, 'completion_length': 276.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7648810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.05357143096625805, 'kl': 0.047607421875, 'epoch': 0.7}
 70%|███████   | 3006/4286 [22:40:59<12:22:23, 34.80s/it] 70%|███████   | 3007/4286 [22:41:25<11:23:12, 32.05s/it]                                                         {'loss': 0.016, 'grad_norm': 4.833895319235399, 'learning_rate': 2.984134391040597e-07, 'completion_length': 310.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.7030187547206879, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6851617097854614, 'reward_std': 0.059116171672940254, 'kl': 0.39892578125, 'epoch': 0.7}
 70%|███████   | 3007/4286 [22:41:25<11:23:12, 32.05s/it] 70%|███████   | 3008/4286 [22:41:50<10:42:12, 30.15s/it]                                                         {'loss': 0.0128, 'grad_norm': 4.677448107428107, 'learning_rate': 2.9818012132524493e-07, 'completion_length': 325.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.9136905372142792, 'rewards/format_reward': 1.0, 'reward': 1.9136906266212463, 'reward_std': 0.01785714365541935, 'kl': 0.3192138671875, 'epoch': 0.7}
 70%|███████   | 3008/4286 [22:41:50<10:42:12, 30.15s/it] 70%|███████   | 3009/4286 [22:42:16<10:14:45, 28.88s/it]                                                         {'loss': 0.0154, 'grad_norm': 7.272039708609499, 'learning_rate': 2.979468035464302e-07, 'completion_length': 325.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.700892835855484, 'rewards/format_reward': 1.0, 'reward': 1.700892984867096, 'reward_std': 0.008928571827709675, 'kl': 0.3857421875, 'epoch': 0.7}
 70%|███████   | 3009/4286 [22:42:16<10:14:45, 28.88s/it] 70%|███████   | 3010/4286 [22:42:42<9:54:39, 27.96s/it]                                                         {'loss': 0.0032, 'grad_norm': 1.1318486764455888, 'learning_rate': 2.977134857676155e-07, 'completion_length': 312.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.8690476417541504, 'rewards/format_reward': 1.0, 'reward': 1.8690477013587952, 'reward_std': 0.02976190112531185, 'kl': 0.07958984375, 'epoch': 0.7}
 70%|███████   | 3010/4286 [22:42:42<9:54:39, 27.96s/it] 70%|███████   | 3011/4286 [22:43:07<9:37:28, 27.18s/it]                                                        {'loss': 0.0047, 'grad_norm': 4.8093817451031065, 'learning_rate': 2.974801679888007e-07, 'completion_length': 348.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.8125000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8125001788139343, 'reward_std': 0.04761904664337635, 'kl': 0.1171875, 'epoch': 0.7}
 70%|███████   | 3011/4286 [22:43:07<9:37:28, 27.18s/it][2025-03-03 13:40:55,463] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 70%|███████   | 3012/4286 [22:43:33<9:23:28, 26.54s/it]                                                        {'loss': 0.0018, 'grad_norm': 2.970393410562773, 'learning_rate': 2.97246850209986e-07, 'completion_length': 263.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.8199405074119568, 'rewards/format_reward': 1.0, 'reward': 1.8199405670166016, 'reward_std': 0.008928571827709675, 'kl': 0.0460205078125, 'epoch': 0.7}
 70%|███████   | 3012/4286 [22:43:33<9:23:28, 26.54s/it] 70%|███████   | 3013/4286 [22:43:56<9:03:52, 25.63s/it]                                                        {'loss': 0.004, 'grad_norm': 12.867013859616746, 'learning_rate': 2.970135324311712e-07, 'completion_length': 261.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.034579766914248466, 'kl': 0.1005859375, 'epoch': 0.7}
 70%|███████   | 3013/4286 [22:43:56<9:03:52, 25.63s/it] 70%|███████   | 3014/4286 [22:44:21<9:01:02, 25.52s/it]                                                        {'loss': 0.0026, 'grad_norm': 3.1369707954688253, 'learning_rate': 2.967802146523565e-07, 'completion_length': 298.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.709821492433548, 'rewards/format_reward': 1.0, 'reward': 1.7098215222358704, 'reward_std': 0.05495268478989601, 'kl': 0.064453125, 'epoch': 0.7}
 70%|███████   | 3014/4286 [22:44:21<9:01:02, 25.52s/it] 70%|███████   | 3015/4286 [22:44:47<8:58:51, 25.44s/it]                                                        {'loss': 0.018, 'grad_norm': 24.56400725359572, 'learning_rate': 2.9654689687354175e-07, 'completion_length': 326.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.07327023893594742, 'kl': 0.448486328125, 'epoch': 0.7}
 70%|███████   | 3015/4286 [22:44:47<8:58:51, 25.44s/it] 70%|███████   | 3016/4286 [22:45:12<8:57:13, 25.38s/it]                                                        {'loss': 0.0101, 'grad_norm': 6.104397543171644, 'learning_rate': 2.96313579094727e-07, 'completion_length': 265.1428756713867, 'rewards/only_full_func_accuracy_reward': 0.8189935386180878, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.783279299736023, 'reward_std': 0.14772728085517883, 'kl': 0.2537841796875, 'epoch': 0.7}
 70%|███████   | 3016/4286 [22:45:12<8:57:13, 25.38s/it] 70%|███████   | 3017/4286 [22:45:38<9:04:57, 25.77s/it]                                                        {'loss': 0.0165, 'grad_norm': 24.96416078600859, 'learning_rate': 2.9608026131591225e-07, 'completion_length': 349.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7114178240299225, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6757035851478577, 'reward_std': 0.17208800464868546, 'kl': 0.41357421875, 'epoch': 0.7}
 70%|███████   | 3017/4286 [22:45:38<9:04:57, 25.77s/it] 70%|███████   | 3018/4286 [22:46:04<9:03:41, 25.73s/it]                                                        {'loss': 0.0029, 'grad_norm': 3.068610722239429, 'learning_rate': 2.9584694353709747e-07, 'completion_length': 318.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.7232143878936768, 'reward_std': 0.06596966460347176, 'kl': 0.071533203125, 'epoch': 0.7}
 70%|███████   | 3018/4286 [22:46:04<9:03:41, 25.73s/it] 70%|███████   | 3019/4286 [22:46:31<9:10:00, 26.05s/it]                                                        {'loss': 0.0097, 'grad_norm': 6.042093682917633, 'learning_rate': 2.9561362575828275e-07, 'completion_length': 332.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.8065476417541504, 'rewards/format_reward': 1.0, 'reward': 1.8065477013587952, 'reward_std': 0.026572031434625387, 'kl': 0.2423095703125, 'epoch': 0.7}
 70%|███████   | 3019/4286 [22:46:31<9:10:00, 26.05s/it] 70%|███████   | 3020/4286 [22:46:57<9:08:47, 26.01s/it]                                                        {'loss': 0.0134, 'grad_norm': 2.4514135385830507, 'learning_rate': 2.95380307979468e-07, 'completion_length': 324.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6895833313465118, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6717263460159302, 'reward_std': 0.05276655498892069, 'kl': 0.33544921875, 'epoch': 0.7}
 70%|███████   | 3020/4286 [22:46:57<9:08:47, 26.01s/it] 70%|███████   | 3021/4286 [22:47:23<9:09:23, 26.06s/it]                                                        {'loss': 0.0226, 'grad_norm': 3.0923833064773034, 'learning_rate': 2.9514699020065324e-07, 'completion_length': 312.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.5773810148239136, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5416668057441711, 'reward_std': 0.16138524282723665, 'kl': 0.564453125, 'epoch': 0.7}
 70%|███████   | 3021/4286 [22:47:23<9:09:23, 26.06s/it] 71%|███████   | 3022/4286 [22:47:49<9:07:37, 25.99s/it]                                                        {'loss': 0.012, 'grad_norm': 31.275854873538645, 'learning_rate': 2.949136724218385e-07, 'completion_length': 337.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.038476791232824326, 'kl': 0.301025390625, 'epoch': 0.71}
 71%|███████   | 3022/4286 [22:47:49<9:07:37, 25.99s/it] 71%|███████   | 3023/4286 [22:48:15<9:08:01, 26.03s/it]                                                        {'loss': 0.0106, 'grad_norm': 4.195063852669296, 'learning_rate': 2.9468035464302374e-07, 'completion_length': 282.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.035714288242161274, 'kl': 0.2650146484375, 'epoch': 0.71}
 71%|███████   | 3023/4286 [22:48:15<9:08:01, 26.03s/it] 71%|███████   | 3024/4286 [22:48:40<9:03:12, 25.83s/it]                                                        {'loss': 0.0099, 'grad_norm': 10.11454046706391, 'learning_rate': 2.94447036864209e-07, 'completion_length': 330.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.5550595968961716, 'rewards/format_reward': 1.0, 'reward': 1.5550596714019775, 'reward_std': 0.022929031401872635, 'kl': 0.2464599609375, 'epoch': 0.71}
 71%|███████   | 3024/4286 [22:48:40<9:03:12, 25.83s/it] 71%|███████   | 3025/4286 [22:49:07<9:07:42, 26.06s/it]                                                        {'loss': 0.0107, 'grad_norm': 14.81867632245645, 'learning_rate': 2.942137190853943e-07, 'completion_length': 344.2143096923828, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.0458451765589416, 'kl': 0.267578125, 'epoch': 0.71}
 71%|███████   | 3025/4286 [22:49:07<9:07:42, 26.06s/it] 71%|███████   | 3026/4286 [22:49:32<9:00:42, 25.75s/it]                                                        {'loss': 0.0147, 'grad_norm': 4.840396585771794, 'learning_rate': 2.939804013065795e-07, 'completion_length': 317.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.059523805975914, 'kl': 0.3682861328125, 'epoch': 0.71}
 71%|███████   | 3026/4286 [22:49:32<9:00:42, 25.75s/it] 71%|███████   | 3027/4286 [22:49:56<8:50:59, 25.31s/it]                                                        {'loss': 0.0046, 'grad_norm': 2.5599920410474075, 'learning_rate': 2.937470835277648e-07, 'completion_length': 285.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7976190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.02816697023808956, 'kl': 0.11474609375, 'epoch': 0.71}
 71%|███████   | 3027/4286 [22:49:56<8:50:59, 25.31s/it] 71%|███████   | 3028/4286 [22:50:20<8:42:07, 24.90s/it]                                                        {'loss': 0.0278, 'grad_norm': 312.31366965847207, 'learning_rate': 2.9351376574895e-07, 'completion_length': 258.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.7785714566707611, 'rewards/format_reward': 1.0, 'reward': 1.778571605682373, 'reward_std': 0.08571427688002586, 'kl': 0.6953125, 'epoch': 0.71}
 71%|███████   | 3028/4286 [22:50:20<8:42:07, 24.90s/it] 71%|███████   | 3029/4286 [22:50:45<8:40:55, 24.86s/it]                                                        {'loss': 0.0157, 'grad_norm': 1.6620032110122211, 'learning_rate': 2.932804479701353e-07, 'completion_length': 315.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7636905312538147, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7458334565162659, 'reward_std': 0.08476953953504562, 'kl': 0.39306640625, 'epoch': 0.71}
 71%|███████   | 3029/4286 [22:50:45<8:40:55, 24.86s/it] 71%|███████   | 3030/4286 [22:51:10<8:38:59, 24.79s/it]                                                        {'loss': 0.0044, 'grad_norm': 4.850393829651216, 'learning_rate': 2.9304713019132056e-07, 'completion_length': 270.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.8720238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8720239400863647, 'reward_std': 0.0297619067132473, 'kl': 0.109130859375, 'epoch': 0.71}
 71%|███████   | 3030/4286 [22:51:10<8:38:59, 24.79s/it] 71%|███████   | 3031/4286 [22:51:36<8:47:27, 25.22s/it]                                                        {'loss': 0.027, 'grad_norm': 3.8708766872673603, 'learning_rate': 2.928138124125058e-07, 'completion_length': 345.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7770563662052155, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7413421869277954, 'reward_std': 0.12200507149100304, 'kl': 0.673828125, 'epoch': 0.71}
 71%|███████   | 3031/4286 [22:51:36<8:47:27, 25.22s/it] 71%|███████   | 3032/4286 [22:52:02<8:51:22, 25.42s/it]                                                        {'loss': 0.0089, 'grad_norm': 2.5600830407592094, 'learning_rate': 2.9258049463369106e-07, 'completion_length': 332.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.8303571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8125001192092896, 'reward_std': 0.06664376333355904, 'kl': 0.2239990234375, 'epoch': 0.71}
 71%|███████   | 3032/4286 [22:52:02<8:51:22, 25.42s/it] 71%|███████   | 3033/4286 [22:52:26<8:45:18, 25.15s/it]                                                        {'loss': 0.0028, 'grad_norm': 0.4054576578787018, 'learning_rate': 2.9234717685487633e-07, 'completion_length': 298.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.77976194024086, 'rewards/format_reward': 1.0, 'reward': 1.7797619700431824, 'reward_std': 0.011904762126505375, 'kl': 0.0701904296875, 'epoch': 0.71}
 71%|███████   | 3033/4286 [22:52:26<8:45:18, 25.15s/it] 71%|███████   | 3034/4286 [22:52:52<8:48:53, 25.35s/it]                                                        {'loss': 0.0022, 'grad_norm': 4.926037191291175, 'learning_rate': 2.9211385907606156e-07, 'completion_length': 310.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.0704943910241127, 'kl': 0.0556640625, 'epoch': 0.71}
 71%|███████   | 3034/4286 [22:52:52<8:48:53, 25.35s/it] 71%|███████   | 3035/4286 [22:53:17<8:44:04, 25.14s/it]                                                        {'loss': 0.0086, 'grad_norm': 22.0647198612384, 'learning_rate': 2.9188054129724683e-07, 'completion_length': 324.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7056548297405243, 'rewards/format_reward': 1.0, 'reward': 1.7056548595428467, 'reward_std': 0.07343813125044107, 'kl': 0.214599609375, 'epoch': 0.71}
 71%|███████   | 3035/4286 [22:53:17<8:44:04, 25.14s/it] 71%|███████   | 3036/4286 [22:53:42<8:42:30, 25.08s/it]                                                        {'loss': 0.0067, 'grad_norm': 6.672447765299394, 'learning_rate': 2.9164722351843205e-07, 'completion_length': 297.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7886905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7886906862258911, 'reward_std': 0.051055656746029854, 'kl': 0.1689453125, 'epoch': 0.71}
 71%|███████   | 3036/4286 [22:53:42<8:42:30, 25.08s/it] 71%|███████   | 3037/4286 [22:54:07<8:45:51, 25.26s/it]                                                        {'loss': 0.0317, 'grad_norm': 7.647313748205215, 'learning_rate': 2.9141390573961733e-07, 'completion_length': 284.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8020834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7842262983322144, 'reward_std': 0.0446428582072258, 'kl': 0.793701171875, 'epoch': 0.71}
 71%|███████   | 3037/4286 [22:54:07<8:45:51, 25.26s/it] 71%|███████   | 3038/4286 [22:54:33<8:46:43, 25.32s/it]                                                        {'loss': 0.016, 'grad_norm': 46.3093097503734, 'learning_rate': 2.911805879608026e-07, 'completion_length': 277.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7708333134651184, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7529763579368591, 'reward_std': 0.15329349413514137, 'kl': 0.3974609375, 'epoch': 0.71}
 71%|███████   | 3038/4286 [22:54:33<8:46:43, 25.32s/it] 71%|███████   | 3039/4286 [22:54:58<8:45:07, 25.27s/it]                                                        {'loss': 0.0085, 'grad_norm': 14.339627354456272, 'learning_rate': 2.909472701819878e-07, 'completion_length': 326.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.8586309850215912, 'rewards/format_reward': 1.0, 'reward': 1.8586311340332031, 'reward_std': 0.02417590469121933, 'kl': 0.212890625, 'epoch': 0.71}
 71%|███████   | 3039/4286 [22:54:58<8:45:07, 25.27s/it] 71%|███████   | 3040/4286 [22:55:23<8:46:14, 25.34s/it]                                                        {'loss': 0.0108, 'grad_norm': 30.724057824991196, 'learning_rate': 2.907139524031731e-07, 'completion_length': 292.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.767857313156128, 'reward_std': 0.06388125941157341, 'kl': 0.26904296875, 'epoch': 0.71}
 71%|███████   | 3040/4286 [22:55:23<8:46:14, 25.34s/it] 71%|███████   | 3041/4286 [22:55:49<8:47:54, 25.44s/it]                                                        {'loss': 0.0161, 'grad_norm': 3.7515628880749103, 'learning_rate': 2.904806346243583e-07, 'completion_length': 346.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.691964328289032, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.07833484560251236, 'kl': 0.40087890625, 'epoch': 0.71}
 71%|███████   | 3041/4286 [22:55:49<8:47:54, 25.44s/it] 71%|███████   | 3042/4286 [22:56:15<8:48:31, 25.49s/it]                                                        {'loss': 0.005, 'grad_norm': 0.5558182675751142, 'learning_rate': 2.902473168455436e-07, 'completion_length': 298.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.71726194024086, 'rewards/format_reward': 1.0, 'reward': 1.7172619700431824, 'reward_std': 0.010309826582670212, 'kl': 0.12548828125, 'epoch': 0.71}
 71%|███████   | 3042/4286 [22:56:15<8:48:31, 25.49s/it] 71%|███████   | 3043/4286 [22:56:40<8:47:39, 25.47s/it]                                                        {'loss': 0.0043, 'grad_norm': 6.785289442761376, 'learning_rate': 2.900139990667289e-07, 'completion_length': 342.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7812500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7812501192092896, 'reward_std': 0.07029405236244202, 'kl': 0.1083984375, 'epoch': 0.71}
 71%|███████   | 3043/4286 [22:56:40<8:47:39, 25.47s/it] 71%|███████   | 3044/4286 [22:57:05<8:44:07, 25.32s/it]                                                        {'loss': 0.0044, 'grad_norm': 0.9154932004075989, 'learning_rate': 2.897806812879141e-07, 'completion_length': 291.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.755952388048172, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.01785714365541935, 'kl': 0.109375, 'epoch': 0.71}
 71%|███████   | 3044/4286 [22:57:05<8:44:07, 25.32s/it] 71%|███████   | 3045/4286 [22:57:30<8:41:43, 25.22s/it]                                                        {'loss': 0.0175, 'grad_norm': 10.487877546447939, 'learning_rate': 2.8954736350909937e-07, 'completion_length': 336.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7991072237491608, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7812500596046448, 'reward_std': 0.07029405329376459, 'kl': 0.4384765625, 'epoch': 0.71}
 71%|███████   | 3045/4286 [22:57:30<8:41:43, 25.22s/it] 71%|███████   | 3046/4286 [22:57:56<8:42:39, 25.29s/it]                                                        {'loss': 0.0106, 'grad_norm': 28.132092914715646, 'learning_rate': 2.893140457302846e-07, 'completion_length': 329.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7591804265975952, 'rewards/format_reward': 1.0, 'reward': 1.75918048620224, 'reward_std': 0.06900755688548088, 'kl': 0.26416015625, 'epoch': 0.71}
 71%|███████   | 3046/4286 [22:57:56<8:42:39, 25.29s/it][2025-03-03 13:55:45,011] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 71%|███████   | 3047/4286 [22:58:22<8:50:05, 25.67s/it]                                                        {'loss': 0.0171, 'grad_norm': 12.301121786688274, 'learning_rate': 2.8908072795146987e-07, 'completion_length': 303.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6625000834465027, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.644642949104309, 'reward_std': 0.08729754388332367, 'kl': 0.4287109375, 'epoch': 0.71}
 71%|███████   | 3047/4286 [22:58:22<8:50:05, 25.67s/it] 71%|███████   | 3048/4286 [22:58:46<8:38:31, 25.13s/it]                                                        {'loss': 0.0158, 'grad_norm': 0.741177968374978, 'learning_rate': 2.8884741017265514e-07, 'completion_length': 250.08930206298828, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7113096714019775, 'reward_std': 0.05297181010246277, 'kl': 0.396240234375, 'epoch': 0.71}
 71%|███████   | 3048/4286 [22:58:46<8:38:31, 25.13s/it] 71%|███████   | 3049/4286 [22:59:11<8:37:12, 25.09s/it]                                                        {'loss': 0.0211, 'grad_norm': 0.7512222343256407, 'learning_rate': 2.8861409239384037e-07, 'completion_length': 317.75, 'rewards/only_full_func_accuracy_reward': 0.8236607611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.805803656578064, 'reward_std': 0.08330097049474716, 'kl': 0.529541015625, 'epoch': 0.71}
 71%|███████   | 3049/4286 [22:59:11<8:37:12, 25.09s/it] 71%|███████   | 3050/4286 [22:59:36<8:35:17, 25.01s/it]                                                        {'loss': 0.0066, 'grad_norm': 2.9930590062893843, 'learning_rate': 2.8838077461502564e-07, 'completion_length': 316.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7702381312847137, 'rewards/format_reward': 1.0, 'reward': 1.770238220691681, 'reward_std': 0.007142859045416117, 'kl': 0.164306640625, 'epoch': 0.71}
 71%|███████   | 3050/4286 [22:59:36<8:35:17, 25.01s/it][2025-03-03 13:57:26,682] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 71%|███████   | 3051/4286 [23:00:04<8:53:08, 25.90s/it]                                                        {'loss': 0.0153, 'grad_norm': 14.254061073728234, 'learning_rate': 2.8814745683621086e-07, 'completion_length': 362.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7276785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7098215222358704, 'reward_std': 0.10696014948189259, 'kl': 0.38427734375, 'epoch': 0.71}
 71%|███████   | 3051/4286 [23:00:04<8:53:08, 25.90s/it] 71%|███████   | 3052/4286 [23:00:29<8:48:15, 25.68s/it]                                                        {'loss': 0.0039, 'grad_norm': 0.8310184664623341, 'learning_rate': 2.8791413905739614e-07, 'completion_length': 299.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7297619581222534, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7119048833847046, 'reward_std': 0.06428571790456772, 'kl': 0.0965576171875, 'epoch': 0.71}
 71%|███████   | 3052/4286 [23:00:29<8:48:15, 25.68s/it] 71%|███████   | 3053/4286 [23:00:54<8:41:35, 25.38s/it]                                                        {'loss': 0.0154, 'grad_norm': 8.43822582960956, 'learning_rate': 2.876808212785814e-07, 'completion_length': 314.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6761904954910278, 'rewards/format_reward': 1.0, 'reward': 1.6761905550956726, 'reward_std': 0.02706730365753174, 'kl': 0.384765625, 'epoch': 0.71}
 71%|███████   | 3053/4286 [23:00:54<8:41:35, 25.38s/it] 71%|███████▏  | 3054/4286 [23:01:18<8:34:40, 25.07s/it]                                                        {'loss': 0.0216, 'grad_norm': 3.3570683318174055, 'learning_rate': 2.8744750349976664e-07, 'completion_length': 260.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6220238506793976, 'rewards/format_reward': 1.0, 'reward': 1.62202388048172, 'reward_std': 0.04350833594799042, 'kl': 0.5380859375, 'epoch': 0.71}
 71%|███████▏  | 3054/4286 [23:01:18<8:34:40, 25.07s/it][2025-03-03 13:59:07,461] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 71%|███████▏  | 3055/4286 [23:01:45<8:43:41, 25.53s/it]                                                        {'loss': 0.0226, 'grad_norm': 3.4371915200018184, 'learning_rate': 2.872141857209519e-07, 'completion_length': 317.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.0476190485060215, 'kl': 0.56396484375, 'epoch': 0.71}
 71%|███████▏  | 3055/4286 [23:01:45<8:43:41, 25.53s/it][2025-03-03 13:59:34,693] [WARNING] [stage3.py:2134:step] 3 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 71%|███████▏  | 3056/4286 [23:02:12<8:53:45, 26.04s/it]                                                        {'loss': 0.024, 'grad_norm': 4.118883421555415, 'learning_rate': 2.869808679421372e-07, 'completion_length': 293.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7370536029338837, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7191965579986572, 'reward_std': 0.1161056449636817, 'kl': 0.599609375, 'epoch': 0.71}
 71%|███████▏  | 3056/4286 [23:02:12<8:53:45, 26.04s/it] 71%|███████▏  | 3057/4286 [23:02:37<8:51:18, 25.94s/it]                                                        {'loss': 0.0054, 'grad_norm': 2.066816043619621, 'learning_rate': 2.867475501633224e-07, 'completion_length': 292.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.586309552192688, 'rewards/format_reward': 1.0, 'reward': 1.5863096117973328, 'reward_std': 0.06781134381890297, 'kl': 0.1357421875, 'epoch': 0.71}
 71%|███████▏  | 3057/4286 [23:02:37<8:51:18, 25.94s/it] 71%|███████▏  | 3058/4286 [23:03:02<8:42:46, 25.54s/it]                                                        {'loss': 0.0011, 'grad_norm': 4.722953156087487, 'learning_rate': 2.865142323845077e-07, 'completion_length': 300.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.0208333320915699, 'kl': 0.02862548828125, 'epoch': 0.71}
 71%|███████▏  | 3058/4286 [23:03:02<8:42:46, 25.54s/it] 71%|███████▏  | 3059/4286 [23:03:27<8:35:26, 25.21s/it]                                                        {'loss': 0.0066, 'grad_norm': 2.6657441044275023, 'learning_rate': 2.862809146056929e-07, 'completion_length': 258.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.794642984867096, 'reward_std': 0.03411935269832611, 'kl': 0.16461181640625, 'epoch': 0.71}
 71%|███████▏  | 3059/4286 [23:03:27<8:35:26, 25.21s/it] 71%|███████▏  | 3060/4286 [23:03:52<8:38:03, 25.35s/it]                                                        {'loss': 0.0104, 'grad_norm': 42.19162382246735, 'learning_rate': 2.860475968268782e-07, 'completion_length': 317.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.68452388048172, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.03571428172290325, 'kl': 0.25927734375, 'epoch': 0.71}
 71%|███████▏  | 3060/4286 [23:03:52<8:38:03, 25.35s/it] 71%|███████▏  | 3061/4286 [23:04:17<8:34:46, 25.21s/it]                                                        {'loss': 0.0034, 'grad_norm': 8.247573075709154, 'learning_rate': 2.8581427904806346e-07, 'completion_length': 307.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.8348214626312256, 'rewards/format_reward': 1.0, 'reward': 1.8348215222358704, 'reward_std': 0.04304792359471321, 'kl': 0.08587646484375, 'epoch': 0.71}
 71%|███████▏  | 3061/4286 [23:04:17<8:34:46, 25.21s/it] 71%|███████▏  | 3062/4286 [23:04:43<8:38:50, 25.43s/it]                                                        {'loss': 0.0021, 'grad_norm': 2.667301860724321, 'learning_rate': 2.855809612692487e-07, 'completion_length': 303.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.7333333790302277, 'rewards/format_reward': 1.0, 'reward': 1.73333340883255, 'reward_std': 0.00789672788232565, 'kl': 0.051513671875, 'epoch': 0.71}
 71%|███████▏  | 3062/4286 [23:04:43<8:38:50, 25.43s/it] 71%|███████▏  | 3063/4286 [23:05:09<8:40:23, 25.53s/it]                                                        {'loss': 0.0261, 'grad_norm': 331.506413856183, 'learning_rate': 2.8534764349043395e-07, 'completion_length': 325.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.5853896886110306, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5675325393676758, 'reward_std': 0.10270283836871386, 'kl': 0.64990234375, 'epoch': 0.71}
 71%|███████▏  | 3063/4286 [23:05:09<8:40:23, 25.53s/it] 71%|███████▏  | 3064/4286 [23:05:35<8:41:34, 25.61s/it]                                                        {'loss': 0.0014, 'grad_norm': 4.5927404348512235, 'learning_rate': 2.851143257116192e-07, 'completion_length': 326.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7896826267242432, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7718254923820496, 'reward_std': 0.07975427620112896, 'kl': 0.0361328125, 'epoch': 0.71}
 71%|███████▏  | 3064/4286 [23:05:35<8:41:34, 25.61s/it] 72%|███████▏  | 3065/4286 [23:06:00<8:39:36, 25.53s/it]                                                        {'loss': 0.0025, 'grad_norm': 0.17327731028788762, 'learning_rate': 2.8488100793280445e-07, 'completion_length': 274.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.8988096117973328, 'rewards/format_reward': 1.0, 'reward': 1.8988096117973328, 'reward_std': 0.0, 'kl': 0.063720703125, 'epoch': 0.72}
 72%|███████▏  | 3065/4286 [23:06:00<8:39:36, 25.53s/it] 72%|███████▏  | 3066/4286 [23:06:26<8:40:34, 25.60s/it]                                                        {'loss': 0.0033, 'grad_norm': 83.86865542054456, 'learning_rate': 2.846476901539897e-07, 'completion_length': 325.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7380952537059784, 'rewards/format_reward': 1.0, 'reward': 1.7380954027175903, 'reward_std': 0.07811372727155685, 'kl': 0.082275390625, 'epoch': 0.72}
 72%|███████▏  | 3066/4286 [23:06:26<8:40:34, 25.60s/it][2025-03-03 14:04:15,538] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 72%|███████▏  | 3067/4286 [23:06:53<8:48:02, 25.99s/it]                                                        {'loss': 0.007, 'grad_norm': 4.5707634508520885, 'learning_rate': 2.8441437237517495e-07, 'completion_length': 301.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.761904776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7440477013587952, 'reward_std': 0.13690475933253765, 'kl': 0.1748046875, 'epoch': 0.72}
 72%|███████▏  | 3067/4286 [23:06:53<8:48:02, 25.99s/it][2025-03-03 14:04:43,587] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 72%|███████▏  | 3068/4286 [23:07:21<9:00:08, 26.61s/it]                                                        {'loss': 0.0123, 'grad_norm': 40.38326256725899, 'learning_rate': 2.841810545963602e-07, 'completion_length': 344.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7764137387275696, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7406995296478271, 'reward_std': 0.1197916679084301, 'kl': 0.306640625, 'epoch': 0.72}
 72%|███████▏  | 3068/4286 [23:07:21<9:00:08, 26.61s/it] 72%|███████▏  | 3069/4286 [23:07:46<8:48:57, 26.08s/it]                                                        {'loss': 0.003, 'grad_norm': 4.658448559269792, 'learning_rate': 2.8394773681754545e-07, 'completion_length': 284.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.05197649821639061, 'kl': 0.074462890625, 'epoch': 0.72}
 72%|███████▏  | 3069/4286 [23:07:46<8:48:57, 26.08s/it][2025-03-03 14:05:34,152] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 72%|███████▏  | 3070/4286 [23:08:11<8:46:21, 25.97s/it]                                                        {'loss': 0.0021, 'grad_norm': 6.108080233646815, 'learning_rate': 2.837144190387307e-07, 'completion_length': 304.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.8690476715564728, 'rewards/format_reward': 1.0, 'reward': 1.86904776096344, 'reward_std': 0.02816697023808956, 'kl': 0.05126953125, 'epoch': 0.72}
 72%|███████▏  | 3070/4286 [23:08:11<8:46:21, 25.97s/it] 72%|███████▏  | 3071/4286 [23:08:37<8:46:42, 26.01s/it]                                                        {'loss': 0.0041, 'grad_norm': 1.6827156870382878, 'learning_rate': 2.83481101259916e-07, 'completion_length': 308.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7827381789684296, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.005952381528913975, 'kl': 0.103271484375, 'epoch': 0.72}
 72%|███████▏  | 3071/4286 [23:08:37<8:46:42, 26.01s/it] 72%|███████▏  | 3072/4286 [23:09:03<8:43:43, 25.88s/it]                                                        {'loss': 0.0081, 'grad_norm': 9.313092977612394, 'learning_rate': 2.832477834811012e-07, 'completion_length': 243.08930206298828, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5922619700431824, 'reward_std': 0.0535714328289032, 'kl': 0.20263671875, 'epoch': 0.72}
 72%|███████▏  | 3072/4286 [23:09:03<8:43:43, 25.88s/it] 72%|███████▏  | 3073/4286 [23:09:27<8:32:07, 25.33s/it]                                                        {'loss': 0.0094, 'grad_norm': 16.00953182073389, 'learning_rate': 2.830144657022865e-07, 'completion_length': 276.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6056548357009888, 'reward_std': 0.026785715483129025, 'kl': 0.233642578125, 'epoch': 0.72}
 72%|███████▏  | 3073/4286 [23:09:27<8:32:07, 25.33s/it] 72%|███████▏  | 3074/4286 [23:09:53<8:34:07, 25.45s/it]                                                        {'loss': 0.0052, 'grad_norm': 2.044755525847137, 'learning_rate': 2.827811479234717e-07, 'completion_length': 306.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.011904764920473099, 'kl': 0.1287841796875, 'epoch': 0.72}
 72%|███████▏  | 3074/4286 [23:09:53<8:34:07, 25.45s/it] 72%|███████▏  | 3075/4286 [23:10:18<8:35:25, 25.54s/it]                                                        {'loss': 0.0064, 'grad_norm': 0.8811369798458327, 'learning_rate': 2.82547830144657e-07, 'completion_length': 322.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8377977013587952, 'rewards/format_reward': 1.0, 'reward': 1.8377977013587952, 'reward_std': 0.044642859138548374, 'kl': 0.159423828125, 'epoch': 0.72}
 72%|███████▏  | 3075/4286 [23:10:18<8:35:25, 25.54s/it] 72%|███████▏  | 3076/4286 [23:10:42<8:24:25, 25.01s/it]                                                        {'loss': 0.0324, 'grad_norm': 2.3690840650095106, 'learning_rate': 2.8231451236584227e-07, 'completion_length': 241.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.71726194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6994048953056335, 'reward_std': 0.0714285746216774, 'kl': 0.810546875, 'epoch': 0.72}
 72%|███████▏  | 3076/4286 [23:10:42<8:24:25, 25.01s/it] 72%|███████▏  | 3077/4286 [23:11:07<8:22:09, 24.92s/it]                                                        {'loss': 0.0033, 'grad_norm': 0.6534699390046247, 'learning_rate': 2.820811945870275e-07, 'completion_length': 291.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.8824405074119568, 'rewards/format_reward': 1.0, 'reward': 1.8824405670166016, 'reward_std': 0.008928571827709675, 'kl': 0.08197021484375, 'epoch': 0.72}
 72%|███████▏  | 3077/4286 [23:11:07<8:22:09, 24.92s/it][2025-03-03 14:08:54,283] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 72%|███████▏  | 3078/4286 [23:11:31<8:18:47, 24.77s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.6173120131503577, 'learning_rate': 2.8184787680821276e-07, 'completion_length': 263.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7514881193637848, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.020833331160247326, 'kl': 0.033203125, 'epoch': 0.72}
 72%|███████▏  | 3078/4286 [23:11:31<8:18:47, 24.77s/it] 72%|███████▏  | 3079/4286 [23:11:56<8:17:04, 24.71s/it]                                                        {'loss': 0.0074, 'grad_norm': 9.76243676053226, 'learning_rate': 2.8161455902939804e-07, 'completion_length': 289.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.766369104385376, 'rewards/format_reward': 1.0, 'reward': 1.7663691639900208, 'reward_std': 0.02463203202933073, 'kl': 0.183349609375, 'epoch': 0.72}
 72%|███████▏  | 3079/4286 [23:11:56<8:17:04, 24.71s/it][2025-03-03 14:09:43,792] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 72%|███████▏  | 3080/4286 [23:12:21<8:18:06, 24.78s/it]                                                        {'loss': 0.0126, 'grad_norm': 5.559387616132551, 'learning_rate': 2.8138124125058326e-07, 'completion_length': 322.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7470238506793976, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.06228631176054478, 'kl': 0.315673828125, 'epoch': 0.72}
 72%|███████▏  | 3080/4286 [23:12:21<8:18:06, 24.78s/it] 72%|███████▏  | 3081/4286 [23:12:47<8:28:34, 25.32s/it]                                                        {'loss': 0.0166, 'grad_norm': 6.812161934169021, 'learning_rate': 2.8114792347176854e-07, 'completion_length': 305.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.8645834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8467262983322144, 'reward_std': 0.1474165841937065, 'kl': 0.4150390625, 'epoch': 0.72}
 72%|███████▏  | 3081/4286 [23:12:47<8:28:34, 25.32s/it] 72%|███████▏  | 3082/4286 [23:13:14<8:32:42, 25.55s/it]                                                        {'loss': 0.0027, 'grad_norm': 7.833631793107358, 'learning_rate': 2.8091460569295376e-07, 'completion_length': 351.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 1.0, 'reward': 1.7619049549102783, 'reward_std': 0.07695359364151955, 'kl': 0.067138671875, 'epoch': 0.72}
 72%|███████▏  | 3082/4286 [23:13:14<8:32:42, 25.55s/it] 72%|███████▏  | 3083/4286 [23:13:39<8:30:14, 25.45s/it]                                                        {'loss': 0.0085, 'grad_norm': 3.0350014604848043, 'learning_rate': 2.8068128791413903e-07, 'completion_length': 311.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.7083334922790527, 'reward_std': 0.059523800387978554, 'kl': 0.2115478515625, 'epoch': 0.72}
 72%|███████▏  | 3083/4286 [23:13:39<8:30:14, 25.45s/it] 72%|███████▏  | 3084/4286 [23:14:05<8:33:31, 25.63s/it]                                                        {'loss': 0.0035, 'grad_norm': 0.689333250417877, 'learning_rate': 2.804479701353243e-07, 'completion_length': 331.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7708333730697632, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.010309826582670212, 'kl': 0.0882568359375, 'epoch': 0.72}
 72%|███████▏  | 3084/4286 [23:14:05<8:33:31, 25.63s/it] 72%|███████▏  | 3085/4286 [23:14:29<8:21:25, 25.05s/it]                                                        {'loss': 0.0264, 'grad_norm': 2.5535049511134984, 'learning_rate': 2.8021465235650953e-07, 'completion_length': 259.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.8354166746139526, 'rewards/format_reward': 1.0, 'reward': 1.835416853427887, 'reward_std': 0.03205380588769913, 'kl': 0.6622314453125, 'epoch': 0.72}
 72%|███████▏  | 3085/4286 [23:14:29<8:21:25, 25.05s/it] 72%|███████▏  | 3086/4286 [23:14:54<8:26:22, 25.32s/it]                                                        {'loss': 0.0032, 'grad_norm': 4.714278643042003, 'learning_rate': 2.799813345776948e-07, 'completion_length': 334.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.8511904776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8333333730697632, 'reward_std': 0.0714285746216774, 'kl': 0.080810546875, 'epoch': 0.72}
 72%|███████▏  | 3086/4286 [23:14:54<8:26:22, 25.32s/it] 72%|███████▏  | 3087/4286 [23:15:19<8:21:05, 25.08s/it]                                                        {'loss': 0.0093, 'grad_norm': 5.443377080055217, 'learning_rate': 2.7974801679888003e-07, 'completion_length': 308.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.8214285671710968, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.07586049847304821, 'kl': 0.232421875, 'epoch': 0.72}
 72%|███████▏  | 3087/4286 [23:15:19<8:21:05, 25.08s/it][2025-03-03 14:13:07,644] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 72%|███████▏  | 3088/4286 [23:15:45<8:24:48, 25.28s/it]                                                        {'loss': 0.0026, 'grad_norm': 7.448904221540075, 'learning_rate': 2.795146990200653e-07, 'completion_length': 310.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7160714864730835, 'rewards/format_reward': 1.0, 'reward': 1.7160714864730835, 'reward_std': 0.04940120782703161, 'kl': 0.06494140625, 'epoch': 0.72}
 72%|███████▏  | 3088/4286 [23:15:45<8:24:48, 25.28s/it] 72%|███████▏  | 3089/4286 [23:16:10<8:25:39, 25.35s/it]                                                        {'loss': 0.0033, 'grad_norm': 2.301479956856586, 'learning_rate': 2.792813812412506e-07, 'completion_length': 316.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.0173503290861845, 'kl': 0.0826416015625, 'epoch': 0.72}
 72%|███████▏  | 3089/4286 [23:16:10<8:25:39, 25.35s/it] 72%|███████▏  | 3090/4286 [23:16:36<8:29:46, 25.57s/it]                                                        {'loss': 0.0039, 'grad_norm': 6.209785827883279, 'learning_rate': 2.790480634624358e-07, 'completion_length': 288.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.7514881789684296, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.03709554113447666, 'kl': 0.09710693359375, 'epoch': 0.72}
 72%|███████▏  | 3090/4286 [23:16:36<8:29:46, 25.57s/it] 72%|███████▏  | 3091/4286 [23:17:01<8:26:15, 25.42s/it]                                                        {'loss': 0.0032, 'grad_norm': 5.174719721827002, 'learning_rate': 2.788147456836211e-07, 'completion_length': 319.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.8526786863803864, 'rewards/format_reward': 1.0, 'reward': 1.8526787161827087, 'reward_std': 0.020833331160247326, 'kl': 0.079345703125, 'epoch': 0.72}
 72%|███████▏  | 3091/4286 [23:17:01<8:26:15, 25.42s/it] 72%|███████▏  | 3092/4286 [23:17:27<8:25:57, 25.43s/it]                                                        {'loss': 0.0176, 'grad_norm': 5.875375469104045, 'learning_rate': 2.785814279048063e-07, 'completion_length': 311.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7098215222358704, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6919643878936768, 'reward_std': 0.11211701016873121, 'kl': 0.44140625, 'epoch': 0.72}
 72%|███████▏  | 3092/4286 [23:17:27<8:25:57, 25.43s/it] 72%|███████▏  | 3093/4286 [23:17:52<8:25:19, 25.41s/it]                                                        {'loss': 0.0021, 'grad_norm': 4.152870958071904, 'learning_rate': 2.7834811012599157e-07, 'completion_length': 289.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.8630952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8630953431129456, 'reward_std': 0.020619653165340424, 'kl': 0.052734375, 'epoch': 0.72}
 72%|███████▏  | 3093/4286 [23:17:52<8:25:19, 25.41s/it] 72%|███████▏  | 3094/4286 [23:18:18<8:25:58, 25.47s/it]                                                        {'loss': 0.0031, 'grad_norm': 15.10406523852738, 'learning_rate': 2.7811479234717685e-07, 'completion_length': 307.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7889881432056427, 'rewards/format_reward': 1.0, 'reward': 1.788988173007965, 'reward_std': 0.025776272639632225, 'kl': 0.07763671875, 'epoch': 0.72}
 72%|███████▏  | 3094/4286 [23:18:18<8:25:58, 25.47s/it] 72%|███████▏  | 3095/4286 [23:18:42<8:19:24, 25.16s/it]                                                        {'loss': 0.0015, 'grad_norm': 3.5122515398651677, 'learning_rate': 2.7788147456836207e-07, 'completion_length': 302.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.011904759332537651, 'kl': 0.0369873046875, 'epoch': 0.72}
 72%|███████▏  | 3095/4286 [23:18:42<8:19:24, 25.16s/it] 72%|███████▏  | 3096/4286 [23:19:06<8:13:15, 24.87s/it]                                                        {'loss': 0.0037, 'grad_norm': 2.329630667354325, 'learning_rate': 2.7764815678954734e-07, 'completion_length': 296.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7696428894996643, 'rewards/format_reward': 1.0, 'reward': 1.7696428894996643, 'reward_std': 0.026605320163071156, 'kl': 0.09130859375, 'epoch': 0.72}
 72%|███████▏  | 3096/4286 [23:19:06<8:13:15, 24.87s/it] 72%|███████▏  | 3097/4286 [23:19:33<8:22:52, 25.38s/it]                                                        {'loss': 0.0078, 'grad_norm': 2.493545606781159, 'learning_rate': 2.7741483901073257e-07, 'completion_length': 298.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.7901785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7723215818405151, 'reward_std': 0.09686608985066414, 'kl': 0.195556640625, 'epoch': 0.72}
 72%|███████▏  | 3097/4286 [23:19:33<8:22:52, 25.38s/it] 72%|███████▏  | 3098/4286 [23:19:58<8:18:10, 25.16s/it]                                                        {'loss': 0.0031, 'grad_norm': 2.5119383423916495, 'learning_rate': 2.7718152123191784e-07, 'completion_length': 277.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.7940475940704346, 'rewards/format_reward': 1.0, 'reward': 1.7940477132797241, 'reward_std': 0.02830179687589407, 'kl': 0.076904296875, 'epoch': 0.72}
 72%|███████▏  | 3098/4286 [23:19:58<8:18:10, 25.16s/it] 72%|███████▏  | 3099/4286 [23:20:23<8:16:08, 25.08s/it]                                                        {'loss': 0.0107, 'grad_norm': 15.242841891692665, 'learning_rate': 2.769482034531031e-07, 'completion_length': 312.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.769196480512619, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7513393759727478, 'reward_std': 0.11317916959524155, 'kl': 0.267578125, 'epoch': 0.72}
 72%|███████▏  | 3099/4286 [23:20:23<8:16:08, 25.08s/it][2025-03-03 14:18:11,093] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 72%|███████▏  | 3100/4286 [23:20:48<8:19:00, 25.24s/it]                                                        {'loss': 0.0075, 'grad_norm': 2.901745580523923, 'learning_rate': 2.7671488567428834e-07, 'completion_length': 344.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7738096117973328, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.01785714365541935, 'kl': 0.1868896484375, 'epoch': 0.72}
 72%|███████▏  | 3100/4286 [23:20:48<8:19:00, 25.24s/it] 72%|███████▏  | 3101/4286 [23:24:51<29:50:03, 90.64s/it]                                                         {'loss': 0.0079, 'grad_norm': 14.74603686237949, 'learning_rate': 2.764815678954736e-07, 'completion_length': 329.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7529762983322144, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.06547619216144085, 'kl': 0.1995849609375, 'epoch': 0.72}
 72%|███████▏  | 3101/4286 [23:24:51<29:50:03, 90.64s/it] 72%|███████▏  | 3102/4286 [23:25:17<23:26:01, 71.25s/it]                                                         {'loss': 0.0033, 'grad_norm': 7.30533377637218, 'learning_rate': 2.762482501166589e-07, 'completion_length': 287.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.5985119640827179, 'rewards/format_reward': 1.0, 'reward': 1.5985119938850403, 'reward_std': 0.047023807652294636, 'kl': 0.08203125, 'epoch': 0.72}
 72%|███████▏  | 3102/4286 [23:25:17<23:26:01, 71.25s/it] 72%|███████▏  | 3103/4286 [23:25:43<18:53:29, 57.49s/it]                                                         {'loss': 0.0035, 'grad_norm': 2.511731641854439, 'learning_rate': 2.760149323378441e-07, 'completion_length': 285.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715818405151, 'reward_std': 0.0892857164144516, 'kl': 0.0872802734375, 'epoch': 0.72}
 72%|███████▏  | 3103/4286 [23:25:43<18:53:29, 57.49s/it] 72%|███████▏  | 3104/4286 [23:26:08<15:39:42, 47.70s/it]                                                         {'loss': 0.002, 'grad_norm': 5.572019596557597, 'learning_rate': 2.757816145590294e-07, 'completion_length': 309.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7547620236873627, 'rewards/format_reward': 1.0, 'reward': 1.7547619938850403, 'reward_std': 0.02890815958380699, 'kl': 0.0506591796875, 'epoch': 0.72}
 72%|███████▏  | 3104/4286 [23:26:08<15:39:42, 47.70s/it] 72%|███████▏  | 3105/4286 [23:26:32<13:20:25, 40.67s/it]                                                         {'loss': 0.0131, 'grad_norm': 29.182997937003492, 'learning_rate': 2.755482967802146e-07, 'completion_length': 324.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.7291667461395264, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7113096117973328, 'reward_std': 0.04602411016821861, 'kl': 0.32861328125, 'epoch': 0.72}
 72%|███████▏  | 3105/4286 [23:26:32<13:20:25, 40.67s/it] 72%|███████▏  | 3106/4286 [23:26:55<11:38:03, 35.49s/it]                                                         {'loss': 0.0041, 'grad_norm': 6.353663543337246, 'learning_rate': 2.753149790013999e-07, 'completion_length': 267.5893020629883, 'rewards/only_full_func_accuracy_reward': 0.8139881193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7961310744285583, 'reward_std': 0.04696349427103996, 'kl': 0.103515625, 'epoch': 0.72}
 72%|███████▏  | 3106/4286 [23:26:55<11:38:03, 35.49s/it] 72%|███████▏  | 3107/4286 [23:27:20<10:36:29, 32.39s/it]                                                         {'loss': 0.0054, 'grad_norm': 1.0192709812142309, 'learning_rate': 2.7508166122258516e-07, 'completion_length': 303.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.68601194024086, 'rewards/format_reward': 1.0, 'reward': 1.6860119700431824, 'reward_std': 0.0295482249930501, 'kl': 0.1361083984375, 'epoch': 0.72}
 72%|███████▏  | 3107/4286 [23:27:20<10:36:29, 32.39s/it] 73%|███████▎  | 3108/4286 [23:27:46<9:56:18, 30.37s/it]                                                         {'loss': 0.0074, 'grad_norm': 7.647209516567349, 'learning_rate': 2.748483434437704e-07, 'completion_length': 322.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6803571879863739, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6625001430511475, 'reward_std': 0.12224226258695126, 'kl': 0.1845703125, 'epoch': 0.73}
 73%|███████▎  | 3108/4286 [23:27:46<9:56:18, 30.37s/it] 73%|███████▎  | 3109/4286 [23:28:12<9:28:14, 28.97s/it]                                                        {'loss': 0.0178, 'grad_norm': 8.495818975602495, 'learning_rate': 2.7461502566495566e-07, 'completion_length': 319.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.74702388048172, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.07712087407708168, 'kl': 0.4459228515625, 'epoch': 0.73}
 73%|███████▎  | 3109/4286 [23:28:12<9:28:14, 28.97s/it] 73%|███████▎  | 3110/4286 [23:28:35<8:56:03, 27.35s/it]                                                        {'loss': 0.0091, 'grad_norm': 1.475176279288762, 'learning_rate': 2.743817078861409e-07, 'completion_length': 254.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.8690476715564728, 'rewards/format_reward': 1.0, 'reward': 1.8690477013587952, 'reward_std': 0.0357142873108387, 'kl': 0.22705078125, 'epoch': 0.73}
 73%|███████▎  | 3110/4286 [23:28:35<8:56:03, 27.35s/it] 73%|███████▎  | 3111/4286 [23:29:02<8:50:19, 27.08s/it]                                                        {'loss': 0.0157, 'grad_norm': 34.260406468180385, 'learning_rate': 2.7414839010732615e-07, 'completion_length': 315.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.8482142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8303572535514832, 'reward_std': 0.06547619216144085, 'kl': 0.392578125, 'epoch': 0.73}
 73%|███████▎  | 3111/4286 [23:29:02<8:50:19, 27.08s/it] 73%|███████▎  | 3112/4286 [23:29:26<8:34:15, 26.28s/it]                                                        {'loss': 0.0012, 'grad_norm': 0.5929119294925186, 'learning_rate': 2.7391507232851143e-07, 'completion_length': 263.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.800000011920929, 'rewards/format_reward': 1.0, 'reward': 1.8000000715255737, 'reward_std': 0.024743584915995598, 'kl': 0.02886962890625, 'epoch': 0.73}
 73%|███████▎  | 3112/4286 [23:29:26<8:34:15, 26.28s/it] 73%|███████▎  | 3113/4286 [23:29:52<8:29:06, 26.04s/it]                                                        {'loss': 0.009, 'grad_norm': 4.725938884369758, 'learning_rate': 2.7368175454969665e-07, 'completion_length': 338.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.8050595223903656, 'rewards/format_reward': 1.0, 'reward': 1.8050596117973328, 'reward_std': 0.05059523694217205, 'kl': 0.2239990234375, 'epoch': 0.73}
 73%|███████▎  | 3113/4286 [23:29:52<8:29:06, 26.04s/it] 73%|███████▎  | 3114/4286 [23:30:16<8:21:02, 25.65s/it]                                                        {'loss': 0.0076, 'grad_norm': 25.959441322627484, 'learning_rate': 2.7344843677088193e-07, 'completion_length': 274.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7886905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7886906266212463, 'reward_std': 0.047619045712053776, 'kl': 0.18963623046875, 'epoch': 0.73}
 73%|███████▎  | 3114/4286 [23:30:16<8:21:02, 25.65s/it] 73%|███████▎  | 3115/4286 [23:30:42<8:20:44, 25.66s/it]                                                        {'loss': 0.0099, 'grad_norm': 20.940129218308705, 'learning_rate': 2.7321511899206715e-07, 'completion_length': 290.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.08395436778664589, 'kl': 0.2484130859375, 'epoch': 0.73}
 73%|███████▎  | 3115/4286 [23:30:42<8:20:44, 25.66s/it] 73%|███████▎  | 3116/4286 [23:31:08<8:18:37, 25.57s/it]                                                        {'loss': 0.0188, 'grad_norm': 0.9169101690429542, 'learning_rate': 2.729818012132524e-07, 'completion_length': 329.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6866071820259094, 'rewards/format_reward': 1.0, 'reward': 1.6866072416305542, 'reward_std': 0.04717185162007809, 'kl': 0.4713134765625, 'epoch': 0.73}
 73%|███████▎  | 3116/4286 [23:31:08<8:18:37, 25.57s/it] 73%|███████▎  | 3117/4286 [23:31:32<8:12:30, 25.28s/it]                                                        {'loss': 0.0024, 'grad_norm': 32.1676560343801, 'learning_rate': 2.727484834344377e-07, 'completion_length': 321.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7648809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.03160357475280762, 'kl': 0.0601806640625, 'epoch': 0.73}
 73%|███████▎  | 3117/4286 [23:31:32<8:12:30, 25.28s/it] 73%|███████▎  | 3118/4286 [23:31:56<8:06:09, 24.97s/it]                                                        {'loss': 0.0019, 'grad_norm': 5.994247871214278, 'learning_rate': 2.725151656556229e-07, 'completion_length': 301.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7880952656269073, 'rewards/format_reward': 1.0, 'reward': 1.7880953550338745, 'reward_std': 0.018321271985769272, 'kl': 0.048583984375, 'epoch': 0.73}
 73%|███████▎  | 3118/4286 [23:31:56<8:06:09, 24.97s/it] 73%|███████▎  | 3119/4286 [23:32:20<7:59:17, 24.64s/it]                                                        {'loss': 0.0028, 'grad_norm': 2.5023927795055383, 'learning_rate': 2.722818478768082e-07, 'completion_length': 286.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.681547611951828, 'rewards/format_reward': 1.0, 'reward': 1.68154776096344, 'reward_std': 0.02816697023808956, 'kl': 0.06982421875, 'epoch': 0.73}
 73%|███████▎  | 3119/4286 [23:32:20<7:59:17, 24.64s/it] 73%|███████▎  | 3120/4286 [23:32:46<8:05:32, 24.98s/it]                                                        {'loss': 0.0109, 'grad_norm': 14.240305629539558, 'learning_rate': 2.720485300979934e-07, 'completion_length': 338.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098215222358704, 'reward_std': 0.05084197223186493, 'kl': 0.2724609375, 'epoch': 0.73}
 73%|███████▎  | 3120/4286 [23:32:46<8:05:32, 24.98s/it] 73%|███████▎  | 3121/4286 [23:33:11<8:04:45, 24.97s/it]                                                        {'loss': 0.019, 'grad_norm': 8.251261366763957, 'learning_rate': 2.718152123191787e-07, 'completion_length': 272.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5982143580913544, 'rewards/format_reward': 1.0, 'reward': 1.598214328289032, 'reward_std': 0.0892857126891613, 'kl': 0.474609375, 'epoch': 0.73}
 73%|███████▎  | 3121/4286 [23:33:11<8:04:45, 24.97s/it] 73%|███████▎  | 3122/4286 [23:33:35<7:57:55, 24.64s/it]                                                        {'loss': 0.0204, 'grad_norm': 35.48357880514922, 'learning_rate': 2.7158189454036397e-07, 'completion_length': 298.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.60788694024086, 'rewards/format_reward': 1.0, 'reward': 1.6078869700431824, 'reward_std': 0.11095542274415493, 'kl': 0.511474609375, 'epoch': 0.73}
 73%|███████▎  | 3122/4286 [23:33:35<7:57:55, 24.64s/it] 73%|███████▎  | 3123/4286 [23:33:59<7:54:59, 24.51s/it]                                                        {'loss': 0.002, 'grad_norm': 4.907695337999552, 'learning_rate': 2.713485767615492e-07, 'completion_length': 276.8928756713867, 'rewards/only_full_func_accuracy_reward': 0.7321428656578064, 'rewards/format_reward': 1.0, 'reward': 1.7321430444717407, 'reward_std': 0.011904762126505375, 'kl': 0.05023193359375, 'epoch': 0.73}
 73%|███████▎  | 3123/4286 [23:33:59<7:54:59, 24.51s/it] 73%|███████▎  | 3124/4286 [23:34:23<7:49:33, 24.25s/it]                                                        {'loss': 0.0166, 'grad_norm': 6.69668058221528, 'learning_rate': 2.7111525898273447e-07, 'completion_length': 277.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.6607143580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.08496974036097527, 'kl': 0.4140625, 'epoch': 0.73}
 73%|███████▎  | 3124/4286 [23:34:23<7:49:33, 24.25s/it] 73%|███████▎  | 3125/4286 [23:34:47<7:49:18, 24.25s/it]                                                        {'loss': 0.0049, 'grad_norm': 2.4685889786478676, 'learning_rate': 2.7088194120391974e-07, 'completion_length': 264.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.8675595223903656, 'rewards/format_reward': 1.0, 'reward': 1.8675596117973328, 'reward_std': 0.047985438257455826, 'kl': 0.12298583984375, 'epoch': 0.73}
 73%|███████▎  | 3125/4286 [23:34:47<7:49:18, 24.25s/it] 73%|███████▎  | 3126/4286 [23:35:13<7:58:19, 24.74s/it]                                                        {'loss': 0.0051, 'grad_norm': 2.391454211374094, 'learning_rate': 2.7064862342510496e-07, 'completion_length': 318.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 1.0, 'reward': 1.7574405670166016, 'reward_std': 0.032738092821091413, 'kl': 0.12646484375, 'epoch': 0.73}
 73%|███████▎  | 3126/4286 [23:35:13<7:58:19, 24.74s/it] 73%|███████▎  | 3127/4286 [23:35:38<8:01:03, 24.90s/it]                                                        {'loss': 0.0013, 'grad_norm': 2.049794039885544, 'learning_rate': 2.7041530564629024e-07, 'completion_length': 322.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.8169643580913544, 'rewards/format_reward': 1.0, 'reward': 1.8169643878936768, 'reward_std': 0.019238397479057312, 'kl': 0.03173828125, 'epoch': 0.73}
 73%|███████▎  | 3127/4286 [23:35:38<8:01:03, 24.90s/it] 73%|███████▎  | 3128/4286 [23:36:04<8:04:11, 25.09s/it]                                                        {'loss': 0.0146, 'grad_norm': 2.5085485939813883, 'learning_rate': 2.7018198786747546e-07, 'completion_length': 340.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.80952388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7916668057441711, 'reward_std': 0.08600887283682823, 'kl': 0.365234375, 'epoch': 0.73}
 73%|███████▎  | 3128/4286 [23:36:04<8:04:11, 25.09s/it] 73%|███████▎  | 3129/4286 [23:36:28<7:57:19, 24.75s/it]                                                        {'loss': 0.0048, 'grad_norm': 19.7690121001405, 'learning_rate': 2.6994867008866074e-07, 'completion_length': 301.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.005952378269284964, 'kl': 0.1197509765625, 'epoch': 0.73}
 73%|███████▎  | 3129/4286 [23:36:28<7:57:19, 24.75s/it] 73%|███████▎  | 3130/4286 [23:36:52<7:56:59, 24.76s/it]                                                        {'loss': 0.0082, 'grad_norm': 9.516695451800228, 'learning_rate': 2.69715352309846e-07, 'completion_length': 319.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.8699405491352081, 'rewards/format_reward': 1.0, 'reward': 1.8699405789375305, 'reward_std': 0.09489667788147926, 'kl': 0.20703125, 'epoch': 0.73}
 73%|███████▎  | 3130/4286 [23:36:52<7:56:59, 24.76s/it] 73%|███████▎  | 3131/4286 [23:37:17<7:57:07, 24.79s/it]                                                        {'loss': 0.0045, 'grad_norm': 14.624516468555322, 'learning_rate': 2.6948203453103123e-07, 'completion_length': 297.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.09661935269832611, 'kl': 0.111328125, 'epoch': 0.73}
 73%|███████▎  | 3131/4286 [23:37:17<7:57:07, 24.79s/it] 73%|███████▎  | 3132/4286 [23:37:41<7:53:40, 24.63s/it]                                                        {'loss': 0.0024, 'grad_norm': 10.583633766741835, 'learning_rate': 2.692487167522165e-07, 'completion_length': 294.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.6577381491661072, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.010309826582670212, 'kl': 0.0589599609375, 'epoch': 0.73}
 73%|███████▎  | 3132/4286 [23:37:41<7:53:40, 24.63s/it] 73%|███████▎  | 3133/4286 [23:38:07<7:58:46, 24.91s/it]                                                        {'loss': 0.0059, 'grad_norm': 1.8653487515277325, 'learning_rate': 2.6901539897340173e-07, 'completion_length': 329.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.7291666865348816, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.04602411109954119, 'kl': 0.1474609375, 'epoch': 0.73}
 73%|███████▎  | 3133/4286 [23:38:07<7:58:46, 24.91s/it] 73%|███████▎  | 3134/4286 [23:38:30<7:45:42, 24.26s/it]                                                        {'loss': 0.0048, 'grad_norm': 4.040485282884007, 'learning_rate': 2.68782081194587e-07, 'completion_length': 251.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.008928571827709675, 'kl': 0.118896484375, 'epoch': 0.73}
 73%|███████▎  | 3134/4286 [23:38:30<7:45:42, 24.26s/it] 73%|███████▎  | 3135/4286 [23:38:57<8:01:03, 25.08s/it]                                                        {'loss': 0.0123, 'grad_norm': 5.361835010722299, 'learning_rate': 2.685487634157723e-07, 'completion_length': 317.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7059524059295654, 'rewards/format_reward': 1.0, 'reward': 1.7059524059295654, 'reward_std': 0.062073132023215294, 'kl': 0.307861328125, 'epoch': 0.73}
 73%|███████▎  | 3135/4286 [23:38:57<8:01:03, 25.08s/it] 73%|███████▎  | 3136/4286 [23:39:23<8:06:32, 25.38s/it]                                                        {'loss': 0.03, 'grad_norm': 9.166909333890185, 'learning_rate': 2.683154456369575e-07, 'completion_length': 325.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.7718254625797272, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7361111640930176, 'reward_std': 0.12535103410482407, 'kl': 0.75, 'epoch': 0.73}
 73%|███████▎  | 3136/4286 [23:39:23<8:06:32, 25.38s/it] 73%|███████▎  | 3137/4286 [23:39:48<8:01:45, 25.16s/it]                                                        {'loss': 0.0015, 'grad_norm': 12.079865475638673, 'learning_rate': 2.680821278581428e-07, 'completion_length': 319.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.8401786088943481, 'rewards/format_reward': 1.0, 'reward': 1.840178668498993, 'reward_std': 0.08082127571105957, 'kl': 0.0382080078125, 'epoch': 0.73}
 73%|███████▎  | 3137/4286 [23:39:48<8:01:45, 25.16s/it] 73%|███████▎  | 3138/4286 [23:40:12<7:59:52, 25.08s/it]                                                        {'loss': 0.0056, 'grad_norm': 21.463648386841918, 'learning_rate': 2.67848810079328e-07, 'completion_length': 320.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351192235946655, 'reward_std': 0.07297691702842712, 'kl': 0.1395263671875, 'epoch': 0.73}
 73%|███████▎  | 3138/4286 [23:40:12<7:59:52, 25.08s/it] 73%|███████▎  | 3139/4286 [23:40:38<8:01:43, 25.20s/it]                                                        {'loss': 0.005, 'grad_norm': 2.0428879047951085, 'learning_rate': 2.676154923005133e-07, 'completion_length': 320.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6844387948513031, 'rewards/format_reward': 1.0, 'reward': 1.6844388246536255, 'reward_std': 0.029657145030796528, 'kl': 0.125732421875, 'epoch': 0.73}
 73%|███████▎  | 3139/4286 [23:40:38<8:01:43, 25.20s/it] 73%|███████▎  | 3140/4286 [23:41:03<8:02:07, 25.24s/it]                                                        {'loss': 0.0319, 'grad_norm': 10.469716261810264, 'learning_rate': 2.6738217452169855e-07, 'completion_length': 298.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.817956417798996, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8000993132591248, 'reward_std': 0.12866760324686766, 'kl': 0.795166015625, 'epoch': 0.73}
 73%|███████▎  | 3140/4286 [23:41:03<8:02:07, 25.24s/it] 73%|███████▎  | 3141/4286 [23:41:29<8:06:32, 25.50s/it]                                                        {'loss': 0.0115, 'grad_norm': 3.227753385207937, 'learning_rate': 2.671488567428838e-07, 'completion_length': 302.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.02380952425301075, 'kl': 0.28857421875, 'epoch': 0.73}
 73%|███████▎  | 3141/4286 [23:41:29<8:06:32, 25.50s/it] 73%|███████▎  | 3142/4286 [23:41:54<8:00:29, 25.20s/it]                                                        {'loss': 0.0018, 'grad_norm': 6.410968321752823, 'learning_rate': 2.6691553896406905e-07, 'completion_length': 317.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.05909644905477762, 'kl': 0.046142578125, 'epoch': 0.73}
 73%|███████▎  | 3142/4286 [23:41:54<8:00:29, 25.20s/it] 73%|███████▎  | 3143/4286 [23:42:21<8:09:56, 25.72s/it]                                                        {'loss': 0.0161, 'grad_norm': 5.219137402645528, 'learning_rate': 2.6668222118525427e-07, 'completion_length': 278.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7979167401790619, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7800596356391907, 'reward_std': 0.14198075607419014, 'kl': 0.401123046875, 'epoch': 0.73}
 73%|███████▎  | 3143/4286 [23:42:21<8:09:56, 25.72s/it] 73%|███████▎  | 3144/4286 [23:42:47<8:11:43, 25.84s/it]                                                        {'loss': 0.0047, 'grad_norm': 27.11462103579666, 'learning_rate': 2.6644890340643955e-07, 'completion_length': 300.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.04106535762548447, 'kl': 0.117431640625, 'epoch': 0.73}
 73%|███████▎  | 3144/4286 [23:42:47<8:11:43, 25.84s/it] 73%|███████▎  | 3145/4286 [23:43:12<8:05:06, 25.51s/it]                                                        {'loss': 0.0019, 'grad_norm': 0.7458914601229585, 'learning_rate': 2.662155856276248e-07, 'completion_length': 313.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.8660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.8660715818405151, 'reward_std': 0.01785714365541935, 'kl': 0.04827880859375, 'epoch': 0.73}
 73%|███████▎  | 3145/4286 [23:43:12<8:05:06, 25.51s/it] 73%|███████▎  | 3146/4286 [23:43:38<8:08:28, 25.71s/it]                                                        {'loss': 0.0049, 'grad_norm': 4.601542105140816, 'learning_rate': 2.6598226784881004e-07, 'completion_length': 314.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.8375000059604645, 'rewards/format_reward': 1.0, 'reward': 1.8375000357627869, 'reward_std': 0.029216589406132698, 'kl': 0.1229248046875, 'epoch': 0.73}
 73%|███████▎  | 3146/4286 [23:43:38<8:08:28, 25.71s/it] 73%|███████▎  | 3147/4286 [23:44:01<7:53:01, 24.92s/it]                                                        {'loss': 0.0051, 'grad_norm': 2.2756430499132483, 'learning_rate': 2.657489500699953e-07, 'completion_length': 247.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.6369048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.07695359364151955, 'kl': 0.128662109375, 'epoch': 0.73}
 73%|███████▎  | 3147/4286 [23:44:01<7:53:01, 24.92s/it] 73%|███████▎  | 3148/4286 [23:44:24<7:44:20, 24.48s/it]                                                        {'loss': 0.0049, 'grad_norm': 6.476748950774067, 'learning_rate': 2.655156322911806e-07, 'completion_length': 297.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.8422619104385376, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.029761902987957, 'kl': 0.123291015625, 'epoch': 0.73}
 73%|███████▎  | 3148/4286 [23:44:24<7:44:20, 24.48s/it] 73%|███████▎  | 3149/4286 [23:44:49<7:46:16, 24.61s/it]                                                        {'loss': 0.0091, 'grad_norm': 0.67124307933615, 'learning_rate': 2.652823145123658e-07, 'completion_length': 308.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547620296478271, 'reward_std': 0.0357142873108387, 'kl': 0.228759765625, 'epoch': 0.73}
 73%|███████▎  | 3149/4286 [23:44:49<7:46:16, 24.61s/it][2025-03-03 14:42:37,237] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 73%|███████▎  | 3150/4286 [23:45:14<7:48:43, 24.76s/it]                                                        {'loss': 0.0051, 'grad_norm': 10.458946190752187, 'learning_rate': 2.650489967335511e-07, 'completion_length': 285.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8315477073192596, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8136905431747437, 'reward_std': 0.0745955454185605, 'kl': 0.127197265625, 'epoch': 0.73}
 73%|███████▎  | 3150/4286 [23:45:14<7:48:43, 24.76s/it] 74%|███████▎  | 3151/4286 [23:45:39<7:46:00, 24.63s/it]                                                        {'loss': 0.0081, 'grad_norm': 38.31074790624521, 'learning_rate': 2.648156789547363e-07, 'completion_length': 281.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7842262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.05594630911946297, 'kl': 0.202392578125, 'epoch': 0.74}
 74%|███████▎  | 3151/4286 [23:45:39<7:46:00, 24.63s/it] 74%|███████▎  | 3152/4286 [23:46:04<7:47:11, 24.72s/it]                                                        {'loss': 0.0059, 'grad_norm': 7.4834483966562475, 'learning_rate': 2.645823611759216e-07, 'completion_length': 311.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068454027175903, 'reward_std': 0.050324155017733574, 'kl': 0.1484375, 'epoch': 0.74}
 74%|███████▎  | 3152/4286 [23:46:04<7:47:11, 24.72s/it] 74%|███████▎  | 3153/4286 [23:46:28<7:45:21, 24.64s/it]                                                        {'loss': 0.008, 'grad_norm': 1.0294589228430513, 'learning_rate': 2.6434904339710686e-07, 'completion_length': 284.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 1.0, 'reward': 1.758928656578064, 'reward_std': 0.005952383857220411, 'kl': 0.2008056640625, 'epoch': 0.74}
 74%|███████▎  | 3153/4286 [23:46:28<7:45:21, 24.64s/it] 74%|███████▎  | 3154/4286 [23:46:54<7:54:27, 25.15s/it]                                                        {'loss': 0.0225, 'grad_norm': 10.730177849428694, 'learning_rate': 2.641157256182921e-07, 'completion_length': 290.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8040674924850464, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.768353283405304, 'reward_std': 0.0937366671860218, 'kl': 0.564453125, 'epoch': 0.74}
 74%|███████▎  | 3154/4286 [23:46:54<7:54:27, 25.15s/it] 74%|███████▎  | 3155/4286 [23:47:19<7:48:23, 24.85s/it]                                                        {'loss': 0.0016, 'grad_norm': 8.160598608917793, 'learning_rate': 2.6388240783947736e-07, 'completion_length': 296.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.827976256608963, 'rewards/format_reward': 1.0, 'reward': 1.8279762864112854, 'reward_std': 0.02680554986000061, 'kl': 0.03955078125, 'epoch': 0.74}
 74%|███████▎  | 3155/4286 [23:47:19<7:48:23, 24.85s/it] 74%|███████▎  | 3156/4286 [23:47:44<7:49:21, 24.92s/it]                                                        {'loss': 0.0067, 'grad_norm': 5.573187864972316, 'learning_rate': 2.636490900606626e-07, 'completion_length': 341.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7508928775787354, 'rewards/format_reward': 1.0, 'reward': 1.750892996788025, 'reward_std': 0.030909646302461624, 'kl': 0.168701171875, 'epoch': 0.74}
 74%|███████▎  | 3156/4286 [23:47:44<7:49:21, 24.92s/it] 74%|███████▎  | 3157/4286 [23:48:09<7:52:04, 25.09s/it]                                                        {'loss': 0.0061, 'grad_norm': 22.81362299402765, 'learning_rate': 2.6341577228184786e-07, 'completion_length': 289.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7594866454601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7416296005249023, 'reward_std': 0.07626487873494625, 'kl': 0.152099609375, 'epoch': 0.74}
 74%|███████▎  | 3157/4286 [23:48:09<7:52:04, 25.09s/it] 74%|███████▎  | 3158/4286 [23:48:32<7:41:56, 24.57s/it]                                                        {'loss': 0.0053, 'grad_norm': 11.412637902510378, 'learning_rate': 2.6318245450303313e-07, 'completion_length': 295.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.011904762126505375, 'kl': 0.13330078125, 'epoch': 0.74}
 74%|███████▎  | 3158/4286 [23:48:32<7:41:56, 24.57s/it] 74%|███████▎  | 3159/4286 [23:48:57<7:40:02, 24.49s/it]                                                        {'loss': 0.0026, 'grad_norm': 6.705339993253485, 'learning_rate': 2.6294913672421836e-07, 'completion_length': 250.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.7455357611179352, 'rewards/format_reward': 1.0, 'reward': 1.7455359101295471, 'reward_std': 0.05016787815839052, 'kl': 0.06396484375, 'epoch': 0.74}
 74%|███████▎  | 3159/4286 [23:48:57<7:40:02, 24.49s/it] 74%|███████▎  | 3160/4286 [23:49:20<7:33:46, 24.18s/it]                                                        {'loss': 0.0165, 'grad_norm': 13.395276772640525, 'learning_rate': 2.6271581894540363e-07, 'completion_length': 291.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6904762983322144, 'reward_std': 0.069406284019351, 'kl': 0.412109375, 'epoch': 0.74}
 74%|███████▎  | 3160/4286 [23:49:20<7:33:46, 24.18s/it] 74%|███████▍  | 3161/4286 [23:49:45<7:37:26, 24.40s/it]                                                        {'loss': 0.0095, 'grad_norm': 2.4241352372929565, 'learning_rate': 2.6248250116658885e-07, 'completion_length': 297.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.07142856903374195, 'kl': 0.2369384765625, 'epoch': 0.74}
 74%|███████▍  | 3161/4286 [23:49:45<7:37:26, 24.40s/it] 74%|███████▍  | 3162/4286 [23:50:09<7:34:44, 24.27s/it]                                                        {'loss': 0.004, 'grad_norm': 14.29181418947628, 'learning_rate': 2.6224918338777413e-07, 'completion_length': 280.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.07238246500492096, 'kl': 0.100341796875, 'epoch': 0.74}
 74%|███████▍  | 3162/4286 [23:50:09<7:34:44, 24.27s/it] 74%|███████▍  | 3163/4286 [23:50:34<7:37:42, 24.45s/it]                                                        {'loss': 0.0032, 'grad_norm': 0.736410647924292, 'learning_rate': 2.620158656089594e-07, 'completion_length': 295.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.01785714365541935, 'kl': 0.080322265625, 'epoch': 0.74}
 74%|███████▍  | 3163/4286 [23:50:34<7:37:42, 24.45s/it] 74%|███████▍  | 3164/4286 [23:50:57<7:30:29, 24.09s/it]                                                        {'loss': 0.0146, 'grad_norm': 15.294419480835947, 'learning_rate': 2.617825478301446e-07, 'completion_length': 289.5, 'rewards/only_full_func_accuracy_reward': 0.787202388048172, 'rewards/format_reward': 1.0, 'reward': 1.7872024774551392, 'reward_std': 0.0208333358168602, 'kl': 0.3641357421875, 'epoch': 0.74}
 74%|███████▍  | 3164/4286 [23:50:57<7:30:29, 24.09s/it] 74%|███████▍  | 3165/4286 [23:51:21<7:28:43, 24.02s/it]                                                        {'loss': 0.0046, 'grad_norm': 12.596024219704876, 'learning_rate': 2.615492300513299e-07, 'completion_length': 302.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.06731786299496889, 'kl': 0.114501953125, 'epoch': 0.74}
 74%|███████▍  | 3165/4286 [23:51:21<7:28:43, 24.02s/it] 74%|███████▍  | 3166/4286 [23:51:44<7:24:26, 23.81s/it]                                                        {'loss': 0.0237, 'grad_norm': 2.257961680699062, 'learning_rate': 2.613159122725151e-07, 'completion_length': 264.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6922619342803955, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6744048595428467, 'reward_std': 0.11354022473096848, 'kl': 0.5927734375, 'epoch': 0.74}
 74%|███████▍  | 3166/4286 [23:51:44<7:24:26, 23.81s/it] 74%|███████▍  | 3167/4286 [23:52:08<7:24:38, 23.84s/it]                                                        {'loss': 0.006, 'grad_norm': 1.0219717986971812, 'learning_rate': 2.610825944937004e-07, 'completion_length': 295.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.7516233921051025, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7337663173675537, 'reward_std': 0.07253926433622837, 'kl': 0.147705078125, 'epoch': 0.74}
 74%|███████▍  | 3167/4286 [23:52:08<7:24:38, 23.84s/it] 74%|███████▍  | 3168/4286 [23:52:33<7:26:40, 23.97s/it]                                                        {'loss': 0.0058, 'grad_norm': 11.700075956161095, 'learning_rate': 2.6084927671488567e-07, 'completion_length': 290.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.07450364157557487, 'kl': 0.1458740234375, 'epoch': 0.74}
 74%|███████▍  | 3168/4286 [23:52:33<7:26:40, 23.97s/it] 74%|███████▍  | 3169/4286 [23:52:57<7:30:12, 24.18s/it]                                                        {'loss': 0.0093, 'grad_norm': 12.699541047860148, 'learning_rate': 2.606159589360709e-07, 'completion_length': 299.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6934524774551392, 'reward_std': 0.07822278887033463, 'kl': 0.2308349609375, 'epoch': 0.74}
 74%|███████▍  | 3169/4286 [23:52:57<7:30:12, 24.18s/it] 74%|███████▍  | 3170/4286 [23:53:21<7:27:26, 24.06s/it]                                                        {'loss': 0.0015, 'grad_norm': 7.364459885744098, 'learning_rate': 2.6038264115725617e-07, 'completion_length': 296.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.866071492433548, 'rewards/format_reward': 1.0, 'reward': 1.8660715222358704, 'reward_std': 0.01718304236419499, 'kl': 0.038330078125, 'epoch': 0.74}
 74%|███████▍  | 3170/4286 [23:53:21<7:27:26, 24.06s/it] 74%|███████▍  | 3171/4286 [23:53:45<7:27:57, 24.11s/it]                                                        {'loss': 0.0038, 'grad_norm': 2.3299905890877746, 'learning_rate': 2.6014932337844145e-07, 'completion_length': 315.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.032231273129582405, 'kl': 0.0955810546875, 'epoch': 0.74}
 74%|███████▍  | 3171/4286 [23:53:45<7:27:57, 24.11s/it] 74%|███████▍  | 3172/4286 [23:54:10<7:30:43, 24.28s/it]                                                        {'loss': 0.0034, 'grad_norm': 33.62451053419176, 'learning_rate': 2.5991600559962667e-07, 'completion_length': 303.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7574405074119568, 'rewards/format_reward': 1.0, 'reward': 1.7574406862258911, 'reward_std': 0.022675003856420517, 'kl': 0.084228515625, 'epoch': 0.74}
 74%|███████▍  | 3172/4286 [23:54:10<7:30:43, 24.28s/it] 74%|███████▍  | 3173/4286 [23:54:34<7:30:47, 24.30s/it]                                                        {'loss': 0.0047, 'grad_norm': 2.244754003495549, 'learning_rate': 2.5968268782081194e-07, 'completion_length': 283.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.05059523694217205, 'kl': 0.1177978515625, 'epoch': 0.74}
 74%|███████▍  | 3173/4286 [23:54:34<7:30:47, 24.30s/it] 74%|███████▍  | 3174/4286 [23:54:58<7:24:23, 23.98s/it]                                                        {'loss': 0.0071, 'grad_norm': 7.541566799117551, 'learning_rate': 2.5944937004199717e-07, 'completion_length': 298.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.7559524476528168, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.04324992187321186, 'kl': 0.17529296875, 'epoch': 0.74}
 74%|███████▍  | 3174/4286 [23:54:58<7:24:23, 23.98s/it] 74%|███████▍  | 3175/4286 [23:55:22<7:24:15, 23.99s/it]                                                        {'loss': 0.0216, 'grad_norm': 5.770803112532039, 'learning_rate': 2.5921605226318244e-07, 'completion_length': 317.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.05197649076581001, 'kl': 0.5380859375, 'epoch': 0.74}
 74%|███████▍  | 3175/4286 [23:55:22<7:24:15, 23.99s/it] 74%|███████▍  | 3176/4286 [23:55:46<7:28:14, 24.23s/it]                                                        {'loss': 0.003, 'grad_norm': 0.6279136644074462, 'learning_rate': 2.589827344843677e-07, 'completion_length': 299.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.074462890625, 'epoch': 0.74}
 74%|███████▍  | 3176/4286 [23:55:46<7:28:14, 24.23s/it] 74%|███████▍  | 3177/4286 [23:56:10<7:24:36, 24.05s/it]                                                        {'loss': 0.0067, 'grad_norm': 3.9953157452845676, 'learning_rate': 2.5874941670555294e-07, 'completion_length': 313.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.799107164144516, 'rewards/format_reward': 1.0, 'reward': 1.7991072535514832, 'reward_std': 0.008928571827709675, 'kl': 0.1676025390625, 'epoch': 0.74}
 74%|███████▍  | 3177/4286 [23:56:10<7:24:36, 24.05s/it] 74%|███████▍  | 3178/4286 [23:56:35<7:29:35, 24.35s/it]                                                        {'loss': 0.0159, 'grad_norm': 3.535017237263343, 'learning_rate': 2.585160989267382e-07, 'completion_length': 301.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.6779762208461761, 'rewards/format_reward': 1.0, 'reward': 1.6779762506484985, 'reward_std': 0.08275220543146133, 'kl': 0.399169921875, 'epoch': 0.74}
 74%|███████▍  | 3178/4286 [23:56:35<7:29:35, 24.35s/it] 74%|███████▍  | 3179/4286 [23:57:00<7:31:28, 24.47s/it]                                                        {'loss': 0.01, 'grad_norm': 142.22018434176437, 'learning_rate': 2.5828278114792344e-07, 'completion_length': 324.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7991072237491608, 'rewards/format_reward': 1.0, 'reward': 1.799107313156128, 'reward_std': 0.0327380932867527, 'kl': 0.2490234375, 'epoch': 0.74}
 74%|███████▍  | 3179/4286 [23:57:00<7:31:28, 24.47s/it] 74%|███████▍  | 3180/4286 [23:57:24<7:31:38, 24.50s/it]                                                        {'loss': 0.0016, 'grad_norm': 0.6720605049061937, 'learning_rate': 2.580494633691087e-07, 'completion_length': 291.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.013746436685323715, 'kl': 0.039306640625, 'epoch': 0.74}
 74%|███████▍  | 3180/4286 [23:57:24<7:31:38, 24.50s/it] 74%|███████▍  | 3181/4286 [23:57:48<7:25:26, 24.19s/it]                                                        {'loss': 0.0024, 'grad_norm': 2.6767966178961635, 'learning_rate': 2.57816145590294e-07, 'completion_length': 290.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.8273809552192688, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.011904759332537651, 'kl': 0.05902099609375, 'epoch': 0.74}
 74%|███████▍  | 3181/4286 [23:57:48<7:25:26, 24.19s/it] 74%|███████▍  | 3182/4286 [23:58:11<7:21:01, 23.97s/it]                                                        {'loss': 0.0054, 'grad_norm': 0.5039663774388542, 'learning_rate': 2.575828278114792e-07, 'completion_length': 303.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.1339111328125, 'epoch': 0.74}
 74%|███████▍  | 3182/4286 [23:58:11<7:21:01, 23.97s/it] 74%|███████▍  | 3183/4286 [23:58:35<7:20:29, 23.96s/it]                                                        {'loss': 0.0037, 'grad_norm': 9.961179004197318, 'learning_rate': 2.573495100326645e-07, 'completion_length': 290.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.06983363255858421, 'kl': 0.092041015625, 'epoch': 0.74}
 74%|███████▍  | 3183/4286 [23:58:35<7:20:29, 23.96s/it] 74%|███████▍  | 3184/4286 [23:59:00<7:23:51, 24.17s/it]                                                        {'loss': 0.021, 'grad_norm': 6.496868941447083, 'learning_rate': 2.571161922538497e-07, 'completion_length': 286.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261906266212463, 'reward_std': 0.04306802712380886, 'kl': 0.52490234375, 'epoch': 0.74}
 74%|███████▍  | 3184/4286 [23:59:00<7:23:51, 24.17s/it] 74%|███████▍  | 3185/4286 [23:59:24<7:24:33, 24.23s/it]                                                        {'loss': 0.0183, 'grad_norm': 11.336485564554, 'learning_rate': 2.56882874475035e-07, 'completion_length': 297.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.800595223903656, 'rewards/format_reward': 1.0, 'reward': 1.8005954027175903, 'reward_std': 0.056914002634584904, 'kl': 0.4599609375, 'epoch': 0.74}
 74%|███████▍  | 3185/4286 [23:59:24<7:24:33, 24.23s/it] 74%|███████▍  | 3186/4286 [23:59:48<7:20:55, 24.05s/it]                                                        {'loss': 0.0109, 'grad_norm': 2.866226343525568, 'learning_rate': 2.5664955669622026e-07, 'completion_length': 306.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7227182686328888, 'rewards/format_reward': 1.0, 'reward': 1.7227182984352112, 'reward_std': 0.06416460126638412, 'kl': 0.271728515625, 'epoch': 0.74}
 74%|███████▍  | 3186/4286 [23:59:48<7:20:55, 24.05s/it] 74%|███████▍  | 3187/4286 [24:00:12<7:21:23, 24.10s/it]                                                        {'loss': 0.0052, 'grad_norm': 4.7726910142603725, 'learning_rate': 2.564162389174055e-07, 'completion_length': 308.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.8184524476528168, 'rewards/format_reward': 1.0, 'reward': 1.818452537059784, 'reward_std': 0.02816697023808956, 'kl': 0.13037109375, 'epoch': 0.74}
 74%|███████▍  | 3187/4286 [24:00:12<7:21:23, 24.10s/it] 74%|███████▍  | 3188/4286 [24:00:36<7:17:49, 23.92s/it]                                                        {'loss': 0.0059, 'grad_norm': 5.014125808321188, 'learning_rate': 2.5618292113859075e-07, 'completion_length': 283.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.06001728214323521, 'kl': 0.148193359375, 'epoch': 0.74}
 74%|███████▍  | 3188/4286 [24:00:36<7:17:49, 23.92s/it] 74%|███████▍  | 3189/4286 [24:01:00<7:19:14, 24.02s/it]                                                        {'loss': 0.0057, 'grad_norm': 115.83406309347893, 'learning_rate': 2.55949603359776e-07, 'completion_length': 319.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7455357313156128, 'rewards/format_reward': 1.0, 'reward': 1.7455357909202576, 'reward_std': 0.008928571827709675, 'kl': 0.142578125, 'epoch': 0.74}
 74%|███████▍  | 3189/4286 [24:01:00<7:19:14, 24.02s/it] 74%|███████▍  | 3190/4286 [24:01:26<7:29:37, 24.61s/it]                                                        {'loss': 0.051, 'grad_norm': 14.53257736370077, 'learning_rate': 2.5571628558096125e-07, 'completion_length': 315.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7193877696990967, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.6301020979881287, 'reward_std': 0.17364110052585602, 'kl': 1.27294921875, 'epoch': 0.74}
 74%|███████▍  | 3190/4286 [24:01:26<7:29:37, 24.61s/it] 74%|███████▍  | 3191/4286 [24:01:52<7:36:02, 24.99s/it]                                                        {'loss': 0.0212, 'grad_norm': 1.8257304987382523, 'learning_rate': 2.554829678021465e-07, 'completion_length': 321.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8380953073501587, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.820238173007965, 'reward_std': 0.06668206304311752, 'kl': 0.5322265625, 'epoch': 0.74}
 74%|███████▍  | 3191/4286 [24:01:52<7:36:02, 24.99s/it] 74%|███████▍  | 3192/4286 [24:02:16<7:34:24, 24.92s/it]                                                        {'loss': 4.2064, 'grad_norm': 459530.91256018775, 'learning_rate': 2.5524965002333175e-07, 'completion_length': 298.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.01785714365541935, 'kl': 105.052734375, 'epoch': 0.74}
 74%|███████▍  | 3192/4286 [24:02:16<7:34:24, 24.92s/it] 74%|███████▍  | 3193/4286 [24:02:42<7:35:18, 24.99s/it]                                                        {'loss': 0.007, 'grad_norm': 9.770297950452457, 'learning_rate': 2.55016332244517e-07, 'completion_length': 319.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.026785715483129025, 'kl': 0.17529296875, 'epoch': 0.74}
 74%|███████▍  | 3193/4286 [24:02:42<7:35:18, 24.99s/it] 75%|███████▍  | 3194/4286 [24:03:06<7:31:58, 24.83s/it]                                                        {'loss': 0.0023, 'grad_norm': 0.7608097229969845, 'learning_rate': 2.547830144657023e-07, 'completion_length': 329.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.008928571827709675, 'kl': 0.0565185546875, 'epoch': 0.75}
 75%|███████▍  | 3194/4286 [24:03:06<7:31:58, 24.83s/it] 75%|███████▍  | 3195/4286 [24:03:31<7:30:58, 24.80s/it]                                                        {'loss': 0.0179, 'grad_norm': 2.0681741102268276, 'learning_rate': 2.545496966868875e-07, 'completion_length': 287.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6711310148239136, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.04705275967717171, 'kl': 0.4501953125, 'epoch': 0.75}
 75%|███████▍  | 3195/4286 [24:03:31<7:30:58, 24.80s/it] 75%|███████▍  | 3196/4286 [24:03:55<7:29:41, 24.75s/it]                                                        {'loss': 0.0013, 'grad_norm': 7.217049089346684, 'learning_rate': 2.543163789080728e-07, 'completion_length': 279.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.7797619998455048, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.07327023893594742, 'kl': 0.0328369140625, 'epoch': 0.75}
 75%|███████▍  | 3196/4286 [24:03:55<7:29:41, 24.75s/it] 75%|███████▍  | 3197/4286 [24:04:19<7:25:06, 24.52s/it]                                                        {'loss': 0.0041, 'grad_norm': 4.1372481452206165, 'learning_rate': 2.54083061129258e-07, 'completion_length': 288.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.020619653165340424, 'kl': 0.10205078125, 'epoch': 0.75}
 75%|███████▍  | 3197/4286 [24:04:19<7:25:06, 24.52s/it] 75%|███████▍  | 3198/4286 [24:04:45<7:31:18, 24.89s/it]                                                        {'loss': 0.0031, 'grad_norm': 6.315199447216592, 'learning_rate': 2.538497433504433e-07, 'completion_length': 316.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7681548297405243, 'rewards/format_reward': 1.0, 'reward': 1.768154799938202, 'reward_std': 0.02798657864332199, 'kl': 0.0777587890625, 'epoch': 0.75}
 75%|███████▍  | 3198/4286 [24:04:45<7:31:18, 24.89s/it] 75%|███████▍  | 3199/4286 [24:05:11<7:36:03, 25.17s/it]                                                        {'loss': 0.0021, 'grad_norm': 1.639057089137541, 'learning_rate': 2.5361642557162857e-07, 'completion_length': 340.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7738095223903656, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.0535714328289032, 'kl': 0.05145263671875, 'epoch': 0.75}
 75%|███████▍  | 3199/4286 [24:05:11<7:36:03, 25.17s/it] 75%|███████▍  | 3200/4286 [24:05:36<7:32:21, 24.99s/it]                                                        {'loss': 0.0114, 'grad_norm': 1.972614750817133, 'learning_rate': 2.533831077928138e-07, 'completion_length': 317.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7053572237491608, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.050381556153297424, 'kl': 0.284423828125, 'epoch': 0.75}
 75%|███████▍  | 3200/4286 [24:05:36<7:32:21, 24.99s/it] 75%|███████▍  | 3201/4286 [24:08:53<23:07:00, 76.70s/it]                                                         {'loss': 0.0047, 'grad_norm': 0.676759260869092, 'learning_rate': 2.5314979001399907e-07, 'completion_length': 339.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.011904762126505375, 'kl': 0.1168212890625, 'epoch': 0.75}
 75%|███████▍  | 3201/4286 [24:08:53<23:07:00, 76.70s/it] 75%|███████▍  | 3202/4286 [24:09:18<18:24:12, 61.12s/it]                                                         {'loss': 0.0022, 'grad_norm': 2.994800535640969, 'learning_rate': 2.529164722351843e-07, 'completion_length': 318.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6741072237491608, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.06685745343565941, 'kl': 0.0562744140625, 'epoch': 0.75}
 75%|███████▍  | 3202/4286 [24:09:18<18:24:12, 61.12s/it] 75%|███████▍  | 3203/4286 [24:09:42<15:05:35, 50.17s/it]                                                         {'loss': 0.0054, 'grad_norm': 12.5023081828932, 'learning_rate': 2.5268315445636956e-07, 'completion_length': 308.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.8229167461395264, 'rewards/format_reward': 1.0, 'reward': 1.8229168057441711, 'reward_std': 0.07891849055886269, 'kl': 0.135986328125, 'epoch': 0.75}
 75%|███████▍  | 3203/4286 [24:09:42<15:05:35, 50.17s/it] 75%|███████▍  | 3204/4286 [24:10:08<12:51:04, 42.76s/it]                                                         {'loss': 0.0053, 'grad_norm': 8.133418020206616, 'learning_rate': 2.5244983667755484e-07, 'completion_length': 277.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8065476715564728, 'rewards/format_reward': 1.0, 'reward': 1.8065477013587952, 'reward_std': 0.0297619067132473, 'kl': 0.1326904296875, 'epoch': 0.75}
 75%|███████▍  | 3204/4286 [24:10:08<12:51:04, 42.76s/it] 75%|███████▍  | 3205/4286 [24:10:33<11:14:35, 37.44s/it]                                                         {'loss': 0.0382, 'grad_norm': 653.9444903582765, 'learning_rate': 2.5221651889874006e-07, 'completion_length': 328.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.8184524178504944, 'rewards/format_reward': 1.0, 'reward': 1.818452537059784, 'reward_std': 0.017857138067483902, 'kl': 0.94873046875, 'epoch': 0.75}
 75%|███████▍  | 3205/4286 [24:10:33<11:14:35, 37.44s/it] 75%|███████▍  | 3206/4286 [24:10:59<10:13:49, 34.10s/it]                                                         {'loss': 0.0048, 'grad_norm': 2.307294895724958, 'learning_rate': 2.5198320111992534e-07, 'completion_length': 326.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.8497024476528168, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8139882683753967, 'reward_std': 0.127976194024086, 'kl': 0.120361328125, 'epoch': 0.75}
 75%|███████▍  | 3206/4286 [24:10:59<10:13:49, 34.10s/it] 75%|███████▍  | 3207/4286 [24:11:24<9:20:54, 31.19s/it]                                                         {'loss': 0.0032, 'grad_norm': 3.922534512360466, 'learning_rate': 2.5174988334111056e-07, 'completion_length': 301.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.8133929371833801, 'rewards/format_reward': 1.0, 'reward': 1.813392996788025, 'reward_std': 0.07346101477742195, 'kl': 0.0789794921875, 'epoch': 0.75}
 75%|███████▍  | 3207/4286 [24:11:24<9:20:54, 31.19s/it] 75%|███████▍  | 3208/4286 [24:11:48<8:44:54, 29.22s/it]                                                        {'loss': 0.0051, 'grad_norm': 4.763938648509527, 'learning_rate': 2.5151656556229583e-07, 'completion_length': 273.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.04392235726118088, 'kl': 0.12646484375, 'epoch': 0.75}
 75%|███████▍  | 3208/4286 [24:11:48<8:44:54, 29.22s/it] 75%|███████▍  | 3209/4286 [24:12:14<8:25:50, 28.18s/it]                                                        {'loss': 0.0082, 'grad_norm': 4.106385010085946, 'learning_rate': 2.512832477834811e-07, 'completion_length': 279.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8452381193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8273810744285583, 'reward_std': 0.04761904664337635, 'kl': 0.205322265625, 'epoch': 0.75}
 75%|███████▍  | 3209/4286 [24:12:14<8:25:50, 28.18s/it] 75%|███████▍  | 3210/4286 [24:12:38<8:05:57, 27.10s/it]                                                        {'loss': 0.0016, 'grad_norm': 4.882483630747136, 'learning_rate': 2.5104993000466633e-07, 'completion_length': 284.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.8199405372142792, 'rewards/format_reward': 1.0, 'reward': 1.8199405670166016, 'reward_std': 0.008928571827709675, 'kl': 0.0389404296875, 'epoch': 0.75}
 75%|███████▍  | 3210/4286 [24:12:38<8:05:57, 27.10s/it] 75%|███████▍  | 3211/4286 [24:13:02<7:47:20, 26.08s/it]                                                        {'loss': 0.0105, 'grad_norm': 2.5471538726345946, 'learning_rate': 2.508166122258516e-07, 'completion_length': 274.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7723215818405151, 'reward_std': 0.06090506911277771, 'kl': 0.261962890625, 'epoch': 0.75}
 75%|███████▍  | 3211/4286 [24:13:02<7:47:20, 26.08s/it] 75%|███████▍  | 3212/4286 [24:13:27<7:42:35, 25.84s/it]                                                        {'loss': 0.0071, 'grad_norm': 0.6215980163849066, 'learning_rate': 2.5058329444703683e-07, 'completion_length': 300.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7857143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.0, 'kl': 0.1776123046875, 'epoch': 0.75}
 75%|███████▍  | 3212/4286 [24:13:27<7:42:35, 25.84s/it] 75%|███████▍  | 3213/4286 [24:13:52<7:35:21, 25.46s/it]                                                        {'loss': 0.0065, 'grad_norm': 1.3909096556956273, 'learning_rate': 2.503499766682221e-07, 'completion_length': 274.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.020619653165340424, 'kl': 0.162109375, 'epoch': 0.75}
 75%|███████▍  | 3213/4286 [24:13:52<7:35:21, 25.46s/it] 75%|███████▍  | 3214/4286 [24:14:16<7:28:37, 25.11s/it]                                                        {'loss': 0.0011, 'grad_norm': 0.21605613975209725, 'learning_rate': 2.501166588894074e-07, 'completion_length': 307.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.9270833730697632, 'rewards/format_reward': 1.0, 'reward': 1.9270834922790527, 'reward_std': 0.008928571827709675, 'kl': 0.02691650390625, 'epoch': 0.75}
 75%|███████▍  | 3214/4286 [24:14:16<7:28:37, 25.11s/it] 75%|███████▌  | 3215/4286 [24:14:41<7:25:57, 24.98s/it]                                                        {'loss': 0.0024, 'grad_norm': 5.716521424295261, 'learning_rate': 2.498833411105926e-07, 'completion_length': 286.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.822916716337204, 'rewards/format_reward': 1.0, 'reward': 1.8229167461395264, 'reward_std': 0.07624643296003342, 'kl': 0.060791015625, 'epoch': 0.75}
 75%|███████▌  | 3215/4286 [24:14:41<7:25:57, 24.98s/it] 75%|███████▌  | 3216/4286 [24:15:06<7:28:08, 25.13s/it]                                                        {'loss': 0.0043, 'grad_norm': 7.5485385012421675, 'learning_rate': 2.496500233317779e-07, 'completion_length': 322.0, 'rewards/only_full_func_accuracy_reward': 0.7979166805744171, 'rewards/format_reward': 1.0, 'reward': 1.7979168891906738, 'reward_std': 0.023214287124574184, 'kl': 0.10626220703125, 'epoch': 0.75}
 75%|███████▌  | 3216/4286 [24:15:06<7:28:08, 25.13s/it] 75%|███████▌  | 3217/4286 [24:15:31<7:22:56, 24.86s/it]                                                        {'loss': 0.0109, 'grad_norm': 34.16341734389801, 'learning_rate': 2.4941670555296315e-07, 'completion_length': 274.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7976190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.1116600651293993, 'kl': 0.2738037109375, 'epoch': 0.75}
 75%|███████▌  | 3217/4286 [24:15:31<7:22:56, 24.86s/it] 75%|███████▌  | 3218/4286 [24:15:55<7:20:30, 24.75s/it]                                                        {'loss': 0.0024, 'grad_norm': 1.393720652345135, 'learning_rate': 2.4918338777414837e-07, 'completion_length': 314.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.01785714365541935, 'kl': 0.060791015625, 'epoch': 0.75}
 75%|███████▌  | 3218/4286 [24:15:55<7:20:30, 24.75s/it] 75%|███████▌  | 3219/4286 [24:16:22<7:29:43, 25.29s/it]                                                        {'loss': 0.0019, 'grad_norm': 8.70898111833496, 'learning_rate': 2.4895006999533365e-07, 'completion_length': 326.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.785714328289032, 'reward_std': 0.05381816253066063, 'kl': 0.04638671875, 'epoch': 0.75}
 75%|███████▌  | 3219/4286 [24:16:22<7:29:43, 25.29s/it] 75%|███████▌  | 3220/4286 [24:16:46<7:24:54, 25.04s/it]                                                        {'loss': 0.0059, 'grad_norm': 48.24542569975757, 'learning_rate': 2.4871675221651887e-07, 'completion_length': 318.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.7321430444717407, 'reward_std': 0.0357142798602581, 'kl': 0.1463623046875, 'epoch': 0.75}
 75%|███████▌  | 3220/4286 [24:16:46<7:24:54, 25.04s/it] 75%|███████▌  | 3221/4286 [24:17:11<7:21:24, 24.87s/it]                                                        {'loss': 0.0018, 'grad_norm': 1.6293068855820139, 'learning_rate': 2.4848343443770414e-07, 'completion_length': 283.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.8497023582458496, 'rewards/format_reward': 1.0, 'reward': 1.849702537059784, 'reward_std': 0.031143158674240112, 'kl': 0.044921875, 'epoch': 0.75}
 75%|███████▌  | 3221/4286 [24:17:11<7:21:24, 24.87s/it] 75%|███████▌  | 3222/4286 [24:17:37<7:26:40, 25.19s/it]                                                        {'loss': 0.0074, 'grad_norm': 4.0931768930199715, 'learning_rate': 2.482501166588894e-07, 'completion_length': 313.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.6577381193637848, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.07762769237160683, 'kl': 0.18359375, 'epoch': 0.75}
 75%|███████▌  | 3222/4286 [24:17:37<7:26:40, 25.19s/it] 75%|███████▌  | 3223/4286 [24:18:01<7:21:09, 24.90s/it]                                                        {'loss': 0.0059, 'grad_norm': 6.9743483981895835, 'learning_rate': 2.4801679888007464e-07, 'completion_length': 287.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160714626312256, 'reward_std': 0.060256581753492355, 'kl': 0.1473388671875, 'epoch': 0.75}
 75%|███████▌  | 3223/4286 [24:18:01<7:21:09, 24.90s/it] 75%|███████▌  | 3224/4286 [24:18:25<7:17:35, 24.72s/it]                                                        {'loss': 0.0017, 'grad_norm': 0.7931823249031079, 'learning_rate': 2.477834811012599e-07, 'completion_length': 300.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7723215222358704, 'reward_std': 0.008928571827709675, 'kl': 0.0413818359375, 'epoch': 0.75}
 75%|███████▌  | 3224/4286 [24:18:25<7:17:35, 24.72s/it] 75%|███████▌  | 3225/4286 [24:18:50<7:17:40, 24.75s/it]                                                        {'loss': 0.0023, 'grad_norm': 3.854665762678958, 'learning_rate': 2.4755016332244514e-07, 'completion_length': 304.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.8244048357009888, 'rewards/format_reward': 1.0, 'reward': 1.8244048953056335, 'reward_std': 0.01785714365541935, 'kl': 0.057861328125, 'epoch': 0.75}
 75%|███████▌  | 3225/4286 [24:18:50<7:17:40, 24.75s/it] 75%|███████▌  | 3226/4286 [24:19:14<7:12:15, 24.47s/it]                                                        {'loss': 0.0277, 'grad_norm': 9.089205297083945, 'learning_rate': 2.473168455436304e-07, 'completion_length': 310.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8325893580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8147322535514832, 'reward_std': 0.08703739009797573, 'kl': 0.693359375, 'epoch': 0.75}
 75%|███████▌  | 3226/4286 [24:19:14<7:12:15, 24.47s/it] 75%|███████▌  | 3227/4286 [24:19:40<7:22:16, 25.06s/it]                                                        {'loss': 0.006, 'grad_norm': 10.11834813105441, 'learning_rate': 2.470835277648157e-07, 'completion_length': 305.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7872024774551392, 'rewards/format_reward': 1.0, 'reward': 1.787202537059784, 'reward_std': 0.04999392945319414, 'kl': 0.1513671875, 'epoch': 0.75}
 75%|███████▌  | 3227/4286 [24:19:40<7:22:16, 25.06s/it] 75%|███████▌  | 3228/4286 [24:20:06<7:23:34, 25.16s/it]                                                        {'loss': 0.0103, 'grad_norm': 10.349516028855373, 'learning_rate': 2.468502099860009e-07, 'completion_length': 316.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.790178656578064, 'rewards/format_reward': 1.0, 'reward': 1.7901787161827087, 'reward_std': 0.03197652846574783, 'kl': 0.2576904296875, 'epoch': 0.75}
 75%|███████▌  | 3228/4286 [24:20:06<7:23:34, 25.16s/it] 75%|███████▌  | 3229/4286 [24:20:31<7:25:41, 25.30s/it]                                                        {'loss': 0.0058, 'grad_norm': 3.2656314616571773, 'learning_rate': 2.466168922071862e-07, 'completion_length': 312.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6581297218799591, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.640272617340088, 'reward_std': 0.07659774273633957, 'kl': 0.14404296875, 'epoch': 0.75}
 75%|███████▌  | 3229/4286 [24:20:31<7:25:41, 25.30s/it] 75%|███████▌  | 3230/4286 [24:20:58<7:32:28, 25.71s/it]                                                        {'loss': 0.0016, 'grad_norm': 2.7272639861699353, 'learning_rate': 2.463835744283714e-07, 'completion_length': 326.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7455357313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7276787161827087, 'reward_std': 0.10650181770324707, 'kl': 0.04010009765625, 'epoch': 0.75}
 75%|███████▌  | 3230/4286 [24:20:58<7:32:28, 25.71s/it] 75%|███████▌  | 3231/4286 [24:21:22<7:24:00, 25.25s/it]                                                        {'loss': 0.0015, 'grad_norm': 0.42688686088255545, 'learning_rate': 2.461502566495567e-07, 'completion_length': 274.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.8407738208770752, 'rewards/format_reward': 1.0, 'reward': 1.8407739400863647, 'reward_std': 0.02267500851303339, 'kl': 0.03814697265625, 'epoch': 0.75}
 75%|███████▌  | 3231/4286 [24:21:22<7:24:00, 25.25s/it] 75%|███████▌  | 3232/4286 [24:21:48<7:25:27, 25.36s/it]                                                        {'loss': 0.0016, 'grad_norm': 13.309924335828406, 'learning_rate': 2.4591693887074196e-07, 'completion_length': 296.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.8476190865039825, 'rewards/format_reward': 1.0, 'reward': 1.847619116306305, 'reward_std': 0.007142859045416117, 'kl': 0.0406494140625, 'epoch': 0.75}
 75%|███████▌  | 3232/4286 [24:21:48<7:25:27, 25.36s/it] 75%|███████▌  | 3233/4286 [24:22:11<7:14:47, 24.77s/it]                                                        {'loss': 0.0038, 'grad_norm': 3.587713048687119, 'learning_rate': 2.456836210919272e-07, 'completion_length': 258.2678756713867, 'rewards/only_full_func_accuracy_reward': 0.6116071939468384, 'rewards/format_reward': 1.0, 'reward': 1.611607313156128, 'reward_std': 0.026785715483129025, 'kl': 0.0936279296875, 'epoch': 0.75}
 75%|███████▌  | 3233/4286 [24:22:11<7:14:47, 24.77s/it] 75%|███████▌  | 3234/4286 [24:22:36<7:16:22, 24.89s/it]                                                        {'loss': 0.0032, 'grad_norm': 1.6431493977939082, 'learning_rate': 2.4545030331311246e-07, 'completion_length': 315.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0, 'kl': 0.0794677734375, 'epoch': 0.75}
 75%|███████▌  | 3234/4286 [24:22:36<7:16:22, 24.89s/it] 75%|███████▌  | 3235/4286 [24:23:02<7:18:39, 25.04s/it]                                                        {'loss': 0.0164, 'grad_norm': 1.898855204052525, 'learning_rate': 2.452169855342977e-07, 'completion_length': 330.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7532738447189331, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.735416829586029, 'reward_std': 0.0945573877543211, 'kl': 0.407470703125, 'epoch': 0.75}
 75%|███████▌  | 3235/4286 [24:23:02<7:18:39, 25.04s/it] 76%|███████▌  | 3236/4286 [24:23:28<7:26:42, 25.53s/it]                                                        {'loss': 0.0077, 'grad_norm': 10.646234458480508, 'learning_rate': 2.4498366775548295e-07, 'completion_length': 323.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8303571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8303572535514832, 'reward_std': 0.04946071840822697, 'kl': 0.1923828125, 'epoch': 0.76}
 76%|███████▌  | 3236/4286 [24:23:28<7:26:42, 25.53s/it] 76%|███████▌  | 3237/4286 [24:23:52<7:17:09, 25.00s/it]                                                        {'loss': 0.0149, 'grad_norm': 6.235386220284996, 'learning_rate': 2.4475034997666823e-07, 'completion_length': 286.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.8824405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.864583432674408, 'reward_std': 0.0803571492433548, 'kl': 0.373291015625, 'epoch': 0.76}
 76%|███████▌  | 3237/4286 [24:23:52<7:17:09, 25.00s/it] 76%|███████▌  | 3238/4286 [24:24:17<7:16:32, 24.99s/it]                                                        {'loss': 0.0028, 'grad_norm': 2.329946049253123, 'learning_rate': 2.4451703219785345e-07, 'completion_length': 321.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.827381044626236, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.049460720270872116, 'kl': 0.0689697265625, 'epoch': 0.76}
 76%|███████▌  | 3238/4286 [24:24:17<7:16:32, 24.99s/it] 76%|███████▌  | 3239/4286 [24:24:42<7:13:23, 24.84s/it]                                                        {'loss': 0.0019, 'grad_norm': 0.3318146207425628, 'learning_rate': 2.4428371441903873e-07, 'completion_length': 306.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.0, 'kl': 0.0478515625, 'epoch': 0.76}
 76%|███████▌  | 3239/4286 [24:24:42<7:13:23, 24.84s/it] 76%|███████▌  | 3240/4286 [24:25:05<7:05:56, 24.43s/it]                                                        {'loss': 0.0076, 'grad_norm': 1.0177303995896994, 'learning_rate': 2.44050396640224e-07, 'completion_length': 285.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.013746432960033417, 'kl': 0.190673828125, 'epoch': 0.76}
 76%|███████▌  | 3240/4286 [24:25:05<7:05:56, 24.43s/it] 76%|███████▌  | 3241/4286 [24:25:30<7:08:59, 24.63s/it]                                                        {'loss': 0.002, 'grad_norm': 0.9328918856495457, 'learning_rate': 2.438170788614092e-07, 'completion_length': 303.51788330078125, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.758928656578064, 'reward_std': 0.01785714365541935, 'kl': 0.0511474609375, 'epoch': 0.76}
 76%|███████▌  | 3241/4286 [24:25:30<7:08:59, 24.63s/it] 76%|███████▌  | 3242/4286 [24:25:55<7:11:24, 24.79s/it]                                                        {'loss': 0.0021, 'grad_norm': 1.161003080151415, 'learning_rate': 2.435837610825945e-07, 'completion_length': 305.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.8824405372142792, 'rewards/format_reward': 1.0, 'reward': 1.8824405670166016, 'reward_std': 0.008928571827709675, 'kl': 0.0531005859375, 'epoch': 0.76}
 76%|███████▌  | 3242/4286 [24:25:55<7:11:24, 24.79s/it] 76%|███████▌  | 3243/4286 [24:26:21<7:14:45, 25.01s/it]                                                        {'loss': 0.0024, 'grad_norm': 2.8302288726286364, 'learning_rate': 2.433504433037797e-07, 'completion_length': 305.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.8958334028720856, 'rewards/format_reward': 1.0, 'reward': 1.895833432674408, 'reward_std': 0.01785714365541935, 'kl': 0.0604248046875, 'epoch': 0.76}
 76%|███████▌  | 3243/4286 [24:26:21<7:14:45, 25.01s/it] 76%|███████▌  | 3244/4286 [24:26:45<7:10:24, 24.78s/it]                                                        {'loss': 0.004, 'grad_norm': 3.0448146466363073, 'learning_rate': 2.43117125524965e-07, 'completion_length': 302.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.6785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.0357142873108387, 'kl': 0.099609375, 'epoch': 0.76}
 76%|███████▌  | 3244/4286 [24:26:45<7:10:24, 24.78s/it] 76%|███████▌  | 3245/4286 [24:27:09<7:07:17, 24.63s/it]                                                        {'loss': 0.0164, 'grad_norm': 0.6480738453394846, 'learning_rate': 2.4288380774615027e-07, 'completion_length': 318.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6815476417541504, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.01785714365541935, 'kl': 0.408203125, 'epoch': 0.76}
 76%|███████▌  | 3245/4286 [24:27:09<7:07:17, 24.63s/it] 76%|███████▌  | 3246/4286 [24:27:36<7:16:25, 25.18s/it]                                                        {'loss': 0.026, 'grad_norm': 5.263295897057001, 'learning_rate': 2.426504899673355e-07, 'completion_length': 300.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.7247024178504944, 'rewards/format_reward': 1.0, 'reward': 1.724702537059784, 'reward_std': 0.04464286006987095, 'kl': 0.65234375, 'epoch': 0.76}
 76%|███████▌  | 3246/4286 [24:27:36<7:16:25, 25.18s/it] 76%|███████▌  | 3247/4286 [24:28:00<7:09:21, 24.79s/it]                                                        {'loss': 0.0093, 'grad_norm': 3.7591690972640452, 'learning_rate': 2.4241717218852077e-07, 'completion_length': 318.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.648809552192688, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.0357142873108387, 'kl': 0.232666015625, 'epoch': 0.76}
 76%|███████▌  | 3247/4286 [24:28:00<7:09:21, 24.79s/it] 76%|███████▌  | 3248/4286 [24:28:24<7:06:09, 24.63s/it]                                                        {'loss': 0.0014, 'grad_norm': 0.7414794497650258, 'learning_rate': 2.42183854409706e-07, 'completion_length': 282.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8550595641136169, 'rewards/format_reward': 1.0, 'reward': 1.8550596237182617, 'reward_std': 0.01607143087312579, 'kl': 0.03497314453125, 'epoch': 0.76}
 76%|███████▌  | 3248/4286 [24:28:24<7:06:09, 24.63s/it] 76%|███████▌  | 3249/4286 [24:28:48<7:03:27, 24.50s/it]                                                        {'loss': 0.0064, 'grad_norm': 1.2100552069401296, 'learning_rate': 2.4195053663089127e-07, 'completion_length': 297.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8348214626312256, 'rewards/format_reward': 1.0, 'reward': 1.8348215818405151, 'reward_std': 0.056547620333731174, 'kl': 0.1617431640625, 'epoch': 0.76}
 76%|███████▌  | 3249/4286 [24:28:48<7:03:27, 24.50s/it] 76%|███████▌  | 3250/4286 [24:29:13<7:04:29, 24.58s/it]                                                        {'loss': 0.0088, 'grad_norm': 4.715103411216021, 'learning_rate': 2.4171721885207654e-07, 'completion_length': 309.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7261904776096344, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.05498574301600456, 'kl': 0.2183837890625, 'epoch': 0.76}
 76%|███████▌  | 3250/4286 [24:29:13<7:04:29, 24.58s/it] 76%|███████▌  | 3251/4286 [24:29:37<7:01:19, 24.43s/it]                                                        {'loss': 0.0041, 'grad_norm': 1.5218496367850665, 'learning_rate': 2.4148390107326176e-07, 'completion_length': 307.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7559524476528168, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.011904759332537651, 'kl': 0.101806640625, 'epoch': 0.76}
 76%|███████▌  | 3251/4286 [24:29:37<7:01:19, 24.43s/it] 76%|███████▌  | 3252/4286 [24:30:01<6:57:37, 24.23s/it]                                                        {'loss': 0.0023, 'grad_norm': 0.44121002969057027, 'learning_rate': 2.4125058329444704e-07, 'completion_length': 309.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7604167461395264, 'rewards/format_reward': 1.0, 'reward': 1.7604167461395264, 'reward_std': 0.008928571827709675, 'kl': 0.058349609375, 'epoch': 0.76}
 76%|███████▌  | 3252/4286 [24:30:01<6:57:37, 24.23s/it] 76%|███████▌  | 3253/4286 [24:30:27<7:08:11, 24.87s/it]                                                        {'loss': 0.0017, 'grad_norm': 3.4935163975737105, 'learning_rate': 2.4101726551563226e-07, 'completion_length': 324.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.01944039762020111, 'kl': 0.0430908203125, 'epoch': 0.76}
 76%|███████▌  | 3253/4286 [24:30:27<7:08:11, 24.87s/it] 76%|███████▌  | 3254/4286 [24:30:51<7:05:12, 24.72s/it]                                                        {'loss': 0.0022, 'grad_norm': 0.5753278358984381, 'learning_rate': 2.4078394773681754e-07, 'completion_length': 254.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.7440477013587952, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.020619657821953297, 'kl': 0.0555419921875, 'epoch': 0.76}
 76%|███████▌  | 3254/4286 [24:30:51<7:05:12, 24.72s/it] 76%|███████▌  | 3255/4286 [24:31:16<7:03:42, 24.66s/it]                                                        {'loss': 0.0254, 'grad_norm': 11.678150168381057, 'learning_rate': 2.405506299580028e-07, 'completion_length': 276.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.8011905252933502, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7833334803581238, 'reward_std': 0.0685654729604721, 'kl': 0.63525390625, 'epoch': 0.76}
 76%|███████▌  | 3255/4286 [24:31:16<7:03:42, 24.66s/it] 76%|███████▌  | 3256/4286 [24:31:41<7:03:00, 24.64s/it]                                                        {'loss': 0.0014, 'grad_norm': 6.533437761160627, 'learning_rate': 2.4031731217918803e-07, 'completion_length': 287.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.6726190447807312, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.0357142873108387, 'kl': 0.0355224609375, 'epoch': 0.76}
 76%|███████▌  | 3256/4286 [24:31:41<7:03:00, 24.64s/it] 76%|███████▌  | 3257/4286 [24:32:05<7:03:15, 24.68s/it]                                                        {'loss': 0.0059, 'grad_norm': 1.3976280524509779, 'learning_rate': 2.400839944003733e-07, 'completion_length': 315.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.6800595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.023908402770757675, 'kl': 0.14794921875, 'epoch': 0.76}
 76%|███████▌  | 3257/4286 [24:32:05<7:03:15, 24.68s/it] 76%|███████▌  | 3258/4286 [24:32:31<7:07:02, 24.92s/it]                                                        {'loss': 0.007, 'grad_norm': 2.219003505010747, 'learning_rate': 2.3985067662155853e-07, 'completion_length': 307.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6577381789684296, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.0704073067754507, 'kl': 0.17431640625, 'epoch': 0.76}
 76%|███████▌  | 3258/4286 [24:32:31<7:07:02, 24.92s/it] 76%|███████▌  | 3259/4286 [24:32:54<6:58:37, 24.46s/it]                                                        {'loss': 0.0018, 'grad_norm': 0.1956948689166244, 'learning_rate': 2.396173588427438e-07, 'completion_length': 268.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001788139343, 'reward_std': 0.0, 'kl': 0.0458984375, 'epoch': 0.76}
 76%|███████▌  | 3259/4286 [24:32:54<6:58:37, 24.46s/it] 76%|███████▌  | 3260/4286 [24:33:20<7:03:18, 24.76s/it]                                                        {'loss': 0.0015, 'grad_norm': 7.2702431090787085, 'learning_rate': 2.393840410639291e-07, 'completion_length': 298.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.035714288242161274, 'kl': 0.0367431640625, 'epoch': 0.76}
 76%|███████▌  | 3260/4286 [24:33:20<7:03:18, 24.76s/it] 76%|███████▌  | 3261/4286 [24:33:43<6:57:02, 24.41s/it]                                                        {'loss': 0.0067, 'grad_norm': 5.068952928753127, 'learning_rate': 2.391507232851143e-07, 'completion_length': 270.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7976190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7976192235946655, 'reward_std': 0.02380952052772045, 'kl': 0.16650390625, 'epoch': 0.76}
 76%|███████▌  | 3261/4286 [24:33:43<6:57:02, 24.41s/it] 76%|███████▌  | 3262/4286 [24:34:07<6:52:14, 24.15s/it]                                                        {'loss': 0.0043, 'grad_norm': 1.9293496956151284, 'learning_rate': 2.389174055062996e-07, 'completion_length': 303.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.0744047649204731, 'kl': 0.10693359375, 'epoch': 0.76}
 76%|███████▌  | 3262/4286 [24:34:07<6:52:14, 24.15s/it] 76%|███████▌  | 3263/4286 [24:34:30<6:47:25, 23.90s/it]                                                        {'loss': 0.0046, 'grad_norm': 0.9651756539519637, 'learning_rate': 2.3868408772748485e-07, 'completion_length': 307.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6250000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6250001788139343, 'reward_std': 0.025651196017861366, 'kl': 0.114501953125, 'epoch': 0.76}
 76%|███████▌  | 3263/4286 [24:34:30<6:47:25, 23.90s/it] 76%|███████▌  | 3264/4286 [24:34:54<6:48:12, 23.97s/it]                                                        {'loss': 0.0018, 'grad_norm': 20.75067494163826, 'learning_rate': 2.384507699486701e-07, 'completion_length': 300.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.880952388048172, 'rewards/format_reward': 1.0, 'reward': 1.880952537059784, 'reward_std': 0.04007173329591751, 'kl': 0.045166015625, 'epoch': 0.76}
 76%|███████▌  | 3264/4286 [24:34:54<6:48:12, 23.97s/it] 76%|███████▌  | 3265/4286 [24:35:18<6:45:28, 23.83s/it]                                                        {'loss': 0.004, 'grad_norm': 1.8399747104834803, 'learning_rate': 2.3821745216985533e-07, 'completion_length': 296.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7053572237491608, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.03596102260053158, 'kl': 0.0999755859375, 'epoch': 0.76}
 76%|███████▌  | 3265/4286 [24:35:18<6:45:28, 23.83s/it] 76%|███████▌  | 3266/4286 [24:35:43<6:52:16, 24.25s/it]                                                        {'loss': 0.0074, 'grad_norm': 74.91686958685477, 'learning_rate': 2.379841343910406e-07, 'completion_length': 325.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875001788139343, 'reward_std': 0.06983363814651966, 'kl': 0.183837890625, 'epoch': 0.76}
 76%|███████▌  | 3266/4286 [24:35:43<6:52:16, 24.25s/it] 76%|███████▌  | 3267/4286 [24:36:06<6:46:36, 23.94s/it]                                                        {'loss': 0.0014, 'grad_norm': 1.7353336699296027, 'learning_rate': 2.3775081661222585e-07, 'completion_length': 274.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.791666716337204, 'rewards/format_reward': 1.0, 'reward': 1.7916667461395264, 'reward_std': 0.02816697023808956, 'kl': 0.0350341796875, 'epoch': 0.76}
 76%|███████▌  | 3267/4286 [24:36:06<6:46:36, 23.94s/it] 76%|███████▌  | 3268/4286 [24:36:29<6:40:05, 23.58s/it]                                                        {'loss': 0.0033, 'grad_norm': 35.54332344089181, 'learning_rate': 2.375174988334111e-07, 'completion_length': 239.64287567138672, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.03160357568413019, 'kl': 0.0816650390625, 'epoch': 0.76}
 76%|███████▌  | 3268/4286 [24:36:29<6:40:05, 23.58s/it] 76%|███████▋  | 3269/4286 [24:36:54<6:46:58, 24.01s/it]                                                        {'loss': 0.0113, 'grad_norm': 4.872063726270625, 'learning_rate': 2.3728418105459635e-07, 'completion_length': 322.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.13073870167136192, 'kl': 0.2822265625, 'epoch': 0.76}
 76%|███████▋  | 3269/4286 [24:36:54<6:46:58, 24.01s/it] 76%|███████▋  | 3270/4286 [24:37:21<7:01:02, 24.86s/it]                                                        {'loss': 0.0025, 'grad_norm': 3.3916265126337284, 'learning_rate': 2.3705086327578162e-07, 'completion_length': 270.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6994049549102783, 'reward_std': 0.08106430247426033, 'kl': 0.0615234375, 'epoch': 0.76}
 76%|███████▋  | 3270/4286 [24:37:21<7:01:02, 24.86s/it] 76%|███████▋  | 3271/4286 [24:37:45<6:57:16, 24.67s/it]                                                        {'loss': 0.0074, 'grad_norm': 0.9296048961222954, 'learning_rate': 2.3681754549696687e-07, 'completion_length': 289.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.699404776096344, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.010309826582670212, 'kl': 0.184326171875, 'epoch': 0.76}
 76%|███████▋  | 3271/4286 [24:37:45<6:57:16, 24.67s/it] 76%|███████▋  | 3272/4286 [24:38:10<6:55:51, 24.61s/it]                                                        {'loss': 0.0053, 'grad_norm': 0.8954189943689167, 'learning_rate': 2.3658422771815212e-07, 'completion_length': 273.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.019238397479057312, 'kl': 0.1334228515625, 'epoch': 0.76}
 76%|███████▋  | 3272/4286 [24:38:10<6:55:51, 24.61s/it] 76%|███████▋  | 3273/4286 [24:38:34<6:52:32, 24.43s/it]                                                        {'loss': 0.0052, 'grad_norm': 69.17041533013553, 'learning_rate': 2.3635090993933737e-07, 'completion_length': 305.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7767857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.049215120263397694, 'kl': 0.129150390625, 'epoch': 0.76}
 76%|███████▋  | 3273/4286 [24:38:34<6:52:32, 24.43s/it] 76%|███████▋  | 3274/4286 [24:38:58<6:52:09, 24.44s/it]                                                        {'loss': 0.0115, 'grad_norm': 2.4565921719178103, 'learning_rate': 2.3611759216052262e-07, 'completion_length': 306.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7455357611179352, 'rewards/format_reward': 1.0, 'reward': 1.7455357909202576, 'reward_std': 0.047405367717146873, 'kl': 0.2861328125, 'epoch': 0.76}
 76%|███████▋  | 3274/4286 [24:38:58<6:52:09, 24.44s/it] 76%|███████▋  | 3275/4286 [24:39:23<6:53:15, 24.53s/it]                                                        {'loss': 0.0061, 'grad_norm': 15.527278397609397, 'learning_rate': 2.358842743817079e-07, 'completion_length': 319.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.8556548655033112, 'rewards/format_reward': 1.0, 'reward': 1.8556548953056335, 'reward_std': 0.0505952350795269, 'kl': 0.15264892578125, 'epoch': 0.76}
 76%|███████▋  | 3275/4286 [24:39:23<6:53:15, 24.53s/it] 76%|███████▋  | 3276/4286 [24:39:47<6:51:40, 24.46s/it]                                                        {'loss': 0.0079, 'grad_norm': 1.807807497999558, 'learning_rate': 2.3565095660289314e-07, 'completion_length': 308.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 1.0, 'reward': 1.8005954027175903, 'reward_std': 0.03160357475280762, 'kl': 0.1962890625, 'epoch': 0.76}
 76%|███████▋  | 3276/4286 [24:39:47<6:51:40, 24.46s/it] 76%|███████▋  | 3277/4286 [24:40:13<6:58:13, 24.87s/it]                                                        {'loss': 0.0079, 'grad_norm': 2.408550579116291, 'learning_rate': 2.354176388240784e-07, 'completion_length': 334.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.8520834147930145, 'rewards/format_reward': 1.0, 'reward': 1.852083444595337, 'reward_std': 0.0391546031460166, 'kl': 0.197509765625, 'epoch': 0.76}
 76%|███████▋  | 3277/4286 [24:40:13<6:58:13, 24.87s/it] 76%|███████▋  | 3278/4286 [24:40:37<6:53:11, 24.60s/it]                                                        {'loss': 0.0014, 'grad_norm': 4.68321814524391, 'learning_rate': 2.3518432104526364e-07, 'completion_length': 281.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.023809518665075302, 'kl': 0.035888671875, 'epoch': 0.76}
 76%|███████▋  | 3278/4286 [24:40:37<6:53:11, 24.60s/it] 77%|███████▋  | 3279/4286 [24:41:00<6:45:32, 24.16s/it]                                                        {'loss': 0.0149, 'grad_norm': 7.428769115775163, 'learning_rate': 2.349510032664489e-07, 'completion_length': 278.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.7083334922790527, 'reward_std': 0.07364453189074993, 'kl': 0.37353515625, 'epoch': 0.77}
 77%|███████▋  | 3279/4286 [24:41:00<6:45:32, 24.16s/it] 77%|███████▋  | 3280/4286 [24:41:26<6:55:30, 24.78s/it]                                                        {'loss': 0.0023, 'grad_norm': 4.424464134929659, 'learning_rate': 2.3471768548763416e-07, 'completion_length': 291.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.8511905372142792, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.02816697023808956, 'kl': 0.0572509765625, 'epoch': 0.77}
 77%|███████▋  | 3280/4286 [24:41:26<6:55:30, 24.78s/it] 77%|███████▋  | 3281/4286 [24:41:50<6:48:35, 24.39s/it]                                                        {'loss': 0.0081, 'grad_norm': 4.296623993172492, 'learning_rate': 2.344843677088194e-07, 'completion_length': 293.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6755952537059784, 'rewards/format_reward': 1.0, 'reward': 1.6755954027175903, 'reward_std': 0.043467560317367315, 'kl': 0.201904296875, 'epoch': 0.77}
 77%|███████▋  | 3281/4286 [24:41:50<6:48:35, 24.39s/it] 77%|███████▋  | 3282/4286 [24:42:15<6:53:27, 24.71s/it]                                                        {'loss': 0.0048, 'grad_norm': 1.938746304518529, 'learning_rate': 2.3425104993000466e-07, 'completion_length': 273.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.68452388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6666667461395264, 'reward_std': 0.0714285746216774, 'kl': 0.1197509765625, 'epoch': 0.77}
 77%|███████▋  | 3282/4286 [24:42:15<6:53:27, 24.71s/it] 77%|███████▋  | 3283/4286 [24:42:39<6:48:32, 24.44s/it]                                                        {'loss': 0.0019, 'grad_norm': 0.21692451184160663, 'learning_rate': 2.340177321511899e-07, 'completion_length': 272.6428756713867, 'rewards/only_full_func_accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.0, 'kl': 0.0469970703125, 'epoch': 0.77}
 77%|███████▋  | 3283/4286 [24:42:39<6:48:32, 24.44s/it] 77%|███████▋  | 3284/4286 [24:43:04<6:52:15, 24.69s/it]                                                        {'loss': 0.0094, 'grad_norm': 3.7998175870742665, 'learning_rate': 2.3378441437237518e-07, 'completion_length': 312.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.04685881733894348, 'kl': 0.23583984375, 'epoch': 0.77}
 77%|███████▋  | 3284/4286 [24:43:04<6:52:15, 24.69s/it] 77%|███████▋  | 3285/4286 [24:43:28<6:45:36, 24.31s/it]                                                        {'loss': 0.0083, 'grad_norm': 4.801205098530146, 'learning_rate': 2.3355109659356043e-07, 'completion_length': 306.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6190476715564728, 'rewards/format_reward': 1.0, 'reward': 1.6190477013587952, 'reward_std': 0.05197649262845516, 'kl': 0.206298828125, 'epoch': 0.77}
 77%|███████▋  | 3285/4286 [24:43:28<6:45:36, 24.31s/it] 77%|███████▋  | 3286/4286 [24:43:52<6:45:29, 24.33s/it]                                                        {'loss': 0.0042, 'grad_norm': 9.55795647108604, 'learning_rate': 2.3331777881474568e-07, 'completion_length': 277.0178756713867, 'rewards/only_full_func_accuracy_reward': 0.7633929252624512, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.0565476231276989, 'kl': 0.105712890625, 'epoch': 0.77}
 77%|███████▋  | 3286/4286 [24:43:52<6:45:29, 24.33s/it] 77%|███████▋  | 3287/4286 [24:44:17<6:47:09, 24.45s/it]                                                        {'loss': 0.0064, 'grad_norm': 23.7467455142387, 'learning_rate': 2.3308446103593093e-07, 'completion_length': 315.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.020619653165340424, 'kl': 0.1591796875, 'epoch': 0.77}
 77%|███████▋  | 3287/4286 [24:44:17<6:47:09, 24.45s/it] 77%|███████▋  | 3288/4286 [24:44:41<6:45:19, 24.37s/it]                                                        {'loss': 0.0115, 'grad_norm': 1.2837379271356457, 'learning_rate': 2.3285114325711618e-07, 'completion_length': 279.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.834821492433548, 'rewards/format_reward': 1.0, 'reward': 1.8348215818405151, 'reward_std': 0.05552634969353676, 'kl': 0.28704833984375, 'epoch': 0.77}
 77%|███████▋  | 3288/4286 [24:44:41<6:45:19, 24.37s/it] 77%|███████▋  | 3289/4286 [24:45:04<6:38:26, 23.98s/it]                                                        {'loss': 0.0117, 'grad_norm': 4.049478469249587, 'learning_rate': 2.3261782547830145e-07, 'completion_length': 287.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7946428954601288, 'rewards/format_reward': 1.0, 'reward': 1.794642984867096, 'reward_std': 0.017857140861451626, 'kl': 0.29248046875, 'epoch': 0.77}
 77%|███████▋  | 3289/4286 [24:45:04<6:38:26, 23.98s/it] 77%|███████▋  | 3290/4286 [24:45:28<6:39:17, 24.05s/it]                                                        {'loss': 0.0172, 'grad_norm': 28.995071503019986, 'learning_rate': 2.323845076994867e-07, 'completion_length': 276.7143020629883, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.08303267508745193, 'kl': 0.4287109375, 'epoch': 0.77}
 77%|███████▋  | 3290/4286 [24:45:28<6:39:17, 24.05s/it] 77%|███████▋  | 3291/4286 [24:45:54<6:45:18, 24.44s/it]                                                        {'loss': 0.0214, 'grad_norm': 3.281501484976535, 'learning_rate': 2.3215118992067195e-07, 'completion_length': 289.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7866072058677673, 'rewards/format_reward': 1.0, 'reward': 1.786607265472412, 'reward_std': 0.03392856940627098, 'kl': 0.5345458984375, 'epoch': 0.77}
 77%|███████▋  | 3291/4286 [24:45:54<6:45:18, 24.44s/it] 77%|███████▋  | 3292/4286 [24:46:19<6:52:09, 24.88s/it]                                                        {'loss': 0.0197, 'grad_norm': 4.321598150170037, 'learning_rate': 2.319178721418572e-07, 'completion_length': 328.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.727678656578064, 'rewards/format_reward': 1.0, 'reward': 1.7276787161827087, 'reward_std': 0.08630952425301075, 'kl': 0.4931640625, 'epoch': 0.77}
 77%|███████▋  | 3292/4286 [24:46:19<6:52:09, 24.88s/it] 77%|███████▋  | 3293/4286 [24:46:44<6:50:42, 24.82s/it]                                                        {'loss': 0.0096, 'grad_norm': 3.257868488933729, 'learning_rate': 2.3168455436304247e-07, 'completion_length': 311.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.04464286006987095, 'kl': 0.23974609375, 'epoch': 0.77}
 77%|███████▋  | 3293/4286 [24:46:44<6:50:42, 24.82s/it] 77%|███████▋  | 3294/4286 [24:47:08<6:46:02, 24.56s/it]                                                        {'loss': 0.013, 'grad_norm': 7.366113567505197, 'learning_rate': 2.3145123658422772e-07, 'completion_length': 291.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7520833313465118, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7342262864112854, 'reward_std': 0.14267826825380325, 'kl': 0.32421875, 'epoch': 0.77}
 77%|███████▋  | 3294/4286 [24:47:08<6:46:02, 24.56s/it] 77%|███████▋  | 3295/4286 [24:47:32<6:44:12, 24.47s/it]                                                        {'loss': 0.0159, 'grad_norm': 5.8756912847876706, 'learning_rate': 2.3121791880541297e-07, 'completion_length': 302.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.0508419768884778, 'kl': 0.3994140625, 'epoch': 0.77}
 77%|███████▋  | 3295/4286 [24:47:32<6:44:12, 24.47s/it] 77%|███████▋  | 3296/4286 [24:47:55<6:36:24, 24.02s/it]                                                        {'loss': 0.011, 'grad_norm': 13.25784751602665, 'learning_rate': 2.3098460102659822e-07, 'completion_length': 259.6428756713867, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.019238398410379887, 'kl': 0.2744140625, 'epoch': 0.77}
 77%|███████▋  | 3296/4286 [24:47:55<6:36:24, 24.02s/it] 77%|███████▋  | 3297/4286 [24:48:20<6:39:34, 24.24s/it]                                                        {'loss': 0.0109, 'grad_norm': 9.313645531340795, 'learning_rate': 2.3075128324778347e-07, 'completion_length': 307.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.8860119581222534, 'rewards/format_reward': 1.0, 'reward': 1.8860120177268982, 'reward_std': 0.05178571492433548, 'kl': 0.2724609375, 'epoch': 0.77}
 77%|███████▋  | 3297/4286 [24:48:20<6:39:34, 24.24s/it] 77%|███████▋  | 3298/4286 [24:48:45<6:41:54, 24.41s/it]                                                        {'loss': 0.0099, 'grad_norm': 6.297276974681706, 'learning_rate': 2.3051796546896874e-07, 'completion_length': 301.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7991071939468384, 'rewards/format_reward': 1.0, 'reward': 1.7991072535514832, 'reward_std': 0.02267500478774309, 'kl': 0.2470703125, 'epoch': 0.77}
 77%|███████▋  | 3298/4286 [24:48:45<6:41:54, 24.41s/it] 77%|███████▋  | 3299/4286 [24:49:10<6:44:42, 24.60s/it]                                                        {'loss': 0.0118, 'grad_norm': 5.920077964571652, 'learning_rate': 2.30284647690154e-07, 'completion_length': 321.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.062286325730383396, 'kl': 0.29541015625, 'epoch': 0.77}
 77%|███████▋  | 3299/4286 [24:49:10<6:44:42, 24.60s/it] 77%|███████▋  | 3300/4286 [24:49:36<6:50:47, 25.00s/it]                                                        {'loss': 0.0203, 'grad_norm': 11.255518115639177, 'learning_rate': 2.3005132991133924e-07, 'completion_length': 334.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.7872024774551392, 'rewards/format_reward': 1.0, 'reward': 1.787202537059784, 'reward_std': 0.06584258284419775, 'kl': 0.50927734375, 'epoch': 0.77}
 77%|███████▋  | 3300/4286 [24:49:36<6:50:47, 25.00s/it] 77%|███████▋  | 3301/4286 [24:52:57<21:19:37, 77.95s/it]                                                         {'loss': 0.0143, 'grad_norm': 1.442497971420682, 'learning_rate': 2.298180121325245e-07, 'completion_length': 315.875, 'rewards/only_full_func_accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 1.0, 'reward': 1.758928656578064, 'reward_std': 0.03847679682075977, 'kl': 0.3583984375, 'epoch': 0.77}
 77%|███████▋  | 3301/4286 [24:52:57<21:19:37, 77.95s/it] 77%|███████▋  | 3302/4286 [24:53:22<16:56:43, 62.00s/it]                                                         {'loss': 0.023, 'grad_norm': 5.525657692953296, 'learning_rate': 2.2958469435370976e-07, 'completion_length': 304.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.09686282649636269, 'kl': 0.573486328125, 'epoch': 0.77}
 77%|███████▋  | 3302/4286 [24:53:22<16:56:43, 62.00s/it] 77%|███████▋  | 3303/4286 [24:53:46<13:46:59, 50.48s/it]                                                         {'loss': 0.0105, 'grad_norm': 6.017194819968118, 'learning_rate': 2.29351376574895e-07, 'completion_length': 255.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6755953431129456, 'reward_std': 0.04273159895092249, 'kl': 0.26123046875, 'epoch': 0.77}
 77%|███████▋  | 3303/4286 [24:53:46<13:46:59, 50.48s/it] 77%|███████▋  | 3304/4286 [24:54:10<11:39:25, 42.73s/it]                                                         {'loss': 0.0098, 'grad_norm': 3.3864281661565694, 'learning_rate': 2.2911805879608026e-07, 'completion_length': 297.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.6889881193637848, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.031299442052841187, 'kl': 0.24609375, 'epoch': 0.77}
 77%|███████▋  | 3304/4286 [24:54:10<11:39:25, 42.73s/it] 77%|███████▋  | 3305/4286 [24:54:36<10:16:41, 37.72s/it]                                                         {'loss': 0.0179, 'grad_norm': 8.478906892150418, 'learning_rate': 2.288847410172655e-07, 'completion_length': 308.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.5825487375259399, 'rewards/format_reward': 1.0, 'reward': 1.5825488567352295, 'reward_std': 0.07615431025624275, 'kl': 0.4453125, 'epoch': 0.77}
 77%|███████▋  | 3305/4286 [24:54:36<10:16:41, 37.72s/it] 77%|███████▋  | 3306/4286 [24:55:01<9:11:03, 33.74s/it]                                                         {'loss': 0.0185, 'grad_norm': 4.512098776653363, 'learning_rate': 2.2865142323845076e-07, 'completion_length': 310.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.8407738506793976, 'rewards/format_reward': 1.0, 'reward': 1.8407739400863647, 'reward_std': 0.029548224061727524, 'kl': 0.463134765625, 'epoch': 0.77}
 77%|███████▋  | 3306/4286 [24:55:01<9:11:03, 33.74s/it] 77%|███████▋  | 3307/4286 [24:55:25<8:22:02, 30.77s/it]                                                        {'loss': 0.0295, 'grad_norm': 5.996453597964397, 'learning_rate': 2.2841810545963603e-07, 'completion_length': 292.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7857143878936768, 'reward_std': 0.04559176787734032, 'kl': 0.73779296875, 'epoch': 0.77}
 77%|███████▋  | 3307/4286 [24:55:25<8:22:02, 30.77s/it] 77%|███████▋  | 3308/4286 [24:55:49<7:51:05, 28.90s/it]                                                        {'loss': 0.0039, 'grad_norm': 1.1535832793458662, 'learning_rate': 2.2818478768082128e-07, 'completion_length': 266.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.9508928954601288, 'rewards/format_reward': 1.0, 'reward': 1.950892984867096, 'reward_std': 0.0295482249930501, 'kl': 0.098388671875, 'epoch': 0.77}
 77%|███████▋  | 3308/4286 [24:55:49<7:51:05, 28.90s/it] 77%|███████▋  | 3309/4286 [24:56:14<7:29:08, 27.58s/it]                                                        {'loss': 0.0139, 'grad_norm': 5.3091674287311035, 'learning_rate': 2.2795146990200653e-07, 'completion_length': 306.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.741071492433548, 'rewards/format_reward': 1.0, 'reward': 1.7410715818405151, 'reward_std': 0.03847680054605007, 'kl': 0.3486328125, 'epoch': 0.77}
 77%|███████▋  | 3309/4286 [24:56:14<7:29:08, 27.58s/it] 77%|███████▋  | 3310/4286 [24:56:38<7:10:51, 26.49s/it]                                                        {'loss': 0.0048, 'grad_norm': 3.0271224427560774, 'learning_rate': 2.2771815212319178e-07, 'completion_length': 299.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.04007173050194979, 'kl': 0.1201171875, 'epoch': 0.77}
 77%|███████▋  | 3310/4286 [24:56:38<7:10:51, 26.49s/it] 77%|███████▋  | 3311/4286 [24:57:02<6:59:21, 25.81s/it]                                                        {'loss': 0.0073, 'grad_norm': 3.6282096116741935, 'learning_rate': 2.2748483434437703e-07, 'completion_length': 303.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.8333333730697632, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.07695359457284212, 'kl': 0.18359375, 'epoch': 0.77}
 77%|███████▋  | 3311/4286 [24:57:02<6:59:21, 25.81s/it] 77%|███████▋  | 3312/4286 [24:57:28<6:59:20, 25.83s/it]                                                        {'loss': 0.0054, 'grad_norm': 12.71184469180745, 'learning_rate': 2.272515165655623e-07, 'completion_length': 291.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.11531119793653488, 'kl': 0.134521484375, 'epoch': 0.77}
 77%|███████▋  | 3312/4286 [24:57:28<6:59:20, 25.83s/it] 77%|███████▋  | 3313/4286 [24:57:50<6:43:39, 24.89s/it]                                                        {'loss': 0.0045, 'grad_norm': 11.123065265616516, 'learning_rate': 2.2701819878674755e-07, 'completion_length': 286.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.659226268529892, 'rewards/format_reward': 1.0, 'reward': 1.6592263579368591, 'reward_std': 0.061151799745857716, 'kl': 0.1121826171875, 'epoch': 0.77}
 77%|███████▋  | 3313/4286 [24:57:50<6:43:39, 24.89s/it] 77%|███████▋  | 3314/4286 [24:58:15<6:40:28, 24.72s/it]                                                        {'loss': 0.0032, 'grad_norm': 2.208263085848075, 'learning_rate': 2.267848810079328e-07, 'completion_length': 315.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8794643580913544, 'rewards/format_reward': 1.0, 'reward': 1.8794644474983215, 'reward_std': 0.0454615019261837, 'kl': 0.078857421875, 'epoch': 0.77}
 77%|███████▋  | 3314/4286 [24:58:15<6:40:28, 24.72s/it] 77%|███████▋  | 3315/4286 [24:58:38<6:31:36, 24.20s/it]                                                        {'loss': 0.0087, 'grad_norm': 1.9285827983096562, 'learning_rate': 2.2655156322911805e-07, 'completion_length': 284.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.0, 'kl': 0.21728515625, 'epoch': 0.77}
 77%|███████▋  | 3315/4286 [24:58:38<6:31:36, 24.20s/it] 77%|███████▋  | 3316/4286 [24:59:01<6:24:46, 23.80s/it]                                                        {'loss': 0.0168, 'grad_norm': 3.565774690200104, 'learning_rate': 2.2631824545030333e-07, 'completion_length': 277.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.8467262983322144, 'rewards/format_reward': 1.0, 'reward': 1.8467263579368591, 'reward_std': 0.08478906378149986, 'kl': 0.419921875, 'epoch': 0.77}
 77%|███████▋  | 3316/4286 [24:59:01<6:24:46, 23.80s/it] 77%|███████▋  | 3317/4286 [24:59:25<6:28:30, 24.06s/it]                                                        {'loss': 0.004, 'grad_norm': 12.70405371206608, 'learning_rate': 2.2608492767148857e-07, 'completion_length': 263.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.7346230447292328, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7167659997940063, 'reward_std': 0.07909649796783924, 'kl': 0.1007080078125, 'epoch': 0.77}
 77%|███████▋  | 3317/4286 [24:59:25<6:28:30, 24.06s/it] 77%|███████▋  | 3318/4286 [24:59:50<6:31:01, 24.24s/it]                                                        {'loss': 0.0097, 'grad_norm': 1.6358850304800796, 'learning_rate': 2.2585160989267382e-07, 'completion_length': 324.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.041666665114462376, 'kl': 0.2431640625, 'epoch': 0.77}
 77%|███████▋  | 3318/4286 [24:59:50<6:31:01, 24.24s/it] 77%|███████▋  | 3319/4286 [25:00:14<6:28:57, 24.13s/it]                                                        {'loss': 0.0026, 'grad_norm': 16.709595091407934, 'learning_rate': 2.2561829211385907e-07, 'completion_length': 261.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7961310148239136, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.022675009444355965, 'kl': 0.0640869140625, 'epoch': 0.77}
 77%|███████▋  | 3319/4286 [25:00:14<6:28:57, 24.13s/it] 77%|███████▋  | 3320/4286 [25:00:40<6:35:58, 24.59s/it]                                                        {'loss': 0.0109, 'grad_norm': 4.062796254723654, 'learning_rate': 2.2538497433504432e-07, 'completion_length': 312.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.040071734227240086, 'kl': 0.2734375, 'epoch': 0.77}
 77%|███████▋  | 3320/4286 [25:00:40<6:35:58, 24.59s/it] 77%|███████▋  | 3321/4286 [25:01:04<6:32:49, 24.42s/it]                                                        {'loss': 0.0024, 'grad_norm': 2.1989382433832207, 'learning_rate': 2.251516565562296e-07, 'completion_length': 288.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.6904762089252472, 'rewards/format_reward': 1.0, 'reward': 1.6904763579368591, 'reward_std': 0.0892857126891613, 'kl': 0.0604248046875, 'epoch': 0.77}
 77%|███████▋  | 3321/4286 [25:01:04<6:32:49, 24.42s/it] 78%|███████▊  | 3322/4286 [25:01:28<6:31:52, 24.39s/it]                                                        {'loss': 0.0085, 'grad_norm': 3.8954359887126753, 'learning_rate': 2.2491833877741484e-07, 'completion_length': 318.89288330078125, 'rewards/only_full_func_accuracy_reward': 0.8303572237491608, 'rewards/format_reward': 1.0, 'reward': 1.8303571939468384, 'reward_std': 0.035714288242161274, 'kl': 0.21435546875, 'epoch': 0.78}
 78%|███████▊  | 3322/4286 [25:01:28<6:31:52, 24.39s/it] 78%|███████▊  | 3323/4286 [25:01:52<6:28:08, 24.18s/it]                                                        {'loss': 0.0049, 'grad_norm': 3.957914410860818, 'learning_rate': 2.246850209986001e-07, 'completion_length': 270.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.5952381193637848, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.02816697023808956, 'kl': 0.122802734375, 'epoch': 0.78}
 78%|███████▊  | 3323/4286 [25:01:52<6:28:08, 24.18s/it] 78%|███████▊  | 3324/4286 [25:02:15<6:24:53, 24.01s/it]                                                        {'loss': 0.0049, 'grad_norm': 0.933206712251164, 'learning_rate': 2.2445170321978534e-07, 'completion_length': 290.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.8169643580913544, 'rewards/format_reward': 1.0, 'reward': 1.8169644474983215, 'reward_std': 0.008928571827709675, 'kl': 0.1236572265625, 'epoch': 0.78}
 78%|███████▊  | 3324/4286 [25:02:15<6:24:53, 24.01s/it] 78%|███████▊  | 3325/4286 [25:02:39<6:24:39, 24.02s/it]                                                        {'loss': 0.0016, 'grad_norm': 0.41299396913115277, 'learning_rate': 2.242183854409706e-07, 'completion_length': 273.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.867559552192688, 'rewards/format_reward': 1.0, 'reward': 1.8675596714019775, 'reward_std': 0.0267857164144516, 'kl': 0.0400390625, 'epoch': 0.78}
 78%|███████▊  | 3325/4286 [25:02:39<6:24:39, 24.02s/it] 78%|███████▊  | 3326/4286 [25:03:04<6:26:50, 24.18s/it]                                                        {'loss': 0.003, 'grad_norm': 6.45002199733849, 'learning_rate': 2.2398506766215587e-07, 'completion_length': 306.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.8125000894069672, 'rewards/format_reward': 1.0, 'reward': 1.8125000596046448, 'reward_std': 0.060444651171565056, 'kl': 0.073974609375, 'epoch': 0.78}
 78%|███████▊  | 3326/4286 [25:03:04<6:26:50, 24.18s/it] 78%|███████▊  | 3327/4286 [25:03:27<6:22:34, 23.94s/it]                                                        {'loss': 0.0083, 'grad_norm': 2.3938833209143593, 'learning_rate': 2.2375174988334111e-07, 'completion_length': 294.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.04602411016821861, 'kl': 0.2060546875, 'epoch': 0.78}
 78%|███████▊  | 3327/4286 [25:03:27<6:22:34, 23.94s/it] 78%|███████▊  | 3328/4286 [25:03:51<6:23:19, 24.01s/it]                                                        {'loss': 0.0037, 'grad_norm': 7.029823115235172, 'learning_rate': 2.2351843210452636e-07, 'completion_length': 257.4643020629883, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.03709554299712181, 'kl': 0.09375, 'epoch': 0.78}
 78%|███████▊  | 3328/4286 [25:03:51<6:23:19, 24.01s/it] 78%|███████▊  | 3329/4286 [25:04:14<6:17:19, 23.66s/it]                                                        {'loss': 0.0085, 'grad_norm': 4.310449803413027, 'learning_rate': 2.232851143257116e-07, 'completion_length': 242.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.65476194024086, 'rewards/format_reward': 1.0, 'reward': 1.654762089252472, 'reward_std': 0.05952381156384945, 'kl': 0.212890625, 'epoch': 0.78}
 78%|███████▊  | 3329/4286 [25:04:14<6:17:19, 23.66s/it] 78%|███████▊  | 3330/4286 [25:04:38<6:18:19, 23.74s/it]                                                        {'loss': 0.0053, 'grad_norm': 0.9389667599145475, 'learning_rate': 2.2305179654689689e-07, 'completion_length': 312.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.8928572535514832, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.032524412497878075, 'kl': 0.1328125, 'epoch': 0.78}
 78%|███████▊  | 3330/4286 [25:04:38<6:18:19, 23.74s/it] 78%|███████▊  | 3331/4286 [25:05:03<6:24:15, 24.14s/it]                                                        {'loss': 0.0044, 'grad_norm': 7.813389034614745, 'learning_rate': 2.2281847876808214e-07, 'completion_length': 279.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6781463027000427, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660289227962494, 'reward_std': 0.10161345079541206, 'kl': 0.10888671875, 'epoch': 0.78}
 78%|███████▊  | 3331/4286 [25:05:03<6:24:15, 24.14s/it] 78%|███████▊  | 3332/4286 [25:05:27<6:20:18, 23.92s/it]                                                        {'loss': 0.008, 'grad_norm': 7.827722326499487, 'learning_rate': 2.2258516098926738e-07, 'completion_length': 289.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 1.0, 'reward': 1.602678656578064, 'reward_std': 0.05838929582387209, 'kl': 0.19970703125, 'epoch': 0.78}
 78%|███████▊  | 3332/4286 [25:05:27<6:20:18, 23.92s/it] 78%|███████▊  | 3333/4286 [25:05:50<6:19:49, 23.91s/it]                                                        {'loss': 0.0068, 'grad_norm': 3.8078362944570023, 'learning_rate': 2.2235184321045263e-07, 'completion_length': 271.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.8020833134651184, 'rewards/format_reward': 1.0, 'reward': 1.802083432674408, 'reward_std': 0.03273809514939785, 'kl': 0.170654296875, 'epoch': 0.78}
 78%|███████▊  | 3333/4286 [25:05:50<6:19:49, 23.91s/it] 78%|███████▊  | 3334/4286 [25:06:14<6:18:13, 23.84s/it]                                                        {'loss': 0.0112, 'grad_norm': 40.65535724724406, 'learning_rate': 2.2211852543163788e-07, 'completion_length': 269.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.793154776096344, 'rewards/format_reward': 1.0, 'reward': 1.7931548953056335, 'reward_std': 0.03457976318895817, 'kl': 0.27880859375, 'epoch': 0.78}
 78%|███████▊  | 3334/4286 [25:06:14<6:18:13, 23.84s/it] 78%|███████▊  | 3335/4286 [25:06:38<6:19:13, 23.93s/it]                                                        {'loss': 0.0037, 'grad_norm': 0.5364669803764124, 'learning_rate': 2.2188520765282316e-07, 'completion_length': 252.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.824404776096344, 'rewards/format_reward': 1.0, 'reward': 1.8244048953056335, 'reward_std': 0.010309826582670212, 'kl': 0.09130859375, 'epoch': 0.78}
 78%|███████▊  | 3335/4286 [25:06:38<6:19:13, 23.93s/it] 78%|███████▊  | 3336/4286 [25:07:03<6:22:33, 24.16s/it]                                                        {'loss': 0.007, 'grad_norm': 5.356909614508622, 'learning_rate': 2.216518898740084e-07, 'completion_length': 289.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7125000059604645, 'rewards/format_reward': 1.0, 'reward': 1.7125000953674316, 'reward_std': 0.05966486781835556, 'kl': 0.175537109375, 'epoch': 0.78}
 78%|███████▊  | 3336/4286 [25:07:03<6:22:33, 24.16s/it] 78%|███████▊  | 3337/4286 [25:07:27<6:19:49, 24.01s/it]                                                        {'loss': 0.0172, 'grad_norm': 1.3313344314039837, 'learning_rate': 2.2141857209519365e-07, 'completion_length': 289.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7708333730697632, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.050381556153297424, 'kl': 0.4306640625, 'epoch': 0.78}
 78%|███████▊  | 3337/4286 [25:07:27<6:19:49, 24.01s/it] 78%|███████▊  | 3338/4286 [25:07:51<6:20:23, 24.08s/it]                                                        {'loss': 0.0104, 'grad_norm': 10.054469445877938, 'learning_rate': 2.211852543163789e-07, 'completion_length': 315.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7339286208152771, 'rewards/format_reward': 1.0, 'reward': 1.7339286804199219, 'reward_std': 0.04508154187351465, 'kl': 0.26025390625, 'epoch': 0.78}
 78%|███████▊  | 3338/4286 [25:07:51<6:20:23, 24.08s/it] 78%|███████▊  | 3339/4286 [25:08:16<6:26:53, 24.51s/it]                                                        {'loss': 0.0144, 'grad_norm': 10.600175956964986, 'learning_rate': 2.2095193653756418e-07, 'completion_length': 293.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.7366072237491608, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7187501788139343, 'reward_std': 0.09856690838932991, 'kl': 0.361328125, 'epoch': 0.78}
 78%|███████▊  | 3339/4286 [25:08:16<6:26:53, 24.51s/it] 78%|███████▊  | 3340/4286 [25:08:41<6:27:19, 24.57s/it]                                                        {'loss': 0.0023, 'grad_norm': 0.24337482024488735, 'learning_rate': 2.2071861875874943e-07, 'completion_length': 301.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8869048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8869048953056335, 'reward_std': 0.0, 'kl': 0.057373046875, 'epoch': 0.78}
 78%|███████▊  | 3340/4286 [25:08:41<6:27:19, 24.57s/it] 78%|███████▊  | 3341/4286 [25:09:06<6:27:19, 24.59s/it]                                                        {'loss': 0.0165, 'grad_norm': 6.505184257868675, 'learning_rate': 2.2048530097993467e-07, 'completion_length': 273.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.647321492433548, 'rewards/format_reward': 1.0, 'reward': 1.6473215818405151, 'reward_std': 0.0799297858029604, 'kl': 0.41259765625, 'epoch': 0.78}
 78%|███████▊  | 3341/4286 [25:09:06<6:27:19, 24.59s/it] 78%|███████▊  | 3342/4286 [25:09:31<6:31:42, 24.90s/it]                                                        {'loss': 0.0063, 'grad_norm': 1.2098272107349195, 'learning_rate': 2.2025198320111992e-07, 'completion_length': 312.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.8982143104076385, 'rewards/format_reward': 1.0, 'reward': 1.898214340209961, 'reward_std': 0.02635884564369917, 'kl': 0.15771484375, 'epoch': 0.78}
 78%|███████▊  | 3342/4286 [25:09:31<6:31:42, 24.90s/it] 78%|███████▊  | 3343/4286 [25:09:54<6:22:35, 24.34s/it]                                                        {'loss': 0.0057, 'grad_norm': 1.4733375648659948, 'learning_rate': 2.2001866542230517e-07, 'completion_length': 258.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.05541310179978609, 'kl': 0.1439208984375, 'epoch': 0.78}
 78%|███████▊  | 3343/4286 [25:09:54<6:22:35, 24.34s/it] 78%|███████▊  | 3344/4286 [25:10:20<6:27:59, 24.71s/it]                                                        {'loss': 0.0266, 'grad_norm': 6.577619733072118, 'learning_rate': 2.1978534764349045e-07, 'completion_length': 299.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 1.0, 'reward': 1.6741071939468384, 'reward_std': 0.04876268282532692, 'kl': 0.6640625, 'epoch': 0.78}
 78%|███████▊  | 3344/4286 [25:10:20<6:27:59, 24.71s/it] 78%|███████▊  | 3345/4286 [25:10:44<6:25:54, 24.61s/it]                                                        {'loss': 0.0137, 'grad_norm': 6.737088754000205, 'learning_rate': 2.195520298646757e-07, 'completion_length': 323.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6383928656578064, 'rewards/format_reward': 1.0, 'reward': 1.6383930444717407, 'reward_std': 0.041765548288822174, 'kl': 0.341796875, 'epoch': 0.78}
 78%|███████▊  | 3345/4286 [25:10:44<6:25:54, 24.61s/it] 78%|███████▊  | 3346/4286 [25:11:08<6:20:03, 24.26s/it]                                                        {'loss': 0.0024, 'grad_norm': 2.465584742125364, 'learning_rate': 2.1931871208586094e-07, 'completion_length': 277.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7369048297405243, 'rewards/format_reward': 1.0, 'reward': 1.7369048595428467, 'reward_std': 0.02065294049680233, 'kl': 0.0601806640625, 'epoch': 0.78}
 78%|███████▊  | 3346/4286 [25:11:08<6:20:03, 24.26s/it] 78%|███████▊  | 3347/4286 [25:11:33<6:26:00, 24.67s/it]                                                        {'loss': 0.012, 'grad_norm': 8.833941665510864, 'learning_rate': 2.190853943070462e-07, 'completion_length': 307.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6684524118900299, 'rewards/format_reward': 1.0, 'reward': 1.6684524416923523, 'reward_std': 0.04404761781916022, 'kl': 0.30078125, 'epoch': 0.78}
 78%|███████▊  | 3347/4286 [25:11:33<6:26:00, 24.67s/it] 78%|███████▊  | 3348/4286 [25:11:58<6:23:04, 24.50s/it]                                                        {'loss': 0.0041, 'grad_norm': 2.300115460790436, 'learning_rate': 2.1885207652823144e-07, 'completion_length': 282.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.07695358619093895, 'kl': 0.1025390625, 'epoch': 0.78}
 78%|███████▊  | 3348/4286 [25:11:58<6:23:04, 24.50s/it] 78%|███████▊  | 3349/4286 [25:12:22<6:24:16, 24.61s/it]                                                        {'loss': 0.0137, 'grad_norm': 20.6210033507057, 'learning_rate': 2.186187587494167e-07, 'completion_length': 304.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127978205680847, 'reward_std': 0.04053214751183987, 'kl': 0.34228515625, 'epoch': 0.78}
 78%|███████▊  | 3349/4286 [25:12:22<6:24:16, 24.61s/it] 78%|███████▊  | 3350/4286 [25:12:46<6:21:36, 24.46s/it]                                                        {'loss': 0.0126, 'grad_norm': 2.678741311974346, 'learning_rate': 2.1838544097060194e-07, 'completion_length': 287.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.7372024059295654, 'rewards/format_reward': 1.0, 'reward': 1.7372024059295654, 'reward_std': 0.015009318944066763, 'kl': 0.314208984375, 'epoch': 0.78}
 78%|███████▊  | 3350/4286 [25:12:46<6:21:36, 24.46s/it] 78%|███████▊  | 3351/4286 [25:13:10<6:16:21, 24.15s/it]                                                        {'loss': 0.0015, 'grad_norm': 23.435575564189268, 'learning_rate': 2.181521231917872e-07, 'completion_length': 280.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.7767857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7767859101295471, 'reward_std': 0.03703813906759024, 'kl': 0.0379638671875, 'epoch': 0.78}
 78%|███████▊  | 3351/4286 [25:13:10<6:16:21, 24.15s/it] 78%|███████▊  | 3352/4286 [25:13:35<6:18:35, 24.32s/it]                                                        {'loss': 0.0088, 'grad_norm': 26.385689708096265, 'learning_rate': 2.1791880541297244e-07, 'completion_length': 311.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.5684524476528168, 'rewards/format_reward': 1.0, 'reward': 1.568452537059784, 'reward_std': 0.0535714216530323, 'kl': 0.220947265625, 'epoch': 0.78}
 78%|███████▊  | 3352/4286 [25:13:35<6:18:35, 24.32s/it] 78%|███████▊  | 3353/4286 [25:14:01<6:26:00, 24.82s/it]                                                        {'loss': 0.0054, 'grad_norm': 5.425328416587254, 'learning_rate': 2.176854876341577e-07, 'completion_length': 304.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.05029458552598953, 'kl': 0.134033203125, 'epoch': 0.78}
 78%|███████▊  | 3353/4286 [25:14:01<6:26:00, 24.82s/it] 78%|███████▊  | 3354/4286 [25:14:25<6:23:44, 24.70s/it]                                                        {'loss': 0.0083, 'grad_norm': 5.911438237160597, 'learning_rate': 2.1745216985534296e-07, 'completion_length': 312.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7991072237491608, 'rewards/format_reward': 1.0, 'reward': 1.7991072535514832, 'reward_std': 0.061151799745857716, 'kl': 0.20849609375, 'epoch': 0.78}
 78%|███████▊  | 3354/4286 [25:14:25<6:23:44, 24.70s/it] 78%|███████▊  | 3355/4286 [25:14:49<6:21:49, 24.61s/it]                                                        {'loss': 0.0185, 'grad_norm': 5.9675874355223275, 'learning_rate': 2.172188520765282e-07, 'completion_length': 313.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.74702388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7291668057441711, 'reward_std': 0.09686957485973835, 'kl': 0.4620361328125, 'epoch': 0.78}
 78%|███████▊  | 3355/4286 [25:14:49<6:21:49, 24.61s/it] 78%|███████▊  | 3356/4286 [25:15:13<6:14:22, 24.15s/it]                                                        {'loss': 0.011, 'grad_norm': 1.8387740656741094, 'learning_rate': 2.1698553429771346e-07, 'completion_length': 236.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8221726417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8043155670166016, 'reward_std': 0.0580357164144516, 'kl': 0.273681640625, 'epoch': 0.78}
 78%|███████▊  | 3356/4286 [25:15:13<6:14:22, 24.15s/it] 78%|███████▊  | 3357/4286 [25:15:38<6:18:55, 24.47s/it]                                                        {'loss': 0.0056, 'grad_norm': 2.419660809793807, 'learning_rate': 2.167522165188987e-07, 'completion_length': 297.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8139881193637848, 'rewards/format_reward': 1.0, 'reward': 1.8139882683753967, 'reward_std': 0.06526251137256622, 'kl': 0.1407470703125, 'epoch': 0.78}
 78%|███████▊  | 3357/4286 [25:15:38<6:18:55, 24.47s/it] 78%|███████▊  | 3358/4286 [25:16:02<6:19:13, 24.52s/it]                                                        {'loss': 0.0202, 'grad_norm': 11.999267105721113, 'learning_rate': 2.1651889874008398e-07, 'completion_length': 300.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6595238447189331, 'rewards/format_reward': 1.0, 'reward': 1.6595239043235779, 'reward_std': 0.05354097671806812, 'kl': 0.5009765625, 'epoch': 0.78}
 78%|███████▊  | 3358/4286 [25:16:02<6:19:13, 24.52s/it] 78%|███████▊  | 3359/4286 [25:16:26<6:15:13, 24.29s/it]                                                        {'loss': 0.0043, 'grad_norm': 7.013984651649385, 'learning_rate': 2.1628558096126923e-07, 'completion_length': 296.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.037095542065799236, 'kl': 0.1068115234375, 'epoch': 0.78}
 78%|███████▊  | 3359/4286 [25:16:26<6:15:13, 24.29s/it] 78%|███████▊  | 3360/4286 [25:16:51<6:17:32, 24.46s/it]                                                        {'loss': 0.0122, 'grad_norm': 34.33406770039503, 'learning_rate': 2.1605226318245448e-07, 'completion_length': 311.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.860119104385376, 'rewards/format_reward': 1.0, 'reward': 1.8601191639900208, 'reward_std': 0.011397944763302803, 'kl': 0.30560302734375, 'epoch': 0.78}
 78%|███████▊  | 3360/4286 [25:16:51<6:17:32, 24.46s/it] 78%|███████▊  | 3361/4286 [25:17:15<6:16:36, 24.43s/it]                                                        {'loss': 0.0104, 'grad_norm': 12.804604588274335, 'learning_rate': 2.1581894540363973e-07, 'completion_length': 294.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.5848214626312256, 'rewards/format_reward': 1.0, 'reward': 1.5848215222358704, 'reward_std': 0.041452993638813496, 'kl': 0.25927734375, 'epoch': 0.78}
 78%|███████▊  | 3361/4286 [25:17:15<6:16:36, 24.43s/it] 78%|███████▊  | 3362/4286 [25:17:40<6:18:08, 24.55s/it]                                                        {'loss': 0.0157, 'grad_norm': 5.564352184191067, 'learning_rate': 2.1558562762482498e-07, 'completion_length': 303.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7455357909202576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7276787161827087, 'reward_std': 0.1016509011387825, 'kl': 0.39208984375, 'epoch': 0.78}
 78%|███████▊  | 3362/4286 [25:17:40<6:18:08, 24.55s/it] 78%|███████▊  | 3363/4286 [25:18:04<6:15:50, 24.43s/it]                                                        {'loss': 0.0334, 'grad_norm': 10.59274558042067, 'learning_rate': 2.1535230984601025e-07, 'completion_length': 299.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7656250894069672, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.747767984867096, 'reward_std': 0.08779762126505375, 'kl': 0.83203125, 'epoch': 0.78}
 78%|███████▊  | 3363/4286 [25:18:04<6:15:50, 24.43s/it] 78%|███████▊  | 3364/4286 [25:18:29<6:18:32, 24.63s/it]                                                        {'loss': 0.009, 'grad_norm': 41.64777610135646, 'learning_rate': 2.151189920671955e-07, 'completion_length': 285.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7142857909202576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.09204821847379208, 'kl': 0.22509765625, 'epoch': 0.78}
 78%|███████▊  | 3364/4286 [25:18:29<6:18:32, 24.63s/it] 79%|███████▊  | 3365/4286 [25:18:54<6:19:07, 24.70s/it]                                                        {'loss': 0.0231, 'grad_norm': 9.857316634083126, 'learning_rate': 2.1488567428838075e-07, 'completion_length': 309.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.611607164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5937500596046448, 'reward_std': 0.11536476388573647, 'kl': 0.576171875, 'epoch': 0.79}
 79%|███████▊  | 3365/4286 [25:18:54<6:19:07, 24.70s/it] 79%|███████▊  | 3366/4286 [25:19:18<6:16:30, 24.55s/it]                                                        {'loss': 0.0034, 'grad_norm': 0.7548685409339041, 'learning_rate': 2.14652356509566e-07, 'completion_length': 267.07144927978516, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.779762089252472, 'reward_std': 0.025651196017861366, 'kl': 0.084228515625, 'epoch': 0.79}
 79%|███████▊  | 3366/4286 [25:19:19<6:16:30, 24.55s/it] 79%|███████▊  | 3367/4286 [25:19:43<6:17:35, 24.65s/it]                                                        {'loss': 0.0049, 'grad_norm': 21.90314463198669, 'learning_rate': 2.1441903873075127e-07, 'completion_length': 312.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.8529762029647827, 'rewards/format_reward': 1.0, 'reward': 1.8529762625694275, 'reward_std': 0.04928513988852501, 'kl': 0.1217041015625, 'epoch': 0.79}
 79%|███████▊  | 3367/4286 [25:19:43<6:17:35, 24.65s/it] 79%|███████▊  | 3368/4286 [25:20:07<6:14:20, 24.47s/it]                                                        {'loss': 0.0023, 'grad_norm': 0.27098940334973526, 'learning_rate': 2.1418572095193652e-07, 'completion_length': 300.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.8333334028720856, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.011904764920473099, 'kl': 0.0562744140625, 'epoch': 0.79}
 79%|███████▊  | 3368/4286 [25:20:07<6:14:20, 24.47s/it] 79%|███████▊  | 3369/4286 [25:20:32<6:15:46, 24.59s/it]                                                        {'loss': 0.0054, 'grad_norm': 1.8690182050805846, 'learning_rate': 2.1395240317312177e-07, 'completion_length': 300.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.036542316898703575, 'kl': 0.1358642578125, 'epoch': 0.79}
 79%|███████▊  | 3369/4286 [25:20:32<6:15:46, 24.59s/it] 79%|███████▊  | 3370/4286 [25:20:55<6:07:29, 24.07s/it]                                                        {'loss': 0.0145, 'grad_norm': 6.622327241298602, 'learning_rate': 2.1371908539430702e-07, 'completion_length': 278.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7053572237491608, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.09196125343441963, 'kl': 0.36376953125, 'epoch': 0.79}
 79%|███████▊  | 3370/4286 [25:20:55<6:07:29, 24.07s/it] 79%|███████▊  | 3371/4286 [25:21:21<6:15:26, 24.62s/it]                                                        {'loss': 0.0322, 'grad_norm': 21.16525234250736, 'learning_rate': 2.1348576761549227e-07, 'completion_length': 279.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.8482142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8303572535514832, 'reward_std': 0.06220148503780365, 'kl': 0.8084716796875, 'epoch': 0.79}
 79%|███████▊  | 3371/4286 [25:21:21<6:15:26, 24.62s/it] 79%|███████▊  | 3372/4286 [25:21:45<6:13:17, 24.50s/it]                                                        {'loss': 0.0105, 'grad_norm': 11.90668939598064, 'learning_rate': 2.1325244983667754e-07, 'completion_length': 286.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.8184524476528168, 'rewards/format_reward': 1.0, 'reward': 1.8184524774551392, 'reward_std': 0.034119345247745514, 'kl': 0.261474609375, 'epoch': 0.79}
 79%|███████▊  | 3372/4286 [25:21:45<6:13:17, 24.50s/it] 79%|███████▊  | 3373/4286 [25:22:12<6:21:36, 25.08s/it]                                                        {'loss': 0.003, 'grad_norm': 0.6240451627706305, 'learning_rate': 2.130191320578628e-07, 'completion_length': 329.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.0, 'kl': 0.076171875, 'epoch': 0.79}
 79%|███████▊  | 3373/4286 [25:22:12<6:21:36, 25.08s/it] 79%|███████▊  | 3374/4286 [25:22:36<6:18:24, 24.90s/it]                                                        {'loss': 0.0073, 'grad_norm': 7.744788312025476, 'learning_rate': 2.1278581427904804e-07, 'completion_length': 255.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.03847679682075977, 'kl': 0.18115234375, 'epoch': 0.79}
 79%|███████▊  | 3374/4286 [25:22:36<6:18:24, 24.90s/it] 79%|███████▊  | 3375/4286 [25:23:00<6:13:16, 24.58s/it]                                                        {'loss': 0.0145, 'grad_norm': 12.000595452535126, 'learning_rate': 2.125524965002333e-07, 'completion_length': 295.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.1258230283856392, 'kl': 0.3623046875, 'epoch': 0.79}
 79%|███████▊  | 3375/4286 [25:23:00<6:13:16, 24.58s/it] 79%|███████▉  | 3376/4286 [25:23:25<6:14:20, 24.68s/it]                                                        {'loss': 0.0082, 'grad_norm': 14.489069311919609, 'learning_rate': 2.1231917872141856e-07, 'completion_length': 324.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7264881730079651, 'rewards/format_reward': 1.0, 'reward': 1.7264882326126099, 'reward_std': 0.024404759518802166, 'kl': 0.203857421875, 'epoch': 0.79}
 79%|███████▉  | 3376/4286 [25:23:25<6:14:20, 24.68s/it] 79%|███████▉  | 3377/4286 [25:23:50<6:13:45, 24.67s/it]                                                        {'loss': 0.01, 'grad_norm': 3.391130393781795, 'learning_rate': 2.120858609426038e-07, 'completion_length': 297.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.04751862213015556, 'kl': 0.24951171875, 'epoch': 0.79}
 79%|███████▉  | 3377/4286 [25:23:50<6:13:45, 24.67s/it] 79%|███████▉  | 3378/4286 [25:24:14<6:09:58, 24.45s/it]                                                        {'loss': 0.0029, 'grad_norm': 2.7366254199844757, 'learning_rate': 2.1185254316378906e-07, 'completion_length': 267.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8452381491661072, 'rewards/format_reward': 1.0, 'reward': 1.845238208770752, 'reward_std': 0.0357142873108387, 'kl': 0.0728759765625, 'epoch': 0.79}
 79%|███████▉  | 3378/4286 [25:24:14<6:09:58, 24.45s/it] 79%|███████▉  | 3379/4286 [25:24:38<6:11:58, 24.61s/it]                                                        {'loss': 0.0247, 'grad_norm': 1.6918259151716768, 'learning_rate': 2.116192253849743e-07, 'completion_length': 322.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6160715222358704, 'reward_std': 0.06547619216144085, 'kl': 0.615234375, 'epoch': 0.79}
 79%|███████▉  | 3379/4286 [25:24:38<6:11:58, 24.61s/it] 79%|███████▉  | 3380/4286 [25:25:04<6:14:20, 24.79s/it]                                                        {'loss': 0.0213, 'grad_norm': 10.639310352691984, 'learning_rate': 2.1138590760615956e-07, 'completion_length': 296.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.4687500447034836, 'rewards/format_reward': 1.0, 'reward': 1.4687500596046448, 'reward_std': 0.1064148498699069, 'kl': 0.53173828125, 'epoch': 0.79}
 79%|███████▉  | 3380/4286 [25:25:04<6:14:20, 24.79s/it] 79%|███████▉  | 3381/4286 [25:25:27<6:07:00, 24.33s/it]                                                        {'loss': 0.0015, 'grad_norm': 6.365398080453338, 'learning_rate': 2.1115258982734483e-07, 'completion_length': 292.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8333333730697632, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.05633394047617912, 'kl': 0.0377197265625, 'epoch': 0.79}
 79%|███████▉  | 3381/4286 [25:25:27<6:07:00, 24.33s/it] 79%|███████▉  | 3382/4286 [25:25:51<6:05:15, 24.24s/it]                                                        {'loss': 0.0025, 'grad_norm': 0.42600488119695046, 'learning_rate': 2.1091927204853008e-07, 'completion_length': 243.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.8630952537059784, 'rewards/format_reward': 1.0, 'reward': 1.8630953431129456, 'reward_std': 0.0, 'kl': 0.06256103515625, 'epoch': 0.79}
 79%|███████▉  | 3382/4286 [25:25:51<6:05:15, 24.24s/it] 79%|███████▉  | 3383/4286 [25:26:16<6:06:46, 24.37s/it]                                                        {'loss': 0.0139, 'grad_norm': 1.3416152705650972, 'learning_rate': 2.1068595426971533e-07, 'completion_length': 322.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7217262983322144, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.703869104385376, 'reward_std': 0.12343812733888626, 'kl': 0.345458984375, 'epoch': 0.79}
 79%|███████▉  | 3383/4286 [25:26:16<6:06:46, 24.37s/it] 79%|███████▉  | 3384/4286 [25:26:40<6:05:41, 24.33s/it]                                                        {'loss': 0.0125, 'grad_norm': 2.75716895262654, 'learning_rate': 2.1045263649090058e-07, 'completion_length': 289.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7157739400863647, 'reward_std': 0.10852411016821861, 'kl': 0.3134765625, 'epoch': 0.79}
 79%|███████▉  | 3384/4286 [25:26:40<6:05:41, 24.33s/it] 79%|███████▉  | 3385/4286 [25:27:06<6:15:25, 25.00s/it]                                                        {'loss': 0.0143, 'grad_norm': 7.225654178576317, 'learning_rate': 2.1021931871208583e-07, 'completion_length': 336.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6877976655960083, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6699405312538147, 'reward_std': 0.07519078068435192, 'kl': 0.35791015625, 'epoch': 0.79}
 79%|███████▉  | 3385/4286 [25:27:06<6:15:25, 25.00s/it] 79%|███████▉  | 3386/4286 [25:27:30<6:09:59, 24.67s/it]                                                        {'loss': 0.0097, 'grad_norm': 6.991129507294315, 'learning_rate': 2.099860009332711e-07, 'completion_length': 282.3393020629883, 'rewards/only_full_func_accuracy_reward': 0.7544643580913544, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.03869047574698925, 'kl': 0.24072265625, 'epoch': 0.79}
 79%|███████▉  | 3386/4286 [25:27:30<6:09:59, 24.67s/it] 79%|███████▉  | 3387/4286 [25:27:54<6:05:19, 24.38s/it]                                                        {'loss': 0.0017, 'grad_norm': 12.852868668853628, 'learning_rate': 2.0975268315445635e-07, 'completion_length': 302.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.7202382683753967, 'reward_std': 0.044429173693060875, 'kl': 0.043701171875, 'epoch': 0.79}
 79%|███████▉  | 3387/4286 [25:27:54<6:05:19, 24.38s/it] 79%|███████▉  | 3388/4286 [25:28:20<6:11:08, 24.80s/it]                                                        {'loss': 0.0064, 'grad_norm': 5.485137711431561, 'learning_rate': 2.095193653756416e-07, 'completion_length': 356.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.8377976715564728, 'rewards/format_reward': 1.0, 'reward': 1.8377978205680847, 'reward_std': 0.03407294489443302, 'kl': 0.1611328125, 'epoch': 0.79}
 79%|███████▉  | 3388/4286 [25:28:20<6:11:08, 24.80s/it] 79%|███████▉  | 3389/4286 [25:28:45<6:10:28, 24.78s/it]                                                        {'loss': 0.0044, 'grad_norm': 16.375881347202625, 'learning_rate': 2.0928604759682685e-07, 'completion_length': 308.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.68452388048172, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.04602411389350891, 'kl': 0.11083984375, 'epoch': 0.79}
 79%|███████▉  | 3389/4286 [25:28:45<6:10:28, 24.78s/it] 79%|███████▉  | 3390/4286 [25:29:09<6:07:57, 24.64s/it]                                                        {'loss': 0.0025, 'grad_norm': 2.269827348556678, 'learning_rate': 2.0905272981801213e-07, 'completion_length': 295.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.799107164144516, 'rewards/format_reward': 1.0, 'reward': 1.799107313156128, 'reward_std': 0.02267500478774309, 'kl': 0.063232421875, 'epoch': 0.79}
 79%|███████▉  | 3390/4286 [25:29:09<6:07:57, 24.64s/it] 79%|███████▉  | 3391/4286 [25:29:34<6:10:37, 24.85s/it]                                                        {'loss': 0.0019, 'grad_norm': 3.9792628645247308, 'learning_rate': 2.0881941203919737e-07, 'completion_length': 271.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380954027175903, 'reward_std': 0.0952380932867527, 'kl': 0.047607421875, 'epoch': 0.79}
 79%|███████▉  | 3391/4286 [25:29:34<6:10:37, 24.85s/it] 79%|███████▉  | 3392/4286 [25:30:00<6:14:42, 25.15s/it]                                                        {'loss': 0.0169, 'grad_norm': 2.010374035222762, 'learning_rate': 2.0858609426038262e-07, 'completion_length': 251.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6741071939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6383929252624512, 'reward_std': 0.05016787815839052, 'kl': 0.423828125, 'epoch': 0.79}
 79%|███████▉  | 3392/4286 [25:30:00<6:14:42, 25.15s/it] 79%|███████▉  | 3393/4286 [25:30:24<6:09:15, 24.81s/it]                                                        {'loss': 0.0137, 'grad_norm': 7.200467816697684, 'learning_rate': 2.0835277648156787e-07, 'completion_length': 305.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6979167461395264, 'rewards/format_reward': 1.0, 'reward': 1.6979167461395264, 'reward_std': 0.10043024737387896, 'kl': 0.342041015625, 'epoch': 0.79}
 79%|███████▉  | 3393/4286 [25:30:24<6:09:15, 24.81s/it] 79%|███████▉  | 3394/4286 [25:30:48<6:06:13, 24.63s/it]                                                        {'loss': 0.0046, 'grad_norm': 40.53887694211277, 'learning_rate': 2.0811945870275312e-07, 'completion_length': 296.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.6800595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6800596714019775, 'reward_std': 0.06395191513001919, 'kl': 0.116455078125, 'epoch': 0.79}
 79%|███████▉  | 3394/4286 [25:30:48<6:06:13, 24.63s/it] 79%|███████▉  | 3395/4286 [25:31:14<6:12:33, 25.09s/it]                                                        {'loss': 0.0115, 'grad_norm': 1.2058298472056732, 'learning_rate': 2.078861409239384e-07, 'completion_length': 291.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.754464328289032, 'reward_std': 0.06856869719922543, 'kl': 0.2847900390625, 'epoch': 0.79}
 79%|███████▉  | 3395/4286 [25:31:14<6:12:33, 25.09s/it] 79%|███████▉  | 3396/4286 [25:31:40<6:14:27, 25.24s/it]                                                        {'loss': 0.0113, 'grad_norm': 6.294712102032854, 'learning_rate': 2.0765282314512364e-07, 'completion_length': 288.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.7949405312538147, 'rewards/format_reward': 1.0, 'reward': 1.7949405908584595, 'reward_std': 0.05140128172934055, 'kl': 0.2822265625, 'epoch': 0.79}
 79%|███████▉  | 3396/4286 [25:31:40<6:14:27, 25.24s/it] 79%|███████▉  | 3397/4286 [25:32:04<6:06:55, 24.76s/it]                                                        {'loss': 0.0133, 'grad_norm': 3.9916220202642023, 'learning_rate': 2.074195053663089e-07, 'completion_length': 264.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.09262829646468163, 'kl': 0.33251953125, 'epoch': 0.79}
 79%|███████▉  | 3397/4286 [25:32:04<6:06:55, 24.76s/it] 79%|███████▉  | 3398/4286 [25:32:29<6:10:19, 25.02s/it]                                                        {'loss': 0.0192, 'grad_norm': 9.118823315068653, 'learning_rate': 2.0718618758749414e-07, 'completion_length': 307.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.6488095819950104, 'rewards/format_reward': 1.0, 'reward': 1.6488096714019775, 'reward_std': 0.044429169967770576, 'kl': 0.47802734375, 'epoch': 0.79}
 79%|███████▉  | 3398/4286 [25:32:29<6:10:19, 25.02s/it] 79%|███████▉  | 3399/4286 [25:32:53<6:06:03, 24.76s/it]                                                        {'loss': 0.0024, 'grad_norm': 12.966811659540976, 'learning_rate': 2.0695286980867942e-07, 'completion_length': 294.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8080357611179352, 'rewards/format_reward': 1.0, 'reward': 1.8080359101295471, 'reward_std': 0.04648452252149582, 'kl': 0.061279296875, 'epoch': 0.79}
 79%|███████▉  | 3399/4286 [25:32:53<6:06:03, 24.76s/it] 79%|███████▉  | 3400/4286 [25:33:19<6:09:27, 25.02s/it]                                                        {'loss': 0.0144, 'grad_norm': 3.0755708888056366, 'learning_rate': 2.0671955202986466e-07, 'completion_length': 311.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.8385417461395264, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8206846714019775, 'reward_std': 0.07291667349636555, 'kl': 0.35888671875, 'epoch': 0.79}
 79%|███████▉  | 3400/4286 [25:33:19<6:09:27, 25.02s/it] 79%|███████▉  | 3401/4286 [25:38:22<26:37:46, 108.32s/it]                                                          {'loss': 0.0164, 'grad_norm': 2.7547643715995167, 'learning_rate': 2.0648623425104991e-07, 'completion_length': 325.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6895833611488342, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6717262864112854, 'reward_std': 0.0658232532441616, 'kl': 0.4091796875, 'epoch': 0.79}
 79%|███████▉  | 3401/4286 [25:38:22<26:37:46, 108.32s/it] 79%|███████▉  | 3402/4286 [25:38:48<20:31:25, 83.58s/it]                                                          {'loss': 0.0049, 'grad_norm': 3.215836896635711, 'learning_rate': 2.0625291647223516e-07, 'completion_length': 319.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.8809524178504944, 'rewards/format_reward': 1.0, 'reward': 1.8809524774551392, 'reward_std': 0.0, 'kl': 0.12255859375, 'epoch': 0.79}
 79%|███████▉  | 3402/4286 [25:38:48<20:31:25, 83.58s/it] 79%|███████▉  | 3403/4286 [25:39:13<16:13:44, 66.17s/it]                                                         {'loss': 0.0084, 'grad_norm': 11.004594651219579, 'learning_rate': 2.060195986934204e-07, 'completion_length': 331.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7633929252624512, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.05654761753976345, 'kl': 0.2098388671875, 'epoch': 0.79}
 79%|███████▉  | 3403/4286 [25:39:13<16:13:44, 66.17s/it] 79%|███████▉  | 3404/4286 [25:39:38<13:09:53, 53.73s/it]                                                         {'loss': 0.0104, 'grad_norm': 20.27116398571654, 'learning_rate': 2.0578628091460569e-07, 'completion_length': 317.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.11107293516397476, 'kl': 0.26025390625, 'epoch': 0.79}
 79%|███████▉  | 3404/4286 [25:39:38<13:09:53, 53.73s/it] 79%|███████▉  | 3405/4286 [25:40:02<11:00:31, 44.98s/it]                                                         {'loss': 0.0049, 'grad_norm': 3.9493870558615267, 'learning_rate': 2.0555296313579093e-07, 'completion_length': 331.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7038690745830536, 'rewards/format_reward': 1.0, 'reward': 1.703869104385376, 'reward_std': 0.060946542769670486, 'kl': 0.12353515625, 'epoch': 0.79}
 79%|███████▉  | 3405/4286 [25:40:02<11:00:31, 44.98s/it] 79%|███████▉  | 3406/4286 [25:40:28<9:33:29, 39.10s/it]                                                         {'loss': 0.0157, 'grad_norm': 32.28381594284659, 'learning_rate': 2.0531964535697618e-07, 'completion_length': 291.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.6166666746139526, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5988095998764038, 'reward_std': 0.07927362807095051, 'kl': 0.392578125, 'epoch': 0.79}
 79%|███████▉  | 3406/4286 [25:40:28<9:33:29, 39.10s/it] 79%|███████▉  | 3407/4286 [25:40:53<8:31:41, 34.93s/it]                                                        {'loss': 0.0025, 'grad_norm': 15.543725060867208, 'learning_rate': 2.0508632757816143e-07, 'completion_length': 313.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7782738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.04027372598648071, 'kl': 0.0621337890625, 'epoch': 0.79}
 79%|███████▉  | 3407/4286 [25:40:53<8:31:41, 34.93s/it] 80%|███████▉  | 3408/4286 [25:41:16<7:39:38, 31.41s/it]                                                        {'loss': 0.0062, 'grad_norm': 5.790989435514376, 'learning_rate': 2.0485300979934668e-07, 'completion_length': 276.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.767857313156128, 'reward_std': 0.0357142873108387, 'kl': 0.15625, 'epoch': 0.8}
 80%|███████▉  | 3408/4286 [25:41:16<7:39:38, 31.41s/it] 80%|███████▉  | 3409/4286 [25:41:41<7:09:34, 29.39s/it]                                                        {'loss': 0.0089, 'grad_norm': 10.145411746970272, 'learning_rate': 2.0461969202053196e-07, 'completion_length': 334.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.04464286006987095, 'kl': 0.221923828125, 'epoch': 0.8}
 80%|███████▉  | 3409/4286 [25:41:41<7:09:34, 29.39s/it] 80%|███████▉  | 3410/4286 [25:42:07<6:54:21, 28.38s/it]                                                        {'loss': 0.006, 'grad_norm': 38.35442485260553, 'learning_rate': 2.043863742417172e-07, 'completion_length': 318.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.6636905670166016, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.017857147380709648, 'kl': 0.1484375, 'epoch': 0.8}
 80%|███████▉  | 3410/4286 [25:42:07<6:54:21, 28.38s/it] 80%|███████▉  | 3411/4286 [25:42:32<6:37:36, 27.26s/it]                                                        {'loss': 0.0024, 'grad_norm': 4.236407513689632, 'learning_rate': 2.0415305646290245e-07, 'completion_length': 274.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8244048953056335, 'reward_std': 0.0535714328289032, 'kl': 0.059326171875, 'epoch': 0.8}
 80%|███████▉  | 3411/4286 [25:42:32<6:37:36, 27.26s/it] 80%|███████▉  | 3412/4286 [25:42:58<6:32:05, 26.92s/it]                                                        {'loss': 0.0189, 'grad_norm': 2.482803322155078, 'learning_rate': 2.039197386840877e-07, 'completion_length': 327.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.04577308148145676, 'kl': 0.474853515625, 'epoch': 0.8}
 80%|███████▉  | 3412/4286 [25:42:58<6:32:05, 26.92s/it] 80%|███████▉  | 3413/4286 [25:43:23<6:23:36, 26.37s/it]                                                        {'loss': 0.0072, 'grad_norm': 6.605910953108403, 'learning_rate': 2.0368642090527298e-07, 'completion_length': 327.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.8050596117973328, 'rewards/format_reward': 1.0, 'reward': 1.8050596117973328, 'reward_std': 0.026785715483129025, 'kl': 0.180419921875, 'epoch': 0.8}
 80%|███████▉  | 3413/4286 [25:43:23<6:23:36, 26.37s/it] 80%|███████▉  | 3414/4286 [25:43:47<6:13:10, 25.68s/it]                                                        {'loss': 0.0128, 'grad_norm': 3.9461110952043525, 'learning_rate': 2.0345310312645823e-07, 'completion_length': 282.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.8434523642063141, 'rewards/format_reward': 1.0, 'reward': 1.8434525728225708, 'reward_std': 0.01785714365541935, 'kl': 0.3203125, 'epoch': 0.8}
 80%|███████▉  | 3414/4286 [25:43:47<6:13:10, 25.68s/it] 80%|███████▉  | 3415/4286 [25:44:13<6:14:48, 25.82s/it]                                                        {'loss': 0.0084, 'grad_norm': 1.9110667250924436, 'learning_rate': 2.0321978534764347e-07, 'completion_length': 298.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.8363096117973328, 'rewards/format_reward': 1.0, 'reward': 1.8363096117973328, 'reward_std': 0.06547619216144085, 'kl': 0.20849609375, 'epoch': 0.8}
 80%|███████▉  | 3415/4286 [25:44:13<6:14:48, 25.82s/it] 80%|███████▉  | 3416/4286 [25:44:38<6:10:16, 25.54s/it]                                                        {'loss': 0.0122, 'grad_norm': 10.064016243011674, 'learning_rate': 2.0298646756882872e-07, 'completion_length': 304.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.7574405670166016, 'rewards/format_reward': 1.0, 'reward': 1.7574405670166016, 'reward_std': 0.0267857164144516, 'kl': 0.305419921875, 'epoch': 0.8}
 80%|███████▉  | 3416/4286 [25:44:38<6:10:16, 25.54s/it] 80%|███████▉  | 3417/4286 [25:45:04<6:13:32, 25.79s/it]                                                        {'loss': 0.0018, 'grad_norm': 16.58128099763041, 'learning_rate': 2.0275314979001397e-07, 'completion_length': 341.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.758928656578064, 'reward_std': 0.0590964462608099, 'kl': 0.046142578125, 'epoch': 0.8}
 80%|███████▉  | 3417/4286 [25:45:04<6:13:32, 25.79s/it] 80%|███████▉  | 3418/4286 [25:45:28<6:05:09, 25.24s/it]                                                        {'loss': 0.004, 'grad_norm': 20.035899981314415, 'learning_rate': 2.0251983201119925e-07, 'completion_length': 313.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7812500596046448, 'rewards/format_reward': 1.0, 'reward': 1.7812500596046448, 'reward_std': 0.04053214192390442, 'kl': 0.099853515625, 'epoch': 0.8}
 80%|███████▉  | 3418/4286 [25:45:28<6:05:09, 25.24s/it] 80%|███████▉  | 3419/4286 [25:45:53<6:04:43, 25.24s/it]                                                        {'loss': 0.0033, 'grad_norm': 0.4164562006903814, 'learning_rate': 2.022865142323845e-07, 'completion_length': 311.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.65476194024086, 'rewards/format_reward': 1.0, 'reward': 1.6547620296478271, 'reward_std': 0.0, 'kl': 0.0830078125, 'epoch': 0.8}
 80%|███████▉  | 3419/4286 [25:45:53<6:04:43, 25.24s/it] 80%|███████▉  | 3420/4286 [25:46:19<6:07:34, 25.47s/it]                                                        {'loss': 0.006, 'grad_norm': 3.6723521788479863, 'learning_rate': 2.0205319645356974e-07, 'completion_length': 322.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.7633929252624512, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.07895872741937637, 'kl': 0.1492919921875, 'epoch': 0.8}
 80%|███████▉  | 3420/4286 [25:46:19<6:07:34, 25.47s/it] 80%|███████▉  | 3421/4286 [25:46:44<6:01:22, 25.07s/it]                                                        {'loss': 0.0033, 'grad_norm': 16.451485363355612, 'learning_rate': 2.01819878674755e-07, 'completion_length': 307.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7366072237491608, 'rewards/format_reward': 1.0, 'reward': 1.736607313156128, 'reward_std': 0.02770654857158661, 'kl': 0.0833740234375, 'epoch': 0.8}
 80%|███████▉  | 3421/4286 [25:46:44<6:01:22, 25.07s/it] 80%|███████▉  | 3422/4286 [25:47:08<5:59:27, 24.96s/it]                                                        {'loss': 0.0238, 'grad_norm': 3.717703045045775, 'learning_rate': 2.0158656089594027e-07, 'completion_length': 300.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.8705357611179352, 'rewards/format_reward': 1.0, 'reward': 1.8705357909202576, 'reward_std': 0.023508869111537933, 'kl': 0.595703125, 'epoch': 0.8}
 80%|███████▉  | 3422/4286 [25:47:08<5:59:27, 24.96s/it] 80%|███████▉  | 3423/4286 [25:47:33<5:59:01, 24.96s/it]                                                        {'loss': 0.0032, 'grad_norm': 0.9153073934830591, 'learning_rate': 2.0135324311712552e-07, 'completion_length': 308.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7767857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.01785714365541935, 'kl': 0.08074951171875, 'epoch': 0.8}
 80%|███████▉  | 3423/4286 [25:47:33<5:59:01, 24.96s/it] 80%|███████▉  | 3424/4286 [25:47:58<5:56:54, 24.84s/it]                                                        {'loss': 0.0024, 'grad_norm': 3.604796519997077, 'learning_rate': 2.0111992533831077e-07, 'completion_length': 316.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.04602411389350891, 'kl': 0.0606689453125, 'epoch': 0.8}
 80%|███████▉  | 3424/4286 [25:47:58<5:56:54, 24.84s/it] 80%|███████▉  | 3425/4286 [25:48:24<6:00:53, 25.15s/it]                                                        {'loss': 0.0118, 'grad_norm': 11.336654504573422, 'learning_rate': 2.0088660755949601e-07, 'completion_length': 308.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8839285969734192, 'rewards/format_reward': 1.0, 'reward': 1.883928656578064, 'reward_std': 0.030929479748010635, 'kl': 0.29595947265625, 'epoch': 0.8}
 80%|███████▉  | 3425/4286 [25:48:24<6:00:53, 25.15s/it] 80%|███████▉  | 3426/4286 [25:48:48<5:58:49, 25.03s/it]                                                        {'loss': 0.0118, 'grad_norm': 16.80021073769207, 'learning_rate': 2.0065328978068126e-07, 'completion_length': 324.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.08766331523656845, 'kl': 0.29638671875, 'epoch': 0.8}
 80%|███████▉  | 3426/4286 [25:48:48<5:58:49, 25.03s/it] 80%|███████▉  | 3427/4286 [25:49:13<5:54:35, 24.77s/it]                                                        {'loss': 0.0038, 'grad_norm': 12.161421188398096, 'learning_rate': 2.0041997200186654e-07, 'completion_length': 293.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.041029577143490314, 'kl': 0.095947265625, 'epoch': 0.8}
 80%|███████▉  | 3427/4286 [25:49:13<5:54:35, 24.77s/it] 80%|███████▉  | 3428/4286 [25:49:36<5:46:55, 24.26s/it]                                                        {'loss': 0.0022, 'grad_norm': 4.869985415153855, 'learning_rate': 2.0018665422305179e-07, 'completion_length': 277.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.8050595819950104, 'rewards/format_reward': 1.0, 'reward': 1.8050596714019775, 'reward_std': 0.02267500851303339, 'kl': 0.0537109375, 'epoch': 0.8}
 80%|███████▉  | 3428/4286 [25:49:36<5:46:55, 24.26s/it] 80%|████████  | 3429/4286 [25:50:01<5:50:17, 24.52s/it]                                                        {'loss': 0.0025, 'grad_norm': 9.229739533950786, 'learning_rate': 1.9995333644423704e-07, 'completion_length': 324.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7648809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.029761907644569874, 'kl': 0.063232421875, 'epoch': 0.8}
 80%|████████  | 3429/4286 [25:50:01<5:50:17, 24.52s/it] 80%|████████  | 3430/4286 [25:50:26<5:52:35, 24.71s/it]                                                        {'loss': 0.0109, 'grad_norm': 7.30713926068766, 'learning_rate': 1.9972001866542228e-07, 'completion_length': 284.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.68601194024086, 'rewards/format_reward': 1.0, 'reward': 1.6860120296478271, 'reward_std': 0.07472597993910313, 'kl': 0.273193359375, 'epoch': 0.8}
 80%|████████  | 3430/4286 [25:50:26<5:52:35, 24.71s/it] 80%|████████  | 3431/4286 [25:50:53<6:01:52, 25.40s/it]                                                        {'loss': 0.0239, 'grad_norm': 6.025558446721659, 'learning_rate': 1.9948670088660753e-07, 'completion_length': 335.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7062925398349762, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6884354948997498, 'reward_std': 0.0978810815140605, 'kl': 0.5986328125, 'epoch': 0.8}
 80%|████████  | 3431/4286 [25:50:53<6:01:52, 25.40s/it] 80%|████████  | 3432/4286 [25:51:18<5:58:58, 25.22s/it]                                                        {'loss': 0.0093, 'grad_norm': 2.9461798774850734, 'learning_rate': 1.992533831077928e-07, 'completion_length': 295.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 1.0, 'reward': 1.7574405670166016, 'reward_std': 0.039858050644397736, 'kl': 0.23388671875, 'epoch': 0.8}
 80%|████████  | 3432/4286 [25:51:18<5:58:58, 25.22s/it] 80%|████████  | 3433/4286 [25:51:44<6:01:21, 25.42s/it]                                                        {'loss': 0.0025, 'grad_norm': 5.545655649314308, 'learning_rate': 1.9902006532897806e-07, 'completion_length': 286.875, 'rewards/only_full_func_accuracy_reward': 0.7886905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7886905670166016, 'reward_std': 0.01785714365541935, 'kl': 0.0623779296875, 'epoch': 0.8}
 80%|████████  | 3433/4286 [25:51:44<6:01:21, 25.42s/it] 80%|████████  | 3434/4286 [25:52:09<6:01:56, 25.49s/it]                                                        {'loss': 0.0082, 'grad_norm': 5.484729850182524, 'learning_rate': 1.987867475501633e-07, 'completion_length': 295.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.05449227150529623, 'kl': 0.205078125, 'epoch': 0.8}
 80%|████████  | 3434/4286 [25:52:09<6:01:56, 25.49s/it] 80%|████████  | 3435/4286 [25:52:34<5:59:37, 25.36s/it]                                                        {'loss': 0.0184, 'grad_norm': 3.0555863283763576, 'learning_rate': 1.9855342977134855e-07, 'completion_length': 312.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7008928656578064, 'rewards/format_reward': 1.0, 'reward': 1.700892984867096, 'reward_std': 0.044642859138548374, 'kl': 0.4599609375, 'epoch': 0.8}
 80%|████████  | 3435/4286 [25:52:34<5:59:37, 25.36s/it] 80%|████████  | 3436/4286 [25:52:59<5:55:36, 25.10s/it]                                                        {'loss': 0.0071, 'grad_norm': 0.7625633298623601, 'learning_rate': 1.9832011199253383e-07, 'completion_length': 299.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.8273809850215912, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.01877797581255436, 'kl': 0.17724609375, 'epoch': 0.8}
 80%|████████  | 3436/4286 [25:52:59<5:55:36, 25.10s/it] 80%|████████  | 3437/4286 [25:53:23<5:52:47, 24.93s/it]                                                        {'loss': 0.0029, 'grad_norm': 1.6040569364321267, 'learning_rate': 1.9808679421371908e-07, 'completion_length': 301.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7282738387584686, 'rewards/format_reward': 1.0, 'reward': 1.7282739281654358, 'reward_std': 0.0355296591296792, 'kl': 0.07275390625, 'epoch': 0.8}
 80%|████████  | 3437/4286 [25:53:23<5:52:47, 24.93s/it] 80%|████████  | 3438/4286 [25:53:50<5:59:22, 25.43s/it]                                                        {'loss': 0.0134, 'grad_norm': 2.437100525924962, 'learning_rate': 1.9785347643490433e-07, 'completion_length': 318.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803572535514832, 'reward_std': 0.08928571827709675, 'kl': 0.3349609375, 'epoch': 0.8}
 80%|████████  | 3438/4286 [25:53:50<5:59:22, 25.43s/it] 80%|████████  | 3439/4286 [25:54:15<5:57:14, 25.31s/it]                                                        {'loss': 0.0093, 'grad_norm': 10.982645178788802, 'learning_rate': 1.9762015865608958e-07, 'completion_length': 294.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.7494047582149506, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7315477132797241, 'reward_std': 0.14284301549196243, 'kl': 0.231689453125, 'epoch': 0.8}
 80%|████████  | 3439/4286 [25:54:15<5:57:14, 25.31s/it] 80%|████████  | 3440/4286 [25:54:40<5:55:43, 25.23s/it]                                                        {'loss': 0.0034, 'grad_norm': 1.6139079963074738, 'learning_rate': 1.9738684087727482e-07, 'completion_length': 307.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7366072237491608, 'rewards/format_reward': 1.0, 'reward': 1.736607313156128, 'reward_std': 0.03114316239953041, 'kl': 0.085693359375, 'epoch': 0.8}
 80%|████████  | 3440/4286 [25:54:40<5:55:43, 25.23s/it] 80%|████████  | 3441/4286 [25:55:04<5:49:21, 24.81s/it]                                                        {'loss': 0.0127, 'grad_norm': 1.3442706926193642, 'learning_rate': 1.971535230984601e-07, 'completion_length': 275.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.7291666567325592, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.08173839747905731, 'kl': 0.319091796875, 'epoch': 0.8}
 80%|████████  | 3441/4286 [25:55:04<5:49:21, 24.81s/it] 80%|████████  | 3442/4286 [25:55:28<5:46:59, 24.67s/it]                                                        {'loss': 0.0101, 'grad_norm': 4.291960712023844, 'learning_rate': 1.9692020531964535e-07, 'completion_length': 274.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.7470238506793976, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.06085866875946522, 'kl': 0.25244140625, 'epoch': 0.8}
 80%|████████  | 3442/4286 [25:55:28<5:46:59, 24.67s/it] 80%|████████  | 3443/4286 [25:55:53<5:46:19, 24.65s/it]                                                        {'loss': 0.0066, 'grad_norm': 2.3294419855718784, 'learning_rate': 1.966868875408306e-07, 'completion_length': 331.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485119700431824, 'reward_std': 0.02862738911062479, 'kl': 0.164794921875, 'epoch': 0.8}
 80%|████████  | 3443/4286 [25:55:53<5:46:19, 24.65s/it] 80%|████████  | 3444/4286 [25:56:18<5:48:03, 24.80s/it]                                                        {'loss': 0.0042, 'grad_norm': 5.000963471999347, 'learning_rate': 1.9645356976201585e-07, 'completion_length': 283.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7366071939468384, 'rewards/format_reward': 1.0, 'reward': 1.7366072535514832, 'reward_std': 0.05281120538711548, 'kl': 0.104736328125, 'epoch': 0.8}
 80%|████████  | 3444/4286 [25:56:18<5:48:03, 24.80s/it] 80%|████████  | 3445/4286 [25:56:42<5:44:16, 24.56s/it]                                                        {'loss': 0.0101, 'grad_norm': 13.213522428791498, 'learning_rate': 1.9622025198320112e-07, 'completion_length': 272.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857144474983215, 'reward_std': 0.04627085104584694, 'kl': 0.251708984375, 'epoch': 0.8}
 80%|████████  | 3445/4286 [25:56:42<5:44:16, 24.56s/it] 80%|████████  | 3446/4286 [25:57:07<5:45:19, 24.67s/it]                                                        {'loss': 0.0018, 'grad_norm': 0.7460026295088502, 'learning_rate': 1.9598693420438637e-07, 'completion_length': 287.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.9107143580913544, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.0, 'kl': 0.0457763671875, 'epoch': 0.8}
 80%|████████  | 3446/4286 [25:57:07<5:45:19, 24.67s/it] 80%|████████  | 3447/4286 [25:57:32<5:47:03, 24.82s/it]                                                        {'loss': 0.0015, 'grad_norm': 1.7109407794393243, 'learning_rate': 1.9575361642557162e-07, 'completion_length': 274.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.8095238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.011904759332537651, 'kl': 0.0377197265625, 'epoch': 0.8}
 80%|████████  | 3447/4286 [25:57:32<5:47:03, 24.82s/it] 80%|████████  | 3448/4286 [25:57:57<5:48:18, 24.94s/it]                                                        {'loss': 0.0061, 'grad_norm': 2.220024865209498, 'learning_rate': 1.9552029864675687e-07, 'completion_length': 274.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7857143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.03793024457991123, 'kl': 0.15283203125, 'epoch': 0.8}
 80%|████████  | 3448/4286 [25:57:57<5:48:18, 24.94s/it] 80%|████████  | 3449/4286 [25:58:22<5:45:51, 24.79s/it]                                                        {'loss': 0.01, 'grad_norm': 3.4959951346501588, 'learning_rate': 1.9528698086794211e-07, 'completion_length': 285.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7425596117973328, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.0744047649204731, 'kl': 0.24951171875, 'epoch': 0.8}
 80%|████████  | 3449/4286 [25:58:22<5:45:51, 24.79s/it] 80%|████████  | 3450/4286 [25:58:46<5:43:04, 24.62s/it]                                                        {'loss': 0.012, 'grad_norm': 2.775113220042848, 'learning_rate': 1.950536630891274e-07, 'completion_length': 272.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.7470238506793976, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.0178571417927742, 'kl': 0.30029296875, 'epoch': 0.8}
 80%|████████  | 3450/4286 [25:58:46<5:43:04, 24.62s/it] 81%|████████  | 3451/4286 [25:59:09<5:35:56, 24.14s/it]                                                        {'loss': 0.0155, 'grad_norm': 14.586013428942211, 'learning_rate': 1.9482034531031264e-07, 'completion_length': 269.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6651786267757416, 'rewards/format_reward': 1.0, 'reward': 1.6651787161827087, 'reward_std': 0.05108621157705784, 'kl': 0.385009765625, 'epoch': 0.81}
 81%|████████  | 3451/4286 [25:59:09<5:35:56, 24.14s/it] 81%|████████  | 3452/4286 [25:59:34<5:39:48, 24.45s/it]                                                        {'loss': 0.0186, 'grad_norm': 5.5238943914565235, 'learning_rate': 1.945870275314979e-07, 'completion_length': 299.375, 'rewards/only_full_func_accuracy_reward': 0.821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8214287161827087, 'reward_std': 0.03361252695322037, 'kl': 0.46435546875, 'epoch': 0.81}
 81%|████████  | 3452/4286 [25:59:34<5:39:48, 24.45s/it] 81%|████████  | 3453/4286 [25:59:59<5:42:36, 24.68s/it]                                                        {'loss': 0.0138, 'grad_norm': 6.1671038391285, 'learning_rate': 1.9435370975268314e-07, 'completion_length': 309.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.691964328289032, 'rewards/format_reward': 1.0, 'reward': 1.691964328289032, 'reward_std': 0.038690474815666676, 'kl': 0.3466796875, 'epoch': 0.81}
 81%|████████  | 3453/4286 [25:59:59<5:42:36, 24.68s/it] 81%|████████  | 3454/4286 [26:00:23<5:36:15, 24.25s/it]                                                        {'loss': 0.0032, 'grad_norm': 0.5065287647353826, 'learning_rate': 1.9412039197386838e-07, 'completion_length': 272.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.760416716337204, 'rewards/format_reward': 1.0, 'reward': 1.7604167461395264, 'reward_std': 0.020833331160247326, 'kl': 0.079833984375, 'epoch': 0.81}
 81%|████████  | 3454/4286 [26:00:23<5:36:15, 24.25s/it] 81%|████████  | 3455/4286 [26:00:48<5:38:48, 24.46s/it]                                                        {'loss': 0.0187, 'grad_norm': 4.155719285527756, 'learning_rate': 1.9388707419505366e-07, 'completion_length': 274.57144927978516, 'rewards/only_full_func_accuracy_reward': 0.7425596117973328, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.03426117356866598, 'kl': 0.466796875, 'epoch': 0.81}
 81%|████████  | 3455/4286 [26:00:48<5:38:48, 24.46s/it] 81%|████████  | 3456/4286 [26:01:12<5:38:11, 24.45s/it]                                                        {'loss': 0.004, 'grad_norm': 2.219068194288346, 'learning_rate': 1.936537564162389e-07, 'completion_length': 304.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.8065476715564728, 'rewards/format_reward': 1.0, 'reward': 1.80654776096344, 'reward_std': 0.03160357475280762, 'kl': 0.1009521484375, 'epoch': 0.81}
 81%|████████  | 3456/4286 [26:01:12<5:38:11, 24.45s/it] 81%|████████  | 3457/4286 [26:01:35<5:31:35, 24.00s/it]                                                        {'loss': 0.0034, 'grad_norm': 3.8069148640147774, 'learning_rate': 1.9342043863742416e-07, 'completion_length': 260.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7380953133106232, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.02380952052772045, 'kl': 0.0849609375, 'epoch': 0.81}
 81%|████████  | 3457/4286 [26:01:35<5:31:35, 24.00s/it] 81%|████████  | 3458/4286 [26:02:01<5:41:06, 24.72s/it]                                                        {'loss': 0.0059, 'grad_norm': 3.1403025941771934, 'learning_rate': 1.931871208586094e-07, 'completion_length': 313.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7574405670166016, 'reward_std': 0.14252865687012672, 'kl': 0.148193359375, 'epoch': 0.81}
 81%|████████  | 3458/4286 [26:02:01<5:41:06, 24.72s/it] 81%|████████  | 3459/4286 [26:02:28<5:48:01, 25.25s/it]                                                        {'loss': 0.0038, 'grad_norm': 11.76454380148235, 'learning_rate': 1.9295380307979468e-07, 'completion_length': 346.8214569091797, 'rewards/only_full_func_accuracy_reward': 0.8392857909202576, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.02380952425301075, 'kl': 0.0943603515625, 'epoch': 0.81}
 81%|████████  | 3459/4286 [26:02:28<5:48:01, 25.25s/it] 81%|████████  | 3460/4286 [26:02:52<5:43:50, 24.98s/it]                                                        {'loss': 0.0021, 'grad_norm': 8.54386703491182, 'learning_rate': 1.9272048530097993e-07, 'completion_length': 284.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8437500298023224, 'rewards/format_reward': 1.0, 'reward': 1.8437501192092896, 'reward_std': 0.039858050644397736, 'kl': 0.0517578125, 'epoch': 0.81}
 81%|████████  | 3460/4286 [26:02:52<5:43:50, 24.98s/it] 81%|████████  | 3461/4286 [26:03:16<5:36:48, 24.50s/it]                                                        {'loss': 0.0051, 'grad_norm': 4.569681055755423, 'learning_rate': 1.9248716752216518e-07, 'completion_length': 277.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6914682686328888, 'rewards/format_reward': 1.0, 'reward': 1.691468358039856, 'reward_std': 0.07380189001560211, 'kl': 0.1282958984375, 'epoch': 0.81}
 81%|████████  | 3461/4286 [26:03:16<5:36:48, 24.50s/it] 81%|████████  | 3462/4286 [26:03:41<5:39:20, 24.71s/it]                                                        {'loss': 0.0024, 'grad_norm': 8.004502361149841, 'learning_rate': 1.9225384974335043e-07, 'completion_length': 300.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7901785969734192, 'rewards/format_reward': 1.0, 'reward': 1.790178656578064, 'reward_std': 0.03298483043909073, 'kl': 0.058837890625, 'epoch': 0.81}
 81%|████████  | 3462/4286 [26:03:41<5:39:20, 24.71s/it] 81%|████████  | 3463/4286 [26:04:05<5:38:04, 24.65s/it]                                                        {'loss': 0.0035, 'grad_norm': 2.940972916557054, 'learning_rate': 1.9202053196453568e-07, 'completion_length': 304.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7187500596046448, 'rewards/format_reward': 1.0, 'reward': 1.7187501192092896, 'reward_std': 0.044642859138548374, 'kl': 0.0863037109375, 'epoch': 0.81}
 81%|████████  | 3463/4286 [26:04:05<5:38:04, 24.65s/it] 81%|████████  | 3464/4286 [26:04:30<5:37:39, 24.65s/it]                                                        {'loss': 0.003, 'grad_norm': 0.4813850538402596, 'learning_rate': 1.9178721418572095e-07, 'completion_length': 304.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.8422619700431824, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.01785714365541935, 'kl': 0.07568359375, 'epoch': 0.81}
 81%|████████  | 3464/4286 [26:04:30<5:37:39, 24.65s/it] 81%|████████  | 3465/4286 [26:04:55<5:38:21, 24.73s/it]                                                        {'loss': 0.0102, 'grad_norm': 11.304067382797733, 'learning_rate': 1.915538964069062e-07, 'completion_length': 305.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.07373067177832127, 'kl': 0.2548828125, 'epoch': 0.81}
 81%|████████  | 3465/4286 [26:04:55<5:38:21, 24.73s/it] 81%|████████  | 3466/4286 [26:05:19<5:34:55, 24.51s/it]                                                        {'loss': 0.0107, 'grad_norm': 5.389176074320964, 'learning_rate': 1.9132057862809145e-07, 'completion_length': 271.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.693452537059784, 'reward_std': 0.06983364000916481, 'kl': 0.266845703125, 'epoch': 0.81}
 81%|████████  | 3466/4286 [26:05:19<5:34:55, 24.51s/it] 81%|████████  | 3467/4286 [26:05:42<5:28:28, 24.06s/it]                                                        {'loss': 0.0047, 'grad_norm': 0.6784730520672371, 'learning_rate': 1.910872608492767e-07, 'completion_length': 261.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291667461395264, 'reward_std': 0.01785714365541935, 'kl': 0.117431640625, 'epoch': 0.81}
 81%|████████  | 3467/4286 [26:05:42<5:28:28, 24.06s/it] 81%|████████  | 3468/4286 [26:06:07<5:31:16, 24.30s/it]                                                        {'loss': 0.0057, 'grad_norm': 1.5160404499156612, 'learning_rate': 1.9085394307046197e-07, 'completion_length': 288.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.8556548357009888, 'rewards/format_reward': 1.0, 'reward': 1.8556548953056335, 'reward_std': 0.029001673683524132, 'kl': 0.141357421875, 'epoch': 0.81}
 81%|████████  | 3468/4286 [26:06:07<5:31:16, 24.30s/it] 81%|████████  | 3469/4286 [26:06:31<5:31:10, 24.32s/it]                                                        {'loss': 0.0071, 'grad_norm': 2.7263122053323268, 'learning_rate': 1.9062062529164722e-07, 'completion_length': 297.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7619048655033112, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.027492865920066833, 'kl': 0.1787109375, 'epoch': 0.81}
 81%|████████  | 3469/4286 [26:06:31<5:31:10, 24.32s/it] 81%|████████  | 3470/4286 [26:06:56<5:31:49, 24.40s/it]                                                        {'loss': 0.0052, 'grad_norm': 2.8103940949373025, 'learning_rate': 1.9038730751283247e-07, 'completion_length': 306.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6934524774551392, 'reward_std': 0.038476794958114624, 'kl': 0.13037109375, 'epoch': 0.81}
 81%|████████  | 3470/4286 [26:06:56<5:31:49, 24.40s/it] 81%|████████  | 3471/4286 [26:07:20<5:30:14, 24.31s/it]                                                        {'loss': 0.0028, 'grad_norm': 5.7236568801207035, 'learning_rate': 1.9015398973401772e-07, 'completion_length': 324.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8366072177886963, 'rewards/format_reward': 1.0, 'reward': 1.836607277393341, 'reward_std': 0.06561701186001301, 'kl': 0.0701904296875, 'epoch': 0.81}
 81%|████████  | 3471/4286 [26:07:20<5:30:14, 24.31s/it] 81%|████████  | 3472/4286 [26:07:44<5:30:39, 24.37s/it]                                                        {'loss': 0.01, 'grad_norm': 2.606781590561984, 'learning_rate': 1.8992067195520297e-07, 'completion_length': 303.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7672619521617889, 'rewards/format_reward': 1.0, 'reward': 1.767262041568756, 'reward_std': 0.028571427799761295, 'kl': 0.25018310546875, 'epoch': 0.81}
 81%|████████  | 3472/4286 [26:07:44<5:30:39, 24.37s/it] 81%|████████  | 3473/4286 [26:08:08<5:29:30, 24.32s/it]                                                        {'loss': 0.0069, 'grad_norm': 2.0435690379141054, 'learning_rate': 1.8968735417638824e-07, 'completion_length': 287.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.01785714365541935, 'kl': 0.17333984375, 'epoch': 0.81}
 81%|████████  | 3473/4286 [26:08:08<5:29:30, 24.32s/it] 81%|████████  | 3474/4286 [26:08:33<5:28:00, 24.24s/it]                                                        {'loss': 0.0052, 'grad_norm': 4.13242865210171, 'learning_rate': 1.894540363975735e-07, 'completion_length': 251.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.05059523694217205, 'kl': 0.129638671875, 'epoch': 0.81}
 81%|████████  | 3474/4286 [26:08:33<5:28:00, 24.24s/it] 81%|████████  | 3475/4286 [26:08:56<5:24:54, 24.04s/it]                                                        {'loss': 0.0071, 'grad_norm': 3.03574433073161, 'learning_rate': 1.8922071861875874e-07, 'completion_length': 292.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8095238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.10234555136412382, 'kl': 0.177978515625, 'epoch': 0.81}
 81%|████████  | 3475/4286 [26:08:56<5:24:54, 24.04s/it] 81%|████████  | 3476/4286 [26:09:20<5:25:47, 24.13s/it]                                                        {'loss': 0.0065, 'grad_norm': 1.1665512386967454, 'learning_rate': 1.88987400839944e-07, 'completion_length': 308.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6383928954601288, 'rewards/format_reward': 1.0, 'reward': 1.6383929252624512, 'reward_std': 0.0208333358168602, 'kl': 0.161865234375, 'epoch': 0.81}
 81%|████████  | 3476/4286 [26:09:20<5:25:47, 24.13s/it] 81%|████████  | 3477/4286 [26:09:47<5:33:42, 24.75s/it]                                                        {'loss': 0.0057, 'grad_norm': 2.327432654875757, 'learning_rate': 1.8875408306112924e-07, 'completion_length': 325.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.891369104385376, 'rewards/format_reward': 1.0, 'reward': 1.891369104385376, 'reward_std': 0.05016787815839052, 'kl': 0.140625, 'epoch': 0.81}
 81%|████████  | 3477/4286 [26:09:47<5:33:42, 24.75s/it] 81%|████████  | 3478/4286 [26:10:13<5:40:52, 25.31s/it]                                                        {'loss': 0.0016, 'grad_norm': 2.313086626711201, 'learning_rate': 1.885207652823145e-07, 'completion_length': 322.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.05084197595715523, 'kl': 0.040283203125, 'epoch': 0.81}
 81%|████████  | 3478/4286 [26:10:13<5:40:52, 25.31s/it] 81%|████████  | 3479/4286 [26:10:38<5:39:26, 25.24s/it]                                                        {'loss': 0.0031, 'grad_norm': 6.828506660226298, 'learning_rate': 1.8828744750349976e-07, 'completion_length': 323.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.78125, 'rewards/format_reward': 1.0, 'reward': 1.7812501788139343, 'reward_std': 0.05833622068166733, 'kl': 0.078857421875, 'epoch': 0.81}
 81%|████████  | 3479/4286 [26:10:38<5:39:26, 25.24s/it] 81%|████████  | 3480/4286 [26:11:03<5:38:01, 25.16s/it]                                                        {'loss': 0.0077, 'grad_norm': 4.767765317967624, 'learning_rate': 1.88054129724685e-07, 'completion_length': 304.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 1.0, 'reward': 1.802083432674408, 'reward_std': 0.023049291223287582, 'kl': 0.19256591796875, 'epoch': 0.81}
 81%|████████  | 3480/4286 [26:11:03<5:38:01, 25.16s/it] 81%|████████  | 3481/4286 [26:11:27<5:30:31, 24.64s/it]                                                        {'loss': 0.0048, 'grad_norm': 20.229043562220227, 'learning_rate': 1.8782081194587026e-07, 'completion_length': 304.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7654763162136078, 'rewards/format_reward': 1.0, 'reward': 1.7654762864112854, 'reward_std': 0.05422262125648558, 'kl': 0.11962890625, 'epoch': 0.81}
 81%|████████  | 3481/4286 [26:11:27<5:30:31, 24.64s/it] 81%|████████  | 3482/4286 [26:11:50<5:26:19, 24.35s/it]                                                        {'loss': 0.0091, 'grad_norm': 4.484101186859583, 'learning_rate': 1.8758749416705553e-07, 'completion_length': 285.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.726190447807312, 'rewards/format_reward': 1.0, 'reward': 1.7261906862258911, 'reward_std': 0.03755595721304417, 'kl': 0.229248046875, 'epoch': 0.81}
 81%|████████  | 3482/4286 [26:11:50<5:26:19, 24.35s/it] 81%|████████▏ | 3483/4286 [26:12:14<5:23:00, 24.14s/it]                                                        {'loss': 0.0019, 'grad_norm': 6.699564254624525, 'learning_rate': 1.8735417638824078e-07, 'completion_length': 271.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.011904759332537651, 'kl': 0.046630859375, 'epoch': 0.81}
 81%|████████▏ | 3483/4286 [26:12:14<5:23:00, 24.14s/it] 81%|████████▏ | 3484/4286 [26:12:39<5:26:48, 24.45s/it]                                                        {'loss': 0.0047, 'grad_norm': 6.43577805802382, 'learning_rate': 1.8712085860942603e-07, 'completion_length': 305.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.8202381134033203, 'rewards/format_reward': 1.0, 'reward': 1.8202382326126099, 'reward_std': 0.017737697809934616, 'kl': 0.1182861328125, 'epoch': 0.81}
 81%|████████▏ | 3484/4286 [26:12:39<5:26:48, 24.45s/it] 81%|████████▏ | 3485/4286 [26:13:04<5:26:00, 24.42s/it]                                                        {'loss': 0.0229, 'grad_norm': 6.776034434238246, 'learning_rate': 1.8688754083061128e-07, 'completion_length': 310.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.863839328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8459821939468384, 'reward_std': 0.08310605213046074, 'kl': 0.572265625, 'epoch': 0.81}
 81%|████████▏ | 3485/4286 [26:13:04<5:26:00, 24.42s/it] 81%|████████▏ | 3486/4286 [26:13:29<5:29:41, 24.73s/it]                                                        {'loss': 0.0028, 'grad_norm': 4.812805024434321, 'learning_rate': 1.8665422305179653e-07, 'completion_length': 308.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.8812500536441803, 'rewards/format_reward': 1.0, 'reward': 1.8812500834465027, 'reward_std': 0.053391036577522755, 'kl': 0.06884765625, 'epoch': 0.81}
 81%|████████▏ | 3486/4286 [26:13:29<5:29:41, 24.73s/it] 81%|████████▏ | 3487/4286 [26:13:54<5:30:51, 24.84s/it]                                                        {'loss': 0.0046, 'grad_norm': 3.7141741331552485, 'learning_rate': 1.864209052729818e-07, 'completion_length': 314.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.854166716337204, 'rewards/format_reward': 1.0, 'reward': 1.8541668057441711, 'reward_std': 0.02908780612051487, 'kl': 0.115234375, 'epoch': 0.81}
 81%|████████▏ | 3487/4286 [26:13:54<5:30:51, 24.84s/it] 81%|████████▏ | 3488/4286 [26:14:17<5:23:17, 24.31s/it]                                                        {'loss': 0.0101, 'grad_norm': 6.49746131815233, 'learning_rate': 1.8618758749416705e-07, 'completion_length': 249.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979167461395264, 'reward_std': 0.008928571827709675, 'kl': 0.25146484375, 'epoch': 0.81}
 81%|████████▏ | 3488/4286 [26:14:17<5:23:17, 24.31s/it] 81%|████████▏ | 3489/4286 [26:14:41<5:21:05, 24.17s/it]                                                        {'loss': 0.0142, 'grad_norm': 4.060561268122815, 'learning_rate': 1.859542697153523e-07, 'completion_length': 300.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 1.0, 'reward': 1.8020834922790527, 'reward_std': 0.07932150177657604, 'kl': 0.354248046875, 'epoch': 0.81}
 81%|████████▏ | 3489/4286 [26:14:41<5:21:05, 24.17s/it] 81%|████████▏ | 3490/4286 [26:15:05<5:19:23, 24.07s/it]                                                        {'loss': 0.0081, 'grad_norm': 1.6770108553761434, 'learning_rate': 1.8572095193653755e-07, 'completion_length': 300.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6964287161827087, 'reward_std': 0.04123930260539055, 'kl': 0.20361328125, 'epoch': 0.81}
 81%|████████▏ | 3490/4286 [26:15:05<5:19:23, 24.07s/it] 81%|████████▏ | 3491/4286 [26:15:29<5:20:44, 24.21s/it]                                                        {'loss': 0.0036, 'grad_norm': 3.4430906961803798, 'learning_rate': 1.854876341577228e-07, 'completion_length': 299.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7500000894069672, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.025651196017861366, 'kl': 0.090087890625, 'epoch': 0.81}
 81%|████████▏ | 3491/4286 [26:15:29<5:20:44, 24.21s/it] 81%|████████▏ | 3492/4286 [26:15:55<5:26:11, 24.65s/it]                                                        {'loss': 0.0027, 'grad_norm': 2.3926364775079176, 'learning_rate': 1.8525431637890807e-07, 'completion_length': 292.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.05222322978079319, 'kl': 0.0687255859375, 'epoch': 0.81}
 81%|████████▏ | 3492/4286 [26:15:55<5:26:11, 24.65s/it] 81%|████████▏ | 3493/4286 [26:16:19<5:24:27, 24.55s/it]                                                        {'loss': 0.0184, 'grad_norm': 2.619893391167527, 'learning_rate': 1.8502099860009332e-07, 'completion_length': 313.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.742559552192688, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.026785715483129025, 'kl': 0.4609375, 'epoch': 0.81}
 81%|████████▏ | 3493/4286 [26:16:19<5:24:27, 24.55s/it] 82%|████████▏ | 3494/4286 [26:16:43<5:21:56, 24.39s/it]                                                        {'loss': 0.0043, 'grad_norm': 1.8023285346663813, 'learning_rate': 1.8478768082127857e-07, 'completion_length': 322.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.6116071492433548, 'rewards/format_reward': 1.0, 'reward': 1.6116072535514832, 'reward_std': 0.07992978021502495, 'kl': 0.108154296875, 'epoch': 0.82}
 82%|████████▏ | 3494/4286 [26:16:43<5:21:56, 24.39s/it] 82%|████████▏ | 3495/4286 [26:17:07<5:17:28, 24.08s/it]                                                        {'loss': 0.0061, 'grad_norm': 7.988569275158162, 'learning_rate': 1.8455436304246382e-07, 'completion_length': 249.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.8452381193637848, 'rewards/format_reward': 1.0, 'reward': 1.8452381491661072, 'reward_std': 0.0357142873108387, 'kl': 0.1514892578125, 'epoch': 0.82}
 82%|████████▏ | 3495/4286 [26:17:07<5:17:28, 24.08s/it] 82%|████████▏ | 3496/4286 [26:17:32<5:23:12, 24.55s/it]                                                        {'loss': 0.0125, 'grad_norm': 0.8027879625598429, 'learning_rate': 1.843210452636491e-07, 'completion_length': 291.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.005952381528913975, 'kl': 0.3115234375, 'epoch': 0.82}
 82%|████████▏ | 3496/4286 [26:17:32<5:23:12, 24.55s/it] 82%|████████▏ | 3497/4286 [26:17:56<5:20:27, 24.37s/it]                                                        {'loss': 0.0072, 'grad_norm': 2.7819313479482344, 'learning_rate': 1.8408772748483434e-07, 'completion_length': 244.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.7232144474983215, 'reward_std': 0.02405625954270363, 'kl': 0.1806640625, 'epoch': 0.82}
 82%|████████▏ | 3497/4286 [26:17:56<5:20:27, 24.37s/it] 82%|████████▏ | 3498/4286 [26:18:20<5:15:45, 24.04s/it]                                                        {'loss': 0.0208, 'grad_norm': 5.8304358147261555, 'learning_rate': 1.838544097060196e-07, 'completion_length': 264.2143020629883, 'rewards/only_full_func_accuracy_reward': 0.8110119700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7931548357009888, 'reward_std': 0.06323832273483276, 'kl': 0.519287109375, 'epoch': 0.82}
 82%|████████▏ | 3498/4286 [26:18:20<5:15:45, 24.04s/it] 82%|████████▏ | 3499/4286 [26:18:45<5:21:39, 24.52s/it]                                                        {'loss': 0.0047, 'grad_norm': 4.4870982904611845, 'learning_rate': 1.8362109192720484e-07, 'completion_length': 294.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.732142984867096, 'reward_std': 0.05197649449110031, 'kl': 0.1181640625, 'epoch': 0.82}
 82%|████████▏ | 3499/4286 [26:18:45<5:21:39, 24.52s/it] 82%|████████▏ | 3500/4286 [26:19:09<5:17:53, 24.27s/it]                                                        {'loss': 0.0022, 'grad_norm': 1.085406643274091, 'learning_rate': 1.833877741483901e-07, 'completion_length': 284.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7633929252624512, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.020833331160247326, 'kl': 0.0548095703125, 'epoch': 0.82}
 82%|████████▏ | 3500/4286 [26:19:09<5:17:53, 24.27s/it] 82%|████████▏ | 3501/4286 [26:23:35<21:07:27, 96.88s/it]                                                         {'loss': 0.0049, 'grad_norm': 5.897441734577479, 'learning_rate': 1.8315445636957536e-07, 'completion_length': 313.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321430444717407, 'reward_std': 0.041666665114462376, 'kl': 0.123046875, 'epoch': 0.82}
 82%|████████▏ | 3501/4286 [26:23:35<21:07:27, 96.88s/it] 82%|████████▏ | 3502/4286 [26:24:00<16:24:09, 75.32s/it]                                                         {'loss': 0.0047, 'grad_norm': 2.499966230002022, 'learning_rate': 1.829211385907606e-07, 'completion_length': 288.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.11767578125, 'epoch': 0.82}
 82%|████████▏ | 3502/4286 [26:24:00<16:24:09, 75.32s/it] 82%|████████▏ | 3503/4286 [26:24:25<13:03:23, 60.03s/it]                                                         {'loss': 0.011, 'grad_norm': 5.047606411962919, 'learning_rate': 1.8268782081194586e-07, 'completion_length': 301.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.04710555821657181, 'kl': 0.27490234375, 'epoch': 0.82}
 82%|████████▏ | 3503/4286 [26:24:25<13:03:23, 60.03s/it] 82%|████████▏ | 3504/4286 [26:24:49<10:42:56, 49.33s/it]                                                         {'loss': 0.0146, 'grad_norm': 4.0413175068925735, 'learning_rate': 1.824545030331311e-07, 'completion_length': 301.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160714626312256, 'reward_std': 0.06923934817314148, 'kl': 0.365234375, 'epoch': 0.82}
 82%|████████▏ | 3504/4286 [26:24:49<10:42:56, 49.33s/it] 82%|████████▏ | 3505/4286 [26:25:15<9:10:18, 42.28s/it]                                                         {'loss': 0.0078, 'grad_norm': 9.947623604589312, 'learning_rate': 1.8222118525431639e-07, 'completion_length': 318.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6889881491661072, 'reward_std': 0.08586649224162102, 'kl': 0.19580078125, 'epoch': 0.82}
 82%|████████▏ | 3505/4286 [26:25:15<9:10:18, 42.28s/it] 82%|████████▏ | 3506/4286 [26:25:40<8:01:55, 37.07s/it]                                                        {'loss': 0.0033, 'grad_norm': 7.257308676075804, 'learning_rate': 1.8198786747550163e-07, 'completion_length': 315.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.78125, 'rewards/format_reward': 1.0, 'reward': 1.7812500596046448, 'reward_std': 0.03273809468373656, 'kl': 0.0816650390625, 'epoch': 0.82}
 82%|████████▏ | 3506/4286 [26:25:40<8:01:55, 37.07s/it] 82%|████████▏ | 3507/4286 [26:26:04<7:10:42, 33.17s/it]                                                        {'loss': 0.0082, 'grad_norm': 5.101645247703454, 'learning_rate': 1.8175454969668688e-07, 'completion_length': 296.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 1.0, 'reward': 1.6502977013587952, 'reward_std': 0.08239656500518322, 'kl': 0.2060546875, 'epoch': 0.82}
 82%|████████▏ | 3507/4286 [26:26:04<7:10:42, 33.17s/it] 82%|████████▏ | 3508/4286 [26:26:29<6:40:03, 30.85s/it]                                                        {'loss': 0.0086, 'grad_norm': 27.191158171868782, 'learning_rate': 1.8152123191787213e-07, 'completion_length': 294.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6937500536441803, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6758930087089539, 'reward_std': 0.10059523954987526, 'kl': 0.2158203125, 'epoch': 0.82}
 82%|████████▏ | 3508/4286 [26:26:29<6:40:03, 30.85s/it] 82%|████████▏ | 3509/4286 [26:26:55<6:17:44, 29.17s/it]                                                        {'loss': 0.0061, 'grad_norm': 9.014384179431074, 'learning_rate': 1.8128791413905738e-07, 'completion_length': 323.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.816964328289032, 'rewards/format_reward': 1.0, 'reward': 1.8169643878936768, 'reward_std': 0.03869048040360212, 'kl': 0.15185546875, 'epoch': 0.82}
 82%|████████▏ | 3509/4286 [26:26:55<6:17:44, 29.17s/it] 82%|████████▏ | 3510/4286 [26:27:19<5:59:26, 27.79s/it]                                                        {'loss': 0.0049, 'grad_norm': 8.869221265576698, 'learning_rate': 1.8105459636024266e-07, 'completion_length': 294.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.8032738566398621, 'rewards/format_reward': 1.0, 'reward': 1.803273856639862, 'reward_std': 0.016616951674222946, 'kl': 0.1220703125, 'epoch': 0.82}
 82%|████████▏ | 3510/4286 [26:27:19<5:59:26, 27.79s/it] 82%|████████▏ | 3511/4286 [26:27:43<5:44:30, 26.67s/it]                                                        {'loss': 0.0042, 'grad_norm': 4.017106220170027, 'learning_rate': 1.808212785814279e-07, 'completion_length': 307.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6116071939468384, 'rewards/format_reward': 1.0, 'reward': 1.6116071939468384, 'reward_std': 0.05838929209858179, 'kl': 0.10498046875, 'epoch': 0.82}
 82%|████████▏ | 3511/4286 [26:27:43<5:44:30, 26.67s/it] 82%|████████▏ | 3512/4286 [26:28:08<5:37:22, 26.15s/it]                                                        {'loss': 0.0128, 'grad_norm': 5.475561733389731, 'learning_rate': 1.8058796080261315e-07, 'completion_length': 306.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762387275696, 'reward_std': 0.06823869794607162, 'kl': 0.322021484375, 'epoch': 0.82}
 82%|████████▏ | 3512/4286 [26:28:08<5:37:22, 26.15s/it] 82%|████████▏ | 3513/4286 [26:28:32<5:29:36, 25.58s/it]                                                        {'loss': 0.0033, 'grad_norm': 0.34478096594929014, 'learning_rate': 1.803546430237984e-07, 'completion_length': 272.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.8988095819950104, 'rewards/format_reward': 1.0, 'reward': 1.8988096117973328, 'reward_std': 0.0, 'kl': 0.083740234375, 'epoch': 0.82}
 82%|████████▏ | 3513/4286 [26:28:32<5:29:36, 25.58s/it] 82%|████████▏ | 3514/4286 [26:28:57<5:25:03, 25.26s/it]                                                        {'loss': 0.0242, 'grad_norm': 2.3074611056978562, 'learning_rate': 1.8012132524498365e-07, 'completion_length': 278.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.8166667520999908, 'rewards/format_reward': 1.0, 'reward': 1.8166667819023132, 'reward_std': 0.0620815958827734, 'kl': 0.6064453125, 'epoch': 0.82}
 82%|████████▏ | 3514/4286 [26:28:57<5:25:03, 25.26s/it] 82%|████████▏ | 3515/4286 [26:29:22<5:22:25, 25.09s/it]                                                        {'loss': 0.012, 'grad_norm': 2.28249770102034, 'learning_rate': 1.7988800746616892e-07, 'completion_length': 291.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.9002977013587952, 'rewards/format_reward': 1.0, 'reward': 1.90029776096344, 'reward_std': 0.03709553927183151, 'kl': 0.3009033203125, 'epoch': 0.82}
 82%|████████▏ | 3515/4286 [26:29:22<5:22:25, 25.09s/it] 82%|████████▏ | 3516/4286 [26:29:47<5:24:40, 25.30s/it]                                                        {'loss': 0.004, 'grad_norm': 17.09381464242838, 'learning_rate': 1.7965468968735417e-07, 'completion_length': 300.64288330078125, 'rewards/only_full_func_accuracy_reward': 0.7991072237491608, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7812501788139343, 'reward_std': 0.08035714365541935, 'kl': 0.101318359375, 'epoch': 0.82}
 82%|████████▏ | 3516/4286 [26:29:47<5:24:40, 25.30s/it] 82%|████████▏ | 3517/4286 [26:30:12<5:20:40, 25.02s/it]                                                        {'loss': 0.0357, 'grad_norm': 6.25237852171976, 'learning_rate': 1.7942137190853942e-07, 'completion_length': 295.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6553572118282318, 'rewards/format_reward': 1.0, 'reward': 1.655357301235199, 'reward_std': 0.01888262666761875, 'kl': 0.8966064453125, 'epoch': 0.82}
 82%|████████▏ | 3517/4286 [26:30:12<5:20:40, 25.02s/it] 82%|████████▏ | 3518/4286 [26:30:36<5:15:54, 24.68s/it]                                                        {'loss': 0.0057, 'grad_norm': 0.7530025764017901, 'learning_rate': 1.7918805412972467e-07, 'completion_length': 223.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.80952388048172, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.011904759332537651, 'kl': 0.143310546875, 'epoch': 0.82}
 82%|████████▏ | 3518/4286 [26:30:36<5:15:54, 24.68s/it] 82%|████████▏ | 3519/4286 [26:31:03<5:25:18, 25.45s/it]                                                        {'loss': 0.004, 'grad_norm': 5.743906018970945, 'learning_rate': 1.7895473635090995e-07, 'completion_length': 325.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.05578738823533058, 'kl': 0.0992431640625, 'epoch': 0.82}
 82%|████████▏ | 3519/4286 [26:31:03<5:25:18, 25.45s/it] 82%|████████▏ | 3520/4286 [26:31:28<5:24:10, 25.39s/it]                                                        {'loss': 0.0025, 'grad_norm': 0.6171367102484925, 'learning_rate': 1.787214185720952e-07, 'completion_length': 263.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7633928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7633930444717407, 'reward_std': 0.02267500478774309, 'kl': 0.0616455078125, 'epoch': 0.82}
 82%|████████▏ | 3520/4286 [26:31:28<5:24:10, 25.39s/it] 82%|████████▏ | 3521/4286 [26:31:53<5:22:14, 25.27s/it]                                                        {'loss': 0.0213, 'grad_norm': 16.745791678847393, 'learning_rate': 1.7848810079328044e-07, 'completion_length': 308.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.06409339234232903, 'kl': 0.533203125, 'epoch': 0.82}
 82%|████████▏ | 3521/4286 [26:31:53<5:22:14, 25.27s/it] 82%|████████▏ | 3522/4286 [26:32:17<5:18:21, 25.00s/it]                                                        {'loss': 0.0066, 'grad_norm': 15.076128185696545, 'learning_rate': 1.782547830144657e-07, 'completion_length': 325.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.8764881491661072, 'rewards/format_reward': 1.0, 'reward': 1.876488208770752, 'reward_std': 0.06090506166219711, 'kl': 0.166015625, 'epoch': 0.82}
 82%|████████▏ | 3522/4286 [26:32:17<5:18:21, 25.00s/it] 82%|████████▏ | 3523/4286 [26:32:42<5:16:53, 24.92s/it]                                                        {'loss': 0.0029, 'grad_norm': 0.512855238373255, 'learning_rate': 1.7802146523565094e-07, 'completion_length': 324.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7752976417541504, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.008928571827709675, 'kl': 0.0723876953125, 'epoch': 0.82}
 82%|████████▏ | 3523/4286 [26:32:42<5:16:53, 24.92s/it] 82%|████████▏ | 3524/4286 [26:33:06<5:13:19, 24.67s/it]                                                        {'loss': 0.0062, 'grad_norm': 1.6468655765566576, 'learning_rate': 1.7778814745683622e-07, 'completion_length': 273.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.8660714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8660714626312256, 'reward_std': 0.04719169018790126, 'kl': 0.15478515625, 'epoch': 0.82}
 82%|████████▏ | 3524/4286 [26:33:06<5:13:19, 24.67s/it] 82%|████████▏ | 3525/4286 [26:33:32<5:16:37, 24.96s/it]                                                        {'loss': 0.0019, 'grad_norm': 0.8382346317255563, 'learning_rate': 1.7755482967802146e-07, 'completion_length': 280.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.691964328289032, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.008928571827709675, 'kl': 0.047607421875, 'epoch': 0.82}
 82%|████████▏ | 3525/4286 [26:33:32<5:16:37, 24.96s/it] 82%|████████▏ | 3526/4286 [26:33:56<5:12:42, 24.69s/it]                                                        {'loss': 0.0016, 'grad_norm': 8.420176891471954, 'learning_rate': 1.7732151189920671e-07, 'completion_length': 272.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.6755953133106232, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.05909644812345505, 'kl': 0.0389404296875, 'epoch': 0.82}
 82%|████████▏ | 3526/4286 [26:33:56<5:12:42, 24.69s/it] 82%|████████▏ | 3527/4286 [26:34:19<5:06:01, 24.19s/it]                                                        {'loss': 0.0025, 'grad_norm': 8.780967143462348, 'learning_rate': 1.7708819412039196e-07, 'completion_length': 283.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.846726268529892, 'rewards/format_reward': 1.0, 'reward': 1.8467262983322144, 'reward_std': 0.037095533683896065, 'kl': 0.0615234375, 'epoch': 0.82}
 82%|████████▏ | 3527/4286 [26:34:19<5:06:01, 24.19s/it] 82%|████████▏ | 3528/4286 [26:34:43<5:03:13, 24.00s/it]                                                        {'loss': 0.0098, 'grad_norm': 0.8665119670621627, 'learning_rate': 1.7685487634157724e-07, 'completion_length': 249.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7767857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.758928656578064, 'reward_std': 0.07419108599424362, 'kl': 0.24462890625, 'epoch': 0.82}
 82%|████████▏ | 3528/4286 [26:34:43<5:03:13, 24.00s/it] 82%|████████▏ | 3529/4286 [26:35:09<5:10:34, 24.62s/it]                                                        {'loss': 0.0034, 'grad_norm': 0.37626468313744305, 'learning_rate': 1.7662155856276249e-07, 'completion_length': 329.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7678572237491608, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.0, 'kl': 0.0863037109375, 'epoch': 0.82}
 82%|████████▏ | 3529/4286 [26:35:09<5:10:34, 24.62s/it] 82%|████████▏ | 3530/4286 [26:35:33<5:10:55, 24.68s/it]                                                        {'loss': 0.0077, 'grad_norm': 2.6204556974591235, 'learning_rate': 1.7638824078394773e-07, 'completion_length': 309.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.05952381435781717, 'kl': 0.193359375, 'epoch': 0.82}
 82%|████████▏ | 3530/4286 [26:35:33<5:10:55, 24.68s/it] 82%|████████▏ | 3531/4286 [26:35:57<5:04:56, 24.23s/it]                                                        {'loss': 0.0034, 'grad_norm': 2.468397786440069, 'learning_rate': 1.7615492300513298e-07, 'completion_length': 289.375, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.07280982844531536, 'kl': 0.08447265625, 'epoch': 0.82}
 82%|████████▏ | 3531/4286 [26:35:57<5:04:56, 24.23s/it] 82%|████████▏ | 3532/4286 [26:36:21<5:03:19, 24.14s/it]                                                        {'loss': 0.0023, 'grad_norm': 2.542211248644893, 'learning_rate': 1.7592160522631823e-07, 'completion_length': 269.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.8125000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.03847679682075977, 'kl': 0.056396484375, 'epoch': 0.82}
 82%|████████▏ | 3532/4286 [26:36:21<5:03:19, 24.14s/it] 82%|████████▏ | 3533/4286 [26:36:45<5:02:26, 24.10s/it]                                                        {'loss': 0.0067, 'grad_norm': 1.0451235271728574, 'learning_rate': 1.756882874475035e-07, 'completion_length': 289.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.008928571827709675, 'kl': 0.167236328125, 'epoch': 0.82}
 82%|████████▏ | 3533/4286 [26:36:45<5:02:26, 24.10s/it] 82%|████████▏ | 3534/4286 [26:37:10<5:05:33, 24.38s/it]                                                        {'loss': 0.0037, 'grad_norm': 8.553230971799724, 'learning_rate': 1.7545496966868876e-07, 'completion_length': 332.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7330357730388641, 'rewards/format_reward': 1.0, 'reward': 1.7330358028411865, 'reward_std': 0.0893416740000248, 'kl': 0.0927734375, 'epoch': 0.82}
 82%|████████▏ | 3534/4286 [26:37:10<5:05:33, 24.38s/it] 82%|████████▏ | 3535/4286 [26:37:34<5:06:52, 24.52s/it]                                                        {'loss': 0.0031, 'grad_norm': 9.50720726631272, 'learning_rate': 1.75221651889874e-07, 'completion_length': 306.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.59077388048172, 'rewards/format_reward': 1.0, 'reward': 1.59077388048172, 'reward_std': 0.04387472942471504, 'kl': 0.07763671875, 'epoch': 0.82}
 82%|████████▏ | 3535/4286 [26:37:34<5:06:52, 24.52s/it] 83%|████████▎ | 3536/4286 [26:37:58<5:03:50, 24.31s/it]                                                        {'loss': 0.0026, 'grad_norm': 0.7714649931154756, 'learning_rate': 1.7498833411105925e-07, 'completion_length': 252.28573608398438, 'rewards/only_full_func_accuracy_reward': 0.7976191341876984, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.02279588393867016, 'kl': 0.063720703125, 'epoch': 0.83}
 83%|████████▎ | 3536/4286 [26:37:58<5:03:50, 24.31s/it] 83%|████████▎ | 3537/4286 [26:38:22<5:00:35, 24.08s/it]                                                        {'loss': 0.007, 'grad_norm': 3.802977282195602, 'learning_rate': 1.747550163322445e-07, 'completion_length': 287.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.0714285746216774, 'kl': 0.17529296875, 'epoch': 0.83}
 83%|████████▎ | 3537/4286 [26:38:22<5:00:35, 24.08s/it] 83%|████████▎ | 3538/4286 [26:38:48<5:06:39, 24.60s/it]                                                        {'loss': 0.0056, 'grad_norm': 3.108446738641761, 'learning_rate': 1.7452169855342978e-07, 'completion_length': 310.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.8392857015132904, 'rewards/format_reward': 1.0, 'reward': 1.8392859101295471, 'reward_std': 0.02380952052772045, 'kl': 0.1396484375, 'epoch': 0.83}
 83%|████████▎ | 3538/4286 [26:38:48<5:06:39, 24.60s/it] 83%|████████▎ | 3539/4286 [26:39:11<5:03:33, 24.38s/it]                                                        {'loss': 0.0021, 'grad_norm': 1.3121570787441947, 'learning_rate': 1.7428838077461503e-07, 'completion_length': 251.53573608398438, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.008928571827709675, 'kl': 0.053466796875, 'epoch': 0.83}
 83%|████████▎ | 3539/4286 [26:39:11<5:03:33, 24.38s/it] 83%|████████▎ | 3540/4286 [26:39:36<5:02:47, 24.35s/it]                                                        {'loss': 0.0106, 'grad_norm': 2.7772723025633037, 'learning_rate': 1.7405506299580027e-07, 'completion_length': 293.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.712301641702652, 'rewards/format_reward': 1.0, 'reward': 1.7123016715049744, 'reward_std': 0.036981185898184776, 'kl': 0.2646484375, 'epoch': 0.83}
 83%|████████▎ | 3540/4286 [26:39:36<5:02:47, 24.35s/it] 83%|████████▎ | 3541/4286 [26:39:59<4:59:55, 24.16s/it]                                                        {'loss': 0.002, 'grad_norm': 0.2620547976742538, 'learning_rate': 1.7382174521698552e-07, 'completion_length': 284.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 1.0, 'reward': 1.672619104385376, 'reward_std': 0.0, 'kl': 0.05078125, 'epoch': 0.83}
 83%|████████▎ | 3541/4286 [26:39:59<4:59:55, 24.16s/it] 83%|████████▎ | 3542/4286 [26:40:25<5:03:09, 24.45s/it]                                                        {'loss': 0.0079, 'grad_norm': 9.518928155500053, 'learning_rate': 1.735884274381708e-07, 'completion_length': 312.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8735119700431824, 'rewards/format_reward': 1.0, 'reward': 1.8735120296478271, 'reward_std': 0.06250000186264515, 'kl': 0.197998046875, 'epoch': 0.83}
 83%|████████▎ | 3542/4286 [26:40:25<5:03:09, 24.45s/it] 83%|████████▎ | 3543/4286 [26:40:47<4:56:26, 23.94s/it]                                                        {'loss': 0.0042, 'grad_norm': 12.92371560885577, 'learning_rate': 1.7335510965935605e-07, 'completion_length': 279.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.034579772502183914, 'kl': 0.104248046875, 'epoch': 0.83}
 83%|████████▎ | 3543/4286 [26:40:47<4:56:26, 23.94s/it] 83%|████████▎ | 3544/4286 [26:41:12<4:58:26, 24.13s/it]                                                        {'loss': 0.0239, 'grad_norm': 6.400925211910829, 'learning_rate': 1.731217918805413e-07, 'completion_length': 301.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.10100985318422318, 'kl': 0.5986328125, 'epoch': 0.83}
 83%|████████▎ | 3544/4286 [26:41:12<4:58:26, 24.13s/it] 83%|████████▎ | 3545/4286 [26:41:37<5:00:28, 24.33s/it]                                                        {'loss': 0.0095, 'grad_norm': 2.673977594404574, 'learning_rate': 1.7288847410172654e-07, 'completion_length': 280.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.8467262387275696, 'rewards/format_reward': 1.0, 'reward': 1.8467262983322144, 'reward_std': 0.07288430258631706, 'kl': 0.2366943359375, 'epoch': 0.83}
 83%|████████▎ | 3545/4286 [26:41:37<5:00:28, 24.33s/it] 83%|████████▎ | 3546/4286 [26:42:02<5:04:06, 24.66s/it]                                                        {'loss': 0.0054, 'grad_norm': 6.406957218816205, 'learning_rate': 1.726551563229118e-07, 'completion_length': 308.875, 'rewards/only_full_func_accuracy_reward': 0.691964328289032, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.06090506911277771, 'kl': 0.1337890625, 'epoch': 0.83}
 83%|████████▎ | 3546/4286 [26:42:02<5:04:06, 24.66s/it] 83%|████████▎ | 3547/4286 [26:42:27<5:06:20, 24.87s/it]                                                        {'loss': 0.0062, 'grad_norm': 4.007840223987074, 'learning_rate': 1.7242183854409707e-07, 'completion_length': 315.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7291666865348816, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.06815172825008631, 'kl': 0.1541748046875, 'epoch': 0.83}
 83%|████████▎ | 3547/4286 [26:42:28<5:06:20, 24.87s/it] 83%|████████▎ | 3548/4286 [26:42:52<5:05:18, 24.82s/it]                                                        {'loss': 0.0039, 'grad_norm': 51.171364562267854, 'learning_rate': 1.7218852076528232e-07, 'completion_length': 301.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8464286029338837, 'rewards/format_reward': 1.0, 'reward': 1.8464286923408508, 'reward_std': 0.056379660964012146, 'kl': 0.09716796875, 'epoch': 0.83}
 83%|████████▎ | 3548/4286 [26:42:52<5:05:18, 24.82s/it] 83%|████████▎ | 3549/4286 [26:43:17<5:03:56, 24.74s/it]                                                        {'loss': 0.0016, 'grad_norm': 1.8025931375581898, 'learning_rate': 1.7195520298646757e-07, 'completion_length': 266.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6994048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.026572037022560835, 'kl': 0.040283203125, 'epoch': 0.83}
 83%|████████▎ | 3549/4286 [26:43:17<5:03:56, 24.74s/it] 83%|████████▎ | 3550/4286 [26:43:42<5:05:37, 24.91s/it]                                                        {'loss': 0.0065, 'grad_norm': 8.335600246697147, 'learning_rate': 1.7172188520765281e-07, 'completion_length': 316.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762387275696, 'reward_std': 0.0551132932305336, 'kl': 0.162841796875, 'epoch': 0.83}
 83%|████████▎ | 3550/4286 [26:43:42<5:05:37, 24.91s/it] 83%|████████▎ | 3551/4286 [26:44:06<5:01:15, 24.59s/it]                                                        {'loss': 0.0047, 'grad_norm': 0.5734538008645436, 'learning_rate': 1.714885674288381e-07, 'completion_length': 264.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 1.0, 'reward': 1.7708334922790527, 'reward_std': 0.04467591270804405, 'kl': 0.1175537109375, 'epoch': 0.83}
 83%|████████▎ | 3551/4286 [26:44:06<5:01:15, 24.59s/it] 83%|████████▎ | 3552/4286 [26:44:30<4:59:00, 24.44s/it]                                                        {'loss': 0.005, 'grad_norm': 2.9706239113963697, 'learning_rate': 1.7125524965002334e-07, 'completion_length': 314.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607144474983215, 'reward_std': 0.0444291727617383, 'kl': 0.12451171875, 'epoch': 0.83}
 83%|████████▎ | 3552/4286 [26:44:30<4:59:00, 24.44s/it] 83%|████████▎ | 3553/4286 [26:44:56<5:03:46, 24.87s/it]                                                        {'loss': 0.0068, 'grad_norm': 5.788842126369841, 'learning_rate': 1.7102193187120859e-07, 'completion_length': 302.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7982568740844727, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7803997993469238, 'reward_std': 0.09396259486675262, 'kl': 0.170166015625, 'epoch': 0.83}
 83%|████████▎ | 3553/4286 [26:44:56<5:03:46, 24.87s/it] 83%|████████▎ | 3554/4286 [26:45:21<5:03:10, 24.85s/it]                                                        {'loss': 0.0065, 'grad_norm': 6.6330824587277215, 'learning_rate': 1.7078861409239384e-07, 'completion_length': 273.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6041667312383652, 'rewards/format_reward': 1.0, 'reward': 1.6041667461395264, 'reward_std': 0.022214587777853012, 'kl': 0.1640625, 'epoch': 0.83}
 83%|████████▎ | 3554/4286 [26:45:21<5:03:10, 24.85s/it] 83%|████████▎ | 3555/4286 [26:45:45<5:00:21, 24.65s/it]                                                        {'loss': 0.0076, 'grad_norm': 1.642502311761865, 'learning_rate': 1.7055529631357908e-07, 'completion_length': 305.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7782738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.0565476156771183, 'kl': 0.18994140625, 'epoch': 0.83}
 83%|████████▎ | 3555/4286 [26:45:45<5:00:21, 24.65s/it] 83%|████████▎ | 3556/4286 [26:46:09<4:59:06, 24.58s/it]                                                        {'loss': 0.0018, 'grad_norm': 1.0244445348166407, 'learning_rate': 1.7032197853476436e-07, 'completion_length': 296.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.922619104385376, 'rewards/format_reward': 1.0, 'reward': 1.9226191639900208, 'reward_std': 0.04007172957062721, 'kl': 0.0458984375, 'epoch': 0.83}
 83%|████████▎ | 3556/4286 [26:46:09<4:59:06, 24.58s/it] 83%|████████▎ | 3557/4286 [26:46:33<4:55:36, 24.33s/it]                                                        {'loss': 0.0161, 'grad_norm': 3.487289802273221, 'learning_rate': 1.700886607559496e-07, 'completion_length': 318.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7625000476837158, 'rewards/format_reward': 1.0, 'reward': 1.7625000476837158, 'reward_std': 0.02290012687444687, 'kl': 0.4013671875, 'epoch': 0.83}
 83%|████████▎ | 3557/4286 [26:46:33<4:55:36, 24.33s/it] 83%|████████▎ | 3558/4286 [26:46:57<4:54:48, 24.30s/it]                                                        {'loss': 0.0059, 'grad_norm': 24.358778059778682, 'learning_rate': 1.6985534297713486e-07, 'completion_length': 282.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7604166567325592, 'rewards/format_reward': 1.0, 'reward': 1.7604168057441711, 'reward_std': 0.10155686363577843, 'kl': 0.146484375, 'epoch': 0.83}
 83%|████████▎ | 3558/4286 [26:46:57<4:54:48, 24.30s/it] 83%|████████▎ | 3559/4286 [26:47:22<4:55:45, 24.41s/it]                                                        {'loss': 0.008, 'grad_norm': 9.011397516185207, 'learning_rate': 1.696220251983201e-07, 'completion_length': 313.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7172619700431824, 'reward_std': 0.01785714365541935, 'kl': 0.20068359375, 'epoch': 0.83}
 83%|████████▎ | 3559/4286 [26:47:22<4:55:45, 24.41s/it] 83%|████████▎ | 3560/4286 [26:47:47<4:58:58, 24.71s/it]                                                        {'loss': 0.0015, 'grad_norm': 2.959347447839669, 'learning_rate': 1.6938870741950535e-07, 'completion_length': 298.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7604166865348816, 'rewards/format_reward': 1.0, 'reward': 1.7604167461395264, 'reward_std': 0.0414529861882329, 'kl': 0.038330078125, 'epoch': 0.83}
 83%|████████▎ | 3560/4286 [26:47:47<4:58:58, 24.71s/it] 83%|████████▎ | 3561/4286 [26:48:10<4:51:29, 24.12s/it]                                                        {'loss': 0.0019, 'grad_norm': 0.1586979674211021, 'learning_rate': 1.6915538964069063e-07, 'completion_length': 224.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.6190476715564728, 'rewards/format_reward': 1.0, 'reward': 1.6190477013587952, 'reward_std': 0.0, 'kl': 0.048583984375, 'epoch': 0.83}
 83%|████████▎ | 3561/4286 [26:48:10<4:51:29, 24.12s/it] 83%|████████▎ | 3562/4286 [26:48:33<4:46:50, 23.77s/it]                                                        {'loss': 0.0094, 'grad_norm': 18.577556436761142, 'learning_rate': 1.6892207186187588e-07, 'completion_length': 275.8393020629883, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.797619104385376, 'reward_std': 0.023809521459043026, 'kl': 0.23583984375, 'epoch': 0.83}
 83%|████████▎ | 3562/4286 [26:48:33<4:46:50, 23.77s/it] 83%|████████▎ | 3563/4286 [26:48:57<4:45:54, 23.73s/it]                                                        {'loss': 0.0024, 'grad_norm': 4.194742533465981, 'learning_rate': 1.6868875408306113e-07, 'completion_length': 273.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7782739102840424, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.07397739589214325, 'kl': 0.0596923828125, 'epoch': 0.83}
 83%|████████▎ | 3563/4286 [26:48:57<4:45:54, 23.73s/it] 83%|████████▎ | 3564/4286 [26:49:24<4:57:49, 24.75s/it]                                                        {'loss': 0.009, 'grad_norm': 4.036002326436751, 'learning_rate': 1.6845543630424638e-07, 'completion_length': 333.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.0357142873108387, 'kl': 0.224609375, 'epoch': 0.83}
 83%|████████▎ | 3564/4286 [26:49:24<4:57:49, 24.75s/it] 83%|████████▎ | 3565/4286 [26:49:49<4:58:45, 24.86s/it]                                                        {'loss': 0.0034, 'grad_norm': 94.90053805094068, 'learning_rate': 1.6822211852543165e-07, 'completion_length': 301.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.78125, 'rewards/format_reward': 1.0, 'reward': 1.7812501788139343, 'reward_std': 0.022675009444355965, 'kl': 0.083984375, 'epoch': 0.83}
 83%|████████▎ | 3565/4286 [26:49:49<4:58:45, 24.86s/it] 83%|████████▎ | 3566/4286 [26:50:13<4:55:22, 24.61s/it]                                                        {'loss': 0.0096, 'grad_norm': 4.728974228739038, 'learning_rate': 1.679888007466169e-07, 'completion_length': 302.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8023810088634491, 'rewards/format_reward': 1.0, 'reward': 1.802381157875061, 'reward_std': 0.07659847289323807, 'kl': 0.240234375, 'epoch': 0.83}
 83%|████████▎ | 3566/4286 [26:50:13<4:55:22, 24.61s/it] 83%|████████▎ | 3567/4286 [26:50:37<4:52:06, 24.38s/it]                                                        {'loss': 0.0041, 'grad_norm': 4.832655004195441, 'learning_rate': 1.6775548296780215e-07, 'completion_length': 293.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7306548357009888, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.015801788307726383, 'kl': 0.10296630859375, 'epoch': 0.83}
 83%|████████▎ | 3567/4286 [26:50:37<4:52:06, 24.38s/it] 83%|████████▎ | 3568/4286 [26:51:02<4:55:11, 24.67s/it]                                                        {'loss': 0.0195, 'grad_norm': 7.036254624309291, 'learning_rate': 1.675221651889874e-07, 'completion_length': 288.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7105655074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6927084922790527, 'reward_std': 0.07589286379516125, 'kl': 0.48779296875, 'epoch': 0.83}
 83%|████████▎ | 3568/4286 [26:51:02<4:55:11, 24.67s/it] 83%|████████▎ | 3569/4286 [26:51:26<4:53:18, 24.54s/it]                                                        {'loss': 0.0017, 'grad_norm': 4.239939444784404, 'learning_rate': 1.6728884741017264e-07, 'completion_length': 299.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7886905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7886905670166016, 'reward_std': 0.04602411389350891, 'kl': 0.041748046875, 'epoch': 0.83}
 83%|████████▎ | 3569/4286 [26:51:26<4:53:18, 24.54s/it] 83%|████████▎ | 3570/4286 [26:51:52<4:56:26, 24.84s/it]                                                        {'loss': 0.0205, 'grad_norm': 2.869841119815651, 'learning_rate': 1.6705552963135792e-07, 'completion_length': 306.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7658731043338776, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7480159401893616, 'reward_std': 0.09158628946170211, 'kl': 0.513916015625, 'epoch': 0.83}
 83%|████████▎ | 3570/4286 [26:51:52<4:56:26, 24.84s/it] 83%|████████▎ | 3571/4286 [26:52:17<4:56:07, 24.85s/it]                                                        {'loss': 0.0261, 'grad_norm': 1.7868192714637798, 'learning_rate': 1.6682221185254317e-07, 'completion_length': 327.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.6488095223903656, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.06913644261658192, 'kl': 0.650390625, 'epoch': 0.83}
 83%|████████▎ | 3571/4286 [26:52:17<4:56:07, 24.85s/it] 83%|████████▎ | 3572/4286 [26:52:43<5:00:15, 25.23s/it]                                                        {'loss': 0.0113, 'grad_norm': 2.5858011634527784, 'learning_rate': 1.6658889407372842e-07, 'completion_length': 312.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.051302384585142136, 'kl': 0.2818603515625, 'epoch': 0.83}
 83%|████████▎ | 3572/4286 [26:52:43<5:00:15, 25.23s/it] 83%|████████▎ | 3573/4286 [26:53:08<4:58:33, 25.12s/it]                                                        {'loss': 0.0196, 'grad_norm': 5.0522671840741555, 'learning_rate': 1.6635557629491367e-07, 'completion_length': 300.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803571939468384, 'reward_std': 0.08026953227818012, 'kl': 0.48974609375, 'epoch': 0.83}
 83%|████████▎ | 3573/4286 [26:53:08<4:58:33, 25.12s/it] 83%|████████▎ | 3574/4286 [26:53:33<4:58:55, 25.19s/it]                                                        {'loss': 0.0138, 'grad_norm': 2.4513332129228727, 'learning_rate': 1.6612225851609894e-07, 'completion_length': 315.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.748512089252472, 'reward_std': 0.04823923297226429, 'kl': 0.345947265625, 'epoch': 0.83}
 83%|████████▎ | 3574/4286 [26:53:33<4:58:55, 25.19s/it] 83%|████████▎ | 3575/4286 [26:53:58<4:56:09, 24.99s/it]                                                        {'loss': 0.0161, 'grad_norm': 8.162356684940965, 'learning_rate': 1.658889407372842e-07, 'completion_length': 312.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.05360400676727295, 'kl': 0.40234375, 'epoch': 0.83}
 83%|████████▎ | 3575/4286 [26:53:58<4:56:09, 24.99s/it] 83%|████████▎ | 3576/4286 [26:54:22<4:52:53, 24.75s/it]                                                        {'loss': 0.0075, 'grad_norm': 1.2422871032724176, 'learning_rate': 1.6565562295846944e-07, 'completion_length': 301.875, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6875001788139343, 'reward_std': 0.0535714328289032, 'kl': 0.187255859375, 'epoch': 0.83}
 83%|████████▎ | 3576/4286 [26:54:22<4:52:53, 24.75s/it] 83%|████████▎ | 3577/4286 [26:54:45<4:47:05, 24.30s/it]                                                        {'loss': 0.0091, 'grad_norm': 7.630878393115253, 'learning_rate': 1.654223051796547e-07, 'completion_length': 261.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.80952388048172, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.01785714365541935, 'kl': 0.22705078125, 'epoch': 0.83}
 83%|████████▎ | 3577/4286 [26:54:45<4:47:05, 24.30s/it] 83%|████████▎ | 3578/4286 [26:55:10<4:49:04, 24.50s/it]                                                        {'loss': 0.0144, 'grad_norm': 7.319717544062222, 'learning_rate': 1.6518898740083994e-07, 'completion_length': 316.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7491496801376343, 'rewards/format_reward': 1.0, 'reward': 1.7491497993469238, 'reward_std': 0.0728286188095808, 'kl': 0.361328125, 'epoch': 0.83}
 83%|████████▎ | 3578/4286 [26:55:10<4:49:04, 24.50s/it] 84%|████████▎ | 3579/4286 [26:55:35<4:48:30, 24.48s/it]                                                        {'loss': 0.007, 'grad_norm': 9.913514479308027, 'learning_rate': 1.649556696220252e-07, 'completion_length': 292.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.04350833781063557, 'kl': 0.17578125, 'epoch': 0.84}
 84%|████████▎ | 3579/4286 [26:55:35<4:48:30, 24.48s/it] 84%|████████▎ | 3580/4286 [26:55:59<4:49:22, 24.59s/it]                                                        {'loss': 0.0017, 'grad_norm': 4.784520895373704, 'learning_rate': 1.6472235184321046e-07, 'completion_length': 240.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693454027175903, 'reward_std': 0.13190628215670586, 'kl': 0.04248046875, 'epoch': 0.84}
 84%|████████▎ | 3580/4286 [26:55:59<4:49:22, 24.59s/it] 84%|████████▎ | 3581/4286 [26:56:22<4:43:15, 24.11s/it]                                                        {'loss': 0.0094, 'grad_norm': 3.855006221658005, 'learning_rate': 1.644890340643957e-07, 'completion_length': 293.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6852679252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6674108505249023, 'reward_std': 0.07958441227674484, 'kl': 0.23486328125, 'epoch': 0.84}
 84%|████████▎ | 3581/4286 [26:56:22<4:43:15, 24.11s/it] 84%|████████▎ | 3582/4286 [26:56:46<4:40:48, 23.93s/it]                                                        {'loss': 0.0025, 'grad_norm': 0.8838895005238231, 'learning_rate': 1.6425571628558096e-07, 'completion_length': 266.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7812500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7812500596046448, 'reward_std': 0.05016787815839052, 'kl': 0.063720703125, 'epoch': 0.84}
 84%|████████▎ | 3582/4286 [26:56:46<4:40:48, 23.93s/it] 84%|████████▎ | 3583/4286 [26:57:11<4:44:05, 24.25s/it]                                                        {'loss': 0.0136, 'grad_norm': 4.859919739674466, 'learning_rate': 1.640223985067662e-07, 'completion_length': 296.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.8824405670166016, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8645834922790527, 'reward_std': 0.06005192920565605, 'kl': 0.33984375, 'epoch': 0.84}
 84%|████████▎ | 3583/4286 [26:57:11<4:44:05, 24.25s/it] 84%|████████▎ | 3584/4286 [26:57:36<4:45:41, 24.42s/it]                                                        {'loss': 0.0015, 'grad_norm': 0.1265399903770433, 'learning_rate': 1.6378908072795148e-07, 'completion_length': 306.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.9345238506793976, 'rewards/format_reward': 1.0, 'reward': 1.93452388048172, 'reward_std': 0.0, 'kl': 0.03851318359375, 'epoch': 0.84}
 84%|████████▎ | 3584/4286 [26:57:36<4:45:41, 24.42s/it] 84%|████████▎ | 3585/4286 [26:58:00<4:46:37, 24.53s/it]                                                        {'loss': 0.0018, 'grad_norm': 0.4294523175638254, 'learning_rate': 1.6355576294913673e-07, 'completion_length': 289.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.04443359375, 'epoch': 0.84}
 84%|████████▎ | 3585/4286 [26:58:00<4:46:37, 24.53s/it] 84%|████████▎ | 3586/4286 [26:58:25<4:47:09, 24.61s/it]                                                        {'loss': 0.0049, 'grad_norm': 4.085125257321929, 'learning_rate': 1.6332244517032198e-07, 'completion_length': 335.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.891369104385376, 'rewards/format_reward': 1.0, 'reward': 1.8913691639900208, 'reward_std': 0.039311498403549194, 'kl': 0.12255859375, 'epoch': 0.84}
 84%|████████▎ | 3586/4286 [26:58:25<4:47:09, 24.61s/it] 84%|████████▎ | 3587/4286 [26:58:50<4:46:13, 24.57s/it]                                                        {'loss': 0.0202, 'grad_norm': 4.287409716227846, 'learning_rate': 1.6308912739150723e-07, 'completion_length': 304.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.10155809111893177, 'kl': 0.5048828125, 'epoch': 0.84}
 84%|████████▎ | 3587/4286 [26:58:50<4:46:13, 24.57s/it] 84%|████████▎ | 3588/4286 [26:59:14<4:44:54, 24.49s/it]                                                        {'loss': 0.0093, 'grad_norm': 14.111969243036715, 'learning_rate': 1.628558096126925e-07, 'completion_length': 294.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8043154776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.786458432674408, 'reward_std': 0.10514794662594795, 'kl': 0.23095703125, 'epoch': 0.84}
 84%|████████▎ | 3588/4286 [26:59:14<4:44:54, 24.49s/it] 84%|████████▎ | 3589/4286 [26:59:37<4:38:15, 23.95s/it]                                                        {'loss': 0.0131, 'grad_norm': 4.857527669769308, 'learning_rate': 1.6262249183387775e-07, 'completion_length': 282.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.61904776096344, 'reward_std': 0.09791364148259163, 'kl': 0.326171875, 'epoch': 0.84}
 84%|████████▎ | 3589/4286 [26:59:37<4:38:15, 23.95s/it] 84%|████████▍ | 3590/4286 [27:00:01<4:38:45, 24.03s/it]                                                        {'loss': 0.002, 'grad_norm': 0.45937828362200395, 'learning_rate': 1.62389174055063e-07, 'completion_length': 285.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.850595235824585, 'rewards/format_reward': 1.0, 'reward': 1.8505953550338745, 'reward_std': 0.008501701056957245, 'kl': 0.0498046875, 'epoch': 0.84}
 84%|████████▍ | 3590/4286 [27:00:01<4:38:45, 24.03s/it] 84%|████████▍ | 3591/4286 [27:00:25<4:37:19, 23.94s/it]                                                        {'loss': 0.0072, 'grad_norm': 4.782383932102254, 'learning_rate': 1.6215585627624825e-07, 'completion_length': 282.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.09523809142410755, 'kl': 0.180419921875, 'epoch': 0.84}
 84%|████████▍ | 3591/4286 [27:00:25<4:37:19, 23.94s/it] 84%|████████▍ | 3592/4286 [27:00:48<4:35:53, 23.85s/it]                                                        {'loss': 0.0049, 'grad_norm': 1.811743145585117, 'learning_rate': 1.619225384974335e-07, 'completion_length': 273.0893020629883, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.07142856903374195, 'kl': 0.122314453125, 'epoch': 0.84}
 84%|████████▍ | 3592/4286 [27:00:48<4:35:53, 23.85s/it] 84%|████████▍ | 3593/4286 [27:01:12<4:35:01, 23.81s/it]                                                        {'loss': 0.0097, 'grad_norm': 4.615857013055231, 'learning_rate': 1.6168922071861877e-07, 'completion_length': 270.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 1.0, 'reward': 1.802083432674408, 'reward_std': 0.060946544632315636, 'kl': 0.2421875, 'epoch': 0.84}
 84%|████████▍ | 3593/4286 [27:01:12<4:35:01, 23.81s/it] 84%|████████▍ | 3594/4286 [27:01:37<4:39:15, 24.21s/it]                                                        {'loss': 0.0015, 'grad_norm': 2.3349300509339828, 'learning_rate': 1.6145590293980402e-07, 'completion_length': 294.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.736607164144516, 'rewards/format_reward': 1.0, 'reward': 1.7366071939468384, 'reward_std': 0.03869047574698925, 'kl': 0.0384521484375, 'epoch': 0.84}
 84%|████████▍ | 3594/4286 [27:01:37<4:39:15, 24.21s/it] 84%|████████▍ | 3595/4286 [27:02:03<4:42:41, 24.55s/it]                                                        {'loss': 0.0122, 'grad_norm': 8.313852212005111, 'learning_rate': 1.6122258516098927e-07, 'completion_length': 302.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8244048357009888, 'reward_std': 0.041666668839752674, 'kl': 0.303466796875, 'epoch': 0.84}
 84%|████████▍ | 3595/4286 [27:02:03<4:42:41, 24.55s/it] 84%|████████▍ | 3596/4286 [27:02:27<4:42:31, 24.57s/it]                                                        {'loss': 0.0053, 'grad_norm': 1.6234432048428604, 'learning_rate': 1.6098926738217452e-07, 'completion_length': 311.89288330078125, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.0750257857143879, 'kl': 0.133544921875, 'epoch': 0.84}
 84%|████████▍ | 3596/4286 [27:02:27<4:42:31, 24.57s/it] 84%|████████▍ | 3597/4286 [27:02:52<4:43:21, 24.68s/it]                                                        {'loss': 0.018, 'grad_norm': 2.3495349631800564, 'learning_rate': 1.607559496033598e-07, 'completion_length': 304.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.860119104385376, 'rewards/format_reward': 1.0, 'reward': 1.860119104385376, 'reward_std': 0.029761902987957, 'kl': 0.4501953125, 'epoch': 0.84}
 84%|████████▍ | 3597/4286 [27:02:52<4:43:21, 24.68s/it] 84%|████████▍ | 3598/4286 [27:03:15<4:36:09, 24.08s/it]                                                        {'loss': 0.0052, 'grad_norm': 2.4626381585059374, 'learning_rate': 1.6052263182454504e-07, 'completion_length': 299.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.68154776096344, 'reward_std': 0.04602411016821861, 'kl': 0.129638671875, 'epoch': 0.84}
 84%|████████▍ | 3598/4286 [27:03:15<4:36:09, 24.08s/it] 84%|████████▍ | 3599/4286 [27:03:40<4:39:04, 24.37s/it]                                                        {'loss': 0.0151, 'grad_norm': 6.232752283135154, 'learning_rate': 1.602893140457303e-07, 'completion_length': 334.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8244048357009888, 'reward_std': 0.03411935409530997, 'kl': 0.3798828125, 'epoch': 0.84}
 84%|████████▍ | 3599/4286 [27:03:40<4:39:04, 24.37s/it] 84%|████████▍ | 3600/4286 [27:04:06<4:43:35, 24.80s/it]                                                        {'loss': 0.0226, 'grad_norm': 7.191908071044604, 'learning_rate': 1.6005599626691554e-07, 'completion_length': 318.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.5833334028720856, 'rewards/format_reward': 1.0, 'reward': 1.5833334922790527, 'reward_std': 0.05541311018168926, 'kl': 0.5625, 'epoch': 0.84}
 84%|████████▍ | 3600/4286 [27:04:06<4:43:35, 24.80s/it] 84%|████████▍ | 3601/4286 [27:08:49<19:29:55, 102.47s/it]                                                          {'loss': 0.0049, 'grad_norm': 1.776490034679293, 'learning_rate': 1.598226784881008e-07, 'completion_length': 310.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7872024476528168, 'rewards/format_reward': 1.0, 'reward': 1.7872024774551392, 'reward_std': 0.0267857164144516, 'kl': 0.1220703125, 'epoch': 0.84}
 84%|████████▍ | 3601/4286 [27:08:49<19:29:55, 102.47s/it] 84%|████████▍ | 3602/4286 [27:09:14<15:03:00, 79.21s/it]                                                          {'loss': 0.0037, 'grad_norm': 4.033016420182774, 'learning_rate': 1.5958936070928606e-07, 'completion_length': 313.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7306548357009888, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.06735091656446457, 'kl': 0.0931396484375, 'epoch': 0.84}
 84%|████████▍ | 3602/4286 [27:09:14<15:03:00, 79.21s/it] 84%|████████▍ | 3603/4286 [27:09:37<11:47:52, 62.18s/it]                                                         {'loss': 0.0238, 'grad_norm': 4.402175437182197, 'learning_rate': 1.593560429304713e-07, 'completion_length': 275.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.805059552192688, 'rewards/format_reward': 1.0, 'reward': 1.8050596714019775, 'reward_std': 0.07833484746515751, 'kl': 0.59228515625, 'epoch': 0.84}
 84%|████████▍ | 3603/4286 [27:09:37<11:47:52, 62.18s/it] 84%|████████▍ | 3604/4286 [27:10:00<9:32:55, 50.40s/it]                                                         {'loss': 0.0152, 'grad_norm': 13.198390033720557, 'learning_rate': 1.5912272515165656e-07, 'completion_length': 252.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.07738095335662365, 'kl': 0.38037109375, 'epoch': 0.84}
 84%|████████▍ | 3604/4286 [27:10:00<9:32:55, 50.40s/it] 84%|████████▍ | 3605/4286 [27:10:27<8:12:21, 43.38s/it]                                                        {'loss': 0.003, 'grad_norm': 8.545119752310711, 'learning_rate': 1.588894073728418e-07, 'completion_length': 314.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.822916716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8050596714019775, 'reward_std': 0.07029405422508717, 'kl': 0.0760498046875, 'epoch': 0.84}
 84%|████████▍ | 3605/4286 [27:10:27<8:12:21, 43.38s/it] 84%|████████▍ | 3606/4286 [27:10:52<7:10:08, 37.95s/it]                                                        {'loss': 0.0052, 'grad_norm': 7.774122464887563, 'learning_rate': 1.5865608959402706e-07, 'completion_length': 336.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.8431548178195953, 'rewards/format_reward': 1.0, 'reward': 1.8431548476219177, 'reward_std': 0.033202226273715496, 'kl': 0.1304931640625, 'epoch': 0.84}
 84%|████████▍ | 3606/4286 [27:10:52<7:10:08, 37.95s/it] 84%|████████▍ | 3607/4286 [27:11:17<6:25:32, 34.07s/it]                                                        {'loss': 0.0211, 'grad_norm': 6.838884615797912, 'learning_rate': 1.5842277181521233e-07, 'completion_length': 320.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6562500596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6383929252624512, 'reward_std': 0.08774416334927082, 'kl': 0.52587890625, 'epoch': 0.84}
 84%|████████▍ | 3607/4286 [27:11:17<6:25:32, 34.07s/it] 84%|████████▍ | 3608/4286 [27:11:42<5:53:11, 31.26s/it]                                                        {'loss': 0.0123, 'grad_norm': 3.183666927020215, 'learning_rate': 1.5818945403639758e-07, 'completion_length': 311.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7544643580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7366072535514832, 'reward_std': 0.05059524346143007, 'kl': 0.3056640625, 'epoch': 0.84}
 84%|████████▍ | 3608/4286 [27:11:42<5:53:11, 31.26s/it] 84%|████████▍ | 3609/4286 [27:12:07<5:31:13, 29.35s/it]                                                        {'loss': 0.0071, 'grad_norm': 1.558539310690773, 'learning_rate': 1.5795613625758283e-07, 'completion_length': 306.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.02083333395421505, 'kl': 0.17822265625, 'epoch': 0.84}
 84%|████████▍ | 3609/4286 [27:12:07<5:31:13, 29.35s/it] 84%|████████▍ | 3610/4286 [27:12:32<5:16:34, 28.10s/it]                                                        {'loss': 0.0086, 'grad_norm': 7.021069857597137, 'learning_rate': 1.5772281847876808e-07, 'completion_length': 315.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.912202388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8943453431129456, 'reward_std': 0.050595239736139774, 'kl': 0.215087890625, 'epoch': 0.84}
 84%|████████▍ | 3610/4286 [27:12:32<5:16:34, 28.10s/it] 84%|████████▍ | 3611/4286 [27:12:55<5:00:47, 26.74s/it]                                                        {'loss': 0.0062, 'grad_norm': 13.536838270144862, 'learning_rate': 1.5748950069995335e-07, 'completion_length': 300.64288330078125, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.03869048412889242, 'kl': 0.1539306640625, 'epoch': 0.84}
 84%|████████▍ | 3611/4286 [27:12:55<5:00:47, 26.74s/it] 84%|████████▍ | 3612/4286 [27:13:21<4:58:42, 26.59s/it]                                                        {'loss': 0.0127, 'grad_norm': 1.9475507496933464, 'learning_rate': 1.572561829211386e-07, 'completion_length': 320.625, 'rewards/only_full_func_accuracy_reward': 0.73139888048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7135418057441711, 'reward_std': 0.10930684208869934, 'kl': 0.317626953125, 'epoch': 0.84}
 84%|████████▍ | 3612/4286 [27:13:21<4:58:42, 26.59s/it] 84%|████████▍ | 3613/4286 [27:13:46<4:49:59, 25.85s/it]                                                        {'loss': 0.0169, 'grad_norm': 5.811812575444688, 'learning_rate': 1.5702286514232385e-07, 'completion_length': 292.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.7595238387584686, 'rewards/format_reward': 1.0, 'reward': 1.7595239281654358, 'reward_std': 0.06679731351323426, 'kl': 0.4228515625, 'epoch': 0.84}
 84%|████████▍ | 3613/4286 [27:13:46<4:49:59, 25.85s/it] 84%|████████▍ | 3614/4286 [27:14:10<4:44:11, 25.37s/it]                                                        {'loss': 0.0116, 'grad_norm': 29.745621847981774, 'learning_rate': 1.567895473635091e-07, 'completion_length': 313.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.822023868560791, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8041667342185974, 'reward_std': 0.08548666164278984, 'kl': 0.288818359375, 'epoch': 0.84}
 84%|████████▍ | 3614/4286 [27:14:10<4:44:11, 25.37s/it] 84%|████████▍ | 3615/4286 [27:14:34<4:39:19, 24.98s/it]                                                        {'loss': 0.0019, 'grad_norm': 4.979156137036373, 'learning_rate': 1.5655622958469435e-07, 'completion_length': 297.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.032524414360523224, 'kl': 0.0469970703125, 'epoch': 0.84}
 84%|████████▍ | 3615/4286 [27:14:34<4:39:19, 24.98s/it] 84%|████████▍ | 3616/4286 [27:14:58<4:37:07, 24.82s/it]                                                        {'loss': 0.0271, 'grad_norm': 6.864552812033009, 'learning_rate': 1.5632291180587962e-07, 'completion_length': 307.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875001192092896, 'reward_std': 0.029761902987957, 'kl': 0.67877197265625, 'epoch': 0.84}
 84%|████████▍ | 3616/4286 [27:14:58<4:37:07, 24.82s/it] 84%|████████▍ | 3617/4286 [27:15:22<4:31:11, 24.32s/it]                                                        {'loss': 0.0078, 'grad_norm': 4.1385175999881145, 'learning_rate': 1.5608959402706485e-07, 'completion_length': 280.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6815476417541504, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.038476794958114624, 'kl': 0.194580078125, 'epoch': 0.84}
 84%|████████▍ | 3617/4286 [27:15:22<4:31:11, 24.32s/it] 84%|████████▍ | 3618/4286 [27:15:46<4:30:34, 24.30s/it]                                                        {'loss': 0.021, 'grad_norm': 6.72762636330001, 'learning_rate': 1.558562762482501e-07, 'completion_length': 318.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.7008928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7008929252624512, 'reward_std': 0.05207537300884724, 'kl': 0.5224609375, 'epoch': 0.84}
 84%|████████▍ | 3618/4286 [27:15:46<4:30:34, 24.30s/it] 84%|████████▍ | 3619/4286 [27:16:11<4:32:22, 24.50s/it]                                                        {'loss': 0.0031, 'grad_norm': 2.851342688299251, 'learning_rate': 1.5562295846943534e-07, 'completion_length': 306.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8720238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8720239400863647, 'reward_std': 0.05909644812345505, 'kl': 0.077392578125, 'epoch': 0.84}
 84%|████████▍ | 3619/4286 [27:16:11<4:32:22, 24.50s/it] 84%|████████▍ | 3620/4286 [27:16:36<4:35:39, 24.83s/it]                                                        {'loss': 0.0019, 'grad_norm': 0.5878030280077167, 'learning_rate': 1.553896406906206e-07, 'completion_length': 316.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7752977013587952, 'reward_std': 0.0295482249930501, 'kl': 0.04730224609375, 'epoch': 0.84}
 84%|████████▍ | 3620/4286 [27:16:36<4:35:39, 24.83s/it] 84%|████████▍ | 3621/4286 [27:17:02<4:38:16, 25.11s/it]                                                        {'loss': 0.0098, 'grad_norm': 2.187742514573144, 'learning_rate': 1.5515632291180587e-07, 'completion_length': 300.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.741443544626236, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7235864400863647, 'reward_std': 0.07504197582602501, 'kl': 0.244384765625, 'epoch': 0.84}
 84%|████████▍ | 3621/4286 [27:17:02<4:38:16, 25.11s/it] 85%|████████▍ | 3622/4286 [27:17:27<4:36:28, 24.98s/it]                                                        {'loss': 0.0036, 'grad_norm': 0.9538662523077286, 'learning_rate': 1.5492300513299112e-07, 'completion_length': 315.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8244049549102783, 'reward_std': 0.01785714365541935, 'kl': 0.0897216796875, 'epoch': 0.85}
 85%|████████▍ | 3622/4286 [27:17:27<4:36:28, 24.98s/it] 85%|████████▍ | 3623/4286 [27:17:51<4:32:22, 24.65s/it]                                                        {'loss': 0.0019, 'grad_norm': 0.10318589653321902, 'learning_rate': 1.5468968735417636e-07, 'completion_length': 293.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.761904776096344, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.0, 'kl': 0.0465087890625, 'epoch': 0.85}
 85%|████████▍ | 3623/4286 [27:17:51<4:32:22, 24.65s/it] 85%|████████▍ | 3624/4286 [27:18:16<4:33:31, 24.79s/it]                                                        {'loss': 0.0044, 'grad_norm': 2.8752360547880675, 'learning_rate': 1.5445636957536161e-07, 'completion_length': 289.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.8154761791229248, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.0654761902987957, 'kl': 0.111083984375, 'epoch': 0.85}
 85%|████████▍ | 3624/4286 [27:18:16<4:33:31, 24.79s/it] 85%|████████▍ | 3625/4286 [27:18:40<4:31:06, 24.61s/it]                                                        {'loss': 0.0048, 'grad_norm': 48.95169639154173, 'learning_rate': 1.542230517965469e-07, 'completion_length': 288.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.726934552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.709077537059784, 'reward_std': 0.07178215496242046, 'kl': 0.11962890625, 'epoch': 0.85}
 85%|████████▍ | 3625/4286 [27:18:40<4:31:06, 24.61s/it] 85%|████████▍ | 3626/4286 [27:19:04<4:28:52, 24.44s/it]                                                        {'loss': 0.0055, 'grad_norm': 6.03625194674501, 'learning_rate': 1.5398973401773214e-07, 'completion_length': 298.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7872024178504944, 'rewards/format_reward': 1.0, 'reward': 1.7872024774551392, 'reward_std': 0.02678571827709675, 'kl': 0.1378173828125, 'epoch': 0.85}
 85%|████████▍ | 3626/4286 [27:19:04<4:28:52, 24.44s/it] 85%|████████▍ | 3627/4286 [27:19:27<4:23:26, 23.99s/it]                                                        {'loss': 0.003, 'grad_norm': 6.977633624271478, 'learning_rate': 1.5375641623891739e-07, 'completion_length': 241.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8154762089252472, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.0714285746216774, 'kl': 0.073974609375, 'epoch': 0.85}
 85%|████████▍ | 3627/4286 [27:19:27<4:23:26, 23.99s/it] 85%|████████▍ | 3628/4286 [27:19:52<4:26:41, 24.32s/it]                                                        {'loss': 0.0231, 'grad_norm': 5.357833158031178, 'learning_rate': 1.5352309846010263e-07, 'completion_length': 310.125, 'rewards/only_full_func_accuracy_reward': 0.84077388048172, 'rewards/format_reward': 1.0, 'reward': 1.8407739400863647, 'reward_std': 0.02900167927145958, 'kl': 0.578125, 'epoch': 0.85}
 85%|████████▍ | 3628/4286 [27:19:52<4:26:41, 24.32s/it] 85%|████████▍ | 3629/4286 [27:20:17<4:27:17, 24.41s/it]                                                        {'loss': 0.0015, 'grad_norm': 2.2202324096813104, 'learning_rate': 1.5328978068128788e-07, 'completion_length': 275.0178756713867, 'rewards/only_full_func_accuracy_reward': 0.6830357909202576, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.02083333395421505, 'kl': 0.03875732421875, 'epoch': 0.85}
 85%|████████▍ | 3629/4286 [27:20:17<4:27:17, 24.41s/it] 85%|████████▍ | 3630/4286 [27:20:41<4:25:47, 24.31s/it]                                                        {'loss': 0.0123, 'grad_norm': 7.714897085507554, 'learning_rate': 1.5305646290247316e-07, 'completion_length': 286.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.03284517675638199, 'kl': 0.308349609375, 'epoch': 0.85}
 85%|████████▍ | 3630/4286 [27:20:41<4:25:47, 24.31s/it] 85%|████████▍ | 3631/4286 [27:21:06<4:27:47, 24.53s/it]                                                        {'loss': 0.011, 'grad_norm': 3.9337247384919554, 'learning_rate': 1.528231451236584e-07, 'completion_length': 301.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7020833790302277, 'rewards/format_reward': 1.0, 'reward': 1.7020834684371948, 'reward_std': 0.08510673232376575, 'kl': 0.27392578125, 'epoch': 0.85}
 85%|████████▍ | 3631/4286 [27:21:06<4:27:47, 24.53s/it] 85%|████████▍ | 3632/4286 [27:21:31<4:28:49, 24.66s/it]                                                        {'loss': 0.0073, 'grad_norm': 6.102424926442549, 'learning_rate': 1.5258982734484366e-07, 'completion_length': 311.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.7202382683753967, 'reward_std': 0.09959554299712181, 'kl': 0.18212890625, 'epoch': 0.85}
 85%|████████▍ | 3632/4286 [27:21:31<4:28:49, 24.66s/it] 85%|████████▍ | 3633/4286 [27:21:56<4:30:02, 24.81s/it]                                                        {'loss': 0.0041, 'grad_norm': 16.934388561592236, 'learning_rate': 1.523565095660289e-07, 'completion_length': 263.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.8348214626312256, 'rewards/format_reward': 1.0, 'reward': 1.8348215818405151, 'reward_std': 0.015801792964339256, 'kl': 0.10205078125, 'epoch': 0.85}
 85%|████████▍ | 3633/4286 [27:21:56<4:30:02, 24.81s/it] 85%|████████▍ | 3634/4286 [27:22:19<4:24:20, 24.33s/it]                                                        {'loss': 0.0033, 'grad_norm': 5.62269771123365, 'learning_rate': 1.5212319178721418e-07, 'completion_length': 250.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.02267500478774309, 'kl': 0.0836181640625, 'epoch': 0.85}
 85%|████████▍ | 3634/4286 [27:22:19<4:24:20, 24.33s/it] 85%|████████▍ | 3635/4286 [27:22:43<4:22:43, 24.21s/it]                                                        {'loss': 0.0069, 'grad_norm': 1.812364784153505, 'learning_rate': 1.5188987400839943e-07, 'completion_length': 301.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7916666865348816, 'rewards/format_reward': 1.0, 'reward': 1.7916668057441711, 'reward_std': 0.07142857648432255, 'kl': 0.1728515625, 'epoch': 0.85}
 85%|████████▍ | 3635/4286 [27:22:43<4:22:43, 24.21s/it] 85%|████████▍ | 3636/4286 [27:23:07<4:20:46, 24.07s/it]                                                        {'loss': 0.0038, 'grad_norm': 1.5793396661746846, 'learning_rate': 1.5165655622958468e-07, 'completion_length': 271.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.90476194024086, 'rewards/format_reward': 1.0, 'reward': 1.9047619700431824, 'reward_std': 0.059523810632526875, 'kl': 0.094970703125, 'epoch': 0.85}
 85%|████████▍ | 3636/4286 [27:23:07<4:20:46, 24.07s/it] 85%|████████▍ | 3637/4286 [27:23:31<4:20:31, 24.09s/it]                                                        {'loss': 0.0017, 'grad_norm': 0.5206462679041741, 'learning_rate': 1.5142323845076993e-07, 'completion_length': 290.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.008928571827709675, 'kl': 0.0433349609375, 'epoch': 0.85}
 85%|████████▍ | 3637/4286 [27:23:31<4:20:31, 24.09s/it] 85%|████████▍ | 3638/4286 [27:23:55<4:19:24, 24.02s/it]                                                        {'loss': 0.0094, 'grad_norm': 5.030715777050826, 'learning_rate': 1.5118992067195517e-07, 'completion_length': 303.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5625000596046448, 'rewards/format_reward': 1.0, 'reward': 1.5625001192092896, 'reward_std': 0.0535714328289032, 'kl': 0.234375, 'epoch': 0.85}
 85%|████████▍ | 3638/4286 [27:23:55<4:19:24, 24.02s/it] 85%|████████▍ | 3639/4286 [27:24:20<4:24:17, 24.51s/it]                                                        {'loss': 0.0198, 'grad_norm': 3.5686954373550743, 'learning_rate': 1.5095660289314045e-07, 'completion_length': 331.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7985119223594666, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7627975940704346, 'reward_std': 0.16993390768766403, 'kl': 0.49609375, 'epoch': 0.85}
 85%|████████▍ | 3639/4286 [27:24:20<4:24:17, 24.51s/it] 85%|████████▍ | 3640/4286 [27:24:45<4:24:45, 24.59s/it]                                                        {'loss': 0.0146, 'grad_norm': 3.4603242485065344, 'learning_rate': 1.507232851143257e-07, 'completion_length': 303.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.8020834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7842263579368591, 'reward_std': 0.0744047611951828, 'kl': 0.365234375, 'epoch': 0.85}
 85%|████████▍ | 3640/4286 [27:24:45<4:24:45, 24.59s/it] 85%|████████▍ | 3641/4286 [27:25:11<4:27:15, 24.86s/it]                                                        {'loss': 0.0043, 'grad_norm': 1.2603303430283332, 'learning_rate': 1.5048996733551095e-07, 'completion_length': 294.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.8809524178504944, 'rewards/format_reward': 1.0, 'reward': 1.8809524774551392, 'reward_std': 0.011904759332537651, 'kl': 0.1085205078125, 'epoch': 0.85}
 85%|████████▍ | 3641/4286 [27:25:11<4:27:15, 24.86s/it] 85%|████████▍ | 3642/4286 [27:25:34<4:23:18, 24.53s/it]                                                        {'loss': 0.0034, 'grad_norm': 2.44684083029887, 'learning_rate': 1.502566495566962e-07, 'completion_length': 299.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.026785715483129025, 'kl': 0.0848388671875, 'epoch': 0.85}
 85%|████████▍ | 3642/4286 [27:25:34<4:23:18, 24.53s/it] 85%|████████▍ | 3643/4286 [27:25:58<4:19:32, 24.22s/it]                                                        {'loss': 0.0147, 'grad_norm': 1.7351113027200364, 'learning_rate': 1.5002333177788144e-07, 'completion_length': 296.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500001788139343, 'reward_std': 0.06388125568628311, 'kl': 0.3671875, 'epoch': 0.85}
 85%|████████▍ | 3643/4286 [27:25:58<4:19:32, 24.22s/it] 85%|████████▌ | 3644/4286 [27:26:23<4:21:50, 24.47s/it]                                                        {'loss': 0.0041, 'grad_norm': 2.521077726802996, 'learning_rate': 1.4979001399906672e-07, 'completion_length': 269.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.9196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.9196429252624512, 'reward_std': 0.015672582667320967, 'kl': 0.1025390625, 'epoch': 0.85}
 85%|████████▌ | 3644/4286 [27:26:23<4:21:50, 24.47s/it] 85%|████████▌ | 3645/4286 [27:26:48<4:22:53, 24.61s/it]                                                        {'loss': 0.0034, 'grad_norm': 1.6951209707281716, 'learning_rate': 1.4955669622025197e-07, 'completion_length': 303.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.796131044626236, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.019238397479057312, 'kl': 0.085205078125, 'epoch': 0.85}
 85%|████████▌ | 3645/4286 [27:26:48<4:22:53, 24.61s/it] 85%|████████▌ | 3646/4286 [27:27:14<4:26:52, 25.02s/it]                                                        {'loss': 0.0073, 'grad_norm': 5.419733090768353, 'learning_rate': 1.4932337844143722e-07, 'completion_length': 272.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.8199405372142792, 'rewards/format_reward': 1.0, 'reward': 1.8199405670166016, 'reward_std': 0.04488959163427353, 'kl': 0.1826171875, 'epoch': 0.85}
 85%|████████▌ | 3646/4286 [27:27:14<4:26:52, 25.02s/it] 85%|████████▌ | 3647/4286 [27:27:37<4:21:26, 24.55s/it]                                                        {'loss': 0.0097, 'grad_norm': 4.109153399230517, 'learning_rate': 1.4909006066262247e-07, 'completion_length': 278.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7023810744285583, 'reward_std': 0.1190476194024086, 'kl': 0.24169921875, 'epoch': 0.85}
 85%|████████▌ | 3647/4286 [27:27:37<4:21:26, 24.55s/it] 85%|████████▌ | 3648/4286 [27:28:02<4:22:10, 24.66s/it]                                                        {'loss': 0.0172, 'grad_norm': 6.968887181879994, 'learning_rate': 1.4885674288380774e-07, 'completion_length': 299.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7559524476528168, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.09630303084850311, 'kl': 0.43017578125, 'epoch': 0.85}
 85%|████████▌ | 3648/4286 [27:28:02<4:22:10, 24.66s/it] 85%|████████▌ | 3649/4286 [27:28:27<4:23:15, 24.80s/it]                                                        {'loss': 0.0191, 'grad_norm': 5.839610873541258, 'learning_rate': 1.48623425104993e-07, 'completion_length': 308.4643096923828, 'rewards/only_full_func_accuracy_reward': 0.7113094925880432, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.06664376519620419, 'kl': 0.47607421875, 'epoch': 0.85}
 85%|████████▌ | 3649/4286 [27:28:27<4:23:15, 24.80s/it] 85%|████████▌ | 3650/4286 [27:28:53<4:24:21, 24.94s/it]                                                        {'loss': 0.002, 'grad_norm': 5.633628102131818, 'learning_rate': 1.4839010732617824e-07, 'completion_length': 307.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8110119104385376, 'rewards/format_reward': 1.0, 'reward': 1.8110120296478271, 'reward_std': 0.036421445198357105, 'kl': 0.048828125, 'epoch': 0.85}
 85%|████████▌ | 3650/4286 [27:28:53<4:24:21, 24.94s/it] 85%|████████▌ | 3651/4286 [27:29:18<4:24:52, 25.03s/it]                                                        {'loss': 0.0016, 'grad_norm': 1.721554096500165, 'learning_rate': 1.481567895473635e-07, 'completion_length': 319.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.8377976417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8199405670166016, 'reward_std': 0.0863095261156559, 'kl': 0.03924560546875, 'epoch': 0.85}
 85%|████████▌ | 3651/4286 [27:29:18<4:24:52, 25.03s/it] 85%|████████▌ | 3652/4286 [27:29:42<4:22:01, 24.80s/it]                                                        {'loss': 0.0032, 'grad_norm': 2.9963086094709004, 'learning_rate': 1.4792347176854874e-07, 'completion_length': 285.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.020619653165340424, 'kl': 0.080078125, 'epoch': 0.85}
 85%|████████▌ | 3652/4286 [27:29:42<4:22:01, 24.80s/it] 85%|████████▌ | 3653/4286 [27:30:08<4:25:21, 25.15s/it]                                                        {'loss': 0.012, 'grad_norm': 24.298013117238046, 'learning_rate': 1.47690153989734e-07, 'completion_length': 304.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7845982611179352, 'rewards/format_reward': 1.0, 'reward': 1.7845982909202576, 'reward_std': 0.03073224611580372, 'kl': 0.30078125, 'epoch': 0.85}
 85%|████████▌ | 3653/4286 [27:30:08<4:25:21, 25.15s/it] 85%|████████▌ | 3654/4286 [27:30:34<4:28:37, 25.50s/it]                                                        {'loss': 0.0274, 'grad_norm': 4.971304893703855, 'learning_rate': 1.4745683621091926e-07, 'completion_length': 314.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7023810744285583, 'reward_std': 0.0476190522313118, 'kl': 0.68316650390625, 'epoch': 0.85}
 85%|████████▌ | 3654/4286 [27:30:34<4:28:37, 25.50s/it] 85%|████████▌ | 3655/4286 [27:30:59<4:26:20, 25.33s/it]                                                        {'loss': 0.0042, 'grad_norm': 7.739016517891054, 'learning_rate': 1.472235184321045e-07, 'completion_length': 296.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.7217262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.019238397479057312, 'kl': 0.1041259765625, 'epoch': 0.85}
 85%|████████▌ | 3655/4286 [27:30:59<4:26:20, 25.33s/it] 85%|████████▌ | 3656/4286 [27:31:25<4:27:41, 25.49s/it]                                                        {'loss': 0.027, 'grad_norm': 19.363319038200107, 'learning_rate': 1.4699020065328976e-07, 'completion_length': 310.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7625000476837158, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7446430325508118, 'reward_std': 0.10772959515452385, 'kl': 0.672607421875, 'epoch': 0.85}
 85%|████████▌ | 3656/4286 [27:31:25<4:27:41, 25.49s/it] 85%|████████▌ | 3657/4286 [27:31:51<4:28:25, 25.60s/it]                                                        {'loss': 0.0109, 'grad_norm': 2.233561726597463, 'learning_rate': 1.46756882874475e-07, 'completion_length': 328.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.8080357611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7723215222358704, 'reward_std': 0.15890567004680634, 'kl': 0.271484375, 'epoch': 0.85}
 85%|████████▌ | 3657/4286 [27:31:51<4:28:25, 25.60s/it] 85%|████████▌ | 3658/4286 [27:32:16<4:25:35, 25.37s/it]                                                        {'loss': 0.0096, 'grad_norm': 1.2709370435156209, 'learning_rate': 1.4652356509566028e-07, 'completion_length': 330.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 1.0, 'reward': 1.7574405670166016, 'reward_std': 0.04304792545735836, 'kl': 0.242431640625, 'epoch': 0.85}
 85%|████████▌ | 3658/4286 [27:32:16<4:25:35, 25.37s/it] 85%|████████▌ | 3659/4286 [27:32:42<4:28:01, 25.65s/it]                                                        {'loss': 0.0053, 'grad_norm': 6.544679417353751, 'learning_rate': 1.4629024731684553e-07, 'completion_length': 317.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.08021794259548187, 'kl': 0.13232421875, 'epoch': 0.85}
 85%|████████▌ | 3659/4286 [27:32:42<4:28:01, 25.65s/it] 85%|████████▌ | 3660/4286 [27:33:08<4:28:43, 25.76s/it]                                                        {'loss': 0.0263, 'grad_norm': 4.515211334207386, 'learning_rate': 1.4605692953803078e-07, 'completion_length': 328.5, 'rewards/only_full_func_accuracy_reward': 0.6175595819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5997024774551392, 'reward_std': 0.0838614497333765, 'kl': 0.65625, 'epoch': 0.85}
 85%|████████▌ | 3660/4286 [27:33:08<4:28:43, 25.76s/it] 85%|████████▌ | 3661/4286 [27:33:32<4:20:57, 25.05s/it]                                                        {'loss': 0.0073, 'grad_norm': 1.3901475019957246, 'learning_rate': 1.4582361175921603e-07, 'completion_length': 277.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.029761907644569874, 'kl': 0.18212890625, 'epoch': 0.85}
 85%|████████▌ | 3661/4286 [27:33:32<4:20:57, 25.05s/it] 85%|████████▌ | 3662/4286 [27:33:57<4:22:14, 25.22s/it]                                                        {'loss': 0.0112, 'grad_norm': 68.60500136306464, 'learning_rate': 1.455902939804013e-07, 'completion_length': 289.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6815476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6636905670166016, 'reward_std': 0.10303214006125927, 'kl': 0.27978515625, 'epoch': 0.85}
 85%|████████▌ | 3662/4286 [27:33:57<4:22:14, 25.22s/it] 85%|████████▌ | 3663/4286 [27:34:22<4:21:41, 25.20s/it]                                                        {'loss': 0.0168, 'grad_norm': 7.439393729378099, 'learning_rate': 1.4535697620158655e-07, 'completion_length': 291.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7529762983322144, 'reward_std': 0.12728974223136902, 'kl': 0.4189453125, 'epoch': 0.85}
 85%|████████▌ | 3663/4286 [27:34:22<4:21:41, 25.20s/it] 85%|████████▌ | 3664/4286 [27:34:48<4:21:55, 25.27s/it]                                                        {'loss': 0.0103, 'grad_norm': 7.737604362571849, 'learning_rate': 1.451236584227718e-07, 'completion_length': 283.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.820684552192688, 'rewards/format_reward': 1.0, 'reward': 1.8206846714019775, 'reward_std': 0.0774846076965332, 'kl': 0.259765625, 'epoch': 0.85}
 85%|████████▌ | 3664/4286 [27:34:48<4:21:55, 25.27s/it] 86%|████████▌ | 3665/4286 [27:35:12<4:17:09, 24.85s/it]                                                        {'loss': 0.0117, 'grad_norm': 39.70671991897947, 'learning_rate': 1.4489034064395705e-07, 'completion_length': 286.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8541667461395264, 'rewards/format_reward': 1.0, 'reward': 1.8541668057441711, 'reward_std': 0.028166969306766987, 'kl': 0.290771484375, 'epoch': 0.86}
 86%|████████▌ | 3665/4286 [27:35:12<4:17:09, 24.85s/it] 86%|████████▌ | 3666/4286 [27:35:37<4:17:42, 24.94s/it]                                                        {'loss': 0.0063, 'grad_norm': 3.151310018787787, 'learning_rate': 1.446570228651423e-07, 'completion_length': 284.7143096923828, 'rewards/only_full_func_accuracy_reward': 0.6145833730697632, 'rewards/format_reward': 1.0, 'reward': 1.6145833730697632, 'reward_std': 0.07029405422508717, 'kl': 0.156494140625, 'epoch': 0.86}
 86%|████████▌ | 3666/4286 [27:35:37<4:17:42, 24.94s/it] 86%|████████▌ | 3667/4286 [27:36:03<4:20:11, 25.22s/it]                                                        {'loss': 0.0223, 'grad_norm': 7.2495841280443445, 'learning_rate': 1.4442370508632757e-07, 'completion_length': 312.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7471591234207153, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7293021082878113, 'reward_std': 0.0514095164835453, 'kl': 0.556640625, 'epoch': 0.86}
 86%|████████▌ | 3667/4286 [27:36:03<4:20:11, 25.22s/it] 86%|████████▌ | 3668/4286 [27:36:29<4:22:15, 25.46s/it]                                                        {'loss': 0.0051, 'grad_norm': 16.690915038934946, 'learning_rate': 1.4419038730751282e-07, 'completion_length': 271.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.014880956150591373, 'kl': 0.12744140625, 'epoch': 0.86}
 86%|████████▌ | 3668/4286 [27:36:29<4:22:15, 25.46s/it] 86%|████████▌ | 3669/4286 [27:36:54<4:21:44, 25.45s/it]                                                        {'loss': 0.0125, 'grad_norm': 6.75097921968314, 'learning_rate': 1.4395706952869807e-07, 'completion_length': 305.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7172620296478271, 'reward_std': 0.05154913291335106, 'kl': 0.313232421875, 'epoch': 0.86}
 86%|████████▌ | 3669/4286 [27:36:54<4:21:44, 25.45s/it] 86%|████████▌ | 3670/4286 [27:37:19<4:19:56, 25.32s/it]                                                        {'loss': 0.0068, 'grad_norm': 10.333165887266624, 'learning_rate': 1.4372375174988332e-07, 'completion_length': 309.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.7544643580913544, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.014880950096994638, 'kl': 0.17041015625, 'epoch': 0.86}
 86%|████████▌ | 3670/4286 [27:37:19<4:19:56, 25.32s/it] 86%|████████▌ | 3671/4286 [27:37:44<4:17:03, 25.08s/it]                                                        {'loss': 0.0229, 'grad_norm': 5.963438171190601, 'learning_rate': 1.434904339710686e-07, 'completion_length': 291.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.7151786088943481, 'rewards/format_reward': 1.0, 'reward': 1.7151786088943481, 'reward_std': 0.07688858546316624, 'kl': 0.578125, 'epoch': 0.86}
 86%|████████▌ | 3671/4286 [27:37:44<4:17:03, 25.08s/it] 86%|████████▌ | 3672/4286 [27:38:09<4:17:45, 25.19s/it]                                                        {'loss': 0.0029, 'grad_norm': 16.680270053362594, 'learning_rate': 1.4325711619225384e-07, 'completion_length': 315.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7425596117973328, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.06163906678557396, 'kl': 0.07373046875, 'epoch': 0.86}
 86%|████████▌ | 3672/4286 [27:38:09<4:17:45, 25.19s/it] 86%|████████▌ | 3673/4286 [27:38:34<4:16:00, 25.06s/it]                                                        {'loss': 0.0062, 'grad_norm': 8.288930104876911, 'learning_rate': 1.430237984134391e-07, 'completion_length': 328.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.07225660420954227, 'kl': 0.155029296875, 'epoch': 0.86}
 86%|████████▌ | 3673/4286 [27:38:34<4:16:00, 25.06s/it] 86%|████████▌ | 3674/4286 [27:38:59<4:14:21, 24.94s/it]                                                        {'loss': 0.0213, 'grad_norm': 4.7029834417496525, 'learning_rate': 1.4279048063462434e-07, 'completion_length': 308.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7053572535514832, 'reward_std': 0.1293574422597885, 'kl': 0.5322265625, 'epoch': 0.86}
 86%|████████▌ | 3674/4286 [27:38:59<4:14:21, 24.94s/it] 86%|████████▌ | 3675/4286 [27:39:25<4:17:33, 25.29s/it]                                                        {'loss': 0.0178, 'grad_norm': 12.92970363356404, 'learning_rate': 1.425571628558096e-07, 'completion_length': 316.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.7083334922790527, 'reward_std': 0.08868439868092537, 'kl': 0.44482421875, 'epoch': 0.86}
 86%|████████▌ | 3675/4286 [27:39:25<4:17:33, 25.29s/it] 86%|████████▌ | 3676/4286 [27:39:50<4:15:55, 25.17s/it]                                                        {'loss': 0.0131, 'grad_norm': 8.62900758916578, 'learning_rate': 1.4232384507699486e-07, 'completion_length': 317.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7750000655651093, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7392858266830444, 'reward_std': 0.11071429029107094, 'kl': 0.32861328125, 'epoch': 0.86}
 86%|████████▌ | 3676/4286 [27:39:50<4:15:55, 25.17s/it] 86%|████████▌ | 3677/4286 [27:40:15<4:16:08, 25.24s/it]                                                        {'loss': 0.0282, 'grad_norm': 33.9956125389615, 'learning_rate': 1.420905272981801e-07, 'completion_length': 300.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6148809790611267, 'rewards/format_reward': 1.0, 'reward': 1.6148810386657715, 'reward_std': 0.05276251211762428, 'kl': 0.703125, 'epoch': 0.86}
 86%|████████▌ | 3677/4286 [27:40:15<4:16:08, 25.24s/it] 86%|████████▌ | 3678/4286 [27:40:40<4:15:27, 25.21s/it]                                                        {'loss': 0.0064, 'grad_norm': 8.46851101238413, 'learning_rate': 1.4185720951936536e-07, 'completion_length': 294.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7604167461395264, 'rewards/format_reward': 1.0, 'reward': 1.7604168057441711, 'reward_std': 0.041452985256910324, 'kl': 0.159423828125, 'epoch': 0.86}
 86%|████████▌ | 3678/4286 [27:40:40<4:15:27, 25.21s/it] 86%|████████▌ | 3679/4286 [27:41:05<4:12:44, 24.98s/it]                                                        {'loss': 0.0166, 'grad_norm': 5.465771183351884, 'learning_rate': 1.416238917405506e-07, 'completion_length': 307.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6592262983322144, 'reward_std': 0.08311965689063072, 'kl': 0.41259765625, 'epoch': 0.86}
 86%|████████▌ | 3679/4286 [27:41:05<4:12:44, 24.98s/it] 86%|████████▌ | 3680/4286 [27:41:29<4:12:00, 24.95s/it]                                                        {'loss': 0.0098, 'grad_norm': 5.299590555544412, 'learning_rate': 1.4139057396173586e-07, 'completion_length': 273.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.78125, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.763392984867096, 'reward_std': 0.06845238152891397, 'kl': 0.24609375, 'epoch': 0.86}
 86%|████████▌ | 3680/4286 [27:41:29<4:12:00, 24.95s/it] 86%|████████▌ | 3681/4286 [27:41:53<4:08:40, 24.66s/it]                                                        {'loss': 0.0256, 'grad_norm': 16.57629706529509, 'learning_rate': 1.4115725618292113e-07, 'completion_length': 271.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.7038690745830536, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.059223161078989506, 'kl': 0.64013671875, 'epoch': 0.86}
 86%|████████▌ | 3681/4286 [27:41:53<4:08:40, 24.66s/it] 86%|████████▌ | 3682/4286 [27:42:18<4:09:15, 24.76s/it]                                                        {'loss': 0.0068, 'grad_norm': 8.238218379287156, 'learning_rate': 1.4092393840410638e-07, 'completion_length': 310.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8348215222358704, 'rewards/format_reward': 1.0, 'reward': 1.8348215818405151, 'reward_std': 0.03869047574698925, 'kl': 0.1689453125, 'epoch': 0.86}
 86%|████████▌ | 3682/4286 [27:42:18<4:09:15, 24.76s/it] 86%|████████▌ | 3683/4286 [27:42:44<4:12:03, 25.08s/it]                                                        {'loss': 0.0081, 'grad_norm': 4.57916809979706, 'learning_rate': 1.4069062062529163e-07, 'completion_length': 323.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.8377976417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8199406862258911, 'reward_std': 0.0784623995423317, 'kl': 0.203125, 'epoch': 0.86}
 86%|████████▌ | 3683/4286 [27:42:44<4:12:03, 25.08s/it] 86%|████████▌ | 3684/4286 [27:43:10<4:12:18, 25.15s/it]                                                        {'loss': 0.032, 'grad_norm': 7.352836074169359, 'learning_rate': 1.4045730284647688e-07, 'completion_length': 311.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7366071939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7008929252624512, 'reward_std': 0.08419515006244183, 'kl': 0.80078125, 'epoch': 0.86}
 86%|████████▌ | 3684/4286 [27:43:10<4:12:18, 25.15s/it] 86%|████████▌ | 3685/4286 [27:43:34<4:10:10, 24.98s/it]                                                        {'loss': 0.0085, 'grad_norm': 7.517307867665229, 'learning_rate': 1.4022398506766215e-07, 'completion_length': 298.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.0659366175532341, 'kl': 0.212158203125, 'epoch': 0.86}
 86%|████████▌ | 3685/4286 [27:43:34<4:10:10, 24.98s/it] 86%|████████▌ | 3686/4286 [27:43:59<4:09:07, 24.91s/it]                                                        {'loss': 0.0068, 'grad_norm': 3.7142469697158296, 'learning_rate': 1.399906672888474e-07, 'completion_length': 308.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.736607313156128, 'reward_std': 0.06250000465661287, 'kl': 0.17138671875, 'epoch': 0.86}
 86%|████████▌ | 3686/4286 [27:43:59<4:09:07, 24.91s/it] 86%|████████▌ | 3687/4286 [27:44:24<4:09:48, 25.02s/it]                                                        {'loss': 0.0047, 'grad_norm': 5.137747007447998, 'learning_rate': 1.3975734951003265e-07, 'completion_length': 290.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.7827382683753967, 'reward_std': 0.011904762359336019, 'kl': 0.118408203125, 'epoch': 0.86}
 86%|████████▌ | 3687/4286 [27:44:24<4:09:48, 25.02s/it] 86%|████████▌ | 3688/4286 [27:44:50<4:12:05, 25.29s/it]                                                        {'loss': 0.0049, 'grad_norm': 4.059582819791549, 'learning_rate': 1.395240317312179e-07, 'completion_length': 313.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7982143461704254, 'rewards/format_reward': 1.0, 'reward': 1.7982143759727478, 'reward_std': 0.021981075406074524, 'kl': 0.121337890625, 'epoch': 0.86}
 86%|████████▌ | 3688/4286 [27:44:50<4:12:05, 25.29s/it] 86%|████████▌ | 3689/4286 [27:45:15<4:09:02, 25.03s/it]                                                        {'loss': 0.0053, 'grad_norm': 4.579489099475209, 'learning_rate': 1.3929071395240315e-07, 'completion_length': 272.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.7827381789684296, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.03419382870197296, 'kl': 0.132568359375, 'epoch': 0.86}
 86%|████████▌ | 3689/4286 [27:45:15<4:09:02, 25.03s/it] 86%|████████▌ | 3690/4286 [27:45:40<4:09:23, 25.11s/it]                                                        {'loss': 0.0352, 'grad_norm': 10.502620584927655, 'learning_rate': 1.3905739617358842e-07, 'completion_length': 324.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.726190447807312, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.708333432674408, 'reward_std': 0.04761904664337635, 'kl': 0.880859375, 'epoch': 0.86}
 86%|████████▌ | 3690/4286 [27:45:40<4:09:23, 25.11s/it] 86%|████████▌ | 3691/4286 [27:46:05<4:08:23, 25.05s/it]                                                        {'loss': 0.0125, 'grad_norm': 5.697005448332252, 'learning_rate': 1.3882407839477367e-07, 'completion_length': 307.7143096923828, 'rewards/only_full_func_accuracy_reward': 0.7175595462322235, 'rewards/format_reward': 1.0, 'reward': 1.7175596356391907, 'reward_std': 0.03219882398843765, 'kl': 0.3115234375, 'epoch': 0.86}
 86%|████████▌ | 3691/4286 [27:46:05<4:08:23, 25.05s/it] 86%|████████▌ | 3692/4286 [27:46:28<4:03:52, 24.63s/it]                                                        {'loss': 0.0194, 'grad_norm': 2.871672782067752, 'learning_rate': 1.3859076061595892e-07, 'completion_length': 263.375, 'rewards/only_full_func_accuracy_reward': 0.8104166984558105, 'rewards/format_reward': 1.0, 'reward': 1.8104167580604553, 'reward_std': 0.03392857313156128, 'kl': 0.482666015625, 'epoch': 0.86}
 86%|████████▌ | 3692/4286 [27:46:28<4:03:52, 24.63s/it] 86%|████████▌ | 3693/4286 [27:46:53<4:03:01, 24.59s/it]                                                        {'loss': 0.0205, 'grad_norm': 0.7581262357776499, 'learning_rate': 1.3835744283714417e-07, 'completion_length': 304.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7520292401313782, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7163150310516357, 'reward_std': 0.10308441519737244, 'kl': 0.51318359375, 'epoch': 0.86}
 86%|████████▌ | 3693/4286 [27:46:53<4:03:01, 24.59s/it] 86%|████████▌ | 3694/4286 [27:47:18<4:04:14, 24.75s/it]                                                        {'loss': 0.0107, 'grad_norm': 9.414562351056919, 'learning_rate': 1.3812412505832944e-07, 'completion_length': 291.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7202381193637848, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.0714285746216774, 'kl': 0.267578125, 'epoch': 0.86}
 86%|████████▌ | 3694/4286 [27:47:18<4:04:14, 24.75s/it] 86%|████████▌ | 3695/4286 [27:47:44<4:07:03, 25.08s/it]                                                        {'loss': 0.0115, 'grad_norm': 3.0017035808422774, 'learning_rate': 1.378908072795147e-07, 'completion_length': 276.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.8910714685916901, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8732144236564636, 'reward_std': 0.11982244625687599, 'kl': 0.28515625, 'epoch': 0.86}
 86%|████████▌ | 3695/4286 [27:47:44<4:07:03, 25.08s/it] 86%|████████▌ | 3696/4286 [27:48:08<4:03:17, 24.74s/it]                                                        {'loss': 0.0226, 'grad_norm': 7.447404953403377, 'learning_rate': 1.3765748950069994e-07, 'completion_length': 264.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.702976256608963, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6851191520690918, 'reward_std': 0.17149470001459122, 'kl': 0.564453125, 'epoch': 0.86}
 86%|████████▌ | 3696/4286 [27:48:08<4:03:17, 24.74s/it] 86%|████████▋ | 3697/4286 [27:48:33<4:02:41, 24.72s/it]                                                        {'loss': 0.0168, 'grad_norm': 10.585885994253772, 'learning_rate': 1.374241717218852e-07, 'completion_length': 305.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6997768580913544, 'rewards/format_reward': 1.0, 'reward': 1.6997769474983215, 'reward_std': 0.09289911389350891, 'kl': 0.42041015625, 'epoch': 0.86}
 86%|████████▋ | 3697/4286 [27:48:33<4:02:41, 24.72s/it] 86%|████████▋ | 3698/4286 [27:48:58<4:03:21, 24.83s/it]                                                        {'loss': 0.0088, 'grad_norm': 18.56774902514758, 'learning_rate': 1.3719085394307044e-07, 'completion_length': 294.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.05952381528913975, 'kl': 0.22021484375, 'epoch': 0.86}
 86%|████████▋ | 3698/4286 [27:48:58<4:03:21, 24.83s/it] 86%|████████▋ | 3699/4286 [27:49:22<4:02:24, 24.78s/it]                                                        {'loss': 0.0068, 'grad_norm': 2.305761687730516, 'learning_rate': 1.3695753616425571e-07, 'completion_length': 284.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7470238506793976, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.01785714365541935, 'kl': 0.16943359375, 'epoch': 0.86}
 86%|████████▋ | 3699/4286 [27:49:22<4:02:24, 24.78s/it] 86%|████████▋ | 3700/4286 [27:49:48<4:04:48, 25.07s/it]                                                        {'loss': 0.0226, 'grad_norm': 12.475313353893672, 'learning_rate': 1.3672421838544096e-07, 'completion_length': 315.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7723214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7544644474983215, 'reward_std': 0.15135835111141205, 'kl': 0.5634765625, 'epoch': 0.86}
 86%|████████▋ | 3700/4286 [27:49:48<4:04:48, 25.07s/it] 86%|████████▋ | 3701/4286 [27:53:37<14:00:32, 86.21s/it]                                                         {'loss': 0.0298, 'grad_norm': 1.957565985612967, 'learning_rate': 1.364909006066262e-07, 'completion_length': 319.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.8199404776096344, 'rewards/format_reward': 1.0, 'reward': 1.8199405074119568, 'reward_std': 0.07288431003689766, 'kl': 0.74609375, 'epoch': 0.86}
 86%|████████▋ | 3701/4286 [27:53:37<14:00:32, 86.21s/it] 86%|████████▋ | 3702/4286 [27:54:01<10:57:15, 67.53s/it]                                                         {'loss': 0.0264, 'grad_norm': 22.892281439792548, 'learning_rate': 1.3625758282781146e-07, 'completion_length': 294.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.0416666679084301, 'kl': 0.66064453125, 'epoch': 0.86}
 86%|████████▋ | 3702/4286 [27:54:01<10:57:15, 67.53s/it] 86%|████████▋ | 3703/4286 [27:54:23<8:45:08, 54.04s/it]                                                         {'loss': 0.0031, 'grad_norm': 0.6188670918814473, 'learning_rate': 1.360242650489967e-07, 'completion_length': 258.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.7619048655033112, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.032524414360523224, 'kl': 0.0772705078125, 'epoch': 0.86}
 86%|████████▋ | 3703/4286 [27:54:23<8:45:08, 54.04s/it] 86%|████████▋ | 3704/4286 [27:54:50<7:24:33, 45.83s/it]                                                        {'loss': 0.0105, 'grad_norm': 5.694127846752877, 'learning_rate': 1.3579094727018198e-07, 'completion_length': 318.125, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.07959691621363163, 'kl': 0.2626953125, 'epoch': 0.86}
 86%|████████▋ | 3704/4286 [27:54:50<7:24:33, 45.83s/it] 86%|████████▋ | 3705/4286 [27:55:15<6:23:09, 39.57s/it]                                                        {'loss': 0.0205, 'grad_norm': 2.577022304068207, 'learning_rate': 1.3555762949136723e-07, 'completion_length': 320.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6235119700431824, 'rewards/format_reward': 1.0, 'reward': 1.6235120296478271, 'reward_std': 0.026785719208419323, 'kl': 0.5126953125, 'epoch': 0.86}
 86%|████████▋ | 3705/4286 [27:55:15<6:23:09, 39.57s/it] 86%|████████▋ | 3706/4286 [27:55:38<5:34:56, 34.65s/it]                                                        {'loss': 0.0045, 'grad_norm': 3.9309829777116723, 'learning_rate': 1.3532431171255248e-07, 'completion_length': 290.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.021453513763844967, 'kl': 0.111328125, 'epoch': 0.86}
 86%|████████▋ | 3706/4286 [27:55:38<5:34:56, 34.65s/it] 86%|████████▋ | 3707/4286 [27:56:04<5:09:24, 32.06s/it]                                                        {'loss': 0.0044, 'grad_norm': 35.15923000564419, 'learning_rate': 1.3509099393373773e-07, 'completion_length': 319.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.023809521459043026, 'kl': 0.10955810546875, 'epoch': 0.86}
 86%|████████▋ | 3707/4286 [27:56:04<5:09:24, 32.06s/it] 87%|████████▋ | 3708/4286 [27:56:28<4:45:50, 29.67s/it]                                                        {'loss': 0.0022, 'grad_norm': 1.489415635210265, 'learning_rate': 1.34857676154923e-07, 'completion_length': 285.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7827381193637848, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.04602411389350891, 'kl': 0.05615234375, 'epoch': 0.87}
 87%|████████▋ | 3708/4286 [27:56:28<4:45:50, 29.67s/it] 87%|████████▋ | 3709/4286 [27:56:54<4:32:32, 28.34s/it]                                                        {'loss': 0.0081, 'grad_norm': 7.258076564477022, 'learning_rate': 1.3462435837610825e-07, 'completion_length': 290.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.8690476715564728, 'rewards/format_reward': 1.0, 'reward': 1.8690477013587952, 'reward_std': 0.02380952797830105, 'kl': 0.2034912109375, 'epoch': 0.87}
 87%|████████▋ | 3709/4286 [27:56:54<4:32:32, 28.34s/it] 87%|████████▋ | 3710/4286 [27:57:18<4:21:52, 27.28s/it]                                                        {'loss': 0.0196, 'grad_norm': 0.5809838505866549, 'learning_rate': 1.343910405972935e-07, 'completion_length': 286.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7946428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7946429252624512, 'reward_std': 0.01785714365541935, 'kl': 0.4874267578125, 'epoch': 0.87}
 87%|████████▋ | 3710/4286 [27:57:18<4:21:52, 27.28s/it] 87%|████████▋ | 3711/4286 [27:57:43<4:15:07, 26.62s/it]                                                        {'loss': 0.0126, 'grad_norm': 6.1547945456846, 'learning_rate': 1.3415772281847875e-07, 'completion_length': 320.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.741071492433548, 'rewards/format_reward': 1.0, 'reward': 1.7410715818405151, 'reward_std': 0.07008036784827709, 'kl': 0.314697265625, 'epoch': 0.87}
 87%|████████▋ | 3711/4286 [27:57:43<4:15:07, 26.62s/it] 87%|████████▋ | 3712/4286 [27:58:10<4:14:11, 26.57s/it]                                                        {'loss': 0.0206, 'grad_norm': 21.17459548874595, 'learning_rate': 1.33924405039664e-07, 'completion_length': 297.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.736607313156128, 'reward_std': 0.12754884362220764, 'kl': 0.5146484375, 'epoch': 0.87}
 87%|████████▋ | 3712/4286 [27:58:10<4:14:11, 26.57s/it] 87%|████████▋ | 3713/4286 [27:58:35<4:10:00, 26.18s/it]                                                        {'loss': 0.0019, 'grad_norm': 6.950905065253304, 'learning_rate': 1.3369108726084928e-07, 'completion_length': 300.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7946429252624512, 'rewards/format_reward': 1.0, 'reward': 1.794642984867096, 'reward_std': 0.02976190857589245, 'kl': 0.04638671875, 'epoch': 0.87}
 87%|████████▋ | 3713/4286 [27:58:35<4:10:00, 26.18s/it] 87%|████████▋ | 3714/4286 [27:59:00<4:05:47, 25.78s/it]                                                        {'loss': 0.0217, 'grad_norm': 4.057054118943322, 'learning_rate': 1.3345776948203452e-07, 'completion_length': 329.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6994048655033112, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6815478205680847, 'reward_std': 0.08877889439463615, 'kl': 0.541015625, 'epoch': 0.87}
 87%|████████▋ | 3714/4286 [27:59:00<4:05:47, 25.78s/it] 87%|████████▋ | 3715/4286 [27:59:24<3:58:52, 25.10s/it]                                                        {'loss': 0.0161, 'grad_norm': 1.178575675435691, 'learning_rate': 1.3322445170321977e-07, 'completion_length': 295.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6547619700431824, 'rewards/format_reward': 1.0, 'reward': 1.6547620296478271, 'reward_std': 0.03735069930553436, 'kl': 0.40234375, 'epoch': 0.87}
 87%|████████▋ | 3715/4286 [27:59:24<3:58:52, 25.10s/it] 87%|████████▋ | 3716/4286 [27:59:48<3:56:56, 24.94s/it]                                                        {'loss': 0.0135, 'grad_norm': 5.099911129524243, 'learning_rate': 1.3299113392440502e-07, 'completion_length': 305.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.772321492433548, 'rewards/format_reward': 1.0, 'reward': 1.7723215818405151, 'reward_std': 0.09502441436052322, 'kl': 0.3387451171875, 'epoch': 0.87}
 87%|████████▋ | 3716/4286 [27:59:48<3:56:56, 24.94s/it] 87%|████████▋ | 3717/4286 [28:00:13<3:56:41, 24.96s/it]                                                        {'loss': 0.0114, 'grad_norm': 28.15968989730813, 'learning_rate': 1.327578161455903e-07, 'completion_length': 296.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6919642984867096, 'rewards/format_reward': 1.0, 'reward': 1.6919643878936768, 'reward_std': 0.07397740706801414, 'kl': 0.28369140625, 'epoch': 0.87}
 87%|████████▋ | 3717/4286 [28:00:13<3:56:41, 24.96s/it] 87%|████████▋ | 3718/4286 [28:00:36<3:51:25, 24.45s/it]                                                        {'loss': 0.0076, 'grad_norm': 1.47966931722743, 'learning_rate': 1.3252449836677555e-07, 'completion_length': 281.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.0357142873108387, 'kl': 0.189697265625, 'epoch': 0.87}
 87%|████████▋ | 3718/4286 [28:00:36<3:51:25, 24.45s/it] 87%|████████▋ | 3719/4286 [28:01:01<3:52:58, 24.65s/it]                                                        {'loss': 0.0046, 'grad_norm': 3.1451946952607788, 'learning_rate': 1.322911805879608e-07, 'completion_length': 302.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.03961131162941456, 'kl': 0.11590576171875, 'epoch': 0.87}
 87%|████████▋ | 3719/4286 [28:01:01<3:52:58, 24.65s/it] 87%|████████▋ | 3720/4286 [28:01:27<3:54:16, 24.83s/it]                                                        {'loss': 0.0157, 'grad_norm': 13.15022448065754, 'learning_rate': 1.3205786280914604e-07, 'completion_length': 318.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.584821492433548, 'rewards/format_reward': 1.0, 'reward': 1.5848215222358704, 'reward_std': 0.068452388048172, 'kl': 0.392578125, 'epoch': 0.87}
 87%|████████▋ | 3720/4286 [28:01:27<3:54:16, 24.83s/it] 87%|████████▋ | 3721/4286 [28:01:52<3:55:02, 24.96s/it]                                                        {'loss': 0.0286, 'grad_norm': 8.025059843087211, 'learning_rate': 1.318245450303313e-07, 'completion_length': 295.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7648809850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7470239400863647, 'reward_std': 0.05297180637717247, 'kl': 0.7147216796875, 'epoch': 0.87}
 87%|████████▋ | 3721/4286 [28:01:52<3:55:02, 24.96s/it] 87%|████████▋ | 3722/4286 [28:02:18<3:58:16, 25.35s/it]                                                        {'loss': 0.0076, 'grad_norm': 15.528359008369605, 'learning_rate': 1.3159122725151657e-07, 'completion_length': 307.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7752977013587952, 'reward_std': 0.0744047649204731, 'kl': 0.19140625, 'epoch': 0.87}
 87%|████████▋ | 3722/4286 [28:02:18<3:58:16, 25.35s/it] 87%|████████▋ | 3723/4286 [28:02:43<3:54:49, 25.03s/it]                                                        {'loss': 0.0231, 'grad_norm': 3.635547326694898, 'learning_rate': 1.3135790947270182e-07, 'completion_length': 290.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7023810744285583, 'reward_std': 0.07578602433204651, 'kl': 0.5787353515625, 'epoch': 0.87}
 87%|████████▋ | 3723/4286 [28:02:43<3:54:49, 25.03s/it] 87%|████████▋ | 3724/4286 [28:03:06<3:49:22, 24.49s/it]                                                        {'loss': 0.0049, 'grad_norm': 5.286881206633461, 'learning_rate': 1.3112459169388706e-07, 'completion_length': 268.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.696428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.09707976691424847, 'kl': 0.12255859375, 'epoch': 0.87}
 87%|████████▋ | 3724/4286 [28:03:06<3:49:22, 24.49s/it] 87%|████████▋ | 3725/4286 [28:03:30<3:49:18, 24.52s/it]                                                        {'loss': 0.0171, 'grad_norm': 0.9299761584517261, 'learning_rate': 1.308912739150723e-07, 'completion_length': 315.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.898809552192688, 'rewards/format_reward': 1.0, 'reward': 1.8988096117973328, 'reward_std': 0.019440393894910812, 'kl': 0.429443359375, 'epoch': 0.87}
 87%|████████▋ | 3725/4286 [28:03:30<3:49:18, 24.52s/it] 87%|████████▋ | 3726/4286 [28:03:55<3:50:08, 24.66s/it]                                                        {'loss': 0.0104, 'grad_norm': 2.30393777835406, 'learning_rate': 1.3065795613625756e-07, 'completion_length': 290.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6324405670166016, 'reward_std': 0.0803571492433548, 'kl': 0.259765625, 'epoch': 0.87}
 87%|████████▋ | 3726/4286 [28:03:55<3:50:08, 24.66s/it] 87%|████████▋ | 3727/4286 [28:04:20<3:49:06, 24.59s/it]                                                        {'loss': 0.0052, 'grad_norm': 3.0758454094054457, 'learning_rate': 1.3042463835744284e-07, 'completion_length': 257.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8377976417541504, 'rewards/format_reward': 1.0, 'reward': 1.8377977013587952, 'reward_std': 0.019238398410379887, 'kl': 0.130615234375, 'epoch': 0.87}
 87%|████████▋ | 3727/4286 [28:04:20<3:49:06, 24.59s/it] 87%|████████▋ | 3728/4286 [28:04:46<3:53:00, 25.05s/it]                                                        {'loss': 0.0068, 'grad_norm': 3.7527204573240787, 'learning_rate': 1.3019132057862809e-07, 'completion_length': 311.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548357009888, 'reward_std': 0.057715194299817085, 'kl': 0.169921875, 'epoch': 0.87}
 87%|████████▋ | 3728/4286 [28:04:46<3:53:00, 25.05s/it] 87%|████████▋ | 3729/4286 [28:05:09<3:47:58, 24.56s/it]                                                        {'loss': 0.0095, 'grad_norm': 3.245790254739674, 'learning_rate': 1.2995800279981333e-07, 'completion_length': 279.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.860119104385376, 'rewards/format_reward': 1.0, 'reward': 1.8601191639900208, 'reward_std': 0.01785714365541935, 'kl': 0.23553466796875, 'epoch': 0.87}
 87%|████████▋ | 3729/4286 [28:05:09<3:47:58, 24.56s/it] 87%|████████▋ | 3730/4286 [28:05:33<3:45:49, 24.37s/it]                                                        {'loss': 0.0134, 'grad_norm': 5.06349827670731, 'learning_rate': 1.2972468502099858e-07, 'completion_length': 273.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.7366072237491608, 'rewards/format_reward': 1.0, 'reward': 1.736607313156128, 'reward_std': 0.08965209499001503, 'kl': 0.333984375, 'epoch': 0.87}
 87%|████████▋ | 3730/4286 [28:05:33<3:45:49, 24.37s/it] 87%|████████▋ | 3731/4286 [28:05:58<3:47:18, 24.57s/it]                                                        {'loss': 0.01, 'grad_norm': 19.201125462683436, 'learning_rate': 1.2949136724218386e-07, 'completion_length': 286.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.8303571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8125001192092896, 'reward_std': 0.08567243814468384, 'kl': 0.25048828125, 'epoch': 0.87}
 87%|████████▋ | 3731/4286 [28:05:58<3:47:18, 24.57s/it] 87%|████████▋ | 3732/4286 [28:06:24<3:50:53, 25.01s/it]                                                        {'loss': 0.0037, 'grad_norm': 2.4822969673593493, 'learning_rate': 1.292580494633691e-07, 'completion_length': 323.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.7598215043544769, 'rewards/format_reward': 1.0, 'reward': 1.759821593761444, 'reward_std': 0.11451326310634613, 'kl': 0.091796875, 'epoch': 0.87}
 87%|████████▋ | 3732/4286 [28:06:24<3:50:53, 25.01s/it] 87%|████████▋ | 3733/4286 [28:06:51<3:54:08, 25.40s/it]                                                        {'loss': 0.0169, 'grad_norm': 6.342459664890371, 'learning_rate': 1.2902473168455436e-07, 'completion_length': 320.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.668154776096344, 'rewards/format_reward': 1.0, 'reward': 1.6681548953056335, 'reward_std': 0.13936662301421165, 'kl': 0.42333984375, 'epoch': 0.87}
 87%|████████▋ | 3733/4286 [28:06:51<3:54:08, 25.40s/it] 87%|████████▋ | 3734/4286 [28:07:15<3:50:36, 25.07s/it]                                                        {'loss': 0.0189, 'grad_norm': 5.767283069665791, 'learning_rate': 1.287914139057396e-07, 'completion_length': 282.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.8303571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8125000596046448, 'reward_std': 0.0773809477686882, 'kl': 0.473388671875, 'epoch': 0.87}
 87%|████████▋ | 3734/4286 [28:07:15<3:50:36, 25.07s/it] 87%|████████▋ | 3735/4286 [28:07:39<3:46:41, 24.69s/it]                                                        {'loss': 0.0092, 'grad_norm': 3.877683381433614, 'learning_rate': 1.2855809612692485e-07, 'completion_length': 320.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.758928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.0297619067132473, 'kl': 0.22998046875, 'epoch': 0.87}
 87%|████████▋ | 3735/4286 [28:07:39<3:46:41, 24.69s/it] 87%|████████▋ | 3736/4286 [28:08:03<3:46:25, 24.70s/it]                                                        {'loss': 0.0053, 'grad_norm': 1.5123495351058132, 'learning_rate': 1.2832477834811013e-07, 'completion_length': 252.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.5833333730697632, 'rewards/format_reward': 1.0, 'reward': 1.5833334922790527, 'reward_std': 0.011904764920473099, 'kl': 0.13232421875, 'epoch': 0.87}
 87%|████████▋ | 3736/4286 [28:08:03<3:46:25, 24.70s/it] 87%|████████▋ | 3737/4286 [28:08:28<3:46:22, 24.74s/it]                                                        {'loss': 0.0033, 'grad_norm': 2.0151808700926095, 'learning_rate': 1.2809146056929538e-07, 'completion_length': 298.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.8452381491661072, 'rewards/format_reward': 1.0, 'reward': 1.845238208770752, 'reward_std': 0.05541309807449579, 'kl': 0.0828857421875, 'epoch': 0.87}
 87%|████████▋ | 3737/4286 [28:08:28<3:46:22, 24.74s/it] 87%|████████▋ | 3738/4286 [28:08:52<3:43:36, 24.48s/it]                                                        {'loss': 0.0599, 'grad_norm': 10.911741695693298, 'learning_rate': 1.2785814279048063e-07, 'completion_length': 285.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6548972427845001, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.637040138244629, 'reward_std': 0.04833441041409969, 'kl': 1.502197265625, 'epoch': 0.87}
 87%|████████▋ | 3738/4286 [28:08:52<3:43:36, 24.48s/it] 87%|████████▋ | 3739/4286 [28:09:17<3:43:45, 24.54s/it]                                                        {'loss': 0.0158, 'grad_norm': 9.475895160054673, 'learning_rate': 1.2762482501166587e-07, 'completion_length': 328.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.07397740334272385, 'kl': 0.39501953125, 'epoch': 0.87}
 87%|████████▋ | 3739/4286 [28:09:17<3:43:45, 24.54s/it] 87%|████████▋ | 3740/4286 [28:09:42<3:43:44, 24.59s/it]                                                        {'loss': 0.0191, 'grad_norm': 6.03469617319467, 'learning_rate': 1.2739150723285115e-07, 'completion_length': 316.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7172620296478271, 'reward_std': 0.14540373161435127, 'kl': 0.47802734375, 'epoch': 0.87}
 87%|████████▋ | 3740/4286 [28:09:42<3:43:44, 24.59s/it] 87%|████████▋ | 3741/4286 [28:10:08<3:48:41, 25.18s/it]                                                        {'loss': 0.0151, 'grad_norm': 10.041743846147009, 'learning_rate': 1.271581894540364e-07, 'completion_length': 328.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.8690476417541504, 'rewards/format_reward': 1.0, 'reward': 1.86904776096344, 'reward_std': 0.06572293490171432, 'kl': 0.3779296875, 'epoch': 0.87}
 87%|████████▋ | 3741/4286 [28:10:08<3:48:41, 25.18s/it] 87%|████████▋ | 3742/4286 [28:10:33<3:47:36, 25.10s/it]                                                        {'loss': 0.0118, 'grad_norm': 5.307542928875722, 'learning_rate': 1.2692487167522165e-07, 'completion_length': 285.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8110119700431824, 'rewards/format_reward': 1.0, 'reward': 1.8110120296478271, 'reward_std': 0.040532153099775314, 'kl': 0.29541015625, 'epoch': 0.87}
 87%|████████▋ | 3742/4286 [28:10:33<3:47:36, 25.10s/it] 87%|████████▋ | 3743/4286 [28:10:58<3:48:07, 25.21s/it]                                                        {'loss': 0.0361, 'grad_norm': 11.153297600212424, 'learning_rate': 1.266915538964069e-07, 'completion_length': 339.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6955357789993286, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6776787042617798, 'reward_std': 0.14035560190677643, 'kl': 0.90234375, 'epoch': 0.87}
 87%|████████▋ | 3743/4286 [28:10:58<3:48:07, 25.21s/it] 87%|████████▋ | 3744/4286 [28:11:22<3:43:53, 24.79s/it]                                                        {'loss': 0.011, 'grad_norm': 5.36520328052432, 'learning_rate': 1.2645823611759214e-07, 'completion_length': 281.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.8184524476528168, 'rewards/format_reward': 1.0, 'reward': 1.818452537059784, 'reward_std': 0.03411935269832611, 'kl': 0.2744140625, 'epoch': 0.87}
 87%|████████▋ | 3744/4286 [28:11:22<3:43:53, 24.79s/it] 87%|████████▋ | 3745/4286 [28:11:46<3:41:35, 24.58s/it]                                                        {'loss': 0.0138, 'grad_norm': 2.4296726057366724, 'learning_rate': 1.2622491833877742e-07, 'completion_length': 303.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.680059552192688, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.06593660824000835, 'kl': 0.34619140625, 'epoch': 0.87}
 87%|████████▋ | 3745/4286 [28:11:46<3:41:35, 24.58s/it] 87%|████████▋ | 3746/4286 [28:12:11<3:40:22, 24.49s/it]                                                        {'loss': 0.0176, 'grad_norm': 2.076407488594857, 'learning_rate': 1.2599160055996267e-07, 'completion_length': 292.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.8958333432674408, 'rewards/format_reward': 1.0, 'reward': 1.8958334922790527, 'reward_std': 0.07419107854366302, 'kl': 0.4415283203125, 'epoch': 0.87}
 87%|████████▋ | 3746/4286 [28:12:11<3:40:22, 24.49s/it] 87%|████████▋ | 3747/4286 [28:12:34<3:38:01, 24.27s/it]                                                        {'loss': 0.0111, 'grad_norm': 2.809433396745412, 'learning_rate': 1.2575828278114792e-07, 'completion_length': 282.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7232143878936768, 'reward_std': 0.09481073915958405, 'kl': 0.27783203125, 'epoch': 0.87}
 87%|████████▋ | 3747/4286 [28:12:34<3:38:01, 24.27s/it] 87%|████████▋ | 3748/4286 [28:13:01<3:44:30, 25.04s/it]                                                        {'loss': 0.0145, 'grad_norm': 8.474438418753374, 'learning_rate': 1.2552496500233316e-07, 'completion_length': 313.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.828869104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7931548953056335, 'reward_std': 0.1742016226053238, 'kl': 0.361328125, 'epoch': 0.87}
 87%|████████▋ | 3748/4286 [28:13:01<3:44:30, 25.04s/it] 87%|████████▋ | 3749/4286 [28:13:26<3:43:59, 25.03s/it]                                                        {'loss': 0.0029, 'grad_norm': 4.433763458165228, 'learning_rate': 1.2529164722351841e-07, 'completion_length': 310.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.7455357611179352, 'rewards/format_reward': 1.0, 'reward': 1.7455358505249023, 'reward_std': 0.03642144054174423, 'kl': 0.0726318359375, 'epoch': 0.87}
 87%|████████▋ | 3749/4286 [28:13:26<3:43:59, 25.03s/it] 87%|████████▋ | 3750/4286 [28:13:52<3:46:18, 25.33s/it]                                                        {'loss': 0.0201, 'grad_norm': 2.822982965174608, 'learning_rate': 1.250583294447037e-07, 'completion_length': 327.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8244048357009888, 'reward_std': 0.01718304678797722, 'kl': 0.50146484375, 'epoch': 0.87}
 87%|████████▋ | 3750/4286 [28:13:52<3:46:18, 25.33s/it] 88%|████████▊ | 3751/4286 [28:14:19<3:48:54, 25.67s/it]                                                        {'loss': 0.0041, 'grad_norm': 12.931769375331886, 'learning_rate': 1.2482501166588894e-07, 'completion_length': 285.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7413690388202667, 'rewards/format_reward': 1.0, 'reward': 1.7413691878318787, 'reward_std': 0.04928032495081425, 'kl': 0.101806640625, 'epoch': 0.88}
 88%|████████▊ | 3751/4286 [28:14:19<3:48:54, 25.67s/it] 88%|████████▊ | 3752/4286 [28:14:43<3:44:46, 25.26s/it]                                                        {'loss': 0.0019, 'grad_norm': 0.6984638735420954, 'learning_rate': 1.2459169388707419e-07, 'completion_length': 275.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.028326730243861675, 'kl': 0.047119140625, 'epoch': 0.88}
 88%|████████▊ | 3752/4286 [28:14:43<3:44:46, 25.26s/it] 88%|████████▊ | 3753/4286 [28:15:08<3:43:28, 25.16s/it]                                                        {'loss': 0.0014, 'grad_norm': 2.958530901430583, 'learning_rate': 1.2435837610825943e-07, 'completion_length': 288.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.8214287161827087, 'reward_std': 0.0357142873108387, 'kl': 0.0341796875, 'epoch': 0.88}
 88%|████████▊ | 3753/4286 [28:15:08<3:43:28, 25.16s/it] 88%|████████▊ | 3754/4286 [28:15:34<3:45:44, 25.46s/it]                                                        {'loss': 0.012, 'grad_norm': 0.9130159293470748, 'learning_rate': 1.241250583294447e-07, 'completion_length': 269.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.8601190745830536, 'rewards/format_reward': 1.0, 'reward': 1.8601191639900208, 'reward_std': 0.01785714365541935, 'kl': 0.29931640625, 'epoch': 0.88}
 88%|████████▊ | 3754/4286 [28:15:34<3:45:44, 25.46s/it] 88%|████████▊ | 3755/4286 [28:15:59<3:44:34, 25.38s/it]                                                        {'loss': 0.0051, 'grad_norm': 3.388358543348269, 'learning_rate': 1.2389174055062996e-07, 'completion_length': 299.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7812500596046448, 'rewards/format_reward': 1.0, 'reward': 1.7812501192092896, 'reward_std': 0.06090506911277771, 'kl': 0.128173828125, 'epoch': 0.88}
 88%|████████▊ | 3755/4286 [28:15:59<3:44:34, 25.38s/it] 88%|████████▊ | 3756/4286 [28:16:23<3:40:59, 25.02s/it]                                                        {'loss': 0.0183, 'grad_norm': 0.836021031216871, 'learning_rate': 1.236584227718152e-07, 'completion_length': 282.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7559524476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7380953431129456, 'reward_std': 0.0476190522313118, 'kl': 0.45703125, 'epoch': 0.88}
 88%|████████▊ | 3756/4286 [28:16:23<3:40:59, 25.02s/it] 88%|████████▊ | 3757/4286 [28:16:49<3:41:00, 25.07s/it]                                                        {'loss': 0.0054, 'grad_norm': 4.465770389426559, 'learning_rate': 1.2342510499300046e-07, 'completion_length': 310.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.05038155568763614, 'kl': 0.135498046875, 'epoch': 0.88}
 88%|████████▊ | 3757/4286 [28:16:49<3:41:00, 25.07s/it] 88%|████████▊ | 3758/4286 [28:17:14<3:42:32, 25.29s/it]                                                        {'loss': 0.0064, 'grad_norm': 0.7742864702919047, 'learning_rate': 1.231917872141857e-07, 'completion_length': 311.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.572916716337204, 'rewards/format_reward': 1.0, 'reward': 1.5729167461395264, 'reward_std': 0.04032688960433006, 'kl': 0.16162109375, 'epoch': 0.88}
 88%|████████▊ | 3758/4286 [28:17:14<3:42:32, 25.29s/it] 88%|████████▊ | 3759/4286 [28:17:40<3:43:15, 25.42s/it]                                                        {'loss': 0.0018, 'grad_norm': 4.522393553376469, 'learning_rate': 1.2295846943537098e-07, 'completion_length': 302.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.767857313156128, 'reward_std': 0.032524414360523224, 'kl': 0.046142578125, 'epoch': 0.88}
 88%|████████▊ | 3759/4286 [28:17:40<3:43:15, 25.42s/it] 88%|████████▊ | 3760/4286 [28:18:07<3:47:37, 25.97s/it]                                                        {'loss': 0.0227, 'grad_norm': 1.6022660206539325, 'learning_rate': 1.2272515165655623e-07, 'completion_length': 271.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.8080357611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.790178656578064, 'reward_std': 0.05220366269350052, 'kl': 0.565673828125, 'epoch': 0.88}
 88%|████████▊ | 3760/4286 [28:18:07<3:47:37, 25.97s/it] 88%|████████▊ | 3761/4286 [28:18:33<3:47:19, 25.98s/it]                                                        {'loss': 0.0169, 'grad_norm': 2.250504578840923, 'learning_rate': 1.2249183387774148e-07, 'completion_length': 330.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7580357789993286, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7401787638664246, 'reward_std': 0.09107143431901932, 'kl': 0.420654296875, 'epoch': 0.88}
 88%|████████▊ | 3761/4286 [28:18:33<3:47:19, 25.98s/it] 88%|████████▊ | 3762/4286 [28:18:58<3:43:30, 25.59s/it]                                                        {'loss': 0.0017, 'grad_norm': 2.7202487023325452, 'learning_rate': 1.2225851609892673e-07, 'completion_length': 334.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7217262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7217263579368591, 'reward_std': 0.03495405940338969, 'kl': 0.042236328125, 'epoch': 0.88}
 88%|████████▊ | 3762/4286 [28:18:58<3:43:30, 25.59s/it] 88%|████████▊ | 3763/4286 [28:19:21<3:36:13, 24.81s/it]                                                        {'loss': 0.0028, 'grad_norm': 13.14049413258062, 'learning_rate': 1.22025198320112e-07, 'completion_length': 248.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.822916716337204, 'rewards/format_reward': 1.0, 'reward': 1.8229168057441711, 'reward_std': 0.03869047574698925, 'kl': 0.0706787109375, 'epoch': 0.88}
 88%|████████▊ | 3763/4286 [28:19:21<3:36:13, 24.81s/it] 88%|████████▊ | 3764/4286 [28:19:47<3:38:02, 25.06s/it]                                                        {'loss': 0.0049, 'grad_norm': 9.788894057044379, 'learning_rate': 1.2179188054129725e-07, 'completion_length': 309.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7633929252624512, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.07296610623598099, 'kl': 0.122314453125, 'epoch': 0.88}
 88%|████████▊ | 3764/4286 [28:19:47<3:38:02, 25.06s/it][2025-03-03 19:17:36,546] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 88%|████████▊ | 3765/4286 [28:20:14<3:42:20, 25.61s/it]                                                        {'loss': 0.0026, 'grad_norm': 0.8007680109562025, 'learning_rate': 1.215585627624825e-07, 'completion_length': 265.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.867559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8497024774551392, 'reward_std': 0.08035714644938707, 'kl': 0.0660400390625, 'epoch': 0.88}
 88%|████████▊ | 3765/4286 [28:20:14<3:42:20, 25.61s/it] 88%|████████▊ | 3766/4286 [28:20:37<3:36:55, 25.03s/it]                                                        {'loss': 0.0073, 'grad_norm': 2.564567587656981, 'learning_rate': 1.2132524498366775e-07, 'completion_length': 263.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.7380953133106232, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.05909644905477762, 'kl': 0.182373046875, 'epoch': 0.88}
 88%|████████▊ | 3766/4286 [28:20:37<3:36:55, 25.03s/it] 88%|████████▊ | 3767/4286 [28:21:03<3:37:32, 25.15s/it]                                                        {'loss': 0.0054, 'grad_norm': 2.0232973810450865, 'learning_rate': 1.21091927204853e-07, 'completion_length': 302.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6488096117973328, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.06823870167136192, 'kl': 0.13623046875, 'epoch': 0.88}
 88%|████████▊ | 3767/4286 [28:21:03<3:37:32, 25.15s/it] 88%|████████▊ | 3768/4286 [28:21:26<3:32:13, 24.58s/it]                                                        {'loss': 0.0252, 'grad_norm': 1.6268605706138064, 'learning_rate': 1.2085860942603827e-07, 'completion_length': 281.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.1071428582072258, 'kl': 0.6298828125, 'epoch': 0.88}
 88%|████████▊ | 3768/4286 [28:21:26<3:32:13, 24.58s/it] 88%|████████▊ | 3769/4286 [28:21:52<3:34:48, 24.93s/it]                                                        {'loss': 0.0143, 'grad_norm': 3.882754013553118, 'learning_rate': 1.2062529164722352e-07, 'completion_length': 318.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.735119104385376, 'reward_std': 0.035714288242161274, 'kl': 0.3603515625, 'epoch': 0.88}
 88%|████████▊ | 3769/4286 [28:21:52<3:34:48, 24.93s/it] 88%|████████▊ | 3770/4286 [28:22:16<3:32:30, 24.71s/it]                                                        {'loss': 0.007, 'grad_norm': 0.9921680053753852, 'learning_rate': 1.2039197386840877e-07, 'completion_length': 297.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.897321492433548, 'rewards/format_reward': 1.0, 'reward': 1.8973215818405151, 'reward_std': 0.03273809142410755, 'kl': 0.176513671875, 'epoch': 0.88}
 88%|████████▊ | 3770/4286 [28:22:16<3:32:30, 24.71s/it] 88%|████████▊ | 3771/4286 [28:22:40<3:29:09, 24.37s/it]                                                        {'loss': 0.0025, 'grad_norm': 1.9824252777472668, 'learning_rate': 1.2015865608959402e-07, 'completion_length': 297.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8065476715564728, 'rewards/format_reward': 1.0, 'reward': 1.8065477013587952, 'reward_std': 0.010309826582670212, 'kl': 0.0631103515625, 'epoch': 0.88}
 88%|████████▊ | 3771/4286 [28:22:40<3:29:09, 24.37s/it] 88%|████████▊ | 3772/4286 [28:23:05<3:30:32, 24.58s/it]                                                        {'loss': 0.0054, 'grad_norm': 14.238773575677442, 'learning_rate': 1.1992533831077927e-07, 'completion_length': 327.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.65327388048172, 'rewards/format_reward': 1.0, 'reward': 1.65327388048172, 'reward_std': 0.0706152692437172, 'kl': 0.1357421875, 'epoch': 0.88}
 88%|████████▊ | 3772/4286 [28:23:05<3:30:32, 24.58s/it] 88%|████████▊ | 3773/4286 [28:23:30<3:31:15, 24.71s/it]                                                        {'loss': 0.002, 'grad_norm': 2.436610161120346, 'learning_rate': 1.1969202053196454e-07, 'completion_length': 305.5, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.005952381528913975, 'kl': 0.0504150390625, 'epoch': 0.88}
 88%|████████▊ | 3773/4286 [28:23:30<3:31:15, 24.71s/it] 88%|████████▊ | 3774/4286 [28:23:55<3:32:10, 24.86s/it]                                                        {'loss': 0.0027, 'grad_norm': 0.3267573563520625, 'learning_rate': 1.194587027531498e-07, 'completion_length': 312.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.0357142873108387, 'kl': 0.068603515625, 'epoch': 0.88}
 88%|████████▊ | 3774/4286 [28:23:55<3:32:10, 24.86s/it] 88%|████████▊ | 3775/4286 [28:24:20<3:33:16, 25.04s/it]                                                        {'loss': 0.0079, 'grad_norm': 2.6444032333976013, 'learning_rate': 1.1922538497433504e-07, 'completion_length': 323.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693454027175903, 'reward_std': 0.01709691435098648, 'kl': 0.196533203125, 'epoch': 0.88}
 88%|████████▊ | 3775/4286 [28:24:20<3:33:16, 25.04s/it] 88%|████████▊ | 3776/4286 [28:24:45<3:32:20, 24.98s/it]                                                        {'loss': 0.0132, 'grad_norm': 6.342098132727889, 'learning_rate': 1.189920671955203e-07, 'completion_length': 305.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.07280982658267021, 'kl': 0.3310546875, 'epoch': 0.88}
 88%|████████▊ | 3776/4286 [28:24:45<3:32:20, 24.98s/it] 88%|████████▊ | 3777/4286 [28:25:10<3:31:30, 24.93s/it]                                                        {'loss': 0.0158, 'grad_norm': 3.353109013191272, 'learning_rate': 1.1875874941670555e-07, 'completion_length': 302.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.020833331160247326, 'kl': 0.39453125, 'epoch': 0.88}
 88%|████████▊ | 3777/4286 [28:25:10<3:31:30, 24.93s/it] 88%|████████▊ | 3778/4286 [28:25:34<3:28:35, 24.64s/it]                                                        {'loss': 0.0042, 'grad_norm': 4.242968948232567, 'learning_rate': 1.1852543163789081e-07, 'completion_length': 258.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7544643580913544, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.03869047947227955, 'kl': 0.1053466796875, 'epoch': 0.88}
 88%|████████▊ | 3778/4286 [28:25:34<3:28:35, 24.64s/it] 88%|████████▊ | 3779/4286 [28:25:59<3:28:40, 24.69s/it]                                                        {'loss': 0.0043, 'grad_norm': 10.110044368030694, 'learning_rate': 1.1829211385907606e-07, 'completion_length': 326.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.042127080261707306, 'kl': 0.107421875, 'epoch': 0.88}
 88%|████████▊ | 3779/4286 [28:25:59<3:28:40, 24.69s/it] 88%|████████▊ | 3780/4286 [28:26:23<3:26:25, 24.48s/it]                                                        {'loss': 0.0028, 'grad_norm': 19.409257364754144, 'learning_rate': 1.1805879608026131e-07, 'completion_length': 282.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.8511905074119568, 'rewards/format_reward': 1.0, 'reward': 1.8511906266212463, 'reward_std': 0.06664376333355904, 'kl': 0.070068359375, 'epoch': 0.88}
 88%|████████▊ | 3780/4286 [28:26:23<3:26:25, 24.48s/it] 88%|████████▊ | 3781/4286 [28:26:46<3:22:57, 24.11s/it]                                                        {'loss': 0.0109, 'grad_norm': 1.1587539454035458, 'learning_rate': 1.1782547830144657e-07, 'completion_length': 261.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.8020834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7842262983322144, 'reward_std': 0.08346372283995152, 'kl': 0.271240234375, 'epoch': 0.88}
 88%|████████▊ | 3781/4286 [28:26:46<3:22:57, 24.11s/it] 88%|████████▊ | 3782/4286 [28:27:11<3:23:41, 24.25s/it]                                                        {'loss': 0.0048, 'grad_norm': 2.139818714955382, 'learning_rate': 1.1759216052263182e-07, 'completion_length': 294.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6488096117973328, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.0, 'kl': 0.119384765625, 'epoch': 0.88}
 88%|████████▊ | 3782/4286 [28:27:11<3:23:41, 24.25s/it] 88%|████████▊ | 3783/4286 [28:27:35<3:23:57, 24.33s/it]                                                        {'loss': 0.0067, 'grad_norm': 4.010767057140723, 'learning_rate': 1.1735884274381708e-07, 'completion_length': 313.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.03838982526212931, 'kl': 0.16748046875, 'epoch': 0.88}
 88%|████████▊ | 3783/4286 [28:27:35<3:23:57, 24.33s/it] 88%|████████▊ | 3784/4286 [28:27:59<3:23:35, 24.33s/it]                                                        {'loss': 0.0032, 'grad_norm': 21.795085549119968, 'learning_rate': 1.1712552496500233e-07, 'completion_length': 308.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.045056876726448536, 'kl': 0.0792236328125, 'epoch': 0.88}
 88%|████████▊ | 3784/4286 [28:27:59<3:23:35, 24.33s/it] 88%|████████▊ | 3785/4286 [28:28:25<3:27:06, 24.80s/it]                                                        {'loss': 0.0168, 'grad_norm': 4.8225840943633065, 'learning_rate': 1.1689220718618759e-07, 'completion_length': 329.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.786309540271759, 'rewards/format_reward': 1.0, 'reward': 1.7863096594810486, 'reward_std': 0.048270237632095814, 'kl': 0.418701171875, 'epoch': 0.88}
 88%|████████▊ | 3785/4286 [28:28:25<3:27:06, 24.80s/it] 88%|████████▊ | 3786/4286 [28:28:50<3:25:39, 24.68s/it]                                                        {'loss': 0.0016, 'grad_norm': 2.2417895285538765, 'learning_rate': 1.1665888940737284e-07, 'completion_length': 313.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.8333334028720856, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.0357142873108387, 'kl': 0.0399169921875, 'epoch': 0.88}
 88%|████████▊ | 3786/4286 [28:28:50<3:25:39, 24.68s/it] 88%|████████▊ | 3787/4286 [28:29:13<3:23:06, 24.42s/it]                                                        {'loss': 0.0023, 'grad_norm': 1.17916083085078, 'learning_rate': 1.1642557162855809e-07, 'completion_length': 305.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8883928656578064, 'rewards/format_reward': 1.0, 'reward': 1.888392984867096, 'reward_std': 0.0295482249930501, 'kl': 0.05810546875, 'epoch': 0.88}
 88%|████████▊ | 3787/4286 [28:29:13<3:23:06, 24.42s/it] 88%|████████▊ | 3788/4286 [28:29:39<3:25:35, 24.77s/it]                                                        {'loss': 0.0166, 'grad_norm': 12.610283534743449, 'learning_rate': 1.1619225384974335e-07, 'completion_length': 321.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8238095641136169, 'rewards/format_reward': 1.0, 'reward': 1.8238096833229065, 'reward_std': 0.05714224465191364, 'kl': 0.4169921875, 'epoch': 0.88}
 88%|████████▊ | 3788/4286 [28:29:39<3:25:35, 24.77s/it] 88%|████████▊ | 3789/4286 [28:30:05<3:28:00, 25.11s/it]                                                        {'loss': 0.0054, 'grad_norm': 4.9805825723623665, 'learning_rate': 1.159589360709286e-07, 'completion_length': 318.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.8199405372142792, 'rewards/format_reward': 1.0, 'reward': 1.8199405670166016, 'reward_std': 0.05612026248127222, 'kl': 0.135009765625, 'epoch': 0.88}
 88%|████████▊ | 3789/4286 [28:30:05<3:28:00, 25.11s/it] 88%|████████▊ | 3790/4286 [28:30:29<3:25:13, 24.83s/it]                                                        {'loss': 0.008, 'grad_norm': 4.966089406548688, 'learning_rate': 1.1572561829211386e-07, 'completion_length': 308.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 1.0, 'reward': 1.802083432674408, 'reward_std': 0.0267857164144516, 'kl': 0.200439453125, 'epoch': 0.88}
 88%|████████▊ | 3790/4286 [28:30:29<3:25:13, 24.83s/it][2025-03-03 19:28:17,103] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 88%|████████▊ | 3791/4286 [28:30:54<3:25:22, 24.89s/it]                                                        {'loss': 0.006, 'grad_norm': 7.690983584437699, 'learning_rate': 1.1549230051329911e-07, 'completion_length': 298.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 1.0, 'reward': 1.8005954027175903, 'reward_std': 0.029761902987957, 'kl': 0.149169921875, 'epoch': 0.88}
 88%|████████▊ | 3791/4286 [28:30:54<3:25:22, 24.89s/it][2025-03-03 19:28:44,787] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 88%|████████▊ | 3792/4286 [28:31:22<3:31:51, 25.73s/it]                                                        {'loss': 0.0021, 'grad_norm': 2.9933382094016303, 'learning_rate': 1.1525898273448437e-07, 'completion_length': 322.2143096923828, 'rewards/only_full_func_accuracy_reward': 0.7619048655033112, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.039310661144554615, 'kl': 0.051513671875, 'epoch': 0.88}
 88%|████████▊ | 3792/4286 [28:31:22<3:31:51, 25.73s/it] 88%|████████▊ | 3793/4286 [28:31:47<3:31:04, 25.69s/it]                                                        {'loss': 0.0101, 'grad_norm': 6.678008602021559, 'learning_rate': 1.1502566495566962e-07, 'completion_length': 303.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.77976194024086, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.08776525594294071, 'kl': 0.251953125, 'epoch': 0.88}
 88%|████████▊ | 3793/4286 [28:31:47<3:31:04, 25.69s/it][2025-03-03 19:29:34,278] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 89%|████████▊ | 3794/4286 [28:32:11<3:26:15, 25.15s/it]                                                        {'loss': 0.0165, 'grad_norm': 5.773343878773127, 'learning_rate': 1.1479234717685488e-07, 'completion_length': 254.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.6339285671710968, 'rewards/format_reward': 1.0, 'reward': 1.633928656578064, 'reward_std': 0.05137518048286438, 'kl': 0.4111328125, 'epoch': 0.89}
 89%|████████▊ | 3794/4286 [28:32:11<3:26:15, 25.15s/it] 89%|████████▊ | 3795/4286 [28:32:37<3:26:34, 25.24s/it]                                                        {'loss': 0.0079, 'grad_norm': 6.7700778224702605, 'learning_rate': 1.1455902939804013e-07, 'completion_length': 292.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7241072058677673, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7062500715255737, 'reward_std': 0.05761783570051193, 'kl': 0.19677734375, 'epoch': 0.89}
 89%|████████▊ | 3795/4286 [28:32:37<3:26:34, 25.24s/it] 89%|████████▊ | 3796/4286 [28:33:02<3:25:28, 25.16s/it]                                                        {'loss': 0.0077, 'grad_norm': 22.24006136262512, 'learning_rate': 1.1432571161922538e-07, 'completion_length': 288.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0476190522313118, 'kl': 0.191650390625, 'epoch': 0.89}
 89%|████████▊ | 3796/4286 [28:33:02<3:25:28, 25.16s/it] 89%|████████▊ | 3797/4286 [28:33:26<3:22:57, 24.90s/it]                                                        {'loss': 0.0222, 'grad_norm': 1.8259577690826359, 'learning_rate': 1.1409239384041064e-07, 'completion_length': 320.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7053572535514832, 'reward_std': 0.06626290921121836, 'kl': 0.5546875, 'epoch': 0.89}
 89%|████████▊ | 3797/4286 [28:33:26<3:22:57, 24.90s/it] 89%|████████▊ | 3798/4286 [28:33:50<3:19:37, 24.54s/it]                                                        {'loss': 0.0136, 'grad_norm': 6.382174323656396, 'learning_rate': 1.1385907606159589e-07, 'completion_length': 303.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.8199405074119568, 'rewards/format_reward': 1.0, 'reward': 1.8199406266212463, 'reward_std': 0.03457976598292589, 'kl': 0.3388671875, 'epoch': 0.89}
 89%|████████▊ | 3798/4286 [28:33:50<3:19:37, 24.54s/it] 89%|████████▊ | 3799/4286 [28:34:15<3:19:40, 24.60s/it]                                                        {'loss': 0.0038, 'grad_norm': 5.775137530805372, 'learning_rate': 1.1362575828278115e-07, 'completion_length': 287.32144927978516, 'rewards/only_full_func_accuracy_reward': 0.8244048357009888, 'rewards/format_reward': 1.0, 'reward': 1.8244048357009888, 'reward_std': 0.05909644067287445, 'kl': 0.094970703125, 'epoch': 0.89}
 89%|████████▊ | 3799/4286 [28:34:15<3:19:40, 24.60s/it] 89%|████████▊ | 3800/4286 [28:34:39<3:18:33, 24.51s/it]                                                        {'loss': 0.0144, 'grad_norm': 47.13218516772244, 'learning_rate': 1.133924405039664e-07, 'completion_length': 303.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.08219881728291512, 'kl': 0.35986328125, 'epoch': 0.89}
 89%|████████▊ | 3800/4286 [28:34:39<3:18:33, 24.51s/it] 89%|████████▊ | 3801/4286 [28:39:51<14:56:28, 110.90s/it]                                                          {'loss': 0.0047, 'grad_norm': 5.017676862769158, 'learning_rate': 1.1315912272515166e-07, 'completion_length': 301.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7767857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.10235805436968803, 'kl': 0.1171875, 'epoch': 0.89}
 89%|████████▊ | 3801/4286 [28:39:51<14:56:28, 110.90s/it] 89%|████████▊ | 3802/4286 [28:40:17<11:29:24, 85.46s/it]                                                          {'loss': 0.0189, 'grad_norm': 1.574455144223585, 'learning_rate': 1.1292580494633691e-07, 'completion_length': 316.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7152778506278992, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.679563581943512, 'reward_std': 0.1051587387919426, 'kl': 0.4718017578125, 'epoch': 0.89}
 89%|████████▊ | 3802/4286 [28:40:17<11:29:24, 85.46s/it] 89%|████████▊ | 3803/4286 [28:40:43<9:04:18, 67.62s/it]                                                         {'loss': 0.0064, 'grad_norm': 11.756577687011548, 'learning_rate': 1.1269248716752216e-07, 'completion_length': 311.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.8199405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.802083432674408, 'reward_std': 0.08795271255075932, 'kl': 0.16015625, 'epoch': 0.89}
 89%|████████▊ | 3803/4286 [28:40:43<9:04:18, 67.62s/it] 89%|████████▉ | 3804/4286 [28:41:08<7:20:02, 54.78s/it]                                                        {'loss': 0.0056, 'grad_norm': 4.629799087079634, 'learning_rate': 1.1245916938870742e-07, 'completion_length': 310.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.028166964650154114, 'kl': 0.1402587890625, 'epoch': 0.89}
 89%|████████▉ | 3804/4286 [28:41:08<7:20:02, 54.78s/it] 89%|████████▉ | 3805/4286 [28:41:34<6:08:37, 45.98s/it]                                                        {'loss': 0.0062, 'grad_norm': 74.61585959811738, 'learning_rate': 1.1222585160989267e-07, 'completion_length': 321.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6369048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.0352074655238539, 'kl': 0.155029296875, 'epoch': 0.89}
 89%|████████▉ | 3805/4286 [28:41:34<6:08:37, 45.98s/it] 89%|████████▉ | 3806/4286 [28:42:00<5:20:57, 40.12s/it]                                                        {'loss': 0.0114, 'grad_norm': 17.791281411007837, 'learning_rate': 1.1199253383107793e-07, 'completion_length': 313.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7693452537059784, 'rewards/format_reward': 1.0, 'reward': 1.7693454027175903, 'reward_std': 0.03869047947227955, 'kl': 0.283935546875, 'epoch': 0.89}
 89%|████████▉ | 3806/4286 [28:42:00<5:20:57, 40.12s/it] 89%|████████▉ | 3807/4286 [28:42:24<4:42:07, 35.34s/it]                                                        {'loss': 0.0174, 'grad_norm': 1.8552630901960219, 'learning_rate': 1.1175921605226318e-07, 'completion_length': 281.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5431548207998276, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5252977013587952, 'reward_std': 0.08267778158187866, 'kl': 0.4365234375, 'epoch': 0.89}
 89%|████████▉ | 3807/4286 [28:42:24<4:42:07, 35.34s/it] 89%|████████▉ | 3808/4286 [28:42:50<4:17:52, 32.37s/it]                                                        {'loss': 0.0057, 'grad_norm': 5.330474401433841, 'learning_rate': 1.1152589827344844e-07, 'completion_length': 320.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619049549102783, 'reward_std': 0.0585499033331871, 'kl': 0.1416015625, 'epoch': 0.89}
 89%|████████▉ | 3808/4286 [28:42:50<4:17:52, 32.37s/it] 89%|████████▉ | 3809/4286 [28:43:14<3:58:54, 30.05s/it]                                                        {'loss': 0.0031, 'grad_norm': 8.257929339459103, 'learning_rate': 1.1129258049463369e-07, 'completion_length': 305.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7425596117973328, 'rewards/format_reward': 1.0, 'reward': 1.7425596117973328, 'reward_std': 0.06249999441206455, 'kl': 0.0780029296875, 'epoch': 0.89}
 89%|████████▉ | 3809/4286 [28:43:14<3:58:54, 30.05s/it] 89%|████████▉ | 3810/4286 [28:43:38<3:42:19, 28.02s/it]                                                        {'loss': 0.0059, 'grad_norm': 8.335879286971311, 'learning_rate': 1.1105926271581894e-07, 'completion_length': 286.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.758928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.03596102446317673, 'kl': 0.148681640625, 'epoch': 0.89}
 89%|████████▉ | 3810/4286 [28:43:38<3:42:19, 28.02s/it] 89%|████████▉ | 3811/4286 [28:44:03<3:36:05, 27.30s/it]                                                        {'loss': 0.0133, 'grad_norm': 0.8784006235626762, 'learning_rate': 1.108259449370042e-07, 'completion_length': 283.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.04464286006987095, 'kl': 0.332275390625, 'epoch': 0.89}
 89%|████████▉ | 3811/4286 [28:44:03<3:36:05, 27.30s/it] 89%|████████▉ | 3812/4286 [28:44:29<3:30:46, 26.68s/it]                                                        {'loss': 0.011, 'grad_norm': 6.565445497402368, 'learning_rate': 1.1059262715818945e-07, 'completion_length': 306.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726192235946655, 'reward_std': 0.0444291764870286, 'kl': 0.27587890625, 'epoch': 0.89}
 89%|████████▉ | 3812/4286 [28:44:29<3:30:46, 26.68s/it] 89%|████████▉ | 3813/4286 [28:44:54<3:27:03, 26.27s/it]                                                        {'loss': 0.0047, 'grad_norm': 128.15918158164297, 'learning_rate': 1.1035930937937471e-07, 'completion_length': 276.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.06547619216144085, 'kl': 0.1173095703125, 'epoch': 0.89}
 89%|████████▉ | 3813/4286 [28:44:54<3:27:03, 26.27s/it] 89%|████████▉ | 3814/4286 [28:45:20<3:26:12, 26.21s/it]                                                        {'loss': 0.013, 'grad_norm': 20.299745444565602, 'learning_rate': 1.1012599160055996e-07, 'completion_length': 296.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7336310744285583, 'reward_std': 0.0922619067132473, 'kl': 0.32568359375, 'epoch': 0.89}
 89%|████████▉ | 3814/4286 [28:45:20<3:26:12, 26.21s/it] 89%|████████▉ | 3815/4286 [28:45:45<3:23:37, 25.94s/it]                                                        {'loss': 0.0085, 'grad_norm': 9.635449107557436, 'learning_rate': 1.0989267382174522e-07, 'completion_length': 312.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.877976268529892, 'rewards/format_reward': 1.0, 'reward': 1.8779762983322144, 'reward_std': 0.029761902987957, 'kl': 0.212158203125, 'epoch': 0.89}
 89%|████████▉ | 3815/4286 [28:45:45<3:23:37, 25.94s/it] 89%|████████▉ | 3816/4286 [28:46:09<3:17:56, 25.27s/it]                                                        {'loss': 0.0102, 'grad_norm': 5.313335288467028, 'learning_rate': 1.0965935604293047e-07, 'completion_length': 315.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.05222323164343834, 'kl': 0.2554931640625, 'epoch': 0.89}
 89%|████████▉ | 3816/4286 [28:46:09<3:17:56, 25.27s/it] 89%|████████▉ | 3817/4286 [28:46:33<3:14:55, 24.94s/it]                                                        {'loss': 0.0245, 'grad_norm': 5.665495217521279, 'learning_rate': 1.0942603826411572e-07, 'completion_length': 265.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7842262089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7663691639900208, 'reward_std': 0.1220238097012043, 'kl': 0.6103515625, 'epoch': 0.89}
 89%|████████▉ | 3817/4286 [28:46:33<3:14:55, 24.94s/it][2025-03-03 19:44:20,767] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 89%|████████▉ | 3818/4286 [28:46:58<3:14:08, 24.89s/it]                                                        {'loss': 0.0142, 'grad_norm': 7.049071036613264, 'learning_rate': 1.0919272048530097e-07, 'completion_length': 275.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 1.0, 'reward': 1.7425596117973328, 'reward_std': 0.03709554113447666, 'kl': 0.35400390625, 'epoch': 0.89}
 89%|████████▉ | 3818/4286 [28:46:58<3:14:08, 24.89s/it] 89%|████████▉ | 3819/4286 [28:47:23<3:14:08, 24.94s/it]                                                        {'loss': 0.0073, 'grad_norm': 4.189083426056217, 'learning_rate': 1.0895940270648622e-07, 'completion_length': 305.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6101191639900208, 'reward_std': 0.08508220314979553, 'kl': 0.1820068359375, 'epoch': 0.89}
 89%|████████▉ | 3819/4286 [28:47:23<3:14:08, 24.94s/it] 89%|████████▉ | 3820/4286 [28:47:49<3:15:40, 25.19s/it]                                                        {'loss': 0.0046, 'grad_norm': 2.59937432483721, 'learning_rate': 1.0872608492767148e-07, 'completion_length': 346.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.8422619998455048, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.03160357475280762, 'kl': 0.1148681640625, 'epoch': 0.89}
 89%|████████▉ | 3820/4286 [28:47:49<3:15:40, 25.19s/it] 89%|████████▉ | 3821/4286 [28:48:14<3:15:17, 25.20s/it]                                                        {'loss': 0.0021, 'grad_norm': 3.100771022200215, 'learning_rate': 1.0849276714885673e-07, 'completion_length': 325.0714569091797, 'rewards/only_full_func_accuracy_reward': 0.8630953133106232, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.845238208770752, 'reward_std': 0.06731786206364632, 'kl': 0.0537109375, 'epoch': 0.89}
 89%|████████▉ | 3821/4286 [28:48:14<3:15:17, 25.20s/it] 89%|████████▉ | 3822/4286 [28:48:39<3:15:06, 25.23s/it]                                                        {'loss': 0.0034, 'grad_norm': 7.584319488428212, 'learning_rate': 1.0825944937004199e-07, 'completion_length': 256.2678756713867, 'rewards/only_full_func_accuracy_reward': 0.7449405193328857, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.727083444595337, 'reward_std': 0.09788430482149124, 'kl': 0.0858154296875, 'epoch': 0.89}
 89%|████████▉ | 3822/4286 [28:48:39<3:15:06, 25.23s/it] 89%|████████▉ | 3823/4286 [28:49:07<3:20:51, 26.03s/it]                                                        {'loss': 0.0091, 'grad_norm': 7.164486658471348, 'learning_rate': 1.0802613159122724e-07, 'completion_length': 274.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.8440476655960083, 'rewards/format_reward': 1.0, 'reward': 1.8440477848052979, 'reward_std': 0.03333957027643919, 'kl': 0.22705078125, 'epoch': 0.89}
 89%|████████▉ | 3823/4286 [28:49:07<3:20:51, 26.03s/it] 89%|████████▉ | 3824/4286 [28:49:34<3:21:24, 26.16s/it]                                                        {'loss': 0.0043, 'grad_norm': 11.63850997162143, 'learning_rate': 1.0779281381241249e-07, 'completion_length': 287.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.7812500894069672, 'rewards/format_reward': 1.0, 'reward': 1.7812501788139343, 'reward_std': 0.08030407316982746, 'kl': 0.108154296875, 'epoch': 0.89}
 89%|████████▉ | 3824/4286 [28:49:34<3:21:24, 26.16s/it] 89%|████████▉ | 3825/4286 [28:49:59<3:19:34, 25.97s/it]                                                        {'loss': 0.0045, 'grad_norm': 4.894838730774079, 'learning_rate': 1.0755949603359775e-07, 'completion_length': 285.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.011904762126505375, 'kl': 0.111328125, 'epoch': 0.89}
 89%|████████▉ | 3825/4286 [28:49:59<3:19:34, 25.97s/it][2025-03-03 19:47:47,649] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 89%|████████▉ | 3826/4286 [28:50:25<3:18:20, 25.87s/it]                                                        {'loss': 0.0041, 'grad_norm': 1.316186574914923, 'learning_rate': 1.07326178254783e-07, 'completion_length': 306.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.011904759332537651, 'kl': 0.102294921875, 'epoch': 0.89}
 89%|████████▉ | 3826/4286 [28:50:25<3:18:20, 25.87s/it][2025-03-03 19:48:12,458] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 89%|████████▉ | 3827/4286 [28:50:50<3:15:28, 25.55s/it]                                                        {'loss': 0.0042, 'grad_norm': 2.2081444709237674, 'learning_rate': 1.0709286047596826e-07, 'completion_length': 294.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7872024476528168, 'rewards/format_reward': 1.0, 'reward': 1.7872024774551392, 'reward_std': 0.00297618773765862, 'kl': 0.10595703125, 'epoch': 0.89}
 89%|████████▉ | 3827/4286 [28:50:50<3:15:28, 25.55s/it] 89%|████████▉ | 3828/4286 [28:51:16<3:17:48, 25.91s/it]                                                        {'loss': 0.0016, 'grad_norm': 3.450130465313232, 'learning_rate': 1.0685954269715351e-07, 'completion_length': 330.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217262983322144, 'reward_std': 0.0652625085785985, 'kl': 0.0406494140625, 'epoch': 0.89}
 89%|████████▉ | 3828/4286 [28:51:16<3:17:48, 25.91s/it] 89%|████████▉ | 3829/4286 [28:51:40<3:12:06, 25.22s/it]                                                        {'loss': 0.0054, 'grad_norm': 5.902827726210368, 'learning_rate': 1.0662622491833877e-07, 'completion_length': 295.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.8273810148239136, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.0068732211366295815, 'kl': 0.13623046875, 'epoch': 0.89}
 89%|████████▉ | 3829/4286 [28:51:40<3:12:06, 25.22s/it][2025-03-03 19:49:29,145] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 89%|████████▉ | 3830/4286 [28:52:06<3:14:11, 25.55s/it]                                                        {'loss': 0.0158, 'grad_norm': 2.596761388192562, 'learning_rate': 1.0639290713952402e-07, 'completion_length': 316.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.02976190857589245, 'kl': 0.3935546875, 'epoch': 0.89}
 89%|████████▉ | 3830/4286 [28:52:06<3:14:11, 25.55s/it][2025-03-03 19:49:54,266] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 89%|████████▉ | 3831/4286 [28:52:31<3:12:47, 25.42s/it]                                                        {'loss': 0.0049, 'grad_norm': 9.548296572503546, 'learning_rate': 1.0615958936070928e-07, 'completion_length': 298.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.05197649076581001, 'kl': 0.123046875, 'epoch': 0.89}
 89%|████████▉ | 3831/4286 [28:52:31<3:12:47, 25.42s/it] 89%|████████▉ | 3832/4286 [28:52:56<3:10:23, 25.16s/it]                                                        {'loss': 0.0028, 'grad_norm': 6.6342728098072214, 'learning_rate': 1.0592627158189453e-07, 'completion_length': 272.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.6994048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6994048953056335, 'reward_std': 0.01785714365541935, 'kl': 0.070068359375, 'epoch': 0.89}
 89%|████████▉ | 3832/4286 [28:52:56<3:10:23, 25.16s/it] 89%|████████▉ | 3833/4286 [28:53:22<3:11:38, 25.38s/it]                                                        {'loss': 0.0033, 'grad_norm': 2.911673221659, 'learning_rate': 1.0569295380307978e-07, 'completion_length': 306.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7827380895614624, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7648810744285583, 'reward_std': 0.06731786206364632, 'kl': 0.081298828125, 'epoch': 0.89}
 89%|████████▉ | 3833/4286 [28:53:22<3:11:38, 25.38s/it] 89%|████████▉ | 3834/4286 [28:53:47<3:10:19, 25.26s/it]                                                        {'loss': 0.0052, 'grad_norm': 0.604190362670962, 'learning_rate': 1.0545963602426504e-07, 'completion_length': 319.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.8544643521308899, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8366072177886963, 'reward_std': 0.04107143357396126, 'kl': 0.130859375, 'epoch': 0.89}
 89%|████████▉ | 3834/4286 [28:53:47<3:10:19, 25.26s/it] 89%|████████▉ | 3835/4286 [28:54:11<3:08:00, 25.01s/it]                                                        {'loss': 0.0055, 'grad_norm': 1.393359957378347, 'learning_rate': 1.0522631824545029e-07, 'completion_length': 320.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.885416716337204, 'rewards/format_reward': 1.0, 'reward': 1.8854167461395264, 'reward_std': 0.01709691435098648, 'kl': 0.1383056640625, 'epoch': 0.89}
 89%|████████▉ | 3835/4286 [28:54:11<3:08:00, 25.01s/it] 90%|████████▉ | 3836/4286 [28:54:35<3:05:24, 24.72s/it]                                                        {'loss': 0.0053, 'grad_norm': 5.485606000476634, 'learning_rate': 1.0499300046663555e-07, 'completion_length': 308.89288330078125, 'rewards/only_full_func_accuracy_reward': 0.7872024178504944, 'rewards/format_reward': 1.0, 'reward': 1.787202537059784, 'reward_std': 0.05063671991229057, 'kl': 0.1328125, 'epoch': 0.9}
 90%|████████▉ | 3836/4286 [28:54:35<3:05:24, 24.72s/it] 90%|████████▉ | 3837/4286 [28:55:00<3:04:26, 24.65s/it]                                                        {'loss': 0.0238, 'grad_norm': 8.941400207821964, 'learning_rate': 1.047596826878208e-07, 'completion_length': 289.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.7202381193637848, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.03788716671988368, 'kl': 0.595703125, 'epoch': 0.9}
 90%|████████▉ | 3837/4286 [28:55:00<3:04:26, 24.65s/it] 90%|████████▉ | 3838/4286 [28:55:24<3:03:43, 24.61s/it]                                                        {'loss': 0.004, 'grad_norm': 1.9291330480728532, 'learning_rate': 1.0452636490900606e-07, 'completion_length': 320.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.8125000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.04602411203086376, 'kl': 0.099853515625, 'epoch': 0.9}
 90%|████████▉ | 3838/4286 [28:55:24<3:03:43, 24.61s/it] 90%|████████▉ | 3839/4286 [28:55:49<3:03:14, 24.60s/it]                                                        {'loss': 0.0068, 'grad_norm': 4.060073000286481, 'learning_rate': 1.0429304713019131e-07, 'completion_length': 274.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5669642984867096, 'rewards/format_reward': 1.0, 'reward': 1.5669643878936768, 'reward_std': 0.0295482249930501, 'kl': 0.170654296875, 'epoch': 0.9}
 90%|████████▉ | 3839/4286 [28:55:49<3:03:14, 24.60s/it] 90%|████████▉ | 3840/4286 [28:56:14<3:03:56, 24.74s/it]                                                        {'loss': 0.0144, 'grad_norm': 0.7622931466773488, 'learning_rate': 1.0405972935137656e-07, 'completion_length': 290.76788330078125, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.006873216480016708, 'kl': 0.35986328125, 'epoch': 0.9}
 90%|████████▉ | 3840/4286 [28:56:14<3:03:56, 24.74s/it] 90%|████████▉ | 3841/4286 [28:56:39<3:04:04, 24.82s/it]                                                        {'loss': 0.0088, 'grad_norm': 3.3548569861711326, 'learning_rate': 1.0382641157256182e-07, 'completion_length': 302.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8244048953056335, 'reward_std': 0.01785714365541935, 'kl': 0.22021484375, 'epoch': 0.9}
 90%|████████▉ | 3841/4286 [28:56:39<3:04:04, 24.82s/it] 90%|████████▉ | 3842/4286 [28:57:03<3:02:58, 24.73s/it]                                                        {'loss': 0.0128, 'grad_norm': 2.9337057147670946, 'learning_rate': 1.0359309379374707e-07, 'completion_length': 292.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.0068732211366295815, 'kl': 0.3203125, 'epoch': 0.9}
 90%|████████▉ | 3842/4286 [28:57:03<3:02:58, 24.73s/it] 90%|████████▉ | 3843/4286 [28:57:28<3:01:30, 24.58s/it]                                                        {'loss': 0.0024, 'grad_norm': 0.9159622307054103, 'learning_rate': 1.0335977601493233e-07, 'completion_length': 290.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.8869048357009888, 'rewards/format_reward': 1.0, 'reward': 1.8869048953056335, 'reward_std': 0.011904759332537651, 'kl': 0.06024169921875, 'epoch': 0.9}
 90%|████████▉ | 3843/4286 [28:57:28<3:01:30, 24.58s/it] 90%|████████▉ | 3844/4286 [28:57:52<3:01:34, 24.65s/it]                                                        {'loss': 0.0059, 'grad_norm': 4.530172889120216, 'learning_rate': 1.0312645823611758e-07, 'completion_length': 305.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5985119640827179, 'rewards/format_reward': 1.0, 'reward': 1.5985119938850403, 'reward_std': 0.06652759667485952, 'kl': 0.146484375, 'epoch': 0.9}
 90%|████████▉ | 3844/4286 [28:57:52<3:01:34, 24.65s/it] 90%|████████▉ | 3845/4286 [28:58:18<3:02:05, 24.77s/it]                                                        {'loss': 0.0046, 'grad_norm': 0.518851891356495, 'learning_rate': 1.0289314045730284e-07, 'completion_length': 279.8928756713867, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.0, 'kl': 0.1142578125, 'epoch': 0.9}
 90%|████████▉ | 3845/4286 [28:58:18<3:02:05, 24.77s/it][2025-03-03 19:56:06,527] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 90%|████████▉ | 3846/4286 [28:58:44<3:04:33, 25.17s/it]                                                        {'loss': 0.0124, 'grad_norm': 3.7983669185118543, 'learning_rate': 1.0265982267848809e-07, 'completion_length': 296.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.8148809969425201, 'rewards/format_reward': 1.0, 'reward': 1.8148810267448425, 'reward_std': 0.046428573317825794, 'kl': 0.3095703125, 'epoch': 0.9}
 90%|████████▉ | 3846/4286 [28:58:44<3:04:33, 25.17s/it] 90%|████████▉ | 3847/4286 [28:59:09<3:03:32, 25.09s/it]                                                        {'loss': 0.0052, 'grad_norm': 11.672215229440143, 'learning_rate': 1.0242650489967334e-07, 'completion_length': 325.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.050381554290652275, 'kl': 0.129150390625, 'epoch': 0.9}
 90%|████████▉ | 3847/4286 [28:59:09<3:03:32, 25.09s/it] 90%|████████▉ | 3848/4286 [28:59:34<3:04:45, 25.31s/it]                                                        {'loss': 0.0018, 'grad_norm': 1.565813404120131, 'learning_rate': 1.021931871208586e-07, 'completion_length': 327.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7678572237491608, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.049366687424480915, 'kl': 0.04388427734375, 'epoch': 0.9}
 90%|████████▉ | 3848/4286 [28:59:34<3:04:45, 25.31s/it] 90%|████████▉ | 3849/4286 [28:59:58<3:00:58, 24.85s/it]                                                        {'loss': 0.0153, 'grad_norm': 7.839656311748349, 'learning_rate': 1.0195986934204385e-07, 'completion_length': 309.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.0476190485060215, 'kl': 0.380859375, 'epoch': 0.9}
 90%|████████▉ | 3849/4286 [28:59:58<3:00:58, 24.85s/it] 90%|████████▉ | 3850/4286 [29:00:24<3:02:38, 25.13s/it]                                                        {'loss': 0.0159, 'grad_norm': 1.5579826995301072, 'learning_rate': 1.0172655156322911e-07, 'completion_length': 321.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.8333334028720856, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.04878662619739771, 'kl': 0.39697265625, 'epoch': 0.9}
 90%|████████▉ | 3850/4286 [29:00:24<3:02:38, 25.13s/it] 90%|████████▉ | 3851/4286 [29:00:48<3:00:08, 24.85s/it]                                                        {'loss': 0.0215, 'grad_norm': 14.009579313601298, 'learning_rate': 1.0149323378441436e-07, 'completion_length': 306.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6517857909202576, 'reward_std': 0.09266335889697075, 'kl': 0.5390625, 'epoch': 0.9}
 90%|████████▉ | 3851/4286 [29:00:48<3:00:08, 24.85s/it] 90%|████████▉ | 3852/4286 [29:01:14<3:01:25, 25.08s/it]                                                        {'loss': 0.0279, 'grad_norm': 3.3540787255254245, 'learning_rate': 1.0125991600559962e-07, 'completion_length': 291.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7206845879554749, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.702827513217926, 'reward_std': 0.07966537587344646, 'kl': 0.695068359375, 'epoch': 0.9}
 90%|████████▉ | 3852/4286 [29:01:14<3:01:25, 25.08s/it] 90%|████████▉ | 3853/4286 [29:01:39<3:01:33, 25.16s/it]                                                        {'loss': 0.0059, 'grad_norm': 2.603855770742385, 'learning_rate': 1.0102659822678487e-07, 'completion_length': 310.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7306548953056335, 'reward_std': 0.1978098228573799, 'kl': 0.146728515625, 'epoch': 0.9}
 90%|████████▉ | 3853/4286 [29:01:39<3:01:33, 25.16s/it] 90%|████████▉ | 3854/4286 [29:02:06<3:05:23, 25.75s/it]                                                        {'loss': 0.0248, 'grad_norm': 3.040249300976849, 'learning_rate': 1.0079328044797013e-07, 'completion_length': 256.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.799107164144516, 'rewards/format_reward': 1.0, 'reward': 1.7991071939468384, 'reward_std': 0.08324572443962097, 'kl': 0.6219482421875, 'epoch': 0.9}
 90%|████████▉ | 3854/4286 [29:02:06<3:05:23, 25.75s/it] 90%|████████▉ | 3855/4286 [29:02:30<3:01:42, 25.30s/it]                                                        {'loss': 0.0066, 'grad_norm': 6.585125894209492, 'learning_rate': 1.0055996266915538e-07, 'completion_length': 280.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.760416716337204, 'rewards/format_reward': 1.0, 'reward': 1.7604168057441711, 'reward_std': 0.0625000037252903, 'kl': 0.164794921875, 'epoch': 0.9}
 90%|████████▉ | 3855/4286 [29:02:30<3:01:42, 25.30s/it] 90%|████████▉ | 3856/4286 [29:02:54<2:58:13, 24.87s/it]                                                        {'loss': 0.007, 'grad_norm': 5.131050207478831, 'learning_rate': 1.0032664489034063e-07, 'completion_length': 279.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7708333432674408, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.017857143189758062, 'kl': 0.173828125, 'epoch': 0.9}
 90%|████████▉ | 3856/4286 [29:02:54<2:58:13, 24.87s/it] 90%|████████▉ | 3857/4286 [29:03:20<2:58:58, 25.03s/it]                                                        {'loss': 0.0134, 'grad_norm': 0.8053849032602258, 'learning_rate': 1.0009332711152589e-07, 'completion_length': 288.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.9133928716182709, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8955357670783997, 'reward_std': 0.054816875606775284, 'kl': 0.3333740234375, 'epoch': 0.9}
 90%|████████▉ | 3857/4286 [29:03:20<2:58:58, 25.03s/it] 90%|█████████ | 3858/4286 [29:03:46<3:01:09, 25.40s/it]                                                        {'loss': 0.0093, 'grad_norm': 6.560843354196632, 'learning_rate': 9.986000933271114e-08, 'completion_length': 310.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.6898809969425201, 'rewards/format_reward': 1.0, 'reward': 1.6898810267448425, 'reward_std': 0.06507788598537445, 'kl': 0.232421875, 'epoch': 0.9}
 90%|█████████ | 3858/4286 [29:03:46<3:01:09, 25.40s/it] 90%|█████████ | 3859/4286 [29:04:11<3:00:15, 25.33s/it]                                                        {'loss': 0.0097, 'grad_norm': 65.91639530246775, 'learning_rate': 9.96266915538964e-08, 'completion_length': 300.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.7693453133106232, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.0695329811424017, 'kl': 0.243408203125, 'epoch': 0.9}
 90%|█████████ | 3859/4286 [29:04:11<3:00:15, 25.33s/it] 90%|█████████ | 3860/4286 [29:04:36<2:58:22, 25.12s/it]                                                        {'loss': 0.0023, 'grad_norm': 8.280374359178213, 'learning_rate': 9.939337377508165e-08, 'completion_length': 291.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8023809790611267, 'rewards/format_reward': 1.0, 'reward': 1.8023810386657715, 'reward_std': 0.0280321529135108, 'kl': 0.0567626953125, 'epoch': 0.9}
 90%|█████████ | 3860/4286 [29:04:36<2:58:22, 25.12s/it] 90%|█████████ | 3861/4286 [29:04:59<2:54:19, 24.61s/it]                                                        {'loss': 0.0054, 'grad_norm': 6.357377083398813, 'learning_rate': 9.916005599626691e-08, 'completion_length': 261.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7514881193637848, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.039858054369688034, 'kl': 0.1357421875, 'epoch': 0.9}
 90%|█████████ | 3861/4286 [29:04:59<2:54:19, 24.61s/it] 90%|█████████ | 3862/4286 [29:05:23<2:53:00, 24.48s/it]                                                        {'loss': 0.0044, 'grad_norm': 13.884149597941558, 'learning_rate': 9.892673821745216e-08, 'completion_length': 285.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.796726256608963, 'rewards/format_reward': 1.0, 'reward': 1.7967263460159302, 'reward_std': 0.05840010568499565, 'kl': 0.11083984375, 'epoch': 0.9}
 90%|█████████ | 3862/4286 [29:05:23<2:53:00, 24.48s/it][2025-03-03 20:03:10,163] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 90%|█████████ | 3863/4286 [29:05:47<2:51:19, 24.30s/it]                                                        {'loss': 0.0109, 'grad_norm': 17.838313807241125, 'learning_rate': 9.869342043863741e-08, 'completion_length': 291.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7208333611488342, 'rewards/format_reward': 1.0, 'reward': 1.7208334803581238, 'reward_std': 0.04857343062758446, 'kl': 0.274169921875, 'epoch': 0.9}
 90%|█████████ | 3863/4286 [29:05:47<2:51:19, 24.30s/it] 90%|█████████ | 3864/4286 [29:06:12<2:51:26, 24.38s/it]                                                        {'loss': 0.0195, 'grad_norm': 6.616413526021014, 'learning_rate': 9.846010265982267e-08, 'completion_length': 308.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7961309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.09545149654150009, 'kl': 0.487548828125, 'epoch': 0.9}
 90%|█████████ | 3864/4286 [29:06:12<2:51:26, 24.38s/it] 90%|█████████ | 3865/4286 [29:06:38<2:54:37, 24.89s/it]                                                        {'loss': 0.0313, 'grad_norm': 3.8568930222207625, 'learning_rate': 9.822678488100792e-08, 'completion_length': 338.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7208333313465118, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7029763460159302, 'reward_std': 0.10089708492159843, 'kl': 0.7841796875, 'epoch': 0.9}
 90%|█████████ | 3865/4286 [29:06:38<2:54:37, 24.89s/it] 90%|█████████ | 3866/4286 [29:07:02<2:53:27, 24.78s/it]                                                        {'loss': 0.0018, 'grad_norm': 2.7255232321498504, 'learning_rate': 9.799346710219318e-08, 'completion_length': 328.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.693452388048172, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.07738094963133335, 'kl': 0.0445556640625, 'epoch': 0.9}
 90%|█████████ | 3866/4286 [29:07:02<2:53:27, 24.78s/it] 90%|█████████ | 3867/4286 [29:07:27<2:52:52, 24.76s/it]                                                        {'loss': 0.0036, 'grad_norm': 1.020648745735666, 'learning_rate': 9.776014932337843e-08, 'completion_length': 309.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.019440393894910812, 'kl': 0.089599609375, 'epoch': 0.9}
 90%|█████████ | 3867/4286 [29:07:27<2:52:52, 24.76s/it] 90%|█████████ | 3868/4286 [29:07:52<2:53:46, 24.94s/it]                                                        {'loss': 0.0091, 'grad_norm': 2.5633073300642035, 'learning_rate': 9.75268315445637e-08, 'completion_length': 274.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.07230739295482635, 'kl': 0.2265625, 'epoch': 0.9}
 90%|█████████ | 3868/4286 [29:07:52<2:53:46, 24.94s/it] 90%|█████████ | 3869/4286 [29:08:16<2:51:00, 24.60s/it]                                                        {'loss': 0.0233, 'grad_norm': 3.5407902830690254, 'learning_rate': 9.729351376574894e-08, 'completion_length': 301.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.029761902987957, 'kl': 0.5830078125, 'epoch': 0.9}
 90%|█████████ | 3869/4286 [29:08:16<2:51:00, 24.60s/it] 90%|█████████ | 3870/4286 [29:08:42<2:51:55, 24.80s/it]                                                        {'loss': 0.0084, 'grad_norm': 15.687772282666733, 'learning_rate': 9.706019598693419e-08, 'completion_length': 280.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.12684166990220547, 'kl': 0.21051025390625, 'epoch': 0.9}
 90%|█████████ | 3870/4286 [29:08:42<2:51:55, 24.80s/it] 90%|█████████ | 3871/4286 [29:09:07<2:51:56, 24.86s/it]                                                        {'loss': 0.0027, 'grad_norm': 2.259764279472864, 'learning_rate': 9.682687820811945e-08, 'completion_length': 314.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.02816697023808956, 'kl': 0.06787109375, 'epoch': 0.9}
 90%|█████████ | 3871/4286 [29:09:07<2:51:56, 24.86s/it] 90%|█████████ | 3872/4286 [29:09:32<2:52:51, 25.05s/it]                                                        {'loss': 0.0012, 'grad_norm': 0.6545346960923085, 'learning_rate': 9.65935604293047e-08, 'completion_length': 304.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8377977013587952, 'rewards/format_reward': 1.0, 'reward': 1.8377977013587952, 'reward_std': 0.019238398410379887, 'kl': 0.02996826171875, 'epoch': 0.9}
 90%|█████████ | 3872/4286 [29:09:32<2:52:51, 25.05s/it] 90%|█████████ | 3873/4286 [29:09:57<2:51:19, 24.89s/it]                                                        {'loss': 0.0038, 'grad_norm': 0.7862575418543446, 'learning_rate': 9.636024265048996e-08, 'completion_length': 285.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.01785714365541935, 'kl': 0.09521484375, 'epoch': 0.9}
 90%|█████████ | 3873/4286 [29:09:57<2:51:19, 24.89s/it] 90%|█████████ | 3874/4286 [29:10:22<2:52:55, 25.18s/it]                                                        {'loss': 0.0036, 'grad_norm': 6.40427325690882, 'learning_rate': 9.612692487167521e-08, 'completion_length': 282.375, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.02816697023808956, 'kl': 0.09033203125, 'epoch': 0.9}
 90%|█████████ | 3874/4286 [29:10:22<2:52:55, 25.18s/it] 90%|█████████ | 3875/4286 [29:10:47<2:51:38, 25.06s/it]                                                        {'loss': 0.0173, 'grad_norm': 7.41788613311631, 'learning_rate': 9.589360709286048e-08, 'completion_length': 300.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6941964626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.676339328289032, 'reward_std': 0.058849070221185684, 'kl': 0.43359375, 'epoch': 0.9}
 90%|█████████ | 3875/4286 [29:10:47<2:51:38, 25.06s/it] 90%|█████████ | 3876/4286 [29:11:11<2:49:36, 24.82s/it]                                                        {'loss': 0.0193, 'grad_norm': 30.84516319520787, 'learning_rate': 9.566028931404572e-08, 'completion_length': 299.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7529763579368591, 'reward_std': 0.0595238097012043, 'kl': 0.48193359375, 'epoch': 0.9}
 90%|█████████ | 3876/4286 [29:11:11<2:49:36, 24.82s/it] 90%|█████████ | 3877/4286 [29:11:36<2:49:34, 24.88s/it]                                                        {'loss': 0.0069, 'grad_norm': 1.1366026408631202, 'learning_rate': 9.542697153523099e-08, 'completion_length': 305.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7797619700431824, 'reward_std': 0.011904759332537651, 'kl': 0.172607421875, 'epoch': 0.9}
 90%|█████████ | 3877/4286 [29:11:36<2:49:34, 24.88s/it] 90%|█████████ | 3878/4286 [29:12:01<2:48:56, 24.84s/it]                                                        {'loss': 0.01, 'grad_norm': 1.7067902577215524, 'learning_rate': 9.519365375641623e-08, 'completion_length': 318.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.05578738637268543, 'kl': 0.2490234375, 'epoch': 0.9}
 90%|█████████ | 3878/4286 [29:12:01<2:48:56, 24.84s/it] 91%|█████████ | 3879/4286 [29:12:27<2:50:48, 25.18s/it]                                                        {'loss': 0.0102, 'grad_norm': 4.2861325921033115, 'learning_rate': 9.496033597760148e-08, 'completion_length': 329.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.7738095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.026275082491338253, 'kl': 0.2548828125, 'epoch': 0.91}
 91%|█████████ | 3879/4286 [29:12:27<2:50:48, 25.18s/it] 91%|█████████ | 3880/4286 [29:12:52<2:50:05, 25.14s/it]                                                        {'loss': 0.0087, 'grad_norm': 2.604215164159346, 'learning_rate': 9.472701819878675e-08, 'completion_length': 325.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7008928656578064, 'rewards/format_reward': 1.0, 'reward': 1.700892984867096, 'reward_std': 0.05335774086415768, 'kl': 0.21630859375, 'epoch': 0.91}
 91%|█████████ | 3880/4286 [29:12:52<2:50:05, 25.14s/it] 91%|█████████ | 3881/4286 [29:13:18<2:50:24, 25.25s/it]                                                        {'loss': 0.0117, 'grad_norm': 3.232318320102309, 'learning_rate': 9.4493700419972e-08, 'completion_length': 307.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7261906266212463, 'reward_std': 0.10554793104529381, 'kl': 0.2919921875, 'epoch': 0.91}
 91%|█████████ | 3881/4286 [29:13:18<2:50:24, 25.25s/it] 91%|█████████ | 3882/4286 [29:13:42<2:48:08, 24.97s/it]                                                        {'loss': 0.0057, 'grad_norm': 19.277068049910742, 'learning_rate': 9.426038264115726e-08, 'completion_length': 312.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8005954027175903, 'reward_std': 0.07100121024996042, 'kl': 0.14178466796875, 'epoch': 0.91}
 91%|█████████ | 3882/4286 [29:13:42<2:48:08, 24.97s/it] 91%|█████████ | 3883/4286 [29:14:06<2:46:13, 24.75s/it]                                                        {'loss': 0.008, 'grad_norm': 4.113403977020595, 'learning_rate': 9.40270648623425e-08, 'completion_length': 323.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7633929550647736, 'rewards/format_reward': 1.0, 'reward': 1.7633930444717407, 'reward_std': 0.0684523805975914, 'kl': 0.19873046875, 'epoch': 0.91}
 91%|█████████ | 3883/4286 [29:14:06<2:46:13, 24.75s/it] 91%|█████████ | 3884/4286 [29:14:33<2:50:33, 25.46s/it]                                                        {'loss': 0.0039, 'grad_norm': 8.457646525013459, 'learning_rate': 9.379374708352777e-08, 'completion_length': 311.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5625001192092896, 'reward_std': 0.14759551733732224, 'kl': 0.0975341796875, 'epoch': 0.91}
 91%|█████████ | 3884/4286 [29:14:33<2:50:33, 25.46s/it] 91%|█████████ | 3885/4286 [29:14:57<2:45:55, 24.83s/it]                                                        {'loss': 0.0036, 'grad_norm': 14.65552135543325, 'learning_rate': 9.356042930471302e-08, 'completion_length': 292.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.044642855413258076, 'kl': 0.0908203125, 'epoch': 0.91}
 91%|█████████ | 3885/4286 [29:14:57<2:45:55, 24.83s/it] 91%|█████████ | 3886/4286 [29:15:22<2:45:57, 24.89s/it]                                                        {'loss': 0.0024, 'grad_norm': 0.4501849205430358, 'learning_rate': 9.332711152589826e-08, 'completion_length': 312.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.834821492433548, 'rewards/format_reward': 1.0, 'reward': 1.8348215222358704, 'reward_std': 0.00297618773765862, 'kl': 0.0606689453125, 'epoch': 0.91}
 91%|█████████ | 3886/4286 [29:15:22<2:45:57, 24.89s/it][2025-03-03 20:13:11,312] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 91%|█████████ | 3887/4286 [29:15:48<2:48:54, 25.40s/it]                                                        {'loss': 0.0177, 'grad_norm': 20.671394383036294, 'learning_rate': 9.309379374708353e-08, 'completion_length': 318.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.7358631491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7180060744285583, 'reward_std': 0.10431477054953575, 'kl': 0.443603515625, 'epoch': 0.91}
 91%|█████████ | 3887/4286 [29:15:48<2:48:54, 25.40s/it] 91%|█████████ | 3888/4286 [29:16:15<2:50:19, 25.68s/it]                                                        {'loss': 0.0131, 'grad_norm': 6.132215930607374, 'learning_rate': 9.286047596826877e-08, 'completion_length': 276.7143020629883, 'rewards/only_full_func_accuracy_reward': 0.8318452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.813988208770752, 'reward_std': 0.0834076926112175, 'kl': 0.3271484375, 'epoch': 0.91}
 91%|█████████ | 3888/4286 [29:16:15<2:50:19, 25.68s/it] 91%|█████████ | 3889/4286 [29:16:40<2:49:29, 25.61s/it]                                                        {'loss': 0.0047, 'grad_norm': 12.558301885506978, 'learning_rate': 9.262715818945404e-08, 'completion_length': 297.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7455357313156128, 'rewards/format_reward': 1.0, 'reward': 1.7455359101295471, 'reward_std': 0.026111618615686893, 'kl': 0.1171875, 'epoch': 0.91}
 91%|█████████ | 3889/4286 [29:16:40<2:49:29, 25.61s/it] 91%|█████████ | 3890/4286 [29:17:04<2:46:24, 25.21s/it]                                                        {'loss': 0.009, 'grad_norm': 3.1039537748588404, 'learning_rate': 9.239384041063929e-08, 'completion_length': 301.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562501192092896, 'reward_std': 0.05654761753976345, 'kl': 0.22509765625, 'epoch': 0.91}
 91%|█████████ | 3890/4286 [29:17:04<2:46:24, 25.21s/it] 91%|█████████ | 3891/4286 [29:17:30<2:47:00, 25.37s/it]                                                        {'loss': 0.0093, 'grad_norm': 6.338783237347599, 'learning_rate': 9.216052263182455e-08, 'completion_length': 300.0, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7752977013587952, 'reward_std': 0.026785715483129025, 'kl': 0.2333984375, 'epoch': 0.91}
 91%|█████████ | 3891/4286 [29:17:30<2:47:00, 25.37s/it] 91%|█████████ | 3892/4286 [29:17:55<2:45:33, 25.21s/it]                                                        {'loss': 0.0049, 'grad_norm': 4.768665704435204, 'learning_rate': 9.19272048530098e-08, 'completion_length': 309.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.8020834028720856, 'rewards/format_reward': 1.0, 'reward': 1.8020834922790527, 'reward_std': 0.06207263842225075, 'kl': 0.1212158203125, 'epoch': 0.91}
 91%|█████████ | 3892/4286 [29:17:55<2:45:33, 25.21s/it] 91%|█████████ | 3893/4286 [29:18:20<2:44:51, 25.17s/it]                                                        {'loss': 0.0149, 'grad_norm': 1.9298279224019959, 'learning_rate': 9.169388707419504e-08, 'completion_length': 334.51788330078125, 'rewards/only_full_func_accuracy_reward': 0.7895833849906921, 'rewards/format_reward': 1.0, 'reward': 1.789583444595337, 'reward_std': 0.05178571864962578, 'kl': 0.373291015625, 'epoch': 0.91}
 91%|█████████ | 3893/4286 [29:18:20<2:44:51, 25.17s/it] 91%|█████████ | 3894/4286 [29:18:44<2:42:53, 24.93s/it]                                                        {'loss': 0.0065, 'grad_norm': 4.82064124015156, 'learning_rate': 9.14605692953803e-08, 'completion_length': 308.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.8288690745830536, 'rewards/format_reward': 1.0, 'reward': 1.8288691639900208, 'reward_std': 0.038690478540956974, 'kl': 0.162353515625, 'epoch': 0.91}
 91%|█████████ | 3894/4286 [29:18:44<2:42:53, 24.93s/it] 91%|█████████ | 3895/4286 [29:19:09<2:41:16, 24.75s/it]                                                        {'loss': 0.0118, 'grad_norm': 54.776643777116604, 'learning_rate': 9.122725151656556e-08, 'completion_length': 305.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.026884591206908226, 'kl': 0.29443359375, 'epoch': 0.91}
 91%|█████████ | 3895/4286 [29:19:09<2:41:16, 24.75s/it] 91%|█████████ | 3896/4286 [29:19:36<2:45:09, 25.41s/it]                                                        {'loss': 0.0167, 'grad_norm': 14.974029674924125, 'learning_rate': 9.099393373775082e-08, 'completion_length': 333.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7075893580913544, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6718751192092896, 'reward_std': 0.12489315867424011, 'kl': 0.41796875, 'epoch': 0.91}
 91%|█████████ | 3896/4286 [29:19:36<2:45:09, 25.41s/it] 91%|█████████ | 3897/4286 [29:20:00<2:41:46, 24.95s/it]                                                        {'loss': 0.0139, 'grad_norm': 137.11342969672552, 'learning_rate': 9.076061595893607e-08, 'completion_length': 299.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6651785969734192, 'rewards/format_reward': 1.0, 'reward': 1.6651787161827087, 'reward_std': 0.02770655509084463, 'kl': 0.34716796875, 'epoch': 0.91}
 91%|█████████ | 3897/4286 [29:20:00<2:41:46, 24.95s/it] 91%|█████████ | 3898/4286 [29:20:26<2:43:44, 25.32s/it]                                                        {'loss': 0.0111, 'grad_norm': 8.490757332750615, 'learning_rate': 9.052729818012133e-08, 'completion_length': 318.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6398809552192688, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.07364453002810478, 'kl': 0.2783203125, 'epoch': 0.91}
 91%|█████████ | 3898/4286 [29:20:26<2:43:44, 25.32s/it] 91%|█████████ | 3899/4286 [29:20:50<2:42:03, 25.12s/it]                                                        {'loss': 0.002, 'grad_norm': 0.4805550809361917, 'learning_rate': 9.029398040130658e-08, 'completion_length': 295.625, 'rewards/only_full_func_accuracy_reward': 0.8958333432674408, 'rewards/format_reward': 1.0, 'reward': 1.8958333730697632, 'reward_std': 0.01785714365541935, 'kl': 0.05035400390625, 'epoch': 0.91}
 91%|█████████ | 3899/4286 [29:20:50<2:42:03, 25.12s/it] 91%|█████████ | 3900/4286 [29:21:14<2:39:15, 24.76s/it]                                                        {'loss': 0.0157, 'grad_norm': 2.380628544246074, 'learning_rate': 9.006066262249182e-08, 'completion_length': 306.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8002977073192596, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7824405431747437, 'reward_std': 0.060119062196463346, 'kl': 0.39306640625, 'epoch': 0.91}
 91%|█████████ | 3900/4286 [29:21:14<2:39:15, 24.76s/it] 91%|█████████ | 3901/4286 [29:26:54<12:45:53, 119.36s/it]                                                          {'loss': 0.0233, 'grad_norm': 3.287686095000413, 'learning_rate': 8.982734484367709e-08, 'completion_length': 307.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678572535514832, 'reward_std': 0.04406621679663658, 'kl': 0.582763671875, 'epoch': 0.91}
 91%|█████████ | 3901/4286 [29:26:54<12:45:53, 119.36s/it] 91%|█████████ | 3902/4286 [29:27:16<9:36:53, 90.14s/it]                                                          {'loss': 0.0128, 'grad_norm': 4.474188380563502, 'learning_rate': 8.959402706486234e-08, 'completion_length': 249.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.8348214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.816964328289032, 'reward_std': 0.033803027123212814, 'kl': 0.3194580078125, 'epoch': 0.91}
 91%|█████████ | 3902/4286 [29:27:16<9:36:53, 90.14s/it] 91%|█████████ | 3903/4286 [29:27:40<7:28:26, 70.25s/it]                                                        {'loss': 0.0049, 'grad_norm': 4.816346721756161, 'learning_rate': 8.93607092860476e-08, 'completion_length': 320.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.7931548357009888, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.77529776096344, 'reward_std': 0.044642859138548374, 'kl': 0.12353515625, 'epoch': 0.91}
 91%|█████████ | 3903/4286 [29:27:40<7:28:26, 70.25s/it] 91%|█████████ | 3904/4286 [29:28:03<5:55:42, 55.87s/it]                                                        {'loss': 0.0075, 'grad_norm': 3.9852839662701043, 'learning_rate': 8.912739150723285e-08, 'completion_length': 308.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.03869047574698925, 'kl': 0.1865234375, 'epoch': 0.91}
 91%|█████████ | 3904/4286 [29:28:03<5:55:42, 55.87s/it] 91%|█████████ | 3905/4286 [29:28:24<4:48:37, 45.45s/it]                                                        {'loss': 0.0048, 'grad_norm': 7.718635872711194, 'learning_rate': 8.889407372841811e-08, 'completion_length': 250.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.8005952537059784, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.06320715695619583, 'kl': 0.119384765625, 'epoch': 0.91}
 91%|█████████ | 3905/4286 [29:28:24<4:48:37, 45.45s/it] 91%|█████████ | 3906/4286 [29:28:47<4:05:55, 38.83s/it]                                                        {'loss': 0.0225, 'grad_norm': 5.474222467850257, 'learning_rate': 8.866075594960336e-08, 'completion_length': 309.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.05059523694217205, 'kl': 0.560791015625, 'epoch': 0.91}
 91%|█████████ | 3906/4286 [29:28:47<4:05:55, 38.83s/it] 91%|█████████ | 3907/4286 [29:29:10<3:35:27, 34.11s/it]                                                        {'loss': 0.008, 'grad_norm': 2.3432070533539955, 'learning_rate': 8.842743817078862e-08, 'completion_length': 317.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.744047611951828, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.011904759332537651, 'kl': 0.199951171875, 'epoch': 0.91}
 91%|█████████ | 3907/4286 [29:29:10<3:35:27, 34.11s/it] 91%|█████████ | 3908/4286 [29:29:35<3:17:22, 31.33s/it]                                                        {'loss': 0.0095, 'grad_norm': 1.4470321251702758, 'learning_rate': 8.819412039197387e-08, 'completion_length': 325.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.8199405372142792, 'rewards/format_reward': 1.0, 'reward': 1.8199406266212463, 'reward_std': 0.04583755135536194, 'kl': 0.236572265625, 'epoch': 0.91}
 91%|█████████ | 3908/4286 [29:29:35<3:17:22, 31.33s/it] 91%|█████████ | 3909/4286 [29:30:01<3:07:08, 29.78s/it]                                                        {'loss': 0.0246, 'grad_norm': 12.76034711165663, 'learning_rate': 8.796080261315912e-08, 'completion_length': 331.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7023809552192688, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6666667461395264, 'reward_std': 0.20640838146209717, 'kl': 0.615234375, 'epoch': 0.91}
 91%|█████████ | 3909/4286 [29:30:01<3:07:08, 29.78s/it] 91%|█████████ | 3910/4286 [29:30:26<2:57:42, 28.36s/it]                                                        {'loss': 0.0134, 'grad_norm': 3.039097881021079, 'learning_rate': 8.772748483434438e-08, 'completion_length': 297.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.05197649076581001, 'kl': 0.3359375, 'epoch': 0.91}
 91%|█████████ | 3910/4286 [29:30:26<2:57:42, 28.36s/it] 91%|█████████▏| 3911/4286 [29:30:50<2:48:56, 27.03s/it]                                                        {'loss': 0.0172, 'grad_norm': 19.313905448033342, 'learning_rate': 8.749416705552963e-08, 'completion_length': 311.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.8437500298023224, 'rewards/format_reward': 1.0, 'reward': 1.8437500596046448, 'reward_std': 0.020326517522335052, 'kl': 0.4296875, 'epoch': 0.91}
 91%|█████████▏| 3911/4286 [29:30:50<2:48:56, 27.03s/it] 91%|█████████▏| 3912/4286 [29:31:15<2:44:40, 26.42s/it]                                                        {'loss': 0.0125, 'grad_norm': 4.210471362115721, 'learning_rate': 8.726084927671489e-08, 'completion_length': 311.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.6494048237800598, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.631547749042511, 'reward_std': 0.08214285969734192, 'kl': 0.312255859375, 'epoch': 0.91}
 91%|█████████▏| 3912/4286 [29:31:15<2:44:40, 26.42s/it] 91%|█████████▏| 3913/4286 [29:31:38<2:37:43, 25.37s/it]                                                        {'loss': 0.0096, 'grad_norm': 11.298674903329921, 'learning_rate': 8.702753149790014e-08, 'completion_length': 236.46430206298828, 'rewards/only_full_func_accuracy_reward': 0.7008929252624512, 'rewards/format_reward': 1.0, 'reward': 1.700892984867096, 'reward_std': 0.08269625157117844, 'kl': 0.23974609375, 'epoch': 0.91}
 91%|█████████▏| 3913/4286 [29:31:38<2:37:43, 25.37s/it] 91%|█████████▏| 3914/4286 [29:32:03<2:36:44, 25.28s/it]                                                        {'loss': 0.0106, 'grad_norm': 27.835427466916098, 'learning_rate': 8.67942137190854e-08, 'completion_length': 296.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.834821492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8169644474983215, 'reward_std': 0.0625, 'kl': 0.265625, 'epoch': 0.91}
 91%|█████████▏| 3914/4286 [29:32:03<2:36:44, 25.28s/it][2025-03-03 20:29:49,905] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 91%|█████████▏| 3915/4286 [29:32:27<2:33:32, 24.83s/it]                                                        {'loss': 0.0021, 'grad_norm': 3.32397931353432, 'learning_rate': 8.656089594027065e-08, 'completion_length': 292.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7440477013587952, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.0264517730101943, 'kl': 0.0533447265625, 'epoch': 0.91}
 91%|█████████▏| 3915/4286 [29:32:27<2:33:32, 24.83s/it][2025-03-03 20:30:16,271] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 91%|█████████▏| 3916/4286 [29:32:53<2:35:58, 25.29s/it]                                                        {'loss': 0.0045, 'grad_norm': 5.1308608268529206, 'learning_rate': 8.63275781614559e-08, 'completion_length': 322.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.8018708527088165, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7840136885643005, 'reward_std': 0.09109213948249817, 'kl': 0.1136474609375, 'epoch': 0.91}
 91%|█████████▏| 3916/4286 [29:32:53<2:35:58, 25.29s/it] 91%|█████████▏| 3917/4286 [29:33:18<2:34:55, 25.19s/it]                                                        {'loss': 0.0168, 'grad_norm': 18.749383594062614, 'learning_rate': 8.609426038264116e-08, 'completion_length': 316.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.049460720270872116, 'kl': 0.4189453125, 'epoch': 0.91}
 91%|█████████▏| 3917/4286 [29:33:18<2:34:55, 25.19s/it] 91%|█████████▏| 3918/4286 [29:33:44<2:35:00, 25.27s/it]                                                        {'loss': 0.0043, 'grad_norm': 4.6889615049305595, 'learning_rate': 8.586094260382641e-08, 'completion_length': 281.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7061012387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.688244104385376, 'reward_std': 0.061172551941126585, 'kl': 0.10693359375, 'epoch': 0.91}
 91%|█████████▏| 3918/4286 [29:33:44<2:35:00, 25.27s/it] 91%|█████████▏| 3919/4286 [29:34:09<2:34:02, 25.18s/it]                                                        {'loss': 0.0085, 'grad_norm': 10.195751132801764, 'learning_rate': 8.562762482501167e-08, 'completion_length': 278.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7247024774551392, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.02417590841650963, 'kl': 0.2138671875, 'epoch': 0.91}
 91%|█████████▏| 3919/4286 [29:34:09<2:34:02, 25.18s/it] 91%|█████████▏| 3920/4286 [29:34:34<2:33:28, 25.16s/it]                                                        {'loss': 0.0132, 'grad_norm': 17.107447863181754, 'learning_rate': 8.539430704619692e-08, 'completion_length': 286.3393020629883, 'rewards/only_full_func_accuracy_reward': 0.8232143521308899, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8053572177886963, 'reward_std': 0.06282560620456934, 'kl': 0.332275390625, 'epoch': 0.91}
 91%|█████████▏| 3920/4286 [29:34:34<2:33:28, 25.16s/it] 91%|█████████▏| 3921/4286 [29:34:59<2:32:22, 25.05s/it]                                                        {'loss': 0.0076, 'grad_norm': 8.392793715423672, 'learning_rate': 8.516098926738218e-08, 'completion_length': 310.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6059524416923523, 'rewards/format_reward': 1.0, 'reward': 1.605952501296997, 'reward_std': 0.026564929634332657, 'kl': 0.19140625, 'epoch': 0.91}
 91%|█████████▏| 3921/4286 [29:34:59<2:32:22, 25.05s/it] 92%|█████████▏| 3922/4286 [29:35:24<2:32:45, 25.18s/it]                                                        {'loss': 0.0203, 'grad_norm': 9.500388119718403, 'learning_rate': 8.492767148856743e-08, 'completion_length': 288.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8511905074119568, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.023809521459043026, 'kl': 0.50732421875, 'epoch': 0.92}
 92%|█████████▏| 3922/4286 [29:35:24<2:32:45, 25.18s/it] 92%|█████████▏| 3923/4286 [29:35:48<2:29:40, 24.74s/it]                                                        {'loss': 0.0041, 'grad_norm': 0.9367756658346552, 'learning_rate': 8.469435370975268e-08, 'completion_length': 244.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.733631044626236, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.016722630010917783, 'kl': 0.10302734375, 'epoch': 0.92}
 92%|█████████▏| 3923/4286 [29:35:48<2:29:40, 24.74s/it] 92%|█████████▏| 3924/4286 [29:36:11<2:26:16, 24.25s/it]                                                        {'loss': 0.0148, 'grad_norm': 1.4942698940839976, 'learning_rate': 8.446103593093794e-08, 'completion_length': 271.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7261905670166016, 'reward_std': 0.05633394047617912, 'kl': 0.369873046875, 'epoch': 0.92}
 92%|█████████▏| 3924/4286 [29:36:11<2:26:16, 24.25s/it] 92%|█████████▏| 3925/4286 [29:36:36<2:28:11, 24.63s/it]                                                        {'loss': 0.0033, 'grad_norm': 0.8979245031110523, 'learning_rate': 8.422771815212319e-08, 'completion_length': 333.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.0, 'kl': 0.08251953125, 'epoch': 0.92}
 92%|█████████▏| 3925/4286 [29:36:36<2:28:11, 24.63s/it] 92%|█████████▏| 3926/4286 [29:37:01<2:26:44, 24.46s/it]                                                        {'loss': 0.0117, 'grad_norm': 5.813736343143985, 'learning_rate': 8.399440037330845e-08, 'completion_length': 267.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7872024178504944, 'rewards/format_reward': 1.0, 'reward': 1.7872024774551392, 'reward_std': 0.05495268478989601, 'kl': 0.29296875, 'epoch': 0.92}
 92%|█████████▏| 3926/4286 [29:37:01<2:26:44, 24.46s/it] 92%|█████████▏| 3927/4286 [29:37:25<2:26:09, 24.43s/it]                                                        {'loss': 0.0065, 'grad_norm': 26.6359481648671, 'learning_rate': 8.37610825944937e-08, 'completion_length': 294.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.05357143096625805, 'kl': 0.161865234375, 'epoch': 0.92}
 92%|█████████▏| 3927/4286 [29:37:25<2:26:09, 24.43s/it] 92%|█████████▏| 3928/4286 [29:37:52<2:29:48, 25.11s/it]                                                        {'loss': 0.0105, 'grad_norm': 5.2734584709555605, 'learning_rate': 8.352776481567896e-08, 'completion_length': 298.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.9042208194732666, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.8685065507888794, 'reward_std': 0.12012987583875656, 'kl': 0.2633056640625, 'epoch': 0.92}
 92%|█████████▏| 3928/4286 [29:37:52<2:29:48, 25.11s/it] 92%|█████████▏| 3929/4286 [29:38:16<2:28:36, 24.98s/it]                                                        {'loss': 0.0199, 'grad_norm': 4.05450006996765, 'learning_rate': 8.329444703686421e-08, 'completion_length': 306.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.03160357568413019, 'kl': 0.49853515625, 'epoch': 0.92}
 92%|█████████▏| 3929/4286 [29:38:16<2:28:36, 24.98s/it] 92%|█████████▏| 3930/4286 [29:38:42<2:28:43, 25.07s/it]                                                        {'loss': 0.0349, 'grad_norm': 14.25418402271459, 'learning_rate': 8.306112925804947e-08, 'completion_length': 308.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.8136905431747437, 'rewards/format_reward': 1.0, 'reward': 1.8136906027793884, 'reward_std': 0.017317861318588257, 'kl': 0.869140625, 'epoch': 0.92}
 92%|█████████▏| 3930/4286 [29:38:42<2:28:43, 25.07s/it] 92%|█████████▏| 3931/4286 [29:39:07<2:28:50, 25.16s/it]                                                        {'loss': 0.029, 'grad_norm': 151.32104517924188, 'learning_rate': 8.282781147923472e-08, 'completion_length': 340.0714569091797, 'rewards/only_full_func_accuracy_reward': 0.7004677057266235, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6826106905937195, 'reward_std': 0.14720339328050613, 'kl': 0.7255859375, 'epoch': 0.92}
 92%|█████████▏| 3931/4286 [29:39:07<2:28:50, 25.16s/it] 92%|█████████▏| 3932/4286 [29:39:32<2:29:05, 25.27s/it]                                                        {'loss': 0.0025, 'grad_norm': 1.6275974755133622, 'learning_rate': 8.259449370041997e-08, 'completion_length': 265.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.721726268529892, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.044642859138548374, 'kl': 0.06201171875, 'epoch': 0.92}
 92%|█████████▏| 3932/4286 [29:39:32<2:29:05, 25.27s/it] 92%|█████████▏| 3933/4286 [29:39:57<2:27:24, 25.06s/it]                                                        {'loss': 0.0034, 'grad_norm': 0.6228358093510957, 'learning_rate': 8.236117592160523e-08, 'completion_length': 310.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8928571939468384, 'reward_std': 0.016835875809192657, 'kl': 0.086181640625, 'epoch': 0.92}
 92%|█████████▏| 3933/4286 [29:39:57<2:27:24, 25.06s/it] 92%|█████████▏| 3934/4286 [29:40:22<2:27:25, 25.13s/it]                                                        {'loss': 0.0205, 'grad_norm': 11.062051552542687, 'learning_rate': 8.212785814279048e-08, 'completion_length': 325.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.04869436239823699, 'kl': 0.51171875, 'epoch': 0.92}
 92%|█████████▏| 3934/4286 [29:40:22<2:27:25, 25.13s/it] 92%|█████████▏| 3935/4286 [29:40:46<2:24:18, 24.67s/it]                                                        {'loss': 0.0041, 'grad_norm': 0.3944075504056275, 'learning_rate': 8.189454036397574e-08, 'completion_length': 240.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.7738096117973328, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.0, 'kl': 0.1026611328125, 'epoch': 0.92}
 92%|█████████▏| 3935/4286 [29:40:46<2:24:18, 24.67s/it] 92%|█████████▏| 3936/4286 [29:41:10<2:22:17, 24.39s/it]                                                        {'loss': 0.0173, 'grad_norm': 9.39140280109632, 'learning_rate': 8.166122258516099e-08, 'completion_length': 268.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.07205146946944296, 'kl': 0.4326171875, 'epoch': 0.92}
 92%|█████████▏| 3936/4286 [29:41:10<2:22:17, 24.39s/it] 92%|█████████▏| 3937/4286 [29:41:36<2:25:13, 24.97s/it]                                                        {'loss': 0.0041, 'grad_norm': 0.7316154923402847, 'learning_rate': 8.142790480634625e-08, 'completion_length': 316.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.7648810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.01785714365541935, 'kl': 0.101806640625, 'epoch': 0.92}
 92%|█████████▏| 3937/4286 [29:41:36<2:25:13, 24.97s/it] 92%|█████████▏| 3938/4286 [29:42:01<2:24:51, 24.98s/it]                                                        {'loss': 0.0036, 'grad_norm': 8.316744226662072, 'learning_rate': 8.11945870275315e-08, 'completion_length': 269.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.05449226312339306, 'kl': 0.0897216796875, 'epoch': 0.92}
 92%|█████████▏| 3938/4286 [29:42:01<2:24:51, 24.98s/it] 92%|█████████▏| 3939/4286 [29:42:26<2:24:20, 24.96s/it]                                                        {'loss': 0.0259, 'grad_norm': 21.127525214353675, 'learning_rate': 8.096126924871675e-08, 'completion_length': 288.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.06432079989463091, 'kl': 0.64453125, 'epoch': 0.92}
 92%|█████████▏| 3939/4286 [29:42:26<2:24:20, 24.96s/it] 92%|█████████▏| 3940/4286 [29:42:50<2:23:09, 24.82s/it]                                                        {'loss': 0.0119, 'grad_norm': 4.6033154262125775, 'learning_rate': 8.072795146990201e-08, 'completion_length': 310.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.7872024178504944, 'rewards/format_reward': 1.0, 'reward': 1.7872024774551392, 'reward_std': 0.05568545428104699, 'kl': 0.29833984375, 'epoch': 0.92}
 92%|█████████▏| 3940/4286 [29:42:50<2:23:09, 24.82s/it] 92%|█████████▏| 3941/4286 [29:43:14<2:21:22, 24.59s/it]                                                        {'loss': 0.0098, 'grad_norm': 3.21771504851196, 'learning_rate': 8.049463369108726e-08, 'completion_length': 310.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6979166865348816, 'rewards/format_reward': 1.0, 'reward': 1.6979167461395264, 'reward_std': 0.026785715483129025, 'kl': 0.2454833984375, 'epoch': 0.92}
 92%|█████████▏| 3941/4286 [29:43:14<2:21:22, 24.59s/it] 92%|█████████▏| 3942/4286 [29:43:38<2:20:06, 24.44s/it]                                                        {'loss': 0.004, 'grad_norm': 4.95636692659111, 'learning_rate': 8.026131591227252e-08, 'completion_length': 289.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7738095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7738096117973328, 'reward_std': 0.059523805975914, 'kl': 0.09912109375, 'epoch': 0.92}
 92%|█████████▏| 3942/4286 [29:43:38<2:20:06, 24.44s/it] 92%|█████████▏| 3943/4286 [29:44:01<2:17:11, 24.00s/it]                                                        {'loss': 0.0225, 'grad_norm': 32.46171064918997, 'learning_rate': 8.002799813345777e-08, 'completion_length': 299.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.9287067353725433, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.910849690437317, 'reward_std': 0.07115800864994526, 'kl': 0.5654296875, 'epoch': 0.92}
 92%|█████████▏| 3943/4286 [29:44:01<2:17:11, 24.00s/it] 92%|█████████▏| 3944/4286 [29:44:25<2:16:34, 23.96s/it]                                                        {'loss': 0.0024, 'grad_norm': 6.135309278279011, 'learning_rate': 7.979468035464303e-08, 'completion_length': 296.125, 'rewards/only_full_func_accuracy_reward': 0.8854167461395264, 'rewards/format_reward': 1.0, 'reward': 1.8854168057441711, 'reward_std': 0.056547620333731174, 'kl': 0.0601806640625, 'epoch': 0.92}
 92%|█████████▏| 3944/4286 [29:44:25<2:16:34, 23.96s/it] 92%|█████████▏| 3945/4286 [29:44:50<2:16:54, 24.09s/it]                                                        {'loss': 0.0124, 'grad_norm': 25.873631945595463, 'learning_rate': 7.956136257582828e-08, 'completion_length': 304.8393096923828, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.04464286006987095, 'kl': 0.308837890625, 'epoch': 0.92}
 92%|█████████▏| 3945/4286 [29:44:50<2:16:54, 24.09s/it] 92%|█████████▏| 3946/4286 [29:45:13<2:15:12, 23.86s/it]                                                        {'loss': 0.0158, 'grad_norm': 14.707236084744927, 'learning_rate': 7.932804479701353e-08, 'completion_length': 272.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 1.0, 'reward': 1.71577388048172, 'reward_std': 0.08060387335717678, 'kl': 0.39501953125, 'epoch': 0.92}
 92%|█████████▏| 3946/4286 [29:45:13<2:15:12, 23.86s/it] 92%|█████████▏| 3947/4286 [29:45:38<2:16:16, 24.12s/it]                                                        {'loss': 0.0128, 'grad_norm': 6.7039481922062265, 'learning_rate': 7.909472701819879e-08, 'completion_length': 323.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.8297619521617889, 'rewards/format_reward': 1.0, 'reward': 1.8297619819641113, 'reward_std': 0.03038606606423855, 'kl': 0.3212890625, 'epoch': 0.92}
 92%|█████████▏| 3947/4286 [29:45:38<2:16:16, 24.12s/it] 92%|█████████▏| 3948/4286 [29:46:03<2:17:24, 24.39s/it]                                                        {'loss': 0.0124, 'grad_norm': 4.919386568886513, 'learning_rate': 7.886140923938404e-08, 'completion_length': 333.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.71577388048172, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.02083333395421505, 'kl': 0.31103515625, 'epoch': 0.92}
 92%|█████████▏| 3948/4286 [29:46:03<2:17:24, 24.39s/it][2025-03-03 20:43:51,208] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 92%|█████████▏| 3949/4286 [29:46:28<2:18:54, 24.73s/it]                                                        {'loss': 0.0041, 'grad_norm': 5.639117427265862, 'learning_rate': 7.86280914605693e-08, 'completion_length': 273.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.8229167461395264, 'rewards/format_reward': 1.0, 'reward': 1.8229167461395264, 'reward_std': 0.014880955684930086, 'kl': 0.103271484375, 'epoch': 0.92}
 92%|█████████▏| 3949/4286 [29:46:28<2:18:54, 24.73s/it] 92%|█████████▏| 3950/4286 [29:46:52<2:17:22, 24.53s/it]                                                        {'loss': 0.0069, 'grad_norm': 7.8700920495948, 'learning_rate': 7.839477368175455e-08, 'completion_length': 309.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.03068273700773716, 'kl': 0.17236328125, 'epoch': 0.92}
 92%|█████████▏| 3950/4286 [29:46:52<2:17:22, 24.53s/it] 92%|█████████▏| 3951/4286 [29:47:18<2:19:16, 24.94s/it]                                                        {'loss': 0.0123, 'grad_norm': 4.323063815036927, 'learning_rate': 7.816145590293981e-08, 'completion_length': 324.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.7544643878936768, 'reward_std': 0.06951950863003731, 'kl': 0.306640625, 'epoch': 0.92}
 92%|█████████▏| 3951/4286 [29:47:18<2:19:16, 24.94s/it] 92%|█████████▏| 3952/4286 [29:47:43<2:18:47, 24.93s/it]                                                        {'loss': 0.0037, 'grad_norm': 9.266014492227981, 'learning_rate': 7.792813812412505e-08, 'completion_length': 305.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.723214328289032, 'reward_std': 0.02976190485060215, 'kl': 0.0933837890625, 'epoch': 0.92}
 92%|█████████▏| 3952/4286 [29:47:43<2:18:47, 24.93s/it] 92%|█████████▏| 3953/4286 [29:48:07<2:16:56, 24.67s/it]                                                        {'loss': 0.0028, 'grad_norm': 0.9380355247284046, 'learning_rate': 7.76948203453103e-08, 'completion_length': 306.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.0, 'kl': 0.0703125, 'epoch': 0.92}
 92%|█████████▏| 3953/4286 [29:48:07<2:16:56, 24.67s/it][2025-03-03 20:45:53,079] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 92%|█████████▏| 3954/4286 [29:48:30<2:13:37, 24.15s/it]                                                        {'loss': 0.0065, 'grad_norm': 5.5491071076843905, 'learning_rate': 7.746150256649556e-08, 'completion_length': 265.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7931548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7931548953056335, 'reward_std': 0.0625000037252903, 'kl': 0.16259765625, 'epoch': 0.92}
 92%|█████████▏| 3954/4286 [29:48:30<2:13:37, 24.15s/it] 92%|█████████▏| 3955/4286 [29:48:54<2:13:25, 24.19s/it]                                                        {'loss': 0.0049, 'grad_norm': 5.3242631593512995, 'learning_rate': 7.722818478768081e-08, 'completion_length': 314.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 1.0, 'reward': 1.763392984867096, 'reward_std': 0.02611161395907402, 'kl': 0.12255859375, 'epoch': 0.92}
 92%|█████████▏| 3955/4286 [29:48:54<2:13:25, 24.19s/it] 92%|█████████▏| 3956/4286 [29:49:21<2:17:23, 24.98s/it]                                                        {'loss': 0.0049, 'grad_norm': 14.30291683853566, 'learning_rate': 7.699486700886607e-08, 'completion_length': 315.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7389881312847137, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6854167580604553, 'reward_std': 0.12603867519646883, 'kl': 0.12255859375, 'epoch': 0.92}
 92%|█████████▏| 3956/4286 [29:49:21<2:17:23, 24.98s/it] 92%|█████████▏| 3957/4286 [29:49:47<2:18:25, 25.24s/it]                                                        {'loss': 0.0072, 'grad_norm': 1.1554418284739298, 'learning_rate': 7.676154923005132e-08, 'completion_length': 313.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.52976194024086, 'rewards/format_reward': 1.0, 'reward': 1.5297620296478271, 'reward_std': 0.042587507516145706, 'kl': 0.179931640625, 'epoch': 0.92}
 92%|█████████▏| 3957/4286 [29:49:47<2:18:25, 25.24s/it] 92%|█████████▏| 3958/4286 [29:50:13<2:18:16, 25.29s/it]                                                        {'loss': 0.0073, 'grad_norm': 8.706118645656122, 'learning_rate': 7.652823145123658e-08, 'completion_length': 299.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.7747024595737457, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7568453550338745, 'reward_std': 0.12775479070842266, 'kl': 0.180908203125, 'epoch': 0.92}
 92%|█████████▏| 3958/4286 [29:50:13<2:18:16, 25.29s/it] 92%|█████████▏| 3959/4286 [29:50:37<2:16:37, 25.07s/it]                                                        {'loss': 0.0029, 'grad_norm': 1.5537866551111224, 'learning_rate': 7.629491367242183e-08, 'completion_length': 334.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.026785715483129025, 'kl': 0.072998046875, 'epoch': 0.92}
 92%|█████████▏| 3959/4286 [29:50:37<2:16:37, 25.07s/it] 92%|█████████▏| 3960/4286 [29:51:02<2:16:33, 25.13s/it]                                                        {'loss': 0.0173, 'grad_norm': 9.188681288614097, 'learning_rate': 7.606159589360709e-08, 'completion_length': 311.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7023810744285583, 'reward_std': 0.05633394047617912, 'kl': 0.432373046875, 'epoch': 0.92}
 92%|█████████▏| 3960/4286 [29:51:02<2:16:33, 25.13s/it] 92%|█████████▏| 3961/4286 [29:51:26<2:14:26, 24.82s/it]                                                        {'loss': 0.0136, 'grad_norm': 3.5399054744428073, 'learning_rate': 7.582827811479234e-08, 'completion_length': 306.625, 'rewards/only_full_func_accuracy_reward': 0.7023810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.06252413056790829, 'kl': 0.3409423828125, 'epoch': 0.92}
 92%|█████████▏| 3961/4286 [29:51:26<2:14:26, 24.82s/it] 92%|█████████▏| 3962/4286 [29:51:50<2:12:18, 24.50s/it]                                                        {'loss': 0.0122, 'grad_norm': 10.95555523213113, 'learning_rate': 7.559496033597759e-08, 'completion_length': 289.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.08861161768436432, 'kl': 0.3057861328125, 'epoch': 0.92}
 92%|█████████▏| 3962/4286 [29:51:50<2:12:18, 24.50s/it] 92%|█████████▏| 3963/4286 [29:52:16<2:14:10, 24.92s/it]                                                        {'loss': 0.0029, 'grad_norm': 0.7515253881358064, 'learning_rate': 7.536164255716285e-08, 'completion_length': 307.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.7544643878936768, 'reward_std': 0.04740536957979202, 'kl': 0.072509765625, 'epoch': 0.92}
 92%|█████████▏| 3963/4286 [29:52:16<2:14:10, 24.92s/it] 92%|█████████▏| 3964/4286 [29:52:41<2:13:59, 24.97s/it]                                                        {'loss': 0.0028, 'grad_norm': 4.528355331291823, 'learning_rate': 7.51283247783481e-08, 'completion_length': 286.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.832341343164444, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8144842386245728, 'reward_std': 0.10349582135677338, 'kl': 0.070556640625, 'epoch': 0.92}
 92%|█████████▏| 3964/4286 [29:52:41<2:13:59, 24.97s/it] 93%|█████████▎| 3965/4286 [29:53:06<2:12:49, 24.83s/it]                                                        {'loss': 0.003, 'grad_norm': 6.948577234290059, 'learning_rate': 7.489500699953336e-08, 'completion_length': 307.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.047405365854501724, 'kl': 0.074462890625, 'epoch': 0.93}
 93%|█████████▎| 3965/4286 [29:53:06<2:12:49, 24.83s/it] 93%|█████████▎| 3966/4286 [29:53:31<2:13:14, 24.98s/it]                                                        {'loss': 0.0274, 'grad_norm': 15.808157320198053, 'learning_rate': 7.466168922071861e-08, 'completion_length': 306.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7529763579368591, 'reward_std': 0.08480328135192394, 'kl': 0.68359375, 'epoch': 0.93}
 93%|█████████▎| 3966/4286 [29:53:31<2:13:14, 24.98s/it] 93%|█████████▎| 3967/4286 [29:53:57<2:13:45, 25.16s/it]                                                        {'loss': 0.0112, 'grad_norm': 3.306390810814985, 'learning_rate': 7.442837144190387e-08, 'completion_length': 328.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7247024476528168, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.0356568843126297, 'kl': 0.279296875, 'epoch': 0.93}
 93%|█████████▎| 3967/4286 [29:53:57<2:13:45, 25.16s/it] 93%|█████████▎| 3968/4286 [29:54:20<2:11:12, 24.76s/it]                                                        {'loss': 0.0169, 'grad_norm': 4.133615276780163, 'learning_rate': 7.419505366308912e-08, 'completion_length': 253.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.8318452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.813988208770752, 'reward_std': 0.0625, 'kl': 0.421875, 'epoch': 0.93}
 93%|█████████▎| 3968/4286 [29:54:20<2:11:12, 24.76s/it] 93%|█████████▎| 3969/4286 [29:54:43<2:08:07, 24.25s/it]                                                        {'loss': 0.0073, 'grad_norm': 10.909613676699914, 'learning_rate': 7.396173588427437e-08, 'completion_length': 296.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.7113096117973328, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.044342201203107834, 'kl': 0.182373046875, 'epoch': 0.93}
 93%|█████████▎| 3969/4286 [29:54:43<2:08:07, 24.25s/it][2025-03-03 20:52:32,645] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 93%|█████████▎| 3970/4286 [29:55:10<2:10:51, 24.85s/it]                                                        {'loss': 0.0065, 'grad_norm': 11.173715184436746, 'learning_rate': 7.372841810545963e-08, 'completion_length': 308.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7916667461395264, 'rewards/format_reward': 1.0, 'reward': 1.7916667461395264, 'reward_std': 0.06030111946165562, 'kl': 0.1619873046875, 'epoch': 0.93}
 93%|█████████▎| 3970/4286 [29:55:10<2:10:51, 24.85s/it] 93%|█████████▎| 3971/4286 [29:55:35<2:10:29, 24.85s/it]                                                        {'loss': 0.0089, 'grad_norm': 1.3361867438638066, 'learning_rate': 7.349510032664488e-08, 'completion_length': 324.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.020619653165340424, 'kl': 0.22119140625, 'epoch': 0.93}
 93%|█████████▎| 3971/4286 [29:55:35<2:10:29, 24.85s/it][2025-03-03 20:53:23,837] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 93%|█████████▎| 3972/4286 [29:56:01<2:12:22, 25.29s/it]                                                        {'loss': 0.055, 'grad_norm': 1132.1025743740709, 'learning_rate': 7.326178254783014e-08, 'completion_length': 300.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.8693452775478363, 'rewards/format_reward': 1.0, 'reward': 1.8693453669548035, 'reward_std': 0.011309522204101086, 'kl': 1.37353515625, 'epoch': 0.93}
 93%|█████████▎| 3972/4286 [29:56:01<2:12:22, 25.29s/it] 93%|█████████▎| 3973/4286 [29:56:25<2:10:44, 25.06s/it]                                                        {'loss': 0.0043, 'grad_norm': 4.303305216884792, 'learning_rate': 7.302846476901539e-08, 'completion_length': 306.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.7529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.7529762387275696, 'reward_std': 0.06815172731876373, 'kl': 0.107177734375, 'epoch': 0.93}
 93%|█████████▎| 3973/4286 [29:56:25<2:10:44, 25.06s/it] 93%|█████████▎| 3974/4286 [29:56:50<2:09:33, 24.91s/it]                                                        {'loss': 0.0093, 'grad_norm': 12.368042183895275, 'learning_rate': 7.279514699020065e-08, 'completion_length': 281.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.11107293516397476, 'kl': 0.23193359375, 'epoch': 0.93}
 93%|█████████▎| 3974/4286 [29:56:50<2:09:33, 24.91s/it] 93%|█████████▎| 3975/4286 [29:57:14<2:08:17, 24.75s/it]                                                        {'loss': 0.011, 'grad_norm': 2.709288314361918, 'learning_rate': 7.25618292113859e-08, 'completion_length': 285.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8660715520381927, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8482144474983215, 'reward_std': 0.06136547774076462, 'kl': 0.2744140625, 'epoch': 0.93}
 93%|█████████▎| 3975/4286 [29:57:14<2:08:17, 24.75s/it][2025-03-03 20:55:02,710] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 93%|█████████▎| 3976/4286 [29:57:40<2:08:54, 24.95s/it]                                                        {'loss': 0.0052, 'grad_norm': 3.9478958421419015, 'learning_rate': 7.232851143257115e-08, 'completion_length': 312.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.8437500596046448, 'rewards/format_reward': 1.0, 'reward': 1.8437501192092896, 'reward_std': 0.0267857164144516, 'kl': 0.12908935546875, 'epoch': 0.93}
 93%|█████████▎| 3976/4286 [29:57:40<2:08:54, 24.95s/it] 93%|█████████▎| 3977/4286 [29:58:05<2:08:36, 24.97s/it]                                                        {'loss': 0.0064, 'grad_norm': 8.333065588373255, 'learning_rate': 7.209519365375641e-08, 'completion_length': 325.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.827381044626236, 'rewards/format_reward': 1.0, 'reward': 1.8273810744285583, 'reward_std': 0.04781556874513626, 'kl': 0.1611328125, 'epoch': 0.93}
 93%|█████████▎| 3977/4286 [29:58:05<2:08:36, 24.97s/it] 93%|█████████▎| 3978/4286 [29:58:30<2:08:20, 25.00s/it]                                                        {'loss': 0.0077, 'grad_norm': 2.953427458623121, 'learning_rate': 7.186187587494166e-08, 'completion_length': 298.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038692235946655, 'reward_std': 0.0505952425301075, 'kl': 0.192626953125, 'epoch': 0.93}
 93%|█████████▎| 3978/4286 [29:58:30<2:08:20, 25.00s/it] 93%|█████████▎| 3979/4286 [29:58:57<2:10:50, 25.57s/it]                                                        {'loss': 0.0126, 'grad_norm': 1.471128027681057, 'learning_rate': 7.162855809612692e-08, 'completion_length': 337.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7247024178504944, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.01709691435098648, 'kl': 0.3154296875, 'epoch': 0.93}
 93%|█████████▎| 3979/4286 [29:58:57<2:10:50, 25.57s/it] 93%|█████████▎| 3980/4286 [29:59:22<2:09:30, 25.39s/it]                                                        {'loss': 0.0162, 'grad_norm': 3.4997963674038943, 'learning_rate': 7.139524031731217e-08, 'completion_length': 314.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6205357313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6026785969734192, 'reward_std': 0.1048240102827549, 'kl': 0.404296875, 'epoch': 0.93}
 93%|█████████▎| 3980/4286 [29:59:22<2:09:30, 25.39s/it] 93%|█████████▎| 3981/4286 [29:59:47<2:08:08, 25.21s/it]                                                        {'loss': 0.0026, 'grad_norm': 5.427063714970545, 'learning_rate': 7.116192253849743e-08, 'completion_length': 301.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.9166666865348816, 'rewards/format_reward': 1.0, 'reward': 1.9166667461395264, 'reward_std': 0.025651197880506516, 'kl': 0.0648193359375, 'epoch': 0.93}
 93%|█████████▎| 3981/4286 [29:59:47<2:08:08, 25.21s/it] 93%|█████████▎| 3982/4286 [30:00:10<2:05:17, 24.73s/it]                                                        {'loss': 0.0087, 'grad_norm': 5.959877775407235, 'learning_rate': 7.092860475968268e-08, 'completion_length': 312.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.021293753758072853, 'kl': 0.21826171875, 'epoch': 0.93}
 93%|█████████▎| 3982/4286 [30:00:10<2:05:17, 24.73s/it] 93%|█████████▎| 3983/4286 [30:00:36<2:06:25, 25.04s/it]                                                        {'loss': 0.0157, 'grad_norm': 2.7491137601510203, 'learning_rate': 7.069528698086793e-08, 'completion_length': 320.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.7239583730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7061013579368591, 'reward_std': 0.09957962296903133, 'kl': 0.3935546875, 'epoch': 0.93}
 93%|█████████▎| 3983/4286 [30:00:36<2:06:25, 25.04s/it] 93%|█████████▎| 3984/4286 [30:01:02<2:07:10, 25.27s/it]                                                        {'loss': 0.0153, 'grad_norm': 3.906796529849661, 'learning_rate': 7.046196920205319e-08, 'completion_length': 309.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.05038155196234584, 'kl': 0.3818359375, 'epoch': 0.93}
 93%|█████████▎| 3984/4286 [30:01:02<2:07:10, 25.27s/it] 93%|█████████▎| 3985/4286 [30:01:25<2:03:39, 24.65s/it]                                                        {'loss': 0.0174, 'grad_norm': 2.120523441929747, 'learning_rate': 7.022865142323844e-08, 'completion_length': 268.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.8477892279624939, 'rewards/format_reward': 1.0, 'reward': 1.847789227962494, 'reward_std': 0.030612248927354813, 'kl': 0.435546875, 'epoch': 0.93}
 93%|█████████▎| 3985/4286 [30:01:25<2:03:39, 24.65s/it] 93%|█████████▎| 3986/4286 [30:01:51<2:05:17, 25.06s/it]                                                        {'loss': 0.008, 'grad_norm': 1.963480382521085, 'learning_rate': 6.99953336444237e-08, 'completion_length': 323.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.7342262268066406, 'rewards/format_reward': 1.0, 'reward': 1.7342263460159302, 'reward_std': 0.0478842988377437, 'kl': 0.19970703125, 'epoch': 0.93}
 93%|█████████▎| 3986/4286 [30:01:51<2:05:17, 25.06s/it] 93%|█████████▎| 3987/4286 [30:02:16<2:04:43, 25.03s/it]                                                        {'loss': 0.0097, 'grad_norm': 6.9766763879022, 'learning_rate': 6.976201586560895e-08, 'completion_length': 327.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6532738208770752, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.07876221276819706, 'kl': 0.24365234375, 'epoch': 0.93}
 93%|█████████▎| 3987/4286 [30:02:16<2:04:43, 25.03s/it] 93%|█████████▎| 3988/4286 [30:02:40<2:03:30, 24.87s/it]                                                        {'loss': 0.0169, 'grad_norm': 15.644735954910615, 'learning_rate': 6.952869808679421e-08, 'completion_length': 287.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8943453133106232, 'rewards/format_reward': 1.0, 'reward': 1.8943454027175903, 'reward_std': 0.0208333320915699, 'kl': 0.421875, 'epoch': 0.93}
 93%|█████████▎| 3988/4286 [30:02:40<2:03:30, 24.87s/it] 93%|█████████▎| 3989/4286 [30:03:04<2:01:05, 24.46s/it]                                                        {'loss': 0.0092, 'grad_norm': 49.28633381993717, 'learning_rate': 6.929538030797946e-08, 'completion_length': 279.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.07142857648432255, 'kl': 0.230712890625, 'epoch': 0.93}
 93%|█████████▎| 3989/4286 [30:03:04<2:01:05, 24.46s/it] 93%|█████████▎| 3990/4286 [30:03:29<2:02:01, 24.74s/it]                                                        {'loss': 0.0035, 'grad_norm': 1.0568213893520373, 'learning_rate': 6.906206252916472e-08, 'completion_length': 323.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7979167699813843, 'rewards/format_reward': 1.0, 'reward': 1.797916829586029, 'reward_std': 0.017676749266684055, 'kl': 0.087890625, 'epoch': 0.93}
 93%|█████████▎| 3990/4286 [30:03:29<2:02:01, 24.74s/it] 93%|█████████▎| 3991/4286 [30:03:55<2:03:17, 25.08s/it]                                                        {'loss': 0.0225, 'grad_norm': 1.7078091250585319, 'learning_rate': 6.882874475034997e-08, 'completion_length': 277.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7976191639900208, 'reward_std': 0.07867459766566753, 'kl': 0.559814453125, 'epoch': 0.93}
 93%|█████████▎| 3991/4286 [30:03:55<2:03:17, 25.08s/it] 93%|█████████▎| 3992/4286 [30:04:23<2:06:47, 25.88s/it]                                                        {'loss': 0.0063, 'grad_norm': 9.457358772089655, 'learning_rate': 6.859542697153522e-08, 'completion_length': 334.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7257653772830963, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7079082131385803, 'reward_std': 0.10143951699137688, 'kl': 0.15673828125, 'epoch': 0.93}
 93%|█████████▎| 3992/4286 [30:04:23<2:06:47, 25.88s/it] 93%|█████████▎| 3993/4286 [30:04:49<2:06:26, 25.89s/it]                                                        {'loss': 0.0173, 'grad_norm': 0.5839728566103678, 'learning_rate': 6.836210919272048e-08, 'completion_length': 290.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410714626312256, 'reward_std': 0.029761902987957, 'kl': 0.43359375, 'epoch': 0.93}
 93%|█████████▎| 3993/4286 [30:04:49<2:06:26, 25.89s/it] 93%|█████████▎| 3994/4286 [30:05:13<2:03:32, 25.39s/it]                                                        {'loss': 0.01, 'grad_norm': 0.6714669270611584, 'learning_rate': 6.812879141390573e-08, 'completion_length': 299.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.016593413427472115, 'kl': 0.25, 'epoch': 0.93}
 93%|█████████▎| 3994/4286 [30:05:13<2:03:32, 25.39s/it] 93%|█████████▎| 3995/4286 [30:05:40<2:05:01, 25.78s/it]                                                        {'loss': 0.0154, 'grad_norm': 37.4971386242193, 'learning_rate': 6.789547363509099e-08, 'completion_length': 282.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6964285969734192, 'reward_std': 0.10908934846520424, 'kl': 0.3857421875, 'epoch': 0.93}
 93%|█████████▎| 3995/4286 [30:05:40<2:05:01, 25.78s/it] 93%|█████████▎| 3996/4286 [30:06:05<2:04:01, 25.66s/it]                                                        {'loss': 0.0069, 'grad_norm': 1.3087347954060629, 'learning_rate': 6.766215585627624e-08, 'completion_length': 313.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.742559552192688, 'rewards/format_reward': 1.0, 'reward': 1.7425596117973328, 'reward_std': 0.019238398410379887, 'kl': 0.17236328125, 'epoch': 0.93}
 93%|█████████▎| 3996/4286 [30:06:05<2:04:01, 25.66s/it] 93%|█████████▎| 3997/4286 [30:06:29<2:01:20, 25.19s/it]                                                        {'loss': 0.0063, 'grad_norm': 7.5387282054256, 'learning_rate': 6.74288380774615e-08, 'completion_length': 263.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.04464286006987095, 'kl': 0.1566162109375, 'epoch': 0.93}
 93%|█████████▎| 3997/4286 [30:06:29<2:01:20, 25.19s/it] 93%|█████████▎| 3998/4286 [30:06:54<2:00:49, 25.17s/it]                                                        {'loss': 0.0024, 'grad_norm': 5.151932007754173, 'learning_rate': 6.719552029864675e-08, 'completion_length': 299.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.9002976715564728, 'rewards/format_reward': 1.0, 'reward': 1.9002977013587952, 'reward_std': 0.06249999813735485, 'kl': 0.0609130859375, 'epoch': 0.93}
 93%|█████████▎| 3998/4286 [30:06:54<2:00:49, 25.17s/it] 93%|█████████▎| 3999/4286 [30:07:19<1:59:47, 25.04s/it]                                                        {'loss': 0.0084, 'grad_norm': 2.3287171645447464, 'learning_rate': 6.6962202519832e-08, 'completion_length': 288.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7470238506793976, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.04920229874551296, 'kl': 0.21044921875, 'epoch': 0.93}
 93%|█████████▎| 3999/4286 [30:07:19<1:59:47, 25.04s/it] 93%|█████████▎| 4000/4286 [30:07:44<1:58:31, 24.87s/it]                                                        {'loss': 0.0323, 'grad_norm': 5.249512379572606, 'learning_rate': 6.672888474101726e-08, 'completion_length': 307.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5803572535514832, 'reward_std': 0.1122405119240284, 'kl': 0.8076171875, 'epoch': 0.93}
 93%|█████████▎| 4000/4286 [30:07:44<1:58:31, 24.87s/it] 93%|█████████▎| 4001/4286 [30:13:31<9:38:21, 121.76s/it]                                                         {'loss': 0.0044, 'grad_norm': 1.8039683853568138, 'learning_rate': 6.649556696220251e-08, 'completion_length': 287.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.8869048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8869048953056335, 'reward_std': 0.011904759332537651, 'kl': 0.10986328125, 'epoch': 0.93}
 93%|█████████▎| 4001/4286 [30:13:31<9:38:21, 121.76s/it] 93%|█████████▎| 4002/4286 [30:13:54<7:15:20, 91.97s/it]                                                         {'loss': 0.0021, 'grad_norm': 0.6174575581037512, 'learning_rate': 6.626224918338777e-08, 'completion_length': 308.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7693453431129456, 'rewards/format_reward': 1.0, 'reward': 1.7693454027175903, 'reward_std': 0.05495268478989601, 'kl': 0.05291748046875, 'epoch': 0.93}
 93%|█████████▎| 4002/4286 [30:13:54<7:15:20, 91.97s/it] 93%|█████████▎| 4003/4286 [30:14:17<5:36:09, 71.27s/it]                                                        {'loss': 0.0052, 'grad_norm': 29.587351664762185, 'learning_rate': 6.602893140457302e-08, 'completion_length': 283.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7544643580913544, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.0386904738843441, 'kl': 0.128662109375, 'epoch': 0.93}
 93%|█████████▎| 4003/4286 [30:14:17<5:36:09, 71.27s/it] 93%|█████████▎| 4004/4286 [30:14:41<4:28:48, 57.19s/it]                                                        {'loss': 0.0244, 'grad_norm': 3.508689451043311, 'learning_rate': 6.579561362575828e-08, 'completion_length': 314.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.846726268529892, 'rewards/format_reward': 1.0, 'reward': 1.8467262983322144, 'reward_std': 0.07752130459994078, 'kl': 0.6124267578125, 'epoch': 0.93}
 93%|█████████▎| 4004/4286 [30:14:41<4:28:48, 57.19s/it] 93%|█████████▎| 4005/4286 [30:15:06<3:42:23, 47.49s/it]                                                        {'loss': 0.0042, 'grad_norm': 28.91965982948819, 'learning_rate': 6.556229584694353e-08, 'completion_length': 325.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7526785731315613, 'rewards/format_reward': 1.0, 'reward': 1.7526786923408508, 'reward_std': 0.039072034414857626, 'kl': 0.104248046875, 'epoch': 0.93}
 93%|█████████▎| 4005/4286 [30:15:06<3:42:23, 47.49s/it] 93%|█████████▎| 4006/4286 [30:15:31<3:09:38, 40.64s/it]                                                        {'loss': 0.0087, 'grad_norm': 4.56142765033352, 'learning_rate': 6.532897806812878e-08, 'completion_length': 321.375, 'rewards/only_full_func_accuracy_reward': 0.8318452835083008, 'rewards/format_reward': 1.0, 'reward': 1.8318454027175903, 'reward_std': 0.026785715483129025, 'kl': 0.218505859375, 'epoch': 0.93}
 93%|█████████▎| 4006/4286 [30:15:31<3:09:38, 40.64s/it] 93%|█████████▎| 4007/4286 [30:15:54<2:45:15, 35.54s/it]                                                        {'loss': 0.0101, 'grad_norm': 8.418182558778891, 'learning_rate': 6.509566028931404e-08, 'completion_length': 303.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.752976268529892, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.06412800028920174, 'kl': 0.25244140625, 'epoch': 0.93}
 93%|█████████▎| 4007/4286 [30:15:54<2:45:15, 35.54s/it][2025-03-03 21:13:43,159] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 94%|█████████▎| 4008/4286 [30:16:20<2:31:20, 32.66s/it]                                                        {'loss': 0.0116, 'grad_norm': 6.970000926657417, 'learning_rate': 6.486234251049929e-08, 'completion_length': 291.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.7222222983837128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7043652534484863, 'reward_std': 0.12379425764083862, 'kl': 0.28955078125, 'epoch': 0.94}
 94%|█████████▎| 4008/4286 [30:16:20<2:31:20, 32.66s/it] 94%|█████████▎| 4009/4286 [30:16:45<2:19:53, 30.30s/it]                                                        {'loss': 0.0151, 'grad_norm': 2.668664725534749, 'learning_rate': 6.462902473168455e-08, 'completion_length': 317.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.02380952797830105, 'kl': 0.37890625, 'epoch': 0.94}
 94%|█████████▎| 4009/4286 [30:16:45<2:19:53, 30.30s/it] 94%|█████████▎| 4010/4286 [30:17:10<2:12:24, 28.79s/it]                                                        {'loss': 0.0132, 'grad_norm': 10.762754611287885, 'learning_rate': 6.43957069528698e-08, 'completion_length': 330.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.789806604385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7719495296478271, 'reward_std': 0.10172084346413612, 'kl': 0.3310546875, 'epoch': 0.94}
 94%|█████████▎| 4010/4286 [30:17:10<2:12:24, 28.79s/it] 94%|█████████▎| 4011/4286 [30:17:35<2:06:01, 27.50s/it]                                                        {'loss': 0.0109, 'grad_norm': 2.7209369427227927, 'learning_rate': 6.416238917405506e-08, 'completion_length': 300.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7462798058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7284227013587952, 'reward_std': 0.1026785783469677, 'kl': 0.2734375, 'epoch': 0.94}
 94%|█████████▎| 4011/4286 [30:17:35<2:06:01, 27.50s/it] 94%|█████████▎| 4012/4286 [30:17:59<2:01:21, 26.57s/it]                                                        {'loss': 0.0156, 'grad_norm': 0.9743574624810147, 'learning_rate': 6.392907139524031e-08, 'completion_length': 280.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 1.0, 'reward': 1.7708334922790527, 'reward_std': 0.005952378269284964, 'kl': 0.390380859375, 'epoch': 0.94}
 94%|█████████▎| 4012/4286 [30:17:59<2:01:21, 26.57s/it] 94%|█████████▎| 4013/4286 [30:18:25<1:59:37, 26.29s/it]                                                        {'loss': 0.0128, 'grad_norm': 1.2873110922838975, 'learning_rate': 6.369575361642557e-08, 'completion_length': 285.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.02816697023808956, 'kl': 0.3193359375, 'epoch': 0.94}
 94%|█████████▎| 4013/4286 [30:18:25<1:59:37, 26.29s/it] 94%|█████████▎| 4014/4286 [30:18:49<1:56:05, 25.61s/it]                                                        {'loss': 0.0105, 'grad_norm': 9.117886874830793, 'learning_rate': 6.346243583761082e-08, 'completion_length': 260.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.7568452656269073, 'rewards/format_reward': 1.0, 'reward': 1.7568453550338745, 'reward_std': 0.03749999403953552, 'kl': 0.261962890625, 'epoch': 0.94}
 94%|█████████▎| 4014/4286 [30:18:49<1:56:05, 25.61s/it] 94%|█████████▎| 4015/4286 [30:19:15<1:55:51, 25.65s/it]                                                        {'loss': 0.0277, 'grad_norm': 2.9237357045017127, 'learning_rate': 6.322911805879607e-08, 'completion_length': 316.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.8377977013587952, 'rewards/format_reward': 1.0, 'reward': 1.83779776096344, 'reward_std': 0.05281119979918003, 'kl': 0.694580078125, 'epoch': 0.94}
 94%|█████████▎| 4015/4286 [30:19:15<1:55:51, 25.65s/it] 94%|█████████▎| 4016/4286 [30:19:39<1:53:44, 25.28s/it]                                                        {'loss': 0.0049, 'grad_norm': 3.2042065268576203, 'learning_rate': 6.299580027998133e-08, 'completion_length': 314.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7193453013896942, 'rewards/format_reward': 1.0, 'reward': 1.7193453907966614, 'reward_std': 0.041722627356648445, 'kl': 0.123046875, 'epoch': 0.94}
 94%|█████████▎| 4016/4286 [30:19:39<1:53:44, 25.28s/it] 94%|█████████▎| 4017/4286 [30:20:02<1:50:52, 24.73s/it]                                                        {'loss': 0.0118, 'grad_norm': 7.85688167121702, 'learning_rate': 6.276248250116658e-08, 'completion_length': 295.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8511905372142792, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.04413222428411245, 'kl': 0.2958984375, 'epoch': 0.94}
 94%|█████████▎| 4017/4286 [30:20:02<1:50:52, 24.73s/it] 94%|█████████▎| 4018/4286 [30:20:27<1:49:54, 24.61s/it]                                                        {'loss': 0.014, 'grad_norm': 1.7774613744412977, 'learning_rate': 6.252916472235184e-08, 'completion_length': 299.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.692956417798996, 'rewards/format_reward': 1.0, 'reward': 1.6929564476013184, 'reward_std': 0.0188492052257061, 'kl': 0.348876953125, 'epoch': 0.94}
 94%|█████████▎| 4018/4286 [30:20:27<1:49:54, 24.61s/it] 94%|█████████▍| 4019/4286 [30:20:52<1:50:29, 24.83s/it]                                                        {'loss': 0.0056, 'grad_norm': 0.4314680180155785, 'learning_rate': 6.229584694353709e-08, 'completion_length': 318.0, 'rewards/only_full_func_accuracy_reward': 0.8519345819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8340774774551392, 'reward_std': 0.0580357164144516, 'kl': 0.138427734375, 'epoch': 0.94}
 94%|█████████▍| 4019/4286 [30:20:52<1:50:29, 24.83s/it] 94%|█████████▍| 4020/4286 [30:21:16<1:48:57, 24.58s/it]                                                        {'loss': 0.0042, 'grad_norm': 2.229209883612968, 'learning_rate': 6.206252916472235e-08, 'completion_length': 323.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.7741072177886963, 'rewards/format_reward': 1.0, 'reward': 1.774107277393341, 'reward_std': 0.05138125829398632, 'kl': 0.10498046875, 'epoch': 0.94}
 94%|█████████▍| 4020/4286 [30:21:16<1:48:57, 24.58s/it] 94%|█████████▍| 4021/4286 [30:21:41<1:49:37, 24.82s/it]                                                        {'loss': 0.0056, 'grad_norm': 2.625025388005478, 'learning_rate': 6.18292113859076e-08, 'completion_length': 327.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.8154761791229248, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.035702604334801435, 'kl': 0.140380859375, 'epoch': 0.94}
 94%|█████████▍| 4021/4286 [30:21:41<1:49:37, 24.82s/it] 94%|█████████▍| 4022/4286 [30:22:07<1:49:30, 24.89s/it]                                                        {'loss': 0.016, 'grad_norm': 5.52504760652245, 'learning_rate': 6.159589360709285e-08, 'completion_length': 302.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7842262089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7663691639900208, 'reward_std': 0.07933587580919266, 'kl': 0.3994140625, 'epoch': 0.94}
 94%|█████████▍| 4022/4286 [30:22:07<1:49:30, 24.89s/it] 94%|█████████▍| 4023/4286 [30:22:31<1:48:21, 24.72s/it]                                                        {'loss': 0.0161, 'grad_norm': 29.877035183945075, 'learning_rate': 6.136257582827811e-08, 'completion_length': 316.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.758928656578064, 'reward_std': 0.062360797077417374, 'kl': 0.40283203125, 'epoch': 0.94}
 94%|█████████▍| 4023/4286 [30:22:31<1:48:21, 24.72s/it] 94%|█████████▍| 4024/4286 [30:22:56<1:48:35, 24.87s/it]                                                        {'loss': 0.0066, 'grad_norm': 25.28612249806289, 'learning_rate': 6.112925804946336e-08, 'completion_length': 320.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.8125000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.023809525184333324, 'kl': 0.1640625, 'epoch': 0.94}
 94%|█████████▍| 4024/4286 [30:22:56<1:48:35, 24.87s/it] 94%|█████████▍| 4025/4286 [30:23:20<1:46:37, 24.51s/it]                                                        {'loss': 0.0125, 'grad_norm': 4.399868129060217, 'learning_rate': 6.089594027064862e-08, 'completion_length': 276.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7321429550647736, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.07142856903374195, 'kl': 0.3125, 'epoch': 0.94}
 94%|█████████▍| 4025/4286 [30:23:20<1:46:37, 24.51s/it] 94%|█████████▍| 4026/4286 [30:23:43<1:45:00, 24.23s/it]                                                        {'loss': 0.0049, 'grad_norm': 0.486526872828911, 'learning_rate': 6.066262249183387e-08, 'completion_length': 272.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.8511905372142792, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.0, 'kl': 0.1220703125, 'epoch': 0.94}
 94%|█████████▍| 4026/4286 [30:23:43<1:45:00, 24.23s/it] 94%|█████████▍| 4027/4286 [30:24:08<1:44:52, 24.30s/it]                                                        {'loss': 0.0162, 'grad_norm': 3.55680551656968, 'learning_rate': 6.042930471301914e-08, 'completion_length': 308.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7767857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.02724613156169653, 'kl': 0.405517578125, 'epoch': 0.94}
 94%|█████████▍| 4027/4286 [30:24:08<1:44:52, 24.30s/it] 94%|█████████▍| 4028/4286 [30:24:32<1:44:06, 24.21s/it]                                                        {'loss': 0.0053, 'grad_norm': 0.6784328999969222, 'learning_rate': 6.019598693420438e-08, 'completion_length': 289.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.0, 'kl': 0.13330078125, 'epoch': 0.94}
 94%|█████████▍| 4028/4286 [30:24:32<1:44:06, 24.21s/it] 94%|█████████▍| 4029/4286 [30:24:56<1:43:24, 24.14s/it]                                                        {'loss': 0.0073, 'grad_norm': 7.04758882202508, 'learning_rate': 5.996266915538963e-08, 'completion_length': 292.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187501788139343, 'reward_std': 0.0863095223903656, 'kl': 0.18212890625, 'epoch': 0.94}
 94%|█████████▍| 4029/4286 [30:24:56<1:43:24, 24.14s/it] 94%|█████████▍| 4030/4286 [30:25:24<1:47:47, 25.26s/it]                                                        {'loss': 0.0046, 'grad_norm': 5.144961908906175, 'learning_rate': 5.97293513765749e-08, 'completion_length': 295.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.8549107611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.837053656578064, 'reward_std': 0.06600918737240136, 'kl': 0.1136474609375, 'epoch': 0.94}
 94%|█████████▍| 4030/4286 [30:25:24<1:47:47, 25.26s/it] 94%|█████████▍| 4031/4286 [30:25:49<1:46:58, 25.17s/it]                                                        {'loss': 0.0307, 'grad_norm': 14.739005886915356, 'learning_rate': 5.949603359776015e-08, 'completion_length': 294.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.80952388048172, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.0773809514939785, 'kl': 0.772705078125, 'epoch': 0.94}
 94%|█████████▍| 4031/4286 [30:25:49<1:46:58, 25.17s/it] 94%|█████████▍| 4032/4286 [30:26:14<1:46:49, 25.24s/it]                                                        {'loss': 0.0027, 'grad_norm': 8.615385974479898, 'learning_rate': 5.9262715818945405e-08, 'completion_length': 290.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.7529763579368591, 'reward_std': 0.04602411016821861, 'kl': 0.0684814453125, 'epoch': 0.94}
 94%|█████████▍| 4032/4286 [30:26:14<1:46:49, 25.24s/it] 94%|█████████▍| 4033/4286 [30:26:38<1:44:54, 24.88s/it]                                                        {'loss': 0.0215, 'grad_norm': 6.282280695652164, 'learning_rate': 5.9029398040130654e-08, 'completion_length': 311.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6205357611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.602678656578064, 'reward_std': 0.069373220205307, 'kl': 0.5361328125, 'epoch': 0.94}
 94%|█████████▍| 4033/4286 [30:26:38<1:44:54, 24.88s/it] 94%|█████████▍| 4034/4286 [30:27:03<1:44:14, 24.82s/it]                                                        {'loss': 0.0032, 'grad_norm': 8.096976570698809, 'learning_rate': 5.879608026131591e-08, 'completion_length': 311.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.8333333730697632, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.048786625266075134, 'kl': 0.080810546875, 'epoch': 0.94}
 94%|█████████▍| 4034/4286 [30:27:03<1:44:14, 24.82s/it] 94%|█████████▍| 4035/4286 [30:27:29<1:45:08, 25.13s/it]                                                        {'loss': 0.024, 'grad_norm': 3.2705018031037656, 'learning_rate': 5.8562762482501165e-08, 'completion_length': 319.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.639881044626236, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.14593220874667168, 'kl': 0.599609375, 'epoch': 0.94}
 94%|█████████▍| 4035/4286 [30:27:29<1:45:08, 25.13s/it] 94%|█████████▍| 4036/4286 [30:27:52<1:42:49, 24.68s/it]                                                        {'loss': 0.0105, 'grad_norm': 3.508562564527588, 'learning_rate': 5.832944470368642e-08, 'completion_length': 303.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.8199405670166016, 'rewards/format_reward': 1.0, 'reward': 1.8199405670166016, 'reward_std': 0.028627381660044193, 'kl': 0.26312255859375, 'epoch': 0.94}
 94%|█████████▍| 4036/4286 [30:27:52<1:42:49, 24.68s/it] 94%|█████████▍| 4037/4286 [30:28:17<1:42:18, 24.65s/it]                                                        {'loss': 0.0046, 'grad_norm': 0.35280131289834527, 'learning_rate': 5.8096126924871675e-08, 'completion_length': 315.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.8690477013587952, 'rewards/format_reward': 1.0, 'reward': 1.8690477013587952, 'reward_std': 0.0, 'kl': 0.1142578125, 'epoch': 0.94}
 94%|█████████▍| 4037/4286 [30:28:17<1:42:18, 24.65s/it] 94%|█████████▍| 4038/4286 [30:28:42<1:42:18, 24.75s/it]                                                        {'loss': 0.0275, 'grad_norm': 10.4142214873131, 'learning_rate': 5.786280914605693e-08, 'completion_length': 249.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.6770833730697632, 'reward_std': 0.06063448078930378, 'kl': 0.68994140625, 'epoch': 0.94}
 94%|█████████▍| 4038/4286 [30:28:42<1:42:18, 24.75s/it] 94%|█████████▍| 4039/4286 [30:29:09<1:45:02, 25.52s/it]                                                        {'loss': 0.0079, 'grad_norm': 8.22213941241526, 'learning_rate': 5.7629491367242186e-08, 'completion_length': 351.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.766369104385376, 'rewards/format_reward': 1.0, 'reward': 1.7663691639900208, 'reward_std': 0.04900030232965946, 'kl': 0.1968994140625, 'epoch': 0.94}
 94%|█████████▍| 4039/4286 [30:29:09<1:45:02, 25.52s/it] 94%|█████████▍| 4040/4286 [30:29:36<1:46:06, 25.88s/it]                                                        {'loss': 0.0133, 'grad_norm': 5.073816651911243, 'learning_rate': 5.739617358842744e-08, 'completion_length': 271.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7678572535514832, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.039397627115249634, 'kl': 0.33203125, 'epoch': 0.94}
 94%|█████████▍| 4040/4286 [30:29:36<1:46:06, 25.88s/it] 94%|█████████▍| 4041/4286 [30:30:01<1:44:43, 25.65s/it]                                                        {'loss': 0.0016, 'grad_norm': 11.719229482861904, 'learning_rate': 5.716285580961269e-08, 'completion_length': 316.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8735119700431824, 'rewards/format_reward': 1.0, 'reward': 1.8735119700431824, 'reward_std': 0.008928571827709675, 'kl': 0.0389404296875, 'epoch': 0.94}
 94%|█████████▍| 4041/4286 [30:30:01<1:44:43, 25.65s/it] 94%|█████████▍| 4042/4286 [30:30:25<1:42:26, 25.19s/it]                                                        {'loss': 0.0069, 'grad_norm': 2.269251796709604, 'learning_rate': 5.6929538030797945e-08, 'completion_length': 249.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.0714285746216774, 'kl': 0.171875, 'epoch': 0.94}
 94%|█████████▍| 4042/4286 [30:30:25<1:42:26, 25.19s/it] 94%|█████████▍| 4043/4286 [30:30:50<1:41:31, 25.07s/it]                                                        {'loss': 0.0129, 'grad_norm': 12.726357439596054, 'learning_rate': 5.66962202519832e-08, 'completion_length': 314.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7425596117973328, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.08554795384407043, 'kl': 0.3232421875, 'epoch': 0.94}
 94%|█████████▍| 4043/4286 [30:30:50<1:41:31, 25.07s/it] 94%|█████████▍| 4044/4286 [30:31:16<1:41:58, 25.28s/it]                                                        {'loss': 0.0066, 'grad_norm': 3.789605503421671, 'learning_rate': 5.6462902473168456e-08, 'completion_length': 308.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.721726268529892, 'rewards/format_reward': 1.0, 'reward': 1.7217263579368591, 'reward_std': 0.028627393301576376, 'kl': 0.16455078125, 'epoch': 0.94}
 94%|█████████▍| 4044/4286 [30:31:16<1:41:58, 25.28s/it][2025-03-03 21:29:02,473] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 94%|█████████▍| 4045/4286 [30:31:40<1:39:56, 24.88s/it]                                                        {'loss': 0.007, 'grad_norm': 4.562973293493275, 'learning_rate': 5.622958469435371e-08, 'completion_length': 289.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7157739102840424, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.06845238618552685, 'kl': 0.1759033203125, 'epoch': 0.94}
 94%|█████████▍| 4045/4286 [30:31:40<1:39:56, 24.88s/it] 94%|█████████▍| 4046/4286 [30:32:05<1:40:44, 25.19s/it]                                                        {'loss': 0.0117, 'grad_norm': 4.819635026255739, 'learning_rate': 5.5996266915538966e-08, 'completion_length': 306.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7184523940086365, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7005953192710876, 'reward_std': 0.06065106950700283, 'kl': 0.292236328125, 'epoch': 0.94}
 94%|█████████▍| 4046/4286 [30:32:05<1:40:44, 25.19s/it] 94%|█████████▍| 4047/4286 [30:32:29<1:38:55, 24.83s/it]                                                        {'loss': 0.0117, 'grad_norm': 1.582046324462651, 'learning_rate': 5.576294913672422e-08, 'completion_length': 274.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.7857142686843872, 'rewards/format_reward': 1.0, 'reward': 1.7857144474983215, 'reward_std': 0.05952380783855915, 'kl': 0.2919921875, 'epoch': 0.94}
 94%|█████████▍| 4047/4286 [30:32:29<1:38:55, 24.83s/it] 94%|█████████▍| 4048/4286 [30:32:55<1:39:41, 25.13s/it]                                                        {'loss': 0.0266, 'grad_norm': 14.839397304621517, 'learning_rate': 5.552963135790947e-08, 'completion_length': 313.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.8279762268066406, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8101191520690918, 'reward_std': 0.10531468223780394, 'kl': 0.666015625, 'epoch': 0.94}
 94%|█████████▍| 4048/4286 [30:32:55<1:39:41, 25.13s/it] 94%|█████████▍| 4049/4286 [30:33:21<1:39:50, 25.27s/it]                                                        {'loss': 0.0158, 'grad_norm': 5.133337520852388, 'learning_rate': 5.5296313579094726e-08, 'completion_length': 325.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.07426555640995502, 'kl': 0.3935546875, 'epoch': 0.94}
 94%|█████████▍| 4049/4286 [30:33:21<1:39:50, 25.27s/it] 94%|█████████▍| 4050/4286 [30:33:45<1:38:05, 24.94s/it]                                                        {'loss': 0.0034, 'grad_norm': 11.189826454455904, 'learning_rate': 5.506299580027998e-08, 'completion_length': 305.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.030682736076414585, 'kl': 0.085693359375, 'epoch': 0.94}
 94%|█████████▍| 4050/4286 [30:33:45<1:38:05, 24.94s/it] 95%|█████████▍| 4051/4286 [30:34:10<1:37:51, 24.99s/it]                                                        {'loss': 0.0083, 'grad_norm': 5.70475357120723, 'learning_rate': 5.4829678021465236e-08, 'completion_length': 301.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.834821492433548, 'rewards/format_reward': 1.0, 'reward': 1.8348215818405151, 'reward_std': 0.04090644000098109, 'kl': 0.2081298828125, 'epoch': 0.95}
 95%|█████████▍| 4051/4286 [30:34:10<1:37:51, 24.99s/it] 95%|█████████▍| 4052/4286 [30:34:35<1:37:37, 25.03s/it]                                                        {'loss': 0.0059, 'grad_norm': 3.0615840384343125, 'learning_rate': 5.4596360242650485e-08, 'completion_length': 315.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.8258928954601288, 'rewards/format_reward': 1.0, 'reward': 1.825892984867096, 'reward_std': 0.019238398410379887, 'kl': 0.1480712890625, 'epoch': 0.95}
 95%|█████████▍| 4052/4286 [30:34:35<1:37:37, 25.03s/it] 95%|█████████▍| 4053/4286 [30:34:59<1:35:17, 24.54s/it]                                                        {'loss': 0.0126, 'grad_norm': 7.928987827498538, 'learning_rate': 5.436304246383574e-08, 'completion_length': 297.51788330078125, 'rewards/only_full_func_accuracy_reward': 0.854166716337204, 'rewards/format_reward': 1.0, 'reward': 1.8541668057441711, 'reward_std': 0.05222322791814804, 'kl': 0.314453125, 'epoch': 0.95}
 95%|█████████▍| 4053/4286 [30:34:59<1:35:17, 24.54s/it] 95%|█████████▍| 4054/4286 [30:35:24<1:35:52, 24.80s/it]                                                        {'loss': 0.0084, 'grad_norm': 3.053635818817438, 'learning_rate': 5.4129724685020995e-08, 'completion_length': 323.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.9080357253551483, 'rewards/format_reward': 1.0, 'reward': 1.9080357551574707, 'reward_std': 0.03511904692277312, 'kl': 0.208984375, 'epoch': 0.95}
 95%|█████████▍| 4054/4286 [30:35:24<1:35:52, 24.80s/it] 95%|█████████▍| 4055/4286 [30:35:48<1:34:34, 24.57s/it]                                                        {'loss': 0.0424, 'grad_norm': 17.24714674562569, 'learning_rate': 5.3896406906206244e-08, 'completion_length': 306.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.607142984867096, 'reward_std': 0.1102976854890585, 'kl': 1.052978515625, 'epoch': 0.95}
 95%|█████████▍| 4055/4286 [30:35:48<1:34:34, 24.57s/it] 95%|█████████▍| 4056/4286 [30:36:12<1:33:34, 24.41s/it]                                                        {'loss': 0.01, 'grad_norm': 7.008743829650688, 'learning_rate': 5.36630891273915e-08, 'completion_length': 259.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7752977013587952, 'reward_std': 0.08311965316534042, 'kl': 0.2509765625, 'epoch': 0.95}
 95%|█████████▍| 4056/4286 [30:36:12<1:33:34, 24.41s/it] 95%|█████████▍| 4057/4286 [30:36:36<1:32:36, 24.26s/it]                                                        {'loss': 0.0138, 'grad_norm': 9.435573554267823, 'learning_rate': 5.3429771348576755e-08, 'completion_length': 264.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.828869104385376, 'rewards/format_reward': 1.0, 'reward': 1.8288691639900208, 'reward_std': 0.06685744598507881, 'kl': 0.3466796875, 'epoch': 0.95}
 95%|█████████▍| 4057/4286 [30:36:36<1:32:36, 24.26s/it] 95%|█████████▍| 4058/4286 [30:37:01<1:32:59, 24.47s/it]                                                        {'loss': 0.0052, 'grad_norm': 0.7332881803014499, 'learning_rate': 5.319645356976201e-08, 'completion_length': 318.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.027492869645357132, 'kl': 0.12939453125, 'epoch': 0.95}
 95%|█████████▍| 4058/4286 [30:37:01<1:32:59, 24.47s/it] 95%|█████████▍| 4059/4286 [30:37:27<1:34:38, 25.01s/it]                                                        {'loss': 0.0141, 'grad_norm': 11.693873146819627, 'learning_rate': 5.2963135790947265e-08, 'completion_length': 331.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.6607143580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.07844539359211922, 'kl': 0.35205078125, 'epoch': 0.95}
 95%|█████████▍| 4059/4286 [30:37:27<1:34:38, 25.01s/it] 95%|█████████▍| 4060/4286 [30:37:52<1:34:24, 25.06s/it]                                                        {'loss': 0.0239, 'grad_norm': 4.067321891311157, 'learning_rate': 5.272981801213252e-08, 'completion_length': 303.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6979166865348816, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.11463277786970139, 'kl': 0.5986328125, 'epoch': 0.95}
 95%|█████████▍| 4060/4286 [30:37:52<1:34:24, 25.06s/it] 95%|█████████▍| 4061/4286 [30:38:18<1:34:12, 25.12s/it]                                                        {'loss': 0.0035, 'grad_norm': 2.69404785790736, 'learning_rate': 5.2496500233317776e-08, 'completion_length': 346.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7782738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.039858050644397736, 'kl': 0.0865478515625, 'epoch': 0.95}
 95%|█████████▍| 4061/4286 [30:38:18<1:34:12, 25.12s/it] 95%|█████████▍| 4062/4286 [30:38:41<1:32:00, 24.64s/it]                                                        {'loss': 0.0198, 'grad_norm': 18.310772432246925, 'learning_rate': 5.226318245450303e-08, 'completion_length': 295.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6556122899055481, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6198980808258057, 'reward_std': 0.14115648483857512, 'kl': 0.4951171875, 'epoch': 0.95}
 95%|█████████▍| 4062/4286 [30:38:41<1:32:00, 24.64s/it] 95%|█████████▍| 4063/4286 [30:39:06<1:31:36, 24.65s/it]                                                        {'loss': 0.0076, 'grad_norm': 3.138185544054426, 'learning_rate': 5.202986467568828e-08, 'completion_length': 302.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.0, 'kl': 0.18994140625, 'epoch': 0.95}
 95%|█████████▍| 4063/4286 [30:39:06<1:31:36, 24.65s/it] 95%|█████████▍| 4064/4286 [30:39:31<1:31:27, 24.72s/it]                                                        {'loss': 0.0083, 'grad_norm': 3.4547697453184876, 'learning_rate': 5.1796546896873535e-08, 'completion_length': 260.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7961309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.06802502274513245, 'kl': 0.2069091796875, 'epoch': 0.95}
 95%|█████████▍| 4064/4286 [30:39:31<1:31:27, 24.72s/it] 95%|█████████▍| 4065/4286 [30:39:56<1:32:04, 25.00s/it]                                                        {'loss': 0.0035, 'grad_norm': 4.2712031493321545, 'learning_rate': 5.156322911805879e-08, 'completion_length': 314.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.0416666716337204, 'kl': 0.0870361328125, 'epoch': 0.95}
 95%|█████████▍| 4065/4286 [30:39:56<1:32:04, 25.00s/it] 95%|█████████▍| 4066/4286 [30:40:19<1:29:22, 24.37s/it]                                                        {'loss': 0.0117, 'grad_norm': 42.52930821242789, 'learning_rate': 5.1329911339244046e-08, 'completion_length': 239.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.8139881193637848, 'rewards/format_reward': 1.0, 'reward': 1.813988208770752, 'reward_std': 0.03114316426217556, 'kl': 0.293701171875, 'epoch': 0.95}
 95%|█████████▍| 4066/4286 [30:40:19<1:29:22, 24.37s/it] 95%|█████████▍| 4067/4286 [30:40:44<1:29:00, 24.39s/it]                                                        {'loss': 0.0156, 'grad_norm': 11.06005569533271, 'learning_rate': 5.10965935604293e-08, 'completion_length': 312.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.06388125568628311, 'kl': 0.38916015625, 'epoch': 0.95}
 95%|█████████▍| 4067/4286 [30:40:44<1:29:00, 24.39s/it] 95%|█████████▍| 4068/4286 [30:41:08<1:28:08, 24.26s/it]                                                        {'loss': 0.0117, 'grad_norm': 3.0765977177552704, 'learning_rate': 5.0863275781614556e-08, 'completion_length': 277.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7348214387893677, 'rewards/format_reward': 1.0, 'reward': 1.7348214983940125, 'reward_std': 0.014760689809918404, 'kl': 0.2919921875, 'epoch': 0.95}
 95%|█████████▍| 4068/4286 [30:41:08<1:28:08, 24.26s/it] 95%|█████████▍| 4069/4286 [30:41:33<1:28:59, 24.60s/it]                                                        {'loss': 0.0247, 'grad_norm': 6.640869868668072, 'learning_rate': 5.062995800279981e-08, 'completion_length': 319.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.053571438416838646, 'kl': 0.6171875, 'epoch': 0.95}
 95%|█████████▍| 4069/4286 [30:41:33<1:28:59, 24.60s/it] 95%|█████████▍| 4070/4286 [30:41:58<1:28:17, 24.52s/it]                                                        {'loss': 0.0142, 'grad_norm': 3.0410857005152274, 'learning_rate': 5.039664022398507e-08, 'completion_length': 317.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.671131044626236, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.044589780271053314, 'kl': 0.35546875, 'epoch': 0.95}
 95%|█████████▍| 4070/4286 [30:41:58<1:28:17, 24.52s/it] 95%|█████████▍| 4071/4286 [30:42:23<1:28:34, 24.72s/it]                                                        {'loss': 0.0027, 'grad_norm': 6.9390095075335525, 'learning_rate': 5.0163322445170316e-08, 'completion_length': 290.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8839286267757416, 'rewards/format_reward': 1.0, 'reward': 1.8839287161827087, 'reward_std': 0.022214585915207863, 'kl': 0.0677490234375, 'epoch': 0.95}
 95%|█████████▍| 4071/4286 [30:42:23<1:28:34, 24.72s/it] 95%|█████████▌| 4072/4286 [30:42:47<1:27:48, 24.62s/it]                                                        {'loss': 0.0101, 'grad_norm': 6.757304426178196, 'learning_rate': 4.993000466635557e-08, 'completion_length': 301.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762387275696, 'reward_std': 0.07718737795948982, 'kl': 0.2529296875, 'epoch': 0.95}
 95%|█████████▌| 4072/4286 [30:42:47<1:27:48, 24.62s/it] 95%|█████████▌| 4073/4286 [30:43:13<1:28:37, 24.96s/it]                                                        {'loss': 0.0116, 'grad_norm': 5.672381434182414, 'learning_rate': 4.9696686887540826e-08, 'completion_length': 317.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.8133928775787354, 'rewards/format_reward': 1.0, 'reward': 1.813392996788025, 'reward_std': 0.06448604725301266, 'kl': 0.28857421875, 'epoch': 0.95}
 95%|█████████▌| 4073/4286 [30:43:13<1:28:37, 24.96s/it] 95%|█████████▌| 4074/4286 [30:43:39<1:29:12, 25.25s/it]                                                        {'loss': 0.0048, 'grad_norm': 2.8666324286956204, 'learning_rate': 4.946336910872608e-08, 'completion_length': 310.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.836309552192688, 'rewards/format_reward': 1.0, 'reward': 1.8363096714019775, 'reward_std': 0.04191340133547783, 'kl': 0.120361328125, 'epoch': 0.95}
 95%|█████████▌| 4074/4286 [30:43:39<1:29:12, 25.25s/it] 95%|█████████▌| 4075/4286 [30:44:04<1:28:25, 25.14s/it]                                                        {'loss': 0.0114, 'grad_norm': 1.075495285627269, 'learning_rate': 4.923005132991134e-08, 'completion_length': 290.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7370130121707916, 'rewards/format_reward': 1.0, 'reward': 1.7370131015777588, 'reward_std': 0.04590256232768297, 'kl': 0.28564453125, 'epoch': 0.95}
 95%|█████████▌| 4075/4286 [30:44:04<1:28:25, 25.14s/it] 95%|█████████▌| 4076/4286 [30:44:30<1:29:01, 25.44s/it]                                                        {'loss': 0.0187, 'grad_norm': 2.371382669219736, 'learning_rate': 4.899673355109659e-08, 'completion_length': 287.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.8288690745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8110120296478271, 'reward_std': 0.07280982751399279, 'kl': 0.466796875, 'epoch': 0.95}
 95%|█████████▌| 4076/4286 [30:44:30<1:29:01, 25.44s/it] 95%|█████████▌| 4077/4286 [30:44:55<1:27:59, 25.26s/it]                                                        {'loss': 0.0117, 'grad_norm': 4.30668390257031, 'learning_rate': 4.876341577228185e-08, 'completion_length': 305.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.8690476715564728, 'rewards/format_reward': 1.0, 'reward': 1.8690477013587952, 'reward_std': 0.040645405650138855, 'kl': 0.29150390625, 'epoch': 0.95}
 95%|█████████▌| 4077/4286 [30:44:55<1:27:59, 25.26s/it] 95%|█████████▌| 4078/4286 [30:45:18<1:25:46, 24.74s/it]                                                        {'loss': 0.0033, 'grad_norm': 9.052007273217148, 'learning_rate': 4.8530097993467096e-08, 'completion_length': 272.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.9151786267757416, 'rewards/format_reward': 1.0, 'reward': 1.9151787757873535, 'reward_std': 0.056454822421073914, 'kl': 0.08343505859375, 'epoch': 0.95}
 95%|█████████▌| 4078/4286 [30:45:18<1:25:46, 24.74s/it] 95%|█████████▌| 4079/4286 [30:45:42<1:24:40, 24.54s/it]                                                        {'loss': 0.0043, 'grad_norm': 0.8633219494064607, 'learning_rate': 4.829678021465235e-08, 'completion_length': 299.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.7348214685916901, 'rewards/format_reward': 1.0, 'reward': 1.7348215579986572, 'reward_std': 0.010664566420018673, 'kl': 0.107421875, 'epoch': 0.95}
 95%|█████████▌| 4079/4286 [30:45:42<1:24:40, 24.54s/it] 95%|█████████▌| 4080/4286 [30:46:07<1:24:55, 24.74s/it]                                                        {'loss': 0.0088, 'grad_norm': 5.671984213423826, 'learning_rate': 4.806346243583761e-08, 'completion_length': 313.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.8690476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8511906266212463, 'reward_std': 0.07578601501882076, 'kl': 0.218505859375, 'epoch': 0.95}
 95%|█████████▌| 4080/4286 [30:46:07<1:24:55, 24.74s/it] 95%|█████████▌| 4081/4286 [30:46:33<1:25:51, 25.13s/it]                                                        {'loss': 0.0168, 'grad_norm': 7.684223495067205, 'learning_rate': 4.783014465702286e-08, 'completion_length': 317.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8630952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8630953431129456, 'reward_std': 0.02816697023808956, 'kl': 0.419921875, 'epoch': 0.95}
 95%|█████████▌| 4081/4286 [30:46:33<1:25:51, 25.13s/it] 95%|█████████▌| 4082/4286 [30:46:59<1:25:50, 25.25s/it]                                                        {'loss': 0.0013, 'grad_norm': 0.1470648759699998, 'learning_rate': 4.759682687820812e-08, 'completion_length': 309.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.9523809552192688, 'rewards/format_reward': 1.0, 'reward': 1.9523811340332031, 'reward_std': 0.0, 'kl': 0.0321044921875, 'epoch': 0.95}
 95%|█████████▌| 4082/4286 [30:46:59<1:25:50, 25.25s/it] 95%|█████████▌| 4083/4286 [30:47:25<1:25:46, 25.35s/it]                                                        {'loss': 0.0135, 'grad_norm': 26.97105799282015, 'learning_rate': 4.736350909939337e-08, 'completion_length': 302.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.7738095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.06915953941643238, 'kl': 0.337890625, 'epoch': 0.95}
 95%|█████████▌| 4083/4286 [30:47:25<1:25:46, 25.35s/it] 95%|█████████▌| 4084/4286 [30:47:50<1:24:56, 25.23s/it]                                                        {'loss': 0.004, 'grad_norm': 0.9785259722712957, 'learning_rate': 4.713019132057863e-08, 'completion_length': 292.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.9047619700431824, 'rewards/format_reward': 1.0, 'reward': 1.9047620296478271, 'reward_std': 0.011904764920473099, 'kl': 0.100341796875, 'epoch': 0.95}
 95%|█████████▌| 4084/4286 [30:47:50<1:24:56, 25.23s/it] 95%|█████████▌| 4085/4286 [30:48:13<1:22:31, 24.63s/it]                                                        {'loss': 0.0072, 'grad_norm': 7.057050953518881, 'learning_rate': 4.689687354176388e-08, 'completion_length': 256.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.736607164144516, 'rewards/format_reward': 1.0, 'reward': 1.7366072535514832, 'reward_std': 0.014880956150591373, 'kl': 0.18017578125, 'epoch': 0.95}
 95%|█████████▌| 4085/4286 [30:48:13<1:22:31, 24.63s/it] 95%|█████████▌| 4086/4286 [30:48:36<1:20:59, 24.30s/it]                                                        {'loss': 0.0287, 'grad_norm': 21.338417390478167, 'learning_rate': 4.666355576294913e-08, 'completion_length': 270.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.5803571939468384, 'rewards/format_reward': 1.0, 'reward': 1.5803572535514832, 'reward_std': 0.04007173259742558, 'kl': 0.720703125, 'epoch': 0.95}
 95%|█████████▌| 4086/4286 [30:48:36<1:20:59, 24.30s/it] 95%|█████████▌| 4087/4286 [30:49:02<1:21:53, 24.69s/it]                                                        {'loss': 0.0035, 'grad_norm': 2.8286375935092973, 'learning_rate': 4.643023798413439e-08, 'completion_length': 309.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7514881193637848, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.025190776214003563, 'kl': 0.088134765625, 'epoch': 0.95}
 95%|█████████▌| 4087/4286 [30:49:02<1:21:53, 24.69s/it] 95%|█████████▌| 4088/4286 [30:49:27<1:21:28, 24.69s/it]                                                        {'loss': 0.0092, 'grad_norm': 3.9003915041517, 'learning_rate': 4.619692020531964e-08, 'completion_length': 287.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.830357164144516, 'rewards/format_reward': 1.0, 'reward': 1.830357313156128, 'reward_std': 0.03818839509040117, 'kl': 0.23095703125, 'epoch': 0.95}
 95%|█████████▌| 4088/4286 [30:49:27<1:21:28, 24.69s/it] 95%|█████████▌| 4089/4286 [30:49:51<1:20:46, 24.60s/it]                                                        {'loss': 0.027, 'grad_norm': 5.794214869718333, 'learning_rate': 4.59636024265049e-08, 'completion_length': 280.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.7267857491970062, 'rewards/format_reward': 1.0, 'reward': 1.7267858386039734, 'reward_std': 0.08626504708081484, 'kl': 0.67626953125, 'epoch': 0.95}
 95%|█████████▌| 4089/4286 [30:49:51<1:20:46, 24.60s/it] 95%|█████████▌| 4090/4286 [30:50:16<1:20:21, 24.60s/it]                                                        {'loss': 0.0175, 'grad_norm': 10.549631938870636, 'learning_rate': 4.573028464769015e-08, 'completion_length': 308.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.836309552192688, 'rewards/format_reward': 1.0, 'reward': 1.8363096117973328, 'reward_std': 0.05792887508869171, 'kl': 0.435546875, 'epoch': 0.95}
 95%|█████████▌| 4090/4286 [30:50:16<1:20:21, 24.60s/it] 95%|█████████▌| 4091/4286 [30:50:40<1:19:55, 24.59s/it]                                                        {'loss': 0.003, 'grad_norm': 15.083402695497119, 'learning_rate': 4.549696686887541e-08, 'completion_length': 294.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.8154221177101135, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.79756498336792, 'reward_std': 0.10725108534097672, 'kl': 0.0753173828125, 'epoch': 0.95}
 95%|█████████▌| 4091/4286 [30:50:40<1:19:55, 24.59s/it] 95%|█████████▌| 4092/4286 [30:51:05<1:20:02, 24.75s/it]                                                        {'loss': 0.0121, 'grad_norm': 3.1370449930846855, 'learning_rate': 4.5263649090060664e-08, 'completion_length': 290.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.05633394047617912, 'kl': 0.302734375, 'epoch': 0.95}
 95%|█████████▌| 4092/4286 [30:51:05<1:20:02, 24.75s/it][2025-03-03 21:48:53,892] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 95%|█████████▌| 4093/4286 [30:51:31<1:20:32, 25.04s/it]                                                        {'loss': 0.0035, 'grad_norm': 0.7054593059320067, 'learning_rate': 4.503033131124591e-08, 'completion_length': 252.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.8690476715564728, 'rewards/format_reward': 1.0, 'reward': 1.8690478205680847, 'reward_std': 0.023809518665075302, 'kl': 0.086669921875, 'epoch': 0.95}
 95%|█████████▌| 4093/4286 [30:51:31<1:20:32, 25.04s/it] 96%|█████████▌| 4094/4286 [30:51:55<1:19:14, 24.76s/it]                                                        {'loss': 0.0146, 'grad_norm': 4.420674683123165, 'learning_rate': 4.479701353243117e-08, 'completion_length': 298.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.8139881491661072, 'rewards/format_reward': 1.0, 'reward': 1.813988208770752, 'reward_std': 0.08630953077226877, 'kl': 0.36328125, 'epoch': 0.96}
 96%|█████████▌| 4094/4286 [30:51:55<1:19:14, 24.76s/it][2025-03-03 21:49:44,001] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▌| 4095/4286 [30:52:21<1:19:59, 25.13s/it]                                                        {'loss': 0.0249, 'grad_norm': 2.5614335869870875, 'learning_rate': 4.456369575361642e-08, 'completion_length': 316.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.709821492433548, 'rewards/format_reward': 1.0, 'reward': 1.7098215818405151, 'reward_std': 0.04464286006987095, 'kl': 0.62109375, 'epoch': 0.96}
 96%|█████████▌| 4095/4286 [30:52:21<1:19:59, 25.13s/it] 96%|█████████▌| 4096/4286 [30:52:45<1:18:24, 24.76s/it]                                                        {'loss': 0.0027, 'grad_norm': 1.6281634527399549, 'learning_rate': 4.433037797480168e-08, 'completion_length': 310.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.9002977013587952, 'rewards/format_reward': 1.0, 'reward': 1.9002977013587952, 'reward_std': 0.03457976505160332, 'kl': 0.068359375, 'epoch': 0.96}
 96%|█████████▌| 4096/4286 [30:52:45<1:18:24, 24.76s/it] 96%|█████████▌| 4097/4286 [30:53:10<1:17:51, 24.72s/it]                                                        {'loss': 0.0042, 'grad_norm': 6.4922082697782555, 'learning_rate': 4.4097060195986934e-08, 'completion_length': 320.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7023810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.0650488268584013, 'kl': 0.106201171875, 'epoch': 0.96}
 96%|█████████▌| 4097/4286 [30:53:10<1:17:51, 24.72s/it] 96%|█████████▌| 4098/4286 [30:53:35<1:17:51, 24.85s/it]                                                        {'loss': 0.0107, 'grad_norm': 6.818488085585145, 'learning_rate': 4.386374241717219e-08, 'completion_length': 287.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7142857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.07871581427752972, 'kl': 0.26611328125, 'epoch': 0.96}
 96%|█████████▌| 4098/4286 [30:53:35<1:17:51, 24.85s/it] 96%|█████████▌| 4099/4286 [30:53:59<1:16:43, 24.62s/it]                                                        {'loss': 0.0093, 'grad_norm': 1.7609765203272303, 'learning_rate': 4.3630424638357444e-08, 'completion_length': 301.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.01785714365541935, 'kl': 0.2314453125, 'epoch': 0.96}
 96%|█████████▌| 4099/4286 [30:53:59<1:16:43, 24.62s/it] 96%|█████████▌| 4100/4286 [30:54:24<1:16:58, 24.83s/it]                                                        {'loss': 0.012, 'grad_norm': 4.049435397777018, 'learning_rate': 4.33971068595427e-08, 'completion_length': 280.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7063492238521576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6884921789169312, 'reward_std': 0.07822974026203156, 'kl': 0.298828125, 'epoch': 0.96}
 96%|█████████▌| 4100/4286 [30:54:24<1:16:58, 24.83s/it] 96%|█████████▌| 4101/4286 [31:03:14<9:03:21, 176.22s/it]                                                         {'loss': 0.0092, 'grad_norm': 18.31217654729218, 'learning_rate': 4.316378908072795e-08, 'completion_length': 307.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7291667461395264, 'rewards/format_reward': 1.0, 'reward': 1.7291667461395264, 'reward_std': 0.0471916887909174, 'kl': 0.2294921875, 'epoch': 0.96}
 96%|█████████▌| 4101/4286 [31:03:14<9:03:21, 176.22s/it] 96%|█████████▌| 4102/4286 [31:03:38<6:41:03, 130.78s/it]                                                         {'loss': 0.0071, 'grad_norm': 9.967931812460558, 'learning_rate': 4.2930471301913204e-08, 'completion_length': 324.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7351192235946655, 'reward_std': 0.08517500944435596, 'kl': 0.17724609375, 'epoch': 0.96}
 96%|█████████▌| 4102/4286 [31:03:38<6:41:03, 130.78s/it] 96%|█████████▌| 4103/4286 [31:04:02<5:00:54, 98.66s/it]                                                         {'loss': 0.0051, 'grad_norm': 3.6454826290026094, 'learning_rate': 4.269715352309846e-08, 'completion_length': 310.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.01785714365541935, 'kl': 0.128173828125, 'epoch': 0.96}
 96%|█████████▌| 4103/4286 [31:04:02<5:00:54, 98.66s/it] 96%|█████████▌| 4104/4286 [31:04:27<3:52:04, 76.51s/it]                                                        {'loss': 0.028, 'grad_norm': 5.042529432742838, 'learning_rate': 4.2463835744283714e-08, 'completion_length': 311.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187500596046448, 'reward_std': 0.044642857275903225, 'kl': 0.703125, 'epoch': 0.96}
 96%|█████████▌| 4104/4286 [31:04:27<3:52:04, 76.51s/it] 96%|█████████▌| 4105/4286 [31:04:53<3:04:46, 61.25s/it]                                                        {'loss': 0.0164, 'grad_norm': 5.603055291224697, 'learning_rate': 4.223051796546897e-08, 'completion_length': 299.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8214287161827087, 'reward_std': 0.07234941516071558, 'kl': 0.4111328125, 'epoch': 0.96}
 96%|█████████▌| 4105/4286 [31:04:53<3:04:46, 61.25s/it] 96%|█████████▌| 4106/4286 [31:05:16<2:29:40, 49.89s/it]                                                        {'loss': 0.0022, 'grad_norm': 1.7726300714474523, 'learning_rate': 4.1997200186654225e-08, 'completion_length': 281.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.01785714365541935, 'kl': 0.055419921875, 'epoch': 0.96}
 96%|█████████▌| 4106/4286 [31:05:16<2:29:40, 49.89s/it] 96%|█████████▌| 4107/4286 [31:05:43<2:08:13, 42.98s/it]                                                        {'loss': 0.0189, 'grad_norm': 4.807253651402085, 'learning_rate': 4.176388240783948e-08, 'completion_length': 290.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.7216804623603821, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7038233876228333, 'reward_std': 0.050037406384944916, 'kl': 0.47314453125, 'epoch': 0.96}
 96%|█████████▌| 4107/4286 [31:05:43<2:08:13, 42.98s/it][2025-03-03 22:03:32,857] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▌| 4108/4286 [31:06:10<1:53:23, 38.22s/it]                                                        {'loss': 0.0143, 'grad_norm': 8.708342160388806, 'learning_rate': 4.1530564629024735e-08, 'completion_length': 311.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.782738208770752, 'reward_std': 0.13698772341012955, 'kl': 0.3583984375, 'epoch': 0.96}
 96%|█████████▌| 4108/4286 [31:06:10<1:53:23, 38.22s/it][2025-03-03 22:03:59,498] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▌| 4109/4286 [31:06:37<1:42:30, 34.75s/it]                                                        {'loss': 0.0095, 'grad_norm': 18.041702915571914, 'learning_rate': 4.1297246850209984e-08, 'completion_length': 321.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.8050595819950104, 'rewards/format_reward': 1.0, 'reward': 1.8050596117973328, 'reward_std': 0.07193046808242798, 'kl': 0.23876953125, 'epoch': 0.96}
 96%|█████████▌| 4109/4286 [31:06:37<1:42:30, 34.75s/it] 96%|█████████▌| 4110/4286 [31:07:01<1:32:59, 31.70s/it]                                                        {'loss': 0.019, 'grad_norm': 1.8368171601260053, 'learning_rate': 4.106392907139524e-08, 'completion_length': 316.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.8223214447498322, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8044643998146057, 'reward_std': 0.05834851786494255, 'kl': 0.4742431640625, 'epoch': 0.96}
 96%|█████████▌| 4110/4286 [31:07:01<1:32:59, 31.70s/it][2025-03-03 22:04:49,450] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▌| 4111/4286 [31:07:27<1:26:54, 29.80s/it]                                                        {'loss': 0.0027, 'grad_norm': 0.6000879306936052, 'learning_rate': 4.0830611292580495e-08, 'completion_length': 297.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726192235946655, 'reward_std': 0.0, 'kl': 0.06640625, 'epoch': 0.96}
 96%|█████████▌| 4111/4286 [31:07:27<1:26:54, 29.80s/it] 96%|█████████▌| 4112/4286 [31:07:52<1:22:19, 28.39s/it]                                                        {'loss': 0.011, 'grad_norm': 3.2985913587182667, 'learning_rate': 4.059729351376575e-08, 'completion_length': 297.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7366071939468384, 'reward_std': 0.08397664874792099, 'kl': 0.27392578125, 'epoch': 0.96}
 96%|█████████▌| 4112/4286 [31:07:52<1:22:19, 28.39s/it] 96%|█████████▌| 4113/4286 [31:08:16<1:18:07, 27.10s/it]                                                        {'loss': 0.0032, 'grad_norm': 1.9535393419468379, 'learning_rate': 4.0363975734951005e-08, 'completion_length': 275.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.8958333134651184, 'rewards/format_reward': 1.0, 'reward': 1.8958335518836975, 'reward_std': 0.05222322978079319, 'kl': 0.07861328125, 'epoch': 0.96}
 96%|█████████▌| 4113/4286 [31:08:16<1:18:07, 27.10s/it] 96%|█████████▌| 4114/4286 [31:08:39<1:14:46, 26.08s/it]                                                        {'loss': 0.0108, 'grad_norm': 1.8505939945513927, 'learning_rate': 4.013065795613626e-08, 'completion_length': 267.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.8035714328289032, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.011904759332537651, 'kl': 0.27001953125, 'epoch': 0.96}
 96%|█████████▌| 4114/4286 [31:08:39<1:14:46, 26.08s/it] 96%|█████████▌| 4115/4286 [31:09:04<1:12:46, 25.53s/it]                                                        {'loss': 0.008, 'grad_norm': 3.066105340583208, 'learning_rate': 3.9897340177321516e-08, 'completion_length': 327.26788330078125, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.12255034409463406, 'kl': 0.19775390625, 'epoch': 0.96}
 96%|█████████▌| 4115/4286 [31:09:04<1:12:46, 25.53s/it] 96%|█████████▌| 4116/4286 [31:09:29<1:11:59, 25.41s/it]                                                        {'loss': 0.0337, 'grad_norm': 8.074605503407422, 'learning_rate': 3.9664022398506764e-08, 'completion_length': 286.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7455357909202576, 'reward_std': 0.13466878980398178, 'kl': 0.845458984375, 'epoch': 0.96}
 96%|█████████▌| 4116/4286 [31:09:29<1:11:59, 25.41s/it] 96%|█████████▌| 4117/4286 [31:09:53<1:10:32, 25.05s/it]                                                        {'loss': 0.0059, 'grad_norm': 3.2462189295513726, 'learning_rate': 3.943070461969202e-08, 'completion_length': 299.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.08928571827709675, 'kl': 0.14892578125, 'epoch': 0.96}
 96%|█████████▌| 4117/4286 [31:09:53<1:10:32, 25.05s/it] 96%|█████████▌| 4118/4286 [31:10:16<1:08:48, 24.57s/it]                                                        {'loss': 0.0167, 'grad_norm': 1.290548341157853, 'learning_rate': 3.9197386840877275e-08, 'completion_length': 274.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6651785969734192, 'rewards/format_reward': 1.0, 'reward': 1.665178656578064, 'reward_std': 0.1141766756772995, 'kl': 0.4169921875, 'epoch': 0.96}
 96%|█████████▌| 4118/4286 [31:10:16<1:08:48, 24.57s/it] 96%|█████████▌| 4119/4286 [31:10:40<1:07:41, 24.32s/it]                                                        {'loss': 0.0044, 'grad_norm': 6.476396384137337, 'learning_rate': 3.8964069062062524e-08, 'completion_length': 251.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.008928571827709675, 'kl': 0.10888671875, 'epoch': 0.96}
 96%|█████████▌| 4119/4286 [31:10:40<1:07:41, 24.32s/it] 96%|█████████▌| 4120/4286 [31:11:05<1:07:35, 24.43s/it]                                                        {'loss': 0.0037, 'grad_norm': 4.235844649568862, 'learning_rate': 3.873075128324778e-08, 'completion_length': 252.71430206298828, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 1.0, 'reward': 1.7529762387275696, 'reward_std': 0.053559744730591774, 'kl': 0.09326171875, 'epoch': 0.96}
 96%|█████████▌| 4120/4286 [31:11:05<1:07:35, 24.43s/it] 96%|█████████▌| 4121/4286 [31:11:30<1:07:38, 24.60s/it]                                                        {'loss': 0.0033, 'grad_norm': 3.7442293199747505, 'learning_rate': 3.8497433504433034e-08, 'completion_length': 296.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.6785715818405151, 'reward_std': 0.03252441808581352, 'kl': 0.082763671875, 'epoch': 0.96}
 96%|█████████▌| 4121/4286 [31:11:30<1:07:38, 24.60s/it] 96%|█████████▌| 4122/4286 [31:11:55<1:07:18, 24.63s/it]                                                        {'loss': 0.0095, 'grad_norm': 54.18905085393506, 'learning_rate': 3.826411572561829e-08, 'completion_length': 313.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6636905670166016, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.09045328944921494, 'kl': 0.23828125, 'epoch': 0.96}
 96%|█████████▌| 4122/4286 [31:11:55<1:07:18, 24.63s/it] 96%|█████████▌| 4123/4286 [31:12:20<1:07:09, 24.72s/it]                                                        {'loss': 0.0129, 'grad_norm': 6.439576123684817, 'learning_rate': 3.8030797946803545e-08, 'completion_length': 319.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6738095581531525, 'rewards/format_reward': 1.0, 'reward': 1.6738097071647644, 'reward_std': 0.06277625262737274, 'kl': 0.32275390625, 'epoch': 0.96}
 96%|█████████▌| 4123/4286 [31:12:20<1:07:09, 24.72s/it] 96%|█████████▌| 4124/4286 [31:12:44<1:06:44, 24.72s/it]                                                        {'loss': 0.027, 'grad_norm': 28.58043681169413, 'learning_rate': 3.7797480167988794e-08, 'completion_length': 298.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.6200397610664368, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5843255519866943, 'reward_std': 0.13080992549657822, 'kl': 0.67578125, 'epoch': 0.96}
 96%|█████████▌| 4124/4286 [31:12:44<1:06:44, 24.72s/it] 96%|█████████▌| 4125/4286 [31:13:08<1:05:48, 24.53s/it]                                                        {'loss': 0.0225, 'grad_norm': 3.866433460722377, 'learning_rate': 3.756416238917405e-08, 'completion_length': 305.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.784226268529892, 'rewards/format_reward': 1.0, 'reward': 1.7842262387275696, 'reward_std': 0.07876220718026161, 'kl': 0.5625, 'epoch': 0.96}
 96%|█████████▌| 4125/4286 [31:13:08<1:05:48, 24.53s/it] 96%|█████████▋| 4126/4286 [31:13:34<1:06:11, 24.82s/it]                                                        {'loss': 0.0058, 'grad_norm': 2.1930215987775155, 'learning_rate': 3.7330844610359304e-08, 'completion_length': 288.1071472167969, 'rewards/only_full_func_accuracy_reward': 0.6488095819950104, 'rewards/format_reward': 1.0, 'reward': 1.6488096714019775, 'reward_std': 0.05952381156384945, 'kl': 0.145263671875, 'epoch': 0.96}
 96%|█████████▋| 4126/4286 [31:13:34<1:06:11, 24.82s/it] 96%|█████████▋| 4127/4286 [31:13:58<1:04:55, 24.50s/it]                                                        {'loss': 0.0218, 'grad_norm': 4.787646164353715, 'learning_rate': 3.709752683154456e-08, 'completion_length': 318.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.765476256608963, 'rewards/format_reward': 1.0, 'reward': 1.7654762864112854, 'reward_std': 0.05888326093554497, 'kl': 0.544189453125, 'epoch': 0.96}
 96%|█████████▋| 4127/4286 [31:13:58<1:04:55, 24.50s/it] 96%|█████████▋| 4128/4286 [31:14:22<1:04:46, 24.60s/it]                                                        {'loss': 0.0067, 'grad_norm': 3.8834264155721394, 'learning_rate': 3.6864209052729815e-08, 'completion_length': 303.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.8363095819950104, 'rewards/format_reward': 1.0, 'reward': 1.8363096714019775, 'reward_std': 0.026025486178696156, 'kl': 0.167236328125, 'epoch': 0.96}
 96%|█████████▋| 4128/4286 [31:14:22<1:04:46, 24.60s/it] 96%|█████████▋| 4129/4286 [31:14:47<1:04:44, 24.74s/it]                                                        {'loss': 0.0161, 'grad_norm': 13.676842076072184, 'learning_rate': 3.663089127391507e-08, 'completion_length': 312.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.8906250298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.872767984867096, 'reward_std': 0.08071072027087212, 'kl': 0.40380859375, 'epoch': 0.96}
 96%|█████████▋| 4129/4286 [31:14:47<1:04:44, 24.74s/it] 96%|█████████▋| 4130/4286 [31:15:12<1:04:24, 24.77s/it]                                                        {'loss': 0.0168, 'grad_norm': 1.6321957547581414, 'learning_rate': 3.6397573495100325e-08, 'completion_length': 331.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.8318452537059784, 'rewards/format_reward': 1.0, 'reward': 1.8318453431129456, 'reward_std': 0.008928571827709675, 'kl': 0.417724609375, 'epoch': 0.96}
 96%|█████████▋| 4130/4286 [31:15:12<1:04:24, 24.77s/it] 96%|█████████▋| 4131/4286 [31:15:36<1:02:56, 24.37s/it]                                                        {'loss': 0.0044, 'grad_norm': 1.8896340200375334, 'learning_rate': 3.6164255716285574e-08, 'completion_length': 266.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.8556548058986664, 'rewards/format_reward': 1.0, 'reward': 1.8556548953056335, 'reward_std': 0.026785715483129025, 'kl': 0.10943603515625, 'epoch': 0.96}
 96%|█████████▋| 4131/4286 [31:15:36<1:02:56, 24.37s/it] 96%|█████████▋| 4132/4286 [31:16:00<1:02:07, 24.20s/it]                                                        {'loss': 0.0083, 'grad_norm': 15.49650982160562, 'learning_rate': 3.593093793747083e-08, 'completion_length': 308.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.8005952537059784, 'rewards/format_reward': 1.0, 'reward': 1.8005954027175903, 'reward_std': 0.08784347400069237, 'kl': 0.208984375, 'epoch': 0.96}
 96%|█████████▋| 4132/4286 [31:16:00<1:02:07, 24.20s/it] 96%|█████████▋| 4133/4286 [31:16:24<1:01:39, 24.18s/it]                                                        {'loss': 0.0076, 'grad_norm': 2.9623727381189373, 'learning_rate': 3.5697620158656085e-08, 'completion_length': 324.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.8711310029029846, 'rewards/format_reward': 1.0, 'reward': 1.8711311221122742, 'reward_std': 0.03552966006100178, 'kl': 0.190185546875, 'epoch': 0.96}
 96%|█████████▋| 4133/4286 [31:16:24<1:01:39, 24.18s/it] 96%|█████████▋| 4134/4286 [31:16:48<1:01:08, 24.13s/it]                                                        {'loss': 0.0125, 'grad_norm': 3.647885292159031, 'learning_rate': 3.546430237984134e-08, 'completion_length': 294.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7440477013587952, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.031603576615452766, 'kl': 0.31494140625, 'epoch': 0.96}
 96%|█████████▋| 4134/4286 [31:16:48<1:01:08, 24.13s/it] 96%|█████████▋| 4135/4286 [31:17:12<1:00:39, 24.10s/it]                                                        {'loss': 0.0046, 'grad_norm': 6.6146760591109794, 'learning_rate': 3.5230984601026595e-08, 'completion_length': 302.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6749999821186066, 'rewards/format_reward': 1.0, 'reward': 1.6750001311302185, 'reward_std': 0.07397789135575294, 'kl': 0.11474609375, 'epoch': 0.96}
 96%|█████████▋| 4135/4286 [31:17:12<1:00:39, 24.10s/it] 97%|█████████▋| 4136/4286 [31:17:38<1:02:13, 24.89s/it]                                                        {'loss': 0.031, 'grad_norm': 7.711433495323065, 'learning_rate': 3.499766682221185e-08, 'completion_length': 307.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6187500655651093, 'rewards/format_reward': 1.0, 'reward': 1.6187500953674316, 'reward_std': 0.11964286491274834, 'kl': 0.7763671875, 'epoch': 0.97}
 97%|█████████▋| 4136/4286 [31:17:38<1:02:13, 24.89s/it] 97%|█████████▋| 4137/4286 [31:18:03<1:01:37, 24.81s/it]                                                        {'loss': 0.0341, 'grad_norm': 3.328027909684026, 'learning_rate': 3.4764349043397106e-08, 'completion_length': 304.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7997768223285675, 'rewards/format_reward': 1.0, 'reward': 1.7997769117355347, 'reward_std': 0.033045271411538124, 'kl': 0.8515625, 'epoch': 0.97}
 97%|█████████▋| 4137/4286 [31:18:03<1:01:37, 24.81s/it] 97%|█████████▋| 4138/4286 [31:18:28<1:01:03, 24.75s/it]                                                        {'loss': 0.022, 'grad_norm': 15.942264967806866, 'learning_rate': 3.453103126458236e-08, 'completion_length': 296.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7366072237491608, 'rewards/format_reward': 1.0, 'reward': 1.736607313156128, 'reward_std': 0.1220238134264946, 'kl': 0.55126953125, 'epoch': 0.97}
 97%|█████████▋| 4138/4286 [31:18:28<1:01:03, 24.75s/it] 97%|█████████▋| 4139/4286 [31:18:52<1:00:29, 24.69s/it]                                                        {'loss': 0.0022, 'grad_norm': 1.233558029666119, 'learning_rate': 3.429771348576761e-08, 'completion_length': 311.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.9315476715564728, 'rewards/format_reward': 1.0, 'reward': 1.9315477013587952, 'reward_std': 0.02119971625506878, 'kl': 0.055419921875, 'epoch': 0.97}
 97%|█████████▋| 4139/4286 [31:18:52<1:00:29, 24.69s/it] 97%|█████████▋| 4140/4286 [31:19:17<1:00:18, 24.78s/it]                                                        {'loss': 0.0464, 'grad_norm': 29.32730954080036, 'learning_rate': 3.4064395706952865e-08, 'completion_length': 297.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.6413690745830536, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.06526251137256622, 'kl': 1.1572265625, 'epoch': 0.97}
 97%|█████████▋| 4140/4286 [31:19:17<1:00:18, 24.78s/it] 97%|█████████▋| 4141/4286 [31:19:42<59:35, 24.66s/it]                                                        {'loss': 0.0133, 'grad_norm': 1.2238308710083843, 'learning_rate': 3.383107792813812e-08, 'completion_length': 299.625, 'rewards/only_full_func_accuracy_reward': 0.791666716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7738096714019775, 'reward_std': 0.08333333395421505, 'kl': 0.331298828125, 'epoch': 0.97}
 97%|█████████▋| 4141/4286 [31:19:42<59:35, 24.66s/it] 97%|█████████▋| 4142/4286 [31:20:06<59:05, 24.62s/it]                                                      {'loss': 0.0248, 'grad_norm': 3.3428453784953835, 'learning_rate': 3.3597760149323376e-08, 'completion_length': 319.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6994048357009888, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6815477013587952, 'reward_std': 0.10786283388733864, 'kl': 0.6171875, 'epoch': 0.97}
 97%|█████████▋| 4142/4286 [31:20:06<59:05, 24.62s/it] 97%|█████████▋| 4143/4286 [31:20:29<57:41, 24.21s/it]                                                      {'loss': 0.0055, 'grad_norm': 5.005709363631264, 'learning_rate': 3.336444237050863e-08, 'completion_length': 259.4643020629883, 'rewards/only_full_func_accuracy_reward': 0.627976268529892, 'rewards/format_reward': 1.0, 'reward': 1.6279762387275696, 'reward_std': 0.08265923336148262, 'kl': 0.13818359375, 'epoch': 0.97}
 97%|█████████▋| 4143/4286 [31:20:29<57:41, 24.21s/it] 97%|█████████▋| 4144/4286 [31:20:53<56:50, 24.02s/it]                                                      {'loss': 0.005, 'grad_norm': 132.75231737377453, 'learning_rate': 3.3131124591693886e-08, 'completion_length': 308.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.8869048655033112, 'rewards/format_reward': 1.0, 'reward': 1.8869048953056335, 'reward_std': 0.03755595162510872, 'kl': 0.1240234375, 'epoch': 0.97}
 97%|█████████▋| 4144/4286 [31:20:53<56:50, 24.02s/it] 97%|█████████▋| 4145/4286 [31:21:18<57:01, 24.27s/it]                                                      {'loss': 0.0149, 'grad_norm': 2.165537662508522, 'learning_rate': 3.289780681287914e-08, 'completion_length': 265.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.854166716337204, 'rewards/format_reward': 1.0, 'reward': 1.8541668057441711, 'reward_std': 0.038476791232824326, 'kl': 0.37255859375, 'epoch': 0.97}
 97%|█████████▋| 4145/4286 [31:21:18<57:01, 24.27s/it] 97%|█████████▋| 4146/4286 [31:21:42<56:32, 24.23s/it]                                                      {'loss': 0.0461, 'grad_norm': 4.161311740408848, 'learning_rate': 3.266448903406439e-08, 'completion_length': 295.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6383929252624512, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6026787161827087, 'reward_std': 0.11113344877958298, 'kl': 1.14794921875, 'epoch': 0.97}
 97%|█████████▋| 4146/4286 [31:21:42<56:32, 24.23s/it] 97%|█████████▋| 4147/4286 [31:22:06<56:05, 24.21s/it]                                                      {'loss': 0.0153, 'grad_norm': 7.557164319422949, 'learning_rate': 3.2431171255249646e-08, 'completion_length': 316.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7079082429409027, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6900510787963867, 'reward_std': 0.08854112215340137, 'kl': 0.384765625, 'epoch': 0.97}
 97%|█████████▋| 4147/4286 [31:22:06<56:05, 24.21s/it] 97%|█████████▋| 4148/4286 [31:22:31<56:15, 24.46s/it]                                                      {'loss': 0.0036, 'grad_norm': 2.656792201742968, 'learning_rate': 3.21978534764349e-08, 'completion_length': 314.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8330357670783997, 'rewards/format_reward': 1.0, 'reward': 1.8330358266830444, 'reward_std': 0.040963370352983475, 'kl': 0.09130859375, 'epoch': 0.97}
 97%|█████████▋| 4148/4286 [31:22:31<56:15, 24.46s/it] 97%|█████████▋| 4149/4286 [31:22:55<55:13, 24.19s/it]                                                      {'loss': 0.0054, 'grad_norm': 1.004079550991653, 'learning_rate': 3.1964535697620156e-08, 'completion_length': 287.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.799107164144516, 'rewards/format_reward': 1.0, 'reward': 1.799107313156128, 'reward_std': 0.0208333320915699, 'kl': 0.134033203125, 'epoch': 0.97}
 97%|█████████▋| 4149/4286 [31:22:55<55:13, 24.19s/it] 97%|█████████▋| 4150/4286 [31:23:20<55:13, 24.37s/it]                                                      {'loss': 0.0312, 'grad_norm': 5.704369577252904, 'learning_rate': 3.173121791880541e-08, 'completion_length': 277.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5967262089252472, 'rewards/format_reward': 1.0, 'reward': 1.5967262983322144, 'reward_std': 0.056712403893470764, 'kl': 0.78125, 'epoch': 0.97}
 97%|█████████▋| 4150/4286 [31:23:20<55:13, 24.37s/it] 97%|█████████▋| 4151/4286 [31:23:43<54:27, 24.20s/it]                                                      {'loss': 0.0048, 'grad_norm': 0.3577941312300584, 'learning_rate': 3.149790013999067e-08, 'completion_length': 306.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.8333333730697632, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.0, 'kl': 0.12109375, 'epoch': 0.97}
 97%|█████████▋| 4151/4286 [31:23:43<54:27, 24.20s/it] 97%|█████████▋| 4152/4286 [31:24:08<54:25, 24.37s/it]                                                      {'loss': 0.0139, 'grad_norm': 8.019707052096873, 'learning_rate': 3.126458236117592e-08, 'completion_length': 277.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6553571820259094, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6375001072883606, 'reward_std': 0.05833333171904087, 'kl': 0.34912109375, 'epoch': 0.97}
 97%|█████████▋| 4152/4286 [31:24:08<54:25, 24.37s/it] 97%|█████████▋| 4153/4286 [31:24:33<54:03, 24.39s/it]                                                      {'loss': 0.0104, 'grad_norm': 2.1345479076810125, 'learning_rate': 3.103126458236118e-08, 'completion_length': 280.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.6205357611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.602678656578064, 'reward_std': 0.11883394420146942, 'kl': 0.258544921875, 'epoch': 0.97}
 97%|█████████▋| 4153/4286 [31:24:33<54:03, 24.39s/it] 97%|█████████▋| 4154/4286 [31:24:58<54:41, 24.86s/it]                                                      {'loss': 0.015, 'grad_norm': 2.864996960615427, 'learning_rate': 3.0797946803546426e-08, 'completion_length': 331.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7544643878936768, 'reward_std': 0.1018976429477334, 'kl': 0.375, 'epoch': 0.97}
 97%|█████████▋| 4154/4286 [31:24:58<54:41, 24.86s/it] 97%|█████████▋| 4155/4286 [31:25:24<54:29, 24.96s/it]                                                      {'loss': 0.0064, 'grad_norm': 4.943222419092121, 'learning_rate': 3.056462902473168e-08, 'completion_length': 297.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.0357142873108387, 'kl': 0.15869140625, 'epoch': 0.97}
 97%|█████████▋| 4155/4286 [31:25:24<54:29, 24.96s/it] 97%|█████████▋| 4156/4286 [31:25:49<54:05, 24.97s/it]                                                      {'loss': 0.005, 'grad_norm': 6.457385341159555, 'learning_rate': 3.033131124591694e-08, 'completion_length': 263.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.01785714365541935, 'kl': 0.1240234375, 'epoch': 0.97}
 97%|█████████▋| 4156/4286 [31:25:49<54:05, 24.97s/it] 97%|█████████▋| 4157/4286 [31:26:14<53:41, 24.98s/it]                                                      {'loss': 0.0213, 'grad_norm': 3.305463662707634, 'learning_rate': 3.009799346710219e-08, 'completion_length': 307.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7008928954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6651785969734192, 'reward_std': 0.12308874353766441, 'kl': 0.53173828125, 'epoch': 0.97}
 97%|█████████▋| 4157/4286 [31:26:14<53:41, 24.98s/it] 97%|█████████▋| 4158/4286 [31:26:38<52:50, 24.77s/it]                                                      {'loss': 0.0036, 'grad_norm': 3.4446871613622703, 'learning_rate': 2.986467568828745e-08, 'completion_length': 312.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7976190447807312, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.04627084732055664, 'kl': 0.089111328125, 'epoch': 0.97}
 97%|█████████▋| 4158/4286 [31:26:38<52:50, 24.77s/it] 97%|█████████▋| 4159/4286 [31:27:04<53:16, 25.17s/it]                                                      {'loss': 0.0039, 'grad_norm': 4.09013729218345, 'learning_rate': 2.9631357909472703e-08, 'completion_length': 296.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7738095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7738096117973328, 'reward_std': 0.05205097235739231, 'kl': 0.09716796875, 'epoch': 0.97}
 97%|█████████▋| 4159/4286 [31:27:04<53:16, 25.17s/it] 97%|█████████▋| 4160/4286 [31:27:29<52:29, 24.99s/it]                                                      {'loss': 0.0244, 'grad_norm': 4.387834256904681, 'learning_rate': 2.9398040130657955e-08, 'completion_length': 296.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7042354345321655, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6863783597946167, 'reward_std': 0.09807439893484116, 'kl': 0.612548828125, 'epoch': 0.97}
 97%|█████████▋| 4160/4286 [31:27:29<52:29, 24.99s/it] 97%|█████████▋| 4161/4286 [31:27:53<51:26, 24.69s/it]                                                      {'loss': 0.0092, 'grad_norm': 6.563405397760752, 'learning_rate': 2.916472235184321e-08, 'completion_length': 288.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6875001192092896, 'reward_std': 0.03411934711039066, 'kl': 0.22998046875, 'epoch': 0.97}
 97%|█████████▋| 4161/4286 [31:27:53<51:26, 24.69s/it] 97%|█████████▋| 4162/4286 [31:28:18<51:09, 24.75s/it]                                                      {'loss': 0.0124, 'grad_norm': 4.3504062595694615, 'learning_rate': 2.8931404573028465e-08, 'completion_length': 305.125, 'rewards/only_full_func_accuracy_reward': 0.6145834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5967262983322144, 'reward_std': 0.14815396443009377, 'kl': 0.3115234375, 'epoch': 0.97}
 97%|█████████▋| 4162/4286 [31:28:18<51:09, 24.75s/it] 97%|█████████▋| 4163/4286 [31:28:44<51:32, 25.14s/it]                                                      {'loss': 0.008, 'grad_norm': 6.544650164334339, 'learning_rate': 2.869808679421372e-08, 'completion_length': 301.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.8794643580913544, 'rewards/format_reward': 1.0, 'reward': 1.8794644474983215, 'reward_std': 0.020833331160247326, 'kl': 0.200927734375, 'epoch': 0.97}
 97%|█████████▋| 4163/4286 [31:28:44<51:32, 25.14s/it] 97%|█████████▋| 4164/4286 [31:29:07<49:52, 24.53s/it]                                                      {'loss': 0.0143, 'grad_norm': 6.254063139716561, 'learning_rate': 2.8464769015398973e-08, 'completion_length': 299.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.74702388048172, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.029761902987957, 'kl': 0.35546875, 'epoch': 0.97}
 97%|█████████▋| 4164/4286 [31:29:07<49:52, 24.53s/it] 97%|█████████▋| 4165/4286 [31:29:32<50:10, 24.88s/it]                                                      {'loss': 0.0087, 'grad_norm': 3.5685835023085635, 'learning_rate': 2.8231451236584228e-08, 'completion_length': 291.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.8660715222358704, 'rewards/format_reward': 1.0, 'reward': 1.8660715818405151, 'reward_std': 0.05289733596146107, 'kl': 0.21826171875, 'epoch': 0.97}
 97%|█████████▋| 4165/4286 [31:29:32<50:10, 24.88s/it] 97%|█████████▋| 4166/4286 [31:29:56<49:01, 24.51s/it]                                                      {'loss': 0.0096, 'grad_norm': 0.5858658493339319, 'learning_rate': 2.7998133457769483e-08, 'completion_length': 240.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.8333333134651184, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.011904762126505375, 'kl': 0.2398681640625, 'epoch': 0.97}
 97%|█████████▋| 4166/4286 [31:29:56<49:01, 24.51s/it] 97%|█████████▋| 4167/4286 [31:30:22<49:29, 24.96s/it]                                                      {'loss': 0.0173, 'grad_norm': 21.46031326354608, 'learning_rate': 2.7764815678954735e-08, 'completion_length': 326.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7827380895614624, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7648810744285583, 'reward_std': 0.13325035944581032, 'kl': 0.4345703125, 'epoch': 0.97}
 97%|█████████▋| 4167/4286 [31:30:22<49:29, 24.96s/it] 97%|█████████▋| 4168/4286 [31:30:48<49:57, 25.40s/it]                                                      {'loss': 0.0162, 'grad_norm': 10.352486421812126, 'learning_rate': 2.753149790013999e-08, 'completion_length': 322.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160715222358704, 'reward_std': 0.08358007669448853, 'kl': 0.4033203125, 'epoch': 0.97}
 97%|█████████▋| 4168/4286 [31:30:48<49:57, 25.40s/it] 97%|█████████▋| 4169/4286 [31:31:13<49:03, 25.16s/it]                                                      {'loss': 0.0052, 'grad_norm': 0.6518954350173964, 'learning_rate': 2.7298180121325242e-08, 'completion_length': 314.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306549549102783, 'reward_std': 0.019238398410379887, 'kl': 0.129150390625, 'epoch': 0.97}
 97%|█████████▋| 4169/4286 [31:31:13<49:03, 25.16s/it] 97%|█████████▋| 4170/4286 [31:31:37<47:59, 24.83s/it]                                                      {'loss': 0.0345, 'grad_norm': 15.239151475300927, 'learning_rate': 2.7064862342510498e-08, 'completion_length': 300.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 1.0, 'reward': 1.5937501192092896, 'reward_std': 0.07238247245550156, 'kl': 0.8583984375, 'epoch': 0.97}
 97%|█████████▋| 4170/4286 [31:31:37<47:59, 24.83s/it] 97%|█████████▋| 4171/4286 [31:32:03<48:11, 25.15s/it]                                                      {'loss': 0.0234, 'grad_norm': 2.381399174280318, 'learning_rate': 2.683154456369575e-08, 'completion_length': 338.51788330078125, 'rewards/only_full_func_accuracy_reward': 0.722321480512619, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7044643759727478, 'reward_std': 0.08654303848743439, 'kl': 0.581298828125, 'epoch': 0.97}
 97%|█████████▋| 4171/4286 [31:32:03<48:11, 25.15s/it] 97%|█████████▋| 4172/4286 [31:32:27<47:19, 24.91s/it]                                                      {'loss': 0.0121, 'grad_norm': 9.186441197711185, 'learning_rate': 2.6598226784881005e-08, 'completion_length': 311.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7767857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7767859101295471, 'reward_std': 0.0773809514939785, 'kl': 0.3037109375, 'epoch': 0.97}
 97%|█████████▋| 4172/4286 [31:32:27<47:19, 24.91s/it] 97%|█████████▋| 4173/4286 [31:32:51<46:14, 24.55s/it]                                                      {'loss': 0.0176, 'grad_norm': 16.090424245662135, 'learning_rate': 2.636490900606626e-08, 'completion_length': 269.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6532739400863647, 'reward_std': 0.11404913291335106, 'kl': 0.439453125, 'epoch': 0.97}
 97%|█████████▋| 4173/4286 [31:32:51<46:14, 24.55s/it] 97%|█████████▋| 4174/4286 [31:33:16<46:00, 24.65s/it]                                                      {'loss': 0.0026, 'grad_norm': 1.0130629186220759, 'learning_rate': 2.6131591227251516e-08, 'completion_length': 287.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.0357142873108387, 'kl': 0.0640869140625, 'epoch': 0.97}
 97%|█████████▋| 4174/4286 [31:33:16<46:00, 24.65s/it] 97%|█████████▋| 4175/4286 [31:33:40<45:28, 24.58s/it]                                                      {'loss': 0.0178, 'grad_norm': 1.9099901259833194, 'learning_rate': 2.5898273448436768e-08, 'completion_length': 304.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.8199405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.802083432674408, 'reward_std': 0.06417815387248993, 'kl': 0.44287109375, 'epoch': 0.97}
 97%|█████████▋| 4175/4286 [31:33:40<45:28, 24.58s/it] 97%|█████████▋| 4176/4286 [31:34:05<45:03, 24.58s/it]                                                      {'loss': 0.0316, 'grad_norm': 2.2496654855334692, 'learning_rate': 2.5664955669622023e-08, 'completion_length': 315.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.032836973667144775, 'kl': 0.7890625, 'epoch': 0.97}
 97%|█████████▋| 4176/4286 [31:34:05<45:03, 24.58s/it] 97%|█████████▋| 4177/4286 [31:34:31<45:29, 25.05s/it]                                                      {'loss': 0.0088, 'grad_norm': 3.641885275713109, 'learning_rate': 2.5431637890807278e-08, 'completion_length': 313.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.770833432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7529762983322144, 'reward_std': 0.0773809589445591, 'kl': 0.22021484375, 'epoch': 0.97}
 97%|█████████▋| 4177/4286 [31:34:31<45:29, 25.05s/it] 97%|█████████▋| 4178/4286 [31:34:55<44:20, 24.64s/it]                                                      {'loss': 0.0082, 'grad_norm': 5.506266391290662, 'learning_rate': 2.5198320111992534e-08, 'completion_length': 259.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.7589287757873535, 'reward_std': 0.08290597423911095, 'kl': 0.205078125, 'epoch': 0.97}
 97%|█████████▋| 4178/4286 [31:34:55<44:20, 24.64s/it] 98%|█████████▊| 4179/4286 [31:35:19<43:54, 24.62s/it]                                                      {'loss': 0.0062, 'grad_norm': 12.113640851437173, 'learning_rate': 2.4965002333177786e-08, 'completion_length': 287.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.7601191401481628, 'rewards/format_reward': 1.0, 'reward': 1.7601191401481628, 'reward_std': 0.034626553766429424, 'kl': 0.15625, 'epoch': 0.98}
 98%|█████████▊| 4179/4286 [31:35:19<43:54, 24.62s/it] 98%|█████████▊| 4180/4286 [31:35:43<43:02, 24.37s/it]                                                      {'loss': 0.0133, 'grad_norm': 33.34866485441476, 'learning_rate': 2.473168455436304e-08, 'completion_length': 307.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.05151607468724251, 'kl': 0.33251953125, 'epoch': 0.98}
 98%|█████████▊| 4180/4286 [31:35:43<43:02, 24.37s/it] 98%|█████████▊| 4181/4286 [31:36:08<43:04, 24.61s/it]                                                      {'loss': 0.0029, 'grad_norm': 3.2070039050751054, 'learning_rate': 2.4498366775548296e-08, 'completion_length': 307.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.681547611951828, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6636905670166016, 'reward_std': 0.05972513556480408, 'kl': 0.07177734375, 'epoch': 0.98}
 98%|█████████▊| 4181/4286 [31:36:08<43:04, 24.61s/it] 98%|█████████▊| 4182/4286 [31:36:32<42:23, 24.46s/it]                                                      {'loss': 0.0064, 'grad_norm': 1.3901819989294886, 'learning_rate': 2.4265048996733548e-08, 'completion_length': 311.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.013746432960033417, 'kl': 0.1607666015625, 'epoch': 0.98}
 98%|█████████▊| 4182/4286 [31:36:32<42:23, 24.46s/it] 98%|█████████▊| 4183/4286 [31:36:57<42:13, 24.59s/it]                                                      {'loss': 0.0389, 'grad_norm': 69.5856074063482, 'learning_rate': 2.4031731217918803e-08, 'completion_length': 311.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7038691639900208, 'reward_std': 0.14443206042051315, 'kl': 0.97265625, 'epoch': 0.98}
 98%|█████████▊| 4183/4286 [31:36:57<42:13, 24.59s/it] 98%|█████████▊| 4184/4286 [31:37:22<41:57, 24.69s/it]                                                      {'loss': 0.0078, 'grad_norm': 4.187428733366834, 'learning_rate': 2.379841343910406e-08, 'completion_length': 293.39288330078125, 'rewards/only_full_func_accuracy_reward': 0.7782738506793976, 'rewards/format_reward': 1.0, 'reward': 1.77827388048172, 'reward_std': 0.059310127049684525, 'kl': 0.195068359375, 'epoch': 0.98}
 98%|█████████▊| 4184/4286 [31:37:22<41:57, 24.69s/it] 98%|█████████▊| 4185/4286 [31:37:48<42:08, 25.04s/it]                                                      {'loss': 0.0265, 'grad_norm': 15.880944531754915, 'learning_rate': 2.3565095660289314e-08, 'completion_length': 287.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.736607164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7008929252624512, 'reward_std': 0.18382930010557175, 'kl': 0.6640625, 'epoch': 0.98}
 98%|█████████▊| 4185/4286 [31:37:48<42:08, 25.04s/it] 98%|█████████▊| 4186/4286 [31:38:12<41:01, 24.61s/it]                                                      {'loss': 0.0066, 'grad_norm': 53.12347810551891, 'learning_rate': 2.3331777881474566e-08, 'completion_length': 280.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.4494047909975052, 'rewards/format_reward': 1.0, 'reward': 1.4494048357009888, 'reward_std': 0.047619045712053776, 'kl': 0.16552734375, 'epoch': 0.98}
 98%|█████████▊| 4186/4286 [31:38:12<41:01, 24.61s/it] 98%|█████████▊| 4187/4286 [31:38:36<40:31, 24.56s/it]                                                      {'loss': 0.0259, 'grad_norm': 5.033490498448552, 'learning_rate': 2.309846010265982e-08, 'completion_length': 313.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.8511905372142792, 'rewards/format_reward': 1.0, 'reward': 1.8511905670166016, 'reward_std': 0.04761904664337635, 'kl': 0.6455078125, 'epoch': 0.98}
 98%|█████████▊| 4187/4286 [31:38:36<40:31, 24.56s/it] 98%|█████████▊| 4188/4286 [31:39:00<40:01, 24.50s/it]                                                      {'loss': 0.0174, 'grad_norm': 6.132855029627525, 'learning_rate': 2.2865142323845077e-08, 'completion_length': 313.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949405670166016, 'reward_std': 0.0814377423375845, 'kl': 0.434814453125, 'epoch': 0.98}
 98%|█████████▊| 4188/4286 [31:39:00<40:01, 24.50s/it] 98%|█████████▊| 4189/4286 [31:39:27<40:32, 25.08s/it]                                                      {'loss': 0.0082, 'grad_norm': 1.878311132264954, 'learning_rate': 2.2631824545030332e-08, 'completion_length': 318.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.6562500894069672, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6383930444717407, 'reward_std': 0.0922619104385376, 'kl': 0.2041015625, 'epoch': 0.98}
 98%|█████████▊| 4189/4286 [31:39:27<40:32, 25.08s/it] 98%|█████████▊| 4190/4286 [31:39:52<40:17, 25.18s/it]                                                      {'loss': 0.011, 'grad_norm': 15.25383464765466, 'learning_rate': 2.2398506766215584e-08, 'completion_length': 303.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.011904759332537651, 'kl': 0.2752685546875, 'epoch': 0.98}
 98%|█████████▊| 4190/4286 [31:39:52<40:17, 25.18s/it] 98%|█████████▊| 4191/4286 [31:40:17<39:46, 25.12s/it]                                                      {'loss': 0.0242, 'grad_norm': 12.94128563710391, 'learning_rate': 2.216518898740084e-08, 'completion_length': 297.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.5877976566553116, 'rewards/format_reward': 1.0, 'reward': 1.5877977013587952, 'reward_std': 0.12201213091611862, 'kl': 0.60546875, 'epoch': 0.98}
 98%|█████████▊| 4191/4286 [31:40:17<39:46, 25.12s/it] 98%|█████████▊| 4192/4286 [31:40:41<38:36, 24.64s/it]                                                      {'loss': 0.0118, 'grad_norm': 2.8157961419054764, 'learning_rate': 2.1931871208586094e-08, 'completion_length': 292.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.07962912321090698, 'kl': 0.2958984375, 'epoch': 0.98}
 98%|█████████▊| 4192/4286 [31:40:41<38:36, 24.64s/it] 98%|█████████▊| 4193/4286 [31:41:06<38:24, 24.77s/it]                                                      {'loss': 0.0136, 'grad_norm': 2.762600856466551, 'learning_rate': 2.169855342977135e-08, 'completion_length': 314.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 1.0, 'reward': 1.7529763579368591, 'reward_std': 0.010309826582670212, 'kl': 0.3388671875, 'epoch': 0.98}
 98%|█████████▊| 4193/4286 [31:41:06<38:24, 24.77s/it] 98%|█████████▊| 4194/4286 [31:41:31<37:56, 24.75s/it]                                                      {'loss': 0.003, 'grad_norm': 3.5621474137933267, 'learning_rate': 2.1465235650956602e-08, 'completion_length': 269.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.8422619700431824, 'rewards/format_reward': 1.0, 'reward': 1.8422620296478271, 'reward_std': 0.017857140861451626, 'kl': 0.076171875, 'epoch': 0.98}
 98%|█████████▊| 4194/4286 [31:41:31<37:56, 24.75s/it] 98%|█████████▊| 4195/4286 [31:41:55<37:13, 24.54s/it]                                                      {'loss': 0.0032, 'grad_norm': 2.7290071112465255, 'learning_rate': 2.1231917872141857e-08, 'completion_length': 289.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7946428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7946429252624512, 'reward_std': 0.022214585915207863, 'kl': 0.080322265625, 'epoch': 0.98}
 98%|█████████▊| 4195/4286 [31:41:55<37:13, 24.54s/it] 98%|█████████▊| 4196/4286 [31:42:20<37:01, 24.68s/it]                                                      {'loss': 0.0179, 'grad_norm': 7.995374528972509, 'learning_rate': 2.0998600093327112e-08, 'completion_length': 293.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.04258750379085541, 'kl': 0.4462890625, 'epoch': 0.98}
 98%|█████████▊| 4196/4286 [31:42:20<37:01, 24.68s/it] 98%|█████████▊| 4197/4286 [31:42:43<36:05, 24.33s/it]                                                      {'loss': 0.0096, 'grad_norm': 1.5080988928864556, 'learning_rate': 2.0765282314512368e-08, 'completion_length': 279.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8244048953056335, 'reward_std': 0.029761902987957, 'kl': 0.240234375, 'epoch': 0.98}
 98%|█████████▊| 4197/4286 [31:42:43<36:05, 24.33s/it] 98%|█████████▊| 4198/4286 [31:43:08<35:58, 24.53s/it]                                                      {'loss': 0.0067, 'grad_norm': 8.943531114955894, 'learning_rate': 2.053196453569762e-08, 'completion_length': 303.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7767858505249023, 'reward_std': 0.02908780239522457, 'kl': 0.16845703125, 'epoch': 0.98}
 98%|█████████▊| 4198/4286 [31:43:08<35:58, 24.53s/it] 98%|█████████▊| 4199/4286 [31:43:33<35:47, 24.68s/it]                                                      {'loss': 0.0188, 'grad_norm': 3.4731732454496576, 'learning_rate': 2.0298646756882875e-08, 'completion_length': 275.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.822916716337204, 'rewards/format_reward': 1.0, 'reward': 1.8229168057441711, 'reward_std': 0.041510168462991714, 'kl': 0.47119140625, 'epoch': 0.98}
 98%|█████████▊| 4199/4286 [31:43:33<35:47, 24.68s/it] 98%|█████████▊| 4200/4286 [31:43:59<36:01, 25.13s/it]                                                      {'loss': 0.0144, 'grad_norm': 8.32076218852742, 'learning_rate': 2.006532897806813e-08, 'completion_length': 316.51788330078125, 'rewards/only_full_func_accuracy_reward': 0.7934524118900299, 'rewards/format_reward': 1.0, 'reward': 1.793452501296997, 'reward_std': 0.08690476883202791, 'kl': 0.36181640625, 'epoch': 0.98}
 98%|█████████▊| 4200/4286 [31:43:59<36:01, 25.13s/it] 98%|█████████▊| 4201/4286 [31:52:26<4:00:01, 169.43s/it]                                                         {'loss': 0.0194, 'grad_norm': 9.205522837200705, 'learning_rate': 1.9832011199253382e-08, 'completion_length': 287.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6101191639900208, 'reward_std': 0.13107239827513695, 'kl': 0.486328125, 'epoch': 0.98}
 98%|█████████▊| 4201/4286 [31:52:26<4:00:01, 169.43s/it] 98%|█████████▊| 4202/4286 [31:52:51<2:56:51, 126.33s/it]                                                         {'loss': 0.0138, 'grad_norm': 7.769405514200148, 'learning_rate': 1.9598693420438638e-08, 'completion_length': 343.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.0357142873108387, 'kl': 0.3466796875, 'epoch': 0.98}
 98%|█████████▊| 4202/4286 [31:52:51<2:56:51, 126.33s/it] 98%|█████████▊| 4203/4286 [31:53:15<2:12:17, 95.64s/it]                                                         {'loss': 0.0104, 'grad_norm': 6.030732210044635, 'learning_rate': 1.936537564162389e-08, 'completion_length': 245.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.867559552192688, 'rewards/format_reward': 1.0, 'reward': 1.8675596714019775, 'reward_std': 0.06250000186264515, 'kl': 0.26025390625, 'epoch': 0.98}
 98%|█████████▊| 4203/4286 [31:53:15<2:12:17, 95.64s/it] 98%|█████████▊| 4204/4286 [31:53:41<1:41:53, 74.56s/it]                                                        {'loss': 0.0036, 'grad_norm': 6.089637849817839, 'learning_rate': 1.9132057862809145e-08, 'completion_length': 310.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.8050595819950104, 'rewards/format_reward': 1.0, 'reward': 1.8050596117973328, 'reward_std': 0.060946544632315636, 'kl': 0.09002685546875, 'epoch': 0.98}
 98%|█████████▊| 4204/4286 [31:53:41<1:41:53, 74.56s/it] 98%|█████████▊| 4205/4286 [31:54:06<1:20:47, 59.84s/it]                                                        {'loss': 0.0052, 'grad_norm': 14.910269216770363, 'learning_rate': 1.8898740083994397e-08, 'completion_length': 319.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7366072535514832, 'reward_std': 0.07708827033638954, 'kl': 0.130615234375, 'epoch': 0.98}
 98%|█████████▊| 4205/4286 [31:54:06<1:20:47, 59.84s/it] 98%|█████████▊| 4206/4286 [31:54:32<1:06:06, 49.59s/it]                                                        {'loss': 0.0052, 'grad_norm': 2.425075327932775, 'learning_rate': 1.8665422305179652e-08, 'completion_length': 308.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.7232143878936768, 'reward_std': 0.01785714365541935, 'kl': 0.130859375, 'epoch': 0.98}
 98%|█████████▊| 4206/4286 [31:54:32<1:06:06, 49.59s/it] 98%|█████████▊| 4207/4286 [31:54:57<55:38, 42.26s/it]                                                        {'loss': 0.0139, 'grad_norm': 7.946533647840585, 'learning_rate': 1.8432104526364907e-08, 'completion_length': 274.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.821428656578064, 'reward_std': 0.09204822778701782, 'kl': 0.3480224609375, 'epoch': 0.98}
 98%|█████████▊| 4207/4286 [31:54:57<55:38, 42.26s/it] 98%|█████████▊| 4208/4286 [31:55:20<47:32, 36.57s/it]                                                      {'loss': 0.0026, 'grad_norm': 3.995410896326546, 'learning_rate': 1.8198786747550163e-08, 'completion_length': 285.75, 'rewards/only_full_func_accuracy_reward': 0.766369104385376, 'rewards/format_reward': 1.0, 'reward': 1.7663691639900208, 'reward_std': 0.03273809049278498, 'kl': 0.066162109375, 'epoch': 0.98}
 98%|█████████▊| 4208/4286 [31:55:20<47:32, 36.57s/it] 98%|█████████▊| 4209/4286 [31:55:46<42:48, 33.36s/it]                                                      {'loss': 0.0137, 'grad_norm': 3.5567978997678615, 'learning_rate': 1.7965468968735415e-08, 'completion_length': 320.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7276786267757416, 'rewards/format_reward': 1.0, 'reward': 1.7276787161827087, 'reward_std': 0.06455380283296108, 'kl': 0.34033203125, 'epoch': 0.98}
 98%|█████████▊| 4209/4286 [31:55:46<42:48, 33.36s/it] 98%|█████████▊| 4210/4286 [31:56:10<38:43, 30.58s/it]                                                      {'loss': 0.0118, 'grad_norm': 12.683358140863033, 'learning_rate': 1.773215118992067e-08, 'completion_length': 279.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7113095223903656, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.09379709139466286, 'kl': 0.2939453125, 'epoch': 0.98}
 98%|█████████▊| 4210/4286 [31:56:10<38:43, 30.58s/it] 98%|█████████▊| 4211/4286 [31:56:35<36:04, 28.86s/it]                                                      {'loss': 0.005, 'grad_norm': 1.9300492058073282, 'learning_rate': 1.7498833411105925e-08, 'completion_length': 295.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.8467262387275696, 'rewards/format_reward': 1.0, 'reward': 1.8467262983322144, 'reward_std': 0.041452985256910324, 'kl': 0.125244140625, 'epoch': 0.98}
 98%|█████████▊| 4211/4286 [31:56:35<36:04, 28.86s/it] 98%|█████████▊| 4212/4286 [31:57:00<33:57, 27.53s/it]                                                      {'loss': 0.0164, 'grad_norm': 0.874909003162468, 'learning_rate': 1.726551563229118e-08, 'completion_length': 299.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.04627084545791149, 'kl': 0.40966796875, 'epoch': 0.98}
 98%|█████████▊| 4212/4286 [31:57:00<33:57, 27.53s/it] 98%|█████████▊| 4213/4286 [31:57:23<32:02, 26.34s/it]                                                      {'loss': 0.0099, 'grad_norm': 8.659043415162229, 'learning_rate': 1.7032197853476433e-08, 'completion_length': 271.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7723215818405151, 'reward_std': 0.05876357667148113, 'kl': 0.24755859375, 'epoch': 0.98}
 98%|█████████▊| 4213/4286 [31:57:23<32:02, 26.34s/it] 98%|█████████▊| 4214/4286 [31:57:47<30:53, 25.74s/it]                                                      {'loss': 0.0065, 'grad_norm': 1.0537829157590362, 'learning_rate': 1.6798880074661688e-08, 'completion_length': 279.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6696429252624512, 'rewards/format_reward': 1.0, 'reward': 1.669642984867096, 'reward_std': 0.01785714365541935, 'kl': 0.161865234375, 'epoch': 0.98}
 98%|█████████▊| 4214/4286 [31:57:47<30:53, 25.74s/it] 98%|█████████▊| 4215/4286 [31:58:13<30:14, 25.55s/it]                                                      {'loss': 0.0039, 'grad_norm': 2.4470712749310124, 'learning_rate': 1.6565562295846943e-08, 'completion_length': 328.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7693453133106232, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.05059524206444621, 'kl': 0.0982666015625, 'epoch': 0.98}
 98%|█████████▊| 4215/4286 [31:58:13<30:14, 25.55s/it] 98%|█████████▊| 4216/4286 [31:58:36<29:00, 24.87s/it]                                                      {'loss': 0.0258, 'grad_norm': 4.60873149887756, 'learning_rate': 1.6332244517032195e-08, 'completion_length': 299.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6919642984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6562500596046448, 'reward_std': 0.16703305207192898, 'kl': 0.64501953125, 'epoch': 0.98}
 98%|█████████▊| 4216/4286 [31:58:36<29:00, 24.87s/it] 98%|█████████▊| 4217/4286 [31:59:00<28:18, 24.62s/it]                                                      {'loss': 0.0031, 'grad_norm': 6.513512937181745, 'learning_rate': 1.609892673821745e-08, 'completion_length': 278.2143020629883, 'rewards/only_full_func_accuracy_reward': 0.8035714328289032, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.03571429057046771, 'kl': 0.0780029296875, 'epoch': 0.98}
 98%|█████████▊| 4217/4286 [31:59:00<28:18, 24.62s/it] 98%|█████████▊| 4218/4286 [31:59:24<27:45, 24.49s/it]                                                      {'loss': 0.0049, 'grad_norm': 0.7230167325234027, 'learning_rate': 1.5865608959402706e-08, 'completion_length': 272.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.8660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.8660715818405151, 'reward_std': 0.01785714365541935, 'kl': 0.1220703125, 'epoch': 0.98}
 98%|█████████▊| 4218/4286 [31:59:24<27:45, 24.49s/it] 98%|█████████▊| 4219/4286 [31:59:51<28:01, 25.10s/it]                                                      {'loss': 0.0112, 'grad_norm': 0.8723659371054437, 'learning_rate': 1.563229118058796e-08, 'completion_length': 288.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.0416666679084301, 'kl': 0.2802734375, 'epoch': 0.98}
 98%|█████████▊| 4219/4286 [31:59:51<28:01, 25.10s/it] 98%|█████████▊| 4220/4286 [32:00:14<27:08, 24.68s/it]                                                      {'loss': 0.0046, 'grad_norm': 11.999771931903716, 'learning_rate': 1.5398973401773213e-08, 'completion_length': 311.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7247024178504944, 'rewards/format_reward': 1.0, 'reward': 1.724702537059784, 'reward_std': 0.07280982658267021, 'kl': 0.1142578125, 'epoch': 0.98}
 98%|█████████▊| 4220/4286 [32:00:14<27:08, 24.68s/it] 98%|█████████▊| 4221/4286 [32:00:39<26:53, 24.82s/it]                                                      {'loss': 0.0067, 'grad_norm': 2.291700778878797, 'learning_rate': 1.516565562295847e-08, 'completion_length': 314.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7285714745521545, 'rewards/format_reward': 1.0, 'reward': 1.7285714745521545, 'reward_std': 0.01904762117192149, 'kl': 0.16748046875, 'epoch': 0.98}
 98%|█████████▊| 4221/4286 [32:00:39<26:53, 24.82s/it] 99%|█████████▊| 4222/4286 [32:01:03<25:59, 24.37s/it]                                                      {'loss': 0.0099, 'grad_norm': 26.22747993321617, 'learning_rate': 1.4932337844143724e-08, 'completion_length': 293.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7752977013587952, 'reward_std': 0.04464286006987095, 'kl': 0.24609375, 'epoch': 0.99}
 99%|█████████▊| 4222/4286 [32:01:03<25:59, 24.37s/it] 99%|█████████▊| 4223/4286 [32:01:28<25:44, 24.52s/it]                                                      {'loss': 0.0036, 'grad_norm': 7.252760079122813, 'learning_rate': 1.4699020065328977e-08, 'completion_length': 296.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7767857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7767857909202576, 'reward_std': 0.011904762359336019, 'kl': 0.09033203125, 'epoch': 0.99}
 99%|█████████▊| 4223/4286 [32:01:28<25:44, 24.52s/it] 99%|█████████▊| 4224/4286 [32:01:52<25:26, 24.62s/it]                                                      {'loss': 0.0045, 'grad_norm': 3.502724406349283, 'learning_rate': 1.4465702286514233e-08, 'completion_length': 305.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.8452381193637848, 'rewards/format_reward': 1.0, 'reward': 1.845238208770752, 'reward_std': 0.032524412497878075, 'kl': 0.11181640625, 'epoch': 0.99}
 99%|█████████▊| 4224/4286 [32:01:52<25:26, 24.62s/it] 99%|█████████▊| 4225/4286 [32:02:17<24:55, 24.51s/it]                                                      {'loss': 0.0202, 'grad_norm': 1.795214986551, 'learning_rate': 1.4232384507699486e-08, 'completion_length': 297.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7157738208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6979167461395264, 'reward_std': 0.09410358220338821, 'kl': 0.505859375, 'epoch': 0.99}
 99%|█████████▊| 4225/4286 [32:02:17<24:55, 24.51s/it] 99%|█████████▊| 4226/4286 [32:02:41<24:21, 24.36s/it]                                                      {'loss': 0.0111, 'grad_norm': 6.531974433201092, 'learning_rate': 1.3999066728884742e-08, 'completion_length': 290.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.8619048297405243, 'rewards/format_reward': 1.0, 'reward': 1.8619048595428467, 'reward_std': 0.02660532481968403, 'kl': 0.2763671875, 'epoch': 0.99}
 99%|█████████▊| 4226/4286 [32:02:41<24:21, 24.36s/it] 99%|█████████▊| 4227/4286 [32:03:05<24:01, 24.44s/it]                                                      {'loss': 0.0122, 'grad_norm': 35.23371927581423, 'learning_rate': 1.3765748950069995e-08, 'completion_length': 323.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7294643819332123, 'rewards/format_reward': 1.0, 'reward': 1.7294644117355347, 'reward_std': 0.038525485433638096, 'kl': 0.3037109375, 'epoch': 0.99}
 99%|█████████▊| 4227/4286 [32:03:05<24:01, 24.44s/it] 99%|█████████▊| 4228/4286 [32:03:30<23:47, 24.62s/it]                                                      {'loss': 0.0074, 'grad_norm': 7.542223192479075, 'learning_rate': 1.3532431171255249e-08, 'completion_length': 302.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7273809909820557, 'rewards/format_reward': 1.0, 'reward': 1.7273810505867004, 'reward_std': 0.06904762610793114, 'kl': 0.185302734375, 'epoch': 0.99}
 99%|█████████▊| 4228/4286 [32:03:30<23:47, 24.62s/it] 99%|█████████▊| 4229/4286 [32:03:55<23:26, 24.68s/it]                                                      {'loss': 0.0175, 'grad_norm': 0.7253860706950005, 'learning_rate': 1.3299113392440503e-08, 'completion_length': 255.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7172620296478271, 'reward_std': 0.07738095708191395, 'kl': 0.43896484375, 'epoch': 0.99}
 99%|█████████▊| 4229/4286 [32:03:55<23:26, 24.68s/it] 99%|█████████▊| 4230/4286 [32:04:20<23:11, 24.85s/it]                                                      {'loss': 0.0073, 'grad_norm': 3.3251215705108663, 'learning_rate': 1.3065795613625758e-08, 'completion_length': 320.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7752977311611176, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.03273809980601072, 'kl': 0.1824951171875, 'epoch': 0.99}
 99%|█████████▊| 4230/4286 [32:04:20<23:11, 24.85s/it] 99%|█████████▊| 4231/4286 [32:04:44<22:31, 24.57s/it]                                                      {'loss': 0.005, 'grad_norm': 2.573123951188579, 'learning_rate': 1.2832477834811011e-08, 'completion_length': 277.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.7351191341876984, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.02976190857589245, 'kl': 0.124755859375, 'epoch': 0.99}
 99%|█████████▊| 4231/4286 [32:04:44<22:31, 24.57s/it] 99%|█████████▊| 4232/4286 [32:05:10<22:16, 24.75s/it]                                                      {'loss': 0.0062, 'grad_norm': 2.998573285447371, 'learning_rate': 1.2599160055996267e-08, 'completion_length': 343.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.8125000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.04761904664337635, 'kl': 0.15478515625, 'epoch': 0.99}
 99%|█████████▊| 4232/4286 [32:05:10<22:16, 24.75s/it] 99%|█████████▉| 4233/4286 [32:05:34<21:39, 24.52s/it]                                                      {'loss': 0.01, 'grad_norm': 7.166609963254222, 'learning_rate': 1.236584227718152e-08, 'completion_length': 274.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7872024774551392, 'rewards/format_reward': 1.0, 'reward': 1.787202537059784, 'reward_std': 0.0565476156771183, 'kl': 0.25048828125, 'epoch': 0.99}
 99%|█████████▉| 4233/4286 [32:05:34<21:39, 24.52s/it] 99%|█████████▉| 4234/4286 [32:05:58<21:17, 24.57s/it]                                                      {'loss': 0.009, 'grad_norm': 3.539531426091377, 'learning_rate': 1.2132524498366774e-08, 'completion_length': 307.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7160714566707611, 'rewards/format_reward': 1.0, 'reward': 1.7160715460777283, 'reward_std': 0.03906228579580784, 'kl': 0.224609375, 'epoch': 0.99}
 99%|█████████▉| 4234/4286 [32:05:58<21:17, 24.57s/it] 99%|█████████▉| 4235/4286 [32:06:25<21:22, 25.16s/it]                                                      {'loss': 0.0046, 'grad_norm': 3.780010107938373, 'learning_rate': 1.189920671955203e-08, 'completion_length': 313.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6488095819950104, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.07578602060675621, 'kl': 0.115234375, 'epoch': 0.99}
 99%|█████████▉| 4235/4286 [32:06:25<21:22, 25.16s/it] 99%|█████████▉| 4236/4286 [32:06:49<20:38, 24.77s/it]                                                      {'loss': 0.006, 'grad_norm': 27.427302058936803, 'learning_rate': 1.1665888940737283e-08, 'completion_length': 287.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 1.0, 'reward': 1.758928656578064, 'reward_std': 0.09112739562988281, 'kl': 0.14892578125, 'epoch': 0.99}
 99%|█████████▉| 4236/4286 [32:06:49<20:38, 24.77s/it] 99%|█████████▉| 4237/4286 [32:07:13<20:06, 24.63s/it]                                                      {'loss': 0.0081, 'grad_norm': 63.286782968580596, 'learning_rate': 1.1432571161922538e-08, 'completion_length': 322.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.8250000476837158, 'rewards/format_reward': 1.0, 'reward': 1.8250000476837158, 'reward_std': 0.06585775315761566, 'kl': 0.201904296875, 'epoch': 0.99}
 99%|█████████▉| 4237/4286 [32:07:13<20:06, 24.63s/it] 99%|█████████▉| 4238/4286 [32:07:37<19:31, 24.40s/it]                                                      {'loss': 0.0102, 'grad_norm': 1.6050761517780467, 'learning_rate': 1.1199253383107792e-08, 'completion_length': 324.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.8125000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8125000596046448, 'reward_std': 0.01785714365541935, 'kl': 0.255859375, 'epoch': 0.99}
 99%|█████████▉| 4238/4286 [32:07:37<19:31, 24.40s/it] 99%|█████████▉| 4239/4286 [32:08:02<19:15, 24.58s/it]                                                      {'loss': 0.0074, 'grad_norm': 0.9039976395661072, 'learning_rate': 1.0965935604293047e-08, 'completion_length': 310.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.011904759332537651, 'kl': 0.1865234375, 'epoch': 0.99}
 99%|█████████▉| 4239/4286 [32:08:02<19:15, 24.58s/it] 99%|█████████▉| 4240/4286 [32:08:26<18:48, 24.53s/it]                                                      {'loss': 0.0104, 'grad_norm': 1.2550585721534873, 'learning_rate': 1.0732617825478301e-08, 'completion_length': 285.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875001192092896, 'reward_std': 0.01785714365541935, 'kl': 0.26123046875, 'epoch': 0.99}
 99%|█████████▉| 4240/4286 [32:08:26<18:48, 24.53s/it] 99%|█████████▉| 4241/4286 [32:08:50<18:20, 24.46s/it]                                                      {'loss': 0.0037, 'grad_norm': 2.625600521440827, 'learning_rate': 1.0499300046663556e-08, 'completion_length': 284.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.05016787722706795, 'kl': 0.092041015625, 'epoch': 0.99}
 99%|█████████▉| 4241/4286 [32:08:50<18:20, 24.46s/it] 99%|█████████▉| 4242/4286 [32:09:16<18:09, 24.75s/it]                                                      {'loss': 0.0247, 'grad_norm': 5.112828310851577, 'learning_rate': 1.026598226784881e-08, 'completion_length': 308.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125000596046448, 'reward_std': 0.06815122999250889, 'kl': 0.619140625, 'epoch': 0.99}
 99%|█████████▉| 4242/4286 [32:09:16<18:09, 24.75s/it] 99%|█████████▉| 4243/4286 [32:09:41<17:50, 24.90s/it]                                                      {'loss': 0.0086, 'grad_norm': 5.123079408727983, 'learning_rate': 1.0032664489034065e-08, 'completion_length': 324.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.7449405193328857, 'rewards/format_reward': 1.0, 'reward': 1.7449405789375305, 'reward_std': 0.13337143883109093, 'kl': 0.2158203125, 'epoch': 0.99}
 99%|█████████▉| 4243/4286 [32:09:41<17:50, 24.90s/it] 99%|█████████▉| 4244/4286 [32:10:07<17:34, 25.12s/it]                                                      {'loss': 0.0076, 'grad_norm': 7.887007607716984, 'learning_rate': 9.799346710219319e-09, 'completion_length': 311.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6757937073707581, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6400794982910156, 'reward_std': 0.10079365409910679, 'kl': 0.190673828125, 'epoch': 0.99}
 99%|█████████▉| 4244/4286 [32:10:07<17:34, 25.12s/it] 99%|█████████▉| 4245/4286 [32:10:31<17:03, 24.97s/it]                                                      {'loss': 0.0199, 'grad_norm': 7.549527615645712, 'learning_rate': 9.566028931404572e-09, 'completion_length': 321.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7172618806362152, 'rewards/format_reward': 1.0, 'reward': 1.717262089252472, 'reward_std': 0.059523806907236576, 'kl': 0.498291015625, 'epoch': 0.99}
 99%|█████████▉| 4245/4286 [32:10:31<17:03, 24.97s/it] 99%|█████████▉| 4246/4286 [32:10:55<16:25, 24.63s/it]                                                      {'loss': 0.0076, 'grad_norm': 1.7673107863256063, 'learning_rate': 9.332711152589826e-09, 'completion_length': 292.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.767857313156128, 'reward_std': 0.053818171843886375, 'kl': 0.18896484375, 'epoch': 0.99}
 99%|█████████▉| 4246/4286 [32:10:55<16:25, 24.63s/it] 99%|█████████▉| 4247/4286 [32:11:19<15:48, 24.31s/it]                                                      {'loss': 0.0038, 'grad_norm': 3.146999001252828, 'learning_rate': 9.099393373775081e-09, 'completion_length': 307.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.84077388048172, 'rewards/format_reward': 1.0, 'reward': 1.8407739400863647, 'reward_std': 0.008928571827709675, 'kl': 0.0941162109375, 'epoch': 0.99}
 99%|█████████▉| 4247/4286 [32:11:19<15:48, 24.31s/it] 99%|█████████▉| 4248/4286 [32:11:43<15:18, 24.18s/it]                                                      {'loss': 0.0041, 'grad_norm': 4.437114644840485, 'learning_rate': 8.866075594960335e-09, 'completion_length': 298.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.8869048357009888, 'rewards/format_reward': 1.0, 'reward': 1.8869048953056335, 'reward_std': 0.07142857648432255, 'kl': 0.1015625, 'epoch': 0.99}
 99%|█████████▉| 4248/4286 [32:11:43<15:18, 24.18s/it] 99%|█████████▉| 4249/4286 [32:12:07<15:00, 24.33s/it]                                                      {'loss': 0.0091, 'grad_norm': 1.0951518103748958, 'learning_rate': 8.63275781614559e-09, 'completion_length': 284.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.7455357313156128, 'rewards/format_reward': 1.0, 'reward': 1.7455357909202576, 'reward_std': 0.032738094218075275, 'kl': 0.2265625, 'epoch': 0.99}
 99%|█████████▉| 4249/4286 [32:12:07<15:00, 24.33s/it] 99%|█████████▉| 4250/4286 [32:12:33<14:51, 24.76s/it]                                                      {'loss': 0.0238, 'grad_norm': 35.37831124883009, 'learning_rate': 8.399440037330844e-09, 'completion_length': 308.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.7907738983631134, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7729167938232422, 'reward_std': 0.09702381258830428, 'kl': 0.5966796875, 'epoch': 0.99}
 99%|█████████▉| 4250/4286 [32:12:33<14:51, 24.76s/it] 99%|█████████▉| 4251/4286 [32:12:57<14:16, 24.46s/it]                                                      {'loss': 0.0061, 'grad_norm': 5.062738534270809, 'learning_rate': 8.166122258516098e-09, 'completion_length': 284.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.756845235824585, 'rewards/format_reward': 1.0, 'reward': 1.7568452954292297, 'reward_std': 0.07156267296522856, 'kl': 0.152099609375, 'epoch': 0.99}
 99%|█████████▉| 4251/4286 [32:12:57<14:16, 24.46s/it] 99%|█████████▉| 4252/4286 [32:13:21<13:44, 24.24s/it]                                                      {'loss': 0.0102, 'grad_norm': 26.893131224082524, 'learning_rate': 7.932804479701353e-09, 'completion_length': 308.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7693453133106232, 'rewards/format_reward': 1.0, 'reward': 1.7693454027175903, 'reward_std': 0.0446428582072258, 'kl': 0.25341796875, 'epoch': 0.99}
 99%|█████████▉| 4252/4286 [32:13:21<13:44, 24.24s/it] 99%|█████████▉| 4253/4286 [32:13:45<13:18, 24.20s/it]                                                      {'loss': 0.0049, 'grad_norm': 0.639492322806074, 'learning_rate': 7.699486700886607e-09, 'completion_length': 308.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6383929252624512, 'rewards/format_reward': 1.0, 'reward': 1.638392984867096, 'reward_std': 0.008928571827709675, 'kl': 0.1220703125, 'epoch': 0.99}
 99%|█████████▉| 4253/4286 [32:13:45<13:18, 24.20s/it] 99%|█████████▉| 4254/4286 [32:14:10<13:04, 24.52s/it]                                                      {'loss': 0.0071, 'grad_norm': 4.185909443050064, 'learning_rate': 7.466168922071862e-09, 'completion_length': 288.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.8556548058986664, 'rewards/format_reward': 1.0, 'reward': 1.8556548953056335, 'reward_std': 0.056547620333731174, 'kl': 0.1787109375, 'epoch': 0.99}
 99%|█████████▉| 4254/4286 [32:14:10<13:04, 24.52s/it] 99%|█████████▉| 4255/4286 [32:14:35<12:43, 24.64s/it]                                                      {'loss': 0.0034, 'grad_norm': 5.433452419815629, 'learning_rate': 7.232851143257116e-09, 'completion_length': 330.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7455357909202576, 'rewards/format_reward': 1.0, 'reward': 1.7455358505249023, 'reward_std': 0.07112791668623686, 'kl': 0.083740234375, 'epoch': 0.99}
 99%|█████████▉| 4255/4286 [32:14:35<12:43, 24.64s/it] 99%|█████████▉| 4256/4286 [32:14:58<12:07, 24.26s/it]                                                      {'loss': 0.0197, 'grad_norm': 1.806354874705167, 'learning_rate': 6.999533364442371e-09, 'completion_length': 262.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.7872024476528168, 'rewards/format_reward': 1.0, 'reward': 1.787202537059784, 'reward_std': 0.068452388048172, 'kl': 0.494140625, 'epoch': 0.99}
 99%|█████████▉| 4256/4286 [32:14:58<12:07, 24.26s/it] 99%|█████████▉| 4257/4286 [32:15:22<11:39, 24.13s/it]                                                      {'loss': 0.0039, 'grad_norm': 4.386748771672115, 'learning_rate': 6.7662155856276244e-09, 'completion_length': 289.375, 'rewards/only_full_func_accuracy_reward': 0.6830357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6830357909202576, 'reward_std': 0.0295482249930501, 'kl': 0.0966796875, 'epoch': 0.99}
 99%|█████████▉| 4257/4286 [32:15:22<11:39, 24.13s/it] 99%|█████████▉| 4258/4286 [32:15:46<11:17, 24.19s/it]                                                      {'loss': 0.004, 'grad_norm': 2.6295879380903755, 'learning_rate': 6.532897806812879e-09, 'completion_length': 314.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7961310148239136, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.039858050644397736, 'kl': 0.10009765625, 'epoch': 0.99}
 99%|█████████▉| 4258/4286 [32:15:46<11:17, 24.19s/it] 99%|█████████▉| 4259/4286 [32:16:11<10:55, 24.28s/it]                                                      {'loss': 0.0078, 'grad_norm': 12.545893610995615, 'learning_rate': 6.299580027998133e-09, 'completion_length': 276.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.029761909740045667, 'kl': 0.19580078125, 'epoch': 0.99}
 99%|█████████▉| 4259/4286 [32:16:11<10:55, 24.28s/it] 99%|█████████▉| 4260/4286 [32:16:35<10:32, 24.33s/it]                                                      {'loss': 0.0184, 'grad_norm': 6.435699941796102, 'learning_rate': 6.066262249183387e-09, 'completion_length': 288.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7410715222358704, 'reward_std': 0.11543039605021477, 'kl': 0.4599609375, 'epoch': 0.99}
 99%|█████████▉| 4260/4286 [32:16:35<10:32, 24.33s/it] 99%|█████████▉| 4261/4286 [32:17:00<10:12, 24.50s/it]                                                      {'loss': 0.0031, 'grad_norm': 3.281301560295882, 'learning_rate': 5.8329444703686415e-09, 'completion_length': 299.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6961309909820557, 'rewards/format_reward': 1.0, 'reward': 1.6961310505867004, 'reward_std': 0.10547530651092529, 'kl': 0.07763671875, 'epoch': 0.99}
 99%|█████████▉| 4261/4286 [32:17:00<10:12, 24.50s/it] 99%|█████████▉| 4262/4286 [32:17:25<09:52, 24.68s/it]                                                      {'loss': 0.0169, 'grad_norm': 4.067742995854065, 'learning_rate': 5.599626691553896e-09, 'completion_length': 313.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.761904776096344, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7261905670166016, 'reward_std': 0.13922540098428726, 'kl': 0.422119140625, 'epoch': 0.99}
 99%|█████████▉| 4262/4286 [32:17:25<09:52, 24.68s/it] 99%|█████████▉| 4263/4286 [32:17:49<09:21, 24.42s/it]                                                      {'loss': 0.0079, 'grad_norm': 3.6549615897220527, 'learning_rate': 5.3663089127391504e-09, 'completion_length': 276.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160714626312256, 'reward_std': 0.06990811601281166, 'kl': 0.1982421875, 'epoch': 0.99}
 99%|█████████▉| 4263/4286 [32:17:49<09:21, 24.42s/it] 99%|█████████▉| 4264/4286 [32:18:13<08:52, 24.20s/it]                                                      {'loss': 0.0274, 'grad_norm': 12.885053853223186, 'learning_rate': 5.132991133924405e-09, 'completion_length': 276.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.04983501136302948, 'kl': 0.685546875, 'epoch': 0.99}
 99%|█████████▉| 4264/4286 [32:18:13<08:52, 24.20s/it]100%|█████████▉| 4265/4286 [32:18:37<08:28, 24.22s/it]                                                      {'loss': 0.006, 'grad_norm': 16.162517854436345, 'learning_rate': 4.899673355109659e-09, 'completion_length': 316.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.05222323350608349, 'kl': 0.1494140625, 'epoch': 1.0}
100%|█████████▉| 4265/4286 [32:18:37<08:28, 24.22s/it]100%|█████████▉| 4266/4286 [32:19:02<08:09, 24.47s/it]                                                      {'loss': 0.0206, 'grad_norm': 2.366947701050969, 'learning_rate': 4.666355576294913e-09, 'completion_length': 304.25, 'rewards/only_full_func_accuracy_reward': 0.8095238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.0357142873108387, 'kl': 0.5150146484375, 'epoch': 1.0}
100%|█████████▉| 4266/4286 [32:19:02<08:09, 24.47s/it]100%|█████████▉| 4267/4286 [32:19:26<07:40, 24.23s/it]                                                      {'loss': 0.0188, 'grad_norm': 4.345195258717794, 'learning_rate': 4.4330377974801675e-09, 'completion_length': 286.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7961310148239136, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.04915658384561539, 'kl': 0.46923828125, 'epoch': 1.0}
100%|█████████▉| 4267/4286 [32:19:26<07:40, 24.23s/it]100%|█████████▉| 4268/4286 [32:19:50<07:13, 24.07s/it]                                                      {'loss': 0.015, 'grad_norm': 3.664214429110046, 'learning_rate': 4.199720018665422e-09, 'completion_length': 286.19644927978516, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7336310744285583, 'reward_std': 0.058389291167259216, 'kl': 0.37353515625, 'epoch': 1.0}
100%|█████████▉| 4268/4286 [32:19:50<07:13, 24.07s/it]100%|█████████▉| 4269/4286 [32:20:14<06:53, 24.30s/it]                                                      {'loss': 0.0084, 'grad_norm': 3.2011387059798984, 'learning_rate': 3.9664022398506764e-09, 'completion_length': 286.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.029761902987957, 'kl': 0.2095947265625, 'epoch': 1.0}
100%|█████████▉| 4269/4286 [32:20:14<06:53, 24.30s/it]100%|█████████▉| 4270/4286 [32:20:39<06:30, 24.38s/it]                                                      {'loss': 0.0077, 'grad_norm': 2.590775876136862, 'learning_rate': 3.733084461035931e-09, 'completion_length': 314.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.6619898229837418, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6441327333450317, 'reward_std': 0.11726503446698189, 'kl': 0.19140625, 'epoch': 1.0}
100%|█████████▉| 4270/4286 [32:20:39<06:30, 24.38s/it]100%|█████████▉| 4271/4286 [32:21:02<06:01, 24.11s/it]                                                      {'loss': 0.0111, 'grad_norm': 4.9916507602579525, 'learning_rate': 3.4997666822211854e-09, 'completion_length': 278.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.7827381193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7648810744285583, 'reward_std': 0.07151959463953972, 'kl': 0.277099609375, 'epoch': 1.0}
100%|█████████▉| 4271/4286 [32:21:02<06:01, 24.11s/it]100%|█████████▉| 4272/4286 [32:21:27<05:38, 24.21s/it]                                                      {'loss': 0.0246, 'grad_norm': 1.9987696276274993, 'learning_rate': 3.2664489034064395e-09, 'completion_length': 301.2143096923828, 'rewards/only_full_func_accuracy_reward': 0.7008929252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6830358505249023, 'reward_std': 0.08035714644938707, 'kl': 0.6162109375, 'epoch': 1.0}
100%|█████████▉| 4272/4286 [32:21:27<05:38, 24.21s/it]100%|█████████▉| 4273/4286 [32:21:52<05:19, 24.55s/it]                                                      {'loss': 0.0158, 'grad_norm': 15.099659559727115, 'learning_rate': 3.0331311245916935e-09, 'completion_length': 304.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.736607164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7187501788139343, 'reward_std': 0.10005596280097961, 'kl': 0.39697265625, 'epoch': 1.0}
100%|█████████▉| 4273/4286 [32:21:52<05:19, 24.55s/it]100%|█████████▉| 4274/4286 [32:22:16<04:53, 24.45s/it]                                                      {'loss': 0.0194, 'grad_norm': 6.04484699567235, 'learning_rate': 2.799813345776948e-09, 'completion_length': 285.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.8467262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8288691639900208, 'reward_std': 0.0803571455180645, 'kl': 0.484375, 'epoch': 1.0}
100%|█████████▉| 4274/4286 [32:22:16<04:53, 24.45s/it]100%|█████████▉| 4275/4286 [32:22:41<04:28, 24.45s/it]                                                      {'loss': 0.0129, 'grad_norm': 12.41852379355998, 'learning_rate': 2.5664955669622025e-09, 'completion_length': 296.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.782738208770752, 'reward_std': 0.11784838140010834, 'kl': 0.3212890625, 'epoch': 1.0}
100%|█████████▉| 4275/4286 [32:22:41<04:28, 24.45s/it]100%|█████████▉| 4276/4286 [32:23:05<04:03, 24.36s/it]                                                      {'loss': 0.0181, 'grad_norm': 35.34194762429983, 'learning_rate': 2.3331777881474565e-09, 'completion_length': 308.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.7983631193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7805060744285583, 'reward_std': 0.07075383514165878, 'kl': 0.451171875, 'epoch': 1.0}
100%|█████████▉| 4276/4286 [32:23:05<04:03, 24.36s/it]100%|█████████▉| 4277/4286 [32:23:30<03:42, 24.68s/it]                                                      {'loss': 0.0111, 'grad_norm': 2.880296495248495, 'learning_rate': 2.099860009332711e-09, 'completion_length': 333.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.703869104385376, 'reward_std': 0.06489954888820648, 'kl': 0.27685546875, 'epoch': 1.0}
100%|█████████▉| 4277/4286 [32:23:30<03:42, 24.68s/it]100%|█████████▉| 4278/4286 [32:23:55<03:17, 24.74s/it]                                                      {'loss': 0.0143, 'grad_norm': 5.338033010671455, 'learning_rate': 1.8665422305179655e-09, 'completion_length': 321.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.6324405074119568, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.0386904813349247, 'kl': 0.357421875, 'epoch': 1.0}
100%|█████████▉| 4278/4286 [32:23:55<03:17, 24.74s/it]100%|█████████▉| 4279/4286 [32:24:22<02:57, 25.41s/it]                                                      {'loss': 0.0118, 'grad_norm': 3.0501595450982966, 'learning_rate': 1.6332244517032197e-09, 'completion_length': 292.8393020629883, 'rewards/only_full_func_accuracy_reward': 0.6659226715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6480655670166016, 'reward_std': 0.0848214328289032, 'kl': 0.2958984375, 'epoch': 1.0}
100%|█████████▉| 4279/4286 [32:24:22<02:57, 25.41s/it]100%|█████████▉| 4280/4286 [32:24:47<02:30, 25.13s/it]                                                      {'loss': 0.0079, 'grad_norm': 7.894712634286592, 'learning_rate': 1.399906672888474e-09, 'completion_length': 271.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7693453133106232, 'rewards/format_reward': 1.0, 'reward': 1.7693454027175903, 'reward_std': 0.0734705775976181, 'kl': 0.1981201171875, 'epoch': 1.0}
100%|█████████▉| 4280/4286 [32:24:47<02:30, 25.13s/it]100%|█████████▉| 4281/4286 [32:25:11<02:03, 24.72s/it]                                                      {'loss': 0.0037, 'grad_norm': 0.7439580612221488, 'learning_rate': 1.1665888940737283e-09, 'completion_length': 283.3571472167969, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.0, 'kl': 0.09326171875, 'epoch': 1.0}
100%|█████████▉| 4281/4286 [32:25:11<02:03, 24.72s/it]100%|█████████▉| 4282/4286 [32:25:34<01:37, 24.43s/it]                                                      {'loss': 0.0176, 'grad_norm': 9.061682966467785, 'learning_rate': 9.332711152589827e-10, 'completion_length': 322.875, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6949405670166016, 'reward_std': 0.07557233795523643, 'kl': 0.44140625, 'epoch': 1.0}
100%|█████████▉| 4282/4286 [32:25:34<01:37, 24.43s/it]100%|█████████▉| 4283/4286 [32:25:59<01:13, 24.55s/it]                                                      {'loss': 0.0057, 'grad_norm': 19.15401547753017, 'learning_rate': 6.99953336444237e-10, 'completion_length': 286.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.772321492433548, 'rewards/format_reward': 1.0, 'reward': 1.7723215818405151, 'reward_std': 0.0684523843228817, 'kl': 0.1416015625, 'epoch': 1.0}
100%|█████████▉| 4283/4286 [32:25:59<01:13, 24.55s/it]100%|█████████▉| 4284/4286 [32:26:23<00:48, 24.27s/it]                                                      {'loss': 0.0218, 'grad_norm': 5.736567259161082, 'learning_rate': 4.666355576294914e-10, 'completion_length': 260.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187501192092896, 'reward_std': 0.08898505941033363, 'kl': 0.545654296875, 'epoch': 1.0}
100%|█████████▉| 4284/4286 [32:26:23<00:48, 24.27s/it]100%|█████████▉| 4285/4286 [32:26:47<00:24, 24.37s/it]                                                      {'loss': 0.0159, 'grad_norm': 12.92546628800682, 'learning_rate': 2.333177788147457e-10, 'completion_length': 269.4107360839844, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187500596046448, 'reward_std': 0.06250000186264515, 'kl': 0.397705078125, 'epoch': 1.0}
100%|█████████▉| 4285/4286 [32:26:47<00:24, 24.37s/it]100%|██████████| 4286/4286 [32:27:12<00:00, 24.52s/it]                                                      {'loss': 0.0106, 'grad_norm': 12.194950370618566, 'learning_rate': 0.0, 'completion_length': 354.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.4861111342906952, 'rewards/format_reward': 1.0, 'reward': 1.4861112236976624, 'reward_std': 0.0208333320915699, 'kl': 0.08416748046875, 'epoch': 1.0}
100%|██████████| 4286/4286 [32:27:12<00:00, 24.52s/it]                                                      {'train_runtime': 117230.0249, 'train_samples_per_second': 0.512, 'train_steps_per_second': 0.037, 'train_loss': 7.590031228707087, 'epoch': 1.0}
100%|██████████| 4286/4286 [32:33:41<00:00, 24.52s/it]100%|██████████| 4286/4286 [32:33:41<00:00, 27.35s/it]
[1;34mwandb[0m: 
[1;34mwandb[0m: 🚀 View run [33mONLY-FULL-SHUFFLE-BEST-HIGH-POINT-R1-RESUME-COT-VLLM-Correct-Qwen2-VL-7B-GRPO-TRANCE-60k-2025-03-02-14-54-34[0m at: [34mhttps://wandb.ai/tanhuajie264-peking-university/vison-open-r1/runs/ax087bcz[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/run-20250302_145719-ax087bcz/logs[0m