[2025-03-02 05:05:17,369] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 05:05:17,381] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 05:05:17,383] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 05:05:17,383] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 05:05:17,383] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 05:05:17,384] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-02 05:05:17,384] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 03-02 05:05:23 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 05:05:23 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 05:05:23 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 05:05:23 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 05:05:23 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 05:05:23 __init__.py:190] Automatically detected platform cuda.
INFO 03-02 05:05:23 __init__.py:190] Automatically detected platform cuda.
[2025-03-02 05:05:32,977] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 05:05:32,977] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 05:05:32,978] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 05:05:32,978] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 05:05:32,978] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 05:05:32,979] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 05:05:32,979] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-03-02 05:05:32,979] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-03-02 05:05:34,173] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
[2025-03-02 05:05:34,310] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
[2025-03-02 05:05:34,316] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
[2025-03-02 05:05:34,333] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 05:05:34,333] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
p-phy-ctyun-gz-a800-node-prod-200-117:884360:884360 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884360:884360 [0] NCCL INFO Bootstrap : Using bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884360:884360 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
p-phy-ctyun-gz-a800-node-prod-200-117:884365:884365 [2] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-117:884365:884365 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884368:884368 [4] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-117:884368:884368 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884366:884366 [3] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-117:884371:884371 [6] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-117:884366:884366 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884371:884371 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884365:884365 [2] NCCL INFO Bootstrap : Using bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884368:884368 [4] NCCL INFO Bootstrap : Using bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884366:884366 [3] NCCL INFO Bootstrap : Using bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884371:884371 [6] NCCL INFO Bootstrap : Using bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Using network IBext_v8
[2025-03-02 05:05:36,581] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
p-phy-ctyun-gz-a800-node-prod-200-117:884369:884369 [5] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-117:884369:884369 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884369:884369 [5] NCCL INFO Bootstrap : Using bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Using network IBext_v8
[2025-03-02 05:05:39,505] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
p-phy-ctyun-gz-a800-node-prod-200-117:884364:884364 [1] NCCL INFO cudaDriverVersion 12040
p-phy-ctyun-gz-a800-node-prod-200-117:884364:884364 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884364:884364 [1] NCCL INFO Bootstrap : Using bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO P2P plugin IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.117<0>
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO ncclCommInitRank comm 0x558fe404a600 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 commId 0x51f39b87a5018f9c - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO ncclCommInitRank comm 0x563725e9aa60 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 commId 0x51f39b87a5018f9c - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO ncclCommInitRank comm 0x55f55541f000 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 commId 0x51f39b87a5018f9c - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO ncclCommInitRank comm 0x557862c87940 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 commId 0x51f39b87a5018f9c - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO ncclCommInitRank comm 0x55cf6e893ae0 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 commId 0x51f39b87a5018f9c - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO ncclCommInitRank comm 0x55caeb55e290 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 commId 0x51f39b87a5018f9c - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO ncclCommInitRank comm 0x55bb0c836010 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 commId 0x51f39b87a5018f9c - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO NVLS multicast support is not available on dev 4
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO NVLS multicast support is not available on dev 2
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO NVLS multicast support is not available on dev 6
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO NVLS multicast support is not available on dev 0
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO NVLS multicast support is not available on dev 3
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO NVLS multicast support is not available on dev 5
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO NVLS multicast support is not available on dev 1
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO comm 0x55cf6e893ae0 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO comm 0x55f55541f000 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO comm 0x563725e9aa60 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO comm 0x55bb0c836010 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO comm 0x557862c87940 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO comm 0x558fe404a600 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO comm 0x55caeb55e290 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 00/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 01/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 02/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 03/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 04/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 05/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 06/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 07/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 08/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 09/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 10/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 11/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 12/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 13/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 14/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 15/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-117:884369:885893 [5] NCCL INFO ncclCommInitRank comm 0x55f55541f000 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 commId 0x51f39b87a5018f9c - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-117:884368:885809 [4] NCCL INFO ncclCommInitRank comm 0x55caeb55e290 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 commId 0x51f39b87a5018f9c - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-117:884366:885810 [3] NCCL INFO ncclCommInitRank comm 0x557862c87940 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 commId 0x51f39b87a5018f9c - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-117:884364:886021 [1] NCCL INFO ncclCommInitRank comm 0x558fe404a600 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 commId 0x51f39b87a5018f9c - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-117:884365:885811 [2] NCCL INFO ncclCommInitRank comm 0x55bb0c836010 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 commId 0x51f39b87a5018f9c - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-117:884360:885804 [0] NCCL INFO ncclCommInitRank comm 0x563725e9aa60 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 commId 0x51f39b87a5018f9c - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
p-phy-ctyun-gz-a800-node-prod-200-117:884371:885812 [6] NCCL INFO ncclCommInitRank comm 0x55cf6e893ae0 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 commId 0x51f39b87a5018f9c - Init COMPLETE
[2025-03-02 05:05:41,075] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 730, num_elems = 8.29B
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:  20%|██        | 1/5 [00:04<00:17,  4.27s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:04<00:17,  4.29s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:04<00:17,  4.31s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:04<00:17,  4.29s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:04<00:17,  4.28s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:04<00:17,  4.29s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:04<00:17,  4.30s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:07<00:11,  3.78s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:07<00:11,  3.78s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:07<00:11,  3.78s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:07<00:11,  3.78s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:07<00:11,  3.79s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:07<00:11,  3.81s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:07<00:11,  3.83s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:11<00:07,  3.59s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:11<00:07,  3.59s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:11<00:07,  3.58s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:11<00:07,  3.59s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:11<00:07,  3.59s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:11<00:07,  3.59s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:11<00:07,  3.66s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:14<00:03,  3.49s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:14<00:03,  3.50s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:14<00:03,  3.51s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:14<00:03,  3.51s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:14<00:03,  3.51s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:14<00:03,  3.51s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.31s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.93s/it]
[2025-03-02 05:05:55,755] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.32s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.93s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.32s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.93s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.32s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.93s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.32s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.94s/it]
[2025-03-02 05:05:55,784] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.33s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:14<00:00,  2.93s/it]
[2025-03-02 05:05:55,789] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 05:05:55,789] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 05:05:55,791] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 05:05:55,795] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
Loading checkpoint shards:  80%|████████  | 4/5 [00:14<00:03,  3.60s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:15<00:00,  2.69s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:15<00:00,  3.17s/it]
[2025-03-02 05:05:56,980] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s][2025-03-02 05:05:57,284] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 1460, num_elems = 16.58B
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]Loading checkpoint shards:  20%|██        | 1/5 [00:01<00:07,  1.96s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:01<00:07,  1.96s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:01<00:07,  1.97s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:01<00:07,  1.97s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:01<00:07,  1.98s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:01<00:07,  1.83s/it]Loading checkpoint shards:  20%|██        | 1/5 [00:02<00:08,  2.00s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:03<00:05,  1.83s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:03<00:05,  1.83s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:03<00:05,  1.84s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:03<00:05,  1.84s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:03<00:05,  1.83s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:03<00:05,  1.85s/it]Loading checkpoint shards:  40%|████      | 2/5 [00:03<00:05,  1.80s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:05<00:03,  1.80s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:05<00:03,  1.79s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:05<00:03,  1.80s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:05<00:03,  1.79s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:05<00:03,  1.79s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:05<00:03,  1.80s/it]Loading checkpoint shards:  60%|██████    | 3/5 [00:05<00:03,  1.79s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:07<00:01,  1.77s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:07<00:01,  1.77s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:07<00:01,  1.77s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:07<00:01,  1.77s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:07<00:01,  1.77s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:07<00:01,  1.77s/it]Loading checkpoint shards:  80%|████████  | 4/5 [00:07<00:01,  1.80s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.21s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.48s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.21s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.48s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.21s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.21s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.48s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.48s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.21s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.48s/it]
Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.21s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.48s/it]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.41s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:07<00:00,  1.59s/it]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[2025-03-02 05:06:05,736] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 05:06:05,737] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 05:06:05,738] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 05:06:05,739] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 05:06:05,739] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 05:06:05,739] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[2025-03-02 05:06:06,014] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
[2025-03-02 05:06:06,014] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7
[2025-03-02 05:06:06,029] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-03-02 05:06:06,032] [INFO] [logging.py:128:log_dist] [Rank 0] Creating ZeRO Offload
[2025-03-02 05:06:06,219] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-03-02 05:06:06,220] [INFO] [utils.py:782:see_memory_usage] MA 4.43 GB         Max_MA 7.33 GB         CA 7.56 GB         Max_CA 8 GB 
[2025-03-02 05:06:06,220] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 57.56 GB, percent = 5.7%
Parameter Offload: Total persistent parameters: 877056 in 401 params
[2025-03-02 05:06:06,435] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-03-02 05:06:06,435] [INFO] [utils.py:782:see_memory_usage] MA 4.43 GB         Max_MA 4.43 GB         CA 7.56 GB         Max_CA 8 GB 
[2025-03-02 05:06:06,436] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 57.56 GB, percent = 5.7%
[2025-03-02 05:06:06,437] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-03-02 05:06:06,437] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f87f9901fc0>
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-03-02 05:06:06,438] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-03-02 05:06:06,439] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 2
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   optimizer_name ............... None
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   optimizer_params ............. None
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-03-02 05:06:06,440] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   train_batch_size ............. 14
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... False
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   world_size ................... 7
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  False
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-03-02 05:06:06,441] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-03-02 05:06:06,442] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 3
[2025-03-02 05:06:06,442] [INFO] [config.py:989:print_user_config]   json = {
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "none", 
            "pin_memory": true
        }, 
        "offload_param": {
            "device": "none", 
            "pin_memory": true
        }, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09, 
        "reduce_bucket_size": "auto", 
        "stage3_prefetch_bucket_size": "auto", 
        "stage3_param_persistence_threshold": "auto", 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 14, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": false, 
    "zero_optimization.reduce_bucket_size": 1.284506e+07, 
    "zero_optimization.stage3_param_persistence_threshold": 3.584000e+04, 
    "zero_optimization.stage3_prefetch_bucket_size": 1.156055e+07
}
INFO 03-02 05:06:21 config.py:542] This model supports multiple tasks: {'score', 'classify', 'embed', 'reward', 'generate'}. Defaulting to 'generate'.
WARNING 03-02 05:06:21 arg_utils.py:1079] --enable-prefix-caching is currently not supported for multimodal models in v0 and has been disabled.
INFO 03-02 05:06:21 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/home/vlm/pretrain_model/Qwen2-VL-7B-Instruct', speculative_config=None, tokenizer='/home/vlm/pretrain_model/Qwen2-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:7, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/vlm/pretrain_model/Qwen2-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
INFO 03-02 05:06:22 cuda.py:230] Using Flash Attention backend.
INFO 03-02 05:06:23 model_runner.py:1110] Starting to load model /home/vlm/pretrain_model/Qwen2-VL-7B-Instruct...
INFO 03-02 05:06:23 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:01,  2.74it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.05it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:03<00:02,  1.18s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:04<00:01,  1.28s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:06<00:00,  1.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:06<00:00,  1.20s/it]

INFO 03-02 05:06:30 model_runner.py:1115] Loading model weights took 0.0000 GB
WARNING 03-02 05:06:31 model_runner.py:1288] Computed max_num_seqs (min(256, 32768 // 49152)) to be less than 1. Setting it to the minimum value of 1.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
Token indices sequence length is longer than the specified maximum sequence length for this model (49152 > 32768). Running this sequence through the model will result in indexing errors
WARNING 03-02 05:06:36 profiling.py:187] The context length (32768) of the model is too short to hold the multi-modal embeddings in the worst case (49152 tokens in total, out of which {'image': 32768, 'video': 16384} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
INFO 03-02 05:06:39 worker.py:267] Memory profiling takes 9.41 seconds
INFO 03-02 05:06:39 worker.py:267] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.70) = 55.53GiB
INFO 03-02 05:06:39 worker.py:267] model weights take 0.00GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.00GiB; the rest of the memory reserved for KV Cache is 55.53GiB.
INFO 03-02 05:06:39 executor_base.py:110] # CUDA blocks: 64982, # CPU blocks: 4681
INFO 03-02 05:06:39 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 31.73x
INFO 03-02 05:06:42 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes:   0%|          | 0/35 [00:00<?, ?it/s]Capturing CUDA graph shapes:   3%|▎         | 1/35 [00:00<00:17,  1.93it/s]Capturing CUDA graph shapes:   6%|▌         | 2/35 [00:01<00:16,  2.01it/s]Capturing CUDA graph shapes:   9%|▊         | 3/35 [00:01<00:15,  2.03it/s]Capturing CUDA graph shapes:  11%|█▏        | 4/35 [00:01<00:15,  2.05it/s]Capturing CUDA graph shapes:  14%|█▍        | 5/35 [00:02<00:14,  2.06it/s]Capturing CUDA graph shapes:  17%|█▋        | 6/35 [00:02<00:14,  2.07it/s]Capturing CUDA graph shapes:  20%|██        | 7/35 [00:03<00:13,  2.07it/s]Capturing CUDA graph shapes:  23%|██▎       | 8/35 [00:03<00:13,  2.07it/s]Capturing CUDA graph shapes:  26%|██▌       | 9/35 [00:04<00:12,  2.06it/s]Capturing CUDA graph shapes:  29%|██▊       | 10/35 [00:04<00:12,  2.07it/s]Capturing CUDA graph shapes:  31%|███▏      | 11/35 [00:05<00:11,  2.07it/s]Capturing CUDA graph shapes:  34%|███▍      | 12/35 [00:05<00:11,  2.07it/s]Capturing CUDA graph shapes:  37%|███▋      | 13/35 [00:06<00:10,  2.08it/s]Capturing CUDA graph shapes:  40%|████      | 14/35 [00:06<00:10,  2.08it/s]Capturing CUDA graph shapes:  43%|████▎     | 15/35 [00:07<00:09,  2.09it/s]Capturing CUDA graph shapes:  46%|████▌     | 16/35 [00:07<00:09,  2.09it/s]Capturing CUDA graph shapes:  49%|████▊     | 17/35 [00:08<00:08,  2.10it/s]Capturing CUDA graph shapes:  51%|█████▏    | 18/35 [00:08<00:08,  2.11it/s]Capturing CUDA graph shapes:  54%|█████▍    | 19/35 [00:09<00:07,  2.11it/s]Capturing CUDA graph shapes:  57%|█████▋    | 20/35 [00:09<00:07,  2.12it/s]Capturing CUDA graph shapes:  60%|██████    | 21/35 [00:10<00:06,  2.12it/s]Capturing CUDA graph shapes:  63%|██████▎   | 22/35 [00:10<00:06,  2.11it/s]Capturing CUDA graph shapes:  66%|██████▌   | 23/35 [00:11<00:05,  2.11it/s]Capturing CUDA graph shapes:  69%|██████▊   | 24/35 [00:11<00:05,  2.12it/s]Capturing CUDA graph shapes:  71%|███████▏  | 25/35 [00:11<00:04,  2.12it/s]Capturing CUDA graph shapes:  74%|███████▍  | 26/35 [00:12<00:04,  2.12it/s]Capturing CUDA graph shapes:  77%|███████▋  | 27/35 [00:12<00:03,  2.12it/s]Capturing CUDA graph shapes:  80%|████████  | 28/35 [00:13<00:03,  2.13it/s]Capturing CUDA graph shapes:  83%|████████▎ | 29/35 [00:13<00:02,  2.13it/s]Capturing CUDA graph shapes:  86%|████████▌ | 30/35 [00:14<00:02,  2.13it/s]Capturing CUDA graph shapes:  89%|████████▊ | 31/35 [00:14<00:01,  2.12it/s]Capturing CUDA graph shapes:  91%|█████████▏| 32/35 [00:15<00:01,  2.13it/s]Capturing CUDA graph shapes:  94%|█████████▍| 33/35 [00:15<00:00,  2.10it/s]Capturing CUDA graph shapes:  97%|█████████▋| 34/35 [00:16<00:00,  2.11it/s]Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:16<00:00,  2.12it/s]Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:16<00:00,  2.10it/s]
INFO 03-02 05:06:58 model_runner.py:1562] Graph capturing finished in 17 secs, took 0.00 GiB
INFO 03-02 05:06:58 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 28.93 seconds
Parameter Offload: Total persistent parameters: 877056 in 401 params
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.5
wandb: Run data is saved locally in /home/vlm/workspace/vision-open-r1-spatial/wandb/run-20250302_050711-qwup6qq1
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run ONLY-FULL-SHUFFLE-R1-ZERO-VLLM-Correct-Qwen2-VL-7B-GRPO-TRANCE-60k-2025-03-02-05-04-15
wandb: ⭐️ View project at https://wandb.ai/tanhuajie264-peking-university/vison-open-r1
wandb: 🚀 View run at https://wandb.ai/tanhuajie264-peking-university/vison-open-r1/runs/qwup6qq1
  0%|          | 0/4286 [00:00<?, ?it/s]p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Using non-device net plugin version 0
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Using network IBext_v8
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO bootstrapSplit: comm 0x7f9088071340 parent 0x55caeb55e290 rank 4 nranks 7 color -1326228412 key 4 prev 3 next 5 - DONE
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO bootstrapSplit: comm 0x7f62bc06eb10 parent 0x563725e9aa60 rank 0 nranks 7 color -1326228412 key 0 prev 6 next 1 - DONE
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO bootstrapSplit: comm 0x7fc10006ede0 parent 0x558fe404a600 rank 1 nranks 7 color -1326228412 key 1 prev 0 next 2 - DONE
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO bootstrapSplit: comm 0x7fa89c06f000 parent 0x55cf6e893ae0 rank 6 nranks 7 color -1326228412 key 6 prev 5 next 0 - DONE
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO bootstrapSplit: comm 0x7f333806ee90 parent 0x55bb0c836010 rank 2 nranks 7 color -1326228412 key 2 prev 1 next 3 - DONE
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO bootstrapSplit: comm 0x7fa4e4071050 parent 0x55f55541f000 rank 5 nranks 7 color -1326228412 key 5 prev 4 next 6 - DONE
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO ncclCommSplit comm 0x7fa89c06f000 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 parent 0x55cf6e893ae0 color -1326228412 key 6 commId 0x9733e5c0865cf202 - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO ncclCommSplit comm 0x7f9088071340 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 parent 0x55caeb55e290 color -1326228412 key 4 commId 0x9733e5c0865cf202 - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO ncclCommSplit comm 0x7f333806ee90 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 parent 0x55bb0c836010 color -1326228412 key 2 commId 0x9733e5c0865cf202 - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO ncclCommSplit comm 0x7fc10006ede0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 parent 0x558fe404a600 color -1326228412 key 1 commId 0x9733e5c0865cf202 - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO ncclCommSplit comm 0x7f62bc06eb10 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 parent 0x563725e9aa60 color -1326228412 key 0 commId 0x9733e5c0865cf202 - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO ncclCommSplit comm 0x7fa4e4071050 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 parent 0x55f55541f000 color -1326228412 key 5 commId 0x9733e5c0865cf202 - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO bootstrapSplit: comm 0x7f22d006ff20 parent 0x557862c87940 rank 3 nranks 7 color -1326228412 key 3 prev 2 next 4 - DONE
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO ncclCommSplit comm 0x7f22d006ff20 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 parent 0x557862c87940 color -1326228412 key 3 commId 0x9733e5c0865cf202 - Init START
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO NVLS multicast support is not available on dev 2
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO NVLS multicast support is not available on dev 3
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO NVLS multicast support is not available on dev 4
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO NVLS multicast support is not available on dev 6
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO NVLS multicast support is not available on dev 1
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO NVLS multicast support is not available on dev 5
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO NVLS multicast support is not available on dev 0
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO comm 0x7fa89c06f000 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO comm 0x7fa4e4071050 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO comm 0x7f9088071340 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO comm 0x7f22d006ff20 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO comm 0x7f333806ee90 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO comm 0x7f62bc06eb10 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO comm 0x7fc10006ede0 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 00/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 01/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 02/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 03/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 04/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 05/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 06/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 07/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 08/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 09/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 10/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 11/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 12/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 13/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 14/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 15/16 :    0   1   2   3   4   5   6
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO P2P Chunksize set to 524288
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Connected all rings
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO Connected all trees
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
p-phy-ctyun-gz-a800-node-prod-200-117:884366:889835 [3] NCCL INFO ncclCommSplit comm 0x7f22d006ff20 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 parent 0x557862c87940 color -1326228412 key 3 commId 0x9733e5c0865cf202 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884369:889836 [5] NCCL INFO ncclCommSplit comm 0x7fa4e4071050 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 parent 0x55f55541f000 color -1326228412 key 5 commId 0x9733e5c0865cf202 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884364:889833 [1] NCCL INFO ncclCommSplit comm 0x7fc10006ede0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 parent 0x558fe404a600 color -1326228412 key 1 commId 0x9733e5c0865cf202 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884371:889832 [6] NCCL INFO ncclCommSplit comm 0x7fa89c06f000 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 parent 0x55cf6e893ae0 color -1326228412 key 6 commId 0x9733e5c0865cf202 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884365:889831 [2] NCCL INFO ncclCommSplit comm 0x7f333806ee90 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 parent 0x55bb0c836010 color -1326228412 key 2 commId 0x9733e5c0865cf202 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884360:889834 [0] NCCL INFO ncclCommSplit comm 0x7f62bc06eb10 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 parent 0x563725e9aa60 color -1326228412 key 0 commId 0x9733e5c0865cf202 - Init COMPLETE
p-phy-ctyun-gz-a800-node-prod-200-117:884368:889837 [4] NCCL INFO ncclCommSplit comm 0x7f9088071340 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 parent 0x55caeb55e290 color -1326228412 key 4 commId 0x9733e5c0865cf202 - Init COMPLETE
[2025-03-02 05:07:46,693] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 1/4286 [00:31<37:16:12, 31.31s/it]                                                   {'loss': -0.0, 'grad_norm': 1.558195881729965, 'learning_rate': 9.997666822211853e-07, 'completion_length': 195.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.016220238525420427, 'rewards/format_reward': 0.9107142984867096, 'reward': 0.926934540271759, 'reward_std': 0.173439247533679, 'kl': 0.0, 'epoch': 0.0}
  0%|          | 1/4286 [00:31<37:16:12, 31.31s/it][2025-03-02 05:08:13,065] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 2/4286 [00:57<33:48:12, 28.41s/it]                                                   {'loss': 0.0, 'grad_norm': 3.181489723233353, 'learning_rate': 9.995333644423704e-07, 'completion_length': 196.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.09415455907583237, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.0227260291576385, 'reward_std': 0.23865488916635513, 'kl': 0.0001811981201171875, 'epoch': 0.0}
  0%|          | 2/4286 [00:57<33:48:12, 28.41s/it][2025-03-02 05:08:38,549] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 3/4286 [01:23<32:12:30, 27.07s/it]                                                   {'loss': 0.0, 'grad_norm': 1.292902552980921, 'learning_rate': 9.993000466635557e-07, 'completion_length': 207.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.04446248430758715, 'rewards/format_reward': 0.910714328289032, 'reward': 0.9551768004894257, 'reward_std': 0.2052948847413063, 'kl': 0.0009918212890625, 'epoch': 0.0}
  0%|          | 3/4286 [01:23<32:12:30, 27.07s/it]  0%|          | 4/4286 [01:48<31:30:51, 26.49s/it]                                                   {'loss': 0.0001, 'grad_norm': 1.6362801516330312, 'learning_rate': 9.99066728884741e-07, 'completion_length': 195.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.021031747106462717, 'rewards/format_reward': 0.9285714626312256, 'reward': 0.9496031701564789, 'reward_std': 0.17523184418678284, 'kl': 0.00209808349609375, 'epoch': 0.0}
  0%|          | 4/4286 [01:48<31:30:51, 26.49s/it][2025-03-02 05:09:29,761] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 5/4286 [02:14<31:07:25, 26.17s/it]                                                   {'loss': 0.0001, 'grad_norm': 1.7962144176530135, 'learning_rate': 9.988334111059262e-07, 'completion_length': 217.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.0446428582072258, 'rewards/format_reward': 0.910714328289032, 'reward': 0.9553571939468384, 'reward_std': 0.22782530635595322, 'kl': 0.00360107421875, 'epoch': 0.0}
  0%|          | 5/4286 [02:14<31:07:25, 26.17s/it]  0%|          | 6/4286 [02:40<31:04:30, 26.14s/it]                                                   {'loss': 0.0004, 'grad_norm': 2.255065993898884, 'learning_rate': 9.986000933271115e-07, 'completion_length': 178.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.08869048021733761, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.035119116306305, 'reward_std': 0.17714375257492065, 'kl': 0.009002685546875, 'epoch': 0.0}
  0%|          | 6/4286 [02:40<31:04:30, 26.14s/it][2025-03-02 05:10:17,137] [WARNING] [stage3.py:2134:step] 3 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 7/4286 [03:01<29:11:24, 24.56s/it]                                                   {'loss': 0.0007, 'grad_norm': 1.5631534744834727, 'learning_rate': 9.983667755482968e-07, 'completion_length': 146.14286422729492, 'rewards/only_full_func_accuracy_reward': 0.03273809747770429, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.0148809850215912, 'reward_std': 0.06601450406014919, 'kl': 0.018157958984375, 'epoch': 0.0}
  0%|          | 7/4286 [03:01<29:11:24, 24.56s/it]  0%|          | 8/4286 [03:19<26:44:38, 22.51s/it]                                                   {'loss': 0.0008, 'grad_norm': 1.5100038007564698, 'learning_rate': 9.98133457769482e-07, 'completion_length': 126.73215103149414, 'rewards/only_full_func_accuracy_reward': 0.10480442363768816, 'rewards/format_reward': 1.0, 'reward': 1.1048044562339783, 'reward_std': 0.05064220353960991, 'kl': 0.01922607421875, 'epoch': 0.0}
  0%|          | 8/4286 [03:19<26:44:38, 22.51s/it][2025-03-02 05:11:00,114] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 9/4286 [03:44<27:36:51, 23.24s/it]                                                   {'loss': 0.0008, 'grad_norm': 1.2500156043827844, 'learning_rate': 9.979001399906673e-07, 'completion_length': 173.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.0163690485060215, 'rewards/format_reward': 0.9464285969734192, 'reward': 0.9627976715564728, 'reward_std': 0.13665135204792023, 'kl': 0.0203857421875, 'epoch': 0.0}
  0%|          | 9/4286 [03:44<27:36:51, 23.24s/it][2025-03-02 05:11:22,478] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 10/4286 [04:07<27:17:09, 22.97s/it]                                                    {'loss': 0.001, 'grad_norm': 0.7446437471109675, 'learning_rate': 9.976668222118526e-07, 'completion_length': 149.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.0029761907644569874, 'rewards/format_reward': 0.9821428656578064, 'reward': 0.9851190447807312, 'reward_std': 0.04166666558012366, 'kl': 0.02374267578125, 'epoch': 0.0}
  0%|          | 10/4286 [04:07<27:17:09, 22.97s/it]  0%|          | 11/4286 [04:27<26:24:05, 22.23s/it]                                                    {'loss': 0.0011, 'grad_norm': 1.481254308933855, 'learning_rate': 9.974335044330377e-07, 'completion_length': 141.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.14696712791919708, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.1112529039382935, 'reward_std': 0.10676521621644497, 'kl': 0.02850341796875, 'epoch': 0.0}
  0%|          | 11/4286 [04:27<26:24:05, 22.23s/it]  0%|          | 12/4286 [04:49<26:05:15, 21.97s/it]                                                    {'loss': 0.0012, 'grad_norm': 2.0872817656900455, 'learning_rate': 9.97200186654223e-07, 'completion_length': 128.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.08599065616726875, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.068133533000946, 'reward_std': 0.12721307203173637, 'kl': 0.030517578125, 'epoch': 0.0}
  0%|          | 12/4286 [04:49<26:05:15, 21.97s/it][2025-03-02 05:12:26,038] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  0%|          | 13/4286 [05:10<25:57:17, 21.87s/it]                                                    {'loss': 0.0011, 'grad_norm': 4.076059740538245, 'learning_rate': 9.969668688754082e-07, 'completion_length': 144.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.049135489389300346, 'rewards/format_reward': 1.0, 'reward': 1.0491355657577515, 'reward_std': 0.086846137419343, 'kl': 0.02728271484375, 'epoch': 0.0}
  0%|          | 13/4286 [05:10<25:57:17, 21.87s/it]  0%|          | 14/4286 [05:28<24:34:59, 20.72s/it]                                                    {'loss': 0.0016, 'grad_norm': 1.8899818755237645, 'learning_rate': 9.967335510965935e-07, 'completion_length': 114.8035774230957, 'rewards/only_full_func_accuracy_reward': 0.04255952686071396, 'rewards/format_reward': 1.0, 'reward': 1.042559564113617, 'reward_std': 0.04802257567644119, 'kl': 0.0390625, 'epoch': 0.0}
  0%|          | 14/4286 [05:28<24:34:59, 20.72s/it]  0%|          | 15/4286 [05:43<22:26:54, 18.92s/it]                                                    {'loss': 0.0016, 'grad_norm': 1.9414026800104551, 'learning_rate': 9.965002333177788e-07, 'completion_length': 100.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.08839286491274834, 'rewards/format_reward': 1.0, 'reward': 1.0883929133415222, 'reward_std': 0.06209554523229599, 'kl': 0.03955078125, 'epoch': 0.0}
  0%|          | 15/4286 [05:43<22:26:54, 18.92s/it]  0%|          | 16/4286 [05:58<20:55:26, 17.64s/it]                                                    {'loss': 0.002, 'grad_norm': 3.0198209012186785, 'learning_rate': 9.96266915538964e-07, 'completion_length': 94.16071701049805, 'rewards/only_full_func_accuracy_reward': 0.12934982776641846, 'rewards/format_reward': 1.0, 'reward': 1.1293498873710632, 'reward_std': 0.09903641417622566, 'kl': 0.05029296875, 'epoch': 0.0}
  0%|          | 16/4286 [05:58<20:55:26, 17.64s/it]  0%|          | 17/4286 [06:14<20:20:44, 17.16s/it]                                                    {'loss': 0.0023, 'grad_norm': 2.0912263164239597, 'learning_rate': 9.960335977601493e-07, 'completion_length': 93.91072082519531, 'rewards/only_full_func_accuracy_reward': 0.04732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.0473214983940125, 'reward_std': 0.056775402277708054, 'kl': 0.056640625, 'epoch': 0.0}
  0%|          | 17/4286 [06:14<20:20:44, 17.16s/it]  0%|          | 18/4286 [06:27<18:58:44, 16.01s/it]                                                    {'loss': 0.0028, 'grad_norm': 2.0491546463756802, 'learning_rate': 9.958002799813346e-07, 'completion_length': 77.69643020629883, 'rewards/only_full_func_accuracy_reward': 0.09434524178504944, 'rewards/format_reward': 1.0, 'reward': 1.0943453311920166, 'reward_std': 0.1021672785282135, 'kl': 0.06982421875, 'epoch': 0.0}
  0%|          | 18/4286 [06:27<18:58:44, 16.01s/it]  0%|          | 19/4286 [06:43<18:51:32, 15.91s/it]                                                    {'loss': 0.0033, 'grad_norm': 3.2696453710360216, 'learning_rate': 9.955669622025197e-07, 'completion_length': 82.78571701049805, 'rewards/only_full_func_accuracy_reward': 0.09345238283276558, 'rewards/format_reward': 1.0, 'reward': 1.0934524536132812, 'reward_std': 0.09138468280434608, 'kl': 0.082763671875, 'epoch': 0.0}
  0%|          | 19/4286 [06:43<18:51:32, 15.91s/it]  0%|          | 20/4286 [06:59<18:53:00, 15.94s/it]                                                    {'loss': 0.0029, 'grad_norm': 5.216537018585601, 'learning_rate': 9.95333644423705e-07, 'completion_length': 87.83929061889648, 'rewards/only_full_func_accuracy_reward': 0.08139881491661072, 'rewards/format_reward': 1.0, 'reward': 1.0813989043235779, 'reward_std': 0.08382174372673035, 'kl': 0.072021484375, 'epoch': 0.0}
  0%|          | 20/4286 [06:59<18:53:00, 15.94s/it]  0%|          | 21/4286 [07:11<17:43:52, 14.97s/it]                                                    {'loss': 0.0033, 'grad_norm': 3.457902358078089, 'learning_rate': 9.951003266448904e-07, 'completion_length': 73.21428871154785, 'rewards/only_full_func_accuracy_reward': 0.12113095819950104, 'rewards/format_reward': 1.0, 'reward': 1.1211310625076294, 'reward_std': 0.09805656969547272, 'kl': 0.0830078125, 'epoch': 0.0}
  0%|          | 21/4286 [07:11<17:43:52, 14.97s/it]  1%|          | 22/4286 [07:28<18:20:37, 15.49s/it]                                                    {'loss': 0.0031, 'grad_norm': 2.188252465574705, 'learning_rate': 9.948670088660755e-07, 'completion_length': 72.32143211364746, 'rewards/only_full_func_accuracy_reward': 0.09816207969561219, 'rewards/format_reward': 1.0, 'reward': 1.0981621742248535, 'reward_std': 0.06409974116832018, 'kl': 0.07861328125, 'epoch': 0.01}
  1%|          | 22/4286 [07:28<18:20:37, 15.49s/it]  1%|          | 23/4286 [07:43<18:13:16, 15.39s/it]                                                    {'loss': 0.0032, 'grad_norm': 2.0035542791209635, 'learning_rate': 9.946336910872608e-07, 'completion_length': 81.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.0773809589445591, 'rewards/format_reward': 1.0, 'reward': 1.0773810148239136, 'reward_std': 0.08901502378284931, 'kl': 0.080322265625, 'epoch': 0.01}
  1%|          | 23/4286 [07:43<18:13:16, 15.39s/it]  1%|          | 24/4286 [08:04<20:09:06, 17.02s/it]                                                    {'loss': 0.0027, 'grad_norm': 2.5555074302839746, 'learning_rate': 9.944003733084461e-07, 'completion_length': 91.37500381469727, 'rewards/only_full_func_accuracy_reward': 0.060756808146834373, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.04289972782135, 'reward_std': 0.08097045123577118, 'kl': 0.068603515625, 'epoch': 0.01}
  1%|          | 24/4286 [08:04<20:09:06, 17.02s/it][2025-03-02 05:15:39,016] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  1%|          | 25/4286 [08:23<20:52:04, 17.63s/it]                                                    {'loss': 0.0031, 'grad_norm': 2.496792402520885, 'learning_rate': 9.941670555296313e-07, 'completion_length': 90.25000381469727, 'rewards/only_full_func_accuracy_reward': 0.09468985348939896, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.0768327713012695, 'reward_std': 0.15914807468652725, 'kl': 0.0771484375, 'epoch': 0.01}
  1%|          | 25/4286 [08:23<20:52:04, 17.63s/it]  1%|          | 26/4286 [08:39<20:22:55, 17.22s/it]                                                    {'loss': 0.0033, 'grad_norm': 2.9420868980889203, 'learning_rate': 9.939337377508166e-07, 'completion_length': 76.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.18363097310066223, 'rewards/format_reward': 1.0, 'reward': 1.1836310029029846, 'reward_std': 0.11079930886626244, 'kl': 0.08203125, 'epoch': 0.01}
  1%|          | 26/4286 [08:39<20:22:55, 17.22s/it]  1%|          | 27/4286 [08:57<20:30:39, 17.34s/it]                                                    {'loss': 0.0032, 'grad_norm': 1.3441502518934185, 'learning_rate': 9.93700419972002e-07, 'completion_length': 85.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.045386909041553736, 'rewards/format_reward': 1.0, 'reward': 1.0453869700431824, 'reward_std': 0.0405263202264905, 'kl': 0.078857421875, 'epoch': 0.01}
  1%|          | 27/4286 [08:57<20:30:39, 17.34s/it]  1%|          | 28/4286 [09:09<18:45:52, 15.86s/it]                                                    {'loss': 0.0036, 'grad_norm': 2.4852634980715083, 'learning_rate': 9.93467102193187e-07, 'completion_length': 67.32143020629883, 'rewards/only_full_func_accuracy_reward': 0.07291667349636555, 'rewards/format_reward': 1.0, 'reward': 1.0729167461395264, 'reward_std': 0.07992978394031525, 'kl': 0.0888671875, 'epoch': 0.01}
  1%|          | 28/4286 [09:09<18:45:52, 15.86s/it]  1%|          | 29/4286 [09:24<18:15:48, 15.44s/it]                                                    {'loss': 0.0038, 'grad_norm': 3.4141873536000653, 'learning_rate': 9.932337844143724e-07, 'completion_length': 70.46429061889648, 'rewards/only_full_func_accuracy_reward': 0.12910289131104946, 'rewards/format_reward': 1.0, 'reward': 1.1291029453277588, 'reward_std': 0.10416102409362793, 'kl': 0.094970703125, 'epoch': 0.01}
  1%|          | 29/4286 [09:24<18:15:48, 15.44s/it]  1%|          | 30/4286 [09:38<17:53:41, 15.14s/it]                                                    {'loss': 0.0037, 'grad_norm': 1.9409739897536005, 'learning_rate': 9.930004666355577e-07, 'completion_length': 72.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.06011905148625374, 'rewards/format_reward': 1.0, 'reward': 1.060119092464447, 'reward_std': 0.07801728136837482, 'kl': 0.093017578125, 'epoch': 0.01}
  1%|          | 30/4286 [09:38<17:53:41, 15.14s/it]  1%|          | 31/4286 [09:52<17:30:23, 14.81s/it]                                                    {'loss': 0.0037, 'grad_norm': 3.1106654275409835, 'learning_rate': 9.927671488567428e-07, 'completion_length': 73.12500381469727, 'rewards/only_full_func_accuracy_reward': 0.07638889737427235, 'rewards/format_reward': 1.0, 'reward': 1.076388955116272, 'reward_std': 0.0813492126762867, 'kl': 0.091552734375, 'epoch': 0.01}
  1%|          | 31/4286 [09:52<17:30:23, 14.81s/it]  1%|          | 32/4286 [10:06<17:07:15, 14.49s/it]                                                    {'loss': 0.0038, 'grad_norm': 2.8271245974897266, 'learning_rate': 9.925338310779281e-07, 'completion_length': 72.375, 'rewards/only_full_func_accuracy_reward': 0.0833333395421505, 'rewards/format_reward': 1.0, 'reward': 1.083333432674408, 'reward_std': 0.10301259905099869, 'kl': 0.095458984375, 'epoch': 0.01}
  1%|          | 32/4286 [10:06<17:07:15, 14.49s/it]  1%|          | 33/4286 [10:19<16:33:19, 14.01s/it]                                                    {'loss': 0.0034, 'grad_norm': 3.9481035279300003, 'learning_rate': 9.923005132991135e-07, 'completion_length': 78.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.07861394807696342, 'rewards/format_reward': 1.0, 'reward': 1.0786139965057373, 'reward_std': 0.07356336526572704, 'kl': 0.084228515625, 'epoch': 0.01}
  1%|          | 33/4286 [10:19<16:33:19, 14.01s/it]  1%|          | 34/4286 [10:36<17:35:20, 14.89s/it]                                                    {'loss': 0.0029, 'grad_norm': 2.5592134713291723, 'learning_rate': 9.920671955202986e-07, 'completion_length': 91.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.18214288353919983, 'rewards/format_reward': 1.0, 'reward': 1.1821429133415222, 'reward_std': 0.12434184923768044, 'kl': 0.0732421875, 'epoch': 0.01}
  1%|          | 34/4286 [10:36<17:35:20, 14.89s/it]  1%|          | 35/4286 [10:53<18:18:54, 15.51s/it]                                                    {'loss': 0.003, 'grad_norm': 3.7674791754735835, 'learning_rate': 9.91833877741484e-07, 'completion_length': 91.71429061889648, 'rewards/only_full_func_accuracy_reward': 0.09345946833491325, 'rewards/format_reward': 1.0, 'reward': 1.0934595465660095, 'reward_std': 0.12336406856775284, 'kl': 0.074951171875, 'epoch': 0.01}
  1%|          | 35/4286 [10:53<18:18:54, 15.51s/it]  1%|          | 36/4286 [11:16<21:05:18, 17.86s/it]                                                    {'loss': 0.0025, 'grad_norm': 2.537950733476228, 'learning_rate': 9.91600559962669e-07, 'completion_length': 117.69643783569336, 'rewards/only_full_func_accuracy_reward': 0.11612269841134548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.0804084539413452, 'reward_std': 0.151675783097744, 'kl': 0.0628662109375, 'epoch': 0.01}
  1%|          | 36/4286 [11:16<21:05:18, 17.86s/it]  1%|          | 37/4286 [11:33<20:35:25, 17.45s/it]                                                    {'loss': 0.0025, 'grad_norm': 2.485147356968777, 'learning_rate': 9.913672421838543e-07, 'completion_length': 100.57143020629883, 'rewards/only_full_func_accuracy_reward': 0.145904203876853, 'rewards/format_reward': 1.0, 'reward': 1.1459041833877563, 'reward_std': 0.1447715237736702, 'kl': 0.0633544921875, 'epoch': 0.01}
  1%|          | 37/4286 [11:33<20:35:25, 17.45s/it]  1%|          | 38/4286 [11:49<20:04:08, 17.01s/it]                                                    {'loss': 0.0027, 'grad_norm': 3.02627471492563, 'learning_rate': 9.911339244050397e-07, 'completion_length': 97.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.13834326714277267, 'rewards/format_reward': 1.0, 'reward': 1.1383433938026428, 'reward_std': 0.13428788632154465, 'kl': 0.068115234375, 'epoch': 0.01}
  1%|          | 38/4286 [11:49<20:04:08, 17.01s/it]  1%|          | 39/4286 [12:10<21:33:37, 18.28s/it]                                                    {'loss': 0.0023, 'grad_norm': 1.9416716533841576, 'learning_rate': 9.909006066262248e-07, 'completion_length': 124.89286422729492, 'rewards/only_full_func_accuracy_reward': 0.11250532791018486, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.09464830160141, 'reward_std': 0.1197703368961811, 'kl': 0.0567626953125, 'epoch': 0.01}
  1%|          | 39/4286 [12:10<21:33:37, 18.28s/it]  1%|          | 40/4286 [12:28<21:22:22, 18.12s/it]                                                    {'loss': 0.0024, 'grad_norm': 2.295545820392664, 'learning_rate': 9.906672888474101e-07, 'completion_length': 114.25000381469727, 'rewards/only_full_func_accuracy_reward': 0.0706845298409462, 'rewards/format_reward': 1.0, 'reward': 1.070684552192688, 'reward_std': 0.08781252056360245, 'kl': 0.060791015625, 'epoch': 0.01}
  1%|          | 40/4286 [12:28<21:22:22, 18.12s/it]  1%|          | 41/4286 [12:47<21:38:07, 18.35s/it]                                                    {'loss': 0.0024, 'grad_norm': 2.1986931060049146, 'learning_rate': 9.904339710685954e-07, 'completion_length': 107.69643020629883, 'rewards/only_full_func_accuracy_reward': 0.1944161057472229, 'rewards/format_reward': 1.0, 'reward': 1.1944161653518677, 'reward_std': 0.1315060295164585, 'kl': 0.0604248046875, 'epoch': 0.01}
  1%|          | 41/4286 [12:47<21:38:07, 18.35s/it]  1%|          | 42/4286 [13:03<21:01:39, 17.84s/it]                                                    {'loss': 0.0022, 'grad_norm': 2.176969449925145, 'learning_rate': 9.902006532897806e-07, 'completion_length': 116.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.1741541549563408, 'rewards/format_reward': 1.0, 'reward': 1.174154281616211, 'reward_std': 0.09372555837035179, 'kl': 0.0560302734375, 'epoch': 0.01}
  1%|          | 42/4286 [13:03<21:01:39, 17.84s/it]  1%|          | 43/4286 [13:20<20:32:07, 17.42s/it]                                                    {'loss': 0.0025, 'grad_norm': 1.9325240260800278, 'learning_rate': 9.899673355109659e-07, 'completion_length': 108.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.06906888633966446, 'rewards/format_reward': 1.0, 'reward': 1.0690689086914062, 'reward_std': 0.0776865966618061, 'kl': 0.0621337890625, 'epoch': 0.01}
  1%|          | 43/4286 [13:20<20:32:07, 17.42s/it]  1%|          | 44/4286 [13:40<21:30:24, 18.25s/it]                                                    {'loss': 0.0026, 'grad_norm': 3.128651979368358, 'learning_rate': 9.897340177321512e-07, 'completion_length': 110.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.08065477013587952, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.062797725200653, 'reward_std': 0.17727193236351013, 'kl': 0.0655517578125, 'epoch': 0.01}
  1%|          | 44/4286 [13:40<21:30:24, 18.25s/it]  1%|          | 45/4286 [13:57<21:12:33, 18.00s/it]                                                    {'loss': 0.003, 'grad_norm': 1.6174500176908195, 'learning_rate': 9.895006999533363e-07, 'completion_length': 98.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.100663922727108, 'rewards/format_reward': 1.0, 'reward': 1.100663959980011, 'reward_std': 0.08939557895064354, 'kl': 0.075439453125, 'epoch': 0.01}
  1%|          | 45/4286 [13:57<21:12:33, 18.00s/it]  1%|          | 46/4286 [14:16<21:17:23, 18.08s/it]                                                    {'loss': 0.0027, 'grad_norm': 2.0190669592487587, 'learning_rate': 9.892673821745217e-07, 'completion_length': 102.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.08142007142305374, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.0635629892349243, 'reward_std': 0.09956488013267517, 'kl': 0.068603515625, 'epoch': 0.01}
  1%|          | 46/4286 [14:16<21:17:23, 18.08s/it]  1%|          | 47/4286 [14:32<20:50:37, 17.70s/it]                                                    {'loss': 0.003, 'grad_norm': 1.675329766337237, 'learning_rate': 9.89034064395707e-07, 'completion_length': 95.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.0917658805847168, 'rewards/format_reward': 1.0, 'reward': 1.0917659401893616, 'reward_std': 0.0936390794813633, 'kl': 0.07421875, 'epoch': 0.01}
  1%|          | 47/4286 [14:32<20:50:37, 17.70s/it]  1%|          | 48/4286 [14:47<19:43:25, 16.75s/it]                                                    {'loss': 0.0035, 'grad_norm': 2.5664415091757165, 'learning_rate': 9.88800746616892e-07, 'completion_length': 85.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.1346726268529892, 'rewards/format_reward': 1.0, 'reward': 1.1346727013587952, 'reward_std': 0.12417486310005188, 'kl': 0.086669921875, 'epoch': 0.01}
  1%|          | 48/4286 [14:47<19:43:25, 16.75s/it]  1%|          | 49/4286 [15:01<18:50:18, 16.01s/it]                                                    {'loss': 0.0038, 'grad_norm': 3.3105604619720084, 'learning_rate': 9.885674288380774e-07, 'completion_length': 81.19643020629883, 'rewards/only_full_func_accuracy_reward': 0.1845238283276558, 'rewards/format_reward': 1.0, 'reward': 1.18452388048172, 'reward_std': 0.13070277869701385, 'kl': 0.094970703125, 'epoch': 0.01}
  1%|          | 49/4286 [15:01<18:50:18, 16.01s/it]  1%|          | 50/4286 [15:16<18:30:29, 15.73s/it]                                                    {'loss': 0.0034, 'grad_norm': 4.556337570369576, 'learning_rate': 9.883341110592628e-07, 'completion_length': 88.83928680419922, 'rewards/only_full_func_accuracy_reward': 0.19486607611179352, 'rewards/format_reward': 1.0, 'reward': 1.194866120815277, 'reward_std': 0.11110621690750122, 'kl': 0.086181640625, 'epoch': 0.01}
  1%|          | 50/4286 [15:16<18:30:29, 15.73s/it]  1%|          | 51/4286 [15:29<17:31:46, 14.90s/it]                                                    {'loss': 0.0032, 'grad_norm': 3.5715240793993615, 'learning_rate': 9.881007932804479e-07, 'completion_length': 82.62500381469727, 'rewards/only_full_func_accuracy_reward': 0.1205357164144516, 'rewards/format_reward': 1.0, 'reward': 1.1205357313156128, 'reward_std': 0.05587352253496647, 'kl': 0.079833984375, 'epoch': 0.01}
  1%|          | 51/4286 [15:29<17:31:46, 14.90s/it]  1%|          | 52/4286 [15:46<18:20:38, 15.60s/it]                                                    {'loss': 0.0036, 'grad_norm': 2.94857199867225, 'learning_rate': 9.878674755016332e-07, 'completion_length': 82.78571701049805, 'rewards/only_full_func_accuracy_reward': 0.1939331591129303, 'rewards/format_reward': 1.0, 'reward': 1.1939332485198975, 'reward_std': 0.16679451800882816, 'kl': 0.09033203125, 'epoch': 0.01}
  1%|          | 52/4286 [15:46<18:20:38, 15.60s/it]  1%|          | 53/4286 [16:00<17:42:09, 15.06s/it]                                                    {'loss': 0.0033, 'grad_norm': 2.1402592509510523, 'learning_rate': 9.876341577228185e-07, 'completion_length': 78.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.14791667833924294, 'rewards/format_reward': 1.0, 'reward': 1.1479167342185974, 'reward_std': 0.06764823198318481, 'kl': 0.081787109375, 'epoch': 0.01}
  1%|          | 53/4286 [16:00<17:42:09, 15.06s/it]  1%|▏         | 54/4286 [16:15<17:45:48, 15.11s/it]                                                    {'loss': 0.0032, 'grad_norm': 2.5417165636052403, 'learning_rate': 9.874008399440036e-07, 'completion_length': 94.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.1714285872876644, 'rewards/format_reward': 1.0, 'reward': 1.1714286804199219, 'reward_std': 0.11906134337186813, 'kl': 0.079833984375, 'epoch': 0.01}
  1%|▏         | 54/4286 [16:15<17:45:48, 15.11s/it]  1%|▏         | 55/4286 [16:31<17:58:40, 15.30s/it]                                                    {'loss': 0.0036, 'grad_norm': 2.909925392140563, 'learning_rate': 9.87167522165189e-07, 'completion_length': 92.69643020629883, 'rewards/only_full_func_accuracy_reward': 0.1443452462553978, 'rewards/format_reward': 1.0, 'reward': 1.1443453431129456, 'reward_std': 0.08965210989117622, 'kl': 0.091064453125, 'epoch': 0.01}
  1%|▏         | 55/4286 [16:31<17:58:40, 15.30s/it]  1%|▏         | 56/4286 [16:46<17:41:18, 15.05s/it]                                                    {'loss': 0.0029, 'grad_norm': 1.2587419902053727, 'learning_rate': 9.869342043863743e-07, 'completion_length': 91.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.10401786863803864, 'rewards/format_reward': 1.0, 'reward': 1.104017972946167, 'reward_std': 0.02809318248182535, 'kl': 0.0732421875, 'epoch': 0.01}
  1%|▏         | 56/4286 [16:46<17:41:18, 15.05s/it]  1%|▏         | 57/4286 [17:02<18:14:39, 15.53s/it]                                                    {'loss': 0.0032, 'grad_norm': 2.3999249962384566, 'learning_rate': 9.867008866075594e-07, 'completion_length': 103.07143020629883, 'rewards/only_full_func_accuracy_reward': 0.21437785029411316, 'rewards/format_reward': 1.0, 'reward': 1.2143778800964355, 'reward_std': 0.0911005400121212, 'kl': 0.078857421875, 'epoch': 0.01}
  1%|▏         | 57/4286 [17:02<18:14:39, 15.53s/it]  1%|▏         | 58/4286 [17:19<18:29:42, 15.75s/it]                                                    {'loss': 0.0033, 'grad_norm': 2.4791325398540445, 'learning_rate': 9.864675688287447e-07, 'completion_length': 98.16072082519531, 'rewards/only_full_func_accuracy_reward': 0.12160364910960197, 'rewards/format_reward': 1.0, 'reward': 1.1216036677360535, 'reward_std': 0.0888749323785305, 'kl': 0.081787109375, 'epoch': 0.01}
  1%|▏         | 58/4286 [17:19<18:29:42, 15.75s/it]  1%|▏         | 59/4286 [17:36<19:00:17, 16.19s/it]                                                    {'loss': 0.0032, 'grad_norm': 2.407632795426049, 'learning_rate': 9.862342510499299e-07, 'completion_length': 95.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.31041667610406876, 'rewards/format_reward': 1.0, 'reward': 1.3104167580604553, 'reward_std': 0.11559626832604408, 'kl': 0.0791015625, 'epoch': 0.01}
  1%|▏         | 59/4286 [17:36<19:00:17, 16.19s/it]  1%|▏         | 60/4286 [17:52<18:54:18, 16.10s/it]                                                    {'loss': 0.0033, 'grad_norm': 2.5845316930766384, 'learning_rate': 9.860009332711152e-07, 'completion_length': 93.83929061889648, 'rewards/only_full_func_accuracy_reward': 0.21393851190805435, 'rewards/format_reward': 1.0, 'reward': 1.213938593864441, 'reward_std': 0.0910521112382412, 'kl': 0.083251953125, 'epoch': 0.01}
  1%|▏         | 60/4286 [17:52<18:54:18, 16.10s/it]  1%|▏         | 61/4286 [18:06<18:09:00, 15.47s/it]                                                    {'loss': 0.0039, 'grad_norm': 2.2939673936873732, 'learning_rate': 9.857676154923005e-07, 'completion_length': 81.10714340209961, 'rewards/only_full_func_accuracy_reward': 0.17232144623994827, 'rewards/format_reward': 1.0, 'reward': 1.1723214983940125, 'reward_std': 0.09613125026226044, 'kl': 0.0966796875, 'epoch': 0.01}
  1%|▏         | 61/4286 [18:06<18:09:00, 15.47s/it]  1%|▏         | 62/4286 [18:25<19:21:45, 16.50s/it]                                                    {'loss': 0.0038, 'grad_norm': 2.731491340836224, 'learning_rate': 9.855342977134856e-07, 'completion_length': 100.53571701049805, 'rewards/only_full_func_accuracy_reward': 0.1309523917734623, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1130953431129456, 'reward_std': 0.1568618267774582, 'kl': 0.0947265625, 'epoch': 0.01}
  1%|▏         | 62/4286 [18:25<19:21:45, 16.50s/it]  1%|▏         | 63/4286 [18:37<18:00:33, 15.35s/it]                                                    {'loss': 0.0046, 'grad_norm': 4.062827077726191, 'learning_rate': 9.85300979934671e-07, 'completion_length': 71.46428871154785, 'rewards/only_full_func_accuracy_reward': 0.25863099098205566, 'rewards/format_reward': 1.0, 'reward': 1.2586310505867004, 'reward_std': 0.19423923641443253, 'kl': 0.114501953125, 'epoch': 0.01}
  1%|▏         | 63/4286 [18:37<18:00:33, 15.35s/it]  1%|▏         | 64/4286 [18:50<16:56:51, 14.45s/it]                                                    {'loss': 0.0044, 'grad_norm': 3.7433866951259436, 'learning_rate': 9.850676621558563e-07, 'completion_length': 66.42857551574707, 'rewards/only_full_func_accuracy_reward': 0.1498724613338709, 'rewards/format_reward': 1.0, 'reward': 1.1498725414276123, 'reward_std': 0.13766339793801308, 'kl': 0.11083984375, 'epoch': 0.01}
  1%|▏         | 64/4286 [18:50<16:56:51, 14.45s/it]  2%|▏         | 65/4286 [19:03<16:37:21, 14.18s/it]                                                    {'loss': 0.0043, 'grad_norm': 2.3698879276888456, 'learning_rate': 9.848343443770414e-07, 'completion_length': 76.44643020629883, 'rewards/only_full_func_accuracy_reward': 0.18112247437238693, 'rewards/format_reward': 1.0, 'reward': 1.181122601032257, 'reward_std': 0.08019610866904259, 'kl': 0.1083984375, 'epoch': 0.02}
  2%|▏         | 65/4286 [19:03<16:37:21, 14.18s/it]  2%|▏         | 66/4286 [19:16<16:02:37, 13.69s/it]                                                    {'loss': 0.0045, 'grad_norm': 3.7201541526150246, 'learning_rate': 9.846010265982267e-07, 'completion_length': 68.96429061889648, 'rewards/only_full_func_accuracy_reward': 0.2738095372915268, 'rewards/format_reward': 1.0, 'reward': 1.273809552192688, 'reward_std': 0.16948115080595016, 'kl': 0.1123046875, 'epoch': 0.02}
  2%|▏         | 66/4286 [19:16<16:02:37, 13.69s/it]  2%|▏         | 67/4286 [19:29<15:44:56, 13.44s/it]                                                    {'loss': 0.0049, 'grad_norm': 3.882333756537762, 'learning_rate': 9.84367708819412e-07, 'completion_length': 74.10714340209961, 'rewards/only_full_func_accuracy_reward': 0.2440476343035698, 'rewards/format_reward': 1.0, 'reward': 1.2440477013587952, 'reward_std': 0.12800925970077515, 'kl': 0.1220703125, 'epoch': 0.02}
  2%|▏         | 67/4286 [19:29<15:44:56, 13.44s/it]  2%|▏         | 68/4286 [19:41<15:27:57, 13.20s/it]                                                    {'loss': 0.0051, 'grad_norm': 2.4120629047060422, 'learning_rate': 9.841343910405972e-07, 'completion_length': 75.32143020629883, 'rewards/only_full_func_accuracy_reward': 0.07023810222744942, 'rewards/format_reward': 1.0, 'reward': 1.0702382326126099, 'reward_std': 0.05427668523043394, 'kl': 0.126220703125, 'epoch': 0.02}
  2%|▏         | 68/4286 [19:41<15:27:57, 13.20s/it]  2%|▏         | 69/4286 [19:55<15:39:33, 13.37s/it]                                                    {'loss': 0.0052, 'grad_norm': 2.655132389247822, 'learning_rate': 9.839010732617825e-07, 'completion_length': 71.25000381469727, 'rewards/only_full_func_accuracy_reward': 0.16726191714406013, 'rewards/format_reward': 1.0, 'reward': 1.1672619581222534, 'reward_std': 0.05476190336048603, 'kl': 0.1298828125, 'epoch': 0.02}
  2%|▏         | 69/4286 [19:55<15:39:33, 13.37s/it]  2%|▏         | 70/4286 [20:10<16:13:13, 13.85s/it]                                                    {'loss': 0.0052, 'grad_norm': 4.816008449424343, 'learning_rate': 9.836677554829678e-07, 'completion_length': 80.50000381469727, 'rewards/only_full_func_accuracy_reward': 0.17504961788654327, 'rewards/format_reward': 1.0, 'reward': 1.1750496625900269, 'reward_std': 0.15032102167606354, 'kl': 0.1298828125, 'epoch': 0.02}
  2%|▏         | 70/4286 [20:10<16:13:13, 13.85s/it]  2%|▏         | 71/4286 [20:24<16:22:00, 13.98s/it]                                                    {'loss': 0.0058, 'grad_norm': 4.873605691981219, 'learning_rate': 9.83434437704153e-07, 'completion_length': 76.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.33861255645751953, 'rewards/format_reward': 1.0, 'reward': 1.3386126160621643, 'reward_std': 0.13038211688399315, 'kl': 0.1455078125, 'epoch': 0.02}
  2%|▏         | 71/4286 [20:24<16:22:00, 13.98s/it]  2%|▏         | 72/4286 [20:36<15:31:22, 13.26s/it]                                                    {'loss': 0.0061, 'grad_norm': 3.0319655210707914, 'learning_rate': 9.832011199253383e-07, 'completion_length': 67.125, 'rewards/only_full_func_accuracy_reward': 0.20238097012043, 'rewards/format_reward': 1.0, 'reward': 1.2023810148239136, 'reward_std': 0.11789705231785774, 'kl': 0.1533203125, 'epoch': 0.02}
  2%|▏         | 72/4286 [20:36<15:31:22, 13.26s/it]  2%|▏         | 73/4286 [20:48<15:08:58, 12.95s/it]                                                    {'loss': 0.0064, 'grad_norm': 3.3027724019504308, 'learning_rate': 9.829678021465236e-07, 'completion_length': 70.71429061889648, 'rewards/only_full_func_accuracy_reward': 0.1056547686457634, 'rewards/format_reward': 1.0, 'reward': 1.1056548357009888, 'reward_std': 0.08060388267040253, 'kl': 0.1611328125, 'epoch': 0.02}
  2%|▏         | 73/4286 [20:48<15:08:58, 12.95s/it]  2%|▏         | 74/4286 [21:02<15:20:10, 13.11s/it]                                                    {'loss': 0.0061, 'grad_norm': 3.4098900108948107, 'learning_rate': 9.827344843677087e-07, 'completion_length': 67.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.19272959977388382, 'rewards/format_reward': 1.0, 'reward': 1.1927297115325928, 'reward_std': 0.08152185752987862, 'kl': 0.1533203125, 'epoch': 0.02}
  2%|▏         | 74/4286 [21:02<15:20:10, 13.11s/it]  2%|▏         | 75/4286 [21:17<16:08:26, 13.80s/it]                                                    {'loss': 0.0063, 'grad_norm': 10.587102093364276, 'learning_rate': 9.82501166588894e-07, 'completion_length': 88.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.08223498240113258, 'rewards/format_reward': 1.0, 'reward': 1.082235038280487, 'reward_std': 0.08648813515901566, 'kl': 0.15869140625, 'epoch': 0.02}
  2%|▏         | 75/4286 [21:17<16:08:26, 13.80s/it]  2%|▏         | 76/4286 [21:31<16:12:49, 13.86s/it]                                                    {'loss': 0.0062, 'grad_norm': 2.5580589492884154, 'learning_rate': 9.822678488100794e-07, 'completion_length': 77.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.13869047537446022, 'rewards/format_reward': 1.0, 'reward': 1.1386905312538147, 'reward_std': 0.06309951469302177, 'kl': 0.15380859375, 'epoch': 0.02}
  2%|▏         | 76/4286 [21:31<16:12:49, 13.86s/it]  2%|▏         | 77/4286 [21:46<16:39:00, 14.24s/it]                                                    {'loss': 0.0069, 'grad_norm': 3.633269649208637, 'learning_rate': 9.820345310312645e-07, 'completion_length': 73.78572082519531, 'rewards/only_full_func_accuracy_reward': 0.21653912961483002, 'rewards/format_reward': 1.0, 'reward': 1.2165391445159912, 'reward_std': 0.10026155784726143, 'kl': 0.171875, 'epoch': 0.02}
  2%|▏         | 77/4286 [21:46<16:39:00, 14.24s/it]  2%|▏         | 78/4286 [21:59<16:08:41, 13.81s/it]                                                    {'loss': 0.0067, 'grad_norm': 3.4265650307451407, 'learning_rate': 9.818012132524498e-07, 'completion_length': 70.66071891784668, 'rewards/only_full_func_accuracy_reward': 0.22663339227437973, 'rewards/format_reward': 1.0, 'reward': 1.2266334891319275, 'reward_std': 0.11424322426319122, 'kl': 0.16650390625, 'epoch': 0.02}
  2%|▏         | 78/4286 [21:59<16:08:41, 13.81s/it]  2%|▏         | 79/4286 [22:13<16:19:46, 13.97s/it]                                                    {'loss': 0.006, 'grad_norm': 2.050480086049301, 'learning_rate': 9.815678954736352e-07, 'completion_length': 82.91072082519531, 'rewards/only_full_func_accuracy_reward': 0.10104167647659779, 'rewards/format_reward': 1.0, 'reward': 1.1010417938232422, 'reward_std': 0.07011731714010239, 'kl': 0.150390625, 'epoch': 0.02}
  2%|▏         | 79/4286 [22:13<16:19:46, 13.97s/it]  2%|▏         | 80/4286 [22:31<17:33:59, 15.04s/it]                                                    {'loss': 0.0064, 'grad_norm': 3.4761767694163206, 'learning_rate': 9.813345776948203e-07, 'completion_length': 79.91071701049805, 'rewards/only_full_func_accuracy_reward': 0.2321428880095482, 'rewards/format_reward': 1.0, 'reward': 1.232142984867096, 'reward_std': 0.08542174845933914, 'kl': 0.15966796875, 'epoch': 0.02}
  2%|▏         | 80/4286 [22:31<17:33:59, 15.04s/it]  2%|▏         | 81/4286 [22:44<16:55:51, 14.49s/it]                                                    {'loss': 0.0072, 'grad_norm': 3.0582936990538934, 'learning_rate': 9.811012599160056e-07, 'completion_length': 74.83929061889648, 'rewards/only_full_func_accuracy_reward': 0.17529762536287308, 'rewards/format_reward': 1.0, 'reward': 1.1752976775169373, 'reward_std': 0.09505880624055862, 'kl': 0.1796875, 'epoch': 0.02}
  2%|▏         | 81/4286 [22:44<16:55:51, 14.49s/it]  2%|▏         | 82/4286 [22:57<16:34:06, 14.19s/it]                                                    {'loss': 0.0072, 'grad_norm': 5.312713259624826, 'learning_rate': 9.808679421371907e-07, 'completion_length': 73.8035774230957, 'rewards/only_full_func_accuracy_reward': 0.18918652832508087, 'rewards/format_reward': 1.0, 'reward': 1.1891865730285645, 'reward_std': 0.06403510505333543, 'kl': 0.1806640625, 'epoch': 0.02}
  2%|▏         | 82/4286 [22:57<16:34:06, 14.19s/it]  2%|▏         | 83/4286 [23:10<16:05:27, 13.78s/it]                                                    {'loss': 0.0068, 'grad_norm': 2.176710298318536, 'learning_rate': 9.80634624358376e-07, 'completion_length': 76.94643020629883, 'rewards/only_full_func_accuracy_reward': 0.11683674529194832, 'rewards/format_reward': 1.0, 'reward': 1.1168367862701416, 'reward_std': 0.06928969919681549, 'kl': 0.16943359375, 'epoch': 0.02}
  2%|▏         | 83/4286 [23:10<16:05:27, 13.78s/it][2025-03-02 05:30:42,238] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  2%|▏         | 84/4286 [23:26<16:53:19, 14.47s/it]                                                    {'loss': 0.0071, 'grad_norm': 3.772283356517953, 'learning_rate': 9.804013065795614e-07, 'completion_length': 80.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.18735120445489883, 'rewards/format_reward': 1.0, 'reward': 1.1873512864112854, 'reward_std': 0.10691110417246819, 'kl': 0.17724609375, 'epoch': 0.02}
  2%|▏         | 84/4286 [23:26<16:53:19, 14.47s/it]  2%|▏         | 85/4286 [23:40<16:36:57, 14.24s/it]                                                    {'loss': 0.0066, 'grad_norm': 3.028223523717492, 'learning_rate': 9.801679888007465e-07, 'completion_length': 83.66071701049805, 'rewards/only_full_func_accuracy_reward': 0.19451532512903214, 'rewards/format_reward': 1.0, 'reward': 1.194515347480774, 'reward_std': 0.13937978446483612, 'kl': 0.16552734375, 'epoch': 0.02}
  2%|▏         | 85/4286 [23:40<16:36:57, 14.24s/it]  2%|▏         | 86/4286 [23:56<17:05:46, 14.65s/it]                                                    {'loss': 0.0068, 'grad_norm': 3.1507825990831115, 'learning_rate': 9.799346710219318e-07, 'completion_length': 85.96428680419922, 'rewards/only_full_func_accuracy_reward': 0.22105657309293747, 'rewards/format_reward': 1.0, 'reward': 1.221056580543518, 'reward_std': 0.11522791534662247, 'kl': 0.16943359375, 'epoch': 0.02}
  2%|▏         | 86/4286 [23:56<17:05:46, 14.65s/it]  2%|▏         | 87/4286 [24:12<17:35:11, 15.08s/it]                                                    {'loss': 0.0064, 'grad_norm': 5.652058043897978, 'learning_rate': 9.797013532431171e-07, 'completion_length': 93.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.22931548953056335, 'rewards/format_reward': 1.0, 'reward': 1.2293155193328857, 'reward_std': 0.11839301884174347, 'kl': 0.16015625, 'epoch': 0.02}
  2%|▏         | 87/4286 [24:12<17:35:11, 15.08s/it]  2%|▏         | 88/4286 [24:27<17:28:14, 14.98s/it]                                                    {'loss': 0.0069, 'grad_norm': 2.5405031306359587, 'learning_rate': 9.794680354643023e-07, 'completion_length': 83.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.1859835609793663, 'rewards/format_reward': 1.0, 'reward': 1.185983657836914, 'reward_std': 0.09823360294103622, 'kl': 0.171875, 'epoch': 0.02}
  2%|▏         | 88/4286 [24:27<17:28:14, 14.98s/it]  2%|▏         | 89/4286 [24:39<16:45:32, 14.38s/it]                                                    {'loss': 0.0066, 'grad_norm': 2.407798969082897, 'learning_rate': 9.792347176854876e-07, 'completion_length': 78.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.16934524476528168, 'rewards/format_reward': 1.0, 'reward': 1.1693453192710876, 'reward_std': 0.054962148889899254, 'kl': 0.1640625, 'epoch': 0.02}
  2%|▏         | 89/4286 [24:39<16:45:32, 14.38s/it]  2%|▏         | 90/4286 [24:54<16:46:08, 14.39s/it]                                                    {'loss': 0.0069, 'grad_norm': 3.866871480342157, 'learning_rate': 9.79001399906673e-07, 'completion_length': 68.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.33839287608861923, 'rewards/format_reward': 1.0, 'reward': 1.338392972946167, 'reward_std': 0.13049817830324173, 'kl': 0.17138671875, 'epoch': 0.02}
  2%|▏         | 90/4286 [24:54<16:46:08, 14.39s/it]  2%|▏         | 91/4286 [25:06<15:58:19, 13.71s/it]                                                    {'loss': 0.0079, 'grad_norm': 4.034871086486291, 'learning_rate': 9.78768082127858e-07, 'completion_length': 65.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.2414434775710106, 'rewards/format_reward': 1.0, 'reward': 1.2414435148239136, 'reward_std': 0.08407738991081715, 'kl': 0.1982421875, 'epoch': 0.02}
  2%|▏         | 91/4286 [25:06<15:58:19, 13.71s/it]  2%|▏         | 92/4286 [25:20<16:10:40, 13.89s/it]                                                    {'loss': 0.0068, 'grad_norm': 5.436846342036476, 'learning_rate': 9.785347643490434e-07, 'completion_length': 83.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.21488095819950104, 'rewards/format_reward': 1.0, 'reward': 1.2148810625076294, 'reward_std': 0.13112711161375046, 'kl': 0.1708984375, 'epoch': 0.02}
  2%|▏         | 92/4286 [25:20<16:10:40, 13.89s/it]  2%|▏         | 93/4286 [25:34<16:08:37, 13.86s/it]                                                    {'loss': 0.0067, 'grad_norm': 2.9285153495661556, 'learning_rate': 9.783014465702287e-07, 'completion_length': 72.83928680419922, 'rewards/only_full_func_accuracy_reward': 0.2663690745830536, 'rewards/format_reward': 1.0, 'reward': 1.266369104385376, 'reward_std': 0.0634902361780405, 'kl': 0.16845703125, 'epoch': 0.02}
  2%|▏         | 93/4286 [25:34<16:08:37, 13.86s/it]  2%|▏         | 94/4286 [25:47<15:49:11, 13.59s/it]                                                    {'loss': 0.0072, 'grad_norm': 29.555074991078598, 'learning_rate': 9.780681287914138e-07, 'completion_length': 75.58929061889648, 'rewards/only_full_func_accuracy_reward': 0.2485119327902794, 'rewards/format_reward': 1.0, 'reward': 1.2485120296478271, 'reward_std': 0.09933928027749062, 'kl': 0.1796875, 'epoch': 0.02}
  2%|▏         | 94/4286 [25:47<15:49:11, 13.59s/it]  2%|▏         | 95/4286 [26:05<17:15:32, 14.83s/it]                                                    {'loss': 0.0074, 'grad_norm': 5.024838952182508, 'learning_rate': 9.778348110125991e-07, 'completion_length': 84.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.2079809159040451, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.190123736858368, 'reward_std': 0.13880963250994682, 'kl': 0.185546875, 'epoch': 0.02}
  2%|▏         | 95/4286 [26:05<17:15:32, 14.83s/it]  2%|▏         | 96/4286 [26:19<16:59:29, 14.60s/it]                                                    {'loss': 0.0063, 'grad_norm': 2.7293277976865644, 'learning_rate': 9.776014932337845e-07, 'completion_length': 79.75000381469727, 'rewards/only_full_func_accuracy_reward': 0.18214286863803864, 'rewards/format_reward': 1.0, 'reward': 1.1821430325508118, 'reward_std': 0.07674040459096432, 'kl': 0.15771484375, 'epoch': 0.02}
  2%|▏         | 96/4286 [26:19<16:59:29, 14.60s/it]  2%|▏         | 97/4286 [26:33<16:49:44, 14.46s/it]                                                    {'loss': 0.0062, 'grad_norm': 2.288648723996582, 'learning_rate': 9.773681754549696e-07, 'completion_length': 74.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.2556547671556473, 'rewards/format_reward': 1.0, 'reward': 1.2556548714637756, 'reward_std': 0.06839270517230034, 'kl': 0.15380859375, 'epoch': 0.02}
  2%|▏         | 97/4286 [26:33<16:49:44, 14.46s/it]  2%|▏         | 98/4286 [26:47<16:42:56, 14.37s/it]                                                    {'loss': 0.0064, 'grad_norm': 2.811692137485108, 'learning_rate': 9.77134857676155e-07, 'completion_length': 75.12500381469727, 'rewards/only_full_func_accuracy_reward': 0.2693452462553978, 'rewards/format_reward': 1.0, 'reward': 1.2693453431129456, 'reward_std': 0.0523043810389936, 'kl': 0.15966796875, 'epoch': 0.02}
  2%|▏         | 98/4286 [26:47<16:42:56, 14.37s/it]  2%|▏         | 99/4286 [26:59<15:52:59, 13.66s/it]                                                    {'loss': 0.0067, 'grad_norm': 2.8285622359488234, 'learning_rate': 9.769015398973402e-07, 'completion_length': 75.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.1860119178891182, 'rewards/format_reward': 1.0, 'reward': 1.186012089252472, 'reward_std': 0.1232350580394268, 'kl': 0.1669921875, 'epoch': 0.02}
  2%|▏         | 99/4286 [26:59<15:52:59, 13.66s/it]  2%|▏         | 100/4286 [27:13<15:48:55, 13.60s/it]                                                     {'loss': 0.0053, 'grad_norm': 3.032351901292137, 'learning_rate': 9.766682221185254e-07, 'completion_length': 89.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.16675879806280136, 'rewards/format_reward': 1.0, 'reward': 1.1667588353157043, 'reward_std': 0.1356324702501297, 'kl': 0.13330078125, 'epoch': 0.02}
  2%|▏         | 100/4286 [27:13<15:48:55, 13.60s/it]  2%|▏         | 101/4286 [33:32<143:25:57, 123.38s/it]                                                       {'loss': 0.0057, 'grad_norm': 2.468899026660015, 'learning_rate': 9.764349043397107e-07, 'completion_length': 89.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.13660715520381927, 'rewards/format_reward': 1.0, 'reward': 1.1366072297096252, 'reward_std': 0.06961158663034439, 'kl': 0.1416015625, 'epoch': 0.02}
  2%|▏         | 101/4286 [33:32<143:25:57, 123.38s/it]  2%|▏         | 102/4286 [33:48<105:59:56, 91.20s/it]                                                       {'loss': 0.0055, 'grad_norm': 2.8349820569644666, 'learning_rate': 9.76201586560896e-07, 'completion_length': 94.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.11799243092536926, 'rewards/format_reward': 1.0, 'reward': 1.1179924607276917, 'reward_std': 0.11901802197098732, 'kl': 0.1376953125, 'epoch': 0.02}
  2%|▏         | 102/4286 [33:48<105:59:56, 91.20s/it]  2%|▏         | 103/4286 [34:03<79:26:03, 68.36s/it]                                                      {'loss': 0.0054, 'grad_norm': 2.5223850551719393, 'learning_rate': 9.759682687820811e-07, 'completion_length': 92.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.15803572162985802, 'rewards/format_reward': 1.0, 'reward': 1.158035695552826, 'reward_std': 0.12088993191719055, 'kl': 0.1337890625, 'epoch': 0.02}
  2%|▏         | 103/4286 [34:03<79:26:03, 68.36s/it]  2%|▏         | 104/4286 [34:20<61:17:31, 52.76s/it]                                                     {'loss': 0.0053, 'grad_norm': 2.126879291809878, 'learning_rate': 9.757349510032665e-07, 'completion_length': 97.75000381469727, 'rewards/only_full_func_accuracy_reward': 0.2023809626698494, 'rewards/format_reward': 1.0, 'reward': 1.2023810744285583, 'reward_std': 0.10947179794311523, 'kl': 0.13232421875, 'epoch': 0.02}
  2%|▏         | 104/4286 [34:20<61:17:31, 52.76s/it]  2%|▏         | 105/4286 [34:36<48:38:01, 41.88s/it]                                                     {'loss': 0.0044, 'grad_norm': 3.0575671672163476, 'learning_rate': 9.755016332244516e-07, 'completion_length': 94.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.26921770721673965, 'rewards/format_reward': 1.0, 'reward': 1.269217848777771, 'reward_std': 0.1032898761332035, 'kl': 0.109375, 'epoch': 0.02}
  2%|▏         | 105/4286 [34:36<48:38:01, 41.88s/it]  2%|▏         | 106/4286 [34:53<39:53:59, 34.36s/it]                                                     {'loss': 0.0054, 'grad_norm': 8.86475998879188, 'learning_rate': 9.75268315445637e-07, 'completion_length': 96.75000381469727, 'rewards/only_full_func_accuracy_reward': 0.18225626274943352, 'rewards/format_reward': 1.0, 'reward': 1.182256281375885, 'reward_std': 0.1379730999469757, 'kl': 0.135498046875, 'epoch': 0.02}
  2%|▏         | 106/4286 [34:53<39:53:59, 34.36s/it]  2%|▏         | 107/4286 [35:09<33:36:58, 28.96s/it]                                                     {'loss': 0.0048, 'grad_norm': 6.756058136879214, 'learning_rate': 9.750349976668222e-07, 'completion_length': 93.37500381469727, 'rewards/only_full_func_accuracy_reward': 0.2529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.2529763579368591, 'reward_std': 0.10062329471111298, 'kl': 0.1201171875, 'epoch': 0.02}
  2%|▏         | 107/4286 [35:09<33:36:58, 28.96s/it]  3%|▎         | 108/4286 [35:27<29:34:17, 25.48s/it]                                                     {'loss': 0.0045, 'grad_norm': 3.310388773100552, 'learning_rate': 9.748016798880073e-07, 'completion_length': 103.21428680419922, 'rewards/only_full_func_accuracy_reward': 0.22653063386678696, 'rewards/format_reward': 1.0, 'reward': 1.22653067111969, 'reward_std': 0.1515505276620388, 'kl': 0.11328125, 'epoch': 0.03}
  3%|▎         | 108/4286 [35:27<29:34:17, 25.48s/it]  3%|▎         | 109/4286 [35:46<27:15:19, 23.49s/it]                                                     {'loss': 0.0048, 'grad_norm': 2.751837703656562, 'learning_rate': 9.745683621091927e-07, 'completion_length': 89.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.18214287236332893, 'rewards/format_reward': 1.0, 'reward': 1.1821429133415222, 'reward_std': 0.11778583377599716, 'kl': 0.12060546875, 'epoch': 0.03}
  3%|▎         | 109/4286 [35:46<27:15:19, 23.49s/it]  3%|▎         | 110/4286 [36:04<25:28:38, 21.96s/it]                                                     {'loss': 0.0042, 'grad_norm': 3.089568305305323, 'learning_rate': 9.74335044330378e-07, 'completion_length': 97.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.22817462682724, 'rewards/format_reward': 1.0, 'reward': 1.2281746864318848, 'reward_std': 0.11346907168626785, 'kl': 0.105224609375, 'epoch': 0.03}
  3%|▎         | 110/4286 [36:04<25:28:38, 21.96s/it]  3%|▎         | 111/4286 [36:22<24:09:44, 20.83s/it]                                                     {'loss': 0.0042, 'grad_norm': 4.091765437579233, 'learning_rate': 9.741017265515631e-07, 'completion_length': 110.71429061889648, 'rewards/only_full_func_accuracy_reward': 0.17264030873775482, 'rewards/format_reward': 1.0, 'reward': 1.1726403832435608, 'reward_std': 0.12301061674952507, 'kl': 0.105224609375, 'epoch': 0.03}
  3%|▎         | 111/4286 [36:22<24:09:44, 20.83s/it]  3%|▎         | 112/4286 [36:40<23:14:11, 20.04s/it]                                                     {'loss': 0.004, 'grad_norm': 1.9777374187396781, 'learning_rate': 9.738684087727484e-07, 'completion_length': 111.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.09791667014360428, 'rewards/format_reward': 1.0, 'reward': 1.0979167222976685, 'reward_std': 0.08858516253530979, 'kl': 0.09912109375, 'epoch': 0.03}
  3%|▎         | 112/4286 [36:40<23:14:11, 20.04s/it]  3%|▎         | 113/4286 [36:57<22:11:59, 19.15s/it]                                                     {'loss': 0.0039, 'grad_norm': 2.697086423236356, 'learning_rate': 9.736350909939338e-07, 'completion_length': 126.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.22472365945577621, 'rewards/format_reward': 1.0, 'reward': 1.2247236967086792, 'reward_std': 0.13727953284978867, 'kl': 0.09716796875, 'epoch': 0.03}
  3%|▎         | 113/4286 [36:57<22:11:59, 19.15s/it]  3%|▎         | 114/4286 [37:16<21:59:14, 18.97s/it]                                                     {'loss': 0.0039, 'grad_norm': 2.437895613403582, 'learning_rate': 9.734017732151189e-07, 'completion_length': 123.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.18605442345142365, 'rewards/format_reward': 1.0, 'reward': 1.186054527759552, 'reward_std': 0.16416718810796738, 'kl': 0.097412109375, 'epoch': 0.03}
  3%|▎         | 114/4286 [37:16<21:59:14, 18.97s/it]  3%|▎         | 115/4286 [37:33<21:28:14, 18.53s/it]                                                     {'loss': 0.0044, 'grad_norm': 3.1727421566693943, 'learning_rate': 9.731684554363042e-07, 'completion_length': 100.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.2495536059141159, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2316965460777283, 'reward_std': 0.15177354216575623, 'kl': 0.110107421875, 'epoch': 0.03}
  3%|▎         | 115/4286 [37:33<21:28:14, 18.53s/it]  3%|▎         | 116/4286 [37:52<21:30:05, 18.56s/it]                                                     {'loss': 0.0038, 'grad_norm': 2.66264736888057, 'learning_rate': 9.729351376574895e-07, 'completion_length': 102.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.17678572237491608, 'rewards/format_reward': 1.0, 'reward': 1.1767857670783997, 'reward_std': 0.08590801060199738, 'kl': 0.094970703125, 'epoch': 0.03}
  3%|▎         | 116/4286 [37:52<21:30:05, 18.56s/it]  3%|▎         | 117/4286 [38:08<20:41:24, 17.87s/it]                                                     {'loss': 0.0047, 'grad_norm': 3.6903735562551585, 'learning_rate': 9.727018198786747e-07, 'completion_length': 103.50000381469727, 'rewards/only_full_func_accuracy_reward': 0.24707941710948944, 'rewards/format_reward': 1.0, 'reward': 1.2470794320106506, 'reward_std': 0.10819780826568604, 'kl': 0.11669921875, 'epoch': 0.03}
  3%|▎         | 117/4286 [38:08<20:41:24, 17.87s/it]  3%|▎         | 118/4286 [38:25<20:10:05, 17.42s/it]                                                     {'loss': 0.0045, 'grad_norm': 2.466349472226151, 'learning_rate': 9.7246850209986e-07, 'completion_length': 105.53571701049805, 'rewards/only_full_func_accuracy_reward': 0.24226191639900208, 'rewards/format_reward': 1.0, 'reward': 1.2422620058059692, 'reward_std': 0.17460735887289047, 'kl': 0.111328125, 'epoch': 0.03}
  3%|▎         | 118/4286 [38:25<20:10:05, 17.42s/it]  3%|▎         | 119/4286 [38:40<19:20:15, 16.71s/it]                                                     {'loss': 0.0045, 'grad_norm': 4.44365077309537, 'learning_rate': 9.722351843210453e-07, 'completion_length': 96.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.2046131119132042, 'rewards/format_reward': 1.0, 'reward': 1.204613208770752, 'reward_std': 0.09559819847345352, 'kl': 0.113037109375, 'epoch': 0.03}
  3%|▎         | 119/4286 [38:40<19:20:15, 16.71s/it]  3%|▎         | 120/4286 [38:57<19:37:53, 16.96s/it]                                                     {'loss': 0.004, 'grad_norm': 5.634674273356894, 'learning_rate': 9.720018665422304e-07, 'completion_length': 101.87500381469727, 'rewards/only_full_func_accuracy_reward': 0.23392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.2339286804199219, 'reward_std': 0.12948886305093765, 'kl': 0.099365234375, 'epoch': 0.03}
  3%|▎         | 120/4286 [38:57<19:37:53, 16.96s/it]  3%|▎         | 121/4286 [39:13<19:11:06, 16.58s/it]                                                     {'loss': 0.0049, 'grad_norm': 2.082070728016547, 'learning_rate': 9.717685487634158e-07, 'completion_length': 96.16072082519531, 'rewards/only_full_func_accuracy_reward': 0.20729168690741062, 'rewards/format_reward': 1.0, 'reward': 1.2072917819023132, 'reward_std': 0.0802498497068882, 'kl': 0.121337890625, 'epoch': 0.03}
  3%|▎         | 121/4286 [39:13<19:11:06, 16.58s/it]  3%|▎         | 122/4286 [39:31<19:40:18, 17.01s/it]                                                     {'loss': 0.0042, 'grad_norm': 2.9326081859963606, 'learning_rate': 9.71535230984601e-07, 'completion_length': 104.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.2172619253396988, 'rewards/format_reward': 1.0, 'reward': 1.2172619104385376, 'reward_std': 0.129481740295887, 'kl': 0.104736328125, 'epoch': 0.03}
  3%|▎         | 122/4286 [39:31<19:40:18, 17.01s/it]  3%|▎         | 123/4286 [39:48<19:48:42, 17.13s/it]                                                     {'loss': 0.0045, 'grad_norm': 2.654905366461235, 'learning_rate': 9.713019132057862e-07, 'completion_length': 106.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.12857143953442574, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1107143759727478, 'reward_std': 0.09643922187387943, 'kl': 0.113037109375, 'epoch': 0.03}
  3%|▎         | 123/4286 [39:48<19:48:42, 17.13s/it]  3%|▎         | 124/4286 [40:07<20:27:27, 17.70s/it]                                                     {'loss': 0.004, 'grad_norm': 2.453509190202524, 'learning_rate': 9.710685954269715e-07, 'completion_length': 122.28571701049805, 'rewards/only_full_func_accuracy_reward': 0.1912202537059784, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1733631491661072, 'reward_std': 0.1265498511493206, 'kl': 0.10107421875, 'epoch': 0.03}
  3%|▎         | 124/4286 [40:07<20:27:27, 17.70s/it]  3%|▎         | 125/4286 [40:23<19:50:03, 17.16s/it]                                                     {'loss': 0.0043, 'grad_norm': 1.9123584821746413, 'learning_rate': 9.708352776481569e-07, 'completion_length': 103.26786422729492, 'rewards/only_full_func_accuracy_reward': 0.1860119178891182, 'rewards/format_reward': 1.0, 'reward': 1.1860119700431824, 'reward_std': 0.04648453835397959, 'kl': 0.106689453125, 'epoch': 0.03}
  3%|▎         | 125/4286 [40:23<19:50:03, 17.16s/it]  3%|▎         | 126/4286 [40:40<19:34:16, 16.94s/it]                                                     {'loss': 0.0042, 'grad_norm': 1.843144335047526, 'learning_rate': 9.70601959869342e-07, 'completion_length': 107.91072082519531, 'rewards/only_full_func_accuracy_reward': 0.3125000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3125001788139343, 'reward_std': 0.13704578578472137, 'kl': 0.1044921875, 'epoch': 0.03}
  3%|▎         | 126/4286 [40:40<19:34:16, 16.94s/it]  3%|▎         | 127/4286 [40:55<18:53:11, 16.35s/it]                                                     {'loss': 0.0042, 'grad_norm': 1.6899632113811136, 'learning_rate': 9.703686420905273e-07, 'completion_length': 93.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.29821430146694183, 'rewards/format_reward': 1.0, 'reward': 1.2982143759727478, 'reward_std': 0.10256345011293888, 'kl': 0.10400390625, 'epoch': 0.03}
  3%|▎         | 127/4286 [40:55<18:53:11, 16.35s/it]  3%|▎         | 128/4286 [41:12<19:10:35, 16.60s/it]                                                     {'loss': 0.0046, 'grad_norm': 2.618119273709178, 'learning_rate': 9.701353243117124e-07, 'completion_length': 101.0535774230957, 'rewards/only_full_func_accuracy_reward': 0.19032739847898483, 'rewards/format_reward': 1.0, 'reward': 1.1903274655342102, 'reward_std': 0.10070829093456268, 'kl': 0.11572265625, 'epoch': 0.03}
  3%|▎         | 128/4286 [41:12<19:10:35, 16.60s/it]  3%|▎         | 129/4286 [41:30<19:37:00, 16.99s/it]                                                     {'loss': 0.0044, 'grad_norm': 2.7356116931021544, 'learning_rate': 9.699020065328977e-07, 'completion_length': 102.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.2782738283276558, 'rewards/format_reward': 1.0, 'reward': 1.27827388048172, 'reward_std': 0.09935421496629715, 'kl': 0.111083984375, 'epoch': 0.03}
  3%|▎         | 129/4286 [41:30<19:37:00, 16.99s/it]  3%|▎         | 130/4286 [41:47<19:46:33, 17.13s/it]                                                     {'loss': 0.0046, 'grad_norm': 2.077740977190694, 'learning_rate': 9.69668688754083e-07, 'completion_length': 100.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.22232143580913544, 'rewards/format_reward': 1.0, 'reward': 1.2223215103149414, 'reward_std': 0.10213418677449226, 'kl': 0.11474609375, 'epoch': 0.03}
  3%|▎         | 130/4286 [41:47<19:46:33, 17.13s/it]  3%|▎         | 131/4286 [42:04<19:31:29, 16.92s/it]                                                     {'loss': 0.0048, 'grad_norm': 2.562910448389568, 'learning_rate': 9.694353709752682e-07, 'completion_length': 103.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.22083333879709244, 'rewards/format_reward': 1.0, 'reward': 1.220833420753479, 'reward_std': 0.12786884233355522, 'kl': 0.118896484375, 'epoch': 0.03}
  3%|▎         | 131/4286 [42:04<19:31:29, 16.92s/it]  3%|▎         | 132/4286 [42:22<19:58:53, 17.32s/it]                                                     {'loss': 0.0052, 'grad_norm': 5.703011924256221, 'learning_rate': 9.692020531964535e-07, 'completion_length': 97.46429061889648, 'rewards/only_full_func_accuracy_reward': 0.33142007887363434, 'rewards/format_reward': 1.0, 'reward': 1.331420123577118, 'reward_std': 0.13546306267380714, 'kl': 0.129150390625, 'epoch': 0.03}
  3%|▎         | 132/4286 [42:22<19:58:53, 17.32s/it]  3%|▎         | 133/4286 [42:37<19:05:10, 16.54s/it]                                                     {'loss': 0.0051, 'grad_norm': 2.4349772789993476, 'learning_rate': 9.689687354176389e-07, 'completion_length': 85.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.2395833432674408, 'rewards/format_reward': 1.0, 'reward': 1.2395833730697632, 'reward_std': 0.17121277004480362, 'kl': 0.1279296875, 'epoch': 0.03}
  3%|▎         | 133/4286 [42:37<19:05:10, 16.54s/it]  3%|▎         | 134/4286 [42:52<18:44:41, 16.25s/it]                                                     {'loss': 0.0046, 'grad_norm': 2.330163508587448, 'learning_rate': 9.68735417638824e-07, 'completion_length': 87.00000381469727, 'rewards/only_full_func_accuracy_reward': 0.1666666716337204, 'rewards/format_reward': 1.0, 'reward': 1.1666667461395264, 'reward_std': 0.049460720270872116, 'kl': 0.1142578125, 'epoch': 0.03}
  3%|▎         | 134/4286 [42:52<18:44:41, 16.25s/it]  3%|▎         | 135/4286 [43:08<18:27:10, 16.00s/it]                                                     {'loss': 0.0055, 'grad_norm': 2.9191984370174793, 'learning_rate': 9.685020998600093e-07, 'completion_length': 87.83929061889648, 'rewards/only_full_func_accuracy_reward': 0.2616071552038193, 'rewards/format_reward': 1.0, 'reward': 1.2616072297096252, 'reward_std': 0.15203889459371567, 'kl': 0.13720703125, 'epoch': 0.03}
  3%|▎         | 135/4286 [43:08<18:27:10, 16.00s/it]  3%|▎         | 136/4286 [43:22<17:44:46, 15.39s/it]                                                     {'loss': 0.0056, 'grad_norm': 3.1586234142215526, 'learning_rate': 9.682687820811946e-07, 'completion_length': 76.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.336309552192688, 'rewards/format_reward': 1.0, 'reward': 1.3363096117973328, 'reward_std': 0.15709706395864487, 'kl': 0.140625, 'epoch': 0.03}
  3%|▎         | 136/4286 [43:22<17:44:46, 15.39s/it]  3%|▎         | 137/4286 [43:38<17:54:31, 15.54s/it]                                                     {'loss': 0.0059, 'grad_norm': 2.6674491723206732, 'learning_rate': 9.680354643023797e-07, 'completion_length': 87.87500381469727, 'rewards/only_full_func_accuracy_reward': 0.2190476432442665, 'rewards/format_reward': 1.0, 'reward': 1.2190476655960083, 'reward_std': 0.15443017333745956, 'kl': 0.1484375, 'epoch': 0.03}
  3%|▎         | 137/4286 [43:38<17:54:31, 15.54s/it]  3%|▎         | 138/4286 [43:53<17:59:04, 15.61s/it]                                                     {'loss': 0.0055, 'grad_norm': 2.369758057249823, 'learning_rate': 9.67802146523565e-07, 'completion_length': 99.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.2710459306836128, 'rewards/format_reward': 1.0, 'reward': 1.271045982837677, 'reward_std': 0.13595934212207794, 'kl': 0.13720703125, 'epoch': 0.03}
  3%|▎         | 138/4286 [43:53<17:59:04, 15.61s/it]  3%|▎         | 139/4286 [44:14<19:36:20, 17.02s/it]                                                     {'loss': 0.0051, 'grad_norm': 3.2418689978238175, 'learning_rate': 9.675688287447504e-07, 'completion_length': 111.89286422729492, 'rewards/only_full_func_accuracy_reward': 0.3303379565477371, 'rewards/format_reward': 1.0, 'reward': 1.3303380012512207, 'reward_std': 0.11159536195918918, 'kl': 0.1279296875, 'epoch': 0.03}
  3%|▎         | 139/4286 [44:14<19:36:20, 17.02s/it]  3%|▎         | 140/4286 [44:32<20:12:53, 17.55s/it]                                                     {'loss': 0.0048, 'grad_norm': 2.337555695200978, 'learning_rate': 9.673355109659355e-07, 'completion_length': 105.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.24017857760190964, 'rewards/format_reward': 1.0, 'reward': 1.2401785850524902, 'reward_std': 0.0970090702176094, 'kl': 0.118896484375, 'epoch': 0.03}
  3%|▎         | 140/4286 [44:32<20:12:53, 17.55s/it]  3%|▎         | 141/4286 [44:52<20:47:19, 18.06s/it]                                                     {'loss': 0.0056, 'grad_norm': 1.8323021074693886, 'learning_rate': 9.671021931871208e-07, 'completion_length': 109.53572082519531, 'rewards/only_full_func_accuracy_reward': 0.21607144176959991, 'rewards/format_reward': 1.0, 'reward': 1.2160714864730835, 'reward_std': 0.10181040782481432, 'kl': 0.1396484375, 'epoch': 0.03}
  3%|▎         | 141/4286 [44:52<20:47:19, 18.06s/it]  3%|▎         | 142/4286 [45:13<21:47:09, 18.93s/it]                                                     {'loss': 0.0043, 'grad_norm': 4.783710521301547, 'learning_rate': 9.668688754083062e-07, 'completion_length': 127.46429061889648, 'rewards/only_full_func_accuracy_reward': 0.22306549921631813, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.205208420753479, 'reward_std': 0.1311081387102604, 'kl': 0.10791015625, 'epoch': 0.03}
  3%|▎         | 142/4286 [45:13<21:47:09, 18.93s/it]  3%|▎         | 143/4286 [45:32<22:05:16, 19.19s/it]                                                     {'loss': 0.0048, 'grad_norm': 2.266687909256045, 'learning_rate': 9.666355576294913e-07, 'completion_length': 128.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.15431547537446022, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.136458396911621, 'reward_std': 0.18312374502420425, 'kl': 0.120361328125, 'epoch': 0.03}
  3%|▎         | 143/4286 [45:32<22:05:16, 19.19s/it]  3%|▎         | 144/4286 [45:50<21:31:02, 18.70s/it]                                                     {'loss': 0.005, 'grad_norm': 2.56932356765308, 'learning_rate': 9.664022398506766e-07, 'completion_length': 113.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.29821430891752243, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2803571224212646, 'reward_std': 0.16679446399211884, 'kl': 0.1240234375, 'epoch': 0.03}
  3%|▎         | 144/4286 [45:50<21:31:02, 18.70s/it]  3%|▎         | 145/4286 [46:09<21:45:11, 18.91s/it]                                                     {'loss': 0.004, 'grad_norm': 1.7093534973877151, 'learning_rate': 9.66168922071862e-07, 'completion_length': 140.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.15535715222358704, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.1017857789993286, 'reward_std': 0.16450229287147522, 'kl': 0.100830078125, 'epoch': 0.03}
  3%|▎         | 145/4286 [46:09<21:45:11, 18.91s/it]  3%|▎         | 146/4286 [46:27<21:27:03, 18.65s/it]                                                     {'loss': 0.0041, 'grad_norm': 1.8939345959598977, 'learning_rate': 9.65935604293047e-07, 'completion_length': 152.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.10168651957064867, 'rewards/format_reward': 1.0, 'reward': 1.1016865372657776, 'reward_std': 0.07460532244294882, 'kl': 0.102294921875, 'epoch': 0.03}
  3%|▎         | 146/4286 [46:27<21:27:03, 18.65s/it]  3%|▎         | 147/4286 [46:47<21:54:43, 19.06s/it]                                                     {'loss': 0.0042, 'grad_norm': 2.6557465967772527, 'learning_rate': 9.657022865142324e-07, 'completion_length': 143.25, 'rewards/only_full_func_accuracy_reward': 0.16339286416769028, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1455358266830444, 'reward_std': 0.171538807451725, 'kl': 0.10498046875, 'epoch': 0.03}
  3%|▎         | 147/4286 [46:47<21:54:43, 19.06s/it]  3%|▎         | 148/4286 [47:12<23:44:46, 20.66s/it]                                                     {'loss': 0.004, 'grad_norm': 2.068526959372155, 'learning_rate': 9.654689687354177e-07, 'completion_length': 153.08928680419922, 'rewards/only_full_func_accuracy_reward': 0.22585421800613403, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2079971432685852, 'reward_std': 0.1470811888575554, 'kl': 0.100830078125, 'epoch': 0.03}
  3%|▎         | 148/4286 [47:12<23:44:46, 20.66s/it]  3%|▎         | 149/4286 [47:34<24:25:15, 21.25s/it]                                                     {'loss': 0.0043, 'grad_norm': 2.1929907867703915, 'learning_rate': 9.652356509566028e-07, 'completion_length': 133.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.3174107223749161, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2995536923408508, 'reward_std': 0.18800747394561768, 'kl': 0.1083984375, 'epoch': 0.03}
  3%|▎         | 149/4286 [47:34<24:25:15, 21.25s/it]  3%|▎         | 150/4286 [47:53<23:33:45, 20.51s/it]                                                     {'loss': 0.0041, 'grad_norm': 2.378706072442246, 'learning_rate': 9.650023331777882e-07, 'completion_length': 144.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.25629252940416336, 'rewards/format_reward': 1.0, 'reward': 1.2562925219535828, 'reward_std': 0.13636157289147377, 'kl': 0.102294921875, 'epoch': 0.03}
  3%|▎         | 150/4286 [47:53<23:33:45, 20.51s/it]  4%|▎         | 151/4286 [48:15<23:56:21, 20.84s/it]                                                     {'loss': 0.0042, 'grad_norm': 1.7682191224990054, 'learning_rate': 9.647690153989733e-07, 'completion_length': 148.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.25672269612550735, 'rewards/format_reward': 1.0, 'reward': 1.2567228078842163, 'reward_std': 0.0951785184442997, 'kl': 0.104248046875, 'epoch': 0.04}
  4%|▎         | 151/4286 [48:15<23:56:21, 20.84s/it]  4%|▎         | 152/4286 [48:36<24:06:24, 20.99s/it]                                                     {'loss': 0.0047, 'grad_norm': 4.213852040556703, 'learning_rate': 9.645356976201586e-07, 'completion_length': 122.51786422729492, 'rewards/only_full_func_accuracy_reward': 0.18618199229240417, 'rewards/format_reward': 1.0, 'reward': 1.1861820220947266, 'reward_std': 0.12074358761310577, 'kl': 0.116455078125, 'epoch': 0.04}
  4%|▎         | 152/4286 [48:36<24:06:24, 20.99s/it]  4%|▎         | 153/4286 [48:56<23:48:08, 20.73s/it]                                                     {'loss': 0.0037, 'grad_norm': 1.8561748830863298, 'learning_rate': 9.64302379841344e-07, 'completion_length': 144.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.192346952855587, 'rewards/format_reward': 1.0, 'reward': 1.1923470497131348, 'reward_std': 0.07206465676426888, 'kl': 0.09228515625, 'epoch': 0.04}
  4%|▎         | 153/4286 [48:56<23:48:08, 20.73s/it]  4%|▎         | 154/4286 [49:17<23:38:41, 20.60s/it]                                                     {'loss': 0.0042, 'grad_norm': 3.4367789528871833, 'learning_rate': 9.64069062062529e-07, 'completion_length': 145.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.21092688292264938, 'rewards/format_reward': 1.0, 'reward': 1.2109269499778748, 'reward_std': 0.15450528636574745, 'kl': 0.10546875, 'epoch': 0.04}
  4%|▎         | 154/4286 [49:17<23:38:41, 20.60s/it]  4%|▎         | 155/4286 [49:33<22:09:13, 19.31s/it]                                                     {'loss': 0.0044, 'grad_norm': 2.332917633510563, 'learning_rate': 9.638357442837144e-07, 'completion_length': 119.48214340209961, 'rewards/only_full_func_accuracy_reward': 0.23382937908172607, 'rewards/format_reward': 1.0, 'reward': 1.233829379081726, 'reward_std': 0.14332610741257668, 'kl': 0.111083984375, 'epoch': 0.04}
  4%|▎         | 155/4286 [49:33<22:09:13, 19.31s/it]  4%|▎         | 156/4286 [49:53<22:32:46, 19.65s/it]                                                     {'loss': 0.0042, 'grad_norm': 2.7505782682774726, 'learning_rate': 9.636024265048997e-07, 'completion_length': 138.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.18919101357460022, 'rewards/format_reward': 1.0, 'reward': 1.1891911625862122, 'reward_std': 0.11067553609609604, 'kl': 0.10498046875, 'epoch': 0.04}
  4%|▎         | 156/4286 [49:53<22:32:46, 19.65s/it]  4%|▎         | 157/4286 [50:12<22:06:51, 19.28s/it]                                                     {'loss': 0.0043, 'grad_norm': 3.7964541345994824, 'learning_rate': 9.633691087260848e-07, 'completion_length': 132.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.24354450404644012, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2256873846054077, 'reward_std': 0.17863301932811737, 'kl': 0.107666015625, 'epoch': 0.04}
  4%|▎         | 157/4286 [50:12<22:06:51, 19.28s/it]  4%|▎         | 158/4286 [50:31<21:55:56, 19.13s/it]                                                     {'loss': 0.0045, 'grad_norm': 2.324237268196485, 'learning_rate': 9.631357909472701e-07, 'completion_length': 116.6785774230957, 'rewards/only_full_func_accuracy_reward': 0.26692070066928864, 'rewards/format_reward': 1.0, 'reward': 1.2669207453727722, 'reward_std': 0.07766796834766865, 'kl': 0.111328125, 'epoch': 0.04}
  4%|▎         | 158/4286 [50:31<21:55:56, 19.13s/it]  4%|▎         | 159/4286 [50:47<21:08:03, 18.44s/it]                                                     {'loss': 0.004, 'grad_norm': 3.1116021263256073, 'learning_rate': 9.629024731684555e-07, 'completion_length': 113.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.20850341767072678, 'rewards/format_reward': 1.0, 'reward': 1.2085034847259521, 'reward_std': 0.13420184701681137, 'kl': 0.10009765625, 'epoch': 0.04}
  4%|▎         | 159/4286 [50:47<21:08:03, 18.44s/it]  4%|▎         | 160/4286 [51:09<22:16:23, 19.43s/it]                                                     {'loss': 0.0041, 'grad_norm': 2.0881102087626697, 'learning_rate': 9.626691553896406e-07, 'completion_length': 127.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.2548115402460098, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.236954391002655, 'reward_std': 0.16985492408275604, 'kl': 0.1015625, 'epoch': 0.04}
  4%|▎         | 160/4286 [51:09<22:16:23, 19.43s/it]  4%|▍         | 161/4286 [51:30<22:39:59, 19.78s/it]                                                     {'loss': 0.004, 'grad_norm': 1.913037265910155, 'learning_rate': 9.62435837610826e-07, 'completion_length': 127.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.2526785880327225, 'rewards/format_reward': 1.0, 'reward': 1.252678632736206, 'reward_std': 0.09292823448777199, 'kl': 0.10009765625, 'epoch': 0.04}
  4%|▍         | 161/4286 [51:30<22:39:59, 19.78s/it]  4%|▍         | 162/4286 [51:49<22:22:07, 19.53s/it]                                                     {'loss': 0.0041, 'grad_norm': 2.6040595458163285, 'learning_rate': 9.622025198320112e-07, 'completion_length': 125.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.24077381938695908, 'rewards/format_reward': 1.0, 'reward': 1.2407739162445068, 'reward_std': 0.11414233222603798, 'kl': 0.1025390625, 'epoch': 0.04}
  4%|▍         | 162/4286 [51:49<22:22:07, 19.53s/it]  4%|▍         | 163/4286 [52:09<22:34:10, 19.71s/it]                                                     {'loss': 0.0045, 'grad_norm': 2.275525539883789, 'learning_rate': 9.619692020531964e-07, 'completion_length': 118.12500381469727, 'rewards/only_full_func_accuracy_reward': 0.357312947511673, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3394558429718018, 'reward_std': 0.2374008446931839, 'kl': 0.113525390625, 'epoch': 0.04}
  4%|▍         | 163/4286 [52:09<22:34:10, 19.71s/it]  4%|▍         | 164/4286 [52:28<22:25:13, 19.58s/it]                                                     {'loss': 0.0048, 'grad_norm': 2.801154812646009, 'learning_rate': 9.617358842743817e-07, 'completion_length': 114.50000381469727, 'rewards/only_full_func_accuracy_reward': 0.2493029683828354, 'rewards/format_reward': 1.0, 'reward': 1.2493030428886414, 'reward_std': 0.18308113515377045, 'kl': 0.12109375, 'epoch': 0.04}
  4%|▍         | 164/4286 [52:28<22:25:13, 19.58s/it]  4%|▍         | 165/4286 [52:45<21:22:19, 18.67s/it]                                                     {'loss': 0.0049, 'grad_norm': 3.6742809817065276, 'learning_rate': 9.61502566495567e-07, 'completion_length': 111.35714340209961, 'rewards/only_full_func_accuracy_reward': 0.26872166246175766, 'rewards/format_reward': 1.0, 'reward': 1.2687217593193054, 'reward_std': 0.12753449007868767, 'kl': 0.121826171875, 'epoch': 0.04}
  4%|▍         | 165/4286 [52:45<21:22:19, 18.67s/it]  4%|▍         | 166/4286 [53:01<20:41:09, 18.08s/it]                                                     {'loss': 0.0051, 'grad_norm': 2.434393209687423, 'learning_rate': 9.612692487167521e-07, 'completion_length': 101.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.23697345703840256, 'rewards/format_reward': 1.0, 'reward': 1.236973524093628, 'reward_std': 0.1313372105360031, 'kl': 0.128173828125, 'epoch': 0.04}
  4%|▍         | 166/4286 [53:01<20:41:09, 18.08s/it]  4%|▍         | 167/4286 [53:17<19:46:29, 17.28s/it]                                                     {'loss': 0.0057, 'grad_norm': 2.1048168904277538, 'learning_rate': 9.610359309379375e-07, 'completion_length': 89.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.344246044754982, 'rewards/format_reward': 1.0, 'reward': 1.3442460894584656, 'reward_std': 0.08360490202903748, 'kl': 0.1435546875, 'epoch': 0.04}
  4%|▍         | 167/4286 [53:17<19:46:29, 17.28s/it]  4%|▍         | 168/4286 [53:35<19:58:29, 17.46s/it]                                                     {'loss': 0.0047, 'grad_norm': 1.7661937585745706, 'learning_rate': 9.608026131591228e-07, 'completion_length': 111.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.20999151468276978, 'rewards/format_reward': 1.0, 'reward': 1.2099915146827698, 'reward_std': 0.0977914147078991, 'kl': 0.1171875, 'epoch': 0.04}
  4%|▍         | 168/4286 [53:35<19:58:29, 17.46s/it]  4%|▍         | 169/4286 [53:53<20:20:07, 17.78s/it]                                                     {'loss': 0.005, 'grad_norm': 2.1298124474948814, 'learning_rate': 9.60569295380308e-07, 'completion_length': 125.83929061889648, 'rewards/only_full_func_accuracy_reward': 0.17288124561309814, 'rewards/format_reward': 1.0, 'reward': 1.1728812456130981, 'reward_std': 0.11938735842704773, 'kl': 0.12451171875, 'epoch': 0.04}
  4%|▍         | 169/4286 [53:53<20:20:07, 17.78s/it]  4%|▍         | 170/4286 [54:10<20:07:45, 17.61s/it]                                                     {'loss': 0.0056, 'grad_norm': 2.6282241167762406, 'learning_rate': 9.603359776014932e-07, 'completion_length': 102.07143020629883, 'rewards/only_full_func_accuracy_reward': 0.3285714462399483, 'rewards/format_reward': 1.0, 'reward': 1.3285715579986572, 'reward_std': 0.14853585511446, 'kl': 0.140625, 'epoch': 0.04}
  4%|▍         | 170/4286 [54:10<20:07:45, 17.61s/it]  4%|▍         | 171/4286 [54:29<20:29:06, 17.92s/it]                                                     {'loss': 0.0047, 'grad_norm': 3.847916933173344, 'learning_rate': 9.601026598226786e-07, 'completion_length': 123.71428680419922, 'rewards/only_full_func_accuracy_reward': 0.21981295198202133, 'rewards/format_reward': 1.0, 'reward': 1.2198129892349243, 'reward_std': 0.1438337378203869, 'kl': 0.116943359375, 'epoch': 0.04}
  4%|▍         | 171/4286 [54:29<20:29:06, 17.92s/it]  4%|▍         | 172/4286 [54:46<20:12:24, 17.68s/it]                                                     {'loss': 0.0046, 'grad_norm': 1.9852376972745616, 'learning_rate': 9.598693420438637e-07, 'completion_length': 120.46429061889648, 'rewards/only_full_func_accuracy_reward': 0.18246175348758698, 'rewards/format_reward': 1.0, 'reward': 1.1824617981910706, 'reward_std': 0.10101079568266869, 'kl': 0.115234375, 'epoch': 0.04}
  4%|▍         | 172/4286 [54:46<20:12:24, 17.68s/it]  4%|▍         | 173/4286 [55:06<20:50:17, 18.24s/it]                                                     {'loss': 0.0043, 'grad_norm': 1.979690416228074, 'learning_rate': 9.59636024265049e-07, 'completion_length': 143.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.22189628332853317, 'rewards/format_reward': 1.0, 'reward': 1.2218963503837585, 'reward_std': 0.10261517018079758, 'kl': 0.1083984375, 'epoch': 0.04}
  4%|▍         | 173/4286 [55:06<20:50:17, 18.24s/it]  4%|▍         | 174/4286 [55:29<22:27:15, 19.66s/it]                                                     {'loss': 0.0043, 'grad_norm': 2.214538170030371, 'learning_rate': 9.594027064862341e-07, 'completion_length': 160.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.1250404343008995, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1071832776069641, 'reward_std': 0.1255096048116684, 'kl': 0.108154296875, 'epoch': 0.04}
  4%|▍         | 174/4286 [55:29<22:27:15, 19.66s/it]  4%|▍         | 175/4286 [55:51<23:16:25, 20.38s/it]                                                     {'loss': 0.0039, 'grad_norm': 4.010999944717165, 'learning_rate': 9.591693887074195e-07, 'completion_length': 161.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.12219388037919998, 'rewards/format_reward': 1.0, 'reward': 1.122193992137909, 'reward_std': 0.08651930466294289, 'kl': 0.09716796875, 'epoch': 0.04}
  4%|▍         | 175/4286 [55:51<23:16:25, 20.38s/it]  4%|▍         | 176/4286 [56:14<24:11:17, 21.19s/it]                                                     {'loss': 0.0046, 'grad_norm': 2.870259563842755, 'learning_rate': 9.589360709286048e-07, 'completion_length': 145.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.3248937278985977, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3070366382598877, 'reward_std': 0.14558043330907822, 'kl': 0.1142578125, 'epoch': 0.04}
  4%|▍         | 176/4286 [56:14<24:11:17, 21.19s/it]  4%|▍         | 177/4286 [56:35<24:09:31, 21.17s/it]                                                     {'loss': 0.0049, 'grad_norm': 2.5233419638644623, 'learning_rate': 9.5870275314979e-07, 'completion_length': 133.71429061889648, 'rewards/only_full_func_accuracy_reward': 0.25778064131736755, 'rewards/format_reward': 1.0, 'reward': 1.25778067111969, 'reward_std': 0.15142744034528732, 'kl': 0.121826171875, 'epoch': 0.04}
  4%|▍         | 177/4286 [56:35<24:09:31, 21.17s/it]  4%|▍         | 178/4286 [56:58<24:39:03, 21.60s/it]                                                     {'loss': 0.0047, 'grad_norm': 2.223835960442383, 'learning_rate': 9.584694353709752e-07, 'completion_length': 149.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.27326391637325287, 'rewards/format_reward': 1.0, 'reward': 1.2732639908790588, 'reward_std': 0.19943857938051224, 'kl': 0.117431640625, 'epoch': 0.04}
  4%|▍         | 178/4286 [56:58<24:39:03, 21.60s/it]  4%|▍         | 179/4286 [57:20<24:58:30, 21.89s/it]                                                     {'loss': 0.0054, 'grad_norm': 3.1442731088121825, 'learning_rate': 9.582361175921606e-07, 'completion_length': 135.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.35386908054351807, 'rewards/format_reward': 1.0, 'reward': 1.353869080543518, 'reward_std': 0.20513851940631866, 'kl': 0.13623046875, 'epoch': 0.04}
  4%|▍         | 179/4286 [57:20<24:58:30, 21.89s/it]  4%|▍         | 180/4286 [57:39<24:02:23, 21.08s/it]                                                     {'loss': 0.0055, 'grad_norm': 2.387843613940689, 'learning_rate': 9.580027998133457e-07, 'completion_length': 123.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.20644842833280563, 'rewards/format_reward': 1.0, 'reward': 1.2064483761787415, 'reward_std': 0.07871726900339127, 'kl': 0.1376953125, 'epoch': 0.04}
  4%|▍         | 180/4286 [57:39<24:02:23, 21.08s/it]  4%|▍         | 181/4286 [57:56<22:38:19, 19.85s/it]                                                     {'loss': 0.0058, 'grad_norm': 2.022533907445618, 'learning_rate': 9.57769482034531e-07, 'completion_length': 104.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.378571480512619, 'rewards/format_reward': 1.0, 'reward': 1.3785715103149414, 'reward_std': 0.14626749604940414, 'kl': 0.14453125, 'epoch': 0.04}
  4%|▍         | 181/4286 [57:56<22:38:19, 19.85s/it]  4%|▍         | 182/4286 [58:13<21:34:39, 18.93s/it]                                                     {'loss': 0.0066, 'grad_norm': 1.6431138401111502, 'learning_rate': 9.575361642557163e-07, 'completion_length': 91.94643020629883, 'rewards/only_full_func_accuracy_reward': 0.4422619491815567, 'rewards/format_reward': 1.0, 'reward': 1.442262053489685, 'reward_std': 0.0897475453093648, 'kl': 0.1650390625, 'epoch': 0.04}
  4%|▍         | 182/4286 [58:13<21:34:39, 18.93s/it]  4%|▍         | 183/4286 [58:34<22:10:18, 19.45s/it]                                                     {'loss': 0.0071, 'grad_norm': 1.9461955786256544, 'learning_rate': 9.573028464769014e-07, 'completion_length': 115.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.3072916939854622, 'rewards/format_reward': 1.0, 'reward': 1.3072917461395264, 'reward_std': 0.10518227890133858, 'kl': 0.17724609375, 'epoch': 0.04}
  4%|▍         | 183/4286 [58:34<22:10:18, 19.45s/it]  4%|▍         | 184/4286 [58:50<20:58:24, 18.41s/it]                                                     {'loss': 0.0073, 'grad_norm': 2.86385950225607, 'learning_rate': 9.570695286980868e-07, 'completion_length': 87.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.269557848572731, 'rewards/format_reward': 1.0, 'reward': 1.2695579528808594, 'reward_std': 0.18199608474969864, 'kl': 0.18212890625, 'epoch': 0.04}
  4%|▍         | 184/4286 [58:50<20:58:24, 18.41s/it]  4%|▍         | 185/4286 [59:07<20:32:24, 18.03s/it]                                                     {'loss': 0.0091, 'grad_norm': 3.209634545949657, 'learning_rate': 9.56836210919272e-07, 'completion_length': 87.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.3299320191144943, 'rewards/format_reward': 1.0, 'reward': 1.3299320936203003, 'reward_std': 0.08140993490815163, 'kl': 0.2265625, 'epoch': 0.04}
  4%|▍         | 185/4286 [59:07<20:32:24, 18.03s/it]  4%|▍         | 186/4286 [59:23<19:49:05, 17.40s/it]                                                     {'loss': 0.0094, 'grad_norm': 2.6380796679054317, 'learning_rate': 9.566028931404572e-07, 'completion_length': 78.50000381469727, 'rewards/only_full_func_accuracy_reward': 0.35969389975070953, 'rewards/format_reward': 1.0, 'reward': 1.3596939444541931, 'reward_std': 0.13032866269350052, 'kl': 0.236328125, 'epoch': 0.04}
  4%|▍         | 186/4286 [59:23<19:49:05, 17.40s/it]  4%|▍         | 187/4286 [59:37<18:45:41, 16.48s/it]                                                     {'loss': 0.0101, 'grad_norm': 3.7481285409394913, 'learning_rate': 9.563695753616425e-07, 'completion_length': 82.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.22797619551420212, 'rewards/format_reward': 1.0, 'reward': 1.2279762625694275, 'reward_std': 0.12386111170053482, 'kl': 0.25390625, 'epoch': 0.04}
  4%|▍         | 187/4286 [59:37<18:45:41, 16.48s/it]  4%|▍         | 188/4286 [59:51<17:47:17, 15.63s/it]                                                     {'loss': 0.0133, 'grad_norm': 3.3913255675915117, 'learning_rate': 9.561362575828279e-07, 'completion_length': 60.96428871154785, 'rewards/only_full_func_accuracy_reward': 0.413690522313118, 'rewards/format_reward': 1.0, 'reward': 1.4136905670166016, 'reward_std': 0.10980493947863579, 'kl': 0.33203125, 'epoch': 0.04}
  4%|▍         | 188/4286 [59:51<17:47:17, 15.63s/it]  4%|▍         | 189/4286 [1:00:07<17:50:54, 15.68s/it]                                                       {'loss': 0.0125, 'grad_norm': 3.882447600102788, 'learning_rate': 9.55902939804013e-07, 'completion_length': 76.48214530944824, 'rewards/only_full_func_accuracy_reward': 0.3467262089252472, 'rewards/format_reward': 1.0, 'reward': 1.3467263579368591, 'reward_std': 0.12284422293305397, 'kl': 0.3125, 'epoch': 0.04}
  4%|▍         | 189/4286 [1:00:07<17:50:54, 15.68s/it]  4%|▍         | 190/4286 [1:00:25<18:45:21, 16.48s/it]                                                       {'loss': 0.0135, 'grad_norm': 3.056910196833753, 'learning_rate': 9.556696220251983e-07, 'completion_length': 77.23214530944824, 'rewards/only_full_func_accuracy_reward': 0.26994049549102783, 'rewards/format_reward': 1.0, 'reward': 1.2699405550956726, 'reward_std': 0.07691500056535006, 'kl': 0.3369140625, 'epoch': 0.04}
  4%|▍         | 190/4286 [1:00:25<18:45:21, 16.48s/it]  4%|▍         | 191/4286 [1:00:40<18:23:26, 16.17s/it]                                                       {'loss': 0.0116, 'grad_norm': 2.600892351871386, 'learning_rate': 9.554363042463836e-07, 'completion_length': 83.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.2738307937979698, 'rewards/format_reward': 1.0, 'reward': 1.273830771446228, 'reward_std': 0.11600404605269432, 'kl': 0.2890625, 'epoch': 0.04}
  4%|▍         | 191/4286 [1:00:40<18:23:26, 16.17s/it]  4%|▍         | 192/4286 [1:00:56<18:22:29, 16.16s/it]                                                       {'loss': 0.0109, 'grad_norm': 4.389442144259027, 'learning_rate': 9.552029864675688e-07, 'completion_length': 76.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.4187925159931183, 'rewards/format_reward': 1.0, 'reward': 1.4187925457954407, 'reward_std': 0.15405423194169998, 'kl': 0.2724609375, 'epoch': 0.04}
  4%|▍         | 192/4286 [1:00:56<18:22:29, 16.16s/it]  5%|▍         | 193/4286 [1:01:13<18:29:43, 16.27s/it]                                                       {'loss': 0.0114, 'grad_norm': 3.576312119468272, 'learning_rate': 9.54969668688754e-07, 'completion_length': 79.28571701049805, 'rewards/only_full_func_accuracy_reward': 0.335565522313118, 'rewards/format_reward': 1.0, 'reward': 1.3355655074119568, 'reward_std': 0.19087433815002441, 'kl': 0.28515625, 'epoch': 0.05}
  5%|▍         | 193/4286 [1:01:13<18:29:43, 16.27s/it]  5%|▍         | 194/4286 [1:01:28<18:02:45, 15.88s/it]                                                       {'loss': 0.0083, 'grad_norm': 4.028728531844733, 'learning_rate': 9.547363509099394e-07, 'completion_length': 88.46429061889648, 'rewards/only_full_func_accuracy_reward': 0.2752976268529892, 'rewards/format_reward': 1.0, 'reward': 1.2752977013587952, 'reward_std': 0.11437606066465378, 'kl': 0.2060546875, 'epoch': 0.05}
  5%|▍         | 194/4286 [1:01:28<18:02:45, 15.88s/it]  5%|▍         | 195/4286 [1:01:48<19:30:02, 17.16s/it]                                                       {'loss': 0.0073, 'grad_norm': 6.192666916014506, 'learning_rate': 9.545030331311245e-07, 'completion_length': 108.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.2041666880249977, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.186309576034546, 'reward_std': 0.1797105148434639, 'kl': 0.18115234375, 'epoch': 0.05}
  5%|▍         | 195/4286 [1:01:48<19:30:02, 17.16s/it]  5%|▍         | 196/4286 [1:02:07<19:55:50, 17.54s/it]                                                       {'loss': 0.0083, 'grad_norm': 2.7250088332052007, 'learning_rate': 9.542697153523099e-07, 'completion_length': 91.3035774230957, 'rewards/only_full_func_accuracy_reward': 0.24672619253396988, 'rewards/format_reward': 1.0, 'reward': 1.2467263340950012, 'reward_std': 0.09277335181832314, 'kl': 0.2080078125, 'epoch': 0.05}
  5%|▍         | 196/4286 [1:02:07<19:55:50, 17.54s/it]  5%|▍         | 197/4286 [1:02:25<20:08:48, 17.74s/it]                                                       {'loss': 0.0083, 'grad_norm': 6.850883135975883, 'learning_rate': 9.54036397573495e-07, 'completion_length': 95.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.31220240890979767, 'rewards/format_reward': 1.0, 'reward': 1.3122024536132812, 'reward_std': 0.16555709391832352, 'kl': 0.20654296875, 'epoch': 0.05}
  5%|▍         | 197/4286 [1:02:25<20:08:48, 17.74s/it]  5%|▍         | 198/4286 [1:02:43<20:18:07, 17.88s/it]                                                       {'loss': 0.008, 'grad_norm': 2.520723960964523, 'learning_rate': 9.538030797946803e-07, 'completion_length': 90.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.28333335369825363, 'rewards/format_reward': 1.0, 'reward': 1.283333420753479, 'reward_std': 0.12697771936655045, 'kl': 0.2001953125, 'epoch': 0.05}
  5%|▍         | 198/4286 [1:02:43<20:18:07, 17.88s/it]  5%|▍         | 199/4286 [1:02:58<19:23:28, 17.08s/it]                                                       {'loss': 0.0073, 'grad_norm': 5.461955080967367, 'learning_rate': 9.535697620158656e-07, 'completion_length': 95.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.36875002086162567, 'rewards/format_reward': 1.0, 'reward': 1.3687500953674316, 'reward_std': 0.17108968645334244, 'kl': 0.18359375, 'epoch': 0.05}
  5%|▍         | 199/4286 [1:02:58<19:23:28, 17.08s/it]  5%|▍         | 200/4286 [1:03:14<18:53:16, 16.64s/it]                                                       {'loss': 0.0085, 'grad_norm': 3.1480189536072003, 'learning_rate': 9.533364442370509e-07, 'completion_length': 90.66072082519531, 'rewards/only_full_func_accuracy_reward': 0.3258928656578064, 'rewards/format_reward': 1.0, 'reward': 1.3258928656578064, 'reward_std': 0.12502137944102287, 'kl': 0.2138671875, 'epoch': 0.05}
  5%|▍         | 200/4286 [1:03:14<18:53:16, 16.64s/it]  5%|▍         | 201/4286 [1:09:51<148:33:33, 130.92s/it]                                                         {'loss': 0.0084, 'grad_norm': 2.178198029343134, 'learning_rate': 9.531031264582361e-07, 'completion_length': 73.21428680419922, 'rewards/only_full_func_accuracy_reward': 0.348214328289032, 'rewards/format_reward': 1.0, 'reward': 1.3482143878936768, 'reward_std': 0.13488247245550156, 'kl': 0.2099609375, 'epoch': 0.05}
  5%|▍         | 201/4286 [1:09:51<148:33:33, 130.92s/it]  5%|▍         | 202/4286 [1:10:07<109:17:02, 96.33s/it]                                                         {'loss': 0.008, 'grad_norm': 2.191381486017661, 'learning_rate': 9.528698086794213e-07, 'completion_length': 95.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.31532740592956543, 'rewards/format_reward': 1.0, 'reward': 1.315327525138855, 'reward_std': 0.09504565969109535, 'kl': 0.19970703125, 'epoch': 0.05}
  5%|▍         | 202/4286 [1:10:07<109:17:02, 96.33s/it]  5%|▍         | 203/4286 [1:10:27<83:12:30, 73.37s/it]                                                        {'loss': 0.0079, 'grad_norm': 3.448597469181285, 'learning_rate': 9.526364909006066e-07, 'completion_length': 92.83929061889648, 'rewards/only_full_func_accuracy_reward': 0.3379676789045334, 'rewards/format_reward': 1.0, 'reward': 1.3379676938056946, 'reward_std': 0.14549414068460464, 'kl': 0.1982421875, 'epoch': 0.05}
  5%|▍         | 203/4286 [1:10:27<83:12:30, 73.37s/it]  5%|▍         | 204/4286 [1:10:46<64:51:42, 57.20s/it]                                                       {'loss': 0.0088, 'grad_norm': 2.848340335168351, 'learning_rate': 9.524031731217919e-07, 'completion_length': 96.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.30267859995365143, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2848215699195862, 'reward_std': 0.16616667807102203, 'kl': 0.2197265625, 'epoch': 0.05}
  5%|▍         | 204/4286 [1:10:46<64:51:42, 57.20s/it]  5%|▍         | 205/4286 [1:11:03<50:58:55, 44.97s/it]                                                       {'loss': 0.0086, 'grad_norm': 3.055848105009525, 'learning_rate': 9.521698553429771e-07, 'completion_length': 89.8035774230957, 'rewards/only_full_func_accuracy_reward': 0.3485969603061676, 'rewards/format_reward': 1.0, 'reward': 1.3485969305038452, 'reward_std': 0.16105769574642181, 'kl': 0.2138671875, 'epoch': 0.05}
  5%|▍         | 205/4286 [1:11:03<50:58:55, 44.97s/it]  5%|▍         | 206/4286 [1:11:18<40:49:49, 36.03s/it]                                                       {'loss': 0.0086, 'grad_norm': 2.7459122793778534, 'learning_rate': 9.519365375641624e-07, 'completion_length': 83.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.36875003576278687, 'rewards/format_reward': 1.0, 'reward': 1.3687500953674316, 'reward_std': 0.10639407113194466, 'kl': 0.21630859375, 'epoch': 0.05}
  5%|▍         | 206/4286 [1:11:18<40:49:49, 36.03s/it]  5%|▍         | 207/4286 [1:11:34<33:55:40, 29.94s/it]                                                       {'loss': 0.0093, 'grad_norm': 4.562021479024712, 'learning_rate': 9.517032197853476e-07, 'completion_length': 90.46429061889648, 'rewards/only_full_func_accuracy_reward': 0.431547649204731, 'rewards/format_reward': 1.0, 'reward': 1.4315476417541504, 'reward_std': 0.1481940969824791, 'kl': 0.23291015625, 'epoch': 0.05}
  5%|▍         | 207/4286 [1:11:34<33:55:40, 29.94s/it]  5%|▍         | 208/4286 [1:11:50<29:13:48, 25.80s/it]                                                       {'loss': 0.0087, 'grad_norm': 2.4452860578808133, 'learning_rate': 9.514699020065328e-07, 'completion_length': 94.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.3681547939777374, 'rewards/format_reward': 1.0, 'reward': 1.3681548833847046, 'reward_std': 0.14288295805454254, 'kl': 0.216796875, 'epoch': 0.05}
  5%|▍         | 208/4286 [1:11:50<29:13:48, 25.80s/it]  5%|▍         | 209/4286 [1:12:07<26:22:07, 23.28s/it]                                                       {'loss': 0.0109, 'grad_norm': 3.3497883593940205, 'learning_rate': 9.512365842277182e-07, 'completion_length': 82.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.341369092464447, 'rewards/format_reward': 1.0, 'reward': 1.3413691520690918, 'reward_std': 0.10377506539225578, 'kl': 0.2724609375, 'epoch': 0.05}
  5%|▍         | 209/4286 [1:12:07<26:22:07, 23.28s/it]  5%|▍         | 210/4286 [1:12:24<24:20:00, 21.49s/it]                                                       {'loss': 0.0093, 'grad_norm': 2.5004877336980287, 'learning_rate': 9.510032664489034e-07, 'completion_length': 93.46429061889648, 'rewards/only_full_func_accuracy_reward': 0.4032738208770752, 'rewards/format_reward': 1.0, 'reward': 1.40327388048172, 'reward_std': 0.03709554485976696, 'kl': 0.23291015625, 'epoch': 0.05}
  5%|▍         | 210/4286 [1:12:24<24:20:00, 21.49s/it]  5%|▍         | 211/4286 [1:12:42<22:57:18, 20.28s/it]                                                       {'loss': 0.0083, 'grad_norm': 4.908639707551876, 'learning_rate': 9.507699486700886e-07, 'completion_length': 100.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.20357144623994827, 'rewards/format_reward': 1.0, 'reward': 1.2035714983940125, 'reward_std': 0.060179226100444794, 'kl': 0.2080078125, 'epoch': 0.05}
  5%|▍         | 211/4286 [1:12:42<22:57:18, 20.28s/it]  5%|▍         | 212/4286 [1:12:58<21:27:33, 18.96s/it]                                                       {'loss': 0.0089, 'grad_norm': 2.764485478446768, 'learning_rate': 9.505366308912739e-07, 'completion_length': 82.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.2336309626698494, 'rewards/format_reward': 1.0, 'reward': 1.2336310744285583, 'reward_std': 0.11361434310674667, 'kl': 0.22265625, 'epoch': 0.05}
  5%|▍         | 212/4286 [1:12:58<21:27:33, 18.96s/it]  5%|▍         | 213/4286 [1:13:16<21:14:52, 18.78s/it]                                                       {'loss': 0.0085, 'grad_norm': 5.0543957689768595, 'learning_rate': 9.503033131124592e-07, 'completion_length': 95.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.2883928716182709, 'rewards/format_reward': 1.0, 'reward': 1.2883929014205933, 'reward_std': 0.13277185708284378, 'kl': 0.21337890625, 'epoch': 0.05}
  5%|▍         | 213/4286 [1:13:16<21:14:52, 18.78s/it]  5%|▍         | 214/4286 [1:13:32<20:20:43, 17.99s/it]                                                       {'loss': 0.0086, 'grad_norm': 3.787259103430605, 'learning_rate': 9.500699953336444e-07, 'completion_length': 96.66072082519531, 'rewards/only_full_func_accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.16656052693724632, 'kl': 0.21484375, 'epoch': 0.05}
  5%|▍         | 214/4286 [1:13:32<20:20:43, 17.99s/it]  5%|▌         | 215/4286 [1:13:47<19:08:54, 16.93s/it]                                                       {'loss': 0.0092, 'grad_norm': 1.8972732628199227, 'learning_rate': 9.498366775548296e-07, 'completion_length': 86.28572082519531, 'rewards/only_full_func_accuracy_reward': 0.37457485496997833, 'rewards/format_reward': 1.0, 'reward': 1.3745748400688171, 'reward_std': 0.0638694316148758, 'kl': 0.22900390625, 'epoch': 0.05}
  5%|▌         | 215/4286 [1:13:47<19:08:54, 16.93s/it]  5%|▌         | 216/4286 [1:14:06<19:50:52, 17.56s/it]                                                       {'loss': 0.0081, 'grad_norm': 2.0200871360909702, 'learning_rate': 9.496033597760149e-07, 'completion_length': 108.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.29375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.2937501668930054, 'reward_std': 0.11557050421833992, 'kl': 0.20263671875, 'epoch': 0.05}
  5%|▌         | 216/4286 [1:14:06<19:50:52, 17.56s/it]  5%|▌         | 217/4286 [1:14:23<19:48:57, 17.53s/it]                                                       {'loss': 0.0093, 'grad_norm': 2.2623421579544183, 'learning_rate': 9.493700419972002e-07, 'completion_length': 107.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.23584958165884018, 'rewards/format_reward': 1.0, 'reward': 1.2358496189117432, 'reward_std': 0.09452942945063114, 'kl': 0.2314453125, 'epoch': 0.05}
  5%|▌         | 217/4286 [1:14:23<19:48:57, 17.53s/it]  5%|▌         | 218/4286 [1:14:40<19:24:55, 17.18s/it]                                                       {'loss': 0.0078, 'grad_norm': 2.832398262184173, 'learning_rate': 9.491367242183854e-07, 'completion_length': 108.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.2976190745830536, 'rewards/format_reward': 1.0, 'reward': 1.297619104385376, 'reward_std': 0.13445595651865005, 'kl': 0.1943359375, 'epoch': 0.05}
  5%|▌         | 218/4286 [1:14:40<19:24:55, 17.18s/it]  5%|▌         | 219/4286 [1:14:56<19:10:58, 16.98s/it]                                                       {'loss': 0.0083, 'grad_norm': 1.9257983812949229, 'learning_rate': 9.489034064395707e-07, 'completion_length': 101.21428680419922, 'rewards/only_full_func_accuracy_reward': 0.24464286863803864, 'rewards/format_reward': 1.0, 'reward': 1.244642972946167, 'reward_std': 0.0729280337691307, 'kl': 0.20849609375, 'epoch': 0.05}
  5%|▌         | 219/4286 [1:14:56<19:10:58, 16.98s/it]  5%|▌         | 220/4286 [1:15:16<20:18:20, 17.98s/it]                                                       {'loss': 0.0064, 'grad_norm': 2.703648842312708, 'learning_rate': 9.486700886607559e-07, 'completion_length': 134.01786422729492, 'rewards/only_full_func_accuracy_reward': 0.3648809939622879, 'rewards/format_reward': 1.0, 'reward': 1.3648810982704163, 'reward_std': 0.15942500159144402, 'kl': 0.1611328125, 'epoch': 0.05}
  5%|▌         | 220/4286 [1:15:16<20:18:20, 17.98s/it]  5%|▌         | 221/4286 [1:15:34<20:09:31, 17.85s/it]                                                       {'loss': 0.0072, 'grad_norm': 2.6916475428240396, 'learning_rate': 9.484367708819412e-07, 'completion_length': 117.94643783569336, 'rewards/only_full_func_accuracy_reward': 0.33227819204330444, 'rewards/format_reward': 1.0, 'reward': 1.3322782516479492, 'reward_std': 0.1562672145664692, 'kl': 0.1796875, 'epoch': 0.05}
  5%|▌         | 221/4286 [1:15:34<20:09:31, 17.85s/it]  5%|▌         | 222/4286 [1:15:52<20:15:57, 17.95s/it]                                                       {'loss': 0.0066, 'grad_norm': 2.413458273039989, 'learning_rate': 9.482034531031265e-07, 'completion_length': 128.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.2985544502735138, 'rewards/format_reward': 1.0, 'reward': 1.2985544800758362, 'reward_std': 0.13143233209848404, 'kl': 0.16552734375, 'epoch': 0.05}
  5%|▌         | 222/4286 [1:15:52<20:15:57, 17.95s/it]  5%|▌         | 223/4286 [1:16:15<22:02:36, 19.53s/it]                                                       {'loss': 0.0067, 'grad_norm': 5.4385008873391785, 'learning_rate': 9.479701353243117e-07, 'completion_length': 139.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.3398809880018234, 'rewards/format_reward': 1.0, 'reward': 1.3398810625076294, 'reward_std': 0.13322482630610466, 'kl': 0.16845703125, 'epoch': 0.05}
  5%|▌         | 223/4286 [1:16:15<22:02:36, 19.53s/it]  5%|▌         | 224/4286 [1:16:37<22:52:45, 20.28s/it]                                                       {'loss': 0.0071, 'grad_norm': 2.007548819190035, 'learning_rate': 9.477368175454969e-07, 'completion_length': 146.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.17772556841373444, 'rewards/format_reward': 1.0, 'reward': 1.177725613117218, 'reward_std': 0.13742398098111153, 'kl': 0.17822265625, 'epoch': 0.05}
  5%|▌         | 224/4286 [1:16:37<22:52:45, 20.28s/it]  5%|▌         | 225/4286 [1:16:56<22:11:53, 19.68s/it]                                                       {'loss': 0.0076, 'grad_norm': 2.149892425772398, 'learning_rate': 9.475034997666822e-07, 'completion_length': 126.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.2875000238418579, 'rewards/format_reward': 1.0, 'reward': 1.2875000834465027, 'reward_std': 0.1380983293056488, 'kl': 0.189453125, 'epoch': 0.05}
  5%|▌         | 225/4286 [1:16:56<22:11:53, 19.68s/it]  5%|▌         | 226/4286 [1:17:17<22:38:37, 20.08s/it]                                                       {'loss': 0.007, 'grad_norm': 2.603127144160766, 'learning_rate': 9.472701819878675e-07, 'completion_length': 141.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.21785715967416763, 'rewards/format_reward': 1.0, 'reward': 1.2178571820259094, 'reward_std': 0.1688430830836296, 'kl': 0.1748046875, 'epoch': 0.05}
  5%|▌         | 226/4286 [1:17:17<22:38:37, 20.08s/it]  5%|▌         | 227/4286 [1:17:38<23:03:31, 20.45s/it]                                                       {'loss': 0.0068, 'grad_norm': 2.3572237992576492, 'learning_rate': 9.470368642090527e-07, 'completion_length': 150.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.40284867584705353, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3671343922615051, 'reward_std': 0.15348117053508759, 'kl': 0.16943359375, 'epoch': 0.05}
  5%|▌         | 227/4286 [1:17:38<23:03:31, 20.45s/it]  5%|▌         | 228/4286 [1:18:00<23:32:37, 20.89s/it]                                                       {'loss': 0.0069, 'grad_norm': 2.369857163078752, 'learning_rate': 9.468035464302379e-07, 'completion_length': 159.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.3142857328057289, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2964287400245667, 'reward_std': 0.19386575371026993, 'kl': 0.17333984375, 'epoch': 0.05}
  5%|▌         | 228/4286 [1:18:00<23:32:37, 20.89s/it]  5%|▌         | 229/4286 [1:18:23<24:16:14, 21.54s/it]                                                       {'loss': 0.0074, 'grad_norm': 3.486775633162168, 'learning_rate': 9.465702286514233e-07, 'completion_length': 142.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.3112032562494278, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.293346107006073, 'reward_std': 0.18920911103487015, 'kl': 0.185546875, 'epoch': 0.05}
  5%|▌         | 229/4286 [1:18:23<24:16:14, 21.54s/it]  5%|▌         | 230/4286 [1:18:44<24:12:38, 21.49s/it]                                                       {'loss': 0.0073, 'grad_norm': 2.5005354335985075, 'learning_rate': 9.463369108726085e-07, 'completion_length': 120.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.3733631372451782, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3555060029029846, 'reward_std': 0.15009142830967903, 'kl': 0.18115234375, 'epoch': 0.05}
  5%|▌         | 230/4286 [1:18:44<24:12:38, 21.49s/it]  5%|▌         | 231/4286 [1:19:06<24:22:51, 21.65s/it]                                                       {'loss': 0.0065, 'grad_norm': 6.875277320440613, 'learning_rate': 9.461035930937937e-07, 'completion_length': 133.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.301190510392189, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.283333420753479, 'reward_std': 0.1547623872756958, 'kl': 0.1630859375, 'epoch': 0.05}
  5%|▌         | 231/4286 [1:19:06<24:22:51, 21.65s/it]  5%|▌         | 232/4286 [1:19:24<22:54:15, 20.34s/it]                                                       {'loss': 0.0071, 'grad_norm': 3.2443566132055426, 'learning_rate': 9.45870275314979e-07, 'completion_length': 118.58929061889648, 'rewards/only_full_func_accuracy_reward': 0.2998512089252472, 'rewards/format_reward': 1.0, 'reward': 1.2998512983322144, 'reward_std': 0.14959587156772614, 'kl': 0.1767578125, 'epoch': 0.05}
  5%|▌         | 232/4286 [1:19:24<22:54:15, 20.34s/it]  5%|▌         | 233/4286 [1:19:40<21:33:34, 19.15s/it]                                                       {'loss': 0.0084, 'grad_norm': 3.2272465458824584, 'learning_rate': 9.456369575361642e-07, 'completion_length': 87.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.24464288353919983, 'rewards/format_reward': 1.0, 'reward': 1.2446429133415222, 'reward_std': 0.12496821582317352, 'kl': 0.21142578125, 'epoch': 0.05}
  5%|▌         | 233/4286 [1:19:40<21:33:34, 19.15s/it]  5%|▌         | 234/4286 [1:19:56<20:22:56, 18.11s/it]                                                       {'loss': 0.0074, 'grad_norm': 2.806806676269419, 'learning_rate': 9.454036397573495e-07, 'completion_length': 106.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.28244051337242126, 'rewards/format_reward': 1.0, 'reward': 1.2824406027793884, 'reward_std': 0.1539774313569069, 'kl': 0.18408203125, 'epoch': 0.05}
  5%|▌         | 234/4286 [1:19:56<20:22:56, 18.11s/it]  5%|▌         | 235/4286 [1:20:11<19:16:13, 17.13s/it]                                                       {'loss': 0.0079, 'grad_norm': 5.111849596363041, 'learning_rate': 9.451703219785348e-07, 'completion_length': 86.60714340209961, 'rewards/only_full_func_accuracy_reward': 0.3735119551420212, 'rewards/format_reward': 1.0, 'reward': 1.3735120296478271, 'reward_std': 0.09996193647384644, 'kl': 0.19775390625, 'epoch': 0.05}
  5%|▌         | 235/4286 [1:20:11<19:16:13, 17.13s/it]  6%|▌         | 236/4286 [1:20:26<18:50:07, 16.74s/it]                                                       {'loss': 0.0089, 'grad_norm': 2.1327736362319714, 'learning_rate': 9.4493700419972e-07, 'completion_length': 91.53571701049805, 'rewards/only_full_func_accuracy_reward': 0.208333358168602, 'rewards/format_reward': 1.0, 'reward': 1.2083334922790527, 'reward_std': 0.11545134335756302, 'kl': 0.22265625, 'epoch': 0.06}
  6%|▌         | 236/4286 [1:20:26<18:50:07, 16.74s/it]  6%|▌         | 237/4286 [1:20:41<18:01:03, 16.02s/it]                                                       {'loss': 0.0079, 'grad_norm': 2.332720202593393, 'learning_rate': 9.447036864209052e-07, 'completion_length': 94.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.2792659103870392, 'rewards/format_reward': 1.0, 'reward': 1.2792659997940063, 'reward_std': 0.09082674235105515, 'kl': 0.19873046875, 'epoch': 0.06}
  6%|▌         | 237/4286 [1:20:41<18:01:03, 16.02s/it]  6%|▌         | 238/4286 [1:20:56<17:44:23, 15.78s/it]                                                       {'loss': 0.0076, 'grad_norm': 3.443671024618125, 'learning_rate': 9.444703686420905e-07, 'completion_length': 92.3035774230957, 'rewards/only_full_func_accuracy_reward': 0.21041668206453323, 'rewards/format_reward': 1.0, 'reward': 1.2104167342185974, 'reward_std': 0.10481787100434303, 'kl': 0.18994140625, 'epoch': 0.06}
  6%|▌         | 238/4286 [1:20:56<17:44:23, 15.78s/it]  6%|▌         | 239/4286 [1:21:11<17:36:31, 15.66s/it]                                                       {'loss': 0.0077, 'grad_norm': 3.6787899778095623, 'learning_rate': 9.442370508632758e-07, 'completion_length': 95.08929061889648, 'rewards/only_full_func_accuracy_reward': 0.330357164144516, 'rewards/format_reward': 1.0, 'reward': 1.3303572535514832, 'reward_std': 0.10776083171367645, 'kl': 0.193359375, 'epoch': 0.06}
  6%|▌         | 239/4286 [1:21:11<17:36:31, 15.66s/it]  6%|▌         | 240/4286 [1:21:26<17:10:28, 15.28s/it]                                                       {'loss': 0.0079, 'grad_norm': 3.8997112395200326, 'learning_rate': 9.44003733084461e-07, 'completion_length': 86.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.3169642984867096, 'rewards/format_reward': 1.0, 'reward': 1.316964328289032, 'reward_std': 0.13281292095780373, 'kl': 0.1962890625, 'epoch': 0.06}
  6%|▌         | 240/4286 [1:21:26<17:10:28, 15.28s/it]  6%|▌         | 241/4286 [1:21:41<17:05:28, 15.21s/it]                                                       {'loss': 0.0089, 'grad_norm': 11.47971430352461, 'learning_rate': 9.437704153056462e-07, 'completion_length': 90.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.2782738208770752, 'rewards/format_reward': 1.0, 'reward': 1.2782739400863647, 'reward_std': 0.11466243118047714, 'kl': 0.2216796875, 'epoch': 0.06}
  6%|▌         | 241/4286 [1:21:41<17:05:28, 15.21s/it]  6%|▌         | 242/4286 [1:21:56<17:06:31, 15.23s/it]                                                       {'loss': 0.0082, 'grad_norm': 3.1940140733443196, 'learning_rate': 9.435370975268316e-07, 'completion_length': 86.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.4895833879709244, 'rewards/format_reward': 1.0, 'reward': 1.489583432674408, 'reward_std': 0.15678390860557556, 'kl': 0.2060546875, 'epoch': 0.06}
  6%|▌         | 242/4286 [1:21:56<17:06:31, 15.23s/it]  6%|▌         | 243/4286 [1:22:10<16:41:56, 14.87s/it]                                                       {'loss': 0.0084, 'grad_norm': 1.925935921682809, 'learning_rate': 9.433037797480168e-07, 'completion_length': 79.58929061889648, 'rewards/only_full_func_accuracy_reward': 0.367559552192688, 'rewards/format_reward': 1.0, 'reward': 1.3675596714019775, 'reward_std': 0.08389320224523544, 'kl': 0.20947265625, 'epoch': 0.06}
  6%|▌         | 243/4286 [1:22:10<16:41:56, 14.87s/it]  6%|▌         | 244/4286 [1:22:25<16:39:58, 14.84s/it]                                                       {'loss': 0.0084, 'grad_norm': 1.7556542987824424, 'learning_rate': 9.43070461969202e-07, 'completion_length': 82.58929061889648, 'rewards/only_full_func_accuracy_reward': 0.37113097310066223, 'rewards/format_reward': 1.0, 'reward': 1.3711310625076294, 'reward_std': 0.07407478801906109, 'kl': 0.2099609375, 'epoch': 0.06}
  6%|▌         | 244/4286 [1:22:25<16:39:58, 14.84s/it]  6%|▌         | 245/4286 [1:22:41<17:00:05, 15.15s/it]                                                       {'loss': 0.009, 'grad_norm': 3.2130379984306527, 'learning_rate': 9.428371441903873e-07, 'completion_length': 83.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 1.0, 'reward': 1.5000001788139343, 'reward_std': 0.164804145693779, 'kl': 0.22607421875, 'epoch': 0.06}
  6%|▌         | 245/4286 [1:22:41<17:00:05, 15.15s/it]  6%|▌         | 246/4286 [1:22:55<16:36:15, 14.80s/it]                                                       {'loss': 0.0082, 'grad_norm': 2.3648644627158357, 'learning_rate': 9.426038264115726e-07, 'completion_length': 82.50000381469727, 'rewards/only_full_func_accuracy_reward': 0.3586309850215912, 'rewards/format_reward': 1.0, 'reward': 1.3586310744285583, 'reward_std': 0.12676307931542397, 'kl': 0.2060546875, 'epoch': 0.06}
  6%|▌         | 246/4286 [1:22:55<16:36:15, 14.80s/it]  6%|▌         | 247/4286 [1:23:09<16:26:47, 14.66s/it]                                                       {'loss': 0.0091, 'grad_norm': 2.6749539653865364, 'learning_rate': 9.423705086327578e-07, 'completion_length': 83.6785774230957, 'rewards/only_full_func_accuracy_reward': 0.3809524178504944, 'rewards/format_reward': 1.0, 'reward': 1.3809524178504944, 'reward_std': 0.1460261568427086, 'kl': 0.2265625, 'epoch': 0.06}
  6%|▌         | 247/4286 [1:23:09<16:26:47, 14.66s/it]  6%|▌         | 248/4286 [1:23:25<16:47:49, 14.98s/it]                                                       {'loss': 0.008, 'grad_norm': 3.120001594524782, 'learning_rate': 9.42137190853943e-07, 'completion_length': 85.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464287161827087, 'reward_std': 0.0911499634385109, 'kl': 0.201171875, 'epoch': 0.06}
  6%|▌         | 248/4286 [1:23:25<16:47:49, 14.98s/it]  6%|▌         | 249/4286 [1:23:40<16:44:54, 14.94s/it]                                                       {'loss': 0.0088, 'grad_norm': 3.149794420084624, 'learning_rate': 9.419038730751283e-07, 'completion_length': 85.01786422729492, 'rewards/only_full_func_accuracy_reward': 0.23469389230012894, 'rewards/format_reward': 1.0, 'reward': 1.234694004058838, 'reward_std': 0.0776070486754179, 'kl': 0.21923828125, 'epoch': 0.06}
  6%|▌         | 249/4286 [1:23:40<16:44:54, 14.94s/it]  6%|▌         | 250/4286 [1:23:54<16:35:37, 14.80s/it]                                                       {'loss': 0.0096, 'grad_norm': 2.0061408833714425, 'learning_rate': 9.416705552963136e-07, 'completion_length': 80.57143020629883, 'rewards/only_full_func_accuracy_reward': 0.351190522313118, 'rewards/format_reward': 1.0, 'reward': 1.3511906266212463, 'reward_std': 0.08071713522076607, 'kl': 0.23876953125, 'epoch': 0.06}
  6%|▌         | 250/4286 [1:23:54<16:35:37, 14.80s/it]  6%|▌         | 251/4286 [1:24:08<16:27:22, 14.68s/it]                                                       {'loss': 0.0089, 'grad_norm': 4.028747557603103, 'learning_rate': 9.414372375174988e-07, 'completion_length': 90.00000381469727, 'rewards/only_full_func_accuracy_reward': 0.3792092055082321, 'rewards/format_reward': 1.0, 'reward': 1.3792092204093933, 'reward_std': 0.10132484510540962, 'kl': 0.22119140625, 'epoch': 0.06}
  6%|▌         | 251/4286 [1:24:08<16:27:22, 14.68s/it]  6%|▌         | 252/4286 [1:24:26<17:23:32, 15.52s/it]                                                       {'loss': 0.0081, 'grad_norm': 5.476224087777597, 'learning_rate': 9.412039197386841e-07, 'completion_length': 100.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.3660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3660714626312256, 'reward_std': 0.13956067711114883, 'kl': 0.201171875, 'epoch': 0.06}
  6%|▌         | 252/4286 [1:24:26<17:23:32, 15.52s/it]  6%|▌         | 253/4286 [1:24:42<17:25:57, 15.56s/it]                                                       {'loss': 0.0081, 'grad_norm': 3.446245849247654, 'learning_rate': 9.409706019598693e-07, 'completion_length': 102.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.2619047835469246, 'rewards/format_reward': 1.0, 'reward': 1.2619048953056335, 'reward_std': 0.1345040686428547, 'kl': 0.20166015625, 'epoch': 0.06}
  6%|▌         | 253/4286 [1:24:42<17:25:57, 15.56s/it]  6%|▌         | 254/4286 [1:24:56<17:01:12, 15.20s/it]                                                       {'loss': 0.0078, 'grad_norm': 38.202750818909486, 'learning_rate': 9.407372841810545e-07, 'completion_length': 81.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.4568452462553978, 'rewards/format_reward': 1.0, 'reward': 1.4568453431129456, 'reward_std': 0.055773086845874786, 'kl': 0.19580078125, 'epoch': 0.06}
  6%|▌         | 254/4286 [1:24:56<17:01:12, 15.20s/it]  6%|▌         | 255/4286 [1:25:11<16:57:55, 15.15s/it]                                                       {'loss': 0.0083, 'grad_norm': 2.8258538429833826, 'learning_rate': 9.405039664022399e-07, 'completion_length': 94.91071701049805, 'rewards/only_full_func_accuracy_reward': 0.3657738268375397, 'rewards/format_reward': 1.0, 'reward': 1.3657739758491516, 'reward_std': 0.16037605702877045, 'kl': 0.2080078125, 'epoch': 0.06}
  6%|▌         | 255/4286 [1:25:11<16:57:55, 15.15s/it]  6%|▌         | 256/4286 [1:25:27<17:16:24, 15.43s/it]                                                       {'loss': 0.0076, 'grad_norm': 2.9168696422549587, 'learning_rate': 9.402706486234251e-07, 'completion_length': 94.32143020629883, 'rewards/only_full_func_accuracy_reward': 0.5220238566398621, 'rewards/format_reward': 1.0, 'reward': 1.5220237970352173, 'reward_std': 0.11334584280848503, 'kl': 0.19091796875, 'epoch': 0.06}
  6%|▌         | 256/4286 [1:25:27<17:16:24, 15.43s/it]  6%|▌         | 257/4286 [1:25:43<17:24:11, 15.55s/it]                                                       {'loss': 0.0075, 'grad_norm': 3.2193610103545813, 'learning_rate': 9.400373308446103e-07, 'completion_length': 101.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.35654762387275696, 'rewards/format_reward': 1.0, 'reward': 1.3565476536750793, 'reward_std': 0.14554591476917267, 'kl': 0.18798828125, 'epoch': 0.06}
  6%|▌         | 257/4286 [1:25:43<17:24:11, 15.55s/it]  6%|▌         | 258/4286 [1:25:58<17:23:55, 15.55s/it]                                                       {'loss': 0.0074, 'grad_norm': 7.431835213421084, 'learning_rate': 9.398040130657957e-07, 'completion_length': 107.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.27738097310066223, 'rewards/format_reward': 1.0, 'reward': 1.2773810625076294, 'reward_std': 0.1260652206838131, 'kl': 0.185546875, 'epoch': 0.06}
  6%|▌         | 258/4286 [1:25:58<17:23:55, 15.55s/it]  6%|▌         | 259/4286 [1:26:14<17:25:47, 15.58s/it]                                                       {'loss': 0.0078, 'grad_norm': 3.225926760203541, 'learning_rate': 9.395706952869809e-07, 'completion_length': 97.41072082519531, 'rewards/only_full_func_accuracy_reward': 0.19107144325971603, 'rewards/format_reward': 1.0, 'reward': 1.1910715103149414, 'reward_std': 0.11437977850437164, 'kl': 0.19580078125, 'epoch': 0.06}
  6%|▌         | 259/4286 [1:26:14<17:25:47, 15.58s/it]  6%|▌         | 260/4286 [1:26:31<17:46:04, 15.89s/it]                                                       {'loss': 0.0077, 'grad_norm': 2.336071202253219, 'learning_rate': 9.393373775081661e-07, 'completion_length': 102.58929061889648, 'rewards/only_full_func_accuracy_reward': 0.3288690745830536, 'rewards/format_reward': 1.0, 'reward': 1.3288690447807312, 'reward_std': 0.07019362598657608, 'kl': 0.19140625, 'epoch': 0.06}
  6%|▌         | 260/4286 [1:26:31<17:46:04, 15.89s/it]  6%|▌         | 261/4286 [1:26:46<17:33:58, 15.71s/it]                                                       {'loss': 0.0079, 'grad_norm': 2.5876847020180316, 'learning_rate': 9.391040597293513e-07, 'completion_length': 104.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.35863097012043, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.34077388048172, 'reward_std': 0.13365232571959496, 'kl': 0.19677734375, 'epoch': 0.06}
  6%|▌         | 261/4286 [1:26:46<17:33:58, 15.71s/it]  6%|▌         | 262/4286 [1:27:03<17:51:17, 15.97s/it]                                                       {'loss': 0.0073, 'grad_norm': 5.284109241830491, 'learning_rate': 9.388707419505366e-07, 'completion_length': 108.75000381469727, 'rewards/only_full_func_accuracy_reward': 0.40565477311611176, 'rewards/format_reward': 1.0, 'reward': 1.4056548476219177, 'reward_std': 0.13311771675944328, 'kl': 0.1826171875, 'epoch': 0.06}
  6%|▌         | 262/4286 [1:27:03<17:51:17, 15.97s/it]  6%|▌         | 263/4286 [1:27:18<17:49:12, 15.95s/it]                                                       {'loss': 0.0072, 'grad_norm': 1.909410395961766, 'learning_rate': 9.386374241717219e-07, 'completion_length': 113.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.4062500298023224, 'rewards/format_reward': 1.0, 'reward': 1.4062501192092896, 'reward_std': 0.097799401730299, 'kl': 0.1806640625, 'epoch': 0.06}
  6%|▌         | 263/4286 [1:27:18<17:49:12, 15.95s/it]  6%|▌         | 264/4286 [1:27:34<17:39:11, 15.80s/it]                                                       {'loss': 0.0071, 'grad_norm': 5.005471595091796, 'learning_rate': 9.384041063929071e-07, 'completion_length': 108.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.3288690745830536, 'rewards/format_reward': 1.0, 'reward': 1.328869104385376, 'reward_std': 0.11823263019323349, 'kl': 0.17822265625, 'epoch': 0.06}
  6%|▌         | 264/4286 [1:27:34<17:39:11, 15.80s/it]  6%|▌         | 265/4286 [1:27:51<17:54:53, 16.04s/it]                                                       {'loss': 0.0078, 'grad_norm': 2.198739481604858, 'learning_rate': 9.381707886140924e-07, 'completion_length': 104.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.3764881268143654, 'rewards/format_reward': 1.0, 'reward': 1.376488208770752, 'reward_std': 0.06869912147521973, 'kl': 0.19384765625, 'epoch': 0.06}
  6%|▌         | 265/4286 [1:27:51<17:54:53, 16.04s/it]  6%|▌         | 266/4286 [1:28:12<19:49:15, 17.75s/it]                                                       {'loss': 0.0069, 'grad_norm': 2.247942349185811, 'learning_rate': 9.379374708352776e-07, 'completion_length': 128.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.22976192086935043, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2119048237800598, 'reward_std': 0.14041472226381302, 'kl': 0.17236328125, 'epoch': 0.06}
  6%|▌         | 266/4286 [1:28:12<19:49:15, 17.75s/it]  6%|▌         | 267/4286 [1:28:29<19:27:34, 17.43s/it]                                                       {'loss': 0.0068, 'grad_norm': 2.241081864871931, 'learning_rate': 9.377041530564629e-07, 'completion_length': 119.0535774230957, 'rewards/only_full_func_accuracy_reward': 0.26994048804044724, 'rewards/format_reward': 1.0, 'reward': 1.2699406147003174, 'reward_std': 0.15107601881027222, 'kl': 0.1708984375, 'epoch': 0.06}
  6%|▌         | 267/4286 [1:28:29<19:27:34, 17.43s/it]  6%|▋         | 268/4286 [1:28:46<19:11:30, 17.20s/it]                                                       {'loss': 0.0075, 'grad_norm': 2.658092236683964, 'learning_rate': 9.374708352776482e-07, 'completion_length': 113.76786422729492, 'rewards/only_full_func_accuracy_reward': 0.32648811489343643, 'rewards/format_reward': 1.0, 'reward': 1.326488196849823, 'reward_std': 0.11215385794639587, 'kl': 0.18603515625, 'epoch': 0.06}
  6%|▋         | 268/4286 [1:28:46<19:11:30, 17.20s/it]  6%|▋         | 269/4286 [1:29:01<18:40:49, 16.74s/it]                                                       {'loss': 0.0075, 'grad_norm': 3.9835899971596382, 'learning_rate': 9.372375174988334e-07, 'completion_length': 103.12500381469727, 'rewards/only_full_func_accuracy_reward': 0.35982145369052887, 'rewards/format_reward': 1.0, 'reward': 1.3598214983940125, 'reward_std': 0.12197801470756531, 'kl': 0.18701171875, 'epoch': 0.06}
  6%|▋         | 269/4286 [1:29:01<18:40:49, 16.74s/it]  6%|▋         | 270/4286 [1:29:17<18:20:36, 16.44s/it]                                                       {'loss': 0.0085, 'grad_norm': 1.938053254156045, 'learning_rate': 9.370041997200186e-07, 'completion_length': 101.37500381469727, 'rewards/only_full_func_accuracy_reward': 0.333333358168602, 'rewards/format_reward': 1.0, 'reward': 1.3333334922790527, 'reward_std': 0.1121213324368, 'kl': 0.21240234375, 'epoch': 0.06}
  6%|▋         | 270/4286 [1:29:17<18:20:36, 16.44s/it]  6%|▋         | 271/4286 [1:29:34<18:24:39, 16.51s/it]                                                       {'loss': 0.008, 'grad_norm': 2.6421635730798583, 'learning_rate': 9.367708819412039e-07, 'completion_length': 113.66072082519531, 'rewards/only_full_func_accuracy_reward': 0.26488097012043, 'rewards/format_reward': 1.0, 'reward': 1.2648810744285583, 'reward_std': 0.16052072495222092, 'kl': 0.20068359375, 'epoch': 0.06}
  6%|▋         | 271/4286 [1:29:34<18:24:39, 16.51s/it]  6%|▋         | 272/4286 [1:29:51<18:49:36, 16.89s/it]                                                       {'loss': 0.008, 'grad_norm': 3.242908248580037, 'learning_rate': 9.365375641623892e-07, 'completion_length': 113.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4419643133878708, 'rewards/format_reward': 1.0, 'reward': 1.4419643878936768, 'reward_std': 0.1502898819744587, 'kl': 0.201171875, 'epoch': 0.06}
  6%|▋         | 272/4286 [1:29:51<18:49:36, 16.89s/it]  6%|▋         | 273/4286 [1:30:08<18:37:25, 16.71s/it]                                                       {'loss': 0.0068, 'grad_norm': 3.5013455729202456, 'learning_rate': 9.363042463835744e-07, 'completion_length': 112.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.31309526413679123, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.295238196849823, 'reward_std': 0.22820492088794708, 'kl': 0.1708984375, 'epoch': 0.06}
  6%|▋         | 273/4286 [1:30:08<18:37:25, 16.71s/it]  6%|▋         | 274/4286 [1:30:27<19:33:45, 17.55s/it]                                                       {'loss': 0.0075, 'grad_norm': 3.6748382772273622, 'learning_rate': 9.360709286047596e-07, 'completion_length': 124.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.3295068144798279, 'rewards/format_reward': 1.0, 'reward': 1.3295069336891174, 'reward_std': 0.16229599714279175, 'kl': 0.1884765625, 'epoch': 0.06}
  6%|▋         | 274/4286 [1:30:27<19:33:45, 17.55s/it]  6%|▋         | 275/4286 [1:30:44<19:08:07, 17.17s/it]                                                       {'loss': 0.0074, 'grad_norm': 1.9884516546646027, 'learning_rate': 9.35837610825945e-07, 'completion_length': 120.32143020629883, 'rewards/only_full_func_accuracy_reward': 0.303571455180645, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2857143878936768, 'reward_std': 0.2024153657257557, 'kl': 0.1845703125, 'epoch': 0.06}
  6%|▋         | 275/4286 [1:30:44<19:08:07, 17.17s/it]  6%|▋         | 276/4286 [1:31:00<18:43:15, 16.81s/it]                                                       {'loss': 0.0071, 'grad_norm': 1.9306957703603775, 'learning_rate': 9.356042930471302e-07, 'completion_length': 107.66072082519531, 'rewards/only_full_func_accuracy_reward': 0.3241071552038193, 'rewards/format_reward': 1.0, 'reward': 1.3241072297096252, 'reward_std': 0.15375984460115433, 'kl': 0.17724609375, 'epoch': 0.06}
  6%|▋         | 276/4286 [1:31:00<18:43:15, 16.81s/it]  6%|▋         | 277/4286 [1:31:19<19:40:20, 17.67s/it]                                                       {'loss': 0.0074, 'grad_norm': 2.4798850657613856, 'learning_rate': 9.353709752683154e-07, 'completion_length': 125.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.376488134264946, 'rewards/format_reward': 1.0, 'reward': 1.3764881491661072, 'reward_std': 0.15507282316684723, 'kl': 0.18505859375, 'epoch': 0.06}
  6%|▋         | 277/4286 [1:31:19<19:40:20, 17.67s/it]  6%|▋         | 278/4286 [1:31:39<20:30:11, 18.42s/it]                                                       {'loss': 0.007, 'grad_norm': 2.1875837161997294, 'learning_rate': 9.351376574895007e-07, 'completion_length': 125.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.313839316368103, 'rewards/format_reward': 1.0, 'reward': 1.3138393759727478, 'reward_std': 0.04052562918514013, 'kl': 0.173828125, 'epoch': 0.06}
  6%|▋         | 278/4286 [1:31:39<20:30:11, 18.42s/it]  7%|▋         | 279/4286 [1:31:57<20:09:37, 18.11s/it]                                                       {'loss': 0.0069, 'grad_norm': 1.9722038911204536, 'learning_rate': 9.34904339710686e-07, 'completion_length': 118.0535774230957, 'rewards/only_full_func_accuracy_reward': 0.25182826071977615, 'rewards/format_reward': 1.0, 'reward': 1.251828372478485, 'reward_std': 0.14276038110256195, 'kl': 0.17333984375, 'epoch': 0.07}
  7%|▋         | 279/4286 [1:31:57<20:09:37, 18.11s/it]  7%|▋         | 280/4286 [1:32:15<20:06:04, 18.06s/it]                                                       {'loss': 0.0069, 'grad_norm': 4.417727942824823, 'learning_rate': 9.346710219318712e-07, 'completion_length': 141.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.2723214477300644, 'rewards/format_reward': 1.0, 'reward': 1.2723215222358704, 'reward_std': 0.1316809505224228, 'kl': 0.17236328125, 'epoch': 0.07}
  7%|▋         | 280/4286 [1:32:15<20:06:04, 18.06s/it]  7%|▋         | 281/4286 [1:32:34<20:22:03, 18.31s/it]                                                       {'loss': 0.0071, 'grad_norm': 1.5355502607226408, 'learning_rate': 9.344377041530565e-07, 'completion_length': 117.0535774230957, 'rewards/only_full_func_accuracy_reward': 0.2767857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2589287161827087, 'reward_std': 0.06127267889678478, 'kl': 0.177734375, 'epoch': 0.07}
  7%|▋         | 281/4286 [1:32:34<20:22:03, 18.31s/it]  7%|▋         | 282/4286 [1:32:51<20:01:26, 18.00s/it]                                                       {'loss': 0.0071, 'grad_norm': 3.4271795113904706, 'learning_rate': 9.342043863742417e-07, 'completion_length': 119.78571701049805, 'rewards/only_full_func_accuracy_reward': 0.39523813128471375, 'rewards/format_reward': 1.0, 'reward': 1.3952381610870361, 'reward_std': 0.15379221737384796, 'kl': 0.17822265625, 'epoch': 0.07}
  7%|▋         | 282/4286 [1:32:51<20:01:26, 18.00s/it]  7%|▋         | 283/4286 [1:33:10<20:25:56, 18.38s/it]                                                       {'loss': 0.0068, 'grad_norm': 2.228046587171446, 'learning_rate': 9.339710685954269e-07, 'completion_length': 134.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.3385416716337204, 'rewards/format_reward': 1.0, 'reward': 1.3385417461395264, 'reward_std': 0.14171681553125381, 'kl': 0.17041015625, 'epoch': 0.07}
  7%|▋         | 283/4286 [1:33:10<20:25:56, 18.38s/it]  7%|▋         | 284/4286 [1:33:28<20:25:39, 18.38s/it]                                                       {'loss': 0.007, 'grad_norm': 2.11581559755742, 'learning_rate': 9.337377508166122e-07, 'completion_length': 142.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.3261905014514923, 'rewards/format_reward': 1.0, 'reward': 1.3261905908584595, 'reward_std': 0.13542190939188004, 'kl': 0.1748046875, 'epoch': 0.07}
  7%|▋         | 284/4286 [1:33:29<20:25:39, 18.38s/it]  7%|▋         | 285/4286 [1:33:47<20:23:52, 18.35s/it]                                                       {'loss': 0.0068, 'grad_norm': 1.9504968610193623, 'learning_rate': 9.335044330377975e-07, 'completion_length': 145.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.22797620296478271, 'rewards/format_reward': 1.0, 'reward': 1.2279762625694275, 'reward_std': 0.16823288053274155, 'kl': 0.16943359375, 'epoch': 0.07}
  7%|▋         | 285/4286 [1:33:47<20:23:52, 18.35s/it]  7%|▋         | 286/4286 [1:34:04<19:57:14, 17.96s/it]                                                       {'loss': 0.0065, 'grad_norm': 2.3598259235821546, 'learning_rate': 9.332711152589827e-07, 'completion_length': 127.53572082519531, 'rewards/only_full_func_accuracy_reward': 0.3523809462785721, 'rewards/format_reward': 1.0, 'reward': 1.3523810505867004, 'reward_std': 0.09633363783359528, 'kl': 0.16357421875, 'epoch': 0.07}
  7%|▋         | 286/4286 [1:34:04<19:57:14, 17.96s/it]  7%|▋         | 287/4286 [1:34:20<19:24:21, 17.47s/it]                                                       {'loss': 0.0069, 'grad_norm': 2.5815571441062692, 'learning_rate': 9.330377974801679e-07, 'completion_length': 120.9285774230957, 'rewards/only_full_func_accuracy_reward': 0.41428573429584503, 'rewards/format_reward': 1.0, 'reward': 1.4142857789993286, 'reward_std': 0.20064660161733627, 'kl': 0.17236328125, 'epoch': 0.07}
  7%|▋         | 287/4286 [1:34:20<19:24:21, 17.47s/it]  7%|▋         | 288/4286 [1:34:42<20:51:24, 18.78s/it]                                                       {'loss': 0.0065, 'grad_norm': 5.726938497065, 'learning_rate': 9.328044797013533e-07, 'completion_length': 133.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.3288690820336342, 'rewards/format_reward': 1.0, 'reward': 1.328869104385376, 'reward_std': 0.0875529982149601, 'kl': 0.16357421875, 'epoch': 0.07}
  7%|▋         | 288/4286 [1:34:42<20:51:24, 18.78s/it]  7%|▋         | 289/4286 [1:35:01<21:03:34, 18.97s/it]                                                       {'loss': 0.007, 'grad_norm': 1.8642752488523047, 'learning_rate': 9.325711619225385e-07, 'completion_length': 141.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.2395833507180214, 'rewards/format_reward': 1.0, 'reward': 1.239583432674408, 'reward_std': 0.09345660731196404, 'kl': 0.17529296875, 'epoch': 0.07}
  7%|▋         | 289/4286 [1:35:01<21:03:34, 18.97s/it]  7%|▋         | 290/4286 [1:35:20<21:04:19, 18.98s/it]                                                       {'loss': 0.0069, 'grad_norm': 1.8871480972385273, 'learning_rate': 9.323378441437237e-07, 'completion_length': 134.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.3232143074274063, 'rewards/format_reward': 1.0, 'reward': 1.3232144117355347, 'reward_std': 0.09449491649866104, 'kl': 0.17138671875, 'epoch': 0.07}
  7%|▋         | 290/4286 [1:35:20<21:04:19, 18.98s/it]  7%|▋         | 291/4286 [1:35:37<20:24:22, 18.39s/it]                                                       {'loss': 0.0072, 'grad_norm': 1.9769976557625442, 'learning_rate': 9.32104526364909e-07, 'completion_length': 126.12500381469727, 'rewards/only_full_func_accuracy_reward': 0.34062500298023224, 'rewards/format_reward': 1.0, 'reward': 1.3406250476837158, 'reward_std': 0.11485203728079796, 'kl': 0.1787109375, 'epoch': 0.07}
  7%|▋         | 291/4286 [1:35:37<20:24:22, 18.39s/it]  7%|▋         | 292/4286 [1:35:58<21:15:03, 19.15s/it]                                                       {'loss': 0.007, 'grad_norm': 3.393927138223175, 'learning_rate': 9.318712085860943e-07, 'completion_length': 135.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.341145858168602, 'rewards/format_reward': 1.0, 'reward': 1.341145932674408, 'reward_std': 0.21730441972613335, 'kl': 0.173828125, 'epoch': 0.07}
  7%|▋         | 292/4286 [1:35:58<21:15:03, 19.15s/it]  7%|▋         | 293/4286 [1:36:18<21:18:12, 19.21s/it]                                                       {'loss': 0.0068, 'grad_norm': 1.5722336857922945, 'learning_rate': 9.316378908072795e-07, 'completion_length': 140.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.26398809999227524, 'rewards/format_reward': 1.0, 'reward': 1.2639881372451782, 'reward_std': 0.06721000652760267, 'kl': 0.17041015625, 'epoch': 0.07}
  7%|▋         | 293/4286 [1:36:18<21:18:12, 19.21s/it]  7%|▋         | 294/4286 [1:36:38<21:47:37, 19.65s/it]                                                       {'loss': 0.007, 'grad_norm': 2.0234040191804845, 'learning_rate': 9.314045730284647e-07, 'completion_length': 150.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.32397960126399994, 'rewards/format_reward': 1.0, 'reward': 1.323979675769806, 'reward_std': 0.14981158077716827, 'kl': 0.173828125, 'epoch': 0.07}
  7%|▋         | 294/4286 [1:36:38<21:47:37, 19.65s/it]  7%|▋         | 295/4286 [1:36:56<21:02:11, 18.98s/it]                                                       {'loss': 0.0068, 'grad_norm': 1.5553781247030225, 'learning_rate': 9.3117125524965e-07, 'completion_length': 123.46429061889648, 'rewards/only_full_func_accuracy_reward': 0.2738095372915268, 'rewards/format_reward': 1.0, 'reward': 1.2738096117973328, 'reward_std': 0.1382250376045704, 'kl': 0.17138671875, 'epoch': 0.07}
  7%|▋         | 295/4286 [1:36:56<21:02:11, 18.98s/it]  7%|▋         | 296/4286 [1:37:14<20:45:33, 18.73s/it]                                                       {'loss': 0.0072, 'grad_norm': 1.8776447787825576, 'learning_rate': 9.309379374708353e-07, 'completion_length': 115.60714340209961, 'rewards/only_full_func_accuracy_reward': 0.4598214626312256, 'rewards/format_reward': 1.0, 'reward': 1.4598214626312256, 'reward_std': 0.11136110872030258, 'kl': 0.1787109375, 'epoch': 0.07}
  7%|▋         | 296/4286 [1:37:14<20:45:33, 18.73s/it]  7%|▋         | 297/4286 [1:37:33<21:01:13, 18.97s/it]                                                       {'loss': 0.007, 'grad_norm': 1.803288159135701, 'learning_rate': 9.307046196920205e-07, 'completion_length': 130.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.34583334624767303, 'rewards/format_reward': 1.0, 'reward': 1.345833420753479, 'reward_std': 0.1508803740143776, 'kl': 0.17578125, 'epoch': 0.07}
  7%|▋         | 297/4286 [1:37:33<21:01:13, 18.97s/it]  7%|▋         | 298/4286 [1:37:51<20:30:57, 18.52s/it]                                                       {'loss': 0.0069, 'grad_norm': 1.7555496672503694, 'learning_rate': 9.304713019132058e-07, 'completion_length': 117.96429061889648, 'rewards/only_full_func_accuracy_reward': 0.435119092464447, 'rewards/format_reward': 1.0, 'reward': 1.4351191520690918, 'reward_std': 0.10976788774132729, 'kl': 0.1728515625, 'epoch': 0.07}
  7%|▋         | 298/4286 [1:37:51<20:30:57, 18.52s/it]  7%|▋         | 299/4286 [1:38:11<20:54:10, 18.87s/it]                                                       {'loss': 0.0071, 'grad_norm': 1.450793733979375, 'learning_rate': 9.30237984134391e-07, 'completion_length': 126.71429061889648, 'rewards/only_full_func_accuracy_reward': 0.33519347012043, 'rewards/format_reward': 1.0, 'reward': 1.3351935744285583, 'reward_std': 0.06396410800516605, 'kl': 0.1787109375, 'epoch': 0.07}
  7%|▋         | 299/4286 [1:38:11<20:54:10, 18.87s/it]  7%|▋         | 300/4286 [1:38:26<19:51:01, 17.93s/it]                                                       {'loss': 0.007, 'grad_norm': 2.540106941157381, 'learning_rate': 9.300046663555763e-07, 'completion_length': 112.83929061889648, 'rewards/only_full_func_accuracy_reward': 0.3318452537059784, 'rewards/format_reward': 1.0, 'reward': 1.3318453431129456, 'reward_std': 0.11387687176465988, 'kl': 0.17529296875, 'epoch': 0.07}
  7%|▋         | 300/4286 [1:38:26<19:51:01, 17.93s/it]  7%|▋         | 301/4286 [1:42:00<84:45:23, 76.57s/it]                                                       {'loss': 0.0075, 'grad_norm': 2.2023367763739135, 'learning_rate': 9.297713485767616e-07, 'completion_length': 115.08929061889648, 'rewards/only_full_func_accuracy_reward': 0.3586309850215912, 'rewards/format_reward': 1.0, 'reward': 1.3586310744285583, 'reward_std': 0.10097679868340492, 'kl': 0.1884765625, 'epoch': 0.07}
  7%|▋         | 301/4286 [1:42:00<84:45:23, 76.57s/it]  7%|▋         | 302/4286 [1:42:17<64:55:39, 58.67s/it]                                                       {'loss': 0.0076, 'grad_norm': 2.0364751154939746, 'learning_rate': 9.295380307979468e-07, 'completion_length': 113.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.3187500238418579, 'rewards/format_reward': 1.0, 'reward': 1.3187500834465027, 'reward_std': 0.11451170593500137, 'kl': 0.18896484375, 'epoch': 0.07}
  7%|▋         | 302/4286 [1:42:17<64:55:39, 58.67s/it]  7%|▋         | 303/4286 [1:42:37<52:07:16, 47.11s/it]                                                       {'loss': 0.0075, 'grad_norm': 7.330398393343079, 'learning_rate': 9.29304713019132e-07, 'completion_length': 114.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.3556547909975052, 'rewards/format_reward': 1.0, 'reward': 1.3556548357009888, 'reward_std': 0.1735587641596794, 'kl': 0.18701171875, 'epoch': 0.07}
  7%|▋         | 303/4286 [1:42:37<52:07:16, 47.11s/it]  7%|▋         | 304/4286 [1:42:54<42:17:48, 38.24s/it]                                                       {'loss': 0.0078, 'grad_norm': 2.9457678689710005, 'learning_rate': 9.290713952403174e-07, 'completion_length': 119.26786422729492, 'rewards/only_full_func_accuracy_reward': 0.35089288651943207, 'rewards/format_reward': 1.0, 'reward': 1.350892961025238, 'reward_std': 0.15269774943590164, 'kl': 0.19580078125, 'epoch': 0.07}
  7%|▋         | 304/4286 [1:42:54<42:17:48, 38.24s/it]  7%|▋         | 305/4286 [1:43:13<35:42:39, 32.29s/it]                                                       {'loss': 0.0083, 'grad_norm': 1.8018039416270535, 'learning_rate': 9.288380774615026e-07, 'completion_length': 117.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 1.0, 'reward': 1.3482144474983215, 'reward_std': 0.10025273263454437, 'kl': 0.20703125, 'epoch': 0.07}
  7%|▋         | 305/4286 [1:43:13<35:42:39, 32.29s/it]  7%|▋         | 306/4286 [1:43:32<31:23:38, 28.40s/it]                                                       {'loss': 0.008, 'grad_norm': 2.105875108379266, 'learning_rate': 9.286047596826878e-07, 'completion_length': 139.4285774230957, 'rewards/only_full_func_accuracy_reward': 0.2517857477068901, 'rewards/format_reward': 1.0, 'reward': 1.2517858147621155, 'reward_std': 0.12404271960258484, 'kl': 0.2001953125, 'epoch': 0.07}
  7%|▋         | 306/4286 [1:43:32<31:23:38, 28.40s/it]  7%|▋         | 307/4286 [1:43:49<27:42:12, 25.06s/it]                                                       {'loss': 0.0077, 'grad_norm': 3.0922679232418333, 'learning_rate': 9.28371441903873e-07, 'completion_length': 124.9285774230957, 'rewards/only_full_func_accuracy_reward': 0.24315477907657623, 'rewards/format_reward': 1.0, 'reward': 1.2431548833847046, 'reward_std': 0.13376276940107346, 'kl': 0.1923828125, 'epoch': 0.07}
  7%|▋         | 307/4286 [1:43:49<27:42:12, 25.06s/it]  7%|▋         | 308/4286 [1:44:09<25:50:45, 23.39s/it]                                                       {'loss': 0.0078, 'grad_norm': 2.0271140648350774, 'learning_rate': 9.281381241250583e-07, 'completion_length': 134.21428680419922, 'rewards/only_full_func_accuracy_reward': 0.3556547686457634, 'rewards/format_reward': 1.0, 'reward': 1.3556548953056335, 'reward_std': 0.18015910312533379, 'kl': 0.19384765625, 'epoch': 0.07}
  7%|▋         | 308/4286 [1:44:09<25:50:45, 23.39s/it]  7%|▋         | 309/4286 [1:44:28<24:26:20, 22.12s/it]                                                       {'loss': 0.0076, 'grad_norm': 2.4040126826864077, 'learning_rate': 9.279048063462436e-07, 'completion_length': 134.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.3708333373069763, 'rewards/format_reward': 1.0, 'reward': 1.370833396911621, 'reward_std': 0.14571115374565125, 'kl': 0.18896484375, 'epoch': 0.07}
  7%|▋         | 309/4286 [1:44:28<24:26:20, 22.12s/it]  7%|▋         | 310/4286 [1:44:50<24:16:01, 21.97s/it]                                                       {'loss': 0.0084, 'grad_norm': 4.527973063196093, 'learning_rate': 9.276714885674288e-07, 'completion_length': 145.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.341145858168602, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3232887983322144, 'reward_std': 0.165640726685524, 'kl': 0.208984375, 'epoch': 0.07}
  7%|▋         | 310/4286 [1:44:50<24:16:01, 21.97s/it]  7%|▋         | 311/4286 [1:45:06<22:32:10, 20.41s/it]                                                       {'loss': 0.0074, 'grad_norm': 2.5551153077616333, 'learning_rate': 9.274381707886141e-07, 'completion_length': 116.03572082519531, 'rewards/only_full_func_accuracy_reward': 0.40476194024086, 'rewards/format_reward': 1.0, 'reward': 1.4047620296478271, 'reward_std': 0.16575054824352264, 'kl': 0.185546875, 'epoch': 0.07}
  7%|▋         | 311/4286 [1:45:06<22:32:10, 20.41s/it]  7%|▋         | 312/4286 [1:45:25<21:49:24, 19.77s/it]                                                       {'loss': 0.0077, 'grad_norm': 3.0965819657173683, 'learning_rate': 9.272048530097993e-07, 'completion_length': 120.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.3913690745830536, 'rewards/format_reward': 1.0, 'reward': 1.3913691639900208, 'reward_std': 0.10969169065356255, 'kl': 0.19189453125, 'epoch': 0.07}
  7%|▋         | 312/4286 [1:45:25<21:49:24, 19.77s/it]  7%|▋         | 313/4286 [1:45:46<22:26:37, 20.34s/it]                                                       {'loss': 0.0082, 'grad_norm': 2.0049027140105977, 'learning_rate': 9.269715352309846e-07, 'completion_length': 149.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.27266159653663635, 'rewards/format_reward': 1.0, 'reward': 1.2726616263389587, 'reward_std': 0.0727043803781271, 'kl': 0.2060546875, 'epoch': 0.07}
  7%|▋         | 313/4286 [1:45:46<22:26:37, 20.34s/it]  7%|▋         | 314/4286 [1:46:04<21:30:36, 19.50s/it]                                                       {'loss': 0.0084, 'grad_norm': 1.8812311926627048, 'learning_rate': 9.267382174521699e-07, 'completion_length': 134.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.33052724599838257, 'rewards/format_reward': 1.0, 'reward': 1.3305273652076721, 'reward_std': 0.11562953889369965, 'kl': 0.20947265625, 'epoch': 0.07}
  7%|▋         | 314/4286 [1:46:04<21:30:36, 19.50s/it]  7%|▋         | 315/4286 [1:46:21<20:41:13, 18.75s/it]                                                       {'loss': 0.0078, 'grad_norm': 1.8506151462017475, 'learning_rate': 9.265048996733551e-07, 'completion_length': 126.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.322916716337204, 'rewards/format_reward': 1.0, 'reward': 1.3229167461395264, 'reward_std': 0.09702535718679428, 'kl': 0.19384765625, 'epoch': 0.07}
  7%|▋         | 315/4286 [1:46:21<20:41:13, 18.75s/it]  7%|▋         | 316/4286 [1:46:44<22:01:38, 19.97s/it]                                                       {'loss': 0.0082, 'grad_norm': 2.431657874526749, 'learning_rate': 9.262715818945403e-07, 'completion_length': 153.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.28630954772233963, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2684524059295654, 'reward_std': 0.15505098551511765, 'kl': 0.2041015625, 'epoch': 0.07}
  7%|▋         | 316/4286 [1:46:44<22:01:38, 19.97s/it]  7%|▋         | 317/4286 [1:47:02<21:32:23, 19.54s/it]                                                       {'loss': 0.0079, 'grad_norm': 4.421444221329356, 'learning_rate': 9.260382641157256e-07, 'completion_length': 140.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.29285719245672226, 'rewards/format_reward': 1.0, 'reward': 1.2928572297096252, 'reward_std': 0.17575281858444214, 'kl': 0.19873046875, 'epoch': 0.07}
  7%|▋         | 317/4286 [1:47:02<21:32:23, 19.54s/it]  7%|▋         | 318/4286 [1:47:21<21:14:56, 19.28s/it]                                                       {'loss': 0.0081, 'grad_norm': 2.0240057253983528, 'learning_rate': 9.258049463369109e-07, 'completion_length': 141.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.3125000223517418, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.13947414979338646, 'kl': 0.203125, 'epoch': 0.07}
  7%|▋         | 318/4286 [1:47:21<21:14:56, 19.28s/it]  7%|▋         | 319/4286 [1:47:42<21:55:40, 19.90s/it]                                                       {'loss': 0.0082, 'grad_norm': 3.1965512133929486, 'learning_rate': 9.255716285580961e-07, 'completion_length': 163.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.22784866392612457, 'rewards/format_reward': 1.0, 'reward': 1.2278487086296082, 'reward_std': 0.13174151256680489, 'kl': 0.205078125, 'epoch': 0.07}
  7%|▋         | 319/4286 [1:47:42<21:55:40, 19.90s/it]  7%|▋         | 320/4286 [1:48:00<21:14:33, 19.28s/it]                                                       {'loss': 0.0077, 'grad_norm': 2.0192588624374053, 'learning_rate': 9.253383107792813e-07, 'completion_length': 126.08929061889648, 'rewards/only_full_func_accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.16732743382453918, 'kl': 0.19287109375, 'epoch': 0.07}
  7%|▋         | 320/4286 [1:48:00<21:14:33, 19.28s/it]  7%|▋         | 321/4286 [1:48:17<20:20:47, 18.47s/it]                                                       {'loss': 0.0076, 'grad_norm': 2.037833306429027, 'learning_rate': 9.251049930004667e-07, 'completion_length': 125.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.3961309939622879, 'rewards/format_reward': 1.0, 'reward': 1.3961310386657715, 'reward_std': 0.13810036703944206, 'kl': 0.19091796875, 'epoch': 0.07}
  7%|▋         | 321/4286 [1:48:17<20:20:47, 18.47s/it]  8%|▊         | 322/4286 [1:48:34<19:58:51, 18.15s/it]                                                       {'loss': 0.0076, 'grad_norm': 2.189422580183614, 'learning_rate': 9.248716752216519e-07, 'completion_length': 128.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.3794643133878708, 'rewards/format_reward': 1.0, 'reward': 1.379464328289032, 'reward_std': 0.12997262552380562, 'kl': 0.18994140625, 'epoch': 0.08}
  8%|▊         | 322/4286 [1:48:34<19:58:51, 18.15s/it]  8%|▊         | 323/4286 [1:48:50<19:13:24, 17.46s/it]                                                       {'loss': 0.0077, 'grad_norm': 2.195232494608868, 'learning_rate': 9.246383574428371e-07, 'completion_length': 108.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5065476894378662, 'rewards/format_reward': 1.0, 'reward': 1.5065476894378662, 'reward_std': 0.06090877763926983, 'kl': 0.1923828125, 'epoch': 0.08}
  8%|▊         | 323/4286 [1:48:50<19:13:24, 17.46s/it]  8%|▊         | 324/4286 [1:49:08<19:27:33, 17.68s/it]                                                       {'loss': 0.0083, 'grad_norm': 1.9902109904391352, 'learning_rate': 9.244050396640224e-07, 'completion_length': 139.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.2187500149011612, 'rewards/format_reward': 1.0, 'reward': 1.2187501192092896, 'reward_std': 0.09342947974801064, 'kl': 0.2080078125, 'epoch': 0.08}
  8%|▊         | 324/4286 [1:49:08<19:27:33, 17.68s/it]  8%|▊         | 325/4286 [1:49:27<19:46:14, 17.97s/it]                                                       {'loss': 0.0085, 'grad_norm': 2.141047021044218, 'learning_rate': 9.241717218852077e-07, 'completion_length': 130.23215103149414, 'rewards/only_full_func_accuracy_reward': 0.3437500298023224, 'rewards/format_reward': 1.0, 'reward': 1.3437500596046448, 'reward_std': 0.15781201422214508, 'kl': 0.2109375, 'epoch': 0.08}
  8%|▊         | 325/4286 [1:49:27<19:46:14, 17.97s/it]  8%|▊         | 326/4286 [1:49:44<19:22:54, 17.62s/it]                                                       {'loss': 0.0086, 'grad_norm': 4.094060159712281, 'learning_rate': 9.239384041063929e-07, 'completion_length': 120.01786422729492, 'rewards/only_full_func_accuracy_reward': 0.3511904925107956, 'rewards/format_reward': 1.0, 'reward': 1.3511906266212463, 'reward_std': 0.19513485580682755, 'kl': 0.21533203125, 'epoch': 0.08}
  8%|▊         | 326/4286 [1:49:44<19:22:54, 17.62s/it]  8%|▊         | 327/4286 [1:50:00<19:01:52, 17.31s/it]                                                       {'loss': 0.0085, 'grad_norm': 2.2255850487961064, 'learning_rate': 9.237050863275782e-07, 'completion_length': 118.26786422729492, 'rewards/only_full_func_accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 1.0, 'reward': 1.348214328289032, 'reward_std': 0.17452684044837952, 'kl': 0.212890625, 'epoch': 0.08}
  8%|▊         | 327/4286 [1:50:00<19:01:52, 17.31s/it]  8%|▊         | 328/4286 [1:50:17<18:53:32, 17.18s/it]                                                       {'loss': 0.0082, 'grad_norm': 2.449038711216012, 'learning_rate': 9.234717685487634e-07, 'completion_length': 117.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.3116071820259094, 'rewards/format_reward': 1.0, 'reward': 1.3116072416305542, 'reward_std': 0.09107143431901932, 'kl': 0.20458984375, 'epoch': 0.08}
  8%|▊         | 328/4286 [1:50:17<18:53:32, 17.18s/it]  8%|▊         | 329/4286 [1:50:33<18:36:27, 16.93s/it]                                                       {'loss': 0.0081, 'grad_norm': 3.182613532651514, 'learning_rate': 9.232384507699487e-07, 'completion_length': 108.08929061889648, 'rewards/only_full_func_accuracy_reward': 0.3839286118745804, 'rewards/format_reward': 1.0, 'reward': 1.383928656578064, 'reward_std': 0.13215277343988419, 'kl': 0.2021484375, 'epoch': 0.08}
  8%|▊         | 329/4286 [1:50:33<18:36:27, 16.93s/it]  8%|▊         | 330/4286 [1:50:49<18:04:19, 16.45s/it]                                                       {'loss': 0.0091, 'grad_norm': 2.174678798572489, 'learning_rate': 9.230051329911339e-07, 'completion_length': 97.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.13270125538110733, 'kl': 0.2265625, 'epoch': 0.08}
  8%|▊         | 330/4286 [1:50:49<18:04:19, 16.45s/it]  8%|▊         | 331/4286 [1:51:04<17:33:29, 15.98s/it]                                                       {'loss': 0.0082, 'grad_norm': 2.4178291466389323, 'learning_rate': 9.227718152123192e-07, 'completion_length': 99.16071701049805, 'rewards/only_full_func_accuracy_reward': 0.4791667312383652, 'rewards/format_reward': 1.0, 'reward': 1.4791668057441711, 'reward_std': 0.09800061210989952, 'kl': 0.2041015625, 'epoch': 0.08}
  8%|▊         | 331/4286 [1:51:04<17:33:29, 15.98s/it]  8%|▊         | 332/4286 [1:51:19<17:17:46, 15.75s/it]                                                       {'loss': 0.01, 'grad_norm': 4.170285644870247, 'learning_rate': 9.225384974335044e-07, 'completion_length': 97.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.3229166865348816, 'rewards/format_reward': 1.0, 'reward': 1.3229167461395264, 'reward_std': 0.09493745118379593, 'kl': 0.25146484375, 'epoch': 0.08}
  8%|▊         | 332/4286 [1:51:19<17:17:46, 15.75s/it]  8%|▊         | 333/4286 [1:51:39<18:48:10, 17.12s/it]                                                       {'loss': 0.0094, 'grad_norm': 20.38507277253876, 'learning_rate': 9.223051796546896e-07, 'completion_length': 112.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.2395833507180214, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2217262983322144, 'reward_std': 0.17084968835115433, 'kl': 0.2333984375, 'epoch': 0.08}
  8%|▊         | 333/4286 [1:51:39<18:48:10, 17.12s/it]  8%|▊         | 334/4286 [1:52:03<20:58:31, 19.11s/it]                                                       {'loss': 0.0089, 'grad_norm': 2.2032336022566676, 'learning_rate': 9.22071861875875e-07, 'completion_length': 112.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.3943452537059784, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3586310744285583, 'reward_std': 0.20800486207008362, 'kl': 0.22314453125, 'epoch': 0.08}
  8%|▊         | 334/4286 [1:52:03<20:58:31, 19.11s/it]  8%|▊         | 335/4286 [1:52:24<21:30:28, 19.60s/it]                                                       {'loss': 0.0103, 'grad_norm': 2.2972593424562318, 'learning_rate': 9.218385440970602e-07, 'completion_length': 105.91072082519531, 'rewards/only_full_func_accuracy_reward': 0.3413690775632858, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3056548237800598, 'reward_std': 0.16484110057353973, 'kl': 0.2587890625, 'epoch': 0.08}
  8%|▊         | 335/4286 [1:52:24<21:30:28, 19.60s/it]  8%|▊         | 336/4286 [1:52:43<21:32:50, 19.64s/it]                                                       {'loss': 0.0102, 'grad_norm': 2.8681067518744454, 'learning_rate': 9.216052263182454e-07, 'completion_length': 100.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3928571939468384, 'reward_std': 0.23780931532382965, 'kl': 0.2548828125, 'epoch': 0.08}
  8%|▊         | 336/4286 [1:52:43<21:32:50, 19.64s/it]  8%|▊         | 337/4286 [1:53:02<21:17:38, 19.41s/it]                                                       {'loss': 0.0113, 'grad_norm': 2.9115052333640605, 'learning_rate': 9.213719085394307e-07, 'completion_length': 100.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.3110119253396988, 'rewards/format_reward': 1.0, 'reward': 1.3110119700431824, 'reward_std': 0.17051062732934952, 'kl': 0.283203125, 'epoch': 0.08}
  8%|▊         | 337/4286 [1:53:02<21:17:38, 19.41s/it]  8%|▊         | 338/4286 [1:53:18<20:11:57, 18.42s/it]                                                       {'loss': 0.0107, 'grad_norm': 9.553058974130352, 'learning_rate': 9.21138590760616e-07, 'completion_length': 93.25000381469727, 'rewards/only_full_func_accuracy_reward': 0.3095238208770752, 'rewards/format_reward': 1.0, 'reward': 1.3095239400863647, 'reward_std': 0.14869336783885956, 'kl': 0.267578125, 'epoch': 0.08}
  8%|▊         | 338/4286 [1:53:18<20:11:57, 18.42s/it]  8%|▊         | 339/4286 [1:53:44<22:35:06, 20.60s/it]                                                       {'loss': 0.0095, 'grad_norm': 2.2029952776792747, 'learning_rate': 9.209052729818012e-07, 'completion_length': 105.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3839287161827087, 'reward_std': 0.16439763456583023, 'kl': 0.23828125, 'epoch': 0.08}
  8%|▊         | 339/4286 [1:53:44<22:35:06, 20.60s/it]  8%|▊         | 340/4286 [1:54:05<22:41:16, 20.70s/it]                                                       {'loss': 0.0131, 'grad_norm': 4.8200640910315435, 'learning_rate': 9.206719552029864e-07, 'completion_length': 95.89286422729492, 'rewards/only_full_func_accuracy_reward': 0.398809514939785, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3630953431129456, 'reward_std': 0.1945338323712349, 'kl': 0.328125, 'epoch': 0.08}
  8%|▊         | 340/4286 [1:54:05<22:41:16, 20.70s/it]  8%|▊         | 341/4286 [1:54:24<22:06:46, 20.18s/it]                                                       {'loss': 0.0118, 'grad_norm': 1.6779205314671026, 'learning_rate': 9.204386374241717e-07, 'completion_length': 85.94643020629883, 'rewards/only_full_func_accuracy_reward': 0.290178582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2723215222358704, 'reward_std': 0.13626372814178467, 'kl': 0.2939453125, 'epoch': 0.08}
  8%|▊         | 341/4286 [1:54:24<22:06:46, 20.18s/it]  8%|▊         | 342/4286 [1:54:43<21:48:14, 19.90s/it]                                                       {'loss': 0.0129, 'grad_norm': 3.1643037708056037, 'learning_rate': 9.20205319645357e-07, 'completion_length': 86.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.3630952537059784, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3273810744285583, 'reward_std': 0.16244813054800034, 'kl': 0.322265625, 'epoch': 0.08}
  8%|▊         | 342/4286 [1:54:43<21:48:14, 19.90s/it][2025-03-02 07:02:13,525] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  8%|▊         | 343/4286 [1:54:58<20:01:16, 18.28s/it]                                                       {'loss': 0.0124, 'grad_norm': 1.5918575981750225, 'learning_rate': 9.199720018665422e-07, 'completion_length': 75.82143020629883, 'rewards/only_full_func_accuracy_reward': 0.3541667014360428, 'rewards/format_reward': 1.0, 'reward': 1.3541667461395264, 'reward_std': 0.02221459336578846, 'kl': 0.3115234375, 'epoch': 0.08}
  8%|▊         | 343/4286 [1:54:58<20:01:16, 18.28s/it]  8%|▊         | 344/4286 [1:55:20<21:21:25, 19.50s/it]                                                       {'loss': 0.0143, 'grad_norm': 4.066455539634117, 'learning_rate': 9.197386840877275e-07, 'completion_length': 106.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.2366071566939354, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2008929252624512, 'reward_std': 0.20811966061592102, 'kl': 0.3564453125, 'epoch': 0.08}
  8%|▊         | 344/4286 [1:55:20<21:21:25, 19.50s/it]  8%|▊         | 345/4286 [1:55:45<23:07:48, 21.13s/it]                                                       {'loss': 0.0166, 'grad_norm': 3.6921753215260416, 'learning_rate': 9.195053663089127e-07, 'completion_length': 110.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.3794643059372902, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3258929252624512, 'reward_std': 0.234774649143219, 'kl': 0.4150390625, 'epoch': 0.08}
  8%|▊         | 345/4286 [1:55:45<23:07:48, 21.13s/it]  8%|▊         | 346/4286 [1:56:01<21:21:57, 19.52s/it]                                                       {'loss': 0.0176, 'grad_norm': 2.498422060479817, 'learning_rate': 9.19272048530098e-07, 'completion_length': 91.28571701049805, 'rewards/only_full_func_accuracy_reward': 0.2172619178891182, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.1815477013587952, 'reward_std': 0.15433455631136894, 'kl': 0.439453125, 'epoch': 0.08}
  8%|▊         | 346/4286 [1:56:01<21:21:57, 19.52s/it]  8%|▊         | 347/4286 [1:56:16<20:06:16, 18.37s/it]                                                       {'loss': 0.0154, 'grad_norm': 4.914479525690038, 'learning_rate': 9.190387307512833e-07, 'completion_length': 82.41071701049805, 'rewards/only_full_func_accuracy_reward': 0.300595261156559, 'rewards/format_reward': 1.0, 'reward': 1.3005953431129456, 'reward_std': 0.1011904776096344, 'kl': 0.3837890625, 'epoch': 0.08}
  8%|▊         | 347/4286 [1:56:16<20:06:16, 18.37s/it]  8%|▊         | 348/4286 [1:56:36<20:28:31, 18.72s/it]                                                       {'loss': 0.0174, 'grad_norm': 4.2426413307437505, 'learning_rate': 9.188054129724685e-07, 'completion_length': 103.66072082519531, 'rewards/only_full_func_accuracy_reward': 0.3407738357782364, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3229167461395264, 'reward_std': 0.12879088521003723, 'kl': 0.4345703125, 'epoch': 0.08}
  8%|▊         | 348/4286 [1:56:36<20:28:31, 18.72s/it]  8%|▊         | 349/4286 [1:56:58<21:25:47, 19.60s/it]                                                       {'loss': 0.0185, 'grad_norm': 3.505751594409609, 'learning_rate': 9.185720951936537e-07, 'completion_length': 104.53571701049805, 'rewards/only_full_func_accuracy_reward': 0.346726194024086, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2931548357009888, 'reward_std': 0.23257070779800415, 'kl': 0.4619140625, 'epoch': 0.08}
  8%|▊         | 349/4286 [1:56:58<21:25:47, 19.60s/it]  8%|▊         | 350/4286 [1:57:23<23:17:28, 21.30s/it]                                                       {'loss': 0.0199, 'grad_norm': 5.999657995277678, 'learning_rate': 9.183387774148391e-07, 'completion_length': 133.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.3883928954601288, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.2812501192092896, 'reward_std': 0.4131114184856415, 'kl': 0.498046875, 'epoch': 0.08}
  8%|▊         | 350/4286 [1:57:23<23:17:28, 21.30s/it]  8%|▊         | 351/4286 [1:57:44<23:05:53, 21.13s/it]                                                       {'loss': 0.0211, 'grad_norm': 3.686204740509673, 'learning_rate': 9.181054596360243e-07, 'completion_length': 136.39286422729492, 'rewards/only_full_func_accuracy_reward': 0.3348214626312256, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.1741071939468384, 'reward_std': 0.38771361857652664, 'kl': 0.5283203125, 'epoch': 0.08}
  8%|▊         | 351/4286 [1:57:44<23:05:53, 21.13s/it]  8%|▊         | 352/4286 [1:58:09<24:21:22, 22.29s/it]                                                       {'loss': 0.0184, 'grad_norm': 3.2256221278627955, 'learning_rate': 9.178721418572095e-07, 'completion_length': 128.89286422729492, 'rewards/only_full_func_accuracy_reward': 0.2946428805589676, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2053572535514832, 'reward_std': 0.2886257767677307, 'kl': 0.4599609375, 'epoch': 0.08}
  8%|▊         | 352/4286 [1:58:09<24:21:22, 22.29s/it]  8%|▊         | 353/4286 [1:58:34<25:24:48, 23.26s/it]                                                       {'loss': 0.0181, 'grad_norm': 8.346558446047528, 'learning_rate': 9.176388240783947e-07, 'completion_length': 141.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.438988134264946, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3318453431129456, 'reward_std': 0.24032709747552872, 'kl': 0.451171875, 'epoch': 0.08}
  8%|▊         | 353/4286 [1:58:34<25:24:48, 23.26s/it]  8%|▊         | 354/4286 [1:58:53<24:06:39, 22.08s/it]                                                       {'loss': 0.0147, 'grad_norm': 1.323767226114588, 'learning_rate': 9.1740550629958e-07, 'completion_length': 101.01786422729492, 'rewards/only_full_func_accuracy_reward': 0.5282738357782364, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5104167461395264, 'reward_std': 0.0446428582072258, 'kl': 0.3662109375, 'epoch': 0.08}
  8%|▊         | 354/4286 [1:58:53<24:06:39, 22.08s/it]  8%|▊         | 355/4286 [1:59:14<23:41:01, 21.69s/it]                                                       {'loss': 0.0167, 'grad_norm': 5.919706723827099, 'learning_rate': 9.171721885207653e-07, 'completion_length': 141.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.4241071790456772, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.2991071939468384, 'reward_std': 0.18118683993816376, 'kl': 0.41796875, 'epoch': 0.08}
  8%|▊         | 355/4286 [1:59:14<23:41:01, 21.69s/it]  8%|▊         | 356/4286 [1:59:33<22:47:45, 20.88s/it]                                                       {'loss': 0.0117, 'grad_norm': 2.0217326549665273, 'learning_rate': 9.169388707419505e-07, 'completion_length': 107.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.5119048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4940478205680847, 'reward_std': 0.0833333358168602, 'kl': 0.2919921875, 'epoch': 0.08}
  8%|▊         | 356/4286 [1:59:33<22:47:45, 20.88s/it]  8%|▊         | 357/4286 [1:59:47<20:34:34, 18.85s/it]                                                       {'loss': 0.0148, 'grad_norm': 3.975423864720138, 'learning_rate': 9.167055529631358e-07, 'completion_length': 98.32143020629883, 'rewards/only_full_func_accuracy_reward': 0.3437500149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3258928656578064, 'reward_std': 0.10257172957062721, 'kl': 0.369140625, 'epoch': 0.08}
  8%|▊         | 357/4286 [1:59:47<20:34:34, 18.85s/it]  8%|▊         | 358/4286 [2:00:13<22:41:44, 20.80s/it]                                                       {'loss': 0.0168, 'grad_norm': 2.9717752841900844, 'learning_rate': 9.16472235184321e-07, 'completion_length': 137.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.45089291036129, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3794644474983215, 'reward_std': 0.23639480769634247, 'kl': 0.4189453125, 'epoch': 0.08}
  8%|▊         | 358/4286 [2:00:13<22:41:44, 20.80s/it]  8%|▊         | 359/4286 [2:00:40<24:43:27, 22.67s/it]                                                       {'loss': 0.0205, 'grad_norm': 4.106600760171872, 'learning_rate': 9.162389174055063e-07, 'completion_length': 168.37500381469727, 'rewards/only_full_func_accuracy_reward': 0.4345238357782364, 'rewards/format_reward': 0.8214285969734192, 'reward': 1.2559524774551392, 'reward_std': 0.3270365782082081, 'kl': 0.5126953125, 'epoch': 0.08}
  8%|▊         | 359/4286 [2:00:40<24:43:27, 22.67s/it][2025-03-02 07:08:21,133] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  8%|▊         | 360/4286 [2:01:05<25:40:05, 23.54s/it]                                                       {'loss': 0.017, 'grad_norm': 4.691095929878654, 'learning_rate': 9.160055996266916e-07, 'completion_length': 161.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.3809524327516556, 'rewards/format_reward': 0.8571429252624512, 'reward': 1.2380953431129456, 'reward_std': 0.37522049248218536, 'kl': 0.4267578125, 'epoch': 0.08}
  8%|▊         | 360/4286 [2:01:05<25:40:05, 23.54s/it]  8%|▊         | 361/4286 [2:01:32<26:51:42, 24.64s/it]                                                       {'loss': 0.0252, 'grad_norm': 6.985670241588428, 'learning_rate': 9.157722818478768e-07, 'completion_length': 200.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.2797619104385376, 'rewards/format_reward': 0.7142857313156128, 'reward': 0.9940477013587952, 'reward_std': 0.4560449868440628, 'kl': 0.630859375, 'epoch': 0.08}
  8%|▊         | 361/4286 [2:01:32<26:51:42, 24.64s/it]  8%|▊         | 362/4286 [2:02:02<28:23:39, 26.05s/it]                                                       {'loss': 0.02, 'grad_norm': 10.269170576055442, 'learning_rate': 9.15538964069062e-07, 'completion_length': 213.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.1979166716337204, 'rewards/format_reward': 0.7142857611179352, 'reward': 0.9122024178504944, 'reward_std': 0.37853580713272095, 'kl': 0.5009765625, 'epoch': 0.08}
  8%|▊         | 362/4286 [2:02:02<28:23:39, 26.05s/it]  8%|▊         | 363/4286 [2:02:28<28:27:16, 26.11s/it]                                                       {'loss': 0.0187, 'grad_norm': 4.989827372864943, 'learning_rate': 9.153056462902473e-07, 'completion_length': 163.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.290178582072258, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.129464328289032, 'reward_std': 0.29413771629333496, 'kl': 0.466796875, 'epoch': 0.08}
  8%|▊         | 363/4286 [2:02:28<28:27:16, 26.11s/it]  8%|▊         | 364/4286 [2:02:55<28:40:28, 26.32s/it]                                                       {'loss': 0.021, 'grad_norm': 6.2307078176044834, 'learning_rate': 9.150723285114326e-07, 'completion_length': 184.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.3422619253396988, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.145833432674408, 'reward_std': 0.30909349024295807, 'kl': 0.525390625, 'epoch': 0.08}
  8%|▊         | 364/4286 [2:02:55<28:40:28, 26.32s/it]  9%|▊         | 365/4286 [2:03:21<28:38:19, 26.29s/it]                                                       {'loss': 0.0274, 'grad_norm': 5.53701386944519, 'learning_rate': 9.148390107326178e-07, 'completion_length': 186.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.413690522313118, 'rewards/format_reward': 0.785714328289032, 'reward': 1.1994048953056335, 'reward_std': 0.3479972928762436, 'kl': 0.685546875, 'epoch': 0.09}
  9%|▊         | 365/4286 [2:03:21<28:38:19, 26.29s/it][2025-03-02 07:11:04,832] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▊         | 366/4286 [2:03:49<29:08:25, 26.76s/it]                                                       {'loss': 0.0317, 'grad_norm': 5.484913570998536, 'learning_rate': 9.14605692953803e-07, 'completion_length': 168.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.2500000223517418, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.0892857909202576, 'reward_std': 0.26840340346097946, 'kl': 0.79296875, 'epoch': 0.09}
  9%|▊         | 366/4286 [2:03:49<29:08:25, 26.76s/it][2025-03-02 07:11:31,460] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▊         | 367/4286 [2:04:16<29:05:20, 26.72s/it]                                                       {'loss': 0.0357, 'grad_norm': 4.872352178608753, 'learning_rate': 9.143723751749884e-07, 'completion_length': 188.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.4151785969734192, 'rewards/format_reward': 0.785714328289032, 'reward': 1.2008929252624512, 'reward_std': 0.24676980078220367, 'kl': 0.89453125, 'epoch': 0.09}
  9%|▊         | 367/4286 [2:04:16<29:05:20, 26.72s/it][2025-03-02 07:11:58,624] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▊         | 368/4286 [2:04:43<29:13:35, 26.85s/it]                                                       {'loss': 0.0553, 'grad_norm': 4.417216432372594, 'learning_rate': 9.141390573961736e-07, 'completion_length': 201.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.165178582072258, 'rewards/format_reward': 0.7500000298023224, 'reward': 0.9151786267757416, 'reward_std': 0.2832159101963043, 'kl': 1.380859375, 'epoch': 0.09}
  9%|▊         | 368/4286 [2:04:43<29:13:35, 26.85s/it]  9%|▊         | 369/4286 [2:05:04<27:15:22, 25.05s/it]                                                       {'loss': 0.0342, 'grad_norm': 6.636251467491661, 'learning_rate': 9.139057396173588e-07, 'completion_length': 142.66072463989258, 'rewards/only_full_func_accuracy_reward': 0.3660714626312256, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.2767858505249023, 'reward_std': 0.13672413863241673, 'kl': 0.853515625, 'epoch': 0.09}
  9%|▊         | 369/4286 [2:05:04<27:15:22, 25.05s/it]  9%|▊         | 370/4286 [2:05:31<28:03:58, 25.80s/it]                                                       {'loss': 0.0763, 'grad_norm': 4.075166402102012, 'learning_rate': 9.136724218385441e-07, 'completion_length': 191.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.4479166716337204, 'rewards/format_reward': 0.7678571939468384, 'reward': 1.21577388048172, 'reward_std': 0.3035091385245323, 'kl': 1.91015625, 'epoch': 0.09}
  9%|▊         | 370/4286 [2:05:31<28:03:58, 25.80s/it][2025-03-02 07:13:12,440] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
  9%|▊         | 371/4286 [2:05:57<27:56:04, 25.69s/it]                                                       {'loss': 0.0477, 'grad_norm': 3.945392725149473, 'learning_rate': 9.134391040597294e-07, 'completion_length': 139.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.892857164144516, 'reward': 1.2142857909202576, 'reward_std': 0.2313547283411026, 'kl': 1.193359375, 'epoch': 0.09}
  9%|▊         | 371/4286 [2:05:57<27:56:04, 25.69s/it]  9%|▊         | 372/4286 [2:06:22<27:50:28, 25.61s/it]                                                       {'loss': 0.0623, 'grad_norm': 4.550020780186362, 'learning_rate': 9.132057862809146e-07, 'completion_length': 142.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.4568452537059784, 'rewards/format_reward': 0.892857164144516, 'reward': 1.3497024774551392, 'reward_std': 0.20131254941225052, 'kl': 1.5546875, 'epoch': 0.09}
  9%|▊         | 372/4286 [2:06:22<27:50:28, 25.61s/it]  9%|▊         | 373/4286 [2:06:48<27:54:52, 25.68s/it]                                                       {'loss': 0.031, 'grad_norm': 7.282744874269742, 'learning_rate': 9.129724685020999e-07, 'completion_length': 117.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4761905074119568, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4404762983322144, 'reward_std': 0.17335926741361618, 'kl': 0.775390625, 'epoch': 0.09}
  9%|▊         | 373/4286 [2:06:48<27:54:52, 25.68s/it]  9%|▊         | 374/4286 [2:07:13<27:35:16, 25.39s/it]                                                       {'loss': 0.0293, 'grad_norm': 3.7534373392005973, 'learning_rate': 9.127391507232851e-07, 'completion_length': 113.76786422729492, 'rewards/only_full_func_accuracy_reward': 0.349702388048172, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.313988208770752, 'reward_std': 0.1160714365541935, 'kl': 0.732421875, 'epoch': 0.09}
  9%|▊         | 374/4286 [2:07:13<27:35:16, 25.39s/it]  9%|▊         | 375/4286 [2:07:28<24:13:15, 22.29s/it]                                                       {'loss': 0.0137, 'grad_norm': 22.342691857333836, 'learning_rate': 9.125058329444704e-07, 'completion_length': 104.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.3898809850215912, 'rewards/format_reward': 1.0, 'reward': 1.3898810744285583, 'reward_std': 0.11752716451883316, 'kl': 0.3427734375, 'epoch': 0.09}
  9%|▊         | 375/4286 [2:07:28<24:13:15, 22.29s/it]  9%|▉         | 376/4286 [2:07:47<23:18:12, 21.46s/it]                                                       {'loss': 0.0222, 'grad_norm': 6.732047348110063, 'learning_rate': 9.122725151656556e-07, 'completion_length': 112.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.4583333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4404762983322144, 'reward_std': 0.14024856686592102, 'kl': 0.5556640625, 'epoch': 0.09}
  9%|▉         | 376/4286 [2:07:47<23:18:12, 21.46s/it]  9%|▉         | 377/4286 [2:08:01<20:59:00, 19.32s/it]                                                       {'loss': 0.0127, 'grad_norm': 0.31996826199103157, 'learning_rate': 9.120391973868409e-07, 'completion_length': 98.16072082519531, 'rewards/only_full_func_accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.0, 'kl': 0.3193359375, 'epoch': 0.09}
  9%|▉         | 377/4286 [2:08:01<20:59:00, 19.32s/it]  9%|▉         | 378/4286 [2:08:16<19:25:32, 17.89s/it]                                                       {'loss': 0.0121, 'grad_norm': 6.760992876730913, 'learning_rate': 9.118058796080261e-07, 'completion_length': 106.44643020629883, 'rewards/only_full_func_accuracy_reward': 0.4211309850215912, 'rewards/format_reward': 1.0, 'reward': 1.4211310744285583, 'reward_std': 0.026785715483129025, 'kl': 0.3017578125, 'epoch': 0.09}
  9%|▉         | 378/4286 [2:08:16<19:25:32, 17.89s/it]  9%|▉         | 379/4286 [2:08:30<18:17:35, 16.86s/it]                                                       {'loss': 0.0128, 'grad_norm': 2.0128034864923636, 'learning_rate': 9.115725618292113e-07, 'completion_length': 104.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642858505249023, 'reward_std': 0.020619653165340424, 'kl': 0.3193359375, 'epoch': 0.09}
  9%|▉         | 379/4286 [2:08:30<18:17:35, 16.86s/it]  9%|▉         | 380/4286 [2:08:45<17:26:12, 16.07s/it]                                                       {'loss': 0.0122, 'grad_norm': 12.557761788408019, 'learning_rate': 9.113392440503967e-07, 'completion_length': 102.96428680419922, 'rewards/only_full_func_accuracy_reward': 0.2916666865348816, 'rewards/format_reward': 1.0, 'reward': 1.2916667461395264, 'reward_std': 0.0444291764870286, 'kl': 0.3046875, 'epoch': 0.09}
  9%|▉         | 380/4286 [2:08:45<17:26:12, 16.07s/it]  9%|▉         | 381/4286 [2:08:59<16:44:15, 15.43s/it]                                                       {'loss': 0.0131, 'grad_norm': 1.8967380005542864, 'learning_rate': 9.111059262715819e-07, 'completion_length': 99.69643020629883, 'rewards/only_full_func_accuracy_reward': 0.2127976343035698, 'rewards/format_reward': 1.0, 'reward': 1.21279776096344, 'reward_std': 0.05838929861783981, 'kl': 0.326171875, 'epoch': 0.09}
  9%|▉         | 381/4286 [2:08:59<16:44:15, 15.43s/it]  9%|▉         | 382/4286 [2:09:18<18:05:59, 16.69s/it]                                                       {'loss': 0.0239, 'grad_norm': 2.3433405101418043, 'learning_rate': 9.108726084927671e-07, 'completion_length': 115.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.2916666716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2738096117973328, 'reward_std': 0.08885835111141205, 'kl': 0.5986328125, 'epoch': 0.09}
  9%|▉         | 382/4286 [2:09:18<18:05:59, 16.69s/it]  9%|▉         | 383/4286 [2:09:33<17:18:35, 15.97s/it]                                                       {'loss': 0.013, 'grad_norm': 1.6983933804842937, 'learning_rate': 9.106392907139524e-07, 'completion_length': 107.78572082519531, 'rewards/only_full_func_accuracy_reward': 0.3943452835083008, 'rewards/format_reward': 1.0, 'reward': 1.3943453431129456, 'reward_std': 0.0208333358168602, 'kl': 0.3251953125, 'epoch': 0.09}
  9%|▉         | 383/4286 [2:09:33<17:18:35, 15.97s/it]  9%|▉         | 384/4286 [2:09:53<18:38:29, 17.20s/it]                                                       {'loss': 0.0324, 'grad_norm': 3.1711819325357378, 'learning_rate': 9.104059729351377e-07, 'completion_length': 117.51786422729492, 'rewards/only_full_func_accuracy_reward': 0.299107164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2812501192092896, 'reward_std': 0.09686608985066414, 'kl': 0.8076171875, 'epoch': 0.09}
  9%|▉         | 384/4286 [2:09:53<18:38:29, 17.20s/it]  9%|▉         | 385/4286 [2:10:09<18:14:41, 16.84s/it]                                                       {'loss': 0.0128, 'grad_norm': 2.7971491441068745, 'learning_rate': 9.101726551563229e-07, 'completion_length': 118.44643783569336, 'rewards/only_full_func_accuracy_reward': 0.29226192831993103, 'rewards/format_reward': 1.0, 'reward': 1.2922620177268982, 'reward_std': 0.03690476343035698, 'kl': 0.3193359375, 'epoch': 0.09}
  9%|▉         | 385/4286 [2:10:09<18:14:41, 16.84s/it]  9%|▉         | 386/4286 [2:10:25<18:09:27, 16.76s/it]                                                       {'loss': 0.0124, 'grad_norm': 2.371834904500304, 'learning_rate': 9.099393373775081e-07, 'completion_length': 113.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.2857143208384514, 'rewards/format_reward': 1.0, 'reward': 1.2857143878936768, 'reward_std': 0.026485062204301357, 'kl': 0.3095703125, 'epoch': 0.09}
  9%|▉         | 386/4286 [2:10:25<18:09:27, 16.76s/it]  9%|▉         | 387/4286 [2:10:40<17:29:44, 16.15s/it]                                                       {'loss': 0.0123, 'grad_norm': 1.3596101892325112, 'learning_rate': 9.097060195986934e-07, 'completion_length': 111.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.3214287161827087, 'reward_std': 0.02816697023808956, 'kl': 0.30859375, 'epoch': 0.09}
  9%|▉         | 387/4286 [2:10:40<17:29:44, 16.15s/it]  9%|▉         | 388/4286 [2:10:57<17:46:43, 16.42s/it]                                                       {'loss': 0.0114, 'grad_norm': 12.56982819727497, 'learning_rate': 9.094727018198787e-07, 'completion_length': 134.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.29672620445489883, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2788691520690918, 'reward_std': 0.11000501550734043, 'kl': 0.28515625, 'epoch': 0.09}
  9%|▉         | 388/4286 [2:10:57<17:46:43, 16.42s/it]  9%|▉         | 389/4286 [2:11:17<18:50:16, 17.40s/it]                                                       {'loss': 0.0285, 'grad_norm': 1.7834391599656756, 'learning_rate': 9.092393840410639e-07, 'completion_length': 126.82143020629883, 'rewards/only_full_func_accuracy_reward': 0.3943452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.376488208770752, 'reward_std': 0.06584258377552032, 'kl': 0.7158203125, 'epoch': 0.09}
  9%|▉         | 389/4286 [2:11:17<18:50:16, 17.40s/it]  9%|▉         | 390/4286 [2:11:32<18:18:15, 16.91s/it]                                                       {'loss': 0.0106, 'grad_norm': 4.699551111025253, 'learning_rate': 9.090060662622492e-07, 'completion_length': 123.64286422729492, 'rewards/only_full_func_accuracy_reward': 0.3005952686071396, 'rewards/format_reward': 1.0, 'reward': 1.3005953431129456, 'reward_std': 0.05116046126931906, 'kl': 0.2646484375, 'epoch': 0.09}
  9%|▉         | 390/4286 [2:11:32<18:18:15, 16.91s/it]  9%|▉         | 391/4286 [2:11:48<17:46:03, 16.42s/it]                                                       {'loss': 0.0118, 'grad_norm': 3.1282657247111594, 'learning_rate': 9.087727484834344e-07, 'completion_length': 108.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.4851190894842148, 'rewards/format_reward': 1.0, 'reward': 1.4851191639900208, 'reward_std': 0.07167531177401543, 'kl': 0.2958984375, 'epoch': 0.09}
  9%|▉         | 391/4286 [2:11:48<17:46:03, 16.42s/it]  9%|▉         | 392/4286 [2:12:09<19:18:37, 17.85s/it]                                                       {'loss': 0.0258, 'grad_norm': 3.327732437719911, 'learning_rate': 9.085394307046197e-07, 'completion_length': 149.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.2056547850370407, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1877976655960083, 'reward_std': 0.08139113895595074, 'kl': 0.6455078125, 'epoch': 0.09}
  9%|▉         | 392/4286 [2:12:09<19:18:37, 17.85s/it]  9%|▉         | 393/4286 [2:12:30<20:17:22, 18.76s/it]                                                       {'loss': 0.0277, 'grad_norm': 3.2320852080334403, 'learning_rate': 9.08306112925805e-07, 'completion_length': 150.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.25208336114883423, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2342262864112854, 'reward_std': 0.10436927108094096, 'kl': 0.693359375, 'epoch': 0.09}
  9%|▉         | 393/4286 [2:12:30<20:17:22, 18.76s/it]  9%|▉         | 394/4286 [2:12:45<19:07:46, 17.69s/it]                                                       {'loss': 0.0113, 'grad_norm': 3.090678099369862, 'learning_rate': 9.080727951469902e-07, 'completion_length': 123.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.3389881104230881, 'rewards/format_reward': 1.0, 'reward': 1.338988184928894, 'reward_std': 0.047872394323349, 'kl': 0.2822265625, 'epoch': 0.09}
  9%|▉         | 394/4286 [2:12:45<19:07:46, 17.69s/it]  9%|▉         | 395/4286 [2:13:06<20:04:48, 18.58s/it]                                                       {'loss': 0.4614, 'grad_norm': 20144.948999610275, 'learning_rate': 9.078394773681754e-07, 'completion_length': 138.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.34384922683238983, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3259921073913574, 'reward_std': 0.07592364028096199, 'kl': 11.5791015625, 'epoch': 0.09}
  9%|▉         | 395/4286 [2:13:06<20:04:48, 18.58s/it]  9%|▉         | 396/4286 [2:13:23<19:39:14, 18.19s/it]                                                       {'loss': 0.0113, 'grad_norm': 3.9458709857626295, 'learning_rate': 9.076061595893607e-07, 'completion_length': 153.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.27619049698114395, 'rewards/format_reward': 1.0, 'reward': 1.2761905193328857, 'reward_std': 0.09465227648615837, 'kl': 0.28125, 'epoch': 0.09}
  9%|▉         | 396/4286 [2:13:23<19:39:14, 18.19s/it]  9%|▉         | 397/4286 [2:13:39<18:56:03, 17.53s/it]                                                       {'loss': 0.0125, 'grad_norm': 4.457198820324316, 'learning_rate': 9.07372841810546e-07, 'completion_length': 116.41072082519531, 'rewards/only_full_func_accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.02565119881182909, 'kl': 0.3125, 'epoch': 0.09}
  9%|▉         | 397/4286 [2:13:39<18:56:03, 17.53s/it]  9%|▉         | 398/4286 [2:13:55<18:35:06, 17.21s/it]                                                       {'loss': 0.0127, 'grad_norm': 5.768304340057073, 'learning_rate': 9.071395240317312e-07, 'completion_length': 134.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.3633928745985031, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3276787400245667, 'reward_std': 0.17045167088508606, 'kl': 0.3173828125, 'epoch': 0.09}
  9%|▉         | 398/4286 [2:13:55<18:35:06, 17.21s/it]  9%|▉         | 399/4286 [2:14:12<18:21:04, 17.00s/it]                                                       {'loss': 0.0125, 'grad_norm': 2.467541680780666, 'learning_rate': 9.069062062529164e-07, 'completion_length': 130.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.22916670143604279, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.1577381491661072, 'reward_std': 0.14094287902116776, 'kl': 0.3115234375, 'epoch': 0.09}
  9%|▉         | 399/4286 [2:14:12<18:21:04, 17.00s/it]  9%|▉         | 400/4286 [2:14:27<17:43:25, 16.42s/it]                                                       {'loss': 0.0129, 'grad_norm': 1.9573459259853614, 'learning_rate': 9.066728884741018e-07, 'completion_length': 117.83929061889648, 'rewards/only_full_func_accuracy_reward': 0.2485119178891182, 'rewards/format_reward': 1.0, 'reward': 1.2485119700431824, 'reward_std': 0.021754169836640358, 'kl': 0.322265625, 'epoch': 0.09}
  9%|▉         | 400/4286 [2:14:27<17:43:25, 16.42s/it]  9%|▉         | 401/4286 [2:19:28<109:45:52, 101.71s/it]                                                         {'loss': 0.013, 'grad_norm': 2.921599647299378, 'learning_rate': 9.06439570695287e-07, 'completion_length': 122.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.2782738283276558, 'rewards/format_reward': 1.0, 'reward': 1.2782739400863647, 'reward_std': 0.050841979682445526, 'kl': 0.3251953125, 'epoch': 0.09}
  9%|▉         | 401/4286 [2:19:28<109:45:52, 101.71s/it]  9%|▉         | 402/4286 [2:19:44<81:57:50, 75.97s/it]                                                         {'loss': 0.0129, 'grad_norm': 3.4141993779101023, 'learning_rate': 9.062062529164722e-07, 'completion_length': 124.07143783569336, 'rewards/only_full_func_accuracy_reward': 0.2633928805589676, 'rewards/format_reward': 1.0, 'reward': 1.263392984867096, 'reward_std': 0.07229499332606792, 'kl': 0.3232421875, 'epoch': 0.09}
  9%|▉         | 402/4286 [2:19:44<81:57:50, 75.97s/it]  9%|▉         | 403/4286 [2:19:59<62:24:24, 57.86s/it]                                                       {'loss': 0.0136, 'grad_norm': 3.298473389781079, 'learning_rate': 9.059729351376575e-07, 'completion_length': 117.6785774230957, 'rewards/only_full_func_accuracy_reward': 0.3125000223517418, 'rewards/format_reward': 1.0, 'reward': 1.3125000596046448, 'reward_std': 0.06664376892149448, 'kl': 0.3388671875, 'epoch': 0.09}
  9%|▉         | 403/4286 [2:19:59<62:24:24, 57.86s/it]  9%|▉         | 404/4286 [2:20:14<48:36:13, 45.07s/it]                                                       {'loss': 0.0141, 'grad_norm': 2.110235421640654, 'learning_rate': 9.057396173588428e-07, 'completion_length': 114.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.2946428805589676, 'rewards/format_reward': 1.0, 'reward': 1.294642984867096, 'reward_std': 0.0416666641831398, 'kl': 0.3515625, 'epoch': 0.09}
  9%|▉         | 404/4286 [2:20:14<48:36:13, 45.07s/it]  9%|▉         | 405/4286 [2:20:30<39:05:09, 36.26s/it]                                                       {'loss': 0.0135, 'grad_norm': 2.5175964783355194, 'learning_rate': 9.05506299580028e-07, 'completion_length': 119.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.293154776096344, 'rewards/format_reward': 1.0, 'reward': 1.2931548357009888, 'reward_std': 0.06391431391239166, 'kl': 0.337890625, 'epoch': 0.09}
  9%|▉         | 405/4286 [2:20:30<39:05:09, 36.26s/it]  9%|▉         | 406/4286 [2:20:45<32:13:24, 29.90s/it]                                                       {'loss': 0.0164, 'grad_norm': 4.715089870009922, 'learning_rate': 9.052729818012133e-07, 'completion_length': 116.5535774230957, 'rewards/only_full_func_accuracy_reward': 0.3943452537059784, 'rewards/format_reward': 1.0, 'reward': 1.3943453431129456, 'reward_std': 0.03709554020315409, 'kl': 0.4091796875, 'epoch': 0.09}
  9%|▉         | 406/4286 [2:20:45<32:13:24, 29.90s/it]  9%|▉         | 407/4286 [2:21:00<27:21:28, 25.39s/it]                                                       {'loss': 0.0171, 'grad_norm': 2.8034090881698255, 'learning_rate': 9.050396640223985e-07, 'completion_length': 113.28571701049805, 'rewards/only_full_func_accuracy_reward': 0.4375000447034836, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.04853988066315651, 'kl': 0.4267578125, 'epoch': 0.09}
  9%|▉         | 407/4286 [2:21:00<27:21:28, 25.39s/it] 10%|▉         | 408/4286 [2:21:15<24:07:17, 22.39s/it]                                                       {'loss': 0.0163, 'grad_norm': 12.854528965384945, 'learning_rate': 9.048063462435837e-07, 'completion_length': 121.64286422729492, 'rewards/only_full_func_accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3750001192092896, 'reward_std': 0.04543448705226183, 'kl': 0.408203125, 'epoch': 0.1}
 10%|▉         | 408/4286 [2:21:15<24:07:17, 22.39s/it] 10%|▉         | 409/4286 [2:21:31<21:46:51, 20.22s/it]                                                       {'loss': 0.0173, 'grad_norm': 2.739671372528119, 'learning_rate': 9.04573028464769e-07, 'completion_length': 115.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.2693452537059784, 'rewards/format_reward': 1.0, 'reward': 1.2693453431129456, 'reward_std': 0.0386904738843441, 'kl': 0.431640625, 'epoch': 0.1}
 10%|▉         | 409/4286 [2:21:31<21:46:51, 20.22s/it] 10%|▉         | 410/4286 [2:21:46<20:08:24, 18.71s/it]                                                       {'loss': 0.0176, 'grad_norm': 4.332801489726913, 'learning_rate': 9.043397106859543e-07, 'completion_length': 116.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.3601190596818924, 'rewards/format_reward': 1.0, 'reward': 1.3601191639900208, 'reward_std': 0.029087798669934273, 'kl': 0.439453125, 'epoch': 0.1}
 10%|▉         | 410/4286 [2:21:46<20:08:24, 18.71s/it] 10%|▉         | 411/4286 [2:22:00<18:49:19, 17.49s/it]                                                       {'loss': 0.0172, 'grad_norm': 7.849000472366841, 'learning_rate': 9.041063929071395e-07, 'completion_length': 112.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.3467262163758278, 'rewards/format_reward': 1.0, 'reward': 1.3467262983322144, 'reward_std': 0.050241902470588684, 'kl': 0.431640625, 'epoch': 0.1}
 10%|▉         | 411/4286 [2:22:00<18:49:19, 17.49s/it] 10%|▉         | 412/4286 [2:22:14<17:43:18, 16.47s/it]                                                       {'loss': 0.0176, 'grad_norm': 2.2571580111588294, 'learning_rate': 9.038730751283247e-07, 'completion_length': 106.12500381469727, 'rewards/only_full_func_accuracy_reward': 0.3392857536673546, 'rewards/format_reward': 1.0, 'reward': 1.3392858505249023, 'reward_std': 0.03160357568413019, 'kl': 0.44140625, 'epoch': 0.1}
 10%|▉         | 412/4286 [2:22:14<17:43:18, 16.47s/it] 10%|▉         | 413/4286 [2:22:29<17:04:23, 15.87s/it]                                                       {'loss': 0.018, 'grad_norm': 3.016919502017841, 'learning_rate': 9.036397573495101e-07, 'completion_length': 107.87500381469727, 'rewards/only_full_func_accuracy_reward': 0.2619047835469246, 'rewards/format_reward': 1.0, 'reward': 1.2619048357009888, 'reward_std': 0.03436608985066414, 'kl': 0.451171875, 'epoch': 0.1}
 10%|▉         | 413/4286 [2:22:29<17:04:23, 15.87s/it] 10%|▉         | 414/4286 [2:22:43<16:28:13, 15.31s/it]                                                       {'loss': 0.0179, 'grad_norm': 5.4489277464890336, 'learning_rate': 9.034064395706953e-07, 'completion_length': 104.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.4538690894842148, 'rewards/format_reward': 1.0, 'reward': 1.453869104385376, 'reward_std': 0.0267857164144516, 'kl': 0.4462890625, 'epoch': 0.1}
 10%|▉         | 414/4286 [2:22:43<16:28:13, 15.31s/it] 10%|▉         | 415/4286 [2:22:57<16:03:34, 14.94s/it]                                                       {'loss': 0.0188, 'grad_norm': 164.6669155695162, 'learning_rate': 9.031731217918805e-07, 'completion_length': 106.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.2559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.2559524774551392, 'reward_std': 0.041913408786058426, 'kl': 0.4697265625, 'epoch': 0.1}
 10%|▉         | 415/4286 [2:22:57<16:03:34, 14.94s/it] 10%|▉         | 416/4286 [2:23:11<15:49:16, 14.72s/it]                                                       {'loss': 0.0184, 'grad_norm': 4.3300852257860205, 'learning_rate': 9.029398040130658e-07, 'completion_length': 107.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.1636904925107956, 'rewards/format_reward': 1.0, 'reward': 1.1636905670166016, 'reward_std': 0.04602411389350891, 'kl': 0.4609375, 'epoch': 0.1}
 10%|▉         | 416/4286 [2:23:11<15:49:16, 14.72s/it] 10%|▉         | 417/4286 [2:23:25<15:32:23, 14.46s/it]                                                       {'loss': 0.0184, 'grad_norm': 3.77862844594138, 'learning_rate': 9.027064862342511e-07, 'completion_length': 100.96429061889648, 'rewards/only_full_func_accuracy_reward': 0.4315476417541504, 'rewards/format_reward': 1.0, 'reward': 1.4315477013587952, 'reward_std': 0.04983500763773918, 'kl': 0.4599609375, 'epoch': 0.1}
 10%|▉         | 417/4286 [2:23:25<15:32:23, 14.46s/it] 10%|▉         | 418/4286 [2:23:39<15:18:39, 14.25s/it]                                                       {'loss': 0.0191, 'grad_norm': 0.17952910701438793, 'learning_rate': 9.024731684554363e-07, 'completion_length': 94.69643020629883, 'rewards/only_full_func_accuracy_reward': 0.45238097012043, 'rewards/format_reward': 1.0, 'reward': 1.4523810744285583, 'reward_std': 0.0, 'kl': 0.478515625, 'epoch': 0.1}
 10%|▉         | 418/4286 [2:23:39<15:18:39, 14.25s/it][2025-03-02 07:31:08,181] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 10%|▉         | 419/4286 [2:23:52<15:02:55, 14.01s/it]                                                       {'loss': 0.0186, 'grad_norm': 3.755953779870657, 'learning_rate': 9.022398506766215e-07, 'completion_length': 93.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035715818405151, 'reward_std': 0.013746436685323715, 'kl': 0.4658203125, 'epoch': 0.1}
 10%|▉         | 419/4286 [2:23:52<15:02:55, 14.01s/it] 10%|▉         | 420/4286 [2:24:06<14:56:08, 13.91s/it]                                                       {'loss': 0.0188, 'grad_norm': 2.2564390619282046, 'learning_rate': 9.020065328978068e-07, 'completion_length': 97.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.4389881193637848, 'rewards/format_reward': 1.0, 'reward': 1.438988208770752, 'reward_std': 0.044642859138548374, 'kl': 0.470703125, 'epoch': 0.1}
 10%|▉         | 420/4286 [2:24:06<14:56:08, 13.91s/it] 10%|▉         | 421/4286 [2:24:20<14:57:29, 13.93s/it]                                                       {'loss': 0.0189, 'grad_norm': 1.5620101310509902, 'learning_rate': 9.017732151189921e-07, 'completion_length': 96.78572082519531, 'rewards/only_full_func_accuracy_reward': 0.2857143133878708, 'rewards/format_reward': 1.0, 'reward': 1.2857144474983215, 'reward_std': 0.023809529840946198, 'kl': 0.47265625, 'epoch': 0.1}
 10%|▉         | 421/4286 [2:24:20<14:57:29, 13.93s/it] 10%|▉         | 422/4286 [2:24:40<16:52:45, 15.73s/it]                                                       {'loss': 0.0325, 'grad_norm': 6.136506051170304, 'learning_rate': 9.015398973401773e-07, 'completion_length': 104.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.348214328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3125001192092896, 'reward_std': 0.07822278887033463, 'kl': 0.810546875, 'epoch': 0.1}
 10%|▉         | 422/4286 [2:24:40<16:52:45, 15.73s/it] 10%|▉         | 423/4286 [2:24:54<16:12:01, 15.10s/it]                                                       {'loss': 0.0188, 'grad_norm': 6.377291395135638, 'learning_rate': 9.013065795613626e-07, 'completion_length': 95.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.4181547909975052, 'rewards/format_reward': 1.0, 'reward': 1.4181548357009888, 'reward_std': 0.09248863533139229, 'kl': 0.470703125, 'epoch': 0.1}
 10%|▉         | 423/4286 [2:24:54<16:12:01, 15.10s/it] 10%|▉         | 424/4286 [2:25:07<15:44:20, 14.67s/it]                                                       {'loss': 0.0187, 'grad_norm': 2.489498722614688, 'learning_rate': 9.010732617825478e-07, 'completion_length': 92.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.2991071566939354, 'rewards/format_reward': 1.0, 'reward': 1.2991071939468384, 'reward_std': 0.03709554113447666, 'kl': 0.4677734375, 'epoch': 0.1}
 10%|▉         | 424/4286 [2:25:07<15:44:20, 14.67s/it] 10%|▉         | 425/4286 [2:25:22<15:49:16, 14.75s/it]                                                       {'loss': 0.0172, 'grad_norm': 1.401858825291691, 'learning_rate': 9.008399440037331e-07, 'completion_length': 101.44643020629883, 'rewards/only_full_func_accuracy_reward': 0.4970238506793976, 'rewards/format_reward': 1.0, 'reward': 1.4970239400863647, 'reward_std': 0.024890122935175896, 'kl': 0.4287109375, 'epoch': 0.1}
 10%|▉         | 425/4286 [2:25:22<15:49:16, 14.75s/it] 10%|▉         | 426/4286 [2:25:42<17:32:59, 16.37s/it]                                                       {'loss': 0.0288, 'grad_norm': 2.4827390989115696, 'learning_rate': 9.006066262249184e-07, 'completion_length': 99.4285774230957, 'rewards/only_full_func_accuracy_reward': 0.2172619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1994048953056335, 'reward_std': 0.08793752267956734, 'kl': 0.71875, 'epoch': 0.1}
 10%|▉         | 426/4286 [2:25:42<17:32:59, 16.37s/it] 10%|▉         | 427/4286 [2:26:03<18:52:57, 17.62s/it]                                                       {'loss': 0.0248, 'grad_norm': 1.346310484571028, 'learning_rate': 9.003733084461036e-07, 'completion_length': 104.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.357142984867096, 'reward_std': 0.0595238134264946, 'kl': 0.6171875, 'epoch': 0.1}
 10%|▉         | 427/4286 [2:26:03<18:52:57, 17.62s/it] 10%|▉         | 428/4286 [2:26:22<19:26:16, 18.14s/it]                                                       {'loss': 0.0346, 'grad_norm': 14.230000025578809, 'learning_rate': 9.001399906672888e-07, 'completion_length': 104.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.4985119253396988, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4806548953056335, 'reward_std': 0.08257310651242733, 'kl': 0.8662109375, 'epoch': 0.1}
 10%|▉         | 428/4286 [2:26:22<19:26:16, 18.14s/it] 10%|█         | 429/4286 [2:26:42<19:56:28, 18.61s/it]                                                       {'loss': 0.0222, 'grad_norm': 0.8480064909644085, 'learning_rate': 8.999066728884742e-07, 'completion_length': 98.58928680419922, 'rewards/only_full_func_accuracy_reward': 0.2559524029493332, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2380952835083008, 'reward_std': 0.0357142873108387, 'kl': 0.5546875, 'epoch': 0.1}
 10%|█         | 429/4286 [2:26:42<19:56:28, 18.61s/it] 10%|█         | 430/4286 [2:27:01<20:09:32, 18.82s/it]                                                       {'loss': 0.0365, 'grad_norm': 2.264505327576709, 'learning_rate': 8.996733551096594e-07, 'completion_length': 103.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.5119047909975052, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4940477013587952, 'reward_std': 0.0714285746216774, 'kl': 0.9130859375, 'epoch': 0.1}
 10%|█         | 430/4286 [2:27:01<20:09:32, 18.82s/it] 10%|█         | 431/4286 [2:27:21<20:23:11, 19.04s/it]                                                       {'loss': 0.0544, 'grad_norm': 4.927557858514773, 'learning_rate': 8.994400373308446e-07, 'completion_length': 115.37500381469727, 'rewards/only_full_func_accuracy_reward': 0.4151786118745804, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3794644474983215, 'reward_std': 0.17700956761837006, 'kl': 1.359375, 'epoch': 0.1}
 10%|█         | 431/4286 [2:27:21<20:23:11, 19.04s/it] 10%|█         | 432/4286 [2:27:41<20:37:39, 19.27s/it]                                                       {'loss': 0.1679, 'grad_norm': 13.095097687557466, 'learning_rate': 8.992067195520298e-07, 'completion_length': 150.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.3333333730697632, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.208333432674408, 'reward_std': 0.22257325984537601, 'kl': 4.1826171875, 'epoch': 0.1}
 10%|█         | 432/4286 [2:27:41<20:37:39, 19.27s/it] 10%|█         | 433/4286 [2:28:07<23:01:41, 21.52s/it]                                                       {'loss': 0.153, 'grad_norm': 12.088784025649403, 'learning_rate': 8.989734017732151e-07, 'completion_length': 157.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.3169643133878708, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.1919643878936768, 'reward_std': 0.32110435515642166, 'kl': 3.828125, 'epoch': 0.1}
 10%|█         | 433/4286 [2:28:07<23:01:41, 21.52s/it] 10%|█         | 434/4286 [2:28:34<24:40:55, 23.07s/it]                                                       {'loss': 0.1375, 'grad_norm': 9.625294209926665, 'learning_rate': 8.987400839944004e-07, 'completion_length': 142.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.3824404925107956, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.2752977013587952, 'reward_std': 0.26124921441078186, 'kl': 3.4375, 'epoch': 0.1}
 10%|█         | 434/4286 [2:28:34<24:40:55, 23.07s/it] 10%|█         | 435/4286 [2:28:59<25:24:21, 23.75s/it]                                                       {'loss': 0.0687, 'grad_norm': 2.527919148457446, 'learning_rate': 8.985067662155856e-07, 'completion_length': 116.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2767858505249023, 'reward_std': 0.1726190522313118, 'kl': 1.7109375, 'epoch': 0.1}
 10%|█         | 435/4286 [2:28:59<25:24:21, 23.75s/it] 10%|█         | 436/4286 [2:29:19<23:56:32, 22.39s/it]                                                       {'loss': 0.0225, 'grad_norm': 2.636316689987954, 'learning_rate': 8.982734484367709e-07, 'completion_length': 106.87500381469727, 'rewards/only_full_func_accuracy_reward': 0.4836309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4657739400863647, 'reward_std': 0.098214291036129, 'kl': 0.5615234375, 'epoch': 0.1}
 10%|█         | 436/4286 [2:29:19<23:56:32, 22.39s/it] 10%|█         | 437/4286 [2:29:43<24:35:37, 23.00s/it]                                                       {'loss': 0.0298, 'grad_norm': 7.807999291074625, 'learning_rate': 8.980401306579561e-07, 'completion_length': 119.26786422729492, 'rewards/only_full_func_accuracy_reward': 0.5163690447807312, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4806548357009888, 'reward_std': 0.1840103194117546, 'kl': 0.744140625, 'epoch': 0.1}
 10%|█         | 437/4286 [2:29:43<24:35:37, 23.00s/it] 10%|█         | 438/4286 [2:30:03<23:35:02, 22.06s/it]                                                       {'loss': 0.0219, 'grad_norm': 4.065153270455995, 'learning_rate': 8.978068128791414e-07, 'completion_length': 113.3035774230957, 'rewards/only_full_func_accuracy_reward': 0.3645833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3467263579368591, 'reward_std': 0.12696255464106798, 'kl': 0.548828125, 'epoch': 0.1}
 10%|█         | 438/4286 [2:30:03<23:35:02, 22.06s/it] 10%|█         | 439/4286 [2:30:22<22:39:43, 21.21s/it]                                                       {'loss': 0.0223, 'grad_norm': 3.6596963620428142, 'learning_rate': 8.975734951003267e-07, 'completion_length': 112.53572082519531, 'rewards/only_full_func_accuracy_reward': 0.270833358168602, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2529762983322144, 'reward_std': 0.14443638548254967, 'kl': 0.5576171875, 'epoch': 0.1}
 10%|█         | 439/4286 [2:30:22<22:39:43, 21.21s/it] 10%|█         | 440/4286 [2:30:42<22:11:23, 20.77s/it]                                                       {'loss': 0.0323, 'grad_norm': 2.8474681520854284, 'learning_rate': 8.973401773215119e-07, 'completion_length': 105.89286422729492, 'rewards/only_full_func_accuracy_reward': 0.4315476566553116, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4136905670166016, 'reward_std': 0.13488247245550156, 'kl': 0.810546875, 'epoch': 0.1}
 10%|█         | 440/4286 [2:30:42<22:11:23, 20.77s/it] 10%|█         | 441/4286 [2:31:01<21:48:09, 20.41s/it]                                                       {'loss': 0.0429, 'grad_norm': 2.045064270353468, 'learning_rate': 8.971068595426971e-07, 'completion_length': 121.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.3392857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.285714328289032, 'reward_std': 0.1360500492155552, 'kl': 1.0751953125, 'epoch': 0.1}
 10%|█         | 441/4286 [2:31:01<21:48:09, 20.41s/it] 10%|█         | 442/4286 [2:31:16<19:51:10, 18.59s/it]                                                       {'loss': 0.0124, 'grad_norm': 10.088741288524128, 'learning_rate': 8.968735417638824e-07, 'completion_length': 107.75000381469727, 'rewards/only_full_func_accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.12616758421063423, 'kl': 0.3095703125, 'epoch': 0.1}
 10%|█         | 442/4286 [2:31:16<19:51:10, 18.59s/it] 10%|█         | 443/4286 [2:31:36<20:15:01, 18.97s/it]                                                       {'loss': 0.0336, 'grad_norm': 2.1325793481701107, 'learning_rate': 8.966402239850677e-07, 'completion_length': 120.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.3630952537059784, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.30952388048172, 'reward_std': 0.05952381528913975, 'kl': 0.83984375, 'epoch': 0.1}
 10%|█         | 443/4286 [2:31:36<20:15:01, 18.97s/it] 10%|█         | 444/4286 [2:31:50<18:40:38, 17.50s/it]                                                       {'loss': 0.0119, 'grad_norm': 3.454704723849013, 'learning_rate': 8.964069062062529e-07, 'completion_length': 101.91071701049805, 'rewards/only_full_func_accuracy_reward': 0.3943452537059784, 'rewards/format_reward': 1.0, 'reward': 1.3943453431129456, 'reward_std': 0.038690478540956974, 'kl': 0.2958984375, 'epoch': 0.1}
 10%|█         | 444/4286 [2:31:50<18:40:38, 17.50s/it] 10%|█         | 445/4286 [2:32:04<17:37:25, 16.52s/it]                                                       {'loss': 0.0123, 'grad_norm': 0.2286849019213758, 'learning_rate': 8.961735884274381e-07, 'completion_length': 101.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.4523809850215912, 'rewards/format_reward': 1.0, 'reward': 1.4523810744285583, 'reward_std': 0.0, 'kl': 0.306640625, 'epoch': 0.1}
 10%|█         | 445/4286 [2:32:04<17:37:25, 16.52s/it] 10%|█         | 446/4286 [2:32:18<16:50:56, 15.80s/it]                                                       {'loss': 0.0127, 'grad_norm': 21.675354882401642, 'learning_rate': 8.959402706486235e-07, 'completion_length': 114.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.049460725858807564, 'kl': 0.31640625, 'epoch': 0.1}
 10%|█         | 446/4286 [2:32:18<16:50:56, 15.80s/it] 10%|█         | 447/4286 [2:32:32<16:22:37, 15.36s/it]                                                       {'loss': 0.0119, 'grad_norm': 2.793518992764929, 'learning_rate': 8.957069528698087e-07, 'completion_length': 110.01786422729492, 'rewards/only_full_func_accuracy_reward': 0.352678619325161, 'rewards/format_reward': 1.0, 'reward': 1.3526787161827087, 'reward_std': 0.043294661678373814, 'kl': 0.2958984375, 'epoch': 0.1}
 10%|█         | 447/4286 [2:32:32<16:22:37, 15.36s/it] 10%|█         | 448/4286 [2:32:47<16:01:57, 15.04s/it]                                                       {'loss': 0.0134, 'grad_norm': 2.0766047328235335, 'learning_rate': 8.954736350909939e-07, 'completion_length': 117.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.3556548058986664, 'rewards/format_reward': 1.0, 'reward': 1.3556548953056335, 'reward_std': 0.029548224061727524, 'kl': 0.333984375, 'epoch': 0.1}
 10%|█         | 448/4286 [2:32:47<16:01:57, 15.04s/it] 10%|█         | 449/4286 [2:33:01<15:45:51, 14.79s/it]                                                       {'loss': 0.0131, 'grad_norm': 3.225300570004697, 'learning_rate': 8.952403173121792e-07, 'completion_length': 112.1785774230957, 'rewards/only_full_func_accuracy_reward': 0.3675595670938492, 'rewards/format_reward': 1.0, 'reward': 1.3675596117973328, 'reward_std': 0.09274118952453136, 'kl': 0.326171875, 'epoch': 0.1}
 10%|█         | 449/4286 [2:33:01<15:45:51, 14.79s/it] 10%|█         | 450/4286 [2:33:20<17:08:47, 16.09s/it]                                                       {'loss': 0.02, 'grad_norm': 4.982050730645506, 'learning_rate': 8.950069995333645e-07, 'completion_length': 121.98215103149414, 'rewards/only_full_func_accuracy_reward': 0.4181548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.40029776096344, 'reward_std': 0.09768378920853138, 'kl': 0.4990234375, 'epoch': 0.1}
 10%|█         | 450/4286 [2:33:20<17:08:47, 16.09s/it] 11%|█         | 451/4286 [2:33:40<18:21:53, 17.24s/it]                                                       {'loss': 0.0232, 'grad_norm': 3.6678609550639543, 'learning_rate': 8.947736817545497e-07, 'completion_length': 120.16072082519531, 'rewards/only_full_func_accuracy_reward': 0.3601190596818924, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.342262089252472, 'reward_std': 0.0773809514939785, 'kl': 0.5791015625, 'epoch': 0.11}
 11%|█         | 451/4286 [2:33:40<18:21:53, 17.24s/it] 11%|█         | 452/4286 [2:33:54<17:25:04, 16.35s/it]                                                       {'loss': 0.0145, 'grad_norm': 2.6095581594887585, 'learning_rate': 8.94540363975735e-07, 'completion_length': 116.73215103149414, 'rewards/only_full_func_accuracy_reward': 0.2455357238650322, 'rewards/format_reward': 1.0, 'reward': 1.2455358505249023, 'reward_std': 0.04145298898220062, 'kl': 0.361328125, 'epoch': 0.11}
 11%|█         | 452/4286 [2:33:54<17:25:04, 16.35s/it] 11%|█         | 453/4286 [2:34:08<16:45:54, 15.75s/it]                                                       {'loss': 0.0135, 'grad_norm': 1.8171300535565555, 'learning_rate': 8.943070461969202e-07, 'completion_length': 111.41071701049805, 'rewards/only_full_func_accuracy_reward': 0.401785746216774, 'rewards/format_reward': 1.0, 'reward': 1.4017858505249023, 'reward_std': 0.05357143096625805, 'kl': 0.3359375, 'epoch': 0.11}
 11%|█         | 453/4286 [2:34:08<16:45:54, 15.75s/it] 11%|█         | 454/4286 [2:34:25<17:05:30, 16.06s/it]                                                       {'loss': 0.0291, 'grad_norm': 2.5155349925324275, 'learning_rate': 8.940737284181055e-07, 'completion_length': 118.83929061889648, 'rewards/only_full_func_accuracy_reward': 0.5074405074119568, 'rewards/format_reward': 1.0, 'reward': 1.5074405670166016, 'reward_std': 0.0803571492433548, 'kl': 0.7236328125, 'epoch': 0.11}
 11%|█         | 454/4286 [2:34:25<17:05:30, 16.06s/it] 11%|█         | 455/4286 [2:34:42<17:27:23, 16.40s/it]                                                       {'loss': 0.0276, 'grad_norm': 2.4532728782542477, 'learning_rate': 8.938404106392907e-07, 'completion_length': 116.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.430059552192688, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3764881491661072, 'reward_std': 0.22086887806653976, 'kl': 0.69140625, 'epoch': 0.11}
 11%|█         | 455/4286 [2:34:42<17:27:23, 16.40s/it] 11%|█         | 456/4286 [2:35:00<17:55:16, 16.85s/it]                                                       {'loss': 0.054, 'grad_norm': 3.638822000082642, 'learning_rate': 8.93607092860476e-07, 'completion_length': 117.35715103149414, 'rewards/only_full_func_accuracy_reward': 0.3511904925107956, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2797619700431824, 'reward_std': 0.15458129346370697, 'kl': 1.353515625, 'epoch': 0.11}
 11%|█         | 456/4286 [2:35:00<17:55:16, 16.85s/it] 11%|█         | 457/4286 [2:35:25<20:31:23, 19.30s/it]                                                       {'loss': 0.1194, 'grad_norm': 6.541006305989787, 'learning_rate': 8.933737750816612e-07, 'completion_length': 139.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.4538690745830536, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.328869104385376, 'reward_std': 0.41450078785419464, 'kl': 2.984375, 'epoch': 0.11}
 11%|█         | 457/4286 [2:35:25<20:31:23, 19.30s/it] 11%|█         | 458/4286 [2:35:45<20:39:30, 19.43s/it]                                                       {'loss': 0.082, 'grad_norm': 7.788543478372078, 'learning_rate': 8.931404573028464e-07, 'completion_length': 127.16071701049805, 'rewards/only_full_func_accuracy_reward': 0.339285746216774, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.196428656578064, 'reward_std': 0.2864900678396225, 'kl': 2.052734375, 'epoch': 0.11}
 11%|█         | 458/4286 [2:35:45<20:39:30, 19.43s/it] 11%|█         | 459/4286 [2:36:10<22:21:35, 21.03s/it]                                                       {'loss': 0.1049, 'grad_norm': 4.6295530174487, 'learning_rate': 8.929071395240318e-07, 'completion_length': 133.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.4345238357782364, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.2738096714019775, 'reward_std': 0.3572751134634018, 'kl': 2.6171875, 'epoch': 0.11}
 11%|█         | 459/4286 [2:36:10<22:21:35, 21.03s/it] 11%|█         | 460/4286 [2:36:35<23:36:07, 22.21s/it]                                                       {'loss': 0.1453, 'grad_norm': 5.698367540838051, 'learning_rate': 8.92673821745217e-07, 'completion_length': 141.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.3244047909975052, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.1279762983322144, 'reward_std': 0.4952087253332138, 'kl': 3.62890625, 'epoch': 0.11}
 11%|█         | 460/4286 [2:36:35<23:36:07, 22.21s/it] 11%|█         | 461/4286 [2:37:01<24:59:40, 23.52s/it]                                                       {'loss': 0.1412, 'grad_norm': 9.26672908033604, 'learning_rate': 8.924405039664022e-07, 'completion_length': 146.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.3125000223517418, 'rewards/format_reward': 0.785714328289032, 'reward': 1.0982143878936768, 'reward_std': 0.3879442662000656, 'kl': 3.5390625, 'epoch': 0.11}
 11%|█         | 461/4286 [2:37:01<24:59:40, 23.52s/it] 11%|█         | 462/4286 [2:37:16<22:05:29, 20.80s/it]                                                       {'loss': 0.0215, 'grad_norm': 2.3290467381910682, 'learning_rate': 8.922071861875875e-07, 'completion_length': 110.0535774230957, 'rewards/only_full_func_accuracy_reward': 0.5208333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5029762983322144, 'reward_std': 0.12046193704009056, 'kl': 0.5380859375, 'epoch': 0.11}
 11%|█         | 462/4286 [2:37:16<22:05:29, 20.80s/it] 11%|█         | 463/4286 [2:37:37<22:03:46, 20.78s/it]                                                       {'loss': 0.0759, 'grad_norm': 2.5733202221032436, 'learning_rate': 8.919738684087728e-07, 'completion_length': 135.60715103149414, 'rewards/only_full_func_accuracy_reward': 0.4047619551420212, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.2976191639900208, 'reward_std': 0.22314956784248352, 'kl': 1.8984375, 'epoch': 0.11}
 11%|█         | 463/4286 [2:37:37<22:03:46, 20.78s/it] 11%|█         | 464/4286 [2:37:58<22:22:12, 21.07s/it]                                                       {'loss': 0.1337, 'grad_norm': 25.57361666774237, 'learning_rate': 8.91740550629958e-07, 'completion_length': 147.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.2946428880095482, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.0982144176959991, 'reward_std': 0.34880543500185013, 'kl': 3.3359375, 'epoch': 0.11}
 11%|█         | 464/4286 [2:37:58<22:22:12, 21.07s/it] 11%|█         | 465/4286 [2:38:23<23:39:43, 22.29s/it]                                                       {'loss': 0.1041, 'grad_norm': 7.691301173018463, 'learning_rate': 8.915072328511432e-07, 'completion_length': 133.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.3318452537059784, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.135416716337204, 'reward_std': 0.4291394352912903, 'kl': 2.6015625, 'epoch': 0.11}
 11%|█         | 465/4286 [2:38:23<23:39:43, 22.29s/it] 11%|█         | 466/4286 [2:38:48<24:27:32, 23.05s/it]                                                       {'loss': 0.1232, 'grad_norm': 8.302594250744672, 'learning_rate': 8.912739150723285e-07, 'completion_length': 151.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.2827381193637848, 'rewards/format_reward': 0.785714328289032, 'reward': 1.0684524774551392, 'reward_std': 0.38709530234336853, 'kl': 3.0859375, 'epoch': 0.11}
 11%|█         | 466/4286 [2:38:48<24:27:32, 23.05s/it] 11%|█         | 467/4286 [2:39:08<23:21:09, 22.01s/it]                                                       {'loss': 0.0646, 'grad_norm': 3.396520214282787, 'learning_rate': 8.910405972935138e-07, 'completion_length': 123.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.4523809999227524, 'rewards/format_reward': 0.892857164144516, 'reward': 1.345238208770752, 'reward_std': 0.29540279507637024, 'kl': 1.6171875, 'epoch': 0.11}
 11%|█         | 467/4286 [2:39:08<23:21:09, 22.01s/it] 11%|█         | 468/4286 [2:39:22<20:48:55, 19.63s/it]                                                       {'loss': 0.017, 'grad_norm': 2.0321759693126893, 'learning_rate': 8.90807279514699e-07, 'completion_length': 103.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.48363097012043, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.46577388048172, 'reward_std': 0.09895450621843338, 'kl': 0.4267578125, 'epoch': 0.11}
 11%|█         | 468/4286 [2:39:22<20:48:55, 19.63s/it] 11%|█         | 469/4286 [2:39:47<22:25:35, 21.15s/it]                                                       {'loss': 0.0426, 'grad_norm': 4.690341805885383, 'learning_rate': 8.905739617358843e-07, 'completion_length': 124.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.3482143133878708, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.294642984867096, 'reward_std': 0.1804131120443344, 'kl': 1.064453125, 'epoch': 0.11}
 11%|█         | 469/4286 [2:39:47<22:25:35, 21.15s/it][2025-03-02 07:47:23,438] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█         | 470/4286 [2:40:08<22:20:18, 21.07s/it]                                                       {'loss': 0.0493, 'grad_norm': 9.983452841829513, 'learning_rate': 8.903406439570695e-07, 'completion_length': 118.08929061889648, 'rewards/only_full_func_accuracy_reward': 0.3020833432674408, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2306548357009888, 'reward_std': 0.2392030954360962, 'kl': 1.234375, 'epoch': 0.11}
 11%|█         | 470/4286 [2:40:08<22:20:18, 21.07s/it] 11%|█         | 471/4286 [2:40:23<20:23:07, 19.24s/it]                                                       {'loss': 0.023, 'grad_norm': 2.9334598766948345, 'learning_rate': 8.901073261782548e-07, 'completion_length': 108.62500381469727, 'rewards/only_full_func_accuracy_reward': 0.4196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.419642984867096, 'reward_std': 0.04434220213443041, 'kl': 0.5771484375, 'epoch': 0.11}
 11%|█         | 471/4286 [2:40:23<20:23:07, 19.24s/it] 11%|█         | 472/4286 [2:40:37<18:47:46, 17.74s/it]                                                       {'loss': 0.0128, 'grad_norm': 1.4106644444656866, 'learning_rate': 8.898740083994401e-07, 'completion_length': 103.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.5297619700431824, 'rewards/format_reward': 1.0, 'reward': 1.5297619700431824, 'reward_std': 0.04627085104584694, 'kl': 0.3203125, 'epoch': 0.11}
 11%|█         | 472/4286 [2:40:37<18:47:46, 17.74s/it] 11%|█         | 473/4286 [2:40:51<17:37:37, 16.64s/it]                                                       {'loss': 0.0129, 'grad_norm': 7.804189739935352, 'learning_rate': 8.896406906206253e-07, 'completion_length': 103.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.0892857201397419, 'kl': 0.32421875, 'epoch': 0.11}
 11%|█         | 473/4286 [2:40:51<17:37:37, 16.64s/it] 11%|█         | 474/4286 [2:41:07<17:18:48, 16.35s/it]                                                       {'loss': 0.0184, 'grad_norm': 2.5342527944794195, 'learning_rate': 8.894073728418105e-07, 'completion_length': 108.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.4345238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4166667461395264, 'reward_std': 0.07142858020961285, 'kl': 0.4580078125, 'epoch': 0.11}
 11%|█         | 474/4286 [2:41:07<17:18:48, 16.35s/it] 11%|█         | 475/4286 [2:41:20<16:31:56, 15.62s/it]                                                       {'loss': 0.0134, 'grad_norm': 1.6083100876047178, 'learning_rate': 8.891740550629959e-07, 'completion_length': 106.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.3422619253396988, 'rewards/format_reward': 1.0, 'reward': 1.3422619700431824, 'reward_std': 0.07167531549930573, 'kl': 0.333984375, 'epoch': 0.11}
 11%|█         | 475/4286 [2:41:20<16:31:56, 15.62s/it] 11%|█         | 476/4286 [2:41:34<16:01:40, 15.14s/it]                                                       {'loss': 0.0111, 'grad_norm': 2.4242442857807074, 'learning_rate': 8.889407372841811e-07, 'completion_length': 113.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.4747023731470108, 'rewards/format_reward': 1.0, 'reward': 1.4747024774551392, 'reward_std': 0.055946310982108116, 'kl': 0.27734375, 'epoch': 0.11}
 11%|█         | 476/4286 [2:41:34<16:01:40, 15.14s/it] 11%|█         | 477/4286 [2:41:48<15:37:40, 14.77s/it]                                                       {'loss': 0.0129, 'grad_norm': 2.4463389360054917, 'learning_rate': 8.887074195053663e-07, 'completion_length': 101.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.3333333432674408, 'rewards/format_reward': 1.0, 'reward': 1.333333432674408, 'reward_std': 0.041913408786058426, 'kl': 0.32421875, 'epoch': 0.11}
 11%|█         | 477/4286 [2:41:48<15:37:40, 14.77s/it] 11%|█         | 478/4286 [2:42:02<15:21:24, 14.52s/it]                                                       {'loss': 0.0132, 'grad_norm': 1.598598782201101, 'learning_rate': 8.884741017265515e-07, 'completion_length': 105.01786422729492, 'rewards/only_full_func_accuracy_reward': 0.49851194024086, 'rewards/format_reward': 1.0, 'reward': 1.4985119700431824, 'reward_std': 0.04839899018406868, 'kl': 0.3291015625, 'epoch': 0.11}
 11%|█         | 478/4286 [2:42:02<15:21:24, 14.52s/it] 11%|█         | 479/4286 [2:42:16<15:08:52, 14.32s/it]                                                       {'loss': 0.0122, 'grad_norm': 0.8605547746821859, 'learning_rate': 8.882407839477369e-07, 'completion_length': 109.83929061889648, 'rewards/only_full_func_accuracy_reward': 0.3333333730697632, 'rewards/format_reward': 1.0, 'reward': 1.333333432674408, 'reward_std': 0.0357142873108387, 'kl': 0.3056640625, 'epoch': 0.11}
 11%|█         | 479/4286 [2:42:16<15:08:52, 14.32s/it] 11%|█         | 480/4286 [2:42:35<16:39:28, 15.76s/it]                                                       {'loss': 0.0227, 'grad_norm': 2.891805907025466, 'learning_rate': 8.880074661689221e-07, 'completion_length': 116.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.3199404925107956, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.302083432674408, 'reward_std': 0.13641267642378807, 'kl': 0.5673828125, 'epoch': 0.11}
 11%|█         | 480/4286 [2:42:35<16:39:28, 15.76s/it] 11%|█         | 481/4286 [2:42:49<16:04:21, 15.21s/it]                                                       {'loss': 0.0124, 'grad_norm': 0.8093837545652827, 'learning_rate': 8.877741483901073e-07, 'completion_length': 108.4285774230957, 'rewards/only_full_func_accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3660715222358704, 'reward_std': 0.010309826582670212, 'kl': 0.310546875, 'epoch': 0.11}
 11%|█         | 481/4286 [2:42:49<16:04:21, 15.21s/it] 11%|█         | 482/4286 [2:43:04<15:51:21, 15.01s/it]                                                       {'loss': 0.0127, 'grad_norm': 10.294326474212227, 'learning_rate': 8.875408306112926e-07, 'completion_length': 112.60715103149414, 'rewards/only_full_func_accuracy_reward': 0.4449404776096344, 'rewards/format_reward': 1.0, 'reward': 1.4449406266212463, 'reward_std': 0.06115180440247059, 'kl': 0.318359375, 'epoch': 0.11}
 11%|█         | 482/4286 [2:43:04<15:51:21, 15.01s/it] 11%|█▏        | 483/4286 [2:43:18<15:38:14, 14.80s/it]                                                       {'loss': 0.013, 'grad_norm': 1.7666088271778566, 'learning_rate': 8.873075128324778e-07, 'completion_length': 112.58929061889648, 'rewards/only_full_func_accuracy_reward': 0.4389881044626236, 'rewards/format_reward': 1.0, 'reward': 1.438988208770752, 'reward_std': 0.05519942194223404, 'kl': 0.3232421875, 'epoch': 0.11}
 11%|█▏        | 483/4286 [2:43:18<15:38:14, 14.80s/it][2025-03-02 07:50:48,749] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 11%|█▏        | 484/4286 [2:43:33<15:38:28, 14.81s/it]                                                       {'loss': 0.0127, 'grad_norm': 2.9067650551678375, 'learning_rate': 8.870741950536631e-07, 'completion_length': 110.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.4017857164144516, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.0773809552192688, 'kl': 0.31640625, 'epoch': 0.11}
 11%|█▏        | 484/4286 [2:43:33<15:38:28, 14.81s/it] 11%|█▏        | 485/4286 [2:43:52<17:05:27, 16.19s/it]                                                       {'loss': 0.0332, 'grad_norm': 4.051299152542411, 'learning_rate': 8.868408772748484e-07, 'completion_length': 129.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5937501788139343, 'reward_std': 0.14622367173433304, 'kl': 0.83203125, 'epoch': 0.11}
 11%|█▏        | 485/4286 [2:43:52<17:05:27, 16.19s/it] 11%|█▏        | 486/4286 [2:44:07<16:40:45, 15.80s/it]                                                       {'loss': 0.0191, 'grad_norm': 7.253272675959441, 'learning_rate': 8.866075594960336e-07, 'completion_length': 117.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.4226190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4047620296478271, 'reward_std': 0.11995860561728477, 'kl': 0.4794921875, 'epoch': 0.11}
 11%|█▏        | 486/4286 [2:44:07<16:40:45, 15.80s/it] 11%|█▏        | 487/4286 [2:44:22<16:25:54, 15.57s/it]                                                       {'loss': 0.0123, 'grad_norm': 0.38809040860067917, 'learning_rate': 8.863742417172188e-07, 'completion_length': 113.1785774230957, 'rewards/only_full_func_accuracy_reward': 0.416666716337204, 'rewards/format_reward': 1.0, 'reward': 1.4166668057441711, 'reward_std': 0.011904762126505375, 'kl': 0.30859375, 'epoch': 0.11}
 11%|█▏        | 487/4286 [2:44:22<16:25:54, 15.57s/it] 11%|█▏        | 488/4286 [2:44:40<17:00:58, 16.13s/it]                                                       {'loss': 0.0283, 'grad_norm': 2.2726172573409675, 'learning_rate': 8.861409239384041e-07, 'completion_length': 119.89286422729492, 'rewards/only_full_func_accuracy_reward': 0.4345238506793976, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3988096117973328, 'reward_std': 0.1071428656578064, 'kl': 0.7060546875, 'epoch': 0.11}
 11%|█▏        | 488/4286 [2:44:40<17:00:58, 16.13s/it] 11%|█▏        | 489/4286 [2:44:57<17:15:29, 16.36s/it]                                                       {'loss': 0.0335, 'grad_norm': 3.354594089016756, 'learning_rate': 8.859076061595894e-07, 'completion_length': 119.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.2916666865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2738096714019775, 'reward_std': 0.12641431391239166, 'kl': 0.83984375, 'epoch': 0.11}
 11%|█▏        | 489/4286 [2:44:57<17:15:29, 16.36s/it] 11%|█▏        | 490/4286 [2:45:11<16:43:37, 15.86s/it]                                                       {'loss': 0.0127, 'grad_norm': 6.695458240403191, 'learning_rate': 8.856742883807746e-07, 'completion_length': 114.46429061889648, 'rewards/only_full_func_accuracy_reward': 0.3854166716337204, 'rewards/format_reward': 1.0, 'reward': 1.3854167461395264, 'reward_std': 0.08503580279648304, 'kl': 0.3173828125, 'epoch': 0.11}
 11%|█▏        | 490/4286 [2:45:11<16:43:37, 15.86s/it] 11%|█▏        | 491/4286 [2:45:27<16:40:30, 15.82s/it]                                                       {'loss': 0.0245, 'grad_norm': 3.5687247515137823, 'learning_rate': 8.854409706019598e-07, 'completion_length': 113.91072082519531, 'rewards/only_full_func_accuracy_reward': 0.5014881193637848, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.46577388048172, 'reward_std': 0.09895450435578823, 'kl': 0.6123046875, 'epoch': 0.11}
 11%|█▏        | 491/4286 [2:45:27<16:40:30, 15.82s/it] 11%|█▏        | 492/4286 [2:45:44<16:56:17, 16.07s/it]                                                       {'loss': 0.0318, 'grad_norm': 3.1278715596703064, 'learning_rate': 8.852076528231452e-07, 'completion_length': 113.1785774230957, 'rewards/only_full_func_accuracy_reward': 0.575892835855484, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5401787161827087, 'reward_std': 0.2242012470960617, 'kl': 0.796875, 'epoch': 0.11}
 11%|█▏        | 492/4286 [2:45:44<16:56:17, 16.07s/it] 12%|█▏        | 493/4286 [2:46:03<18:07:37, 17.20s/it]                                                       {'loss': 0.0586, 'grad_norm': 2.15685468385493, 'learning_rate': 8.849743350443304e-07, 'completion_length': 127.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.4389881193637848, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3854168057441711, 'reward_std': 0.11840658262372017, 'kl': 1.4658203125, 'epoch': 0.12}
 12%|█▏        | 493/4286 [2:46:03<18:07:37, 17.20s/it] 12%|█▏        | 494/4286 [2:46:21<18:20:26, 17.41s/it]                                                       {'loss': 0.0454, 'grad_norm': 2.169324475638278, 'learning_rate': 8.847410172655156e-07, 'completion_length': 125.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.430059552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.412202537059784, 'reward_std': 0.0863095335662365, 'kl': 1.1328125, 'epoch': 0.12}
 12%|█▏        | 494/4286 [2:46:21<18:20:26, 17.41s/it] 12%|█▏        | 495/4286 [2:46:42<19:15:12, 18.28s/it]                                                       {'loss': 0.0687, 'grad_norm': 2.3547039238618894, 'learning_rate': 8.845076994867009e-07, 'completion_length': 126.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.495535746216774, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3883929252624512, 'reward_std': 0.2677491381764412, 'kl': 1.71484375, 'epoch': 0.12}
 12%|█▏        | 495/4286 [2:46:42<19:15:12, 18.28s/it] 12%|█▏        | 496/4286 [2:47:07<21:21:44, 20.29s/it]                                                       {'loss': 0.1569, 'grad_norm': 6.5653229956771195, 'learning_rate': 8.842743817078862e-07, 'completion_length': 163.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 0.803571492433548, 'reward': 1.1071429252624512, 'reward_std': 0.38789381086826324, 'kl': 3.921875, 'epoch': 0.12}
 12%|█▏        | 496/4286 [2:47:07<21:21:44, 20.29s/it] 12%|█▏        | 497/4286 [2:47:27<21:27:28, 20.39s/it]                                                       {'loss': 0.0863, 'grad_norm': 4.438641029037376, 'learning_rate': 8.840410639290714e-07, 'completion_length': 134.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.892857164144516, 'reward': 1.330357313156128, 'reward_std': 0.28561191260814667, 'kl': 2.15625, 'epoch': 0.12}
 12%|█▏        | 497/4286 [2:47:27<21:27:28, 20.39s/it] 12%|█▏        | 498/4286 [2:47:52<22:51:51, 21.73s/it]                                                       {'loss': 0.1518, 'grad_norm': 4.990812045306617, 'learning_rate': 8.838077461502567e-07, 'completion_length': 150.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.4166666865348816, 'rewards/format_reward': 0.7678571939468384, 'reward': 1.1845239400863647, 'reward_std': 0.4121510982513428, 'kl': 3.796875, 'epoch': 0.12}
 12%|█▏        | 498/4286 [2:47:52<22:51:51, 21.73s/it] 12%|█▏        | 499/4286 [2:48:14<22:52:02, 21.74s/it]                                                       {'loss': 0.0891, 'grad_norm': 3.8357070110282283, 'learning_rate': 8.835744283714419e-07, 'completion_length': 145.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.32738097012043, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.2023810148239136, 'reward_std': 0.31235450506210327, 'kl': 2.22265625, 'epoch': 0.12}
 12%|█▏        | 499/4286 [2:48:14<22:52:02, 21.74s/it] 12%|█▏        | 500/4286 [2:48:36<23:06:54, 21.98s/it]                                                       {'loss': 0.0902, 'grad_norm': 5.392780300912198, 'learning_rate': 8.833411105926272e-07, 'completion_length': 144.9285774230957, 'rewards/only_full_func_accuracy_reward': 0.3154762089252472, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2261905670166016, 'reward_std': 0.17649808526039124, 'kl': 2.26171875, 'epoch': 0.12}
 12%|█▏        | 500/4286 [2:48:36<23:06:54, 21.98s/it] 12%|█▏        | 501/4286 [2:55:24<144:42:09, 137.63s/it]                                                         {'loss': 0.0746, 'grad_norm': 4.4229769844660005, 'learning_rate': 8.831077928138124e-07, 'completion_length': 155.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.2589286044239998, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.133928656578064, 'reward_std': 0.27852821350097656, 'kl': 1.859375, 'epoch': 0.12}
 12%|█▏        | 501/4286 [2:55:24<144:42:09, 137.63s/it] 12%|█▏        | 502/4286 [2:55:42<107:05:35, 101.89s/it]                                                         {'loss': 0.0397, 'grad_norm': 2.5676160965301302, 'learning_rate': 8.828744750349977e-07, 'completion_length': 129.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.38988097012043, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3184524774551392, 'reward_std': 0.20413333177566528, 'kl': 0.99609375, 'epoch': 0.12}
 12%|█▏        | 502/4286 [2:55:42<107:05:35, 101.89s/it] 12%|█▏        | 503/4286 [2:55:57<79:34:19, 75.72s/it]                                                         {'loss': 0.0167, 'grad_norm': 1.7446823869881043, 'learning_rate': 8.826411572561829e-07, 'completion_length': 112.60715103149414, 'rewards/only_full_func_accuracy_reward': 0.3482143059372902, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3303571939468384, 'reward_std': 0.08600886911153793, 'kl': 0.4189453125, 'epoch': 0.12}
 12%|█▏        | 503/4286 [2:55:57<79:34:19, 75.72s/it] 12%|█▏        | 504/4286 [2:56:13<60:51:49, 57.93s/it]                                                       {'loss': 0.0294, 'grad_norm': 2.2891901423101744, 'learning_rate': 8.824078394773681e-07, 'completion_length': 129.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.4851190745830536, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4315477013587952, 'reward_std': 0.1911102570593357, 'kl': 0.734375, 'epoch': 0.12}
 12%|█▏        | 504/4286 [2:56:14<60:51:49, 57.93s/it] 12%|█▏        | 505/4286 [2:56:32<48:19:58, 46.02s/it]                                                       {'loss': 0.0227, 'grad_norm': 2.020078059647909, 'learning_rate': 8.821745216985535e-07, 'completion_length': 136.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.2321428805589676, 'rewards/format_reward': 1.0, 'reward': 1.2321429252624512, 'reward_std': 0.03688185662031174, 'kl': 0.568359375, 'epoch': 0.12}
 12%|█▏        | 505/4286 [2:56:32<48:19:58, 46.02s/it] 12%|█▏        | 506/4286 [2:56:52<40:13:51, 38.32s/it]                                                       {'loss': 0.0268, 'grad_norm': 2.978042747633364, 'learning_rate': 8.819412039197387e-07, 'completion_length': 136.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.3377976417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3199405670166016, 'reward_std': 0.08563542738556862, 'kl': 0.6669921875, 'epoch': 0.12}
 12%|█▏        | 506/4286 [2:56:52<40:13:51, 38.32s/it] 12%|█▏        | 507/4286 [2:57:07<32:53:45, 31.34s/it]                                                       {'loss': 0.0121, 'grad_norm': 3.4411886472027193, 'learning_rate': 8.817078861409239e-07, 'completion_length': 120.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.4761904776096344, 'rewards/format_reward': 1.0, 'reward': 1.4761906266212463, 'reward_std': 0.0476190485060215, 'kl': 0.3037109375, 'epoch': 0.12}
 12%|█▏        | 507/4286 [2:57:07<32:53:45, 31.34s/it] 12%|█▏        | 508/4286 [2:57:22<27:45:17, 26.45s/it]                                                       {'loss': 0.0115, 'grad_norm': 3.296862605789573, 'learning_rate': 8.814745683621092e-07, 'completion_length': 128.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.4538690447807312, 'rewards/format_reward': 1.0, 'reward': 1.4538691639900208, 'reward_std': 0.07825539447367191, 'kl': 0.287109375, 'epoch': 0.12}
 12%|█▏        | 508/4286 [2:57:22<27:45:17, 26.45s/it] 12%|█▏        | 509/4286 [2:57:38<24:22:34, 23.23s/it]                                                       {'loss': 0.0109, 'grad_norm': 1.653926849823056, 'learning_rate': 8.812412505832945e-07, 'completion_length': 132.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.453869104385376, 'rewards/format_reward': 1.0, 'reward': 1.453869104385376, 'reward_std': 0.056120261549949646, 'kl': 0.2724609375, 'epoch': 0.12}
 12%|█▏        | 509/4286 [2:57:38<24:22:34, 23.23s/it] 12%|█▏        | 510/4286 [2:57:53<21:50:15, 20.82s/it]                                                       {'loss': 0.0117, 'grad_norm': 1.926115819210627, 'learning_rate': 8.810079328044797e-07, 'completion_length': 126.4285774230957, 'rewards/only_full_func_accuracy_reward': 0.3943452835083008, 'rewards/format_reward': 1.0, 'reward': 1.3943453431129456, 'reward_std': 0.10005596466362476, 'kl': 0.29296875, 'epoch': 0.12}
 12%|█▏        | 510/4286 [2:57:53<21:50:15, 20.82s/it] 12%|█▏        | 511/4286 [2:58:09<20:25:21, 19.48s/it]                                                       {'loss': 0.0112, 'grad_norm': 1.6284763488040161, 'learning_rate': 8.807746150256649e-07, 'completion_length': 141.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.2916667014360428, 'rewards/format_reward': 1.0, 'reward': 1.2916667461395264, 'reward_std': 0.09463846869766712, 'kl': 0.2802734375, 'epoch': 0.12}
 12%|█▏        | 511/4286 [2:58:09<20:25:21, 19.48s/it] 12%|█▏        | 512/4286 [2:58:25<19:20:32, 18.45s/it]                                                       {'loss': 0.011, 'grad_norm': 1.3301878610603501, 'learning_rate': 8.805412972468502e-07, 'completion_length': 133.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.430059552192688, 'rewards/format_reward': 1.0, 'reward': 1.4300596714019775, 'reward_std': 0.04935433901846409, 'kl': 0.275390625, 'epoch': 0.12}
 12%|█▏        | 512/4286 [2:58:25<19:20:32, 18.45s/it] 12%|█▏        | 513/4286 [2:58:45<19:49:43, 18.92s/it]                                                       {'loss': 0.0315, 'grad_norm': 2.4707611093605477, 'learning_rate': 8.803079794680355e-07, 'completion_length': 138.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.357142984867096, 'reward_std': 0.1061877142637968, 'kl': 0.787109375, 'epoch': 0.12}
 12%|█▏        | 513/4286 [2:58:45<19:49:43, 18.92s/it] 12%|█▏        | 514/4286 [2:59:05<19:55:16, 19.01s/it]                                                       {'loss': 0.0221, 'grad_norm': 1.8530584264346366, 'learning_rate': 8.800746616892207e-07, 'completion_length': 142.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.4538690745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4360119700431824, 'reward_std': 0.05495268478989601, 'kl': 0.552734375, 'epoch': 0.12}
 12%|█▏        | 514/4286 [2:59:05<19:55:16, 19.01s/it] 12%|█▏        | 515/4286 [2:59:21<19:06:14, 18.24s/it]                                                       {'loss': 0.0112, 'grad_norm': 1.8378980083508578, 'learning_rate': 8.79841343910406e-07, 'completion_length': 141.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.346726194024086, 'rewards/format_reward': 1.0, 'reward': 1.3467262387275696, 'reward_std': 0.06844069808721542, 'kl': 0.28125, 'epoch': 0.12}
 12%|█▏        | 515/4286 [2:59:21<19:06:14, 18.24s/it] 12%|█▏        | 516/4286 [2:59:37<18:25:16, 17.59s/it]                                                       {'loss': 0.011, 'grad_norm': 1.1509647150590359, 'learning_rate': 8.796080261315912e-07, 'completion_length': 143.01786422729492, 'rewards/only_full_func_accuracy_reward': 0.2321428656578064, 'rewards/format_reward': 1.0, 'reward': 1.2321429252624512, 'reward_std': 0.011904764920473099, 'kl': 0.2744140625, 'epoch': 0.12}
 12%|█▏        | 516/4286 [2:59:37<18:25:16, 17.59s/it] 12%|█▏        | 517/4286 [2:59:54<18:11:42, 17.38s/it]                                                       {'loss': 0.0166, 'grad_norm': 3.363479235368074, 'learning_rate': 8.793747083527765e-07, 'completion_length': 142.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.4080357551574707, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.390178620815277, 'reward_std': 0.09857528284192085, 'kl': 0.4150390625, 'epoch': 0.12}
 12%|█▏        | 517/4286 [2:59:54<18:11:42, 17.38s/it] 12%|█▏        | 518/4286 [3:00:14<19:05:24, 18.24s/it]                                                       {'loss': 0.0275, 'grad_norm': 4.653551626092804, 'learning_rate': 8.791413905739618e-07, 'completion_length': 155.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.3526785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3348215818405151, 'reward_std': 0.11319411918520927, 'kl': 0.689453125, 'epoch': 0.12}
 12%|█▏        | 518/4286 [3:00:14<19:05:24, 18.24s/it] 12%|█▏        | 519/4286 [3:00:34<19:36:54, 18.75s/it]                                                       {'loss': 0.0301, 'grad_norm': 1.7706498180977606, 'learning_rate': 8.78908072795147e-07, 'completion_length': 159.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.513988122344017, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4782739281654358, 'reward_std': 0.11779477261006832, 'kl': 0.748046875, 'epoch': 0.12}
 12%|█▏        | 519/4286 [3:00:34<19:36:54, 18.75s/it] 12%|█▏        | 520/4286 [3:00:57<20:57:14, 20.03s/it]                                                       {'loss': 0.0649, 'grad_norm': 3.721316357791014, 'learning_rate': 8.786747550163322e-07, 'completion_length': 169.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.3229166865348816, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2514881491661072, 'reward_std': 0.24733296036720276, 'kl': 1.623046875, 'epoch': 0.12}
 12%|█▏        | 520/4286 [3:00:57<20:57:14, 20.03s/it] 12%|█▏        | 521/4286 [3:01:18<21:02:54, 20.13s/it]                                                       {'loss': 0.038, 'grad_norm': 2.1178880456251283, 'learning_rate': 8.784414372375176e-07, 'completion_length': 157.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3928572535514832, 'reward_std': 0.12763753533363342, 'kl': 0.9482421875, 'epoch': 0.12}
 12%|█▏        | 521/4286 [3:01:18<21:02:54, 20.13s/it] 12%|█▏        | 522/4286 [3:01:43<22:32:58, 21.57s/it]                                                       {'loss': 0.0471, 'grad_norm': 2.6410751090253224, 'learning_rate': 8.782081194587028e-07, 'completion_length': 160.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.38660717010498047, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.350892961025238, 'reward_std': 0.12238970398902893, 'kl': 1.17578125, 'epoch': 0.12}
 12%|█▏        | 522/4286 [3:01:43<22:32:58, 21.57s/it] 12%|█▏        | 523/4286 [3:02:09<24:02:26, 23.00s/it]                                                       {'loss': 0.121, 'grad_norm': 3.616732863934688, 'learning_rate': 8.77974801679888e-07, 'completion_length': 187.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.2574405074119568, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.114583432674408, 'reward_std': 0.2741176187992096, 'kl': 3.015625, 'epoch': 0.12}
 12%|█▏        | 523/4286 [3:02:09<24:02:26, 23.00s/it][2025-03-02 08:09:52,831] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 12%|█▏        | 524/4286 [3:02:37<25:36:31, 24.51s/it]                                                       {'loss': 0.1855, 'grad_norm': 7.0860914154208725, 'learning_rate': 8.777414839010732e-07, 'completion_length': 202.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.33690477907657623, 'rewards/format_reward': 0.7678571939468384, 'reward': 1.1047619581222534, 'reward_std': 0.42793357372283936, 'kl': 4.640625, 'epoch': 0.12}
 12%|█▏        | 524/4286 [3:02:37<25:36:31, 24.51s/it][2025-03-02 08:10:22,859] [WARNING] [stage3.py:2134:step] 3 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 12%|█▏        | 525/4286 [3:03:07<27:19:57, 26.16s/it]                                                       {'loss': 0.2419, 'grad_norm': 20.804562850155254, 'learning_rate': 8.775081661222586e-07, 'completion_length': 222.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.7321428954601288, 'reward': 1.0892858505249023, 'reward_std': 0.2514236569404602, 'kl': 6.046875, 'epoch': 0.12}
 12%|█▏        | 525/4286 [3:03:07<27:19:57, 26.16s/it][2025-03-02 08:10:49,322] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 12%|█▏        | 526/4286 [3:03:33<27:25:08, 26.25s/it]                                                       {'loss': 0.1324, 'grad_norm': 5.135110221989833, 'learning_rate': 8.772748483434438e-07, 'completion_length': 188.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.4032738357782364, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.2604168057441711, 'reward_std': 0.3543187379837036, 'kl': 3.3203125, 'epoch': 0.12}
 12%|█▏        | 526/4286 [3:03:33<27:25:08, 26.25s/it] 12%|█▏        | 527/4286 [3:04:00<27:23:55, 26.24s/it]                                                       {'loss': 0.171, 'grad_norm': 5.500431770588612, 'learning_rate': 8.77041530564629e-07, 'completion_length': 217.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.316964328289032, 'rewards/format_reward': 0.7678571939468384, 'reward': 1.0848214626312256, 'reward_std': 0.4128701388835907, 'kl': 4.2734375, 'epoch': 0.12}
 12%|█▏        | 527/4286 [3:04:00<27:23:55, 26.24s/it] 12%|█▏        | 528/4286 [3:04:27<27:38:43, 26.48s/it]                                                       {'loss': 0.1153, 'grad_norm': 4.925728703730332, 'learning_rate': 8.768082127858143e-07, 'completion_length': 210.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.3110119327902794, 'rewards/format_reward': 0.785714328289032, 'reward': 1.096726268529892, 'reward_std': 0.3668268248438835, 'kl': 2.8828125, 'epoch': 0.12}
 12%|█▏        | 528/4286 [3:04:27<27:38:43, 26.48s/it] 12%|█▏        | 529/4286 [3:04:48<26:00:32, 24.92s/it]                                                       {'loss': 0.0311, 'grad_norm': 4.095464743843839, 'learning_rate': 8.765748950069996e-07, 'completion_length': 148.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4017858505249023, 'reward_std': 0.1672876924276352, 'kl': 0.775390625, 'epoch': 0.12}
 12%|█▏        | 529/4286 [3:04:48<26:00:32, 24.92s/it] 12%|█▏        | 530/4286 [3:05:13<26:02:23, 24.96s/it]                                                       {'loss': 0.0332, 'grad_norm': 5.504130280565536, 'learning_rate': 8.763415772281848e-07, 'completion_length': 152.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5491071343421936, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4955357909202576, 'reward_std': 0.13439881429076195, 'kl': 0.830078125, 'epoch': 0.12}
 12%|█▏        | 530/4286 [3:05:13<26:02:23, 24.96s/it] 12%|█▏        | 531/4286 [3:05:37<25:38:51, 24.59s/it]                                                       {'loss': 0.0364, 'grad_norm': 4.21508532746904, 'learning_rate': 8.761082594493701e-07, 'completion_length': 153.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.2931547909975052, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2395834922790527, 'reward_std': 0.1215964499861002, 'kl': 0.9130859375, 'epoch': 0.12}
 12%|█▏        | 531/4286 [3:05:37<25:38:51, 24.59s/it] 12%|█▏        | 532/4286 [3:05:52<22:48:28, 21.87s/it]                                                       {'loss': 0.0107, 'grad_norm': 1.610805638328452, 'learning_rate': 8.758749416705553e-07, 'completion_length': 132.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.4732143431901932, 'rewards/format_reward': 1.0, 'reward': 1.4732144474983215, 'reward_std': 0.05038156360387802, 'kl': 0.2666015625, 'epoch': 0.12}
 12%|█▏        | 532/4286 [3:05:52<22:48:28, 21.87s/it][2025-03-02 08:13:23,692] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 12%|█▏        | 533/4286 [3:06:08<20:49:01, 19.97s/it]                                                       {'loss': 0.0116, 'grad_norm': 1.8024389541737245, 'learning_rate': 8.756416238917405e-07, 'completion_length': 122.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.37351194024086, 'rewards/format_reward': 1.0, 'reward': 1.3735120296478271, 'reward_std': 0.06458841636776924, 'kl': 0.2890625, 'epoch': 0.12}
 12%|█▏        | 533/4286 [3:06:08<20:49:01, 19.97s/it] 12%|█▏        | 534/4286 [3:06:34<22:45:45, 21.84s/it]                                                       {'loss': 0.0353, 'grad_norm': 5.13731425818045, 'learning_rate': 8.754083061129258e-07, 'completion_length': 161.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.3883928805589676, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3169643878936768, 'reward_std': 0.18769367784261703, 'kl': 0.884765625, 'epoch': 0.12}
 12%|█▏        | 534/4286 [3:06:34<22:45:45, 21.84s/it] 12%|█▏        | 535/4286 [3:06:56<22:40:09, 21.76s/it]                                                       {'loss': 0.0308, 'grad_norm': 3.448092971663821, 'learning_rate': 8.751749883341111e-07, 'completion_length': 152.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.3928571790456772, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571429252624512, 'reward_std': 0.2067384086549282, 'kl': 0.76953125, 'epoch': 0.12}
 12%|█▏        | 535/4286 [3:06:56<22:40:09, 21.76s/it] 13%|█▎        | 536/4286 [3:07:12<20:54:23, 20.07s/it]                                                       {'loss': 0.01, 'grad_norm': 1.5115104983981402, 'learning_rate': 8.749416705552962e-07, 'completion_length': 132.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.04761905036866665, 'kl': 0.24951171875, 'epoch': 0.13}
 13%|█▎        | 536/4286 [3:07:12<20:54:23, 20.07s/it] 13%|█▎        | 537/4286 [3:07:33<21:15:30, 20.41s/it]                                                       {'loss': 0.0264, 'grad_norm': 2.064107879083672, 'learning_rate': 8.747083527764814e-07, 'completion_length': 140.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.352678582072258, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.316964328289032, 'reward_std': 0.10486217215657234, 'kl': 0.6611328125, 'epoch': 0.13}
 13%|█▎        | 537/4286 [3:07:33<21:15:30, 20.41s/it] 13%|█▎        | 538/4286 [3:07:49<19:52:55, 19.10s/it]                                                       {'loss': 0.0115, 'grad_norm': 1.3124778385209814, 'learning_rate': 8.744750349976668e-07, 'completion_length': 132.96429061889648, 'rewards/only_full_func_accuracy_reward': 0.5297619551420212, 'rewards/format_reward': 1.0, 'reward': 1.5297620296478271, 'reward_std': 0.037555962800979614, 'kl': 0.2861328125, 'epoch': 0.13}
 13%|█▎        | 538/4286 [3:07:49<19:52:55, 19.10s/it] 13%|█▎        | 539/4286 [3:08:07<19:36:42, 18.84s/it]                                                       {'loss': 0.0237, 'grad_norm': 0.9420718876809061, 'learning_rate': 8.74241717218852e-07, 'completion_length': 135.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.4880952686071396, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.470238208770752, 'reward_std': 0.07008037716150284, 'kl': 0.5927734375, 'epoch': 0.13}
 13%|█▎        | 539/4286 [3:08:07<19:36:42, 18.84s/it] 13%|█▎        | 540/4286 [3:08:23<18:40:22, 17.95s/it]                                                       {'loss': 0.0113, 'grad_norm': 0.8147072635679532, 'learning_rate': 8.740083994400372e-07, 'completion_length': 129.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.3065476417541504, 'rewards/format_reward': 1.0, 'reward': 1.3065477013587952, 'reward_std': 0.031603580340743065, 'kl': 0.2841796875, 'epoch': 0.13}
 13%|█▎        | 540/4286 [3:08:23<18:40:22, 17.95s/it] 13%|█▎        | 541/4286 [3:08:44<19:39:17, 18.89s/it]                                                       {'loss': 0.0318, 'grad_norm': 1.88523000033768, 'learning_rate': 8.737750816612225e-07, 'completion_length': 139.25, 'rewards/only_full_func_accuracy_reward': 0.38988097012043, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3720239400863647, 'reward_std': 0.10546308197081089, 'kl': 0.7978515625, 'epoch': 0.13}
 13%|█▎        | 541/4286 [3:08:44<19:39:17, 18.89s/it] 13%|█▎        | 542/4286 [3:09:00<18:42:10, 17.98s/it]                                                       {'loss': 0.0115, 'grad_norm': 3.2920111074415095, 'learning_rate': 8.735417638824078e-07, 'completion_length': 131.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.4687500447034836, 'rewards/format_reward': 1.0, 'reward': 1.4687501192092896, 'reward_std': 0.06090507283806801, 'kl': 0.2890625, 'epoch': 0.13}
 13%|█▎        | 542/4286 [3:09:00<18:42:10, 17.98s/it] 13%|█▎        | 543/4286 [3:09:19<18:56:42, 18.22s/it]                                                       {'loss': 0.0433, 'grad_norm': 2.3668247823599455, 'learning_rate': 8.73308446103593e-07, 'completion_length': 126.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2857143878936768, 'reward_std': 0.12088929861783981, 'kl': 1.08203125, 'epoch': 0.13}
 13%|█▎        | 543/4286 [3:09:19<18:56:42, 18.22s/it] 13%|█▎        | 544/4286 [3:09:44<21:08:22, 20.34s/it]                                                       {'loss': 0.0636, 'grad_norm': 1.7737258292838893, 'learning_rate': 8.730751283247782e-07, 'completion_length': 155.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.3809524327516556, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3273810744285583, 'reward_std': 0.1781049445271492, 'kl': 1.59375, 'epoch': 0.13}
 13%|█▎        | 544/4286 [3:09:44<21:08:22, 20.34s/it] 13%|█▎        | 545/4286 [3:10:03<20:44:15, 19.96s/it]                                                       {'loss': 0.0303, 'grad_norm': 1.5993426998965437, 'learning_rate': 8.728418105459635e-07, 'completion_length': 151.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.40684526413679123, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3711310625076294, 'reward_std': 0.06523960828781128, 'kl': 0.7587890625, 'epoch': 0.13}
 13%|█▎        | 545/4286 [3:10:03<20:44:15, 19.96s/it] 13%|█▎        | 546/4286 [3:10:20<19:39:38, 18.92s/it]                                                       {'loss': 0.0178, 'grad_norm': 1.3745845468999054, 'learning_rate': 8.726084927671488e-07, 'completion_length': 136.375, 'rewards/only_full_func_accuracy_reward': 0.3482142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3303572535514832, 'reward_std': 0.10553208738565445, 'kl': 0.4443359375, 'epoch': 0.13}
 13%|█▎        | 546/4286 [3:10:20<19:39:38, 18.92s/it] 13%|█▎        | 547/4286 [3:10:43<20:57:33, 20.18s/it]                                                       {'loss': 0.0876, 'grad_norm': 4.950446945637646, 'learning_rate': 8.72375174988334e-07, 'completion_length': 166.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.3497024178504944, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.2247024178504944, 'reward_std': 0.3490256667137146, 'kl': 2.1875, 'epoch': 0.13}
 13%|█▎        | 547/4286 [3:10:43<20:57:33, 20.18s/it] 13%|█▎        | 548/4286 [3:11:04<21:14:59, 20.47s/it]                                                       {'loss': 0.0531, 'grad_norm': 2.6370128464516522, 'learning_rate': 8.721418572095193e-07, 'completion_length': 163.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.3273809552192688, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2559524774551392, 'reward_std': 0.1896989569067955, 'kl': 1.328125, 'epoch': 0.13}
 13%|█▎        | 548/4286 [3:11:04<21:14:59, 20.47s/it] 13%|█▎        | 549/4286 [3:11:21<20:16:33, 19.53s/it]                                                       {'loss': 0.0188, 'grad_norm': 1.6414456582334918, 'learning_rate': 8.719085394307045e-07, 'completion_length': 140.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.4375000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4196429252624512, 'reward_std': 0.10168395563960075, 'kl': 0.470703125, 'epoch': 0.13}
 13%|█▎        | 549/4286 [3:11:21<20:16:33, 19.53s/it] 13%|█▎        | 550/4286 [3:11:43<20:59:42, 20.23s/it]                                                       {'loss': 0.0596, 'grad_norm': 2.6389325689859713, 'learning_rate': 8.716752216518897e-07, 'completion_length': 171.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.309523843228817, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2559524178504944, 'reward_std': 0.18978641368448734, 'kl': 1.4921875, 'epoch': 0.13}
 13%|█▎        | 550/4286 [3:11:43<20:59:42, 20.23s/it] 13%|█▎        | 551/4286 [3:11:59<19:33:46, 18.86s/it]                                                       {'loss': 0.022, 'grad_norm': 2.127695754564891, 'learning_rate': 8.714419038730751e-07, 'completion_length': 142.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.4925595372915268, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4747024774551392, 'reward_std': 0.11237061023712158, 'kl': 0.54931640625, 'epoch': 0.13}
 13%|█▎        | 551/4286 [3:11:59<19:33:46, 18.86s/it] 13%|█▎        | 552/4286 [3:12:17<19:25:37, 18.73s/it]                                                       {'loss': 0.0427, 'grad_norm': 4.7841651718538305, 'learning_rate': 8.712085860942603e-07, 'completion_length': 141.51786422729492, 'rewards/only_full_func_accuracy_reward': 0.4806547909975052, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4270834922790527, 'reward_std': 0.125980862416327, 'kl': 1.0693359375, 'epoch': 0.13}
 13%|█▎        | 552/4286 [3:12:17<19:25:37, 18.73s/it] 13%|█▎        | 553/4286 [3:12:36<19:26:40, 18.75s/it]                                                       {'loss': 0.0648, 'grad_norm': 2.963201035453107, 'learning_rate': 8.709752683154455e-07, 'completion_length': 154.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.4181548058986664, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3288692235946655, 'reward_std': 0.18228039145469666, 'kl': 1.62109375, 'epoch': 0.13}
 13%|█▎        | 553/4286 [3:12:36<19:26:40, 18.75s/it] 13%|█▎        | 554/4286 [3:12:55<19:29:22, 18.80s/it]                                                       {'loss': 0.0215, 'grad_norm': 6.636463478013721, 'learning_rate': 8.707419505366308e-07, 'completion_length': 150.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.3199404999613762, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2842262983322144, 'reward_std': 0.163677129894495, 'kl': 0.53515625, 'epoch': 0.13}
 13%|█▎        | 554/4286 [3:12:55<19:29:22, 18.80s/it] 13%|█▎        | 555/4286 [3:13:12<18:59:22, 18.32s/it]                                                       {'loss': 0.0131, 'grad_norm': 1.4017343003976264, 'learning_rate': 8.705086327578161e-07, 'completion_length': 158.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.44464288651943207, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4267858266830444, 'reward_std': 0.08555657416582108, 'kl': 0.32763671875, 'epoch': 0.13}
 13%|█▎        | 555/4286 [3:13:12<18:59:22, 18.32s/it] 13%|█▎        | 556/4286 [3:13:28<18:22:42, 17.74s/it]                                                       {'loss': 0.0113, 'grad_norm': 1.0060664740748744, 'learning_rate': 8.702753149790013e-07, 'completion_length': 158.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.5178572088479996, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.037555959075689316, 'kl': 0.2822265625, 'epoch': 0.13}
 13%|█▎        | 556/4286 [3:13:29<18:22:42, 17.74s/it] 13%|█▎        | 557/4286 [3:13:45<17:51:44, 17.24s/it]                                                       {'loss': 0.0196, 'grad_norm': 2.5001097041580826, 'learning_rate': 8.700419972001865e-07, 'completion_length': 134.96429061889648, 'rewards/only_full_func_accuracy_reward': 0.4494047909975052, 'rewards/format_reward': 1.0, 'reward': 1.4494048953056335, 'reward_std': 0.11177253723144531, 'kl': 0.4892578125, 'epoch': 0.13}
 13%|█▎        | 557/4286 [3:13:45<17:51:44, 17.24s/it] 13%|█▎        | 558/4286 [3:14:01<17:33:36, 16.96s/it]                                                       {'loss': 0.0126, 'grad_norm': 2.6386880901476113, 'learning_rate': 8.698086794213718e-07, 'completion_length': 144.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.08014345541596413, 'kl': 0.314453125, 'epoch': 0.13}
 13%|█▎        | 558/4286 [3:14:01<17:33:36, 16.96s/it] 13%|█▎        | 559/4286 [3:14:17<17:25:43, 16.83s/it]                                                       {'loss': 0.0152, 'grad_norm': 3.6816038988520865, 'learning_rate': 8.695753616425571e-07, 'completion_length': 143.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.4300595670938492, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4122024774551392, 'reward_std': 0.12914376333355904, 'kl': 0.3798828125, 'epoch': 0.13}
 13%|█▎        | 559/4286 [3:14:17<17:25:43, 16.83s/it] 13%|█▎        | 560/4286 [3:14:34<17:28:47, 16.89s/it]                                                       {'loss': 0.0217, 'grad_norm': 2.0917127495143464, 'learning_rate': 8.693420438637423e-07, 'completion_length': 143.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.4970238208770752, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4255954027175903, 'reward_std': 0.17428960651159286, 'kl': 0.54296875, 'epoch': 0.13}
 13%|█▎        | 560/4286 [3:14:34<17:28:47, 16.89s/it] 13%|█▎        | 561/4286 [3:14:53<17:54:22, 17.31s/it]                                                       {'loss': 0.0386, 'grad_norm': 3.0342780032136254, 'learning_rate': 8.691087260849276e-07, 'completion_length': 151.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.3690476566553116, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2797619700431824, 'reward_std': 0.255227230489254, 'kl': 0.96484375, 'epoch': 0.13}
 13%|█▎        | 561/4286 [3:14:53<17:54:22, 17.31s/it] 13%|█▎        | 562/4286 [3:15:13<18:54:37, 18.28s/it]                                                       {'loss': 0.0541, 'grad_norm': 4.555390212913784, 'learning_rate': 8.688754083061128e-07, 'completion_length': 155.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.4761905074119568, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3690477013587952, 'reward_std': 0.29308100044727325, 'kl': 1.3515625, 'epoch': 0.13}
 13%|█▎        | 562/4286 [3:15:13<18:54:37, 18.28s/it] 13%|█▎        | 563/4286 [3:15:31<18:52:59, 18.26s/it]                                                       {'loss': 0.125, 'grad_norm': 4.877542853174328, 'learning_rate': 8.686420905272981e-07, 'completion_length': 148.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.0892857909202576, 'reward_std': 0.33365775644779205, 'kl': 3.1171875, 'epoch': 0.13}
 13%|█▎        | 563/4286 [3:15:31<18:52:59, 18.26s/it] 13%|█▎        | 564/4286 [3:15:58<21:18:38, 20.61s/it]                                                       {'loss': 0.2209, 'grad_norm': 11.017529586334419, 'learning_rate': 8.684087727484834e-07, 'completion_length': 207.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.1354166716337204, 'rewards/format_reward': 0.517857164144516, 'reward': 0.6532738506793976, 'reward_std': 0.5766984820365906, 'kl': 5.515625, 'epoch': 0.13}
 13%|█▎        | 564/4286 [3:15:58<21:18:38, 20.61s/it] 13%|█▎        | 565/4286 [3:16:22<22:20:29, 21.61s/it]                                                       {'loss': 0.1155, 'grad_norm': 4.9708313111739955, 'learning_rate': 8.681754549696686e-07, 'completion_length': 170.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.3229167014360428, 'rewards/format_reward': 0.785714328289032, 'reward': 1.1086310148239136, 'reward_std': 0.4053483307361603, 'kl': 2.890625, 'epoch': 0.13}
 13%|█▎        | 565/4286 [3:16:22<22:20:29, 21.61s/it] 13%|█▎        | 566/4286 [3:16:42<21:50:46, 21.14s/it]                                                       {'loss': 0.0649, 'grad_norm': 19.550370598851583, 'learning_rate': 8.679421371908538e-07, 'completion_length': 145.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.4345238506793976, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.30952388048172, 'reward_std': 0.3305308222770691, 'kl': 1.625, 'epoch': 0.13}
 13%|█▎        | 566/4286 [3:16:42<21:50:46, 21.14s/it] 13%|█▎        | 567/4286 [3:17:02<21:35:40, 20.90s/it]                                                       {'loss': 0.0369, 'grad_norm': 3.4535695268100297, 'learning_rate': 8.677088194120391e-07, 'completion_length': 136.01786422729492, 'rewards/only_full_func_accuracy_reward': 0.4330357164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3973215818405151, 'reward_std': 0.3160755932331085, 'kl': 0.923828125, 'epoch': 0.13}
 13%|█▎        | 567/4286 [3:17:02<21:35:40, 20.90s/it] 13%|█▎        | 568/4286 [3:17:19<20:25:38, 19.78s/it]                                                       {'loss': 0.0308, 'grad_norm': 3.660368067930068, 'learning_rate': 8.674755016332244e-07, 'completion_length': 137.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.367559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3497024774551392, 'reward_std': 0.14714987576007843, 'kl': 0.76953125, 'epoch': 0.13}
 13%|█▎        | 568/4286 [3:17:19<20:25:38, 19.78s/it] 13%|█▎        | 569/4286 [3:17:39<20:30:20, 19.86s/it]                                                       {'loss': 0.0361, 'grad_norm': 2.5554455176058695, 'learning_rate': 8.672421838544096e-07, 'completion_length': 137.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.4687500149011612, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3973215818405151, 'reward_std': 0.20044294744729996, 'kl': 0.90234375, 'epoch': 0.13}
 13%|█▎        | 569/4286 [3:17:39<20:30:20, 19.86s/it] 13%|█▎        | 570/4286 [3:17:56<19:33:25, 18.95s/it]                                                       {'loss': 0.0156, 'grad_norm': 1.1158570222445046, 'learning_rate': 8.670088660755948e-07, 'completion_length': 142.25, 'rewards/only_full_func_accuracy_reward': 0.4062500447034836, 'rewards/format_reward': 1.0, 'reward': 1.4062501192092896, 'reward_std': 0.04136601276695728, 'kl': 0.3896484375, 'epoch': 0.13}
 13%|█▎        | 570/4286 [3:17:56<19:33:25, 18.95s/it] 13%|█▎        | 571/4286 [3:18:13<18:49:00, 18.23s/it]                                                       {'loss': 0.0117, 'grad_norm': 1.2968418399022592, 'learning_rate': 8.667755482967802e-07, 'completion_length': 138.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.5059524178504944, 'rewards/format_reward': 1.0, 'reward': 1.5059524774551392, 'reward_std': 0.06395573727786541, 'kl': 0.2919921875, 'epoch': 0.13}
 13%|█▎        | 571/4286 [3:18:13<18:49:00, 18.23s/it] 13%|█▎        | 572/4286 [3:18:29<18:14:51, 17.69s/it]                                                       {'loss': 0.0111, 'grad_norm': 3.1141450736995457, 'learning_rate': 8.665422305179654e-07, 'completion_length': 138.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5252976566553116, 'rewards/format_reward': 1.0, 'reward': 1.52529776096344, 'reward_std': 0.10591553524136543, 'kl': 0.2783203125, 'epoch': 0.13}
 13%|█▎        | 572/4286 [3:18:29<18:14:51, 17.69s/it] 13%|█▎        | 573/4286 [3:18:45<17:53:10, 17.34s/it]                                                       {'loss': 0.0118, 'grad_norm': 1.4321963567972449, 'learning_rate': 8.663089127391506e-07, 'completion_length': 126.1785774230957, 'rewards/only_full_func_accuracy_reward': 0.492559552192688, 'rewards/format_reward': 1.0, 'reward': 1.4925596117973328, 'reward_std': 0.0627467418089509, 'kl': 0.294921875, 'epoch': 0.13}
 13%|█▎        | 573/4286 [3:18:45<17:53:10, 17.34s/it] 13%|█▎        | 574/4286 [3:19:01<17:12:41, 16.69s/it]                                                       {'loss': 0.0118, 'grad_norm': 1.5290167437673434, 'learning_rate': 8.660755949603359e-07, 'completion_length': 122.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5297619104385376, 'rewards/format_reward': 1.0, 'reward': 1.5297620296478271, 'reward_std': 0.04123930633068085, 'kl': 0.294921875, 'epoch': 0.13}
 13%|█▎        | 574/4286 [3:19:01<17:12:41, 16.69s/it] 13%|█▎        | 575/4286 [3:19:16<16:52:31, 16.37s/it]                                                       {'loss': 0.0129, 'grad_norm': 1.9148695297724507, 'learning_rate': 8.658422771815211e-07, 'completion_length': 129.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.352678582072258, 'rewards/format_reward': 1.0, 'reward': 1.3526785969734192, 'reward_std': 0.04900030139833689, 'kl': 0.322265625, 'epoch': 0.13}
 13%|█▎        | 575/4286 [3:19:16<16:52:31, 16.37s/it] 13%|█▎        | 576/4286 [3:19:37<18:11:46, 17.66s/it]                                                       {'loss': 0.023, 'grad_norm': 1.101049010108734, 'learning_rate': 8.656089594027064e-07, 'completion_length': 140.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.3794642984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3616072535514832, 'reward_std': 0.07172521017491817, 'kl': 0.578125, 'epoch': 0.13}
 13%|█▎        | 576/4286 [3:19:37<18:11:46, 17.66s/it] 13%|█▎        | 577/4286 [3:19:57<19:00:00, 18.44s/it]                                                       {'loss': 0.0203, 'grad_norm': 1.6364408502489434, 'learning_rate': 8.653756416238917e-07, 'completion_length': 135.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.657738208770752, 'reward_std': 0.10649015847593546, 'kl': 0.5078125, 'epoch': 0.13}
 13%|█▎        | 577/4286 [3:19:57<19:00:00, 18.44s/it] 13%|█▎        | 578/4286 [3:20:14<18:31:05, 17.98s/it]                                                       {'loss': 0.0216, 'grad_norm': 1.152165829608336, 'learning_rate': 8.651423238450769e-07, 'completion_length': 130.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.3601190522313118, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.342262089252472, 'reward_std': 0.05909645278006792, 'kl': 0.5400390625, 'epoch': 0.13}
 13%|█▎        | 578/4286 [3:20:14<18:31:05, 17.98s/it] 14%|█▎        | 579/4286 [3:20:30<17:57:03, 17.43s/it]                                                       {'loss': 0.0351, 'grad_norm': 2.137550062295508, 'learning_rate': 8.649090060662621e-07, 'completion_length': 127.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.3392857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2857143878936768, 'reward_std': 0.19021038711071014, 'kl': 0.880859375, 'epoch': 0.14}
 14%|█▎        | 579/4286 [3:20:30<17:57:03, 17.43s/it] 14%|█▎        | 580/4286 [3:20:45<17:15:02, 16.76s/it]                                                       {'loss': 0.0213, 'grad_norm': 1.0714676575117945, 'learning_rate': 8.646756882874474e-07, 'completion_length': 116.75000381469727, 'rewards/only_full_func_accuracy_reward': 0.364583358168602, 'rewards/format_reward': 1.0, 'reward': 1.3645833730697632, 'reward_std': 0.008928571827709675, 'kl': 0.5322265625, 'epoch': 0.14}
 14%|█▎        | 580/4286 [3:20:45<17:15:02, 16.76s/it] 14%|█▎        | 581/4286 [3:21:03<17:38:04, 17.13s/it]                                                       {'loss': 0.0338, 'grad_norm': 5.623782521342871, 'learning_rate': 8.644423705086327e-07, 'completion_length': 127.32143020629883, 'rewards/only_full_func_accuracy_reward': 0.59226194024086, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5565477013587952, 'reward_std': 0.10630497708916664, 'kl': 0.845703125, 'epoch': 0.14}
 14%|█▎        | 581/4286 [3:21:03<17:38:04, 17.13s/it] 14%|█▎        | 582/4286 [3:21:19<17:05:07, 16.61s/it]                                                       {'loss': 0.0317, 'grad_norm': 1.392066961244956, 'learning_rate': 8.642090527298179e-07, 'completion_length': 112.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.5059524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4880953431129456, 'reward_std': 0.0357142873108387, 'kl': 0.791015625, 'epoch': 0.14}
 14%|█▎        | 582/4286 [3:21:19<17:05:07, 16.61s/it] 14%|█▎        | 583/4286 [3:21:34<16:33:02, 16.09s/it]                                                       {'loss': 0.0191, 'grad_norm': 1.6126455022374688, 'learning_rate': 8.639757349510031e-07, 'completion_length': 113.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5119047909975052, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.49404776096344, 'reward_std': 0.06888023018836975, 'kl': 0.4775390625, 'epoch': 0.14}
 14%|█▎        | 583/4286 [3:21:34<16:33:02, 16.09s/it] 14%|█▎        | 584/4286 [3:21:50<16:33:58, 16.11s/it]                                                       {'loss': 0.0229, 'grad_norm': 1.695524177095331, 'learning_rate': 8.637424171721885e-07, 'completion_length': 116.71429061889648, 'rewards/only_full_func_accuracy_reward': 0.3601190596818924, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3422619700431824, 'reward_std': 0.07503291219472885, 'kl': 0.5751953125, 'epoch': 0.14}
 14%|█▎        | 584/4286 [3:21:50<16:33:58, 16.11s/it] 14%|█▎        | 585/4286 [3:22:06<16:39:08, 16.20s/it]                                                       {'loss': 0.0534, 'grad_norm': 4.160990992554599, 'learning_rate': 8.635090993933737e-07, 'completion_length': 128.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.4255952686071396, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3898810744285583, 'reward_std': 0.18334335088729858, 'kl': 1.333984375, 'epoch': 0.14}
 14%|█▎        | 585/4286 [3:22:06<16:39:08, 16.20s/it] 14%|█▎        | 586/4286 [3:22:23<16:45:52, 16.31s/it]                                                       {'loss': 0.0224, 'grad_norm': 1.8701434268462762, 'learning_rate': 8.632757816145589e-07, 'completion_length': 123.51786422729492, 'rewards/only_full_func_accuracy_reward': 0.4196428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4017858505249023, 'reward_std': 0.1514887511730194, 'kl': 0.55859375, 'epoch': 0.14}
 14%|█▎        | 586/4286 [3:22:23<16:45:52, 16.31s/it] 14%|█▎        | 587/4286 [3:22:38<16:16:42, 15.84s/it]                                                       {'loss': 0.0119, 'grad_norm': 1.3034249398724336, 'learning_rate': 8.630424638357442e-07, 'completion_length': 121.03572082519531, 'rewards/only_full_func_accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642858505249023, 'reward_std': 0.05951213277876377, 'kl': 0.298828125, 'epoch': 0.14}
 14%|█▎        | 587/4286 [3:22:38<16:16:42, 15.84s/it] 14%|█▎        | 588/4286 [3:22:53<16:07:11, 15.69s/it]                                                       {'loss': 0.0125, 'grad_norm': 0.2936481874244874, 'learning_rate': 8.628091460569295e-07, 'completion_length': 130.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.4761905074119568, 'rewards/format_reward': 1.0, 'reward': 1.4761905670166016, 'reward_std': 0.0, 'kl': 0.3134765625, 'epoch': 0.14}
 14%|█▎        | 588/4286 [3:22:53<16:07:11, 15.69s/it] 14%|█▎        | 589/4286 [3:23:09<16:22:55, 15.95s/it]                                                       {'loss': 0.0238, 'grad_norm': 1.4129972341309371, 'learning_rate': 8.625758282781147e-07, 'completion_length': 118.96429061889648, 'rewards/only_full_func_accuracy_reward': 0.3779762238264084, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3601191639900208, 'reward_std': 0.10671550035476685, 'kl': 0.591796875, 'epoch': 0.14}
 14%|█▎        | 589/4286 [3:23:09<16:22:55, 15.95s/it] 14%|█▍        | 590/4286 [3:23:25<16:13:32, 15.80s/it]                                                       {'loss': 0.0131, 'grad_norm': 0.6988322816871969, 'learning_rate': 8.623425104992999e-07, 'completion_length': 118.37500381469727, 'rewards/only_full_func_accuracy_reward': 0.5833333730697632, 'rewards/format_reward': 1.0, 'reward': 1.583333432674408, 'reward_std': 0.06185895949602127, 'kl': 0.328125, 'epoch': 0.14}
 14%|█▍        | 590/4286 [3:23:25<16:13:32, 15.80s/it] 14%|█▍        | 591/4286 [3:23:42<16:38:21, 16.21s/it]                                                       {'loss': 0.0238, 'grad_norm': 1.7557718250961996, 'learning_rate': 8.621091927204852e-07, 'completion_length': 137.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.3318452537059784, 'rewards/format_reward': 1.0, 'reward': 1.3318453431129456, 'reward_std': 0.04900030791759491, 'kl': 0.5966796875, 'epoch': 0.14}
 14%|█▍        | 591/4286 [3:23:42<16:38:21, 16.21s/it] 14%|█▍        | 592/4286 [3:23:58<16:36:16, 16.18s/it]                                                       {'loss': 0.0119, 'grad_norm': 2.16788944290105, 'learning_rate': 8.618758749416705e-07, 'completion_length': 142.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5193452835083008, 'rewards/format_reward': 1.0, 'reward': 1.5193453431129456, 'reward_std': 0.08301922678947449, 'kl': 0.2978515625, 'epoch': 0.14}
 14%|█▍        | 592/4286 [3:23:58<16:36:16, 16.18s/it] 14%|█▍        | 593/4286 [3:24:14<16:32:52, 16.13s/it]                                                       {'loss': 0.0137, 'grad_norm': 1.3934476689969795, 'learning_rate': 8.616425571628557e-07, 'completion_length': 133.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.2648809775710106, 'rewards/format_reward': 1.0, 'reward': 1.2648810744285583, 'reward_std': 0.0297619067132473, 'kl': 0.341796875, 'epoch': 0.14}
 14%|█▍        | 593/4286 [3:24:14<16:32:52, 16.13s/it] 14%|█▍        | 594/4286 [3:24:30<16:29:01, 16.07s/it]                                                       {'loss': 0.0172, 'grad_norm': 1.6444222056833147, 'learning_rate': 8.61409239384041e-07, 'completion_length': 125.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.4880952835083008, 'rewards/format_reward': 1.0, 'reward': 1.4880953431129456, 'reward_std': 0.06823870167136192, 'kl': 0.4287109375, 'epoch': 0.14}
 14%|█▍        | 594/4286 [3:24:30<16:29:01, 16.07s/it] 14%|█▍        | 595/4286 [3:24:46<16:28:53, 16.08s/it]                                                       {'loss': 0.0127, 'grad_norm': 3.1032307658673752, 'learning_rate': 8.611759216052262e-07, 'completion_length': 131.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.409226194024086, 'rewards/format_reward': 1.0, 'reward': 1.4092262983322144, 'reward_std': 0.08563542738556862, 'kl': 0.3173828125, 'epoch': 0.14}
 14%|█▍        | 595/4286 [3:24:46<16:28:53, 16.08s/it] 14%|█▍        | 596/4286 [3:25:03<16:45:55, 16.36s/it]                                                       {'loss': 0.0127, 'grad_norm': 1.4191374917468087, 'learning_rate': 8.609426038264115e-07, 'completion_length': 140.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.4508928954601288, 'rewards/format_reward': 1.0, 'reward': 1.450892984867096, 'reward_std': 0.05427858233451843, 'kl': 0.31640625, 'epoch': 0.14}
 14%|█▍        | 596/4286 [3:25:03<16:45:55, 16.36s/it] 14%|█▍        | 597/4286 [3:25:21<17:02:46, 16.64s/it]                                                       {'loss': 0.0138, 'grad_norm': 4.335878027024507, 'learning_rate': 8.607092860475968e-07, 'completion_length': 138.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.2827381193637848, 'rewards/format_reward': 1.0, 'reward': 1.282738208770752, 'reward_std': 0.04166666977107525, 'kl': 0.3447265625, 'epoch': 0.14}
 14%|█▍        | 597/4286 [3:25:21<17:02:46, 16.64s/it] 14%|█▍        | 598/4286 [3:25:38<17:22:29, 16.96s/it]                                                       {'loss': 0.0134, 'grad_norm': 1.9095014758960978, 'learning_rate': 8.60475968268782e-07, 'completion_length': 143.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4732144474983215, 'reward_std': 0.05884971283376217, 'kl': 0.3349609375, 'epoch': 0.14}
 14%|█▍        | 598/4286 [3:25:38<17:22:29, 16.96s/it] 14%|█▍        | 599/4286 [3:25:54<17:00:23, 16.61s/it]                                                       {'loss': 0.0126, 'grad_norm': 0.8801057337024112, 'learning_rate': 8.602426504899672e-07, 'completion_length': 133.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.4032738208770752, 'rewards/format_reward': 1.0, 'reward': 1.4032739400863647, 'reward_std': 0.042458297684788704, 'kl': 0.3154296875, 'epoch': 0.14}
 14%|█▍        | 599/4286 [3:25:54<17:00:23, 16.61s/it] 14%|█▍        | 600/4286 [3:26:10<16:51:55, 16.47s/it]                                                       {'loss': 0.0124, 'grad_norm': 1.7111631924830468, 'learning_rate': 8.600093327111526e-07, 'completion_length': 139.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.4181547909975052, 'rewards/format_reward': 1.0, 'reward': 1.4181548357009888, 'reward_std': 0.12360706552863121, 'kl': 0.310546875, 'epoch': 0.14}
 14%|█▍        | 600/4286 [3:26:10<16:51:55, 16.47s/it] 14%|█▍        | 601/4286 [3:32:05<120:40:45, 117.90s/it]                                                         {'loss': 0.0123, 'grad_norm': 1.140615837146812, 'learning_rate': 8.597760149323378e-07, 'completion_length': 142.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642858505249023, 'reward_std': 0.029160594567656517, 'kl': 0.30859375, 'epoch': 0.14}
 14%|█▍        | 601/4286 [3:32:05<120:40:45, 117.90s/it] 14%|█▍        | 602/4286 [3:32:21<89:24:47, 87.37s/it]                                                         {'loss': 0.0217, 'grad_norm': 1.021307727874241, 'learning_rate': 8.59542697153523e-07, 'completion_length': 133.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.4776785671710968, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4598215222358704, 'reward_std': 0.0565476194024086, 'kl': 0.54296875, 'epoch': 0.14}
 14%|█▍        | 602/4286 [3:32:21<89:24:47, 87.37s/it] 14%|█▍        | 603/4286 [3:32:38<67:50:09, 66.31s/it]                                                       {'loss': 0.0213, 'grad_norm': 1.2425470489153079, 'learning_rate': 8.593093793747082e-07, 'completion_length': 130.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.4241071790456772, 'rewards/format_reward': 1.0, 'reward': 1.424107313156128, 'reward_std': 0.0386904738843441, 'kl': 0.5341796875, 'epoch': 0.14}
 14%|█▍        | 603/4286 [3:32:38<67:50:09, 66.31s/it] 14%|█▍        | 604/4286 [3:32:58<53:33:53, 52.37s/it]                                                       {'loss': 0.0473, 'grad_norm': 3.062261051808931, 'learning_rate': 8.590760615958935e-07, 'completion_length': 151.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.308035746216774, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2723215222358704, 'reward_std': 0.16818953305482864, 'kl': 1.181640625, 'epoch': 0.14}
 14%|█▍        | 604/4286 [3:32:58<53:33:53, 52.37s/it] 14%|█▍        | 605/4286 [3:33:16<43:01:39, 42.08s/it]                                                       {'loss': 0.05, 'grad_norm': 8.462598886144725, 'learning_rate': 8.588427438170788e-07, 'completion_length': 155.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.444940522313118, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3735119700431824, 'reward_std': 0.21340326219797134, 'kl': 1.25390625, 'epoch': 0.14}
 14%|█▍        | 605/4286 [3:33:16<43:01:39, 42.08s/it] 14%|█▍        | 606/4286 [3:33:37<36:27:16, 35.66s/it]                                                       {'loss': 0.0722, 'grad_norm': 4.04820167555545, 'learning_rate': 8.58609426038264e-07, 'completion_length': 162.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.2961309552192688, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2425596117973328, 'reward_std': 0.15183189883828163, 'kl': 1.8046875, 'epoch': 0.14}
 14%|█▍        | 606/4286 [3:33:37<36:27:16, 35.66s/it] 14%|█▍        | 607/4286 [3:33:56<31:18:46, 30.64s/it]                                                       {'loss': 0.0784, 'grad_norm': 3.6173565139609742, 'learning_rate': 8.583761082594493e-07, 'completion_length': 162.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.33690477907657623, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3011906147003174, 'reward_std': 0.25084127485752106, 'kl': 1.95703125, 'epoch': 0.14}
 14%|█▍        | 607/4286 [3:33:56<31:18:46, 30.64s/it] 14%|█▍        | 608/4286 [3:34:14<27:40:58, 27.10s/it]                                                       {'loss': 0.0447, 'grad_norm': 2.2568960483016554, 'learning_rate': 8.581427904806345e-07, 'completion_length': 163.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5130952894687653, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.495238184928894, 'reward_std': 0.15439919754862785, 'kl': 1.119140625, 'epoch': 0.14}
 14%|█▍        | 608/4286 [3:34:14<27:40:58, 27.10s/it] 14%|█▍        | 609/4286 [3:34:37<26:22:58, 25.83s/it]                                                       {'loss': 0.0592, 'grad_norm': 69.6159082737521, 'learning_rate': 8.579094727018198e-07, 'completion_length': 164.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.3824404776096344, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.328869104385376, 'reward_std': 0.2611139118671417, 'kl': 1.48046875, 'epoch': 0.14}
 14%|█▍        | 609/4286 [3:34:37<26:22:58, 25.83s/it] 14%|█▍        | 610/4286 [3:34:55<23:57:27, 23.46s/it]                                                       {'loss': 0.0355, 'grad_norm': 2.4450600666340385, 'learning_rate': 8.576761549230051e-07, 'completion_length': 168.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.3095238357782364, 'rewards/format_reward': 1.0, 'reward': 1.30952388048172, 'reward_std': 0.20103276520967484, 'kl': 0.884765625, 'epoch': 0.14}
 14%|█▍        | 610/4286 [3:34:55<23:57:27, 23.46s/it] 14%|█▍        | 611/4286 [3:35:18<23:38:29, 23.16s/it]                                                       {'loss': 0.0555, 'grad_norm': 2.6466179638639087, 'learning_rate': 8.574428371441903e-07, 'completion_length': 175.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.27648812532424927, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2586311101913452, 'reward_std': 0.13139106705784798, 'kl': 1.38671875, 'epoch': 0.14}
 14%|█▍        | 611/4286 [3:35:18<23:38:29, 23.16s/it] 14%|█▍        | 612/4286 [3:35:43<24:12:14, 23.72s/it]                                                       {'loss': 0.0563, 'grad_norm': 4.974639892095182, 'learning_rate': 8.572095193653755e-07, 'completion_length': 172.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.38511908054351807, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3672620058059692, 'reward_std': 0.15208154916763306, 'kl': 1.41015625, 'epoch': 0.14}
 14%|█▍        | 612/4286 [3:35:43<24:12:14, 23.72s/it] 14%|█▍        | 613/4286 [3:36:04<23:31:51, 23.06s/it]                                                       {'loss': 0.0516, 'grad_norm': 2.4009880180116743, 'learning_rate': 8.569762015865608e-07, 'completion_length': 177.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.3586309850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3229167461395264, 'reward_std': 0.18689321726560593, 'kl': 1.2890625, 'epoch': 0.14}
 14%|█▍        | 613/4286 [3:36:04<23:31:51, 23.06s/it] 14%|█▍        | 614/4286 [3:36:26<23:11:11, 22.73s/it]                                                       {'loss': 0.0465, 'grad_norm': 3.443060037941988, 'learning_rate': 8.567428838077461e-07, 'completion_length': 167.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.4613095670938492, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3898810148239136, 'reward_std': 0.22538010776042938, 'kl': 1.1640625, 'epoch': 0.14}
 14%|█▍        | 614/4286 [3:36:26<23:11:11, 22.73s/it] 14%|█▍        | 615/4286 [3:36:45<22:07:20, 21.69s/it]                                                       {'loss': 0.0437, 'grad_norm': 3.2333910312864864, 'learning_rate': 8.565095660289313e-07, 'completion_length': 150.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.364583358168602, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3110119700431824, 'reward_std': 0.2179904356598854, 'kl': 1.091796875, 'epoch': 0.14}
 14%|█▍        | 615/4286 [3:36:45<22:07:20, 21.69s/it] 14%|█▍        | 616/4286 [3:37:05<21:22:11, 20.96s/it]                                                       {'loss': 0.0351, 'grad_norm': 3.123048688868201, 'learning_rate': 8.562762482501165e-07, 'completion_length': 162.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.480654776096344, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4449405670166016, 'reward_std': 0.23542911559343338, 'kl': 0.876953125, 'epoch': 0.14}
 14%|█▍        | 616/4286 [3:37:05<21:22:11, 20.96s/it] 14%|█▍        | 617/4286 [3:37:26<21:35:35, 21.19s/it]                                                       {'loss': 0.0564, 'grad_norm': 3.341476470936276, 'learning_rate': 8.560429304713019e-07, 'completion_length': 166.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.340773843228817, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2872024774551392, 'reward_std': 0.17995267547667027, 'kl': 1.412109375, 'epoch': 0.14}
 14%|█▍        | 617/4286 [3:37:26<21:35:35, 21.19s/it] 14%|█▍        | 618/4286 [3:37:46<20:56:47, 20.56s/it]                                                       {'loss': 0.0584, 'grad_norm': 3.978938235363186, 'learning_rate': 8.558096126924871e-07, 'completion_length': 162.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.3943452835083008, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3407739400863647, 'reward_std': 0.22648119181394577, 'kl': 1.4609375, 'epoch': 0.14}
 14%|█▍        | 618/4286 [3:37:46<20:56:47, 20.56s/it] 14%|█▍        | 619/4286 [3:38:05<20:43:29, 20.35s/it]                                                       {'loss': 0.0363, 'grad_norm': 4.502848692303425, 'learning_rate': 8.555762949136723e-07, 'completion_length': 156.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.5657738447189331, 'rewards/format_reward': 1.0, 'reward': 1.5657739043235779, 'reward_std': 0.12375207245349884, 'kl': 0.908203125, 'epoch': 0.14}
 14%|█▍        | 619/4286 [3:38:05<20:43:29, 20.35s/it] 14%|█▍        | 620/4286 [3:38:25<20:20:54, 19.98s/it]                                                       {'loss': 0.0214, 'grad_norm': 2.6054199565144827, 'learning_rate': 8.553429771348576e-07, 'completion_length': 153.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4739583730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4561012983322144, 'reward_std': 0.15766431391239166, 'kl': 0.53515625, 'epoch': 0.14}
 14%|█▍        | 620/4286 [3:38:25<20:20:54, 19.98s/it] 14%|█▍        | 621/4286 [3:38:46<20:51:03, 20.48s/it]                                                       {'loss': 0.0358, 'grad_norm': 3.356118265015615, 'learning_rate': 8.551096593560429e-07, 'completion_length': 166.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.3035714626312256, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.232142984867096, 'reward_std': 0.2564014568924904, 'kl': 0.89453125, 'epoch': 0.14}
 14%|█▍        | 621/4286 [3:38:46<20:51:03, 20.48s/it] 15%|█▍        | 622/4286 [3:39:08<21:12:22, 20.84s/it]                                                       {'loss': 0.049, 'grad_norm': 3.632943466146439, 'learning_rate': 8.548763415772281e-07, 'completion_length': 161.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.5300596058368683, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4764882922172546, 'reward_std': 0.27080971747636795, 'kl': 1.2265625, 'epoch': 0.15}
 15%|█▍        | 622/4286 [3:39:08<21:12:22, 20.84s/it] 15%|█▍        | 623/4286 [3:39:29<21:19:16, 20.95s/it]                                                       {'loss': 0.0749, 'grad_norm': 3.789523901044032, 'learning_rate': 8.546430237984134e-07, 'completion_length': 158.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5907738208770752, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.4657739400863647, 'reward_std': 0.32867737114429474, 'kl': 1.87109375, 'epoch': 0.15}
 15%|█▍        | 623/4286 [3:39:29<21:19:16, 20.95s/it] 15%|█▍        | 624/4286 [3:39:50<21:16:41, 20.92s/it]                                                       {'loss': 0.1522, 'grad_norm': 6.240269394854047, 'learning_rate': 8.544097060195986e-07, 'completion_length': 157.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.24092263728380203, 'rewards/format_reward': 0.7678571939468384, 'reward': 1.0087798833847046, 'reward_std': 0.4817314147949219, 'kl': 3.8046875, 'epoch': 0.15}
 15%|█▍        | 624/4286 [3:39:50<21:16:41, 20.92s/it] 15%|█▍        | 625/4286 [3:40:10<21:03:04, 20.70s/it]                                                       {'loss': 0.1704, 'grad_norm': 5.23491749962757, 'learning_rate': 8.541763882407838e-07, 'completion_length': 156.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.3630952686071396, 'rewards/format_reward': 0.7500000298023224, 'reward': 1.1130953431129456, 'reward_std': 0.4347919523715973, 'kl': 4.265625, 'epoch': 0.15}
 15%|█▍        | 625/4286 [3:40:10<21:03:04, 20.70s/it] 15%|█▍        | 626/4286 [3:40:32<21:26:40, 21.09s/it]                                                       {'loss': 0.2503, 'grad_norm': 8.094396954226532, 'learning_rate': 8.539430704619691e-07, 'completion_length': 164.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.2752976343035698, 'rewards/format_reward': 0.7321428954601288, 'reward': 1.0074405372142792, 'reward_std': 0.5597399920225143, 'kl': 6.25, 'epoch': 0.15}
 15%|█▍        | 626/4286 [3:40:32<21:26:40, 21.09s/it] 15%|█▍        | 627/4286 [3:40:54<21:48:22, 21.45s/it]                                                       {'loss': 0.2109, 'grad_norm': 6.606478515085003, 'learning_rate': 8.537097526831544e-07, 'completion_length': 157.21428680419922, 'rewards/only_full_func_accuracy_reward': 0.276785746216774, 'rewards/format_reward': 0.7500000298023224, 'reward': 1.0267857909202576, 'reward_std': 0.5544931590557098, 'kl': 5.2734375, 'epoch': 0.15}
 15%|█▍        | 627/4286 [3:40:54<21:48:22, 21.45s/it] 15%|█▍        | 628/4286 [3:41:14<21:21:31, 21.02s/it]                                                       {'loss': 0.1648, 'grad_norm': 6.1361849681080685, 'learning_rate': 8.534764349043396e-07, 'completion_length': 150.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.3258928954601288, 'rewards/format_reward': 0.767857164144516, 'reward': 1.0937500596046448, 'reward_std': 0.5103669762611389, 'kl': 4.109375, 'epoch': 0.15}
 15%|█▍        | 628/4286 [3:41:14<21:21:31, 21.02s/it] 15%|█▍        | 629/4286 [3:41:32<20:11:27, 19.88s/it]                                                       {'loss': 0.0458, 'grad_norm': 8.26438463942006, 'learning_rate': 8.532431171255248e-07, 'completion_length': 144.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.3601190894842148, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3065477013587952, 'reward_std': 0.18072408810257912, 'kl': 1.14453125, 'epoch': 0.15}
 15%|█▍        | 629/4286 [3:41:32<20:11:27, 19.88s/it] 15%|█▍        | 630/4286 [3:41:49<19:31:49, 19.23s/it]                                                       {'loss': 0.0799, 'grad_norm': 3.669173913646527, 'learning_rate': 8.530097993467102e-07, 'completion_length': 151.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.3363095372915268, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2648810744285583, 'reward_std': 0.2659478522837162, 'kl': 2.0, 'epoch': 0.15}
 15%|█▍        | 630/4286 [3:41:49<19:31:49, 19.23s/it] 15%|█▍        | 631/4286 [3:42:06<18:42:45, 18.43s/it]                                                       {'loss': 0.0389, 'grad_norm': 14.80678521532496, 'learning_rate': 8.527764815678954e-07, 'completion_length': 139.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.2976190745830536, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.24404776096344, 'reward_std': 0.19759200513362885, 'kl': 0.97265625, 'epoch': 0.15}
 15%|█▍        | 631/4286 [3:42:06<18:42:45, 18.43s/it] 15%|█▍        | 632/4286 [3:42:22<18:06:43, 17.84s/it]                                                       {'loss': 0.0335, 'grad_norm': 2.005684192053188, 'learning_rate': 8.525431637890806e-07, 'completion_length': 149.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.4062500298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3705357313156128, 'reward_std': 0.16762056201696396, 'kl': 0.837890625, 'epoch': 0.15}
 15%|█▍        | 632/4286 [3:42:22<18:06:43, 17.84s/it] 15%|█▍        | 633/4286 [3:42:39<17:46:37, 17.52s/it]                                                       {'loss': 0.0374, 'grad_norm': 8.518183460794212, 'learning_rate': 8.523098460102659e-07, 'completion_length': 150.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.3824405372142792, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3288691639900208, 'reward_std': 0.22778942435979843, 'kl': 0.9345703125, 'epoch': 0.15}
 15%|█▍        | 633/4286 [3:42:39<17:46:37, 17.52s/it] 15%|█▍        | 634/4286 [3:42:58<18:13:13, 17.96s/it]                                                       {'loss': 0.0569, 'grad_norm': 2.6759851654915696, 'learning_rate': 8.520765282314512e-07, 'completion_length': 141.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.3824404925107956, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3467262983322144, 'reward_std': 0.1329634804278612, 'kl': 1.42578125, 'epoch': 0.15}
 15%|█▍        | 634/4286 [3:42:58<18:13:13, 17.96s/it] 15%|█▍        | 635/4286 [3:43:14<17:37:50, 17.38s/it]                                                       {'loss': 0.0111, 'grad_norm': 1.126343513473344, 'learning_rate': 8.518432104526364e-07, 'completion_length': 141.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.3273809850215912, 'rewards/format_reward': 1.0, 'reward': 1.3273810744285583, 'reward_std': 0.02380952797830105, 'kl': 0.27734375, 'epoch': 0.15}
 15%|█▍        | 635/4286 [3:43:14<17:37:50, 17.38s/it] 15%|█▍        | 636/4286 [3:43:31<17:23:03, 17.15s/it]                                                       {'loss': 0.0126, 'grad_norm': 2.524889430380271, 'learning_rate': 8.516098926738216e-07, 'completion_length': 145.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5208333879709244, 'rewards/format_reward': 1.0, 'reward': 1.5208334922790527, 'reward_std': 0.06731786951422691, 'kl': 0.3154296875, 'epoch': 0.15}
 15%|█▍        | 636/4286 [3:43:31<17:23:03, 17.15s/it] 15%|█▍        | 637/4286 [3:43:48<17:15:40, 17.03s/it]                                                       {'loss': 0.0115, 'grad_norm': 1.4274985461842944, 'learning_rate': 8.513765748950069e-07, 'completion_length': 144.75, 'rewards/only_full_func_accuracy_reward': 0.3913690745830536, 'rewards/format_reward': 1.0, 'reward': 1.3913691639900208, 'reward_std': 0.07029405795037746, 'kl': 0.287109375, 'epoch': 0.15}
 15%|█▍        | 637/4286 [3:43:48<17:15:40, 17.03s/it] 15%|█▍        | 638/4286 [3:44:05<17:17:01, 17.06s/it]                                                       {'loss': 0.0124, 'grad_norm': 1.86721622832687, 'learning_rate': 8.511432571161922e-07, 'completion_length': 150.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.3824404925107956, 'rewards/format_reward': 1.0, 'reward': 1.3824406266212463, 'reward_std': 0.0301282936707139, 'kl': 0.3095703125, 'epoch': 0.15}
 15%|█▍        | 638/4286 [3:44:05<17:17:01, 17.06s/it] 15%|█▍        | 639/4286 [3:44:22<17:31:04, 17.29s/it]                                                       {'loss': 0.011, 'grad_norm': 1.4851047837925462, 'learning_rate': 8.509099393373774e-07, 'completion_length': 177.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.3125000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3125001192092896, 'reward_std': 0.06773188523948193, 'kl': 0.2744140625, 'epoch': 0.15}
 15%|█▍        | 639/4286 [3:44:22<17:31:04, 17.29s/it] 15%|█▍        | 640/4286 [3:44:39<17:14:18, 17.02s/it]                                                       {'loss': 0.011, 'grad_norm': 6.131531907685931, 'learning_rate': 8.506766215585627e-07, 'completion_length': 156.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.3541666865348816, 'rewards/format_reward': 1.0, 'reward': 1.3541668057441711, 'reward_std': 0.07954384386539459, 'kl': 0.2744140625, 'epoch': 0.15}
 15%|█▍        | 640/4286 [3:44:39<17:14:18, 17.02s/it] 15%|█▍        | 641/4286 [3:44:58<17:57:57, 17.74s/it]                                                       {'loss': 0.011, 'grad_norm': 1.3446915506411663, 'learning_rate': 8.504433037797479e-07, 'completion_length': 171.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.43244051933288574, 'rewards/format_reward': 1.0, 'reward': 1.4324405789375305, 'reward_std': 0.08400976471602917, 'kl': 0.2744140625, 'epoch': 0.15}
 15%|█▍        | 641/4286 [3:44:58<17:57:57, 17.74s/it] 15%|█▍        | 642/4286 [3:45:19<18:48:51, 18.59s/it]                                                       {'loss': 0.0159, 'grad_norm': 5.349625246560252, 'learning_rate': 8.502099860009332e-07, 'completion_length': 179.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.288690485060215, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.270833432674408, 'reward_std': 0.07022446766495705, 'kl': 0.3974609375, 'epoch': 0.15}
 15%|█▍        | 642/4286 [3:45:19<18:48:51, 18.59s/it] 15%|█▌        | 643/4286 [3:45:37<18:40:50, 18.46s/it]                                                       {'loss': 0.0109, 'grad_norm': 2.0520266229986484, 'learning_rate': 8.499766682221185e-07, 'completion_length': 181.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.3288690745830536, 'rewards/format_reward': 1.0, 'reward': 1.328869104385376, 'reward_std': 0.07465150393545628, 'kl': 0.2724609375, 'epoch': 0.15}
 15%|█▌        | 643/4286 [3:45:37<18:40:50, 18.46s/it] 15%|█▌        | 644/4286 [3:45:57<19:06:48, 18.89s/it]                                                       {'loss': 0.0205, 'grad_norm': 2.6810823601122284, 'learning_rate': 8.497433504433037e-07, 'completion_length': 183.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.34375, 'rewards/format_reward': 1.0, 'reward': 1.3437500596046448, 'reward_std': 0.12940297834575176, 'kl': 0.5126953125, 'epoch': 0.15}
 15%|█▌        | 644/4286 [3:45:57<19:06:48, 18.89s/it] 15%|█▌        | 645/4286 [3:46:15<18:57:24, 18.74s/it]                                                       {'loss': 0.0196, 'grad_norm': 3.048748279442755, 'learning_rate': 8.495100326644889e-07, 'completion_length': 165.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.45178574323654175, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.433928668498993, 'reward_std': 0.15915601700544357, 'kl': 0.490234375, 'epoch': 0.15}
 15%|█▌        | 645/4286 [3:46:15<18:57:24, 18.74s/it] 15%|█▌        | 646/4286 [3:46:34<19:00:21, 18.80s/it]                                                       {'loss': 0.015, 'grad_norm': 2.1385214378474537, 'learning_rate': 8.492767148856743e-07, 'completion_length': 175.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.4494048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.43154776096344, 'reward_std': 0.10074744373559952, 'kl': 0.37451171875, 'epoch': 0.15}
 15%|█▌        | 646/4286 [3:46:34<19:00:21, 18.80s/it] 15%|█▌        | 647/4286 [3:46:54<19:21:21, 19.15s/it]                                                       {'loss': 0.0168, 'grad_norm': 4.823906715497533, 'learning_rate': 8.490433971068595e-07, 'completion_length': 188.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.4023810029029846, 'rewards/format_reward': 1.0, 'reward': 1.4023811221122742, 'reward_std': 0.08400897867977619, 'kl': 0.4208984375, 'epoch': 0.15}
 15%|█▌        | 647/4286 [3:46:54<19:21:21, 19.15s/it] 15%|█▌        | 648/4286 [3:47:14<19:39:30, 19.45s/it]                                                       {'loss': 0.028, 'grad_norm': 1.763705793287414, 'learning_rate': 8.488100793280447e-07, 'completion_length': 171.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.4717262536287308, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.436012089252472, 'reward_std': 0.14498972706496716, 'kl': 0.701171875, 'epoch': 0.15}
 15%|█▌        | 648/4286 [3:47:14<19:39:30, 19.45s/it] 15%|█▌        | 649/4286 [3:47:34<19:37:47, 19.43s/it]                                                       {'loss': 0.0362, 'grad_norm': 3.3762708747440793, 'learning_rate': 8.485767615492299e-07, 'completion_length': 174.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.46994051337242126, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4342262148857117, 'reward_std': 0.24826187640428543, 'kl': 0.90576171875, 'epoch': 0.15}
 15%|█▌        | 649/4286 [3:47:34<19:37:47, 19.43s/it] 15%|█▌        | 650/4286 [3:47:52<19:19:35, 19.14s/it]                                                       {'loss': 0.0207, 'grad_norm': 6.334185366610556, 'learning_rate': 8.483434437704153e-07, 'completion_length': 183.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.48273812234401703, 'rewards/format_reward': 1.0, 'reward': 1.482738196849823, 'reward_std': 0.09441464208066463, 'kl': 0.517578125, 'epoch': 0.15}
 15%|█▌        | 650/4286 [3:47:52<19:19:35, 19.14s/it] 15%|█▌        | 651/4286 [3:48:11<19:05:18, 18.90s/it]                                                       {'loss': 0.0991, 'grad_norm': 4.377584260329755, 'learning_rate': 8.481101259916005e-07, 'completion_length': 164.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.2934523969888687, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.1863096356391907, 'reward_std': 0.25460827350616455, 'kl': 2.47265625, 'epoch': 0.15}
 15%|█▌        | 651/4286 [3:48:11<19:05:18, 18.90s/it] 15%|█▌        | 652/4286 [3:48:29<18:54:03, 18.72s/it]                                                       {'loss': 0.1073, 'grad_norm': 6.082850454029641, 'learning_rate': 8.478768082127857e-07, 'completion_length': 174.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.3258928805589676, 'rewards/format_reward': 0.8571429252624512, 'reward': 1.1830357909202576, 'reward_std': 0.22661832720041275, 'kl': 2.6875, 'epoch': 0.15}
 15%|█▌        | 652/4286 [3:48:29<18:54:03, 18.72s/it] 15%|█▌        | 653/4286 [3:48:47<18:46:57, 18.61s/it]                                                       {'loss': 0.0989, 'grad_norm': 3.6249533631470516, 'learning_rate': 8.47643490433971e-07, 'completion_length': 165.375, 'rewards/only_full_func_accuracy_reward': 0.3764881193637848, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2872024774551392, 'reward_std': 0.24085913598537445, 'kl': 2.46875, 'epoch': 0.15}
 15%|█▌        | 653/4286 [3:48:47<18:46:57, 18.61s/it] 15%|█▌        | 654/4286 [3:49:08<19:24:24, 19.24s/it]                                                       {'loss': 0.1053, 'grad_norm': 4.659416859525629, 'learning_rate': 8.474101726551562e-07, 'completion_length': 175.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.3883928805589676, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.2633929252624512, 'reward_std': 0.2731339931488037, 'kl': 2.63671875, 'epoch': 0.15}
 15%|█▌        | 654/4286 [3:49:08<19:24:24, 19.24s/it] 15%|█▌        | 655/4286 [3:49:26<19:09:11, 18.99s/it]                                                       {'loss': 0.0641, 'grad_norm': 4.497200687069729, 'learning_rate': 8.471768548763415e-07, 'completion_length': 174.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.4389881193637848, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3854167461395264, 'reward_std': 0.24410419911146164, 'kl': 1.6015625, 'epoch': 0.15}
 15%|█▌        | 655/4286 [3:49:26<19:09:11, 18.99s/it] 15%|█▌        | 656/4286 [3:49:44<18:46:30, 18.62s/it]                                                       {'loss': 0.0819, 'grad_norm': 10.086985638612079, 'learning_rate': 8.469435370975268e-07, 'completion_length': 162.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.196428656578064, 'reward_std': 0.3929551988840103, 'kl': 2.046875, 'epoch': 0.15}
 15%|█▌        | 656/4286 [3:49:44<18:46:30, 18.62s/it] 15%|█▌        | 657/4286 [3:50:02<18:39:59, 18.52s/it]                                                       {'loss': 0.066, 'grad_norm': 2.8031407106285147, 'learning_rate': 8.46710219318712e-07, 'completion_length': 163.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.401785746216774, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3482143878936768, 'reward_std': 0.22431448101997375, 'kl': 1.6484375, 'epoch': 0.15}
 15%|█▌        | 657/4286 [3:50:02<18:39:59, 18.52s/it] 15%|█▌        | 658/4286 [3:50:22<18:56:04, 18.79s/it]                                                       {'loss': 0.0742, 'grad_norm': 5.718686441621277, 'learning_rate': 8.464769015398972e-07, 'completion_length': 172.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.3586309850215912, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2872024774551392, 'reward_std': 0.26896215230226517, 'kl': 1.85546875, 'epoch': 0.15}
 15%|█▌        | 658/4286 [3:50:22<18:56:04, 18.79s/it] 15%|█▌        | 659/4286 [3:50:39<18:21:01, 18.21s/it]                                                       {'loss': 0.0357, 'grad_norm': 3.2650489574358983, 'learning_rate': 8.462435837610825e-07, 'completion_length': 145.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.4449405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.427083432674408, 'reward_std': 0.16455472260713577, 'kl': 0.89453125, 'epoch': 0.15}
 15%|█▌        | 659/4286 [3:50:39<18:21:01, 18.21s/it] 15%|█▌        | 660/4286 [3:50:56<18:11:00, 18.05s/it]                                                       {'loss': 0.039, 'grad_norm': 4.1763629538135705, 'learning_rate': 8.460102659822678e-07, 'completion_length': 160.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.2693452686071396, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.1979168057441711, 'reward_std': 0.2710757851600647, 'kl': 0.9765625, 'epoch': 0.15}
 15%|█▌        | 660/4286 [3:50:56<18:11:00, 18.05s/it] 15%|█▌        | 661/4286 [3:51:17<19:00:58, 18.89s/it]                                                       {'loss': 0.0599, 'grad_norm': 3.961851793202154, 'learning_rate': 8.45776948203453e-07, 'completion_length': 166.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.3169642984867096, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.2098215222358704, 'reward_std': 0.29505734145641327, 'kl': 1.49609375, 'epoch': 0.15}
 15%|█▌        | 661/4286 [3:51:17<19:00:58, 18.89s/it] 15%|█▌        | 662/4286 [3:51:37<19:15:38, 19.13s/it]                                                       {'loss': 0.0427, 'grad_norm': 2.491674312271267, 'learning_rate': 8.455436304246382e-07, 'completion_length': 167.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.433035746216774, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.361607313156128, 'reward_std': 0.14128470793366432, 'kl': 1.064453125, 'epoch': 0.15}
 15%|█▌        | 662/4286 [3:51:37<19:15:38, 19.13s/it] 15%|█▌        | 663/4286 [3:51:54<18:32:37, 18.43s/it]                                                       {'loss': 0.0193, 'grad_norm': 1.7412943507905232, 'learning_rate': 8.453103126458236e-07, 'completion_length': 151.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.3229166716337204, 'rewards/format_reward': 1.0, 'reward': 1.3229167461395264, 'reward_std': 0.0565476231276989, 'kl': 0.482421875, 'epoch': 0.15}
 15%|█▌        | 663/4286 [3:51:54<18:32:37, 18.43s/it] 15%|█▌        | 664/4286 [3:52:11<18:17:41, 18.18s/it]                                                       {'loss': 0.0123, 'grad_norm': 0.8870462673220578, 'learning_rate': 8.450769948670088e-07, 'completion_length': 159.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.434523805975914, 'rewards/format_reward': 1.0, 'reward': 1.4345239400863647, 'reward_std': 0.02565119881182909, 'kl': 0.30859375, 'epoch': 0.15}
 15%|█▌        | 664/4286 [3:52:11<18:17:41, 18.18s/it] 16%|█▌        | 665/4286 [3:52:29<18:06:35, 18.00s/it]                                                       {'loss': 0.0171, 'grad_norm': 2.5890035494606525, 'learning_rate': 8.44843677088194e-07, 'completion_length': 160.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.4925595670938492, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.474702537059784, 'reward_std': 0.14377480745315552, 'kl': 0.427734375, 'epoch': 0.16}
 16%|█▌        | 665/4286 [3:52:29<18:06:35, 18.00s/it] 16%|█▌        | 666/4286 [3:52:47<18:05:08, 17.99s/it]                                                       {'loss': 0.0151, 'grad_norm': 2.3995364371927987, 'learning_rate': 8.446103593093793e-07, 'completion_length': 165.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.4732143431901932, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.455357313156128, 'reward_std': 0.1296994686126709, 'kl': 0.376953125, 'epoch': 0.16}
 16%|█▌        | 666/4286 [3:52:47<18:05:08, 17.99s/it] 16%|█▌        | 667/4286 [3:53:05<18:00:23, 17.91s/it]                                                       {'loss': 0.0307, 'grad_norm': 3.4201758098923394, 'learning_rate': 8.443770415305646e-07, 'completion_length': 154.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.4702381491661072, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.43452388048172, 'reward_std': 0.1759306713938713, 'kl': 0.771484375, 'epoch': 0.16}
 16%|█▌        | 667/4286 [3:53:05<18:00:23, 17.91s/it] 16%|█▌        | 668/4286 [3:53:23<18:02:20, 17.95s/it]                                                       {'loss': 0.0162, 'grad_norm': 1.7727040897728514, 'learning_rate': 8.441437237517498e-07, 'completion_length': 163.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.4315476417541504, 'rewards/format_reward': 1.0, 'reward': 1.4315477013587952, 'reward_std': 0.13829077035188675, 'kl': 0.4052734375, 'epoch': 0.16}
 16%|█▌        | 668/4286 [3:53:23<18:02:20, 17.95s/it] 16%|█▌        | 669/4286 [3:53:41<18:02:57, 17.96s/it]                                                       {'loss': 0.0265, 'grad_norm': 2.0273925835503457, 'learning_rate': 8.439104059729351e-07, 'completion_length': 153.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.4196428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.383928656578064, 'reward_std': 0.16888263076543808, 'kl': 0.662109375, 'epoch': 0.16}
 16%|█▌        | 669/4286 [3:53:41<18:02:57, 17.96s/it] 16%|█▌        | 670/4286 [3:53:58<17:49:48, 17.75s/it]                                                       {'loss': 0.0148, 'grad_norm': 1.3587297207272075, 'learning_rate': 8.436770881941203e-07, 'completion_length': 159.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.367559552192688, 'rewards/format_reward': 1.0, 'reward': 1.3675596117973328, 'reward_std': 0.06250000186264515, 'kl': 0.37109375, 'epoch': 0.16}
 16%|█▌        | 670/4286 [3:53:58<17:49:48, 17.75s/it] 16%|█▌        | 671/4286 [3:54:15<17:45:03, 17.68s/it]                                                       {'loss': 0.0119, 'grad_norm': 10.291524770571822, 'learning_rate': 8.434437704153056e-07, 'completion_length': 157.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.4360119104385376, 'rewards/format_reward': 1.0, 'reward': 1.4360119700431824, 'reward_std': 0.16260873526334763, 'kl': 0.2978515625, 'epoch': 0.16}
 16%|█▌        | 671/4286 [3:54:15<17:45:03, 17.68s/it] 16%|█▌        | 672/4286 [3:54:33<17:48:30, 17.74s/it]                                                       {'loss': 0.0243, 'grad_norm': 2.9656268073808207, 'learning_rate': 8.432104526364908e-07, 'completion_length': 152.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.574404776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5565477013587952, 'reward_std': 0.1250000074505806, 'kl': 0.607421875, 'epoch': 0.16}
 16%|█▌        | 672/4286 [3:54:33<17:48:30, 17.74s/it] 16%|█▌        | 673/4286 [3:54:50<17:34:34, 17.51s/it]                                                       {'loss': 0.0212, 'grad_norm': 1.9567942705152501, 'learning_rate': 8.429771348576761e-07, 'completion_length': 144.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.3169642984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2991071939468384, 'reward_std': 0.09721016138792038, 'kl': 0.53125, 'epoch': 0.16}
 16%|█▌        | 673/4286 [3:54:50<17:34:34, 17.51s/it] 16%|█▌        | 674/4286 [3:55:11<18:32:07, 18.47s/it]                                                       {'loss': 0.0298, 'grad_norm': 3.181515536773107, 'learning_rate': 8.427438170788613e-07, 'completion_length': 157.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5267858505249023, 'reward_std': 0.17247987538576126, 'kl': 0.7470703125, 'epoch': 0.16}
 16%|█▌        | 674/4286 [3:55:11<18:32:07, 18.47s/it] 16%|█▌        | 675/4286 [3:55:32<19:22:16, 19.31s/it]                                                       {'loss': 0.099, 'grad_norm': 4.2361882282805405, 'learning_rate': 8.425104993000465e-07, 'completion_length': 164.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.4122024029493332, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.2872024774551392, 'reward_std': 0.30526555329561234, 'kl': 2.4765625, 'epoch': 0.16}
 16%|█▌        | 675/4286 [3:55:32<19:22:16, 19.31s/it] 16%|█▌        | 676/4286 [3:55:52<19:29:11, 19.43s/it]                                                       {'loss': 0.0821, 'grad_norm': 10.157723904954537, 'learning_rate': 8.422771815212319e-07, 'completion_length': 153.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.3511904776096344, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.2440477013587952, 'reward_std': 0.295619934797287, 'kl': 2.05078125, 'epoch': 0.16}
 16%|█▌        | 676/4286 [3:55:52<19:29:11, 19.43s/it] 16%|█▌        | 677/4286 [3:56:13<20:07:32, 20.08s/it]                                                       {'loss': 0.062, 'grad_norm': 6.740348890467261, 'learning_rate': 8.420438637424171e-07, 'completion_length': 160.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.3616071790456772, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2723215818405151, 'reward_std': 0.31231939792633057, 'kl': 1.548828125, 'epoch': 0.16}
 16%|█▌        | 677/4286 [3:56:13<20:07:32, 20.08s/it] 16%|█▌        | 678/4286 [3:56:39<21:37:42, 21.58s/it]                                                       {'loss': 0.1009, 'grad_norm': 7.4234354907548825, 'learning_rate': 8.418105459636023e-07, 'completion_length': 150.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.4345238655805588, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3452381491661072, 'reward_std': 0.2126392424106598, 'kl': 2.51953125, 'epoch': 0.16}
 16%|█▌        | 678/4286 [3:56:39<21:37:42, 21.58s/it] 16%|█▌        | 679/4286 [3:56:55<20:03:15, 20.02s/it]                                                       {'loss': 0.0719, 'grad_norm': 3.2727652951326185, 'learning_rate': 8.415772281847876e-07, 'completion_length': 145.21428680419922, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.547619104385376, 'reward_std': 0.17405687645077705, 'kl': 1.8037109375, 'epoch': 0.16}
 16%|█▌        | 679/4286 [3:56:55<20:03:15, 20.02s/it] 16%|█▌        | 680/4286 [3:57:14<19:40:38, 19.64s/it]                                                       {'loss': 0.0638, 'grad_norm': 2.793541549202742, 'learning_rate': 8.413439104059729e-07, 'completion_length': 131.03571701049805, 'rewards/only_full_func_accuracy_reward': 0.3824405074119568, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3288691639900208, 'reward_std': 0.1700616255402565, 'kl': 1.59765625, 'epoch': 0.16}
 16%|█▌        | 680/4286 [3:57:14<19:40:38, 19.64s/it] 16%|█▌        | 681/4286 [3:57:31<18:54:57, 18.89s/it]                                                       {'loss': 0.0145, 'grad_norm': 1.4438088350820133, 'learning_rate': 8.411105926271581e-07, 'completion_length': 156.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5193452388048172, 'rewards/format_reward': 1.0, 'reward': 1.5193453431129456, 'reward_std': 0.06250000186264515, 'kl': 0.3623046875, 'epoch': 0.16}
 16%|█▌        | 681/4286 [3:57:31<18:54:57, 18.89s/it][2025-03-02 09:05:08,406] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 16%|█▌        | 682/4286 [3:57:53<19:45:15, 19.73s/it]                                                       {'loss': 0.1345, 'grad_norm': 3.6366337692099493, 'learning_rate': 8.408772748483433e-07, 'completion_length': 155.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.330357164144516, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.223214328289032, 'reward_std': 0.3180432468652725, 'kl': 3.359375, 'epoch': 0.16}
 16%|█▌        | 682/4286 [3:57:53<19:45:15, 19.73s/it] 16%|█▌        | 683/4286 [3:58:11<19:31:11, 19.50s/it]                                                       {'loss': 0.0667, 'grad_norm': 2.9720057412577243, 'learning_rate': 8.406439570695286e-07, 'completion_length': 147.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.447916716337204, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3943453431129456, 'reward_std': 0.1875000074505806, 'kl': 1.6708984375, 'epoch': 0.16}
 16%|█▌        | 683/4286 [3:58:11<19:31:11, 19.50s/it] 16%|█▌        | 684/4286 [3:58:28<18:34:26, 18.56s/it]                                                       {'loss': 0.0111, 'grad_norm': 0.9897011465684036, 'learning_rate': 8.404106392907139e-07, 'completion_length': 133.23215103149414, 'rewards/only_full_func_accuracy_reward': 0.4851190447807312, 'rewards/format_reward': 1.0, 'reward': 1.4851191639900208, 'reward_std': 0.05909644812345505, 'kl': 0.2783203125, 'epoch': 0.16}
 16%|█▌        | 684/4286 [3:58:28<18:34:26, 18.56s/it] 16%|█▌        | 685/4286 [3:58:46<18:26:05, 18.43s/it]                                                       {'loss': 0.0176, 'grad_norm': 1.692731374310767, 'learning_rate': 8.401773215118991e-07, 'completion_length': 134.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5208333730697632, 'rewards/format_reward': 1.0, 'reward': 1.520833432674408, 'reward_std': 0.043508339673280716, 'kl': 0.439453125, 'epoch': 0.16}
 16%|█▌        | 685/4286 [3:58:46<18:26:05, 18.43s/it] 16%|█▌        | 686/4286 [3:59:01<17:26:41, 17.44s/it]                                                       {'loss': 0.0103, 'grad_norm': 0.722862621345425, 'learning_rate': 8.399440037330844e-07, 'completion_length': 126.91072463989258, 'rewards/only_full_func_accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714287161827087, 'reward_std': 0.014580297283828259, 'kl': 0.2568359375, 'epoch': 0.16}
 16%|█▌        | 686/4286 [3:59:01<17:26:41, 17.44s/it] 16%|█▌        | 687/4286 [3:59:18<17:23:32, 17.40s/it]                                                       {'loss': 0.0128, 'grad_norm': 2.055575176706753, 'learning_rate': 8.397106859542696e-07, 'completion_length': 159.25, 'rewards/only_full_func_accuracy_reward': 0.4687500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.450892984867096, 'reward_std': 0.11242800019681454, 'kl': 0.3193359375, 'epoch': 0.16}
 16%|█▌        | 687/4286 [3:59:18<17:23:32, 17.40s/it] 16%|█▌        | 688/4286 [3:59:34<16:53:14, 16.90s/it]                                                       {'loss': 0.0105, 'grad_norm': 1.3756848268083657, 'learning_rate': 8.394773681754549e-07, 'completion_length': 134.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.4940476566553116, 'rewards/format_reward': 1.0, 'reward': 1.49404776096344, 'reward_std': 0.08769077621400356, 'kl': 0.26220703125, 'epoch': 0.16}
 16%|█▌        | 688/4286 [3:59:34<16:53:14, 16.90s/it] 16%|█▌        | 689/4286 [3:59:50<16:40:37, 16.69s/it]                                                       {'loss': 0.0113, 'grad_norm': 0.800733199405346, 'learning_rate': 8.392440503966402e-07, 'completion_length': 131.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5178571939468384, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.07578602246940136, 'kl': 0.2822265625, 'epoch': 0.16}
 16%|█▌        | 689/4286 [3:59:50<16:40:37, 16.69s/it] 16%|█▌        | 690/4286 [4:00:08<17:04:47, 17.10s/it]                                                       {'loss': 0.0123, 'grad_norm': 2.2030929973735027, 'learning_rate': 8.390107326178254e-07, 'completion_length': 175.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.4032738506793976, 'rewards/format_reward': 1.0, 'reward': 1.4032739400863647, 'reward_std': 0.09208128601312637, 'kl': 0.306640625, 'epoch': 0.16}
 16%|█▌        | 690/4286 [4:00:08<17:04:47, 17.10s/it] 16%|█▌        | 691/4286 [4:00:27<17:33:15, 17.58s/it]                                                       {'loss': 0.015, 'grad_norm': 1.714841440171318, 'learning_rate': 8.387774148390106e-07, 'completion_length': 155.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.36815477907657623, 'rewards/format_reward': 1.0, 'reward': 1.3681548833847046, 'reward_std': 0.06553214974701405, 'kl': 0.375, 'epoch': 0.16}
 16%|█▌        | 691/4286 [4:00:27<17:33:15, 17.58s/it] 16%|█▌        | 692/4286 [4:00:46<17:54:01, 17.93s/it]                                                       {'loss': 0.0105, 'grad_norm': 1.4755484645793888, 'learning_rate': 8.38544097060196e-07, 'completion_length': 162.96428680419922, 'rewards/only_full_func_accuracy_reward': 0.3720238357782364, 'rewards/format_reward': 1.0, 'reward': 1.37202388048172, 'reward_std': 0.07738095335662365, 'kl': 0.26220703125, 'epoch': 0.16}
 16%|█▌        | 692/4286 [4:00:46<17:54:01, 17.93s/it] 16%|█▌        | 693/4286 [4:01:03<17:41:19, 17.72s/it]                                                       {'loss': 0.0115, 'grad_norm': 1.7038398954273466, 'learning_rate': 8.383107792813812e-07, 'completion_length': 149.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.4479167014360428, 'rewards/format_reward': 1.0, 'reward': 1.4479168057441711, 'reward_std': 0.0988353081047535, 'kl': 0.28759765625, 'epoch': 0.16}
 16%|█▌        | 693/4286 [4:01:03<17:41:19, 17.72s/it] 16%|█▌        | 694/4286 [4:01:25<18:53:21, 18.93s/it]                                                       {'loss': 0.0298, 'grad_norm': 1.9369096539755462, 'learning_rate': 8.380774615025664e-07, 'completion_length': 190.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.2440476268529892, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.1904762387275696, 'reward_std': 0.1920045055449009, 'kl': 0.74609375, 'epoch': 0.16}
 16%|█▌        | 694/4286 [4:01:25<18:53:21, 18.93s/it] 16%|█▌        | 695/4286 [4:01:44<19:01:15, 19.07s/it]                                                       {'loss': 0.0298, 'grad_norm': 1.2666401999897745, 'learning_rate': 8.378441437237516e-07, 'completion_length': 166.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.3779762089252472, 'rewards/format_reward': 1.0, 'reward': 1.3779762983322144, 'reward_std': 0.07882524654269218, 'kl': 0.744140625, 'epoch': 0.16}
 16%|█▌        | 695/4286 [4:01:44<19:01:15, 19.07s/it] 16%|█▌        | 696/4286 [4:02:04<19:06:07, 19.16s/it]                                                       {'loss': 0.0243, 'grad_norm': 5.123590507174307, 'learning_rate': 8.37610825944937e-07, 'completion_length': 181.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.4806547909975052, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4449405670166016, 'reward_std': 0.1542729139328003, 'kl': 0.607421875, 'epoch': 0.16}
 16%|█▌        | 696/4286 [4:02:04<19:06:07, 19.16s/it] 16%|█▋        | 697/4286 [4:02:21<18:41:03, 18.74s/it]                                                       {'loss': 0.0157, 'grad_norm': 1.6916012871218238, 'learning_rate': 8.373775081661222e-07, 'completion_length': 174.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.4895833730697632, 'rewards/format_reward': 1.0, 'reward': 1.489583432674408, 'reward_std': 0.0505952425301075, 'kl': 0.392578125, 'epoch': 0.16}
 16%|█▋        | 697/4286 [4:02:21<18:41:03, 18.74s/it] 16%|█▋        | 698/4286 [4:02:42<19:09:43, 19.23s/it]                                                       {'loss': 0.0622, 'grad_norm': 3.2480724008392934, 'learning_rate': 8.371441903873074e-07, 'completion_length': 183.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.2604166865348816, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2068453431129456, 'reward_std': 0.24545500427484512, 'kl': 1.55078125, 'epoch': 0.16}
 16%|█▋        | 698/4286 [4:02:42<19:09:43, 19.23s/it] 16%|█▋        | 699/4286 [4:03:03<19:41:32, 19.76s/it]                                                       {'loss': 0.0557, 'grad_norm': 3.8132887505851687, 'learning_rate': 8.369108726084927e-07, 'completion_length': 175.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.2857143133878708, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.1785714626312256, 'reward_std': 0.2820158749818802, 'kl': 1.39453125, 'epoch': 0.16}
 16%|█▋        | 699/4286 [4:03:03<19:41:32, 19.76s/it] 16%|█▋        | 700/4286 [4:03:25<20:32:50, 20.63s/it]                                                       {'loss': 0.0513, 'grad_norm': 3.359519797021155, 'learning_rate': 8.36677554829678e-07, 'completion_length': 167.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.4747024178504944, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4211310744285583, 'reward_std': 0.2489316165447235, 'kl': 1.28515625, 'epoch': 0.16}
 16%|█▋        | 700/4286 [4:03:25<20:32:50, 20.63s/it] 16%|█▋        | 701/4286 [4:07:40<90:29:32, 90.87s/it]                                                       {'loss': 0.0301, 'grad_norm': 2.932199243412937, 'learning_rate': 8.364442370508632e-07, 'completion_length': 183.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.3943452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.376488208770752, 'reward_std': 0.10692917555570602, 'kl': 0.7548828125, 'epoch': 0.16}
 16%|█▋        | 701/4286 [4:07:40<90:29:32, 90.87s/it] 16%|█▋        | 702/4286 [4:08:00<69:13:29, 69.53s/it]                                                       {'loss': 0.0272, 'grad_norm': 2.084792922918068, 'learning_rate': 8.362109192720485e-07, 'completion_length': 173.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.5059524029493332, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4523810744285583, 'reward_std': 0.2261904813349247, 'kl': 0.6796875, 'epoch': 0.16}
 16%|█▋        | 702/4286 [4:08:00<69:13:29, 69.53s/it] 16%|█▋        | 703/4286 [4:08:21<54:42:12, 54.96s/it]                                                       {'loss': 0.0329, 'grad_norm': 5.856065936291708, 'learning_rate': 8.359776014932337e-07, 'completion_length': 184.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.3509920835494995, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3152778148651123, 'reward_std': 0.15666977316141129, 'kl': 0.826171875, 'epoch': 0.16}
 16%|█▋        | 703/4286 [4:08:21<54:42:12, 54.96s/it] 16%|█▋        | 704/4286 [4:08:39<43:38:52, 43.87s/it]                                                       {'loss': 0.015, 'grad_norm': 1.10447480249891, 'learning_rate': 8.357442837144189e-07, 'completion_length': 160.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.510416716337204, 'rewards/format_reward': 1.0, 'reward': 1.5104167461395264, 'reward_std': 0.04053215403109789, 'kl': 0.37548828125, 'epoch': 0.16}
 16%|█▋        | 704/4286 [4:08:39<43:38:52, 43.87s/it] 16%|█▋        | 705/4286 [4:09:00<36:47:38, 36.99s/it]                                                       {'loss': 0.0632, 'grad_norm': 2.984736760598273, 'learning_rate': 8.355109659356042e-07, 'completion_length': 166.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.49511057138442993, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4415391683578491, 'reward_std': 0.21193456649780273, 'kl': 1.578125, 'epoch': 0.16}
 16%|█▋        | 705/4286 [4:09:00<36:47:38, 36.99s/it] 16%|█▋        | 706/4286 [4:09:17<30:57:15, 31.13s/it]                                                       {'loss': 0.0291, 'grad_norm': 4.010146842358846, 'learning_rate': 8.352776481567895e-07, 'completion_length': 141.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.5000000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.130952388048172, 'kl': 0.7265625, 'epoch': 0.16}
 16%|█▋        | 706/4286 [4:09:17<30:57:15, 31.13s/it] 16%|█▋        | 707/4286 [4:09:39<28:13:16, 28.39s/it]                                                       {'loss': 0.0626, 'grad_norm': 4.878086030050801, 'learning_rate': 8.350443303779747e-07, 'completion_length': 170.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.3422619253396988, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2529762387275696, 'reward_std': 0.25140121579170227, 'kl': 1.56640625, 'epoch': 0.16}
 16%|█▋        | 707/4286 [4:09:39<28:13:16, 28.39s/it] 17%|█▋        | 708/4286 [4:09:59<25:41:24, 25.85s/it]                                                       {'loss': 0.0674, 'grad_norm': 3.8680113146946735, 'learning_rate': 8.348110125991599e-07, 'completion_length': 165.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.392857313156128, 'reward_std': 0.349436953663826, 'kl': 1.68359375, 'epoch': 0.17}
 17%|█▋        | 708/4286 [4:09:59<25:41:24, 25.85s/it][2025-03-02 09:17:34,730] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 17%|█▋        | 709/4286 [4:10:19<23:50:57, 24.00s/it]                                                       {'loss': 0.16, 'grad_norm': 6.117274406628632, 'learning_rate': 8.345776948203453e-07, 'completion_length': 137.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.424107164144516, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.2812501192092896, 'reward_std': 0.42171579599380493, 'kl': 3.9921875, 'epoch': 0.17}
 17%|█▋        | 709/4286 [4:10:19<23:50:57, 24.00s/it] 17%|█▋        | 710/4286 [4:10:37<22:14:10, 22.39s/it]                                                       {'loss': 0.2065, 'grad_norm': 8.552547509649248, 'learning_rate': 8.343443770415305e-07, 'completion_length': 163.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.4247024357318878, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.228273868560791, 'reward_std': 0.4589183181524277, 'kl': 5.15625, 'epoch': 0.17}
 17%|█▋        | 710/4286 [4:10:37<22:14:10, 22.39s/it] 17%|█▋        | 711/4286 [4:10:56<21:10:42, 21.33s/it]                                                       {'loss': 0.2897, 'grad_norm': 13.600970241409167, 'learning_rate': 8.341110592627157e-07, 'completion_length': 151.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.3422619253396988, 'rewards/format_reward': 0.785714328289032, 'reward': 1.1279762983322144, 'reward_std': 0.4258918762207031, 'kl': 7.234375, 'epoch': 0.17}
 17%|█▋        | 711/4286 [4:10:56<21:10:42, 21.33s/it] 17%|█▋        | 712/4286 [4:11:19<21:41:34, 21.85s/it]                                                       {'loss': 0.1473, 'grad_norm': 4.3784738565842956, 'learning_rate': 8.33877741483901e-07, 'completion_length': 160.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.3869047611951828, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.2440477013587952, 'reward_std': 0.29384560137987137, 'kl': 3.6796875, 'epoch': 0.17}
 17%|█▋        | 712/4286 [4:11:19<21:41:34, 21.85s/it] 17%|█▋        | 713/4286 [4:11:37<20:33:43, 20.72s/it]                                                       {'loss': 0.1252, 'grad_norm': 4.210747609513753, 'learning_rate': 8.336444237050863e-07, 'completion_length': 135.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.5089286267757416, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4375001192092896, 'reward_std': 0.2618817761540413, 'kl': 3.125, 'epoch': 0.17}
 17%|█▋        | 713/4286 [4:11:37<20:33:43, 20.72s/it] 17%|█▋        | 714/4286 [4:11:54<19:26:33, 19.59s/it]                                                       {'loss': 0.0846, 'grad_norm': 2.2330727392578447, 'learning_rate': 8.334111059262715e-07, 'completion_length': 142.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.412202388048172, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3586310148239136, 'reward_std': 0.21315120160579681, 'kl': 2.1171875, 'epoch': 0.17}
 17%|█▋        | 714/4286 [4:11:54<19:26:33, 19.59s/it] 17%|█▋        | 715/4286 [4:12:12<18:41:31, 18.84s/it]                                                       {'loss': 0.0486, 'grad_norm': 2.4121089886276597, 'learning_rate': 8.331777881474568e-07, 'completion_length': 145.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.40089288353919983, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3830358386039734, 'reward_std': 0.06621909886598587, 'kl': 1.216796875, 'epoch': 0.17}
 17%|█▋        | 715/4286 [4:12:12<18:41:31, 18.84s/it] 17%|█▋        | 716/4286 [4:12:31<19:00:04, 19.16s/it]                                                       {'loss': 0.0491, 'grad_norm': 4.973484191029984, 'learning_rate': 8.32944470368642e-07, 'completion_length': 158.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5615079700946808, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5436508059501648, 'reward_std': 0.2263459712266922, 'kl': 1.2265625, 'epoch': 0.17}
 17%|█▋        | 716/4286 [4:12:31<19:00:04, 19.16s/it] 17%|█▋        | 717/4286 [4:12:49<18:29:02, 18.64s/it]                                                       {'loss': 0.0344, 'grad_norm': 3.4650386788964416, 'learning_rate': 8.327111525898273e-07, 'completion_length': 146.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.4315476417541504, 'rewards/format_reward': 1.0, 'reward': 1.4315477013587952, 'reward_std': 0.05792887508869171, 'kl': 0.859375, 'epoch': 0.17}
 17%|█▋        | 717/4286 [4:12:49<18:29:02, 18.64s/it] 17%|█▋        | 718/4286 [4:13:07<18:16:16, 18.44s/it]                                                       {'loss': 0.0635, 'grad_norm': 3.681068037244331, 'learning_rate': 8.324778348110125e-07, 'completion_length': 133.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.3407738208770752, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3050596117973328, 'reward_std': 0.1648928001523018, 'kl': 1.5859375, 'epoch': 0.17}
 17%|█▋        | 718/4286 [4:13:07<18:16:16, 18.44s/it] 17%|█▋        | 719/4286 [4:13:23<17:43:22, 17.89s/it]                                                       {'loss': 0.0685, 'grad_norm': 3.5314297960722585, 'learning_rate': 8.322445170321978e-07, 'completion_length': 145.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.41964291036129, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.383928656578064, 'reward_std': 0.24345649033784866, 'kl': 1.71484375, 'epoch': 0.17}
 17%|█▋        | 719/4286 [4:13:23<17:43:22, 17.89s/it] 17%|█▋        | 720/4286 [4:13:41<17:44:01, 17.90s/it]                                                       {'loss': 0.078, 'grad_norm': 8.5475079293021, 'learning_rate': 8.32011199253383e-07, 'completion_length': 141.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.4494048058986664, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.395833432674408, 'reward_std': 0.263979010283947, 'kl': 1.95703125, 'epoch': 0.17}
 17%|█▋        | 720/4286 [4:13:41<17:44:01, 17.90s/it] 17%|█▋        | 721/4286 [4:13:58<17:29:14, 17.66s/it]                                                       {'loss': 0.0731, 'grad_norm': 2.8092224863496993, 'learning_rate': 8.317778814745683e-07, 'completion_length': 130.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3392857313156128, 'reward_std': 0.15157204493880272, 'kl': 1.82421875, 'epoch': 0.17}
 17%|█▋        | 721/4286 [4:13:58<17:29:14, 17.66s/it] 17%|█▋        | 722/4286 [4:14:20<18:42:19, 18.89s/it]                                                       {'loss': 0.101, 'grad_norm': 2.955095481065562, 'learning_rate': 8.315445636957536e-07, 'completion_length': 147.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.351190485060215, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2797619700431824, 'reward_std': 0.1701783835887909, 'kl': 2.5234375, 'epoch': 0.17}
 17%|█▋        | 722/4286 [4:14:20<18:42:19, 18.89s/it] 17%|█▋        | 723/4286 [4:14:42<19:24:44, 19.61s/it]                                                       {'loss': 0.102, 'grad_norm': 3.8057769881573846, 'learning_rate': 8.313112459169388e-07, 'completion_length': 143.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.43660716712474823, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3651787042617798, 'reward_std': 0.2692810893058777, 'kl': 2.546875, 'epoch': 0.17}
 17%|█▋        | 723/4286 [4:14:42<19:24:44, 19.61s/it] 17%|█▋        | 724/4286 [4:14:59<18:41:12, 18.89s/it]                                                       {'loss': 0.0552, 'grad_norm': 4.744162995327748, 'learning_rate': 8.31077928138124e-07, 'completion_length': 127.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.5029762089252472, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4672620296478271, 'reward_std': 0.1923178806900978, 'kl': 1.3828125, 'epoch': 0.17}
 17%|█▋        | 724/4286 [4:14:59<18:41:12, 18.89s/it] 17%|█▋        | 725/4286 [4:15:20<19:18:53, 19.53s/it]                                                       {'loss': 0.1188, 'grad_norm': 1.8271459562024508, 'learning_rate': 8.308446103593094e-07, 'completion_length': 153.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.4002976417541504, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3110119700431824, 'reward_std': 0.28507326543331146, 'kl': 2.96875, 'epoch': 0.17}
 17%|█▋        | 725/4286 [4:15:20<19:18:53, 19.53s/it] 17%|█▋        | 726/4286 [4:15:44<20:43:59, 20.97s/it]                                                       {'loss': 0.0405, 'grad_norm': 5.4477844288858215, 'learning_rate': 8.306112925804946e-07, 'completion_length': 161.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.3571428954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3035715818405151, 'reward_std': 0.15785609185695648, 'kl': 1.0126953125, 'epoch': 0.17}
 17%|█▋        | 726/4286 [4:15:44<20:43:59, 20.97s/it] 17%|█▋        | 727/4286 [4:16:04<20:21:33, 20.59s/it]                                                       {'loss': 0.0802, 'grad_norm': 3.1703460370430165, 'learning_rate': 8.303779748016798e-07, 'completion_length': 162.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.5104166865348816, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.438988208770752, 'reward_std': 0.1629994958639145, 'kl': 2.00390625, 'epoch': 0.17}
 17%|█▋        | 727/4286 [4:16:04<20:21:33, 20.59s/it] 17%|█▋        | 728/4286 [4:16:29<21:51:15, 22.11s/it]                                                       {'loss': 0.1681, 'grad_norm': 156.14908243688583, 'learning_rate': 8.30144657022865e-07, 'completion_length': 180.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.3321428745985031, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.1892857551574707, 'reward_std': 0.3373855799436569, 'kl': 4.201171875, 'epoch': 0.17}
 17%|█▋        | 728/4286 [4:16:29<21:51:15, 22.11s/it] 17%|█▋        | 729/4286 [4:16:56<23:12:32, 23.49s/it]                                                       {'loss': 0.2194, 'grad_norm': 8.006447163043974, 'learning_rate': 8.299113392440503e-07, 'completion_length': 171.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.2574404925107956, 'rewards/format_reward': 0.7678571939468384, 'reward': 1.0252977013587952, 'reward_std': 0.43576426804065704, 'kl': 5.5, 'epoch': 0.17}
 17%|█▋        | 729/4286 [4:16:56<23:12:32, 23.49s/it] 17%|█▋        | 730/4286 [4:17:18<22:46:10, 23.05s/it]                                                       {'loss': 0.4247, 'grad_norm': 16.75412620072321, 'learning_rate': 8.296780214652356e-07, 'completion_length': 143.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.3184524178504944, 'rewards/format_reward': 0.6071428954601288, 'reward': 0.9255953133106232, 'reward_std': 0.8025565445423126, 'kl': 10.609375, 'epoch': 0.17}
 17%|█▋        | 730/4286 [4:17:18<22:46:10, 23.05s/it] 17%|█▋        | 731/4286 [4:17:41<22:40:17, 22.96s/it]                                                       {'loss': 0.5072, 'grad_norm': 22.97467685350545, 'learning_rate': 8.294447036864208e-07, 'completion_length': 149.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.1502976305782795, 'rewards/format_reward': 0.5000000298023224, 'reward': 0.6502976715564728, 'reward_std': 0.5528820157051086, 'kl': 12.6875, 'epoch': 0.17}
 17%|█▋        | 731/4286 [4:17:41<22:40:17, 22.96s/it] 17%|█▋        | 732/4286 [4:18:06<23:21:35, 23.66s/it]                                                       {'loss': 0.3778, 'grad_norm': 11.006314525205656, 'learning_rate': 8.292113859076061e-07, 'completion_length': 186.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.3080357313156128, 'rewards/format_reward': 0.5892857313156128, 'reward': 0.8973215222358704, 'reward_std': 0.5958743840456009, 'kl': 9.4375, 'epoch': 0.17}
 17%|█▋        | 732/4286 [4:18:06<23:21:35, 23.66s/it] 17%|█▋        | 733/4286 [4:18:31<23:47:23, 24.10s/it]                                                       {'loss': 0.3173, 'grad_norm': 10.79989958781451, 'learning_rate': 8.289780681287913e-07, 'completion_length': 164.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.2247024103999138, 'rewards/format_reward': 0.660714328289032, 'reward': 0.885416716337204, 'reward_std': 0.6179159879684448, 'kl': 7.9375, 'epoch': 0.17}
 17%|█▋        | 733/4286 [4:18:31<23:47:23, 24.10s/it][2025-03-02 09:26:12,809] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 17%|█▋        | 734/4286 [4:18:57<24:13:04, 24.55s/it]                                                       {'loss': 0.1199, 'grad_norm': 4.950644584112353, 'learning_rate': 8.287447503499766e-07, 'completion_length': 191.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.3976190611720085, 'rewards/format_reward': 0.8214285969734192, 'reward': 1.2190477848052979, 'reward_std': 0.4003375470638275, 'kl': 3.0, 'epoch': 0.17}
 17%|█▋        | 734/4286 [4:18:57<24:13:04, 24.55s/it] 17%|█▋        | 735/4286 [4:19:13<21:46:52, 22.08s/it]                                                       {'loss': 0.1464, 'grad_norm': 3.485894510016048, 'learning_rate': 8.285114325711619e-07, 'completion_length': 140.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.2827381193637848, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.12202388048172, 'reward_std': 0.42457902431488037, 'kl': 3.6640625, 'epoch': 0.17}
 17%|█▋        | 735/4286 [4:19:13<21:46:52, 22.08s/it] 17%|█▋        | 736/4286 [4:19:34<21:15:52, 21.56s/it]                                                       {'loss': 0.0517, 'grad_norm': 4.3590203659475755, 'learning_rate': 8.282781147923471e-07, 'completion_length': 164.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.479166716337204, 'rewards/format_reward': 0.892857164144516, 'reward': 1.37202388048172, 'reward_std': 0.29426996409893036, 'kl': 1.291015625, 'epoch': 0.17}
 17%|█▋        | 736/4286 [4:19:34<21:15:52, 21.56s/it] 17%|█▋        | 737/4286 [4:19:52<20:15:18, 20.55s/it]                                                       {'loss': 0.0708, 'grad_norm': 4.090087532015043, 'learning_rate': 8.280447970135323e-07, 'completion_length': 155.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.500000074505806, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4285715222358704, 'reward_std': 0.22483553737401962, 'kl': 1.76953125, 'epoch': 0.17}
 17%|█▋        | 737/4286 [4:19:52<20:15:18, 20.55s/it] 17%|█▋        | 738/4286 [4:20:09<19:15:30, 19.54s/it]                                                       {'loss': 0.0098, 'grad_norm': 0.9500695871598631, 'learning_rate': 8.278114792347177e-07, 'completion_length': 170.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.035714282654225826, 'kl': 0.24609375, 'epoch': 0.17}
 17%|█▋        | 738/4286 [4:20:09<19:15:30, 19.54s/it] 17%|█▋        | 739/4286 [4:20:28<19:10:02, 19.45s/it]                                                       {'loss': 0.026, 'grad_norm': 6292.917161706522, 'learning_rate': 8.275781614559029e-07, 'completion_length': 151.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.4598214626312256, 'rewards/format_reward': 1.0, 'reward': 1.4598215818405151, 'reward_std': 0.061151803471148014, 'kl': 0.6484375, 'epoch': 0.17}
 17%|█▋        | 739/4286 [4:20:28<19:10:02, 19.45s/it] 17%|█▋        | 740/4286 [4:20:46<18:31:39, 18.81s/it]                                                       {'loss': 0.0092, 'grad_norm': 1.7765372382014657, 'learning_rate': 8.273448436770881e-07, 'completion_length': 151.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.4821428507566452, 'rewards/format_reward': 1.0, 'reward': 1.482142984867096, 'reward_std': 0.0503815608099103, 'kl': 0.23046875, 'epoch': 0.17}
 17%|█▋        | 740/4286 [4:20:46<18:31:39, 18.81s/it] 17%|█▋        | 741/4286 [4:21:06<19:08:46, 19.44s/it]                                                       {'loss': 0.0241, 'grad_norm': 2.277527364660695, 'learning_rate': 8.271115258982733e-07, 'completion_length': 163.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.376488134264946, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3050596714019775, 'reward_std': 0.1346314661204815, 'kl': 0.60205078125, 'epoch': 0.17}
 17%|█▋        | 741/4286 [4:21:06<19:08:46, 19.44s/it] 17%|█▋        | 742/4286 [4:21:28<19:45:10, 20.06s/it]                                                       {'loss': 0.0345, 'grad_norm': 2.5801007520970676, 'learning_rate': 8.268782081194587e-07, 'completion_length': 188.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.3845982551574707, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.33102685213089, 'reward_std': 0.21649043262004852, 'kl': 0.86279296875, 'epoch': 0.17}
 17%|█▋        | 742/4286 [4:21:28<19:45:10, 20.06s/it] 17%|█▋        | 743/4286 [4:21:50<20:27:02, 20.78s/it]                                                       {'loss': 0.0535, 'grad_norm': 2.864205985757053, 'learning_rate': 8.266448903406439e-07, 'completion_length': 200.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.24910718202590942, 'rewards/format_reward': 0.8214285969734192, 'reward': 1.070535808801651, 'reward_std': 0.19985748082399368, 'kl': 1.3349609375, 'epoch': 0.17}
 17%|█▋        | 743/4286 [4:21:50<20:27:02, 20.78s/it] 17%|█▋        | 744/4286 [4:22:09<19:50:11, 20.16s/it]                                                       {'loss': 0.0224, 'grad_norm': 10.714563557794284, 'learning_rate': 8.264115725618291e-07, 'completion_length': 176.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.4616071730852127, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4080358147621155, 'reward_std': 0.12736738473176956, 'kl': 0.560546875, 'epoch': 0.17}
 17%|█▋        | 744/4286 [4:22:09<19:50:11, 20.16s/it] 17%|█▋        | 745/4286 [4:22:33<20:55:35, 21.28s/it]                                                       {'loss': 0.0614, 'grad_norm': 7.843125790980134, 'learning_rate': 8.261782547830144e-07, 'completion_length': 193.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.3169642984867096, 'rewards/format_reward': 0.8571429252624512, 'reward': 1.1741071939468384, 'reward_std': 0.3714907467365265, 'kl': 1.5390625, 'epoch': 0.17}
 17%|█▋        | 745/4286 [4:22:33<20:55:35, 21.28s/it] 17%|█▋        | 746/4286 [4:22:51<19:52:11, 20.21s/it]                                                       {'loss': 0.0314, 'grad_norm': 3.324005107723497, 'learning_rate': 8.259449370041997e-07, 'completion_length': 167.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.4419643133878708, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4062500596046448, 'reward_std': 0.17723165452480316, 'kl': 0.783203125, 'epoch': 0.17}
 17%|█▋        | 746/4286 [4:22:51<19:52:11, 20.21s/it] 17%|█▋        | 747/4286 [4:23:13<20:31:02, 20.87s/it]                                                       {'loss': 0.0521, 'grad_norm': 5.158235631809322, 'learning_rate': 8.257116192253849e-07, 'completion_length': 177.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.4559524357318878, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3845239281654358, 'reward_std': 0.17994464933872223, 'kl': 1.30078125, 'epoch': 0.17}
 17%|█▋        | 747/4286 [4:23:13<20:31:02, 20.87s/it] 17%|█▋        | 748/4286 [4:23:35<20:43:20, 21.09s/it]                                                       {'loss': 0.0737, 'grad_norm': 4.418226240972418, 'learning_rate': 8.254783014465702e-07, 'completion_length': 174.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.3333333730697632, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.279762089252472, 'reward_std': 0.15824005007743835, 'kl': 1.83984375, 'epoch': 0.17}
 17%|█▋        | 748/4286 [4:23:35<20:43:20, 21.09s/it] 17%|█▋        | 749/4286 [4:23:53<19:55:09, 20.27s/it]                                                       {'loss': 0.0496, 'grad_norm': 5.167101721442638, 'learning_rate': 8.252449836677554e-07, 'completion_length': 153.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.4568452686071396, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.40327388048172, 'reward_std': 0.24842369556427002, 'kl': 1.23828125, 'epoch': 0.17}
 17%|█▋        | 749/4286 [4:23:53<19:55:09, 20.27s/it] 17%|█▋        | 750/4286 [4:24:17<20:50:45, 21.22s/it]                                                       {'loss': 0.0562, 'grad_norm': 4.512381739133486, 'learning_rate': 8.250116658889406e-07, 'completion_length': 166.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.2738095372915268, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2202381491661072, 'reward_std': 0.15669919177889824, 'kl': 1.40234375, 'epoch': 0.17}
 17%|█▋        | 750/4286 [4:24:17<20:50:45, 21.22s/it] 18%|█▊        | 751/4286 [4:24:42<22:02:15, 22.44s/it]                                                       {'loss': 0.0438, 'grad_norm': 4.425860905068287, 'learning_rate': 8.247783481101259e-07, 'completion_length': 178.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.4627976715564728, 'rewards/format_reward': 0.910714328289032, 'reward': 1.373512089252472, 'reward_std': 0.2674119621515274, 'kl': 1.09765625, 'epoch': 0.18}
 18%|█▊        | 751/4286 [4:24:42<22:02:15, 22.44s/it] 18%|█▊        | 752/4286 [4:24:59<20:36:10, 20.99s/it]                                                       {'loss': 0.0312, 'grad_norm': 6.553111278134066, 'learning_rate': 8.245450303313112e-07, 'completion_length': 153.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.500000074505806, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.482142984867096, 'reward_std': 0.1989988461136818, 'kl': 0.78125, 'epoch': 0.18}
 18%|█▊        | 752/4286 [4:24:59<20:36:10, 20.99s/it] 18%|█▊        | 753/4286 [4:25:16<19:21:44, 19.73s/it]                                                       {'loss': 0.0102, 'grad_norm': 2.5690621695875464, 'learning_rate': 8.243117125524964e-07, 'completion_length': 142.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.4062500149011612, 'rewards/format_reward': 1.0, 'reward': 1.4062501192092896, 'reward_std': 0.07066834159195423, 'kl': 0.2548828125, 'epoch': 0.18}
 18%|█▊        | 753/4286 [4:25:16<19:21:44, 19.73s/it] 18%|█▊        | 754/4286 [4:25:34<18:54:35, 19.27s/it]                                                       {'loss': 0.0142, 'grad_norm': 3.0742123183320325, 'learning_rate': 8.240783947736816e-07, 'completion_length': 165.08928680419922, 'rewards/only_full_func_accuracy_reward': 0.3973214626312256, 'rewards/format_reward': 1.0, 'reward': 1.3973215818405151, 'reward_std': 0.04740537330508232, 'kl': 0.35546875, 'epoch': 0.18}
 18%|█▊        | 754/4286 [4:25:34<18:54:35, 19.27s/it] 18%|█▊        | 755/4286 [4:25:52<18:28:35, 18.84s/it]                                                       {'loss': 0.0251, 'grad_norm': 2.935800912059479, 'learning_rate': 8.23845076994867e-07, 'completion_length': 154.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.5639881193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5461310744285583, 'reward_std': 0.14434950053691864, 'kl': 0.6279296875, 'epoch': 0.18}
 18%|█▊        | 755/4286 [4:25:52<18:28:35, 18.84s/it] 18%|█▊        | 756/4286 [4:26:10<18:10:06, 18.53s/it]                                                       {'loss': 0.0156, 'grad_norm': 3.419775076127813, 'learning_rate': 8.236117592160522e-07, 'completion_length': 157.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5044643431901932, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4866072535514832, 'reward_std': 0.14264129474759102, 'kl': 0.390625, 'epoch': 0.18}
 18%|█▊        | 756/4286 [4:26:10<18:10:06, 18.53s/it] 18%|█▊        | 757/4286 [4:26:28<17:50:48, 18.21s/it]                                                       {'loss': 0.0174, 'grad_norm': 1.242280653984295, 'learning_rate': 8.233784414372374e-07, 'completion_length': 137.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5550595819950104, 'rewards/format_reward': 1.0, 'reward': 1.5550596714019775, 'reward_std': 0.04145298898220062, 'kl': 0.435546875, 'epoch': 0.18}
 18%|█▊        | 757/4286 [4:26:28<17:50:48, 18.21s/it] 18%|█▊        | 758/4286 [4:26:45<17:34:08, 17.93s/it]                                                       {'loss': 0.0227, 'grad_norm': 2.4902770509348575, 'learning_rate': 8.231451236584227e-07, 'completion_length': 154.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.3988095670938492, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3630953431129456, 'reward_std': 0.1401907280087471, 'kl': 0.5703125, 'epoch': 0.18}
 18%|█▊        | 758/4286 [4:26:45<17:34:08, 17.93s/it] 18%|█▊        | 759/4286 [4:27:04<18:02:47, 18.42s/it]                                                       {'loss': 0.0862, 'grad_norm': 7.7361483997972424, 'learning_rate': 8.22911805879608e-07, 'completion_length': 157.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.352678582072258, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2991071939468384, 'reward_std': 0.16357829421758652, 'kl': 2.15625, 'epoch': 0.18}
 18%|█▊        | 759/4286 [4:27:04<18:02:47, 18.42s/it] 18%|█▊        | 760/4286 [4:27:20<17:20:11, 17.70s/it]                                                       {'loss': 0.0632, 'grad_norm': 6.8940311159225685, 'learning_rate': 8.226784881007932e-07, 'completion_length': 122.0535774230957, 'rewards/only_full_func_accuracy_reward': 0.4196428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4196429252624512, 'reward_std': 0.13127360492944717, 'kl': 1.578125, 'epoch': 0.18}
 18%|█▊        | 760/4286 [4:27:20<17:20:11, 17.70s/it][2025-03-02 09:34:57,591] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 18%|█▊        | 761/4286 [4:27:42<18:23:39, 18.79s/it]                                                       {'loss': 0.0706, 'grad_norm': 7.630277687741993, 'learning_rate': 8.224451703219785e-07, 'completion_length': 145.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.339285746216774, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2678572535514832, 'reward_std': 0.1561436504125595, 'kl': 1.76953125, 'epoch': 0.18}
 18%|█▊        | 761/4286 [4:27:42<18:23:39, 18.79s/it] 18%|█▊        | 762/4286 [4:28:00<18:20:54, 18.74s/it]                                                       {'loss': 0.0491, 'grad_norm': 5.1369184675211645, 'learning_rate': 8.222118525431637e-07, 'completion_length': 132.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.4077381193637848, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3720239400863647, 'reward_std': 0.1784486398100853, 'kl': 1.2265625, 'epoch': 0.18}
 18%|█▊        | 762/4286 [4:28:00<18:20:54, 18.74s/it] 18%|█▊        | 763/4286 [4:28:20<18:33:23, 18.96s/it]                                                       {'loss': 0.0708, 'grad_norm': 4.410176116010069, 'learning_rate': 8.21978534764349e-07, 'completion_length': 137.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.3586309850215912, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2872024178504944, 'reward_std': 0.16556335240602493, 'kl': 1.765625, 'epoch': 0.18}
 18%|█▊        | 763/4286 [4:28:20<18:33:23, 18.96s/it] 18%|█▊        | 764/4286 [4:28:38<18:12:13, 18.61s/it]                                                       {'loss': 0.0801, 'grad_norm': 7.940525000116736, 'learning_rate': 8.217452169855342e-07, 'completion_length': 150.83928680419922, 'rewards/only_full_func_accuracy_reward': 0.2752976417541504, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2395834922790527, 'reward_std': 0.1700134500861168, 'kl': 2.0078125, 'epoch': 0.18}
 18%|█▊        | 764/4286 [4:28:38<18:12:13, 18.61s/it] 18%|█▊        | 765/4286 [4:28:55<17:53:14, 18.29s/it]                                                       {'loss': 0.0415, 'grad_norm': 3.915650776279664, 'learning_rate': 8.215118992067195e-07, 'completion_length': 139.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.4077381193637848, 'rewards/format_reward': 1.0, 'reward': 1.407738208770752, 'reward_std': 0.09070003405213356, 'kl': 1.037109375, 'epoch': 0.18}
 18%|█▊        | 765/4286 [4:28:55<17:53:14, 18.29s/it] 18%|█▊        | 766/4286 [4:29:16<18:39:34, 19.08s/it]                                                       {'loss': 0.0468, 'grad_norm': 3.2594215179208628, 'learning_rate': 8.212785814279047e-07, 'completion_length': 141.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4315476566553116, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3958333730697632, 'reward_std': 0.16808098927140236, 'kl': 1.169921875, 'epoch': 0.18}
 18%|█▊        | 766/4286 [4:29:16<18:39:34, 19.08s/it] 18%|█▊        | 767/4286 [4:29:33<18:00:25, 18.42s/it]                                                       {'loss': 0.0212, 'grad_norm': 5.609152930993468, 'learning_rate': 8.2104526364909e-07, 'completion_length': 145.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.3571428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3392857313156128, 'reward_std': 0.0952381081879139, 'kl': 0.5302734375, 'epoch': 0.18}
 18%|█▊        | 767/4286 [4:29:33<18:00:25, 18.42s/it] 18%|█▊        | 768/4286 [4:29:49<17:10:04, 17.57s/it]                                                       {'loss': 0.0257, 'grad_norm': 2.611488063712958, 'learning_rate': 8.208119458702753e-07, 'completion_length': 124.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.3824404925107956, 'rewards/format_reward': 1.0, 'reward': 1.3824405670166016, 'reward_std': 0.05495268478989601, 'kl': 0.642578125, 'epoch': 0.18}
 18%|█▊        | 768/4286 [4:29:49<17:10:04, 17.57s/it][2025-03-02 09:37:21,051] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 18%|█▊        | 769/4286 [4:30:05<16:53:13, 17.29s/it]                                                       {'loss': 0.019, 'grad_norm': 0.8631144310997783, 'learning_rate': 8.205786280914605e-07, 'completion_length': 134.07143783569336, 'rewards/only_full_func_accuracy_reward': 0.4315476417541504, 'rewards/format_reward': 1.0, 'reward': 1.4315477013587952, 'reward_std': 0.01785714365541935, 'kl': 0.474609375, 'epoch': 0.18}
 18%|█▊        | 769/4286 [4:30:05<16:53:13, 17.29s/it] 18%|█▊        | 770/4286 [4:30:21<16:27:39, 16.85s/it]                                                       {'loss': 0.0173, 'grad_norm': 4.4793624709560715, 'learning_rate': 8.203453103126457e-07, 'completion_length': 116.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5580357611179352, 'rewards/format_reward': 1.0, 'reward': 1.5580357909202576, 'reward_std': 0.061985667794942856, 'kl': 0.4326171875, 'epoch': 0.18}
 18%|█▊        | 770/4286 [4:30:21<16:27:39, 16.85s/it] 18%|█▊        | 771/4286 [4:30:38<16:21:37, 16.76s/it]                                                       {'loss': 0.0109, 'grad_norm': 2.3960749225138795, 'learning_rate': 8.201119925338311e-07, 'completion_length': 135.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5193452835083008, 'rewards/format_reward': 1.0, 'reward': 1.5193453431129456, 'reward_std': 0.025190782733261585, 'kl': 0.271484375, 'epoch': 0.18}
 18%|█▊        | 771/4286 [4:30:38<16:21:37, 16.76s/it] 18%|█▊        | 772/4286 [4:30:54<16:07:27, 16.52s/it]                                                       {'loss': 0.01, 'grad_norm': 1.1890439530759447, 'learning_rate': 8.198786747550163e-07, 'completion_length': 141.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.4166666716337204, 'rewards/format_reward': 1.0, 'reward': 1.4166667461395264, 'reward_std': 0.013746436685323715, 'kl': 0.25, 'epoch': 0.18}
 18%|█▊        | 772/4286 [4:30:54<16:07:27, 16.52s/it] 18%|█▊        | 773/4286 [4:31:10<16:12:00, 16.60s/it]                                                       {'loss': 0.0127, 'grad_norm': 4.6076782999751495, 'learning_rate': 8.196453569762015e-07, 'completion_length': 125.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.4776786118745804, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4598215818405151, 'reward_std': 0.07389794290065765, 'kl': 0.318359375, 'epoch': 0.18}
 18%|█▊        | 773/4286 [4:31:10<16:12:00, 16.60s/it][2025-03-02 09:38:42,170] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 18%|█▊        | 774/4286 [4:31:26<16:00:53, 16.42s/it]                                                       {'loss': 0.0101, 'grad_norm': 2.0985803300315076, 'learning_rate': 8.194120391973867e-07, 'completion_length': 139.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4627976566553116, 'rewards/format_reward': 1.0, 'reward': 1.46279776096344, 'reward_std': 0.07029406167566776, 'kl': 0.2529296875, 'epoch': 0.18}
 18%|█▊        | 774/4286 [4:31:26<16:00:53, 16.42s/it] 18%|█▊        | 775/4286 [4:31:42<15:50:36, 16.24s/it]                                                       {'loss': 0.0177, 'grad_norm': 4.446277421911403, 'learning_rate': 8.19178721418572e-07, 'completion_length': 140.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.4508928954601288, 'rewards/format_reward': 1.0, 'reward': 1.4508930444717407, 'reward_std': 0.10921961814165115, 'kl': 0.4423828125, 'epoch': 0.18}
 18%|█▊        | 775/4286 [4:31:42<15:50:36, 16.24s/it] 18%|█▊        | 776/4286 [4:31:58<15:46:39, 16.18s/it]                                                       {'loss': 0.0125, 'grad_norm': 3.2485288346211676, 'learning_rate': 8.189454036397573e-07, 'completion_length': 129.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.4360119104385376, 'rewards/format_reward': 1.0, 'reward': 1.4360119700431824, 'reward_std': 0.03114316239953041, 'kl': 0.3115234375, 'epoch': 0.18}
 18%|█▊        | 776/4286 [4:31:58<15:46:39, 16.18s/it] 18%|█▊        | 777/4286 [4:32:14<15:32:04, 15.94s/it]                                                       {'loss': 0.0107, 'grad_norm': 2.3124738384926586, 'learning_rate': 8.187120858609425e-07, 'completion_length': 124.10715103149414, 'rewards/only_full_func_accuracy_reward': 0.4345238506793976, 'rewards/format_reward': 1.0, 'reward': 1.4345239400863647, 'reward_std': 0.016262203454971313, 'kl': 0.2685546875, 'epoch': 0.18}
 18%|█▊        | 777/4286 [4:32:14<15:32:04, 15.94s/it] 18%|█▊        | 778/4286 [4:32:30<15:33:54, 15.97s/it]                                                       {'loss': 0.0109, 'grad_norm': 2.15698123101954, 'learning_rate': 8.184787680821278e-07, 'completion_length': 127.9285774230957, 'rewards/only_full_func_accuracy_reward': 0.471726194024086, 'rewards/format_reward': 1.0, 'reward': 1.4717262983322144, 'reward_std': 0.10097679495811462, 'kl': 0.271484375, 'epoch': 0.18}
 18%|█▊        | 778/4286 [4:32:30<15:33:54, 15.97s/it] 18%|█▊        | 779/4286 [4:32:46<15:35:56, 16.01s/it]                                                       {'loss': 0.0103, 'grad_norm': 1.6882384932678698, 'learning_rate': 8.18245450303313e-07, 'completion_length': 130.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.4270833879709244, 'rewards/format_reward': 1.0, 'reward': 1.427083432674408, 'reward_std': 0.0620726402848959, 'kl': 0.2578125, 'epoch': 0.18}
 18%|█▊        | 779/4286 [4:32:46<15:35:56, 16.01s/it] 18%|█▊        | 780/4286 [4:33:02<15:42:18, 16.13s/it]                                                       {'loss': 0.0116, 'grad_norm': 3.3541138043102947, 'learning_rate': 8.180121325244983e-07, 'completion_length': 130.78571701049805, 'rewards/only_full_func_accuracy_reward': 0.4598214328289032, 'rewards/format_reward': 1.0, 'reward': 1.4598215222358704, 'reward_std': 0.014880949631333351, 'kl': 0.2900390625, 'epoch': 0.18}
 18%|█▊        | 780/4286 [4:33:02<15:42:18, 16.13s/it] 18%|█▊        | 781/4286 [4:33:21<16:31:41, 16.98s/it]                                                       {'loss': 0.0188, 'grad_norm': 19.319081745005978, 'learning_rate': 8.177788147456836e-07, 'completion_length': 140.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.4136905074119568, 'rewards/format_reward': 1.0, 'reward': 1.4136905670166016, 'reward_std': 0.05289733037352562, 'kl': 0.47216796875, 'epoch': 0.18}
 18%|█▊        | 781/4286 [4:33:21<16:31:41, 16.98s/it] 18%|█▊        | 782/4286 [4:33:37<16:07:32, 16.57s/it]                                                       {'loss': 0.0107, 'grad_norm': 1.5259928134847165, 'learning_rate': 8.175454969668688e-07, 'completion_length': 124.21429061889648, 'rewards/only_full_func_accuracy_reward': 0.5550595223903656, 'rewards/format_reward': 1.0, 'reward': 1.5550596714019775, 'reward_std': 0.09166059270501137, 'kl': 0.2666015625, 'epoch': 0.18}
 18%|█▊        | 782/4286 [4:33:37<16:07:32, 16.57s/it] 18%|█▊        | 783/4286 [4:33:55<16:31:07, 16.98s/it]                                                       {'loss': 0.0204, 'grad_norm': 3.652583872860468, 'learning_rate': 8.17312179188054e-07, 'completion_length': 143.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5877977013587952, 'reward_std': 0.11917800828814507, 'kl': 0.51171875, 'epoch': 0.18}
 18%|█▊        | 783/4286 [4:33:55<16:31:07, 16.98s/it] 18%|█▊        | 784/4286 [4:34:11<16:28:31, 16.94s/it]                                                       {'loss': 0.0105, 'grad_norm': 2.2195741812570344, 'learning_rate': 8.170788614092394e-07, 'completion_length': 149.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.4375, 'rewards/format_reward': 1.0, 'reward': 1.4375001192092896, 'reward_std': 0.0416666641831398, 'kl': 0.2626953125, 'epoch': 0.18}
 18%|█▊        | 784/4286 [4:34:11<16:28:31, 16.94s/it] 18%|█▊        | 785/4286 [4:34:27<16:00:01, 16.45s/it]                                                       {'loss': 0.0118, 'grad_norm': 21.841752576481873, 'learning_rate': 8.168455436304246e-07, 'completion_length': 115.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.6547619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6547620296478271, 'reward_std': 0.0357142873108387, 'kl': 0.294921875, 'epoch': 0.18}
 18%|█▊        | 785/4286 [4:34:27<16:00:01, 16.45s/it] 18%|█▊        | 786/4286 [4:34:49<17:32:51, 18.05s/it]                                                       {'loss': 0.0543, 'grad_norm': 6.10598991861646, 'learning_rate': 8.166122258516098e-07, 'completion_length': 163.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.3258928805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2901787161827087, 'reward_std': 0.1531151942908764, 'kl': 1.353515625, 'epoch': 0.18}
 18%|█▊        | 786/4286 [4:34:49<17:32:51, 18.05s/it] 18%|█▊        | 787/4286 [4:35:05<16:58:14, 17.46s/it]                                                       {'loss': 0.0106, 'grad_norm': 2.1085680296575218, 'learning_rate': 8.16378908072795e-07, 'completion_length': 133.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.538690522313118, 'rewards/format_reward': 1.0, 'reward': 1.5386905670166016, 'reward_std': 0.02221459150314331, 'kl': 0.263671875, 'epoch': 0.18}
 18%|█▊        | 787/4286 [4:35:05<16:58:14, 17.46s/it] 18%|█▊        | 788/4286 [4:35:22<16:49:11, 17.31s/it]                                                       {'loss': 0.0177, 'grad_norm': 3.663322561763701, 'learning_rate': 8.161455902939804e-07, 'completion_length': 151.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.4657738357782364, 'rewards/format_reward': 1.0, 'reward': 1.4657739400863647, 'reward_std': 0.13795407861471176, 'kl': 0.4423828125, 'epoch': 0.18}
 18%|█▊        | 788/4286 [4:35:22<16:49:11, 17.31s/it] 18%|█▊        | 789/4286 [4:35:43<17:57:54, 18.49s/it]                                                       {'loss': 0.0208, 'grad_norm': 1.6832987296988544, 'learning_rate': 8.159122725151656e-07, 'completion_length': 157.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.4288690835237503, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4110119938850403, 'reward_std': 0.08055352978408337, 'kl': 0.5224609375, 'epoch': 0.18}
 18%|█▊        | 789/4286 [4:35:43<17:57:54, 18.49s/it] 18%|█▊        | 790/4286 [4:36:07<19:31:15, 20.10s/it]                                                       {'loss': 0.0687, 'grad_norm': 4.070736215480188, 'learning_rate': 8.156789547363508e-07, 'completion_length': 167.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.3281250223517418, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.2209821939468384, 'reward_std': 0.25497742742300034, 'kl': 1.71484375, 'epoch': 0.18}
 18%|█▊        | 790/4286 [4:36:07<19:31:15, 20.10s/it] 18%|█▊        | 791/4286 [4:36:25<18:52:22, 19.44s/it]                                                       {'loss': 0.032, 'grad_norm': 5.7711940246876114, 'learning_rate': 8.154456369575361e-07, 'completion_length': 139.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.4330357611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3973215818405151, 'reward_std': 0.12435895949602127, 'kl': 0.798828125, 'epoch': 0.18}
 18%|█▊        | 791/4286 [4:36:25<18:52:22, 19.44s/it] 18%|█▊        | 792/4286 [4:36:41<18:05:23, 18.64s/it]                                                       {'loss': 0.0212, 'grad_norm': 2.7427456372079004, 'learning_rate': 8.152123191787214e-07, 'completion_length': 140.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.424107164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4062500596046448, 'reward_std': 0.1041666641831398, 'kl': 0.529296875, 'epoch': 0.18}
 18%|█▊        | 792/4286 [4:36:41<18:05:23, 18.64s/it] 19%|█▊        | 793/4286 [4:36:59<17:46:19, 18.32s/it]                                                       {'loss': 0.0358, 'grad_norm': 6.78282199831283, 'learning_rate': 8.149790013999066e-07, 'completion_length': 143.96428680419922, 'rewards/only_full_func_accuracy_reward': 0.4708333760499954, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3994048833847046, 'reward_std': 0.2379295602440834, 'kl': 0.89453125, 'epoch': 0.19}
 19%|█▊        | 793/4286 [4:36:59<17:46:19, 18.32s/it] 19%|█▊        | 794/4286 [4:37:18<18:02:42, 18.60s/it]                                                       {'loss': 0.038, 'grad_norm': 3.7190541744887797, 'learning_rate': 8.147456836210919e-07, 'completion_length': 154.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4389881193637848, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3854168057441711, 'reward_std': 0.18370355665683746, 'kl': 0.947265625, 'epoch': 0.19}
 19%|█▊        | 794/4286 [4:37:18<18:02:42, 18.60s/it] 19%|█▊        | 795/4286 [4:37:36<17:40:24, 18.23s/it]                                                       {'loss': 0.0122, 'grad_norm': 2.3102976696548323, 'learning_rate': 8.145123658422771e-07, 'completion_length': 154.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.6116071939468384, 'rewards/format_reward': 1.0, 'reward': 1.6116072535514832, 'reward_std': 0.05193009041249752, 'kl': 0.3046875, 'epoch': 0.19}
 19%|█▊        | 795/4286 [4:37:36<17:40:24, 18.23s/it] 19%|█▊        | 796/4286 [4:37:53<17:30:19, 18.06s/it]                                                       {'loss': 0.0118, 'grad_norm': 1.7246799151415801, 'learning_rate': 8.142790480634624e-07, 'completion_length': 163.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5163690894842148, 'rewards/format_reward': 1.0, 'reward': 1.5163691639900208, 'reward_std': 0.1151372455060482, 'kl': 0.29443359375, 'epoch': 0.19}
 19%|█▊        | 796/4286 [4:37:53<17:30:19, 18.06s/it] 19%|█▊        | 797/4286 [4:38:11<17:23:36, 17.95s/it]                                                       {'loss': 0.0141, 'grad_norm': 3.28810176200829, 'learning_rate': 8.140457302846476e-07, 'completion_length': 149.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4449405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.427083432674408, 'reward_std': 0.11196072399616241, 'kl': 0.3515625, 'epoch': 0.19}
 19%|█▊        | 797/4286 [4:38:11<17:23:36, 17.95s/it] 19%|█▊        | 798/4286 [4:38:32<18:15:02, 18.84s/it]                                                       {'loss': 0.021, 'grad_norm': 5.87067974468117, 'learning_rate': 8.138124125058329e-07, 'completion_length': 165.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.4330357238650322, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.415178656578064, 'reward_std': 0.1461014747619629, 'kl': 0.52685546875, 'epoch': 0.19}
 19%|█▊        | 798/4286 [4:38:32<18:15:02, 18.84s/it] 19%|█▊        | 799/4286 [4:38:52<18:33:55, 19.17s/it]                                                       {'loss': 0.0266, 'grad_norm': 2.651140583144584, 'learning_rate': 8.135790947270181e-07, 'completion_length': 169.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.5119047462940216, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4761906862258911, 'reward_std': 0.24183834344148636, 'kl': 0.66796875, 'epoch': 0.19}
 19%|█▊        | 799/4286 [4:38:52<18:33:55, 19.17s/it] 19%|█▊        | 800/4286 [4:39:12<18:45:38, 19.37s/it]                                                       {'loss': 0.0376, 'grad_norm': 3.772730954851421, 'learning_rate': 8.133457769482033e-07, 'completion_length': 156.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5779762417078018, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5244048833847046, 'reward_std': 0.09854403510689735, 'kl': 0.94140625, 'epoch': 0.19}
 19%|█▊        | 800/4286 [4:39:12<18:45:38, 19.37s/it] 19%|█▊        | 801/4286 [4:46:51<146:25:19, 151.25s/it]                                                         {'loss': 0.0397, 'grad_norm': 3.0098068072832795, 'learning_rate': 8.131124591693887e-07, 'completion_length': 161.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.5041667073965073, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4684524536132812, 'reward_std': 0.09561743773519993, 'kl': 0.9921875, 'epoch': 0.19}
 19%|█▊        | 801/4286 [4:46:51<146:25:19, 151.25s/it] 19%|█▊        | 802/4286 [4:47:07<107:20:23, 110.91s/it]                                                         {'loss': 0.0136, 'grad_norm': 1.1073807367301944, 'learning_rate': 8.128791413905739e-07, 'completion_length': 146.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4925595372915268, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4747024774551392, 'reward_std': 0.0565476231276989, 'kl': 0.34130859375, 'epoch': 0.19}
 19%|█▊        | 802/4286 [4:47:07<107:20:23, 110.91s/it] 19%|█▊        | 803/4286 [4:47:25<80:11:07, 82.88s/it]                                                         {'loss': 0.0236, 'grad_norm': 2.603285526941841, 'learning_rate': 8.126458236117591e-07, 'completion_length': 149.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.3928571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750001192092896, 'reward_std': 0.06731786578893661, 'kl': 0.5908203125, 'epoch': 0.19}
 19%|█▊        | 803/4286 [4:47:25<80:11:07, 82.88s/it] 19%|█▉        | 804/4286 [4:47:42<61:11:35, 63.27s/it]                                                       {'loss': 0.0098, 'grad_norm': 1.2568057601188718, 'learning_rate': 8.124125058329444e-07, 'completion_length': 149.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.48363097012043, 'rewards/format_reward': 1.0, 'reward': 1.4836310744285583, 'reward_std': 0.07238246500492096, 'kl': 0.244140625, 'epoch': 0.19}
 19%|█▉        | 804/4286 [4:47:42<61:11:35, 63.27s/it] 19%|█▉        | 805/4286 [4:48:00<47:53:42, 49.53s/it]                                                       {'loss': 0.0319, 'grad_norm': 2.518962340980549, 'learning_rate': 8.121791880541297e-07, 'completion_length': 146.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.4836310148239136, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4657739400863647, 'reward_std': 0.05929445615038276, 'kl': 0.796875, 'epoch': 0.19}
 19%|█▉        | 805/4286 [4:48:00<47:53:42, 49.53s/it] 19%|█▉        | 806/4286 [4:48:21<39:35:03, 40.95s/it]                                                       {'loss': 0.0242, 'grad_norm': 2.2550439319383915, 'learning_rate': 8.119458702753149e-07, 'completion_length': 150.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.3541667014360428, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3363096117973328, 'reward_std': 0.12943191826343536, 'kl': 0.603515625, 'epoch': 0.19}
 19%|█▉        | 806/4286 [4:48:21<39:35:03, 40.95s/it][2025-03-02 09:55:57,630] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 807/4286 [4:48:42<33:47:34, 34.97s/it]                                                       {'loss': 0.021, 'grad_norm': 16.663588732968897, 'learning_rate': 8.117125524965002e-07, 'completion_length': 138.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.4494047611951828, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4315477013587952, 'reward_std': 0.11745268851518631, 'kl': 0.52392578125, 'epoch': 0.19}
 19%|█▉        | 807/4286 [4:48:42<33:47:34, 34.97s/it] 19%|█▉        | 808/4286 [4:49:00<28:53:46, 29.91s/it]                                                       {'loss': 0.0269, 'grad_norm': 7.130600301193904, 'learning_rate': 8.114792347176854e-07, 'completion_length': 159.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.4910714775323868, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4732144474983215, 'reward_std': 0.11807582527399063, 'kl': 0.6728515625, 'epoch': 0.19}
 19%|█▉        | 808/4286 [4:49:00<28:53:46, 29.91s/it] 19%|█▉        | 809/4286 [4:49:17<25:18:43, 26.21s/it]                                                       {'loss': 0.015, 'grad_norm': 3.3682635165764543, 'learning_rate': 8.112459169388707e-07, 'completion_length': 159.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.575396865606308, 'rewards/format_reward': 1.0, 'reward': 1.5753969550132751, 'reward_std': 0.10757216438651085, 'kl': 0.3759765625, 'epoch': 0.19}
 19%|█▉        | 809/4286 [4:49:17<25:18:43, 26.21s/it] 19%|█▉        | 810/4286 [4:49:39<24:05:49, 24.96s/it]                                                       {'loss': 0.0278, 'grad_norm': 3.25695478862096, 'learning_rate': 8.110125991600559e-07, 'completion_length': 169.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.4747024029493332, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4568453431129456, 'reward_std': 0.12812890484929085, 'kl': 0.6943359375, 'epoch': 0.19}
 19%|█▉        | 810/4286 [4:49:39<24:05:49, 24.96s/it] 19%|█▉        | 811/4286 [4:49:57<21:53:23, 22.68s/it]                                                       {'loss': 0.0103, 'grad_norm': 1.7884932690364528, 'learning_rate': 8.107792813812412e-07, 'completion_length': 140.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.023809527046978474, 'kl': 0.255859375, 'epoch': 0.19}
 19%|█▉        | 811/4286 [4:49:57<21:53:23, 22.68s/it] 19%|█▉        | 812/4286 [4:50:14<20:24:24, 21.15s/it]                                                       {'loss': 0.0102, 'grad_norm': 1.5266694409089219, 'learning_rate': 8.105459636024264e-07, 'completion_length': 159.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5654762387275696, 'rewards/format_reward': 1.0, 'reward': 1.5654763579368591, 'reward_std': 0.04191340692341328, 'kl': 0.25537109375, 'epoch': 0.19}
 19%|█▉        | 812/4286 [4:50:14<20:24:24, 21.15s/it] 19%|█▉        | 813/4286 [4:50:32<19:29:02, 20.20s/it]                                                       {'loss': 0.0327, 'grad_norm': 19.270599273517217, 'learning_rate': 8.103126458236117e-07, 'completion_length': 161.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.329464316368103, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3116071820259094, 'reward_std': 0.14607790112495422, 'kl': 0.81494140625, 'epoch': 0.19}
 19%|█▉        | 813/4286 [4:50:32<19:29:02, 20.20s/it] 19%|█▉        | 814/4286 [4:50:51<19:08:18, 19.84s/it]                                                       {'loss': 0.0297, 'grad_norm': 1.7634199492759193, 'learning_rate': 8.10079328044797e-07, 'completion_length': 159.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3928571939468384, 'reward_std': 0.11899879202246666, 'kl': 0.74462890625, 'epoch': 0.19}
 19%|█▉        | 814/4286 [4:50:51<19:08:18, 19.84s/it] 19%|█▉        | 815/4286 [4:51:08<18:11:58, 18.88s/it]                                                       {'loss': 0.0131, 'grad_norm': 60.2227845183057, 'learning_rate': 8.098460102659822e-07, 'completion_length': 138.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.441964328289032, 'rewards/format_reward': 1.0, 'reward': 1.441964328289032, 'reward_std': 0.06869912147521973, 'kl': 0.330078125, 'epoch': 0.19}
 19%|█▉        | 815/4286 [4:51:08<18:11:58, 18.88s/it] 19%|█▉        | 816/4286 [4:51:27<18:07:57, 18.81s/it]                                                       {'loss': 0.0414, 'grad_norm': 2.1454360163252226, 'learning_rate': 8.096126924871674e-07, 'completion_length': 159.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.2946428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2767858505249023, 'reward_std': 0.0889533357694745, 'kl': 1.0390625, 'epoch': 0.19}
 19%|█▉        | 816/4286 [4:51:27<18:07:57, 18.81s/it] 19%|█▉        | 817/4286 [4:51:46<18:20:38, 19.04s/it]                                                       {'loss': 0.0201, 'grad_norm': 0.8676818507330638, 'learning_rate': 8.093793747083528e-07, 'completion_length': 145.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.5312500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5133930444717407, 'reward_std': 0.0625000037252903, 'kl': 0.5029296875, 'epoch': 0.19}
 19%|█▉        | 817/4286 [4:51:46<18:20:38, 19.04s/it] 19%|█▉        | 818/4286 [4:52:03<17:36:35, 18.28s/it]                                                       {'loss': 0.0102, 'grad_norm': 4.868968292284184, 'learning_rate': 8.09146056929538e-07, 'completion_length': 138.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.504464328289032, 'rewards/format_reward': 1.0, 'reward': 1.5044644474983215, 'reward_std': 0.027706552296876907, 'kl': 0.25439453125, 'epoch': 0.19}
 19%|█▉        | 818/4286 [4:52:03<17:36:35, 18.28s/it] 19%|█▉        | 819/4286 [4:52:20<17:21:31, 18.02s/it]                                                       {'loss': 0.0194, 'grad_norm': 2.4881484581921933, 'learning_rate': 8.089127391507232e-07, 'completion_length': 158.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.504464328289032, 'rewards/format_reward': 1.0, 'reward': 1.504464328289032, 'reward_std': 0.06299347430467606, 'kl': 0.4833984375, 'epoch': 0.19}
 19%|█▉        | 819/4286 [4:52:20<17:21:31, 18.02s/it] 19%|█▉        | 820/4286 [4:52:38<17:14:13, 17.90s/it]                                                       {'loss': 0.0282, 'grad_norm': 3.4070118097738624, 'learning_rate': 8.086794213719084e-07, 'completion_length': 144.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4732143878936768, 'reward_std': 0.12028379365801811, 'kl': 0.70556640625, 'epoch': 0.19}
 19%|█▉        | 820/4286 [4:52:38<17:14:13, 17.90s/it] 19%|█▉        | 821/4286 [4:52:54<16:51:30, 17.52s/it]                                                       {'loss': 0.0166, 'grad_norm': 1.0559450903983587, 'learning_rate': 8.084461035930938e-07, 'completion_length': 131.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4434524178504944, 'rewards/format_reward': 1.0, 'reward': 1.443452537059784, 'reward_std': 0.04115233523771167, 'kl': 0.4140625, 'epoch': 0.19}
 19%|█▉        | 821/4286 [4:52:54<16:51:30, 17.52s/it] 19%|█▉        | 822/4286 [4:53:13<17:04:15, 17.74s/it]                                                       {'loss': 0.0493, 'grad_norm': 10.04532699206719, 'learning_rate': 8.08212785814279e-07, 'completion_length': 142.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.3437500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.325892984867096, 'reward_std': 0.11429983004927635, 'kl': 1.234375, 'epoch': 0.19}
 19%|█▉        | 822/4286 [4:53:13<17:04:15, 17.74s/it] 19%|█▉        | 823/4286 [4:53:34<17:59:17, 18.70s/it]                                                       {'loss': 0.0637, 'grad_norm': 3.3342183949336968, 'learning_rate': 8.079794680354642e-07, 'completion_length': 167.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.4895833879709244, 'rewards/format_reward': 1.0, 'reward': 1.4895834922790527, 'reward_std': 0.12478631734848022, 'kl': 1.58984375, 'epoch': 0.19}
 19%|█▉        | 823/4286 [4:53:34<17:59:17, 18.70s/it] 19%|█▉        | 824/4286 [4:53:50<17:21:26, 18.05s/it]                                                       {'loss': 0.0592, 'grad_norm': 7.431720799696952, 'learning_rate': 8.077461502566495e-07, 'completion_length': 148.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.4122024029493332, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3586310744285583, 'reward_std': 0.19542967528104782, 'kl': 1.48046875, 'epoch': 0.19}
 19%|█▉        | 824/4286 [4:53:50<17:21:26, 18.05s/it] 19%|█▉        | 825/4286 [4:54:14<19:09:58, 19.94s/it]                                                       {'loss': 0.1052, 'grad_norm': 4.835425285370613, 'learning_rate': 8.075128324778347e-07, 'completion_length': 169.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.4806548058986664, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.3556548357009888, 'reward_std': 0.2841580808162689, 'kl': 2.6328125, 'epoch': 0.19}
 19%|█▉        | 825/4286 [4:54:14<19:09:58, 19.94s/it] 19%|█▉        | 826/4286 [4:54:35<19:26:37, 20.23s/it]                                                       {'loss': 0.0412, 'grad_norm': 2.3173915429682164, 'learning_rate': 8.0727951469902e-07, 'completion_length': 159.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4107144474983215, 'reward_std': 0.15191611647605896, 'kl': 1.033203125, 'epoch': 0.19}
 19%|█▉        | 826/4286 [4:54:35<19:26:37, 20.23s/it][2025-03-02 10:02:08,269] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 827/4286 [4:54:52<18:30:07, 19.26s/it]                                                       {'loss': 0.0195, 'grad_norm': 7.295948532770335, 'learning_rate': 8.070461969202053e-07, 'completion_length': 146.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.4568452835083008, 'rewards/format_reward': 1.0, 'reward': 1.4568453431129456, 'reward_std': 0.09342947602272034, 'kl': 0.48828125, 'epoch': 0.19}
 19%|█▉        | 827/4286 [4:54:52<18:30:07, 19.26s/it][2025-03-02 10:02:28,011] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 19%|█▉        | 828/4286 [4:55:12<18:38:11, 19.40s/it]                                                       {'loss': 0.0602, 'grad_norm': 5.072796078984047, 'learning_rate': 8.068128791413905e-07, 'completion_length': 144.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.3452381193637848, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3095239400863647, 'reward_std': 0.13731923699378967, 'kl': 1.50390625, 'epoch': 0.19}
 19%|█▉        | 828/4286 [4:55:12<18:38:11, 19.40s/it] 19%|█▉        | 829/4286 [4:55:29<17:55:24, 18.66s/it]                                                       {'loss': 0.029, 'grad_norm': 4.8255353989255, 'learning_rate': 8.065795613625757e-07, 'completion_length': 142.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3660715222358704, 'reward_std': 0.06983364000916481, 'kl': 0.724609375, 'epoch': 0.19}
 19%|█▉        | 829/4286 [4:55:29<17:55:24, 18.66s/it] 19%|█▉        | 830/4286 [4:55:45<17:14:26, 17.96s/it]                                                       {'loss': 0.0232, 'grad_norm': 2.379300770475035, 'learning_rate': 8.063462435837611e-07, 'completion_length': 141.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.5074405372142792, 'rewards/format_reward': 1.0, 'reward': 1.5074405670166016, 'reward_std': 0.07079148106276989, 'kl': 0.5791015625, 'epoch': 0.19}
 19%|█▉        | 830/4286 [4:55:45<17:14:26, 17.96s/it] 19%|█▉        | 831/4286 [4:56:02<16:54:05, 17.61s/it]                                                       {'loss': 0.0179, 'grad_norm': 4.604717878239206, 'learning_rate': 8.061129258049463e-07, 'completion_length': 130.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.4806547909975052, 'rewards/format_reward': 1.0, 'reward': 1.4806548357009888, 'reward_std': 0.05495268292725086, 'kl': 0.447265625, 'epoch': 0.19}
 19%|█▉        | 831/4286 [4:56:02<16:54:05, 17.61s/it] 19%|█▉        | 832/4286 [4:56:20<16:53:51, 17.61s/it]                                                       {'loss': 0.042, 'grad_norm': 6.005977195162468, 'learning_rate': 8.058796080261315e-07, 'completion_length': 146.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.4880952537059784, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4523810744285583, 'reward_std': 0.10407906491309404, 'kl': 1.0478515625, 'epoch': 0.19}
 19%|█▉        | 832/4286 [4:56:20<16:53:51, 17.61s/it] 19%|█▉        | 833/4286 [4:56:36<16:26:05, 17.13s/it]                                                       {'loss': 0.0112, 'grad_norm': 3.058187242264597, 'learning_rate': 8.056462902473167e-07, 'completion_length': 128.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.3854167014360428, 'rewards/format_reward': 1.0, 'reward': 1.3854167461395264, 'reward_std': 0.1032249704003334, 'kl': 0.28076171875, 'epoch': 0.19}
 19%|█▉        | 833/4286 [4:56:36<16:26:05, 17.13s/it] 19%|█▉        | 834/4286 [4:56:53<16:27:09, 17.16s/it]                                                       {'loss': 0.0177, 'grad_norm': 5.76044311978588, 'learning_rate': 8.054129724685021e-07, 'completion_length': 142.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.35863097012043, 'rewards/format_reward': 1.0, 'reward': 1.3586310148239136, 'reward_std': 0.06312102824449539, 'kl': 0.443359375, 'epoch': 0.19}
 19%|█▉        | 834/4286 [4:56:53<16:27:09, 17.16s/it] 19%|█▉        | 835/4286 [4:57:14<17:29:29, 18.25s/it]                                                       {'loss': 0.0261, 'grad_norm': 1.7673916655617417, 'learning_rate': 8.051796546896873e-07, 'completion_length': 132.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3928571939468384, 'reward_std': 0.0892857201397419, 'kl': 0.654296875, 'epoch': 0.19}
 19%|█▉        | 835/4286 [4:57:14<17:29:29, 18.25s/it] 20%|█▉        | 836/4286 [4:57:30<16:56:37, 17.68s/it]                                                       {'loss': 0.0288, 'grad_norm': 1.7972148269862878, 'learning_rate': 8.049463369108725e-07, 'completion_length': 135.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6458333134651184, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6101192235946655, 'reward_std': 0.1130952388048172, 'kl': 0.71923828125, 'epoch': 0.2}
 20%|█▉        | 836/4286 [4:57:30<16:56:37, 17.68s/it] 20%|█▉        | 837/4286 [4:57:47<16:46:40, 17.51s/it]                                                       {'loss': 0.0141, 'grad_norm': 3.8008464234670813, 'learning_rate': 8.047130191320578e-07, 'completion_length': 143.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.4717262238264084, 'rewards/format_reward': 1.0, 'reward': 1.4717262983322144, 'reward_std': 0.08336639404296875, 'kl': 0.353515625, 'epoch': 0.2}
 20%|█▉        | 837/4286 [4:57:47<16:46:40, 17.51s/it] 20%|█▉        | 838/4286 [4:58:04<16:27:31, 17.18s/it]                                                       {'loss': 0.0216, 'grad_norm': 4.060006779518737, 'learning_rate': 8.044797013532431e-07, 'completion_length': 137.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.406250037252903, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.388392984867096, 'reward_std': 0.13637294992804527, 'kl': 0.5390625, 'epoch': 0.2}
 20%|█▉        | 838/4286 [4:58:04<16:27:31, 17.18s/it] 20%|█▉        | 839/4286 [4:58:20<16:15:49, 16.99s/it]                                                       {'loss': 0.0135, 'grad_norm': 1.8638799920679046, 'learning_rate': 8.042463835744283e-07, 'completion_length': 139.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4017858505249023, 'reward_std': 0.07419108226895332, 'kl': 0.3388671875, 'epoch': 0.2}
 20%|█▉        | 839/4286 [4:58:20<16:15:49, 16.99s/it] 20%|█▉        | 840/4286 [4:58:36<15:57:30, 16.67s/it]                                                       {'loss': 0.0234, 'grad_norm': 14.996129602516348, 'learning_rate': 8.040130657956136e-07, 'completion_length': 131.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.38988097012043, 'rewards/format_reward': 1.0, 'reward': 1.3898810744285583, 'reward_std': 0.04970746114850044, 'kl': 0.5830078125, 'epoch': 0.2}
 20%|█▉        | 840/4286 [4:58:36<15:57:30, 16.67s/it] 20%|█▉        | 841/4286 [4:58:54<16:18:03, 17.03s/it]                                                       {'loss': 0.0242, 'grad_norm': 2.038332266408726, 'learning_rate': 8.037797480167988e-07, 'completion_length': 134.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.444940522313118, 'rewards/format_reward': 1.0, 'reward': 1.4449405670166016, 'reward_std': 0.11310235410928726, 'kl': 0.60546875, 'epoch': 0.2}
 20%|█▉        | 841/4286 [4:58:54<16:18:03, 17.03s/it] 20%|█▉        | 842/4286 [4:59:11<16:17:09, 17.02s/it]                                                       {'loss': 0.0118, 'grad_norm': 2.588191639437038, 'learning_rate': 8.035464302379841e-07, 'completion_length': 146.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.418154776096344, 'rewards/format_reward': 1.0, 'reward': 1.4181548953056335, 'reward_std': 0.06856658682227135, 'kl': 0.294921875, 'epoch': 0.2}
 20%|█▉        | 842/4286 [4:59:11<16:17:09, 17.02s/it] 20%|█▉        | 843/4286 [4:59:29<16:28:38, 17.23s/it]                                                       {'loss': 0.0102, 'grad_norm': 1.1705562785433572, 'learning_rate': 8.033131124591693e-07, 'completion_length': 147.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4047619104385376, 'rewards/format_reward': 1.0, 'reward': 1.4047620296478271, 'reward_std': 0.02380952797830105, 'kl': 0.25634765625, 'epoch': 0.2}
 20%|█▉        | 843/4286 [4:59:29<16:28:38, 17.23s/it] 20%|█▉        | 844/4286 [4:59:46<16:20:35, 17.09s/it]                                                       {'loss': 0.0208, 'grad_norm': 1.9684245506232452, 'learning_rate': 8.030797946803546e-07, 'completion_length': 143.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.09890817478299141, 'kl': 0.5224609375, 'epoch': 0.2}
 20%|█▉        | 844/4286 [4:59:46<16:20:35, 17.09s/it] 20%|█▉        | 845/4286 [5:00:02<16:11:28, 16.94s/it]                                                       {'loss': 0.0098, 'grad_norm': 0.9989390676974494, 'learning_rate': 8.028464769015398e-07, 'completion_length': 137.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.5029762238264084, 'rewards/format_reward': 1.0, 'reward': 1.5029762983322144, 'reward_std': 0.048770660534501076, 'kl': 0.24365234375, 'epoch': 0.2}
 20%|█▉        | 845/4286 [5:00:02<16:11:28, 16.94s/it] 20%|█▉        | 846/4286 [5:00:18<15:57:32, 16.70s/it]                                                       {'loss': 0.0114, 'grad_norm': 2.869712410624718, 'learning_rate': 8.02613159122725e-07, 'completion_length': 131.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.4315476715564728, 'rewards/format_reward': 1.0, 'reward': 1.4315477013587952, 'reward_std': 0.0773809552192688, 'kl': 0.28515625, 'epoch': 0.2}
 20%|█▉        | 846/4286 [5:00:18<15:57:32, 16.70s/it] 20%|█▉        | 847/4286 [5:00:35<15:53:59, 16.64s/it]                                                       {'loss': 0.0716, 'grad_norm': 8.935713091794266, 'learning_rate': 8.023798413439104e-07, 'completion_length': 132.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.2440476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2261905670166016, 'reward_std': 0.08769078552722931, 'kl': 1.79296875, 'epoch': 0.2}
 20%|█▉        | 847/4286 [5:00:35<15:53:59, 16.64s/it] 20%|█▉        | 848/4286 [5:00:52<16:09:07, 16.91s/it]                                                       {'loss': 0.0291, 'grad_norm': 3.570277516216613, 'learning_rate': 8.021465235650956e-07, 'completion_length': 133.1785774230957, 'rewards/only_full_func_accuracy_reward': 0.379464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3616071939468384, 'reward_std': 0.1204288899898529, 'kl': 0.7255859375, 'epoch': 0.2}
 20%|█▉        | 848/4286 [5:00:52<16:09:07, 16.91s/it] 20%|█▉        | 849/4286 [5:01:09<16:04:53, 16.84s/it]                                                       {'loss': 0.0201, 'grad_norm': 1.9123971063628349, 'learning_rate': 8.019132057862808e-07, 'completion_length': 146.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.10020089522004128, 'kl': 0.5029296875, 'epoch': 0.2}
 20%|█▉        | 849/4286 [5:01:09<16:04:53, 16.84s/it] 20%|█▉        | 850/4286 [5:01:26<16:14:51, 17.02s/it]                                                       {'loss': 0.0749, 'grad_norm': 4.249508421391924, 'learning_rate': 8.016798880074662e-07, 'completion_length': 135.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.2663690745830536, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.21279776096344, 'reward_std': 0.19386685639619827, 'kl': 1.875, 'epoch': 0.2}
 20%|█▉        | 850/4286 [5:01:26<16:14:51, 17.02s/it] 20%|█▉        | 851/4286 [5:01:43<16:03:34, 16.83s/it]                                                       {'loss': 0.0384, 'grad_norm': 5.763332713934828, 'learning_rate': 8.014465702286514e-07, 'completion_length': 127.14286422729492, 'rewards/only_full_func_accuracy_reward': 0.4568452835083008, 'rewards/format_reward': 1.0, 'reward': 1.4568453431129456, 'reward_std': 0.0799297858029604, 'kl': 0.958984375, 'epoch': 0.2}
 20%|█▉        | 851/4286 [5:01:43<16:03:34, 16.83s/it] 20%|█▉        | 852/4286 [5:01:59<15:51:29, 16.62s/it]                                                       {'loss': 0.0686, 'grad_norm': 4.630899212137408, 'learning_rate': 8.012132524498366e-07, 'completion_length': 121.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.4136904925107956, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.395833432674408, 'reward_std': 0.136904776096344, 'kl': 1.7109375, 'epoch': 0.2}
 20%|█▉        | 852/4286 [5:01:59<15:51:29, 16.62s/it] 20%|█▉        | 853/4286 [5:02:15<15:38:50, 16.41s/it]                                                       {'loss': 0.0923, 'grad_norm': 3.862032684971238, 'learning_rate': 8.009799346710219e-07, 'completion_length': 127.71429061889648, 'rewards/only_full_func_accuracy_reward': 0.5625000447034836, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5089287161827087, 'reward_std': 0.2722145989537239, 'kl': 2.3125, 'epoch': 0.2}
 20%|█▉        | 853/4286 [5:02:15<15:38:50, 16.41s/it] 20%|█▉        | 854/4286 [5:02:31<15:35:07, 16.35s/it]                                                       {'loss': 0.0744, 'grad_norm': 5.120116139673462, 'learning_rate': 8.007466168922071e-07, 'completion_length': 114.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5952381193637848, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5416668057441711, 'reward_std': 0.15051203407347202, 'kl': 1.865234375, 'epoch': 0.2}
 20%|█▉        | 854/4286 [5:02:31<15:35:07, 16.35s/it] 20%|█▉        | 855/4286 [5:02:48<15:36:25, 16.38s/it]                                                       {'loss': 0.0466, 'grad_norm': 3.3635370756343432, 'learning_rate': 8.005132991133924e-07, 'completion_length': 130.48215103149414, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5595239400863647, 'reward_std': 0.18249697983264923, 'kl': 1.166015625, 'epoch': 0.2}
 20%|█▉        | 855/4286 [5:02:48<15:36:25, 16.38s/it] 20%|█▉        | 856/4286 [5:03:06<16:06:06, 16.90s/it]                                                       {'loss': 0.0848, 'grad_norm': 5.0257838883678065, 'learning_rate': 8.002799813345776e-07, 'completion_length': 143.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.4538690894842148, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3824406266212463, 'reward_std': 0.1965876966714859, 'kl': 2.115234375, 'epoch': 0.2}
 20%|█▉        | 856/4286 [5:03:06<16:06:06, 16.90s/it] 20%|█▉        | 857/4286 [5:03:24<16:30:05, 17.32s/it]                                                       {'loss': 0.12, 'grad_norm': 5.86213310920025, 'learning_rate': 8.000466635557629e-07, 'completion_length': 137.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.3675595372915268, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.313988208770752, 'reward_std': 0.27606065571308136, 'kl': 3.0, 'epoch': 0.2}
 20%|█▉        | 857/4286 [5:03:24<16:30:05, 17.32s/it][2025-03-02 10:10:56,817] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 20%|██        | 858/4286 [5:03:41<16:24:01, 17.22s/it]                                                       {'loss': 0.1358, 'grad_norm': 4.984215155957157, 'learning_rate': 7.998133457769481e-07, 'completion_length': 127.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.4642857611179352, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3928572535514832, 'reward_std': 0.34888888895511627, 'kl': 3.3984375, 'epoch': 0.2}
 20%|██        | 858/4286 [5:03:41<16:24:01, 17.22s/it] 20%|██        | 859/4286 [5:03:58<16:22:34, 17.20s/it]                                                       {'loss': 0.1985, 'grad_norm': 10.273960818522648, 'learning_rate': 7.995800279981334e-07, 'completion_length': 130.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.401785746216774, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.2589285969734192, 'reward_std': 0.3442546874284744, 'kl': 4.96875, 'epoch': 0.2}
 20%|██        | 859/4286 [5:03:58<16:22:34, 17.20s/it] 20%|██        | 860/4286 [5:04:19<17:24:09, 18.29s/it]                                                       {'loss': 0.0472, 'grad_norm': 3.9618694294563768, 'learning_rate': 7.993467102193187e-07, 'completion_length': 156.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.4300595670938492, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3943453431129456, 'reward_std': 0.1576547995209694, 'kl': 1.1796875, 'epoch': 0.2}
 20%|██        | 860/4286 [5:04:19<17:24:09, 18.29s/it] 20%|██        | 861/4286 [5:04:40<18:18:40, 19.25s/it]                                                       {'loss': 0.0823, 'grad_norm': 5.6119269563779675, 'learning_rate': 7.991133924405039e-07, 'completion_length': 154.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.236607164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.1830358505249023, 'reward_std': 0.15269088372588158, 'kl': 2.060546875, 'epoch': 0.2}
 20%|██        | 861/4286 [5:04:40<18:18:40, 19.25s/it] 20%|██        | 862/4286 [5:04:56<17:16:09, 18.16s/it]                                                       {'loss': 0.063, 'grad_norm': 7.209790078343441, 'learning_rate': 7.988800746616891e-07, 'completion_length': 129.23215103149414, 'rewards/only_full_func_accuracy_reward': 0.5059524029493332, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4880953431129456, 'reward_std': 0.11685137823224068, 'kl': 1.5751953125, 'epoch': 0.2}
 20%|██        | 862/4286 [5:04:56<17:16:09, 18.16s/it] 20%|██        | 863/4286 [5:05:13<17:01:14, 17.90s/it]                                                       {'loss': 0.0149, 'grad_norm': 7.569464186095336, 'learning_rate': 7.986467568828745e-07, 'completion_length': 154.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.550595298409462, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.532738208770752, 'reward_std': 0.08491658791899681, 'kl': 0.37109375, 'epoch': 0.2}
 20%|██        | 863/4286 [5:05:13<17:01:14, 17.90s/it] 20%|██        | 864/4286 [5:05:31<16:54:30, 17.79s/it]                                                       {'loss': 0.0167, 'grad_norm': 1.3526065887344518, 'learning_rate': 7.984134391040597e-07, 'completion_length': 145.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6681548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6681549549102783, 'reward_std': 0.034954057075083256, 'kl': 0.4150390625, 'epoch': 0.2}
 20%|██        | 864/4286 [5:05:31<16:54:30, 17.79s/it] 20%|██        | 865/4286 [5:05:48<16:42:09, 17.58s/it]                                                       {'loss': 0.0402, 'grad_norm': 3.7675417856029396, 'learning_rate': 7.981801213252449e-07, 'completion_length': 138.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4464286118745804, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4107143878936768, 'reward_std': 0.14959122240543365, 'kl': 1.005859375, 'epoch': 0.2}
 20%|██        | 865/4286 [5:05:48<16:42:09, 17.58s/it] 20%|██        | 866/4286 [5:06:05<16:38:16, 17.51s/it]                                                       {'loss': 0.032, 'grad_norm': 2.5433119227209233, 'learning_rate': 7.979468035464301e-07, 'completion_length': 153.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.3913690894842148, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3735119700431824, 'reward_std': 0.1414642110466957, 'kl': 0.80078125, 'epoch': 0.2}
 20%|██        | 866/4286 [5:06:05<16:38:16, 17.51s/it] 20%|██        | 867/4286 [5:06:23<16:37:06, 17.50s/it]                                                       {'loss': 0.0108, 'grad_norm': 3.3857163033878632, 'learning_rate': 7.977134857676155e-07, 'completion_length': 155.0, 'rewards/only_full_func_accuracy_reward': 0.4255952686071396, 'rewards/format_reward': 1.0, 'reward': 1.4255953431129456, 'reward_std': 0.12010039389133453, 'kl': 0.26953125, 'epoch': 0.2}
 20%|██        | 867/4286 [5:06:23<16:37:06, 17.50s/it] 20%|██        | 868/4286 [5:06:39<16:21:10, 17.22s/it]                                                       {'loss': 0.0104, 'grad_norm': 3.593149104911992, 'learning_rate': 7.974801679888007e-07, 'completion_length': 142.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.392857313156128, 'reward_std': 0.11232833191752434, 'kl': 0.2587890625, 'epoch': 0.2}
 20%|██        | 868/4286 [5:06:39<16:21:10, 17.22s/it] 20%|██        | 869/4286 [5:06:57<16:20:25, 17.22s/it]                                                       {'loss': 0.01, 'grad_norm': 2.1947654071714333, 'learning_rate': 7.972468502099859e-07, 'completion_length': 164.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.4360119551420212, 'rewards/format_reward': 1.0, 'reward': 1.4360119700431824, 'reward_std': 0.0831824541091919, 'kl': 0.25, 'epoch': 0.2}
 20%|██        | 869/4286 [5:06:57<16:20:25, 17.22s/it] 20%|██        | 870/4286 [5:07:14<16:30:49, 17.40s/it]                                                       {'loss': 0.0116, 'grad_norm': 2.0751665354439033, 'learning_rate': 7.970135324311712e-07, 'completion_length': 160.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.461309552192688, 'rewards/format_reward': 1.0, 'reward': 1.4613096117973328, 'reward_std': 0.06487488746643066, 'kl': 0.28857421875, 'epoch': 0.2}
 20%|██        | 870/4286 [5:07:14<16:30:49, 17.40s/it] 20%|██        | 871/4286 [5:07:31<16:08:39, 17.02s/it]                                                       {'loss': 0.0113, 'grad_norm': 44.02732842290429, 'learning_rate': 7.967802146523565e-07, 'completion_length': 145.375, 'rewards/only_full_func_accuracy_reward': 0.433035746216774, 'rewards/format_reward': 1.0, 'reward': 1.4330357909202576, 'reward_std': 0.10650181770324707, 'kl': 0.28125, 'epoch': 0.2}
 20%|██        | 871/4286 [5:07:31<16:08:39, 17.02s/it] 20%|██        | 872/4286 [5:07:48<16:11:04, 17.07s/it]                                                       {'loss': 0.0152, 'grad_norm': 2.8797156584290304, 'learning_rate': 7.965468968735417e-07, 'completion_length': 158.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.4449405074119568, 'rewards/format_reward': 1.0, 'reward': 1.4449406862258911, 'reward_std': 0.0803571343421936, 'kl': 0.3798828125, 'epoch': 0.2}
 20%|██        | 872/4286 [5:07:48<16:11:04, 17.07s/it] 20%|██        | 873/4286 [5:08:05<16:11:34, 17.08s/it]                                                       {'loss': 0.0222, 'grad_norm': 30.50268596490256, 'learning_rate': 7.96313579094727e-07, 'completion_length': 158.875, 'rewards/only_full_func_accuracy_reward': 0.4122024029493332, 'rewards/format_reward': 1.0, 'reward': 1.4122024774551392, 'reward_std': 0.043621594086289406, 'kl': 0.55615234375, 'epoch': 0.2}
 20%|██        | 873/4286 [5:08:05<16:11:34, 17.08s/it] 20%|██        | 874/4286 [5:08:24<16:39:46, 17.58s/it]                                                       {'loss': 0.0392, 'grad_norm': 12.303137861052063, 'learning_rate': 7.960802613159122e-07, 'completion_length': 156.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.4553571790456772, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4375001192092896, 'reward_std': 0.13521833717823029, 'kl': 0.982421875, 'epoch': 0.2}
 20%|██        | 874/4286 [5:08:24<16:39:46, 17.58s/it] 20%|██        | 875/4286 [5:08:40<16:28:38, 17.39s/it]                                                       {'loss': 0.0089, 'grad_norm': 1.5378206087459598, 'learning_rate': 7.958469435370974e-07, 'completion_length': 151.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.5565477013587952, 'rewards/format_reward': 1.0, 'reward': 1.5565477013587952, 'reward_std': 0.05038155987858772, 'kl': 0.22314453125, 'epoch': 0.2}
 20%|██        | 875/4286 [5:08:40<16:28:38, 17.39s/it] 20%|██        | 876/4286 [5:08:59<16:44:56, 17.68s/it]                                                       {'loss': 0.0154, 'grad_norm': 1.6730710096595534, 'learning_rate': 7.956136257582828e-07, 'completion_length': 169.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.519345298409462, 'rewards/format_reward': 1.0, 'reward': 1.5193453431129456, 'reward_std': 0.06492296606302261, 'kl': 0.38671875, 'epoch': 0.2}
 20%|██        | 876/4286 [5:08:59<16:44:56, 17.68s/it] 20%|██        | 877/4286 [5:09:17<16:54:22, 17.85s/it]                                                       {'loss': 0.0324, 'grad_norm': 8.767581200118249, 'learning_rate': 7.95380307979468e-07, 'completion_length': 157.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.3895833492279053, 'rewards/format_reward': 1.0, 'reward': 1.3895834684371948, 'reward_std': 0.11782998964190483, 'kl': 0.8125, 'epoch': 0.2}
 20%|██        | 877/4286 [5:09:17<16:54:22, 17.85s/it] 20%|██        | 878/4286 [5:09:34<16:36:09, 17.54s/it]                                                       {'loss': 0.0179, 'grad_norm': 3.1518708399723887, 'learning_rate': 7.951469902006532e-07, 'completion_length': 161.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.3898809999227524, 'rewards/format_reward': 1.0, 'reward': 1.3898810148239136, 'reward_std': 0.09112739190459251, 'kl': 0.447265625, 'epoch': 0.2}
 20%|██        | 878/4286 [5:09:34<16:36:09, 17.54s/it] 21%|██        | 879/4286 [5:09:52<16:39:56, 17.61s/it]                                                       {'loss': 0.0334, 'grad_norm': 2.533615072905347, 'learning_rate': 7.949136724218384e-07, 'completion_length': 154.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.476190522313118, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.458333432674408, 'reward_std': 0.08333333395421505, 'kl': 0.8359375, 'epoch': 0.21}
 21%|██        | 879/4286 [5:09:52<16:39:56, 17.61s/it] 21%|██        | 880/4286 [5:10:16<18:38:44, 19.71s/it]                                                       {'loss': 0.1053, 'grad_norm': 5.96045746672421, 'learning_rate': 7.946803546430238e-07, 'completion_length': 154.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.4151785969734192, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3080357313156128, 'reward_std': 0.22839491069316864, 'kl': 2.6328125, 'epoch': 0.21}
 21%|██        | 880/4286 [5:10:16<18:38:44, 19.71s/it] 21%|██        | 881/4286 [5:10:38<19:15:04, 20.35s/it]                                                       {'loss': 0.0463, 'grad_norm': 10.252190438129327, 'learning_rate': 7.94447036864209e-07, 'completion_length': 153.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.3541667014360428, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3005953431129456, 'reward_std': 0.13777105137705803, 'kl': 1.15625, 'epoch': 0.21}
 21%|██        | 881/4286 [5:10:38<19:15:04, 20.35s/it] 21%|██        | 882/4286 [5:11:00<19:37:19, 20.75s/it]                                                       {'loss': 0.0719, 'grad_norm': 3.9217797844713167, 'learning_rate': 7.942137190853942e-07, 'completion_length': 173.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.38601192831993103, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3324406147003174, 'reward_std': 0.1116496566683054, 'kl': 1.796875, 'epoch': 0.21}
 21%|██        | 882/4286 [5:11:00<19:37:19, 20.75s/it] 21%|██        | 883/4286 [5:11:16<18:15:06, 19.31s/it]                                                       {'loss': 0.012, 'grad_norm': 2.125007544904859, 'learning_rate': 7.939804013065795e-07, 'completion_length': 141.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.5803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.580357313156128, 'reward_std': 0.07419108599424362, 'kl': 0.298828125, 'epoch': 0.21}
 21%|██        | 883/4286 [5:11:16<18:15:06, 19.31s/it] 21%|██        | 884/4286 [5:11:33<17:46:08, 18.80s/it]                                                       {'loss': 0.0351, 'grad_norm': 3.703920959081782, 'learning_rate': 7.937470835277648e-07, 'completion_length': 142.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.3586309850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3229167461395264, 'reward_std': 0.17667584121227264, 'kl': 0.8759765625, 'epoch': 0.21}
 21%|██        | 884/4286 [5:11:33<17:46:08, 18.80s/it] 21%|██        | 885/4286 [5:11:51<17:32:52, 18.57s/it]                                                       {'loss': 0.0504, 'grad_norm': 3.0036606668061743, 'learning_rate': 7.9351376574895e-07, 'completion_length': 141.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.553571492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178572535514832, 'reward_std': 0.18903036788105965, 'kl': 1.26171875, 'epoch': 0.21}
 21%|██        | 885/4286 [5:11:51<17:32:52, 18.57s/it] 21%|██        | 886/4286 [5:12:10<17:37:16, 18.66s/it]                                                       {'loss': 0.0373, 'grad_norm': 2.7292500117583103, 'learning_rate': 7.932804479701353e-07, 'completion_length': 148.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5565476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5386905670166016, 'reward_std': 0.11745267733931541, 'kl': 0.92822265625, 'epoch': 0.21}
 21%|██        | 886/4286 [5:12:10<17:37:16, 18.66s/it] 21%|██        | 887/4286 [5:12:29<17:41:49, 18.74s/it]                                                       {'loss': 0.059, 'grad_norm': 4.652067395331467, 'learning_rate': 7.930471301913205e-07, 'completion_length': 146.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4107144474983215, 'reward_std': 0.22333232313394547, 'kl': 1.4765625, 'epoch': 0.21}
 21%|██        | 887/4286 [5:12:29<17:41:49, 18.74s/it] 21%|██        | 888/4286 [5:12:46<17:15:48, 18.29s/it]                                                       {'loss': 0.0341, 'grad_norm': 4.737154173818517, 'learning_rate': 7.928138124125058e-07, 'completion_length': 145.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.488095298409462, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.470238208770752, 'reward_std': 0.07431714981794357, 'kl': 0.8515625, 'epoch': 0.21}
 21%|██        | 888/4286 [5:12:46<17:15:48, 18.29s/it] 21%|██        | 889/4286 [5:13:03<16:38:55, 17.64s/it]                                                       {'loss': 0.0597, 'grad_norm': 3.8212154037520305, 'learning_rate': 7.92580494633691e-07, 'completion_length': 138.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.4345238357782364, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3988096714019775, 'reward_std': 0.16387204825878143, 'kl': 1.4921875, 'epoch': 0.21}
 21%|██        | 889/4286 [5:13:03<16:38:55, 17.64s/it] 21%|██        | 890/4286 [5:13:19<16:14:20, 17.21s/it]                                                       {'loss': 0.0106, 'grad_norm': 11.698976522025136, 'learning_rate': 7.923471768548763e-07, 'completion_length': 134.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.479166716337204, 'rewards/format_reward': 1.0, 'reward': 1.4791667461395264, 'reward_std': 0.071001211181283, 'kl': 0.2646484375, 'epoch': 0.21}
 21%|██        | 890/4286 [5:13:19<16:14:20, 17.21s/it] 21%|██        | 891/4286 [5:13:35<16:00:30, 16.98s/it]                                                       {'loss': 0.0479, 'grad_norm': 1.8862043623417861, 'learning_rate': 7.921138590760615e-07, 'completion_length': 127.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.48363097012043, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4300596714019775, 'reward_std': 0.1517857164144516, 'kl': 1.193359375, 'epoch': 0.21}
 21%|██        | 891/4286 [5:13:35<16:00:30, 16.98s/it] 21%|██        | 892/4286 [5:13:52<15:50:36, 16.81s/it]                                                       {'loss': 0.0488, 'grad_norm': 2.5848159831319952, 'learning_rate': 7.918805412972468e-07, 'completion_length': 126.60715103149414, 'rewards/only_full_func_accuracy_reward': 0.3363095372915268, 'rewards/format_reward': 1.0, 'reward': 1.3363096117973328, 'reward_std': 0.10788307711482048, 'kl': 1.216796875, 'epoch': 0.21}
 21%|██        | 892/4286 [5:13:52<15:50:36, 16.81s/it] 21%|██        | 893/4286 [5:14:11<16:35:59, 17.61s/it]                                                       {'loss': 0.0185, 'grad_norm': 3.468083857225214, 'learning_rate': 7.916472235184321e-07, 'completion_length': 133.6785774230957, 'rewards/only_full_func_accuracy_reward': 0.4122024029493332, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3943453431129456, 'reward_std': 0.10969168692827225, 'kl': 0.46337890625, 'epoch': 0.21}
 21%|██        | 893/4286 [5:14:11<16:35:59, 17.61s/it] 21%|██        | 894/4286 [5:14:27<16:08:38, 17.13s/it]                                                       {'loss': 0.0504, 'grad_norm': 2.9215957147328147, 'learning_rate': 7.914139057396173e-07, 'completion_length': 137.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.4627976268529892, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.427083432674408, 'reward_std': 0.1534467712044716, 'kl': 1.2607421875, 'epoch': 0.21}
 21%|██        | 894/4286 [5:14:27<16:08:38, 17.13s/it] 21%|██        | 895/4286 [5:14:44<16:01:12, 17.01s/it]                                                       {'loss': 0.0103, 'grad_norm': 1.521930299039329, 'learning_rate': 7.911805879608025e-07, 'completion_length': 135.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.3913690745830536, 'rewards/format_reward': 1.0, 'reward': 1.3913691639900208, 'reward_std': 0.037095542065799236, 'kl': 0.2568359375, 'epoch': 0.21}
 21%|██        | 895/4286 [5:14:44<16:01:12, 17.01s/it] 21%|██        | 896/4286 [5:14:59<15:36:53, 16.58s/it]                                                       {'loss': 0.0189, 'grad_norm': 2.027846792761511, 'learning_rate': 7.909472701819879e-07, 'completion_length': 119.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.08315271139144897, 'kl': 0.4716796875, 'epoch': 0.21}
 21%|██        | 896/4286 [5:14:59<15:36:53, 16.58s/it] 21%|██        | 897/4286 [5:15:15<15:27:06, 16.41s/it]                                                       {'loss': 0.0197, 'grad_norm': 4.735888223289726, 'learning_rate': 7.907139524031731e-07, 'completion_length': 124.00000381469727, 'rewards/only_full_func_accuracy_reward': 0.5461310148239136, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5282739400863647, 'reward_std': 0.09066697023808956, 'kl': 0.4921875, 'epoch': 0.21}
 21%|██        | 897/4286 [5:15:15<15:27:06, 16.41s/it] 21%|██        | 898/4286 [5:15:31<15:07:02, 16.06s/it]                                                       {'loss': 0.0149, 'grad_norm': 2.139976081095916, 'learning_rate': 7.904806346243583e-07, 'completion_length': 124.78572082519531, 'rewards/only_full_func_accuracy_reward': 0.4895833730697632, 'rewards/format_reward': 1.0, 'reward': 1.489583432674408, 'reward_std': 0.10005596652626991, 'kl': 0.37109375, 'epoch': 0.21}
 21%|██        | 898/4286 [5:15:31<15:07:02, 16.06s/it] 21%|██        | 899/4286 [5:15:46<14:48:09, 15.73s/it]                                                       {'loss': 0.0312, 'grad_norm': 10.327520737850676, 'learning_rate': 7.902473168455436e-07, 'completion_length': 116.96429061889648, 'rewards/only_full_func_accuracy_reward': 0.4776785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4598215818405151, 'reward_std': 0.1547161042690277, 'kl': 0.779296875, 'epoch': 0.21}
 21%|██        | 899/4286 [5:15:46<14:48:09, 15.73s/it] 21%|██        | 900/4286 [5:16:01<14:36:21, 15.53s/it]                                                       {'loss': 0.031, 'grad_norm': 6.313025492966126, 'learning_rate': 7.900139990667289e-07, 'completion_length': 116.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.3720238357782364, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3541668057441711, 'reward_std': 0.0892857164144516, 'kl': 0.77587890625, 'epoch': 0.21}
 21%|██        | 900/4286 [5:16:01<14:36:21, 15.53s/it] 21%|██        | 901/4286 [5:20:42<89:31:56, 95.22s/it]                                                       {'loss': 0.0108, 'grad_norm': 3.965290345024989, 'learning_rate': 7.897806812879141e-07, 'completion_length': 135.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.4985119253396988, 'rewards/format_reward': 1.0, 'reward': 1.4985119700431824, 'reward_std': 0.058389296755194664, 'kl': 0.27099609375, 'epoch': 0.21}
 21%|██        | 901/4286 [5:20:42<89:31:56, 95.22s/it] 21%|██        | 902/4286 [5:20:57<66:53:42, 71.17s/it]                                                       {'loss': 0.032, 'grad_norm': 2.9403354620306508, 'learning_rate': 7.895473635090993e-07, 'completion_length': 108.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.5386905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.520833432674408, 'reward_std': 0.09800060419365764, 'kl': 0.80224609375, 'epoch': 0.21}
 21%|██        | 902/4286 [5:20:57<66:53:42, 71.17s/it] 21%|██        | 903/4286 [5:21:12<51:05:58, 54.38s/it]                                                       {'loss': 0.0717, 'grad_norm': 3.600323645990727, 'learning_rate': 7.893140457302846e-07, 'completion_length': 121.28571701049805, 'rewards/only_full_func_accuracy_reward': 0.5059524327516556, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4880953431129456, 'reward_std': 0.08660616725683212, 'kl': 1.7890625, 'epoch': 0.21}
 21%|██        | 903/4286 [5:21:12<51:05:58, 54.38s/it] 21%|██        | 904/4286 [5:21:27<40:02:05, 42.62s/it]                                                       {'loss': 0.0258, 'grad_norm': 1.8628327423624202, 'learning_rate': 7.890807279514698e-07, 'completion_length': 120.75000381469727, 'rewards/only_full_func_accuracy_reward': 0.6264881193637848, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.1357702501118183, 'kl': 0.646484375, 'epoch': 0.21}
 21%|██        | 904/4286 [5:21:27<40:02:05, 42.62s/it] 21%|██        | 905/4286 [5:21:47<33:33:23, 35.73s/it]                                                       {'loss': 0.0981, 'grad_norm': 4.049178924396786, 'learning_rate': 7.888474101726551e-07, 'completion_length': 136.35715103149414, 'rewards/only_full_func_accuracy_reward': 0.2797619253396988, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.208333432674408, 'reward_std': 0.21938148885965347, 'kl': 2.44921875, 'epoch': 0.21}
 21%|██        | 905/4286 [5:21:47<33:33:23, 35.73s/it] 21%|██        | 906/4286 [5:22:02<27:46:54, 29.59s/it]                                                       {'loss': 0.0337, 'grad_norm': 3.7746316863679024, 'learning_rate': 7.886140923938404e-07, 'completion_length': 114.53572082519531, 'rewards/only_full_func_accuracy_reward': 0.4866071939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4687501788139343, 'reward_std': 0.1160714328289032, 'kl': 0.837890625, 'epoch': 0.21}
 21%|██        | 906/4286 [5:22:02<27:46:54, 29.59s/it] 21%|██        | 907/4286 [5:22:18<23:56:15, 25.50s/it]                                                       {'loss': 0.0737, 'grad_norm': 4.384887914000974, 'learning_rate': 7.883807746150256e-07, 'completion_length': 123.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.4880952686071396, 'rewards/format_reward': 1.0, 'reward': 1.4880953431129456, 'reward_std': 0.06881999969482422, 'kl': 1.84375, 'epoch': 0.21}
 21%|██        | 907/4286 [5:22:18<23:56:15, 25.50s/it] 21%|██        | 908/4286 [5:22:34<21:03:57, 22.45s/it]                                                       {'loss': 0.0427, 'grad_norm': 2.1905349478424005, 'learning_rate': 7.881474568362108e-07, 'completion_length': 116.69643020629883, 'rewards/only_full_func_accuracy_reward': 0.6309524476528168, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.08769078738987446, 'kl': 1.06884765625, 'epoch': 0.21}
 21%|██        | 908/4286 [5:22:34<21:03:57, 22.45s/it] 21%|██        | 909/4286 [5:22:48<18:50:41, 20.09s/it]                                                       {'loss': 0.0489, 'grad_norm': 5.7876712281337666, 'learning_rate': 7.879141390573962e-07, 'completion_length': 109.07143020629883, 'rewards/only_full_func_accuracy_reward': 0.5223214775323868, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.504464328289032, 'reward_std': 0.11472323536872864, 'kl': 1.21875, 'epoch': 0.21}
 21%|██        | 909/4286 [5:22:48<18:50:41, 20.09s/it] 21%|██        | 910/4286 [5:23:04<17:46:01, 18.95s/it]                                                       {'loss': 0.0621, 'grad_norm': 3.4429675841552942, 'learning_rate': 7.876808212785814e-07, 'completion_length': 119.0535774230957, 'rewards/only_full_func_accuracy_reward': 0.4866072088479996, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4508930444717407, 'reward_std': 0.1398809552192688, 'kl': 1.5546875, 'epoch': 0.21}
 21%|██        | 910/4286 [5:23:04<17:46:01, 18.95s/it] 21%|██▏       | 911/4286 [5:23:19<16:32:22, 17.64s/it]                                                       {'loss': 0.0177, 'grad_norm': 2.6717840839923315, 'learning_rate': 7.874475034997666e-07, 'completion_length': 117.75000381469727, 'rewards/only_full_func_accuracy_reward': 0.6086309999227524, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.0474053667858243, 'kl': 0.439453125, 'epoch': 0.21}
 21%|██▏       | 911/4286 [5:23:19<16:32:22, 17.64s/it] 21%|██▏       | 912/4286 [5:23:35<16:08:58, 17.23s/it]                                                       {'loss': 0.0198, 'grad_norm': 3.1632523301822997, 'learning_rate': 7.872141857209518e-07, 'completion_length': 127.03572082519531, 'rewards/only_full_func_accuracy_reward': 0.525297611951828, 'rewards/format_reward': 1.0, 'reward': 1.52529776096344, 'reward_std': 0.09342947974801064, 'kl': 0.49560546875, 'epoch': 0.21}
 21%|██▏       | 912/4286 [5:23:35<16:08:58, 17.23s/it] 21%|██▏       | 913/4286 [5:23:50<15:33:01, 16.60s/it]                                                       {'loss': 0.0117, 'grad_norm': 1.0943581039854242, 'learning_rate': 7.869808679421372e-07, 'completion_length': 124.8035774230957, 'rewards/only_full_func_accuracy_reward': 0.4092262089252472, 'rewards/format_reward': 1.0, 'reward': 1.4092262983322144, 'reward_std': 0.06250000465661287, 'kl': 0.2939453125, 'epoch': 0.21}
 21%|██▏       | 913/4286 [5:23:50<15:33:01, 16.60s/it] 21%|██▏       | 914/4286 [5:24:06<15:08:56, 16.17s/it]                                                       {'loss': 0.0245, 'grad_norm': 3.2090942258192747, 'learning_rate': 7.867475501633224e-07, 'completion_length': 130.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.5520833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5342263579368591, 'reward_std': 0.13173232600092888, 'kl': 0.61279296875, 'epoch': 0.21}
 21%|██▏       | 914/4286 [5:24:06<15:08:56, 16.17s/it] 21%|██▏       | 915/4286 [5:24:28<17:00:59, 18.17s/it]                                                       {'loss': 0.0598, 'grad_norm': 5.4256135764186695, 'learning_rate': 7.865142323845076e-07, 'completion_length': 133.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.4241071790456772, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3705357909202576, 'reward_std': 0.16410495713353157, 'kl': 1.4921875, 'epoch': 0.21}
 21%|██▏       | 915/4286 [5:24:28<17:00:59, 18.17s/it] 21%|██▏       | 916/4286 [5:24:46<16:57:18, 18.11s/it]                                                       {'loss': 0.0207, 'grad_norm': 9.158790775242547, 'learning_rate': 7.862809146056929e-07, 'completion_length': 143.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.4583333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4404762983322144, 'reward_std': 0.06769215315580368, 'kl': 0.51611328125, 'epoch': 0.21}
 21%|██▏       | 916/4286 [5:24:46<16:57:18, 18.11s/it] 21%|██▏       | 917/4286 [5:25:03<16:26:17, 17.57s/it]                                                       {'loss': 0.0134, 'grad_norm': 4.906257157684802, 'learning_rate': 7.860475968268782e-07, 'completion_length': 129.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.4136905074119568, 'rewards/format_reward': 1.0, 'reward': 1.4136905670166016, 'reward_std': 0.10808509588241577, 'kl': 0.333984375, 'epoch': 0.21}
 21%|██▏       | 917/4286 [5:25:03<16:26:17, 17.57s/it] 21%|██▏       | 918/4286 [5:25:19<16:00:17, 17.11s/it]                                                       {'loss': 0.01, 'grad_norm': 8.111482902392584, 'learning_rate': 7.858142790480634e-07, 'completion_length': 125.16071701049805, 'rewards/only_full_func_accuracy_reward': 0.476190522313118, 'rewards/format_reward': 1.0, 'reward': 1.4761906266212463, 'reward_std': 0.10787045955657959, 'kl': 0.24951171875, 'epoch': 0.21}
 21%|██▏       | 918/4286 [5:25:19<16:00:17, 17.11s/it] 21%|██▏       | 919/4286 [5:25:35<15:40:52, 16.77s/it]                                                       {'loss': 0.0655, 'grad_norm': 3.56136315452589, 'learning_rate': 7.855809612692487e-07, 'completion_length': 140.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.453869104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4181548953056335, 'reward_std': 0.11840657889842987, 'kl': 1.6435546875, 'epoch': 0.21}
 21%|██▏       | 919/4286 [5:25:35<15:40:52, 16.77s/it] 21%|██▏       | 920/4286 [5:25:52<15:44:20, 16.83s/it]                                                       {'loss': 0.0563, 'grad_norm': 3.6262710849937076, 'learning_rate': 7.853476434904339e-07, 'completion_length': 142.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.3526785969734192, 'rewards/format_reward': 1.0, 'reward': 1.3526787161827087, 'reward_std': 0.06514770910143852, 'kl': 1.41015625, 'epoch': 0.21}
 21%|██▏       | 920/4286 [5:25:52<15:44:20, 16.83s/it] 21%|██▏       | 921/4286 [5:26:07<15:22:07, 16.44s/it]                                                       {'loss': 0.0195, 'grad_norm': 1.8715926117637065, 'learning_rate': 7.851143257116192e-07, 'completion_length': 138.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.4092262238264084, 'rewards/format_reward': 1.0, 'reward': 1.4092262983322144, 'reward_std': 0.008928571827709675, 'kl': 0.48828125, 'epoch': 0.21}
 21%|██▏       | 921/4286 [5:26:07<15:22:07, 16.44s/it] 22%|██▏       | 922/4286 [5:26:24<15:20:17, 16.41s/it]                                                       {'loss': 0.0408, 'grad_norm': 3.927324581744601, 'learning_rate': 7.848810079328045e-07, 'completion_length': 136.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.5910714566707611, 'rewards/format_reward': 1.0, 'reward': 1.5910715460777283, 'reward_std': 0.06715028919279575, 'kl': 1.021484375, 'epoch': 0.22}
 22%|██▏       | 922/4286 [5:26:24<15:20:17, 16.41s/it] 22%|██▏       | 923/4286 [5:26:40<15:15:25, 16.33s/it]                                                       {'loss': 0.0248, 'grad_norm': 2.201078527050963, 'learning_rate': 7.846476901539897e-07, 'completion_length': 137.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4583333283662796, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4404762983322144, 'reward_std': 0.09842910245060921, 'kl': 0.6171875, 'epoch': 0.22}
 22%|██▏       | 923/4286 [5:26:40<15:15:25, 16.33s/it] 22%|██▏       | 924/4286 [5:26:56<15:15:11, 16.33s/it]                                                       {'loss': 0.0459, 'grad_norm': 2.584685347503284, 'learning_rate': 7.844143723751749e-07, 'completion_length': 137.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.4776785969734192, 'rewards/format_reward': 1.0, 'reward': 1.4776787161827087, 'reward_std': 0.0771672772243619, 'kl': 1.146484375, 'epoch': 0.22}
 22%|██▏       | 924/4286 [5:26:56<15:15:11, 16.33s/it] 22%|██▏       | 925/4286 [5:27:21<17:34:45, 18.83s/it]                                                       {'loss': 0.0877, 'grad_norm': 3.878460185795673, 'learning_rate': 7.841810545963601e-07, 'completion_length': 151.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.40284866094589233, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3492772579193115, 'reward_std': 0.202383391559124, 'kl': 2.1953125, 'epoch': 0.22}
 22%|██▏       | 925/4286 [5:27:21<17:34:45, 18.83s/it] 22%|██▏       | 926/4286 [5:27:38<17:02:47, 18.26s/it]                                                       {'loss': 0.0714, 'grad_norm': 2.4129008347050034, 'learning_rate': 7.839477368175455e-07, 'completion_length': 141.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.42619049549102783, 'rewards/format_reward': 1.0, 'reward': 1.4261906147003174, 'reward_std': 0.10404741019010544, 'kl': 1.7890625, 'epoch': 0.22}
 22%|██▏       | 926/4286 [5:27:38<17:02:47, 18.26s/it] 22%|██▏       | 927/4286 [5:27:55<16:46:24, 17.98s/it]                                                       {'loss': 0.0523, 'grad_norm': 5.719687905025395, 'learning_rate': 7.837144190387307e-07, 'completion_length': 133.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.4255952686071396, 'rewards/format_reward': 1.0, 'reward': 1.4255953431129456, 'reward_std': 0.06136548891663551, 'kl': 1.30859375, 'epoch': 0.22}
 22%|██▏       | 927/4286 [5:27:55<16:46:24, 17.98s/it] 22%|██▏       | 928/4286 [5:28:13<16:42:15, 17.91s/it]                                                       {'loss': 0.0708, 'grad_norm': 8.082541468130303, 'learning_rate': 7.834811012599159e-07, 'completion_length': 150.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.4571428894996643, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4392857551574707, 'reward_std': 0.15520573034882545, 'kl': 1.76953125, 'epoch': 0.22}
 22%|██▏       | 928/4286 [5:28:13<16:42:15, 17.91s/it] 22%|██▏       | 929/4286 [5:28:30<16:34:21, 17.77s/it]                                                       {'loss': 0.0378, 'grad_norm': 8.074390701132478, 'learning_rate': 7.832477834811012e-07, 'completion_length': 142.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.4928571879863739, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.457142949104309, 'reward_std': 0.17062390595674515, 'kl': 0.9423828125, 'epoch': 0.22}
 22%|██▏       | 929/4286 [5:28:30<16:34:21, 17.77s/it] 22%|██▏       | 930/4286 [5:28:48<16:30:25, 17.71s/it]                                                       {'loss': 0.0278, 'grad_norm': 14.183170677932225, 'learning_rate': 7.830144657022865e-07, 'completion_length': 142.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.4479166865348816, 'rewards/format_reward': 1.0, 'reward': 1.4479167461395264, 'reward_std': 0.08152472600340843, 'kl': 0.697265625, 'epoch': 0.22}
 22%|██▏       | 930/4286 [5:28:48<16:30:25, 17.71s/it] 22%|██▏       | 931/4286 [5:29:05<16:20:46, 17.54s/it]                                                       {'loss': 0.0455, 'grad_norm': 6.237283277641339, 'learning_rate': 7.827811479234717e-07, 'completion_length': 136.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.4940476268529892, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4761905670166016, 'reward_std': 0.1944977566599846, 'kl': 1.138671875, 'epoch': 0.22}
 22%|██▏       | 931/4286 [5:29:05<16:20:46, 17.54s/it] 22%|██▏       | 932/4286 [5:29:23<16:29:44, 17.71s/it]                                                       {'loss': 0.0717, 'grad_norm': 5.037647716638248, 'learning_rate': 7.82547830144657e-07, 'completion_length': 146.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.48184525966644287, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4639881253242493, 'reward_std': 0.16061294078826904, 'kl': 1.79296875, 'epoch': 0.22}
 22%|██▏       | 932/4286 [5:29:23<16:29:44, 17.71s/it] 22%|██▏       | 933/4286 [5:29:39<16:10:21, 17.36s/it]                                                       {'loss': 0.0921, 'grad_norm': 5.173087174720211, 'learning_rate': 7.823145123658422e-07, 'completion_length': 134.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.447916716337204, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.376488208770752, 'reward_std': 0.3384437710046768, 'kl': 2.296875, 'epoch': 0.22}
 22%|██▏       | 933/4286 [5:29:39<16:10:21, 17.36s/it] 22%|██▏       | 934/4286 [5:29:59<16:48:47, 18.06s/it]                                                       {'loss': 0.0959, 'grad_norm': 4.133396561692214, 'learning_rate': 7.820811945870275e-07, 'completion_length': 147.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.543154776096344, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4717262983322144, 'reward_std': 0.26172637939453125, 'kl': 2.390625, 'epoch': 0.22}
 22%|██▏       | 934/4286 [5:29:59<16:48:47, 18.06s/it] 22%|██▏       | 935/4286 [5:30:16<16:32:05, 17.76s/it]                                                       {'loss': 0.0779, 'grad_norm': 6.009414140355868, 'learning_rate': 7.818478768082127e-07, 'completion_length': 139.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571428656578064, 'reward_std': 0.2069520354270935, 'kl': 1.94921875, 'epoch': 0.22}
 22%|██▏       | 935/4286 [5:30:16<16:32:05, 17.76s/it] 22%|██▏       | 936/4286 [5:30:41<18:30:45, 19.89s/it]                                                       {'loss': 0.1401, 'grad_norm': 3.4167651213249894, 'learning_rate': 7.81614559029398e-07, 'completion_length': 187.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.43273812532424927, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.2898810505867004, 'reward_std': 0.39971470832824707, 'kl': 3.5, 'epoch': 0.22}
 22%|██▏       | 936/4286 [5:30:41<18:30:45, 19.89s/it] 22%|██▏       | 937/4286 [5:31:02<18:48:57, 20.23s/it]                                                       {'loss': 0.0524, 'grad_norm': 8.720430181507975, 'learning_rate': 7.813812412505832e-07, 'completion_length': 167.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.3418154865503311, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3239584565162659, 'reward_std': 0.1799655184149742, 'kl': 1.30859375, 'epoch': 0.22}
 22%|██▏       | 937/4286 [5:31:02<18:48:57, 20.23s/it] 22%|██▏       | 938/4286 [5:31:24<19:17:22, 20.74s/it]                                                       {'loss': 0.0979, 'grad_norm': 6.326026164905572, 'learning_rate': 7.811479234717685e-07, 'completion_length': 166.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.340773843228817, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.2336310148239136, 'reward_std': 0.31039440631866455, 'kl': 2.4375, 'epoch': 0.22}
 22%|██▏       | 938/4286 [5:31:24<19:17:22, 20.74s/it] 22%|██▏       | 939/4286 [5:31:41<18:16:18, 19.65s/it]                                                       {'loss': 0.0703, 'grad_norm': 4.93546187619673, 'learning_rate': 7.809146056929538e-07, 'completion_length': 149.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.4985118955373764, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4627977013587952, 'reward_std': 0.23559240996837616, 'kl': 1.7578125, 'epoch': 0.22}
 22%|██▏       | 939/4286 [5:31:41<18:16:18, 19.65s/it][2025-03-02 10:39:17,248] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 22%|██▏       | 940/4286 [5:32:01<18:25:06, 19.82s/it]                                                       {'loss': 0.0757, 'grad_norm': 5.7667391029092006, 'learning_rate': 7.80681287914139e-07, 'completion_length': 167.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.47746603190898895, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4417518377304077, 'reward_std': 0.1954110935330391, 'kl': 1.89453125, 'epoch': 0.22}
 22%|██▏       | 940/4286 [5:32:01<18:25:06, 19.82s/it] 22%|██▏       | 941/4286 [5:32:21<18:16:10, 19.66s/it]                                                       {'loss': 0.12, 'grad_norm': 6.143526556141274, 'learning_rate': 7.804479701353242e-07, 'completion_length': 154.08928680419922, 'rewards/only_full_func_accuracy_reward': 0.427083358168602, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3735120296478271, 'reward_std': 0.20087750256061554, 'kl': 3.0078125, 'epoch': 0.22}
 22%|██▏       | 941/4286 [5:32:21<18:16:10, 19.66s/it] 22%|██▏       | 942/4286 [5:32:38<17:42:21, 19.06s/it]                                                       {'loss': 0.1439, 'grad_norm': 6.645017863582971, 'learning_rate': 7.802146523565096e-07, 'completion_length': 159.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.4125000089406967, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3410715460777283, 'reward_std': 0.26749545335769653, 'kl': 3.59375, 'epoch': 0.22}
 22%|██▏       | 942/4286 [5:32:38<17:42:21, 19.06s/it] 22%|██▏       | 943/4286 [5:32:59<18:08:31, 19.54s/it]                                                       {'loss': 0.109, 'grad_norm': 6.795272547540573, 'learning_rate': 7.799813345776948e-07, 'completion_length': 169.96428680419922, 'rewards/only_full_func_accuracy_reward': 0.3616071790456772, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2723215222358704, 'reward_std': 0.2932039201259613, 'kl': 2.71875, 'epoch': 0.22}
 22%|██▏       | 943/4286 [5:32:59<18:08:31, 19.54s/it] 22%|██▏       | 944/4286 [5:33:21<18:43:03, 20.16s/it]                                                       {'loss': 0.149, 'grad_norm': 4.534891467162871, 'learning_rate': 7.7974801679888e-07, 'completion_length': 178.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.3291667103767395, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2398810982704163, 'reward_std': 0.302002876996994, 'kl': 3.7265625, 'epoch': 0.22}
 22%|██▏       | 944/4286 [5:33:21<18:43:03, 20.16s/it] 22%|██▏       | 945/4286 [5:33:42<19:07:58, 20.62s/it]                                                       {'loss': 0.0937, 'grad_norm': 6.281728620828877, 'learning_rate': 7.795146990200653e-07, 'completion_length': 173.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.43869052827358246, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4029763340950012, 'reward_std': 0.1668248139321804, 'kl': 2.34375, 'epoch': 0.22}
 22%|██▏       | 945/4286 [5:33:42<19:07:58, 20.62s/it] 22%|██▏       | 946/4286 [5:34:02<18:45:04, 20.21s/it]                                                       {'loss': 0.0719, 'grad_norm': 8.449680237601159, 'learning_rate': 7.792813812412506e-07, 'completion_length': 170.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.41686511039733887, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3632937669754028, 'reward_std': 0.23590171337127686, 'kl': 1.79296875, 'epoch': 0.22}
 22%|██▏       | 946/4286 [5:34:02<18:45:04, 20.21s/it] 22%|██▏       | 947/4286 [5:34:23<19:13:47, 20.73s/it]                                                       {'loss': 0.0642, 'grad_norm': 26.725872826412264, 'learning_rate': 7.790480634624358e-07, 'completion_length': 176.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.42767859995365143, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3562501668930054, 'reward_std': 0.23328383266925812, 'kl': 1.609375, 'epoch': 0.22}
 22%|██▏       | 947/4286 [5:34:23<19:13:47, 20.73s/it] 22%|██▏       | 948/4286 [5:34:41<18:19:52, 19.77s/it]                                                       {'loss': 0.0407, 'grad_norm': 2.6072115018149566, 'learning_rate': 7.78814745683621e-07, 'completion_length': 154.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.5059524327516556, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4880953431129456, 'reward_std': 0.17708319425582886, 'kl': 1.017578125, 'epoch': 0.22}
 22%|██▏       | 948/4286 [5:34:41<18:19:52, 19.77s/it] 22%|██▏       | 949/4286 [5:35:02<18:40:52, 20.15s/it]                                                       {'loss': 0.0226, 'grad_norm': 4.388671020470896, 'learning_rate': 7.785814279048063e-07, 'completion_length': 170.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.45148810744285583, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.415773868560791, 'reward_std': 0.13101869821548462, 'kl': 0.5634765625, 'epoch': 0.22}
 22%|██▏       | 949/4286 [5:35:02<18:40:52, 20.15s/it] 22%|██▏       | 950/4286 [5:35:19<17:52:59, 19.30s/it]                                                       {'loss': 0.0222, 'grad_norm': 5.365532117346111, 'learning_rate': 7.783481101259915e-07, 'completion_length': 159.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.46547621488571167, 'rewards/format_reward': 1.0, 'reward': 1.4654762148857117, 'reward_std': 0.10982584953308105, 'kl': 0.55322265625, 'epoch': 0.22}
 22%|██▏       | 950/4286 [5:35:19<17:52:59, 19.30s/it] 22%|██▏       | 951/4286 [5:35:41<18:35:20, 20.07s/it]                                                       {'loss': 0.0245, 'grad_norm': 3.9604923938951795, 'learning_rate': 7.781147923471768e-07, 'completion_length': 183.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.616071492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5982143878936768, 'reward_std': 0.14113281667232513, 'kl': 0.61328125, 'epoch': 0.22}
 22%|██▏       | 951/4286 [5:35:41<18:35:20, 20.07s/it] 22%|██▏       | 952/4286 [5:35:59<17:49:52, 19.25s/it]                                                       {'loss': 0.0121, 'grad_norm': 3.9080285461515163, 'learning_rate': 7.778814745683621e-07, 'completion_length': 149.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.4261905252933502, 'rewards/format_reward': 1.0, 'reward': 1.4261905550956726, 'reward_std': 0.07240444235503674, 'kl': 0.3037109375, 'epoch': 0.22}
 22%|██▏       | 952/4286 [5:35:59<17:49:52, 19.25s/it] 22%|██▏       | 953/4286 [5:36:16<17:16:49, 18.66s/it]                                                       {'loss': 0.01, 'grad_norm': 3.4404940326046223, 'learning_rate': 7.776481567895473e-07, 'completion_length': 160.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.4694727957248688, 'rewards/format_reward': 1.0, 'reward': 1.469472885131836, 'reward_std': 0.04718721937388182, 'kl': 0.24951171875, 'epoch': 0.22}
 22%|██▏       | 953/4286 [5:36:16<17:16:49, 18.66s/it] 22%|██▏       | 954/4286 [5:36:33<16:46:31, 18.12s/it]                                                       {'loss': 0.0235, 'grad_norm': 3.8440053028564063, 'learning_rate': 7.774148390107325e-07, 'completion_length': 146.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.578869104385376, 'rewards/format_reward': 1.0, 'reward': 1.5788691639900208, 'reward_std': 0.1312018446624279, 'kl': 0.58984375, 'epoch': 0.22}
 22%|██▏       | 954/4286 [5:36:33<16:46:31, 18.12s/it] 22%|██▏       | 955/4286 [5:36:50<16:27:37, 17.79s/it]                                                       {'loss': 0.009, 'grad_norm': 9.392811492734788, 'learning_rate': 7.771815212319179e-07, 'completion_length': 159.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.3943452537059784, 'rewards/format_reward': 1.0, 'reward': 1.3943453431129456, 'reward_std': 0.11832712218165398, 'kl': 0.22412109375, 'epoch': 0.22}
 22%|██▏       | 955/4286 [5:36:50<16:27:37, 17.79s/it] 22%|██▏       | 956/4286 [5:37:08<16:36:17, 17.95s/it]                                                       {'loss': 0.0182, 'grad_norm': 10.144741252566147, 'learning_rate': 7.769482034531031e-07, 'completion_length': 156.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.6420068293809891, 'rewards/format_reward': 1.0, 'reward': 1.6420068740844727, 'reward_std': 0.07670862227678299, 'kl': 0.4560546875, 'epoch': 0.22}
 22%|██▏       | 956/4286 [5:37:08<16:36:17, 17.95s/it] 22%|██▏       | 957/4286 [5:37:27<16:52:25, 18.25s/it]                                                       {'loss': 0.0248, 'grad_norm': 17.843559896126326, 'learning_rate': 7.767148856742883e-07, 'completion_length': 160.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.46934525668621063, 'rewards/format_reward': 1.0, 'reward': 1.4693453907966614, 'reward_std': 0.14954517781734467, 'kl': 0.6201171875, 'epoch': 0.22}
 22%|██▏       | 957/4286 [5:37:27<16:52:25, 18.25s/it] 22%|██▏       | 958/4286 [5:37:50<18:02:48, 19.52s/it]                                                       {'loss': 0.0621, 'grad_norm': 5.374137993145502, 'learning_rate': 7.764815678954735e-07, 'completion_length': 175.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.2931547835469246, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2395833730697632, 'reward_std': 0.15915242210030556, 'kl': 1.55078125, 'epoch': 0.22}
 22%|██▏       | 958/4286 [5:37:50<18:02:48, 19.52s/it] 22%|██▏       | 959/4286 [5:38:07<17:25:28, 18.85s/it]                                                       {'loss': 0.0177, 'grad_norm': 4.635063621850708, 'learning_rate': 7.762482501166589e-07, 'completion_length': 157.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.48928575217723846, 'rewards/format_reward': 1.0, 'reward': 1.4892858266830444, 'reward_std': 0.10898284986615181, 'kl': 0.44140625, 'epoch': 0.22}
 22%|██▏       | 959/4286 [5:38:07<17:25:28, 18.85s/it] 22%|██▏       | 960/4286 [5:38:24<16:54:59, 18.31s/it]                                                       {'loss': 0.0247, 'grad_norm': 4.746232147243405, 'learning_rate': 7.760149323378441e-07, 'completion_length': 157.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.5386905074119568, 'rewards/format_reward': 1.0, 'reward': 1.5386905670166016, 'reward_std': 0.08512193337082863, 'kl': 0.6171875, 'epoch': 0.22}
 22%|██▏       | 960/4286 [5:38:24<16:54:59, 18.31s/it] 22%|██▏       | 961/4286 [5:38:43<17:12:16, 18.63s/it]                                                       {'loss': 0.0332, 'grad_norm': 12.346434630624767, 'learning_rate': 7.757816145590293e-07, 'completion_length': 158.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.3857143223285675, 'rewards/format_reward': 1.0, 'reward': 1.3857142925262451, 'reward_std': 0.09708873927593231, 'kl': 0.828125, 'epoch': 0.22}
 22%|██▏       | 961/4286 [5:38:43<17:12:16, 18.63s/it] 22%|██▏       | 962/4286 [5:39:02<17:15:54, 18.70s/it]                                                       {'loss': 0.0303, 'grad_norm': 7.64373510310957, 'learning_rate': 7.755482967802146e-07, 'completion_length': 169.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.46934525668621063, 'rewards/format_reward': 1.0, 'reward': 1.4693452715873718, 'reward_std': 0.12594518065452576, 'kl': 0.7578125, 'epoch': 0.22}
 22%|██▏       | 962/4286 [5:39:02<17:15:54, 18.70s/it] 22%|██▏       | 963/4286 [5:39:22<17:37:14, 19.09s/it]                                                       {'loss': 0.0665, 'grad_norm': 7.958217141246122, 'learning_rate': 7.753149790013999e-07, 'completion_length': 150.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.4940476566553116, 'rewards/format_reward': 1.0, 'reward': 1.49404776096344, 'reward_std': 0.046605405397713184, 'kl': 1.6630859375, 'epoch': 0.22}
 22%|██▏       | 963/4286 [5:39:22<17:37:14, 19.09s/it] 22%|██▏       | 964/4286 [5:39:42<17:43:28, 19.21s/it]                                                       {'loss': 0.0523, 'grad_norm': 17.144765537524776, 'learning_rate': 7.750816612225851e-07, 'completion_length': 182.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.3467262089252472, 'rewards/format_reward': 1.0, 'reward': 1.3467262387275696, 'reward_std': 0.19456805288791656, 'kl': 1.3046875, 'epoch': 0.22}
 22%|██▏       | 964/4286 [5:39:42<17:43:28, 19.21s/it] 23%|██▎       | 965/4286 [5:40:00<17:30:36, 18.98s/it]                                                       {'loss': 0.0272, 'grad_norm': 6.411842215005899, 'learning_rate': 7.748483434437704e-07, 'completion_length': 158.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5252976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5074405670166016, 'reward_std': 0.14540597796440125, 'kl': 0.6796875, 'epoch': 0.23}
 23%|██▎       | 965/4286 [5:40:00<17:30:36, 18.98s/it] 23%|██▎       | 966/4286 [5:40:23<18:35:09, 20.15s/it]                                                       {'loss': 0.0784, 'grad_norm': 62.94628332329281, 'learning_rate': 7.746150256649556e-07, 'completion_length': 170.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.2647186294198036, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2290043830871582, 'reward_std': 0.17780663818120956, 'kl': 1.9609375, 'epoch': 0.23}
 23%|██▎       | 966/4286 [5:40:23<18:35:09, 20.15s/it] 23%|██▎       | 967/4286 [5:40:43<18:39:54, 20.25s/it]                                                       {'loss': 0.0545, 'grad_norm': 8.106693071695073, 'learning_rate': 7.743817078861409e-07, 'completion_length': 156.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.3998724967241287, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3820154070854187, 'reward_std': 0.20183801651000977, 'kl': 1.36328125, 'epoch': 0.23}
 23%|██▎       | 967/4286 [5:40:43<18:39:54, 20.25s/it] 23%|██▎       | 968/4286 [5:41:02<18:07:31, 19.67s/it]                                                       {'loss': 0.0254, 'grad_norm': 14.607150221032317, 'learning_rate': 7.741483901073262e-07, 'completion_length': 150.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.5122024267911911, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4943453669548035, 'reward_std': 0.23435164988040924, 'kl': 0.634765625, 'epoch': 0.23}
 23%|██▎       | 968/4286 [5:41:02<18:07:31, 19.67s/it] 23%|██▎       | 969/4286 [5:41:19<17:31:40, 19.02s/it]                                                       {'loss': 0.0555, 'grad_norm': 8.912290933545675, 'learning_rate': 7.739150723285114e-07, 'completion_length': 145.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.318452388048172, 'rewards/format_reward': 1.0, 'reward': 1.318452537059784, 'reward_std': 0.10234637558460236, 'kl': 1.38671875, 'epoch': 0.23}
 23%|██▎       | 969/4286 [5:41:19<17:31:40, 19.02s/it] 23%|██▎       | 970/4286 [5:41:37<17:10:54, 18.65s/it]                                                       {'loss': 0.0224, 'grad_norm': 19.39170424709685, 'learning_rate': 7.736817545496966e-07, 'completion_length': 158.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.566964328289032, 'rewards/format_reward': 1.0, 'reward': 1.5669644474983215, 'reward_std': 0.14439816772937775, 'kl': 0.560546875, 'epoch': 0.23}
 23%|██▎       | 970/4286 [5:41:37<17:10:54, 18.65s/it] 23%|██▎       | 971/4286 [5:41:55<17:02:08, 18.50s/it]                                                       {'loss': 0.0173, 'grad_norm': 8.55579409682921, 'learning_rate': 7.734484367708819e-07, 'completion_length': 161.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.4854167103767395, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4675596356391907, 'reward_std': 0.14258311688899994, 'kl': 0.431640625, 'epoch': 0.23}
 23%|██▎       | 971/4286 [5:41:55<17:02:08, 18.50s/it] 23%|██▎       | 972/4286 [5:42:14<17:10:26, 18.66s/it]                                                       {'loss': 0.0336, 'grad_norm': 4.195397050097357, 'learning_rate': 7.732151189920672e-07, 'completion_length': 152.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.4291667193174362, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4113096594810486, 'reward_std': 0.07714043371379375, 'kl': 0.83984375, 'epoch': 0.23}
 23%|██▎       | 972/4286 [5:42:14<17:10:26, 18.66s/it] 23%|██▎       | 973/4286 [5:42:32<16:49:35, 18.28s/it]                                                       {'loss': 0.0181, 'grad_norm': 6.604079330348119, 'learning_rate': 7.729818012132524e-07, 'completion_length': 160.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.4747024178504944, 'rewards/format_reward': 1.0, 'reward': 1.4747024774551392, 'reward_std': 0.09056831896305084, 'kl': 0.4521484375, 'epoch': 0.23}
 23%|██▎       | 973/4286 [5:42:32<16:49:35, 18.28s/it] 23%|██▎       | 974/4286 [5:42:50<16:45:33, 18.22s/it]                                                       {'loss': 0.0263, 'grad_norm': 5.2627433485188195, 'learning_rate': 7.727484834344376e-07, 'completion_length': 158.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.4479166716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4300596117973328, 'reward_std': 0.18465755879878998, 'kl': 0.658203125, 'epoch': 0.23}
 23%|██▎       | 974/4286 [5:42:50<16:45:33, 18.22s/it] 23%|██▎       | 975/4286 [5:43:05<16:06:16, 17.51s/it]                                                       {'loss': 0.0185, 'grad_norm': 2.5076097601938607, 'learning_rate': 7.72515165655623e-07, 'completion_length': 138.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5565476715564728, 'rewards/format_reward': 1.0, 'reward': 1.5565477013587952, 'reward_std': 0.08189816027879715, 'kl': 0.462890625, 'epoch': 0.23}
 23%|██▎       | 975/4286 [5:43:05<16:06:16, 17.51s/it] 23%|██▎       | 976/4286 [5:43:23<16:10:22, 17.59s/it]                                                       {'loss': 0.0095, 'grad_norm': 17.25408044545066, 'learning_rate': 7.722818478768082e-07, 'completion_length': 160.25, 'rewards/only_full_func_accuracy_reward': 0.598214328289032, 'rewards/format_reward': 1.0, 'reward': 1.598214328289032, 'reward_std': 0.09548483043909073, 'kl': 0.23779296875, 'epoch': 0.23}
 23%|██▎       | 976/4286 [5:43:23<16:10:22, 17.59s/it] 23%|██▎       | 977/4286 [5:43:41<16:13:17, 17.65s/it]                                                       {'loss': 0.0292, 'grad_norm': 5.228845071643081, 'learning_rate': 7.720485300979934e-07, 'completion_length': 164.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.4910714775323868, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4553572535514832, 'reward_std': 0.15210112184286118, 'kl': 0.73046875, 'epoch': 0.23}
 23%|██▎       | 977/4286 [5:43:41<16:13:17, 17.65s/it] 23%|██▎       | 978/4286 [5:44:00<16:27:54, 17.92s/it]                                                       {'loss': 0.0422, 'grad_norm': 6.595724952702475, 'learning_rate': 7.718152123191787e-07, 'completion_length': 157.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.34226194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3244048953056335, 'reward_std': 0.10467883199453354, 'kl': 1.052734375, 'epoch': 0.23}
 23%|██▎       | 978/4286 [5:44:00<16:27:54, 17.92s/it] 23%|██▎       | 979/4286 [5:44:17<16:11:09, 17.62s/it]                                                       {'loss': 0.0317, 'grad_norm': 11.212473586087139, 'learning_rate': 7.715818945403639e-07, 'completion_length': 147.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5390625, 'rewards/format_reward': 1.0, 'reward': 1.5390626192092896, 'reward_std': 0.11457234993577003, 'kl': 0.7919921875, 'epoch': 0.23}
 23%|██▎       | 979/4286 [5:44:17<16:11:09, 17.62s/it] 23%|██▎       | 980/4286 [5:44:36<16:39:11, 18.13s/it]                                                       {'loss': 0.1773, 'grad_norm': 5140.068227648574, 'learning_rate': 7.713485767615492e-07, 'completion_length': 170.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.3973214477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3794643878936768, 'reward_std': 0.10252152383327484, 'kl': 4.42578125, 'epoch': 0.23}
 23%|██▎       | 980/4286 [5:44:36<16:39:11, 18.13s/it] 23%|██▎       | 981/4286 [5:44:54<16:38:29, 18.13s/it]                                                       {'loss': 0.0358, 'grad_norm': 5.801367065001799, 'learning_rate': 7.711152589827344e-07, 'completion_length': 152.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.4583333432674408, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4047619700431824, 'reward_std': 0.12221009097993374, 'kl': 0.892578125, 'epoch': 0.23}
 23%|██▎       | 981/4286 [5:44:54<16:38:29, 18.13s/it] 23%|██▎       | 982/4286 [5:45:13<16:48:14, 18.31s/it]                                                       {'loss': 0.0356, 'grad_norm': 4.828107674085641, 'learning_rate': 7.708819412039197e-07, 'completion_length': 153.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.4699404835700989, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4342262744903564, 'reward_std': 0.1353127434849739, 'kl': 0.888671875, 'epoch': 0.23}
 23%|██▎       | 982/4286 [5:45:13<16:48:14, 18.31s/it] 23%|██▎       | 983/4286 [5:45:33<17:27:46, 19.03s/it]                                                       {'loss': 0.0914, 'grad_norm': 9.01925469589193, 'learning_rate': 7.706486234251049e-07, 'completion_length': 190.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.31785716116428375, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2642858028411865, 'reward_std': 0.2153090052306652, 'kl': 2.2890625, 'epoch': 0.23}
 23%|██▎       | 983/4286 [5:45:33<17:27:46, 19.03s/it] 23%|██▎       | 984/4286 [5:45:54<17:55:35, 19.54s/it]                                                       {'loss': 0.0312, 'grad_norm': 7.440381867957189, 'learning_rate': 7.704153056462902e-07, 'completion_length': 170.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.4627976268529892, 'rewards/format_reward': 1.0, 'reward': 1.46279776096344, 'reward_std': 0.15855278819799423, 'kl': 0.7802734375, 'epoch': 0.23}
 23%|██▎       | 984/4286 [5:45:54<17:55:35, 19.54s/it] 23%|██▎       | 985/4286 [5:46:14<17:53:15, 19.51s/it]                                                       {'loss': 0.0864, 'grad_norm': 8.094745020394079, 'learning_rate': 7.701819878674755e-07, 'completion_length': 168.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.48273812234401703, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3934524655342102, 'reward_std': 0.332465335726738, 'kl': 2.16015625, 'epoch': 0.23}
 23%|██▎       | 985/4286 [5:46:14<17:53:15, 19.51s/it] 23%|██▎       | 986/4286 [5:46:30<16:55:42, 18.47s/it]                                                       {'loss': 0.0348, 'grad_norm': 4.22508356066484, 'learning_rate': 7.699486700886607e-07, 'completion_length': 123.57143020629883, 'rewards/only_full_func_accuracy_reward': 0.5997024327516556, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5818454027175903, 'reward_std': 0.1220238208770752, 'kl': 0.87109375, 'epoch': 0.23}
 23%|██▎       | 986/4286 [5:46:30<16:55:42, 18.47s/it] 23%|██▎       | 987/4286 [5:46:50<17:19:44, 18.91s/it]                                                       {'loss': 0.0537, 'grad_norm': 13.668912394412153, 'learning_rate': 7.697153523098459e-07, 'completion_length': 164.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5029762238264084, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4494048953056335, 'reward_std': 0.160714291036129, 'kl': 1.337890625, 'epoch': 0.23}
 23%|██▎       | 987/4286 [5:46:50<17:19:44, 18.91s/it] 23%|██▎       | 988/4286 [5:47:09<17:30:46, 19.12s/it]                                                       {'loss': 0.0701, 'grad_norm': 45.73359706866669, 'learning_rate': 7.694820345310313e-07, 'completion_length': 170.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.460416704416275, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4247024655342102, 'reward_std': 0.1722782775759697, 'kl': 1.75390625, 'epoch': 0.23}
 23%|██▎       | 988/4286 [5:47:09<17:30:46, 19.12s/it] 23%|██▎       | 989/4286 [5:47:27<17:02:12, 18.60s/it]                                                       {'loss': 0.055, 'grad_norm': 10.577141192881557, 'learning_rate': 7.692487167522165e-07, 'completion_length': 154.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.407738134264946, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3720239400863647, 'reward_std': 0.1554226651787758, 'kl': 1.375, 'epoch': 0.23}
 23%|██▎       | 989/4286 [5:47:27<17:02:12, 18.60s/it] 23%|██▎       | 990/4286 [5:47:49<17:57:46, 19.62s/it]                                                       {'loss': 0.1036, 'grad_norm': 8.597393140084467, 'learning_rate': 7.690153989734017e-07, 'completion_length': 178.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.3690476566553116, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.24404776096344, 'reward_std': 0.3614310175180435, 'kl': 2.5859375, 'epoch': 0.23}
 23%|██▎       | 990/4286 [5:47:49<17:57:46, 19.62s/it] 23%|██▎       | 991/4286 [5:48:07<17:45:04, 19.39s/it]                                                       {'loss': 0.0389, 'grad_norm': 23.10650944789998, 'learning_rate': 7.68782081194587e-07, 'completion_length': 167.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.351190522313118, 'rewards/format_reward': 1.0, 'reward': 1.3511905670166016, 'reward_std': 0.1254306584596634, 'kl': 0.97265625, 'epoch': 0.23}
 23%|██▎       | 991/4286 [5:48:07<17:45:04, 19.39s/it] 23%|██▎       | 992/4286 [5:48:32<19:10:09, 20.95s/it]                                                       {'loss': 0.0412, 'grad_norm': 3.8962651561588064, 'learning_rate': 7.685487634157723e-07, 'completion_length': 171.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.42772112786769867, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3562925457954407, 'reward_std': 0.13807456195354462, 'kl': 1.03125, 'epoch': 0.23}
 23%|██▎       | 992/4286 [5:48:32<19:10:09, 20.95s/it] 23%|██▎       | 993/4286 [5:48:51<18:41:43, 20.44s/it]                                                       {'loss': 0.0296, 'grad_norm': 1.811769152349972, 'learning_rate': 7.683154456369575e-07, 'completion_length': 190.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5214286148548126, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4857144355773926, 'reward_std': 0.07077575381845236, 'kl': 0.74267578125, 'epoch': 0.23}
 23%|██▎       | 993/4286 [5:48:51<18:41:43, 20.44s/it] 23%|██▎       | 994/4286 [5:49:09<17:53:52, 19.57s/it]                                                       {'loss': 0.0139, 'grad_norm': 3.417070397594248, 'learning_rate': 7.680821278581427e-07, 'completion_length': 177.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.4583333879709244, 'rewards/format_reward': 1.0, 'reward': 1.4583334922790527, 'reward_std': 0.058175613172352314, 'kl': 0.345703125, 'epoch': 0.23}
 23%|██▎       | 994/4286 [5:49:09<17:53:52, 19.57s/it] 23%|██▎       | 995/4286 [5:49:31<18:39:24, 20.41s/it]                                                       {'loss': 0.0424, 'grad_norm': 3.7452916854350278, 'learning_rate': 7.67848810079328e-07, 'completion_length': 175.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.5580357313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5223215222358704, 'reward_std': 0.22550421208143234, 'kl': 1.060546875, 'epoch': 0.23}
 23%|██▎       | 995/4286 [5:49:31<18:39:24, 20.41s/it] 23%|██▎       | 996/4286 [5:49:48<17:44:58, 19.42s/it]                                                       {'loss': 0.0367, 'grad_norm': 1.8812563314386577, 'learning_rate': 7.676154923005133e-07, 'completion_length': 147.76786422729492, 'rewards/only_full_func_accuracy_reward': 0.5267857015132904, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5089287161827087, 'reward_std': 0.10738959535956383, 'kl': 0.91796875, 'epoch': 0.23}
 23%|██▎       | 996/4286 [5:49:48<17:44:58, 19.42s/it] 23%|██▎       | 997/4286 [5:50:06<17:15:59, 18.90s/it]                                                       {'loss': 0.0095, 'grad_norm': 4.505687531443919, 'learning_rate': 7.673821745216985e-07, 'completion_length': 165.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5491071790456772, 'rewards/format_reward': 1.0, 'reward': 1.549107313156128, 'reward_std': 0.049000296741724014, 'kl': 0.23779296875, 'epoch': 0.23}
 23%|██▎       | 997/4286 [5:50:06<17:15:59, 18.90s/it] 23%|██▎       | 998/4286 [5:50:24<17:00:33, 18.62s/it]                                                       {'loss': 0.0223, 'grad_norm': 4.348780319643523, 'learning_rate': 7.671488567428838e-07, 'completion_length': 175.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.4895833432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4717262387275696, 'reward_std': 0.14767501130700111, 'kl': 0.55908203125, 'epoch': 0.23}
 23%|██▎       | 998/4286 [5:50:24<17:00:33, 18.62s/it] 23%|██▎       | 999/4286 [5:50:43<17:01:06, 18.64s/it]                                                       {'loss': 0.0429, 'grad_norm': 4.729000918161586, 'learning_rate': 7.66915538964069e-07, 'completion_length': 166.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.3657738268375397, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3479167222976685, 'reward_std': 0.12297377735376358, 'kl': 1.0703125, 'epoch': 0.23}
 23%|██▎       | 999/4286 [5:50:43<17:01:06, 18.64s/it] 23%|██▎       | 1000/4286 [5:51:01<16:55:17, 18.54s/it]                                                        {'loss': 0.009, 'grad_norm': 1.8226689141848176, 'learning_rate': 7.666822211852542e-07, 'completion_length': 173.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5157738626003265, 'rewards/format_reward': 1.0, 'reward': 1.5157738327980042, 'reward_std': 0.04107142798602581, 'kl': 0.2255859375, 'epoch': 0.23}
 23%|██▎       | 1000/4286 [5:51:01<16:55:17, 18.54s/it][2025-03-02 11:02:51,221] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 23%|██▎       | 1001/4286 [5:55:35<86:57:47, 95.30s/it]                                                        {'loss': 0.0545, 'grad_norm': 6.814589122171578, 'learning_rate': 7.664489034064396e-07, 'completion_length': 183.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.40940938889980316, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3379809260368347, 'reward_std': 0.21950192004442215, 'kl': 1.36328125, 'epoch': 0.23}
 23%|██▎       | 1001/4286 [5:55:35<86:57:47, 95.30s/it] 23%|██▎       | 1002/4286 [5:55:53<65:42:04, 72.02s/it]                                                        {'loss': 0.0082, 'grad_norm': 1.540721664513842, 'learning_rate': 7.662155856276248e-07, 'completion_length': 164.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.6166667640209198, 'rewards/format_reward': 1.0, 'reward': 1.6166667342185974, 'reward_std': 0.04139285534620285, 'kl': 0.2041015625, 'epoch': 0.23}
 23%|██▎       | 1002/4286 [5:55:53<65:42:04, 72.02s/it] 23%|██▎       | 1003/4286 [5:56:12<51:14:44, 56.19s/it]                                                        {'loss': 0.0328, 'grad_norm': 4.581479163752277, 'learning_rate': 7.6598226784881e-07, 'completion_length': 186.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.4586309641599655, 'rewards/format_reward': 1.0, 'reward': 1.4586310386657715, 'reward_std': 0.16407203301787376, 'kl': 0.822265625, 'epoch': 0.23}
 23%|██▎       | 1003/4286 [5:56:12<51:14:44, 56.19s/it] 23%|██▎       | 1004/4286 [5:56:30<40:45:35, 44.71s/it]                                                        {'loss': 0.0371, 'grad_norm': 3.5507196495524225, 'learning_rate': 7.657489500699952e-07, 'completion_length': 157.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.3898809850215912, 'rewards/format_reward': 1.0, 'reward': 1.3898810744285583, 'reward_std': 0.09421682357788086, 'kl': 0.927734375, 'epoch': 0.23}
 23%|██▎       | 1004/4286 [5:56:30<40:45:35, 44.71s/it] 23%|██▎       | 1005/4286 [5:56:48<33:29:43, 36.75s/it]                                                        {'loss': 0.0225, 'grad_norm': 2.6690652185236106, 'learning_rate': 7.655156322911806e-07, 'completion_length': 166.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.4568452686071396, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.438988208770752, 'reward_std': 0.07557233795523643, 'kl': 0.5615234375, 'epoch': 0.23}
 23%|██▎       | 1005/4286 [5:56:48<33:29:43, 36.75s/it] 23%|██▎       | 1006/4286 [5:57:09<28:56:02, 31.76s/it]                                                        {'loss': 0.0596, 'grad_norm': 3.9925092301194103, 'learning_rate': 7.652823145123658e-07, 'completion_length': 179.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.3643353283405304, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3286212086677551, 'reward_std': 0.1539093367755413, 'kl': 1.4912109375, 'epoch': 0.23}
 23%|██▎       | 1006/4286 [5:57:09<28:56:02, 31.76s/it] 23%|██▎       | 1007/4286 [5:57:26<25:03:59, 27.52s/it]                                                        {'loss': 0.011, 'grad_norm': 1.3148516760031081, 'learning_rate': 7.65048996733551e-07, 'completion_length': 165.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.4360119253396988, 'rewards/format_reward': 1.0, 'reward': 1.4360119700431824, 'reward_std': 0.048026394098997116, 'kl': 0.275390625, 'epoch': 0.23}
 23%|██▎       | 1007/4286 [5:57:26<25:03:59, 27.52s/it] 24%|██▎       | 1008/4286 [5:57:46<22:56:45, 25.20s/it]                                                        {'loss': 0.0381, 'grad_norm': 4.9493368918071505, 'learning_rate': 7.648156789547363e-07, 'completion_length': 172.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.2619047909975052, 'rewards/format_reward': 1.0, 'reward': 1.2619048357009888, 'reward_std': 0.1194644421339035, 'kl': 0.953125, 'epoch': 0.24}
 24%|██▎       | 1008/4286 [5:57:46<22:56:45, 25.20s/it] 24%|██▎       | 1009/4286 [5:58:04<21:04:55, 23.16s/it]                                                        {'loss': 0.0277, 'grad_norm': 6.484666176660109, 'learning_rate': 7.645823611759216e-07, 'completion_length': 160.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5139881074428558, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4961310625076294, 'reward_std': 0.15343285351991653, 'kl': 0.6904296875, 'epoch': 0.24}
 24%|██▎       | 1009/4286 [5:58:04<21:04:55, 23.16s/it] 24%|██▎       | 1010/4286 [5:58:24<20:10:40, 22.17s/it]                                                        {'loss': 0.0827, 'grad_norm': 7.368877585050569, 'learning_rate': 7.643490433971068e-07, 'completion_length': 171.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.23571429401636124, 'rewards/format_reward': 1.0, 'reward': 1.2357143759727478, 'reward_std': 0.09092770516872406, 'kl': 2.06640625, 'epoch': 0.24}
 24%|██▎       | 1010/4286 [5:58:24<20:10:40, 22.17s/it] 24%|██▎       | 1011/4286 [5:58:43<19:20:12, 21.26s/it]                                                        {'loss': 0.0999, 'grad_norm': 6.440505429365749, 'learning_rate': 7.641157256182921e-07, 'completion_length': 147.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.4851190745830536, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4315477013587952, 'reward_std': 0.2533879205584526, 'kl': 2.49609375, 'epoch': 0.24}
 24%|██▎       | 1011/4286 [5:58:43<19:20:12, 21.26s/it] 24%|██▎       | 1012/4286 [5:59:03<18:51:31, 20.74s/it]                                                        {'loss': 0.0858, 'grad_norm': 6.799494738047647, 'learning_rate': 7.638824078394773e-07, 'completion_length': 160.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.495535746216774, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4776787161827087, 'reward_std': 0.20476176589727402, 'kl': 2.1484375, 'epoch': 0.24}
 24%|██▎       | 1012/4286 [5:59:03<18:51:31, 20.74s/it] 24%|██▎       | 1013/4286 [5:59:21<18:10:27, 19.99s/it]                                                        {'loss': 0.0602, 'grad_norm': 8.649694307490895, 'learning_rate': 7.636490900606626e-07, 'completion_length': 163.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.4053572118282318, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.369642972946167, 'reward_std': 0.17454785853624344, 'kl': 1.5078125, 'epoch': 0.24}
 24%|██▎       | 1013/4286 [5:59:21<18:10:27, 19.99s/it] 24%|██▎       | 1014/4286 [5:59:42<18:30:42, 20.37s/it]                                                        {'loss': 0.0773, 'grad_norm': 8.91656791392227, 'learning_rate': 7.634157722818479e-07, 'completion_length': 158.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.37460319697856903, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.303174614906311, 'reward_std': 0.17965156584978104, 'kl': 1.927734375, 'epoch': 0.24}
 24%|██▎       | 1014/4286 [5:59:42<18:30:42, 20.37s/it] 24%|██▎       | 1015/4286 [6:00:00<17:53:39, 19.69s/it]                                                        {'loss': 0.0488, 'grad_norm': 4.222291346543651, 'learning_rate': 7.631824545030331e-07, 'completion_length': 152.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.3794643133878708, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.325892984867096, 'reward_std': 0.14391634985804558, 'kl': 1.21875, 'epoch': 0.24}
 24%|██▎       | 1015/4286 [6:00:00<17:53:39, 19.69s/it] 24%|██▎       | 1016/4286 [6:00:17<17:05:41, 18.82s/it]                                                        {'loss': 0.029, 'grad_norm': 13.868005399596658, 'learning_rate': 7.629491367242183e-07, 'completion_length': 144.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.409226194024086, 'rewards/format_reward': 1.0, 'reward': 1.4092262983322144, 'reward_std': 0.0853356160223484, 'kl': 0.7255859375, 'epoch': 0.24}
 24%|██▎       | 1016/4286 [6:00:17<17:05:41, 18.82s/it] 24%|██▎       | 1017/4286 [6:00:38<17:42:40, 19.50s/it]                                                        {'loss': 0.0254, 'grad_norm': 3.045057736487943, 'learning_rate': 7.627158189454036e-07, 'completion_length': 155.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.3809524029493332, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3630953431129456, 'reward_std': 0.1234050765633583, 'kl': 0.634765625, 'epoch': 0.24}
 24%|██▎       | 1017/4286 [6:00:38<17:42:40, 19.50s/it] 24%|██▍       | 1018/4286 [6:00:59<18:05:12, 19.92s/it]                                                        {'loss': 0.0304, 'grad_norm': 6.72504421072149, 'learning_rate': 7.624825011665889e-07, 'completion_length': 143.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.3511904925107956, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.333333432674408, 'reward_std': 0.125477802939713, 'kl': 0.759765625, 'epoch': 0.24}
 24%|██▍       | 1018/4286 [6:00:59<18:05:12, 19.92s/it] 24%|██▍       | 1019/4286 [6:01:17<17:33:58, 19.36s/it]                                                        {'loss': 0.0231, 'grad_norm': 1.7681775441959233, 'learning_rate': 7.622491833877741e-07, 'completion_length': 143.75, 'rewards/only_full_func_accuracy_reward': 0.4717262089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.453869104385376, 'reward_std': 0.09821428824216127, 'kl': 0.5771484375, 'epoch': 0.24}
 24%|██▍       | 1019/4286 [6:01:17<17:33:58, 19.36s/it] 24%|██▍       | 1020/4286 [6:01:34<16:49:50, 18.55s/it]                                                        {'loss': 0.0104, 'grad_norm': 2.829663246880316, 'learning_rate': 7.620158656089593e-07, 'completion_length': 142.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.3988095819950104, 'rewards/format_reward': 1.0, 'reward': 1.3988096117973328, 'reward_std': 0.07419108599424362, 'kl': 0.259765625, 'epoch': 0.24}
 24%|██▍       | 1020/4286 [6:01:34<16:49:50, 18.55s/it] 24%|██▍       | 1021/4286 [6:01:54<17:14:12, 19.01s/it]                                                        {'loss': 0.0396, 'grad_norm': 5.35752364132608, 'learning_rate': 7.617825478301447e-07, 'completion_length': 150.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.4836309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.46577388048172, 'reward_std': 0.12469702307134867, 'kl': 0.986328125, 'epoch': 0.24}
 24%|██▍       | 1021/4286 [6:01:54<17:14:12, 19.01s/it] 24%|██▍       | 1022/4286 [6:02:12<16:50:37, 18.58s/it]                                                        {'loss': 0.0242, 'grad_norm': 8.850588651038608, 'learning_rate': 7.615492300513299e-07, 'completion_length': 143.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.3816220462322235, 'rewards/format_reward': 1.0, 'reward': 1.3816221952438354, 'reward_std': 0.10144967213273048, 'kl': 0.603515625, 'epoch': 0.24}
 24%|██▍       | 1022/4286 [6:02:12<16:50:37, 18.58s/it] 24%|██▍       | 1023/4286 [6:02:28<16:20:32, 18.03s/it]                                                        {'loss': 0.0152, 'grad_norm': 4.02579346388235, 'learning_rate': 7.613159122725151e-07, 'completion_length': 137.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.09150167182087898, 'kl': 0.37939453125, 'epoch': 0.24}
 24%|██▍       | 1023/4286 [6:02:28<16:20:32, 18.03s/it] 24%|██▍       | 1024/4286 [6:02:45<16:04:34, 17.74s/it]                                                        {'loss': 0.0135, 'grad_norm': 1.4974122109483, 'learning_rate': 7.610825944937004e-07, 'completion_length': 133.66072463989258, 'rewards/only_full_func_accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.06388125568628311, 'kl': 0.33740234375, 'epoch': 0.24}
 24%|██▍       | 1024/4286 [6:02:45<16:04:34, 17.74s/it] 24%|██▍       | 1025/4286 [6:03:02<15:51:42, 17.51s/it]                                                        {'loss': 0.0084, 'grad_norm': 3.149353313858085, 'learning_rate': 7.608492767148857e-07, 'completion_length': 138.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5758928805589676, 'rewards/format_reward': 1.0, 'reward': 1.5758929252624512, 'reward_std': 0.11143932491540909, 'kl': 0.2099609375, 'epoch': 0.24}
 24%|██▍       | 1025/4286 [6:03:02<15:51:42, 17.51s/it] 24%|██▍       | 1026/4286 [6:03:20<15:51:23, 17.51s/it]                                                        {'loss': 0.014, 'grad_norm': 2.9708397018492896, 'learning_rate': 7.606159589360709e-07, 'completion_length': 130.50000381469727, 'rewards/only_full_func_accuracy_reward': 0.6232143342494965, 'rewards/format_reward': 1.0, 'reward': 1.6232144236564636, 'reward_std': 0.04553450271487236, 'kl': 0.349609375, 'epoch': 0.24}
 24%|██▍       | 1026/4286 [6:03:20<15:51:23, 17.51s/it] 24%|██▍       | 1027/4286 [6:03:40<16:26:07, 18.16s/it]                                                        {'loss': 0.0303, 'grad_norm': 3.1446223627301433, 'learning_rate': 7.603826411572561e-07, 'completion_length': 127.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.5294642746448517, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5116072297096252, 'reward_std': 0.11580722406506538, 'kl': 0.7568359375, 'epoch': 0.24}
 24%|██▍       | 1027/4286 [6:03:40<16:26:07, 18.16s/it] 24%|██▍       | 1028/4286 [6:03:57<16:09:40, 17.86s/it]                                                        {'loss': 0.0208, 'grad_norm': 8.155227274271189, 'learning_rate': 7.601493233784414e-07, 'completion_length': 136.91072463989258, 'rewards/only_full_func_accuracy_reward': 0.523809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5059524774551392, 'reward_std': 0.13173258118331432, 'kl': 0.5205078125, 'epoch': 0.24}
 24%|██▍       | 1028/4286 [6:03:57<16:09:40, 17.86s/it] 24%|██▍       | 1029/4286 [6:04:17<16:56:59, 18.73s/it]                                                        {'loss': 0.0234, 'grad_norm': 14.968777683516945, 'learning_rate': 7.599160055996266e-07, 'completion_length': 137.00000381469727, 'rewards/only_full_func_accuracy_reward': 0.5178571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.11082621291279793, 'kl': 0.583984375, 'epoch': 0.24}
 24%|██▍       | 1029/4286 [6:04:18<16:56:59, 18.73s/it] 24%|██▍       | 1030/4286 [6:04:34<16:24:25, 18.14s/it]                                                        {'loss': 0.0217, 'grad_norm': 16.474133686809978, 'learning_rate': 7.596826878208119e-07, 'completion_length': 135.28571701049805, 'rewards/only_full_func_accuracy_reward': 0.4303571730852127, 'rewards/format_reward': 1.0, 'reward': 1.430357277393341, 'reward_std': 0.10777115635573864, 'kl': 0.5419921875, 'epoch': 0.24}
 24%|██▍       | 1030/4286 [6:04:34<16:24:25, 18.14s/it] 24%|██▍       | 1031/4286 [6:04:50<15:46:41, 17.45s/it]                                                        {'loss': 0.0128, 'grad_norm': 4.557796635277855, 'learning_rate': 7.594493700419972e-07, 'completion_length': 123.08929061889648, 'rewards/only_full_func_accuracy_reward': 0.438988134264946, 'rewards/format_reward': 1.0, 'reward': 1.438988208770752, 'reward_std': 0.08041993714869022, 'kl': 0.31982421875, 'epoch': 0.24}
 24%|██▍       | 1031/4286 [6:04:50<15:46:41, 17.45s/it] 24%|██▍       | 1032/4286 [6:05:07<15:37:41, 17.29s/it]                                                        {'loss': 0.0144, 'grad_norm': 6.554159899644463, 'learning_rate': 7.592160522631824e-07, 'completion_length': 145.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6119047999382019, 'rewards/format_reward': 1.0, 'reward': 1.6119049191474915, 'reward_std': 0.11722506955265999, 'kl': 0.361328125, 'epoch': 0.24}
 24%|██▍       | 1032/4286 [6:05:07<15:37:41, 17.29s/it] 24%|██▍       | 1033/4286 [6:05:27<16:29:14, 18.25s/it]                                                        {'loss': 0.0439, 'grad_norm': 4.52291094499435, 'learning_rate': 7.589827344843676e-07, 'completion_length': 140.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.3574405014514923, 'rewards/format_reward': 1.0, 'reward': 1.3574405312538147, 'reward_std': 0.06991519220173359, 'kl': 1.09765625, 'epoch': 0.24}
 24%|██▍       | 1033/4286 [6:05:27<16:29:14, 18.25s/it] 24%|██▍       | 1034/4286 [6:05:44<15:57:39, 17.67s/it]                                                        {'loss': 0.0298, 'grad_norm': 5.88996060856084, 'learning_rate': 7.58749416705553e-07, 'completion_length': 133.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.4255952537059784, 'rewards/format_reward': 1.0, 'reward': 1.4255953431129456, 'reward_std': 0.12297770753502846, 'kl': 0.74609375, 'epoch': 0.24}
 24%|██▍       | 1034/4286 [6:05:44<15:57:39, 17.67s/it] 24%|██▍       | 1035/4286 [6:06:01<15:46:06, 17.46s/it]                                                        {'loss': 0.02, 'grad_norm': 3.015101950812057, 'learning_rate': 7.585160989267382e-07, 'completion_length': 134.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.4475446790456772, 'rewards/format_reward': 1.0, 'reward': 1.4475447535514832, 'reward_std': 0.05289181135594845, 'kl': 0.5009765625, 'epoch': 0.24}
 24%|██▍       | 1035/4286 [6:06:01<15:46:06, 17.46s/it] 24%|██▍       | 1036/4286 [6:06:19<15:58:48, 17.70s/it]                                                        {'loss': 0.0191, 'grad_norm': 8.9325959308842, 'learning_rate': 7.582827811479234e-07, 'completion_length': 147.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.5531462728977203, 'rewards/format_reward': 1.0, 'reward': 1.5531463623046875, 'reward_std': 0.15518487989902496, 'kl': 0.478515625, 'epoch': 0.24}
 24%|██▍       | 1036/4286 [6:06:19<15:58:48, 17.70s/it] 24%|██▍       | 1037/4286 [6:06:37<16:06:25, 17.85s/it]                                                        {'loss': 0.0293, 'grad_norm': 4.780547487850013, 'learning_rate': 7.580494633691087e-07, 'completion_length': 136.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.5860119462013245, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5681548118591309, 'reward_std': 0.10553007666021585, 'kl': 0.732421875, 'epoch': 0.24}
 24%|██▍       | 1037/4286 [6:06:37<16:06:25, 17.85s/it] 24%|██▍       | 1038/4286 [6:06:55<16:03:40, 17.80s/it]                                                        {'loss': 0.0328, 'grad_norm': 4.5391152120519855, 'learning_rate': 7.57816145590294e-07, 'completion_length': 136.39286422729492, 'rewards/only_full_func_accuracy_reward': 0.32738097012043, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.30952388048172, 'reward_std': 0.0773809589445591, 'kl': 0.8203125, 'epoch': 0.24}
 24%|██▍       | 1038/4286 [6:06:55<16:03:40, 17.80s/it] 24%|██▍       | 1039/4286 [6:07:14<16:18:25, 18.08s/it]                                                        {'loss': 0.0499, 'grad_norm': 4.3341314536537645, 'learning_rate': 7.575828278114792e-07, 'completion_length': 144.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.3886634558439255, 'rewards/format_reward': 1.0, 'reward': 1.3886635899543762, 'reward_std': 0.1251186989247799, 'kl': 1.244140625, 'epoch': 0.24}
 24%|██▍       | 1039/4286 [6:07:14<16:18:25, 18.08s/it] 24%|██▍       | 1040/4286 [6:07:31<16:05:00, 17.84s/it]                                                        {'loss': 0.0198, 'grad_norm': 5.30798485238513, 'learning_rate': 7.573495100326644e-07, 'completion_length': 139.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.3407738208770752, 'rewards/format_reward': 1.0, 'reward': 1.3407739400863647, 'reward_std': 0.09149500355124474, 'kl': 0.49609375, 'epoch': 0.24}
 24%|██▍       | 1040/4286 [6:07:31<16:05:00, 17.84s/it] 24%|██▍       | 1041/4286 [6:07:47<15:40:49, 17.40s/it]                                                        {'loss': 0.0137, 'grad_norm': 4.008738927471544, 'learning_rate': 7.571161922538497e-07, 'completion_length': 132.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6101192235946655, 'reward_std': 0.1171528808772564, 'kl': 0.3427734375, 'epoch': 0.24}
 24%|██▍       | 1041/4286 [6:07:47<15:40:49, 17.40s/it][2025-03-02 11:15:23,800] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 24%|██▍       | 1042/4286 [6:08:08<16:32:54, 18.36s/it]                                                        {'loss': 0.0377, 'grad_norm': 7.024777086622673, 'learning_rate': 7.56882874475035e-07, 'completion_length': 162.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.4375000447034836, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4017857909202576, 'reward_std': 0.1506734099239111, 'kl': 0.943359375, 'epoch': 0.24}
 24%|██▍       | 1042/4286 [6:08:08<16:32:54, 18.36s/it] 24%|██▍       | 1043/4286 [6:08:26<16:29:07, 18.30s/it]                                                        {'loss': 0.032, 'grad_norm': 7.07571888395657, 'learning_rate': 7.566495566962202e-07, 'completion_length': 148.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.4776786118745804, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4598215818405151, 'reward_std': 0.10533424839377403, 'kl': 0.7998046875, 'epoch': 0.24}
 24%|██▍       | 1043/4286 [6:08:26<16:29:07, 18.30s/it] 24%|██▍       | 1044/4286 [6:08:47<17:11:11, 19.08s/it]                                                        {'loss': 0.0906, 'grad_norm': 11.5604541673788, 'learning_rate': 7.564162389174055e-07, 'completion_length': 164.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.2812500074505806, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2455358505249023, 'reward_std': 0.2288956195116043, 'kl': 2.265625, 'epoch': 0.24}
 24%|██▍       | 1044/4286 [6:08:47<17:11:11, 19.08s/it] 24%|██▍       | 1045/4286 [6:09:06<17:07:22, 19.02s/it]                                                        {'loss': 0.0434, 'grad_norm': 4.992394695608245, 'learning_rate': 7.561829211385907e-07, 'completion_length': 160.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.4404762387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4226192235946655, 'reward_std': 0.1113918386399746, 'kl': 1.083984375, 'epoch': 0.24}
 24%|██▍       | 1045/4286 [6:09:06<17:07:22, 19.02s/it] 24%|██▍       | 1046/4286 [6:09:25<17:11:06, 19.09s/it]                                                        {'loss': 0.0541, 'grad_norm': 5.590168002319825, 'learning_rate': 7.55949603359776e-07, 'completion_length': 159.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.455357164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4196428656578064, 'reward_std': 0.15743745118379593, 'kl': 1.35546875, 'epoch': 0.24}
 24%|██▍       | 1046/4286 [6:09:25<17:11:06, 19.09s/it] 24%|██▍       | 1047/4286 [6:09:44<17:00:43, 18.91s/it]                                                        {'loss': 0.021, 'grad_norm': 10.453008961425228, 'learning_rate': 7.557162855809613e-07, 'completion_length': 156.125, 'rewards/only_full_func_accuracy_reward': 0.46125994622707367, 'rewards/format_reward': 1.0, 'reward': 1.4612599611282349, 'reward_std': 0.11068311706185341, 'kl': 0.5244140625, 'epoch': 0.24}
 24%|██▍       | 1047/4286 [6:09:44<17:00:43, 18.91s/it] 24%|██▍       | 1048/4286 [6:10:05<17:36:08, 19.57s/it]                                                        {'loss': 0.0585, 'grad_norm': 5.374181781369416, 'learning_rate': 7.554829678021465e-07, 'completion_length': 159.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.4745536148548126, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4566965103149414, 'reward_std': 0.20669718086719513, 'kl': 1.46484375, 'epoch': 0.24}
 24%|██▍       | 1048/4286 [6:10:05<17:36:08, 19.57s/it] 24%|██▍       | 1049/4286 [6:10:22<16:59:19, 18.89s/it]                                                        {'loss': 0.0453, 'grad_norm': 4.841042753511163, 'learning_rate': 7.552496500233317e-07, 'completion_length': 146.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.4092262238264084, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3913691639900208, 'reward_std': 0.11607256904244423, 'kl': 1.12890625, 'epoch': 0.24}
 24%|██▍       | 1049/4286 [6:10:22<16:59:19, 18.89s/it] 24%|██▍       | 1050/4286 [6:10:41<17:02:06, 18.95s/it]                                                        {'loss': 0.0493, 'grad_norm': 4.304272253531451, 'learning_rate': 7.55016332244517e-07, 'completion_length': 155.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.4933248311281204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4576106667518616, 'reward_std': 0.1404216941446066, 'kl': 1.234375, 'epoch': 0.24}
 24%|██▍       | 1050/4286 [6:10:41<17:02:06, 18.95s/it] 25%|██▍       | 1051/4286 [6:11:00<16:55:49, 18.84s/it]                                                        {'loss': 0.0274, 'grad_norm': 2.2060656568212664, 'learning_rate': 7.547830144657023e-07, 'completion_length': 145.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.3883928954601288, 'rewards/format_reward': 1.0, 'reward': 1.3883929252624512, 'reward_std': 0.025190782733261585, 'kl': 0.68359375, 'epoch': 0.25}
 25%|██▍       | 1051/4286 [6:11:00<16:55:49, 18.84s/it] 25%|██▍       | 1052/4286 [6:11:18<16:44:39, 18.64s/it]                                                        {'loss': 0.0222, 'grad_norm': 14.949459281908393, 'learning_rate': 7.545496966868875e-07, 'completion_length': 159.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.4226190894842148, 'rewards/format_reward': 1.0, 'reward': 1.4226192235946655, 'reward_std': 0.07959692180156708, 'kl': 0.556640625, 'epoch': 0.25}
 25%|██▍       | 1052/4286 [6:11:18<16:44:39, 18.64s/it] 25%|██▍       | 1053/4286 [6:11:35<16:26:50, 18.31s/it]                                                        {'loss': 0.0437, 'grad_norm': 5.105134392732437, 'learning_rate': 7.543163789080727e-07, 'completion_length': 153.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5000000447034836, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.1547619104385376, 'kl': 1.091796875, 'epoch': 0.25}
 25%|██▍       | 1053/4286 [6:11:35<16:26:50, 18.31s/it] 25%|██▍       | 1054/4286 [6:11:54<16:30:10, 18.38s/it]                                                        {'loss': 0.0271, 'grad_norm': 1.3785800898941827, 'learning_rate': 7.54083061129258e-07, 'completion_length': 154.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.4455782473087311, 'rewards/format_reward': 1.0, 'reward': 1.4455782771110535, 'reward_std': 0.06558194011449814, 'kl': 0.677734375, 'epoch': 0.25}
 25%|██▍       | 1054/4286 [6:11:54<16:30:10, 18.38s/it] 25%|██▍       | 1055/4286 [6:12:12<16:29:53, 18.38s/it]                                                        {'loss': 0.0149, 'grad_norm': 4.657168978031672, 'learning_rate': 7.538497433504433e-07, 'completion_length': 176.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6205357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6205357313156128, 'reward_std': 0.1298883892595768, 'kl': 0.3740234375, 'epoch': 0.25}
 25%|██▍       | 1055/4286 [6:12:12<16:29:53, 18.38s/it] 25%|██▍       | 1056/4286 [6:12:34<17:20:33, 19.33s/it]                                                        {'loss': 0.0449, 'grad_norm': 36.585533221014856, 'learning_rate': 7.536164255716285e-07, 'completion_length': 172.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.3809524178504944, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.345238208770752, 'reward_std': 0.19821718335151672, 'kl': 1.12109375, 'epoch': 0.25}
 25%|██▍       | 1056/4286 [6:12:34<17:20:33, 19.33s/it] 25%|██▍       | 1057/4286 [6:12:54<17:27:55, 19.47s/it]                                                        {'loss': 0.0329, 'grad_norm': 20.092881573837634, 'learning_rate': 7.533831077928138e-07, 'completion_length': 168.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4017857387661934, 'rewards/format_reward': 1.0, 'reward': 1.4017857909202576, 'reward_std': 0.08146731369197369, 'kl': 0.82421875, 'epoch': 0.25}
 25%|██▍       | 1057/4286 [6:12:54<17:27:55, 19.47s/it] 25%|██▍       | 1058/4286 [6:13:13<17:21:13, 19.35s/it]                                                        {'loss': 0.0241, 'grad_norm': 3.3719791980092166, 'learning_rate': 7.53149790013999e-07, 'completion_length': 160.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.4394841492176056, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4216270446777344, 'reward_std': 0.09570176899433136, 'kl': 0.599609375, 'epoch': 0.25}
 25%|██▍       | 1058/4286 [6:13:13<17:21:13, 19.35s/it] 25%|██▍       | 1059/4286 [6:13:31<17:10:37, 19.16s/it]                                                        {'loss': 0.0145, 'grad_norm': 7.7727737119845886, 'learning_rate': 7.529164722351843e-07, 'completion_length': 164.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.541666716337204, 'rewards/format_reward': 1.0, 'reward': 1.5416667461395264, 'reward_std': 0.06705944333225489, 'kl': 0.361328125, 'epoch': 0.25}
 25%|██▍       | 1059/4286 [6:13:31<17:10:37, 19.16s/it] 25%|██▍       | 1060/4286 [6:13:49<16:47:37, 18.74s/it]                                                        {'loss': 0.0376, 'grad_norm': 8.65030485941807, 'learning_rate': 7.526831544563696e-07, 'completion_length': 160.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.5520833432674408, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4985119700431824, 'reward_std': 0.2512981966137886, 'kl': 0.94140625, 'epoch': 0.25}
 25%|██▍       | 1060/4286 [6:13:49<16:47:37, 18.74s/it] 25%|██▍       | 1061/4286 [6:14:11<17:33:42, 19.60s/it]                                                        {'loss': 0.0535, 'grad_norm': 4.38644845298429, 'learning_rate': 7.524498366775548e-07, 'completion_length': 175.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.32351192831993103, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2699405550956726, 'reward_std': 0.21951457113027573, 'kl': 1.3359375, 'epoch': 0.25}
 25%|██▍       | 1061/4286 [6:14:11<17:33:42, 19.60s/it] 25%|██▍       | 1062/4286 [6:14:36<19:08:23, 21.37s/it]                                                        {'loss': 0.1068, 'grad_norm': 6.368637812812446, 'learning_rate': 7.5221651889874e-07, 'completion_length': 203.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.4066220447421074, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.2637649774551392, 'reward_std': 0.3625364676117897, 'kl': 2.66796875, 'epoch': 0.25}
 25%|██▍       | 1062/4286 [6:14:36<19:08:23, 21.37s/it] 25%|██▍       | 1063/4286 [6:14:58<19:20:30, 21.60s/it]                                                        {'loss': 0.0633, 'grad_norm': 7.585127383510452, 'learning_rate': 7.519832011199253e-07, 'completion_length': 183.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.46726194024086, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.360119104385376, 'reward_std': 0.30712203681468964, 'kl': 1.58203125, 'epoch': 0.25}
 25%|██▍       | 1063/4286 [6:14:59<19:20:30, 21.60s/it] 25%|██▍       | 1064/4286 [6:15:17<18:35:17, 20.77s/it]                                                        {'loss': 0.0323, 'grad_norm': 2.103912686596349, 'learning_rate': 7.517498833411106e-07, 'completion_length': 167.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.505059540271759, 'rewards/format_reward': 1.0, 'reward': 1.5050595998764038, 'reward_std': 0.05391779914498329, 'kl': 0.806640625, 'epoch': 0.25}
 25%|██▍       | 1064/4286 [6:15:17<18:35:17, 20.77s/it] 25%|██▍       | 1065/4286 [6:15:41<19:26:17, 21.73s/it]                                                        {'loss': 0.0585, 'grad_norm': 5.868159450822158, 'learning_rate': 7.515165655622958e-07, 'completion_length': 161.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.3783482313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.360491156578064, 'reward_std': 0.16944840922951698, 'kl': 1.458984375, 'epoch': 0.25}
 25%|██▍       | 1065/4286 [6:15:41<19:26:17, 21.73s/it] 25%|██▍       | 1066/4286 [6:16:00<18:40:01, 20.87s/it]                                                        {'loss': 0.1347, 'grad_norm': 8.212280543633666, 'learning_rate': 7.51283247783481e-07, 'completion_length': 166.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.2592262029647827, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2056549191474915, 'reward_std': 0.16990172117948532, 'kl': 3.359375, 'epoch': 0.25}
 25%|██▍       | 1066/4286 [6:16:00<18:40:01, 20.87s/it] 25%|██▍       | 1067/4286 [6:16:21<18:33:04, 20.75s/it]                                                        {'loss': 0.0395, 'grad_norm': 3.310047099901253, 'learning_rate': 7.510499300046664e-07, 'completion_length': 173.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.5086309909820557, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.437202513217926, 'reward_std': 0.16433580964803696, 'kl': 0.986328125, 'epoch': 0.25}
 25%|██▍       | 1067/4286 [6:16:21<18:33:04, 20.75s/it] 25%|██▍       | 1068/4286 [6:16:43<19:05:36, 21.36s/it]                                                        {'loss': 0.0557, 'grad_norm': 3.718110004003612, 'learning_rate': 7.508166122258516e-07, 'completion_length': 177.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.566964328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5491071939468384, 'reward_std': 0.1041666716337204, 'kl': 1.390625, 'epoch': 0.25}
 25%|██▍       | 1068/4286 [6:16:43<19:05:36, 21.36s/it] 25%|██▍       | 1069/4286 [6:17:04<18:54:40, 21.16s/it]                                                        {'loss': 0.0332, 'grad_norm': 3.330595529528032, 'learning_rate': 7.505832944470368e-07, 'completion_length': 169.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.3184524029493332, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.282738208770752, 'reward_std': 0.13176707178354263, 'kl': 0.828125, 'epoch': 0.25}
 25%|██▍       | 1069/4286 [6:17:04<18:54:40, 21.16s/it] 25%|██▍       | 1070/4286 [6:17:21<17:49:46, 19.96s/it]                                                        {'loss': 0.0115, 'grad_norm': 44.76288201639259, 'learning_rate': 7.503499766682221e-07, 'completion_length': 158.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5267857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5267858505249023, 'reward_std': 0.005952383857220411, 'kl': 0.2861328125, 'epoch': 0.25}
 25%|██▍       | 1070/4286 [6:17:21<17:49:46, 19.96s/it] 25%|██▍       | 1071/4286 [6:17:40<17:25:51, 19.52s/it]                                                        {'loss': 0.0073, 'grad_norm': 2.1152452779804074, 'learning_rate': 7.501166588894074e-07, 'completion_length': 170.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5520833730697632, 'rewards/format_reward': 1.0, 'reward': 1.552083432674408, 'reward_std': 0.1132010854780674, 'kl': 0.1826171875, 'epoch': 0.25}
 25%|██▍       | 1071/4286 [6:17:40<17:25:51, 19.52s/it] 25%|██▌       | 1072/4286 [6:17:58<17:00:00, 19.04s/it]                                                        {'loss': 0.0086, 'grad_norm': 1.7115177917439757, 'learning_rate': 7.498833411105926e-07, 'completion_length': 156.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.03411934897303581, 'kl': 0.21484375, 'epoch': 0.25}
 25%|██▌       | 1072/4286 [6:17:58<17:00:00, 19.04s/it] 25%|██▌       | 1073/4286 [6:18:18<17:26:43, 19.55s/it]                                                        {'loss': 0.0132, 'grad_norm': 2.818929382795799, 'learning_rate': 7.496500233317778e-07, 'completion_length': 179.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.5982143580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803571939468384, 'reward_std': 0.15081598609685898, 'kl': 0.33154296875, 'epoch': 0.25}
 25%|██▌       | 1073/4286 [6:18:18<17:26:43, 19.55s/it] 25%|██▌       | 1074/4286 [6:18:38<17:27:25, 19.57s/it]                                                        {'loss': 0.008, 'grad_norm': 1.6627205907367855, 'learning_rate': 7.494167055529631e-07, 'completion_length': 184.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5833333730697632, 'rewards/format_reward': 1.0, 'reward': 1.583333432674408, 'reward_std': 0.032524410635232925, 'kl': 0.2001953125, 'epoch': 0.25}
 25%|██▌       | 1074/4286 [6:18:38<17:27:25, 19.57s/it] 25%|██▌       | 1075/4286 [6:18:56<17:02:16, 19.10s/it]                                                        {'loss': 0.0086, 'grad_norm': 2.2090306971309803, 'learning_rate': 7.491833877741483e-07, 'completion_length': 149.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.4776786118745804, 'rewards/format_reward': 1.0, 'reward': 1.4776787161827087, 'reward_std': 0.046213832683861256, 'kl': 0.21630859375, 'epoch': 0.25}
 25%|██▌       | 1075/4286 [6:18:56<17:02:16, 19.10s/it] 25%|██▌       | 1076/4286 [6:19:16<17:17:30, 19.39s/it]                                                        {'loss': 0.0131, 'grad_norm': 3.0277464623225625, 'learning_rate': 7.489500699953336e-07, 'completion_length': 197.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5502976477146149, 'rewards/format_reward': 1.0, 'reward': 1.5502977967262268, 'reward_std': 0.029581504873931408, 'kl': 0.32763671875, 'epoch': 0.25}
 25%|██▌       | 1076/4286 [6:19:16<17:17:30, 19.39s/it] 25%|██▌       | 1077/4286 [6:19:34<16:47:55, 18.85s/it]                                                        {'loss': 0.0082, 'grad_norm': 3.884544444577132, 'learning_rate': 7.487167522165189e-07, 'completion_length': 167.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5863095819950104, 'rewards/format_reward': 1.0, 'reward': 1.5863096714019775, 'reward_std': 0.06596966832876205, 'kl': 0.20458984375, 'epoch': 0.25}
 25%|██▌       | 1077/4286 [6:19:34<16:47:55, 18.85s/it] 25%|██▌       | 1078/4286 [6:19:55<17:27:47, 19.60s/it]                                                        {'loss': 0.0212, 'grad_norm': 1.4236767924877431, 'learning_rate': 7.484834344377041e-07, 'completion_length': 191.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.427083358168602, 'rewards/format_reward': 1.0, 'reward': 1.427083432674408, 'reward_std': 0.057280393317341805, 'kl': 0.529296875, 'epoch': 0.25}
 25%|██▌       | 1078/4286 [6:19:55<17:27:47, 19.60s/it] 25%|██▌       | 1079/4286 [6:20:14<17:10:39, 19.28s/it]                                                        {'loss': 0.0076, 'grad_norm': 2.351401809352262, 'learning_rate': 7.482501166588893e-07, 'completion_length': 172.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5654762536287308, 'rewards/format_reward': 1.0, 'reward': 1.5654762983322144, 'reward_std': 0.09580604359507561, 'kl': 0.189453125, 'epoch': 0.25}
 25%|██▌       | 1079/4286 [6:20:14<17:10:39, 19.28s/it] 25%|██▌       | 1080/4286 [6:20:32<16:57:02, 19.03s/it]                                                        {'loss': 0.0211, 'grad_norm': 1.6043452828914173, 'learning_rate': 7.480167988800747e-07, 'completion_length': 163.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.4166667014360428, 'rewards/format_reward': 1.0, 'reward': 1.4166668057441711, 'reward_std': 0.04710471536964178, 'kl': 0.52783203125, 'epoch': 0.25}
 25%|██▌       | 1080/4286 [6:20:32<16:57:02, 19.03s/it] 25%|██▌       | 1081/4286 [6:20:51<16:48:09, 18.87s/it]                                                        {'loss': 0.0078, 'grad_norm': 0.3138579894023327, 'learning_rate': 7.477834811012599e-07, 'completion_length': 175.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.6369047462940216, 'rewards/format_reward': 1.0, 'reward': 1.6369048953056335, 'reward_std': 0.0, 'kl': 0.1943359375, 'epoch': 0.25}
 25%|██▌       | 1081/4286 [6:20:51<16:48:09, 18.87s/it] 25%|██▌       | 1082/4286 [6:21:09<16:42:59, 18.78s/it]                                                        {'loss': 0.0073, 'grad_norm': 0.8748105525492867, 'learning_rate': 7.475501633224451e-07, 'completion_length': 168.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.5401785969734192, 'rewards/format_reward': 1.0, 'reward': 1.5401785969734192, 'reward_std': 0.023595843696966767, 'kl': 0.18212890625, 'epoch': 0.25}
 25%|██▌       | 1082/4286 [6:21:09<16:42:59, 18.78s/it] 25%|██▌       | 1083/4286 [6:21:28<16:48:24, 18.89s/it]                                                        {'loss': 0.0079, 'grad_norm': 4.505989133223184, 'learning_rate': 7.473168455436303e-07, 'completion_length': 191.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.4735119342803955, 'rewards/format_reward': 1.0, 'reward': 1.4735119342803955, 'reward_std': 0.11048955097794533, 'kl': 0.19677734375, 'epoch': 0.25}
 25%|██▌       | 1083/4286 [6:21:28<16:48:24, 18.89s/it] 25%|██▌       | 1084/4286 [6:21:48<17:01:27, 19.14s/it]                                                        {'loss': 0.0101, 'grad_norm': 3.61054998336682, 'learning_rate': 7.470835277648157e-07, 'completion_length': 172.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.47172626852989197, 'rewards/format_reward': 1.0, 'reward': 1.4717262983322144, 'reward_std': 0.031143165193498135, 'kl': 0.25146484375, 'epoch': 0.25}
 25%|██▌       | 1084/4286 [6:21:48<17:01:27, 19.14s/it] 25%|██▌       | 1085/4286 [6:22:06<16:45:05, 18.84s/it]                                                        {'loss': 0.0077, 'grad_norm': 1.551249081105707, 'learning_rate': 7.468502099860009e-07, 'completion_length': 182.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5324404835700989, 'rewards/format_reward': 1.0, 'reward': 1.5324406027793884, 'reward_std': 0.042261906899511814, 'kl': 0.19140625, 'epoch': 0.25}
 25%|██▌       | 1085/4286 [6:22:06<16:45:05, 18.84s/it] 25%|██▌       | 1086/4286 [6:22:27<17:22:56, 19.56s/it]                                                        {'loss': 0.0108, 'grad_norm': 1.7625854738810711, 'learning_rate': 7.466168922071861e-07, 'completion_length': 186.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.418154776096344, 'rewards/format_reward': 1.0, 'reward': 1.4181548357009888, 'reward_std': 0.050167880952358246, 'kl': 0.26953125, 'epoch': 0.25}
 25%|██▌       | 1086/4286 [6:22:27<17:22:56, 19.56s/it][2025-03-02 11:30:06,943] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▌       | 1087/4286 [6:22:51<18:29:57, 20.82s/it]                                                        {'loss': 0.0352, 'grad_norm': 7.865906998879902, 'learning_rate': 7.463835744283714e-07, 'completion_length': 208.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.24910057336091995, 'kl': 0.880859375, 'epoch': 0.25}
 25%|██▌       | 1087/4286 [6:22:51<18:29:57, 20.82s/it] 25%|██▌       | 1088/4286 [6:23:10<18:03:53, 20.34s/it]                                                        {'loss': 0.0282, 'grad_norm': 16.289170073898546, 'learning_rate': 7.461502566495567e-07, 'completion_length': 165.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.1041666641831398, 'kl': 0.705078125, 'epoch': 0.25}
 25%|██▌       | 1088/4286 [6:23:10<18:03:53, 20.34s/it][2025-03-02 11:30:46,712] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▌       | 1089/4286 [6:23:31<18:06:48, 20.40s/it]                                                        {'loss': 0.0222, 'grad_norm': 136.46689234458293, 'learning_rate': 7.459169388707419e-07, 'completion_length': 180.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.6498016119003296, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6319445371627808, 'reward_std': 0.14857426658272743, 'kl': 0.5556640625, 'epoch': 0.25}
 25%|██▌       | 1089/4286 [6:23:31<18:06:48, 20.40s/it] 25%|██▌       | 1090/4286 [6:23:52<18:18:50, 20.63s/it]                                                        {'loss': 0.037, 'grad_norm': 14.03133313383292, 'learning_rate': 7.456836210919272e-07, 'completion_length': 212.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5302579998970032, 'rewards/format_reward': 1.0, 'reward': 1.530258059501648, 'reward_std': 0.09608771651983261, 'kl': 0.9296875, 'epoch': 0.25}
 25%|██▌       | 1090/4286 [6:23:52<18:18:50, 20.63s/it][2025-03-02 11:31:28,799] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 25%|██▌       | 1091/4286 [6:24:13<18:23:04, 20.71s/it]                                                        {'loss': 0.0189, 'grad_norm': 10.659322814699783, 'learning_rate': 7.454503033131124e-07, 'completion_length': 188.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.3668155074119568, 'rewards/format_reward': 1.0, 'reward': 1.3668155670166016, 'reward_std': 0.0635011438280344, 'kl': 0.474609375, 'epoch': 0.25}
 25%|██▌       | 1091/4286 [6:24:13<18:23:04, 20.71s/it] 25%|██▌       | 1092/4286 [6:24:35<18:39:20, 21.03s/it]                                                        {'loss': 0.0255, 'grad_norm': 2.635518087322065, 'learning_rate': 7.452169855342977e-07, 'completion_length': 201.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.5610119551420212, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5431548953056335, 'reward_std': 0.13969411700963974, 'kl': 0.6396484375, 'epoch': 0.25}
 25%|██▌       | 1092/4286 [6:24:35<18:39:20, 21.03s/it] 26%|██▌       | 1093/4286 [6:24:55<18:22:10, 20.71s/it]                                                        {'loss': 0.0562, 'grad_norm': 4.832865889294047, 'learning_rate': 7.44983667755483e-07, 'completion_length': 195.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.3169643133878708, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2455358505249023, 'reward_std': 0.25758039951324463, 'kl': 1.40625, 'epoch': 0.26}
 26%|██▌       | 1093/4286 [6:24:55<18:22:10, 20.71s/it] 26%|██▌       | 1094/4286 [6:25:14<17:55:14, 20.21s/it]                                                        {'loss': 0.0543, 'grad_norm': 3.881489862419779, 'learning_rate': 7.447503499766682e-07, 'completion_length': 157.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.5907738506793976, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5550596117973328, 'reward_std': 0.3065476417541504, 'kl': 1.35546875, 'epoch': 0.26}
 26%|██▌       | 1094/4286 [6:25:14<17:55:14, 20.21s/it] 26%|██▌       | 1095/4286 [6:25:34<17:55:39, 20.23s/it]                                                        {'loss': 0.0748, 'grad_norm': 3.0262409704437405, 'learning_rate': 7.445170321978534e-07, 'completion_length': 186.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4196429252624512, 'reward_std': 0.20075097680091858, 'kl': 1.87109375, 'epoch': 0.26}
 26%|██▌       | 1095/4286 [6:25:34<17:55:39, 20.23s/it] 26%|██▌       | 1096/4286 [6:25:55<18:05:37, 20.42s/it]                                                        {'loss': 0.0494, 'grad_norm': 3.685853234742059, 'learning_rate': 7.442837144190387e-07, 'completion_length': 215.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.4761905372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4583333730697632, 'reward_std': 0.18912799656391144, 'kl': 1.234375, 'epoch': 0.26}
 26%|██▌       | 1096/4286 [6:25:55<18:05:37, 20.42s/it] 26%|██▌       | 1097/4286 [6:26:16<18:16:08, 20.62s/it]                                                        {'loss': 0.0791, 'grad_norm': 3.110022772044988, 'learning_rate': 7.44050396640224e-07, 'completion_length': 210.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.28363097459077835, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2479167580604553, 'reward_std': 0.16577187925577164, 'kl': 1.97265625, 'epoch': 0.26}
 26%|██▌       | 1097/4286 [6:26:16<18:16:08, 20.62s/it] 26%|██▌       | 1098/4286 [6:26:40<19:08:07, 21.61s/it]                                                        {'loss': 0.2143, 'grad_norm': 7.6836458304435045, 'learning_rate': 7.438170788614092e-07, 'completion_length': 208.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.1752232313156128, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.0502232611179352, 'reward_std': 0.316228449344635, 'kl': 5.34375, 'epoch': 0.26}
 26%|██▌       | 1098/4286 [6:26:40<19:08:07, 21.61s/it] 26%|██▌       | 1099/4286 [6:27:01<19:08:15, 21.62s/it]                                                        {'loss': 0.1879, 'grad_norm': 9.734614206021984, 'learning_rate': 7.435837610825944e-07, 'completion_length': 184.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.3958333730697632, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.235119104385376, 'reward_std': 0.42403310537338257, 'kl': 4.703125, 'epoch': 0.26}
 26%|██▌       | 1099/4286 [6:27:01<19:08:15, 21.62s/it] 26%|██▌       | 1100/4286 [6:27:22<18:52:35, 21.33s/it]                                                        {'loss': 0.0941, 'grad_norm': 7.840986477918614, 'learning_rate': 7.433504433037798e-07, 'completion_length': 185.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.333333358168602, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2797619104385376, 'reward_std': 0.22558382898569107, 'kl': 2.34765625, 'epoch': 0.26}
 26%|██▌       | 1100/4286 [6:27:22<18:52:35, 21.33s/it][2025-03-02 11:39:41,754] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▌       | 1101/4286 [6:32:26<93:49:45, 106.06s/it]                                                         {'loss': 0.0951, 'grad_norm': 4.749281368484085, 'learning_rate': 7.43117125524965e-07, 'completion_length': 171.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.3363095372915268, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3005953431129456, 'reward_std': 0.15752442181110382, 'kl': 2.37890625, 'epoch': 0.26}
 26%|██▌       | 1101/4286 [6:32:26<93:49:45, 106.06s/it] 26%|██▌       | 1102/4286 [6:32:46<70:56:02, 80.20s/it]                                                         {'loss': 0.1166, 'grad_norm': 8.508275160845079, 'learning_rate': 7.428838077461502e-07, 'completion_length': 178.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5363839566707611, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5185269117355347, 'reward_std': 0.21683388203382492, 'kl': 2.9140625, 'epoch': 0.26}
 26%|██▌       | 1102/4286 [6:32:46<70:56:02, 80.20s/it] 26%|██▌       | 1103/4286 [6:33:05<54:51:05, 62.04s/it]                                                        {'loss': 0.1119, 'grad_norm': 8.883936202911691, 'learning_rate': 7.426504899673355e-07, 'completion_length': 189.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.51636902987957, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.46279776096344, 'reward_std': 0.264298640191555, 'kl': 2.796875, 'epoch': 0.26}
 26%|██▌       | 1103/4286 [6:33:05<54:51:05, 62.04s/it][2025-03-02 11:40:41,946] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 26%|██▌       | 1104/4286 [6:33:26<43:51:44, 49.62s/it]                                                        {'loss': 0.087, 'grad_norm': 4.419716263584451, 'learning_rate': 7.424171721885207e-07, 'completion_length': 183.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5660715103149414, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5482143759727478, 'reward_std': 0.28728635609149933, 'kl': 2.171875, 'epoch': 0.26}
 26%|██▌       | 1104/4286 [6:33:26<43:51:44, 49.62s/it] 26%|██▌       | 1105/4286 [6:33:46<35:53:52, 40.63s/it]                                                        {'loss': 0.0744, 'grad_norm': 3.132353173117686, 'learning_rate': 7.42183854409706e-07, 'completion_length': 158.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.574404776096344, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5386906266212463, 'reward_std': 0.14472959004342556, 'kl': 1.8623046875, 'epoch': 0.26}
 26%|██▌       | 1105/4286 [6:33:46<35:53:52, 40.63s/it] 26%|██▌       | 1106/4286 [6:34:09<31:20:30, 35.48s/it]                                                        {'loss': 0.0914, 'grad_norm': 2.7802808757581348, 'learning_rate': 7.419505366308912e-07, 'completion_length': 191.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.3943452686071396, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3586310148239136, 'reward_std': 0.1955621913075447, 'kl': 2.28515625, 'epoch': 0.26}
 26%|██▌       | 1106/4286 [6:34:09<31:20:30, 35.48s/it] 26%|██▌       | 1107/4286 [6:34:28<27:02:58, 30.63s/it]                                                        {'loss': 0.1144, 'grad_norm': 5.7589104679274765, 'learning_rate': 7.417172188520765e-07, 'completion_length': 172.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.3154762089252472, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2797619700431824, 'reward_std': 0.13831908255815506, 'kl': 2.85546875, 'epoch': 0.26}
 26%|██▌       | 1107/4286 [6:34:28<27:02:58, 30.63s/it] 26%|██▌       | 1108/4286 [6:34:47<23:53:22, 27.06s/it]                                                        {'loss': 0.0339, 'grad_norm': 2.377667508697919, 'learning_rate': 7.414839010732617e-07, 'completion_length': 167.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5505952835083008, 'rewards/format_reward': 1.0, 'reward': 1.5505953431129456, 'reward_std': 0.08333333674818277, 'kl': 0.84765625, 'epoch': 0.26}
 26%|██▌       | 1108/4286 [6:34:47<23:53:22, 27.06s/it] 26%|██▌       | 1109/4286 [6:35:10<22:51:38, 25.90s/it]                                                        {'loss': 0.1076, 'grad_norm': 5.040844885585855, 'learning_rate': 7.41250583294447e-07, 'completion_length': 188.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5074405372142792, 'rewards/format_reward': 1.0, 'reward': 1.5074405670166016, 'reward_std': 0.16776952147483826, 'kl': 2.6875, 'epoch': 0.26}
 26%|██▌       | 1109/4286 [6:35:10<22:51:38, 25.90s/it] 26%|██▌       | 1110/4286 [6:35:32<21:45:25, 24.66s/it]                                                        {'loss': 0.0736, 'grad_norm': 2.8429131989748897, 'learning_rate': 7.410172655156323e-07, 'completion_length': 177.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.3169642984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2812501192092896, 'reward_std': 0.10043025761842728, 'kl': 1.841796875, 'epoch': 0.26}
 26%|██▌       | 1110/4286 [6:35:32<21:45:25, 24.66s/it] 26%|██▌       | 1111/4286 [6:35:52<20:35:57, 23.36s/it]                                                        {'loss': 0.0947, 'grad_norm': 8.926841817879746, 'learning_rate': 7.407839477368175e-07, 'completion_length': 157.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.549107164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4955357909202576, 'reward_std': 0.17462344840168953, 'kl': 2.36328125, 'epoch': 0.26}
 26%|██▌       | 1111/4286 [6:35:53<20:35:57, 23.36s/it] 26%|██▌       | 1112/4286 [6:36:16<20:36:44, 23.38s/it]                                                        {'loss': 0.1107, 'grad_norm': 2.8256406071638316, 'learning_rate': 7.405506299580027e-07, 'completion_length': 224.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.42240649461746216, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3688351511955261, 'reward_std': 0.2598460912704468, 'kl': 2.765625, 'epoch': 0.26}
 26%|██▌       | 1112/4286 [6:36:16<20:36:44, 23.38s/it] 26%|██▌       | 1113/4286 [6:36:38<20:14:49, 22.97s/it]                                                        {'loss': 0.071, 'grad_norm': 6.008472051515825, 'learning_rate': 7.403173121791881e-07, 'completion_length': 177.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.4184524267911911, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4005953073501587, 'reward_std': 0.16455103084445, 'kl': 1.77734375, 'epoch': 0.26}
 26%|██▌       | 1113/4286 [6:36:38<20:14:49, 22.97s/it] 26%|██▌       | 1114/4286 [6:36:59<19:41:48, 22.35s/it]                                                        {'loss': 0.0734, 'grad_norm': 2.6515945654727306, 'learning_rate': 7.400839944003733e-07, 'completion_length': 185.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.4032738208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3854168057441711, 'reward_std': 0.11439166218042374, 'kl': 1.83203125, 'epoch': 0.26}
 26%|██▌       | 1114/4286 [6:36:59<19:41:48, 22.35s/it] 26%|██▌       | 1115/4286 [6:37:19<19:05:41, 21.68s/it]                                                        {'loss': 0.0516, 'grad_norm': 2.468023161313978, 'learning_rate': 7.398506766215585e-07, 'completion_length': 192.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.438988134264946, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4211310744285583, 'reward_std': 0.18120671063661575, 'kl': 1.29296875, 'epoch': 0.26}
 26%|██▌       | 1115/4286 [6:37:19<19:05:41, 21.68s/it] 26%|██▌       | 1116/4286 [6:37:37<18:15:16, 20.73s/it]                                                        {'loss': 0.0652, 'grad_norm': 2.277031416835616, 'learning_rate': 7.396173588427438e-07, 'completion_length': 156.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.395833358168602, 'rewards/format_reward': 1.0, 'reward': 1.395833432674408, 'reward_std': 0.08173839747905731, 'kl': 1.626953125, 'epoch': 0.26}
 26%|██▌       | 1116/4286 [6:37:37<18:15:16, 20.73s/it] 26%|██▌       | 1117/4286 [6:38:03<19:27:38, 22.11s/it]                                                        {'loss': 0.0835, 'grad_norm': 2.9666361886135952, 'learning_rate': 7.393840410639291e-07, 'completion_length': 216.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.3963293880224228, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3784723281860352, 'reward_std': 0.15999729931354523, 'kl': 2.0859375, 'epoch': 0.26}
 26%|██▌       | 1117/4286 [6:38:03<19:27:38, 22.11s/it] 26%|██▌       | 1118/4286 [6:38:24<19:11:46, 21.81s/it]                                                        {'loss': 0.0341, 'grad_norm': 2.1992949531807953, 'learning_rate': 7.391507232851143e-07, 'completion_length': 197.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.367559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3497024774551392, 'reward_std': 0.09039165824651718, 'kl': 0.849609375, 'epoch': 0.26}
 26%|██▌       | 1118/4286 [6:38:24<19:11:46, 21.81s/it] 26%|██▌       | 1119/4286 [6:38:43<18:21:22, 20.87s/it]                                                        {'loss': 0.008, 'grad_norm': 1.444505509931821, 'learning_rate': 7.389174055062995e-07, 'completion_length': 183.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.586309552192688, 'rewards/format_reward': 1.0, 'reward': 1.5863096117973328, 'reward_std': 0.06446255184710026, 'kl': 0.19921875, 'epoch': 0.26}
 26%|██▌       | 1119/4286 [6:38:43<18:21:22, 20.87s/it] 26%|██▌       | 1120/4286 [6:39:02<17:57:36, 20.42s/it]                                                        {'loss': 0.0178, 'grad_norm': 37.14736032332054, 'learning_rate': 7.386840877274848e-07, 'completion_length': 187.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.549107164144516, 'rewards/format_reward': 1.0, 'reward': 1.5491072535514832, 'reward_std': 0.0625000037252903, 'kl': 0.447265625, 'epoch': 0.26}
 26%|██▌       | 1120/4286 [6:39:02<17:57:36, 20.42s/it] 26%|██▌       | 1121/4286 [6:39:20<17:24:51, 19.81s/it]                                                        {'loss': 0.0191, 'grad_norm': 1.393923159823891, 'learning_rate': 7.384507699486701e-07, 'completion_length': 173.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.014880950096994638, 'kl': 0.47607421875, 'epoch': 0.26}
 26%|██▌       | 1121/4286 [6:39:20<17:24:51, 19.81s/it] 26%|██▌       | 1122/4286 [6:39:41<17:41:54, 20.14s/it]                                                        {'loss': 0.0086, 'grad_norm': 4.849943798514749, 'learning_rate': 7.382174521698553e-07, 'completion_length': 203.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.4553571939468384, 'rewards/format_reward': 1.0, 'reward': 1.4553572535514832, 'reward_std': 0.06983363255858421, 'kl': 0.21484375, 'epoch': 0.26}
 26%|██▌       | 1122/4286 [6:39:41<17:41:54, 20.14s/it] 26%|██▌       | 1123/4286 [6:40:00<17:25:54, 19.84s/it]                                                        {'loss': 0.0262, 'grad_norm': 6.209760077201278, 'learning_rate': 7.379841343910406e-07, 'completion_length': 162.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.05130239203572273, 'kl': 0.65771484375, 'epoch': 0.26}
 26%|██▌       | 1123/4286 [6:40:00<17:25:54, 19.84s/it] 26%|██▌       | 1124/4286 [6:40:21<17:33:59, 20.00s/it]                                                        {'loss': 0.0168, 'grad_norm': 2.2852402096111946, 'learning_rate': 7.377508166122258e-07, 'completion_length': 224.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.5491072237491608, 'rewards/format_reward': 1.0, 'reward': 1.549107313156128, 'reward_std': 0.08180620893836021, 'kl': 0.4208984375, 'epoch': 0.26}
 26%|██▌       | 1124/4286 [6:40:21<17:33:59, 20.00s/it] 26%|██▌       | 1125/4286 [6:40:39<17:10:10, 19.55s/it]                                                        {'loss': 0.0078, 'grad_norm': 1.2254681424639198, 'learning_rate': 7.37517498833411e-07, 'completion_length': 181.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.7425596117973328, 'rewards/format_reward': 1.0, 'reward': 1.7425596117973328, 'reward_std': 0.05059523321688175, 'kl': 0.1953125, 'epoch': 0.26}
 26%|██▌       | 1125/4286 [6:40:39<17:10:10, 19.55s/it] 26%|██▋       | 1126/4286 [6:41:02<17:57:07, 20.45s/it]                                                        {'loss': 0.0201, 'grad_norm': 1.1578155171493791, 'learning_rate': 7.372841810545964e-07, 'completion_length': 217.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.48363102972507477, 'rewards/format_reward': 1.0, 'reward': 1.4836310744285583, 'reward_std': 0.06250000186264515, 'kl': 0.5029296875, 'epoch': 0.26}
 26%|██▋       | 1126/4286 [6:41:02<17:57:07, 20.45s/it] 26%|██▋       | 1127/4286 [6:41:23<18:05:20, 20.61s/it]                                                        {'loss': 0.0089, 'grad_norm': 3.2822142580443323, 'learning_rate': 7.370508632757816e-07, 'completion_length': 191.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.447916716337204, 'rewards/format_reward': 1.0, 'reward': 1.4479167461395264, 'reward_std': 0.08244555629789829, 'kl': 0.22314453125, 'epoch': 0.26}
 26%|██▋       | 1127/4286 [6:41:23<18:05:20, 20.61s/it] 26%|██▋       | 1128/4286 [6:41:42<17:38:27, 20.11s/it]                                                        {'loss': 0.0156, 'grad_norm': 2.148546745481821, 'learning_rate': 7.368175454969668e-07, 'completion_length': 199.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.07327024638652802, 'kl': 0.3896484375, 'epoch': 0.26}
 26%|██▋       | 1128/4286 [6:41:42<17:38:27, 20.11s/it] 26%|██▋       | 1129/4286 [6:42:00<17:01:32, 19.41s/it]                                                        {'loss': 0.0073, 'grad_norm': 3.145600294043069, 'learning_rate': 7.36584227718152e-07, 'completion_length': 172.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.7723214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7723215222358704, 'reward_std': 0.07624643296003342, 'kl': 0.18310546875, 'epoch': 0.26}
 26%|██▋       | 1129/4286 [6:42:00<17:01:32, 19.41s/it] 26%|██▋       | 1130/4286 [6:42:19<16:59:21, 19.38s/it]                                                        {'loss': 0.0185, 'grad_norm': 4.315022440577194, 'learning_rate': 7.363509099393374e-07, 'completion_length': 181.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.42648814618587494, 'rewards/format_reward': 1.0, 'reward': 1.4264881610870361, 'reward_std': 0.11874674260616302, 'kl': 0.4619140625, 'epoch': 0.26}
 26%|██▋       | 1130/4286 [6:42:19<16:59:21, 19.38s/it] 26%|██▋       | 1131/4286 [6:42:37<16:42:16, 19.06s/it]                                                        {'loss': 0.0139, 'grad_norm': 1.8206221838211938, 'learning_rate': 7.361175921605226e-07, 'completion_length': 186.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.3779762089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3601192235946655, 'reward_std': 0.10604140162467957, 'kl': 0.3486328125, 'epoch': 0.26}
 26%|██▋       | 1131/4286 [6:42:37<16:42:16, 19.06s/it] 26%|██▋       | 1132/4286 [6:42:55<16:30:18, 18.84s/it]                                                        {'loss': 0.0291, 'grad_norm': 11.744870703852628, 'learning_rate': 7.358842743817078e-07, 'completion_length': 169.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5639881044626236, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.10257173702120781, 'kl': 0.728515625, 'epoch': 0.26}
 26%|██▋       | 1132/4286 [6:42:55<16:30:18, 18.84s/it] 26%|██▋       | 1133/4286 [6:43:14<16:23:51, 18.72s/it]                                                        {'loss': 0.0119, 'grad_norm': 1.9211640407183261, 'learning_rate': 7.356509566028931e-07, 'completion_length': 185.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.5491071790456772, 'rewards/format_reward': 1.0, 'reward': 1.5491071939468384, 'reward_std': 0.059622688218951225, 'kl': 0.2978515625, 'epoch': 0.26}
 26%|██▋       | 1133/4286 [6:43:14<16:23:51, 18.72s/it] 26%|██▋       | 1134/4286 [6:43:35<16:53:59, 19.30s/it]                                                        {'loss': 0.0673, 'grad_norm': 8.35584495659174, 'learning_rate': 7.354176388240784e-07, 'completion_length': 213.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.447916716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4122024774551392, 'reward_std': 0.22893118858337402, 'kl': 1.6796875, 'epoch': 0.26}
 26%|██▋       | 1134/4286 [6:43:35<16:53:59, 19.30s/it] 26%|██▋       | 1135/4286 [6:43:53<16:35:00, 18.95s/it]                                                        {'loss': 0.0119, 'grad_norm': 2.310859221421734, 'learning_rate': 7.351843210452636e-07, 'completion_length': 161.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.5357144474983215, 'reward_std': 0.07516713440418243, 'kl': 0.296875, 'epoch': 0.26}
 26%|██▋       | 1135/4286 [6:43:53<16:35:00, 18.95s/it] 27%|██▋       | 1136/4286 [6:44:12<16:39:07, 19.03s/it]                                                        {'loss': 0.0382, 'grad_norm': 5.2941439288290875, 'learning_rate': 7.349510032664489e-07, 'completion_length': 206.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.3229166716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3050596117973328, 'reward_std': 0.17167320847511292, 'kl': 0.955078125, 'epoch': 0.27}
 27%|██▋       | 1136/4286 [6:44:12<16:39:07, 19.03s/it] 27%|██▋       | 1137/4286 [6:44:30<16:22:43, 18.72s/it]                                                        {'loss': 0.0216, 'grad_norm': 5.619268177900143, 'learning_rate': 7.347176854876341e-07, 'completion_length': 173.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.5044643133878708, 'rewards/format_reward': 1.0, 'reward': 1.504464328289032, 'reward_std': 0.05495269037783146, 'kl': 0.5390625, 'epoch': 0.27}
 27%|██▋       | 1137/4286 [6:44:30<16:22:43, 18.72s/it] 27%|██▋       | 1138/4286 [6:44:50<16:44:19, 19.14s/it]                                                        {'loss': 0.0719, 'grad_norm': 5.648397032462889, 'learning_rate': 7.344843677088194e-07, 'completion_length': 170.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.491071417927742, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4553572535514832, 'reward_std': 0.20709244161844254, 'kl': 1.796875, 'epoch': 0.27}
 27%|██▋       | 1138/4286 [6:44:50<16:44:19, 19.14s/it] 27%|██▋       | 1139/4286 [6:45:11<17:14:20, 19.72s/it]                                                        {'loss': 0.0564, 'grad_norm': 4.453952830270098, 'learning_rate': 7.342510499300047e-07, 'completion_length': 194.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.501488134264946, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4479167461395264, 'reward_std': 0.15654076263308525, 'kl': 1.404296875, 'epoch': 0.27}
 27%|██▋       | 1139/4286 [6:45:11<17:14:20, 19.72s/it] 27%|██▋       | 1140/4286 [6:45:31<17:21:58, 19.87s/it]                                                        {'loss': 0.0178, 'grad_norm': 3.990388992748776, 'learning_rate': 7.340177321511899e-07, 'completion_length': 192.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.4375000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4196430444717407, 'reward_std': 0.12684166803956032, 'kl': 0.4453125, 'epoch': 0.27}
 27%|██▋       | 1140/4286 [6:45:31<17:21:58, 19.87s/it] 27%|██▋       | 1141/4286 [6:45:51<17:15:42, 19.76s/it]                                                        {'loss': 0.0323, 'grad_norm': 2.958109451087374, 'learning_rate': 7.337844143723751e-07, 'completion_length': 184.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.4211309999227524, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.40327388048172, 'reward_std': 0.17763344198465347, 'kl': 0.8046875, 'epoch': 0.27}
 27%|██▋       | 1141/4286 [6:45:51<17:15:42, 19.76s/it] 27%|██▋       | 1142/4286 [6:46:14<18:05:29, 20.72s/it]                                                        {'loss': 0.0435, 'grad_norm': 2.2204868549800088, 'learning_rate': 7.335510965935604e-07, 'completion_length': 194.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.5391865521669388, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5213294625282288, 'reward_std': 0.1527293473482132, 'kl': 1.08984375, 'epoch': 0.27}
 27%|██▋       | 1142/4286 [6:46:14<18:05:29, 20.72s/it] 27%|██▋       | 1143/4286 [6:46:35<18:18:15, 20.97s/it]                                                        {'loss': 0.0308, 'grad_norm': 3.3333439375988525, 'learning_rate': 7.333177788147457e-07, 'completion_length': 210.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.3869047909975052, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3690477013587952, 'reward_std': 0.10562240332365036, 'kl': 0.76953125, 'epoch': 0.27}
 27%|██▋       | 1143/4286 [6:46:35<18:18:15, 20.97s/it] 27%|██▋       | 1144/4286 [6:46:55<17:56:34, 20.56s/it]                                                        {'loss': 0.0244, 'grad_norm': 3.0565326469297474, 'learning_rate': 7.330844610359309e-07, 'completion_length': 194.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.5969671458005905, 'rewards/format_reward': 1.0, 'reward': 1.5969672203063965, 'reward_std': 0.09257495030760765, 'kl': 0.611328125, 'epoch': 0.27}
 27%|██▋       | 1144/4286 [6:46:55<17:56:34, 20.56s/it] 27%|██▋       | 1145/4286 [6:47:16<18:04:31, 20.72s/it]                                                        {'loss': 0.0299, 'grad_norm': 4.7577729121285595, 'learning_rate': 7.328511432571161e-07, 'completion_length': 203.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.3705357536673546, 'rewards/format_reward': 1.0, 'reward': 1.3705357909202576, 'reward_std': 0.10809675604104996, 'kl': 0.74609375, 'epoch': 0.27}
 27%|██▋       | 1145/4286 [6:47:16<18:04:31, 20.72s/it] 27%|██▋       | 1146/4286 [6:47:37<18:02:58, 20.69s/it]                                                        {'loss': 0.0286, 'grad_norm': 4.635234245588061, 'learning_rate': 7.326178254783015e-07, 'completion_length': 200.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.351190522313118, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3333334922790527, 'reward_std': 0.10831043124198914, 'kl': 0.71484375, 'epoch': 0.27}
 27%|██▋       | 1146/4286 [6:47:37<18:02:58, 20.69s/it] 27%|██▋       | 1147/4286 [6:47:56<17:36:29, 20.19s/it]                                                        {'loss': 0.0544, 'grad_norm': 9.711696509482259, 'learning_rate': 7.323845076994867e-07, 'completion_length': 182.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.4687500298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.415178656578064, 'reward_std': 0.30660583078861237, 'kl': 1.359375, 'epoch': 0.27}
 27%|██▋       | 1147/4286 [6:47:56<17:36:29, 20.19s/it] 27%|██▋       | 1148/4286 [6:48:15<17:16:47, 19.82s/it]                                                        {'loss': 0.0162, 'grad_norm': 6.05221224342897, 'learning_rate': 7.321511899206719e-07, 'completion_length': 169.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.4910714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4732143878936768, 'reward_std': 0.17113317549228668, 'kl': 0.404296875, 'epoch': 0.27}
 27%|██▋       | 1148/4286 [6:48:15<17:16:47, 19.82s/it] 27%|██▋       | 1149/4286 [6:48:34<17:14:53, 19.79s/it]                                                        {'loss': 0.0623, 'grad_norm': 5.960738635317397, 'learning_rate': 7.319178721418572e-07, 'completion_length': 190.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.38437505066394806, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3486608266830444, 'reward_std': 0.14548424258828163, 'kl': 1.55859375, 'epoch': 0.27}
 27%|██▋       | 1149/4286 [6:48:34<17:14:53, 19.79s/it] 27%|██▋       | 1150/4286 [6:48:54<17:11:11, 19.73s/it]                                                        {'loss': 0.0482, 'grad_norm': 4.0502989311991255, 'learning_rate': 7.316845543630425e-07, 'completion_length': 186.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.5758928656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5580357909202576, 'reward_std': 0.15290232747793198, 'kl': 1.205078125, 'epoch': 0.27}
 27%|██▋       | 1150/4286 [6:48:54<17:11:11, 19.73s/it] 27%|██▋       | 1151/4286 [6:49:17<17:56:54, 20.61s/it]                                                        {'loss': 0.0514, 'grad_norm': 9.250818859245307, 'learning_rate': 7.314512365842277e-07, 'completion_length': 206.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.3541667014360428, 'rewards/format_reward': 1.0, 'reward': 1.3541667461395264, 'reward_std': 0.10188598558306694, 'kl': 1.2890625, 'epoch': 0.27}
 27%|██▋       | 1151/4286 [6:49:17<17:56:54, 20.61s/it] 27%|██▋       | 1152/4286 [6:49:37<18:00:19, 20.68s/it]                                                        {'loss': 0.0531, 'grad_norm': 4.977900998966882, 'learning_rate': 7.312179188054129e-07, 'completion_length': 184.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.368526816368103, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.350669801235199, 'reward_std': 0.11072051525115967, 'kl': 1.326171875, 'epoch': 0.27}
 27%|██▋       | 1152/4286 [6:49:38<18:00:19, 20.68s/it] 27%|██▋       | 1153/4286 [6:49:57<17:45:31, 20.41s/it]                                                        {'loss': 0.0138, 'grad_norm': 6.328867807049505, 'learning_rate': 7.309846010265982e-07, 'completion_length': 189.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.548214316368103, 'rewards/format_reward': 1.0, 'reward': 1.5482144355773926, 'reward_std': 0.06640603952109814, 'kl': 0.3466796875, 'epoch': 0.27}
 27%|██▋       | 1153/4286 [6:49:57<17:45:31, 20.41s/it] 27%|██▋       | 1154/4286 [6:50:16<17:14:44, 19.82s/it]                                                        {'loss': 0.009, 'grad_norm': 4.017672994977034, 'learning_rate': 7.307512832477834e-07, 'completion_length': 166.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.540178582072258, 'rewards/format_reward': 1.0, 'reward': 1.540178656578064, 'reward_std': 0.10037716943770647, 'kl': 0.22412109375, 'epoch': 0.27}
 27%|██▋       | 1154/4286 [6:50:16<17:14:44, 19.82s/it] 27%|██▋       | 1155/4286 [6:50:37<17:40:32, 20.32s/it]                                                        {'loss': 0.0303, 'grad_norm': 3.5905598160817433, 'learning_rate': 7.305179654689687e-07, 'completion_length': 188.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.5148810148239136, 'rewards/format_reward': 1.0, 'reward': 1.5148810744285583, 'reward_std': 0.1559775248169899, 'kl': 0.75390625, 'epoch': 0.27}
 27%|██▋       | 1155/4286 [6:50:37<17:40:32, 20.32s/it] 27%|██▋       | 1156/4286 [6:51:00<18:14:38, 20.98s/it]                                                        {'loss': 0.0394, 'grad_norm': 5.725603636484825, 'learning_rate': 7.30284647690154e-07, 'completion_length': 182.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.3928571790456772, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750001192092896, 'reward_std': 0.13044018298387527, 'kl': 0.984375, 'epoch': 0.27}
 27%|██▋       | 1156/4286 [6:51:00<18:14:38, 20.98s/it][2025-03-02 11:58:36,851] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1157/4286 [6:51:21<18:18:13, 21.06s/it]                                                        {'loss': 0.0154, 'grad_norm': 6.763221129531096, 'learning_rate': 7.300513299113392e-07, 'completion_length': 196.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.4464287161827087, 'reward_std': 0.1290045604109764, 'kl': 0.384765625, 'epoch': 0.27}
 27%|██▋       | 1157/4286 [6:51:21<18:18:13, 21.06s/it] 27%|██▋       | 1158/4286 [6:51:41<18:02:30, 20.76s/it]                                                        {'loss': 0.0304, 'grad_norm': 8.310058018919532, 'learning_rate': 7.298180121325244e-07, 'completion_length': 184.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5806547850370407, 'rewards/format_reward': 1.0, 'reward': 1.5806549191474915, 'reward_std': 0.0517630772665143, 'kl': 0.76171875, 'epoch': 0.27}
 27%|██▋       | 1158/4286 [6:51:41<18:02:30, 20.76s/it] 27%|██▋       | 1159/4286 [6:52:02<17:57:50, 20.68s/it]                                                        {'loss': 0.0352, 'grad_norm': 8.514361262947089, 'learning_rate': 7.295846943537098e-07, 'completion_length': 202.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6073129773139954, 'rewards/format_reward': 1.0, 'reward': 1.607313096523285, 'reward_std': 0.12211842834949493, 'kl': 0.87890625, 'epoch': 0.27}
 27%|██▋       | 1159/4286 [6:52:02<17:57:50, 20.68s/it][2025-03-02 11:59:41,295] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1160/4286 [6:52:25<18:47:29, 21.64s/it]                                                        {'loss': 0.0328, 'grad_norm': 4.037997060643408, 'learning_rate': 7.29351376574895e-07, 'completion_length': 200.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.48674243688583374, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4510282278060913, 'reward_std': 0.23192675411701202, 'kl': 0.8203125, 'epoch': 0.27}
 27%|██▋       | 1160/4286 [6:52:25<18:47:29, 21.64s/it] 27%|██▋       | 1161/4286 [6:52:46<18:28:37, 21.29s/it]                                                        {'loss': 0.0366, 'grad_norm': 3.612232410037372, 'learning_rate': 7.291180587960802e-07, 'completion_length': 203.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.3258928805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3080357909202576, 'reward_std': 0.08420777320861816, 'kl': 0.9140625, 'epoch': 0.27}
 27%|██▋       | 1161/4286 [6:52:46<18:28:37, 21.29s/it] 27%|██▋       | 1162/4286 [6:53:06<18:08:11, 20.90s/it]                                                        {'loss': 0.0247, 'grad_norm': 4.440587406463724, 'learning_rate': 7.288847410172655e-07, 'completion_length': 192.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.5011905133724213, 'rewards/format_reward': 1.0, 'reward': 1.5011906027793884, 'reward_std': 0.17111452668905258, 'kl': 0.6162109375, 'epoch': 0.27}
 27%|██▋       | 1162/4286 [6:53:06<18:08:11, 20.90s/it] 27%|██▋       | 1163/4286 [6:53:25<17:39:35, 20.36s/it]                                                        {'loss': 0.0116, 'grad_norm': 4.316467689969209, 'learning_rate': 7.286514232384508e-07, 'completion_length': 174.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.473214328289032, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.07742243260145187, 'kl': 0.29052734375, 'epoch': 0.27}
 27%|██▋       | 1163/4286 [6:53:25<17:39:35, 20.36s/it] 27%|██▋       | 1164/4286 [6:53:45<17:35:38, 20.29s/it]                                                        {'loss': 0.0268, 'grad_norm': 4.281315553659354, 'learning_rate': 7.28418105459636e-07, 'completion_length': 189.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.375, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3571429252624512, 'reward_std': 0.09548066928982735, 'kl': 0.669921875, 'epoch': 0.27}
 27%|██▋       | 1164/4286 [6:53:45<17:35:38, 20.29s/it] 27%|██▋       | 1165/4286 [6:54:06<17:47:32, 20.52s/it]                                                        {'loss': 0.0418, 'grad_norm': 6.498822737832113, 'learning_rate': 7.281847876808212e-07, 'completion_length': 207.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.477678582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4598215222358704, 'reward_std': 0.12491239607334137, 'kl': 1.048828125, 'epoch': 0.27}
 27%|██▋       | 1165/4286 [6:54:06<17:47:32, 20.52s/it] 27%|██▋       | 1166/4286 [6:54:28<18:08:31, 20.93s/it]                                                        {'loss': 0.0222, 'grad_norm': 5.758571956851568, 'learning_rate': 7.279514699020065e-07, 'completion_length': 182.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.45595237612724304, 'rewards/format_reward': 1.0, 'reward': 1.4559524655342102, 'reward_std': 0.08644722774624825, 'kl': 0.556640625, 'epoch': 0.27}
 27%|██▋       | 1166/4286 [6:54:28<18:08:31, 20.93s/it] 27%|██▋       | 1167/4286 [6:54:48<17:56:15, 20.70s/it]                                                        {'loss': 0.0189, 'grad_norm': 2.9821279740780047, 'learning_rate': 7.277181521231918e-07, 'completion_length': 203.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.4791666865348816, 'rewards/format_reward': 1.0, 'reward': 1.4791667461395264, 'reward_std': 0.06815173290669918, 'kl': 0.47412109375, 'epoch': 0.27}
 27%|██▋       | 1167/4286 [6:54:48<17:56:15, 20.70s/it] 27%|██▋       | 1168/4286 [6:55:10<18:09:40, 20.97s/it]                                                        {'loss': 0.0249, 'grad_norm': 2.044955157705473, 'learning_rate': 7.27484834344377e-07, 'completion_length': 181.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000001192092896, 'reward_std': 0.071428582072258, 'kl': 0.623046875, 'epoch': 0.27}
 27%|██▋       | 1168/4286 [6:55:10<18:09:40, 20.97s/it] 27%|██▋       | 1169/4286 [6:55:31<18:15:26, 21.09s/it]                                                        {'loss': 0.0369, 'grad_norm': 4.608440142840684, 'learning_rate': 7.272515165655623e-07, 'completion_length': 193.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.4869047850370407, 'rewards/format_reward': 1.0, 'reward': 1.4869048595428467, 'reward_std': 0.07368208467960358, 'kl': 0.9228515625, 'epoch': 0.27}
 27%|██▋       | 1169/4286 [6:55:31<18:15:26, 21.09s/it] 27%|██▋       | 1170/4286 [6:55:50<17:41:35, 20.44s/it]                                                        {'loss': 0.0107, 'grad_norm': 2.8469368420085757, 'learning_rate': 7.270181987867475e-07, 'completion_length': 195.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.4613095372915268, 'rewards/format_reward': 1.0, 'reward': 1.4613096714019775, 'reward_std': 0.05222323536872864, 'kl': 0.26708984375, 'epoch': 0.27}
 27%|██▋       | 1170/4286 [6:55:50<17:41:35, 20.44s/it] 27%|██▋       | 1171/4286 [6:56:10<17:25:19, 20.13s/it]                                                        {'loss': 0.017, 'grad_norm': 1.860878127991246, 'learning_rate': 7.267848810079328e-07, 'completion_length': 161.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.4806547909975052, 'rewards/format_reward': 1.0, 'reward': 1.4806548953056335, 'reward_std': 0.06239968352019787, 'kl': 0.4248046875, 'epoch': 0.27}
 27%|██▋       | 1171/4286 [6:56:10<17:25:19, 20.13s/it] 27%|██▋       | 1172/4286 [6:56:29<17:21:55, 20.08s/it]                                                        {'loss': 0.0132, 'grad_norm': 2.0962884065833127, 'learning_rate': 7.265515632291181e-07, 'completion_length': 176.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.4300595372915268, 'rewards/format_reward': 1.0, 'reward': 1.430059552192688, 'reward_std': 0.09166059363633394, 'kl': 0.3310546875, 'epoch': 0.27}
 27%|██▋       | 1172/4286 [6:56:29<17:21:55, 20.08s/it] 27%|██▋       | 1173/4286 [6:56:50<17:33:44, 20.31s/it]                                                        {'loss': 0.0134, 'grad_norm': 1.1309537624314525, 'learning_rate': 7.263182454503033e-07, 'completion_length': 178.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5327381789684296, 'rewards/format_reward': 1.0, 'reward': 1.532738208770752, 'reward_std': 0.022214588709175587, 'kl': 0.3349609375, 'epoch': 0.27}
 27%|██▋       | 1173/4286 [6:56:50<17:33:44, 20.31s/it] 27%|██▋       | 1174/4286 [6:57:10<17:22:05, 20.09s/it]                                                        {'loss': 0.0118, 'grad_norm': 13.169683823275147, 'learning_rate': 7.260849276714885e-07, 'completion_length': 171.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.509523868560791, 'rewards/format_reward': 1.0, 'reward': 1.5095239281654358, 'reward_std': 0.05507789924740791, 'kl': 0.29541015625, 'epoch': 0.27}
 27%|██▋       | 1174/4286 [6:57:10<17:22:05, 20.09s/it][2025-03-02 12:04:48,668] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1175/4286 [6:57:33<18:05:15, 20.93s/it]                                                        {'loss': 0.0164, 'grad_norm': 5.268122112055055, 'learning_rate': 7.258516098926737e-07, 'completion_length': 168.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.4002976268529892, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3645833730697632, 'reward_std': 0.16820186376571655, 'kl': 0.4111328125, 'epoch': 0.27}
 27%|██▋       | 1175/4286 [6:57:33<18:05:15, 20.93s/it][2025-03-02 12:05:08,668] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 27%|██▋       | 1176/4286 [6:57:53<17:50:26, 20.65s/it]                                                        {'loss': 0.0137, 'grad_norm': 2.9530486284998894, 'learning_rate': 7.256182921138591e-07, 'completion_length': 161.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.44739586114883423, 'rewards/format_reward': 1.0, 'reward': 1.447395920753479, 'reward_std': 0.03327765315771103, 'kl': 0.34130859375, 'epoch': 0.27}
 27%|██▋       | 1176/4286 [6:57:53<17:50:26, 20.65s/it] 27%|██▋       | 1177/4286 [6:58:14<17:56:21, 20.77s/it]                                                        {'loss': 0.0081, 'grad_norm': 4.102120178355051, 'learning_rate': 7.253849743350443e-07, 'completion_length': 201.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.4270833879709244, 'rewards/format_reward': 1.0, 'reward': 1.427083432674408, 'reward_std': 0.11999655142426491, 'kl': 0.20361328125, 'epoch': 0.27}
 27%|██▋       | 1177/4286 [6:58:14<17:56:21, 20.77s/it] 27%|██▋       | 1178/4286 [6:58:35<18:05:05, 20.95s/it]                                                        {'loss': 0.0174, 'grad_norm': 2.565443135844162, 'learning_rate': 7.251516565562295e-07, 'completion_length': 169.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.5654762089252472, 'rewards/format_reward': 1.0, 'reward': 1.5654762983322144, 'reward_std': 0.08701668307185173, 'kl': 0.4345703125, 'epoch': 0.27}
 27%|██▋       | 1178/4286 [6:58:35<18:05:05, 20.95s/it] 28%|██▊       | 1179/4286 [6:58:57<18:23:12, 21.30s/it]                                                        {'loss': 0.0122, 'grad_norm': 2.411612498426104, 'learning_rate': 7.249183387774148e-07, 'completion_length': 170.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5848215818405151, 'reward_std': 0.08234240859746933, 'kl': 0.3037109375, 'epoch': 0.28}
 28%|██▊       | 1179/4286 [6:58:57<18:23:12, 21.30s/it] 28%|██▊       | 1180/4286 [6:59:18<18:06:14, 20.98s/it]                                                        {'loss': 0.0088, 'grad_norm': 4.410159612108068, 'learning_rate': 7.246850209986001e-07, 'completion_length': 182.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.10519503615796566, 'kl': 0.21923828125, 'epoch': 0.28}
 28%|██▊       | 1180/4286 [6:59:18<18:06:14, 20.98s/it] 28%|██▊       | 1181/4286 [6:59:39<18:06:53, 21.00s/it]                                                        {'loss': 0.0086, 'grad_norm': 0.2516776463594745, 'learning_rate': 7.244517032197853e-07, 'completion_length': 162.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.6190476417541504, 'rewards/format_reward': 1.0, 'reward': 1.6190477013587952, 'reward_std': 0.0, 'kl': 0.21337890625, 'epoch': 0.28}
 28%|██▊       | 1181/4286 [6:59:39<18:06:53, 21.00s/it] 28%|██▊       | 1182/4286 [7:00:02<18:45:48, 21.76s/it]                                                        {'loss': 0.0081, 'grad_norm': 3.5417994128815278, 'learning_rate': 7.242183854409706e-07, 'completion_length': 221.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.4747024029493332, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4568453431129456, 'reward_std': 0.1301373913884163, 'kl': 0.201171875, 'epoch': 0.28}
 28%|██▊       | 1182/4286 [7:00:02<18:45:48, 21.76s/it] 28%|██▊       | 1183/4286 [7:00:22<18:09:07, 21.06s/it]                                                        {'loss': 0.008, 'grad_norm': 1.6260407816013942, 'learning_rate': 7.239850676621558e-07, 'completion_length': 176.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.504464328289032, 'rewards/format_reward': 1.0, 'reward': 1.504464328289032, 'reward_std': 0.03541363123804331, 'kl': 0.20068359375, 'epoch': 0.28}
 28%|██▊       | 1183/4286 [7:00:22<18:09:07, 21.06s/it] 28%|██▊       | 1184/4286 [7:00:43<18:20:39, 21.29s/it]                                                        {'loss': 0.0166, 'grad_norm': 2.185092644100462, 'learning_rate': 7.237517498833411e-07, 'completion_length': 178.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.07091423869132996, 'kl': 0.416015625, 'epoch': 0.28}
 28%|██▊       | 1184/4286 [7:00:43<18:20:39, 21.29s/it] 28%|██▊       | 1185/4286 [7:01:08<19:15:50, 22.36s/it]                                                        {'loss': 0.0153, 'grad_norm': 4.778638205627093, 'learning_rate': 7.235184321045264e-07, 'completion_length': 186.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.4444444924592972, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4265874028205872, 'reward_std': 0.07386324927210808, 'kl': 0.3828125, 'epoch': 0.28}
 28%|██▊       | 1185/4286 [7:01:08<19:15:50, 22.36s/it] 28%|██▊       | 1186/4286 [7:01:28<18:30:07, 21.49s/it]                                                        {'loss': 0.0181, 'grad_norm': 3.041047076411027, 'learning_rate': 7.232851143257116e-07, 'completion_length': 157.46429061889648, 'rewards/only_full_func_accuracy_reward': 0.42517009377479553, 'rewards/format_reward': 1.0, 'reward': 1.425170123577118, 'reward_std': 0.08970870077610016, 'kl': 0.451171875, 'epoch': 0.28}
 28%|██▊       | 1186/4286 [7:01:28<18:30:07, 21.49s/it] 28%|██▊       | 1187/4286 [7:01:49<18:21:54, 21.33s/it]                                                        {'loss': 0.0108, 'grad_norm': 1.6665932996251742, 'learning_rate': 7.230517965468968e-07, 'completion_length': 163.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6324406266212463, 'reward_std': 0.04900030419230461, 'kl': 0.26953125, 'epoch': 0.28}
 28%|██▊       | 1187/4286 [7:01:49<18:21:54, 21.33s/it][2025-03-02 12:09:30,010] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 28%|██▊       | 1188/4286 [7:02:14<19:25:12, 22.57s/it]                                                        {'loss': 0.014, 'grad_norm': 2.770216976307323, 'learning_rate': 7.228184787680821e-07, 'completion_length': 249.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5654762238264084, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.547619104385376, 'reward_std': 0.12960418313741684, 'kl': 0.35009765625, 'epoch': 0.28}
 28%|██▊       | 1188/4286 [7:02:14<19:25:12, 22.57s/it] 28%|██▊       | 1189/4286 [7:02:37<19:31:58, 22.71s/it]                                                        {'loss': 0.0122, 'grad_norm': 1.8436823780930132, 'learning_rate': 7.225851609892674e-07, 'completion_length': 229.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.272321455180645, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2008929252624512, 'reward_std': 0.10674666799604893, 'kl': 0.30517578125, 'epoch': 0.28}
 28%|██▊       | 1189/4286 [7:02:37<19:31:58, 22.71s/it] 28%|██▊       | 1190/4286 [7:03:02<20:12:10, 23.49s/it]                                                        {'loss': 0.0075, 'grad_norm': 2.589933253485172, 'learning_rate': 7.223518432104526e-07, 'completion_length': 240.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.4062500149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3883930444717407, 'reward_std': 0.10214436799287796, 'kl': 0.1884765625, 'epoch': 0.28}
 28%|██▊       | 1190/4286 [7:03:02<20:12:10, 23.49s/it] 28%|██▊       | 1191/4286 [7:03:29<20:52:20, 24.28s/it]                                                        {'loss': 0.0094, 'grad_norm': 2.324047992674918, 'learning_rate': 7.221185254316378e-07, 'completion_length': 209.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5639881193637848, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.03223127964884043, 'kl': 0.23583984375, 'epoch': 0.28}
 28%|██▊       | 1191/4286 [7:03:29<20:52:20, 24.28s/it] 28%|██▊       | 1192/4286 [7:03:51<20:28:06, 23.82s/it]                                                        {'loss': 0.0192, 'grad_norm': 2.406017568581392, 'learning_rate': 7.218852076528232e-07, 'completion_length': 199.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.407738134264946, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.37202388048172, 'reward_std': 0.1834355816245079, 'kl': 0.48046875, 'epoch': 0.28}
 28%|██▊       | 1192/4286 [7:03:51<20:28:06, 23.82s/it] 28%|██▊       | 1193/4286 [7:04:16<20:39:04, 24.04s/it]                                                        {'loss': 0.0235, 'grad_norm': 2.6154720777930063, 'learning_rate': 7.216518898740084e-07, 'completion_length': 211.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.379464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3616072535514832, 'reward_std': 0.11862867698073387, 'kl': 0.5869140625, 'epoch': 0.28}
 28%|██▊       | 1193/4286 [7:04:16<20:39:04, 24.04s/it][2025-03-02 12:11:59,429] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 28%|██▊       | 1194/4286 [7:04:44<21:34:45, 25.12s/it]                                                        {'loss': 0.0239, 'grad_norm': 4.424408651454973, 'learning_rate': 7.214185720951936e-07, 'completion_length': 223.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.6418651640415192, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5704366564750671, 'reward_std': 0.2130800038576126, 'kl': 0.59765625, 'epoch': 0.28}
 28%|██▊       | 1194/4286 [7:04:44<21:34:45, 25.12s/it] 28%|██▊       | 1195/4286 [7:05:10<21:58:15, 25.59s/it]                                                        {'loss': 0.0115, 'grad_norm': 13.163437861760153, 'learning_rate': 7.211852543163789e-07, 'completion_length': 207.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.4806548058986664, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4449406266212463, 'reward_std': 0.2077573835849762, 'kl': 0.2880859375, 'epoch': 0.28}
 28%|██▊       | 1195/4286 [7:05:10<21:58:15, 25.59s/it] 28%|██▊       | 1196/4286 [7:05:33<21:16:35, 24.79s/it]                                                        {'loss': 0.0153, 'grad_norm': 1.8435606000823466, 'learning_rate': 7.209519365375642e-07, 'completion_length': 229.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.4095238298177719, 'rewards/format_reward': 1.0, 'reward': 1.4095239043235779, 'reward_std': 0.08926679566502571, 'kl': 0.3837890625, 'epoch': 0.28}
 28%|██▊       | 1196/4286 [7:05:33<21:16:35, 24.79s/it] 28%|██▊       | 1197/4286 [7:05:55<20:23:35, 23.77s/it]                                                        {'loss': 0.057, 'grad_norm': 3.084305416039816, 'learning_rate': 7.207186187587494e-07, 'completion_length': 179.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.3482142835855484, 'rewards/format_reward': 1.0, 'reward': 1.3482143878936768, 'reward_std': 0.0535714365541935, 'kl': 1.42578125, 'epoch': 0.28}
 28%|██▊       | 1197/4286 [7:05:55<20:23:35, 23.77s/it] 28%|██▊       | 1198/4286 [7:06:16<19:40:39, 22.94s/it]                                                        {'loss': 0.0489, 'grad_norm': 7.653430878969255, 'learning_rate': 7.204853009799346e-07, 'completion_length': 194.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.262400820851326, 'rewards/format_reward': 1.0, 'reward': 1.2624009251594543, 'reward_std': 0.10193236917257309, 'kl': 1.220703125, 'epoch': 0.28}
 28%|██▊       | 1198/4286 [7:06:16<19:40:39, 22.94s/it][2025-03-02 12:13:55,812] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 28%|██▊       | 1199/4286 [7:06:40<20:02:43, 23.38s/it]                                                        {'loss': 0.0645, 'grad_norm': 4.389315430909886, 'learning_rate': 7.202519832011199e-07, 'completion_length': 206.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.4032738208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3854167461395264, 'reward_std': 0.1402643509209156, 'kl': 1.607421875, 'epoch': 0.28}
 28%|██▊       | 1199/4286 [7:06:40<20:02:43, 23.38s/it][2025-03-02 12:14:19,660] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 28%|██▊       | 1200/4286 [7:07:04<20:09:37, 23.52s/it]                                                        {'loss': 0.0696, 'grad_norm': 5.785428566718393, 'learning_rate': 7.200186654223051e-07, 'completion_length': 191.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.3256944641470909, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2721230387687683, 'reward_std': 0.20304393023252487, 'kl': 1.73828125, 'epoch': 0.28}
 28%|██▊       | 1200/4286 [7:07:04<20:09:37, 23.52s/it] 28%|██▊       | 1201/4286 [7:13:09<107:56:11, 125.95s/it]                                                          {'loss': 0.0647, 'grad_norm': 7.813532311792468, 'learning_rate': 7.197853476434904e-07, 'completion_length': 181.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.2187500074505806, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.1651785969734192, 'reward_std': 0.20767778158187866, 'kl': 1.6171875, 'epoch': 0.28}
 28%|██▊       | 1201/4286 [7:13:09<107:56:11, 125.95s/it][2025-03-02 12:20:46,122] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 28%|██▊       | 1202/4286 [7:13:30<81:03:11, 94.61s/it]                                                          {'loss': 0.0299, 'grad_norm': 4.823954358381243, 'learning_rate': 7.195520298646757e-07, 'completion_length': 180.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.58779776096344, 'reward_std': 0.13476894237101078, 'kl': 0.74609375, 'epoch': 0.28}
 28%|██▊       | 1202/4286 [7:13:30<81:03:11, 94.61s/it] 28%|██▊       | 1203/4286 [7:13:52<62:11:17, 72.62s/it]                                                        {'loss': 0.0456, 'grad_norm': 5.379021127255907, 'learning_rate': 7.193187120858609e-07, 'completion_length': 191.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.5461309850215912, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.492559552192688, 'reward_std': 0.24580169469118118, 'kl': 1.140625, 'epoch': 0.28}
 28%|██▊       | 1203/4286 [7:13:52<62:11:17, 72.62s/it][2025-03-02 12:21:29,926] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 28%|██▊       | 1204/4286 [7:14:14<49:18:00, 57.59s/it]                                                        {'loss': 0.0642, 'grad_norm': 11.993401976794921, 'learning_rate': 7.190853943070461e-07, 'completion_length': 200.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.4851190596818924, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4494048357009888, 'reward_std': 0.21078380942344666, 'kl': 1.60546875, 'epoch': 0.28}
 28%|██▊       | 1204/4286 [7:14:14<49:18:00, 57.59s/it] 28%|██▊       | 1205/4286 [7:14:32<39:13:28, 45.83s/it]                                                        {'loss': 0.0318, 'grad_norm': 3.319998417713221, 'learning_rate': 7.188520765282315e-07, 'completion_length': 166.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5029762089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4851191639900208, 'reward_std': 0.06547618843615055, 'kl': 0.794921875, 'epoch': 0.28}
 28%|██▊       | 1205/4286 [7:14:32<39:13:28, 45.83s/it] 28%|██▊       | 1206/4286 [7:14:52<32:25:29, 37.90s/it]                                                        {'loss': 0.0754, 'grad_norm': 4.215918344528791, 'learning_rate': 7.186187587494167e-07, 'completion_length': 151.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4375000596046448, 'reward_std': 0.27170250564813614, 'kl': 1.87890625, 'epoch': 0.28}
 28%|██▊       | 1206/4286 [7:14:52<32:25:29, 37.90s/it] 28%|██▊       | 1207/4286 [7:15:12<27:49:17, 32.53s/it]                                                        {'loss': 0.0385, 'grad_norm': 4.690014877065194, 'learning_rate': 7.183854409706019e-07, 'completion_length': 174.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.5848214328289032, 'rewards/format_reward': 1.0, 'reward': 1.5848215818405151, 'reward_std': 0.11860860884189606, 'kl': 0.962890625, 'epoch': 0.28}
 28%|██▊       | 1207/4286 [7:15:12<27:49:17, 32.53s/it] 28%|██▊       | 1208/4286 [7:15:33<24:53:09, 29.11s/it]                                                        {'loss': 0.0218, 'grad_norm': 14.59209839579384, 'learning_rate': 7.181521231917872e-07, 'completion_length': 182.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5089286416769028, 'rewards/format_reward': 1.0, 'reward': 1.5089287161827087, 'reward_std': 0.07663404382765293, 'kl': 0.544921875, 'epoch': 0.28}
 28%|██▊       | 1208/4286 [7:15:33<24:53:09, 29.11s/it] 28%|██▊       | 1209/4286 [7:15:55<22:57:54, 26.87s/it]                                                        {'loss': 0.0334, 'grad_norm': 3.765489024833096, 'learning_rate': 7.179188054129725e-07, 'completion_length': 186.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.5130208432674408, 'rewards/format_reward': 1.0, 'reward': 1.5130208730697632, 'reward_std': 0.09957936778664589, 'kl': 0.8359375, 'epoch': 0.28}
 28%|██▊       | 1209/4286 [7:15:55<22:57:54, 26.87s/it] 28%|██▊       | 1210/4286 [7:16:14<21:09:42, 24.77s/it]                                                        {'loss': 0.0391, 'grad_norm': 4.981817395189295, 'learning_rate': 7.176854876341577e-07, 'completion_length': 170.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.4032738506793976, 'rewards/format_reward': 1.0, 'reward': 1.4032739400863647, 'reward_std': 0.04771793261170387, 'kl': 0.9765625, 'epoch': 0.28}
 28%|██▊       | 1210/4286 [7:16:14<21:09:42, 24.77s/it] 28%|██▊       | 1211/4286 [7:16:33<19:40:58, 23.04s/it]                                                        {'loss': 0.0292, 'grad_norm': 4.741682034199902, 'learning_rate': 7.174521698553429e-07, 'completion_length': 153.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.3561508059501648, 'rewards/format_reward': 1.0, 'reward': 1.3561508655548096, 'reward_std': 0.058749277144670486, 'kl': 0.732421875, 'epoch': 0.28}
 28%|██▊       | 1211/4286 [7:16:33<19:40:58, 23.04s/it] 28%|██▊       | 1212/4286 [7:16:53<18:42:07, 21.90s/it]                                                        {'loss': 0.0381, 'grad_norm': 2.8594550501322265, 'learning_rate': 7.172188520765282e-07, 'completion_length': 147.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.4335317760705948, 'rewards/format_reward': 1.0, 'reward': 1.4335318803787231, 'reward_std': 0.05725478194653988, 'kl': 0.947265625, 'epoch': 0.28}
 28%|██▊       | 1212/4286 [7:16:53<18:42:07, 21.90s/it] 28%|██▊       | 1213/4286 [7:17:12<17:54:01, 20.97s/it]                                                        {'loss': 0.0325, 'grad_norm': 3.8271890939317412, 'learning_rate': 7.169855342977135e-07, 'completion_length': 151.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5438244342803955, 'rewards/format_reward': 1.0, 'reward': 1.5438244938850403, 'reward_std': 0.0961309764534235, 'kl': 0.8125, 'epoch': 0.28}
 28%|██▊       | 1213/4286 [7:17:12<17:54:01, 20.97s/it] 28%|██▊       | 1214/4286 [7:17:30<17:08:52, 20.10s/it]                                                        {'loss': 0.0131, 'grad_norm': 4.349762835533351, 'learning_rate': 7.167522165188987e-07, 'completion_length': 141.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.532242089509964, 'rewards/format_reward': 1.0, 'reward': 1.5322421193122864, 'reward_std': 0.038548026233911514, 'kl': 0.32861328125, 'epoch': 0.28}
 28%|██▊       | 1214/4286 [7:17:30<17:08:52, 20.10s/it] 28%|██▊       | 1215/4286 [7:17:47<16:32:35, 19.39s/it]                                                        {'loss': 0.0374, 'grad_norm': 4.581895519944773, 'learning_rate': 7.16518898740084e-07, 'completion_length': 129.32143783569336, 'rewards/only_full_func_accuracy_reward': 0.605654776096344, 'rewards/format_reward': 1.0, 'reward': 1.6056548357009888, 'reward_std': 0.14104852825403214, 'kl': 0.931640625, 'epoch': 0.28}
 28%|██▊       | 1215/4286 [7:17:47<16:32:35, 19.39s/it] 28%|██▊       | 1216/4286 [7:18:07<16:35:34, 19.46s/it]                                                        {'loss': 0.0262, 'grad_norm': 10.369977598983278, 'learning_rate': 7.162855809612692e-07, 'completion_length': 124.48215103149414, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160714626312256, 'reward_std': 0.08928571827709675, 'kl': 0.6572265625, 'epoch': 0.28}
 28%|██▊       | 1216/4286 [7:18:07<16:35:34, 19.46s/it] 28%|██▊       | 1217/4286 [7:18:25<16:08:12, 18.93s/it]                                                        {'loss': 0.0279, 'grad_norm': 2.3744988413134376, 'learning_rate': 7.160522631824545e-07, 'completion_length': 140.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.4940476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4761905670166016, 'reward_std': 0.04761904664337635, 'kl': 0.6962890625, 'epoch': 0.28}
 28%|██▊       | 1217/4286 [7:18:25<16:08:12, 18.93s/it] 28%|██▊       | 1218/4286 [7:18:42<15:39:25, 18.37s/it]                                                        {'loss': 0.024, 'grad_norm': 4.696930160597805, 'learning_rate': 7.158189454036398e-07, 'completion_length': 131.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.523809552192688, 'rewards/format_reward': 1.0, 'reward': 1.5238096117973328, 'reward_std': 0.0952381007373333, 'kl': 0.6015625, 'epoch': 0.28}
 28%|██▊       | 1218/4286 [7:18:42<15:39:25, 18.37s/it] 28%|██▊       | 1219/4286 [7:18:59<15:22:31, 18.05s/it]                                                        {'loss': 0.0122, 'grad_norm': 3.7471553819006385, 'learning_rate': 7.15585627624825e-07, 'completion_length': 124.12500381469727, 'rewards/only_full_func_accuracy_reward': 0.512276828289032, 'rewards/format_reward': 1.0, 'reward': 1.5122769474983215, 'reward_std': 0.07252619229257107, 'kl': 0.3056640625, 'epoch': 0.28}
 28%|██▊       | 1219/4286 [7:18:59<15:22:31, 18.05s/it] 28%|██▊       | 1220/4286 [7:19:17<15:20:51, 18.02s/it]                                                        {'loss': 0.0348, 'grad_norm': 2.9098800149199047, 'learning_rate': 7.153523098460102e-07, 'completion_length': 131.32143783569336, 'rewards/only_full_func_accuracy_reward': 0.477678582072258, 'rewards/format_reward': 1.0, 'reward': 1.4776787161827087, 'reward_std': 0.041366010904312134, 'kl': 0.87109375, 'epoch': 0.28}
 28%|██▊       | 1220/4286 [7:19:17<15:20:51, 18.02s/it] 28%|██▊       | 1221/4286 [7:19:38<16:02:07, 18.83s/it]                                                        {'loss': 0.0529, 'grad_norm': 4.0691676752361765, 'learning_rate': 7.151189920671955e-07, 'completion_length': 125.44643020629883, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.598214328289032, 'reward_std': 0.1834663823246956, 'kl': 1.326171875, 'epoch': 0.28}
 28%|██▊       | 1221/4286 [7:19:38<16:02:07, 18.83s/it] 29%|██▊       | 1222/4286 [7:19:55<15:42:50, 18.46s/it]                                                        {'loss': 0.0317, 'grad_norm': 8.284445111958465, 'learning_rate': 7.148856742883808e-07, 'completion_length': 126.32143783569336, 'rewards/only_full_func_accuracy_reward': 0.4836309850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4479167461395264, 'reward_std': 0.22597680240869522, 'kl': 0.791015625, 'epoch': 0.29}
 29%|██▊       | 1222/4286 [7:19:55<15:42:50, 18.46s/it] 29%|██▊       | 1223/4286 [7:20:13<15:23:52, 18.10s/it]                                                        {'loss': 0.0799, 'grad_norm': 4.216408349704582, 'learning_rate': 7.14652356509566e-07, 'completion_length': 111.37500381469727, 'rewards/only_full_func_accuracy_reward': 0.4613095223903656, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4255953431129456, 'reward_std': 0.13749706000089645, 'kl': 1.9921875, 'epoch': 0.29}
 29%|██▊       | 1223/4286 [7:20:13<15:23:52, 18.10s/it] 29%|██▊       | 1224/4286 [7:20:29<15:00:14, 17.64s/it]                                                        {'loss': 0.0535, 'grad_norm': 6.73539524685938, 'learning_rate': 7.144190387307512e-07, 'completion_length': 104.96429061889648, 'rewards/only_full_func_accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4196429252624512, 'reward_std': 0.18551077879965305, 'kl': 1.33984375, 'epoch': 0.29}
 29%|██▊       | 1224/4286 [7:20:29<15:00:14, 17.64s/it] 29%|██▊       | 1225/4286 [7:20:46<14:55:07, 17.55s/it]                                                        {'loss': 0.0647, 'grad_norm': 451.3131504708098, 'learning_rate': 7.141857209519366e-07, 'completion_length': 120.6785774230957, 'rewards/only_full_func_accuracy_reward': 0.46398815512657166, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4282739162445068, 'reward_std': 0.14711357653141022, 'kl': 1.619140625, 'epoch': 0.29}
 29%|██▊       | 1225/4286 [7:20:46<14:55:07, 17.55s/it] 29%|██▊       | 1226/4286 [7:21:04<14:47:54, 17.41s/it]                                                        {'loss': 0.0403, 'grad_norm': 3.8134687298648395, 'learning_rate': 7.139524031731218e-07, 'completion_length': 129.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.485119104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4494048953056335, 'reward_std': 0.11831296607851982, 'kl': 1.0078125, 'epoch': 0.29}
 29%|██▊       | 1226/4286 [7:21:04<14:47:54, 17.41s/it] 29%|██▊       | 1227/4286 [7:21:22<15:07:20, 17.80s/it]                                                        {'loss': 0.0423, 'grad_norm': 21.358996012856004, 'learning_rate': 7.13719085394307e-07, 'completion_length': 98.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.5729166865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5550596117973328, 'reward_std': 0.16109290719032288, 'kl': 1.0595703125, 'epoch': 0.29}
 29%|██▊       | 1227/4286 [7:21:22<15:07:20, 17.80s/it] 29%|██▊       | 1228/4286 [7:21:39<14:55:05, 17.56s/it]                                                        {'loss': 0.0136, 'grad_norm': 5.949100706546981, 'learning_rate': 7.134857676154923e-07, 'completion_length': 123.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.5193452835083008, 'rewards/format_reward': 1.0, 'reward': 1.5193453431129456, 'reward_std': 0.046226111240684986, 'kl': 0.33935546875, 'epoch': 0.29}
 29%|██▊       | 1228/4286 [7:21:39<14:55:05, 17.56s/it] 29%|██▊       | 1229/4286 [7:21:57<14:55:53, 17.58s/it]                                                        {'loss': 0.0407, 'grad_norm': 3.524242504182308, 'learning_rate': 7.132524498366775e-07, 'completion_length': 118.10715103149414, 'rewards/only_full_func_accuracy_reward': 0.6458334028720856, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.610119104385376, 'reward_std': 0.1353098303079605, 'kl': 1.017578125, 'epoch': 0.29}
 29%|██▊       | 1229/4286 [7:21:57<14:55:53, 17.58s/it] 29%|██▊       | 1230/4286 [7:22:14<14:43:51, 17.35s/it]                                                        {'loss': 0.0696, 'grad_norm': 4.553740779752005, 'learning_rate': 7.130191320578628e-07, 'completion_length': 125.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.4062500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.388392984867096, 'reward_std': 0.0982142835855484, 'kl': 1.7314453125, 'epoch': 0.29}
 29%|██▊       | 1230/4286 [7:22:14<14:43:51, 17.35s/it] 29%|██▊       | 1231/4286 [7:22:32<14:54:41, 17.57s/it]                                                        {'loss': 0.059, 'grad_norm': 7.067762857739206, 'learning_rate': 7.127858142790481e-07, 'completion_length': 125.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.5312500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.513392984867096, 'reward_std': 0.09410358220338821, 'kl': 1.48046875, 'epoch': 0.29}
 29%|██▊       | 1231/4286 [7:22:32<14:54:41, 17.57s/it] 29%|██▊       | 1232/4286 [7:22:49<14:44:13, 17.37s/it]                                                        {'loss': 0.0514, 'grad_norm': 4.274011433281002, 'learning_rate': 7.125524965002333e-07, 'completion_length': 116.64286422729492, 'rewards/only_full_func_accuracy_reward': 0.4449404925107956, 'rewards/format_reward': 1.0, 'reward': 1.4449405670166016, 'reward_std': 0.06434167921543121, 'kl': 1.28515625, 'epoch': 0.29}
 29%|██▊       | 1232/4286 [7:22:49<14:44:13, 17.37s/it] 29%|██▉       | 1233/4286 [7:23:06<14:41:52, 17.33s/it]                                                        {'loss': 0.0406, 'grad_norm': 2.9794478468999843, 'learning_rate': 7.123191787214185e-07, 'completion_length': 110.28571701049805, 'rewards/only_full_func_accuracy_reward': 0.5833333432674408, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5476191639900208, 'reward_std': 0.11131968349218369, 'kl': 1.013671875, 'epoch': 0.29}
 29%|██▉       | 1233/4286 [7:23:06<14:41:52, 17.33s/it] 29%|██▉       | 1234/4286 [7:23:23<14:34:05, 17.18s/it]                                                        {'loss': 0.0501, 'grad_norm': 4.925655766402955, 'learning_rate': 7.120858609426038e-07, 'completion_length': 109.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4866071790456772, 'rewards/format_reward': 1.0, 'reward': 1.4866072535514832, 'reward_std': 0.03621618077158928, 'kl': 1.24609375, 'epoch': 0.29}
 29%|██▉       | 1234/4286 [7:23:23<14:34:05, 17.18s/it] 29%|██▉       | 1235/4286 [7:23:42<14:59:15, 17.68s/it]                                                        {'loss': 0.0308, 'grad_norm': 5.2245200535544, 'learning_rate': 7.118525431637891e-07, 'completion_length': 139.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.541666716337204, 'rewards/format_reward': 1.0, 'reward': 1.5416668057441711, 'reward_std': 0.13831907510757446, 'kl': 0.7705078125, 'epoch': 0.29}
 29%|██▉       | 1235/4286 [7:23:42<14:59:15, 17.68s/it] 29%|██▉       | 1236/4286 [7:23:58<14:45:16, 17.42s/it]                                                        {'loss': 0.0321, 'grad_norm': 5.7163709166115035, 'learning_rate': 7.116192253849743e-07, 'completion_length': 123.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5543154776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5364583730697632, 'reward_std': 0.08184524066746235, 'kl': 0.8017578125, 'epoch': 0.29}
 29%|██▉       | 1236/4286 [7:23:58<14:45:16, 17.42s/it] 29%|██▉       | 1237/4286 [7:24:16<14:41:00, 17.34s/it]                                                        {'loss': 0.038, 'grad_norm': 2.4838923539422613, 'learning_rate': 7.113859076061595e-07, 'completion_length': 129.33929061889648, 'rewards/only_full_func_accuracy_reward': 0.5877976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5699405670166016, 'reward_std': 0.12133403494954109, 'kl': 0.94921875, 'epoch': 0.29}
 29%|██▉       | 1237/4286 [7:24:16<14:41:00, 17.34s/it] 29%|██▉       | 1238/4286 [7:24:33<14:44:27, 17.41s/it]                                                        {'loss': 0.0117, 'grad_norm': 4.416234845776262, 'learning_rate': 7.111525898273449e-07, 'completion_length': 143.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.427083358168602, 'rewards/format_reward': 1.0, 'reward': 1.4270834922790527, 'reward_std': 0.08999288082122803, 'kl': 0.29345703125, 'epoch': 0.29}
 29%|██▉       | 1238/4286 [7:24:33<14:44:27, 17.41s/it] 29%|██▉       | 1239/4286 [7:24:49<14:22:23, 16.98s/it]                                                        {'loss': 0.01, 'grad_norm': 2.3415875711771252, 'learning_rate': 7.109192720485301e-07, 'completion_length': 118.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5238095670938492, 'rewards/format_reward': 1.0, 'reward': 1.5238096117973328, 'reward_std': 0.07419108599424362, 'kl': 0.25048828125, 'epoch': 0.29}
 29%|██▉       | 1239/4286 [7:24:49<14:22:23, 16.98s/it] 29%|██▉       | 1240/4286 [7:25:07<14:35:10, 17.24s/it]                                                        {'loss': 0.0086, 'grad_norm': 3.6417924621952187, 'learning_rate': 7.106859542697153e-07, 'completion_length': 147.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.611607164144516, 'rewards/format_reward': 1.0, 'reward': 1.6116072535514832, 'reward_std': 0.040906441397964954, 'kl': 0.2138671875, 'epoch': 0.29}
 29%|██▉       | 1240/4286 [7:25:07<14:35:10, 17.24s/it] 29%|██▉       | 1241/4286 [7:25:24<14:34:41, 17.24s/it]                                                        {'loss': 0.0172, 'grad_norm': 48.245467624001215, 'learning_rate': 7.104526364909006e-07, 'completion_length': 135.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.5967262238264084, 'rewards/format_reward': 1.0, 'reward': 1.5967262983322144, 'reward_std': 0.04238656908273697, 'kl': 0.43017578125, 'epoch': 0.29}
 29%|██▉       | 1241/4286 [7:25:24<14:34:41, 17.24s/it] 29%|██▉       | 1242/4286 [7:25:42<14:37:36, 17.30s/it]                                                        {'loss': 0.0169, 'grad_norm': 1.5032782710085706, 'learning_rate': 7.102193187120859e-07, 'completion_length': 131.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.06547619588673115, 'kl': 0.423828125, 'epoch': 0.29}
 29%|██▉       | 1242/4286 [7:25:42<14:37:36, 17.30s/it] 29%|██▉       | 1243/4286 [7:25:58<14:29:17, 17.14s/it]                                                        {'loss': 0.0306, 'grad_norm': 11.25907766575086, 'learning_rate': 7.099860009332711e-07, 'completion_length': 122.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.5431548207998276, 'rewards/format_reward': 1.0, 'reward': 1.5431548357009888, 'reward_std': 0.08815119788050652, 'kl': 0.763671875, 'epoch': 0.29}
 29%|██▉       | 1243/4286 [7:25:58<14:29:17, 17.14s/it] 29%|██▉       | 1244/4286 [7:26:16<14:29:24, 17.15s/it]                                                        {'loss': 0.0167, 'grad_norm': 2.9845933133666542, 'learning_rate': 7.097526831544563e-07, 'completion_length': 133.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.44851192831993103, 'rewards/format_reward': 1.0, 'reward': 1.4485120177268982, 'reward_std': 0.06726190447807312, 'kl': 0.4169921875, 'epoch': 0.29}
 29%|██▉       | 1244/4286 [7:26:16<14:29:24, 17.15s/it] 29%|██▉       | 1245/4286 [7:26:32<14:13:25, 16.84s/it]                                                        {'loss': 0.0148, 'grad_norm': 83.60898708587182, 'learning_rate': 7.095193653756416e-07, 'completion_length': 109.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.4494048058986664, 'rewards/format_reward': 1.0, 'reward': 1.4494048953056335, 'reward_std': 0.02221459336578846, 'kl': 0.36962890625, 'epoch': 0.29}
 29%|██▉       | 1245/4286 [7:26:32<14:13:25, 16.84s/it] 29%|██▉       | 1246/4286 [7:26:51<14:46:47, 17.50s/it]                                                        {'loss': 0.0715, 'grad_norm': 6.14956979355258, 'learning_rate': 7.092860475968269e-07, 'completion_length': 139.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.4508928805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4330357909202576, 'reward_std': 0.0991867259144783, 'kl': 1.78515625, 'epoch': 0.29}
 29%|██▉       | 1246/4286 [7:26:51<14:46:47, 17.50s/it] 29%|██▉       | 1247/4286 [7:27:09<14:54:39, 17.66s/it]                                                        {'loss': 0.0319, 'grad_norm': 4.197428628701566, 'learning_rate': 7.090527298180121e-07, 'completion_length': 122.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.4940476566553116, 'rewards/format_reward': 1.0, 'reward': 1.4940477013587952, 'reward_std': 0.0821499191224575, 'kl': 0.794921875, 'epoch': 0.29}
 29%|██▉       | 1247/4286 [7:27:09<14:54:39, 17.66s/it] 29%|██▉       | 1248/4286 [7:27:28<15:10:56, 17.99s/it]                                                        {'loss': 0.0355, 'grad_norm': 4.659165482237173, 'learning_rate': 7.088194120391974e-07, 'completion_length': 140.57143783569336, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.10625508427619934, 'kl': 0.884765625, 'epoch': 0.29}
 29%|██▉       | 1248/4286 [7:27:28<15:10:56, 17.99s/it] 29%|██▉       | 1249/4286 [7:27:45<15:08:34, 17.95s/it]                                                        {'loss': 0.0345, 'grad_norm': 10.431194882890471, 'learning_rate': 7.085860942603826e-07, 'completion_length': 138.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 1.0, 'reward': 1.455357313156128, 'reward_std': 0.08358007669448853, 'kl': 0.865234375, 'epoch': 0.29}
 29%|██▉       | 1249/4286 [7:27:45<15:08:34, 17.95s/it] 29%|██▉       | 1250/4286 [7:28:02<14:54:31, 17.68s/it]                                                        {'loss': 0.0426, 'grad_norm': 3.9363472803815367, 'learning_rate': 7.083527764815678e-07, 'completion_length': 119.78571701049805, 'rewards/only_full_func_accuracy_reward': 0.370535746216774, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3348215222358704, 'reward_std': 0.18615180253982544, 'kl': 1.064453125, 'epoch': 0.29}
 29%|██▉       | 1250/4286 [7:28:02<14:54:31, 17.68s/it] 29%|██▉       | 1251/4286 [7:28:21<15:07:48, 17.95s/it]                                                        {'loss': 0.0657, 'grad_norm': 20.23681605583018, 'learning_rate': 7.081194587027532e-07, 'completion_length': 143.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.505952462553978, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4880953431129456, 'reward_std': 0.10868101939558983, 'kl': 1.64453125, 'epoch': 0.29}
 29%|██▉       | 1251/4286 [7:28:21<15:07:48, 17.95s/it] 29%|██▉       | 1252/4286 [7:28:38<14:50:43, 17.61s/it]                                                        {'loss': 0.068, 'grad_norm': 5.908283744486684, 'learning_rate': 7.078861409239384e-07, 'completion_length': 124.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.52976194024086, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4940477013587952, 'reward_std': 0.19336090236902237, 'kl': 1.6953125, 'epoch': 0.29}
 29%|██▉       | 1252/4286 [7:28:38<14:50:43, 17.61s/it] 29%|██▉       | 1253/4286 [7:28:56<14:53:13, 17.67s/it]                                                        {'loss': 0.0442, 'grad_norm': 15.587686690798844, 'learning_rate': 7.076528231451236e-07, 'completion_length': 133.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.4598214626312256, 'rewards/format_reward': 1.0, 'reward': 1.4598215222358704, 'reward_std': 0.1031927578151226, 'kl': 1.109375, 'epoch': 0.29}
 29%|██▉       | 1253/4286 [7:28:56<14:53:13, 17.67s/it] 29%|██▉       | 1254/4286 [7:29:13<14:43:44, 17.49s/it]                                                        {'loss': 0.0895, 'grad_norm': 8.72187817340298, 'learning_rate': 7.07419505366309e-07, 'completion_length': 126.78572082519531, 'rewards/only_full_func_accuracy_reward': 0.4449405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.427083432674408, 'reward_std': 0.2469395026564598, 'kl': 2.2421875, 'epoch': 0.29}
 29%|██▉       | 1254/4286 [7:29:13<14:43:44, 17.49s/it] 29%|██▉       | 1255/4286 [7:29:31<14:54:40, 17.71s/it]                                                        {'loss': 0.0648, 'grad_norm': 11.273337280825363, 'learning_rate': 7.071861875874942e-07, 'completion_length': 111.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.4226190969347954, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3869048357009888, 'reward_std': 0.13771944865584373, 'kl': 1.62109375, 'epoch': 0.29}
 29%|██▉       | 1255/4286 [7:29:31<14:54:40, 17.71s/it] 29%|██▉       | 1256/4286 [7:29:49<15:02:54, 17.88s/it]                                                        {'loss': 0.0932, 'grad_norm': 6.758658030123631, 'learning_rate': 7.069528698086794e-07, 'completion_length': 124.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.3735119253396988, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.33779776096344, 'reward_std': 0.19448721408843994, 'kl': 2.3359375, 'epoch': 0.29}
 29%|██▉       | 1256/4286 [7:29:49<15:02:54, 17.88s/it] 29%|██▉       | 1257/4286 [7:30:07<15:00:54, 17.85s/it]                                                        {'loss': 0.0629, 'grad_norm': 5.540040632277562, 'learning_rate': 7.067195520298646e-07, 'completion_length': 124.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.4360119551420212, 'rewards/format_reward': 1.0, 'reward': 1.4360119700431824, 'reward_std': 0.13537286967039108, 'kl': 1.57421875, 'epoch': 0.29}
 29%|██▉       | 1257/4286 [7:30:07<15:00:54, 17.85s/it] 29%|██▉       | 1258/4286 [7:30:29<16:03:28, 19.09s/it]                                                        {'loss': 0.0636, 'grad_norm': 4.039535572608405, 'learning_rate': 7.064862342510499e-07, 'completion_length': 140.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.4002976417541504, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.364583432674408, 'reward_std': 0.14279241859912872, 'kl': 1.58984375, 'epoch': 0.29}
 29%|██▉       | 1258/4286 [7:30:29<16:03:28, 19.09s/it] 29%|██▉       | 1259/4286 [7:30:49<16:20:38, 19.44s/it]                                                        {'loss': 0.1197, 'grad_norm': 7.199914239163742, 'learning_rate': 7.062529164722352e-07, 'completion_length': 148.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.4827381372451782, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4113096594810486, 'reward_std': 0.2518298402428627, 'kl': 2.9921875, 'epoch': 0.29}
 29%|██▉       | 1259/4286 [7:30:49<16:20:38, 19.44s/it] 29%|██▉       | 1260/4286 [7:31:06<15:45:24, 18.75s/it]                                                        {'loss': 0.0244, 'grad_norm': 4.2297085381505175, 'learning_rate': 7.060195986934204e-07, 'completion_length': 108.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.543154776096344, 'rewards/format_reward': 1.0, 'reward': 1.5431548357009888, 'reward_std': 0.0267857164144516, 'kl': 0.6083984375, 'epoch': 0.29}
 29%|██▉       | 1260/4286 [7:31:06<15:45:24, 18.75s/it] 29%|██▉       | 1261/4286 [7:31:24<15:26:09, 18.37s/it]                                                        {'loss': 0.0447, 'grad_norm': 4.313075148611977, 'learning_rate': 7.057862809146057e-07, 'completion_length': 109.73215103149414, 'rewards/only_full_func_accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803572535514832, 'reward_std': 0.15610390715301037, 'kl': 1.12109375, 'epoch': 0.29}
 29%|██▉       | 1261/4286 [7:31:24<15:26:09, 18.37s/it] 29%|██▉       | 1262/4286 [7:31:42<15:24:25, 18.34s/it]                                                        {'loss': 0.0174, 'grad_norm': 2.588722520144967, 'learning_rate': 7.055529631357909e-07, 'completion_length': 134.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041667461395264, 'reward_std': 0.0736328512430191, 'kl': 0.43701171875, 'epoch': 0.29}
 29%|██▉       | 1262/4286 [7:31:42<15:24:25, 18.34s/it] 29%|██▉       | 1263/4286 [7:31:58<14:50:55, 17.68s/it]                                                        {'loss': 0.015, 'grad_norm': 4.977516818990222, 'learning_rate': 7.053196453569762e-07, 'completion_length': 110.25000381469727, 'rewards/only_full_func_accuracy_reward': 0.4404762238264084, 'rewards/format_reward': 1.0, 'reward': 1.4404762983322144, 'reward_std': 0.03633531183004379, 'kl': 0.3759765625, 'epoch': 0.29}
 29%|██▉       | 1263/4286 [7:31:58<14:50:55, 17.68s/it] 29%|██▉       | 1264/4286 [7:32:15<14:30:53, 17.29s/it]                                                        {'loss': 0.0192, 'grad_norm': 3.7458556800434812, 'learning_rate': 7.050863275781615e-07, 'completion_length': 123.76786422729492, 'rewards/only_full_func_accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.482142984867096, 'reward_std': 0.10494661144912243, 'kl': 0.48046875, 'epoch': 0.29}
 29%|██▉       | 1264/4286 [7:32:15<14:30:53, 17.29s/it] 30%|██▉       | 1265/4286 [7:32:35<15:09:41, 18.07s/it]                                                        {'loss': 0.0655, 'grad_norm': 13.760723501076528, 'learning_rate': 7.048530097993467e-07, 'completion_length': 136.0535774230957, 'rewards/only_full_func_accuracy_reward': 0.4449404776096344, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4092262983322144, 'reward_std': 0.23768610507249832, 'kl': 1.63671875, 'epoch': 0.3}
 30%|██▉       | 1265/4286 [7:32:35<15:09:41, 18.07s/it] 30%|██▉       | 1266/4286 [7:32:53<15:18:45, 18.25s/it]                                                        {'loss': 0.0353, 'grad_norm': 4.871403350049716, 'learning_rate': 7.046196920205319e-07, 'completion_length': 129.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.5125425457954407, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4946855306625366, 'reward_std': 0.1317993849515915, 'kl': 0.8828125, 'epoch': 0.3}
 30%|██▉       | 1266/4286 [7:32:53<15:18:45, 18.25s/it] 30%|██▉       | 1267/4286 [7:33:15<16:09:08, 19.26s/it]                                                        {'loss': 0.0239, 'grad_norm': 4.0929605294985345, 'learning_rate': 7.043863742417172e-07, 'completion_length': 148.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.4657738506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4479168057441711, 'reward_std': 0.07029405236244202, 'kl': 0.59765625, 'epoch': 0.3}
 30%|██▉       | 1267/4286 [7:33:15<16:09:08, 19.26s/it] 30%|██▉       | 1268/4286 [7:33:33<15:54:42, 18.98s/it]                                                        {'loss': 0.0406, 'grad_norm': 13.240379506890772, 'learning_rate': 7.041530564629025e-07, 'completion_length': 161.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.3467262089252472, 'rewards/format_reward': 1.0, 'reward': 1.3467262983322144, 'reward_std': 0.08918799459934235, 'kl': 1.013671875, 'epoch': 0.3}
 30%|██▉       | 1268/4286 [7:33:33<15:54:42, 18.98s/it] 30%|██▉       | 1269/4286 [7:33:52<15:48:41, 18.87s/it]                                                        {'loss': 0.0257, 'grad_norm': 39.65592906873047, 'learning_rate': 7.039197386840877e-07, 'completion_length': 142.08928680419922, 'rewards/only_full_func_accuracy_reward': 0.51339291036129, 'rewards/format_reward': 1.0, 'reward': 1.513392984867096, 'reward_std': 0.06090506538748741, 'kl': 0.6416015625, 'epoch': 0.3}
 30%|██▉       | 1269/4286 [7:33:52<15:48:41, 18.87s/it] 30%|██▉       | 1270/4286 [7:34:12<16:03:03, 19.16s/it]                                                        {'loss': 0.034, 'grad_norm': 7.515077772672871, 'learning_rate': 7.036864209052729e-07, 'completion_length': 166.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.5020833611488342, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4842262864112854, 'reward_std': 0.11196789890527725, 'kl': 0.85009765625, 'epoch': 0.3}
 30%|██▉       | 1270/4286 [7:34:12<16:03:03, 19.16s/it] 30%|██▉       | 1271/4286 [7:34:30<15:56:48, 19.04s/it]                                                        {'loss': 0.0094, 'grad_norm': 7.3552271040410035, 'learning_rate': 7.034531031264583e-07, 'completion_length': 172.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.3630952537059784, 'rewards/format_reward': 1.0, 'reward': 1.3630953431129456, 'reward_std': 0.04400576837360859, 'kl': 0.2353515625, 'epoch': 0.3}
 30%|██▉       | 1271/4286 [7:34:30<15:56:48, 19.04s/it] 30%|██▉       | 1272/4286 [7:34:50<16:08:35, 19.28s/it]                                                        {'loss': 0.017, 'grad_norm': 5.73948003626263, 'learning_rate': 7.032197853476435e-07, 'completion_length': 166.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.65327388048172, 'reward_std': 0.08404049091041088, 'kl': 0.42578125, 'epoch': 0.3}
 30%|██▉       | 1272/4286 [7:34:50<16:08:35, 19.28s/it] 30%|██▉       | 1273/4286 [7:35:09<16:06:23, 19.24s/it]                                                        {'loss': 0.023, 'grad_norm': 3.55472257767113, 'learning_rate': 7.029864675688287e-07, 'completion_length': 171.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.4710317552089691, 'rewards/format_reward': 1.0, 'reward': 1.4710317850112915, 'reward_std': 0.06792735680937767, 'kl': 0.5732421875, 'epoch': 0.3}
 30%|██▉       | 1273/4286 [7:35:09<16:06:23, 19.24s/it] 30%|██▉       | 1274/4286 [7:35:28<15:59:18, 19.11s/it]                                                        {'loss': 0.0168, 'grad_norm': 5.7852857958757165, 'learning_rate': 7.02753149790014e-07, 'completion_length': 172.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6279762983322144, 'reward_std': 0.07419108413159847, 'kl': 0.419921875, 'epoch': 0.3}
 30%|██▉       | 1274/4286 [7:35:28<15:59:18, 19.11s/it] 30%|██▉       | 1275/4286 [7:35:47<15:56:35, 19.06s/it]                                                        {'loss': 0.0132, 'grad_norm': 9.701841133415165, 'learning_rate': 7.025198320111993e-07, 'completion_length': 176.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6297619044780731, 'rewards/format_reward': 1.0, 'reward': 1.629762053489685, 'reward_std': 0.07061965763568878, 'kl': 0.330078125, 'epoch': 0.3}
 30%|██▉       | 1275/4286 [7:35:47<15:56:35, 19.06s/it] 30%|██▉       | 1276/4286 [7:36:06<15:52:04, 18.98s/it]                                                        {'loss': 0.0171, 'grad_norm': 4.438903479902117, 'learning_rate': 7.022865142323845e-07, 'completion_length': 151.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.5550595968961716, 'rewards/format_reward': 1.0, 'reward': 1.5550596714019775, 'reward_std': 0.12565408647060394, 'kl': 0.4267578125, 'epoch': 0.3}
 30%|██▉       | 1276/4286 [7:36:06<15:52:04, 18.98s/it] 30%|██▉       | 1277/4286 [7:36:26<16:01:39, 19.18s/it]                                                        {'loss': 0.0252, 'grad_norm': 14.45807980675753, 'learning_rate': 7.020531964535698e-07, 'completion_length': 180.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.47023816406726837, 'rewards/format_reward': 1.0, 'reward': 1.470238208770752, 'reward_std': 0.08769077807664871, 'kl': 0.63037109375, 'epoch': 0.3}
 30%|██▉       | 1277/4286 [7:36:26<16:01:39, 19.18s/it] 30%|██▉       | 1278/4286 [7:36:44<15:53:43, 19.02s/it]                                                        {'loss': 0.0215, 'grad_norm': 6.521808414745734, 'learning_rate': 7.01819878674755e-07, 'completion_length': 160.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5163691192865372, 'rewards/format_reward': 1.0, 'reward': 1.5163692235946655, 'reward_std': 0.1370106115937233, 'kl': 0.53515625, 'epoch': 0.3}
 30%|██▉       | 1278/4286 [7:36:44<15:53:43, 19.02s/it] 30%|██▉       | 1279/4286 [7:37:04<16:11:46, 19.39s/it]                                                        {'loss': 0.0315, 'grad_norm': 13.260925201436804, 'learning_rate': 7.015865608959402e-07, 'completion_length': 166.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.3973214626312256, 'rewards/format_reward': 1.0, 'reward': 1.3973215222358704, 'reward_std': 0.11403652280569077, 'kl': 0.787109375, 'epoch': 0.3}
 30%|██▉       | 1279/4286 [7:37:04<16:11:46, 19.39s/it] 30%|██▉       | 1280/4286 [7:37:25<16:24:11, 19.64s/it]                                                        {'loss': 0.0578, 'grad_norm': 5.603053728963809, 'learning_rate': 7.013532431171255e-07, 'completion_length': 150.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.4568452686071396, 'rewards/format_reward': 1.0, 'reward': 1.4568453431129456, 'reward_std': 0.12301471456885338, 'kl': 1.4453125, 'epoch': 0.3}
 30%|██▉       | 1280/4286 [7:37:25<16:24:11, 19.64s/it] 30%|██▉       | 1281/4286 [7:37:43<16:09:29, 19.36s/it]                                                        {'loss': 0.0219, 'grad_norm': 8.821444107109393, 'learning_rate': 7.011199253383108e-07, 'completion_length': 172.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.05967651307582855, 'kl': 0.544921875, 'epoch': 0.3}
 30%|██▉       | 1281/4286 [7:37:43<16:09:29, 19.36s/it] 30%|██▉       | 1282/4286 [7:38:04<16:31:54, 19.81s/it]                                                        {'loss': 0.0463, 'grad_norm': 4.1233845510294955, 'learning_rate': 7.00886607559496e-07, 'completion_length': 190.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3214285969734192, 'reward_std': 0.15709706768393517, 'kl': 1.16015625, 'epoch': 0.3}
 30%|██▉       | 1282/4286 [7:38:04<16:31:54, 19.81s/it] 30%|██▉       | 1283/4286 [7:38:23<16:21:38, 19.61s/it]                                                        {'loss': 0.0269, 'grad_norm': 5.759445420229571, 'learning_rate': 7.006532897806812e-07, 'completion_length': 182.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6160715222358704, 'reward_std': 0.12935745529830456, 'kl': 0.6708984375, 'epoch': 0.3}
 30%|██▉       | 1283/4286 [7:38:23<16:21:38, 19.61s/it] 30%|██▉       | 1284/4286 [7:38:45<16:50:45, 20.20s/it]                                                        {'loss': 0.0412, 'grad_norm': 28.46602736583151, 'learning_rate': 7.004199720018666e-07, 'completion_length': 161.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5401785671710968, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4866071939468384, 'reward_std': 0.19636385142803192, 'kl': 1.029296875, 'epoch': 0.3}
 30%|██▉       | 1284/4286 [7:38:45<16:50:45, 20.20s/it] 30%|██▉       | 1285/4286 [7:39:05<16:49:13, 20.18s/it]                                                        {'loss': 0.0948, 'grad_norm': 11.572512665361776, 'learning_rate': 7.001866542230518e-07, 'completion_length': 191.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.4002976566553116, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.364583432674408, 'reward_std': 0.09183454979211092, 'kl': 2.3671875, 'epoch': 0.3}
 30%|██▉       | 1285/4286 [7:39:05<16:49:13, 20.18s/it] 30%|███       | 1286/4286 [7:39:25<16:40:07, 20.00s/it]                                                        {'loss': 0.0712, 'grad_norm': 7.424281025680123, 'learning_rate': 6.99953336444237e-07, 'completion_length': 170.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.401785746216774, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3303572535514832, 'reward_std': 0.2728928104043007, 'kl': 1.78515625, 'epoch': 0.3}
 30%|███       | 1286/4286 [7:39:25<16:40:07, 20.00s/it] 30%|███       | 1287/4286 [7:39:43<16:19:25, 19.59s/it]                                                        {'loss': 0.0428, 'grad_norm': 7.911814049021749, 'learning_rate': 6.997200186654223e-07, 'completion_length': 162.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.4181547909975052, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.364583432674408, 'reward_std': 0.21799591183662415, 'kl': 1.072265625, 'epoch': 0.3}
 30%|███       | 1287/4286 [7:39:43<16:19:25, 19.59s/it] 30%|███       | 1288/4286 [7:40:02<16:01:19, 19.24s/it]                                                        {'loss': 0.0236, 'grad_norm': 25.365719033417182, 'learning_rate': 6.994867008866076e-07, 'completion_length': 165.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.4836309552192688, 'rewards/format_reward': 1.0, 'reward': 1.4836310148239136, 'reward_std': 0.11480056867003441, 'kl': 0.58984375, 'epoch': 0.3}
 30%|███       | 1288/4286 [7:40:02<16:01:19, 19.24s/it] 30%|███       | 1289/4286 [7:40:21<16:02:54, 19.28s/it]                                                        {'loss': 0.0433, 'grad_norm': 7.985711103566347, 'learning_rate': 6.992533831077928e-07, 'completion_length': 155.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.56101194024086, 'rewards/format_reward': 1.0, 'reward': 1.561012089252472, 'reward_std': 0.1429084986448288, 'kl': 1.08203125, 'epoch': 0.3}
 30%|███       | 1289/4286 [7:40:21<16:02:54, 19.28s/it] 30%|███       | 1290/4286 [7:40:40<15:58:03, 19.19s/it]                                                        {'loss': 0.0515, 'grad_norm': 6.3124970980826065, 'learning_rate': 6.99020065328978e-07, 'completion_length': 159.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.4675595462322235, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.449702501296997, 'reward_std': 0.13598697632551193, 'kl': 1.2890625, 'epoch': 0.3}
 30%|███       | 1290/4286 [7:40:40<15:58:03, 19.19s/it] 30%|███       | 1291/4286 [7:41:00<16:07:07, 19.37s/it]                                                        {'loss': 0.0617, 'grad_norm': 3.485042860395492, 'learning_rate': 6.987867475501633e-07, 'completion_length': 174.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.37291668355464935, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3550596833229065, 'reward_std': 0.1328258216381073, 'kl': 1.5390625, 'epoch': 0.3}
 30%|███       | 1291/4286 [7:41:00<16:07:07, 19.37s/it] 30%|███       | 1292/4286 [7:41:19<16:04:34, 19.33s/it]                                                        {'loss': 0.0231, 'grad_norm': 6.948577586567361, 'learning_rate': 6.985534297713486e-07, 'completion_length': 190.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5744048207998276, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5565477013587952, 'reward_std': 0.15102548897266388, 'kl': 0.578125, 'epoch': 0.3}
 30%|███       | 1292/4286 [7:41:19<16:04:34, 19.33s/it] 30%|███       | 1293/4286 [7:41:39<16:11:06, 19.47s/it]                                                        {'loss': 0.0677, 'grad_norm': 10.0172447286121, 'learning_rate': 6.983201119925338e-07, 'completion_length': 191.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.361607164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3437500596046448, 'reward_std': 0.15331296622753143, 'kl': 1.6953125, 'epoch': 0.3}
 30%|███       | 1293/4286 [7:41:39<16:11:06, 19.47s/it] 30%|███       | 1294/4286 [7:42:00<16:40:48, 20.07s/it]                                                        {'loss': 0.0725, 'grad_norm': 5.431481126601624, 'learning_rate': 6.980867942137191e-07, 'completion_length': 172.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.41011908650398254, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3744048476219177, 'reward_std': 0.22503092885017395, 'kl': 1.80859375, 'epoch': 0.3}
 30%|███       | 1294/4286 [7:42:00<16:40:48, 20.07s/it] 30%|███       | 1295/4286 [7:42:20<16:30:25, 19.87s/it]                                                        {'loss': 0.0301, 'grad_norm': 36.83066623251696, 'learning_rate': 6.978534764349043e-07, 'completion_length': 165.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5267857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5089287161827087, 'reward_std': 0.12181013450026512, 'kl': 0.751953125, 'epoch': 0.3}
 30%|███       | 1295/4286 [7:42:20<16:30:25, 19.87s/it] 30%|███       | 1296/4286 [7:42:38<15:59:34, 19.26s/it]                                                        {'loss': 0.0238, 'grad_norm': 10.024063609316467, 'learning_rate': 6.976201586560896e-07, 'completion_length': 156.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.605654776096344, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.1639586165547371, 'kl': 0.59375, 'epoch': 0.3}
 30%|███       | 1296/4286 [7:42:38<15:59:34, 19.26s/it] 30%|███       | 1297/4286 [7:42:57<15:54:54, 19.17s/it]                                                        {'loss': 0.0474, 'grad_norm': 6.57290733957294, 'learning_rate': 6.973868408772749e-07, 'completion_length': 181.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.4598214775323868, 'rewards/format_reward': 1.0, 'reward': 1.4598215818405151, 'reward_std': 0.09669382125139236, 'kl': 1.18359375, 'epoch': 0.3}
 30%|███       | 1297/4286 [7:42:57<15:54:54, 19.17s/it] 30%|███       | 1298/4286 [7:43:18<16:24:19, 19.77s/it]                                                        {'loss': 0.0487, 'grad_norm': 6.559572172373561, 'learning_rate': 6.971535230984601e-07, 'completion_length': 194.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.3973214775323868, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3794644474983215, 'reward_std': 0.1622699797153473, 'kl': 1.216796875, 'epoch': 0.3}
 30%|███       | 1298/4286 [7:43:18<16:24:19, 19.77s/it] 30%|███       | 1299/4286 [7:43:38<16:29:49, 19.88s/it]                                                        {'loss': 0.0402, 'grad_norm': 5.95717116454871, 'learning_rate': 6.969202053196453e-07, 'completion_length': 178.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.4940476566553116, 'rewards/format_reward': 1.0, 'reward': 1.4940477013587952, 'reward_std': 0.08014346286654472, 'kl': 1.005859375, 'epoch': 0.3}
 30%|███       | 1299/4286 [7:43:38<16:29:49, 19.88s/it] 30%|███       | 1300/4286 [7:43:58<16:27:39, 19.85s/it]                                                        {'loss': 0.0332, 'grad_norm': 3.0076179073112024, 'learning_rate': 6.966868875408307e-07, 'completion_length': 194.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4464287161827087, 'reward_std': 0.09735432453453541, 'kl': 0.83203125, 'epoch': 0.3}
 30%|███       | 1300/4286 [7:43:58<16:27:39, 19.85s/it] 30%|███       | 1301/4286 [7:47:33<65:00:29, 78.40s/it]                                                        {'loss': 0.0542, 'grad_norm': 4.634715230296539, 'learning_rate': 6.964535697620159e-07, 'completion_length': 177.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4890873581171036, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4533730745315552, 'reward_std': 0.16044735349714756, 'kl': 1.353515625, 'epoch': 0.3}
 30%|███       | 1301/4286 [7:47:33<65:00:29, 78.40s/it] 30%|███       | 1302/4286 [7:47:50<49:49:49, 60.12s/it]                                                        {'loss': 0.0254, 'grad_norm': 7.229059670257655, 'learning_rate': 6.962202519832011e-07, 'completion_length': 170.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.4702381193637848, 'rewards/format_reward': 1.0, 'reward': 1.470238208770752, 'reward_std': 0.07090840861201286, 'kl': 0.634765625, 'epoch': 0.3}
 30%|███       | 1302/4286 [7:47:50<49:49:49, 60.12s/it] 30%|███       | 1303/4286 [7:48:08<39:11:36, 47.30s/it]                                                        {'loss': 0.0176, 'grad_norm': 4.315359910084631, 'learning_rate': 6.959869342043863e-07, 'completion_length': 184.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.5401785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5223215818405151, 'reward_std': 0.11680420860648155, 'kl': 0.439453125, 'epoch': 0.3}
 30%|███       | 1303/4286 [7:48:08<39:11:36, 47.30s/it] 30%|███       | 1304/4286 [7:48:25<31:53:43, 38.51s/it]                                                        {'loss': 0.0237, 'grad_norm': 6.248665555668526, 'learning_rate': 6.957536164255716e-07, 'completion_length': 177.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.5714286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715818405151, 'reward_std': 0.12021519988775253, 'kl': 0.5908203125, 'epoch': 0.3}
 30%|███       | 1304/4286 [7:48:25<31:53:43, 38.51s/it] 30%|███       | 1305/4286 [7:48:43<26:35:00, 32.10s/it]                                                        {'loss': 0.0196, 'grad_norm': 3.669315797329116, 'learning_rate': 6.955202986467569e-07, 'completion_length': 160.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5669643431901932, 'rewards/format_reward': 1.0, 'reward': 1.5669644474983215, 'reward_std': 0.08630952704697847, 'kl': 0.48828125, 'epoch': 0.3}
 30%|███       | 1305/4286 [7:48:43<26:35:00, 32.10s/it] 30%|███       | 1306/4286 [7:49:03<23:32:02, 28.43s/it]                                                        {'loss': 0.0243, 'grad_norm': 7.776330857552665, 'learning_rate': 6.952869808679421e-07, 'completion_length': 196.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.4613095670938492, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4434524774551392, 'reward_std': 0.12046192213892937, 'kl': 0.6083984375, 'epoch': 0.3}
 30%|███       | 1306/4286 [7:49:03<23:32:02, 28.43s/it] 30%|███       | 1307/4286 [7:49:21<20:59:11, 25.36s/it]                                                        {'loss': 0.052, 'grad_norm': 6.365096564380402, 'learning_rate': 6.950536630891274e-07, 'completion_length': 181.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.4895833432674408, 'rewards/format_reward': 1.0, 'reward': 1.4895833730697632, 'reward_std': 0.14615455269813538, 'kl': 1.298828125, 'epoch': 0.3}
 30%|███       | 1307/4286 [7:49:21<20:59:11, 25.36s/it] 31%|███       | 1308/4286 [7:49:38<18:52:07, 22.81s/it]                                                        {'loss': 0.0777, 'grad_norm': 3.1886461596803044, 'learning_rate': 6.948203453103126e-07, 'completion_length': 179.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.4226190745830536, 'rewards/format_reward': 1.0, 'reward': 1.422619104385376, 'reward_std': 0.06823869794607162, 'kl': 1.9453125, 'epoch': 0.31}
 31%|███       | 1308/4286 [7:49:38<18:52:07, 22.81s/it] 31%|███       | 1309/4286 [7:49:56<17:47:09, 21.51s/it]                                                        {'loss': 0.0399, 'grad_norm': 6.356856937457728, 'learning_rate': 6.945870275314979e-07, 'completion_length': 210.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.4866072088479996, 'rewards/format_reward': 1.0, 'reward': 1.4866072535514832, 'reward_std': 0.11463626474142075, 'kl': 0.994140625, 'epoch': 0.31}
 31%|███       | 1309/4286 [7:49:56<17:47:09, 21.51s/it] 31%|███       | 1310/4286 [7:50:15<17:01:35, 20.60s/it]                                                        {'loss': 0.0336, 'grad_norm': 3.353810708133347, 'learning_rate': 6.943537097526832e-07, 'completion_length': 171.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.45153066515922546, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4336735606193542, 'reward_std': 0.052688054740428925, 'kl': 0.8427734375, 'epoch': 0.31}
 31%|███       | 1310/4286 [7:50:15<17:01:35, 20.60s/it] 31%|███       | 1311/4286 [7:50:34<16:43:11, 20.23s/it]                                                        {'loss': 0.0485, 'grad_norm': 3.0534754699902247, 'learning_rate': 6.941203919738684e-07, 'completion_length': 189.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5940476208925247, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5404763221740723, 'reward_std': 0.25105053931474686, 'kl': 1.2109375, 'epoch': 0.31}
 31%|███       | 1311/4286 [7:50:34<16:43:11, 20.23s/it] 31%|███       | 1312/4286 [7:50:54<16:40:09, 20.18s/it]                                                        {'loss': 0.062, 'grad_norm': 28.207523401320305, 'learning_rate': 6.938870741950536e-07, 'completion_length': 181.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.45297619700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4351191520690918, 'reward_std': 0.1694667786359787, 'kl': 1.55078125, 'epoch': 0.31}
 31%|███       | 1312/4286 [7:50:54<16:40:09, 20.18s/it] 31%|███       | 1313/4286 [7:51:15<16:53:40, 20.46s/it]                                                        {'loss': 0.0726, 'grad_norm': 489.20978174075583, 'learning_rate': 6.936537564162389e-07, 'completion_length': 193.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.433035746216774, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4151785969734192, 'reward_std': 0.1220238208770752, 'kl': 1.8125, 'epoch': 0.31}
 31%|███       | 1313/4286 [7:51:15<16:53:40, 20.46s/it][2025-03-02 12:58:53,049] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 31%|███       | 1314/4286 [7:51:37<17:17:58, 20.95s/it]                                                        {'loss': 0.0738, 'grad_norm': 17.356394956227668, 'learning_rate': 6.934204386374242e-07, 'completion_length': 200.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4285714626312256, 'reward_std': 0.20989125221967697, 'kl': 1.83984375, 'epoch': 0.31}
 31%|███       | 1314/4286 [7:51:37<17:17:58, 20.95s/it] 31%|███       | 1315/4286 [7:51:58<17:13:33, 20.87s/it]                                                        {'loss': 0.0464, 'grad_norm': 6.489052625115774, 'learning_rate': 6.931871208586094e-07, 'completion_length': 210.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.516369104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4985119700431824, 'reward_std': 0.12604888528585434, 'kl': 1.162109375, 'epoch': 0.31}
 31%|███       | 1315/4286 [7:51:58<17:13:33, 20.87s/it] 31%|███       | 1316/4286 [7:52:21<17:41:31, 21.44s/it]                                                        {'loss': 0.0722, 'grad_norm': 7.885788559883188, 'learning_rate': 6.929538030797946e-07, 'completion_length': 199.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.4467262178659439, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4288691282272339, 'reward_std': 0.16591788083314896, 'kl': 1.8046875, 'epoch': 0.31}
 31%|███       | 1316/4286 [7:52:21<17:41:31, 21.44s/it] 31%|███       | 1317/4286 [7:52:40<17:12:28, 20.86s/it]                                                        {'loss': 0.0224, 'grad_norm': 27.702861890692315, 'learning_rate': 6.9272048530098e-07, 'completion_length': 194.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6907738745212555, 'rewards/format_reward': 1.0, 'reward': 1.6907739043235779, 'reward_std': 0.041943530552089214, 'kl': 0.560546875, 'epoch': 0.31}
 31%|███       | 1317/4286 [7:52:40<17:12:28, 20.86s/it] 31%|███       | 1318/4286 [7:53:00<16:53:42, 20.49s/it]                                                        {'loss': 0.0643, 'grad_norm': 7.096001053669567, 'learning_rate': 6.924871675221652e-07, 'completion_length': 185.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5288691073656082, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4931548833847046, 'reward_std': 0.1681511029601097, 'kl': 1.609375, 'epoch': 0.31}
 31%|███       | 1318/4286 [7:53:00<16:53:42, 20.49s/it][2025-03-02 13:00:36,819] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 31%|███       | 1319/4286 [7:53:21<17:03:26, 20.70s/it]                                                        {'loss': 0.0788, 'grad_norm': 7.197769496564129, 'learning_rate': 6.922538497433504e-07, 'completion_length': 204.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.4925595372915268, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4568453431129456, 'reward_std': 0.25422023981809616, 'kl': 1.96875, 'epoch': 0.31}
 31%|███       | 1319/4286 [7:53:21<17:03:26, 20.70s/it] 31%|███       | 1320/4286 [7:53:42<17:10:00, 20.84s/it]                                                        {'loss': 0.1157, 'grad_norm': 16.55336182941999, 'learning_rate': 6.920205319645357e-07, 'completion_length': 197.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.46339286863803864, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4098215103149414, 'reward_std': 0.2337297573685646, 'kl': 2.8828125, 'epoch': 0.31}
 31%|███       | 1320/4286 [7:53:42<17:10:00, 20.84s/it] 31%|███       | 1321/4286 [7:54:02<16:59:38, 20.63s/it]                                                        {'loss': 0.0866, 'grad_norm': 12.081242188594416, 'learning_rate': 6.91787214185721e-07, 'completion_length': 198.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.352678582072258, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.316964328289032, 'reward_std': 0.22506487369537354, 'kl': 2.16796875, 'epoch': 0.31}
 31%|███       | 1321/4286 [7:54:02<16:59:38, 20.63s/it] 31%|███       | 1322/4286 [7:54:25<17:26:36, 21.19s/it]                                                        {'loss': 0.0829, 'grad_norm': 27.3927107379722, 'learning_rate': 6.915538964069062e-07, 'completion_length': 216.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.345238134264946, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2916667461395264, 'reward_std': 0.1604129932820797, 'kl': 2.068359375, 'epoch': 0.31}
 31%|███       | 1322/4286 [7:54:25<17:26:36, 21.19s/it] 31%|███       | 1323/4286 [7:54:46<17:29:09, 21.25s/it]                                                        {'loss': 0.0811, 'grad_norm': 55.086098714347, 'learning_rate': 6.913205786280915e-07, 'completion_length': 206.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.3690476417541504, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2976191639900208, 'reward_std': 0.25049906969070435, 'kl': 2.03125, 'epoch': 0.31}
 31%|███       | 1323/4286 [7:54:46<17:29:09, 21.25s/it] 31%|███       | 1324/4286 [7:55:05<16:58:00, 20.62s/it]                                                        {'loss': 0.0841, 'grad_norm': 4.788171596326085, 'learning_rate': 6.910872608492767e-07, 'completion_length': 169.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5180272310972214, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4823130369186401, 'reward_std': 0.23451413214206696, 'kl': 2.1015625, 'epoch': 0.31}
 31%|███       | 1324/4286 [7:55:05<16:58:00, 20.62s/it][2025-03-02 13:02:43,344] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 31%|███       | 1325/4286 [7:55:27<17:20:43, 21.09s/it]                                                        {'loss': 0.1204, 'grad_norm': 10.842787626338776, 'learning_rate': 6.90853943070462e-07, 'completion_length': 191.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.4345238357782364, 'rewards/format_reward': 1.0, 'reward': 1.4345239400863647, 'reward_std': 0.11669664271175861, 'kl': 3.00390625, 'epoch': 0.31}
 31%|███       | 1325/4286 [7:55:27<17:20:43, 21.09s/it] 31%|███       | 1326/4286 [7:55:46<16:47:54, 20.43s/it]                                                        {'loss': 0.0873, 'grad_norm': 12.470008855119087, 'learning_rate': 6.906206252916472e-07, 'completion_length': 190.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5312500447034836, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.513392984867096, 'reward_std': 0.1050875149667263, 'kl': 2.18359375, 'epoch': 0.31}
 31%|███       | 1326/4286 [7:55:46<16:47:54, 20.43s/it] 31%|███       | 1327/4286 [7:56:06<16:32:01, 20.12s/it]                                                        {'loss': 0.0964, 'grad_norm': 1.9553753542509347, 'learning_rate': 6.903873075128325e-07, 'completion_length': 181.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.4196428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3839287161827087, 'reward_std': 0.1750931739807129, 'kl': 2.40625, 'epoch': 0.31}
 31%|███       | 1327/4286 [7:56:06<16:32:01, 20.12s/it] 31%|███       | 1328/4286 [7:56:25<16:15:21, 19.78s/it]                                                        {'loss': 0.043, 'grad_norm': 2.9033539813330846, 'learning_rate': 6.901539897340177e-07, 'completion_length': 180.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5994048118591309, 'rewards/format_reward': 1.0, 'reward': 1.5994048714637756, 'reward_std': 0.1100255660712719, 'kl': 1.080078125, 'epoch': 0.31}
 31%|███       | 1328/4286 [7:56:25<16:15:21, 19.78s/it][2025-03-02 13:04:01,459] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 31%|███       | 1329/4286 [7:56:46<16:30:28, 20.10s/it]                                                        {'loss': 0.0708, 'grad_norm': 3.9585219928246276, 'learning_rate': 6.899206719552029e-07, 'completion_length': 211.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5386905372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5208334922790527, 'reward_std': 0.1797804832458496, 'kl': 1.76953125, 'epoch': 0.31}
 31%|███       | 1329/4286 [7:56:46<16:30:28, 20.10s/it] 31%|███       | 1330/4286 [7:57:05<16:21:38, 19.93s/it]                                                        {'loss': 0.0409, 'grad_norm': 9.420557451925715, 'learning_rate': 6.896873541763883e-07, 'completion_length': 210.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5104167014360428, 'rewards/format_reward': 1.0, 'reward': 1.5104167461395264, 'reward_std': 0.11815984547138214, 'kl': 1.021484375, 'epoch': 0.31}
 31%|███       | 1330/4286 [7:57:05<16:21:38, 19.93s/it] 31%|███       | 1331/4286 [7:57:25<16:19:39, 19.89s/it]                                                        {'loss': 0.0297, 'grad_norm': 4.384268904580277, 'learning_rate': 6.894540363975735e-07, 'completion_length': 187.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7648809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.04927331767976284, 'kl': 0.7451171875, 'epoch': 0.31}
 31%|███       | 1331/4286 [7:57:25<16:19:39, 19.89s/it] 31%|███       | 1332/4286 [7:57:48<17:08:32, 20.89s/it]                                                        {'loss': 0.1418, 'grad_norm': 33.91538003923572, 'learning_rate': 6.892207186187587e-07, 'completion_length': 189.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.3601190596818924, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2886905670166016, 'reward_std': 0.30293427407741547, 'kl': 3.5546875, 'epoch': 0.31}
 31%|███       | 1332/4286 [7:57:48<17:08:32, 20.89s/it] 31%|███       | 1333/4286 [7:58:09<17:11:30, 20.96s/it]                                                        {'loss': 0.032, 'grad_norm': 15.966076159739996, 'learning_rate': 6.88987400839944e-07, 'completion_length': 188.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.49851194024086, 'rewards/format_reward': 1.0, 'reward': 1.4985119700431824, 'reward_std': 0.041987037286162376, 'kl': 0.80224609375, 'epoch': 0.31}
 31%|███       | 1333/4286 [7:58:09<17:11:30, 20.96s/it] 31%|███       | 1334/4286 [7:58:28<16:39:14, 20.31s/it]                                                        {'loss': 0.0412, 'grad_norm': 2.428878709743604, 'learning_rate': 6.887540830611293e-07, 'completion_length': 185.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.48809531331062317, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.470238208770752, 'reward_std': 0.06044464558362961, 'kl': 1.03125, 'epoch': 0.31}
 31%|███       | 1334/4286 [7:58:28<16:39:14, 20.31s/it] 31%|███       | 1335/4286 [7:58:49<16:50:18, 20.54s/it]                                                        {'loss': 0.086, 'grad_norm': 5.370952642224413, 'learning_rate': 6.885207652823145e-07, 'completion_length': 208.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.5029762387275696, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4494048357009888, 'reward_std': 0.15476191509515047, 'kl': 2.1513671875, 'epoch': 0.31}
 31%|███       | 1335/4286 [7:58:49<16:50:18, 20.54s/it] 31%|███       | 1336/4286 [7:59:09<16:38:30, 20.31s/it]                                                        {'loss': 0.067, 'grad_norm': 4.723669086190915, 'learning_rate': 6.882874475034997e-07, 'completion_length': 174.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.13384097442030907, 'kl': 1.6806640625, 'epoch': 0.31}
 31%|███       | 1336/4286 [7:59:09<16:38:30, 20.31s/it] 31%|███       | 1337/4286 [7:59:28<16:24:51, 20.04s/it]                                                        {'loss': 0.0954, 'grad_norm': 9.336343617310531, 'learning_rate': 6.88054129724685e-07, 'completion_length': 168.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5297619104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.49404776096344, 'reward_std': 0.1887117624282837, 'kl': 2.390625, 'epoch': 0.31}
 31%|███       | 1337/4286 [7:59:28<16:24:51, 20.04s/it] 31%|███       | 1338/4286 [7:59:48<16:12:48, 19.80s/it]                                                        {'loss': 0.0671, 'grad_norm': 1.5903136642690778, 'learning_rate': 6.878208119458703e-07, 'completion_length': 197.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.5401786267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5223215818405151, 'reward_std': 0.1220238171517849, 'kl': 1.67578125, 'epoch': 0.31}
 31%|███       | 1338/4286 [7:59:48<16:12:48, 19.80s/it] 31%|███       | 1339/4286 [8:00:05<15:41:29, 19.17s/it]                                                        {'loss': 0.011, 'grad_norm': 7.962951608199821, 'learning_rate': 6.875874941670555e-07, 'completion_length': 175.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.4910714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.05154913291335106, 'kl': 0.2744140625, 'epoch': 0.31}
 31%|███       | 1339/4286 [8:00:05<15:41:29, 19.17s/it] 31%|███▏      | 1340/4286 [8:00:28<16:28:42, 20.14s/it]                                                        {'loss': 0.0608, 'grad_norm': 4.611293025427415, 'learning_rate': 6.873541763882408e-07, 'completion_length': 204.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.574404776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5565477013587952, 'reward_std': 0.10192880779504776, 'kl': 1.51953125, 'epoch': 0.31}
 31%|███▏      | 1340/4286 [8:00:28<16:28:42, 20.14s/it] 31%|███▏      | 1341/4286 [8:00:48<16:27:35, 20.12s/it]                                                        {'loss': 0.0763, 'grad_norm': 8.676465309968261, 'learning_rate': 6.87120858609426e-07, 'completion_length': 182.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4836309999227524, 'rewards/format_reward': 1.0, 'reward': 1.4836310744285583, 'reward_std': 0.09965858235955238, 'kl': 1.91015625, 'epoch': 0.31}
 31%|███▏      | 1341/4286 [8:00:48<16:27:35, 20.12s/it] 31%|███▏      | 1342/4286 [8:01:10<16:53:09, 20.65s/it]                                                        {'loss': 0.0349, 'grad_norm': 5.236711368485504, 'learning_rate': 6.868875408306113e-07, 'completion_length': 190.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5666666924953461, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5488095879554749, 'reward_std': 0.07021738588809967, 'kl': 0.873046875, 'epoch': 0.31}
 31%|███▏      | 1342/4286 [8:01:10<16:53:09, 20.65s/it] 31%|███▏      | 1343/4286 [8:01:32<17:18:33, 21.17s/it]                                                        {'loss': 0.019, 'grad_norm': 1.553774913428795, 'learning_rate': 6.866542230517966e-07, 'completion_length': 235.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5312500596046448, 'rewards/format_reward': 1.0, 'reward': 1.5312501788139343, 'reward_std': 0.026785715483129025, 'kl': 0.4765625, 'epoch': 0.31}
 31%|███▏      | 1343/4286 [8:01:32<17:18:33, 21.17s/it] 31%|███▏      | 1344/4286 [8:01:50<16:28:29, 20.16s/it]                                                        {'loss': 0.02, 'grad_norm': 1.344748378586487, 'learning_rate': 6.864209052729818e-07, 'completion_length': 170.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6130952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.02380952425301075, 'kl': 0.5, 'epoch': 0.31}
 31%|███▏      | 1344/4286 [8:01:50<16:28:29, 20.16s/it] 31%|███▏      | 1345/4286 [8:02:12<16:56:39, 20.74s/it]                                                        {'loss': 0.0219, 'grad_norm': 4.4699569964087855, 'learning_rate': 6.86187587494167e-07, 'completion_length': 212.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5184524357318878, 'rewards/format_reward': 1.0, 'reward': 1.5184524655342102, 'reward_std': 0.04791685752570629, 'kl': 0.546875, 'epoch': 0.31}
 31%|███▏      | 1345/4286 [8:02:12<16:56:39, 20.74s/it] 31%|███▏      | 1346/4286 [8:02:33<16:58:53, 20.79s/it]                                                        {'loss': 0.0318, 'grad_norm': 2.435508229647097, 'learning_rate': 6.859542697153524e-07, 'completion_length': 195.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.3988095819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3809524774551392, 'reward_std': 0.08520827814936638, 'kl': 0.798828125, 'epoch': 0.31}
 31%|███▏      | 1346/4286 [8:02:33<16:58:53, 20.79s/it] 31%|███▏      | 1347/4286 [8:02:53<16:54:15, 20.71s/it]                                                        {'loss': 0.032, 'grad_norm': 17.13355145059977, 'learning_rate': 6.857209519365376e-07, 'completion_length': 202.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.3898809850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.37202388048172, 'reward_std': 0.11745268851518631, 'kl': 0.80078125, 'epoch': 0.31}
 31%|███▏      | 1347/4286 [8:02:53<16:54:15, 20.71s/it] 31%|███▏      | 1348/4286 [8:03:12<16:21:36, 20.05s/it]                                                        {'loss': 0.0225, 'grad_norm': 3.877693030564876, 'learning_rate': 6.854876341577228e-07, 'completion_length': 184.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041668057441711, 'reward_std': 0.04920230060815811, 'kl': 0.5634765625, 'epoch': 0.31}
 31%|███▏      | 1348/4286 [8:03:12<16:21:36, 20.05s/it] 31%|███▏      | 1349/4286 [8:03:34<16:49:51, 20.63s/it]                                                        {'loss': 0.0142, 'grad_norm': 2.0358309812769653, 'learning_rate': 6.85254316378908e-07, 'completion_length': 206.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.5213294327259064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5034722089767456, 'reward_std': 0.11037563346326351, 'kl': 0.35546875, 'epoch': 0.31}
 31%|███▏      | 1349/4286 [8:03:34<16:49:51, 20.63s/it] 31%|███▏      | 1350/4286 [8:03:53<16:26:22, 20.16s/it]                                                        {'loss': 0.0078, 'grad_norm': 2.4122647004327855, 'learning_rate': 6.850209986000934e-07, 'completion_length': 182.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.6696428656578064, 'rewards/format_reward': 1.0, 'reward': 1.669642984867096, 'reward_std': 0.05038155987858772, 'kl': 0.1943359375, 'epoch': 0.31}
 31%|███▏      | 1350/4286 [8:03:53<16:26:22, 20.16s/it] 32%|███▏      | 1351/4286 [8:04:19<17:54:42, 21.97s/it]                                                        {'loss': 0.0077, 'grad_norm': 2.1814281645119973, 'learning_rate': 6.847876808212786e-07, 'completion_length': 242.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.3794642984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3616072535514832, 'reward_std': 0.06483910605311394, 'kl': 0.19287109375, 'epoch': 0.32}
 32%|███▏      | 1351/4286 [8:04:19<17:54:42, 21.97s/it] 32%|███▏      | 1352/4286 [8:04:39<17:25:25, 21.38s/it]                                                        {'loss': 0.0141, 'grad_norm': 3.485829290395152, 'learning_rate': 6.845543630424638e-07, 'completion_length': 206.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5839285850524902, 'rewards/format_reward': 1.0, 'reward': 1.5839287042617798, 'reward_std': 0.06368440575897694, 'kl': 0.35302734375, 'epoch': 0.32}
 32%|███▏      | 1352/4286 [8:04:39<17:25:25, 21.38s/it] 32%|███▏      | 1353/4286 [8:04:59<16:56:52, 20.80s/it]                                                        {'loss': 0.0079, 'grad_norm': 0.7005501186703244, 'learning_rate': 6.843210452636491e-07, 'completion_length': 191.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6741071939468384, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.019238397479057312, 'kl': 0.197265625, 'epoch': 0.32}
 32%|███▏      | 1353/4286 [8:04:59<16:56:52, 20.80s/it] 32%|███▏      | 1354/4286 [8:05:19<16:58:56, 20.85s/it]                                                        {'loss': 0.016, 'grad_norm': 5.367031116240167, 'learning_rate': 6.840877274848343e-07, 'completion_length': 190.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.6190476566553116, 'rewards/format_reward': 1.0, 'reward': 1.6190477013587952, 'reward_std': 0.09824734926223755, 'kl': 0.4013671875, 'epoch': 0.32}
 32%|███▏      | 1354/4286 [8:05:19<16:58:56, 20.85s/it] 32%|███▏      | 1355/4286 [8:05:41<17:05:45, 21.00s/it]                                                        {'loss': 0.0373, 'grad_norm': 8.538534842183655, 'learning_rate': 6.838544097060196e-07, 'completion_length': 223.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375001192092896, 'reward_std': 0.09959554672241211, 'kl': 0.931640625, 'epoch': 0.32}
 32%|███▏      | 1355/4286 [8:05:41<17:05:45, 21.00s/it] 32%|███▏      | 1356/4286 [8:06:05<17:45:49, 21.83s/it]                                                        {'loss': 0.0278, 'grad_norm': 6.771006232963991, 'learning_rate': 6.836210919272049e-07, 'completion_length': 212.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.461309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4434524774551392, 'reward_std': 0.11497018858790398, 'kl': 0.693359375, 'epoch': 0.32}
 32%|███▏      | 1356/4286 [8:06:05<17:45:49, 21.83s/it] 32%|███▏      | 1357/4286 [8:06:26<17:41:26, 21.74s/it]                                                        {'loss': 0.0273, 'grad_norm': 2.021127181010685, 'learning_rate': 6.833877741483901e-07, 'completion_length': 208.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.4583333730697632, 'rewards/format_reward': 1.0, 'reward': 1.4583334922790527, 'reward_std': 0.05357143096625805, 'kl': 0.6796875, 'epoch': 0.32}
 32%|███▏      | 1357/4286 [8:06:26<17:41:26, 21.74s/it] 32%|███▏      | 1358/4286 [8:06:47<17:21:24, 21.34s/it]                                                        {'loss': 0.0125, 'grad_norm': 4.356199290774456, 'learning_rate': 6.831544563695753e-07, 'completion_length': 197.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.06547619216144085, 'kl': 0.3115234375, 'epoch': 0.32}
 32%|███▏      | 1358/4286 [8:06:47<17:21:24, 21.34s/it] 32%|███▏      | 1359/4286 [8:07:05<16:39:39, 20.49s/it]                                                        {'loss': 0.0265, 'grad_norm': 16.57068844402876, 'learning_rate': 6.829211385907606e-07, 'completion_length': 191.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.5747024416923523, 'rewards/format_reward': 1.0, 'reward': 1.5747024416923523, 'reward_std': 0.11050061136484146, 'kl': 0.662109375, 'epoch': 0.32}
 32%|███▏      | 1359/4286 [8:07:05<16:39:39, 20.49s/it] 32%|███▏      | 1360/4286 [8:07:24<16:19:21, 20.08s/it]                                                        {'loss': 0.0407, 'grad_norm': 3.891902960878873, 'learning_rate': 6.826878208119459e-07, 'completion_length': 190.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5848215222358704, 'reward_std': 0.12687214463949203, 'kl': 1.01953125, 'epoch': 0.32}
 32%|███▏      | 1360/4286 [8:07:24<16:19:21, 20.08s/it] 32%|███▏      | 1361/4286 [8:07:46<16:39:46, 20.51s/it]                                                        {'loss': 0.0667, 'grad_norm': 8.570581412244138, 'learning_rate': 6.824545030331311e-07, 'completion_length': 192.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.5312500447034836, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.513392984867096, 'reward_std': 0.1253131963312626, 'kl': 1.66796875, 'epoch': 0.32}
 32%|███▏      | 1361/4286 [8:07:46<16:39:46, 20.51s/it] 32%|███▏      | 1362/4286 [8:08:06<16:29:31, 20.31s/it]                                                        {'loss': 0.0386, 'grad_norm': 3.7616896591788627, 'learning_rate': 6.822211852543163e-07, 'completion_length': 193.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5580357313156128, 'rewards/format_reward': 1.0, 'reward': 1.5580357313156128, 'reward_std': 0.08400873467326164, 'kl': 0.96484375, 'epoch': 0.32}
 32%|███▏      | 1362/4286 [8:08:06<16:29:31, 20.31s/it] 32%|███▏      | 1363/4286 [8:08:27<16:44:15, 20.61s/it]                                                        {'loss': 0.0346, 'grad_norm': 5.952964704478578, 'learning_rate': 6.819878674755017e-07, 'completion_length': 197.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.4955357760190964, 'rewards/format_reward': 1.0, 'reward': 1.4955358505249023, 'reward_std': 0.0922619104385376, 'kl': 0.86328125, 'epoch': 0.32}
 32%|███▏      | 1363/4286 [8:08:27<16:44:15, 20.61s/it] 32%|███▏      | 1364/4286 [8:08:46<16:18:04, 20.08s/it]                                                        {'loss': 0.0777, 'grad_norm': 6.40773721198766, 'learning_rate': 6.817545496966869e-07, 'completion_length': 166.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.4985119551420212, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4449405670166016, 'reward_std': 0.3072878420352936, 'kl': 1.9453125, 'epoch': 0.32}
 32%|███▏      | 1364/4286 [8:08:46<16:18:04, 20.08s/it] 32%|███▏      | 1365/4286 [8:09:06<16:25:48, 20.25s/it]                                                        {'loss': 0.0483, 'grad_norm': 10.687787704161003, 'learning_rate': 6.815212319178721e-07, 'completion_length': 188.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.517857313156128, 'reward_std': 0.09092213213443756, 'kl': 1.205078125, 'epoch': 0.32}
 32%|███▏      | 1365/4286 [8:09:06<16:25:48, 20.25s/it] 32%|███▏      | 1366/4286 [8:09:26<16:22:50, 20.20s/it]                                                        {'loss': 0.0751, 'grad_norm': 6.326680885481496, 'learning_rate': 6.812879141390574e-07, 'completion_length': 182.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.4189200848340988, 'rewards/format_reward': 1.0, 'reward': 1.4189201593399048, 'reward_std': 0.11528364941477776, 'kl': 1.87890625, 'epoch': 0.32}
 32%|███▏      | 1366/4286 [8:09:26<16:22:50, 20.20s/it] 32%|███▏      | 1367/4286 [8:09:45<16:05:36, 19.85s/it]                                                        {'loss': 0.0656, 'grad_norm': 9.950691754170979, 'learning_rate': 6.810545963602427e-07, 'completion_length': 175.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.502604216337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4847471714019775, 'reward_std': 0.11924830172210932, 'kl': 1.63671875, 'epoch': 0.32}
 32%|███▏      | 1367/4286 [8:09:45<16:05:36, 19.85s/it] 32%|███▏      | 1368/4286 [8:10:05<15:59:06, 19.72s/it]                                                        {'loss': 0.0564, 'grad_norm': 3.334820713066652, 'learning_rate': 6.808212785814279e-07, 'completion_length': 206.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.5535714775323868, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.0688802283257246, 'kl': 1.408203125, 'epoch': 0.32}
 32%|███▏      | 1368/4286 [8:10:05<15:59:06, 19.72s/it] 32%|███▏      | 1369/4286 [8:10:27<16:34:37, 20.46s/it]                                                        {'loss': 0.0353, 'grad_norm': 9.163248149993379, 'learning_rate': 6.805879608026132e-07, 'completion_length': 204.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.3455357328057289, 'rewards/format_reward': 1.0, 'reward': 1.345535695552826, 'reward_std': 0.03511904925107956, 'kl': 0.87939453125, 'epoch': 0.32}
 32%|███▏      | 1369/4286 [8:10:27<16:34:37, 20.46s/it] 32%|███▏      | 1370/4286 [8:10:46<16:14:46, 20.06s/it]                                                        {'loss': 0.0683, 'grad_norm': 4.061976647417745, 'learning_rate': 6.803546430237984e-07, 'completion_length': 196.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.3660714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3482144474983215, 'reward_std': 0.09707976877689362, 'kl': 1.703125, 'epoch': 0.32}
 32%|███▏      | 1370/4286 [8:10:46<16:14:46, 20.06s/it] 32%|███▏      | 1371/4286 [8:11:10<17:14:44, 21.30s/it]                                                        {'loss': 0.035, 'grad_norm': 2.6270151693965857, 'learning_rate': 6.801213252449837e-07, 'completion_length': 235.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.3139881119132042, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2604168057441711, 'reward_std': 0.2175300493836403, 'kl': 0.873046875, 'epoch': 0.32}
 32%|███▏      | 1371/4286 [8:11:10<17:14:44, 21.30s/it] 32%|███▏      | 1372/4286 [8:11:31<17:02:13, 21.05s/it]                                                        {'loss': 0.0629, 'grad_norm': 2.326610936591947, 'learning_rate': 6.798880074661689e-07, 'completion_length': 196.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.3809524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3630953431129456, 'reward_std': 0.1190476305782795, 'kl': 1.57421875, 'epoch': 0.32}
 32%|███▏      | 1372/4286 [8:11:31<17:02:13, 21.05s/it] 32%|███▏      | 1373/4286 [8:11:51<16:53:08, 20.87s/it]                                                        {'loss': 0.0325, 'grad_norm': 2.1191668459945157, 'learning_rate': 6.796546896873542e-07, 'completion_length': 187.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.5148809850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.49702388048172, 'reward_std': 0.09959553927183151, 'kl': 0.8125, 'epoch': 0.32}
 32%|███▏      | 1373/4286 [8:11:51<16:53:08, 20.87s/it] 32%|███▏      | 1374/4286 [8:12:11<16:34:46, 20.50s/it]                                                        {'loss': 0.0301, 'grad_norm': 1.4416697943628083, 'learning_rate': 6.794213719085394e-07, 'completion_length': 201.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.48363097012043, 'rewards/format_reward': 1.0, 'reward': 1.4836310744285583, 'reward_std': 0.0474053667858243, 'kl': 0.74951171875, 'epoch': 0.32}
 32%|███▏      | 1374/4286 [8:12:11<16:34:46, 20.50s/it] 32%|███▏      | 1375/4286 [8:12:32<16:42:35, 20.66s/it]                                                        {'loss': 0.035, 'grad_norm': 1.633695746809303, 'learning_rate': 6.791880541297246e-07, 'completion_length': 188.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.3958333730697632, 'rewards/format_reward': 1.0, 'reward': 1.3958334922790527, 'reward_std': 0.08928571827709675, 'kl': 0.8740234375, 'epoch': 0.32}
 32%|███▏      | 1375/4286 [8:12:32<16:42:35, 20.66s/it] 32%|███▏      | 1376/4286 [8:12:51<16:19:51, 20.20s/it]                                                        {'loss': 0.0343, 'grad_norm': 6.054786724646775, 'learning_rate': 6.7895473635091e-07, 'completion_length': 179.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.4791666865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4613096117973328, 'reward_std': 0.11745268851518631, 'kl': 0.859375, 'epoch': 0.32}
 32%|███▏      | 1376/4286 [8:12:51<16:19:51, 20.20s/it] 32%|███▏      | 1377/4286 [8:13:11<16:12:00, 20.05s/it]                                                        {'loss': 0.0211, 'grad_norm': 7.153878094677164, 'learning_rate': 6.787214185720952e-07, 'completion_length': 212.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6324406266212463, 'reward_std': 0.10638262331485748, 'kl': 0.5263671875, 'epoch': 0.32}
 32%|███▏      | 1377/4286 [8:13:11<16:12:00, 20.05s/it] 32%|███▏      | 1378/4286 [8:13:31<16:17:14, 20.16s/it]                                                        {'loss': 0.0249, 'grad_norm': 0.8632655457510142, 'learning_rate': 6.784881007932804e-07, 'completion_length': 183.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.6190476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6011905670166016, 'reward_std': 0.053571430034935474, 'kl': 0.623046875, 'epoch': 0.32}
 32%|███▏      | 1378/4286 [8:13:31<16:17:14, 20.16s/it] 32%|███▏      | 1379/4286 [8:13:51<16:16:34, 20.16s/it]                                                        {'loss': 0.0071, 'grad_norm': 1.3033656369054365, 'learning_rate': 6.782547830144657e-07, 'completion_length': 203.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4002976417541504, 'rewards/format_reward': 1.0, 'reward': 1.4002977013587952, 'reward_std': 0.07422413676977158, 'kl': 0.17822265625, 'epoch': 0.32}
 32%|███▏      | 1379/4286 [8:13:51<16:16:34, 20.16s/it] 32%|███▏      | 1380/4286 [8:14:10<15:54:16, 19.70s/it]                                                        {'loss': 0.0451, 'grad_norm': 2.422715006694378, 'learning_rate': 6.78021465235651e-07, 'completion_length': 160.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.380952388048172, 'rewards/format_reward': 1.0, 'reward': 1.3809524774551392, 'reward_std': 0.051976495422422886, 'kl': 1.12890625, 'epoch': 0.32}
 32%|███▏      | 1380/4286 [8:14:10<15:54:16, 19.70s/it] 32%|███▏      | 1381/4286 [8:14:31<16:19:37, 20.23s/it]                                                        {'loss': 0.0307, 'grad_norm': 1.8545066970650845, 'learning_rate': 6.777881474568362e-07, 'completion_length': 202.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.450000062584877, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4321429133415222, 'reward_std': 0.10990537330508232, 'kl': 0.767578125, 'epoch': 0.32}
 32%|███▏      | 1381/4286 [8:14:31<16:19:37, 20.23s/it] 32%|███▏      | 1382/4286 [8:14:52<16:23:03, 20.31s/it]                                                        {'loss': 0.0237, 'grad_norm': 6.942007990755401, 'learning_rate': 6.775548296780214e-07, 'completion_length': 224.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.611607164144516, 'rewards/format_reward': 1.0, 'reward': 1.6116072535514832, 'reward_std': 0.14976342022418976, 'kl': 0.58837890625, 'epoch': 0.32}
 32%|███▏      | 1382/4286 [8:14:52<16:23:03, 20.31s/it] 32%|███▏      | 1383/4286 [8:15:11<16:11:31, 20.08s/it]                                                        {'loss': 0.0172, 'grad_norm': 1.7204837916408344, 'learning_rate': 6.773215118992067e-07, 'completion_length': 203.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.5580357313156128, 'rewards/format_reward': 1.0, 'reward': 1.5580358505249023, 'reward_std': 0.044642859138548374, 'kl': 0.4296875, 'epoch': 0.32}
 32%|███▏      | 1383/4286 [8:15:11<16:11:31, 20.08s/it] 32%|███▏      | 1384/4286 [8:15:31<16:10:02, 20.06s/it]                                                        {'loss': 0.0275, 'grad_norm': 6.238133380889363, 'learning_rate': 6.77088194120392e-07, 'completion_length': 210.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5491072237491608, 'rewards/format_reward': 1.0, 'reward': 1.549107313156128, 'reward_std': 0.10926433280110359, 'kl': 0.68505859375, 'epoch': 0.32}
 32%|███▏      | 1384/4286 [8:15:31<16:10:02, 20.06s/it] 32%|███▏      | 1385/4286 [8:15:51<16:08:32, 20.03s/it]                                                        {'loss': 0.0669, 'grad_norm': 2.045822276053139, 'learning_rate': 6.768548763415772e-07, 'completion_length': 205.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.4553571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4375001192092896, 'reward_std': 0.14242978766560555, 'kl': 1.671875, 'epoch': 0.32}
 32%|███▏      | 1385/4286 [8:15:51<16:08:32, 20.03s/it] 32%|███▏      | 1386/4286 [8:16:10<15:44:36, 19.54s/it]                                                        {'loss': 0.0086, 'grad_norm': 2.5898882005147823, 'learning_rate': 6.766215585627625e-07, 'completion_length': 177.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.5848214328289032, 'rewards/format_reward': 1.0, 'reward': 1.5848215818405151, 'reward_std': 0.008928571827709675, 'kl': 0.21435546875, 'epoch': 0.32}
 32%|███▏      | 1386/4286 [8:16:10<15:44:36, 19.54s/it] 32%|███▏      | 1387/4286 [8:16:30<15:52:47, 19.72s/it]                                                        {'loss': 0.0093, 'grad_norm': 2.3419302545536422, 'learning_rate': 6.763882407839477e-07, 'completion_length': 210.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5654762387275696, 'rewards/format_reward': 1.0, 'reward': 1.5654762983322144, 'reward_std': 0.07142857648432255, 'kl': 0.23193359375, 'epoch': 0.32}
 32%|███▏      | 1387/4286 [8:16:30<15:52:47, 19.72s/it] 32%|███▏      | 1388/4286 [8:16:52<16:24:07, 20.38s/it]                                                        {'loss': 0.0261, 'grad_norm': 62.734266303737726, 'learning_rate': 6.76154923005133e-07, 'completion_length': 212.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5505952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.532738208770752, 'reward_std': 0.11745268851518631, 'kl': 0.6513671875, 'epoch': 0.32}
 32%|███▏      | 1388/4286 [8:16:52<16:24:07, 20.38s/it] 32%|███▏      | 1389/4286 [8:17:13<16:29:26, 20.49s/it]                                                        {'loss': 0.0072, 'grad_norm': 3.3997744412994875, 'learning_rate': 6.759216052263183e-07, 'completion_length': 210.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5199404954910278, 'rewards/format_reward': 1.0, 'reward': 1.5199406147003174, 'reward_std': 0.02531914785504341, 'kl': 0.17919921875, 'epoch': 0.32}
 32%|███▏      | 1389/4286 [8:17:13<16:29:26, 20.49s/it] 32%|███▏      | 1390/4286 [8:17:32<16:13:35, 20.17s/it]                                                        {'loss': 0.0288, 'grad_norm': 3.013749150648239, 'learning_rate': 6.756882874475035e-07, 'completion_length': 192.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7541667520999908, 'rewards/format_reward': 1.0, 'reward': 1.7541667819023132, 'reward_std': 0.046954947523772717, 'kl': 0.7177734375, 'epoch': 0.32}
 32%|███▏      | 1390/4286 [8:17:32<16:13:35, 20.17s/it] 32%|███▏      | 1391/4286 [8:17:53<16:28:43, 20.49s/it]                                                        {'loss': 0.0162, 'grad_norm': 2.5439803372746765, 'learning_rate': 6.754549696686887e-07, 'completion_length': 221.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4714285731315613, 'rewards/format_reward': 1.0, 'reward': 1.471428632736206, 'reward_std': 0.06643369514495134, 'kl': 0.404296875, 'epoch': 0.32}
 32%|███▏      | 1391/4286 [8:17:53<16:28:43, 20.49s/it] 32%|███▏      | 1392/4286 [8:18:12<16:04:54, 20.01s/it]                                                        {'loss': 0.0083, 'grad_norm': 2.6943154111303755, 'learning_rate': 6.752216518898741e-07, 'completion_length': 206.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5193453133106232, 'rewards/format_reward': 1.0, 'reward': 1.5193453431129456, 'reward_std': 0.044642859138548374, 'kl': 0.20703125, 'epoch': 0.32}
 32%|███▏      | 1392/4286 [8:18:12<16:04:54, 20.01s/it] 33%|███▎      | 1393/4286 [8:18:33<16:21:02, 20.35s/it]                                                        {'loss': 0.0232, 'grad_norm': 2.0929595655617113, 'learning_rate': 6.749883341110593e-07, 'completion_length': 205.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5252976566553116, 'rewards/format_reward': 1.0, 'reward': 1.5252977013587952, 'reward_std': 0.04740536957979202, 'kl': 0.580078125, 'epoch': 0.33}
 33%|███▎      | 1393/4286 [8:18:33<16:21:02, 20.35s/it] 33%|███▎      | 1394/4286 [8:18:54<16:23:09, 20.40s/it]                                                        {'loss': 0.0486, 'grad_norm': 2.265586909148342, 'learning_rate': 6.747550163322445e-07, 'completion_length': 209.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.4672619551420212, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4315477013587952, 'reward_std': 0.11456519365310669, 'kl': 1.21875, 'epoch': 0.33}
 33%|███▎      | 1394/4286 [8:18:54<16:23:09, 20.40s/it] 33%|███▎      | 1395/4286 [8:19:14<16:26:09, 20.47s/it]                                                        {'loss': 0.0074, 'grad_norm': 1.1821845557806336, 'learning_rate': 6.745216985534297e-07, 'completion_length': 207.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.025651199743151665, 'kl': 0.1865234375, 'epoch': 0.33}
 33%|███▎      | 1395/4286 [8:19:14<16:26:09, 20.47s/it] 33%|███▎      | 1396/4286 [8:19:38<17:10:45, 21.40s/it]                                                        {'loss': 0.0147, 'grad_norm': 1.6240646586399308, 'learning_rate': 6.742883807746151e-07, 'completion_length': 235.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.541666716337204, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4880953431129456, 'reward_std': 0.1438441090285778, 'kl': 0.3662109375, 'epoch': 0.33}
 33%|███▎      | 1396/4286 [8:19:38<17:10:45, 21.40s/it] 33%|███▎      | 1397/4286 [8:20:01<17:31:50, 21.84s/it]                                                        {'loss': 0.0177, 'grad_norm': 7.166557466258794, 'learning_rate': 6.740550629958003e-07, 'completion_length': 249.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.4508928954601288, 'rewards/format_reward': 1.0, 'reward': 1.450892984867096, 'reward_std': 0.17584338039159775, 'kl': 0.4423828125, 'epoch': 0.33}
 33%|███▎      | 1397/4286 [8:20:01<17:31:50, 21.84s/it] 33%|███▎      | 1398/4286 [8:20:23<17:39:25, 22.01s/it]                                                        {'loss': 0.0155, 'grad_norm': 4.486397590536918, 'learning_rate': 6.738217452169855e-07, 'completion_length': 209.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5681548118591309, 'rewards/format_reward': 1.0, 'reward': 1.5681548714637756, 'reward_std': 0.03447408974170685, 'kl': 0.38623046875, 'epoch': 0.33}
 33%|███▎      | 1398/4286 [8:20:23<17:39:25, 22.01s/it] 33%|███▎      | 1399/4286 [8:20:45<17:35:16, 21.93s/it]                                                        {'loss': 0.0085, 'grad_norm': 4.3476868443404975, 'learning_rate': 6.735884274381708e-07, 'completion_length': 235.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.3590136170387268, 'rewards/format_reward': 1.0, 'reward': 1.3590136766433716, 'reward_std': 0.05762399733066559, 'kl': 0.21337890625, 'epoch': 0.33}
 33%|███▎      | 1399/4286 [8:20:45<17:35:16, 21.93s/it] 33%|███▎      | 1400/4286 [8:21:07<17:28:10, 21.79s/it]                                                        {'loss': 0.0291, 'grad_norm': 5.291445440904353, 'learning_rate': 6.73355109659356e-07, 'completion_length': 239.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.5114796012639999, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.475765347480774, 'reward_std': 0.13080589845776558, 'kl': 0.72802734375, 'epoch': 0.33}
 33%|███▎      | 1400/4286 [8:21:07<17:28:10, 21.79s/it] 33%|███▎      | 1401/4286 [8:24:40<63:36:22, 79.37s/it]                                                        {'loss': 0.009, 'grad_norm': 1.0196285270388172, 'learning_rate': 6.731217918805413e-07, 'completion_length': 206.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6413691341876984, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.04053215403109789, 'kl': 0.22607421875, 'epoch': 0.33}
 33%|███▎      | 1401/4286 [8:24:40<63:36:22, 79.37s/it] 33%|███▎      | 1402/4286 [8:25:01<49:27:48, 61.74s/it]                                                        {'loss': 0.0201, 'grad_norm': 1.4367315332415076, 'learning_rate': 6.728884741017266e-07, 'completion_length': 208.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.5654762089252472, 'rewards/format_reward': 1.0, 'reward': 1.5654763579368591, 'reward_std': 0.10327015072107315, 'kl': 0.5029296875, 'epoch': 0.33}
 33%|███▎      | 1402/4286 [8:25:01<49:27:48, 61.74s/it] 33%|███▎      | 1403/4286 [8:25:24<40:12:28, 50.21s/it]                                                        {'loss': 0.0072, 'grad_norm': 2.531867629964371, 'learning_rate': 6.726551563229118e-07, 'completion_length': 242.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.46646827459335327, 'rewards/format_reward': 1.0, 'reward': 1.466468334197998, 'reward_std': 0.07794829085469246, 'kl': 0.18017578125, 'epoch': 0.33}
 33%|███▎      | 1403/4286 [8:25:24<40:12:28, 50.21s/it] 33%|███▎      | 1404/4286 [8:25:49<34:08:12, 42.64s/it]                                                        {'loss': 0.0132, 'grad_norm': 5.034045475289284, 'learning_rate': 6.72421838544097e-07, 'completion_length': 212.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6242559850215912, 'rewards/format_reward': 1.0, 'reward': 1.6242560744285583, 'reward_std': 0.15225330740213394, 'kl': 0.3291015625, 'epoch': 0.33}
 33%|███▎      | 1404/4286 [8:25:49<34:08:12, 42.64s/it] 33%|███▎      | 1405/4286 [8:26:14<29:45:13, 37.18s/it]                                                        {'loss': 0.0161, 'grad_norm': 8.408816172821526, 'learning_rate': 6.721885207652823e-07, 'completion_length': 221.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.4430059492588043, 'rewards/format_reward': 1.0, 'reward': 1.4430060982704163, 'reward_std': 0.10484174638986588, 'kl': 0.4033203125, 'epoch': 0.33}
 33%|███▎      | 1405/4286 [8:26:14<29:45:13, 37.18s/it] 33%|███▎      | 1406/4286 [8:26:35<25:59:11, 32.48s/it]                                                        {'loss': 0.0331, 'grad_norm': 1.5383503981468807, 'learning_rate': 6.719552029864676e-07, 'completion_length': 195.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7857143878936768, 'reward_std': 0.11363367736339569, 'kl': 0.82763671875, 'epoch': 0.33}
 33%|███▎      | 1406/4286 [8:26:35<25:59:11, 32.48s/it] 33%|███▎      | 1407/4286 [8:26:59<23:47:41, 29.75s/it]                                                        {'loss': 0.0299, 'grad_norm': 9.605689923071479, 'learning_rate': 6.717218852076528e-07, 'completion_length': 216.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.45554032921791077, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4376832246780396, 'reward_std': 0.1803978681564331, 'kl': 0.7490234375, 'epoch': 0.33}
 33%|███▎      | 1407/4286 [8:26:59<23:47:41, 29.75s/it] 33%|███▎      | 1408/4286 [8:27:23<22:32:37, 28.20s/it]                                                        {'loss': 0.0472, 'grad_norm': 5.335249577313406, 'learning_rate': 6.71488567428838e-07, 'completion_length': 203.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5386904925107956, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5029762983322144, 'reward_std': 0.16967161744832993, 'kl': 1.1796875, 'epoch': 0.33}
 33%|███▎      | 1408/4286 [8:27:23<22:32:37, 28.20s/it] 33%|███▎      | 1409/4286 [8:27:44<20:41:31, 25.89s/it]                                                        {'loss': 0.0658, 'grad_norm': 5.319643897199726, 'learning_rate': 6.712552496500234e-07, 'completion_length': 198.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.36422260105609894, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3463655710220337, 'reward_std': 0.09080118127167225, 'kl': 1.640625, 'epoch': 0.33}
 33%|███▎      | 1409/4286 [8:27:44<20:41:31, 25.89s/it] 33%|███▎      | 1410/4286 [8:28:04<19:26:37, 24.34s/it]                                                        {'loss': 0.0935, 'grad_norm': 6.004245970213289, 'learning_rate': 6.710219318712086e-07, 'completion_length': 200.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.4312642216682434, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3419784903526306, 'reward_std': 0.2482691928744316, 'kl': 2.3359375, 'epoch': 0.33}
 33%|███▎      | 1410/4286 [8:28:04<19:26:37, 24.34s/it] 33%|███▎      | 1411/4286 [8:28:25<18:31:38, 23.20s/it]                                                        {'loss': 0.1012, 'grad_norm': 6.237216652732718, 'learning_rate': 6.707886140923938e-07, 'completion_length': 195.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.5062500536441803, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4705357551574707, 'reward_std': 0.2265281155705452, 'kl': 2.52734375, 'epoch': 0.33}
 33%|███▎      | 1411/4286 [8:28:25<18:31:38, 23.20s/it] 33%|███▎      | 1412/4286 [8:28:46<18:07:32, 22.70s/it]                                                        {'loss': 0.0478, 'grad_norm': 8.733987363863301, 'learning_rate': 6.705552963135791e-07, 'completion_length': 194.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.4002976417541504, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.364583432674408, 'reward_std': 0.12737488001585007, 'kl': 1.19140625, 'epoch': 0.33}
 33%|███▎      | 1412/4286 [8:28:46<18:07:32, 22.70s/it] 33%|███▎      | 1413/4286 [8:29:06<17:24:58, 21.82s/it]                                                        {'loss': 0.074, 'grad_norm': 8.72709176153423, 'learning_rate': 6.703219785347644e-07, 'completion_length': 190.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.49486610293388367, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4770089983940125, 'reward_std': 0.17527684196829796, 'kl': 1.84765625, 'epoch': 0.33}
 33%|███▎      | 1413/4286 [8:29:06<17:24:58, 21.82s/it] 33%|███▎      | 1414/4286 [8:29:26<16:56:26, 21.24s/it]                                                        {'loss': 0.0831, 'grad_norm': 7.57980623343871, 'learning_rate': 6.700886607559496e-07, 'completion_length': 198.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.3726615905761719, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3369473814964294, 'reward_std': 0.20132236182689667, 'kl': 2.078125, 'epoch': 0.33}
 33%|███▎      | 1414/4286 [8:29:26<16:56:26, 21.24s/it] 33%|███▎      | 1415/4286 [8:29:45<16:25:22, 20.59s/it]                                                        {'loss': 0.0688, 'grad_norm': 15.382871375552241, 'learning_rate': 6.698553429771349e-07, 'completion_length': 197.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.4062500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.388392984867096, 'reward_std': 0.14195369184017181, 'kl': 1.72265625, 'epoch': 0.33}
 33%|███▎      | 1415/4286 [8:29:45<16:25:22, 20.59s/it] 33%|███▎      | 1416/4286 [8:30:06<16:25:03, 20.59s/it]                                                        {'loss': 0.0359, 'grad_norm': 18.490168568000048, 'learning_rate': 6.696220251983201e-07, 'completion_length': 192.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5469967573881149, 'rewards/format_reward': 1.0, 'reward': 1.5469967722892761, 'reward_std': 0.18165891617536545, 'kl': 0.896484375, 'epoch': 0.33}
 33%|███▎      | 1416/4286 [8:30:06<16:25:03, 20.59s/it] 33%|███▎      | 1417/4286 [8:30:26<16:26:51, 20.64s/it]                                                        {'loss': 0.0512, 'grad_norm': 5.708243867577471, 'learning_rate': 6.693887074195054e-07, 'completion_length': 206.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5251831859350204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4894689321517944, 'reward_std': 0.1846291609108448, 'kl': 1.28125, 'epoch': 0.33}
 33%|███▎      | 1417/4286 [8:30:26<16:26:51, 20.64s/it] 33%|███▎      | 1418/4286 [8:30:47<16:26:09, 20.63s/it]                                                        {'loss': 0.0367, 'grad_norm': 4.257034843050742, 'learning_rate': 6.691553896406906e-07, 'completion_length': 212.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.4226190745830536, 'rewards/format_reward': 1.0, 'reward': 1.4226191639900208, 'reward_std': 0.039397635497152805, 'kl': 0.91650390625, 'epoch': 0.33}
 33%|███▎      | 1418/4286 [8:30:47<16:26:09, 20.63s/it] 33%|███▎      | 1419/4286 [8:31:06<16:02:33, 20.14s/it]                                                        {'loss': 0.0225, 'grad_norm': 1.8809039689785543, 'learning_rate': 6.689220718618759e-07, 'completion_length': 187.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6190477013587952, 'rewards/format_reward': 1.0, 'reward': 1.6190477013587952, 'reward_std': 0.04711223114281893, 'kl': 0.5615234375, 'epoch': 0.33}
 33%|███▎      | 1419/4286 [8:31:06<16:02:33, 20.14s/it] 33%|███▎      | 1420/4286 [8:31:26<15:58:39, 20.07s/it]                                                        {'loss': 0.0075, 'grad_norm': 4.832179194027004, 'learning_rate': 6.686887540830611e-07, 'completion_length': 203.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6160715073347092, 'rewards/format_reward': 1.0, 'reward': 1.6160715818405151, 'reward_std': 0.05007292330265045, 'kl': 0.18603515625, 'epoch': 0.33}
 33%|███▎      | 1420/4286 [8:31:26<15:58:39, 20.07s/it] 33%|███▎      | 1421/4286 [8:31:46<15:50:40, 19.91s/it]                                                        {'loss': 0.0369, 'grad_norm': 3.319034918377913, 'learning_rate': 6.684554363042464e-07, 'completion_length': 162.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.4485119581222534, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.412797749042511, 'reward_std': 0.12434929981827736, 'kl': 0.921875, 'epoch': 0.33}
 33%|███▎      | 1421/4286 [8:31:46<15:50:40, 19.91s/it] 33%|███▎      | 1422/4286 [8:32:05<15:45:28, 19.81s/it]                                                        {'loss': 0.0113, 'grad_norm': 4.262994542709969, 'learning_rate': 6.682221185254317e-07, 'completion_length': 182.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6741071939468384, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.1696428582072258, 'kl': 0.2822265625, 'epoch': 0.33}
 33%|███▎      | 1422/4286 [8:32:05<15:45:28, 19.81s/it] 33%|███▎      | 1423/4286 [8:32:24<15:33:13, 19.56s/it]                                                        {'loss': 0.0233, 'grad_norm': 2.444847292532653, 'learning_rate': 6.679888007466169e-07, 'completion_length': 188.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.5758928954601288, 'rewards/format_reward': 1.0, 'reward': 1.575892984867096, 'reward_std': 0.06826403178274632, 'kl': 0.58203125, 'epoch': 0.33}
 33%|███▎      | 1423/4286 [8:32:24<15:33:13, 19.56s/it] 33%|███▎      | 1424/4286 [8:32:47<16:17:15, 20.49s/it]                                                        {'loss': 0.0229, 'grad_norm': 2.206171186525238, 'learning_rate': 6.677554829678021e-07, 'completion_length': 190.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5970238149166107, 'rewards/format_reward': 1.0, 'reward': 1.5970239043235779, 'reward_std': 0.059129735454916954, 'kl': 0.57421875, 'epoch': 0.33}
 33%|███▎      | 1424/4286 [8:32:47<16:17:15, 20.49s/it] 33%|███▎      | 1425/4286 [8:33:08<16:25:56, 20.68s/it]                                                        {'loss': 0.0429, 'grad_norm': 5.976412397547039, 'learning_rate': 6.675221651889875e-07, 'completion_length': 188.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5205357521772385, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5026787519454956, 'reward_std': 0.05701344460248947, 'kl': 1.07421875, 'epoch': 0.33}
 33%|███▎      | 1425/4286 [8:33:08<16:25:56, 20.68s/it] 33%|███▎      | 1426/4286 [8:33:29<16:28:18, 20.73s/it]                                                        {'loss': 0.0231, 'grad_norm': 4.944323183982708, 'learning_rate': 6.672888474101727e-07, 'completion_length': 187.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.3814484477043152, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3635913729667664, 'reward_std': 0.10964808985590935, 'kl': 0.578125, 'epoch': 0.33}
 33%|███▎      | 1426/4286 [8:33:29<16:28:18, 20.73s/it] 33%|███▎      | 1427/4286 [8:33:48<16:05:00, 20.25s/it]                                                        {'loss': 0.0356, 'grad_norm': 2.408676827648423, 'learning_rate': 6.670555296313579e-07, 'completion_length': 183.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.5163690745830536, 'rewards/format_reward': 1.0, 'reward': 1.5163691639900208, 'reward_std': 0.026785715483129025, 'kl': 0.888671875, 'epoch': 0.33}
 33%|███▎      | 1427/4286 [8:33:48<16:05:00, 20.25s/it] 33%|███▎      | 1428/4286 [8:34:07<15:49:02, 19.92s/it]                                                        {'loss': 0.0137, 'grad_norm': 3.242695383057115, 'learning_rate': 6.668222118525431e-07, 'completion_length': 176.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.65327388048172, 'reward_std': 0.08622623234987259, 'kl': 0.341796875, 'epoch': 0.33}
 33%|███▎      | 1428/4286 [8:34:07<15:49:02, 19.92s/it] 33%|███▎      | 1429/4286 [8:34:27<15:54:31, 20.05s/it]                                                        {'loss': 0.0547, 'grad_norm': 6.025841878932626, 'learning_rate': 6.665888940737284e-07, 'completion_length': 183.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.49613095819950104, 'rewards/format_reward': 1.0, 'reward': 1.4961310029029846, 'reward_std': 0.09548514150083065, 'kl': 1.3671875, 'epoch': 0.33}
 33%|███▎      | 1429/4286 [8:34:27<15:54:31, 20.05s/it] 33%|███▎      | 1430/4286 [8:34:49<16:18:13, 20.55s/it]                                                        {'loss': 0.07, 'grad_norm': 3.7165305069746672, 'learning_rate': 6.663555762949137e-07, 'completion_length': 190.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.5294643491506577, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.475892961025238, 'reward_std': 0.2999497950077057, 'kl': 1.74609375, 'epoch': 0.33}
 33%|███▎      | 1430/4286 [8:34:49<16:18:13, 20.55s/it] 33%|███▎      | 1431/4286 [8:35:08<15:55:21, 20.08s/it]                                                        {'loss': 0.0567, 'grad_norm': 3.3374149187764077, 'learning_rate': 6.661222585160989e-07, 'completion_length': 163.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.5595238208770752, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5059524774551392, 'reward_std': 0.21342912688851357, 'kl': 1.416015625, 'epoch': 0.33}
 33%|███▎      | 1431/4286 [8:35:08<15:55:21, 20.08s/it] 33%|███▎      | 1432/4286 [8:35:27<15:37:15, 19.70s/it]                                                        {'loss': 0.0369, 'grad_norm': 7.378322311726763, 'learning_rate': 6.658889407372842e-07, 'completion_length': 183.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6250000894069672, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892858505249023, 'reward_std': 0.22642239183187485, 'kl': 0.923828125, 'epoch': 0.33}
 33%|███▎      | 1432/4286 [8:35:27<15:37:15, 19.70s/it] 33%|███▎      | 1433/4286 [8:35:48<16:03:26, 20.26s/it]                                                        {'loss': 0.0682, 'grad_norm': 8.88610833028029, 'learning_rate': 6.656556229584694e-07, 'completion_length': 202.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6343254745006561, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5628969073295593, 'reward_std': 0.29785729199647903, 'kl': 1.703125, 'epoch': 0.33}
 33%|███▎      | 1433/4286 [8:35:48<16:03:26, 20.26s/it] 33%|███▎      | 1434/4286 [8:36:09<16:11:06, 20.43s/it]                                                        {'loss': 0.0493, 'grad_norm': 3.953897974583842, 'learning_rate': 6.654223051796547e-07, 'completion_length': 205.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.4360119253396988, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4002976417541504, 'reward_std': 0.19364815950393677, 'kl': 1.232421875, 'epoch': 0.33}
 33%|███▎      | 1434/4286 [8:36:09<16:11:06, 20.43s/it] 33%|███▎      | 1435/4286 [8:36:30<16:09:48, 20.41s/it]                                                        {'loss': 0.0399, 'grad_norm': 5.337863687986755, 'learning_rate': 6.6518898740084e-07, 'completion_length': 212.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.4258928745985031, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4080357551574707, 'reward_std': 0.12371939048171043, 'kl': 0.99609375, 'epoch': 0.33}
 33%|███▎      | 1435/4286 [8:36:30<16:09:48, 20.41s/it] 34%|███▎      | 1436/4286 [8:36:49<15:48:19, 19.96s/it]                                                        {'loss': 0.0715, 'grad_norm': 5.50483576092783, 'learning_rate': 6.649556696220252e-07, 'completion_length': 173.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.446428656578064, 'reward_std': 0.24730359762907028, 'kl': 1.787109375, 'epoch': 0.34}
 34%|███▎      | 1436/4286 [8:36:49<15:48:19, 19.96s/it] 34%|███▎      | 1437/4286 [8:37:12<16:36:13, 20.98s/it]                                                        {'loss': 0.0397, 'grad_norm': 6.4888594727209625, 'learning_rate': 6.647223518432104e-07, 'completion_length': 189.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6386905014514923, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6208334565162659, 'reward_std': 0.1530204713344574, 'kl': 0.994140625, 'epoch': 0.34}
 34%|███▎      | 1437/4286 [8:37:12<16:36:13, 20.98s/it] 34%|███▎      | 1438/4286 [8:37:30<16:01:34, 20.26s/it]                                                        {'loss': 0.0486, 'grad_norm': 5.701696944875766, 'learning_rate': 6.644890340643958e-07, 'completion_length': 163.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.48363102972507477, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4300596117973328, 'reward_std': 0.2532974183559418, 'kl': 1.21484375, 'epoch': 0.34}
 34%|███▎      | 1438/4286 [8:37:30<16:01:34, 20.26s/it] 34%|███▎      | 1439/4286 [8:37:51<16:08:20, 20.41s/it]                                                        {'loss': 0.0887, 'grad_norm': 7.125734662108499, 'learning_rate': 6.64255716285581e-07, 'completion_length': 178.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.4285714700818062, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.14633962884545326, 'kl': 2.21484375, 'epoch': 0.34}
 34%|███▎      | 1439/4286 [8:37:51<16:08:20, 20.41s/it] 34%|███▎      | 1440/4286 [8:38:09<15:36:31, 19.74s/it]                                                        {'loss': 0.0365, 'grad_norm': 6.716991781971142, 'learning_rate': 6.640223985067662e-07, 'completion_length': 175.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5040391534566879, 'rewards/format_reward': 1.0, 'reward': 1.5040391683578491, 'reward_std': 0.08715987205505371, 'kl': 0.912109375, 'epoch': 0.34}
 34%|███▎      | 1440/4286 [8:38:09<15:36:31, 19.74s/it] 34%|███▎      | 1441/4286 [8:38:29<15:32:45, 19.67s/it]                                                        {'loss': 0.028, 'grad_norm': 2.9990693053215494, 'learning_rate': 6.637890807279514e-07, 'completion_length': 195.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.383928582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3660714626312256, 'reward_std': 0.08392562810331583, 'kl': 0.69970703125, 'epoch': 0.34}
 34%|███▎      | 1441/4286 [8:38:29<15:32:45, 19.67s/it] 34%|███▎      | 1442/4286 [8:38:48<15:19:41, 19.40s/it]                                                        {'loss': 0.0343, 'grad_norm': 3.247105394617566, 'learning_rate': 6.635557629491368e-07, 'completion_length': 175.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5245536267757416, 'rewards/format_reward': 1.0, 'reward': 1.5245537161827087, 'reward_std': 0.0831121876835823, 'kl': 0.85693359375, 'epoch': 0.34}
 34%|███▎      | 1442/4286 [8:38:48<15:19:41, 19.40s/it] 34%|███▎      | 1443/4286 [8:39:07<15:21:52, 19.46s/it]                                                        {'loss': 0.0112, 'grad_norm': 15.15199630068578, 'learning_rate': 6.63322445170322e-07, 'completion_length': 202.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.6354166865348816, 'rewards/format_reward': 1.0, 'reward': 1.6354168057441711, 'reward_std': 0.06649590097367764, 'kl': 0.28076171875, 'epoch': 0.34}
 34%|███▎      | 1443/4286 [8:39:07<15:21:52, 19.46s/it] 34%|███▎      | 1444/4286 [8:39:30<16:11:58, 20.52s/it]                                                        {'loss': 0.0246, 'grad_norm': 4.765960734680435, 'learning_rate': 6.630891273915072e-07, 'completion_length': 217.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.551190510392189, 'rewards/format_reward': 1.0, 'reward': 1.5511905550956726, 'reward_std': 0.028571434319019318, 'kl': 0.6142578125, 'epoch': 0.34}
 34%|███▎      | 1444/4286 [8:39:30<16:11:58, 20.52s/it] 34%|███▎      | 1445/4286 [8:39:50<16:04:28, 20.37s/it]                                                        {'loss': 0.0431, 'grad_norm': 8.45693561230568, 'learning_rate': 6.628558096126925e-07, 'completion_length': 178.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.410714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.10237988829612732, 'kl': 1.07666015625, 'epoch': 0.34}
 34%|███▎      | 1445/4286 [8:39:50<16:04:28, 20.37s/it] 34%|███▎      | 1446/4286 [8:40:12<16:16:45, 20.64s/it]                                                        {'loss': 0.1376, 'grad_norm': 560.517881782098, 'learning_rate': 6.626224918338778e-07, 'completion_length': 198.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.3958333730697632, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3601192235946655, 'reward_std': 0.17365185916423798, 'kl': 3.44140625, 'epoch': 0.34}
 34%|███▎      | 1446/4286 [8:40:12<16:16:45, 20.64s/it] 34%|███▍      | 1447/4286 [8:40:31<16:01:04, 20.31s/it]                                                        {'loss': 0.0862, 'grad_norm': 2.9997290905395153, 'learning_rate': 6.62389174055063e-07, 'completion_length': 183.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.3690476417541504, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.333333432674408, 'reward_std': 0.14574572443962097, 'kl': 2.150390625, 'epoch': 0.34}
 34%|███▍      | 1447/4286 [8:40:31<16:01:04, 20.31s/it] 34%|███▍      | 1448/4286 [8:40:50<15:46:23, 20.01s/it]                                                        {'loss': 0.0571, 'grad_norm': 7.149630514656287, 'learning_rate': 6.621558562762483e-07, 'completion_length': 175.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.5720238387584686, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5363096594810486, 'reward_std': 0.1535714427009225, 'kl': 1.4287109375, 'epoch': 0.34}
 34%|███▍      | 1448/4286 [8:40:50<15:46:23, 20.01s/it] 34%|███▍      | 1449/4286 [8:41:10<15:35:14, 19.78s/it]                                                        {'loss': 0.0868, 'grad_norm': 4.362286885452699, 'learning_rate': 6.619225384974335e-07, 'completion_length': 194.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.5431548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5252977013587952, 'reward_std': 0.14101510494947433, 'kl': 2.16796875, 'epoch': 0.34}
 34%|███▍      | 1449/4286 [8:41:10<15:35:14, 19.78s/it] 34%|███▍      | 1450/4286 [8:41:28<15:20:05, 19.47s/it]                                                        {'loss': 0.0513, 'grad_norm': 18.43385060402818, 'learning_rate': 6.616892207186187e-07, 'completion_length': 164.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.5476190745830536, 'rewards/format_reward': 1.0, 'reward': 1.547619104385376, 'reward_std': 0.09967002458870411, 'kl': 1.28125, 'epoch': 0.34}
 34%|███▍      | 1450/4286 [8:41:28<15:20:05, 19.47s/it] 34%|███▍      | 1451/4286 [8:41:48<15:28:24, 19.65s/it]                                                        {'loss': 0.0563, 'grad_norm': 5.053666846786341, 'learning_rate': 6.61455902939804e-07, 'completion_length': 197.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.4750000238418579, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.457142949104309, 'reward_std': 0.1350393109023571, 'kl': 1.41015625, 'epoch': 0.34}
 34%|███▍      | 1451/4286 [8:41:48<15:28:24, 19.65s/it] 34%|███▍      | 1452/4286 [8:42:09<15:41:43, 19.94s/it]                                                        {'loss': 0.0752, 'grad_norm': 12.502142129140138, 'learning_rate': 6.612225851609893e-07, 'completion_length': 183.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.589285746216774, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5535715222358704, 'reward_std': 0.19582894444465637, 'kl': 1.8828125, 'epoch': 0.34}
 34%|███▍      | 1452/4286 [8:42:09<15:41:43, 19.94s/it] 34%|███▍      | 1453/4286 [8:42:29<15:43:58, 19.99s/it]                                                        {'loss': 0.0575, 'grad_norm': 2.767420262086245, 'learning_rate': 6.609892673821745e-07, 'completion_length': 189.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.52976194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5119048357009888, 'reward_std': 0.06403519958257675, 'kl': 1.43408203125, 'epoch': 0.34}
 34%|███▍      | 1453/4286 [8:42:29<15:43:58, 19.99s/it] 34%|███▍      | 1454/4286 [8:42:52<16:24:20, 20.85s/it]                                                        {'loss': 0.1059, 'grad_norm': 3.6479398765293403, 'learning_rate': 6.607559496033597e-07, 'completion_length': 170.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5595239400863647, 'reward_std': 0.20853110402822495, 'kl': 2.64453125, 'epoch': 0.34}
 34%|███▍      | 1454/4286 [8:42:52<16:24:20, 20.85s/it] 34%|███▍      | 1455/4286 [8:43:15<16:57:22, 21.56s/it]                                                        {'loss': 0.1071, 'grad_norm': 5.7701383474629715, 'learning_rate': 6.605226318245451e-07, 'completion_length': 183.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.470089316368103, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4165179133415222, 'reward_std': 0.24657614529132843, 'kl': 2.6796875, 'epoch': 0.34}
 34%|███▍      | 1455/4286 [8:43:15<16:57:22, 21.56s/it] 34%|███▍      | 1456/4286 [8:43:35<16:32:26, 21.04s/it]                                                        {'loss': 0.0497, 'grad_norm': 4.758284946659953, 'learning_rate': 6.602893140457303e-07, 'completion_length': 168.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.49166668951511383, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.473809540271759, 'reward_std': 0.08390592411160469, 'kl': 1.23828125, 'epoch': 0.34}
 34%|███▍      | 1456/4286 [8:43:35<16:32:26, 21.04s/it] 34%|███▍      | 1457/4286 [8:43:55<16:19:26, 20.77s/it]                                                        {'loss': 0.114, 'grad_norm': 8.233817595695161, 'learning_rate': 6.600559962669155e-07, 'completion_length': 166.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.4553572088479996, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4375000596046448, 'reward_std': 0.10306542180478573, 'kl': 2.8515625, 'epoch': 0.34}
 34%|███▍      | 1457/4286 [8:43:55<16:19:26, 20.77s/it] 34%|███▍      | 1458/4286 [8:44:16<16:18:31, 20.76s/it]                                                        {'loss': 0.1107, 'grad_norm': 4.697483653434627, 'learning_rate': 6.598226784881008e-07, 'completion_length': 173.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.4836309999227524, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4300596117973328, 'reward_std': 0.1879473440349102, 'kl': 2.76953125, 'epoch': 0.34}
 34%|███▍      | 1458/4286 [8:44:16<16:18:31, 20.76s/it][2025-03-02 13:51:54,407] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 34%|███▍      | 1459/4286 [8:44:39<16:43:40, 21.30s/it]                                                        {'loss': 0.0499, 'grad_norm': 5.118766433353563, 'learning_rate': 6.595893607092861e-07, 'completion_length': 169.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547620296478271, 'reward_std': 0.08014347404241562, 'kl': 1.24609375, 'epoch': 0.34}
 34%|███▍      | 1459/4286 [8:44:39<16:43:40, 21.30s/it] 34%|███▍      | 1460/4286 [8:44:58<16:21:14, 20.83s/it]                                                        {'loss': 0.0942, 'grad_norm': 14.053156382988735, 'learning_rate': 6.593560429304713e-07, 'completion_length': 164.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.4449404776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.427083432674408, 'reward_std': 0.1775069823488593, 'kl': 2.349609375, 'epoch': 0.34}
 34%|███▍      | 1460/4286 [8:44:58<16:21:14, 20.83s/it] 34%|███▍      | 1461/4286 [8:45:16<15:36:20, 19.89s/it]                                                        {'loss': 0.0298, 'grad_norm': 5.857685693401809, 'learning_rate': 6.591227251516566e-07, 'completion_length': 168.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5130952894687653, 'rewards/format_reward': 1.0, 'reward': 1.5130953788757324, 'reward_std': 0.06766954436898232, 'kl': 0.7490234375, 'epoch': 0.34}
 34%|███▍      | 1461/4286 [8:45:16<15:36:20, 19.89s/it] 34%|███▍      | 1462/4286 [8:45:34<15:08:37, 19.31s/it]                                                        {'loss': 0.0643, 'grad_norm': 95.61126947348669, 'learning_rate': 6.588894073728418e-07, 'completion_length': 148.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.5377976894378662, 'rewards/format_reward': 1.0, 'reward': 1.537797749042511, 'reward_std': 0.06993326917290688, 'kl': 1.60546875, 'epoch': 0.34}
 34%|███▍      | 1462/4286 [8:45:34<15:08:37, 19.31s/it] 34%|███▍      | 1463/4286 [8:45:53<15:11:39, 19.38s/it]                                                        {'loss': 0.0335, 'grad_norm': 3.7613448037632393, 'learning_rate': 6.586560895940271e-07, 'completion_length': 155.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.5625000596046448, 'rewards/format_reward': 1.0, 'reward': 1.5625001192092896, 'reward_std': 0.06990811973810196, 'kl': 0.837890625, 'epoch': 0.34}
 34%|███▍      | 1463/4286 [8:45:53<15:11:39, 19.38s/it][2025-03-02 13:53:28,719] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 34%|███▍      | 1464/4286 [8:46:13<15:11:42, 19.38s/it]                                                        {'loss': 0.0517, 'grad_norm': 3.068873553080082, 'learning_rate': 6.584227718152123e-07, 'completion_length': 187.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.4732143431901932, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4375001192092896, 'reward_std': 0.0993882305920124, 'kl': 1.2890625, 'epoch': 0.34}
 34%|███▍      | 1464/4286 [8:46:13<15:11:42, 19.38s/it] 34%|███▍      | 1465/4286 [8:46:33<15:15:57, 19.48s/it]                                                        {'loss': 0.008, 'grad_norm': 1.6254856860543523, 'learning_rate': 6.581894540363976e-07, 'completion_length': 174.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.691964328289032, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.07900894898921251, 'kl': 0.20068359375, 'epoch': 0.34}
 34%|███▍      | 1465/4286 [8:46:33<15:15:57, 19.48s/it] 34%|███▍      | 1466/4286 [8:46:52<15:08:24, 19.33s/it]                                                        {'loss': 0.0155, 'grad_norm': 3.0492156329392466, 'learning_rate': 6.579561362575828e-07, 'completion_length': 167.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.5119048058986664, 'rewards/format_reward': 1.0, 'reward': 1.5119048357009888, 'reward_std': 0.043508343398571014, 'kl': 0.3857421875, 'epoch': 0.34}
 34%|███▍      | 1466/4286 [8:46:52<15:08:24, 19.33s/it] 34%|███▍      | 1467/4286 [8:47:12<15:20:15, 19.59s/it]                                                        {'loss': 0.0487, 'grad_norm': 2.8427756401817033, 'learning_rate': 6.577228184787681e-07, 'completion_length': 162.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.48511911928653717, 'rewards/format_reward': 1.0, 'reward': 1.4851191639900208, 'reward_std': 0.1011904813349247, 'kl': 1.21484375, 'epoch': 0.34}
 34%|███▍      | 1467/4286 [8:47:12<15:20:15, 19.59s/it] 34%|███▍      | 1468/4286 [8:47:29<14:46:41, 18.88s/it]                                                        {'loss': 0.0222, 'grad_norm': 3.5962605643799166, 'learning_rate': 6.574895006999534e-07, 'completion_length': 148.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.0773809514939785, 'kl': 0.556640625, 'epoch': 0.34}
 34%|███▍      | 1468/4286 [8:47:29<14:46:41, 18.88s/it] 34%|███▍      | 1469/4286 [8:47:47<14:41:53, 18.78s/it]                                                        {'loss': 0.0192, 'grad_norm': 11.102375498456215, 'learning_rate': 6.572561829211386e-07, 'completion_length': 175.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5595238357782364, 'rewards/format_reward': 1.0, 'reward': 1.5595239400863647, 'reward_std': 0.04007173329591751, 'kl': 0.478515625, 'epoch': 0.34}
 34%|███▍      | 1469/4286 [8:47:48<14:41:53, 18.78s/it] 34%|███▍      | 1470/4286 [8:48:06<14:44:13, 18.84s/it]                                                        {'loss': 0.0287, 'grad_norm': 6.893697233759922, 'learning_rate': 6.570228651423238e-07, 'completion_length': 150.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.4925595819950104, 'rewards/format_reward': 1.0, 'reward': 1.4925596714019775, 'reward_std': 0.10441340506076813, 'kl': 0.71875, 'epoch': 0.34}
 34%|███▍      | 1470/4286 [8:48:06<14:44:13, 18.84s/it] 34%|███▍      | 1471/4286 [8:48:25<14:44:50, 18.86s/it]                                                        {'loss': 0.0143, 'grad_norm': 3.1102759622351237, 'learning_rate': 6.567895473635092e-07, 'completion_length': 173.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.511904776096344, 'rewards/format_reward': 1.0, 'reward': 1.5119048953056335, 'reward_std': 0.07511192187666893, 'kl': 0.35595703125, 'epoch': 0.34}
 34%|███▍      | 1471/4286 [8:48:25<14:44:50, 18.86s/it] 34%|███▍      | 1472/4286 [8:48:43<14:26:32, 18.48s/it]                                                        {'loss': 0.0157, 'grad_norm': 4.153138503116461, 'learning_rate': 6.565562295846944e-07, 'completion_length': 157.00000381469727, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.022795887663960457, 'kl': 0.392578125, 'epoch': 0.34}
 34%|███▍      | 1472/4286 [8:48:43<14:26:32, 18.48s/it] 34%|███▍      | 1473/4286 [8:49:01<14:24:09, 18.43s/it]                                                        {'loss': 0.0237, 'grad_norm': 6.152509888291957, 'learning_rate': 6.563229118058796e-07, 'completion_length': 160.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6145833730697632, 'rewards/format_reward': 1.0, 'reward': 1.614583432674408, 'reward_std': 0.10122353956103325, 'kl': 0.59375, 'epoch': 0.34}
 34%|███▍      | 1473/4286 [8:49:01<14:24:09, 18.43s/it] 34%|███▍      | 1474/4286 [8:49:23<15:05:29, 19.32s/it]                                                        {'loss': 0.0105, 'grad_norm': 4.214348550786387, 'learning_rate': 6.560895940270648e-07, 'completion_length': 192.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.5327381193637848, 'rewards/format_reward': 1.0, 'reward': 1.532738208770752, 'reward_std': 0.03411935083568096, 'kl': 0.26123046875, 'epoch': 0.34}
 34%|███▍      | 1474/4286 [8:49:23<15:05:29, 19.32s/it] 34%|███▍      | 1475/4286 [8:49:42<15:09:44, 19.42s/it]                                                        {'loss': 0.0384, 'grad_norm': 7.846634563392465, 'learning_rate': 6.558562762482502e-07, 'completion_length': 172.21428680419922, 'rewards/only_full_func_accuracy_reward': 0.486607164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.450892984867096, 'reward_std': 0.1617957390844822, 'kl': 0.958984375, 'epoch': 0.34}
 34%|███▍      | 1475/4286 [8:49:42<15:09:44, 19.42s/it] 34%|███▍      | 1476/4286 [8:50:02<15:09:59, 19.43s/it]                                                        {'loss': 0.012, 'grad_norm': 5.001412321312849, 'learning_rate': 6.556229584694354e-07, 'completion_length': 191.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.5610119700431824, 'rewards/format_reward': 1.0, 'reward': 1.5610120296478271, 'reward_std': 0.12946783751249313, 'kl': 0.30029296875, 'epoch': 0.34}
 34%|███▍      | 1476/4286 [8:50:02<15:09:59, 19.43s/it] 34%|███▍      | 1477/4286 [8:50:22<15:19:58, 19.65s/it]                                                        {'loss': 0.0615, 'grad_norm': 2.785388251915014, 'learning_rate': 6.553896406906206e-07, 'completion_length': 198.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.4782738536596298, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4604167938232422, 'reward_std': 0.1090783984400332, 'kl': 1.5400390625, 'epoch': 0.34}
 34%|███▍      | 1477/4286 [8:50:22<15:19:58, 19.65s/it] 34%|███▍      | 1478/4286 [8:50:40<15:02:17, 19.28s/it]                                                        {'loss': 0.0314, 'grad_norm': 8.839075575509847, 'learning_rate': 6.551563229118059e-07, 'completion_length': 181.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 1.0, 'reward': 1.602678656578064, 'reward_std': 0.14682424068450928, 'kl': 0.78515625, 'epoch': 0.34}
 34%|███▍      | 1478/4286 [8:50:40<15:02:17, 19.28s/it] 35%|███▍      | 1479/4286 [8:51:00<15:03:56, 19.32s/it]                                                        {'loss': 0.023, 'grad_norm': 3.4220550835637993, 'learning_rate': 6.549230051329911e-07, 'completion_length': 186.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5133928656578064, 'rewards/format_reward': 1.0, 'reward': 1.513392984867096, 'reward_std': 0.04212708491832018, 'kl': 0.5771484375, 'epoch': 0.35}
 35%|███▍      | 1479/4286 [8:51:00<15:03:56, 19.32s/it] 35%|███▍      | 1480/4286 [8:51:22<15:38:18, 20.06s/it]                                                        {'loss': 0.0157, 'grad_norm': 2.8880695303593913, 'learning_rate': 6.546896873541764e-07, 'completion_length': 172.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.6324405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.614583432674408, 'reward_std': 0.07716727256774902, 'kl': 0.390625, 'epoch': 0.35}
 35%|███▍      | 1480/4286 [8:51:22<15:38:18, 20.06s/it] 35%|███▍      | 1481/4286 [8:51:42<15:38:07, 20.07s/it]                                                        {'loss': 0.036, 'grad_norm': 2.9948669518931395, 'learning_rate': 6.544563695753617e-07, 'completion_length': 205.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.4002976417541504, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.364583432674408, 'reward_std': 0.20395587384700775, 'kl': 0.8984375, 'epoch': 0.35}
 35%|███▍      | 1481/4286 [8:51:42<15:38:07, 20.07s/it] 35%|███▍      | 1482/4286 [8:52:01<15:23:06, 19.75s/it]                                                        {'loss': 0.0223, 'grad_norm': 1.9443851629301434, 'learning_rate': 6.542230517965469e-07, 'completion_length': 178.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.5729166865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5550596714019775, 'reward_std': 0.12393318489193916, 'kl': 0.55810546875, 'epoch': 0.35}
 35%|███▍      | 1482/4286 [8:52:01<15:23:06, 19.75s/it] 35%|███▍      | 1483/4286 [8:52:21<15:30:26, 19.92s/it]                                                        {'loss': 0.0678, 'grad_norm': 4.147804427185363, 'learning_rate': 6.539897340177321e-07, 'completion_length': 187.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.4568452686071396, 'rewards/format_reward': 1.0, 'reward': 1.4568454027175903, 'reward_std': 0.053270772099494934, 'kl': 1.69921875, 'epoch': 0.35}
 35%|███▍      | 1483/4286 [8:52:21<15:30:26, 19.92s/it] 35%|███▍      | 1484/4286 [8:52:41<15:27:40, 19.86s/it]                                                        {'loss': 0.0407, 'grad_norm': 12.548759307147208, 'learning_rate': 6.537564162389175e-07, 'completion_length': 202.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5119048207998276, 'rewards/format_reward': 1.0, 'reward': 1.5119048953056335, 'reward_std': 0.11401608027517796, 'kl': 1.01416015625, 'epoch': 0.35}
 35%|███▍      | 1484/4286 [8:52:41<15:27:40, 19.86s/it] 35%|███▍      | 1485/4286 [8:53:01<15:26:33, 19.85s/it]                                                        {'loss': 0.0292, 'grad_norm': 12.917277510352777, 'learning_rate': 6.535230984601027e-07, 'completion_length': 191.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.5133928805589676, 'rewards/format_reward': 1.0, 'reward': 1.513392984867096, 'reward_std': 0.06869912706315517, 'kl': 0.7275390625, 'epoch': 0.35}
 35%|███▍      | 1485/4286 [8:53:01<15:26:33, 19.85s/it] 35%|███▍      | 1486/4286 [8:53:20<15:21:35, 19.75s/it]                                                        {'loss': 0.0079, 'grad_norm': 4.194062147787111, 'learning_rate': 6.532897806812879e-07, 'completion_length': 197.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5633928775787354, 'rewards/format_reward': 1.0, 'reward': 1.5633929371833801, 'reward_std': 0.15086285769939423, 'kl': 0.19677734375, 'epoch': 0.35}
 35%|███▍      | 1486/4286 [8:53:20<15:21:35, 19.75s/it] 35%|███▍      | 1487/4286 [8:53:39<15:13:27, 19.58s/it]                                                        {'loss': 0.0462, 'grad_norm': 2.1703729488034624, 'learning_rate': 6.530564629024731e-07, 'completion_length': 187.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.549107164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5133929252624512, 'reward_std': 0.14295602962374687, 'kl': 1.15869140625, 'epoch': 0.35}
 35%|███▍      | 1487/4286 [8:53:39<15:13:27, 19.58s/it] 35%|███▍      | 1488/4286 [8:53:58<15:04:22, 19.39s/it]                                                        {'loss': 0.0672, 'grad_norm': 1.8526747407688025, 'learning_rate': 6.528231451236585e-07, 'completion_length': 179.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5669643133878708, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5491071939468384, 'reward_std': 0.15731073915958405, 'kl': 1.67578125, 'epoch': 0.35}
 35%|███▍      | 1488/4286 [8:53:58<15:04:22, 19.39s/it] 35%|███▍      | 1489/4286 [8:54:18<15:11:51, 19.56s/it]                                                        {'loss': 0.0443, 'grad_norm': 2.288002519448079, 'learning_rate': 6.525898273448437e-07, 'completion_length': 204.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.488095298409462, 'rewards/format_reward': 1.0, 'reward': 1.4880953431129456, 'reward_std': 0.06982195563614368, 'kl': 1.10546875, 'epoch': 0.35}
 35%|███▍      | 1489/4286 [8:54:18<15:11:51, 19.56s/it] 35%|███▍      | 1490/4286 [8:54:39<15:35:45, 20.08s/it]                                                        {'loss': 0.1108, 'grad_norm': 2.7714216056766032, 'learning_rate': 6.523565095660289e-07, 'completion_length': 189.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.5133928805589676, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4419643878936768, 'reward_std': 0.2512006387114525, 'kl': 2.765625, 'epoch': 0.35}
 35%|███▍      | 1490/4286 [8:54:39<15:35:45, 20.08s/it] 35%|███▍      | 1491/4286 [8:54:59<15:25:26, 19.87s/it]                                                        {'loss': 0.0447, 'grad_norm': 2.5995530260212534, 'learning_rate': 6.521231917872142e-07, 'completion_length': 175.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.53125, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4955357909202576, 'reward_std': 0.1367393359541893, 'kl': 1.1181640625, 'epoch': 0.35}
 35%|███▍      | 1491/4286 [8:54:59<15:25:26, 19.87s/it] 35%|███▍      | 1492/4286 [8:55:20<15:39:27, 20.17s/it]                                                        {'loss': 0.0175, 'grad_norm': 1.7950937107032345, 'learning_rate': 6.518898740083995e-07, 'completion_length': 206.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6434524357318878, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6255953311920166, 'reward_std': 0.07661034166812897, 'kl': 0.4375, 'epoch': 0.35}
 35%|███▍      | 1492/4286 [8:55:20<15:39:27, 20.17s/it] 35%|███▍      | 1493/4286 [8:55:40<15:37:40, 20.14s/it]                                                        {'loss': 0.0338, 'grad_norm': 1.7598659784558237, 'learning_rate': 6.516565562295847e-07, 'completion_length': 187.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6142857372760773, 'rewards/format_reward': 1.0, 'reward': 1.6142858266830444, 'reward_std': 0.06858276668936014, 'kl': 0.84716796875, 'epoch': 0.35}
 35%|███▍      | 1493/4286 [8:55:40<15:37:40, 20.14s/it] 35%|███▍      | 1494/4286 [8:56:00<15:43:20, 20.27s/it]                                                        {'loss': 0.0251, 'grad_norm': 2.787357230005593, 'learning_rate': 6.5142323845077e-07, 'completion_length': 193.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.5273810178041458, 'rewards/format_reward': 1.0, 'reward': 1.5273810625076294, 'reward_std': 0.028571427799761295, 'kl': 0.626953125, 'epoch': 0.35}
 35%|███▍      | 1494/4286 [8:56:00<15:43:20, 20.27s/it] 35%|███▍      | 1495/4286 [8:56:20<15:29:37, 19.98s/it]                                                        {'loss': 0.0235, 'grad_norm': 1.8706576737841476, 'learning_rate': 6.511899206719552e-07, 'completion_length': 194.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.584821492433548, 'rewards/format_reward': 1.0, 'reward': 1.5848215818405151, 'reward_std': 0.05594631005078554, 'kl': 0.587890625, 'epoch': 0.35}
 35%|███▍      | 1495/4286 [8:56:20<15:29:37, 19.98s/it] 35%|███▍      | 1496/4286 [8:56:39<15:23:56, 19.87s/it]                                                        {'loss': 0.029, 'grad_norm': 2.056031055521936, 'learning_rate': 6.509566028931405e-07, 'completion_length': 178.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.7038690745830536, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.08907204307615757, 'kl': 0.728515625, 'epoch': 0.35}
 35%|███▍      | 1496/4286 [8:56:39<15:23:56, 19.87s/it] 35%|███▍      | 1497/4286 [8:57:00<15:31:02, 20.03s/it]                                                        {'loss': 0.0728, 'grad_norm': 3.0569197424937142, 'learning_rate': 6.507232851143257e-07, 'completion_length': 197.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.5014881193637848, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4657739400863647, 'reward_std': 0.19078204035758972, 'kl': 1.8125, 'epoch': 0.35}
 35%|███▍      | 1497/4286 [8:57:00<15:31:02, 20.03s/it] 35%|███▍      | 1498/4286 [8:57:19<15:23:04, 19.87s/it]                                                        {'loss': 0.0142, 'grad_norm': 2.7492855901037743, 'learning_rate': 6.50489967335511e-07, 'completion_length': 189.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.5661139935255051, 'rewards/format_reward': 1.0, 'reward': 1.566114068031311, 'reward_std': 0.03443877678364515, 'kl': 0.3544921875, 'epoch': 0.35}
 35%|███▍      | 1498/4286 [8:57:19<15:23:04, 19.87s/it] 35%|███▍      | 1499/4286 [8:57:39<15:21:39, 19.84s/it]                                                        {'loss': 0.0583, 'grad_norm': 3.340310616712708, 'learning_rate': 6.502566495566962e-07, 'completion_length': 200.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.59226194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5744048953056335, 'reward_std': 0.14631515741348267, 'kl': 1.45703125, 'epoch': 0.35}
 35%|███▍      | 1499/4286 [8:57:39<15:21:39, 19.84s/it] 35%|███▍      | 1500/4286 [8:58:01<15:46:11, 20.38s/it]                                                        {'loss': 0.0175, 'grad_norm': 6.183693537876374, 'learning_rate': 6.500233317778814e-07, 'completion_length': 211.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.43095239996910095, 'rewards/format_reward': 1.0, 'reward': 1.430952548980713, 'reward_std': 0.05150624178349972, 'kl': 0.435546875, 'epoch': 0.35}
 35%|███▍      | 1500/4286 [8:58:01<15:46:11, 20.38s/it] 35%|███▌      | 1501/4286 [9:01:45<63:09:54, 81.65s/it]                                                        {'loss': 0.0435, 'grad_norm': 1.6628112296556452, 'learning_rate': 6.497900139990668e-07, 'completion_length': 183.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.5669643580913544, 'rewards/format_reward': 1.0, 'reward': 1.5669643878936768, 'reward_std': 0.0922619104385376, 'kl': 1.0859375, 'epoch': 0.35}
 35%|███▌      | 1501/4286 [9:01:45<63:09:54, 81.65s/it] 35%|███▌      | 1502/4286 [9:02:06<48:59:03, 63.34s/it]                                                        {'loss': 0.0349, 'grad_norm': 4.583014381641504, 'learning_rate': 6.49556696220252e-07, 'completion_length': 207.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.5238095372915268, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.505952537059784, 'reward_std': 0.14838217198848724, 'kl': 0.875, 'epoch': 0.35}
 35%|███▌      | 1502/4286 [9:02:06<48:59:03, 63.34s/it] 35%|███▌      | 1503/4286 [9:02:27<39:17:21, 50.82s/it]                                                        {'loss': 0.0683, 'grad_norm': 3.0807549799274696, 'learning_rate': 6.493233784414372e-07, 'completion_length': 228.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.590277835726738, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5724207162857056, 'reward_std': 0.10915993340313435, 'kl': 1.7109375, 'epoch': 0.35}
 35%|███▌      | 1503/4286 [9:02:27<39:17:21, 50.82s/it] 35%|███▌      | 1504/4286 [9:02:51<32:57:52, 42.66s/it]                                                        {'loss': 0.0363, 'grad_norm': 12.298929781802348, 'learning_rate': 6.490900606626225e-07, 'completion_length': 234.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.546726256608963, 'rewards/format_reward': 1.0, 'reward': 1.5467263460159302, 'reward_std': 0.10853204876184464, 'kl': 0.91015625, 'epoch': 0.35}
 35%|███▌      | 1504/4286 [9:02:51<32:57:52, 42.66s/it] 35%|███▌      | 1505/4286 [9:03:16<28:48:02, 37.28s/it]                                                        {'loss': 0.0775, 'grad_norm': 6.614272773667157, 'learning_rate': 6.488567428838078e-07, 'completion_length': 227.33930206298828, 'rewards/only_full_func_accuracy_reward': 0.5483631193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5305060744285583, 'reward_std': 0.21264159679412842, 'kl': 1.94140625, 'epoch': 0.35}
 35%|███▌      | 1505/4286 [9:03:16<28:48:02, 37.28s/it] 35%|███▌      | 1506/4286 [9:03:37<24:59:15, 32.36s/it]                                                        {'loss': 0.0484, 'grad_norm': 4.668717353064591, 'learning_rate': 6.48623425104993e-07, 'completion_length': 194.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5029762238264084, 'rewards/format_reward': 1.0, 'reward': 1.5029762983322144, 'reward_std': 0.0669918954372406, 'kl': 1.21484375, 'epoch': 0.35}
 35%|███▌      | 1506/4286 [9:03:37<24:59:15, 32.36s/it] 35%|███▌      | 1507/4286 [9:03:57<22:17:47, 28.88s/it]                                                        {'loss': 0.071, 'grad_norm': 2.8877419952463423, 'learning_rate': 6.483901073261783e-07, 'completion_length': 209.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4769953787326813, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4412810802459717, 'reward_std': 0.20130904763936996, 'kl': 1.7734375, 'epoch': 0.35}
 35%|███▌      | 1507/4286 [9:03:57<22:17:47, 28.88s/it] 35%|███▌      | 1508/4286 [9:04:17<20:06:08, 26.05s/it]                                                        {'loss': 0.0666, 'grad_norm': 14.500831423005167, 'learning_rate': 6.481567895473635e-07, 'completion_length': 199.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5312500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5133929252624512, 'reward_std': 0.18691687658429146, 'kl': 1.6640625, 'epoch': 0.35}
 35%|███▌      | 1508/4286 [9:04:17<20:06:08, 26.05s/it] 35%|███▌      | 1509/4286 [9:04:37<18:46:53, 24.35s/it]                                                        {'loss': 0.0808, 'grad_norm': 7.417215261614071, 'learning_rate': 6.479234717685488e-07, 'completion_length': 206.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.5044643431901932, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4866072535514832, 'reward_std': 0.23144348710775375, 'kl': 2.015625, 'epoch': 0.35}
 35%|███▌      | 1509/4286 [9:04:37<18:46:53, 24.35s/it] 35%|███▌      | 1510/4286 [9:05:01<18:40:01, 24.21s/it]                                                        {'loss': 0.0299, 'grad_norm': 2.735136786070311, 'learning_rate': 6.47690153989734e-07, 'completion_length': 220.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.4038690999150276, 'rewards/format_reward': 1.0, 'reward': 1.4038691520690918, 'reward_std': 0.07812385354191065, 'kl': 0.748046875, 'epoch': 0.35}
 35%|███▌      | 1510/4286 [9:05:01<18:40:01, 24.21s/it] 35%|███▌      | 1511/4286 [9:05:20<17:31:26, 22.73s/it]                                                        {'loss': 0.0532, 'grad_norm': 5.958880509113359, 'learning_rate': 6.474568362109193e-07, 'completion_length': 197.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.5492772459983826, 'rewards/format_reward': 1.0, 'reward': 1.5492773056030273, 'reward_std': 0.14154477417469025, 'kl': 1.326171875, 'epoch': 0.35}
 35%|███▌      | 1511/4286 [9:05:20<17:31:26, 22.73s/it] 35%|███▌      | 1512/4286 [9:05:43<17:26:46, 22.64s/it]                                                        {'loss': 0.0943, 'grad_norm': 4.165645785418935, 'learning_rate': 6.472235184321045e-07, 'completion_length': 182.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6765873730182648, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6587302088737488, 'reward_std': 0.1337631419301033, 'kl': 2.359375, 'epoch': 0.35}
 35%|███▌      | 1512/4286 [9:05:43<17:26:46, 22.64s/it] 35%|███▌      | 1513/4286 [9:06:05<17:24:35, 22.60s/it]                                                        {'loss': 0.0931, 'grad_norm': 7.852080021647145, 'learning_rate': 6.469902006532898e-07, 'completion_length': 175.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.4325397163629532, 'rewards/format_reward': 1.0, 'reward': 1.432539701461792, 'reward_std': 0.1301233321428299, 'kl': 2.328125, 'epoch': 0.35}
 35%|███▌      | 1513/4286 [9:06:05<17:24:35, 22.60s/it] 35%|███▌      | 1514/4286 [9:06:25<16:42:26, 21.70s/it]                                                        {'loss': 0.094, 'grad_norm': 11.672898295487814, 'learning_rate': 6.467568828744751e-07, 'completion_length': 192.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5041667222976685, 'rewards/format_reward': 1.0, 'reward': 1.5041667222976685, 'reward_std': 0.1692127138376236, 'kl': 2.3515625, 'epoch': 0.35}
 35%|███▌      | 1514/4286 [9:06:25<16:42:26, 21.70s/it] 35%|███▌      | 1515/4286 [9:06:46<16:33:44, 21.52s/it]                                                        {'loss': 0.0804, 'grad_norm': 4.049306815862584, 'learning_rate': 6.465235650956603e-07, 'completion_length': 166.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.2723214477300644, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2366071939468384, 'reward_std': 0.11564593017101288, 'kl': 2.00390625, 'epoch': 0.35}
 35%|███▌      | 1515/4286 [9:06:46<16:33:44, 21.52s/it] 35%|███▌      | 1516/4286 [9:07:07<16:24:42, 21.33s/it]                                                        {'loss': 0.0554, 'grad_norm': 9.092600062113277, 'learning_rate': 6.462902473168455e-07, 'completion_length': 205.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5535714775323868, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.12066227197647095, 'kl': 1.388671875, 'epoch': 0.35}
 35%|███▌      | 1516/4286 [9:07:07<16:24:42, 21.33s/it] 35%|███▌      | 1517/4286 [9:07:29<16:35:58, 21.58s/it]                                                        {'loss': 0.0581, 'grad_norm': 5.321494055062, 'learning_rate': 6.460569295380309e-07, 'completion_length': 219.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.5800595432519913, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5264882445335388, 'reward_std': 0.2139335200190544, 'kl': 1.453125, 'epoch': 0.35}
 35%|███▌      | 1517/4286 [9:07:29<16:35:58, 21.58s/it] 35%|███▌      | 1518/4286 [9:07:49<16:19:02, 21.22s/it]                                                        {'loss': 0.0248, 'grad_norm': 11.011076348434969, 'learning_rate': 6.458236117592161e-07, 'completion_length': 191.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.4806547611951828, 'rewards/format_reward': 1.0, 'reward': 1.4806548357009888, 'reward_std': 0.09651605784893036, 'kl': 0.619140625, 'epoch': 0.35}
 35%|███▌      | 1518/4286 [9:07:49<16:19:02, 21.22s/it] 35%|███▌      | 1519/4286 [9:08:15<17:25:05, 22.66s/it]                                                        {'loss': 0.1029, 'grad_norm': 13.323315563012835, 'learning_rate': 6.455902939804013e-07, 'completion_length': 195.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.49414683878421783, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4584326148033142, 'reward_std': 0.17651528865098953, 'kl': 2.57421875, 'epoch': 0.35}
 35%|███▌      | 1519/4286 [9:08:15<17:25:05, 22.66s/it] 35%|███▌      | 1520/4286 [9:08:37<17:12:23, 22.39s/it]                                                        {'loss': 0.0585, 'grad_norm': 3.2971565434979118, 'learning_rate': 6.453569762015865e-07, 'completion_length': 190.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.4367559850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4010418057441711, 'reward_std': 0.18834738433361053, 'kl': 1.466796875, 'epoch': 0.35}
 35%|███▌      | 1520/4286 [9:08:37<17:12:23, 22.39s/it] 35%|███▌      | 1521/4286 [9:08:58<16:50:03, 21.92s/it]                                                        {'loss': 0.0192, 'grad_norm': 4.7667977512521205, 'learning_rate': 6.451236584227719e-07, 'completion_length': 205.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5861819982528687, 'rewards/format_reward': 1.0, 'reward': 1.5861820578575134, 'reward_std': 0.07717739045619965, 'kl': 0.47998046875, 'epoch': 0.35}
 35%|███▌      | 1521/4286 [9:08:58<16:50:03, 21.92s/it] 36%|███▌      | 1522/4286 [9:09:18<16:24:55, 21.38s/it]                                                        {'loss': 0.0356, 'grad_norm': 4.767536604752182, 'learning_rate': 6.448903406439571e-07, 'completion_length': 213.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5437500476837158, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5258930325508118, 'reward_std': 0.21573548018932343, 'kl': 0.888671875, 'epoch': 0.36}
 36%|███▌      | 1522/4286 [9:09:18<16:24:55, 21.38s/it] 36%|███▌      | 1523/4286 [9:09:41<16:48:04, 21.89s/it]                                                        {'loss': 0.0447, 'grad_norm': 4.032285713521833, 'learning_rate': 6.446570228651423e-07, 'completion_length': 239.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.4990079700946808, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.463293731212616, 'reward_std': 0.1916874274611473, 'kl': 1.1171875, 'epoch': 0.36}
 36%|███▌      | 1523/4286 [9:09:41<16:48:04, 21.89s/it] 36%|███▌      | 1524/4286 [9:10:05<17:18:15, 22.55s/it]                                                        {'loss': 0.0385, 'grad_norm': 6.439737007911842, 'learning_rate': 6.444237050863276e-07, 'completion_length': 244.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.4364796280860901, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4007654786109924, 'reward_std': 0.20115907490253448, 'kl': 0.9609375, 'epoch': 0.36}
 36%|███▌      | 1524/4286 [9:10:05<17:18:15, 22.55s/it] 36%|███▌      | 1525/4286 [9:10:25<16:38:29, 21.70s/it]                                                        {'loss': 0.0084, 'grad_norm': 1.182363294041843, 'learning_rate': 6.441903873075129e-07, 'completion_length': 202.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.01785714365541935, 'kl': 0.20947265625, 'epoch': 0.36}
 36%|███▌      | 1525/4286 [9:10:25<16:38:29, 21.70s/it] 36%|███▌      | 1526/4286 [9:10:46<16:28:32, 21.49s/it]                                                        {'loss': 0.0236, 'grad_norm': 3.2675038341426483, 'learning_rate': 6.439570695286981e-07, 'completion_length': 230.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.5760281682014465, 'rewards/format_reward': 1.0, 'reward': 1.5760282278060913, 'reward_std': 0.02651515230536461, 'kl': 0.587890625, 'epoch': 0.36}
 36%|███▌      | 1526/4286 [9:10:46<16:28:32, 21.49s/it] 36%|███▌      | 1527/4286 [9:11:07<16:24:11, 21.40s/it]                                                        {'loss': 0.043, 'grad_norm': 5.164006731459943, 'learning_rate': 6.437237517498834e-07, 'completion_length': 207.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.42219386994838715, 'rewards/format_reward': 1.0, 'reward': 1.422194004058838, 'reward_std': 0.09075066447257996, 'kl': 1.076171875, 'epoch': 0.36}
 36%|███▌      | 1527/4286 [9:11:07<16:24:11, 21.40s/it] 36%|███▌      | 1528/4286 [9:11:28<16:09:46, 21.10s/it]                                                        {'loss': 0.037, 'grad_norm': 4.575325732931095, 'learning_rate': 6.434904339710686e-07, 'completion_length': 223.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.4479167014360428, 'rewards/format_reward': 1.0, 'reward': 1.4479167461395264, 'reward_std': 0.07808811403810978, 'kl': 0.9267578125, 'epoch': 0.36}
 36%|███▌      | 1528/4286 [9:11:28<16:09:46, 21.10s/it] 36%|███▌      | 1529/4286 [9:11:50<16:20:02, 21.33s/it]                                                        {'loss': 0.0377, 'grad_norm': 24.43803280775762, 'learning_rate': 6.432571161922538e-07, 'completion_length': 207.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.541666716337204, 'rewards/format_reward': 1.0, 'reward': 1.5416668057441711, 'reward_std': 0.10718433186411858, 'kl': 0.943359375, 'epoch': 0.36}
 36%|███▌      | 1529/4286 [9:11:50<16:20:02, 21.33s/it] 36%|███▌      | 1530/4286 [9:12:10<16:09:43, 21.11s/it]                                                        {'loss': 0.026, 'grad_norm': 5.699655367596019, 'learning_rate': 6.430237984134392e-07, 'completion_length': 195.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.566964328289032, 'rewards/format_reward': 1.0, 'reward': 1.5669644474983215, 'reward_std': 0.054907362908124924, 'kl': 0.650390625, 'epoch': 0.36}
 36%|███▌      | 1530/4286 [9:12:10<16:09:43, 21.11s/it] 36%|███▌      | 1531/4286 [9:12:30<15:50:08, 20.69s/it]                                                        {'loss': 0.0611, 'grad_norm': 5.1292974336322565, 'learning_rate': 6.427904806346244e-07, 'completion_length': 182.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.4136904925107956, 'rewards/format_reward': 1.0, 'reward': 1.4136905670166016, 'reward_std': 0.02627508156001568, 'kl': 1.52734375, 'epoch': 0.36}
 36%|███▌      | 1531/4286 [9:12:30<15:50:08, 20.69s/it] 36%|███▌      | 1532/4286 [9:12:49<15:30:44, 20.28s/it]                                                        {'loss': 0.0912, 'grad_norm': 14.198153655287102, 'learning_rate': 6.425571628558096e-07, 'completion_length': 186.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6532739400863647, 'reward_std': 0.09650318510830402, 'kl': 2.27734375, 'epoch': 0.36}
 36%|███▌      | 1532/4286 [9:12:49<15:30:44, 20.28s/it] 36%|███▌      | 1533/4286 [9:13:12<16:10:52, 21.16s/it]                                                        {'loss': 0.0522, 'grad_norm': 18.118518322549438, 'learning_rate': 6.423238450769948e-07, 'completion_length': 184.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.4846939146518707, 'rewards/format_reward': 1.0, 'reward': 1.4846938848495483, 'reward_std': 0.09893476217985153, 'kl': 1.2998046875, 'epoch': 0.36}
 36%|███▌      | 1533/4286 [9:13:12<16:10:52, 21.16s/it] 36%|███▌      | 1534/4286 [9:13:32<15:55:20, 20.83s/it]                                                        {'loss': 0.0104, 'grad_norm': 7.704324940245526, 'learning_rate': 6.420905272981802e-07, 'completion_length': 185.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.635416716337204, 'rewards/format_reward': 1.0, 'reward': 1.6354167461395264, 'reward_std': 0.05876358598470688, 'kl': 0.25927734375, 'epoch': 0.36}
 36%|███▌      | 1534/4286 [9:13:32<15:55:20, 20.83s/it] 36%|███▌      | 1535/4286 [9:13:52<15:35:05, 20.39s/it]                                                        {'loss': 0.0078, 'grad_norm': 4.364402703486361, 'learning_rate': 6.418572095193654e-07, 'completion_length': 184.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.03201759420335293, 'kl': 0.19482421875, 'epoch': 0.36}
 36%|███▌      | 1535/4286 [9:13:52<15:35:05, 20.39s/it] 36%|███▌      | 1536/4286 [9:14:12<15:26:53, 20.22s/it]                                                        {'loss': 0.0245, 'grad_norm': 6.200015691556374, 'learning_rate': 6.416238917405506e-07, 'completion_length': 181.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6792092621326447, 'rewards/format_reward': 1.0, 'reward': 1.679209291934967, 'reward_std': 0.04398592188954353, 'kl': 0.61328125, 'epoch': 0.36}
 36%|███▌      | 1536/4286 [9:14:12<15:26:53, 20.22s/it] 36%|███▌      | 1537/4286 [9:14:33<15:45:35, 20.64s/it]                                                        {'loss': 0.0264, 'grad_norm': 11.33418753904043, 'learning_rate': 6.413905739617359e-07, 'completion_length': 220.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.49851194024086, 'rewards/format_reward': 1.0, 'reward': 1.4985119700431824, 'reward_std': 0.1439708024263382, 'kl': 0.66015625, 'epoch': 0.36}
 36%|███▌      | 1537/4286 [9:14:33<15:45:35, 20.64s/it] 36%|███▌      | 1538/4286 [9:14:54<15:52:08, 20.79s/it]                                                        {'loss': 0.0491, 'grad_norm': 5.426507571307634, 'learning_rate': 6.411572561829212e-07, 'completion_length': 179.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.49404768645763397, 'rewards/format_reward': 1.0, 'reward': 1.4940477013587952, 'reward_std': 0.0773809589445591, 'kl': 1.2265625, 'epoch': 0.36}
 36%|███▌      | 1538/4286 [9:14:54<15:52:08, 20.79s/it] 36%|███▌      | 1539/4286 [9:15:14<15:37:17, 20.47s/it]                                                        {'loss': 0.0284, 'grad_norm': 7.721779467265199, 'learning_rate': 6.409239384041064e-07, 'completion_length': 208.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.3869047611951828, 'rewards/format_reward': 1.0, 'reward': 1.3869048357009888, 'reward_std': 0.08769078180193901, 'kl': 0.708984375, 'epoch': 0.36}
 36%|███▌      | 1539/4286 [9:15:14<15:37:17, 20.47s/it] 36%|███▌      | 1540/4286 [9:15:34<15:26:10, 20.24s/it]                                                        {'loss': 0.0451, 'grad_norm': 4.746083572620278, 'learning_rate': 6.406906206252917e-07, 'completion_length': 177.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.47827382385730743, 'rewards/format_reward': 1.0, 'reward': 1.478273868560791, 'reward_std': 0.056781137362122536, 'kl': 1.125, 'epoch': 0.36}
 36%|███▌      | 1540/4286 [9:15:34<15:26:10, 20.24s/it] 36%|███▌      | 1541/4286 [9:15:55<15:35:36, 20.45s/it]                                                        {'loss': 0.0348, 'grad_norm': 4.015388445563787, 'learning_rate': 6.404573028464769e-07, 'completion_length': 212.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5059524029493332, 'rewards/format_reward': 1.0, 'reward': 1.5059524774551392, 'reward_std': 0.02816697023808956, 'kl': 0.8720703125, 'epoch': 0.36}
 36%|███▌      | 1541/4286 [9:15:55<15:35:36, 20.45s/it] 36%|███▌      | 1542/4286 [9:16:17<16:00:18, 21.00s/it]                                                        {'loss': 0.0128, 'grad_norm': 4.399853580020183, 'learning_rate': 6.402239850676622e-07, 'completion_length': 214.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7244517803192139, 'rewards/format_reward': 1.0, 'reward': 1.7244518399238586, 'reward_std': 0.03979504480957985, 'kl': 0.3193359375, 'epoch': 0.36}
 36%|███▌      | 1542/4286 [9:16:17<16:00:18, 21.00s/it] 36%|███▌      | 1543/4286 [9:16:38<15:53:32, 20.86s/it]                                                        {'loss': 0.0095, 'grad_norm': 2.5656785640354642, 'learning_rate': 6.399906672888474e-07, 'completion_length': 207.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.5922619700431824, 'rewards/format_reward': 1.0, 'reward': 1.5922620296478271, 'reward_std': 0.09003208577632904, 'kl': 0.23681640625, 'epoch': 0.36}
 36%|███▌      | 1543/4286 [9:16:38<15:53:32, 20.86s/it] 36%|███▌      | 1544/4286 [9:16:56<15:20:53, 20.15s/it]                                                        {'loss': 0.0085, 'grad_norm': 2.2728382577891693, 'learning_rate': 6.397573495100327e-07, 'completion_length': 184.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7901786267757416, 'rewards/format_reward': 1.0, 'reward': 1.7901787161827087, 'reward_std': 0.014880949631333351, 'kl': 0.212890625, 'epoch': 0.36}
 36%|███▌      | 1544/4286 [9:16:56<15:20:53, 20.15s/it] 36%|███▌      | 1545/4286 [9:17:16<15:17:24, 20.08s/it]                                                        {'loss': 0.017, 'grad_norm': 3.205634001956723, 'learning_rate': 6.395240317312179e-07, 'completion_length': 171.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5223214775323868, 'rewards/format_reward': 1.0, 'reward': 1.5223215222358704, 'reward_std': 0.0474053705111146, 'kl': 0.4248046875, 'epoch': 0.36}
 36%|███▌      | 1545/4286 [9:17:16<15:17:24, 20.08s/it] 36%|███▌      | 1546/4286 [9:17:34<14:51:23, 19.52s/it]                                                        {'loss': 0.0128, 'grad_norm': 2.1933357755163314, 'learning_rate': 6.392907139524032e-07, 'completion_length': 173.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.5252976566553116, 'rewards/format_reward': 1.0, 'reward': 1.5252977013587952, 'reward_std': 0.06526251137256622, 'kl': 0.31982421875, 'epoch': 0.36}
 36%|███▌      | 1546/4286 [9:17:34<14:51:23, 19.52s/it] 36%|███▌      | 1547/4286 [9:17:55<15:06:00, 19.85s/it]                                                        {'loss': 0.0086, 'grad_norm': 3.2230910176771532, 'learning_rate': 6.390573961735885e-07, 'completion_length': 200.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5386905074119568, 'rewards/format_reward': 1.0, 'reward': 1.5386905670166016, 'reward_std': 0.060301112942397594, 'kl': 0.21435546875, 'epoch': 0.36}
 36%|███▌      | 1547/4286 [9:17:55<15:06:00, 19.85s/it] 36%|███▌      | 1548/4286 [9:18:16<15:21:24, 20.19s/it]                                                        {'loss': 0.014, 'grad_norm': 3.4150143464054477, 'learning_rate': 6.388240783947737e-07, 'completion_length': 197.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5788690745830536, 'rewards/format_reward': 1.0, 'reward': 1.578869104385376, 'reward_std': 0.05059524066746235, 'kl': 0.349609375, 'epoch': 0.36}
 36%|███▌      | 1548/4286 [9:18:16<15:21:24, 20.19s/it] 36%|███▌      | 1549/4286 [9:18:36<15:21:41, 20.21s/it]                                                        {'loss': 0.0114, 'grad_norm': 5.772916462729152, 'learning_rate': 6.385907606159589e-07, 'completion_length': 204.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5505952686071396, 'rewards/format_reward': 1.0, 'reward': 1.5505953431129456, 'reward_std': 0.1300976611673832, 'kl': 0.28515625, 'epoch': 0.36}
 36%|███▌      | 1549/4286 [9:18:36<15:21:41, 20.21s/it] 36%|███▌      | 1550/4286 [9:18:55<15:06:32, 19.88s/it]                                                        {'loss': 0.0104, 'grad_norm': 4.864790041485691, 'learning_rate': 6.383574428371443e-07, 'completion_length': 201.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.4970237910747528, 'rewards/format_reward': 1.0, 'reward': 1.4970239400863647, 'reward_std': 0.06392273120582104, 'kl': 0.25830078125, 'epoch': 0.36}
 36%|███▌      | 1550/4286 [9:18:55<15:06:32, 19.88s/it] 36%|███▌      | 1551/4286 [9:19:15<15:12:23, 20.02s/it]                                                        {'loss': 0.0103, 'grad_norm': 5.136170605207944, 'learning_rate': 6.381241250583295e-07, 'completion_length': 220.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.3976190537214279, 'rewards/format_reward': 1.0, 'reward': 1.3976191282272339, 'reward_std': 0.08363982103765011, 'kl': 0.2568359375, 'epoch': 0.36}
 36%|███▌      | 1551/4286 [9:19:15<15:12:23, 20.02s/it] 36%|███▌      | 1552/4286 [9:19:35<15:08:49, 19.95s/it]                                                        {'loss': 0.009, 'grad_norm': 0.4990204179101487, 'learning_rate': 6.378908072795147e-07, 'completion_length': 185.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.438988134264946, 'rewards/format_reward': 1.0, 'reward': 1.438988208770752, 'reward_std': 0.008928571827709675, 'kl': 0.22509765625, 'epoch': 0.36}
 36%|███▌      | 1552/4286 [9:19:35<15:08:49, 19.95s/it] 36%|███▌      | 1553/4286 [9:19:56<15:19:19, 20.18s/it]                                                        {'loss': 0.0118, 'grad_norm': 3.9387531277802736, 'learning_rate': 6.376574895007e-07, 'completion_length': 183.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.5907738208770752, 'rewards/format_reward': 1.0, 'reward': 1.5907739400863647, 'reward_std': 0.09066697210073471, 'kl': 0.294921875, 'epoch': 0.36}
 36%|███▌      | 1553/4286 [9:19:56<15:19:19, 20.18s/it] 36%|███▋      | 1554/4286 [9:20:15<14:57:44, 19.72s/it]                                                        {'loss': 0.0167, 'grad_norm': 1.9368422486943497, 'learning_rate': 6.374241717218852e-07, 'completion_length': 199.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.5403486639261246, 'rewards/format_reward': 1.0, 'reward': 1.5403487086296082, 'reward_std': 0.037801156751811504, 'kl': 0.41796875, 'epoch': 0.36}
 36%|███▋      | 1554/4286 [9:20:15<14:57:44, 19.72s/it] 36%|███▋      | 1555/4286 [9:20:34<14:55:39, 19.68s/it]                                                        {'loss': 0.036, 'grad_norm': 6.146217804515351, 'learning_rate': 6.371908539430705e-07, 'completion_length': 170.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.4627976417541504, 'rewards/format_reward': 1.0, 'reward': 1.46279776096344, 'reward_std': 0.0744047686457634, 'kl': 0.8984375, 'epoch': 0.36}
 36%|███▋      | 1555/4286 [9:20:34<14:55:39, 19.68s/it] 36%|███▋      | 1556/4286 [9:20:53<14:42:40, 19.40s/it]                                                        {'loss': 0.0168, 'grad_norm': 2.80245861665331, 'learning_rate': 6.369575361642557e-07, 'completion_length': 178.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041667461395264, 'reward_std': 0.01785714365541935, 'kl': 0.42138671875, 'epoch': 0.36}
 36%|███▋      | 1556/4286 [9:20:53<14:42:40, 19.40s/it] 36%|███▋      | 1557/4286 [9:21:13<14:45:07, 19.46s/it]                                                        {'loss': 0.0132, 'grad_norm': 6.61616219772456, 'learning_rate': 6.36724218385441e-07, 'completion_length': 196.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5034722983837128, 'rewards/format_reward': 1.0, 'reward': 1.5034723281860352, 'reward_std': 0.0525793693959713, 'kl': 0.3310546875, 'epoch': 0.36}
 36%|███▋      | 1557/4286 [9:21:13<14:45:07, 19.46s/it] 36%|███▋      | 1558/4286 [9:21:34<15:05:14, 19.91s/it]                                                        {'loss': 0.0388, 'grad_norm': 5.6037208487605366, 'learning_rate': 6.364909006066262e-07, 'completion_length': 199.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.5517857372760773, 'rewards/format_reward': 1.0, 'reward': 1.5517857670783997, 'reward_std': 0.08952697366476059, 'kl': 0.970703125, 'epoch': 0.36}
 36%|███▋      | 1558/4286 [9:21:34<15:05:14, 19.91s/it] 36%|███▋      | 1559/4286 [9:21:55<15:25:41, 20.37s/it]                                                        {'loss': 0.0186, 'grad_norm': 0.8798347852340228, 'learning_rate': 6.362575828278115e-07, 'completion_length': 226.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.4910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4910715222358704, 'reward_std': 0.01785714365541935, 'kl': 0.4658203125, 'epoch': 0.36}
 36%|███▋      | 1559/4286 [9:21:55<15:25:41, 20.37s/it] 36%|███▋      | 1560/4286 [9:22:14<15:13:29, 20.11s/it]                                                        {'loss': 0.0211, 'grad_norm': 3.120195596977087, 'learning_rate': 6.360242650489968e-07, 'completion_length': 186.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.672619104385376, 'reward_std': 0.07780397683382034, 'kl': 0.5263671875, 'epoch': 0.36}
 36%|███▋      | 1560/4286 [9:22:14<15:13:29, 20.11s/it] 36%|███▋      | 1561/4286 [9:22:35<15:21:20, 20.29s/it]                                                        {'loss': 0.05, 'grad_norm': 3.912308543419789, 'learning_rate': 6.35790947270182e-07, 'completion_length': 192.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5550595074892044, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5193453431129456, 'reward_std': 0.1636904776096344, 'kl': 1.25390625, 'epoch': 0.36}
 36%|███▋      | 1561/4286 [9:22:35<15:21:20, 20.29s/it] 36%|███▋      | 1562/4286 [9:22:56<15:28:31, 20.45s/it]                                                        {'loss': 0.0854, 'grad_norm': 13.206143453178976, 'learning_rate': 6.355576294913672e-07, 'completion_length': 211.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.4255952686071396, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3720239400863647, 'reward_std': 0.22817125916481018, 'kl': 2.1328125, 'epoch': 0.36}
 36%|███▋      | 1562/4286 [9:22:56<15:28:31, 20.45s/it][2025-03-02 14:30:34,988] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 36%|███▋      | 1563/4286 [9:23:19<16:04:25, 21.25s/it]                                                        {'loss': 0.0812, 'grad_norm': 53.59339392918179, 'learning_rate': 6.353243117125526e-07, 'completion_length': 198.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.46636906266212463, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4485120177268982, 'reward_std': 0.15238384529948235, 'kl': 2.029296875, 'epoch': 0.36}
 36%|███▋      | 1563/4286 [9:23:19<16:04:25, 21.25s/it] 36%|███▋      | 1564/4286 [9:23:40<15:53:31, 21.02s/it]                                                        {'loss': 0.1092, 'grad_norm': 22.68968675341954, 'learning_rate': 6.350909939337378e-07, 'completion_length': 159.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6011905670166016, 'reward_std': 0.2094884142279625, 'kl': 2.734375, 'epoch': 0.36}
 36%|███▋      | 1564/4286 [9:23:40<15:53:31, 21.02s/it] 37%|███▋      | 1565/4286 [9:24:02<16:17:43, 21.56s/it]                                                        {'loss': 0.071, 'grad_norm': 5.567053903537711, 'learning_rate': 6.34857676154923e-07, 'completion_length': 217.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.5684524178504944, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.532738208770752, 'reward_std': 0.13567392621189356, 'kl': 1.767578125, 'epoch': 0.37}
 37%|███▋      | 1565/4286 [9:24:02<16:17:43, 21.56s/it] 37%|███▋      | 1566/4286 [9:24:26<16:45:33, 22.18s/it]                                                        {'loss': 0.0987, 'grad_norm': 13.180485877980281, 'learning_rate': 6.346243583761082e-07, 'completion_length': 237.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.508928582072258, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4553571939468384, 'reward_std': 0.24273563921451569, 'kl': 2.46484375, 'epoch': 0.37}
 37%|███▋      | 1566/4286 [9:24:26<16:45:33, 22.18s/it] 37%|███▋      | 1567/4286 [9:24:47<16:31:31, 21.88s/it]                                                        {'loss': 0.0719, 'grad_norm': 2.1054340689077886, 'learning_rate': 6.343910405972936e-07, 'completion_length': 220.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5818453133106232, 'rewards/format_reward': 1.0, 'reward': 1.5818453431129456, 'reward_std': 0.0863095298409462, 'kl': 1.79638671875, 'epoch': 0.37}
 37%|███▋      | 1567/4286 [9:24:47<16:31:31, 21.88s/it][2025-03-02 14:32:26,793] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 37%|███▋      | 1568/4286 [9:25:11<16:55:53, 22.43s/it]                                                        {'loss': 0.165, 'grad_norm': 10.223655939602297, 'learning_rate': 6.341577228184788e-07, 'completion_length': 199.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.3199404999613762, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.248512089252472, 'reward_std': 0.23452922701835632, 'kl': 4.1171875, 'epoch': 0.37}
 37%|███▋      | 1568/4286 [9:25:11<16:55:53, 22.43s/it] 37%|███▋      | 1569/4286 [9:25:32<16:34:49, 21.97s/it]                                                        {'loss': 0.152, 'grad_norm': 8.959479560527734, 'learning_rate': 6.33924405039664e-07, 'completion_length': 221.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.3809524178504944, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2916668057441711, 'reward_std': 0.3072424679994583, 'kl': 3.8046875, 'epoch': 0.37}
 37%|███▋      | 1569/4286 [9:25:32<16:34:49, 21.97s/it] 37%|███▋      | 1570/4286 [9:25:51<16:01:30, 21.24s/it]                                                        {'loss': 0.0424, 'grad_norm': 5.176156657763508, 'learning_rate': 6.336910872608493e-07, 'completion_length': 202.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.6264881193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6086310744285583, 'reward_std': 0.0863095261156559, 'kl': 1.05859375, 'epoch': 0.37}
 37%|███▋      | 1570/4286 [9:25:51<16:01:30, 21.24s/it] 37%|███▋      | 1571/4286 [9:26:10<15:27:57, 20.51s/it]                                                        {'loss': 0.389, 'grad_norm': 17607.310386877773, 'learning_rate': 6.334577694820346e-07, 'completion_length': 192.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.4464286118745804, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285715818405151, 'reward_std': 0.1842254474759102, 'kl': 9.72265625, 'epoch': 0.37}
 37%|███▋      | 1571/4286 [9:26:10<15:27:57, 20.51s/it] 37%|███▋      | 1572/4286 [9:26:31<15:32:37, 20.62s/it]                                                        {'loss': 0.0829, 'grad_norm': 187.17054914182324, 'learning_rate': 6.332244517032198e-07, 'completion_length': 191.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4345238357782364, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3988096714019775, 'reward_std': 0.15476190485060215, 'kl': 2.078125, 'epoch': 0.37}
 37%|███▋      | 1572/4286 [9:26:31<15:32:37, 20.62s/it] 37%|███▋      | 1573/4286 [9:26:51<15:18:24, 20.31s/it]                                                        {'loss': 0.0174, 'grad_norm': 1.2830314274726555, 'learning_rate': 6.329911339244051e-07, 'completion_length': 171.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.5238095819950104, 'rewards/format_reward': 1.0, 'reward': 1.5238096714019775, 'reward_std': 0.020619653165340424, 'kl': 0.435546875, 'epoch': 0.37}
 37%|███▋      | 1573/4286 [9:26:51<15:18:24, 20.31s/it] 37%|███▋      | 1574/4286 [9:27:11<15:18:19, 20.32s/it]                                                        {'loss': 0.0627, 'grad_norm': 1.9825425966210535, 'learning_rate': 6.327578161455903e-07, 'completion_length': 212.08930206298828, 'rewards/only_full_func_accuracy_reward': 0.5386905074119568, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5029763579368591, 'reward_std': 0.18332118727266788, 'kl': 1.568359375, 'epoch': 0.37}
 37%|███▋      | 1574/4286 [9:27:11<15:18:19, 20.32s/it] 37%|███▋      | 1575/4286 [9:27:31<15:18:26, 20.33s/it]                                                        {'loss': 0.0174, 'grad_norm': 2.5881672440960255, 'learning_rate': 6.325244983667755e-07, 'completion_length': 200.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.58779776096344, 'reward_std': 0.1145407184958458, 'kl': 0.43359375, 'epoch': 0.37}
 37%|███▋      | 1575/4286 [9:27:31<15:18:26, 20.33s/it] 37%|███▋      | 1576/4286 [9:27:52<15:19:22, 20.36s/it]                                                        {'loss': 0.0354, 'grad_norm': 1.40887615393902, 'learning_rate': 6.322911805879609e-07, 'completion_length': 166.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.4523809850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4345239400863647, 'reward_std': 0.10554792359471321, 'kl': 0.88427734375, 'epoch': 0.37}
 37%|███▋      | 1576/4286 [9:27:52<15:19:22, 20.36s/it] 37%|███▋      | 1577/4286 [9:28:13<15:25:34, 20.50s/it]                                                        {'loss': 0.0396, 'grad_norm': 4.8926190867994075, 'learning_rate': 6.320578628091461e-07, 'completion_length': 210.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.453869104385376, 'rewards/format_reward': 1.0, 'reward': 1.4538691639900208, 'reward_std': 0.09384516254067421, 'kl': 0.990234375, 'epoch': 0.37}
 37%|███▋      | 1577/4286 [9:28:13<15:25:34, 20.50s/it] 37%|███▋      | 1578/4286 [9:28:33<15:21:11, 20.41s/it]                                                        {'loss': 0.0271, 'grad_norm': 0.7657800668956695, 'learning_rate': 6.318245450303313e-07, 'completion_length': 175.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.56101194024086, 'rewards/format_reward': 1.0, 'reward': 1.5610120296478271, 'reward_std': 0.026785715483129025, 'kl': 0.67529296875, 'epoch': 0.37}
 37%|███▋      | 1578/4286 [9:28:33<15:21:11, 20.41s/it] 37%|███▋      | 1579/4286 [9:28:53<15:21:29, 20.42s/it]                                                        {'loss': 0.0293, 'grad_norm': 1.8381771237304596, 'learning_rate': 6.315912272515165e-07, 'completion_length': 217.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.568452388048172, 'rewards/format_reward': 1.0, 'reward': 1.5684524774551392, 'reward_std': 0.09823485091328621, 'kl': 0.7353515625, 'epoch': 0.37}
 37%|███▋      | 1579/4286 [9:28:53<15:21:29, 20.42s/it] 37%|███▋      | 1580/4286 [9:29:14<15:22:24, 20.45s/it]                                                        {'loss': 0.0516, 'grad_norm': 2.655058134760914, 'learning_rate': 6.313579094727019e-07, 'completion_length': 199.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5401785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5223215222358704, 'reward_std': 0.12202381808310747, 'kl': 1.2919921875, 'epoch': 0.37}
 37%|███▋      | 1580/4286 [9:29:14<15:22:24, 20.45s/it] 37%|███▋      | 1581/4286 [9:29:34<15:13:57, 20.27s/it]                                                        {'loss': 0.0485, 'grad_norm': 3.3155858594340146, 'learning_rate': 6.311245916938871e-07, 'completion_length': 182.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.3809524029493332, 'rewards/format_reward': 1.0, 'reward': 1.3809524774551392, 'reward_std': 0.08416137471795082, 'kl': 1.21533203125, 'epoch': 0.37}
 37%|███▋      | 1581/4286 [9:29:34<15:13:57, 20.27s/it] 37%|███▋      | 1582/4286 [9:29:54<15:16:51, 20.34s/it]                                                        {'loss': 0.0659, 'grad_norm': 6.973374834690835, 'learning_rate': 6.308912739150723e-07, 'completion_length': 198.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5252976566553116, 'rewards/format_reward': 1.0, 'reward': 1.5252977013587952, 'reward_std': 0.09066697396337986, 'kl': 1.6484375, 'epoch': 0.37}
 37%|███▋      | 1582/4286 [9:29:54<15:16:51, 20.34s/it] 37%|███▋      | 1583/4286 [9:30:17<15:49:36, 21.08s/it]                                                        {'loss': 0.0464, 'grad_norm': 4.320810225800464, 'learning_rate': 6.306579561362576e-07, 'completion_length': 207.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.583333432674408, 'reward_std': 0.10119047947227955, 'kl': 1.162109375, 'epoch': 0.37}
 37%|███▋      | 1583/4286 [9:30:17<15:49:36, 21.08s/it] 37%|███▋      | 1584/4286 [9:30:39<15:56:26, 21.24s/it]                                                        {'loss': 0.0249, 'grad_norm': 1.9156142101362403, 'learning_rate': 6.304246383574429e-07, 'completion_length': 213.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.617559552192688, 'rewards/format_reward': 1.0, 'reward': 1.6175596117973328, 'reward_std': 0.06869911402463913, 'kl': 0.6240234375, 'epoch': 0.37}
 37%|███▋      | 1584/4286 [9:30:39<15:56:26, 21.24s/it] 37%|███▋      | 1585/4286 [9:30:59<15:40:36, 20.89s/it]                                                        {'loss': 0.0494, 'grad_norm': 3.536526548604223, 'learning_rate': 6.301913205786281e-07, 'completion_length': 192.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6221591234207153, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6043020486831665, 'reward_std': 0.1366342008113861, 'kl': 1.23828125, 'epoch': 0.37}
 37%|███▋      | 1585/4286 [9:30:59<15:40:36, 20.89s/it] 37%|███▋      | 1586/4286 [9:31:20<15:45:26, 21.01s/it]                                                        {'loss': 0.0075, 'grad_norm': 1.262992362181673, 'learning_rate': 6.299580027998134e-07, 'completion_length': 221.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.4122024178504944, 'rewards/format_reward': 1.0, 'reward': 1.4122024774551392, 'reward_std': 0.03114316239953041, 'kl': 0.1875, 'epoch': 0.37}
 37%|███▋      | 1586/4286 [9:31:20<15:45:26, 21.01s/it] 37%|███▋      | 1587/4286 [9:31:41<15:42:53, 20.96s/it]                                                        {'loss': 0.021, 'grad_norm': 1.4667774949669565, 'learning_rate': 6.297246850209986e-07, 'completion_length': 187.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5431548207998276, 'rewards/format_reward': 1.0, 'reward': 1.5431548953056335, 'reward_std': 0.06907341070473194, 'kl': 0.5234375, 'epoch': 0.37}
 37%|███▋      | 1587/4286 [9:31:41<15:42:53, 20.96s/it] 37%|███▋      | 1588/4286 [9:32:02<15:40:18, 20.91s/it]                                                        {'loss': 0.0272, 'grad_norm': 2.4024432419700887, 'learning_rate': 6.294913672421839e-07, 'completion_length': 220.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.5327381193637848, 'rewards/format_reward': 1.0, 'reward': 1.532738208770752, 'reward_std': 0.02976190857589245, 'kl': 0.68017578125, 'epoch': 0.37}
 37%|███▋      | 1588/4286 [9:32:02<15:40:18, 20.91s/it] 37%|███▋      | 1589/4286 [9:32:22<15:29:43, 20.68s/it]                                                        {'loss': 0.0229, 'grad_norm': 3.6240285493306206, 'learning_rate': 6.292580494633691e-07, 'completion_length': 209.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.4880952835083008, 'rewards/format_reward': 1.0, 'reward': 1.4880953431129456, 'reward_std': 0.06069139018654823, 'kl': 0.572265625, 'epoch': 0.37}
 37%|███▋      | 1589/4286 [9:32:22<15:29:43, 20.68s/it] 37%|███▋      | 1590/4286 [9:32:42<15:30:17, 20.70s/it]                                                        {'loss': 0.0215, 'grad_norm': 0.8969833346983744, 'learning_rate': 6.290247316845544e-07, 'completion_length': 207.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.586309552192688, 'rewards/format_reward': 1.0, 'reward': 1.5863096714019775, 'reward_std': 0.06228632479906082, 'kl': 0.53759765625, 'epoch': 0.37}
 37%|███▋      | 1590/4286 [9:32:42<15:30:17, 20.70s/it] 37%|███▋      | 1591/4286 [9:33:03<15:27:08, 20.64s/it]                                                        {'loss': 0.06, 'grad_norm': 1.5857163866135116, 'learning_rate': 6.287914139057396e-07, 'completion_length': 221.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.4940476566553116, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.458333432674408, 'reward_std': 0.16417229920625687, 'kl': 1.49609375, 'epoch': 0.37}
 37%|███▋      | 1591/4286 [9:33:03<15:27:08, 20.64s/it] 37%|███▋      | 1592/4286 [9:33:23<15:25:04, 20.60s/it]                                                        {'loss': 0.0222, 'grad_norm': 1.7330926762426901, 'learning_rate': 6.285580961269249e-07, 'completion_length': 224.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.0208333320915699, 'kl': 0.5546875, 'epoch': 0.37}
 37%|███▋      | 1592/4286 [9:33:23<15:25:04, 20.60s/it] 37%|███▋      | 1593/4286 [9:33:44<15:19:07, 20.48s/it]                                                        {'loss': 0.0069, 'grad_norm': 0.6776734769651719, 'learning_rate': 6.283247783481102e-07, 'completion_length': 194.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.026785715483129025, 'kl': 0.173828125, 'epoch': 0.37}
 37%|███▋      | 1593/4286 [9:33:44<15:19:07, 20.48s/it] 37%|███▋      | 1594/4286 [9:34:04<15:19:11, 20.49s/it]                                                        {'loss': 0.0251, 'grad_norm': 0.7276112506766679, 'learning_rate': 6.280914605692954e-07, 'completion_length': 213.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.4672619700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4494048357009888, 'reward_std': 0.05357143050059676, 'kl': 0.630859375, 'epoch': 0.37}
 37%|███▋      | 1594/4286 [9:34:04<15:19:11, 20.49s/it] 37%|███▋      | 1595/4286 [9:34:24<15:12:24, 20.34s/it]                                                        {'loss': 0.0231, 'grad_norm': 1.419550747673869, 'learning_rate': 6.278581427904806e-07, 'completion_length': 200.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 1.0, 'reward': 1.598214328289032, 'reward_std': 0.04602411389350891, 'kl': 0.57373046875, 'epoch': 0.37}
 37%|███▋      | 1595/4286 [9:34:24<15:12:24, 20.34s/it] 37%|███▋      | 1596/4286 [9:34:44<15:00:19, 20.08s/it]                                                        {'loss': 0.0296, 'grad_norm': 1.1971328958784677, 'learning_rate': 6.27624825011666e-07, 'completion_length': 192.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.619047611951828, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6011906266212463, 'reward_std': 0.06210381165146828, 'kl': 0.7421875, 'epoch': 0.37}
 37%|███▋      | 1596/4286 [9:34:44<15:00:19, 20.08s/it] 37%|███▋      | 1597/4286 [9:35:04<14:59:14, 20.06s/it]                                                        {'loss': 0.0336, 'grad_norm': 2.0198907484667017, 'learning_rate': 6.273915072328512e-07, 'completion_length': 202.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.6130953133106232, 'rewards/format_reward': 1.0, 'reward': 1.6130954027175903, 'reward_std': 0.10260479152202606, 'kl': 0.841796875, 'epoch': 0.37}
 37%|███▋      | 1597/4286 [9:35:04<14:59:14, 20.06s/it] 37%|███▋      | 1598/4286 [9:35:24<15:02:22, 20.14s/it]                                                        {'loss': 0.0262, 'grad_norm': 3.3040788956636797, 'learning_rate': 6.271581894540364e-07, 'completion_length': 205.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6383928656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6205359101295471, 'reward_std': 0.09938187152147293, 'kl': 0.65576171875, 'epoch': 0.37}
 37%|███▋      | 1598/4286 [9:35:24<15:02:22, 20.14s/it] 37%|███▋      | 1599/4286 [9:35:46<15:22:57, 20.61s/it]                                                        {'loss': 0.0076, 'grad_norm': 1.2381641809742177, 'learning_rate': 6.269248716752217e-07, 'completion_length': 241.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.4002976566553116, 'rewards/format_reward': 1.0, 'reward': 1.4002978205680847, 'reward_std': 0.0977869275957346, 'kl': 0.18896484375, 'epoch': 0.37}
 37%|███▋      | 1599/4286 [9:35:46<15:22:57, 20.61s/it] 37%|███▋      | 1600/4286 [9:36:07<15:27:24, 20.72s/it]                                                        {'loss': 0.0408, 'grad_norm': 3.712724315674042, 'learning_rate': 6.26691553896407e-07, 'completion_length': 234.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5119048058986664, 'rewards/format_reward': 1.0, 'reward': 1.5119048357009888, 'reward_std': 0.0357142873108387, 'kl': 1.01953125, 'epoch': 0.37}
 37%|███▋      | 1600/4286 [9:36:07<15:27:24, 20.72s/it] 37%|███▋      | 1601/4286 [9:41:08<78:15:53, 104.94s/it]                                                         {'loss': 0.0661, 'grad_norm': 2.3471547576396046, 'learning_rate': 6.264582361175922e-07, 'completion_length': 218.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5666666924953461, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5309525728225708, 'reward_std': 0.18759804964065552, 'kl': 1.6552734375, 'epoch': 0.37}
 37%|███▋      | 1601/4286 [9:41:08<78:15:53, 104.94s/it] 37%|███▋      | 1602/4286 [9:41:29<59:29:46, 79.80s/it]                                                         {'loss': 0.0073, 'grad_norm': 2.9356794737755214, 'learning_rate': 6.262249183387774e-07, 'completion_length': 252.14287567138672, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.09266925230622292, 'kl': 0.18359375, 'epoch': 0.37}
 37%|███▋      | 1602/4286 [9:41:29<59:29:46, 79.80s/it] 37%|███▋      | 1603/4286 [9:41:49<46:01:08, 61.75s/it]                                                        {'loss': 0.0154, 'grad_norm': 2.428325417408335, 'learning_rate': 6.259916005599627e-07, 'completion_length': 193.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7366072535514832, 'rewards/format_reward': 1.0, 'reward': 1.736607313156128, 'reward_std': 0.07280982099473476, 'kl': 0.3857421875, 'epoch': 0.37}
 37%|███▋      | 1603/4286 [9:41:49<46:01:08, 61.75s/it] 37%|███▋      | 1604/4286 [9:42:11<37:04:17, 49.76s/it]                                                        {'loss': 0.1006, 'grad_norm': 3.6684454955279833, 'learning_rate': 6.257582827811479e-07, 'completion_length': 229.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.5192708373069763, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4478423595428467, 'reward_std': 0.3298543654382229, 'kl': 2.51171875, 'epoch': 0.37}
 37%|███▋      | 1604/4286 [9:42:11<37:04:17, 49.76s/it] 37%|███▋      | 1605/4286 [9:42:35<31:24:52, 42.18s/it]                                                        {'loss': 0.0606, 'grad_norm': 2.0236315176817397, 'learning_rate': 6.255249650023332e-07, 'completion_length': 230.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.3482143208384514, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.294642984867096, 'reward_std': 0.20422262698411942, 'kl': 1.515625, 'epoch': 0.37}
 37%|███▋      | 1605/4286 [9:42:35<31:24:52, 42.18s/it] 37%|███▋      | 1606/4286 [9:42:55<26:23:13, 35.45s/it]                                                        {'loss': 0.0422, 'grad_norm': 3.4298903990154623, 'learning_rate': 6.252916472235185e-07, 'completion_length': 213.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.677083432674408, 'reward_std': 0.1160714328289032, 'kl': 1.052734375, 'epoch': 0.37}
 37%|███▋      | 1606/4286 [9:42:55<26:23:13, 35.45s/it] 37%|███▋      | 1607/4286 [9:43:14<22:45:58, 30.59s/it]                                                        {'loss': 0.0599, 'grad_norm': 4.746903740242083, 'learning_rate': 6.250583294447037e-07, 'completion_length': 179.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.629464328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5937500596046448, 'reward_std': 0.13587725907564163, 'kl': 1.49609375, 'epoch': 0.37}
 37%|███▋      | 1607/4286 [9:43:14<22:45:58, 30.59s/it] 38%|███▊      | 1608/4286 [9:43:35<20:38:04, 27.74s/it]                                                        {'loss': 0.0465, 'grad_norm': 25.706404344437512, 'learning_rate': 6.248250116658888e-07, 'completion_length': 230.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.5386905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.520833432674408, 'reward_std': 0.1233053058385849, 'kl': 1.1640625, 'epoch': 0.38}
 38%|███▊      | 1608/4286 [9:43:35<20:38:04, 27.74s/it][2025-03-02 14:51:11,677] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1609/4286 [9:43:56<19:01:52, 25.59s/it]                                                        {'loss': 0.1352, 'grad_norm': 6.114843566795821, 'learning_rate': 6.245916938870742e-07, 'completion_length': 183.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5744048357009888, 'reward_std': 0.27011343836784363, 'kl': 3.375, 'epoch': 0.38}
 38%|███▊      | 1609/4286 [9:43:56<19:01:52, 25.59s/it] 38%|███▊      | 1610/4286 [9:44:16<17:51:48, 24.03s/it]                                                        {'loss': 0.1106, 'grad_norm': 2.3297110011488655, 'learning_rate': 6.243583761082594e-07, 'completion_length': 186.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6116071790456772, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5401785969734192, 'reward_std': 0.25367169827222824, 'kl': 2.7578125, 'epoch': 0.38}
 38%|███▊      | 1610/4286 [9:44:16<17:51:48, 24.03s/it] 38%|███▊      | 1611/4286 [9:44:36<16:53:33, 22.73s/it]                                                        {'loss': 0.0589, 'grad_norm': 5.894796764101116, 'learning_rate': 6.241250583294446e-07, 'completion_length': 191.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6919642984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6562500596046448, 'reward_std': 0.20953397452831268, 'kl': 1.46875, 'epoch': 0.38}
 38%|███▊      | 1611/4286 [9:44:36<16:53:33, 22.73s/it] 38%|███▊      | 1612/4286 [9:44:56<16:14:43, 21.87s/it]                                                        {'loss': 0.071, 'grad_norm': 5.1433194854578, 'learning_rate': 6.238917405506298e-07, 'completion_length': 209.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4910714626312256, 'reward_std': 0.15025382488965988, 'kl': 1.77734375, 'epoch': 0.38}
 38%|███▊      | 1612/4286 [9:44:56<16:14:43, 21.87s/it] 38%|███▊      | 1613/4286 [9:45:17<16:08:53, 21.75s/it]                                                        {'loss': 0.0319, 'grad_norm': 5.341064283920645, 'learning_rate': 6.236584227718152e-07, 'completion_length': 197.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.4910714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4732144474983215, 'reward_std': 0.08173840120434761, 'kl': 0.7939453125, 'epoch': 0.38}
 38%|███▊      | 1613/4286 [9:45:17<16:08:53, 21.75s/it] 38%|███▊      | 1614/4286 [9:45:38<15:51:28, 21.37s/it]                                                        {'loss': 0.0762, 'grad_norm': 18.162461204704336, 'learning_rate': 6.234251049930004e-07, 'completion_length': 190.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5357143878936768, 'reward_std': 0.2648136019706726, 'kl': 1.90234375, 'epoch': 0.38}
 38%|███▊      | 1614/4286 [9:45:38<15:51:28, 21.37s/it] 38%|███▊      | 1615/4286 [9:45:57<15:23:12, 20.74s/it]                                                        {'loss': 0.0366, 'grad_norm': 5.516949710147253, 'learning_rate': 6.231917872141856e-07, 'completion_length': 199.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6056548207998276, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.16388414055109024, 'kl': 0.9169921875, 'epoch': 0.38}
 38%|███▊      | 1615/4286 [9:45:57<15:23:12, 20.74s/it] 38%|███▊      | 1616/4286 [9:46:19<15:40:49, 21.14s/it]                                                        {'loss': 0.1391, 'grad_norm': 3.749942860957503, 'learning_rate': 6.229584694353709e-07, 'completion_length': 178.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.55952388048172, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5238096117973328, 'reward_std': 0.18994100019335747, 'kl': 3.4765625, 'epoch': 0.38}
 38%|███▊      | 1616/4286 [9:46:19<15:40:49, 21.14s/it] 38%|███▊      | 1617/4286 [9:46:40<15:41:28, 21.16s/it]                                                        {'loss': 0.0728, 'grad_norm': 3.487294606743573, 'learning_rate': 6.227251516565562e-07, 'completion_length': 231.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.3943452835083008, 'rewards/format_reward': 1.0, 'reward': 1.3943453431129456, 'reward_std': 0.07280983403325081, 'kl': 1.818359375, 'epoch': 0.38}
 38%|███▊      | 1617/4286 [9:46:40<15:41:28, 21.16s/it] 38%|███▊      | 1618/4286 [9:47:00<15:25:20, 20.81s/it]                                                        {'loss': 0.0237, 'grad_norm': 1.9753766327949396, 'learning_rate': 6.224918338777414e-07, 'completion_length': 206.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.5238095670938492, 'rewards/format_reward': 1.0, 'reward': 1.5238096714019775, 'reward_std': 0.047619045712053776, 'kl': 0.5947265625, 'epoch': 0.38}
 38%|███▊      | 1618/4286 [9:47:00<15:25:20, 20.81s/it] 38%|███▊      | 1619/4286 [9:47:21<15:23:01, 20.77s/it]                                                        {'loss': 0.0693, 'grad_norm': 4.039297855309, 'learning_rate': 6.222585160989267e-07, 'completion_length': 211.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.4092262089252472, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3735119700431824, 'reward_std': 0.27384260296821594, 'kl': 1.734375, 'epoch': 0.38}
 38%|███▊      | 1619/4286 [9:47:21<15:23:01, 20.77s/it] 38%|███▊      | 1620/4286 [9:47:42<15:23:11, 20.78s/it]                                                        {'loss': 0.047, 'grad_norm': 10.786012477785608, 'learning_rate': 6.220251983201119e-07, 'completion_length': 212.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.3437500298023224, 'rewards/format_reward': 1.0, 'reward': 1.3437501192092896, 'reward_std': 0.08183727413415909, 'kl': 1.17578125, 'epoch': 0.38}
 38%|███▊      | 1620/4286 [9:47:42<15:23:11, 20.78s/it] 38%|███▊      | 1621/4286 [9:48:02<15:18:22, 20.68s/it]                                                        {'loss': 0.0604, 'grad_norm': 2.518333988111859, 'learning_rate': 6.217918805412971e-07, 'completion_length': 214.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.5958333909511566, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5601191520690918, 'reward_std': 0.198592908680439, 'kl': 1.515625, 'epoch': 0.38}
 38%|███▊      | 1621/4286 [9:48:02<15:18:22, 20.68s/it] 38%|███▊      | 1622/4286 [9:48:24<15:29:00, 20.92s/it]                                                        {'loss': 0.0517, 'grad_norm': 3.8362152701153005, 'learning_rate': 6.215585627624824e-07, 'completion_length': 233.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.5416666865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5238096117973328, 'reward_std': 0.12876783311367035, 'kl': 1.2890625, 'epoch': 0.38}
 38%|███▊      | 1622/4286 [9:48:24<15:29:00, 20.92s/it] 38%|███▊      | 1623/4286 [9:48:46<15:43:43, 21.26s/it]                                                        {'loss': 0.0859, 'grad_norm': 5.263918176654056, 'learning_rate': 6.213252449836677e-07, 'completion_length': 208.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.6190476417541504, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.5297619700431824, 'reward_std': 0.17219168692827225, 'kl': 2.14453125, 'epoch': 0.38}
 38%|███▊      | 1623/4286 [9:48:46<15:43:43, 21.26s/it] 38%|███▊      | 1624/4286 [9:49:08<15:56:12, 21.55s/it]                                                        {'loss': 0.1029, 'grad_norm': 3.2054802415215167, 'learning_rate': 6.210919272048529e-07, 'completion_length': 213.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.555059552192688, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4836310744285583, 'reward_std': 0.2219736948609352, 'kl': 2.578125, 'epoch': 0.38}
 38%|███▊      | 1624/4286 [9:49:08<15:56:12, 21.55s/it] 38%|███▊      | 1625/4286 [9:49:30<16:06:45, 21.80s/it]                                                        {'loss': 0.1075, 'grad_norm': 3.7017277072620915, 'learning_rate': 6.208586094260381e-07, 'completion_length': 224.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.4627976566553116, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4092262387275696, 'reward_std': 0.20465519279241562, 'kl': 2.6875, 'epoch': 0.38}
 38%|███▊      | 1625/4286 [9:49:30<16:06:45, 21.80s/it] 38%|███▊      | 1626/4286 [9:49:52<16:04:49, 21.76s/it]                                                        {'loss': 0.065, 'grad_norm': 5.579586891776615, 'learning_rate': 6.206252916472235e-07, 'completion_length': 247.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.4965773969888687, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4430060982704163, 'reward_std': 0.21563612669706345, 'kl': 1.625, 'epoch': 0.38}
 38%|███▊      | 1626/4286 [9:49:52<16:04:49, 21.76s/it] 38%|███▊      | 1627/4286 [9:50:12<15:44:40, 21.32s/it]                                                        {'loss': 0.0235, 'grad_norm': 5.24302217034691, 'learning_rate': 6.203919738684087e-07, 'completion_length': 206.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.5431548058986664, 'rewards/format_reward': 1.0, 'reward': 1.5431548357009888, 'reward_std': 0.03709553927183151, 'kl': 0.58837890625, 'epoch': 0.38}
 38%|███▊      | 1627/4286 [9:50:12<15:44:40, 21.32s/it] 38%|███▊      | 1628/4286 [9:50:33<15:34:46, 21.10s/it]                                                        {'loss': 0.0854, 'grad_norm': 1.95334439768364, 'learning_rate': 6.201586560895939e-07, 'completion_length': 222.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.580357164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5267857909202576, 'reward_std': 0.2260880544781685, 'kl': 2.1328125, 'epoch': 0.38}
 38%|███▊      | 1628/4286 [9:50:33<15:34:46, 21.10s/it] 38%|███▊      | 1629/4286 [9:50:53<15:28:11, 20.96s/it]                                                        {'loss': 0.0253, 'grad_norm': 1.9052774070684801, 'learning_rate': 6.199253383107792e-07, 'completion_length': 199.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.0476190522313118, 'kl': 0.630859375, 'epoch': 0.38}
 38%|███▊      | 1629/4286 [9:50:53<15:28:11, 20.96s/it] 38%|███▊      | 1630/4286 [9:51:14<15:25:16, 20.90s/it]                                                        {'loss': 0.0216, 'grad_norm': 1.3286545192619774, 'learning_rate': 6.196920205319645e-07, 'completion_length': 222.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142859101295471, 'reward_std': 0.04388262890279293, 'kl': 0.5400390625, 'epoch': 0.38}
 38%|███▊      | 1630/4286 [9:51:14<15:25:16, 20.90s/it] 38%|███▊      | 1631/4286 [9:51:36<15:36:54, 21.17s/it]                                                        {'loss': 0.0406, 'grad_norm': 1.6532640753971903, 'learning_rate': 6.194587027531497e-07, 'completion_length': 244.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.5837160348892212, 'rewards/format_reward': 1.0, 'reward': 1.5837161540985107, 'reward_std': 0.1241321973502636, 'kl': 1.0126953125, 'epoch': 0.38}
 38%|███▊      | 1631/4286 [9:51:36<15:36:54, 21.17s/it] 38%|███▊      | 1632/4286 [9:51:58<15:52:47, 21.54s/it]                                                        {'loss': 0.0251, 'grad_norm': 3.250131213526929, 'learning_rate': 6.19225384974335e-07, 'completion_length': 216.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5467262119054794, 'rewards/format_reward': 1.0, 'reward': 1.5467262864112854, 'reward_std': 0.09892453253269196, 'kl': 0.6279296875, 'epoch': 0.38}
 38%|███▊      | 1632/4286 [9:51:58<15:52:47, 21.54s/it] 38%|███▊      | 1633/4286 [9:52:20<15:56:05, 21.62s/it]                                                        {'loss': 0.0326, 'grad_norm': 3.4917718388658017, 'learning_rate': 6.189920671955202e-07, 'completion_length': 237.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.6208063662052155, 'rewards/format_reward': 1.0, 'reward': 1.6208064556121826, 'reward_std': 0.06043224409222603, 'kl': 0.818359375, 'epoch': 0.38}
 38%|███▊      | 1633/4286 [9:52:20<15:56:05, 21.62s/it] 38%|███▊      | 1634/4286 [9:52:41<15:45:49, 21.40s/it]                                                        {'loss': 0.0083, 'grad_norm': 2.3586034977668704, 'learning_rate': 6.187587494167055e-07, 'completion_length': 208.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6324404776096344, 'rewards/format_reward': 1.0, 'reward': 1.6324406266212463, 'reward_std': 0.0446428582072258, 'kl': 0.20703125, 'epoch': 0.38}
 38%|███▊      | 1634/4286 [9:52:41<15:45:49, 21.40s/it] 38%|███▊      | 1635/4286 [9:53:02<15:40:56, 21.30s/it]                                                        {'loss': 0.0069, 'grad_norm': 1.6694032496986242, 'learning_rate': 6.185254316378907e-07, 'completion_length': 219.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5142857134342194, 'rewards/format_reward': 1.0, 'reward': 1.5142857432365417, 'reward_std': 0.06192553602159023, 'kl': 0.1728515625, 'epoch': 0.38}
 38%|███▊      | 1635/4286 [9:53:02<15:40:56, 21.30s/it][2025-03-02 15:00:42,176] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1636/4286 [9:53:26<16:17:30, 22.13s/it]                                                        {'loss': 0.0327, 'grad_norm': 8.137486535936006, 'learning_rate': 6.18292113859076e-07, 'completion_length': 251.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.39384925365448, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3581349849700928, 'reward_std': 0.16808954626321793, 'kl': 0.818359375, 'epoch': 0.38}
 38%|███▊      | 1636/4286 [9:53:26<16:17:30, 22.13s/it] 38%|███▊      | 1637/4286 [9:53:47<15:58:06, 21.70s/it]                                                        {'loss': 0.0594, 'grad_norm': 4.048205148999293, 'learning_rate': 6.180587960802612e-07, 'completion_length': 231.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.5271577835083008, 'rewards/format_reward': 1.0, 'reward': 1.5271578431129456, 'reward_std': 0.09072807803750038, 'kl': 1.484375, 'epoch': 0.38}
 38%|███▊      | 1637/4286 [9:53:47<15:58:06, 21.70s/it] 38%|███▊      | 1638/4286 [9:54:09<16:05:49, 21.88s/it]                                                        {'loss': 0.045, 'grad_norm': 2.0326021971953954, 'learning_rate': 6.178254783014465e-07, 'completion_length': 237.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.5705782771110535, 'rewards/format_reward': 1.0, 'reward': 1.5705783367156982, 'reward_std': 0.129956915974617, 'kl': 1.1240234375, 'epoch': 0.38}
 38%|███▊      | 1638/4286 [9:54:09<16:05:49, 21.88s/it] 38%|███▊      | 1639/4286 [9:54:30<15:47:07, 21.47s/it]                                                        {'loss': 0.0576, 'grad_norm': 6.353901928899119, 'learning_rate': 6.175921605226318e-07, 'completion_length': 211.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.47930197417736053, 'rewards/format_reward': 1.0, 'reward': 1.4793021082878113, 'reward_std': 0.064398561604321, 'kl': 1.44140625, 'epoch': 0.38}
 38%|███▊      | 1639/4286 [9:54:30<15:47:07, 21.47s/it] 38%|███▊      | 1640/4286 [9:54:53<16:07:12, 21.93s/it]                                                        {'loss': 0.0514, 'grad_norm': 6.284763259463998, 'learning_rate': 6.17358842743817e-07, 'completion_length': 249.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.4895833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4717262983322144, 'reward_std': 0.14296754449605942, 'kl': 1.28515625, 'epoch': 0.38}
 38%|███▊      | 1640/4286 [9:54:53<16:07:12, 21.93s/it][2025-03-02 15:02:32,490] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1641/4286 [9:55:17<16:31:29, 22.49s/it]                                                        {'loss': 0.0654, 'grad_norm': 10.366258189472795, 'learning_rate': 6.171255249650022e-07, 'completion_length': 217.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.4538690894842148, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4181548357009888, 'reward_std': 0.13414273783564568, 'kl': 1.63671875, 'epoch': 0.38}
 38%|███▊      | 1641/4286 [9:55:17<16:31:29, 22.49s/it] 38%|███▊      | 1642/4286 [9:55:39<16:24:51, 22.35s/it]                                                        {'loss': 0.0762, 'grad_norm': 10.570089881192658, 'learning_rate': 6.168922071861876e-07, 'completion_length': 224.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.4017857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3839287161827087, 'reward_std': 0.19730248302221298, 'kl': 1.90234375, 'epoch': 0.38}
 38%|███▊      | 1642/4286 [9:55:39<16:24:51, 22.35s/it] 38%|███▊      | 1643/4286 [9:56:02<16:33:34, 22.56s/it]                                                        {'loss': 0.0309, 'grad_norm': 13.036899491783844, 'learning_rate': 6.166588894073728e-07, 'completion_length': 239.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.4732143431901932, 'rewards/format_reward': 1.0, 'reward': 1.473214328289032, 'reward_std': 0.06236079474911094, 'kl': 0.7724609375, 'epoch': 0.38}
 38%|███▊      | 1643/4286 [9:56:02<16:33:34, 22.56s/it] 38%|███▊      | 1644/4286 [9:56:23<16:16:23, 22.17s/it]                                                        {'loss': 0.0715, 'grad_norm': 11.478150018975487, 'learning_rate': 6.16425571628558e-07, 'completion_length': 209.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6324406266212463, 'reward_std': 0.16991522908210754, 'kl': 1.78125, 'epoch': 0.38}
 38%|███▊      | 1644/4286 [9:56:23<16:16:23, 22.17s/it] 38%|███▊      | 1645/4286 [9:56:48<16:50:33, 22.96s/it]                                                        {'loss': 0.1006, 'grad_norm': 8.188281636707922, 'learning_rate': 6.161922538497432e-07, 'completion_length': 229.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.486607164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4330357909202576, 'reward_std': 0.23416852951049805, 'kl': 2.5078125, 'epoch': 0.38}
 38%|███▊      | 1645/4286 [9:56:48<16:50:33, 22.96s/it] 38%|███▊      | 1646/4286 [9:57:11<16:48:44, 22.93s/it]                                                        {'loss': 0.1173, 'grad_norm': 13.748895771844262, 'learning_rate': 6.159589360709286e-07, 'completion_length': 226.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4657738357782364, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4479167461395264, 'reward_std': 0.15937451273202896, 'kl': 2.921875, 'epoch': 0.38}
 38%|███▊      | 1646/4286 [9:57:11<16:48:44, 22.93s/it] 38%|███▊      | 1647/4286 [9:57:37<17:38:11, 24.06s/it]                                                        {'loss': 0.1754, 'grad_norm': 20.334861833786686, 'learning_rate': 6.157256182921138e-07, 'completion_length': 261.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.34196431934833527, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2705357670783997, 'reward_std': 0.23539802432060242, 'kl': 4.390625, 'epoch': 0.38}
 38%|███▊      | 1647/4286 [9:57:37<17:38:11, 24.06s/it] 38%|███▊      | 1648/4286 [9:57:58<16:53:48, 23.06s/it]                                                        {'loss': 0.011, 'grad_norm': 3.0712833518187472, 'learning_rate': 6.15492300513299e-07, 'completion_length': 209.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 1.0, 'reward': 1.6235119700431824, 'reward_std': 0.0446428582072258, 'kl': 0.27392578125, 'epoch': 0.38}
 38%|███▊      | 1648/4286 [9:57:58<16:53:48, 23.06s/it] 38%|███▊      | 1649/4286 [9:58:21<16:45:55, 22.89s/it]                                                        {'loss': 0.0764, 'grad_norm': 12.785869069680661, 'learning_rate': 6.152589827344843e-07, 'completion_length': 243.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.4211309999227524, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.40327388048172, 'reward_std': 0.12846966832876205, 'kl': 1.912109375, 'epoch': 0.38}
 38%|███▊      | 1649/4286 [9:58:21<16:45:55, 22.89s/it][2025-03-02 15:05:59,527] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 38%|███▊      | 1650/4286 [9:58:44<16:48:53, 22.96s/it]                                                        {'loss': 0.1258, 'grad_norm': 7.75760814767672, 'learning_rate': 6.150256649556695e-07, 'completion_length': 213.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.2030930072069168, 'kl': 3.1484375, 'epoch': 0.38}
 38%|███▊      | 1650/4286 [9:58:44<16:48:53, 22.96s/it] 39%|███▊      | 1651/4286 [9:59:05<16:30:15, 22.55s/it]                                                        {'loss': 0.0432, 'grad_norm': 7.772473106095312, 'learning_rate': 6.147923471768548e-07, 'completion_length': 215.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6488095223903656, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.1505005583167076, 'kl': 1.080078125, 'epoch': 0.39}
 39%|███▊      | 1651/4286 [9:59:05<16:30:15, 22.55s/it] 39%|███▊      | 1652/4286 [9:59:27<16:22:50, 22.39s/it]                                                        {'loss': 0.0426, 'grad_norm': 3.974146848334602, 'learning_rate': 6.145590293980401e-07, 'completion_length': 229.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625001788139343, 'reward_std': 0.07225660979747772, 'kl': 1.0634765625, 'epoch': 0.39}
 39%|███▊      | 1652/4286 [9:59:27<16:22:50, 22.39s/it] 39%|███▊      | 1653/4286 [9:59:48<15:55:51, 21.78s/it]                                                        {'loss': 0.0124, 'grad_norm': 2.033196888129716, 'learning_rate': 6.143257116192253e-07, 'completion_length': 223.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.5133928507566452, 'rewards/format_reward': 1.0, 'reward': 1.513392984867096, 'reward_std': 0.040532149374485016, 'kl': 0.31005859375, 'epoch': 0.39}
 39%|███▊      | 1653/4286 [9:59:48<15:55:51, 21.78s/it] 39%|███▊      | 1654/4286 [10:00:10<15:58:15, 21.84s/it]                                                         {'loss': 0.0275, 'grad_norm': 10.776722201193072, 'learning_rate': 6.140923938404105e-07, 'completion_length': 222.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5830357670783997, 'rewards/format_reward': 1.0, 'reward': 1.5830358266830444, 'reward_std': 0.09343690052628517, 'kl': 0.6865234375, 'epoch': 0.39}
 39%|███▊      | 1654/4286 [10:00:10<15:58:15, 21.84s/it][2025-03-02 15:07:46,900] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▊      | 1655/4286 [10:00:31<15:52:18, 21.72s/it]                                                         {'loss': 0.0106, 'grad_norm': 4.681401918888808, 'learning_rate': 6.138590760615959e-07, 'completion_length': 209.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.6302083730697632, 'rewards/format_reward': 1.0, 'reward': 1.6302083730697632, 'reward_std': 0.05022844113409519, 'kl': 0.26416015625, 'epoch': 0.39}
 39%|███▊      | 1655/4286 [10:00:31<15:52:18, 21.72s/it][2025-03-02 15:08:08,851] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▊      | 1656/4286 [10:00:53<15:55:00, 21.79s/it]                                                         {'loss': 0.0367, 'grad_norm': 16.30976775411561, 'learning_rate': 6.136257582827811e-07, 'completion_length': 210.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.4184523969888687, 'rewards/format_reward': 1.0, 'reward': 1.4184524416923523, 'reward_std': 0.14328018575906754, 'kl': 0.91796875, 'epoch': 0.39}
 39%|███▊      | 1656/4286 [10:00:53<15:55:00, 21.79s/it] 39%|███▊      | 1657/4286 [10:01:16<16:06:19, 22.05s/it]                                                         {'loss': 0.0192, 'grad_norm': 6.272327777937944, 'learning_rate': 6.133924405039663e-07, 'completion_length': 248.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5119048058986664, 'rewards/format_reward': 1.0, 'reward': 1.5119049549102783, 'reward_std': 0.08014345914125443, 'kl': 0.48046875, 'epoch': 0.39}
 39%|███▊      | 1657/4286 [10:01:16<16:06:19, 22.05s/it] 39%|███▊      | 1658/4286 [10:01:38<16:09:32, 22.14s/it]                                                         {'loss': 0.0235, 'grad_norm': 5.380231299433179, 'learning_rate': 6.131591227251515e-07, 'completion_length': 232.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001788139343, 'reward_std': 0.1333707720041275, 'kl': 0.5859375, 'epoch': 0.39}
 39%|███▊      | 1658/4286 [10:01:38<16:09:32, 22.14s/it] 39%|███▊      | 1659/4286 [10:02:02<16:30:14, 22.62s/it]                                                         {'loss': 0.012, 'grad_norm': 1.5518718229200017, 'learning_rate': 6.129258049463369e-07, 'completion_length': 200.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6547619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6369048953056335, 'reward_std': 0.049460720270872116, 'kl': 0.30029296875, 'epoch': 0.39}
 39%|███▊      | 1659/4286 [10:02:02<16:30:14, 22.62s/it] 39%|███▊      | 1660/4286 [10:02:22<16:00:47, 21.95s/it]                                                         {'loss': 0.008, 'grad_norm': 4.776434208623925, 'learning_rate': 6.126924871675221e-07, 'completion_length': 191.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 1.0, 'reward': 1.7232144474983215, 'reward_std': 0.03411935269832611, 'kl': 0.19873046875, 'epoch': 0.39}
 39%|███▊      | 1660/4286 [10:02:22<16:00:47, 21.95s/it] 39%|███▉      | 1661/4286 [10:02:46<16:21:15, 22.43s/it]                                                         {'loss': 0.0219, 'grad_norm': 9.177225127912893, 'learning_rate': 6.124591693887073e-07, 'completion_length': 217.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5178572088479996, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000001788139343, 'reward_std': 0.08504730183631182, 'kl': 0.544921875, 'epoch': 0.39}
 39%|███▉      | 1661/4286 [10:02:46<16:21:15, 22.43s/it] 39%|███▉      | 1662/4286 [10:03:07<16:08:35, 22.15s/it]                                                         {'loss': 0.0077, 'grad_norm': 2.5020586624442394, 'learning_rate': 6.122258516098926e-07, 'completion_length': 231.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.4345238506793976, 'rewards/format_reward': 1.0, 'reward': 1.4345239400863647, 'reward_std': 0.043508343398571014, 'kl': 0.19287109375, 'epoch': 0.39}
 39%|███▉      | 1662/4286 [10:03:07<16:08:35, 22.15s/it] 39%|███▉      | 1663/4286 [10:03:30<16:20:25, 22.43s/it]                                                         {'loss': 0.0123, 'grad_norm': 4.961664636441818, 'learning_rate': 6.119925338310779e-07, 'completion_length': 239.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.08411327749490738, 'kl': 0.30859375, 'epoch': 0.39}
 39%|███▉      | 1663/4286 [10:03:30<16:20:25, 22.43s/it] 39%|███▉      | 1664/4286 [10:03:54<16:40:03, 22.88s/it]                                                         {'loss': 0.0178, 'grad_norm': 1.5414945131730335, 'learning_rate': 6.117592160522631e-07, 'completion_length': 205.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.5238095372915268, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.505952537059784, 'reward_std': 0.10454528033733368, 'kl': 0.4443359375, 'epoch': 0.39}
 39%|███▉      | 1664/4286 [10:03:54<16:40:03, 22.88s/it] 39%|███▉      | 1665/4286 [10:04:17<16:42:09, 22.94s/it]                                                         {'loss': 0.0154, 'grad_norm': 1.3630754720556808, 'learning_rate': 6.115258982734484e-07, 'completion_length': 223.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6160715222358704, 'reward_std': 0.08554929308593273, 'kl': 0.38671875, 'epoch': 0.39}
 39%|███▉      | 1665/4286 [10:04:17<16:42:09, 22.94s/it] 39%|███▉      | 1666/4286 [10:04:40<16:42:48, 22.96s/it]                                                         {'loss': 0.0263, 'grad_norm': 3.1075446986731596, 'learning_rate': 6.112925804946336e-07, 'completion_length': 251.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.5221088975667953, 'rewards/format_reward': 1.0, 'reward': 1.5221089720726013, 'reward_std': 0.12417355552315712, 'kl': 0.65673828125, 'epoch': 0.39}
 39%|███▉      | 1666/4286 [10:04:40<16:42:48, 22.96s/it] 39%|███▉      | 1667/4286 [10:05:02<16:21:50, 22.49s/it]                                                         {'loss': 0.0156, 'grad_norm': 4.930059616316447, 'learning_rate': 6.110592627158189e-07, 'completion_length': 233.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.047619045712053776, 'kl': 0.38916015625, 'epoch': 0.39}
 39%|███▉      | 1667/4286 [10:05:02<16:21:50, 22.49s/it][2025-03-02 15:12:39,648] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1668/4286 [10:05:24<16:16:22, 22.38s/it]                                                         {'loss': 0.0192, 'grad_norm': 2.407334963751647, 'learning_rate': 6.108259449370041e-07, 'completion_length': 206.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5710034370422363, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.553146243095398, 'reward_std': 0.08886188082396984, 'kl': 0.4814453125, 'epoch': 0.39}
 39%|███▉      | 1668/4286 [10:05:24<16:16:22, 22.38s/it] 39%|███▉      | 1669/4286 [10:05:46<16:10:42, 22.26s/it]                                                         {'loss': 0.039, 'grad_norm': 6.742267390044166, 'learning_rate': 6.105926271581894e-07, 'completion_length': 240.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.4434524327516556, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.407738208770752, 'reward_std': 0.12315377593040466, 'kl': 0.97265625, 'epoch': 0.39}
 39%|███▉      | 1669/4286 [10:05:46<16:10:42, 22.26s/it][2025-03-02 15:13:23,491] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1670/4286 [10:06:08<16:05:17, 22.14s/it]                                                         {'loss': 0.0302, 'grad_norm': 5.361963173526305, 'learning_rate': 6.103593093793746e-07, 'completion_length': 219.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5610119253396988, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.543154776096344, 'reward_std': 0.11302565969526768, 'kl': 0.75390625, 'epoch': 0.39}
 39%|███▉      | 1670/4286 [10:06:08<16:05:17, 22.14s/it] 39%|███▉      | 1671/4286 [10:06:28<15:41:12, 21.60s/it]                                                         {'loss': 0.0254, 'grad_norm': 1.1318284544619877, 'learning_rate': 6.101259916005598e-07, 'completion_length': 204.33930206298828, 'rewards/only_full_func_accuracy_reward': 0.5997024476528168, 'rewards/format_reward': 1.0, 'reward': 1.599702537059784, 'reward_std': 0.11748574674129486, 'kl': 0.6357421875, 'epoch': 0.39}
 39%|███▉      | 1671/4286 [10:06:28<15:41:12, 21.60s/it][2025-03-02 15:14:03,910] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1672/4286 [10:06:48<15:21:13, 21.15s/it]                                                         {'loss': 0.0257, 'grad_norm': 1.3978879634103665, 'learning_rate': 6.098926738217452e-07, 'completion_length': 188.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7113096117973328, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.693452537059784, 'reward_std': 0.09141558222472668, 'kl': 0.64453125, 'epoch': 0.39}
 39%|███▉      | 1672/4286 [10:06:48<15:21:13, 21.15s/it] 39%|███▉      | 1673/4286 [10:07:11<15:40:11, 21.59s/it]                                                         {'loss': 0.0682, 'grad_norm': 9.220821395424483, 'learning_rate': 6.096593560429304e-07, 'completion_length': 250.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5505952835083008, 'rewards/format_reward': 1.0, 'reward': 1.5505953431129456, 'reward_std': 0.1319284401834011, 'kl': 1.70703125, 'epoch': 0.39}
 39%|███▉      | 1673/4286 [10:07:11<15:40:11, 21.59s/it] 39%|███▉      | 1674/4286 [10:07:33<15:44:47, 21.70s/it]                                                         {'loss': 0.0326, 'grad_norm': 9.913782942696509, 'learning_rate': 6.094260382641156e-07, 'completion_length': 252.96430206298828, 'rewards/only_full_func_accuracy_reward': 0.4434524178504944, 'rewards/format_reward': 1.0, 'reward': 1.443452537059784, 'reward_std': 0.04609858733601868, 'kl': 0.81640625, 'epoch': 0.39}
 39%|███▉      | 1674/4286 [10:07:33<15:44:47, 21.70s/it] 39%|███▉      | 1675/4286 [10:07:56<16:01:57, 22.11s/it]                                                         {'loss': 0.0178, 'grad_norm': 4.410234200120257, 'learning_rate': 6.09192720485301e-07, 'completion_length': 245.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.4806547909975052, 'rewards/format_reward': 1.0, 'reward': 1.4806548357009888, 'reward_std': 0.05679436353966594, 'kl': 0.4453125, 'epoch': 0.39}
 39%|███▉      | 1675/4286 [10:07:56<16:01:57, 22.11s/it][2025-03-02 15:15:36,649] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1676/4286 [10:08:21<16:40:40, 23.00s/it]                                                         {'loss': 0.0552, 'grad_norm': 6.5227431478614095, 'learning_rate': 6.089594027064862e-07, 'completion_length': 275.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.4032738506793976, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3675596117973328, 'reward_std': 0.16317375004291534, 'kl': 1.3779296875, 'epoch': 0.39}
 39%|███▉      | 1676/4286 [10:08:21<16:40:40, 23.00s/it][2025-03-02 15:16:00,541] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 39%|███▉      | 1677/4286 [10:08:45<16:51:51, 23.27s/it]                                                         {'loss': 0.0366, 'grad_norm': 5.6603923615713105, 'learning_rate': 6.087260849276714e-07, 'completion_length': 236.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6151786148548126, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5616071820259094, 'reward_std': 0.17294985800981522, 'kl': 0.916015625, 'epoch': 0.39}
 39%|███▉      | 1677/4286 [10:08:45<16:51:51, 23.27s/it] 39%|███▉      | 1678/4286 [10:09:08<16:57:37, 23.41s/it]                                                         {'loss': 0.1236, 'grad_norm': 11.488679666444575, 'learning_rate': 6.084927671488567e-07, 'completion_length': 261.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.5186012536287308, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4471727013587952, 'reward_std': 0.17768792901188135, 'kl': 3.0830078125, 'epoch': 0.39}
 39%|███▉      | 1678/4286 [10:09:08<16:57:37, 23.41s/it] 39%|███▉      | 1679/4286 [10:09:31<16:48:50, 23.22s/it]                                                         {'loss': 0.0499, 'grad_norm': 5.615827601031189, 'learning_rate': 6.082594493700419e-07, 'completion_length': 205.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.5773810297250748, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5416667461395264, 'reward_std': 0.16967591643333435, 'kl': 1.25, 'epoch': 0.39}
 39%|███▉      | 1679/4286 [10:09:31<16:48:50, 23.22s/it] 39%|███▉      | 1680/4286 [10:09:53<16:30:07, 22.80s/it]                                                         {'loss': 0.0217, 'grad_norm': 11.927569331836722, 'learning_rate': 6.080261315912272e-07, 'completion_length': 218.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6711310744285583, 'reward_std': 0.06697607971727848, 'kl': 0.544921875, 'epoch': 0.39}
 39%|███▉      | 1680/4286 [10:09:53<16:30:07, 22.80s/it] 39%|███▉      | 1681/4286 [10:10:16<16:27:25, 22.74s/it]                                                         {'loss': 0.0598, 'grad_norm': 1.4329255563869758, 'learning_rate': 6.077928138124124e-07, 'completion_length': 240.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.6505953073501587, 'rewards/format_reward': 1.0, 'reward': 1.6505953669548035, 'reward_std': 0.09800117462873459, 'kl': 1.498046875, 'epoch': 0.39}
 39%|███▉      | 1681/4286 [10:10:16<16:27:25, 22.74s/it] 39%|███▉      | 1682/4286 [10:10:38<16:23:57, 22.67s/it]                                                         {'loss': 0.0919, 'grad_norm': 3.2586592854350473, 'learning_rate': 6.075594960335977e-07, 'completion_length': 247.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6877126395702362, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6341413259506226, 'reward_std': 0.2977861762046814, 'kl': 2.29296875, 'epoch': 0.39}
 39%|███▉      | 1682/4286 [10:10:38<16:23:57, 22.67s/it] 39%|███▉      | 1683/4286 [10:10:59<15:57:13, 22.06s/it]                                                         {'loss': 0.0886, 'grad_norm': 3.1937784773720206, 'learning_rate': 6.073261782547829e-07, 'completion_length': 207.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5699405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.552083432674408, 'reward_std': 0.1281739566475153, 'kl': 2.21484375, 'epoch': 0.39}
 39%|███▉      | 1683/4286 [10:10:59<15:57:13, 22.06s/it] 39%|███▉      | 1684/4286 [10:11:21<15:57:24, 22.08s/it]                                                         {'loss': 0.0538, 'grad_norm': 4.138486618262937, 'learning_rate': 6.070928604759682e-07, 'completion_length': 216.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.593750074505806, 'rewards/format_reward': 1.0, 'reward': 1.5937501788139343, 'reward_std': 0.15434297546744347, 'kl': 1.34765625, 'epoch': 0.39}
 39%|███▉      | 1684/4286 [10:11:21<15:57:24, 22.08s/it] 39%|███▉      | 1685/4286 [10:11:43<15:59:15, 22.13s/it]                                                         {'loss': 0.0453, 'grad_norm': 4.789726954059923, 'learning_rate': 6.068595426971535e-07, 'completion_length': 252.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.6767857372760773, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.658928632736206, 'reward_std': 0.17507734149694443, 'kl': 1.1328125, 'epoch': 0.39}
 39%|███▉      | 1685/4286 [10:11:43<15:59:15, 22.13s/it] 39%|███▉      | 1686/4286 [10:12:03<15:32:38, 21.52s/it]                                                         {'loss': 0.1673, 'grad_norm': 1648.27350380341, 'learning_rate': 6.066262249183387e-07, 'completion_length': 184.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.611607164144516, 'rewards/format_reward': 1.0, 'reward': 1.6116071939468384, 'reward_std': 0.053126778453588486, 'kl': 4.1875, 'epoch': 0.39}
 39%|███▉      | 1686/4286 [10:12:03<15:32:38, 21.52s/it] 39%|███▉      | 1687/4286 [10:12:25<15:41:47, 21.74s/it]                                                         {'loss': 0.0677, 'grad_norm': 3.302371475070648, 'learning_rate': 6.063929071395239e-07, 'completion_length': 232.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.5758928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5580358505249023, 'reward_std': 0.18955840170383453, 'kl': 1.69140625, 'epoch': 0.39}
 39%|███▉      | 1687/4286 [10:12:25<15:41:47, 21.74s/it] 39%|███▉      | 1688/4286 [10:12:47<15:38:00, 21.66s/it]                                                         {'loss': 0.0663, 'grad_norm': 2.979730094591844, 'learning_rate': 6.061595893607093e-07, 'completion_length': 213.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6205357611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5848215818405151, 'reward_std': 0.21683453768491745, 'kl': 1.65625, 'epoch': 0.39}
 39%|███▉      | 1688/4286 [10:12:47<15:38:00, 21.66s/it] 39%|███▉      | 1689/4286 [10:13:09<15:47:30, 21.89s/it]                                                         {'loss': 0.0745, 'grad_norm': 5.477858780107194, 'learning_rate': 6.059262715818945e-07, 'completion_length': 238.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.3928571790456772, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750001192092896, 'reward_std': 0.12340506911277771, 'kl': 1.8671875, 'epoch': 0.39}
 39%|███▉      | 1689/4286 [10:13:09<15:47:30, 21.89s/it] 39%|███▉      | 1690/4286 [10:13:31<15:42:12, 21.78s/it]                                                         {'loss': 0.0264, 'grad_norm': 1.3612583198174144, 'learning_rate': 6.056929538030797e-07, 'completion_length': 217.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.5744048207998276, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5565477013587952, 'reward_std': 0.14997709915041924, 'kl': 0.6640625, 'epoch': 0.39}
 39%|███▉      | 1690/4286 [10:13:31<15:42:12, 21.78s/it] 39%|███▉      | 1691/4286 [10:13:53<15:51:30, 22.00s/it]                                                         {'loss': 0.0432, 'grad_norm': 5.739802482571994, 'learning_rate': 6.054596360242649e-07, 'completion_length': 238.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.501488134264946, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4836310744285583, 'reward_std': 0.10852411761879921, 'kl': 1.078125, 'epoch': 0.39}
 39%|███▉      | 1691/4286 [10:13:53<15:51:30, 22.00s/it] 39%|███▉      | 1692/4286 [10:14:15<15:49:27, 21.96s/it]                                                         {'loss': 0.062, 'grad_norm': 18.63483745338078, 'learning_rate': 6.052263182454503e-07, 'completion_length': 221.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.5148809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.49702388048172, 'reward_std': 0.15655050799250603, 'kl': 1.546875, 'epoch': 0.39}
 39%|███▉      | 1692/4286 [10:14:15<15:49:27, 21.96s/it] 40%|███▉      | 1693/4286 [10:14:37<15:39:47, 21.75s/it]                                                         {'loss': 0.0222, 'grad_norm': 1.6763520684730613, 'learning_rate': 6.049930004666355e-07, 'completion_length': 183.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 1.0, 'reward': 1.5997024774551392, 'reward_std': 0.133928582072258, 'kl': 0.55419921875, 'epoch': 0.4}
 40%|███▉      | 1693/4286 [10:14:37<15:39:47, 21.75s/it] 40%|███▉      | 1694/4286 [10:14:58<15:35:28, 21.65s/it]                                                         {'loss': 0.0073, 'grad_norm': 0.0928616663798447, 'learning_rate': 6.047596826878207e-07, 'completion_length': 209.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.0, 'kl': 0.18212890625, 'epoch': 0.4}
 40%|███▉      | 1694/4286 [10:14:58<15:35:28, 21.65s/it] 40%|███▉      | 1695/4286 [10:15:22<16:05:28, 22.36s/it]                                                         {'loss': 0.039, 'grad_norm': 1.8836132921681012, 'learning_rate': 6.04526364909006e-07, 'completion_length': 237.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.4184524267911911, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.382738173007965, 'reward_std': 0.1568220593035221, 'kl': 0.978515625, 'epoch': 0.4}
 40%|███▉      | 1695/4286 [10:15:22<16:05:28, 22.36s/it] 40%|███▉      | 1696/4286 [10:15:45<16:15:43, 22.60s/it]                                                         {'loss': 0.0406, 'grad_norm': 1.9565847161379566, 'learning_rate': 6.042930471301912e-07, 'completion_length': 245.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.4523809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4345239400863647, 'reward_std': 0.13095239363610744, 'kl': 1.0166015625, 'epoch': 0.4}
 40%|███▉      | 1696/4286 [10:15:45<16:15:43, 22.60s/it] 40%|███▉      | 1697/4286 [10:16:06<15:57:03, 22.18s/it]                                                         {'loss': 0.0084, 'grad_norm': 1.4066360503385822, 'learning_rate': 6.040597293513765e-07, 'completion_length': 180.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.6770833432674408, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.09131267433986068, 'kl': 0.2109375, 'epoch': 0.4}
 40%|███▉      | 1697/4286 [10:16:06<15:57:03, 22.18s/it] 40%|███▉      | 1698/4286 [10:16:29<16:00:29, 22.27s/it]                                                         {'loss': 0.0495, 'grad_norm': 3.0839543871820125, 'learning_rate': 6.038264115725618e-07, 'completion_length': 234.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.4270833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4092262983322144, 'reward_std': 0.08311965316534042, 'kl': 1.23583984375, 'epoch': 0.4}
 40%|███▉      | 1698/4286 [10:16:29<16:00:29, 22.27s/it] 40%|███▉      | 1699/4286 [10:16:51<16:01:49, 22.31s/it]                                                         {'loss': 0.0228, 'grad_norm': 1.2763329095978797, 'learning_rate': 6.03593093793747e-07, 'completion_length': 216.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.0892857164144516, 'kl': 0.56982421875, 'epoch': 0.4}
 40%|███▉      | 1699/4286 [10:16:51<16:01:49, 22.31s/it] 40%|███▉      | 1700/4286 [10:17:13<15:52:45, 22.11s/it]                                                         {'loss': 0.1029, 'grad_norm': 2.802740080720575, 'learning_rate': 6.033597760149322e-07, 'completion_length': 216.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.4747024327516556, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4211310744285583, 'reward_std': 0.22485071420669556, 'kl': 2.5703125, 'epoch': 0.4}
 40%|███▉      | 1700/4286 [10:17:13<15:52:45, 22.11s/it] 40%|███▉      | 1701/4286 [10:20:36<54:58:49, 76.57s/it]                                                         {'loss': 0.0387, 'grad_norm': 1.3377270615626349, 'learning_rate': 6.031264582361176e-07, 'completion_length': 222.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6297619342803955, 'rewards/format_reward': 1.0, 'reward': 1.629762053489685, 'reward_std': 0.06592636369168758, 'kl': 0.966796875, 'epoch': 0.4}
 40%|███▉      | 1701/4286 [10:20:36<54:58:49, 76.57s/it] 40%|███▉      | 1702/4286 [10:20:56<42:34:48, 59.32s/it]                                                         {'loss': 0.0077, 'grad_norm': 1.7750427181540203, 'learning_rate': 6.028931404573028e-07, 'completion_length': 174.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.008928571827709675, 'kl': 0.1923828125, 'epoch': 0.4}
 40%|███▉      | 1702/4286 [10:20:56<42:34:48, 59.32s/it] 40%|███▉      | 1703/4286 [10:21:18<34:33:23, 48.16s/it]                                                         {'loss': 0.045, 'grad_norm': 2.583574462416703, 'learning_rate': 6.02659822678488e-07, 'completion_length': 247.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.541666716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.505952537059784, 'reward_std': 0.17751142382621765, 'kl': 1.125, 'epoch': 0.4}
 40%|███▉      | 1703/4286 [10:21:18<34:33:23, 48.16s/it] 40%|███▉      | 1704/4286 [10:21:39<28:46:20, 40.12s/it]                                                         {'loss': 0.0084, 'grad_norm': 5.085717008640922, 'learning_rate': 6.024265048996732e-07, 'completion_length': 212.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.06547619309276342, 'kl': 0.208984375, 'epoch': 0.4}
 40%|███▉      | 1704/4286 [10:21:39<28:46:20, 40.12s/it] 40%|███▉      | 1705/4286 [10:21:59<24:32:03, 34.22s/it]                                                         {'loss': 0.0832, 'grad_norm': 3.5272992566843344, 'learning_rate': 6.021931871208586e-07, 'completion_length': 200.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.5163690894842148, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4985119700431824, 'reward_std': 0.08884670957922935, 'kl': 2.076171875, 'epoch': 0.4}
 40%|███▉      | 1705/4286 [10:21:59<24:32:03, 34.22s/it] 40%|███▉      | 1706/4286 [10:22:19<21:23:06, 29.84s/it]                                                         {'loss': 0.0303, 'grad_norm': 1.0513740102241806, 'learning_rate': 6.019598693420438e-07, 'completion_length': 177.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 1.0, 'reward': 1.5997024774551392, 'reward_std': 0.0267857164144516, 'kl': 0.76171875, 'epoch': 0.4}
 40%|███▉      | 1706/4286 [10:22:19<21:23:06, 29.84s/it] 40%|███▉      | 1707/4286 [10:22:40<19:25:26, 27.11s/it]                                                         {'loss': 0.0402, 'grad_norm': 3.304597282731937, 'learning_rate': 6.01726551563229e-07, 'completion_length': 215.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5297619700431824, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.49404776096344, 'reward_std': 0.20027920603752136, 'kl': 1.001953125, 'epoch': 0.4}
 40%|███▉      | 1707/4286 [10:22:40<19:25:26, 27.11s/it] 40%|███▉      | 1708/4286 [10:22:58<17:31:21, 24.47s/it]                                                         {'loss': 0.1289, 'grad_norm': 5955.518502091692, 'learning_rate': 6.014932337844143e-07, 'completion_length': 172.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.05351835489273071, 'kl': 3.197265625, 'epoch': 0.4}
 40%|███▉      | 1708/4286 [10:22:58<17:31:21, 24.47s/it] 40%|███▉      | 1709/4286 [10:23:18<16:26:43, 22.97s/it]                                                         {'loss': 0.0081, 'grad_norm': 1.637485022339016, 'learning_rate': 6.012599160055996e-07, 'completion_length': 191.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.608631044626236, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.0446428544819355, 'kl': 0.20263671875, 'epoch': 0.4}
 40%|███▉      | 1709/4286 [10:23:18<16:26:43, 22.97s/it] 40%|███▉      | 1710/4286 [10:23:41<16:26:46, 22.98s/it]                                                         {'loss': 0.0888, 'grad_norm': 11.961671276502837, 'learning_rate': 6.010265982267848e-07, 'completion_length': 232.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.3139881193637848, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2425596714019775, 'reward_std': 0.135821845382452, 'kl': 2.22265625, 'epoch': 0.4}
 40%|███▉      | 1710/4286 [10:23:41<16:26:46, 22.98s/it] 40%|███▉      | 1711/4286 [10:24:02<16:07:01, 22.53s/it]                                                         {'loss': 0.0617, 'grad_norm': 4.178407927967054, 'learning_rate': 6.007932804479701e-07, 'completion_length': 209.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6116071939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5758929252624512, 'reward_std': 0.1911240667104721, 'kl': 1.541015625, 'epoch': 0.4}
 40%|███▉      | 1711/4286 [10:24:02<16:07:01, 22.53s/it] 40%|███▉      | 1712/4286 [10:24:23<15:41:37, 21.95s/it]                                                         {'loss': 0.0676, 'grad_norm': 3.542994203399583, 'learning_rate': 6.005599626691553e-07, 'completion_length': 186.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5308532118797302, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4951390027999878, 'reward_std': 0.1509336568415165, 'kl': 1.6953125, 'epoch': 0.4}
 40%|███▉      | 1712/4286 [10:24:23<15:41:37, 21.95s/it] 40%|███▉      | 1713/4286 [10:24:43<15:25:48, 21.59s/it]                                                         {'loss': 0.0252, 'grad_norm': 1.5554980793557298, 'learning_rate': 6.003266448903406e-07, 'completion_length': 196.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.727678656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7098215222358704, 'reward_std': 0.09212672710418701, 'kl': 0.6318359375, 'epoch': 0.4}
 40%|███▉      | 1713/4286 [10:24:43<15:25:48, 21.59s/it] 40%|███▉      | 1714/4286 [10:25:07<15:45:24, 22.05s/it]                                                         {'loss': 0.0329, 'grad_norm': 2.6950764049854046, 'learning_rate': 6.000933271115258e-07, 'completion_length': 217.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5232143253087997, 'rewards/format_reward': 1.0, 'reward': 1.5232143998146057, 'reward_std': 0.14331317879259586, 'kl': 0.8232421875, 'epoch': 0.4}
 40%|███▉      | 1714/4286 [10:25:07<15:45:24, 22.05s/it] 40%|████      | 1715/4286 [10:25:28<15:34:30, 21.81s/it]                                                         {'loss': 0.066, 'grad_norm': 9.338630333676493, 'learning_rate': 5.998600093327111e-07, 'completion_length': 214.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.54613097012043, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5282739400863647, 'reward_std': 0.09778692945837975, 'kl': 1.65283203125, 'epoch': 0.4}
 40%|████      | 1715/4286 [10:25:28<15:34:30, 21.81s/it] 40%|████      | 1716/4286 [10:25:50<15:41:35, 21.98s/it]                                                         {'loss': 0.0183, 'grad_norm': 61.51631356073332, 'learning_rate': 5.996266915538963e-07, 'completion_length': 193.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.03755595162510872, 'kl': 0.45849609375, 'epoch': 0.4}
 40%|████      | 1716/4286 [10:25:50<15:41:35, 21.98s/it] 40%|████      | 1717/4286 [10:26:11<15:31:46, 21.76s/it]                                                         {'loss': 0.0079, 'grad_norm': 1.3438082999971988, 'learning_rate': 5.993933737750816e-07, 'completion_length': 178.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5580357611179352, 'rewards/format_reward': 1.0, 'reward': 1.5580358505249023, 'reward_std': 0.06961995735764503, 'kl': 0.197265625, 'epoch': 0.4}
 40%|████      | 1717/4286 [10:26:11<15:31:46, 21.76s/it] 40%|████      | 1718/4286 [10:26:33<15:22:44, 21.56s/it]                                                         {'loss': 0.0725, 'grad_norm': 6.49695334963082, 'learning_rate': 5.991600559962669e-07, 'completion_length': 186.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6041666567325592, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.532738208770752, 'reward_std': 0.2355937585234642, 'kl': 1.80859375, 'epoch': 0.4}
 40%|████      | 1718/4286 [10:26:33<15:22:44, 21.56s/it] 40%|████      | 1719/4286 [10:26:55<15:31:08, 21.76s/it]                                                         {'loss': 0.1234, 'grad_norm': 16.493929367969002, 'learning_rate': 5.989267382174521e-07, 'completion_length': 214.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.43065477907657623, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4127976894378662, 'reward_std': 0.10014599189162254, 'kl': 3.078125, 'epoch': 0.4}
 40%|████      | 1719/4286 [10:26:55<15:31:08, 21.76s/it] 40%|████      | 1720/4286 [10:27:15<15:05:50, 21.18s/it]                                                         {'loss': 0.0336, 'grad_norm': 2.0696721629790926, 'learning_rate': 5.986934204386373e-07, 'completion_length': 169.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.4568452537059784, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.438988208770752, 'reward_std': 0.08760317135602236, 'kl': 0.84326171875, 'epoch': 0.4}
 40%|████      | 1720/4286 [10:27:15<15:05:50, 21.18s/it] 40%|████      | 1721/4286 [10:27:35<15:00:21, 21.06s/it]                                                         {'loss': 0.0104, 'grad_norm': 0.9358250079625782, 'learning_rate': 5.984601026598227e-07, 'completion_length': 188.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.455357164144516, 'rewards/format_reward': 1.0, 'reward': 1.4553571939468384, 'reward_std': 0.01785714365541935, 'kl': 0.259765625, 'epoch': 0.4}
 40%|████      | 1721/4286 [10:27:35<15:00:21, 21.06s/it] 40%|████      | 1722/4286 [10:27:55<14:42:11, 20.64s/it]                                                         {'loss': 0.0083, 'grad_norm': 3.364797955868783, 'learning_rate': 5.982267848810079e-07, 'completion_length': 168.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.647321492433548, 'rewards/format_reward': 1.0, 'reward': 1.6473215818405151, 'reward_std': 0.12644404917955399, 'kl': 0.20654296875, 'epoch': 0.4}
 40%|████      | 1722/4286 [10:27:55<14:42:11, 20.64s/it] 40%|████      | 1723/4286 [10:28:14<14:15:58, 20.04s/it]                                                         {'loss': 0.0094, 'grad_norm': 1.3437332861165572, 'learning_rate': 5.979934671021931e-07, 'completion_length': 158.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.53125, 'rewards/format_reward': 1.0, 'reward': 1.5312501192092896, 'reward_std': 0.09778692945837975, 'kl': 0.23583984375, 'epoch': 0.4}
 40%|████      | 1723/4286 [10:28:14<14:15:58, 20.04s/it] 40%|████      | 1724/4286 [10:28:36<14:42:58, 20.68s/it]                                                         {'loss': 0.0075, 'grad_norm': 3.966646585814536, 'learning_rate': 5.977601493233784e-07, 'completion_length': 188.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6360118985176086, 'rewards/format_reward': 1.0, 'reward': 1.6360120177268982, 'reward_std': 0.13136059418320656, 'kl': 0.18701171875, 'epoch': 0.4}
 40%|████      | 1724/4286 [10:28:36<14:42:58, 20.68s/it] 40%|████      | 1725/4286 [10:28:56<14:38:53, 20.59s/it]                                                         {'loss': 0.0074, 'grad_norm': 2.136234887825046, 'learning_rate': 5.975268315445636e-07, 'completion_length': 198.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6622024476528168, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.027706551365554333, 'kl': 0.18408203125, 'epoch': 0.4}
 40%|████      | 1725/4286 [10:28:56<14:38:53, 20.59s/it] 40%|████      | 1726/4286 [10:29:15<14:10:59, 19.95s/it]                                                         {'loss': 0.013, 'grad_norm': 2.0524702079999497, 'learning_rate': 5.972935137657489e-07, 'completion_length': 178.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6616071462631226, 'rewards/format_reward': 1.0, 'reward': 1.6616072058677673, 'reward_std': 0.07235211879014969, 'kl': 0.32470703125, 'epoch': 0.4}
 40%|████      | 1726/4286 [10:29:15<14:10:59, 19.95s/it] 40%|████      | 1727/4286 [10:29:35<14:14:42, 20.04s/it]                                                         {'loss': 0.0107, 'grad_norm': 1.5439164863305292, 'learning_rate': 5.970601959869341e-07, 'completion_length': 192.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.7432540357112885, 'rewards/format_reward': 1.0, 'reward': 1.7432541251182556, 'reward_std': 0.04218742251396179, 'kl': 0.2685546875, 'epoch': 0.4}
 40%|████      | 1727/4286 [10:29:35<14:14:42, 20.04s/it] 40%|████      | 1728/4286 [10:29:55<14:14:06, 20.03s/it]                                                         {'loss': 0.0213, 'grad_norm': 2.513465319845678, 'learning_rate': 5.968268782081194e-07, 'completion_length': 175.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.646428644657135, 'rewards/format_reward': 1.0, 'reward': 1.646428644657135, 'reward_std': 0.12813077494502068, 'kl': 0.533203125, 'epoch': 0.4}
 40%|████      | 1728/4286 [10:29:55<14:14:06, 20.03s/it] 40%|████      | 1729/4286 [10:30:15<14:16:44, 20.10s/it]                                                         {'loss': 0.0077, 'grad_norm': 0.857767810570092, 'learning_rate': 5.965935604293046e-07, 'completion_length': 194.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.03160357475280762, 'kl': 0.193359375, 'epoch': 0.4}
 40%|████      | 1729/4286 [10:30:15<14:16:44, 20.10s/it] 40%|████      | 1730/4286 [10:30:35<14:07:35, 19.90s/it]                                                         {'loss': 0.0083, 'grad_norm': 5.483619342655673, 'learning_rate': 5.963602426504899e-07, 'completion_length': 177.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5952380895614624, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.06915953475981951, 'kl': 0.20703125, 'epoch': 0.4}
 40%|████      | 1730/4286 [10:30:35<14:07:35, 19.90s/it] 40%|████      | 1731/4286 [10:30:55<14:13:20, 20.04s/it]                                                         {'loss': 0.0078, 'grad_norm': 1.231435677779871, 'learning_rate': 5.961269248716752e-07, 'completion_length': 193.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5148809552192688, 'rewards/format_reward': 1.0, 'reward': 1.5148810744285583, 'reward_std': 0.03905686270445585, 'kl': 0.1943359375, 'epoch': 0.4}
 40%|████      | 1731/4286 [10:30:55<14:13:20, 20.04s/it] 40%|████      | 1732/4286 [10:31:14<14:05:13, 19.86s/it]                                                         {'loss': 0.007, 'grad_norm': 1.6237304513962472, 'learning_rate': 5.958936070928604e-07, 'completion_length': 191.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5669643580913544, 'rewards/format_reward': 1.0, 'reward': 1.566964328289032, 'reward_std': 0.020833334885537624, 'kl': 0.173828125, 'epoch': 0.4}
 40%|████      | 1732/4286 [10:31:14<14:05:13, 19.86s/it] 40%|████      | 1733/4286 [10:31:35<14:17:51, 20.16s/it]                                                         {'loss': 0.0079, 'grad_norm': 1.3669614228535372, 'learning_rate': 5.956602893140456e-07, 'completion_length': 192.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5476190745830536, 'rewards/format_reward': 1.0, 'reward': 1.547619104385376, 'reward_std': 0.03571428917348385, 'kl': 0.19873046875, 'epoch': 0.4}
 40%|████      | 1733/4286 [10:31:35<14:17:51, 20.16s/it] 40%|████      | 1734/4286 [10:31:54<14:04:11, 19.85s/it]                                                         {'loss': 0.0091, 'grad_norm': 4.696837745958046, 'learning_rate': 5.95426971535231e-07, 'completion_length': 177.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.42380955815315247, 'rewards/format_reward': 1.0, 'reward': 1.4238096475601196, 'reward_std': 0.03365878947079182, 'kl': 0.22607421875, 'epoch': 0.4}
 40%|████      | 1734/4286 [10:31:54<14:04:11, 19.85s/it] 40%|████      | 1735/4286 [10:32:15<14:10:50, 20.01s/it]                                                         {'loss': 0.0082, 'grad_norm': 1.7867661266333412, 'learning_rate': 5.951936537564162e-07, 'completion_length': 175.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5729166865348816, 'rewards/format_reward': 1.0, 'reward': 1.5729168057441711, 'reward_std': 0.09023960679769516, 'kl': 0.20556640625, 'epoch': 0.4}
 40%|████      | 1735/4286 [10:32:15<14:10:50, 20.01s/it] 41%|████      | 1736/4286 [10:32:36<14:24:48, 20.35s/it]                                                         {'loss': 0.0066, 'grad_norm': 1.5641685443893614, 'learning_rate': 5.949603359776014e-07, 'completion_length': 212.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5297619551420212, 'rewards/format_reward': 1.0, 'reward': 1.5297620296478271, 'reward_std': 0.09481073543429375, 'kl': 0.16455078125, 'epoch': 0.41}
 41%|████      | 1736/4286 [10:32:36<14:24:48, 20.35s/it] 41%|████      | 1737/4286 [10:32:57<14:35:50, 20.62s/it]                                                         {'loss': 0.0078, 'grad_norm': 1.3565963803177834, 'learning_rate': 5.947270181987866e-07, 'completion_length': 212.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 1.0, 'reward': 1.610119104385376, 'reward_std': 0.06228632479906082, 'kl': 0.19384765625, 'epoch': 0.41}
 41%|████      | 1737/4286 [10:32:57<14:35:50, 20.62s/it] 41%|████      | 1738/4286 [10:33:18<14:34:42, 20.60s/it]                                                         {'loss': 0.0075, 'grad_norm': 2.276450610673388, 'learning_rate': 5.94493700419972e-07, 'completion_length': 212.21430206298828, 'rewards/only_full_func_accuracy_reward': 0.520833358168602, 'rewards/format_reward': 1.0, 'reward': 1.5208334922790527, 'reward_std': 0.11258090659976006, 'kl': 0.18701171875, 'epoch': 0.41}
 41%|████      | 1738/4286 [10:33:18<14:34:42, 20.60s/it] 41%|████      | 1739/4286 [10:33:39<14:38:58, 20.71s/it]                                                         {'loss': 0.0071, 'grad_norm': 1.0390798958609018, 'learning_rate': 5.942603826411572e-07, 'completion_length': 213.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.049460723996162415, 'kl': 0.177734375, 'epoch': 0.41}
 41%|████      | 1739/4286 [10:33:39<14:38:58, 20.71s/it] 41%|████      | 1740/4286 [10:34:00<14:51:31, 21.01s/it]                                                         {'loss': 0.0069, 'grad_norm': 1.9211636825254421, 'learning_rate': 5.940270648623424e-07, 'completion_length': 245.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.5252976417541504, 'rewards/format_reward': 1.0, 'reward': 1.5252977013587952, 'reward_std': 0.04999392572790384, 'kl': 0.173828125, 'epoch': 0.41}
 41%|████      | 1740/4286 [10:34:00<14:51:31, 21.01s/it] 41%|████      | 1741/4286 [10:34:21<14:46:10, 20.89s/it]                                                         {'loss': 0.0072, 'grad_norm': 0.5876939471534018, 'learning_rate': 5.937937470835277e-07, 'completion_length': 213.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.01785714365541935, 'kl': 0.17919921875, 'epoch': 0.41}
 41%|████      | 1741/4286 [10:34:21<14:46:10, 20.89s/it] 41%|████      | 1742/4286 [10:34:42<14:45:49, 20.89s/it]                                                         {'loss': 0.0193, 'grad_norm': 1.2612485659546706, 'learning_rate': 5.93560429304713e-07, 'completion_length': 199.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.5041667222976685, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4863096475601196, 'reward_std': 0.09365567192435265, 'kl': 0.48388671875, 'epoch': 0.41}
 41%|████      | 1742/4286 [10:34:42<14:45:49, 20.89s/it] 41%|████      | 1743/4286 [10:35:04<15:02:48, 21.30s/it]                                                         {'loss': 0.0614, 'grad_norm': 15.508038002155498, 'learning_rate': 5.933271115258982e-07, 'completion_length': 249.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5372024327516556, 'rewards/format_reward': 1.0, 'reward': 1.5372024774551392, 'reward_std': 0.05901032313704491, 'kl': 1.5361328125, 'epoch': 0.41}
 41%|████      | 1743/4286 [10:35:04<15:02:48, 21.30s/it] 41%|████      | 1744/4286 [10:35:24<14:41:05, 20.80s/it]                                                         {'loss': 0.0263, 'grad_norm': 13.032510996081523, 'learning_rate': 5.930937937470835e-07, 'completion_length': 197.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.4508928656578064, 'rewards/format_reward': 1.0, 'reward': 1.450892984867096, 'reward_std': 0.14879360795021057, 'kl': 0.658203125, 'epoch': 0.41}
 41%|████      | 1744/4286 [10:35:24<14:41:05, 20.80s/it] 41%|████      | 1745/4286 [10:35:45<14:45:12, 20.90s/it]                                                         {'loss': 0.0189, 'grad_norm': 4.778753252657842, 'learning_rate': 5.928604759682687e-07, 'completion_length': 237.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.4687500447034836, 'rewards/format_reward': 1.0, 'reward': 1.4687500596046448, 'reward_std': 0.10005596652626991, 'kl': 0.47119140625, 'epoch': 0.41}
 41%|████      | 1745/4286 [10:35:45<14:45:12, 20.90s/it] 41%|████      | 1746/4286 [10:36:06<14:45:40, 20.92s/it]                                                         {'loss': 0.0158, 'grad_norm': 1.274044392788341, 'learning_rate': 5.92627158189454e-07, 'completion_length': 198.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5595238655805588, 'rewards/format_reward': 1.0, 'reward': 1.55952388048172, 'reward_std': 0.03134515322744846, 'kl': 0.39501953125, 'epoch': 0.41}
 41%|████      | 1746/4286 [10:36:06<14:45:40, 20.92s/it] 41%|████      | 1747/4286 [10:36:28<15:02:26, 21.33s/it]                                                         {'loss': 0.0245, 'grad_norm': 1.6568712913031978, 'learning_rate': 5.923938404106393e-07, 'completion_length': 219.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6279762387275696, 'reward_std': 0.11079980060458183, 'kl': 0.60888671875, 'epoch': 0.41}
 41%|████      | 1747/4286 [10:36:28<15:02:26, 21.33s/it] 41%|████      | 1748/4286 [10:36:52<15:37:09, 22.16s/it]                                                         {'loss': 0.0229, 'grad_norm': 2.2545850312929576, 'learning_rate': 5.921605226318245e-07, 'completion_length': 275.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.4654761999845505, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4476191401481628, 'reward_std': 0.1366935446858406, 'kl': 0.57373046875, 'epoch': 0.41}
 41%|████      | 1748/4286 [10:36:52<15:37:09, 22.16s/it] 41%|████      | 1749/4286 [10:37:15<15:48:00, 22.42s/it]                                                         {'loss': 0.0465, 'grad_norm': 4.009404823198051, 'learning_rate': 5.919272048530097e-07, 'completion_length': 258.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5119048207998276, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4940478205680847, 'reward_std': 0.18600395694375038, 'kl': 1.16015625, 'epoch': 0.41}
 41%|████      | 1749/4286 [10:37:15<15:48:00, 22.42s/it] 41%|████      | 1750/4286 [10:37:38<15:47:13, 22.41s/it]                                                         {'loss': 0.0248, 'grad_norm': 4.944154094388406, 'learning_rate': 5.916938870741949e-07, 'completion_length': 226.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.5241071879863739, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5062501430511475, 'reward_std': 0.18676557391881943, 'kl': 0.619140625, 'epoch': 0.41}
 41%|████      | 1750/4286 [10:37:38<15:47:13, 22.41s/it] 41%|████      | 1751/4286 [10:38:01<16:03:38, 22.81s/it]                                                         {'loss': 0.147, 'grad_norm': 7.137463171236603, 'learning_rate': 5.914605692953803e-07, 'completion_length': 283.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.357142984867096, 'reward_std': 0.2458983063697815, 'kl': 3.6796875, 'epoch': 0.41}
 41%|████      | 1751/4286 [10:38:01<16:03:38, 22.81s/it] 41%|████      | 1752/4286 [10:38:27<16:34:12, 23.54s/it]                                                         {'loss': 0.1242, 'grad_norm': 6.129733726278144, 'learning_rate': 5.912272515165655e-07, 'completion_length': 226.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.4166667312383652, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3630953431129456, 'reward_std': 0.25682400166988373, 'kl': 3.109375, 'epoch': 0.41}
 41%|████      | 1752/4286 [10:38:27<16:34:12, 23.54s/it] 41%|████      | 1753/4286 [10:38:50<16:33:00, 23.52s/it]                                                         {'loss': 0.2125, 'grad_norm': 5.097821286137508, 'learning_rate': 5.909939337377507e-07, 'completion_length': 259.3214416503906, 'rewards/only_full_func_accuracy_reward': 0.394940510392189, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3235119581222534, 'reward_std': 0.32934345304965973, 'kl': 5.3125, 'epoch': 0.41}
 41%|████      | 1753/4286 [10:38:50<16:33:00, 23.52s/it] 41%|████      | 1754/4286 [10:39:17<17:08:05, 24.36s/it]                                                         {'loss': 0.172, 'grad_norm': 8.419205893618486, 'learning_rate': 5.90760615958936e-07, 'completion_length': 233.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.4029761850833893, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3136906027793884, 'reward_std': 0.29796332120895386, 'kl': 4.3046875, 'epoch': 0.41}
 41%|████      | 1754/4286 [10:39:17<17:08:05, 24.36s/it] 41%|████      | 1755/4286 [10:39:42<17:16:57, 24.58s/it]                                                         {'loss': 0.1563, 'grad_norm': 9.375678152400098, 'learning_rate': 5.905272981801213e-07, 'completion_length': 284.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.29672620445489883, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.2252976894378662, 'reward_std': 0.23004546761512756, 'kl': 3.8984375, 'epoch': 0.41}
 41%|████      | 1755/4286 [10:39:42<17:16:57, 24.58s/it] 41%|████      | 1756/4286 [10:40:05<16:55:32, 24.08s/it]                                                         {'loss': 0.1241, 'grad_norm': 4.514830527299955, 'learning_rate': 5.902939804013065e-07, 'completion_length': 237.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.52976194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5119048357009888, 'reward_std': 0.19951868802309036, 'kl': 3.109375, 'epoch': 0.41}
 41%|████      | 1756/4286 [10:40:05<16:55:32, 24.08s/it] 41%|████      | 1757/4286 [10:40:29<17:04:03, 24.30s/it]                                                         {'loss': 0.0477, 'grad_norm': 5.2337700829807785, 'learning_rate': 5.900606626224918e-07, 'completion_length': 277.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.4523809850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4345239400863647, 'reward_std': 0.12859638035297394, 'kl': 1.1953125, 'epoch': 0.41}
 41%|████      | 1757/4286 [10:40:29<17:04:03, 24.30s/it] 41%|████      | 1758/4286 [10:40:52<16:40:58, 23.76s/it]                                                         {'loss': 0.081, 'grad_norm': 10.441345596473221, 'learning_rate': 5.89827344843677e-07, 'completion_length': 196.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.543154776096344, 'rewards/format_reward': 1.0, 'reward': 1.5431548357009888, 'reward_std': 0.1561431661248207, 'kl': 2.03125, 'epoch': 0.41}
 41%|████      | 1758/4286 [10:40:52<16:40:58, 23.76s/it] 41%|████      | 1759/4286 [10:41:16<16:51:28, 24.02s/it]                                                         {'loss': 0.0927, 'grad_norm': 5.955883500022998, 'learning_rate': 5.895940270648623e-07, 'completion_length': 228.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.4166666865348816, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3630953431129456, 'reward_std': 0.20568817108869553, 'kl': 2.31640625, 'epoch': 0.41}
 41%|████      | 1759/4286 [10:41:16<16:51:28, 24.02s/it] 41%|████      | 1760/4286 [10:41:41<16:51:45, 24.03s/it]                                                         {'loss': 0.0645, 'grad_norm': 6.543296574737468, 'learning_rate': 5.893607092860475e-07, 'completion_length': 241.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.5639881193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5461310744285583, 'reward_std': 0.20717743039131165, 'kl': 1.61328125, 'epoch': 0.41}
 41%|████      | 1760/4286 [10:41:41<16:51:45, 24.03s/it][2025-03-02 15:49:22,593] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 41%|████      | 1761/4286 [10:42:07<17:18:32, 24.68s/it]                                                         {'loss': 0.1011, 'grad_norm': 4.68926857685287, 'learning_rate': 5.891273915072328e-07, 'completion_length': 243.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.463392898440361, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3919643759727478, 'reward_std': 0.25697287917137146, 'kl': 2.52734375, 'epoch': 0.41}
 41%|████      | 1761/4286 [10:42:07<17:18:32, 24.68s/it][2025-03-02 15:49:46,859] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 41%|████      | 1762/4286 [10:42:31<17:12:55, 24.55s/it]                                                         {'loss': 0.0425, 'grad_norm': 2.676917990593286, 'learning_rate': 5.88894073728418e-07, 'completion_length': 238.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5535715222358704, 'reward_std': 0.18413107097148895, 'kl': 1.05859375, 'epoch': 0.41}
 41%|████      | 1762/4286 [10:42:31<17:12:55, 24.55s/it] 41%|████      | 1763/4286 [10:42:58<17:38:05, 25.16s/it]                                                         {'loss': 0.0829, 'grad_norm': 6.013074237416256, 'learning_rate': 5.886607559496033e-07, 'completion_length': 267.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.4508928805589676, 'rewards/format_reward': 0.910714328289032, 'reward': 1.361607313156128, 'reward_std': 0.2870168387889862, 'kl': 2.07421875, 'epoch': 0.41}
 41%|████      | 1763/4286 [10:42:58<17:38:05, 25.16s/it] 41%|████      | 1764/4286 [10:43:19<16:48:42, 24.00s/it]                                                         {'loss': 0.052, 'grad_norm': 4.262839898533073, 'learning_rate': 5.884274381707886e-07, 'completion_length': 207.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5461309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5282739400863647, 'reward_std': 0.13131877779960632, 'kl': 1.296875, 'epoch': 0.41}
 41%|████      | 1764/4286 [10:43:19<16:48:42, 24.00s/it] 41%|████      | 1765/4286 [10:43:44<17:05:36, 24.41s/it]                                                         {'loss': 0.0775, 'grad_norm': 14.851593618213082, 'learning_rate': 5.881941203919738e-07, 'completion_length': 246.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.546301007270813, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4748725891113281, 'reward_std': 0.27969613298773766, 'kl': 1.9365234375, 'epoch': 0.41}
 41%|████      | 1765/4286 [10:43:44<17:05:36, 24.41s/it] 41%|████      | 1766/4286 [10:44:07<16:40:50, 23.83s/it]                                                         {'loss': 0.0678, 'grad_norm': 2.606211756739994, 'learning_rate': 5.87960802613159e-07, 'completion_length': 227.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5907738208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5729167461395264, 'reward_std': 0.15400168299674988, 'kl': 1.69921875, 'epoch': 0.41}
 41%|████      | 1766/4286 [10:44:07<16:40:50, 23.83s/it] 41%|████      | 1767/4286 [10:44:31<16:44:38, 23.93s/it]                                                         {'loss': 0.116, 'grad_norm': 5.815949234147338, 'learning_rate': 5.877274848343444e-07, 'completion_length': 224.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.3630952686071396, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3095239400863647, 'reward_std': 0.18295925110578537, 'kl': 2.90625, 'epoch': 0.41}
 41%|████      | 1767/4286 [10:44:31<16:44:38, 23.93s/it] 41%|████▏     | 1768/4286 [10:44:55<16:50:34, 24.08s/it]                                                         {'loss': 0.1391, 'grad_norm': 7.474094447007085, 'learning_rate': 5.874941670555296e-07, 'completion_length': 235.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.5479167401790619, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.476488173007965, 'reward_std': 0.19131508097052574, 'kl': 3.4765625, 'epoch': 0.41}
 41%|████▏     | 1768/4286 [10:44:55<16:50:34, 24.08s/it] 41%|████▏     | 1769/4286 [10:45:21<17:07:16, 24.49s/it]                                                         {'loss': 0.079, 'grad_norm': 2.2562293173868464, 'learning_rate': 5.872608492767148e-07, 'completion_length': 250.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.4583333879709244, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3869048953056335, 'reward_std': 0.2857804149389267, 'kl': 1.9765625, 'epoch': 0.41}
 41%|████▏     | 1769/4286 [10:45:21<17:07:16, 24.49s/it] 41%|████▏     | 1770/4286 [10:45:42<16:26:50, 23.53s/it]                                                         {'loss': 0.1125, 'grad_norm': 3.5198312918510335, 'learning_rate': 5.870275314979e-07, 'completion_length': 205.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.5580357611179352, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4866072535514832, 'reward_std': 0.31222885847091675, 'kl': 2.8203125, 'epoch': 0.41}
 41%|████▏     | 1770/4286 [10:45:42<16:26:50, 23.53s/it] 41%|████▏     | 1771/4286 [10:46:06<16:34:18, 23.72s/it]                                                         {'loss': 0.0571, 'grad_norm': 6.853935652494081, 'learning_rate': 5.867942137190854e-07, 'completion_length': 256.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5877977013587952, 'reward_std': 0.1096916925162077, 'kl': 1.427734375, 'epoch': 0.41}
 41%|████▏     | 1771/4286 [10:46:06<16:34:18, 23.72s/it] 41%|████▏     | 1772/4286 [10:46:28<16:04:56, 23.03s/it]                                                         {'loss': 0.0641, 'grad_norm': 5.184841993647961, 'learning_rate': 5.865608959402706e-07, 'completion_length': 204.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5327381491661072, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4791668057441711, 'reward_std': 0.23052629828453064, 'kl': 1.60546875, 'epoch': 0.41}
 41%|████▏     | 1772/4286 [10:46:28<16:04:56, 23.03s/it] 41%|████▏     | 1773/4286 [10:46:52<16:20:46, 23.42s/it]                                                         {'loss': 0.068, 'grad_norm': 7.924907272106295, 'learning_rate': 5.863275781614558e-07, 'completion_length': 246.08930206298828, 'rewards/only_full_func_accuracy_reward': 0.5818452835083008, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5461310744285583, 'reward_std': 0.17848382145166397, 'kl': 1.6953125, 'epoch': 0.41}
 41%|████▏     | 1773/4286 [10:46:52<16:20:46, 23.42s/it] 41%|████▏     | 1774/4286 [10:47:16<16:23:56, 23.50s/it]                                                         {'loss': 0.152, 'grad_norm': 4.3444319145210875, 'learning_rate': 5.860942603826411e-07, 'completion_length': 232.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.4211309999227524, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3497024774551392, 'reward_std': 0.33045804500579834, 'kl': 3.796875, 'epoch': 0.41}
 41%|████▏     | 1774/4286 [10:47:16<16:23:56, 23.50s/it] 41%|████▏     | 1775/4286 [10:47:38<16:09:15, 23.16s/it]                                                         {'loss': 0.0623, 'grad_norm': 2.625548651945492, 'learning_rate': 5.858609426038263e-07, 'completion_length': 252.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.602678656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5848215818405151, 'reward_std': 0.1918574571609497, 'kl': 1.55078125, 'epoch': 0.41}
 41%|████▏     | 1775/4286 [10:47:38<16:09:15, 23.16s/it] 41%|████▏     | 1776/4286 [10:48:02<16:16:22, 23.34s/it]                                                         {'loss': 0.1111, 'grad_norm': 4.853169428438104, 'learning_rate': 5.856276248250116e-07, 'completion_length': 250.33930206298828, 'rewards/only_full_func_accuracy_reward': 0.5461309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.52827388048172, 'reward_std': 0.14632681757211685, 'kl': 2.78125, 'epoch': 0.41}
 41%|████▏     | 1776/4286 [10:48:02<16:16:22, 23.34s/it] 41%|████▏     | 1777/4286 [10:48:27<16:33:51, 23.77s/it]                                                         {'loss': 0.1034, 'grad_norm': 3.9004826683537903, 'learning_rate': 5.853943070461969e-07, 'completion_length': 185.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.5803571939468384, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5267858505249023, 'reward_std': 0.2202381044626236, 'kl': 2.5859375, 'epoch': 0.41}
 41%|████▏     | 1777/4286 [10:48:27<16:33:51, 23.77s/it] 41%|████▏     | 1778/4286 [10:48:53<17:05:02, 24.52s/it]                                                         {'loss': 0.1184, 'grad_norm': 4.354232666161107, 'learning_rate': 5.851609892673821e-07, 'completion_length': 279.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.458333358168602, 'rewards/format_reward': 0.910714328289032, 'reward': 1.36904776096344, 'reward_std': 0.24478784948587418, 'kl': 2.953125, 'epoch': 0.41}
 41%|████▏     | 1778/4286 [10:48:53<17:05:02, 24.52s/it] 42%|████▏     | 1779/4286 [10:49:18<17:08:10, 24.61s/it]                                                         {'loss': 0.0951, 'grad_norm': 3.67970913536624, 'learning_rate': 5.849276714885673e-07, 'completion_length': 253.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4285715222358704, 'reward_std': 0.2964356392621994, 'kl': 2.37890625, 'epoch': 0.42}
 42%|████▏     | 1779/4286 [10:49:18<17:08:10, 24.61s/it][2025-03-02 15:57:00,047] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1780/4286 [10:49:44<17:32:19, 25.20s/it]                                                         {'loss': 0.076, 'grad_norm': 3.137095089241962, 'learning_rate': 5.846943537097527e-07, 'completion_length': 262.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.4017857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3660715222358704, 'reward_std': 0.16942918300628662, 'kl': 1.90234375, 'epoch': 0.42}
 42%|████▏     | 1780/4286 [10:49:44<17:32:19, 25.20s/it] 42%|████▏     | 1781/4286 [10:50:09<17:21:20, 24.94s/it]                                                         {'loss': 0.037, 'grad_norm': 4.181876471960299, 'learning_rate': 5.844610359309379e-07, 'completion_length': 200.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5818452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.563988208770752, 'reward_std': 0.07530550099909306, 'kl': 0.92578125, 'epoch': 0.42}
 42%|████▏     | 1781/4286 [10:50:09<17:21:20, 24.94s/it] 42%|████▏     | 1782/4286 [10:50:32<17:00:05, 24.44s/it]                                                         {'loss': 0.0916, 'grad_norm': 4.774113392628885, 'learning_rate': 5.842277181521231e-07, 'completion_length': 215.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.537202388048172, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.501488208770752, 'reward_std': 0.13190627843141556, 'kl': 2.2890625, 'epoch': 0.42}
 42%|████▏     | 1782/4286 [10:50:32<17:00:05, 24.44s/it] 42%|████▏     | 1783/4286 [10:50:54<16:26:43, 23.65s/it]                                                         {'loss': 0.0232, 'grad_norm': 8.073674584225738, 'learning_rate': 5.839944003733083e-07, 'completion_length': 214.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5327381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5148810744285583, 'reward_std': 0.07142857648432255, 'kl': 0.580078125, 'epoch': 0.42}
 42%|████▏     | 1783/4286 [10:50:54<16:26:43, 23.65s/it] 42%|████▏     | 1784/4286 [10:51:16<16:08:56, 23.24s/it]                                                         {'loss': 0.0355, 'grad_norm': 1.6929339548367046, 'learning_rate': 5.837610825944937e-07, 'completion_length': 264.3393020629883, 'rewards/only_full_func_accuracy_reward': 0.6577381789684296, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.0763673186302185, 'kl': 0.884765625, 'epoch': 0.42}
 42%|████▏     | 1784/4286 [10:51:16<16:08:56, 23.24s/it] 42%|████▏     | 1785/4286 [10:51:38<16:00:00, 23.03s/it]                                                         {'loss': 0.04, 'grad_norm': 1.8161460819819346, 'learning_rate': 5.835277648156789e-07, 'completion_length': 225.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.13258038461208344, 'kl': 1.001953125, 'epoch': 0.42}
 42%|████▏     | 1785/4286 [10:51:38<16:00:00, 23.03s/it][2025-03-02 15:59:16,797] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1786/4286 [10:52:01<15:52:55, 22.87s/it]                                                         {'loss': 0.0098, 'grad_norm': 1.413273987947194, 'learning_rate': 5.832944470368641e-07, 'completion_length': 275.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6339287161827087, 'reward_std': 0.13318806886672974, 'kl': 0.2451171875, 'epoch': 0.42}
 42%|████▏     | 1786/4286 [10:52:01<15:52:55, 22.87s/it] 42%|████▏     | 1787/4286 [10:52:24<15:54:36, 22.92s/it]                                                         {'loss': 0.0659, 'grad_norm': 1.6918200781333934, 'learning_rate': 5.830611292580494e-07, 'completion_length': 226.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.5473214685916901, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.51160728931427, 'reward_std': 0.1465926617383957, 'kl': 1.6474609375, 'epoch': 0.42}
 42%|████▏     | 1787/4286 [10:52:24<15:54:36, 22.92s/it] 42%|████▏     | 1788/4286 [10:52:47<15:59:26, 23.04s/it]                                                         {'loss': 0.0396, 'grad_norm': 1.3037399610582727, 'learning_rate': 5.828278114792347e-07, 'completion_length': 267.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.4166666865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3988096117973328, 'reward_std': 0.09236079268157482, 'kl': 0.98876953125, 'epoch': 0.42}
 42%|████▏     | 1788/4286 [10:52:47<15:59:26, 23.04s/it] 42%|████▏     | 1789/4286 [10:53:11<16:04:28, 23.18s/it]                                                         {'loss': 0.0263, 'grad_norm': 1.002467392999254, 'learning_rate': 5.825944937004199e-07, 'completion_length': 274.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6770833432674408, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.05495268478989601, 'kl': 0.65673828125, 'epoch': 0.42}
 42%|████▏     | 1789/4286 [10:53:11<16:04:28, 23.18s/it] 42%|████▏     | 1790/4286 [10:53:32<15:45:12, 22.72s/it]                                                         {'loss': 0.074, 'grad_norm': 2.0067995677174046, 'learning_rate': 5.823611759216052e-07, 'completion_length': 229.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.6473214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6294643878936768, 'reward_std': 0.12202381528913975, 'kl': 1.84765625, 'epoch': 0.42}
 42%|████▏     | 1790/4286 [10:53:32<15:45:12, 22.72s/it][2025-03-02 16:01:13,665] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1791/4286 [10:53:58<16:17:40, 23.51s/it]                                                         {'loss': 0.0522, 'grad_norm': 2.933305545056111, 'learning_rate': 5.821278581427904e-07, 'completion_length': 302.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.6071429252624512, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.08121156692504883, 'kl': 1.3046875, 'epoch': 0.42}
 42%|████▏     | 1791/4286 [10:53:58<16:17:40, 23.51s/it] 42%|████▏     | 1792/4286 [10:54:23<16:35:38, 23.95s/it]                                                         {'loss': 0.0341, 'grad_norm': 4.474713896933233, 'learning_rate': 5.818945403639757e-07, 'completion_length': 267.3928756713867, 'rewards/only_full_func_accuracy_reward': 0.5358383059501648, 'rewards/format_reward': 1.0, 'reward': 1.5358383655548096, 'reward_std': 0.10696351155638695, 'kl': 0.8515625, 'epoch': 0.42}
 42%|████▏     | 1792/4286 [10:54:23<16:35:38, 23.95s/it] 42%|████▏     | 1793/4286 [10:54:49<16:59:16, 24.53s/it]                                                         {'loss': 0.0488, 'grad_norm': 3.4359262992574124, 'learning_rate': 5.816612225851609e-07, 'completion_length': 257.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.5836309939622879, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5479167699813843, 'reward_std': 0.15666262991726398, 'kl': 1.2158203125, 'epoch': 0.42}
 42%|████▏     | 1793/4286 [10:54:49<16:59:16, 24.53s/it][2025-03-02 16:02:32,425] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1794/4286 [10:55:17<17:40:46, 25.54s/it]                                                         {'loss': 0.0152, 'grad_norm': 3.893877011012369, 'learning_rate': 5.814279048063462e-07, 'completion_length': 298.2321472167969, 'rewards/only_full_func_accuracy_reward': 0.4895833432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4717262983322144, 'reward_std': 0.0863095261156559, 'kl': 0.3798828125, 'epoch': 0.42}
 42%|████▏     | 1794/4286 [10:55:17<17:40:46, 25.54s/it][2025-03-02 16:03:01,275] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1795/4286 [10:55:45<18:21:35, 26.53s/it]                                                         {'loss': 0.061, 'grad_norm': 3.556091929702691, 'learning_rate': 5.811945870275314e-07, 'completion_length': 301.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.4761905074119568, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4226191639900208, 'reward_std': 0.16212860494852066, 'kl': 1.5234375, 'epoch': 0.42}
 42%|████▏     | 1795/4286 [10:55:45<18:21:35, 26.53s/it][2025-03-02 16:03:29,711] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1796/4286 [10:56:14<18:44:49, 27.10s/it]                                                         {'loss': 0.0294, 'grad_norm': 13.177557297579964, 'learning_rate': 5.809612692487166e-07, 'completion_length': 342.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.4672619551420212, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3779762387275696, 'reward_std': 0.11368752829730511, 'kl': 0.734375, 'epoch': 0.42}
 42%|████▏     | 1796/4286 [10:56:14<18:44:49, 27.10s/it][2025-03-02 16:03:59,743] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1797/4286 [10:56:44<19:20:48, 27.98s/it]                                                         {'loss': 0.0108, 'grad_norm': 5.795429919023441, 'learning_rate': 5.80727951469902e-07, 'completion_length': 375.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.4032738357782364, 'rewards/format_reward': 0.8571429252624512, 'reward': 1.2604167461395264, 'reward_std': 0.3647748678922653, 'kl': 0.26953125, 'epoch': 0.42}
 42%|████▏     | 1797/4286 [10:56:44<19:20:48, 27.98s/it][2025-03-02 16:04:28,212] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1798/4286 [10:57:12<19:26:23, 28.13s/it]                                                         {'loss': 0.0178, 'grad_norm': 3.284685292956116, 'learning_rate': 5.804946336910872e-07, 'completion_length': 335.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.4293155074119568, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3400298357009888, 'reward_std': 0.26384104788303375, 'kl': 0.44482421875, 'epoch': 0.42}
 42%|████▏     | 1798/4286 [10:57:12<19:26:23, 28.13s/it] 42%|████▏     | 1799/4286 [10:57:39<19:06:39, 27.66s/it]                                                         {'loss': 0.0092, 'grad_norm': 0.877572595350969, 'learning_rate': 5.802613159122724e-07, 'completion_length': 283.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.6681548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6681548357009888, 'reward_std': 0.0267857164144516, 'kl': 0.2294921875, 'epoch': 0.42}
 42%|████▏     | 1799/4286 [10:57:39<19:06:39, 27.66s/it] 42%|████▏     | 1800/4286 [10:58:05<18:49:21, 27.26s/it]                                                         {'loss': 0.031, 'grad_norm': 5.30631243798744, 'learning_rate': 5.800279981334577e-07, 'completion_length': 355.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.5291667133569717, 'rewards/format_reward': 1.0, 'reward': 1.5291667580604553, 'reward_std': 0.15217016264796257, 'kl': 0.775390625, 'epoch': 0.42}
 42%|████▏     | 1800/4286 [10:58:05<18:49:21, 27.26s/it] 42%|████▏     | 1801/4286 [11:02:07<63:16:14, 91.66s/it]                                                         {'loss': 0.0344, 'grad_norm': 2.920832935701216, 'learning_rate': 5.79794680354643e-07, 'completion_length': 298.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5403770357370377, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5046627521514893, 'reward_std': 0.14851242117583752, 'kl': 0.8564453125, 'epoch': 0.42}
 42%|████▏     | 1801/4286 [11:02:07<63:16:14, 91.66s/it] 42%|████▏     | 1802/4286 [11:02:32<49:26:19, 71.65s/it]                                                         {'loss': 0.0439, 'grad_norm': 2.845153426548898, 'learning_rate': 5.795613625758282e-07, 'completion_length': 300.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6949405670166016, 'reward_std': 0.11607143748551607, 'kl': 1.09716796875, 'epoch': 0.42}
 42%|████▏     | 1802/4286 [11:02:32<49:26:19, 71.65s/it][2025-03-02 16:10:16,068] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1803/4286 [11:03:00<40:24:07, 58.58s/it]                                                         {'loss': 0.0549, 'grad_norm': 1464.5116689457493, 'learning_rate': 5.793280447970135e-07, 'completion_length': 357.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.52976194024086, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4583334922790527, 'reward_std': 0.28063084185123444, 'kl': 1.37109375, 'epoch': 0.42}
 42%|████▏     | 1803/4286 [11:03:00<40:24:07, 58.58s/it][2025-03-02 16:10:44,012] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1804/4286 [11:03:28<34:02:59, 49.39s/it]                                                         {'loss': 0.0115, 'grad_norm': 2.6588354558966656, 'learning_rate': 5.790947270181987e-07, 'completion_length': 402.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.4303571879863739, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3767858147621155, 'reward_std': 0.1826893538236618, 'kl': 0.28759765625, 'epoch': 0.42}
 42%|████▏     | 1804/4286 [11:03:28<34:02:59, 49.39s/it] 42%|████▏     | 1805/4286 [11:03:56<29:30:48, 42.82s/it]                                                         {'loss': 0.0124, 'grad_norm': 2.674496079499481, 'learning_rate': 5.78861409239384e-07, 'completion_length': 402.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.41428573429584503, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3964287042617798, 'reward_std': 0.105195302516222, 'kl': 0.310546875, 'epoch': 0.42}
 42%|████▏     | 1805/4286 [11:03:56<29:30:48, 42.82s/it] 42%|████▏     | 1806/4286 [11:04:24<26:33:20, 38.55s/it]                                                         {'loss': 0.007, 'grad_norm': 1.2580744138329338, 'learning_rate': 5.786280914605692e-07, 'completion_length': 363.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.1246555931866169, 'kl': 0.17578125, 'epoch': 0.42}
 42%|████▏     | 1806/4286 [11:04:24<26:33:20, 38.55s/it] 42%|████▏     | 1807/4286 [11:04:52<24:25:15, 35.46s/it]                                                         {'loss': 0.117, 'grad_norm': 268.75891158479635, 'learning_rate': 5.783947736817545e-07, 'completion_length': 416.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.4434524029493332, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4255953431129456, 'reward_std': 0.07922262884676456, 'kl': 2.9296875, 'epoch': 0.42}
 42%|████▏     | 1807/4286 [11:04:52<24:25:15, 35.46s/it] 42%|████▏     | 1808/4286 [11:05:21<22:58:08, 33.37s/it]                                                         {'loss': 0.0174, 'grad_norm': 4.60566827472959, 'learning_rate': 5.781614559029397e-07, 'completion_length': 433.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.4823129326105118, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3930273056030273, 'reward_std': 0.3255451023578644, 'kl': 0.43359375, 'epoch': 0.42}
 42%|████▏     | 1808/4286 [11:05:21<22:58:08, 33.37s/it] 42%|████▏     | 1809/4286 [11:05:49<21:52:59, 31.80s/it]                                                         {'loss': 0.0079, 'grad_norm': 1.2792278268196347, 'learning_rate': 5.77928138124125e-07, 'completion_length': 392.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.4779762327671051, 'rewards/format_reward': 1.0, 'reward': 1.4779762029647827, 'reward_std': 0.05119401961565018, 'kl': 0.19873046875, 'epoch': 0.42}
 42%|████▏     | 1809/4286 [11:05:49<21:52:59, 31.80s/it][2025-03-02 16:13:33,309] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1810/4286 [11:06:17<21:09:14, 30.76s/it]                                                         {'loss': 0.0546, 'grad_norm': 8.245010938141297, 'learning_rate': 5.776948203453103e-07, 'completion_length': 340.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.5189201384782791, 'rewards/format_reward': 1.0, 'reward': 1.5189201831817627, 'reward_std': 0.09242865722626448, 'kl': 1.3671875, 'epoch': 0.42}
 42%|████▏     | 1810/4286 [11:06:17<21:09:14, 30.76s/it] 42%|████▏     | 1811/4286 [11:06:43<20:06:04, 29.24s/it]                                                         {'loss': 0.0424, 'grad_norm': 3.5520777569024466, 'learning_rate': 5.774615025664955e-07, 'completion_length': 281.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.59226194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5744049549102783, 'reward_std': 0.08968471176922321, 'kl': 1.06005859375, 'epoch': 0.42}
 42%|████▏     | 1811/4286 [11:06:43<20:06:04, 29.24s/it] 42%|████▏     | 1812/4286 [11:07:09<19:17:57, 28.08s/it]                                                         {'loss': 0.0264, 'grad_norm': 2.977721440344482, 'learning_rate': 5.772281847876807e-07, 'completion_length': 333.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.5174320340156555, 'rewards/format_reward': 1.0, 'reward': 1.5174320936203003, 'reward_std': 0.07373938523232937, 'kl': 0.658203125, 'epoch': 0.42}
 42%|████▏     | 1812/4286 [11:07:09<19:17:57, 28.08s/it][2025-03-02 16:14:50,838] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1813/4286 [11:07:35<18:57:15, 27.59s/it]                                                         {'loss': 0.0425, 'grad_norm': 3.9942459215826913, 'learning_rate': 5.769948670088661e-07, 'completion_length': 305.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5000000447034836, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.15341370925307274, 'kl': 1.06640625, 'epoch': 0.42}
 42%|████▏     | 1813/4286 [11:07:35<18:57:15, 27.59s/it] 42%|████▏     | 1814/4286 [11:08:01<18:37:20, 27.12s/it]                                                         {'loss': 0.0257, 'grad_norm': 14.472943683303333, 'learning_rate': 5.767615492300513e-07, 'completion_length': 303.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.42023812234401703, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4023810029029846, 'reward_std': 0.11831150949001312, 'kl': 0.6416015625, 'epoch': 0.42}
 42%|████▏     | 1814/4286 [11:08:01<18:37:20, 27.12s/it] 42%|████▏     | 1815/4286 [11:08:26<18:10:24, 26.48s/it]                                                         {'loss': 0.0235, 'grad_norm': 6.941986552686446, 'learning_rate': 5.765282314512365e-07, 'completion_length': 269.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.49375002086162567, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.440178632736206, 'reward_std': 0.17450794205069542, 'kl': 0.583984375, 'epoch': 0.42}
 42%|████▏     | 1815/4286 [11:08:26<18:10:24, 26.48s/it] 42%|████▏     | 1816/4286 [11:08:50<17:34:26, 25.61s/it]                                                         {'loss': 0.0496, 'grad_norm': 10.916108358452075, 'learning_rate': 5.762949136724217e-07, 'completion_length': 284.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.5395833849906921, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.503869116306305, 'reward_std': 0.19422391802072525, 'kl': 1.236328125, 'epoch': 0.42}
 42%|████▏     | 1816/4286 [11:08:50<17:34:26, 25.61s/it] 42%|████▏     | 1817/4286 [11:09:14<17:22:44, 25.34s/it]                                                         {'loss': 0.0416, 'grad_norm': 3.883674477246182, 'learning_rate': 5.760615958936071e-07, 'completion_length': 302.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.5209686458110809, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.503111481666565, 'reward_std': 0.14888394251465797, 'kl': 1.0458984375, 'epoch': 0.42}
 42%|████▏     | 1817/4286 [11:09:14<17:22:44, 25.34s/it] 42%|████▏     | 1818/4286 [11:09:40<17:27:10, 25.46s/it]                                                         {'loss': 0.0481, 'grad_norm': 6.5485427141349755, 'learning_rate': 5.758282781147923e-07, 'completion_length': 275.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.5580357313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5223215222358704, 'reward_std': 0.1791759394109249, 'kl': 1.203125, 'epoch': 0.42}
 42%|████▏     | 1818/4286 [11:09:40<17:27:10, 25.46s/it] 42%|████▏     | 1819/4286 [11:10:06<17:37:44, 25.73s/it]                                                         {'loss': 0.086, 'grad_norm': 16.30040855574747, 'learning_rate': 5.755949603359775e-07, 'completion_length': 289.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.4297619163990021, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4119048714637756, 'reward_std': 0.15237760171294212, 'kl': 2.1484375, 'epoch': 0.42}
 42%|████▏     | 1819/4286 [11:10:06<17:37:44, 25.73s/it][2025-03-02 16:17:48,417] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 42%|████▏     | 1820/4286 [11:10:33<17:43:10, 25.87s/it]                                                         {'loss': 0.0282, 'grad_norm': 14.870491241245954, 'learning_rate': 5.753616425571628e-07, 'completion_length': 330.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.37247027456760406, 'rewards/format_reward': 1.0, 'reward': 1.3724703192710876, 'reward_std': 0.1558115854859352, 'kl': 0.7041015625, 'epoch': 0.42}
 42%|████▏     | 1820/4286 [11:10:33<17:43:10, 25.87s/it] 42%|████▏     | 1821/4286 [11:10:57<17:22:34, 25.38s/it]                                                         {'loss': 0.0191, 'grad_norm': 3.5934038479147805, 'learning_rate': 5.75128324778348e-07, 'completion_length': 267.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.5476190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5297620296478271, 'reward_std': 0.13379148207604885, 'kl': 0.4775390625, 'epoch': 0.42}
 42%|████▏     | 1821/4286 [11:10:57<17:22:34, 25.38s/it] 43%|████▎     | 1822/4286 [11:11:22<17:15:38, 25.22s/it]                                                         {'loss': 0.038, 'grad_norm': 9.892656721824341, 'learning_rate': 5.748950069995333e-07, 'completion_length': 286.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.46726194024086, 'rewards/format_reward': 1.0, 'reward': 1.4672620296478271, 'reward_std': 0.10898454114794731, 'kl': 0.947265625, 'epoch': 0.43}
 43%|████▎     | 1822/4286 [11:11:22<17:15:38, 25.22s/it] 43%|████▎     | 1823/4286 [11:11:47<17:17:11, 25.27s/it]                                                         {'loss': 0.0261, 'grad_norm': 3.7270791591245107, 'learning_rate': 5.746616892207186e-07, 'completion_length': 259.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.553146243095398, 'rewards/format_reward': 1.0, 'reward': 1.5531463623046875, 'reward_std': 0.08121174573898315, 'kl': 0.654296875, 'epoch': 0.43}
 43%|████▎     | 1823/4286 [11:11:47<17:17:11, 25.27s/it] 43%|████▎     | 1824/4286 [11:12:13<17:28:42, 25.56s/it]                                                         {'loss': 0.0424, 'grad_norm': 79.91279295140222, 'learning_rate': 5.744283714419038e-07, 'completion_length': 293.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.4355158805847168, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3998017311096191, 'reward_std': 0.16664674878120422, 'kl': 1.060546875, 'epoch': 0.43}
 43%|████▎     | 1824/4286 [11:12:13<17:28:42, 25.56s/it] 43%|████▎     | 1825/4286 [11:12:43<18:18:27, 26.78s/it]                                                         {'loss': 0.031, 'grad_norm': 3.9255465577851276, 'learning_rate': 5.74195053663089e-07, 'completion_length': 345.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.4151785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3973215818405151, 'reward_std': 0.18493790552020073, 'kl': 0.7744140625, 'epoch': 0.43}
 43%|████▎     | 1825/4286 [11:12:43<18:18:27, 26.78s/it] 43%|████▎     | 1826/4286 [11:13:09<18:06:13, 26.49s/it]                                                         {'loss': 0.0229, 'grad_norm': 7.56527088308257, 'learning_rate': 5.739617358842744e-07, 'completion_length': 254.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5601615905761719, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.542304515838623, 'reward_std': 0.12277435883879662, 'kl': 0.57275390625, 'epoch': 0.43}
 43%|████▎     | 1826/4286 [11:13:09<18:06:13, 26.49s/it][2025-03-02 16:20:51,336] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 43%|████▎     | 1827/4286 [11:13:35<18:09:07, 26.58s/it]                                                         {'loss': 0.0983, 'grad_norm': 2.7062452965184485, 'learning_rate': 5.737284181054596e-07, 'completion_length': 267.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.4435877054929733, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4257306456565857, 'reward_std': 0.1614554412662983, 'kl': 2.453125, 'epoch': 0.43}
 43%|████▎     | 1827/4286 [11:13:35<18:09:07, 26.58s/it] 43%|████▎     | 1828/4286 [11:14:04<18:35:43, 27.23s/it]                                                         {'loss': 0.0364, 'grad_norm': 5.202482202650252, 'learning_rate': 5.734951003266448e-07, 'completion_length': 328.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5898809731006622, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.572023868560791, 'reward_std': 0.1345054917037487, 'kl': 0.91015625, 'epoch': 0.43}
 43%|████▎     | 1828/4286 [11:14:04<18:35:43, 27.23s/it] 43%|████▎     | 1829/4286 [11:14:31<18:28:51, 27.08s/it]                                                         {'loss': 0.0371, 'grad_norm': 7.672895248301131, 'learning_rate': 5.7326178254783e-07, 'completion_length': 295.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.4880952835083008, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4523810744285583, 'reward_std': 0.20655777677893639, 'kl': 0.921875, 'epoch': 0.43}
 43%|████▎     | 1829/4286 [11:14:31<18:28:51, 27.08s/it] 43%|████▎     | 1830/4286 [11:14:59<18:35:12, 27.24s/it]                                                         {'loss': 0.0299, 'grad_norm': 2.02620413885829, 'learning_rate': 5.730284647690154e-07, 'completion_length': 272.9107360839844, 'rewards/only_full_func_accuracy_reward': 0.4508928805589676, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3794643878936768, 'reward_std': 0.19708716869354248, 'kl': 0.7451171875, 'epoch': 0.43}
 43%|████▎     | 1830/4286 [11:14:59<18:35:12, 27.24s/it] 43%|████▎     | 1831/4286 [11:15:25<18:27:16, 27.06s/it]                                                         {'loss': 0.02, 'grad_norm': 3.0098055385246343, 'learning_rate': 5.727951469902006e-07, 'completion_length': 240.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.482142984867096, 'reward_std': 0.191608726978302, 'kl': 0.5, 'epoch': 0.43}
 43%|████▎     | 1831/4286 [11:15:25<18:27:16, 27.06s/it] 43%|████▎     | 1832/4286 [11:15:52<18:18:33, 26.86s/it]                                                         {'loss': 0.0179, 'grad_norm': 3.7428769294530078, 'learning_rate': 5.725618292113858e-07, 'completion_length': 241.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6002976596355438, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5824406147003174, 'reward_std': 0.1936141923069954, 'kl': 0.44873046875, 'epoch': 0.43}
 43%|████▎     | 1832/4286 [11:15:52<18:18:33, 26.86s/it] 43%|████▎     | 1833/4286 [11:16:18<18:18:18, 26.86s/it]                                                         {'loss': 0.0533, 'grad_norm': 2.8585743804255657, 'learning_rate': 5.723285114325711e-07, 'completion_length': 257.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.4970238506793976, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4255953431129456, 'reward_std': 0.15851253643631935, 'kl': 1.33154296875, 'epoch': 0.43}
 43%|████▎     | 1833/4286 [11:16:18<18:18:18, 26.86s/it][2025-03-02 16:24:02,775] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 43%|████▎     | 1834/4286 [11:16:47<18:36:56, 27.33s/it]                                                         {'loss': 0.044, 'grad_norm': 2.8147133328895872, 'learning_rate': 5.720951936537564e-07, 'completion_length': 263.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.4898809790611267, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4720239639282227, 'reward_std': 0.12296219542622566, 'kl': 1.099609375, 'epoch': 0.43}
 43%|████▎     | 1834/4286 [11:16:47<18:36:56, 27.33s/it][2025-03-02 16:24:29,263] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 43%|████▎     | 1835/4286 [11:17:13<18:26:09, 27.08s/it]                                                         {'loss': 0.058, 'grad_norm': 5.983878501176318, 'learning_rate': 5.718618758749416e-07, 'completion_length': 275.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.5595238208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5416667461395264, 'reward_std': 0.16682060062885284, 'kl': 1.451171875, 'epoch': 0.43}
 43%|████▎     | 1835/4286 [11:17:13<18:26:09, 27.08s/it] 43%|████▎     | 1836/4286 [11:17:41<18:30:57, 27.21s/it]                                                         {'loss': 0.0107, 'grad_norm': 2.0741241675699746, 'learning_rate': 5.716285580961269e-07, 'completion_length': 289.9464416503906, 'rewards/only_full_func_accuracy_reward': 0.5422619730234146, 'rewards/format_reward': 1.0, 'reward': 1.5422620177268982, 'reward_std': 0.05413147993385792, 'kl': 0.2666015625, 'epoch': 0.43}
 43%|████▎     | 1836/4286 [11:17:41<18:30:57, 27.21s/it] 43%|████▎     | 1837/4286 [11:18:07<18:18:06, 26.90s/it]                                                         {'loss': 0.0262, 'grad_norm': 4.8599780481599, 'learning_rate': 5.713952403173121e-07, 'completion_length': 227.96430206298828, 'rewards/only_full_func_accuracy_reward': 0.4315476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4136905670166016, 'reward_std': 0.12185035087168217, 'kl': 0.65576171875, 'epoch': 0.43}
 43%|████▎     | 1837/4286 [11:18:07<18:18:06, 26.90s/it][2025-03-02 16:25:50,146] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 43%|████▎     | 1838/4286 [11:18:34<18:21:03, 26.99s/it]                                                         {'loss': 0.0111, 'grad_norm': 3.959931679130942, 'learning_rate': 5.711619225384974e-07, 'completion_length': 234.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.56101194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5431548357009888, 'reward_std': 0.1160714328289032, 'kl': 0.279296875, 'epoch': 0.43}
 43%|████▎     | 1838/4286 [11:18:34<18:21:03, 26.99s/it] 43%|████▎     | 1839/4286 [11:19:01<18:11:44, 26.77s/it]                                                         {'loss': 0.0297, 'grad_norm': 3.6635713223803674, 'learning_rate': 5.709286047596826e-07, 'completion_length': 235.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.6955357491970062, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.677678644657135, 'reward_std': 0.14886992424726486, 'kl': 0.7431640625, 'epoch': 0.43}
 43%|████▎     | 1839/4286 [11:19:01<18:11:44, 26.77s/it] 43%|████▎     | 1840/4286 [11:19:26<17:56:25, 26.40s/it]                                                         {'loss': 0.0243, 'grad_norm': 2.3525971012097693, 'learning_rate': 5.706952869808679e-07, 'completion_length': 239.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5758928656578064, 'rewards/format_reward': 1.0, 'reward': 1.5758929252624512, 'reward_std': 0.06969274766743183, 'kl': 0.61083984375, 'epoch': 0.43}
 43%|████▎     | 1840/4286 [11:19:26<17:56:25, 26.40s/it] 43%|████▎     | 1841/4286 [11:19:52<17:47:38, 26.20s/it]                                                         {'loss': 0.0156, 'grad_norm': 101.59283812293735, 'learning_rate': 5.704619692020531e-07, 'completion_length': 260.0535888671875, 'rewards/only_full_func_accuracy_reward': 0.4449405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.427083432674408, 'reward_std': 0.13761193305253983, 'kl': 0.39208984375, 'epoch': 0.43}
 43%|████▎     | 1841/4286 [11:19:52<17:47:38, 26.20s/it] 43%|████▎     | 1842/4286 [11:20:21<18:28:45, 27.22s/it]                                                         {'loss': 0.0245, 'grad_norm': 1.79990918754904, 'learning_rate': 5.702286514232384e-07, 'completion_length': 303.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.4985119700431824, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4449406266212463, 'reward_std': 0.15472646057605743, 'kl': 0.61328125, 'epoch': 0.43}
 43%|████▎     | 1842/4286 [11:20:21<18:28:45, 27.22s/it] 43%|████▎     | 1843/4286 [11:20:49<18:31:15, 27.29s/it]                                                         {'loss': 0.0505, 'grad_norm': 13.79761799241522, 'learning_rate': 5.699953336444237e-07, 'completion_length': 237.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.5722069591283798, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5543498992919922, 'reward_std': 0.13512505032122135, 'kl': 1.26513671875, 'epoch': 0.43}
 43%|████▎     | 1843/4286 [11:20:49<18:31:15, 27.29s/it] 43%|████▎     | 1844/4286 [11:21:14<18:08:09, 26.74s/it]                                                         {'loss': 0.0118, 'grad_norm': 1.850810203360984, 'learning_rate': 5.697620158656089e-07, 'completion_length': 244.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.5967262089252472, 'rewards/format_reward': 1.0, 'reward': 1.5967262983322144, 'reward_std': 0.06008904613554478, 'kl': 0.29541015625, 'epoch': 0.43}
 43%|████▎     | 1844/4286 [11:21:14<18:08:09, 26.74s/it] 43%|████▎     | 1845/4286 [11:21:40<17:52:46, 26.37s/it]                                                         {'loss': 0.0412, 'grad_norm': 4.640739786591454, 'learning_rate': 5.695286980867941e-07, 'completion_length': 213.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6354166865348816, 'rewards/format_reward': 1.0, 'reward': 1.6354167461395264, 'reward_std': 0.08311965316534042, 'kl': 1.0283203125, 'epoch': 0.43}
 43%|████▎     | 1845/4286 [11:21:40<17:52:46, 26.37s/it] 43%|████▎     | 1846/4286 [11:22:07<18:08:20, 26.76s/it]                                                         {'loss': 0.0113, 'grad_norm': 12.501185139515758, 'learning_rate': 5.692953803079795e-07, 'completion_length': 250.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.5297619551420212, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5119048357009888, 'reward_std': 0.04761905036866665, 'kl': 0.283203125, 'epoch': 0.43}
 43%|████▎     | 1846/4286 [11:22:08<18:08:20, 26.76s/it] 43%|████▎     | 1847/4286 [11:22:33<17:58:03, 26.52s/it]                                                         {'loss': 0.0078, 'grad_norm': 2.326696693107733, 'learning_rate': 5.690620625291647e-07, 'completion_length': 278.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.6825397312641144, 'rewards/format_reward': 1.0, 'reward': 1.6825397610664368, 'reward_std': 0.03733126446604729, 'kl': 0.1943359375, 'epoch': 0.43}
 43%|████▎     | 1847/4286 [11:22:33<17:58:03, 26.52s/it] 43%|████▎     | 1848/4286 [11:23:01<18:11:21, 26.86s/it]                                                         {'loss': 0.021, 'grad_norm': 2.063119400274888, 'learning_rate': 5.688287447503499e-07, 'completion_length': 276.48216247558594, 'rewards/only_full_func_accuracy_reward': 0.48697349429130554, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4691163301467896, 'reward_std': 0.09797045588493347, 'kl': 0.525390625, 'epoch': 0.43}
 43%|████▎     | 1848/4286 [11:23:01<18:11:21, 26.86s/it] 43%|████▎     | 1849/4286 [11:23:26<17:46:55, 26.27s/it]                                                         {'loss': 0.0203, 'grad_norm': 8.125773973928634, 'learning_rate': 5.685954269715352e-07, 'completion_length': 216.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5464286208152771, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5107144117355347, 'reward_std': 0.16072016954421997, 'kl': 0.5078125, 'epoch': 0.43}
 43%|████▎     | 1849/4286 [11:23:26<17:46:55, 26.27s/it] 43%|████▎     | 1850/4286 [11:23:53<17:56:29, 26.51s/it]                                                         {'loss': 0.0076, 'grad_norm': 2.763321083264524, 'learning_rate': 5.683621091927204e-07, 'completion_length': 243.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.4961310029029846, 'rewards/format_reward': 1.0, 'reward': 1.4961311221122742, 'reward_std': 0.09912282228469849, 'kl': 0.189453125, 'epoch': 0.43}
 43%|████▎     | 1850/4286 [11:23:53<17:56:29, 26.51s/it] 43%|████▎     | 1851/4286 [11:24:19<17:45:46, 26.26s/it]                                                         {'loss': 0.0254, 'grad_norm': 4.528668866275405, 'learning_rate': 5.681287914139057e-07, 'completion_length': 200.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.4797619432210922, 'rewards/format_reward': 1.0, 'reward': 1.4797620177268982, 'reward_std': 0.05456013046205044, 'kl': 0.6328125, 'epoch': 0.43}
 43%|████▎     | 1851/4286 [11:24:19<17:45:46, 26.26s/it] 43%|████▎     | 1852/4286 [11:24:46<17:52:20, 26.43s/it]                                                         {'loss': 0.0209, 'grad_norm': 9.348657861187906, 'learning_rate': 5.678954736350909e-07, 'completion_length': 236.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.6208333671092987, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6029762625694275, 'reward_std': 0.12462559342384338, 'kl': 0.52490234375, 'epoch': 0.43}
 43%|████▎     | 1852/4286 [11:24:46<17:52:20, 26.43s/it][2025-03-02 16:32:27,383] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 43%|████▎     | 1853/4286 [11:25:12<17:45:35, 26.28s/it]                                                         {'loss': 0.0086, 'grad_norm': 1.4894770298512303, 'learning_rate': 5.676621558562762e-07, 'completion_length': 197.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5875000655651093, 'rewards/format_reward': 1.0, 'reward': 1.5875000953674316, 'reward_std': 0.05702805519104004, 'kl': 0.21435546875, 'epoch': 0.43}
 43%|████▎     | 1853/4286 [11:25:12<17:45:35, 26.28s/it] 43%|████▎     | 1854/4286 [11:25:39<17:54:19, 26.50s/it]                                                         {'loss': 0.0247, 'grad_norm': 2.104734419586483, 'learning_rate': 5.674288380774614e-07, 'completion_length': 266.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.40148812532424927, 'rewards/format_reward': 1.0, 'reward': 1.4014881253242493, 'reward_std': 0.08329595439136028, 'kl': 0.6162109375, 'epoch': 0.43}
 43%|████▎     | 1854/4286 [11:25:39<17:54:19, 26.50s/it][2025-03-02 16:33:21,444] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 43%|████▎     | 1855/4286 [11:26:06<18:00:15, 26.66s/it]                                                         {'loss': 0.0131, 'grad_norm': 1.6502539951656676, 'learning_rate': 5.671955202986467e-07, 'completion_length': 247.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6086310744285583, 'reward_std': 0.12109564617276192, 'kl': 0.3271484375, 'epoch': 0.43}
 43%|████▎     | 1855/4286 [11:26:06<18:00:15, 26.66s/it] 43%|████▎     | 1856/4286 [11:26:28<17:07:10, 25.36s/it]                                                         {'loss': 0.0098, 'grad_norm': 7.636587060728226, 'learning_rate': 5.66962202519832e-07, 'completion_length': 203.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.6089286208152771, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5910715460777283, 'reward_std': 0.09702010080218315, 'kl': 0.2451171875, 'epoch': 0.43}
 43%|████▎     | 1856/4286 [11:26:28<17:07:10, 25.36s/it] 43%|████▎     | 1857/4286 [11:26:54<17:11:23, 25.48s/it]                                                         {'loss': 0.0342, 'grad_norm': 9.638129772184257, 'learning_rate': 5.667288847410172e-07, 'completion_length': 267.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.5204294621944427, 'rewards/format_reward': 1.0, 'reward': 1.52042955160141, 'reward_std': 0.04883104283362627, 'kl': 0.85546875, 'epoch': 0.43}
 43%|████▎     | 1857/4286 [11:26:54<17:11:23, 25.48s/it] 43%|████▎     | 1858/4286 [11:27:18<17:01:19, 25.24s/it]                                                         {'loss': 0.0187, 'grad_norm': 2.9856805290622592, 'learning_rate': 5.664955669622024e-07, 'completion_length': 224.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.55555559694767, 'rewards/format_reward': 1.0, 'reward': 1.5555557012557983, 'reward_std': 0.07006688974797726, 'kl': 0.4677734375, 'epoch': 0.43}
 43%|████▎     | 1858/4286 [11:27:18<17:01:19, 25.24s/it] 43%|████▎     | 1859/4286 [11:27:42<16:43:58, 24.82s/it]                                                         {'loss': 0.0072, 'grad_norm': 1.6840405348057252, 'learning_rate': 5.662622491833878e-07, 'completion_length': 214.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6273809671401978, 'rewards/format_reward': 1.0, 'reward': 1.6273809671401978, 'reward_std': 0.043385256081819534, 'kl': 0.18115234375, 'epoch': 0.43}
 43%|████▎     | 1859/4286 [11:27:42<16:43:58, 24.82s/it] 43%|████▎     | 1860/4286 [11:28:06<16:37:03, 24.66s/it]                                                         {'loss': 0.0377, 'grad_norm': 3.256919449598852, 'learning_rate': 5.66028931404573e-07, 'completion_length': 190.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6104167103767395, 'rewards/format_reward': 1.0, 'reward': 1.610416829586029, 'reward_std': 0.09662698954343796, 'kl': 0.9404296875, 'epoch': 0.43}
 43%|████▎     | 1860/4286 [11:28:06<16:37:03, 24.66s/it] 43%|████▎     | 1861/4286 [11:28:32<16:51:15, 25.02s/it]                                                         {'loss': 0.017, 'grad_norm': 2.798917572267251, 'learning_rate': 5.657956136257582e-07, 'completion_length': 258.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.6211310029029846, 'rewards/format_reward': 1.0, 'reward': 1.6211310625076294, 'reward_std': 0.12853092700242996, 'kl': 0.42578125, 'epoch': 0.43}
 43%|████▎     | 1861/4286 [11:28:32<16:51:15, 25.02s/it] 43%|████▎     | 1862/4286 [11:28:58<16:58:48, 25.22s/it]                                                         {'loss': 0.019, 'grad_norm': 2.791314140256754, 'learning_rate': 5.655622958469434e-07, 'completion_length': 248.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5104167014360428, 'rewards/format_reward': 1.0, 'reward': 1.5104168057441711, 'reward_std': 0.05327077582478523, 'kl': 0.47412109375, 'epoch': 0.43}
 43%|████▎     | 1862/4286 [11:28:58<16:58:48, 25.22s/it] 43%|████▎     | 1863/4286 [11:29:24<17:11:12, 25.54s/it]                                                         {'loss': 0.0203, 'grad_norm': 1.7892366693108563, 'learning_rate': 5.653289780681288e-07, 'completion_length': 255.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.6770834028720856, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.026785715483129025, 'kl': 0.5078125, 'epoch': 0.43}
 43%|████▎     | 1863/4286 [11:29:24<17:11:12, 25.54s/it] 43%|████▎     | 1864/4286 [11:29:50<17:08:40, 25.48s/it]                                                         {'loss': 0.0235, 'grad_norm': 3.0051509658948117, 'learning_rate': 5.65095660289314e-07, 'completion_length': 250.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.5544643402099609, 'rewards/format_reward': 1.0, 'reward': 1.5544643998146057, 'reward_std': 0.09849779680371284, 'kl': 0.587890625, 'epoch': 0.43}
 43%|████▎     | 1864/4286 [11:29:50<17:08:40, 25.48s/it] 44%|████▎     | 1865/4286 [11:30:13<16:40:48, 24.80s/it]                                                         {'loss': 0.0074, 'grad_norm': 1.6561012502420014, 'learning_rate': 5.648623425104992e-07, 'completion_length': 226.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6818452775478363, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6639882326126099, 'reward_std': 0.13311965018510818, 'kl': 0.185546875, 'epoch': 0.44}
 44%|████▎     | 1865/4286 [11:30:13<16:40:48, 24.80s/it] 44%|████▎     | 1866/4286 [11:30:37<16:27:43, 24.49s/it]                                                         {'loss': 0.0174, 'grad_norm': 4.390350234864552, 'learning_rate': 5.646290247316845e-07, 'completion_length': 230.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.4589286148548126, 'rewards/format_reward': 1.0, 'reward': 1.458928644657135, 'reward_std': 0.038865149952471256, 'kl': 0.43408203125, 'epoch': 0.44}
 44%|████▎     | 1866/4286 [11:30:37<16:27:43, 24.49s/it] 44%|████▎     | 1867/4286 [11:31:04<17:07:27, 25.48s/it]                                                         {'loss': 0.0098, 'grad_norm': 1.9806085460721894, 'learning_rate': 5.643957069528698e-07, 'completion_length': 306.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6336309611797333, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6157739162445068, 'reward_std': 0.0639850590378046, 'kl': 0.24560546875, 'epoch': 0.44}
 44%|████▎     | 1867/4286 [11:31:04<17:07:27, 25.48s/it] 44%|████▎     | 1868/4286 [11:31:30<17:05:34, 25.45s/it]                                                         {'loss': 0.0121, 'grad_norm': 4.809070188458917, 'learning_rate': 5.64162389174055e-07, 'completion_length': 274.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.37857145071029663, 'rewards/format_reward': 1.0, 'reward': 1.3785715103149414, 'reward_std': 0.06428571604192257, 'kl': 0.3017578125, 'epoch': 0.44}
 44%|████▎     | 1868/4286 [11:31:30<17:05:34, 25.45s/it] 44%|████▎     | 1869/4286 [11:31:54<16:46:24, 24.98s/it]                                                         {'loss': 0.0162, 'grad_norm': 7.834936684901334, 'learning_rate': 5.639290713952403e-07, 'completion_length': 256.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.578869104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.561012089252472, 'reward_std': 0.1342395395040512, 'kl': 0.4052734375, 'epoch': 0.44}
 44%|████▎     | 1869/4286 [11:31:54<16:46:24, 24.98s/it] 44%|████▎     | 1870/4286 [11:32:19<16:52:23, 25.14s/it]                                                         {'loss': 0.0209, 'grad_norm': 4.428820302876029, 'learning_rate': 5.636957536164255e-07, 'completion_length': 324.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5263606011867523, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5085034370422363, 'reward_std': 0.15572195500135422, 'kl': 0.521484375, 'epoch': 0.44}
 44%|████▎     | 1870/4286 [11:32:19<16:52:23, 25.14s/it] 44%|████▎     | 1871/4286 [11:32:43<16:39:28, 24.83s/it]                                                         {'loss': 0.0137, 'grad_norm': 2.579358703711742, 'learning_rate': 5.634624358376107e-07, 'completion_length': 259.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.5291666984558105, 'rewards/format_reward': 1.0, 'reward': 1.5291667580604553, 'reward_std': 0.07722323201596737, 'kl': 0.34326171875, 'epoch': 0.44}
 44%|████▎     | 1871/4286 [11:32:43<16:39:28, 24.83s/it] 44%|████▎     | 1872/4286 [11:33:08<16:33:47, 24.70s/it]                                                         {'loss': 0.013, 'grad_norm': 8.244157004697906, 'learning_rate': 5.632291180587961e-07, 'completion_length': 278.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.47589291632175446, 'rewards/format_reward': 1.0, 'reward': 1.4758929014205933, 'reward_std': 0.11215220391750336, 'kl': 0.32373046875, 'epoch': 0.44}
 44%|████▎     | 1872/4286 [11:33:08<16:33:47, 24.70s/it] 44%|████▎     | 1873/4286 [11:33:31<16:19:16, 24.35s/it]                                                         {'loss': 0.0251, 'grad_norm': 5.309628015458686, 'learning_rate': 5.629958002799813e-07, 'completion_length': 243.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.730654776096344, 'rewards/format_reward': 1.0, 'reward': 1.7306548357009888, 'reward_std': 0.09502441436052322, 'kl': 0.6240234375, 'epoch': 0.44}
 44%|████▎     | 1873/4286 [11:33:31<16:19:16, 24.35s/it] 44%|████▎     | 1874/4286 [11:33:56<16:20:57, 24.40s/it]                                                         {'loss': 0.0385, 'grad_norm': 4.54898869860447, 'learning_rate': 5.627624825011665e-07, 'completion_length': 268.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.4789399355649948, 'rewards/format_reward': 1.0, 'reward': 1.4789398908615112, 'reward_std': 0.12655439600348473, 'kl': 0.966796875, 'epoch': 0.44}
 44%|████▎     | 1874/4286 [11:33:56<16:20:57, 24.40s/it] 44%|████▎     | 1875/4286 [11:34:18<15:49:36, 23.63s/it]                                                         {'loss': 0.038, 'grad_norm': 4.566221809482346, 'learning_rate': 5.625291647223517e-07, 'completion_length': 232.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.39523813128471375, 'rewards/format_reward': 1.0, 'reward': 1.3952381610870361, 'reward_std': 0.05733690410852432, 'kl': 0.953125, 'epoch': 0.44}
 44%|████▎     | 1875/4286 [11:34:18<15:49:36, 23.63s/it] 44%|████▍     | 1876/4286 [11:34:39<15:23:10, 22.98s/it]                                                         {'loss': 0.0084, 'grad_norm': 3.7824981058518654, 'learning_rate': 5.622958469435371e-07, 'completion_length': 199.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.5910714864730835, 'rewards/format_reward': 1.0, 'reward': 1.5910715460777283, 'reward_std': 0.03481839131563902, 'kl': 0.20947265625, 'epoch': 0.44}
 44%|████▍     | 1876/4286 [11:34:39<15:23:10, 22.98s/it] 44%|████▍     | 1877/4286 [11:35:00<14:58:44, 22.38s/it]                                                         {'loss': 0.0429, 'grad_norm': 5.457543236314801, 'learning_rate': 5.620625291647223e-07, 'completion_length': 208.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7148810029029846, 'rewards/format_reward': 1.0, 'reward': 1.7148810029029846, 'reward_std': 0.10761680081486702, 'kl': 1.0703125, 'epoch': 0.44}
 44%|████▍     | 1877/4286 [11:35:00<14:58:44, 22.38s/it] 44%|████▍     | 1878/4286 [11:35:23<15:02:52, 22.50s/it]                                                         {'loss': 0.0339, 'grad_norm': 4.419125263833349, 'learning_rate': 5.618292113859075e-07, 'completion_length': 252.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.44997167587280273, 'rewards/format_reward': 1.0, 'reward': 1.4499717354774475, 'reward_std': 0.10699397884309292, 'kl': 0.845703125, 'epoch': 0.44}
 44%|████▍     | 1878/4286 [11:35:23<15:02:52, 22.50s/it] 44%|████▍     | 1879/4286 [11:35:48<15:30:09, 23.19s/it]                                                         {'loss': 0.0445, 'grad_norm': 6.769285544380327, 'learning_rate': 5.615958936070928e-07, 'completion_length': 260.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.47663693130016327, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4409226775169373, 'reward_std': 0.1367020681500435, 'kl': 1.115234375, 'epoch': 0.44}
 44%|████▍     | 1879/4286 [11:35:48<15:30:09, 23.19s/it] 44%|████▍     | 1880/4286 [11:36:10<15:24:26, 23.05s/it]                                                         {'loss': 0.111, 'grad_norm': 5.926592391261626, 'learning_rate': 5.613625758282781e-07, 'completion_length': 228.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.4015873223543167, 'rewards/format_reward': 1.0, 'reward': 1.401587426662445, 'reward_std': 0.14635565131902695, 'kl': 2.7734375, 'epoch': 0.44}
 44%|████▍     | 1880/4286 [11:36:10<15:24:26, 23.05s/it] 44%|████▍     | 1881/4286 [11:36:32<15:11:10, 22.73s/it]                                                         {'loss': 0.0101, 'grad_norm': 4.06716446639275, 'learning_rate': 5.611292580494633e-07, 'completion_length': 250.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.5458333790302277, 'rewards/format_reward': 1.0, 'reward': 1.54583340883255, 'reward_std': 0.0806104950606823, 'kl': 0.2529296875, 'epoch': 0.44}
 44%|████▍     | 1881/4286 [11:36:32<15:11:10, 22.73s/it] 44%|████▍     | 1882/4286 [11:36:58<15:49:47, 23.71s/it]                                                         {'loss': 0.0213, 'grad_norm': 2.7114541005500556, 'learning_rate': 5.608959402706486e-07, 'completion_length': 257.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5914683043956757, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5557540655136108, 'reward_std': 0.15635015815496445, 'kl': 0.5302734375, 'epoch': 0.44}
 44%|████▍     | 1882/4286 [11:36:58<15:49:47, 23.71s/it] 44%|████▍     | 1883/4286 [11:37:20<15:26:17, 23.13s/it]                                                         {'loss': 0.0346, 'grad_norm': 10.897484410695172, 'learning_rate': 5.606626224918338e-07, 'completion_length': 204.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.633556604385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6156995296478271, 'reward_std': 0.1488572508096695, 'kl': 0.8671875, 'epoch': 0.44}
 44%|████▍     | 1883/4286 [11:37:20<15:26:17, 23.13s/it] 44%|████▍     | 1884/4286 [11:37:42<15:09:26, 22.72s/it]                                                         {'loss': 0.0382, 'grad_norm': 7.5697849923767295, 'learning_rate': 5.604293047130191e-07, 'completion_length': 226.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.5428571850061417, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5250001549720764, 'reward_std': 0.12479179725050926, 'kl': 0.95703125, 'epoch': 0.44}
 44%|████▍     | 1884/4286 [11:37:42<15:09:26, 22.72s/it] 44%|████▍     | 1885/4286 [11:38:05<15:19:30, 22.98s/it]                                                         {'loss': 0.0668, 'grad_norm': 4.805743679483639, 'learning_rate': 5.601959869342043e-07, 'completion_length': 212.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6257305145263672, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6078734397888184, 'reward_std': 0.1693749949336052, 'kl': 1.669921875, 'epoch': 0.44}
 44%|████▍     | 1885/4286 [11:38:05<15:19:30, 22.98s/it] 44%|████▍     | 1886/4286 [11:38:30<15:34:49, 23.37s/it]                                                         {'loss': 0.0904, 'grad_norm': 36.430798644688885, 'learning_rate': 5.599626691553896e-07, 'completion_length': 234.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.38575680553913116, 'rewards/format_reward': 1.0, 'reward': 1.3857569098472595, 'reward_std': 0.13468296453356743, 'kl': 2.2578125, 'epoch': 0.44}
 44%|████▍     | 1886/4286 [11:38:30<15:34:49, 23.37s/it] 44%|████▍     | 1887/4286 [11:38:52<15:27:24, 23.19s/it]                                                         {'loss': 0.0928, 'grad_norm': 21.52673392480305, 'learning_rate': 5.597293513765748e-07, 'completion_length': 220.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.45691612362861633, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4390589594841003, 'reward_std': 0.15798881277441978, 'kl': 2.32421875, 'epoch': 0.44}
 44%|████▍     | 1887/4286 [11:38:52<15:27:24, 23.19s/it] 44%|████▍     | 1888/4286 [11:39:17<15:45:18, 23.65s/it]                                                         {'loss': 0.1521, 'grad_norm': 10.574666128036919, 'learning_rate': 5.594960335977601e-07, 'completion_length': 208.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.531250074505806, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.4419643878936768, 'reward_std': 0.263938095420599, 'kl': 3.80078125, 'epoch': 0.44}
 44%|████▍     | 1888/4286 [11:39:17<15:45:18, 23.65s/it] 44%|████▍     | 1889/4286 [11:39:39<15:20:18, 23.04s/it]                                                         {'loss': 0.0757, 'grad_norm': 6.486079258828835, 'learning_rate': 5.592627158189454e-07, 'completion_length': 227.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.48630957305431366, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4505953192710876, 'reward_std': 0.18380988389253616, 'kl': 1.89453125, 'epoch': 0.44}
 44%|████▍     | 1889/4286 [11:39:39<15:20:18, 23.04s/it] 44%|████▍     | 1890/4286 [11:40:01<15:12:58, 22.86s/it]                                                         {'loss': 0.0387, 'grad_norm': 9.92599472290284, 'learning_rate': 5.590293980401306e-07, 'completion_length': 246.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041668057441711, 'reward_std': 0.11390823312103748, 'kl': 0.96875, 'epoch': 0.44}
 44%|████▍     | 1890/4286 [11:40:01<15:12:58, 22.86s/it] 44%|████▍     | 1891/4286 [11:40:24<15:12:15, 22.85s/it]                                                         {'loss': 0.0565, 'grad_norm': 10.392733011133261, 'learning_rate': 5.587960802613158e-07, 'completion_length': 217.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.610714316368103, 'rewards/format_reward': 1.0, 'reward': 1.6107143759727478, 'reward_std': 0.1442035362124443, 'kl': 1.4140625, 'epoch': 0.44}
 44%|████▍     | 1891/4286 [11:40:24<15:12:15, 22.85s/it] 44%|████▍     | 1892/4286 [11:40:46<15:04:30, 22.67s/it]                                                         {'loss': 0.048, 'grad_norm': 3.7200254531676933, 'learning_rate': 5.585627624825012e-07, 'completion_length': 203.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.47976192831993103, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4619048833847046, 'reward_std': 0.14458894729614258, 'kl': 1.201171875, 'epoch': 0.44}
 44%|████▍     | 1892/4286 [11:40:46<15:04:30, 22.67s/it] 44%|████▍     | 1893/4286 [11:41:08<14:56:00, 22.47s/it]                                                         {'loss': 0.053, 'grad_norm': 26.269010965369716, 'learning_rate': 5.583294447036864e-07, 'completion_length': 212.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.4616071879863739, 'rewards/format_reward': 1.0, 'reward': 1.4616071581840515, 'reward_std': 0.15791695937514305, 'kl': 1.32421875, 'epoch': 0.44}
 44%|████▍     | 1893/4286 [11:41:08<14:56:00, 22.47s/it] 44%|████▍     | 1894/4286 [11:41:32<15:12:31, 22.89s/it]                                                         {'loss': 0.0349, 'grad_norm': 11.032738497701558, 'learning_rate': 5.580961269248716e-07, 'completion_length': 235.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.6130952537059784, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5595239400863647, 'reward_std': 0.24172164499759674, 'kl': 0.873046875, 'epoch': 0.44}
 44%|████▍     | 1894/4286 [11:41:32<15:12:31, 22.89s/it] 44%|████▍     | 1895/4286 [11:41:54<15:04:39, 22.70s/it]                                                         {'loss': 0.0513, 'grad_norm': 7.552611062496059, 'learning_rate': 5.578628091460569e-07, 'completion_length': 215.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.3898809850215912, 'rewards/format_reward': 1.0, 'reward': 1.3898810744285583, 'reward_std': 0.09893056377768517, 'kl': 1.28515625, 'epoch': 0.44}
 44%|████▍     | 1895/4286 [11:41:54<15:04:39, 22.70s/it] 44%|████▍     | 1896/4286 [11:42:17<14:58:42, 22.56s/it]                                                         {'loss': 0.0412, 'grad_norm': 10.072161236604083, 'learning_rate': 5.576294913672421e-07, 'completion_length': 239.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5907738506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5729167461395264, 'reward_std': 0.14238565415143967, 'kl': 1.02734375, 'epoch': 0.44}
 44%|████▍     | 1896/4286 [11:42:17<14:58:42, 22.56s/it] 44%|████▍     | 1897/4286 [11:42:38<14:37:17, 22.03s/it]                                                         {'loss': 0.0257, 'grad_norm': 5.523980295254142, 'learning_rate': 5.573961735884274e-07, 'completion_length': 184.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.5293651223182678, 'rewards/format_reward': 1.0, 'reward': 1.5293651819229126, 'reward_std': 0.08788960054516792, 'kl': 0.642578125, 'epoch': 0.44}
 44%|████▍     | 1897/4286 [11:42:38<14:37:17, 22.03s/it] 44%|████▍     | 1898/4286 [11:42:59<14:32:19, 21.92s/it]                                                         {'loss': 0.0466, 'grad_norm': 6.4916472836396135, 'learning_rate': 5.571628558096126e-07, 'completion_length': 198.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5608630776405334, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.543006181716919, 'reward_std': 0.1381917092949152, 'kl': 1.162109375, 'epoch': 0.44}
 44%|████▍     | 1898/4286 [11:42:59<14:32:19, 21.92s/it] 44%|████▍     | 1899/4286 [11:43:23<14:54:09, 22.48s/it]                                                         {'loss': 0.0323, 'grad_norm': 15.14684579756764, 'learning_rate': 5.569295380307979e-07, 'completion_length': 224.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.5922619253396988, 'rewards/format_reward': 1.0, 'reward': 1.5922619104385376, 'reward_std': 0.07122771069407463, 'kl': 0.8056640625, 'epoch': 0.44}
 44%|████▍     | 1899/4286 [11:43:23<14:54:09, 22.48s/it] 44%|████▍     | 1900/4286 [11:43:45<14:43:28, 22.22s/it]                                                         {'loss': 0.024, 'grad_norm': 14.173663520882268, 'learning_rate': 5.566962202519831e-07, 'completion_length': 192.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6979166865348816, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.0565476194024086, 'kl': 0.5986328125, 'epoch': 0.44}
 44%|████▍     | 1900/4286 [11:43:45<14:43:28, 22.22s/it] 44%|████▍     | 1901/4286 [11:47:45<58:11:32, 87.84s/it]                                                         {'loss': 0.0416, 'grad_norm': 18.73323674287299, 'learning_rate': 5.564629024731684e-07, 'completion_length': 230.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.41056548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3927084803581238, 'reward_std': 0.15046683698892593, 'kl': 1.0390625, 'epoch': 0.44}
 44%|████▍     | 1901/4286 [11:47:45<58:11:32, 87.84s/it] 44%|████▍     | 1902/4286 [11:48:07<44:57:43, 67.90s/it]                                                         {'loss': 0.022, 'grad_norm': 2.5197854032003826, 'learning_rate': 5.562295846943537e-07, 'completion_length': 202.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.5221726298332214, 'rewards/format_reward': 1.0, 'reward': 1.522172749042511, 'reward_std': 0.09853683412075043, 'kl': 0.548828125, 'epoch': 0.44}
 44%|████▍     | 1902/4286 [11:48:07<44:57:43, 67.90s/it] 44%|████▍     | 1903/4286 [11:48:29<35:45:38, 54.02s/it]                                                         {'loss': 0.0295, 'grad_norm': 5.930601990791656, 'learning_rate': 5.559962669155389e-07, 'completion_length': 205.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.5669642984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5491072535514832, 'reward_std': 0.1494418866932392, 'kl': 0.7373046875, 'epoch': 0.44}
 44%|████▍     | 1903/4286 [11:48:29<35:45:38, 54.02s/it] 44%|████▍     | 1904/4286 [11:48:49<29:07:22, 44.01s/it]                                                         {'loss': 0.038, 'grad_norm': 6.9256633389737825, 'learning_rate': 5.557629491367241e-07, 'completion_length': 212.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.554166704416275, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5363095998764038, 'reward_std': 0.22559359669685364, 'kl': 0.951171875, 'epoch': 0.44}
 44%|████▍     | 1904/4286 [11:48:49<29:07:22, 44.01s/it] 44%|████▍     | 1905/4286 [11:49:11<24:38:32, 37.26s/it]                                                         {'loss': 0.0366, 'grad_norm': 45.830328305508566, 'learning_rate': 5.555296313579095e-07, 'completion_length': 213.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5532738566398621, 'rewards/format_reward': 1.0, 'reward': 1.5532739162445068, 'reward_std': 0.13083676993846893, 'kl': 0.916015625, 'epoch': 0.44}
 44%|████▍     | 1905/4286 [11:49:11<24:38:32, 37.26s/it] 44%|████▍     | 1906/4286 [11:49:32<21:29:39, 32.51s/it]                                                         {'loss': 0.0511, 'grad_norm': 5.893636634991765, 'learning_rate': 5.552963135790947e-07, 'completion_length': 194.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.438988134264946, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4211310148239136, 'reward_std': 0.1615498587489128, 'kl': 1.27734375, 'epoch': 0.44}
 44%|████▍     | 1906/4286 [11:49:32<21:29:39, 32.51s/it] 44%|████▍     | 1907/4286 [11:49:56<19:47:15, 29.94s/it]                                                         {'loss': 0.0373, 'grad_norm': 6.2851683102346065, 'learning_rate': 5.550629958002799e-07, 'completion_length': 225.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.5670996308326721, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5313854217529297, 'reward_std': 0.1726799439638853, 'kl': 0.931640625, 'epoch': 0.44}
 44%|████▍     | 1907/4286 [11:49:56<19:47:15, 29.94s/it] 45%|████▍     | 1908/4286 [11:50:20<18:30:01, 28.01s/it]                                                         {'loss': 0.0528, 'grad_norm': 6.408466204137753, 'learning_rate': 5.548296780214651e-07, 'completion_length': 206.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.6300595998764038, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.576488196849823, 'reward_std': 0.2507767602801323, 'kl': 1.31640625, 'epoch': 0.45}
 45%|████▍     | 1908/4286 [11:50:20<18:30:01, 28.01s/it] 45%|████▍     | 1909/4286 [11:50:43<17:37:33, 26.69s/it]                                                         {'loss': 0.0364, 'grad_norm': 5.130263939037621, 'learning_rate': 5.545963602426505e-07, 'completion_length': 201.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6471726596355438, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5936012864112854, 'reward_std': 0.17031357437372208, 'kl': 0.91015625, 'epoch': 0.45}
 45%|████▍     | 1909/4286 [11:50:43<17:37:33, 26.69s/it] 45%|████▍     | 1910/4286 [11:51:07<16:57:15, 25.69s/it]                                                         {'loss': 0.0653, 'grad_norm': 6.587971679870321, 'learning_rate': 5.543630424638357e-07, 'completion_length': 192.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5723640322685242, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5545068979263306, 'reward_std': 0.17065874487161636, 'kl': 1.6328125, 'epoch': 0.45}
 45%|████▍     | 1910/4286 [11:51:07<16:57:15, 25.69s/it] 45%|████▍     | 1911/4286 [11:51:28<16:10:44, 24.52s/it]                                                         {'loss': 0.0429, 'grad_norm': 9.15234305415073, 'learning_rate': 5.541297246850209e-07, 'completion_length': 183.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.47857147455215454, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4607143998146057, 'reward_std': 0.13830504193902016, 'kl': 1.076171875, 'epoch': 0.45}
 45%|████▍     | 1911/4286 [11:51:28<16:10:44, 24.52s/it] 45%|████▍     | 1912/4286 [11:51:50<15:31:11, 23.53s/it]                                                         {'loss': 0.0443, 'grad_norm': 9.109759217145173, 'learning_rate': 5.538964069062062e-07, 'completion_length': 189.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5654762238264084, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5476191639900208, 'reward_std': 0.21012193709611893, 'kl': 1.109375, 'epoch': 0.45}
 45%|████▍     | 1912/4286 [11:51:50<15:31:11, 23.53s/it] 45%|████▍     | 1913/4286 [11:52:10<14:57:49, 22.70s/it]                                                         {'loss': 0.0721, 'grad_norm': 10.030577145078038, 'learning_rate': 5.536630891273915e-07, 'completion_length': 194.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.42946431040763855, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.393750011920929, 'reward_std': 0.18489493429660797, 'kl': 1.80078125, 'epoch': 0.45}
 45%|████▍     | 1913/4286 [11:52:10<14:57:49, 22.70s/it] 45%|████▍     | 1914/4286 [11:52:31<14:35:28, 22.15s/it]                                                         {'loss': 0.0369, 'grad_norm': 29.302713608894436, 'learning_rate': 5.534297713485767e-07, 'completion_length': 176.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5392857789993286, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.521428644657135, 'reward_std': 0.14616328850388527, 'kl': 0.921875, 'epoch': 0.45}
 45%|████▍     | 1914/4286 [11:52:31<14:35:28, 22.15s/it] 45%|████▍     | 1915/4286 [11:52:53<14:28:26, 21.98s/it]                                                         {'loss': 0.0849, 'grad_norm': 16.902077400869434, 'learning_rate': 5.53196453569762e-07, 'completion_length': 202.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.4211309850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3854168057441711, 'reward_std': 0.22557615488767624, 'kl': 2.125, 'epoch': 0.45}
 45%|████▍     | 1915/4286 [11:52:53<14:28:26, 21.98s/it] 45%|████▍     | 1916/4286 [11:53:13<14:13:31, 21.61s/it]                                                         {'loss': 0.0683, 'grad_norm': 14.290064194829768, 'learning_rate': 5.529631357909472e-07, 'completion_length': 175.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5568452775478363, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5211310386657715, 'reward_std': 0.2047925665974617, 'kl': 1.70703125, 'epoch': 0.45}
 45%|████▍     | 1916/4286 [11:53:13<14:13:31, 21.61s/it] 45%|████▍     | 1917/4286 [11:53:34<13:59:30, 21.26s/it]                                                         {'loss': 0.0373, 'grad_norm': 3.7233983386474745, 'learning_rate': 5.527298180121325e-07, 'completion_length': 180.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.5476190894842148, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5297620296478271, 'reward_std': 0.14123695716261864, 'kl': 0.931640625, 'epoch': 0.45}
 45%|████▍     | 1917/4286 [11:53:34<13:59:30, 21.26s/it] 45%|████▍     | 1918/4286 [11:53:54<13:44:49, 20.90s/it]                                                         {'loss': 0.0171, 'grad_norm': 5.500783393340775, 'learning_rate': 5.524965002333178e-07, 'completion_length': 156.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5639881193637848, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.07121489569544792, 'kl': 0.4267578125, 'epoch': 0.45}
 45%|████▍     | 1918/4286 [11:53:54<13:44:49, 20.90s/it][2025-03-02 17:01:29,179] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▍     | 1919/4286 [11:54:13<13:25:34, 20.42s/it]                                                         {'loss': 0.0254, 'grad_norm': 2.605836760568098, 'learning_rate': 5.52263182454503e-07, 'completion_length': 178.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.3675595372915268, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3497024774551392, 'reward_std': 0.1270022876560688, 'kl': 0.6328125, 'epoch': 0.45}
 45%|████▍     | 1919/4286 [11:54:13<13:25:34, 20.42s/it] 45%|████▍     | 1920/4286 [11:54:36<13:47:36, 20.99s/it]                                                         {'loss': 0.0422, 'grad_norm': 5.619780352628716, 'learning_rate': 5.520298646756882e-07, 'completion_length': 197.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.485119104385376, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4315477013587952, 'reward_std': 0.21804355457425117, 'kl': 1.056640625, 'epoch': 0.45}
 45%|████▍     | 1920/4286 [11:54:36<13:47:36, 20.99s/it][2025-03-02 17:02:13,779] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▍     | 1921/4286 [11:54:58<14:02:38, 21.38s/it]                                                         {'loss': 0.017, 'grad_norm': 3.087709852220199, 'learning_rate': 5.517965468968734e-07, 'completion_length': 172.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6089285910129547, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5732142925262451, 'reward_std': 0.17207976430654526, 'kl': 0.4248046875, 'epoch': 0.45}
 45%|████▍     | 1921/4286 [11:54:58<14:02:38, 21.38s/it] 45%|████▍     | 1922/4286 [11:55:19<13:59:10, 21.30s/it]                                                         {'loss': 0.0201, 'grad_norm': 1.1950550152659432, 'learning_rate': 5.515632291180588e-07, 'completion_length': 176.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.07695358991622925, 'kl': 0.501953125, 'epoch': 0.45}
 45%|████▍     | 1922/4286 [11:55:19<13:59:10, 21.30s/it] 45%|████▍     | 1923/4286 [11:55:38<13:26:50, 20.49s/it]                                                         {'loss': 0.0127, 'grad_norm': 1.7430226659763293, 'learning_rate': 5.51329911339244e-07, 'completion_length': 169.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.05448058620095253, 'kl': 0.318359375, 'epoch': 0.45}
 45%|████▍     | 1923/4286 [11:55:38<13:26:50, 20.49s/it] 45%|████▍     | 1924/4286 [11:55:57<13:11:27, 20.10s/it]                                                         {'loss': 0.0175, 'grad_norm': 38.07907366315637, 'learning_rate': 5.510965935604292e-07, 'completion_length': 195.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.4820685088634491, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642114043235779, 'reward_std': 0.12798580527305603, 'kl': 0.43896484375, 'epoch': 0.45}
 45%|████▍     | 1924/4286 [11:55:57<13:11:27, 20.10s/it][2025-03-02 17:03:34,287] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▍     | 1925/4286 [11:56:18<13:28:37, 20.55s/it]                                                         {'loss': 0.0221, 'grad_norm': 2.607808561261169, 'learning_rate': 5.508632757816145e-07, 'completion_length': 168.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.7380953431129456, 'rewards/format_reward': 1.0, 'reward': 1.7380954027175903, 'reward_std': 0.0668761357665062, 'kl': 0.552734375, 'epoch': 0.45}
 45%|████▍     | 1925/4286 [11:56:18<13:28:37, 20.55s/it] 45%|████▍     | 1926/4286 [11:56:39<13:28:26, 20.55s/it]                                                         {'loss': 0.009, 'grad_norm': 3.7923247571512846, 'learning_rate': 5.506299580027998e-07, 'completion_length': 169.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7288691401481628, 'rewards/format_reward': 1.0, 'reward': 1.7288691997528076, 'reward_std': 0.0669628195464611, 'kl': 0.22509765625, 'epoch': 0.45}
 45%|████▍     | 1926/4286 [11:56:39<13:28:26, 20.55s/it] 45%|████▍     | 1927/4286 [11:57:02<14:00:46, 21.38s/it]                                                         {'loss': 0.0238, 'grad_norm': 3.195954875067913, 'learning_rate': 5.50396640223985e-07, 'completion_length': 184.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.6383928954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.602678656578064, 'reward_std': 0.1597948418930173, 'kl': 0.5947265625, 'epoch': 0.45}
 45%|████▍     | 1927/4286 [11:57:02<14:00:46, 21.38s/it][2025-03-02 17:04:39,751] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 45%|████▍     | 1928/4286 [11:57:24<14:02:41, 21.44s/it]                                                         {'loss': 0.0105, 'grad_norm': 3.353312009657487, 'learning_rate': 5.501633224451703e-07, 'completion_length': 152.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 1.0, 'reward': 1.7485119700431824, 'reward_std': 0.037095542065799236, 'kl': 0.26318359375, 'epoch': 0.45}
 45%|████▍     | 1928/4286 [11:57:24<14:02:41, 21.44s/it] 45%|████▌     | 1929/4286 [11:57:43<13:32:44, 20.69s/it]                                                         {'loss': 0.0088, 'grad_norm': 0.8545815163301117, 'learning_rate': 5.499300046663555e-07, 'completion_length': 168.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.013746436685323715, 'kl': 0.21923828125, 'epoch': 0.45}
 45%|████▌     | 1929/4286 [11:57:43<13:32:44, 20.69s/it] 45%|████▌     | 1930/4286 [11:58:06<13:56:18, 21.30s/it]                                                         {'loss': 0.0127, 'grad_norm': 1.6890906462675888, 'learning_rate': 5.496966868875408e-07, 'completion_length': 179.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5877977013587952, 'reward_std': 0.08852548897266388, 'kl': 0.3173828125, 'epoch': 0.45}
 45%|████▌     | 1930/4286 [11:58:06<13:56:18, 21.30s/it] 45%|████▌     | 1931/4286 [11:58:25<13:37:08, 20.82s/it]                                                         {'loss': 0.0178, 'grad_norm': 3.5658456798960074, 'learning_rate': 5.49463369108726e-07, 'completion_length': 174.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.5833333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5654762387275696, 'reward_std': 0.1190476231276989, 'kl': 0.4453125, 'epoch': 0.45}
 45%|████▌     | 1931/4286 [11:58:25<13:37:08, 20.82s/it] 45%|████▌     | 1932/4286 [11:58:46<13:34:11, 20.75s/it]                                                         {'loss': 0.0296, 'grad_norm': 1.789328202314433, 'learning_rate': 5.492300513299113e-07, 'completion_length': 164.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.6071428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.14294010400772095, 'kl': 0.740234375, 'epoch': 0.45}
 45%|████▌     | 1932/4286 [11:58:46<13:34:11, 20.75s/it] 45%|████▌     | 1933/4286 [11:59:06<13:26:29, 20.57s/it]                                                         {'loss': 0.0127, 'grad_norm': 1.6057040937743499, 'learning_rate': 5.489967335510965e-07, 'completion_length': 165.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6820578873157501, 'rewards/format_reward': 1.0, 'reward': 1.6820579171180725, 'reward_std': 0.12746260315179825, 'kl': 0.3173828125, 'epoch': 0.45}
 45%|████▌     | 1933/4286 [11:59:06<13:26:29, 20.57s/it] 45%|████▌     | 1934/4286 [11:59:26<13:21:11, 20.44s/it]                                                         {'loss': 0.0287, 'grad_norm': 3.4452075834910496, 'learning_rate': 5.487634157722818e-07, 'completion_length': 161.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5193452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.501488208770752, 'reward_std': 0.1278255432844162, 'kl': 0.71875, 'epoch': 0.45}
 45%|████▌     | 1934/4286 [11:59:26<13:21:11, 20.44s/it] 45%|████▌     | 1935/4286 [11:59:47<13:26:31, 20.58s/it]                                                         {'loss': 0.0369, 'grad_norm': 11.026387704849533, 'learning_rate': 5.485300979934671e-07, 'completion_length': 180.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5014881193637848, 'rewards/format_reward': 1.0, 'reward': 1.501488208770752, 'reward_std': 0.04849523399025202, 'kl': 0.9208984375, 'epoch': 0.45}
 45%|████▌     | 1935/4286 [11:59:47<13:26:31, 20.58s/it] 45%|████▌     | 1936/4286 [12:00:03<12:30:43, 19.17s/it]                                                         {'loss': 0.0263, 'grad_norm': 2.572597117887468, 'learning_rate': 5.482967802146523e-07, 'completion_length': 138.78572463989258, 'rewards/only_full_func_accuracy_reward': 0.6547619998455048, 'rewards/format_reward': 1.0, 'reward': 1.654762089252472, 'reward_std': 0.08600887283682823, 'kl': 0.65771484375, 'epoch': 0.45}
 45%|████▌     | 1936/4286 [12:00:03<12:30:43, 19.17s/it] 45%|████▌     | 1937/4286 [12:00:20<12:11:31, 18.69s/it]                                                         {'loss': 0.035, 'grad_norm': 3.2644130473124755, 'learning_rate': 5.480634624358375e-07, 'completion_length': 137.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6369048953056335, 'reward_std': 0.07267062366008759, 'kl': 0.876953125, 'epoch': 0.45}
 45%|████▌     | 1937/4286 [12:00:20<12:11:31, 18.69s/it] 45%|████▌     | 1938/4286 [12:00:39<12:04:39, 18.52s/it]                                                         {'loss': 0.151, 'grad_norm': 9.308196046024927, 'learning_rate': 5.478301446570229e-07, 'completion_length': 152.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.476190522313118, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4583334922790527, 'reward_std': 0.17538157105445862, 'kl': 3.7734375, 'epoch': 0.45}
 45%|████▌     | 1938/4286 [12:00:39<12:04:39, 18.52s/it] 45%|████▌     | 1939/4286 [12:00:56<11:56:20, 18.31s/it]                                                         {'loss': 0.0573, 'grad_norm': 2.5975908919495123, 'learning_rate': 5.475968268782081e-07, 'completion_length': 153.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.0712779313325882, 'kl': 1.4296875, 'epoch': 0.45}
 45%|████▌     | 1939/4286 [12:00:56<11:56:20, 18.31s/it] 45%|████▌     | 1940/4286 [12:01:15<12:00:55, 18.44s/it]                                                         {'loss': 0.065, 'grad_norm': 9.45800226156132, 'learning_rate': 5.473635090993933e-07, 'completion_length': 148.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.489583358168602, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4717262983322144, 'reward_std': 0.12206529825925827, 'kl': 1.625, 'epoch': 0.45}
 45%|████▌     | 1940/4286 [12:01:15<12:00:55, 18.44s/it] 45%|████▌     | 1941/4286 [12:01:37<12:39:36, 19.44s/it]                                                         {'loss': 0.06, 'grad_norm': 4.0112547072656, 'learning_rate': 5.471301913205786e-07, 'completion_length': 175.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.07364453375339508, 'kl': 1.501953125, 'epoch': 0.45}
 45%|████▌     | 1941/4286 [12:01:37<12:39:36, 19.44s/it] 45%|████▌     | 1942/4286 [12:01:54<12:16:21, 18.85s/it]                                                         {'loss': 0.0765, 'grad_norm': 10.276330623754557, 'learning_rate': 5.468968735417639e-07, 'completion_length': 150.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.556547611951828, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5386905670166016, 'reward_std': 0.08173839747905731, 'kl': 1.91796875, 'epoch': 0.45}
 45%|████▌     | 1942/4286 [12:01:54<12:16:21, 18.85s/it] 45%|████▌     | 1943/4286 [12:02:14<12:29:47, 19.20s/it]                                                         {'loss': 0.0613, 'grad_norm': 4.542248971174136, 'learning_rate': 5.466635557629491e-07, 'completion_length': 163.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.563988208770752, 'reward_std': 0.18753990530967712, 'kl': 1.53125, 'epoch': 0.45}
 45%|████▌     | 1943/4286 [12:02:14<12:29:47, 19.20s/it] 45%|████▌     | 1944/4286 [12:02:32<12:10:38, 18.72s/it]                                                         {'loss': 0.0291, 'grad_norm': 6.49227509621289, 'learning_rate': 5.464302379841343e-07, 'completion_length': 143.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041668057441711, 'reward_std': 0.10260478407144547, 'kl': 0.72705078125, 'epoch': 0.45}
 45%|████▌     | 1944/4286 [12:02:32<12:10:38, 18.72s/it] 45%|████▌     | 1945/4286 [12:02:51<12:16:34, 18.88s/it]                                                         {'loss': 0.117, 'grad_norm': 5.479549205474643, 'learning_rate': 5.461969202053196e-07, 'completion_length': 154.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5252976417541504, 'rewards/format_reward': 1.0, 'reward': 1.5252977013587952, 'reward_std': 0.20700763911008835, 'kl': 2.9296875, 'epoch': 0.45}
 45%|████▌     | 1945/4286 [12:02:51<12:16:34, 18.88s/it] 45%|████▌     | 1946/4286 [12:03:11<12:22:25, 19.04s/it]                                                         {'loss': 0.0755, 'grad_norm': 6.67641543662343, 'learning_rate': 5.459636024265048e-07, 'completion_length': 158.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.430059552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4122024774551392, 'reward_std': 0.1022719256579876, 'kl': 1.88671875, 'epoch': 0.45}
 45%|████▌     | 1946/4286 [12:03:11<12:22:25, 19.04s/it] 45%|████▌     | 1947/4286 [12:03:28<12:07:40, 18.67s/it]                                                         {'loss': 0.0413, 'grad_norm': 8.960256971773715, 'learning_rate': 5.457302846476901e-07, 'completion_length': 157.5, 'rewards/only_full_func_accuracy_reward': 0.5208333432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5029762387275696, 'reward_std': 0.07854853011667728, 'kl': 1.033203125, 'epoch': 0.45}
 45%|████▌     | 1947/4286 [12:03:28<12:07:40, 18.67s/it] 45%|████▌     | 1948/4286 [12:03:49<12:29:35, 19.24s/it]                                                         {'loss': 0.064, 'grad_norm': 2.3510460792850374, 'learning_rate': 5.454969668688754e-07, 'completion_length': 147.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.5312500298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4776785969734192, 'reward_std': 0.23257070779800415, 'kl': 1.6015625, 'epoch': 0.45}
 45%|████▌     | 1948/4286 [12:03:49<12:29:35, 19.24s/it] 45%|████▌     | 1949/4286 [12:04:08<12:29:30, 19.24s/it]                                                         {'loss': 0.0488, 'grad_norm': 5.948194118949034, 'learning_rate': 5.452636490900606e-07, 'completion_length': 154.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6413691639900208, 'reward_std': 0.13816103339195251, 'kl': 1.2177734375, 'epoch': 0.45}
 45%|████▌     | 1949/4286 [12:04:08<12:29:30, 19.24s/it] 45%|████▌     | 1950/4286 [12:04:30<12:55:32, 19.92s/it]                                                         {'loss': 0.0583, 'grad_norm': 4.20226028169896, 'learning_rate': 5.450303313112458e-07, 'completion_length': 158.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6547619551420212, 'rewards/format_reward': 1.0, 'reward': 1.6547619700431824, 'reward_std': 0.04500924050807953, 'kl': 1.4609375, 'epoch': 0.45}
 45%|████▌     | 1950/4286 [12:04:30<12:55:32, 19.92s/it] 46%|████▌     | 1951/4286 [12:04:48<12:38:53, 19.50s/it]                                                         {'loss': 0.0173, 'grad_norm': 8.7064519225983, 'learning_rate': 5.447970135324312e-07, 'completion_length': 157.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.601190522313118, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.583333432674408, 'reward_std': 0.09296905621886253, 'kl': 0.43408203125, 'epoch': 0.46}
 46%|████▌     | 1951/4286 [12:04:48<12:38:53, 19.50s/it] 46%|████▌     | 1952/4286 [12:05:09<12:52:41, 19.86s/it]                                                         {'loss': 0.0736, 'grad_norm': 4.89227071597296, 'learning_rate': 5.445636957536164e-07, 'completion_length': 165.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.6026786118745804, 'rewards/format_reward': 1.0, 'reward': 1.6026787161827087, 'reward_std': 0.07708029821515083, 'kl': 1.83984375, 'epoch': 0.46}
 46%|████▌     | 1952/4286 [12:05:09<12:52:41, 19.86s/it] 46%|████▌     | 1953/4286 [12:05:29<12:55:41, 19.95s/it]                                                         {'loss': 0.0656, 'grad_norm': 5.363210460461363, 'learning_rate': 5.443303779748016e-07, 'completion_length': 167.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.635416716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5997024774551392, 'reward_std': 0.1952940635383129, 'kl': 1.640625, 'epoch': 0.46}
 46%|████▌     | 1953/4286 [12:05:29<12:55:41, 19.95s/it] 46%|████▌     | 1954/4286 [12:05:49<12:55:47, 19.96s/it]                                                         {'loss': 0.0658, 'grad_norm': 5.793824120818944, 'learning_rate': 5.440970601959868e-07, 'completion_length': 155.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.4553571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4375001788139343, 'reward_std': 0.21615087985992432, 'kl': 1.6484375, 'epoch': 0.46}
 46%|████▌     | 1954/4286 [12:05:49<12:55:47, 19.96s/it] 46%|████▌     | 1955/4286 [12:06:07<12:25:47, 19.20s/it]                                                         {'loss': 0.0285, 'grad_norm': 1.8640890685036746, 'learning_rate': 5.438637424171722e-07, 'completion_length': 149.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.605654776096344, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.06115180253982544, 'kl': 0.716796875, 'epoch': 0.46}
 46%|████▌     | 1955/4286 [12:06:07<12:25:47, 19.20s/it] 46%|████▌     | 1956/4286 [12:06:26<12:27:03, 19.24s/it]                                                         {'loss': 0.0317, 'grad_norm': 21.561340168958292, 'learning_rate': 5.436304246383574e-07, 'completion_length': 154.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.6581845879554749, 'rewards/format_reward': 1.0, 'reward': 1.6581846475601196, 'reward_std': 0.11453994736075401, 'kl': 0.79052734375, 'epoch': 0.46}
 46%|████▌     | 1956/4286 [12:06:26<12:27:03, 19.24s/it] 46%|████▌     | 1957/4286 [12:06:47<12:54:15, 19.95s/it]                                                         {'loss': 0.0185, 'grad_norm': 7.9402693142456755, 'learning_rate': 5.433971068595426e-07, 'completion_length': 174.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.647321492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6294644474983215, 'reward_std': 0.1273463536053896, 'kl': 0.4609375, 'epoch': 0.46}
 46%|████▌     | 1957/4286 [12:06:47<12:54:15, 19.95s/it] 46%|████▌     | 1958/4286 [12:07:06<12:37:01, 19.51s/it]                                                         {'loss': 0.0321, 'grad_norm': 3.696907446461152, 'learning_rate': 5.431637890807279e-07, 'completion_length': 166.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071430444717407, 'reward_std': 0.0952381044626236, 'kl': 0.802734375, 'epoch': 0.46}
 46%|████▌     | 1958/4286 [12:07:06<12:37:01, 19.51s/it] 46%|████▌     | 1959/4286 [12:07:26<12:41:56, 19.65s/it]                                                         {'loss': 0.034, 'grad_norm': 3.568581591215381, 'learning_rate': 5.429304713019132e-07, 'completion_length': 167.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5476190745830536, 'rewards/format_reward': 1.0, 'reward': 1.5476191639900208, 'reward_std': 0.06487487815320492, 'kl': 0.84765625, 'epoch': 0.46}
 46%|████▌     | 1959/4286 [12:07:26<12:41:56, 19.65s/it] 46%|████▌     | 1960/4286 [12:07:46<12:49:24, 19.85s/it]                                                         {'loss': 0.0459, 'grad_norm': 10.912971099502666, 'learning_rate': 5.426971535230984e-07, 'completion_length': 138.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5401786267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5044643878936768, 'reward_std': 0.11984077841043472, 'kl': 1.1474609375, 'epoch': 0.46}
 46%|████▌     | 1960/4286 [12:07:46<12:49:24, 19.85s/it] 46%|████▌     | 1961/4286 [12:08:04<12:24:36, 19.22s/it]                                                         {'loss': 0.0156, 'grad_norm': 12.889167057731106, 'learning_rate': 5.424638357442837e-07, 'completion_length': 147.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041667461395264, 'reward_std': 0.17596286535263062, 'kl': 0.390625, 'epoch': 0.46}
 46%|████▌     | 1961/4286 [12:08:04<12:24:36, 19.22s/it] 46%|████▌     | 1962/4286 [12:08:22<12:04:36, 18.71s/it]                                                         {'loss': 0.045, 'grad_norm': 7.024333456241717, 'learning_rate': 5.422305179654689e-07, 'completion_length': 138.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.5334821790456772, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5156251192092896, 'reward_std': 0.1425826996564865, 'kl': 1.123046875, 'epoch': 0.46}
 46%|████▌     | 1962/4286 [12:08:22<12:04:36, 18.71s/it] 46%|████▌     | 1963/4286 [12:08:39<11:47:23, 18.27s/it]                                                         {'loss': 0.0123, 'grad_norm': 4.983847655137488, 'learning_rate': 5.419972001866542e-07, 'completion_length': 153.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.06504883244633675, 'kl': 0.3076171875, 'epoch': 0.46}
 46%|████▌     | 1963/4286 [12:08:39<11:47:23, 18.27s/it] 46%|████▌     | 1964/4286 [12:08:59<12:11:11, 18.89s/it]                                                         {'loss': 0.0151, 'grad_norm': 3.3557502084602895, 'learning_rate': 5.417638824078395e-07, 'completion_length': 160.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5431548207998276, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4895834922790527, 'reward_std': 0.13755884021520615, 'kl': 0.3759765625, 'epoch': 0.46}
 46%|████▌     | 1964/4286 [12:08:59<12:11:11, 18.89s/it] 46%|████▌     | 1965/4286 [12:09:22<12:56:45, 20.08s/it]                                                         {'loss': 0.048, 'grad_norm': 3.8575491706723852, 'learning_rate': 5.415305646290247e-07, 'completion_length': 176.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5258928835391998, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4723215103149414, 'reward_std': 0.1795024275779724, 'kl': 1.203125, 'epoch': 0.46}
 46%|████▌     | 1965/4286 [12:09:22<12:56:45, 20.08s/it] 46%|████▌     | 1966/4286 [12:09:43<13:06:10, 20.33s/it]                                                         {'loss': 0.0364, 'grad_norm': 10.943296305770833, 'learning_rate': 5.412972468502099e-07, 'completion_length': 157.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5758928954601288, 'rewards/format_reward': 1.0, 'reward': 1.5758929252624512, 'reward_std': 0.14814724400639534, 'kl': 0.91015625, 'epoch': 0.46}
 46%|████▌     | 1966/4286 [12:09:43<13:06:10, 20.33s/it] 46%|████▌     | 1967/4286 [12:10:03<13:05:47, 20.33s/it]                                                         {'loss': 0.062, 'grad_norm': 5.55025915265149, 'learning_rate': 5.410639290713952e-07, 'completion_length': 170.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5520833879709244, 'rewards/format_reward': 1.0, 'reward': 1.5520834922790527, 'reward_std': 0.11010736227035522, 'kl': 1.55078125, 'epoch': 0.46}
 46%|████▌     | 1967/4286 [12:10:03<13:05:47, 20.33s/it][2025-03-02 17:17:41,681] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▌     | 1968/4286 [12:10:26<13:31:33, 21.01s/it]                                                         {'loss': 0.0415, 'grad_norm': 4.765059222667915, 'learning_rate': 5.408306112925805e-07, 'completion_length': 168.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.6116071492433548, 'rewards/format_reward': 1.0, 'reward': 1.611607313156128, 'reward_std': 0.07708030194044113, 'kl': 1.041015625, 'epoch': 0.46}
 46%|████▌     | 1968/4286 [12:10:26<13:31:33, 21.01s/it] 46%|████▌     | 1969/4286 [12:10:50<14:09:38, 22.00s/it]                                                         {'loss': 0.0786, 'grad_norm': 8.927506388654047, 'learning_rate': 5.405972935137657e-07, 'completion_length': 180.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.392857313156128, 'reward_std': 0.18164650723338127, 'kl': 1.96875, 'epoch': 0.46}
 46%|████▌     | 1969/4286 [12:10:50<14:09:38, 22.00s/it] 46%|████▌     | 1970/4286 [12:11:10<13:40:49, 21.27s/it]                                                         {'loss': 0.0243, 'grad_norm': 5.92035982942805, 'learning_rate': 5.403639757349509e-07, 'completion_length': 168.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5431547909975052, 'rewards/format_reward': 1.0, 'reward': 1.5431548953056335, 'reward_std': 0.10756217688322067, 'kl': 0.6064453125, 'epoch': 0.46}
 46%|████▌     | 1970/4286 [12:11:10<13:40:49, 21.27s/it] 46%|████▌     | 1971/4286 [12:11:29<13:19:21, 20.72s/it]                                                         {'loss': 0.0267, 'grad_norm': 3.7857943942083687, 'learning_rate': 5.401306579561363e-07, 'completion_length': 147.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.5696429014205933, 'rewards/format_reward': 1.0, 'reward': 1.569642961025238, 'reward_std': 0.0436431672424078, 'kl': 0.666015625, 'epoch': 0.46}
 46%|████▌     | 1971/4286 [12:11:29<13:19:21, 20.72s/it] 46%|████▌     | 1972/4286 [12:11:47<12:46:28, 19.87s/it]                                                         {'loss': 0.0134, 'grad_norm': 5.035336014884325, 'learning_rate': 5.398973401773215e-07, 'completion_length': 154.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.1156440656632185, 'kl': 0.3349609375, 'epoch': 0.46}
 46%|████▌     | 1972/4286 [12:11:47<12:46:28, 19.87s/it] 46%|████▌     | 1973/4286 [12:12:09<13:06:08, 20.39s/it]                                                         {'loss': 0.0554, 'grad_norm': 7.9135565750212695, 'learning_rate': 5.396640223985067e-07, 'completion_length': 195.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.3854167014360428, 'rewards/format_reward': 1.0, 'reward': 1.3854167461395264, 'reward_std': 0.16808940097689629, 'kl': 1.3828125, 'epoch': 0.46}
 46%|████▌     | 1973/4286 [12:12:09<13:06:08, 20.39s/it] 46%|████▌     | 1974/4286 [12:12:27<12:48:14, 19.94s/it]                                                         {'loss': 0.0143, 'grad_norm': 2.929274882669686, 'learning_rate': 5.39430704619692e-07, 'completion_length': 171.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.06983364000916481, 'kl': 0.3564453125, 'epoch': 0.46}
 46%|████▌     | 1974/4286 [12:12:27<12:48:14, 19.94s/it] 46%|████▌     | 1975/4286 [12:12:48<12:52:11, 20.05s/it]                                                         {'loss': 0.0325, 'grad_norm': 5.505878872548926, 'learning_rate': 5.391973868408772e-07, 'completion_length': 170.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.48571428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4500001072883606, 'reward_std': 0.16266319528222084, 'kl': 0.8125, 'epoch': 0.46}
 46%|████▌     | 1975/4286 [12:12:48<12:52:11, 20.05s/it] 46%|████▌     | 1976/4286 [12:13:05<12:24:05, 19.33s/it]                                                         {'loss': 0.021, 'grad_norm': 5.06793402857528, 'learning_rate': 5.389640690620625e-07, 'completion_length': 141.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.1401614546775818, 'kl': 0.5263671875, 'epoch': 0.46}
 46%|████▌     | 1976/4286 [12:13:05<12:24:05, 19.33s/it] 46%|████▌     | 1977/4286 [12:13:26<12:36:40, 19.66s/it]                                                         {'loss': 0.0173, 'grad_norm': 25.51667860166418, 'learning_rate': 5.387307512832477e-07, 'completion_length': 163.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5267858505249023, 'reward_std': 0.11677858978509903, 'kl': 0.4326171875, 'epoch': 0.46}
 46%|████▌     | 1977/4286 [12:13:26<12:36:40, 19.66s/it] 46%|████▌     | 1978/4286 [12:13:45<12:31:03, 19.53s/it]                                                         {'loss': 0.0268, 'grad_norm': 63.33493576004307, 'learning_rate': 5.38497433504433e-07, 'completion_length': 169.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5684524178504944, 'rewards/format_reward': 1.0, 'reward': 1.5684524774551392, 'reward_std': 0.07074279710650444, 'kl': 0.6689453125, 'epoch': 0.46}
 46%|████▌     | 1978/4286 [12:13:45<12:31:03, 19.53s/it] 46%|████▌     | 1979/4286 [12:14:06<12:43:14, 19.85s/it]                                                         {'loss': 0.0253, 'grad_norm': 2.5456551329061043, 'learning_rate': 5.382641157256182e-07, 'completion_length': 155.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.636904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.0476190522313118, 'kl': 0.6318359375, 'epoch': 0.46}
 46%|████▌     | 1979/4286 [12:14:06<12:43:14, 19.85s/it] 46%|████▌     | 1980/4286 [12:14:25<12:39:24, 19.76s/it]                                                         {'loss': 0.0181, 'grad_norm': 6.489792326871409, 'learning_rate': 5.380307979468035e-07, 'completion_length': 161.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.5663690716028214, 'rewards/format_reward': 1.0, 'reward': 1.5663691759109497, 'reward_std': 0.08323157206177711, 'kl': 0.4501953125, 'epoch': 0.46}
 46%|████▌     | 1980/4286 [12:14:25<12:39:24, 19.76s/it] 46%|████▌     | 1981/4286 [12:14:46<12:48:32, 20.01s/it]                                                         {'loss': 0.0359, 'grad_norm': 8.956658665538326, 'learning_rate': 5.377974801679888e-07, 'completion_length': 163.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6875001192092896, 'reward_std': 0.12215420603752136, 'kl': 0.896484375, 'epoch': 0.46}
 46%|████▌     | 1981/4286 [12:14:46<12:48:32, 20.01s/it] 46%|████▌     | 1982/4286 [12:15:03<12:19:48, 19.27s/it]                                                         {'loss': 0.0117, 'grad_norm': 7.300037662699928, 'learning_rate': 5.37564162389174e-07, 'completion_length': 151.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.7059524655342102, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6880953907966614, 'reward_std': 0.0929236114025116, 'kl': 0.2919921875, 'epoch': 0.46}
 46%|████▌     | 1982/4286 [12:15:03<12:19:48, 19.27s/it] 46%|████▋     | 1983/4286 [12:15:26<12:58:39, 20.29s/it]                                                         {'loss': 0.0336, 'grad_norm': 25.166686469663897, 'learning_rate': 5.373308446103592e-07, 'completion_length': 180.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6364583671092987, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6186013221740723, 'reward_std': 0.16080156713724136, 'kl': 0.83984375, 'epoch': 0.46}
 46%|████▋     | 1983/4286 [12:15:26<12:58:39, 20.29s/it] 46%|████▋     | 1984/4286 [12:15:43<12:15:05, 19.16s/it]                                                         {'loss': 0.0182, 'grad_norm': 3.463129699922104, 'learning_rate': 5.370975268315446e-07, 'completion_length': 130.07143783569336, 'rewards/only_full_func_accuracy_reward': 0.71726194024086, 'rewards/format_reward': 1.0, 'reward': 1.7172619700431824, 'reward_std': 0.12582803890109062, 'kl': 0.4541015625, 'epoch': 0.46}
 46%|████▋     | 1984/4286 [12:15:43<12:15:05, 19.16s/it][2025-03-02 17:23:20,261] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 46%|████▋     | 1985/4286 [12:16:04<12:45:17, 19.96s/it]                                                         {'loss': 0.065, 'grad_norm': 27.729856426019243, 'learning_rate': 5.368642090527298e-07, 'completion_length': 180.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.44600342214107513, 'rewards/format_reward': 1.0, 'reward': 1.446003496646881, 'reward_std': 0.13363420963287354, 'kl': 1.619140625, 'epoch': 0.46}
 46%|████▋     | 1985/4286 [12:16:04<12:45:17, 19.96s/it] 46%|████▋     | 1986/4286 [12:16:22<12:13:35, 19.14s/it]                                                         {'loss': 0.0211, 'grad_norm': 2.4394734853469173, 'learning_rate': 5.36630891273915e-07, 'completion_length': 156.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7410715222358704, 'reward_std': 0.09137413650751114, 'kl': 0.5283203125, 'epoch': 0.46}
 46%|████▋     | 1986/4286 [12:16:22<12:13:35, 19.14s/it] 46%|████▋     | 1987/4286 [12:16:44<12:45:46, 19.99s/it]                                                         {'loss': 0.0372, 'grad_norm': 16.73783555539301, 'learning_rate': 5.363975734951003e-07, 'completion_length': 164.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6279762983322144, 'reward_std': 0.12457264587283134, 'kl': 0.93359375, 'epoch': 0.46}
 46%|████▋     | 1987/4286 [12:16:44<12:45:46, 19.99s/it] 46%|████▋     | 1988/4286 [12:17:01<12:10:19, 19.07s/it]                                                         {'loss': 0.0532, 'grad_norm': 7.055435082187604, 'learning_rate': 5.361642557162856e-07, 'completion_length': 156.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.4375000298023224, 'rewards/format_reward': 1.0, 'reward': 1.4375000596046448, 'reward_std': 0.08265924174338579, 'kl': 1.328125, 'epoch': 0.46}
 46%|████▋     | 1988/4286 [12:17:01<12:10:19, 19.07s/it] 46%|████▋     | 1989/4286 [12:17:19<12:01:22, 18.84s/it]                                                         {'loss': 0.0358, 'grad_norm': 13.309861498424851, 'learning_rate': 5.359309379374708e-07, 'completion_length': 154.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250001192092896, 'reward_std': 0.15552376210689545, 'kl': 0.8935546875, 'epoch': 0.46}
 46%|████▋     | 1989/4286 [12:17:19<12:01:22, 18.84s/it] 46%|████▋     | 1990/4286 [12:17:40<12:26:28, 19.51s/it]                                                         {'loss': 0.0815, 'grad_norm': 9.536764293692125, 'learning_rate': 5.35697620158656e-07, 'completion_length': 159.71428680419922, 'rewards/only_full_func_accuracy_reward': 0.5848214626312256, 'rewards/format_reward': 1.0, 'reward': 1.5848214626312256, 'reward_std': 0.18654580414295197, 'kl': 2.0390625, 'epoch': 0.46}
 46%|████▋     | 1990/4286 [12:17:40<12:26:28, 19.51s/it] 46%|████▋     | 1991/4286 [12:18:00<12:34:18, 19.72s/it]                                                         {'loss': 0.1011, 'grad_norm': 14.678296025672473, 'learning_rate': 5.354643023798413e-07, 'completion_length': 162.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5040391236543655, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.486182153224945, 'reward_std': 0.22962215542793274, 'kl': 2.5234375, 'epoch': 0.46}
 46%|████▋     | 1991/4286 [12:18:00<12:34:18, 19.72s/it] 46%|████▋     | 1992/4286 [12:18:22<12:57:17, 20.33s/it]                                                         {'loss': 0.1485, 'grad_norm': 11.23311428184823, 'learning_rate': 5.352309846010266e-07, 'completion_length': 178.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.479166716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4613096117973328, 'reward_std': 0.17378663271665573, 'kl': 3.7109375, 'epoch': 0.46}
 46%|████▋     | 1992/4286 [12:18:22<12:57:17, 20.33s/it] 47%|████▋     | 1993/4286 [12:18:43<13:01:47, 20.46s/it]                                                         {'loss': 0.112, 'grad_norm': 8.658698467325019, 'learning_rate': 5.349976668222118e-07, 'completion_length': 175.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.5803571939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.544642984867096, 'reward_std': 0.1713969185948372, 'kl': 2.8046875, 'epoch': 0.47}
 47%|████▋     | 1993/4286 [12:18:43<13:01:47, 20.46s/it] 47%|████▋     | 1994/4286 [12:19:02<12:51:40, 20.20s/it]                                                         {'loss': 0.1112, 'grad_norm': 10.741676480892751, 'learning_rate': 5.347643490433971e-07, 'completion_length': 158.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.523313507437706, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4875993132591248, 'reward_std': 0.27491486817598343, 'kl': 2.78125, 'epoch': 0.47}
 47%|████▋     | 1994/4286 [12:19:02<12:51:40, 20.20s/it] 47%|████▋     | 1995/4286 [12:19:23<13:02:32, 20.49s/it]                                                         {'loss': 0.1115, 'grad_norm': 30.02193925619989, 'learning_rate': 5.345310312645823e-07, 'completion_length': 159.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5174320340156555, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.499574899673462, 'reward_std': 0.2107696831226349, 'kl': 2.7890625, 'epoch': 0.47}
 47%|████▋     | 1995/4286 [12:19:23<13:02:32, 20.49s/it] 47%|████▋     | 1996/4286 [12:19:46<13:31:09, 21.25s/it]                                                         {'loss': 0.0527, 'grad_norm': 12.9438087194543, 'learning_rate': 5.342977134857675e-07, 'completion_length': 171.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6145833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5967262983322144, 'reward_std': 0.2094147801399231, 'kl': 1.3203125, 'epoch': 0.47}
 47%|████▋     | 1996/4286 [12:19:46<13:31:09, 21.25s/it] 47%|████▋     | 1997/4286 [12:20:06<13:12:47, 20.78s/it]                                                         {'loss': 0.0805, 'grad_norm': 6.496248817332683, 'learning_rate': 5.340643957069529e-07, 'completion_length': 174.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6744047999382019, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.656547725200653, 'reward_std': 0.15280301868915558, 'kl': 2.015625, 'epoch': 0.47}
 47%|████▋     | 1997/4286 [12:20:06<13:12:47, 20.78s/it] 47%|████▋     | 1998/4286 [12:20:26<12:57:20, 20.38s/it]                                                         {'loss': 0.0753, 'grad_norm': 9.251664441528977, 'learning_rate': 5.338310779281381e-07, 'completion_length': 155.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.5967262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5788690447807312, 'reward_std': 0.1627165637910366, 'kl': 1.87890625, 'epoch': 0.47}
 47%|████▋     | 1998/4286 [12:20:26<12:57:20, 20.38s/it] 47%|████▋     | 1999/4286 [12:20:48<13:24:48, 21.11s/it]                                                         {'loss': 0.041, 'grad_norm': 18.01412880136181, 'learning_rate': 5.335977601493233e-07, 'completion_length': 149.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.617559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5997024774551392, 'reward_std': 0.1458333469927311, 'kl': 1.02734375, 'epoch': 0.47}
 47%|████▋     | 1999/4286 [12:20:48<13:24:48, 21.11s/it] 47%|████▋     | 2000/4286 [12:21:07<12:52:06, 20.27s/it]                                                         {'loss': 0.023, 'grad_norm': 5.016701380542761, 'learning_rate': 5.333644423705085e-07, 'completion_length': 143.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.07199607044458389, 'kl': 0.572265625, 'epoch': 0.47}
 47%|████▋     | 2000/4286 [12:21:07<12:52:06, 20.27s/it] 47%|████▋     | 2001/4286 [12:24:39<49:21:04, 77.75s/it]                                                         {'loss': 0.0337, 'grad_norm': 3.6637406785220086, 'learning_rate': 5.331311245916939e-07, 'completion_length': 176.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.4568452686071396, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.438988208770752, 'reward_std': 0.11196072399616241, 'kl': 0.841796875, 'epoch': 0.47}
 47%|████▋     | 2001/4286 [12:24:39<49:21:04, 77.75s/it] 47%|████▋     | 2002/4286 [12:24:57<38:04:02, 60.00s/it]                                                         {'loss': 0.0275, 'grad_norm': 3.1422592917082377, 'learning_rate': 5.328978068128791e-07, 'completion_length': 166.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.5580357611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.540178656578064, 'reward_std': 0.11304131895303726, 'kl': 0.69140625, 'epoch': 0.47}
 47%|████▋     | 2002/4286 [12:24:57<38:04:02, 60.00s/it] 47%|████▋     | 2003/4286 [12:25:16<30:13:10, 47.65s/it]                                                         {'loss': 0.0207, 'grad_norm': 10.788645547822705, 'learning_rate': 5.326644890340643e-07, 'completion_length': 155.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.65476194024086, 'rewards/format_reward': 1.0, 'reward': 1.6547620296478271, 'reward_std': 0.0715427789837122, 'kl': 0.51708984375, 'epoch': 0.47}
 47%|████▋     | 2003/4286 [12:25:16<30:13:10, 47.65s/it] 47%|████▋     | 2004/4286 [12:25:37<25:14:10, 39.81s/it]                                                         {'loss': 0.0214, 'grad_norm': 31.00773517225923, 'learning_rate': 5.324311712552496e-07, 'completion_length': 179.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5224567353725433, 'rewards/format_reward': 1.0, 'reward': 1.5224568247795105, 'reward_std': 0.08178479503840208, 'kl': 0.5341796875, 'epoch': 0.47}
 47%|████▋     | 2004/4286 [12:25:37<25:14:10, 39.81s/it] 47%|████▋     | 2005/4286 [12:25:56<21:06:25, 33.31s/it]                                                         {'loss': 0.0086, 'grad_norm': 1.3271711987137307, 'learning_rate': 5.321978534764349e-07, 'completion_length': 158.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.4851190894842148, 'rewards/format_reward': 1.0, 'reward': 1.485119104385376, 'reward_std': 0.029761905781924725, 'kl': 0.21533203125, 'epoch': 0.47}
 47%|████▋     | 2005/4286 [12:25:56<21:06:25, 33.31s/it] 47%|████▋     | 2006/4286 [12:26:15<18:21:43, 28.99s/it]                                                         {'loss': 0.012, 'grad_norm': 2.524249727349864, 'learning_rate': 5.319645356976201e-07, 'completion_length': 147.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5766369104385376, 'rewards/format_reward': 1.0, 'reward': 1.5766370296478271, 'reward_std': 0.03810553625226021, 'kl': 0.2998046875, 'epoch': 0.47}
 47%|████▋     | 2006/4286 [12:26:15<18:21:43, 28.99s/it] 47%|████▋     | 2007/4286 [12:26:37<17:10:51, 27.14s/it]                                                         {'loss': 0.0234, 'grad_norm': 2.2743035383239647, 'learning_rate': 5.317312179188054e-07, 'completion_length': 162.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.5773809850215912, 'rewards/format_reward': 1.0, 'reward': 1.5773810744285583, 'reward_std': 0.061365481466054916, 'kl': 0.5849609375, 'epoch': 0.47}
 47%|████▋     | 2007/4286 [12:26:37<17:10:51, 27.14s/it] 47%|████▋     | 2008/4286 [12:26:58<15:59:23, 25.27s/it]                                                         {'loss': 0.0118, 'grad_norm': 4.80457221051494, 'learning_rate': 5.314979001399906e-07, 'completion_length': 168.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.38809526711702347, 'rewards/format_reward': 1.0, 'reward': 1.3880953788757324, 'reward_std': 0.0367397703230381, 'kl': 0.294921875, 'epoch': 0.47}
 47%|████▋     | 2008/4286 [12:26:58<15:59:23, 25.27s/it] 47%|████▋     | 2009/4286 [12:27:19<15:03:30, 23.81s/it]                                                         {'loss': 0.0184, 'grad_norm': 1.4162705848823514, 'learning_rate': 5.312645823611759e-07, 'completion_length': 158.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.4866071790456772, 'rewards/format_reward': 1.0, 'reward': 1.4866072535514832, 'reward_std': 0.05059524439275265, 'kl': 0.45947265625, 'epoch': 0.47}
 47%|████▋     | 2009/4286 [12:27:19<15:03:30, 23.81s/it] 47%|████▋     | 2010/4286 [12:27:39<14:27:53, 22.88s/it]                                                         {'loss': 0.0185, 'grad_norm': 1.8963191976976908, 'learning_rate': 5.310312645823612e-07, 'completion_length': 157.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.5803571939468384, 'rewards/format_reward': 1.0, 'reward': 1.5803572535514832, 'reward_std': 0.07912982068955898, 'kl': 0.46240234375, 'epoch': 0.47}
 47%|████▋     | 2010/4286 [12:27:39<14:27:53, 22.88s/it][2025-03-02 17:35:18,466] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 47%|████▋     | 2011/4286 [12:28:03<14:31:23, 22.98s/it]                                                         {'loss': 0.011, 'grad_norm': 2.312105700507851, 'learning_rate': 5.307979468035464e-07, 'completion_length': 189.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.05817561782896519, 'kl': 0.275390625, 'epoch': 0.47}
 47%|████▋     | 2011/4286 [12:28:03<14:31:23, 22.98s/it] 47%|████▋     | 2012/4286 [12:28:22<13:49:12, 21.88s/it]                                                         {'loss': 0.008, 'grad_norm': 0.6530585616438276, 'learning_rate': 5.305646290247316e-07, 'completion_length': 185.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.520833358168602, 'rewards/format_reward': 1.0, 'reward': 1.5208334922790527, 'reward_std': 0.0357142873108387, 'kl': 0.20068359375, 'epoch': 0.47}
 47%|████▋     | 2012/4286 [12:28:22<13:49:12, 21.88s/it] 47%|████▋     | 2013/4286 [12:28:43<13:34:40, 21.50s/it]                                                         {'loss': 0.0125, 'grad_norm': 39.01958414367362, 'learning_rate': 5.303313112459169e-07, 'completion_length': 167.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.7633929550647736, 'rewards/format_reward': 1.0, 'reward': 1.7633930444717407, 'reward_std': 0.04053214658051729, 'kl': 0.3134765625, 'epoch': 0.47}
 47%|████▋     | 2013/4286 [12:28:43<13:34:40, 21.50s/it] 47%|████▋     | 2014/4286 [12:29:04<13:28:45, 21.36s/it]                                                         {'loss': 0.0129, 'grad_norm': 1.566446094339745, 'learning_rate': 5.300979934671022e-07, 'completion_length': 179.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6187500357627869, 'rewards/format_reward': 1.0, 'reward': 1.6187500953674316, 'reward_std': 0.047477658838033676, 'kl': 0.322265625, 'epoch': 0.47}
 47%|████▋     | 2014/4286 [12:29:04<13:28:45, 21.36s/it] 47%|████▋     | 2015/4286 [12:29:23<13:10:02, 20.87s/it]                                                         {'loss': 0.0138, 'grad_norm': 5.836568581210398, 'learning_rate': 5.298646756882874e-07, 'completion_length': 168.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6011906266212463, 'reward_std': 0.06915953569114208, 'kl': 0.34521484375, 'epoch': 0.47}
 47%|████▋     | 2015/4286 [12:29:23<13:10:02, 20.87s/it] 47%|████▋     | 2016/4286 [12:29:45<13:21:13, 21.18s/it]                                                         {'loss': 0.018, 'grad_norm': 49.945710157097, 'learning_rate': 5.296313579094726e-07, 'completion_length': 207.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.03134516254067421, 'kl': 0.44921875, 'epoch': 0.47}
 47%|████▋     | 2016/4286 [12:29:45<13:21:13, 21.18s/it] 47%|████▋     | 2017/4286 [12:30:08<13:42:11, 21.74s/it]                                                         {'loss': 0.0196, 'grad_norm': 3.393113171600895, 'learning_rate': 5.29398040130658e-07, 'completion_length': 219.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4642857015132904, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.1071428693830967, 'kl': 0.490234375, 'epoch': 0.47}
 47%|████▋     | 2017/4286 [12:30:08<13:42:11, 21.74s/it][2025-03-02 17:37:47,900] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 47%|████▋     | 2018/4286 [12:30:32<14:05:06, 22.36s/it]                                                         {'loss': 0.0564, 'grad_norm': 4.041654185729739, 'learning_rate': 5.291647223518432e-07, 'completion_length': 211.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.5688492655754089, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.55099219083786, 'reward_std': 0.15914808958768845, 'kl': 1.40625, 'epoch': 0.47}
 47%|████▋     | 2018/4286 [12:30:32<14:05:06, 22.36s/it] 47%|████▋     | 2019/4286 [12:30:58<14:40:19, 23.30s/it]                                                         {'loss': 0.1411, 'grad_norm': 4290.721480735523, 'learning_rate': 5.289314045730284e-07, 'completion_length': 227.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6762330532073975, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6583759784698486, 'reward_std': 0.16906211525201797, 'kl': 3.53125, 'epoch': 0.47}
 47%|████▋     | 2019/4286 [12:30:58<14:40:19, 23.30s/it][2025-03-02 17:38:36,336] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 47%|████▋     | 2020/4286 [12:31:20<14:35:51, 23.19s/it]                                                         {'loss': 0.0393, 'grad_norm': 1.6617532661105692, 'learning_rate': 5.286980867942137e-07, 'completion_length': 206.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.4583333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4404762983322144, 'reward_std': 0.13854118436574936, 'kl': 0.982421875, 'epoch': 0.47}
 47%|████▋     | 2020/4286 [12:31:20<14:35:51, 23.19s/it] 47%|████▋     | 2021/4286 [12:31:45<14:47:48, 23.52s/it]                                                         {'loss': 0.082, 'grad_norm': 10.931341318460683, 'learning_rate': 5.28464769015399e-07, 'completion_length': 264.1607360839844, 'rewards/only_full_func_accuracy_reward': 0.4699404835700989, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4163691401481628, 'reward_std': 0.28255875408649445, 'kl': 2.05078125, 'epoch': 0.47}
 47%|████▋     | 2021/4286 [12:31:45<14:47:48, 23.52s/it] 47%|████▋     | 2022/4286 [12:32:06<14:18:30, 22.75s/it]                                                         {'loss': 0.0423, 'grad_norm': 5.195188721054985, 'learning_rate': 5.282314512365842e-07, 'completion_length': 217.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6379754543304443, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6201183199882507, 'reward_std': 0.11603472009301186, 'kl': 1.0546875, 'epoch': 0.47}
 47%|████▋     | 2022/4286 [12:32:06<14:18:30, 22.75s/it] 47%|████▋     | 2023/4286 [12:32:28<14:17:01, 22.72s/it]                                                         {'loss': 0.0511, 'grad_norm': 3.5852364842095725, 'learning_rate': 5.279981334577694e-07, 'completion_length': 214.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6345238387584686, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6166667938232422, 'reward_std': 0.060125263407826424, 'kl': 1.28125, 'epoch': 0.47}
 47%|████▋     | 2023/4286 [12:32:28<14:17:01, 22.72s/it] 47%|████▋     | 2024/4286 [12:32:51<14:14:32, 22.67s/it]                                                         {'loss': 0.0357, 'grad_norm': 3.604094294368502, 'learning_rate': 5.277648156789547e-07, 'completion_length': 193.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.5040674954652786, 'rewards/format_reward': 1.0, 'reward': 1.504067599773407, 'reward_std': 0.1019042618572712, 'kl': 0.8935546875, 'epoch': 0.47}
 47%|████▋     | 2024/4286 [12:32:51<14:14:32, 22.67s/it] 47%|████▋     | 2025/4286 [12:33:13<14:07:11, 22.48s/it]                                                         {'loss': 0.1007, 'grad_norm': 4.5802757022404235, 'learning_rate': 5.275314979001399e-07, 'completion_length': 219.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.4499729871749878, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4142587184906006, 'reward_std': 0.22931206598877907, 'kl': 2.505859375, 'epoch': 0.47}
 47%|████▋     | 2025/4286 [12:33:13<14:07:11, 22.48s/it] 47%|████▋     | 2026/4286 [12:33:36<14:13:39, 22.66s/it]                                                         {'loss': 0.0414, 'grad_norm': 3.1368822788409885, 'learning_rate': 5.272981801213252e-07, 'completion_length': 222.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.4437500238418579, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3901787400245667, 'reward_std': 0.22755734622478485, 'kl': 1.03515625, 'epoch': 0.47}
 47%|████▋     | 2026/4286 [12:33:36<14:13:39, 22.66s/it] 47%|████▋     | 2027/4286 [12:33:58<14:06:41, 22.49s/it]                                                         {'loss': 0.0689, 'grad_norm': 3.3882255397895062, 'learning_rate': 5.270648623425105e-07, 'completion_length': 213.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.410714328289032, 'reward_std': 0.22532308846712112, 'kl': 1.720703125, 'epoch': 0.47}
 47%|████▋     | 2027/4286 [12:33:58<14:06:41, 22.49s/it] 47%|████▋     | 2028/4286 [12:34:20<14:02:35, 22.39s/it]                                                         {'loss': 0.0997, 'grad_norm': 5.15072314219003, 'learning_rate': 5.268315445636957e-07, 'completion_length': 194.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5288690775632858, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4931548833847046, 'reward_std': 0.18889948725700378, 'kl': 2.484375, 'epoch': 0.47}
 47%|████▋     | 2028/4286 [12:34:20<14:02:35, 22.39s/it] 47%|████▋     | 2029/4286 [12:34:44<14:15:12, 22.73s/it]                                                         {'loss': 0.0496, 'grad_norm': 5.909831558442691, 'learning_rate': 5.265982267848809e-07, 'completion_length': 204.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.4907738268375397, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4729166626930237, 'reward_std': 0.13591131940484047, 'kl': 1.2421875, 'epoch': 0.47}
 47%|████▋     | 2029/4286 [12:34:44<14:15:12, 22.73s/it] 47%|████▋     | 2030/4286 [12:35:09<14:40:30, 23.42s/it]                                                         {'loss': 0.0785, 'grad_norm': 3.6611030409691128, 'learning_rate': 5.263649090060663e-07, 'completion_length': 202.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.5026786178350449, 'rewards/format_reward': 0.910714328289032, 'reward': 1.413392961025238, 'reward_std': 0.1961703523993492, 'kl': 1.96875, 'epoch': 0.47}
 47%|████▋     | 2030/4286 [12:35:09<14:40:30, 23.42s/it] 47%|████▋     | 2031/4286 [12:35:32<14:39:13, 23.39s/it]                                                         {'loss': 0.0331, 'grad_norm': 2.8971411296519998, 'learning_rate': 5.261315912272515e-07, 'completion_length': 220.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.47083336114883423, 'rewards/format_reward': 1.0, 'reward': 1.470833420753479, 'reward_std': 0.06258090399205685, 'kl': 0.828125, 'epoch': 0.47}
 47%|████▋     | 2031/4286 [12:35:32<14:39:13, 23.39s/it] 47%|████▋     | 2032/4286 [12:35:56<14:48:00, 23.64s/it]                                                         {'loss': 0.0512, 'grad_norm': 4.544766359784986, 'learning_rate': 5.258982734484367e-07, 'completion_length': 190.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.6741071939468384, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6205358505249023, 'reward_std': 0.1221887357532978, 'kl': 1.28125, 'epoch': 0.47}
 47%|████▋     | 2032/4286 [12:35:56<14:48:00, 23.64s/it] 47%|████▋     | 2033/4286 [12:36:16<14:05:40, 22.52s/it]                                                         {'loss': 0.0197, 'grad_norm': 4.676991493868209, 'learning_rate': 5.25664955669622e-07, 'completion_length': 181.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7053572237491608, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6875000596046448, 'reward_std': 0.09800060419365764, 'kl': 0.49609375, 'epoch': 0.47}
 47%|████▋     | 2033/4286 [12:36:16<14:05:40, 22.52s/it] 47%|████▋     | 2034/4286 [12:36:39<14:03:43, 22.48s/it]                                                         {'loss': 0.0406, 'grad_norm': 4.904698144750403, 'learning_rate': 5.254316378908073e-07, 'completion_length': 204.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.6067177653312683, 'rewards/format_reward': 1.0, 'reward': 1.6067177653312683, 'reward_std': 0.10619350150227547, 'kl': 1.017578125, 'epoch': 0.47}
 47%|████▋     | 2034/4286 [12:36:39<14:03:43, 22.48s/it] 47%|████▋     | 2035/4286 [12:36:58<13:22:57, 21.40s/it]                                                         {'loss': 0.0243, 'grad_norm': 3.2604989817147474, 'learning_rate': 5.251983201119925e-07, 'completion_length': 176.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5630952417850494, 'rewards/format_reward': 1.0, 'reward': 1.5630953311920166, 'reward_std': 0.0922505371272564, 'kl': 0.6083984375, 'epoch': 0.47}
 47%|████▋     | 2035/4286 [12:36:58<13:22:57, 21.40s/it] 48%|████▊     | 2036/4286 [12:37:16<12:47:20, 20.46s/it]                                                         {'loss': 0.0107, 'grad_norm': 4.9861352328197075, 'learning_rate': 5.249650023331777e-07, 'completion_length': 182.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.4598214775323868, 'rewards/format_reward': 1.0, 'reward': 1.4598214626312256, 'reward_std': 0.061064330860972404, 'kl': 0.26708984375, 'epoch': 0.48}
 48%|████▊     | 2036/4286 [12:37:16<12:47:20, 20.46s/it] 48%|████▊     | 2037/4286 [12:37:34<12:19:14, 19.72s/it]                                                         {'loss': 0.0111, 'grad_norm': 1.0985825484081464, 'learning_rate': 5.24731684554363e-07, 'completion_length': 173.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.736607164144516, 'rewards/format_reward': 1.0, 'reward': 1.7366072535514832, 'reward_std': 0.008928571827709675, 'kl': 0.27783203125, 'epoch': 0.48}
 48%|████▊     | 2037/4286 [12:37:34<12:19:14, 19.72s/it] 48%|████▊     | 2038/4286 [12:37:54<12:26:48, 19.93s/it]                                                         {'loss': 0.0429, 'grad_norm': 7.708698420362585, 'learning_rate': 5.244983667755483e-07, 'completion_length': 186.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5902778208255768, 'rewards/format_reward': 1.0, 'reward': 1.590277910232544, 'reward_std': 0.09311597235500813, 'kl': 1.07421875, 'epoch': 0.48}
 48%|████▊     | 2038/4286 [12:37:54<12:26:48, 19.93s/it] 48%|████▊     | 2039/4286 [12:38:13<12:09:07, 19.47s/it]                                                         {'loss': 0.0265, 'grad_norm': 3.076739620794761, 'learning_rate': 5.242650489967335e-07, 'completion_length': 169.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.5684524029493332, 'rewards/format_reward': 1.0, 'reward': 1.568452537059784, 'reward_std': 0.0892857201397419, 'kl': 0.6630859375, 'epoch': 0.48}
 48%|████▊     | 2039/4286 [12:38:13<12:09:07, 19.47s/it] 48%|████▊     | 2040/4286 [12:38:31<11:51:57, 19.02s/it]                                                         {'loss': 0.012, 'grad_norm': 3.6915331185557205, 'learning_rate': 5.240317312179188e-07, 'completion_length': 179.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.531250074505806, 'rewards/format_reward': 1.0, 'reward': 1.5312501788139343, 'reward_std': 0.06526251509785652, 'kl': 0.2998046875, 'epoch': 0.48}
 48%|████▊     | 2040/4286 [12:38:31<11:51:57, 19.02s/it] 48%|████▊     | 2041/4286 [12:38:51<12:07:01, 19.43s/it]                                                         {'loss': 0.0507, 'grad_norm': 4.985283506306645, 'learning_rate': 5.23798413439104e-07, 'completion_length': 179.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.5910714864730835, 'rewards/format_reward': 1.0, 'reward': 1.5910715460777283, 'reward_std': 0.11641731485724449, 'kl': 1.26953125, 'epoch': 0.48}
 48%|████▊     | 2041/4286 [12:38:51<12:07:01, 19.43s/it] 48%|████▊     | 2042/4286 [12:39:11<12:16:15, 19.69s/it]                                                         {'loss': 0.016, 'grad_norm': 5.700393242433325, 'learning_rate': 5.235650956602893e-07, 'completion_length': 181.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.3235119432210922, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3056548833847046, 'reward_std': 0.14853745512664318, 'kl': 0.400390625, 'epoch': 0.48}
 48%|████▊     | 2042/4286 [12:39:11<12:16:15, 19.69s/it] 48%|████▊     | 2043/4286 [12:39:30<12:00:57, 19.29s/it]                                                         {'loss': 0.024, 'grad_norm': 4.682953027204091, 'learning_rate': 5.233317778814746e-07, 'completion_length': 172.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.06394429225474596, 'kl': 0.6025390625, 'epoch': 0.48}
 48%|████▊     | 2043/4286 [12:39:30<12:00:57, 19.29s/it] 48%|████▊     | 2044/4286 [12:39:51<12:29:20, 20.05s/it]                                                         {'loss': 0.0219, 'grad_norm': 4.944106283715725, 'learning_rate': 5.230984601026598e-07, 'completion_length': 195.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.500595286488533, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4827382564544678, 'reward_std': 0.16964083537459373, 'kl': 0.5498046875, 'epoch': 0.48}
 48%|████▊     | 2044/4286 [12:39:51<12:29:20, 20.05s/it] 48%|████▊     | 2045/4286 [12:40:10<12:10:38, 19.56s/it]                                                         {'loss': 0.0402, 'grad_norm': 3.472836936518198, 'learning_rate': 5.22865142323845e-07, 'completion_length': 165.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6428572237491608, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.05029458552598953, 'kl': 1.0048828125, 'epoch': 0.48}
 48%|████▊     | 2045/4286 [12:40:10<12:10:38, 19.56s/it] 48%|████▊     | 2046/4286 [12:40:29<12:04:04, 19.39s/it]                                                         {'loss': 0.0743, 'grad_norm': 4.449372066510218, 'learning_rate': 5.226318245450302e-07, 'completion_length': 173.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6172619462013245, 'rewards/format_reward': 1.0, 'reward': 1.6172620058059692, 'reward_std': 0.144544068723917, 'kl': 1.861328125, 'epoch': 0.48}
 48%|████▊     | 2046/4286 [12:40:29<12:04:04, 19.39s/it] 48%|████▊     | 2047/4286 [12:40:49<12:17:11, 19.76s/it]                                                         {'loss': 0.0615, 'grad_norm': 2.3707246962767057, 'learning_rate': 5.223985067662156e-07, 'completion_length': 184.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.4880952686071396, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.470238208770752, 'reward_std': 0.11988946795463562, 'kl': 1.53515625, 'epoch': 0.48}
 48%|████▊     | 2047/4286 [12:40:49<12:17:11, 19.76s/it] 48%|████▊     | 2048/4286 [12:41:12<12:42:14, 20.44s/it]                                                         {'loss': 0.0433, 'grad_norm': 9.697381781408208, 'learning_rate': 5.221651889874008e-07, 'completion_length': 184.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5360863655805588, 'rewards/format_reward': 1.0, 'reward': 1.5360864400863647, 'reward_std': 0.09314585477113724, 'kl': 1.083984375, 'epoch': 0.48}
 48%|████▊     | 2048/4286 [12:41:12<12:42:14, 20.44s/it] 48%|████▊     | 2049/4286 [12:41:31<12:27:23, 20.05s/it]                                                         {'loss': 0.0247, 'grad_norm': 9.26111669461657, 'learning_rate': 5.21931871208586e-07, 'completion_length': 168.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.7089710831642151, 'rewards/format_reward': 1.0, 'reward': 1.7089711427688599, 'reward_std': 0.11181972920894623, 'kl': 0.6181640625, 'epoch': 0.48}
 48%|████▊     | 2049/4286 [12:41:31<12:27:23, 20.05s/it] 48%|████▊     | 2050/4286 [12:41:51<12:32:08, 20.18s/it]                                                         {'loss': 0.0694, 'grad_norm': 4.784191908168346, 'learning_rate': 5.216985534297713e-07, 'completion_length': 183.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6000000238418579, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5821430087089539, 'reward_std': 0.2503435015678406, 'kl': 1.73828125, 'epoch': 0.48}
 48%|████▊     | 2050/4286 [12:41:51<12:32:08, 20.18s/it] 48%|████▊     | 2051/4286 [12:42:14<13:06:15, 21.11s/it]                                                         {'loss': 0.0635, 'grad_norm': 2.9513750586287872, 'learning_rate': 5.214652356509566e-07, 'completion_length': 186.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.6580782830715179, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6223640441894531, 'reward_std': 0.21275586634874344, 'kl': 1.591796875, 'epoch': 0.48}
 48%|████▊     | 2051/4286 [12:42:14<13:06:15, 21.11s/it] 48%|████▊     | 2052/4286 [12:42:34<12:52:21, 20.74s/it]                                                         {'loss': 0.1141, 'grad_norm': 11.679589544783298, 'learning_rate': 5.212319178721418e-07, 'completion_length': 168.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5358843952417374, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.500170111656189, 'reward_std': 0.24031250923871994, 'kl': 2.8515625, 'epoch': 0.48}
 48%|████▊     | 2052/4286 [12:42:34<12:52:21, 20.74s/it] 48%|████▊     | 2053/4286 [12:42:55<12:56:37, 20.87s/it]                                                         {'loss': 0.0428, 'grad_norm': 6.047275393434925, 'learning_rate': 5.209986000933271e-07, 'completion_length': 161.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6121032536029816, 'rewards/format_reward': 1.0, 'reward': 1.6121032238006592, 'reward_std': 0.12091638520359993, 'kl': 1.06640625, 'epoch': 0.48}
 48%|████▊     | 2053/4286 [12:42:55<12:56:37, 20.87s/it] 48%|████▊     | 2054/4286 [12:43:13<12:23:54, 20.00s/it]                                                         {'loss': 0.0168, 'grad_norm': 2.829921705436254, 'learning_rate': 5.207652823145123e-07, 'completion_length': 159.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6860120296478271, 'reward_std': 0.12595389038324356, 'kl': 0.4228515625, 'epoch': 0.48}
 48%|████▊     | 2054/4286 [12:43:13<12:23:54, 20.00s/it] 48%|████▊     | 2055/4286 [12:43:35<12:46:35, 20.62s/it]                                                         {'loss': 0.0318, 'grad_norm': 6.991122420972698, 'learning_rate': 5.205319645356976e-07, 'completion_length': 174.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.59077388048172, 'reward_std': 0.17955279909074306, 'kl': 0.794921875, 'epoch': 0.48}
 48%|████▊     | 2055/4286 [12:43:35<12:46:35, 20.62s/it] 48%|████▊     | 2056/4286 [12:43:58<13:09:46, 21.25s/it]                                                         {'loss': 0.1311, 'grad_norm': 5.275183633452408, 'learning_rate': 5.202986467568829e-07, 'completion_length': 178.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4017857909202576, 'reward_std': 0.3081566393375397, 'kl': 3.265625, 'epoch': 0.48}
 48%|████▊     | 2056/4286 [12:43:58<13:09:46, 21.25s/it] 48%|████▊     | 2057/4286 [12:44:20<13:13:36, 21.36s/it]                                                         {'loss': 0.1058, 'grad_norm': 16.009630576136097, 'learning_rate': 5.200653289780681e-07, 'completion_length': 210.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.409722276031971, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3918652534484863, 'reward_std': 0.18010696023702621, 'kl': 2.6484375, 'epoch': 0.48}
 48%|████▊     | 2057/4286 [12:44:20<13:13:36, 21.36s/it] 48%|████▊     | 2058/4286 [12:44:41<13:10:13, 21.28s/it]                                                         {'loss': 0.0672, 'grad_norm': 6.844869696125817, 'learning_rate': 5.198320111992533e-07, 'completion_length': 190.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.45238097012043, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.43452388048172, 'reward_std': 0.09755873493850231, 'kl': 1.68359375, 'epoch': 0.48}
 48%|████▊     | 2058/4286 [12:44:41<13:10:13, 21.28s/it] 48%|████▊     | 2059/4286 [12:45:03<13:22:52, 21.63s/it]                                                         {'loss': 0.088, 'grad_norm': 23.891902253975157, 'learning_rate': 5.195986934204386e-07, 'completion_length': 184.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5573129504919052, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5037416219711304, 'reward_std': 0.23356082290410995, 'kl': 2.19921875, 'epoch': 0.48}
 48%|████▊     | 2059/4286 [12:45:03<13:22:52, 21.63s/it] 48%|████▊     | 2060/4286 [12:45:25<13:19:38, 21.55s/it]                                                         {'loss': 0.0632, 'grad_norm': 8.169264218795028, 'learning_rate': 5.193653756416239e-07, 'completion_length': 191.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5291596353054047, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5113025307655334, 'reward_std': 0.14033278822898865, 'kl': 1.578125, 'epoch': 0.48}
 48%|████▊     | 2060/4286 [12:45:25<13:19:38, 21.55s/it] 48%|████▊     | 2061/4286 [12:45:47<13:32:07, 21.90s/it]                                                         {'loss': 0.1111, 'grad_norm': 8.68292196558096, 'learning_rate': 5.191320578628091e-07, 'completion_length': 200.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.5219171047210693, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4326314330101013, 'reward_std': 0.31148336827754974, 'kl': 2.78125, 'epoch': 0.48}
 48%|████▊     | 2061/4286 [12:45:47<13:32:07, 21.90s/it] 48%|████▊     | 2062/4286 [12:46:09<13:30:56, 21.88s/it]                                                         {'loss': 0.1744, 'grad_norm': 3.43309726072241, 'learning_rate': 5.188987400839943e-07, 'completion_length': 172.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5514881610870361, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.4443453550338745, 'reward_std': 0.35236673057079315, 'kl': 4.3671875, 'epoch': 0.48}
 48%|████▊     | 2062/4286 [12:46:09<13:30:56, 21.88s/it] 48%|████▊     | 2063/4286 [12:46:30<13:12:52, 21.40s/it]                                                         {'loss': 0.1072, 'grad_norm': 2.987543230899952, 'learning_rate': 5.186654223051797e-07, 'completion_length': 166.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.567099541425705, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5313854217529297, 'reward_std': 0.22670727968215942, 'kl': 2.6796875, 'epoch': 0.48}
 48%|████▊     | 2063/4286 [12:46:30<13:12:52, 21.40s/it] 48%|████▊     | 2064/4286 [12:46:49<12:47:02, 20.71s/it]                                                         {'loss': 0.0679, 'grad_norm': 4.176724320084544, 'learning_rate': 5.184321045263649e-07, 'completion_length': 176.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.5967262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5788691639900208, 'reward_std': 0.12550189718604088, 'kl': 1.69140625, 'epoch': 0.48}
 48%|████▊     | 2064/4286 [12:46:49<12:47:02, 20.71s/it] 48%|████▊     | 2065/4286 [12:47:08<12:35:41, 20.41s/it]                                                         {'loss': 0.032, 'grad_norm': 3.312848519961569, 'learning_rate': 5.181987867475501e-07, 'completion_length': 156.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5576105713844299, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5397535562515259, 'reward_std': 0.0773342177271843, 'kl': 0.80078125, 'epoch': 0.48}
 48%|████▊     | 2065/4286 [12:47:08<12:35:41, 20.41s/it] 48%|████▊     | 2066/4286 [12:47:29<12:33:30, 20.37s/it]                                                         {'loss': 0.0235, 'grad_norm': 5.6886470815681, 'learning_rate': 5.179654689687354e-07, 'completion_length': 165.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.6860332190990448, 'rewards/format_reward': 1.0, 'reward': 1.6860332489013672, 'reward_std': 0.08753779157996178, 'kl': 0.5859375, 'epoch': 0.48}
 48%|████▊     | 2066/4286 [12:47:29<12:33:30, 20.37s/it] 48%|████▊     | 2067/4286 [12:47:48<12:26:10, 20.18s/it]                                                         {'loss': 0.0495, 'grad_norm': 3.1682000967789845, 'learning_rate': 5.177321511899207e-07, 'completion_length': 167.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.12076430767774582, 'kl': 1.23828125, 'epoch': 0.48}
 48%|████▊     | 2067/4286 [12:47:48<12:26:10, 20.18s/it] 48%|████▊     | 2068/4286 [12:48:08<12:21:36, 20.06s/it]                                                         {'loss': 0.022, 'grad_norm': 3.980563532890619, 'learning_rate': 5.174988334111059e-07, 'completion_length': 169.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5967262387275696, 'rewards/format_reward': 1.0, 'reward': 1.5967262983322144, 'reward_std': 0.06907340884208679, 'kl': 0.548828125, 'epoch': 0.48}
 48%|████▊     | 2068/4286 [12:48:08<12:21:36, 20.06s/it] 48%|████▊     | 2069/4286 [12:48:28<12:15:01, 19.89s/it]                                                         {'loss': 0.0171, 'grad_norm': 2.982387759052414, 'learning_rate': 5.172655156322911e-07, 'completion_length': 160.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.7441829442977905, 'rewards/format_reward': 1.0, 'reward': 1.7441830039024353, 'reward_std': 0.08476635441184044, 'kl': 0.4287109375, 'epoch': 0.48}
 48%|████▊     | 2069/4286 [12:48:28<12:15:01, 19.89s/it] 48%|████▊     | 2070/4286 [12:48:46<11:57:53, 19.44s/it]                                                         {'loss': 0.0102, 'grad_norm': 3.2563813741546705, 'learning_rate': 5.170321978534764e-07, 'completion_length': 167.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5520833879709244, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5342263579368591, 'reward_std': 0.127976194024086, 'kl': 0.25341796875, 'epoch': 0.48}
 48%|████▊     | 2070/4286 [12:48:46<11:57:53, 19.44s/it] 48%|████▊     | 2071/4286 [12:49:08<12:27:06, 20.24s/it]                                                         {'loss': 0.0089, 'grad_norm': 1.7209814624480562, 'learning_rate': 5.167988800746616e-07, 'completion_length': 173.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6532738208770752, 'rewards/format_reward': 1.0, 'reward': 1.65327388048172, 'reward_std': 0.09778692573308945, 'kl': 0.22216796875, 'epoch': 0.48}
 48%|████▊     | 2071/4286 [12:49:08<12:27:06, 20.24s/it] 48%|████▊     | 2072/4286 [12:49:26<12:00:39, 19.53s/it]                                                         {'loss': 0.0135, 'grad_norm': 11.25899966857199, 'learning_rate': 5.165655622958469e-07, 'completion_length': 158.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5595238506793976, 'rewards/format_reward': 1.0, 'reward': 1.5595239400863647, 'reward_std': 0.13764496892690659, 'kl': 0.33837890625, 'epoch': 0.48}
 48%|████▊     | 2072/4286 [12:49:26<12:00:39, 19.53s/it] 48%|████▊     | 2073/4286 [12:49:46<12:06:44, 19.70s/it]                                                         {'loss': 0.0201, 'grad_norm': 3.960609791323731, 'learning_rate': 5.163322445170322e-07, 'completion_length': 169.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.48720240592956543, 'rewards/format_reward': 1.0, 'reward': 1.487202525138855, 'reward_std': 0.043330405838787556, 'kl': 0.501953125, 'epoch': 0.48}
 48%|████▊     | 2073/4286 [12:49:46<12:06:44, 19.70s/it] 48%|████▊     | 2074/4286 [12:50:05<12:01:16, 19.56s/it]                                                         {'loss': 0.0194, 'grad_norm': 5.3322154025743105, 'learning_rate': 5.160989267382174e-07, 'completion_length': 147.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.68154776096344, 'reward_std': 0.08775761723518372, 'kl': 0.486328125, 'epoch': 0.48}
 48%|████▊     | 2074/4286 [12:50:05<12:01:16, 19.56s/it] 48%|████▊     | 2075/4286 [12:50:24<11:50:53, 19.29s/it]                                                         {'loss': 0.0151, 'grad_norm': 6.85254531302125, 'learning_rate': 5.158656089594026e-07, 'completion_length': 157.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 1.0, 'reward': 1.723214328289032, 'reward_std': 0.05940900556743145, 'kl': 0.3779296875, 'epoch': 0.48}
 48%|████▊     | 2075/4286 [12:50:24<11:50:53, 19.29s/it] 48%|████▊     | 2076/4286 [12:50:42<11:36:17, 18.90s/it]                                                         {'loss': 0.0083, 'grad_norm': 2.0612283186341775, 'learning_rate': 5.15632291180588e-07, 'completion_length': 152.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.6488095819950104, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.04007173329591751, 'kl': 0.20751953125, 'epoch': 0.48}
 48%|████▊     | 2076/4286 [12:50:42<11:36:17, 18.90s/it] 48%|████▊     | 2077/4286 [12:51:06<12:31:51, 20.42s/it]                                                         {'loss': 0.0297, 'grad_norm': 3.0182435539827943, 'learning_rate': 5.153989734017732e-07, 'completion_length': 185.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.5326236486434937, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5147665739059448, 'reward_std': 0.13678791746497154, 'kl': 0.744140625, 'epoch': 0.48}
 48%|████▊     | 2077/4286 [12:51:06<12:31:51, 20.42s/it] 48%|████▊     | 2078/4286 [12:51:24<12:03:19, 19.66s/it]                                                         {'loss': 0.0082, 'grad_norm': 2.246664785070618, 'learning_rate': 5.151656556229584e-07, 'completion_length': 151.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.7276785969734192, 'rewards/format_reward': 1.0, 'reward': 1.7276787161827087, 'reward_std': 0.07716727070510387, 'kl': 0.2060546875, 'epoch': 0.48}
 48%|████▊     | 2078/4286 [12:51:24<12:03:19, 19.66s/it] 49%|████▊     | 2079/4286 [12:51:41<11:39:07, 19.01s/it]                                                         {'loss': 0.04, 'grad_norm': 3.6748324624389928, 'learning_rate': 5.149323378441437e-07, 'completion_length': 149.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5349702835083008, 'rewards/format_reward': 1.0, 'reward': 1.5349703431129456, 'reward_std': 0.038736362010240555, 'kl': 0.9970703125, 'epoch': 0.49}
 49%|████▊     | 2079/4286 [12:51:41<11:39:07, 19.01s/it] 49%|████▊     | 2080/4286 [12:52:01<11:40:56, 19.06s/it]                                                         {'loss': 0.0488, 'grad_norm': 5.8166386041459734, 'learning_rate': 5.14699020065329e-07, 'completion_length': 163.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.611607164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5937501192092896, 'reward_std': 0.1634950265288353, 'kl': 1.2197265625, 'epoch': 0.49}
 49%|████▊     | 2080/4286 [12:52:01<11:40:56, 19.06s/it] 49%|████▊     | 2081/4286 [12:52:20<11:44:26, 19.17s/it]                                                         {'loss': 0.0346, 'grad_norm': 1.7018538604290685, 'learning_rate': 5.144657022865142e-07, 'completion_length': 154.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.8273809850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8095239400863647, 'reward_std': 0.11024943739175797, 'kl': 0.8642578125, 'epoch': 0.49}
 49%|████▊     | 2081/4286 [12:52:20<11:44:26, 19.17s/it] 49%|████▊     | 2082/4286 [12:52:42<12:19:21, 20.13s/it]                                                         {'loss': 0.0216, 'grad_norm': 1.2917903321463573, 'learning_rate': 5.142323845076994e-07, 'completion_length': 167.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 1.0, 'reward': 1.544642984867096, 'reward_std': 0.053571430034935474, 'kl': 0.5400390625, 'epoch': 0.49}
 49%|████▊     | 2082/4286 [12:52:42<12:19:21, 20.13s/it] 49%|████▊     | 2083/4286 [12:53:04<12:31:59, 20.48s/it]                                                         {'loss': 0.0337, 'grad_norm': 3.7246864681935508, 'learning_rate': 5.139990667288847e-07, 'completion_length': 162.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.59226194024086, 'rewards/format_reward': 1.0, 'reward': 1.5922620296478271, 'reward_std': 0.0773809552192688, 'kl': 0.841796875, 'epoch': 0.49}
 49%|████▊     | 2083/4286 [12:53:04<12:31:59, 20.48s/it] 49%|████▊     | 2084/4286 [12:53:21<12:00:51, 19.64s/it]                                                         {'loss': 0.0153, 'grad_norm': 3.289100270443298, 'learning_rate': 5.1376574895007e-07, 'completion_length': 149.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.5133928954601288, 'rewards/format_reward': 1.0, 'reward': 1.5133929252624512, 'reward_std': 0.049839189276099205, 'kl': 0.3828125, 'epoch': 0.49}
 49%|████▊     | 2084/4286 [12:53:21<12:00:51, 19.64s/it] 49%|████▊     | 2085/4286 [12:53:42<12:10:07, 19.90s/it]                                                         {'loss': 0.0357, 'grad_norm': 2.5158095397609572, 'learning_rate': 5.135324311712552e-07, 'completion_length': 183.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071430444717407, 'reward_std': 0.06620896607637405, 'kl': 0.890625, 'epoch': 0.49}
 49%|████▊     | 2085/4286 [12:53:42<12:10:07, 19.90s/it] 49%|████▊     | 2086/4286 [12:54:01<12:00:50, 19.66s/it]                                                         {'loss': 0.0256, 'grad_norm': 3.3974535643429586, 'learning_rate': 5.132991133924405e-07, 'completion_length': 166.0, 'rewards/only_full_func_accuracy_reward': 0.71577388048172, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.079383235424757, 'kl': 0.6416015625, 'epoch': 0.49}
 49%|████▊     | 2086/4286 [12:54:01<12:00:50, 19.66s/it] 49%|████▊     | 2087/4286 [12:54:20<11:57:07, 19.57s/it]                                                         {'loss': 0.0367, 'grad_norm': 1.5724204242063309, 'learning_rate': 5.130657956136257e-07, 'completion_length': 157.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6104167103767395, 'rewards/format_reward': 1.0, 'reward': 1.6104167699813843, 'reward_std': 0.07678571715950966, 'kl': 0.9140625, 'epoch': 0.49}
 49%|████▊     | 2087/4286 [12:54:20<11:57:07, 19.57s/it] 49%|████▊     | 2088/4286 [12:54:41<12:13:10, 20.01s/it]                                                         {'loss': 0.0273, 'grad_norm': 6.016844877758627, 'learning_rate': 5.12832477834811e-07, 'completion_length': 156.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5833334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5654762983322144, 'reward_std': 0.09770291019231081, 'kl': 0.68408203125, 'epoch': 0.49}
 49%|████▊     | 2088/4286 [12:54:41<12:13:10, 20.01s/it] 49%|████▊     | 2089/4286 [12:55:02<12:22:36, 20.28s/it]                                                         {'loss': 0.0305, 'grad_norm': 4.293187895046343, 'learning_rate': 5.125991600559963e-07, 'completion_length': 181.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.3931547701358795, 'rewards/format_reward': 1.0, 'reward': 1.3931548595428467, 'reward_std': 0.0931149274110794, 'kl': 0.759765625, 'epoch': 0.49}
 49%|████▊     | 2089/4286 [12:55:02<12:22:36, 20.28s/it] 49%|████▉     | 2090/4286 [12:55:23<12:32:34, 20.56s/it]                                                         {'loss': 0.0336, 'grad_norm': 1.8744709860204554, 'learning_rate': 5.123658422771815e-07, 'completion_length': 165.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.0476190485060215, 'kl': 0.837890625, 'epoch': 0.49}
 49%|████▉     | 2090/4286 [12:55:23<12:32:34, 20.56s/it][2025-03-02 18:03:03,047] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 49%|████▉     | 2091/4286 [12:55:47<13:06:43, 21.50s/it]                                                         {'loss': 0.0546, 'grad_norm': 4.990562485057227, 'learning_rate': 5.121325244983667e-07, 'completion_length': 176.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.482142984867096, 'reward_std': 0.17736880853772163, 'kl': 1.369140625, 'epoch': 0.49}
 49%|████▉     | 2091/4286 [12:55:47<13:06:43, 21.50s/it] 49%|████▉     | 2092/4286 [12:56:08<12:59:21, 21.31s/it]                                                         {'loss': 0.013, 'grad_norm': 10.092384057614726, 'learning_rate': 5.11899206719552e-07, 'completion_length': 191.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.10380570217967033, 'kl': 0.326171875, 'epoch': 0.49}
 49%|████▉     | 2092/4286 [12:56:08<12:59:21, 21.31s/it] 49%|████▉     | 2093/4286 [12:56:30<13:03:27, 21.44s/it]                                                         {'loss': 0.0328, 'grad_norm': 2.877206410688211, 'learning_rate': 5.116658889407373e-07, 'completion_length': 188.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5744048058986664, 'rewards/format_reward': 1.0, 'reward': 1.5744048357009888, 'reward_std': 0.06998756900429726, 'kl': 0.822265625, 'epoch': 0.49}
 49%|████▉     | 2093/4286 [12:56:30<13:03:27, 21.44s/it] 49%|████▉     | 2094/4286 [12:56:52<13:09:10, 21.60s/it]                                                         {'loss': 0.0221, 'grad_norm': 11.276620635749454, 'learning_rate': 5.114325711619225e-07, 'completion_length': 201.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.558630958199501, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.540773868560791, 'reward_std': 0.08511904627084732, 'kl': 0.5517578125, 'epoch': 0.49}
 49%|████▉     | 2094/4286 [12:56:52<13:09:10, 21.60s/it] 49%|████▉     | 2095/4286 [12:57:13<13:06:12, 21.53s/it]                                                         {'loss': 0.0139, 'grad_norm': 0.9425277815520994, 'learning_rate': 5.111992533831077e-07, 'completion_length': 192.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5044643431901932, 'rewards/format_reward': 1.0, 'reward': 1.5044643878936768, 'reward_std': 0.023595843696966767, 'kl': 0.349609375, 'epoch': 0.49}
 49%|████▉     | 2095/4286 [12:57:13<13:06:12, 21.53s/it] 49%|████▉     | 2096/4286 [12:57:34<13:02:16, 21.43s/it]                                                         {'loss': 0.0465, 'grad_norm': 4.580459034694191, 'learning_rate': 5.10965935604293e-07, 'completion_length': 194.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.4508928656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4330357909202576, 'reward_std': 0.08471459336578846, 'kl': 1.1650390625, 'epoch': 0.49}
 49%|████▉     | 2096/4286 [12:57:34<13:02:16, 21.43s/it] 49%|████▉     | 2097/4286 [12:57:57<13:17:23, 21.86s/it]                                                         {'loss': 0.0575, 'grad_norm': 4.521167771614577, 'learning_rate': 5.107326178254783e-07, 'completion_length': 209.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.4985119551420212, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.46279776096344, 'reward_std': 0.2416376955807209, 'kl': 1.4375, 'epoch': 0.49}
 49%|████▉     | 2097/4286 [12:57:57<13:17:23, 21.86s/it] 49%|████▉     | 2098/4286 [12:58:18<13:02:21, 21.45s/it]                                                         {'loss': 0.0267, 'grad_norm': 11.210214094688567, 'learning_rate': 5.104993000466635e-07, 'completion_length': 180.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.7276785969734192, 'rewards/format_reward': 1.0, 'reward': 1.727678656578064, 'reward_std': 0.008928571827709675, 'kl': 0.6669921875, 'epoch': 0.49}
 49%|████▉     | 2098/4286 [12:58:18<13:02:21, 21.45s/it] 49%|████▉     | 2099/4286 [12:58:42<13:29:50, 22.22s/it]                                                         {'loss': 0.0516, 'grad_norm': 3.748851188670655, 'learning_rate': 5.102659822678488e-07, 'completion_length': 194.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5699405372142792, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5342262983322144, 'reward_std': 0.19516676664352417, 'kl': 1.2890625, 'epoch': 0.49}
 49%|████▉     | 2099/4286 [12:58:42<13:29:50, 22.22s/it] 49%|████▉     | 2100/4286 [12:59:07<13:59:33, 23.04s/it]                                                         {'loss': 0.0561, 'grad_norm': 5.417320248792137, 'learning_rate': 5.10032664489034e-07, 'completion_length': 214.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.495535746216774, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4598215818405151, 'reward_std': 0.1745065301656723, 'kl': 1.3984375, 'epoch': 0.49}
 49%|████▉     | 2100/4286 [12:59:07<13:59:33, 23.04s/it][2025-03-02 18:10:32,690] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 49%|████▉     | 2101/4286 [13:03:17<55:20:28, 91.18s/it]                                                         {'loss': 0.0103, 'grad_norm': 3.623857360170787, 'learning_rate': 5.097993467102193e-07, 'completion_length': 179.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7232144474983215, 'reward_std': 0.06990811694413424, 'kl': 0.25732421875, 'epoch': 0.49}
 49%|████▉     | 2101/4286 [13:03:17<55:20:28, 91.18s/it] 49%|████▉     | 2102/4286 [13:03:38<42:35:27, 70.20s/it]                                                         {'loss': 0.0177, 'grad_norm': 4.502668568974617, 'learning_rate': 5.095660289314046e-07, 'completion_length': 198.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.5431548058986664, 'rewards/format_reward': 1.0, 'reward': 1.5431548953056335, 'reward_std': 0.08012753445655107, 'kl': 0.44140625, 'epoch': 0.49}
 49%|████▉     | 2102/4286 [13:03:38<42:35:27, 70.20s/it] 49%|████▉     | 2103/4286 [13:04:01<33:54:30, 55.92s/it]                                                         {'loss': 0.0483, 'grad_norm': 6.618761862884529, 'learning_rate': 5.093327111525898e-07, 'completion_length': 207.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6815476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6636906266212463, 'reward_std': 0.13756146281957626, 'kl': 1.208984375, 'epoch': 0.49}
 49%|████▉     | 2103/4286 [13:04:01<33:54:30, 55.92s/it] 49%|████▉     | 2104/4286 [13:04:23<27:46:56, 45.84s/it]                                                         {'loss': 0.0462, 'grad_norm': 4.210440443866514, 'learning_rate': 5.09099393373775e-07, 'completion_length': 186.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.5282738953828812, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5104167461395264, 'reward_std': 0.12938889116048813, 'kl': 1.16015625, 'epoch': 0.49}
 49%|████▉     | 2104/4286 [13:04:23<27:46:56, 45.84s/it] 49%|████▉     | 2105/4286 [13:04:45<23:28:36, 38.75s/it]                                                         {'loss': 0.1192, 'grad_norm': 13.339988704156005, 'learning_rate': 5.088660755949603e-07, 'completion_length': 218.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.5116071850061417, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4401786923408508, 'reward_std': 0.17542067728936672, 'kl': 2.986328125, 'epoch': 0.49}
 49%|████▉     | 2105/4286 [13:04:45<23:28:36, 38.75s/it] 49%|████▉     | 2106/4286 [13:05:08<20:31:09, 33.89s/it]                                                         {'loss': 0.0785, 'grad_norm': 47.9086417384158, 'learning_rate': 5.086327578161456e-07, 'completion_length': 183.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5342262238264084, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.516369104385376, 'reward_std': 0.18940778076648712, 'kl': 1.96484375, 'epoch': 0.49}
 49%|████▉     | 2106/4286 [13:05:08<20:31:09, 33.89s/it] 49%|████▉     | 2107/4286 [13:05:31<18:30:28, 30.58s/it]                                                         {'loss': 0.0527, 'grad_norm': 8.918127105532296, 'learning_rate': 5.083994400373308e-07, 'completion_length': 206.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.5458333790302277, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5279763340950012, 'reward_std': 0.1852957047522068, 'kl': 1.31640625, 'epoch': 0.49}
 49%|████▉     | 2107/4286 [13:05:31<18:30:28, 30.58s/it] 49%|████▉     | 2108/4286 [13:05:54<17:09:43, 28.37s/it]                                                         {'loss': 0.1063, 'grad_norm': 8.924587111994814, 'learning_rate': 5.08166122258516e-07, 'completion_length': 207.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.48750001192092896, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4339286088943481, 'reward_std': 0.35179491341114044, 'kl': 2.65625, 'epoch': 0.49}
 49%|████▉     | 2108/4286 [13:05:54<17:09:43, 28.37s/it] 49%|████▉     | 2109/4286 [13:06:12<15:23:38, 25.46s/it]                                                         {'loss': 0.0674, 'grad_norm': 12.459546546538455, 'learning_rate': 5.079328044797014e-07, 'completion_length': 179.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.4806547909975052, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4449405670166016, 'reward_std': 0.2601805552840233, 'kl': 1.6875, 'epoch': 0.49}
 49%|████▉     | 2109/4286 [13:06:12<15:23:38, 25.46s/it] 49%|████▉     | 2110/4286 [13:06:34<14:40:02, 24.27s/it]                                                         {'loss': 0.1251, 'grad_norm': 8.445604933580158, 'learning_rate': 5.076994867008866e-07, 'completion_length': 196.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6175595223903656, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5818453431129456, 'reward_std': 0.2520708814263344, 'kl': 3.125, 'epoch': 0.49}
 49%|████▉     | 2110/4286 [13:06:34<14:40:02, 24.27s/it] 49%|████▉     | 2111/4286 [13:06:55<14:04:42, 23.30s/it]                                                         {'loss': 0.1082, 'grad_norm': 11.946359348121192, 'learning_rate': 5.074661689220718e-07, 'completion_length': 165.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.5431547611951828, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4717262983322144, 'reward_std': 0.12071270495653152, 'kl': 2.7099609375, 'epoch': 0.49}
 49%|████▉     | 2111/4286 [13:06:55<14:04:42, 23.30s/it] 49%|████▉     | 2112/4286 [13:07:17<13:52:12, 22.97s/it]                                                         {'loss': 0.0516, 'grad_norm': 11.515792334582377, 'learning_rate': 5.072328511432571e-07, 'completion_length': 182.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5196428745985031, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5017857551574707, 'reward_std': 0.13977281004190445, 'kl': 1.2890625, 'epoch': 0.49}
 49%|████▉     | 2112/4286 [13:07:17<13:52:12, 22.97s/it] 49%|████▉     | 2113/4286 [13:07:38<13:24:18, 22.21s/it]                                                         {'loss': 0.0613, 'grad_norm': 10.059803708663011, 'learning_rate': 5.069995333644424e-07, 'completion_length': 176.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.5148809850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4970239400863647, 'reward_std': 0.21271106600761414, 'kl': 1.525390625, 'epoch': 0.49}
 49%|████▉     | 2113/4286 [13:07:38<13:24:18, 22.21s/it] 49%|████▉     | 2114/4286 [13:08:00<13:25:54, 22.26s/it]                                                         {'loss': 0.055, 'grad_norm': 4.459071946132872, 'learning_rate': 5.067662155856276e-07, 'completion_length': 194.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.0751119265332818, 'kl': 1.37109375, 'epoch': 0.49}
 49%|████▉     | 2114/4286 [13:08:00<13:25:54, 22.26s/it] 49%|████▉     | 2115/4286 [13:08:21<13:12:10, 21.89s/it]                                                         {'loss': 0.0135, 'grad_norm': 2.7032081421179646, 'learning_rate': 5.065328978068128e-07, 'completion_length': 211.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.04007173143327236, 'kl': 0.3359375, 'epoch': 0.49}
 49%|████▉     | 2115/4286 [13:08:21<13:12:10, 21.89s/it] 49%|████▉     | 2116/4286 [13:08:43<13:08:51, 21.81s/it]                                                         {'loss': 0.0574, 'grad_norm': 10.297340600934778, 'learning_rate': 5.062995800279981e-07, 'completion_length': 180.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.5302579700946808, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5124008059501648, 'reward_std': 0.1061508022248745, 'kl': 1.4326171875, 'epoch': 0.49}
 49%|████▉     | 2116/4286 [13:08:43<13:08:51, 21.81s/it][2025-03-02 18:16:21,850] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 49%|████▉     | 2117/4286 [13:09:06<13:24:43, 22.26s/it]                                                         {'loss': 0.0153, 'grad_norm': 8.740110851795203, 'learning_rate': 5.060662622491834e-07, 'completion_length': 186.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.647321492433548, 'rewards/format_reward': 1.0, 'reward': 1.6473215222358704, 'reward_std': 0.08067835960537195, 'kl': 0.380859375, 'epoch': 0.49}
 49%|████▉     | 2117/4286 [13:09:06<13:24:43, 22.26s/it] 49%|████▉     | 2118/4286 [13:09:30<13:48:10, 22.92s/it]                                                         {'loss': 0.0118, 'grad_norm': 2.491097769620395, 'learning_rate': 5.058329444703686e-07, 'completion_length': 192.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.5699405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5520834922790527, 'reward_std': 0.05838929582387209, 'kl': 0.2958984375, 'epoch': 0.49}
 49%|████▉     | 2118/4286 [13:09:30<13:48:10, 22.92s/it][2025-03-02 18:17:06,904] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 49%|████▉     | 2119/4286 [13:09:51<13:22:35, 22.22s/it]                                                         {'loss': 0.0436, 'grad_norm': 6.454871531443217, 'learning_rate': 5.055996266915539e-07, 'completion_length': 162.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.398809552192688, 'rewards/format_reward': 1.0, 'reward': 1.3988096714019775, 'reward_std': 0.0892857126891613, 'kl': 1.087890625, 'epoch': 0.49}
 49%|████▉     | 2119/4286 [13:09:51<13:22:35, 22.22s/it] 49%|████▉     | 2120/4286 [13:10:13<13:24:38, 22.29s/it]                                                         {'loss': 0.0133, 'grad_norm': 15.017422849494464, 'learning_rate': 5.053663089127391e-07, 'completion_length': 203.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6101191639900208, 'reward_std': 0.09382909908890724, 'kl': 0.3330078125, 'epoch': 0.49}
 49%|████▉     | 2120/4286 [13:10:13<13:24:38, 22.29s/it] 49%|████▉     | 2121/4286 [13:10:33<12:58:50, 21.58s/it]                                                         {'loss': 0.0528, 'grad_norm': 13.755811772391638, 'learning_rate': 5.051329911339243e-07, 'completion_length': 171.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5892857760190964, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.14905627444386482, 'kl': 1.318359375, 'epoch': 0.49}
 49%|████▉     | 2121/4286 [13:10:33<12:58:50, 21.58s/it] 50%|████▉     | 2122/4286 [13:10:55<12:55:12, 21.49s/it]                                                         {'loss': 0.0193, 'grad_norm': 6.260491400237308, 'learning_rate': 5.048996733551097e-07, 'completion_length': 198.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.7026785612106323, 'rewards/format_reward': 1.0, 'reward': 1.702678620815277, 'reward_std': 0.12429671548306942, 'kl': 0.482421875, 'epoch': 0.5}
 50%|████▉     | 2122/4286 [13:10:55<12:55:12, 21.49s/it] 50%|████▉     | 2123/4286 [13:11:16<12:49:24, 21.34s/it]                                                         {'loss': 0.0161, 'grad_norm': 4.065575884282102, 'learning_rate': 5.046663555762949e-07, 'completion_length': 179.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6130952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.06609721668064594, 'kl': 0.40234375, 'epoch': 0.5}
 50%|████▉     | 2123/4286 [13:11:16<12:49:24, 21.34s/it] 50%|████▉     | 2124/4286 [13:11:37<12:50:22, 21.38s/it]                                                         {'loss': 0.0283, 'grad_norm': 9.110786045276487, 'learning_rate': 5.044330377974801e-07, 'completion_length': 178.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.5163690894842148, 'rewards/format_reward': 1.0, 'reward': 1.5163691639900208, 'reward_std': 0.08898506313562393, 'kl': 0.7080078125, 'epoch': 0.5}
 50%|████▉     | 2124/4286 [13:11:37<12:50:22, 21.38s/it] 50%|████▉     | 2125/4286 [13:12:00<13:04:31, 21.78s/it]                                                         {'loss': 0.054, 'grad_norm': 5.067575578179164, 'learning_rate': 5.041997200186654e-07, 'completion_length': 184.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.5394842028617859, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.521627128124237, 'reward_std': 0.11681405082345009, 'kl': 1.3515625, 'epoch': 0.5}
 50%|████▉     | 2125/4286 [13:12:00<13:04:31, 21.78s/it] 50%|████▉     | 2126/4286 [13:12:20<12:51:14, 21.42s/it]                                                         {'loss': 0.0546, 'grad_norm': 6.051643934323573, 'learning_rate': 5.039664022398507e-07, 'completion_length': 171.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5252976715564728, 'rewards/format_reward': 1.0, 'reward': 1.52529776096344, 'reward_std': 0.0922619104385376, 'kl': 1.36328125, 'epoch': 0.5}
 50%|████▉     | 2126/4286 [13:12:20<12:51:14, 21.42s/it] 50%|████▉     | 2127/4286 [13:12:42<12:47:08, 21.32s/it]                                                         {'loss': 0.0172, 'grad_norm': 2.5134888503305373, 'learning_rate': 5.037330844610359e-07, 'completion_length': 177.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.5193453133106232, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.501488208770752, 'reward_std': 0.05838929209858179, 'kl': 0.4296875, 'epoch': 0.5}
 50%|████▉     | 2127/4286 [13:12:42<12:47:08, 21.32s/it] 50%|████▉     | 2128/4286 [13:13:01<12:28:45, 20.82s/it]                                                         {'loss': 0.0299, 'grad_norm': 3.2554379187135445, 'learning_rate': 5.034997666822211e-07, 'completion_length': 167.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.68154776096344, 'reward_std': 0.04166666558012366, 'kl': 0.74609375, 'epoch': 0.5}
 50%|████▉     | 2128/4286 [13:13:01<12:28:45, 20.82s/it] 50%|████▉     | 2129/4286 [13:13:22<12:24:22, 20.71s/it]                                                         {'loss': 0.0334, 'grad_norm': 1.5439987864204436, 'learning_rate': 5.032664489034064e-07, 'completion_length': 159.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.8359375, 'epoch': 0.5}
 50%|████▉     | 2129/4286 [13:13:22<12:24:22, 20.71s/it][2025-03-02 18:20:55,012] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 50%|████▉     | 2130/4286 [13:13:39<11:49:35, 19.75s/it]                                                         {'loss': 0.0387, 'grad_norm': 5.203103500473575, 'learning_rate': 5.030331311245917e-07, 'completion_length': 159.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6800595372915268, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.008928571827709675, 'kl': 0.96875, 'epoch': 0.5}
 50%|████▉     | 2130/4286 [13:13:39<11:49:35, 19.75s/it] 50%|████▉     | 2131/4286 [13:13:57<11:33:17, 19.30s/it]                                                         {'loss': 0.0108, 'grad_norm': 2.4298388060237843, 'learning_rate': 5.027998133457769e-07, 'completion_length': 155.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5699405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.552083432674408, 'reward_std': 0.10005595907568932, 'kl': 0.27001953125, 'epoch': 0.5}
 50%|████▉     | 2131/4286 [13:13:57<11:33:17, 19.30s/it] 50%|████▉     | 2132/4286 [13:14:18<11:44:46, 19.63s/it]                                                         {'loss': 0.0335, 'grad_norm': 2.314651607676573, 'learning_rate': 5.025664955669622e-07, 'completion_length': 171.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.501488134264946, 'rewards/format_reward': 1.0, 'reward': 1.501488208770752, 'reward_std': 0.0982142835855484, 'kl': 0.8359375, 'epoch': 0.5}
 50%|████▉     | 2132/4286 [13:14:18<11:44:46, 19.63s/it] 50%|████▉     | 2133/4286 [13:14:36<11:30:31, 19.24s/it]                                                         {'loss': 0.0416, 'grad_norm': 1.4987140935275514, 'learning_rate': 5.023331777881474e-07, 'completion_length': 156.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5863096117973328, 'reward_std': 0.11208160407841206, 'kl': 1.04052734375, 'epoch': 0.5}
 50%|████▉     | 2133/4286 [13:14:36<11:30:31, 19.24s/it] 50%|████▉     | 2134/4286 [13:14:57<11:44:04, 19.63s/it]                                                         {'loss': 0.0281, 'grad_norm': 250.75550667183055, 'learning_rate': 5.020998600093327e-07, 'completion_length': 173.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.5586309432983398, 'rewards/format_reward': 1.0, 'reward': 1.5586310625076294, 'reward_std': 0.08252713084220886, 'kl': 0.7041015625, 'epoch': 0.5}
 50%|████▉     | 2134/4286 [13:14:57<11:44:04, 19.63s/it] 50%|████▉     | 2135/4286 [13:15:17<11:55:59, 19.97s/it]                                                         {'loss': 0.0101, 'grad_norm': 1.9494673928807036, 'learning_rate': 5.01866542230518e-07, 'completion_length': 159.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5833333730697632, 'reward_std': 0.09204822778701782, 'kl': 0.251953125, 'epoch': 0.5}
 50%|████▉     | 2135/4286 [13:15:17<11:55:59, 19.97s/it] 50%|████▉     | 2136/4286 [13:15:36<11:43:25, 19.63s/it]                                                         {'loss': 0.0239, 'grad_norm': 3.3366936989697544, 'learning_rate': 5.016332244517032e-07, 'completion_length': 159.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.5763888657093048, 'rewards/format_reward': 1.0, 'reward': 1.576388955116272, 'reward_std': 0.046938207000494, 'kl': 0.5966796875, 'epoch': 0.5}
 50%|████▉     | 2136/4286 [13:15:36<11:43:25, 19.63s/it] 50%|████▉     | 2137/4286 [13:15:55<11:28:12, 19.21s/it]                                                         {'loss': 0.0152, 'grad_norm': 2.462371862634917, 'learning_rate': 5.013999066728884e-07, 'completion_length': 173.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6223215162754059, 'rewards/format_reward': 1.0, 'reward': 1.6223215460777283, 'reward_std': 0.07041782326996326, 'kl': 0.37939453125, 'epoch': 0.5}
 50%|████▉     | 2137/4286 [13:15:55<11:28:12, 19.21s/it] 50%|████▉     | 2138/4286 [13:16:13<11:24:06, 19.11s/it]                                                         {'loss': 0.0176, 'grad_norm': 4.436383738029039, 'learning_rate': 5.011665888940737e-07, 'completion_length': 162.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7023809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.11488383635878563, 'kl': 0.43994140625, 'epoch': 0.5}
 50%|████▉     | 2138/4286 [13:16:13<11:24:06, 19.11s/it] 50%|████▉     | 2139/4286 [13:16:34<11:37:40, 19.50s/it]                                                         {'loss': 0.0106, 'grad_norm': 3.9755575614574425, 'learning_rate': 5.00933271115259e-07, 'completion_length': 175.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6934524774551392, 'reward_std': 0.010309826582670212, 'kl': 0.265625, 'epoch': 0.5}
 50%|████▉     | 2139/4286 [13:16:34<11:37:40, 19.50s/it] 50%|████▉     | 2140/4286 [13:16:50<11:03:08, 18.54s/it]                                                         {'loss': 0.0213, 'grad_norm': 1.2654125664722735, 'learning_rate': 5.006999533364442e-07, 'completion_length': 137.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.764881044626236, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.01785714365541935, 'kl': 0.53173828125, 'epoch': 0.5}
 50%|████▉     | 2140/4286 [13:16:50<11:03:08, 18.54s/it] 50%|████▉     | 2141/4286 [13:17:09<11:03:17, 18.55s/it]                                                         {'loss': 0.0372, 'grad_norm': 5.920413752058222, 'learning_rate': 5.004666355576294e-07, 'completion_length': 157.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6383929252624512, 'reward_std': 0.0922619104385376, 'kl': 0.931640625, 'epoch': 0.5}
 50%|████▉     | 2141/4286 [13:17:09<11:03:17, 18.55s/it] 50%|████▉     | 2142/4286 [13:17:25<10:39:10, 17.89s/it]                                                         {'loss': 0.0507, 'grad_norm': 3.9634259758972648, 'learning_rate': 5.002333177788148e-07, 'completion_length': 141.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.6517857909202576, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.16372354701161385, 'kl': 1.26953125, 'epoch': 0.5}
 50%|████▉     | 2142/4286 [13:17:25<10:39:10, 17.89s/it] 50%|█████     | 2143/4286 [13:17:43<10:43:30, 18.02s/it]                                                         {'loss': 0.0319, 'grad_norm': 4.886626839074299, 'learning_rate': 5e-07, 'completion_length': 148.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.563988208770752, 'reward_std': 0.13693781569600105, 'kl': 0.80078125, 'epoch': 0.5}
 50%|█████     | 2143/4286 [13:17:43<10:43:30, 18.02s/it] 50%|█████     | 2144/4286 [13:18:03<11:03:15, 18.58s/it]                                                         {'loss': 0.0234, 'grad_norm': 6.761026125050649, 'learning_rate': 4.997666822211852e-07, 'completion_length': 168.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.65476194024086, 'rewards/format_reward': 1.0, 'reward': 1.6547620296478271, 'reward_std': 0.05633394047617912, 'kl': 0.58740234375, 'epoch': 0.5}
 50%|█████     | 2144/4286 [13:18:03<11:03:15, 18.58s/it] 50%|█████     | 2145/4286 [13:18:24<11:27:14, 19.26s/it]                                                         {'loss': 0.0616, 'grad_norm': 1.7063220897742453, 'learning_rate': 4.995333644423705e-07, 'completion_length': 171.83928680419922, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.708333432674408, 'reward_std': 0.12773162126541138, 'kl': 1.5390625, 'epoch': 0.5}
 50%|█████     | 2145/4286 [13:18:24<11:27:14, 19.26s/it] 50%|█████     | 2146/4286 [13:18:43<11:20:32, 19.08s/it]                                                         {'loss': 0.0108, 'grad_norm': 2.044987129672455, 'learning_rate': 4.993000466635557e-07, 'completion_length': 159.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6220238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6041668057441711, 'reward_std': 0.04602411389350891, 'kl': 0.26953125, 'epoch': 0.5}
 50%|█████     | 2146/4286 [13:18:43<11:20:32, 19.08s/it] 50%|█████     | 2147/4286 [13:19:00<11:04:58, 18.65s/it]                                                         {'loss': 0.013, 'grad_norm': 1.560049224308445, 'learning_rate': 4.99066728884741e-07, 'completion_length': 151.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.7202381193637848, 'rewards/format_reward': 1.0, 'reward': 1.7202382683753967, 'reward_std': 0.07695359364151955, 'kl': 0.32373046875, 'epoch': 0.5}
 50%|█████     | 2147/4286 [13:19:00<11:04:58, 18.65s/it] 50%|█████     | 2148/4286 [13:19:21<11:30:50, 19.39s/it]                                                         {'loss': 0.0406, 'grad_norm': 18.74374576797804, 'learning_rate': 4.988334111059263e-07, 'completion_length': 171.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5576105415821075, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.539753496646881, 'reward_std': 0.11238371022045612, 'kl': 1.013671875, 'epoch': 0.5}
 50%|█████     | 2148/4286 [13:19:21<11:30:50, 19.39s/it] 50%|█████     | 2149/4286 [13:19:42<11:38:27, 19.61s/it]                                                         {'loss': 0.0331, 'grad_norm': 5.740725927174286, 'learning_rate': 4.986000933271115e-07, 'completion_length': 188.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5720238387584686, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5541667938232422, 'reward_std': 0.05612075328826904, 'kl': 0.8271484375, 'epoch': 0.5}
 50%|█████     | 2149/4286 [13:19:42<11:38:27, 19.61s/it] 50%|█████     | 2150/4286 [13:20:01<11:32:48, 19.46s/it]                                                         {'loss': 0.0233, 'grad_norm': 2.770831847126484, 'learning_rate': 4.983667755482967e-07, 'completion_length': 158.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6190476417541504, 'rewards/format_reward': 1.0, 'reward': 1.6190477013587952, 'reward_std': 0.09002592414617538, 'kl': 0.58056640625, 'epoch': 0.5}
 50%|█████     | 2150/4286 [13:20:01<11:32:48, 19.46s/it] 50%|█████     | 2151/4286 [13:20:21<11:43:03, 19.76s/it]                                                         {'loss': 0.0102, 'grad_norm': 5.350213013415298, 'learning_rate': 4.98133457769482e-07, 'completion_length': 183.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.4255952835083008, 'rewards/format_reward': 1.0, 'reward': 1.4255953431129456, 'reward_std': 0.05706671625375748, 'kl': 0.25537109375, 'epoch': 0.5}
 50%|█████     | 2151/4286 [13:20:21<11:43:03, 19.76s/it] 50%|█████     | 2152/4286 [13:20:38<11:13:42, 18.94s/it]                                                         {'loss': 0.0388, 'grad_norm': 5.017942506851833, 'learning_rate': 4.979001399906673e-07, 'completion_length': 167.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.6011904776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.583333432674408, 'reward_std': 0.08739097788929939, 'kl': 0.970703125, 'epoch': 0.5}
 50%|█████     | 2152/4286 [13:20:38<11:13:42, 18.94s/it] 50%|█████     | 2153/4286 [13:20:59<11:33:04, 19.50s/it]                                                         {'loss': 0.0661, 'grad_norm': 6.556169949279001, 'learning_rate': 4.976668222118525e-07, 'completion_length': 170.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.6488095819950104, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6130954027175903, 'reward_std': 0.0943833701312542, 'kl': 1.65625, 'epoch': 0.5}
 50%|█████     | 2153/4286 [13:20:59<11:33:04, 19.50s/it] 50%|█████     | 2154/4286 [13:21:17<11:14:13, 18.97s/it]                                                         {'loss': 0.0576, 'grad_norm': 10.73125509682167, 'learning_rate': 4.974335044330377e-07, 'completion_length': 171.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6041668057441711, 'reward_std': 0.15752443298697472, 'kl': 1.443359375, 'epoch': 0.5}
 50%|█████     | 2154/4286 [13:21:17<11:14:13, 18.97s/it] 50%|█████     | 2155/4286 [13:21:34<10:58:13, 18.53s/it]                                                         {'loss': 0.0329, 'grad_norm': 3.148893044385141, 'learning_rate': 4.972001866542231e-07, 'completion_length': 161.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.5476191192865372, 'rewards/format_reward': 1.0, 'reward': 1.547619104385376, 'reward_std': 0.09505747072398663, 'kl': 0.8232421875, 'epoch': 0.5}
 50%|█████     | 2155/4286 [13:21:34<10:58:13, 18.53s/it] 50%|█████     | 2156/4286 [13:21:54<11:10:16, 18.88s/it]                                                         {'loss': 0.0537, 'grad_norm': 5.464395288721071, 'learning_rate': 4.969668688754083e-07, 'completion_length': 185.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.5726191252470016, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5369049310684204, 'reward_std': 0.11428571864962578, 'kl': 1.34765625, 'epoch': 0.5}
 50%|█████     | 2156/4286 [13:21:54<11:10:16, 18.88s/it] 50%|█████     | 2157/4286 [13:22:12<11:01:49, 18.65s/it]                                                         {'loss': 0.108, 'grad_norm': 7.820852014983863, 'learning_rate': 4.967335510965935e-07, 'completion_length': 177.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.6190476715564728, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.547619104385376, 'reward_std': 0.2277117446064949, 'kl': 2.6953125, 'epoch': 0.5}
 50%|█████     | 2157/4286 [13:22:12<11:01:49, 18.65s/it] 50%|█████     | 2158/4286 [13:22:34<11:38:33, 19.70s/it]                                                         {'loss': 0.0337, 'grad_norm': 3.6645519189097815, 'learning_rate': 4.965002333177788e-07, 'completion_length': 194.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.623512089252472, 'reward_std': 0.14104852825403214, 'kl': 0.8388671875, 'epoch': 0.5}
 50%|█████     | 2158/4286 [13:22:34<11:38:33, 19.70s/it] 50%|█████     | 2159/4286 [13:22:55<11:51:52, 20.08s/it]                                                         {'loss': 0.0648, 'grad_norm': 9.383845997711076, 'learning_rate': 4.962669155389641e-07, 'completion_length': 176.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.5853174924850464, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5674604177474976, 'reward_std': 0.1428571492433548, 'kl': 1.62109375, 'epoch': 0.5}
 50%|█████     | 2159/4286 [13:22:55<11:51:52, 20.08s/it] 50%|█████     | 2160/4286 [13:23:16<11:56:18, 20.22s/it]                                                         {'loss': 0.105, 'grad_norm': 9.856905501422741, 'learning_rate': 4.960335977601493e-07, 'completion_length': 189.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.5025297701358795, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4846726655960083, 'reward_std': 0.21062549203634262, 'kl': 2.625, 'epoch': 0.5}
 50%|█████     | 2160/4286 [13:23:16<11:56:18, 20.22s/it] 50%|█████     | 2161/4286 [13:23:35<11:44:22, 19.89s/it]                                                         {'loss': 0.0864, 'grad_norm': 5.590507463107153, 'learning_rate': 4.958002799813345e-07, 'completion_length': 168.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6770833432674408, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.641369104385376, 'reward_std': 0.1677745096385479, 'kl': 2.15625, 'epoch': 0.5}
 50%|█████     | 2161/4286 [13:23:35<11:44:22, 19.89s/it][2025-03-02 18:31:11,794] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 50%|█████     | 2162/4286 [13:23:56<11:56:37, 20.24s/it]                                                         {'loss': 0.0329, 'grad_norm': 7.408816024925081, 'learning_rate': 4.955669622025198e-07, 'completion_length': 197.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6565476655960083, 'rewards/format_reward': 1.0, 'reward': 1.6565477848052979, 'reward_std': 0.06682149693369865, 'kl': 0.8173828125, 'epoch': 0.5}
 50%|█████     | 2162/4286 [13:23:56<11:56:37, 20.24s/it] 50%|█████     | 2163/4286 [13:24:17<12:06:16, 20.53s/it]                                                         {'loss': 0.0296, 'grad_norm': 4.330141665225299, 'learning_rate': 4.953336444237051e-07, 'completion_length': 177.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.5327381491661072, 'rewards/format_reward': 1.0, 'reward': 1.532738208770752, 'reward_std': 0.11426281556487083, 'kl': 0.740234375, 'epoch': 0.5}
 50%|█████     | 2163/4286 [13:24:17<12:06:16, 20.53s/it] 50%|█████     | 2164/4286 [13:24:38<12:15:06, 20.79s/it]                                                         {'loss': 0.0453, 'grad_norm': 18.985083552940658, 'learning_rate': 4.951003266448903e-07, 'completion_length': 188.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.6517857909202576, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.10235805064439774, 'kl': 1.134765625, 'epoch': 0.5}
 50%|█████     | 2164/4286 [13:24:38<12:15:06, 20.79s/it] 51%|█████     | 2165/4286 [13:25:00<12:18:20, 20.89s/it]                                                         {'loss': 0.0736, 'grad_norm': 11.647780675515913, 'learning_rate': 4.948670088660756e-07, 'completion_length': 186.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.5126488357782364, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4769346117973328, 'reward_std': 0.1918073557317257, 'kl': 1.84375, 'epoch': 0.51}
 51%|█████     | 2165/4286 [13:25:00<12:18:20, 20.89s/it] 51%|█████     | 2166/4286 [13:25:20<12:11:03, 20.69s/it]                                                         {'loss': 0.0802, 'grad_norm': 12.269370393587609, 'learning_rate': 4.946336910872608e-07, 'completion_length': 183.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5791666805744171, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.561309576034546, 'reward_std': 0.12968577072024345, 'kl': 2.001953125, 'epoch': 0.51}
 51%|█████     | 2166/4286 [13:25:20<12:11:03, 20.69s/it] 51%|█████     | 2167/4286 [13:25:43<12:33:59, 21.35s/it]                                                         {'loss': 0.0615, 'grad_norm': 10.567007458108888, 'learning_rate': 4.94400373308446e-07, 'completion_length': 190.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.5580357611179352, 'rewards/format_reward': 1.0, 'reward': 1.5580358505249023, 'reward_std': 0.07624644041061401, 'kl': 1.53515625, 'epoch': 0.51}
 51%|█████     | 2167/4286 [13:25:43<12:33:59, 21.35s/it] 51%|█████     | 2168/4286 [13:26:06<12:56:08, 21.99s/it]                                                         {'loss': 0.0403, 'grad_norm': 4.1084607653099985, 'learning_rate': 4.941670555296314e-07, 'completion_length': 170.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.6220237910747528, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6041667461395264, 'reward_std': 0.1419577058404684, 'kl': 1.0068359375, 'epoch': 0.51}
 51%|█████     | 2168/4286 [13:26:06<12:56:08, 21.99s/it] 51%|█████     | 2169/4286 [13:26:29<13:02:39, 22.18s/it]                                                         {'loss': 0.0868, 'grad_norm': 16.832461288802435, 'learning_rate': 4.939337377508166e-07, 'completion_length': 196.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.39196427166461945, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.338392972946167, 'reward_std': 0.16707567125558853, 'kl': 2.17578125, 'epoch': 0.51}
 51%|█████     | 2169/4286 [13:26:29<13:02:39, 22.18s/it] 51%|█████     | 2170/4286 [13:26:47<12:24:40, 21.12s/it]                                                         {'loss': 0.0289, 'grad_norm': 23.618711302297307, 'learning_rate': 4.937004199720018e-07, 'completion_length': 170.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.4910714775323868, 'rewards/format_reward': 1.0, 'reward': 1.4910715818405151, 'reward_std': 0.09665241092443466, 'kl': 0.7197265625, 'epoch': 0.51}
 51%|█████     | 2170/4286 [13:26:47<12:24:40, 21.12s/it] 51%|█████     | 2171/4286 [13:27:08<12:17:49, 20.93s/it]                                                         {'loss': 0.0373, 'grad_norm': 4.120731659925786, 'learning_rate': 4.934671021931872e-07, 'completion_length': 174.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.5511905401945114, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.533333420753479, 'reward_std': 0.07245405483990908, 'kl': 0.93017578125, 'epoch': 0.51}
 51%|█████     | 2171/4286 [13:27:08<12:17:49, 20.93s/it] 51%|█████     | 2172/4286 [13:27:30<12:24:59, 21.14s/it]                                                         {'loss': 0.0084, 'grad_norm': 13.094977220902512, 'learning_rate': 4.932337844143724e-07, 'completion_length': 205.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6500000357627869, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6321429014205933, 'reward_std': 0.10453305207192898, 'kl': 0.2099609375, 'epoch': 0.51}
 51%|█████     | 2172/4286 [13:27:30<12:24:59, 21.14s/it] 51%|█████     | 2173/4286 [13:27:49<12:05:06, 20.59s/it]                                                         {'loss': 0.0231, 'grad_norm': 7.040228159117024, 'learning_rate': 4.930004666355576e-07, 'completion_length': 164.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.7038690745830536, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.08789278380572796, 'kl': 0.5771484375, 'epoch': 0.51}
 51%|█████     | 2173/4286 [13:27:49<12:05:06, 20.59s/it] 51%|█████     | 2174/4286 [13:28:11<12:20:06, 21.03s/it]                                                         {'loss': 0.0242, 'grad_norm': 4.819133108202985, 'learning_rate': 4.927671488567428e-07, 'completion_length': 216.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.46636906266212463, 'rewards/format_reward': 1.0, 'reward': 1.4663691520690918, 'reward_std': 0.03291493561118841, 'kl': 0.60888671875, 'epoch': 0.51}
 51%|█████     | 2174/4286 [13:28:11<12:20:06, 21.03s/it] 51%|█████     | 2175/4286 [13:28:31<12:06:21, 20.64s/it]                                                         {'loss': 0.0152, 'grad_norm': 1.6760602275932193, 'learning_rate': 4.925338310779281e-07, 'completion_length': 172.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6875001788139343, 'reward_std': 0.054813481867313385, 'kl': 0.38037109375, 'epoch': 0.51}
 51%|█████     | 2175/4286 [13:28:31<12:06:21, 20.64s/it] 51%|█████     | 2176/4286 [13:28:49<11:45:41, 20.07s/it]                                                         {'loss': 0.0167, 'grad_norm': 1.3405173596886728, 'learning_rate': 4.923005132991134e-07, 'completion_length': 171.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6220237910747528, 'rewards/format_reward': 1.0, 'reward': 1.6220239400863647, 'reward_std': 0.0416666679084301, 'kl': 0.4169921875, 'epoch': 0.51}
 51%|█████     | 2176/4286 [13:28:49<11:45:41, 20.07s/it] 51%|█████     | 2177/4286 [13:29:10<11:52:43, 20.28s/it]                                                         {'loss': 0.0229, 'grad_norm': 1.502915086135611, 'learning_rate': 4.920671955202986e-07, 'completion_length': 189.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.08450091071426868, 'kl': 0.576171875, 'epoch': 0.51}
 51%|█████     | 2177/4286 [13:29:10<11:52:43, 20.28s/it] 51%|█████     | 2178/4286 [13:29:31<11:53:40, 20.31s/it]                                                         {'loss': 0.0094, 'grad_norm': 2.4577695846111007, 'learning_rate': 4.918338777414839e-07, 'completion_length': 179.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.0357142873108387, 'kl': 0.234375, 'epoch': 0.51}
 51%|█████     | 2178/4286 [13:29:31<11:53:40, 20.31s/it] 51%|█████     | 2179/4286 [13:29:53<12:18:30, 21.03s/it]                                                         {'loss': 0.0229, 'grad_norm': 5.384217476156773, 'learning_rate': 4.916005599626691e-07, 'completion_length': 204.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.4434524327516556, 'rewards/format_reward': 1.0, 'reward': 1.4434524774551392, 'reward_std': 0.01969880983233452, 'kl': 0.56884765625, 'epoch': 0.51}
 51%|█████     | 2179/4286 [13:29:53<12:18:30, 21.03s/it][2025-03-02 18:37:28,213] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 51%|█████     | 2180/4286 [13:30:12<11:57:11, 20.43s/it]                                                         {'loss': 0.0076, 'grad_norm': 1.5817285905828966, 'learning_rate': 4.913672421838544e-07, 'completion_length': 181.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5372024178504944, 'rewards/format_reward': 1.0, 'reward': 1.5372024774551392, 'reward_std': 0.038690478540956974, 'kl': 0.19140625, 'epoch': 0.51}
 51%|█████     | 2180/4286 [13:30:12<11:57:11, 20.43s/it] 51%|█████     | 2181/4286 [13:30:34<12:05:42, 20.69s/it]                                                         {'loss': 0.0076, 'grad_norm': 1.2591149307561889, 'learning_rate': 4.911339244050397e-07, 'completion_length': 192.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041667461395264, 'reward_std': 0.045509777031838894, 'kl': 0.189453125, 'epoch': 0.51}
 51%|█████     | 2181/4286 [13:30:34<12:05:42, 20.69s/it] 51%|█████     | 2182/4286 [13:30:56<12:27:09, 21.31s/it]                                                         {'loss': 0.0102, 'grad_norm': 3.2370628697804693, 'learning_rate': 4.909006066262249e-07, 'completion_length': 174.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.029761906247586012, 'kl': 0.2548828125, 'epoch': 0.51}
 51%|█████     | 2182/4286 [13:30:56<12:27:09, 21.31s/it] 51%|█████     | 2183/4286 [13:31:19<12:44:45, 21.82s/it]                                                         {'loss': 0.027, 'grad_norm': 2.455699237258455, 'learning_rate': 4.906672888474101e-07, 'completion_length': 197.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6775794327259064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6597222685813904, 'reward_std': 0.1253891885280609, 'kl': 0.67578125, 'epoch': 0.51}
 51%|█████     | 2183/4286 [13:31:19<12:44:45, 21.82s/it][2025-03-02 18:38:57,373] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 51%|█████     | 2184/4286 [13:31:41<12:47:29, 21.91s/it]                                                         {'loss': 0.0135, 'grad_norm': 2.6513576508406933, 'learning_rate': 4.904339710685954e-07, 'completion_length': 201.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6220238208770752, 'rewards/format_reward': 1.0, 'reward': 1.6220239400863647, 'reward_std': 0.03847679682075977, 'kl': 0.337890625, 'epoch': 0.51}
 51%|█████     | 2184/4286 [13:31:41<12:47:29, 21.91s/it] 51%|█████     | 2185/4286 [13:32:02<12:36:08, 21.59s/it]                                                         {'loss': 0.0268, 'grad_norm': 3.9698753635201482, 'learning_rate': 4.902006532897807e-07, 'completion_length': 191.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.65625, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.638392984867096, 'reward_std': 0.1279762089252472, 'kl': 0.67138671875, 'epoch': 0.51}
 51%|█████     | 2185/4286 [13:32:02<12:36:08, 21.59s/it] 51%|█████     | 2186/4286 [13:32:20<11:55:35, 20.45s/it]                                                         {'loss': 0.0268, 'grad_norm': 4.534449707752623, 'learning_rate': 4.899673355109659e-07, 'completion_length': 182.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041668057441711, 'reward_std': 0.08172671496868134, 'kl': 0.671875, 'epoch': 0.51}
 51%|█████     | 2186/4286 [13:32:20<11:55:35, 20.45s/it] 51%|█████     | 2187/4286 [13:32:39<11:43:58, 20.12s/it]                                                         {'loss': 0.0312, 'grad_norm': 1.9367551577544027, 'learning_rate': 4.897340177321511e-07, 'completion_length': 172.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 1.0, 'reward': 1.6026787161827087, 'reward_std': 0.07440476305782795, 'kl': 0.78125, 'epoch': 0.51}
 51%|█████     | 2187/4286 [13:32:39<11:43:58, 20.12s/it] 51%|█████     | 2188/4286 [13:32:59<11:34:05, 19.85s/it]                                                         {'loss': 0.0283, 'grad_norm': 1.993571482359552, 'learning_rate': 4.895006999533365e-07, 'completion_length': 171.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.10307802259922028, 'kl': 0.7099609375, 'epoch': 0.51}
 51%|█████     | 2188/4286 [13:32:59<11:34:05, 19.85s/it] 51%|█████     | 2189/4286 [13:33:19<11:36:42, 19.93s/it]                                                         {'loss': 0.0264, 'grad_norm': 8.243891343399964, 'learning_rate': 4.892673821745217e-07, 'completion_length': 188.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.14083484560251236, 'kl': 0.6591796875, 'epoch': 0.51}
 51%|█████     | 2189/4286 [13:33:19<11:36:42, 19.93s/it] 51%|█████     | 2190/4286 [13:33:42<12:08:44, 20.86s/it]                                                         {'loss': 0.0365, 'grad_norm': 2.498638929253708, 'learning_rate': 4.890340643957069e-07, 'completion_length': 198.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.12068402953445911, 'kl': 0.9140625, 'epoch': 0.51}
 51%|█████     | 2190/4286 [13:33:42<12:08:44, 20.86s/it] 51%|█████     | 2191/4286 [13:34:00<11:36:44, 19.95s/it]                                                         {'loss': 0.0552, 'grad_norm': 8.879559487981163, 'learning_rate': 4.888007466168922e-07, 'completion_length': 161.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5758929252624512, 'reward_std': 0.13639339059591293, 'kl': 1.37890625, 'epoch': 0.51}
 51%|█████     | 2191/4286 [13:34:00<11:36:44, 19.95s/it] 51%|█████     | 2192/4286 [13:34:20<11:35:26, 19.93s/it]                                                         {'loss': 0.0498, 'grad_norm': 1.8852801215853334, 'learning_rate': 4.885674288380775e-07, 'completion_length': 205.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.10554791986942291, 'kl': 1.2421875, 'epoch': 0.51}
 51%|█████     | 2192/4286 [13:34:20<11:35:26, 19.93s/it] 51%|█████     | 2193/4286 [13:34:39<11:30:09, 19.78s/it]                                                         {'loss': 0.0458, 'grad_norm': 1.4816325031693565, 'learning_rate': 4.883341110592627e-07, 'completion_length': 186.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5223214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5044644474983215, 'reward_std': 0.12059662211686373, 'kl': 1.14697265625, 'epoch': 0.51}
 51%|█████     | 2193/4286 [13:34:39<11:30:09, 19.78s/it] 51%|█████     | 2194/4286 [13:35:00<11:45:06, 20.22s/it]                                                         {'loss': 0.0134, 'grad_norm': 13.486142006694863, 'learning_rate': 4.88100793280448e-07, 'completion_length': 183.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6369048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6369049549102783, 'reward_std': 0.0902065597474575, 'kl': 0.333984375, 'epoch': 0.51}
 51%|█████     | 2194/4286 [13:35:00<11:45:06, 20.22s/it] 51%|█████     | 2195/4286 [13:35:21<11:52:40, 20.45s/it]                                                         {'loss': 0.1072, 'grad_norm': 9.987876606766257, 'learning_rate': 4.878674755016332e-07, 'completion_length': 200.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.06279663741588593, 'kl': 2.671875, 'epoch': 0.51}
 51%|█████     | 2195/4286 [13:35:21<11:52:40, 20.45s/it] 51%|█████     | 2196/4286 [13:35:41<11:47:59, 20.33s/it]                                                         {'loss': 0.0729, 'grad_norm': 1.7890545189180471, 'learning_rate': 4.876341577228184e-07, 'completion_length': 187.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.5744048058986664, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.485119104385376, 'reward_std': 0.2263554334640503, 'kl': 1.82177734375, 'epoch': 0.51}
 51%|█████     | 2196/4286 [13:35:41<11:47:59, 20.33s/it] 51%|█████▏    | 2197/4286 [13:36:03<11:59:28, 20.66s/it]                                                         {'loss': 0.0339, 'grad_norm': 1.567410621090699, 'learning_rate': 4.874008399440037e-07, 'completion_length': 197.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.648809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6309524774551392, 'reward_std': 0.11493691429495811, 'kl': 0.84375, 'epoch': 0.51}
 51%|█████▏    | 2197/4286 [13:36:03<11:59:28, 20.66s/it] 51%|█████▏    | 2198/4286 [13:36:24<12:05:16, 20.84s/it]                                                         {'loss': 0.0518, 'grad_norm': 26.126355810992774, 'learning_rate': 4.87167522165189e-07, 'completion_length': 191.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5982143878936768, 'reward_std': 0.19383099675178528, 'kl': 1.296875, 'epoch': 0.51}
 51%|█████▏    | 2198/4286 [13:36:24<12:05:16, 20.84s/it] 51%|█████▏    | 2199/4286 [13:36:48<12:39:17, 21.83s/it]                                                         {'loss': 0.0275, 'grad_norm': 2.523692416667399, 'learning_rate': 4.869342043863742e-07, 'completion_length': 200.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5818453431129456, 'reward_std': 0.09637044090777636, 'kl': 0.68505859375, 'epoch': 0.51}
 51%|█████▏    | 2199/4286 [13:36:48<12:39:17, 21.83s/it] 51%|█████▏    | 2200/4286 [13:37:11<12:46:52, 22.06s/it]                                                         {'loss': 0.0306, 'grad_norm': 1.1798706876670655, 'learning_rate': 4.867008866075594e-07, 'completion_length': 203.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.7083333134651184, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.052550166845321655, 'kl': 0.763671875, 'epoch': 0.51}
 51%|█████▏    | 2200/4286 [13:37:11<12:46:52, 22.06s/it] 51%|█████▏    | 2201/4286 [13:41:47<56:53:39, 98.23s/it]                                                         {'loss': 0.0296, 'grad_norm': 0.6448383605559854, 'learning_rate': 4.864675688287448e-07, 'completion_length': 205.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5660714507102966, 'rewards/format_reward': 1.0, 'reward': 1.5660715103149414, 'reward_std': 0.028981779236346483, 'kl': 0.744140625, 'epoch': 0.51}
 51%|█████▏    | 2201/4286 [13:41:47<56:53:39, 98.23s/it] 51%|█████▏    | 2202/4286 [13:42:09<43:41:48, 75.48s/it]                                                         {'loss': 0.0806, 'grad_norm': 11.72858880560967, 'learning_rate': 4.8623425104993e-07, 'completion_length': 213.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.4508928656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4330357909202576, 'reward_std': 0.14209692552685738, 'kl': 2.015625, 'epoch': 0.51}
 51%|█████▏    | 2202/4286 [13:42:09<43:41:48, 75.48s/it] 51%|█████▏    | 2203/4286 [13:42:30<34:13:26, 59.15s/it]                                                         {'loss': 0.0644, 'grad_norm': 6.118275865730042, 'learning_rate': 4.860009332711152e-07, 'completion_length': 187.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.7008928954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6651787161827087, 'reward_std': 0.2422073371708393, 'kl': 1.611328125, 'epoch': 0.51}
 51%|█████▏    | 2203/4286 [13:42:30<34:13:26, 59.15s/it] 51%|█████▏    | 2204/4286 [13:42:53<27:51:44, 48.18s/it]                                                         {'loss': 0.0921, 'grad_norm': 2.6308039059733455, 'learning_rate': 4.857676154923005e-07, 'completion_length': 202.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821430444717407, 'reward_std': 0.11986265704035759, 'kl': 2.3046875, 'epoch': 0.51}
 51%|█████▏    | 2204/4286 [13:42:53<27:51:44, 48.18s/it] 51%|█████▏    | 2205/4286 [13:43:15<23:19:35, 40.35s/it]                                                         {'loss': 0.0294, 'grad_norm': 4.101366651705674, 'learning_rate': 4.855342977134858e-07, 'completion_length': 222.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714285969734192, 'reward_std': 0.09539202600717545, 'kl': 0.73486328125, 'epoch': 0.51}
 51%|█████▏    | 2205/4286 [13:43:15<23:19:35, 40.35s/it] 51%|█████▏    | 2206/4286 [13:43:37<20:09:14, 34.88s/it]                                                         {'loss': 0.0432, 'grad_norm': 4.32679941998436, 'learning_rate': 4.85300979934671e-07, 'completion_length': 214.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.5104166865348816, 'rewards/format_reward': 1.0, 'reward': 1.5104168057441711, 'reward_std': 0.11781267449259758, 'kl': 1.078125, 'epoch': 0.51}
 51%|█████▏    | 2206/4286 [13:43:37<20:09:14, 34.88s/it] 51%|█████▏    | 2207/4286 [13:44:01<18:15:01, 31.60s/it]                                                         {'loss': 0.0512, 'grad_norm': 4.445583729175563, 'learning_rate': 4.850676621558562e-07, 'completion_length': 212.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.4732143431901932, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4553572535514832, 'reward_std': 0.1132563091814518, 'kl': 1.28125, 'epoch': 0.51}
 51%|█████▏    | 2207/4286 [13:44:01<18:15:01, 31.60s/it] 52%|█████▏    | 2208/4286 [13:44:21<16:20:22, 28.31s/it]                                                         {'loss': 0.0503, 'grad_norm': 2.6402174898152153, 'learning_rate': 4.848343443770415e-07, 'completion_length': 198.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.56101194024086, 'rewards/format_reward': 1.0, 'reward': 1.5610119700431824, 'reward_std': 0.08097816817462444, 'kl': 1.255859375, 'epoch': 0.52}
 52%|█████▏    | 2208/4286 [13:44:21<16:20:22, 28.31s/it] 52%|█████▏    | 2209/4286 [13:44:45<15:30:31, 26.88s/it]                                                         {'loss': 0.0087, 'grad_norm': 1.7293250316162125, 'learning_rate': 4.846010265982268e-07, 'completion_length': 199.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.7053572237491608, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.07954214885830879, 'kl': 0.2177734375, 'epoch': 0.52}
 52%|█████▏    | 2209/4286 [13:44:45<15:30:31, 26.88s/it] 52%|█████▏    | 2210/4286 [13:45:06<14:27:55, 25.08s/it]                                                         {'loss': 0.0071, 'grad_norm': 3.5725396964577927, 'learning_rate': 4.84367708819412e-07, 'completion_length': 180.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.553571492433548, 'rewards/format_reward': 1.0, 'reward': 1.5535715818405151, 'reward_std': 0.011904759332537651, 'kl': 0.1767578125, 'epoch': 0.52}
 52%|█████▏    | 2210/4286 [13:45:06<14:27:55, 25.08s/it][2025-03-02 18:52:45,797] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 52%|█████▏    | 2211/4286 [13:45:30<14:16:10, 24.76s/it]                                                         {'loss': 0.0205, 'grad_norm': 11.340188717091909, 'learning_rate': 4.841343910405973e-07, 'completion_length': 223.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.680059552192688, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.08169200271368027, 'kl': 0.5126953125, 'epoch': 0.52}
 52%|█████▏    | 2211/4286 [13:45:30<14:16:10, 24.76s/it] 52%|█████▏    | 2212/4286 [13:45:53<14:00:14, 24.31s/it]                                                         {'loss': 0.0705, 'grad_norm': 6.128161988153069, 'learning_rate': 4.839010732617825e-07, 'completion_length': 232.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.3407738208770752, 'rewards/format_reward': 1.0, 'reward': 1.3407739400863647, 'reward_std': 0.08524864166975021, 'kl': 1.755859375, 'epoch': 0.52}
 52%|█████▏    | 2212/4286 [13:45:53<14:00:14, 24.31s/it] 52%|█████▏    | 2213/4286 [13:46:16<13:43:02, 23.82s/it]                                                         {'loss': 0.0067, 'grad_norm': 0.640209736333033, 'learning_rate': 4.836677554829678e-07, 'completion_length': 205.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.020326515659689903, 'kl': 0.16650390625, 'epoch': 0.52}
 52%|█████▏    | 2213/4286 [13:46:16<13:43:02, 23.82s/it] 52%|█████▏    | 2214/4286 [13:46:38<13:25:03, 23.31s/it]                                                         {'loss': 0.02, 'grad_norm': 1.260103962378728, 'learning_rate': 4.834344377041531e-07, 'completion_length': 212.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5273809880018234, 'rewards/format_reward': 1.0, 'reward': 1.5273811221122742, 'reward_std': 0.05975732812657952, 'kl': 0.50048828125, 'epoch': 0.52}
 52%|█████▏    | 2214/4286 [13:46:38<13:25:03, 23.31s/it] 52%|█████▏    | 2215/4286 [13:47:01<13:19:24, 23.16s/it]                                                         {'loss': 0.0255, 'grad_norm': 24.803448267644946, 'learning_rate': 4.832011199253383e-07, 'completion_length': 214.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6160715222358704, 'reward_std': 0.10475356131792068, 'kl': 0.6357421875, 'epoch': 0.52}
 52%|█████▏    | 2215/4286 [13:47:01<13:19:24, 23.16s/it][2025-03-02 18:54:38,973] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 52%|█████▏    | 2216/4286 [13:47:23<13:10:07, 22.90s/it]                                                         {'loss': 0.0097, 'grad_norm': 0.8019258893862682, 'learning_rate': 4.829678021465235e-07, 'completion_length': 202.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.523809552192688, 'rewards/format_reward': 1.0, 'reward': 1.5238096714019775, 'reward_std': 0.0, 'kl': 0.2421875, 'epoch': 0.52}
 52%|█████▏    | 2216/4286 [13:47:23<13:10:07, 22.90s/it] 52%|█████▏    | 2217/4286 [13:47:46<13:08:14, 22.86s/it]                                                         {'loss': 0.0096, 'grad_norm': 1.7681873543434365, 'learning_rate': 4.827344843677089e-07, 'completion_length': 241.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.4523809999227524, 'rewards/format_reward': 1.0, 'reward': 1.4523810148239136, 'reward_std': 0.14083484560251236, 'kl': 0.23974609375, 'epoch': 0.52}
 52%|█████▏    | 2217/4286 [13:47:46<13:08:14, 22.86s/it] 52%|█████▏    | 2218/4286 [13:48:09<13:08:34, 22.88s/it]                                                         {'loss': 0.0079, 'grad_norm': 0.8583668060994039, 'learning_rate': 4.825011665888941e-07, 'completion_length': 201.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.59077388048172, 'rewards/format_reward': 1.0, 'reward': 1.5907739400863647, 'reward_std': 0.04740536957979202, 'kl': 0.197265625, 'epoch': 0.52}
 52%|█████▏    | 2218/4286 [13:48:09<13:08:34, 22.88s/it] 52%|█████▏    | 2219/4286 [13:48:29<12:39:40, 22.05s/it]                                                         {'loss': 0.0069, 'grad_norm': 2.051956830095208, 'learning_rate': 4.822678488100793e-07, 'completion_length': 181.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5342262387275696, 'rewards/format_reward': 1.0, 'reward': 1.5342263579368591, 'reward_std': 0.020833336748182774, 'kl': 0.17236328125, 'epoch': 0.52}
 52%|█████▏    | 2219/4286 [13:48:29<12:39:40, 22.05s/it] 52%|█████▏    | 2220/4286 [13:48:49<12:22:08, 21.55s/it]                                                         {'loss': 0.0125, 'grad_norm': 2.296466226809981, 'learning_rate': 4.820345310312645e-07, 'completion_length': 200.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5044643133878708, 'rewards/format_reward': 1.0, 'reward': 1.504464328289032, 'reward_std': 0.019238397479057312, 'kl': 0.31201171875, 'epoch': 0.52}
 52%|█████▏    | 2220/4286 [13:48:49<12:22:08, 21.55s/it] 52%|█████▏    | 2221/4286 [13:49:12<12:38:02, 22.03s/it]                                                         {'loss': 0.0177, 'grad_norm': 3.7128104681456198, 'learning_rate': 4.818012132524499e-07, 'completion_length': 217.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.4345238357782364, 'rewards/format_reward': 1.0, 'reward': 1.4345239400863647, 'reward_std': 0.07142857275903225, 'kl': 0.43994140625, 'epoch': 0.52}
 52%|█████▏    | 2221/4286 [13:49:12<12:38:02, 22.03s/it] 52%|█████▏    | 2222/4286 [13:49:34<12:38:04, 22.04s/it]                                                         {'loss': 0.0105, 'grad_norm': 1.198002214135794, 'learning_rate': 4.815678954736351e-07, 'completion_length': 214.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6130952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.03068273840472102, 'kl': 0.263671875, 'epoch': 0.52}
 52%|█████▏    | 2222/4286 [13:49:34<12:38:04, 22.04s/it] 52%|█████▏    | 2223/4286 [13:49:56<12:35:35, 21.98s/it]                                                         {'loss': 0.0086, 'grad_norm': 0.9354558880718421, 'learning_rate': 4.813345776948203e-07, 'completion_length': 208.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.680059552192688, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.06424124538898468, 'kl': 0.21435546875, 'epoch': 0.52}
 52%|█████▏    | 2223/4286 [13:49:56<12:35:35, 21.98s/it] 52%|█████▏    | 2224/4286 [13:50:21<12:59:12, 22.67s/it]                                                         {'loss': 0.0077, 'grad_norm': 1.939403699795436, 'learning_rate': 4.811012599160056e-07, 'completion_length': 225.64287567138672, 'rewards/only_full_func_accuracy_reward': 0.4869047850370407, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4690477848052979, 'reward_std': 0.07310966216027737, 'kl': 0.19384765625, 'epoch': 0.52}
 52%|█████▏    | 2224/4286 [13:50:21<12:59:12, 22.67s/it] 52%|█████▏    | 2225/4286 [13:50:45<13:11:48, 23.05s/it]                                                         {'loss': 0.007, 'grad_norm': 1.971830410789771, 'learning_rate': 4.808679421371908e-07, 'completion_length': 246.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6145834028720856, 'rewards/format_reward': 1.0, 'reward': 1.614583432674408, 'reward_std': 0.053517505526542664, 'kl': 0.17529296875, 'epoch': 0.52}
 52%|█████▏    | 2225/4286 [13:50:45<13:11:48, 23.05s/it] 52%|█████▏    | 2226/4286 [13:51:07<13:08:08, 22.96s/it]                                                         {'loss': 0.0249, 'grad_norm': 2.735922241454581, 'learning_rate': 4.806346243583761e-07, 'completion_length': 235.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6417658925056458, 'rewards/format_reward': 1.0, 'reward': 1.6417659521102905, 'reward_std': 0.07067359425127506, 'kl': 0.625, 'epoch': 0.52}
 52%|█████▏    | 2226/4286 [13:51:07<13:08:08, 22.96s/it] 52%|█████▏    | 2227/4286 [13:51:31<13:13:05, 23.11s/it]                                                         {'loss': 0.0164, 'grad_norm': 1.4491936990804575, 'learning_rate': 4.804013065795614e-07, 'completion_length': 218.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.07876220718026161, 'kl': 0.4091796875, 'epoch': 0.52}
 52%|█████▏    | 2227/4286 [13:51:31<13:13:05, 23.11s/it] 52%|█████▏    | 2228/4286 [13:51:54<13:12:52, 23.12s/it]                                                         {'loss': 0.0075, 'grad_norm': 1.8509869191137842, 'learning_rate': 4.801679888007466e-07, 'completion_length': 228.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5558035969734192, 'rewards/format_reward': 1.0, 'reward': 1.5558037161827087, 'reward_std': 0.07776311784982681, 'kl': 0.18798828125, 'epoch': 0.52}
 52%|█████▏    | 2228/4286 [13:51:54<13:12:52, 23.12s/it] 52%|█████▏    | 2229/4286 [13:52:15<12:51:02, 22.49s/it]                                                         {'loss': 0.0593, 'grad_norm': 3.002633830817214, 'learning_rate': 4.799346710219318e-07, 'completion_length': 217.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6126701235771179, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5948129892349243, 'reward_std': 0.13620383106172085, 'kl': 1.47900390625, 'epoch': 0.52}
 52%|█████▏    | 2229/4286 [13:52:15<12:51:02, 22.49s/it] 52%|█████▏    | 2230/4286 [13:52:38<12:59:42, 22.75s/it]                                                         {'loss': 0.0426, 'grad_norm': 3.2242778715469065, 'learning_rate': 4.797013532431171e-07, 'completion_length': 223.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5900298058986664, 'rewards/format_reward': 1.0, 'reward': 1.5900299549102783, 'reward_std': 0.1583118736743927, 'kl': 1.06640625, 'epoch': 0.52}
 52%|█████▏    | 2230/4286 [13:52:38<12:59:42, 22.75s/it] 52%|█████▏    | 2231/4286 [13:53:01<12:58:33, 22.73s/it]                                                         {'loss': 0.0461, 'grad_norm': 7.319219832832849, 'learning_rate': 4.794680354643024e-07, 'completion_length': 218.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.508928582072258, 'rewards/format_reward': 1.0, 'reward': 1.508928656578064, 'reward_std': 0.14351791143417358, 'kl': 1.1552734375, 'epoch': 0.52}
 52%|█████▏    | 2231/4286 [13:53:01<12:58:33, 22.73s/it] 52%|█████▏    | 2232/4286 [13:53:24<12:59:04, 22.76s/it]                                                         {'loss': 0.0719, 'grad_norm': 8.021553278279608, 'learning_rate': 4.792347176854876e-07, 'completion_length': 222.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5565476417541504, 'rewards/format_reward': 1.0, 'reward': 1.5565477013587952, 'reward_std': 0.26372314989566803, 'kl': 1.80078125, 'epoch': 0.52}
 52%|█████▏    | 2232/4286 [13:53:24<12:59:04, 22.76s/it] 52%|█████▏    | 2233/4286 [13:53:46<12:55:41, 22.67s/it]                                                         {'loss': 0.0779, 'grad_norm': 4.328222896732235, 'learning_rate': 4.790013999066728e-07, 'completion_length': 221.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.516964316368103, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4991072416305542, 'reward_std': 0.24335749447345734, 'kl': 1.9453125, 'epoch': 0.52}
 52%|█████▏    | 2233/4286 [13:53:46<12:55:41, 22.67s/it] 52%|█████▏    | 2234/4286 [13:54:09<12:52:06, 22.58s/it]                                                         {'loss': 0.1312, 'grad_norm': 4.991921807437089, 'learning_rate': 4.787680821278582e-07, 'completion_length': 195.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5806547999382019, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5449406504631042, 'reward_std': 0.22835776954889297, 'kl': 3.28125, 'epoch': 0.52}
 52%|█████▏    | 2234/4286 [13:54:09<12:52:06, 22.58s/it] 52%|█████▏    | 2235/4286 [13:54:34<13:22:08, 23.47s/it]                                                         {'loss': 0.1147, 'grad_norm': 31.51244229751156, 'learning_rate': 4.785347643490434e-07, 'completion_length': 214.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5166667103767395, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4809524416923523, 'reward_std': 0.3022581711411476, 'kl': 2.8671875, 'epoch': 0.52}
 52%|█████▏    | 2235/4286 [13:54:34<13:22:08, 23.47s/it] 52%|█████▏    | 2236/4286 [13:54:58<13:21:42, 23.46s/it]                                                         {'loss': 0.0487, 'grad_norm': 11.332142757772374, 'learning_rate': 4.783014465702286e-07, 'completion_length': 192.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6488095223903656, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6309524774551392, 'reward_std': 0.08642552047967911, 'kl': 1.216796875, 'epoch': 0.52}
 52%|█████▏    | 2236/4286 [13:54:58<13:21:42, 23.46s/it] 52%|█████▏    | 2237/4286 [13:55:20<13:12:10, 23.20s/it]                                                         {'loss': 0.093, 'grad_norm': 9.834234616644572, 'learning_rate': 4.780681287914139e-07, 'completion_length': 206.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5639881193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5461310744285583, 'reward_std': 0.1417226307094097, 'kl': 2.3203125, 'epoch': 0.52}
 52%|█████▏    | 2237/4286 [13:55:20<13:12:10, 23.20s/it] 52%|█████▏    | 2238/4286 [13:55:42<12:58:05, 22.80s/it]                                                         {'loss': 0.0686, 'grad_norm': 2.2609913153579777, 'learning_rate': 4.778348110125992e-07, 'completion_length': 229.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6495535969734192, 'rewards/format_reward': 1.0, 'reward': 1.6495535969734192, 'reward_std': 0.14061707630753517, 'kl': 1.7109375, 'epoch': 0.52}
 52%|█████▏    | 2238/4286 [13:55:42<12:58:05, 22.80s/it] 52%|█████▏    | 2239/4286 [13:56:05<12:58:56, 22.83s/it]                                                         {'loss': 0.103, 'grad_norm': 6.557688759663844, 'learning_rate': 4.776014932337844e-07, 'completion_length': 215.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5558035969734192, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4843751192092896, 'reward_std': 0.2612232342362404, 'kl': 2.5703125, 'epoch': 0.52}
 52%|█████▏    | 2239/4286 [13:56:05<12:58:56, 22.83s/it] 52%|█████▏    | 2240/4286 [13:56:28<12:55:57, 22.76s/it]                                                         {'loss': 0.0883, 'grad_norm': 4.7357134170041375, 'learning_rate': 4.773681754549697e-07, 'completion_length': 238.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.4776786118745804, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4419644474983215, 'reward_std': 0.18896129727363586, 'kl': 2.21484375, 'epoch': 0.52}
 52%|█████▏    | 2240/4286 [13:56:28<12:55:57, 22.76s/it] 52%|█████▏    | 2241/4286 [13:56:50<12:50:53, 22.62s/it]                                                         {'loss': 0.0869, 'grad_norm': 3.0244671812565214, 'learning_rate': 4.771348576761549e-07, 'completion_length': 222.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.486607164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4330357909202576, 'reward_std': 0.21172397211194038, 'kl': 2.1640625, 'epoch': 0.52}
 52%|█████▏    | 2241/4286 [13:56:50<12:50:53, 22.62s/it][2025-03-02 19:04:29,873] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 52%|█████▏    | 2242/4286 [13:57:14<13:06:22, 23.08s/it]                                                         {'loss': 0.0476, 'grad_norm': 11.008832297024284, 'learning_rate': 4.769015398973402e-07, 'completion_length': 221.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.5491071939468384, 'rewards/format_reward': 1.0, 'reward': 1.549107313156128, 'reward_std': 0.09332683682441711, 'kl': 1.189453125, 'epoch': 0.52}
 52%|█████▏    | 2242/4286 [13:57:14<13:06:22, 23.08s/it] 52%|█████▏    | 2243/4286 [13:57:37<13:07:57, 23.14s/it]                                                         {'loss': 0.0207, 'grad_norm': 1.9694668459205107, 'learning_rate': 4.7666822211852543e-07, 'completion_length': 250.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.7514881193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7336309552192688, 'reward_std': 0.09226189740002155, 'kl': 0.51953125, 'epoch': 0.52}
 52%|█████▏    | 2243/4286 [13:57:37<13:07:57, 23.14s/it] 52%|█████▏    | 2244/4286 [13:58:02<13:27:07, 23.72s/it]                                                         {'loss': 0.038, 'grad_norm': 2.0710568412892956, 'learning_rate': 4.7643490433971065e-07, 'completion_length': 266.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.484747052192688, 'rewards/format_reward': 1.0, 'reward': 1.4847471117973328, 'reward_std': 0.0657635722309351, 'kl': 0.947265625, 'epoch': 0.52}
 52%|█████▏    | 2244/4286 [13:58:02<13:27:07, 23.72s/it][2025-03-02 19:05:44,654] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 52%|█████▏    | 2245/4286 [13:58:29<13:54:37, 24.54s/it]                                                         {'loss': 0.0437, 'grad_norm': 1.7736550716795083, 'learning_rate': 4.762015865608959e-07, 'completion_length': 249.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5580358505249023, 'reward_std': 0.18855422735214233, 'kl': 1.091796875, 'epoch': 0.52}
 52%|█████▏    | 2245/4286 [13:58:29<13:54:37, 24.54s/it][2025-03-02 19:06:06,452] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 52%|█████▏    | 2246/4286 [13:58:51<13:26:17, 23.71s/it]                                                         {'loss': 0.0517, 'grad_norm': 6.007423273370234, 'learning_rate': 4.759682687820812e-07, 'completion_length': 236.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6793367266654968, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6436224579811096, 'reward_std': 0.1747496798634529, 'kl': 1.29296875, 'epoch': 0.52}
 52%|█████▏    | 2246/4286 [13:58:51<13:26:17, 23.71s/it] 52%|█████▏    | 2247/4286 [13:59:13<13:09:32, 23.23s/it]                                                         {'loss': 0.0097, 'grad_norm': 1.808400158000285, 'learning_rate': 4.757349510032664e-07, 'completion_length': 231.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6375425457954407, 'rewards/format_reward': 1.0, 'reward': 1.6375426054000854, 'reward_std': 0.07946353405714035, 'kl': 0.2421875, 'epoch': 0.52}
 52%|█████▏    | 2247/4286 [13:59:13<13:09:32, 23.23s/it] 52%|█████▏    | 2248/4286 [13:59:35<12:55:08, 22.82s/it]                                                         {'loss': 0.0373, 'grad_norm': 1.0626780300533212, 'learning_rate': 4.755016332244517e-07, 'completion_length': 231.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6160715222358704, 'reward_std': 0.1028318889439106, 'kl': 0.93505859375, 'epoch': 0.52}
 52%|█████▏    | 2248/4286 [13:59:35<12:55:08, 22.82s/it] 52%|█████▏    | 2249/4286 [13:59:56<12:37:17, 22.31s/it]                                                         {'loss': 0.0231, 'grad_norm': 1.2214428126561485, 'learning_rate': 4.7526831544563697e-07, 'completion_length': 202.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.7056548297405243, 'rewards/format_reward': 1.0, 'reward': 1.7056548595428467, 'reward_std': 0.06853229366242886, 'kl': 0.578125, 'epoch': 0.52}
 52%|█████▏    | 2249/4286 [13:59:56<12:37:17, 22.31s/it] 52%|█████▏    | 2250/4286 [14:00:17<12:29:14, 22.08s/it]                                                         {'loss': 0.0094, 'grad_norm': 1.3083339209321105, 'learning_rate': 4.750349976668222e-07, 'completion_length': 207.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.060457466170191765, 'kl': 0.23583984375, 'epoch': 0.52}
 52%|█████▏    | 2250/4286 [14:00:17<12:29:14, 22.08s/it] 53%|█████▎    | 2251/4286 [14:00:37<12:07:01, 21.44s/it]                                                         {'loss': 0.022, 'grad_norm': 2.6087826244302903, 'learning_rate': 4.7480167988800747e-07, 'completion_length': 196.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6800595223903656, 'rewards/format_reward': 1.0, 'reward': 1.6800596714019775, 'reward_std': 0.05654762405902147, 'kl': 0.55126953125, 'epoch': 0.53}
 53%|█████▎    | 2251/4286 [14:00:37<12:07:01, 21.44s/it] 53%|█████▎    | 2252/4286 [14:00:58<12:03:16, 21.34s/it]                                                         {'loss': 0.0111, 'grad_norm': 0.9590011668565418, 'learning_rate': 4.745683621091927e-07, 'completion_length': 209.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.438988134264946, 'rewards/format_reward': 1.0, 'reward': 1.438988208770752, 'reward_std': 0.014880955684930086, 'kl': 0.2783203125, 'epoch': 0.53}
 53%|█████▎    | 2252/4286 [14:00:58<12:03:16, 21.34s/it] 53%|█████▎    | 2253/4286 [14:01:18<11:50:17, 20.96s/it]                                                         {'loss': 0.0422, 'grad_norm': 1.3586442392261882, 'learning_rate': 4.7433504433037797e-07, 'completion_length': 186.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5163690745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4985119700431824, 'reward_std': 0.0803571492433548, 'kl': 1.0546875, 'epoch': 0.53}
 53%|█████▎    | 2253/4286 [14:01:18<11:50:17, 20.96s/it] 53%|█████▎    | 2254/4286 [14:01:41<12:03:50, 21.37s/it]                                                         {'loss': 0.0073, 'grad_norm': 0.9756828531310042, 'learning_rate': 4.7410172655156324e-07, 'completion_length': 217.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.5252976566553116, 'rewards/format_reward': 1.0, 'reward': 1.5252977013587952, 'reward_std': 0.025190782733261585, 'kl': 0.18212890625, 'epoch': 0.53}
 53%|█████▎    | 2254/4286 [14:01:41<12:03:50, 21.37s/it] 53%|█████▎    | 2255/4286 [14:02:03<12:17:22, 21.78s/it]                                                         {'loss': 0.0082, 'grad_norm': 2.9588690174228227, 'learning_rate': 4.7386840877274847e-07, 'completion_length': 232.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.732738196849823, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7148810625076294, 'reward_std': 0.11288052797317505, 'kl': 0.205078125, 'epoch': 0.53}
 53%|█████▎    | 2255/4286 [14:02:03<12:17:22, 21.78s/it] 53%|█████▎    | 2256/4286 [14:02:25<12:17:48, 21.81s/it]                                                         {'loss': 0.052, 'grad_norm': 3.1386947878363545, 'learning_rate': 4.7363509099393374e-07, 'completion_length': 227.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.52976194024086, 'rewards/format_reward': 1.0, 'reward': 1.5297620296478271, 'reward_std': 0.12296520546078682, 'kl': 1.30126953125, 'epoch': 0.53}
 53%|█████▎    | 2256/4286 [14:02:25<12:17:48, 21.81s/it] 53%|█████▎    | 2257/4286 [14:02:47<12:14:13, 21.71s/it]                                                         {'loss': 0.0072, 'grad_norm': 6.817881032112227, 'learning_rate': 4.7340177321511896e-07, 'completion_length': 226.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.6413690745830536, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.06296526081860065, 'kl': 0.1806640625, 'epoch': 0.53}
 53%|█████▎    | 2257/4286 [14:02:47<12:14:13, 21.71s/it] 53%|█████▎    | 2258/4286 [14:03:09<12:22:21, 21.96s/it]                                                         {'loss': 0.0487, 'grad_norm': 1.313773414865326, 'learning_rate': 4.7316845543630424e-07, 'completion_length': 205.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7164683043956757, 'rewards/format_reward': 1.0, 'reward': 1.716468334197998, 'reward_std': 0.10482091084122658, 'kl': 1.21875, 'epoch': 0.53}
 53%|█████▎    | 2258/4286 [14:03:09<12:22:21, 21.96s/it] 53%|█████▎    | 2259/4286 [14:03:31<12:18:59, 21.87s/it]                                                         {'loss': 0.0339, 'grad_norm': 1.537961676492058, 'learning_rate': 4.729351376574895e-07, 'completion_length': 199.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5758928954601288, 'rewards/format_reward': 1.0, 'reward': 1.5758929252624512, 'reward_std': 0.07883669435977936, 'kl': 0.845703125, 'epoch': 0.53}
 53%|█████▎    | 2259/4286 [14:03:31<12:18:59, 21.87s/it] 53%|█████▎    | 2260/4286 [14:03:52<12:12:44, 21.70s/it]                                                         {'loss': 0.0071, 'grad_norm': 1.9308033116004748, 'learning_rate': 4.7270181987867473e-07, 'completion_length': 216.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.7958333790302277, 'rewards/format_reward': 1.0, 'reward': 1.79583340883255, 'reward_std': 0.015476187691092491, 'kl': 0.1787109375, 'epoch': 0.53}
 53%|█████▎    | 2260/4286 [14:03:52<12:12:44, 21.70s/it] 53%|█████▎    | 2261/4286 [14:04:15<12:25:33, 22.09s/it]                                                         {'loss': 0.0252, 'grad_norm': 1.8344997637977456, 'learning_rate': 4.7246850209986e-07, 'completion_length': 194.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6735119521617889, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.637797772884369, 'reward_std': 0.18231073254719377, 'kl': 0.6279296875, 'epoch': 0.53}
 53%|█████▎    | 2261/4286 [14:04:15<12:25:33, 22.09s/it] 53%|█████▎    | 2262/4286 [14:04:37<12:17:07, 21.85s/it]                                                         {'loss': 0.0856, 'grad_norm': 2.2824784854898867, 'learning_rate': 4.7223518432104523e-07, 'completion_length': 209.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.5282738506793976, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4747024774551392, 'reward_std': 0.2471023052930832, 'kl': 2.13671875, 'epoch': 0.53}
 53%|█████▎    | 2262/4286 [14:04:37<12:17:07, 21.85s/it] 53%|█████▎    | 2263/4286 [14:04:59<12:21:03, 21.98s/it]                                                         {'loss': 0.0112, 'grad_norm': 5.767256352085359, 'learning_rate': 4.720018665422305e-07, 'completion_length': 200.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.4985119551420212, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4806548357009888, 'reward_std': 0.0480794720351696, 'kl': 0.2802734375, 'epoch': 0.53}
 53%|█████▎    | 2263/4286 [14:04:59<12:21:03, 21.98s/it] 53%|█████▎    | 2264/4286 [14:05:21<12:19:57, 21.96s/it]                                                         {'loss': 0.0386, 'grad_norm': 4.046368463185439, 'learning_rate': 4.717685487634158e-07, 'completion_length': 213.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.719345211982727, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6836310029029846, 'reward_std': 0.1664666160941124, 'kl': 0.9609375, 'epoch': 0.53}
 53%|█████▎    | 2264/4286 [14:05:21<12:19:57, 21.96s/it] 53%|█████▎    | 2265/4286 [14:05:39<11:43:53, 20.90s/it]                                                         {'loss': 0.0077, 'grad_norm': 0.5290904328099038, 'learning_rate': 4.71535230984601e-07, 'completion_length': 177.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.77976194024086, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.013746436685323715, 'kl': 0.19189453125, 'epoch': 0.53}
 53%|█████▎    | 2265/4286 [14:05:39<11:43:53, 20.90s/it] 53%|█████▎    | 2266/4286 [14:05:57<11:17:17, 20.12s/it]                                                         {'loss': 0.0317, 'grad_norm': 1.3843753735055462, 'learning_rate': 4.713019132057863e-07, 'completion_length': 188.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6461310088634491, 'rewards/format_reward': 1.0, 'reward': 1.6461310386657715, 'reward_std': 0.04621959198266268, 'kl': 0.796875, 'epoch': 0.53}
 53%|█████▎    | 2266/4286 [14:05:57<11:17:17, 20.12s/it] 53%|█████▎    | 2267/4286 [14:06:18<11:19:48, 20.20s/it]                                                         {'loss': 0.0096, 'grad_norm': 3.0746610439043383, 'learning_rate': 4.710685954269715e-07, 'completion_length': 208.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.5392857640981674, 'rewards/format_reward': 1.0, 'reward': 1.5392857789993286, 'reward_std': 0.04285714775323868, 'kl': 0.24072265625, 'epoch': 0.53}
 53%|█████▎    | 2267/4286 [14:06:18<11:19:48, 20.20s/it] 53%|█████▎    | 2268/4286 [14:06:38<11:15:33, 20.09s/it]                                                         {'loss': 0.0116, 'grad_norm': 3.449858342328765, 'learning_rate': 4.708352776481568e-07, 'completion_length': 198.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.6002976894378662, 'rewards/format_reward': 1.0, 'reward': 1.6002976894378662, 'reward_std': 0.08949853479862213, 'kl': 0.28955078125, 'epoch': 0.53}
 53%|█████▎    | 2268/4286 [14:06:38<11:15:33, 20.09s/it] 53%|█████▎    | 2269/4286 [14:06:58<11:16:30, 20.12s/it]                                                         {'loss': 0.0335, 'grad_norm': 5.639416078301939, 'learning_rate': 4.7060195986934205e-07, 'completion_length': 226.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.614583432674408, 'rewards/format_reward': 1.0, 'reward': 1.614583432674408, 'reward_std': 0.020833336748182774, 'kl': 0.83935546875, 'epoch': 0.53}
 53%|█████▎    | 2269/4286 [14:06:58<11:16:30, 20.12s/it] 53%|█████▎    | 2270/4286 [14:07:21<11:47:20, 21.05s/it]                                                         {'loss': 0.067, 'grad_norm': 1.2732477562788502, 'learning_rate': 4.703686420905273e-07, 'completion_length': 216.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.4895833879709244, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4538691639900208, 'reward_std': 0.15078158117830753, 'kl': 1.67578125, 'epoch': 0.53}
 53%|█████▎    | 2270/4286 [14:07:21<11:47:20, 21.05s/it] 53%|█████▎    | 2271/4286 [14:07:40<11:25:13, 20.40s/it]                                                         {'loss': 0.0071, 'grad_norm': 8.301736166304295, 'learning_rate': 4.7013532431171255e-07, 'completion_length': 176.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6636904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6636906266212463, 'reward_std': 0.04602411016821861, 'kl': 0.17822265625, 'epoch': 0.53}
 53%|█████▎    | 2271/4286 [14:07:40<11:25:13, 20.40s/it] 53%|█████▎    | 2272/4286 [14:08:01<11:27:50, 20.49s/it]                                                         {'loss': 0.0548, 'grad_norm': 7.415943699048671, 'learning_rate': 4.699020065328978e-07, 'completion_length': 219.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.6092262268066406, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5556548833847046, 'reward_std': 0.24749525636434555, 'kl': 1.37109375, 'epoch': 0.53}
 53%|█████▎    | 2272/4286 [14:08:01<11:27:50, 20.49s/it] 53%|█████▎    | 2273/4286 [14:08:22<11:37:56, 20.80s/it]                                                         {'loss': 0.0628, 'grad_norm': 4.995784711794368, 'learning_rate': 4.6966868875408305e-07, 'completion_length': 200.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.630357176065445, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.558928668498993, 'reward_std': 0.22350114583969116, 'kl': 1.5703125, 'epoch': 0.53}
 53%|█████▎    | 2273/4286 [14:08:22<11:37:56, 20.80s/it] 53%|█████▎    | 2274/4286 [14:08:41<11:17:12, 20.20s/it]                                                         {'loss': 0.0081, 'grad_norm': 3.1773124128945556, 'learning_rate': 4.694353709752683e-07, 'completion_length': 199.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5014881044626236, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4836310744285583, 'reward_std': 0.03959564492106438, 'kl': 0.20263671875, 'epoch': 0.53}
 53%|█████▎    | 2274/4286 [14:08:41<11:17:12, 20.20s/it] 53%|█████▎    | 2275/4286 [14:09:02<11:26:19, 20.48s/it]                                                         {'loss': 0.0202, 'grad_norm': 5.470778594591584, 'learning_rate': 4.6920205319645354e-07, 'completion_length': 207.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 1.0, 'reward': 1.5937501192092896, 'reward_std': 0.08311965502798557, 'kl': 0.50439453125, 'epoch': 0.53}
 53%|█████▎    | 2275/4286 [14:09:02<11:26:19, 20.48s/it] 53%|█████▎    | 2276/4286 [14:09:23<11:33:09, 20.69s/it]                                                         {'loss': 0.0391, 'grad_norm': 5.505598982153795, 'learning_rate': 4.689687354176388e-07, 'completion_length': 199.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5580357611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5401787161827087, 'reward_std': 0.12805915344506502, 'kl': 0.974609375, 'epoch': 0.53}
 53%|█████▎    | 2276/4286 [14:09:23<11:33:09, 20.69s/it] 53%|█████▎    | 2277/4286 [14:09:47<12:04:50, 21.65s/it]                                                         {'loss': 0.1107, 'grad_norm': 7.447818397377641, 'learning_rate': 4.687354176388241e-07, 'completion_length': 242.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.41934528946876526, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3657739758491516, 'reward_std': 0.193585604429245, 'kl': 2.76171875, 'epoch': 0.53}
 53%|█████▎    | 2277/4286 [14:09:47<12:04:50, 21.65s/it] 53%|█████▎    | 2278/4286 [14:10:08<11:57:20, 21.43s/it]                                                         {'loss': 0.0482, 'grad_norm': 10.083963730633728, 'learning_rate': 4.685020998600093e-07, 'completion_length': 199.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.504464328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4866072535514832, 'reward_std': 0.10054944083094597, 'kl': 1.205078125, 'epoch': 0.53}
 53%|█████▎    | 2278/4286 [14:10:08<11:57:20, 21.43s/it] 53%|█████▎    | 2279/4286 [14:10:28<11:43:50, 21.04s/it]                                                         {'loss': 0.0884, 'grad_norm': 6.906415416009469, 'learning_rate': 4.682687820811946e-07, 'completion_length': 199.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.6071429252624512, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5714287161827087, 'reward_std': 0.20620840787887573, 'kl': 2.20703125, 'epoch': 0.53}
 53%|█████▎    | 2279/4286 [14:10:28<11:43:50, 21.04s/it] 53%|█████▎    | 2280/4286 [14:10:48<11:27:42, 20.57s/it]                                                         {'loss': 0.1136, 'grad_norm': 7.842007543629218, 'learning_rate': 4.680354643023798e-07, 'completion_length': 169.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6220238208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6041668057441711, 'reward_std': 0.22183555364608765, 'kl': 2.83203125, 'epoch': 0.53}
 53%|█████▎    | 2280/4286 [14:10:48<11:27:42, 20.57s/it] 53%|█████▎    | 2281/4286 [14:11:09<11:31:59, 20.71s/it]                                                         {'loss': 0.0806, 'grad_norm': 10.415509602806326, 'learning_rate': 4.678021465235651e-07, 'completion_length': 208.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.5238095819950104, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4880953431129456, 'reward_std': 0.2108815833926201, 'kl': 2.015625, 'epoch': 0.53}
 53%|█████▎    | 2281/4286 [14:11:09<11:31:59, 20.71s/it] 53%|█████▎    | 2282/4286 [14:11:29<11:30:15, 20.67s/it]                                                         {'loss': 0.0657, 'grad_norm': 10.683334310501717, 'learning_rate': 4.6756882874475036e-07, 'completion_length': 197.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.4500000327825546, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4142858386039734, 'reward_std': 0.17613179981708527, 'kl': 1.64453125, 'epoch': 0.53}
 53%|█████▎    | 2282/4286 [14:11:29<11:30:15, 20.67s/it] 53%|█████▎    | 2283/4286 [14:11:50<11:25:25, 20.53s/it]                                                         {'loss': 0.1378, 'grad_norm': 14.977740020437889, 'learning_rate': 4.673355109659356e-07, 'completion_length': 208.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.5610119700431824, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4717263579368591, 'reward_std': 0.3396747261285782, 'kl': 3.4453125, 'epoch': 0.53}
 53%|█████▎    | 2283/4286 [14:11:50<11:25:25, 20.53s/it] 53%|█████▎    | 2284/4286 [14:12:12<11:40:06, 20.98s/it]                                                         {'loss': 0.064, 'grad_norm': 9.744947474174284, 'learning_rate': 4.6710219318712086e-07, 'completion_length': 197.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.647916704416275, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.612202525138855, 'reward_std': 0.17280982807278633, 'kl': 1.6015625, 'epoch': 0.53}
 53%|█████▎    | 2284/4286 [14:12:12<11:40:06, 20.98s/it] 53%|█████▎    | 2285/4286 [14:12:35<12:00:27, 21.60s/it]                                                         {'loss': 0.0834, 'grad_norm': 20.223782746249096, 'learning_rate': 4.668688754083061e-07, 'completion_length': 201.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.49910716712474823, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4455358982086182, 'reward_std': 0.20285295695066452, 'kl': 2.0859375, 'epoch': 0.53}
 53%|█████▎    | 2285/4286 [14:12:35<12:00:27, 21.60s/it] 53%|█████▎    | 2286/4286 [14:12:56<11:59:34, 21.59s/it]                                                         {'loss': 0.0653, 'grad_norm': 14.97566043358587, 'learning_rate': 4.6663555762949136e-07, 'completion_length': 209.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.4217262268066406, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4038691520690918, 'reward_std': 0.18556194007396698, 'kl': 1.63671875, 'epoch': 0.53}
 53%|█████▎    | 2286/4286 [14:12:56<11:59:34, 21.59s/it][2025-03-02 19:20:37,532] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 53%|█████▎    | 2287/4286 [14:13:22<12:37:58, 22.75s/it]                                                         {'loss': 0.1213, 'grad_norm': 8.312165678146325, 'learning_rate': 4.6640223985067663e-07, 'completion_length': 208.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.4976190775632858, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4261905550956726, 'reward_std': 0.20017003268003464, 'kl': 3.0234375, 'epoch': 0.53}
 53%|█████▎    | 2287/4286 [14:13:22<12:37:58, 22.75s/it] 53%|█████▎    | 2288/4286 [14:13:43<12:23:38, 22.33s/it]                                                         {'loss': 0.1118, 'grad_norm': 13.81761676095294, 'learning_rate': 4.6616892207186186e-07, 'completion_length': 224.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.4988095313310623, 'rewards/format_reward': 1.0, 'reward': 1.4988096356391907, 'reward_std': 0.18144091963768005, 'kl': 2.79296875, 'epoch': 0.53}
 53%|█████▎    | 2288/4286 [14:13:43<12:23:38, 22.33s/it] 53%|█████▎    | 2289/4286 [14:14:04<12:07:24, 21.86s/it]                                                         {'loss': 0.0731, 'grad_norm': 13.024925781555542, 'learning_rate': 4.6593560429304713e-07, 'completion_length': 207.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.4486819952726364, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4308249354362488, 'reward_std': 0.21913501620292664, 'kl': 1.828125, 'epoch': 0.53}
 53%|█████▎    | 2289/4286 [14:14:04<12:07:24, 21.86s/it] 53%|█████▎    | 2290/4286 [14:14:23<11:45:59, 21.22s/it]                                                         {'loss': 0.0319, 'grad_norm': 4.067219788876427, 'learning_rate': 4.6570228651423235e-07, 'completion_length': 203.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.7265307009220123, 'rewards/format_reward': 1.0, 'reward': 1.7265307307243347, 'reward_std': 0.07639862969517708, 'kl': 0.7978515625, 'epoch': 0.53}
 53%|█████▎    | 2290/4286 [14:14:24<11:45:59, 21.22s/it] 53%|█████▎    | 2291/4286 [14:14:44<11:36:27, 20.95s/it]                                                         {'loss': 0.0188, 'grad_norm': 14.833518362146709, 'learning_rate': 4.6546896873541763e-07, 'completion_length': 178.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.5610119700431824, 'rewards/format_reward': 1.0, 'reward': 1.5610120296478271, 'reward_std': 0.08815119788050652, 'kl': 0.47265625, 'epoch': 0.53}
 53%|█████▎    | 2291/4286 [14:14:44<11:36:27, 20.95s/it] 53%|█████▎    | 2292/4286 [14:15:06<11:45:38, 21.23s/it]                                                         {'loss': 0.0288, 'grad_norm': 15.603363053454206, 'learning_rate': 4.652356509566029e-07, 'completion_length': 203.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.45476195216178894, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4369048476219177, 'reward_std': 0.12023810297250748, 'kl': 0.71875, 'epoch': 0.53}
 53%|█████▎    | 2292/4286 [14:15:06<11:45:38, 21.23s/it] 53%|█████▎    | 2293/4286 [14:15:25<11:27:14, 20.69s/it]                                                         {'loss': 0.0151, 'grad_norm': 5.303963534888199, 'learning_rate': 4.6500233317778813e-07, 'completion_length': 202.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.616071492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5982143878936768, 'reward_std': 0.09594079107046127, 'kl': 0.376953125, 'epoch': 0.53}
 53%|█████▎    | 2293/4286 [14:15:25<11:27:14, 20.69s/it] 54%|█████▎    | 2294/4286 [14:15:45<11:22:00, 20.54s/it]                                                         {'loss': 0.029, 'grad_norm': 2.699223403867936, 'learning_rate': 4.647690153989734e-07, 'completion_length': 194.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.447916716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4122024774551392, 'reward_std': 0.15658654272556305, 'kl': 0.724609375, 'epoch': 0.54}
 54%|█████▎    | 2294/4286 [14:15:45<11:22:00, 20.54s/it] 54%|█████▎    | 2295/4286 [14:16:05<11:16:36, 20.39s/it]                                                         {'loss': 0.0105, 'grad_norm': 12.640049064276083, 'learning_rate': 4.645356976201587e-07, 'completion_length': 184.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 1.0, 'reward': 1.6026787161827087, 'reward_std': 0.13258037343621254, 'kl': 0.263671875, 'epoch': 0.54}
 54%|█████▎    | 2295/4286 [14:16:05<11:16:36, 20.39s/it] 54%|█████▎    | 2296/4286 [14:16:27<11:25:56, 20.68s/it]                                                         {'loss': 0.0212, 'grad_norm': 3.094721820406922, 'learning_rate': 4.643023798413439e-07, 'completion_length': 215.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.62053582072258, 'rewards/format_reward': 1.0, 'reward': 1.6205358505249023, 'reward_std': 0.08907203748822212, 'kl': 0.53076171875, 'epoch': 0.54}
 54%|█████▎    | 2296/4286 [14:16:27<11:25:56, 20.68s/it] 54%|█████▎    | 2297/4286 [14:16:46<11:07:00, 20.12s/it]                                                         {'loss': 0.0109, 'grad_norm': 2.7688817989974974, 'learning_rate': 4.640690620625292e-07, 'completion_length': 182.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.09935792069882154, 'kl': 0.27099609375, 'epoch': 0.54}
 54%|█████▎    | 2297/4286 [14:16:46<11:07:00, 20.12s/it] 54%|█████▎    | 2298/4286 [14:17:06<11:06:51, 20.13s/it]                                                         {'loss': 0.0156, 'grad_norm': 15.892266226020517, 'learning_rate': 4.638357442837144e-07, 'completion_length': 214.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.65476194024086, 'rewards/format_reward': 1.0, 'reward': 1.6547620296478271, 'reward_std': 0.09899592027068138, 'kl': 0.3876953125, 'epoch': 0.54}
 54%|█████▎    | 2298/4286 [14:17:06<11:06:51, 20.13s/it] 54%|█████▎    | 2299/4286 [14:17:27<11:16:47, 20.44s/it]                                                         {'loss': 0.0238, 'grad_norm': 3.4766722732805633, 'learning_rate': 4.6360242650489967e-07, 'completion_length': 197.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.5345238149166107, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5166667699813843, 'reward_std': 0.057715313509106636, 'kl': 0.595703125, 'epoch': 0.54}
 54%|█████▎    | 2299/4286 [14:17:27<11:16:47, 20.44s/it] 54%|█████▎    | 2300/4286 [14:17:47<11:17:13, 20.46s/it]                                                         {'loss': 0.0369, 'grad_norm': 6.4451544165731285, 'learning_rate': 4.6336910872608495e-07, 'completion_length': 199.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.5956845432519913, 'rewards/format_reward': 1.0, 'reward': 1.5956845879554749, 'reward_std': 0.06136698368936777, 'kl': 0.921875, 'epoch': 0.54}
 54%|█████▎    | 2300/4286 [14:17:47<11:17:13, 20.46s/it] 54%|█████▎    | 2301/4286 [14:21:06<40:41:30, 73.80s/it]                                                         {'loss': 0.0142, 'grad_norm': 9.307160442928655, 'learning_rate': 4.6313579094727017e-07, 'completion_length': 203.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5282738655805588, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5104167461395264, 'reward_std': 0.13534288480877876, 'kl': 0.3564453125, 'epoch': 0.54}
 54%|█████▎    | 2301/4286 [14:21:06<40:41:30, 73.80s/it] 54%|█████▎    | 2302/4286 [14:21:26<31:48:19, 57.71s/it]                                                         {'loss': 0.0096, 'grad_norm': 2.0343188609974208, 'learning_rate': 4.6290247316845544e-07, 'completion_length': 194.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.5901786088943481, 'rewards/format_reward': 1.0, 'reward': 1.5901786088943481, 'reward_std': 0.07127300091087818, 'kl': 0.24072265625, 'epoch': 0.54}
 54%|█████▎    | 2302/4286 [14:21:26<31:48:19, 57.71s/it] 54%|█████▎    | 2303/4286 [14:21:46<25:34:55, 46.44s/it]                                                         {'loss': 0.0077, 'grad_norm': 3.0361684609092485, 'learning_rate': 4.6266915538964067e-07, 'completion_length': 211.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.5538690984249115, 'rewards/format_reward': 1.0, 'reward': 1.5538691878318787, 'reward_std': 0.05573870614171028, 'kl': 0.19287109375, 'epoch': 0.54}
 54%|█████▎    | 2303/4286 [14:21:46<25:34:55, 46.44s/it] 54%|█████▍    | 2304/4286 [14:22:07<21:18:20, 38.70s/it]                                                         {'loss': 0.0145, 'grad_norm': 7.007490897115696, 'learning_rate': 4.6243583761082594e-07, 'completion_length': 194.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.711904764175415, 'rewards/format_reward': 1.0, 'reward': 1.7119048833847046, 'reward_std': 0.11058150976896286, 'kl': 0.36328125, 'epoch': 0.54}
 54%|█████▍    | 2304/4286 [14:22:07<21:18:20, 38.70s/it] 54%|█████▍    | 2305/4286 [14:22:29<18:36:38, 33.82s/it]                                                         {'loss': 0.0551, 'grad_norm': 22.1185571910377, 'learning_rate': 4.622025198320112e-07, 'completion_length': 175.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6056548357009888, 'reward_std': 0.07121489383280277, 'kl': 1.3828125, 'epoch': 0.54}
 54%|█████▍    | 2305/4286 [14:22:29<18:36:38, 33.82s/it] 54%|█████▍    | 2306/4286 [14:22:51<16:40:46, 30.33s/it]                                                         {'loss': 0.0217, 'grad_norm': 3.833957206287748, 'learning_rate': 4.6196920205319644e-07, 'completion_length': 191.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.45297621190547943, 'rewards/format_reward': 1.0, 'reward': 1.4529762864112854, 'reward_std': 0.04222671687602997, 'kl': 0.5419921875, 'epoch': 0.54}
 54%|█████▍    | 2306/4286 [14:22:51<16:40:46, 30.33s/it] 54%|█████▍    | 2307/4286 [14:23:09<14:40:36, 26.70s/it]                                                         {'loss': 0.0374, 'grad_norm': 5.025298608303781, 'learning_rate': 4.617358842743817e-07, 'completion_length': 170.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.6175595223903656, 'rewards/format_reward': 1.0, 'reward': 1.6175596714019775, 'reward_std': 0.12039005383849144, 'kl': 0.93359375, 'epoch': 0.54}
 54%|█████▍    | 2307/4286 [14:23:09<14:40:36, 26.70s/it] 54%|█████▍    | 2308/4286 [14:23:30<13:42:38, 24.95s/it]                                                         {'loss': 0.0206, 'grad_norm': 12.354920305597497, 'learning_rate': 4.6150256649556694e-07, 'completion_length': 190.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5814076215028763, 'rewards/format_reward': 1.0, 'reward': 1.5814077258110046, 'reward_std': 0.12083199992775917, 'kl': 0.515625, 'epoch': 0.54}
 54%|█████▍    | 2308/4286 [14:23:30<13:42:38, 24.95s/it] 54%|█████▍    | 2309/4286 [14:23:49<12:38:28, 23.02s/it]                                                         {'loss': 0.0149, 'grad_norm': 3.9834419375296592, 'learning_rate': 4.612692487167522e-07, 'completion_length': 185.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6461310088634491, 'rewards/format_reward': 1.0, 'reward': 1.6461310386657715, 'reward_std': 0.066379738971591, 'kl': 0.3720703125, 'epoch': 0.54}
 54%|█████▍    | 2309/4286 [14:23:49<12:38:28, 23.02s/it][2025-03-02 19:31:25,106] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 54%|█████▍    | 2310/4286 [14:24:09<12:12:41, 22.25s/it]                                                         {'loss': 0.1096, 'grad_norm': 976.2844228578085, 'learning_rate': 4.610359309379375e-07, 'completion_length': 196.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5175595134496689, 'rewards/format_reward': 1.0, 'reward': 1.5175595879554749, 'reward_std': 0.14673244580626488, 'kl': 2.734375, 'epoch': 0.54}
 54%|█████▍    | 2310/4286 [14:24:09<12:12:41, 22.25s/it] 54%|█████▍    | 2311/4286 [14:24:29<11:50:40, 21.59s/it]                                                         {'loss': 0.0249, 'grad_norm': 6.060567961619759, 'learning_rate': 4.608026131591227e-07, 'completion_length': 209.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.6220238506793976, 'rewards/format_reward': 1.0, 'reward': 1.62202388048172, 'reward_std': 0.0535714328289032, 'kl': 0.62255859375, 'epoch': 0.54}
 54%|█████▍    | 2311/4286 [14:24:29<11:50:40, 21.59s/it] 54%|█████▍    | 2312/4286 [14:24:49<11:27:31, 20.90s/it]                                                         {'loss': 0.0503, 'grad_norm': 2.36217504723315, 'learning_rate': 4.60569295380308e-07, 'completion_length': 178.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6046627163887024, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5868057012557983, 'reward_std': 0.11210318095982075, 'kl': 1.259765625, 'epoch': 0.54}
 54%|█████▍    | 2312/4286 [14:24:49<11:27:31, 20.90s/it] 54%|█████▍    | 2313/4286 [14:25:08<11:10:26, 20.39s/it]                                                         {'loss': 0.017, 'grad_norm': 3.2263665449915218, 'learning_rate': 4.603359776014932e-07, 'completion_length': 173.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 1.0, 'reward': 1.672619104385376, 'reward_std': 0.029761905781924725, 'kl': 0.42578125, 'epoch': 0.54}
 54%|█████▍    | 2313/4286 [14:25:08<11:10:26, 20.39s/it] 54%|█████▍    | 2314/4286 [14:25:28<11:11:44, 20.44s/it]                                                         {'loss': 0.076, 'grad_norm': 12.673030050421966, 'learning_rate': 4.601026598226785e-07, 'completion_length': 170.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.45892858505249023, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4410715699195862, 'reward_std': 0.1247778832912445, 'kl': 1.8984375, 'epoch': 0.54}
 54%|█████▍    | 2314/4286 [14:25:28<11:11:44, 20.44s/it] 54%|█████▍    | 2315/4286 [14:25:51<11:29:26, 20.99s/it]                                                         {'loss': 0.0963, 'grad_norm': 4.897596803096098, 'learning_rate': 4.5986934204386376e-07, 'completion_length': 197.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.4806548058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.46279776096344, 'reward_std': 0.19776205718517303, 'kl': 2.4140625, 'epoch': 0.54}
 54%|█████▍    | 2315/4286 [14:25:51<11:29:26, 20.99s/it] 54%|█████▍    | 2316/4286 [14:26:09<11:00:50, 20.13s/it]                                                         {'loss': 0.0818, 'grad_norm': 21.00806238504763, 'learning_rate': 4.59636024265049e-07, 'completion_length': 171.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6904762983322144, 'reward_std': 0.1835101768374443, 'kl': 2.046875, 'epoch': 0.54}
 54%|█████▍    | 2316/4286 [14:26:09<11:00:50, 20.13s/it] 54%|█████▍    | 2317/4286 [14:26:28<10:54:46, 19.95s/it]                                                         {'loss': 0.1051, 'grad_norm': 5.599649567434917, 'learning_rate': 4.5940270648623425e-07, 'completion_length': 182.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.44553573429584503, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4098215103149414, 'reward_std': 0.2265925034880638, 'kl': 2.6328125, 'epoch': 0.54}
 54%|█████▍    | 2317/4286 [14:26:28<10:54:46, 19.95s/it] 54%|█████▍    | 2318/4286 [14:26:49<11:00:55, 20.15s/it]                                                         {'loss': 0.0487, 'grad_norm': 8.64406679683303, 'learning_rate': 4.5916938870741953e-07, 'completion_length': 180.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.5922619104385376, 'rewards/format_reward': 1.0, 'reward': 1.5922619700431824, 'reward_std': 0.12391316145658493, 'kl': 1.2197265625, 'epoch': 0.54}
 54%|█████▍    | 2318/4286 [14:26:49<11:00:55, 20.15s/it] 54%|█████▍    | 2319/4286 [14:27:10<11:06:14, 20.32s/it]                                                         {'loss': 0.0531, 'grad_norm': 4.506001492260461, 'learning_rate': 4.5893607092860475e-07, 'completion_length': 180.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6053571999073029, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5875000953674316, 'reward_std': 0.15030048415064812, 'kl': 1.330078125, 'epoch': 0.54}
 54%|█████▍    | 2319/4286 [14:27:10<11:06:14, 20.32s/it] 54%|█████▍    | 2320/4286 [14:27:28<10:48:47, 19.80s/it]                                                         {'loss': 0.0306, 'grad_norm': 1.9939159664030404, 'learning_rate': 4.5870275314979e-07, 'completion_length': 172.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.680059552192688, 'rewards/format_reward': 1.0, 'reward': 1.6800596714019775, 'reward_std': 0.0761425532400608, 'kl': 0.7666015625, 'epoch': 0.54}
 54%|█████▍    | 2320/4286 [14:27:28<10:48:47, 19.80s/it] 54%|█████▍    | 2321/4286 [14:27:47<10:35:49, 19.41s/it]                                                         {'loss': 0.033, 'grad_norm': 27.99154390956552, 'learning_rate': 4.5846943537097525e-07, 'completion_length': 175.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.5238095968961716, 'rewards/format_reward': 1.0, 'reward': 1.5238096714019775, 'reward_std': 0.08118016459047794, 'kl': 0.8251953125, 'epoch': 0.54}
 54%|█████▍    | 2321/4286 [14:27:47<10:35:49, 19.41s/it] 54%|█████▍    | 2322/4286 [14:28:06<10:35:55, 19.43s/it]                                                         {'loss': 0.0718, 'grad_norm': 3.9457046290661975, 'learning_rate': 4.582361175921605e-07, 'completion_length': 187.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.624872475862503, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6070154309272766, 'reward_std': 0.11041765473783016, 'kl': 1.79296875, 'epoch': 0.54}
 54%|█████▍    | 2322/4286 [14:28:06<10:35:55, 19.43s/it] 54%|█████▍    | 2323/4286 [14:28:25<10:28:16, 19.20s/it]                                                         {'loss': 0.0267, 'grad_norm': 3.8075899785629805, 'learning_rate': 4.580027998133458e-07, 'completion_length': 179.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.741071492433548, 'rewards/format_reward': 1.0, 'reward': 1.7410714626312256, 'reward_std': 0.10771036520600319, 'kl': 0.66796875, 'epoch': 0.54}
 54%|█████▍    | 2323/4286 [14:28:25<10:28:16, 19.20s/it][2025-03-02 19:36:01,461] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 54%|█████▍    | 2324/4286 [14:28:46<10:43:10, 19.67s/it]                                                         {'loss': 0.0286, 'grad_norm': 6.229187482753214, 'learning_rate': 4.57769482034531e-07, 'completion_length': 168.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.6547619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6369049549102783, 'reward_std': 0.08769077807664871, 'kl': 0.7109375, 'epoch': 0.54}
 54%|█████▍    | 2324/4286 [14:28:46<10:43:10, 19.67s/it] 54%|█████▍    | 2325/4286 [14:29:06<10:50:49, 19.91s/it]                                                         {'loss': 0.0318, 'grad_norm': 2.4344125159235417, 'learning_rate': 4.575361642557163e-07, 'completion_length': 180.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6830357313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.665178656578064, 'reward_std': 0.05417593568563461, 'kl': 0.79736328125, 'epoch': 0.54}
 54%|█████▍    | 2325/4286 [14:29:06<10:50:49, 19.91s/it] 54%|█████▍    | 2326/4286 [14:29:25<10:42:34, 19.67s/it]                                                         {'loss': 0.054, 'grad_norm': 7.649655394797813, 'learning_rate': 4.573028464769015e-07, 'completion_length': 184.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.5699405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5520834922790527, 'reward_std': 0.16762056201696396, 'kl': 1.349609375, 'epoch': 0.54}
 54%|█████▍    | 2326/4286 [14:29:25<10:42:34, 19.67s/it] 54%|█████▍    | 2327/4286 [14:29:44<10:38:10, 19.55s/it]                                                         {'loss': 0.12, 'grad_norm': 2.0943785151527505, 'learning_rate': 4.570695286980868e-07, 'completion_length': 176.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7127976417541504, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6413691639900208, 'reward_std': 0.3011341392993927, 'kl': 2.9921875, 'epoch': 0.54}
 54%|█████▍    | 2327/4286 [14:29:44<10:38:10, 19.55s/it] 54%|█████▍    | 2328/4286 [14:30:07<11:07:10, 20.44s/it]                                                         {'loss': 0.0629, 'grad_norm': 17.6691465295328, 'learning_rate': 4.5683621091927207e-07, 'completion_length': 190.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5714287161827087, 'reward_std': 0.14449796080589294, 'kl': 1.5703125, 'epoch': 0.54}
 54%|█████▍    | 2328/4286 [14:30:07<11:07:10, 20.44s/it] 54%|█████▍    | 2329/4286 [14:30:26<10:52:50, 20.02s/it]                                                         {'loss': 0.0541, 'grad_norm': 14.141308644019313, 'learning_rate': 4.566028931404573e-07, 'completion_length': 186.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.5803571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5625001192092896, 'reward_std': 0.12280375882983208, 'kl': 1.34765625, 'epoch': 0.54}
 54%|█████▍    | 2329/4286 [14:30:26<10:52:50, 20.02s/it] 54%|█████▍    | 2330/4286 [14:30:45<10:46:22, 19.83s/it]                                                         {'loss': 0.0691, 'grad_norm': 17.869288098743894, 'learning_rate': 4.5636957536164257e-07, 'completion_length': 183.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.614583432674408, 'reward_std': 0.12565409019589424, 'kl': 1.72265625, 'epoch': 0.54}
 54%|█████▍    | 2330/4286 [14:30:45<10:46:22, 19.83s/it] 54%|█████▍    | 2331/4286 [14:31:05<10:44:37, 19.78s/it]                                                         {'loss': 0.0549, 'grad_norm': 6.595495994617978, 'learning_rate': 4.561362575828278e-07, 'completion_length': 193.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5372024178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5193452835083008, 'reward_std': 0.08787607122212648, 'kl': 1.3759765625, 'epoch': 0.54}
 54%|█████▍    | 2331/4286 [14:31:05<10:44:37, 19.78s/it] 54%|█████▍    | 2332/4286 [14:31:25<10:48:13, 19.90s/it]                                                         {'loss': 0.0935, 'grad_norm': 17.954892681323006, 'learning_rate': 4.5590293980401306e-07, 'completion_length': 166.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.486607164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.450892984867096, 'reward_std': 0.26348379254341125, 'kl': 2.341796875, 'epoch': 0.54}
 54%|█████▍    | 2332/4286 [14:31:25<10:48:13, 19.90s/it] 54%|█████▍    | 2333/4286 [14:31:44<10:39:12, 19.64s/it]                                                         {'loss': 0.0898, 'grad_norm': 6.094820525765208, 'learning_rate': 4.5566962202519834e-07, 'completion_length': 190.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.5949405431747437, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5235120058059692, 'reward_std': 0.22259490936994553, 'kl': 2.25, 'epoch': 0.54}
 54%|█████▍    | 2333/4286 [14:31:44<10:39:12, 19.64s/it] 54%|█████▍    | 2334/4286 [14:32:02<10:22:17, 19.13s/it]                                                         {'loss': 0.0519, 'grad_norm': 1.225397449745998, 'learning_rate': 4.5543630424638356e-07, 'completion_length': 167.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.648809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6309524774551392, 'reward_std': 0.10410194098949432, 'kl': 1.30126953125, 'epoch': 0.54}
 54%|█████▍    | 2334/4286 [14:32:02<10:22:17, 19.13s/it] 54%|█████▍    | 2335/4286 [14:32:22<10:24:35, 19.21s/it]                                                         {'loss': 0.0644, 'grad_norm': 10.788359736902217, 'learning_rate': 4.5520298646756884e-07, 'completion_length': 181.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803572535514832, 'reward_std': 0.12801361549645662, 'kl': 1.607421875, 'epoch': 0.54}
 54%|█████▍    | 2335/4286 [14:32:22<10:24:35, 19.21s/it] 55%|█████▍    | 2336/4286 [14:32:41<10:23:29, 19.18s/it]                                                         {'loss': 0.075, 'grad_norm': 4.036026436205813, 'learning_rate': 4.5496966868875406e-07, 'completion_length': 193.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.5208333879709244, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5029763579368591, 'reward_std': 0.11687319725751877, 'kl': 1.87109375, 'epoch': 0.55}
 55%|█████▍    | 2336/4286 [14:32:41<10:23:29, 19.18s/it] 55%|█████▍    | 2337/4286 [14:33:02<10:41:16, 19.74s/it]                                                         {'loss': 0.0926, 'grad_norm': 10.109593780387042, 'learning_rate': 4.5473635090993933e-07, 'completion_length': 198.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6550595760345459, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6193453073501587, 'reward_std': 0.2183581441640854, 'kl': 2.31640625, 'epoch': 0.55}
 55%|█████▍    | 2337/4286 [14:33:02<10:41:16, 19.74s/it] 55%|█████▍    | 2338/4286 [14:33:23<10:54:39, 20.16s/it]                                                         {'loss': 0.0728, 'grad_norm': 3.2615551844947905, 'learning_rate': 4.545030331311246e-07, 'completion_length': 186.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.1352643370628357, 'kl': 1.82421875, 'epoch': 0.55}
 55%|█████▍    | 2338/4286 [14:33:23<10:54:39, 20.16s/it] 55%|█████▍    | 2339/4286 [14:33:42<10:43:02, 19.82s/it]                                                         {'loss': 0.0494, 'grad_norm': 2.6270706221142857, 'learning_rate': 4.5426971535230983e-07, 'completion_length': 178.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6130952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.08450091443955898, 'kl': 1.234375, 'epoch': 0.55}
 55%|█████▍    | 2339/4286 [14:33:42<10:43:02, 19.82s/it] 55%|█████▍    | 2340/4286 [14:34:01<10:37:43, 19.66s/it]                                                         {'loss': 0.0289, 'grad_norm': 3.7138416140832984, 'learning_rate': 4.540363975734951e-07, 'completion_length': 190.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.474702388048172, 'rewards/format_reward': 1.0, 'reward': 1.4747024774551392, 'reward_std': 0.10373929888010025, 'kl': 0.7216796875, 'epoch': 0.55}
 55%|█████▍    | 2340/4286 [14:34:01<10:37:43, 19.66s/it] 55%|█████▍    | 2341/4286 [14:34:22<10:44:41, 19.89s/it]                                                         {'loss': 0.0538, 'grad_norm': 0.9101852963902954, 'learning_rate': 4.5380307979468033e-07, 'completion_length': 171.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.5416667312383652, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5238096117973328, 'reward_std': 0.0952381044626236, 'kl': 1.3447265625, 'epoch': 0.55}
 55%|█████▍    | 2341/4286 [14:34:22<10:44:41, 19.89s/it] 55%|█████▍    | 2342/4286 [14:34:39<10:22:00, 19.20s/it]                                                         {'loss': 0.0478, 'grad_norm': 1.822686961681555, 'learning_rate': 4.535697620158656e-07, 'completion_length': 174.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.522321492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5044643878936768, 'reward_std': 0.07624643296003342, 'kl': 1.19921875, 'epoch': 0.55}
 55%|█████▍    | 2342/4286 [14:34:39<10:22:00, 19.20s/it] 55%|█████▍    | 2343/4286 [14:34:58<10:13:49, 18.95s/it]                                                         {'loss': 0.0446, 'grad_norm': 1.4475899466184081, 'learning_rate': 4.533364442370509e-07, 'completion_length': 186.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5482143610715866, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.530357301235199, 'reward_std': 0.07091019116342068, 'kl': 1.107421875, 'epoch': 0.55}
 55%|█████▍    | 2343/4286 [14:34:58<10:13:49, 18.95s/it] 55%|█████▍    | 2344/4286 [14:35:16<10:05:17, 18.70s/it]                                                         {'loss': 0.0747, 'grad_norm': 7.866724399206671, 'learning_rate': 4.531031264582361e-07, 'completion_length': 166.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6145834028720856, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5788691639900208, 'reward_std': 0.10138079524040222, 'kl': 1.86328125, 'epoch': 0.55}
 55%|█████▍    | 2344/4286 [14:35:16<10:05:17, 18.70s/it] 55%|█████▍    | 2345/4286 [14:35:35<10:07:01, 18.76s/it]                                                         {'loss': 0.0398, 'grad_norm': 2.3335201227418, 'learning_rate': 4.528698086794214e-07, 'completion_length': 174.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6806547939777374, 'rewards/format_reward': 1.0, 'reward': 1.6806548833847046, 'reward_std': 0.09492106549441814, 'kl': 0.994140625, 'epoch': 0.55}
 55%|█████▍    | 2345/4286 [14:35:35<10:07:01, 18.76s/it] 55%|█████▍    | 2346/4286 [14:35:53<10:02:40, 18.64s/it]                                                         {'loss': 0.0254, 'grad_norm': 1.1141534285326213, 'learning_rate': 4.5263649090060665e-07, 'completion_length': 192.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.4255952686071396, 'rewards/format_reward': 1.0, 'reward': 1.4255953431129456, 'reward_std': 0.03411934897303581, 'kl': 0.6357421875, 'epoch': 0.55}
 55%|█████▍    | 2346/4286 [14:35:53<10:02:40, 18.64s/it] 55%|█████▍    | 2347/4286 [14:36:12<10:04:27, 18.70s/it]                                                         {'loss': 0.0086, 'grad_norm': 66.41183643500905, 'learning_rate': 4.5240317312179187e-07, 'completion_length': 190.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5357143580913544, 'rewards/format_reward': 1.0, 'reward': 1.5357144474983215, 'reward_std': 0.041666668839752674, 'kl': 0.2138671875, 'epoch': 0.55}
 55%|█████▍    | 2347/4286 [14:36:12<10:04:27, 18.70s/it] 55%|█████▍    | 2348/4286 [14:36:33<10:25:41, 19.37s/it]                                                         {'loss': 0.0318, 'grad_norm': 1.1421431567797669, 'learning_rate': 4.5216985534297715e-07, 'completion_length': 184.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5848214626312256, 'reward_std': 0.0565476231276989, 'kl': 0.7958984375, 'epoch': 0.55}
 55%|█████▍    | 2348/4286 [14:36:33<10:25:41, 19.37s/it] 55%|█████▍    | 2349/4286 [14:36:51<10:16:25, 19.09s/it]                                                         {'loss': 0.0459, 'grad_norm': 1.0897461353825262, 'learning_rate': 4.5193653756416237e-07, 'completion_length': 192.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6205357313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6026787161827087, 'reward_std': 0.11346285790205002, 'kl': 1.1484375, 'epoch': 0.55}
 55%|█████▍    | 2349/4286 [14:36:51<10:16:25, 19.09s/it] 55%|█████▍    | 2350/4286 [14:37:10<10:17:02, 19.12s/it]                                                         {'loss': 0.0091, 'grad_norm': 0.4575028949660048, 'learning_rate': 4.5170321978534765e-07, 'completion_length': 205.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.0, 'kl': 0.2275390625, 'epoch': 0.55}
 55%|█████▍    | 2350/4286 [14:37:10<10:17:02, 19.12s/it] 55%|█████▍    | 2351/4286 [14:37:29<10:16:30, 19.12s/it]                                                         {'loss': 0.0625, 'grad_norm': 8.37421253123914, 'learning_rate': 4.514699020065329e-07, 'completion_length': 188.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.6071429252624512, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.571428656578064, 'reward_std': 0.16339794546365738, 'kl': 1.5625, 'epoch': 0.55}
 55%|█████▍    | 2351/4286 [14:37:30<10:16:30, 19.12s/it] 55%|█████▍    | 2352/4286 [14:37:47<10:04:38, 18.76s/it]                                                         {'loss': 0.041, 'grad_norm': 3.5388927909752055, 'learning_rate': 4.5123658422771814e-07, 'completion_length': 173.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6517857909202576, 'rewards/format_reward': 1.0, 'reward': 1.6517858505249023, 'reward_std': 0.0942963995039463, 'kl': 1.0234375, 'epoch': 0.55}
 55%|█████▍    | 2352/4286 [14:37:47<10:04:38, 18.76s/it] 55%|█████▍    | 2353/4286 [14:38:07<10:13:54, 19.06s/it]                                                         {'loss': 0.0085, 'grad_norm': 8.681993762923437, 'learning_rate': 4.510032664489034e-07, 'completion_length': 194.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5949405133724213, 'rewards/format_reward': 1.0, 'reward': 1.5949406623840332, 'reward_std': 0.04821428842842579, 'kl': 0.21240234375, 'epoch': 0.55}
 55%|█████▍    | 2353/4286 [14:38:07<10:13:54, 19.06s/it] 55%|█████▍    | 2354/4286 [14:38:27<10:25:47, 19.43s/it]                                                         {'loss': 0.007, 'grad_norm': 1.5761821861035632, 'learning_rate': 4.5076994867008864e-07, 'completion_length': 193.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.0503815608099103, 'kl': 0.17578125, 'epoch': 0.55}
 55%|█████▍    | 2354/4286 [14:38:27<10:25:47, 19.43s/it] 55%|█████▍    | 2355/4286 [14:38:47<10:24:03, 19.39s/it]                                                         {'loss': 0.0068, 'grad_norm': 1.8392918356304424, 'learning_rate': 4.505366308912739e-07, 'completion_length': 203.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.08907203562557697, 'kl': 0.1689453125, 'epoch': 0.55}
 55%|█████▍    | 2355/4286 [14:38:47<10:24:03, 19.39s/it] 55%|█████▍    | 2356/4286 [14:39:06<10:18:24, 19.23s/it]                                                         {'loss': 0.0218, 'grad_norm': 54.28199773046512, 'learning_rate': 4.503033131124592e-07, 'completion_length': 168.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.095238097012043, 'kl': 0.5458984375, 'epoch': 0.55}
 55%|█████▍    | 2356/4286 [14:39:06<10:18:24, 19.23s/it] 55%|█████▍    | 2357/4286 [14:39:26<10:27:26, 19.52s/it]                                                         {'loss': 0.007, 'grad_norm': 7.649968994979529, 'learning_rate': 4.500699953336444e-07, 'completion_length': 211.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.5699405372142792, 'rewards/format_reward': 1.0, 'reward': 1.5699405670166016, 'reward_std': 0.09353616833686829, 'kl': 0.17529296875, 'epoch': 0.55}
 55%|█████▍    | 2357/4286 [14:39:26<10:27:26, 19.52s/it] 55%|█████▌    | 2358/4286 [14:39:47<10:42:02, 19.98s/it]                                                         {'loss': 0.0284, 'grad_norm': 2.9446644431958036, 'learning_rate': 4.498366775548297e-07, 'completion_length': 206.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5892857909202576, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.04166667256504297, 'kl': 0.708984375, 'epoch': 0.55}
 55%|█████▌    | 2358/4286 [14:39:47<10:42:02, 19.98s/it] 55%|█████▌    | 2359/4286 [14:40:08<10:49:32, 20.22s/it]                                                         {'loss': 0.0182, 'grad_norm': 5.527245779602249, 'learning_rate': 4.496033597760149e-07, 'completion_length': 203.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.7401786148548126, 'rewards/format_reward': 1.0, 'reward': 1.740178644657135, 'reward_std': 0.06504932418465614, 'kl': 0.453125, 'epoch': 0.55}
 55%|█████▌    | 2359/4286 [14:40:08<10:49:32, 20.22s/it] 55%|█████▌    | 2360/4286 [14:40:27<10:42:19, 20.01s/it]                                                         {'loss': 0.0088, 'grad_norm': 0.8262224930889114, 'learning_rate': 4.493700419972002e-07, 'completion_length': 193.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 1.0, 'reward': 1.672619104385376, 'reward_std': 0.03827153518795967, 'kl': 0.21875, 'epoch': 0.55}
 55%|█████▌    | 2360/4286 [14:40:27<10:42:19, 20.01s/it] 55%|█████▌    | 2361/4286 [14:40:47<10:38:00, 19.89s/it]                                                         {'loss': 0.0118, 'grad_norm': 14.615744424663378, 'learning_rate': 4.4913672421838546e-07, 'completion_length': 198.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.735119104385376, 'reward_std': 0.01785714365541935, 'kl': 0.29541015625, 'epoch': 0.55}
 55%|█████▌    | 2361/4286 [14:40:47<10:38:00, 19.89s/it] 55%|█████▌    | 2362/4286 [14:41:06<10:29:24, 19.63s/it]                                                         {'loss': 0.0066, 'grad_norm': 5.035823713200283, 'learning_rate': 4.489034064395707e-07, 'completion_length': 191.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6342262625694275, 'rewards/format_reward': 1.0, 'reward': 1.6342262625694275, 'reward_std': 0.04542887583374977, 'kl': 0.166015625, 'epoch': 0.55}
 55%|█████▌    | 2362/4286 [14:41:06<10:29:24, 19.63s/it] 55%|█████▌    | 2363/4286 [14:41:24<10:18:19, 19.29s/it]                                                         {'loss': 0.0178, 'grad_norm': 1.1597882807762978, 'learning_rate': 4.4867008866075596e-07, 'completion_length': 186.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.5922619253396988, 'rewards/format_reward': 1.0, 'reward': 1.5922620296478271, 'reward_std': 0.05541309900581837, 'kl': 0.44189453125, 'epoch': 0.55}
 55%|█████▌    | 2363/4286 [14:41:24<10:18:19, 19.29s/it] 55%|█████▌    | 2364/4286 [14:41:44<10:18:14, 19.30s/it]                                                         {'loss': 0.0077, 'grad_norm': 1.1701459518877804, 'learning_rate': 4.484367708819412e-07, 'completion_length': 184.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.5372024178504944, 'rewards/format_reward': 1.0, 'reward': 1.5372024774551392, 'reward_std': 0.019238398410379887, 'kl': 0.193359375, 'epoch': 0.55}
 55%|█████▌    | 2364/4286 [14:41:44<10:18:14, 19.30s/it] 55%|█████▌    | 2365/4286 [14:42:04<10:23:52, 19.49s/it]                                                         {'loss': 0.007, 'grad_norm': 11.73295248214227, 'learning_rate': 4.4820345310312646e-07, 'completion_length': 186.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5040391534566879, 'rewards/format_reward': 1.0, 'reward': 1.5040392875671387, 'reward_std': 0.008078234503045678, 'kl': 0.1748046875, 'epoch': 0.55}
 55%|█████▌    | 2365/4286 [14:42:04<10:23:52, 19.49s/it] 55%|█████▌    | 2366/4286 [14:42:23<10:27:43, 19.62s/it]                                                         {'loss': 0.0075, 'grad_norm': 2.4728091538169275, 'learning_rate': 4.4797013532431173e-07, 'completion_length': 191.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.03793024830520153, 'kl': 0.1875, 'epoch': 0.55}
 55%|█████▌    | 2366/4286 [14:42:23<10:27:43, 19.62s/it] 55%|█████▌    | 2367/4286 [14:42:43<10:26:15, 19.58s/it]                                                         {'loss': 0.0066, 'grad_norm': 1.487725283304915, 'learning_rate': 4.4773681754549695e-07, 'completion_length': 208.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 1.0, 'reward': 1.8244048357009888, 'reward_std': 0.05578738544136286, 'kl': 0.1650390625, 'epoch': 0.55}
 55%|█████▌    | 2367/4286 [14:42:43<10:26:15, 19.58s/it] 55%|█████▌    | 2368/4286 [14:43:04<10:41:05, 20.05s/it]                                                         {'loss': 0.0119, 'grad_norm': 2.9433675668134334, 'learning_rate': 4.4750349976668223e-07, 'completion_length': 189.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7372711002826691, 'rewards/format_reward': 1.0, 'reward': 1.7372711300849915, 'reward_std': 0.09703013300895691, 'kl': 0.2958984375, 'epoch': 0.55}
 55%|█████▌    | 2368/4286 [14:43:04<10:41:05, 20.05s/it] 55%|█████▌    | 2369/4286 [14:43:23<10:26:06, 19.60s/it]                                                         {'loss': 0.007, 'grad_norm': 1.6275407246835865, 'learning_rate': 4.472701819878675e-07, 'completion_length': 181.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979167461395264, 'reward_std': 0.03298483043909073, 'kl': 0.1748046875, 'epoch': 0.55}
 55%|█████▌    | 2369/4286 [14:43:23<10:26:06, 19.60s/it] 55%|█████▌    | 2370/4286 [14:43:41<10:11:37, 19.15s/it]                                                         {'loss': 0.0074, 'grad_norm': 3.581120291063582, 'learning_rate': 4.470368642090527e-07, 'completion_length': 174.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.0386904813349247, 'kl': 0.18505859375, 'epoch': 0.55}
 55%|█████▌    | 2370/4286 [14:43:41<10:11:37, 19.15s/it] 55%|█████▌    | 2371/4286 [14:44:01<10:19:26, 19.41s/it]                                                         {'loss': 0.0074, 'grad_norm': 0.9853154820417203, 'learning_rate': 4.46803546430238e-07, 'completion_length': 176.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.5892857909202576, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.05222322791814804, 'kl': 0.18505859375, 'epoch': 0.55}
 55%|█████▌    | 2371/4286 [14:44:01<10:19:26, 19.41s/it] 55%|█████▌    | 2372/4286 [14:44:20<10:19:51, 19.43s/it]                                                         {'loss': 0.0073, 'grad_norm': 1.1614033014487963, 'learning_rate': 4.465702286514232e-07, 'completion_length': 182.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5520833432674408, 'rewards/format_reward': 1.0, 'reward': 1.552083432674408, 'reward_std': 0.036421436816453934, 'kl': 0.1826171875, 'epoch': 0.55}
 55%|█████▌    | 2372/4286 [14:44:20<10:19:51, 19.43s/it] 55%|█████▌    | 2373/4286 [14:44:41<10:30:41, 19.78s/it]                                                         {'loss': 0.0071, 'grad_norm': 2.219559508083726, 'learning_rate': 4.463369108726085e-07, 'completion_length': 210.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.07578602433204651, 'kl': 0.17724609375, 'epoch': 0.55}
 55%|█████▌    | 2373/4286 [14:44:41<10:30:41, 19.78s/it] 55%|█████▌    | 2374/4286 [14:45:00<10:25:41, 19.63s/it]                                                         {'loss': 0.0249, 'grad_norm': 2.5527608951389626, 'learning_rate': 4.4610359309379377e-07, 'completion_length': 206.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6889880895614624, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.03869048459455371, 'kl': 0.62158203125, 'epoch': 0.55}
 55%|█████▌    | 2374/4286 [14:45:00<10:25:41, 19.63s/it] 55%|█████▌    | 2375/4286 [14:45:19<10:22:06, 19.53s/it]                                                         {'loss': 0.0254, 'grad_norm': 3.999136396332132, 'learning_rate': 4.45870275314979e-07, 'completion_length': 193.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.6755953133106232, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.657738208770752, 'reward_std': 0.09959554392844439, 'kl': 0.63818359375, 'epoch': 0.55}
 55%|█████▌    | 2375/4286 [14:45:19<10:22:06, 19.53s/it] 55%|█████▌    | 2376/4286 [14:45:40<10:34:55, 19.95s/it]                                                         {'loss': 0.0082, 'grad_norm': 5.196331171085631, 'learning_rate': 4.4563695753616427e-07, 'completion_length': 201.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.7291666865348816, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.03197786211967468, 'kl': 0.2060546875, 'epoch': 0.55}
 55%|█████▌    | 2376/4286 [14:45:40<10:34:55, 19.95s/it] 55%|█████▌    | 2377/4286 [14:46:02<10:46:48, 20.33s/it]                                                         {'loss': 0.0278, 'grad_norm': 1.641836346082735, 'learning_rate': 4.454036397573495e-07, 'completion_length': 213.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.7149659693241119, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6971089243888855, 'reward_std': 0.1224743090569973, 'kl': 0.697265625, 'epoch': 0.55}
 55%|█████▌    | 2377/4286 [14:46:02<10:46:48, 20.33s/it] 55%|█████▌    | 2378/4286 [14:46:20<10:31:51, 19.87s/it]                                                         {'loss': 0.0207, 'grad_norm': 1.8955037187450046, 'learning_rate': 4.4517032197853477e-07, 'completion_length': 184.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.4925595819950104, 'rewards/format_reward': 1.0, 'reward': 1.4925596117973328, 'reward_std': 0.06250000465661287, 'kl': 0.51904296875, 'epoch': 0.55}
 55%|█████▌    | 2378/4286 [14:46:20<10:31:51, 19.87s/it] 56%|█████▌    | 2379/4286 [14:46:41<10:41:07, 20.17s/it]                                                         {'loss': 0.0335, 'grad_norm': 2.6048734810915404, 'learning_rate': 4.4493700419972004e-07, 'completion_length': 205.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5773809850215912, 'rewards/format_reward': 1.0, 'reward': 1.5773810744285583, 'reward_std': 0.06823870353400707, 'kl': 0.83984375, 'epoch': 0.56}
 56%|█████▌    | 2379/4286 [14:46:41<10:41:07, 20.17s/it] 56%|█████▌    | 2380/4286 [14:47:05<11:13:51, 21.21s/it]                                                         {'loss': 0.0435, 'grad_norm': 2.4877995894286697, 'learning_rate': 4.4470368642090527e-07, 'completion_length': 203.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.6800595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.11662012711167336, 'kl': 1.087890625, 'epoch': 0.56}
 56%|█████▌    | 2380/4286 [14:47:05<11:13:51, 21.21s/it] 56%|█████▌    | 2381/4286 [14:47:26<11:13:48, 21.22s/it]                                                         {'loss': 0.0283, 'grad_norm': 3.016220372558539, 'learning_rate': 4.4447036864209054e-07, 'completion_length': 228.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6285714507102966, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6107143759727478, 'reward_std': 0.1244646655395627, 'kl': 0.71044921875, 'epoch': 0.56}
 56%|█████▌    | 2381/4286 [14:47:26<11:13:48, 21.22s/it] 56%|█████▌    | 2382/4286 [14:47:47<11:09:00, 21.08s/it]                                                         {'loss': 0.0429, 'grad_norm': 11.860058562800198, 'learning_rate': 4.4423705086327576e-07, 'completion_length': 212.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5770833492279053, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5592262148857117, 'reward_std': 0.11829685047268867, 'kl': 1.072265625, 'epoch': 0.56}
 56%|█████▌    | 2382/4286 [14:47:47<11:09:00, 21.08s/it] 56%|█████▌    | 2383/4286 [14:48:09<11:16:49, 21.34s/it]                                                         {'loss': 0.0506, 'grad_norm': 1.3773895885582563, 'learning_rate': 4.4400373308446104e-07, 'completion_length': 200.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.5776786506175995, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5419644117355347, 'reward_std': 0.1612866334617138, 'kl': 1.26171875, 'epoch': 0.56}
 56%|█████▌    | 2383/4286 [14:48:09<11:16:49, 21.34s/it] 56%|█████▌    | 2384/4286 [14:48:32<11:33:32, 21.88s/it]                                                         {'loss': 0.0366, 'grad_norm': 1.635437529460724, 'learning_rate': 4.437704153056463e-07, 'completion_length': 214.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.578869104385376, 'rewards/format_reward': 1.0, 'reward': 1.5788691639900208, 'reward_std': 0.04013477172702551, 'kl': 0.91162109375, 'epoch': 0.56}
 56%|█████▌    | 2384/4286 [14:48:32<11:33:32, 21.88s/it] 56%|█████▌    | 2385/4286 [14:48:52<11:18:53, 21.43s/it]                                                         {'loss': 0.0359, 'grad_norm': 4.651450670058011, 'learning_rate': 4.4353709752683153e-07, 'completion_length': 211.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.605654776096344, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.05059524066746235, 'kl': 0.9013671875, 'epoch': 0.56}
 56%|█████▌    | 2385/4286 [14:48:52<11:18:53, 21.43s/it] 56%|█████▌    | 2386/4286 [14:49:14<11:16:16, 21.36s/it]                                                         {'loss': 0.0614, 'grad_norm': 3.941902481343568, 'learning_rate': 4.433037797480168e-07, 'completion_length': 204.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.6190476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6011905670166016, 'reward_std': 0.10608284920454025, 'kl': 1.5390625, 'epoch': 0.56}
 56%|█████▌    | 2386/4286 [14:49:14<11:16:16, 21.36s/it] 56%|█████▌    | 2387/4286 [14:49:34<11:07:28, 21.09s/it]                                                         {'loss': 0.0604, 'grad_norm': 1.972453558240801, 'learning_rate': 4.4307046196920203e-07, 'completion_length': 219.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5163690894842148, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.498512089252472, 'reward_std': 0.18233326822519302, 'kl': 1.51220703125, 'epoch': 0.56}
 56%|█████▌    | 2387/4286 [14:49:34<11:07:28, 21.09s/it] 56%|█████▌    | 2388/4286 [14:49:53<10:42:34, 20.31s/it]                                                         {'loss': 0.0649, 'grad_norm': 1.2115879459136534, 'learning_rate': 4.428371441903873e-07, 'completion_length': 187.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6741071939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.638392984867096, 'reward_std': 0.1755952462553978, 'kl': 1.6171875, 'epoch': 0.56}
 56%|█████▌    | 2388/4286 [14:49:53<10:42:34, 20.31s/it] 56%|█████▌    | 2389/4286 [14:50:13<10:45:04, 20.40s/it]                                                         {'loss': 0.0218, 'grad_norm': 1.8602684461649648, 'learning_rate': 4.426038264115726e-07, 'completion_length': 202.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.6949404776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6770834922790527, 'reward_std': 0.11828739196062088, 'kl': 0.54638671875, 'epoch': 0.56}
 56%|█████▌    | 2389/4286 [14:50:13<10:45:04, 20.40s/it] 56%|█████▌    | 2390/4286 [14:50:32<10:34:35, 20.08s/it]                                                         {'loss': 0.0324, 'grad_norm': 2.0371356295384175, 'learning_rate': 4.423705086327578e-07, 'completion_length': 204.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.626488208770752, 'reward_std': 0.1577381044626236, 'kl': 0.810546875, 'epoch': 0.56}
 56%|█████▌    | 2390/4286 [14:50:32<10:34:35, 20.08s/it] 56%|█████▌    | 2391/4286 [14:50:52<10:31:58, 20.01s/it]                                                         {'loss': 0.0216, 'grad_norm': 6.5633246136945145, 'learning_rate': 4.421371908539431e-07, 'completion_length': 178.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.5639881193637848, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.07584905996918678, 'kl': 0.541015625, 'epoch': 0.56}
 56%|█████▌    | 2391/4286 [14:50:52<10:31:58, 20.01s/it] 56%|█████▌    | 2392/4286 [14:51:12<10:33:03, 20.05s/it]                                                         {'loss': 0.0074, 'grad_norm': 10.934102177391528, 'learning_rate': 4.4190387307512836e-07, 'completion_length': 198.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5357143059372902, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.01785714365541935, 'kl': 0.18505859375, 'epoch': 0.56}
 56%|█████▌    | 2392/4286 [14:51:12<10:33:03, 20.05s/it] 56%|█████▌    | 2393/4286 [14:51:35<10:56:11, 20.80s/it]                                                         {'loss': 0.0758, 'grad_norm': 2.597852539336043, 'learning_rate': 4.416705552963136e-07, 'completion_length': 231.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.4255952686071396, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3898810744285583, 'reward_std': 0.15558281168341637, 'kl': 1.892578125, 'epoch': 0.56}
 56%|█████▌    | 2393/4286 [14:51:35<10:56:11, 20.80s/it] 56%|█████▌    | 2394/4286 [14:51:56<11:01:18, 20.97s/it]                                                         {'loss': 0.0527, 'grad_norm': 1.786268517425961, 'learning_rate': 4.4143723751749885e-07, 'completion_length': 224.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6468962728977203, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.629039227962494, 'reward_std': 0.14754897356033325, 'kl': 1.3115234375, 'epoch': 0.56}
 56%|█████▌    | 2394/4286 [14:51:56<11:01:18, 20.97s/it] 56%|█████▌    | 2395/4286 [14:52:17<11:00:05, 20.94s/it]                                                         {'loss': 0.0063, 'grad_norm': 5.8231577758562425, 'learning_rate': 4.412039197386841e-07, 'completion_length': 208.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6651786267757416, 'rewards/format_reward': 1.0, 'reward': 1.6651787161827087, 'reward_std': 0.0267857164144516, 'kl': 0.1572265625, 'epoch': 0.56}
 56%|█████▌    | 2395/4286 [14:52:17<11:00:05, 20.94s/it] 56%|█████▌    | 2396/4286 [14:52:38<10:54:33, 20.78s/it]                                                         {'loss': 0.089, 'grad_norm': 1.8108882888876996, 'learning_rate': 4.4097060195986935e-07, 'completion_length': 211.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5223214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.504464328289032, 'reward_std': 0.11447649914771318, 'kl': 2.21875, 'epoch': 0.56}
 56%|█████▌    | 2396/4286 [14:52:38<10:54:33, 20.78s/it] 56%|█████▌    | 2397/4286 [14:52:58<10:54:55, 20.80s/it]                                                         {'loss': 0.0072, 'grad_norm': 0.4527811900892479, 'learning_rate': 4.407372841810546e-07, 'completion_length': 215.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.4598214477300644, 'rewards/format_reward': 1.0, 'reward': 1.4598214626312256, 'reward_std': 0.026785715483129025, 'kl': 0.17919921875, 'epoch': 0.56}
 56%|█████▌    | 2397/4286 [14:52:58<10:54:55, 20.80s/it] 56%|█████▌    | 2398/4286 [14:53:19<10:47:58, 20.59s/it]                                                         {'loss': 0.074, 'grad_norm': 1.0458162737274683, 'learning_rate': 4.4050396640223985e-07, 'completion_length': 195.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5803572535514832, 'reward_std': 0.1845238208770752, 'kl': 1.84375, 'epoch': 0.56}
 56%|█████▌    | 2398/4286 [14:53:19<10:47:58, 20.59s/it] 56%|█████▌    | 2399/4286 [14:53:39<10:44:22, 20.49s/it]                                                         {'loss': 0.0061, 'grad_norm': 0.8796753626494334, 'learning_rate': 4.402706486234251e-07, 'completion_length': 237.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892858505249023, 'reward_std': 0.045576510950922966, 'kl': 0.15087890625, 'epoch': 0.56}
 56%|█████▌    | 2399/4286 [14:53:39<10:44:22, 20.49s/it] 56%|█████▌    | 2400/4286 [14:54:00<10:47:18, 20.59s/it]                                                         {'loss': 0.098, 'grad_norm': 5.658307113191542, 'learning_rate': 4.4003733084461034e-07, 'completion_length': 207.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6220239400863647, 'reward_std': 0.267857164144516, 'kl': 2.4453125, 'epoch': 0.56}
 56%|█████▌    | 2400/4286 [14:54:00<10:47:18, 20.59s/it] 56%|█████▌    | 2401/4286 [14:57:49<43:32:15, 83.15s/it]                                                         {'loss': 0.0316, 'grad_norm': 8.000107088102208, 'learning_rate': 4.398040130657956e-07, 'completion_length': 199.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.7023809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6845239400863647, 'reward_std': 0.10143721103668213, 'kl': 0.787109375, 'epoch': 0.56}
 56%|█████▌    | 2401/4286 [14:57:49<43:32:15, 83.15s/it][2025-03-02 20:05:28,023] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 56%|█████▌    | 2402/4286 [14:58:12<34:07:35, 65.21s/it]                                                         {'loss': 0.0976, 'grad_norm': 19.86686746960613, 'learning_rate': 4.395706952869809e-07, 'completion_length': 233.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5982142984867096, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5267858505249023, 'reward_std': 0.1845238246023655, 'kl': 2.4501953125, 'epoch': 0.56}
 56%|█████▌    | 2402/4286 [14:58:12<34:07:35, 65.21s/it][2025-03-02 20:05:50,514] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 56%|█████▌    | 2403/4286 [14:58:35<27:24:18, 52.39s/it]                                                         {'loss': 0.0613, 'grad_norm': 13.782174663198061, 'learning_rate': 4.393373775081661e-07, 'completion_length': 237.96430206298828, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.563988208770752, 'reward_std': 0.1603916436433792, 'kl': 1.53125, 'epoch': 0.56}
 56%|█████▌    | 2403/4286 [14:58:35<27:24:18, 52.39s/it] 56%|█████▌    | 2404/4286 [14:58:55<22:23:10, 42.82s/it]                                                         {'loss': 0.0358, 'grad_norm': 14.898394579139659, 'learning_rate': 4.391040597293514e-07, 'completion_length': 199.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.8125000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8125001192092896, 'reward_std': 0.045103274285793304, 'kl': 0.8984375, 'epoch': 0.56}
 56%|█████▌    | 2404/4286 [14:58:55<22:23:10, 42.82s/it] 56%|█████▌    | 2405/4286 [14:59:17<19:10:03, 36.68s/it]                                                         {'loss': 0.0278, 'grad_norm': 3.791777191949155, 'learning_rate': 4.388707419505366e-07, 'completion_length': 216.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5705357491970062, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5526787638664246, 'reward_std': 0.07321429252624512, 'kl': 0.6953125, 'epoch': 0.56}
 56%|█████▌    | 2405/4286 [14:59:17<19:10:03, 36.68s/it][2025-03-02 20:06:54,858] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 56%|█████▌    | 2406/4286 [14:59:39<16:46:39, 32.13s/it]                                                         {'loss': 0.0701, 'grad_norm': 4.541346458052021, 'learning_rate': 4.386374241717219e-07, 'completion_length': 214.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.482142984867096, 'reward_std': 0.08985321596264839, 'kl': 1.75390625, 'epoch': 0.56}
 56%|█████▌    | 2406/4286 [14:59:39<16:46:39, 32.13s/it] 56%|█████▌    | 2407/4286 [15:00:02<15:22:00, 29.44s/it]                                                         {'loss': 0.011, 'grad_norm': 2.1132893506404167, 'learning_rate': 4.3840410639290716e-07, 'completion_length': 235.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4553572535514832, 'reward_std': 0.11751950159668922, 'kl': 0.2734375, 'epoch': 0.56}
 56%|█████▌    | 2407/4286 [15:00:02<15:22:00, 29.44s/it] 56%|█████▌    | 2408/4286 [15:00:25<14:23:11, 27.58s/it]                                                         {'loss': 0.0296, 'grad_norm': 23.988980191564597, 'learning_rate': 4.381707886140924e-07, 'completion_length': 221.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.5476190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.529762089252472, 'reward_std': 0.0476190522313118, 'kl': 0.7412109375, 'epoch': 0.56}
 56%|█████▌    | 2408/4286 [15:00:25<14:23:11, 27.58s/it] 56%|█████▌    | 2409/4286 [15:00:47<13:23:52, 25.70s/it]                                                         {'loss': 0.0577, 'grad_norm': 22.325931857008047, 'learning_rate': 4.3793747083527766e-07, 'completion_length': 249.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 1.0, 'reward': 1.602678656578064, 'reward_std': 0.08352364972233772, 'kl': 1.443359375, 'epoch': 0.56}
 56%|█████▌    | 2409/4286 [15:00:47<13:23:52, 25.70s/it] 56%|█████▌    | 2410/4286 [15:01:07<12:33:41, 24.11s/it]                                                         {'loss': 0.0508, 'grad_norm': 2.1661235322473464, 'learning_rate': 4.377041530564629e-07, 'completion_length': 216.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5773810148239136, 'rewards/format_reward': 1.0, 'reward': 1.5773810744285583, 'reward_std': 0.060968104749917984, 'kl': 1.26708984375, 'epoch': 0.56}
 56%|█████▌    | 2410/4286 [15:01:07<12:33:41, 24.11s/it] 56%|█████▋    | 2411/4286 [15:01:28<12:00:33, 23.06s/it]                                                         {'loss': 0.0088, 'grad_norm': 2.916139079616157, 'learning_rate': 4.374708352776481e-07, 'completion_length': 226.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6050595641136169, 'rewards/format_reward': 1.0, 'reward': 1.6050596237182617, 'reward_std': 0.04821429401636124, 'kl': 0.2197265625, 'epoch': 0.56}
 56%|█████▋    | 2411/4286 [15:01:28<12:00:33, 23.06s/it] 56%|█████▋    | 2412/4286 [15:01:49<11:46:14, 22.61s/it]                                                         {'loss': 0.0191, 'grad_norm': 160.5998019140178, 'learning_rate': 4.372375174988334e-07, 'completion_length': 242.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6860120296478271, 'reward_std': 0.10638263262808323, 'kl': 0.47900390625, 'epoch': 0.56}
 56%|█████▋    | 2412/4286 [15:01:49<11:46:14, 22.61s/it] 56%|█████▋    | 2413/4286 [15:02:10<11:30:12, 22.11s/it]                                                         {'loss': 0.0498, 'grad_norm': 4.714180715716639, 'learning_rate': 4.370041997200186e-07, 'completion_length': 217.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.5105655044317245, 'rewards/format_reward': 1.0, 'reward': 1.5105655193328857, 'reward_std': 0.05787799879908562, 'kl': 1.2451171875, 'epoch': 0.56}
 56%|█████▋    | 2413/4286 [15:02:10<11:30:12, 22.11s/it][2025-03-02 20:09:46,005] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 56%|█████▋    | 2414/4286 [15:02:30<11:09:19, 21.45s/it]                                                         {'loss': 0.0229, 'grad_norm': 30.793221397260716, 'learning_rate': 4.367708819412039e-07, 'completion_length': 187.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.55952388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5416667461395264, 'reward_std': 0.12021518871188164, 'kl': 0.572265625, 'epoch': 0.56}
 56%|█████▋    | 2414/4286 [15:02:30<11:09:19, 21.45s/it] 56%|█████▋    | 2415/4286 [15:02:51<11:06:20, 21.37s/it]                                                         {'loss': 0.0424, 'grad_norm': 5.0176229529382566, 'learning_rate': 4.365375641623891e-07, 'completion_length': 237.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6145833432674408, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5788691639900208, 'reward_std': 0.196530569344759, 'kl': 1.0634765625, 'epoch': 0.56}
 56%|█████▋    | 2415/4286 [15:02:51<11:06:20, 21.37s/it] 56%|█████▋    | 2416/4286 [15:03:15<11:27:18, 22.05s/it]                                                         {'loss': 0.0319, 'grad_norm': 28.824158511614254, 'learning_rate': 4.363042463835744e-07, 'completion_length': 227.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.659722238779068, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6418651342391968, 'reward_std': 0.1466829478740692, 'kl': 0.7958984375, 'epoch': 0.56}
 56%|█████▋    | 2416/4286 [15:03:15<11:27:18, 22.05s/it] 56%|█████▋    | 2417/4286 [15:03:39<11:46:19, 22.68s/it]                                                         {'loss': 0.0166, 'grad_norm': 12.959724431580357, 'learning_rate': 4.3607092860475965e-07, 'completion_length': 241.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.5979166626930237, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5800596475601196, 'reward_std': 0.0582836139947176, 'kl': 0.4150390625, 'epoch': 0.56}
 56%|█████▋    | 2417/4286 [15:03:39<11:46:19, 22.68s/it] 56%|█████▋    | 2418/4286 [15:04:00<11:30:16, 22.17s/it]                                                         {'loss': 0.0322, 'grad_norm': 59.9531140279208, 'learning_rate': 4.358376108259449e-07, 'completion_length': 234.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.605654776096344, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.0740121565759182, 'kl': 0.80908203125, 'epoch': 0.56}
 56%|█████▋    | 2418/4286 [15:04:00<11:30:16, 22.17s/it][2025-03-02 20:11:40,522] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 56%|█████▋    | 2419/4286 [15:04:25<11:52:19, 22.89s/it]                                                         {'loss': 0.021, 'grad_norm': 8.311527881744484, 'learning_rate': 4.3560429304713015e-07, 'completion_length': 230.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.633184552192688, 'rewards/format_reward': 1.0, 'reward': 1.6331846117973328, 'reward_std': 0.0550595261156559, 'kl': 0.525390625, 'epoch': 0.56}
 56%|█████▋    | 2419/4286 [15:04:25<11:52:19, 22.89s/it] 56%|█████▋    | 2420/4286 [15:04:46<11:38:47, 22.47s/it]                                                         {'loss': 0.0085, 'grad_norm': 3.5263547742052963, 'learning_rate': 4.353709752683154e-07, 'completion_length': 214.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.03847679682075977, 'kl': 0.212890625, 'epoch': 0.56}
 56%|█████▋    | 2420/4286 [15:04:46<11:38:47, 22.47s/it] 56%|█████▋    | 2421/4286 [15:05:09<11:42:35, 22.60s/it]                                                         {'loss': 0.0184, 'grad_norm': 17.962106934989293, 'learning_rate': 4.3513765748950065e-07, 'completion_length': 248.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762387275696, 'reward_std': 0.07142857648432255, 'kl': 0.4599609375, 'epoch': 0.56}
 56%|█████▋    | 2421/4286 [15:05:09<11:42:35, 22.60s/it] 57%|█████▋    | 2422/4286 [15:05:30<11:25:40, 22.07s/it]                                                         {'loss': 0.0116, 'grad_norm': 3.317442948394002, 'learning_rate': 4.349043397106859e-07, 'completion_length': 175.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7708333730697632, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.01785714365541935, 'kl': 0.28857421875, 'epoch': 0.57}
 57%|█████▋    | 2422/4286 [15:05:30<11:25:40, 22.07s/it] 57%|█████▋    | 2423/4286 [15:05:51<11:17:37, 21.82s/it]                                                         {'loss': 0.061, 'grad_norm': 9.999093284551922, 'learning_rate': 4.3467102193187114e-07, 'completion_length': 242.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5352891832590103, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.481717824935913, 'reward_std': 0.15677109360694885, 'kl': 1.53125, 'epoch': 0.57}
 57%|█████▋    | 2423/4286 [15:05:51<11:17:37, 21.82s/it] 57%|█████▋    | 2424/4286 [15:06:13<11:18:37, 21.87s/it]                                                         {'loss': 0.0513, 'grad_norm': 4.75765294868621, 'learning_rate': 4.344377041530564e-07, 'completion_length': 242.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.6547619700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6369048953056335, 'reward_std': 0.130651094019413, 'kl': 1.28125, 'epoch': 0.57}
 57%|█████▋    | 2424/4286 [15:06:13<11:18:37, 21.87s/it] 57%|█████▋    | 2425/4286 [15:06:33<11:04:02, 21.41s/it]                                                         {'loss': 0.0213, 'grad_norm': 1.6597837845716052, 'learning_rate': 4.342043863742417e-07, 'completion_length': 222.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.68154776096344, 'reward_std': 0.10352957993745804, 'kl': 0.53125, 'epoch': 0.57}
 57%|█████▋    | 2425/4286 [15:06:33<11:04:02, 21.41s/it] 57%|█████▋    | 2426/4286 [15:06:56<11:16:20, 21.82s/it]                                                         {'loss': 0.0446, 'grad_norm': 10.246162829201744, 'learning_rate': 4.339710685954269e-07, 'completion_length': 233.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.447916716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4300596117973328, 'reward_std': 0.06498841592110693, 'kl': 1.1181640625, 'epoch': 0.57}
 57%|█████▋    | 2426/4286 [15:06:56<11:16:20, 21.82s/it][2025-03-02 20:14:33,821] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 57%|█████▋    | 2427/4286 [15:07:18<11:15:17, 21.80s/it]                                                         {'loss': 0.0123, 'grad_norm': 5.944087621268716, 'learning_rate': 4.337377508166122e-07, 'completion_length': 233.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.6369048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6369048953056335, 'reward_std': 0.06772436667233706, 'kl': 0.3056640625, 'epoch': 0.57}
 57%|█████▋    | 2427/4286 [15:07:18<11:15:17, 21.80s/it] 57%|█████▋    | 2428/4286 [15:07:39<11:09:37, 21.62s/it]                                                         {'loss': 0.0587, 'grad_norm': 6.074846301903456, 'learning_rate': 4.335044330377974e-07, 'completion_length': 221.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6041668057441711, 'reward_std': 0.13316834717988968, 'kl': 1.46875, 'epoch': 0.57}
 57%|█████▋    | 2428/4286 [15:07:39<11:09:37, 21.62s/it] 57%|█████▋    | 2429/4286 [15:08:01<11:10:54, 21.68s/it]                                                         {'loss': 0.0267, 'grad_norm': 7.4516321414046205, 'learning_rate': 4.332711152589827e-07, 'completion_length': 241.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6398810148239136, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.11236806213855743, 'kl': 0.6669921875, 'epoch': 0.57}
 57%|█████▋    | 2429/4286 [15:08:01<11:10:54, 21.68s/it] 57%|█████▋    | 2430/4286 [15:08:22<11:04:20, 21.48s/it]                                                         {'loss': 0.0114, 'grad_norm': 1.8254083031258612, 'learning_rate': 4.3303779748016796e-07, 'completion_length': 229.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.52976194024086, 'rewards/format_reward': 1.0, 'reward': 1.5297619700431824, 'reward_std': 0.05197649449110031, 'kl': 0.28515625, 'epoch': 0.57}
 57%|█████▋    | 2430/4286 [15:08:22<11:04:20, 21.48s/it] 57%|█████▋    | 2431/4286 [15:08:44<11:05:41, 21.53s/it]                                                         {'loss': 0.0235, 'grad_norm': 2.4674181712390224, 'learning_rate': 4.328044797013532e-07, 'completion_length': 223.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 1.0, 'reward': 1.6741071939468384, 'reward_std': 0.06336775794625282, 'kl': 0.58740234375, 'epoch': 0.57}
 57%|█████▋    | 2431/4286 [15:08:44<11:05:41, 21.53s/it] 57%|█████▋    | 2432/4286 [15:09:05<11:02:48, 21.45s/it]                                                         {'loss': 0.0127, 'grad_norm': 2.3479063117774968, 'learning_rate': 4.3257116192253846e-07, 'completion_length': 206.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6529762446880341, 'rewards/format_reward': 1.0, 'reward': 1.6529763340950012, 'reward_std': 0.09796987194567919, 'kl': 0.31689453125, 'epoch': 0.57}
 57%|█████▋    | 2432/4286 [15:09:05<11:02:48, 21.45s/it] 57%|█████▋    | 2433/4286 [15:09:28<11:17:18, 21.93s/it]                                                         {'loss': 0.0356, 'grad_norm': 8.348319928495428, 'learning_rate': 4.323378441437237e-07, 'completion_length': 218.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.5300595760345459, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.512202501296997, 'reward_std': 0.036113839596509933, 'kl': 0.89013671875, 'epoch': 0.57}
 57%|█████▋    | 2433/4286 [15:09:28<11:17:18, 21.93s/it] 57%|█████▋    | 2434/4286 [15:09:50<11:21:22, 22.07s/it]                                                         {'loss': 0.0316, 'grad_norm': 5.549296165050428, 'learning_rate': 4.3210452636490896e-07, 'completion_length': 248.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.5877976417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5699406266212463, 'reward_std': 0.1154559999704361, 'kl': 0.791015625, 'epoch': 0.57}
 57%|█████▋    | 2434/4286 [15:09:50<11:21:22, 22.07s/it] 57%|█████▋    | 2435/4286 [15:10:11<11:06:09, 21.59s/it]                                                         {'loss': 0.0157, 'grad_norm': 3.2897120917426554, 'learning_rate': 4.3187120858609423e-07, 'completion_length': 198.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.7674603462219238, 'rewards/format_reward': 1.0, 'reward': 1.7674603462219238, 'reward_std': 0.07854177802801132, 'kl': 0.39111328125, 'epoch': 0.57}
 57%|█████▋    | 2435/4286 [15:10:11<11:06:09, 21.59s/it] 57%|█████▋    | 2436/4286 [15:10:32<11:01:46, 21.46s/it]                                                         {'loss': 0.0063, 'grad_norm': 4.32332601407616, 'learning_rate': 4.3163789080727946e-07, 'completion_length': 233.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5252976268529892, 'rewards/format_reward': 1.0, 'reward': 1.5252978205680847, 'reward_std': 0.040532154962420464, 'kl': 0.15625, 'epoch': 0.57}
 57%|█████▋    | 2436/4286 [15:10:32<11:01:46, 21.46s/it] 57%|█████▋    | 2437/4286 [15:10:53<10:55:01, 21.26s/it]                                                         {'loss': 0.0388, 'grad_norm': 5.3648696560270785, 'learning_rate': 4.3140457302846473e-07, 'completion_length': 218.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.586309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5684524774551392, 'reward_std': 0.07511191815137863, 'kl': 0.97216796875, 'epoch': 0.57}
 57%|█████▋    | 2437/4286 [15:10:53<10:55:01, 21.26s/it] 57%|█████▋    | 2438/4286 [15:11:15<11:05:09, 21.60s/it]                                                         {'loss': 0.0336, 'grad_norm': 4.435626961619951, 'learning_rate': 4.3117125524964995e-07, 'completion_length': 220.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.4747024327516556, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4568453431129456, 'reward_std': 0.06456848699599504, 'kl': 0.84130859375, 'epoch': 0.57}
 57%|█████▋    | 2438/4286 [15:11:15<11:05:09, 21.60s/it] 57%|█████▋    | 2439/4286 [15:11:36<10:59:00, 21.41s/it]                                                         {'loss': 0.0062, 'grad_norm': 1.695230603024131, 'learning_rate': 4.3093793747083523e-07, 'completion_length': 230.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6872024536132812, 'rewards/format_reward': 1.0, 'reward': 1.6872025728225708, 'reward_std': 0.06401878781616688, 'kl': 0.15576171875, 'epoch': 0.57}
 57%|█████▋    | 2439/4286 [15:11:36<10:59:00, 21.41s/it] 57%|█████▋    | 2440/4286 [15:12:00<11:21:34, 22.15s/it]                                                         {'loss': 0.0474, 'grad_norm': 6.138521742968444, 'learning_rate': 4.307046196920205e-07, 'completion_length': 239.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6947782039642334, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6769211292266846, 'reward_std': 0.13688058406114578, 'kl': 1.181640625, 'epoch': 0.57}
 57%|█████▋    | 2440/4286 [15:12:00<11:21:34, 22.15s/it] 57%|█████▋    | 2441/4286 [15:12:20<11:03:44, 21.59s/it]                                                         {'loss': 0.0211, 'grad_norm': 2.8241194206627247, 'learning_rate': 4.304713019132057e-07, 'completion_length': 224.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.690476268529892, 'rewards/format_reward': 1.0, 'reward': 1.6904763579368591, 'reward_std': 0.03283697459846735, 'kl': 0.529296875, 'epoch': 0.57}
 57%|█████▋    | 2441/4286 [15:12:20<11:03:44, 21.59s/it] 57%|█████▋    | 2442/4286 [15:12:41<10:58:01, 21.41s/it]                                                         {'loss': 0.0125, 'grad_norm': 9.234525727819365, 'learning_rate': 4.30237984134391e-07, 'completion_length': 221.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.6577381789684296, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.09150167554616928, 'kl': 0.3125, 'epoch': 0.57}
 57%|█████▋    | 2442/4286 [15:12:41<10:58:01, 21.41s/it] 57%|█████▋    | 2443/4286 [15:13:02<10:52:51, 21.25s/it]                                                         {'loss': 0.0101, 'grad_norm': 2.323992897086248, 'learning_rate': 4.300046663555763e-07, 'completion_length': 221.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5654762089252472, 'rewards/format_reward': 1.0, 'reward': 1.5654762983322144, 'reward_std': 0.06721126660704613, 'kl': 0.2529296875, 'epoch': 0.57}
 57%|█████▋    | 2443/4286 [15:13:02<10:52:51, 21.25s/it] 57%|█████▋    | 2444/4286 [15:13:23<10:44:15, 20.99s/it]                                                         {'loss': 0.0054, 'grad_norm': 0.7609341698294163, 'learning_rate': 4.297713485767615e-07, 'completion_length': 248.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.020619653165340424, 'kl': 0.13427734375, 'epoch': 0.57}
 57%|█████▋    | 2444/4286 [15:13:23<10:44:15, 20.99s/it] 57%|█████▋    | 2445/4286 [15:13:44<10:52:13, 21.26s/it]                                                         {'loss': 0.0567, 'grad_norm': 7.386900582001966, 'learning_rate': 4.2953803079794677e-07, 'completion_length': 240.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.5204212963581085, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5025641918182373, 'reward_std': 0.14345603063702583, 'kl': 1.419921875, 'epoch': 0.57}
 57%|█████▋    | 2445/4286 [15:13:44<10:52:13, 21.26s/it] 57%|█████▋    | 2446/4286 [15:14:05<10:42:53, 20.96s/it]                                                         {'loss': 0.0061, 'grad_norm': 3.4210883654640662, 'learning_rate': 4.29304713019132e-07, 'completion_length': 220.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5833333432674408, 'rewards/format_reward': 1.0, 'reward': 1.583333432674408, 'reward_std': 0.03755595535039902, 'kl': 0.15185546875, 'epoch': 0.57}
 57%|█████▋    | 2446/4286 [15:14:05<10:42:53, 20.96s/it][2025-03-02 20:21:42,955] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 57%|█████▋    | 2447/4286 [15:14:27<10:55:36, 21.39s/it]                                                         {'loss': 0.0551, 'grad_norm': 8.763571230181727, 'learning_rate': 4.2907139524031727e-07, 'completion_length': 215.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7395834922790527, 'reward_std': 0.18445909023284912, 'kl': 1.37890625, 'epoch': 0.57}
 57%|█████▋    | 2447/4286 [15:14:27<10:55:36, 21.39s/it] 57%|█████▋    | 2448/4286 [15:14:49<11:00:10, 21.55s/it]                                                         {'loss': 0.0091, 'grad_norm': 6.918430928329535, 'learning_rate': 4.2883807746150255e-07, 'completion_length': 250.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5148810148239136, 'rewards/format_reward': 1.0, 'reward': 1.5148810744285583, 'reward_std': 0.07677963841706514, 'kl': 0.22802734375, 'epoch': 0.57}
 57%|█████▋    | 2448/4286 [15:14:49<11:00:10, 21.55s/it] 57%|█████▋    | 2449/4286 [15:15:09<10:42:59, 21.00s/it]                                                         {'loss': 0.0572, 'grad_norm': 4.267034419395995, 'learning_rate': 4.2860475968268777e-07, 'completion_length': 202.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6941965222358704, 'rewards/format_reward': 1.0, 'reward': 1.6941965818405151, 'reward_std': 0.1567697450518608, 'kl': 1.4296875, 'epoch': 0.57}
 57%|█████▋    | 2449/4286 [15:15:09<10:42:59, 21.00s/it] 57%|█████▋    | 2450/4286 [15:15:32<10:59:05, 21.54s/it]                                                         {'loss': 0.0669, 'grad_norm': 4.2941682277269955, 'learning_rate': 4.2837144190387304e-07, 'completion_length': 230.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.43958334624767303, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4217262864112854, 'reward_std': 0.1360691711306572, 'kl': 1.673828125, 'epoch': 0.57}
 57%|█████▋    | 2450/4286 [15:15:32<10:59:05, 21.54s/it][2025-03-02 20:23:08,776] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 57%|█████▋    | 2451/4286 [15:15:53<10:57:17, 21.49s/it]                                                         {'loss': 0.0074, 'grad_norm': 63.207557241414875, 'learning_rate': 4.2813812412505827e-07, 'completion_length': 225.21430206298828, 'rewards/only_full_func_accuracy_reward': 0.578869104385376, 'rewards/format_reward': 1.0, 'reward': 1.5788691639900208, 'reward_std': 0.022675009444355965, 'kl': 0.1845703125, 'epoch': 0.57}
 57%|█████▋    | 2451/4286 [15:15:53<10:57:17, 21.49s/it] 57%|█████▋    | 2452/4286 [15:16:14<10:51:46, 21.32s/it]                                                         {'loss': 0.0307, 'grad_norm': 3.838698303210443, 'learning_rate': 4.2790480634624354e-07, 'completion_length': 207.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.5872024595737457, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5693454146385193, 'reward_std': 0.11193265672773123, 'kl': 0.765625, 'epoch': 0.57}
 57%|█████▋    | 2452/4286 [15:16:14<10:51:46, 21.32s/it][2025-03-02 20:23:50,726] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 57%|█████▋    | 2453/4286 [15:16:35<10:48:38, 21.23s/it]                                                         {'loss': 0.0192, 'grad_norm': 6.108556927860159, 'learning_rate': 4.276714885674288e-07, 'completion_length': 237.76787567138672, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.069187356159091, 'kl': 0.47802734375, 'epoch': 0.57}
 57%|█████▋    | 2453/4286 [15:16:35<10:48:38, 21.23s/it] 57%|█████▋    | 2454/4286 [15:16:56<10:44:24, 21.11s/it]                                                         {'loss': 0.0268, 'grad_norm': 11.12489949227929, 'learning_rate': 4.2743817078861404e-07, 'completion_length': 215.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5595238208770752, 'rewards/format_reward': 1.0, 'reward': 1.5595239400863647, 'reward_std': 0.06823870167136192, 'kl': 0.671875, 'epoch': 0.57}
 57%|█████▋    | 2454/4286 [15:16:56<10:44:24, 21.11s/it] 57%|█████▋    | 2455/4286 [15:17:18<10:56:17, 21.51s/it]                                                         {'loss': 0.1208, 'grad_norm': 8.881961616066851, 'learning_rate': 4.272048530097993e-07, 'completion_length': 231.57144927978516, 'rewards/only_full_func_accuracy_reward': 0.6493236720561981, 'rewards/format_reward': 1.0, 'reward': 1.6493236422538757, 'reward_std': 0.1233675628900528, 'kl': 3.01953125, 'epoch': 0.57}
 57%|█████▋    | 2455/4286 [15:17:18<10:56:17, 21.51s/it] 57%|█████▋    | 2456/4286 [15:17:39<10:51:37, 21.36s/it]                                                         {'loss': 0.09, 'grad_norm': 1.4012044380526092, 'learning_rate': 4.2697153523098454e-07, 'completion_length': 216.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.5502976179122925, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4967263340950012, 'reward_std': 0.21828395873308182, 'kl': 2.24609375, 'epoch': 0.57}
 57%|█████▋    | 2456/4286 [15:17:39<10:51:37, 21.36s/it] 57%|█████▋    | 2457/4286 [15:18:00<10:46:12, 21.20s/it]                                                         {'loss': 0.0075, 'grad_norm': 1.2989364015589788, 'learning_rate': 4.267382174521698e-07, 'completion_length': 234.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.0476190485060215, 'kl': 0.18603515625, 'epoch': 0.57}
 57%|█████▋    | 2457/4286 [15:18:00<10:46:12, 21.20s/it] 57%|█████▋    | 2458/4286 [15:18:21<10:44:57, 21.17s/it]                                                         {'loss': 0.0555, 'grad_norm': 15.778818464931309, 'learning_rate': 4.265048996733551e-07, 'completion_length': 223.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.6678571701049805, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6500000953674316, 'reward_std': 0.103740643709898, 'kl': 1.3828125, 'epoch': 0.57}
 57%|█████▋    | 2458/4286 [15:18:21<10:44:57, 21.17s/it] 57%|█████▋    | 2459/4286 [15:18:42<10:39:09, 20.99s/it]                                                         {'loss': 0.0246, 'grad_norm': 1.3718880862529326, 'learning_rate': 4.262715818945403e-07, 'completion_length': 234.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.4851190894842148, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4672620296478271, 'reward_std': 0.16740687936544418, 'kl': 0.61767578125, 'epoch': 0.57}
 57%|█████▋    | 2459/4286 [15:18:42<10:39:09, 20.99s/it] 57%|█████▋    | 2460/4286 [15:19:01<10:25:18, 20.55s/it]                                                         {'loss': 0.0291, 'grad_norm': 1.915581622677503, 'learning_rate': 4.260382641157256e-07, 'completion_length': 189.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.6130953133106232, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.0, 'kl': 0.7275390625, 'epoch': 0.57}
 57%|█████▋    | 2460/4286 [15:19:01<10:25:18, 20.55s/it] 57%|█████▋    | 2461/4286 [15:19:22<10:28:27, 20.66s/it]                                                         {'loss': 0.0069, 'grad_norm': 1.136780754425079, 'learning_rate': 4.258049463369108e-07, 'completion_length': 214.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.830357164144516, 'rewards/format_reward': 1.0, 'reward': 1.8303572535514832, 'reward_std': 0.038476794958114624, 'kl': 0.17333984375, 'epoch': 0.57}
 57%|█████▋    | 2461/4286 [15:19:22<10:28:27, 20.66s/it] 57%|█████▋    | 2462/4286 [15:19:42<10:24:29, 20.54s/it]                                                         {'loss': 0.0365, 'grad_norm': 76.11344423660304, 'learning_rate': 4.255716285580961e-07, 'completion_length': 175.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.8318452835083008, 'rewards/format_reward': 1.0, 'reward': 1.8318453431129456, 'reward_std': 0.07900893688201904, 'kl': 0.9111328125, 'epoch': 0.57}
 57%|█████▋    | 2462/4286 [15:19:42<10:24:29, 20.54s/it] 57%|█████▋    | 2463/4286 [15:20:01<10:10:59, 20.11s/it]                                                         {'loss': 0.0139, 'grad_norm': 0.928339506463736, 'learning_rate': 4.2533831077928136e-07, 'completion_length': 194.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.5357144474983215, 'reward_std': 0.03141617402434349, 'kl': 0.34619140625, 'epoch': 0.57}
 57%|█████▋    | 2463/4286 [15:20:01<10:10:59, 20.11s/it] 57%|█████▋    | 2464/4286 [15:20:22<10:11:02, 20.12s/it]                                                         {'loss': 0.0062, 'grad_norm': 0.41448691232956, 'learning_rate': 4.251049930004666e-07, 'completion_length': 228.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.7008929252624512, 'rewards/format_reward': 1.0, 'reward': 1.700892984867096, 'reward_std': 0.0295482249930501, 'kl': 0.15380859375, 'epoch': 0.57}
 57%|█████▋    | 2464/4286 [15:20:22<10:11:02, 20.12s/it] 58%|█████▊    | 2465/4286 [15:20:42<10:13:58, 20.23s/it]                                                         {'loss': 0.0065, 'grad_norm': 0.6736338774249172, 'learning_rate': 4.2487167522165185e-07, 'completion_length': 204.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.032524414360523224, 'kl': 0.16357421875, 'epoch': 0.58}
 58%|█████▊    | 2465/4286 [15:20:42<10:13:58, 20.23s/it] 58%|█████▊    | 2466/4286 [15:21:03<10:23:07, 20.54s/it]                                                         {'loss': 0.0407, 'grad_norm': 4.875695196400431, 'learning_rate': 4.2463835744283713e-07, 'completion_length': 213.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6919642984867096, 'rewards/format_reward': 1.0, 'reward': 1.6919643878936768, 'reward_std': 0.07599682360887527, 'kl': 1.013671875, 'epoch': 0.58}
 58%|█████▊    | 2466/4286 [15:21:03<10:23:07, 20.54s/it] 58%|█████▊    | 2467/4286 [15:21:24<10:21:38, 20.50s/it]                                                         {'loss': 0.0217, 'grad_norm': 0.8525496946520468, 'learning_rate': 4.2440503966402235e-07, 'completion_length': 229.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.7083335518836975, 'reward_std': 0.08333333767950535, 'kl': 0.54296875, 'epoch': 0.58}
 58%|█████▊    | 2467/4286 [15:21:24<10:21:38, 20.50s/it] 58%|█████▊    | 2468/4286 [15:21:45<10:29:18, 20.77s/it]                                                         {'loss': 0.0096, 'grad_norm': 5.4447174128914915, 'learning_rate': 4.241717218852076e-07, 'completion_length': 245.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.06572292745113373, 'kl': 0.240234375, 'epoch': 0.58}
 58%|█████▊    | 2468/4286 [15:21:45<10:29:18, 20.77s/it] 58%|█████▊    | 2469/4286 [15:22:06<10:28:32, 20.76s/it]                                                         {'loss': 0.0066, 'grad_norm': 0.42728880081019593, 'learning_rate': 4.2393840410639285e-07, 'completion_length': 211.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6547619104385376, 'rewards/format_reward': 1.0, 'reward': 1.654762089252472, 'reward_std': 0.023809525184333324, 'kl': 0.16357421875, 'epoch': 0.58}
 58%|█████▊    | 2469/4286 [15:22:06<10:28:32, 20.76s/it] 58%|█████▊    | 2470/4286 [15:22:24<10:09:00, 20.12s/it]                                                         {'loss': 0.0095, 'grad_norm': 1.6265418134085856, 'learning_rate': 4.237050863275781e-07, 'completion_length': 169.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.7455357611179352, 'rewards/format_reward': 1.0, 'reward': 1.7455358505249023, 'reward_std': 0.037095542065799236, 'kl': 0.23779296875, 'epoch': 0.58}
 58%|█████▊    | 2470/4286 [15:22:24<10:09:00, 20.12s/it] 58%|█████▊    | 2471/4286 [15:22:50<10:53:55, 21.62s/it]                                                         {'loss': 0.0314, 'grad_norm': 1.6371202581961768, 'learning_rate': 4.234717685487634e-07, 'completion_length': 251.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.5104167014360428, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4925596117973328, 'reward_std': 0.13138023391366005, 'kl': 0.787109375, 'epoch': 0.58}
 58%|█████▊    | 2471/4286 [15:22:50<10:53:55, 21.62s/it] 58%|█████▊    | 2472/4286 [15:23:08<10:26:51, 20.73s/it]                                                         {'loss': 0.0071, 'grad_norm': 9.0830625959605, 'learning_rate': 4.232384507699486e-07, 'completion_length': 182.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.04191340319812298, 'kl': 0.177734375, 'epoch': 0.58}
 58%|█████▊    | 2472/4286 [15:23:08<10:26:51, 20.73s/it] 58%|█████▊    | 2473/4286 [15:23:29<10:25:28, 20.70s/it]                                                         {'loss': 0.0228, 'grad_norm': 0.8541295841194725, 'learning_rate': 4.230051329911339e-07, 'completion_length': 237.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.6026786416769028, 'rewards/format_reward': 1.0, 'reward': 1.6026787161827087, 'reward_std': 0.06250000465661287, 'kl': 0.56982421875, 'epoch': 0.58}
 58%|█████▊    | 2473/4286 [15:23:29<10:25:28, 20.70s/it] 58%|█████▊    | 2474/4286 [15:23:49<10:20:15, 20.54s/it]                                                         {'loss': 0.0256, 'grad_norm': 1.8341333673089826, 'learning_rate': 4.227718152123191e-07, 'completion_length': 199.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.7125000357627869, 'rewards/format_reward': 1.0, 'reward': 1.7125001549720764, 'reward_std': 0.0660715401172638, 'kl': 0.6396484375, 'epoch': 0.58}
 58%|█████▊    | 2474/4286 [15:23:49<10:20:15, 20.54s/it] 58%|█████▊    | 2475/4286 [15:24:10<10:24:18, 20.68s/it]                                                         {'loss': 0.0438, 'grad_norm': 1.738410388363816, 'learning_rate': 4.225384974335044e-07, 'completion_length': 235.08930206298828, 'rewards/only_full_func_accuracy_reward': 0.6086310148239136, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.07230475265532732, 'kl': 1.0986328125, 'epoch': 0.58}
 58%|█████▊    | 2475/4286 [15:24:10<10:24:18, 20.68s/it] 58%|█████▊    | 2476/4286 [15:24:31<10:27:32, 20.80s/it]                                                         {'loss': 0.057, 'grad_norm': 19.501079317279988, 'learning_rate': 4.2230517965468967e-07, 'completion_length': 188.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5476190745830536, 'rewards/format_reward': 1.0, 'reward': 1.5476192235946655, 'reward_std': 0.1534367799758911, 'kl': 1.421875, 'epoch': 0.58}
 58%|█████▊    | 2476/4286 [15:24:31<10:27:32, 20.80s/it] 58%|█████▊    | 2477/4286 [15:24:54<10:49:32, 21.54s/it]                                                         {'loss': 0.0447, 'grad_norm': 1.5721431383471864, 'learning_rate': 4.220718618758749e-07, 'completion_length': 216.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.10901585593819618, 'kl': 1.11328125, 'epoch': 0.58}
 58%|█████▊    | 2477/4286 [15:24:54<10:49:32, 21.54s/it] 58%|█████▊    | 2478/4286 [15:25:17<10:59:43, 21.89s/it]                                                         {'loss': 0.0094, 'grad_norm': 2.0817785883822615, 'learning_rate': 4.2183854409706017e-07, 'completion_length': 247.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6101191639900208, 'reward_std': 0.07419108599424362, 'kl': 0.234375, 'epoch': 0.58}
 58%|█████▊    | 2478/4286 [15:25:17<10:59:43, 21.89s/it] 58%|█████▊    | 2479/4286 [15:25:37<10:42:56, 21.35s/it]                                                         {'loss': 0.0347, 'grad_norm': 12.498291609280315, 'learning_rate': 4.216052263182454e-07, 'completion_length': 177.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547619700431824, 'reward_std': 0.1140139028429985, 'kl': 0.8701171875, 'epoch': 0.58}
 58%|█████▊    | 2479/4286 [15:25:37<10:42:56, 21.35s/it] 58%|█████▊    | 2480/4286 [15:26:00<10:55:13, 21.77s/it]                                                         {'loss': 0.0097, 'grad_norm': 1.1545924185065908, 'learning_rate': 4.2137190853943066e-07, 'completion_length': 200.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.541666716337204, 'rewards/format_reward': 1.0, 'reward': 1.5416668057441711, 'reward_std': 0.025651201605796814, 'kl': 0.2431640625, 'epoch': 0.58}
 58%|█████▊    | 2480/4286 [15:26:00<10:55:13, 21.77s/it][2025-03-02 20:33:38,243] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 58%|█████▊    | 2481/4286 [15:26:22<11:00:32, 21.96s/it]                                                         {'loss': 0.1194, 'grad_norm': 6.413285828016248, 'learning_rate': 4.2113859076061594e-07, 'completion_length': 226.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5868056118488312, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5510914325714111, 'reward_std': 0.24800530076026917, 'kl': 2.9765625, 'epoch': 0.58}
 58%|█████▊    | 2481/4286 [15:26:22<11:00:32, 21.96s/it] 58%|█████▊    | 2482/4286 [15:26:43<10:50:41, 21.64s/it]                                                         {'loss': 0.0103, 'grad_norm': 2.4535893921000116, 'learning_rate': 4.2090527298180116e-07, 'completion_length': 209.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068454027175903, 'reward_std': 0.07121489569544792, 'kl': 0.25732421875, 'epoch': 0.58}
 58%|█████▊    | 2482/4286 [15:26:43<10:50:41, 21.64s/it] 58%|█████▊    | 2483/4286 [15:27:04<10:42:01, 21.37s/it]                                                         {'loss': 0.0768, 'grad_norm': 2.660042275165035, 'learning_rate': 4.2067195520298644e-07, 'completion_length': 179.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.68601194024086, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6502977013587952, 'reward_std': 0.2127419300377369, 'kl': 1.91796875, 'epoch': 0.58}
 58%|█████▊    | 2483/4286 [15:27:04<10:42:01, 21.37s/it] 58%|█████▊    | 2484/4286 [15:27:26<10:50:51, 21.67s/it]                                                         {'loss': 0.0242, 'grad_norm': 3.930823575484371, 'learning_rate': 4.2043863742417166e-07, 'completion_length': 181.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.607142984867096, 'reward_std': 0.10927552916109562, 'kl': 0.6044921875, 'epoch': 0.58}
 58%|█████▊    | 2484/4286 [15:27:26<10:50:51, 21.67s/it] 58%|█████▊    | 2485/4286 [15:27:50<11:06:47, 22.21s/it]                                                         {'loss': 0.0874, 'grad_norm': 17.323034914236768, 'learning_rate': 4.2020531964535693e-07, 'completion_length': 211.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4285715222358704, 'reward_std': 0.1495252028107643, 'kl': 2.1875, 'epoch': 0.58}
 58%|█████▊    | 2485/4286 [15:27:50<11:06:47, 22.21s/it] 58%|█████▊    | 2486/4286 [15:28:11<10:55:15, 21.84s/it]                                                         {'loss': 0.0996, 'grad_norm': 3.7305407247600852, 'learning_rate': 4.199720018665422e-07, 'completion_length': 222.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6041667461395264, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5684524774551392, 'reward_std': 0.19296880438923836, 'kl': 2.48828125, 'epoch': 0.58}
 58%|█████▊    | 2486/4286 [15:28:11<10:55:15, 21.84s/it] 58%|█████▊    | 2487/4286 [15:28:31<10:37:11, 21.25s/it]                                                         {'loss': 0.0701, 'grad_norm': 2.0135058186372823, 'learning_rate': 4.1973868408772743e-07, 'completion_length': 202.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.7901785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7723214626312256, 'reward_std': 0.12722014635801315, 'kl': 1.75, 'epoch': 0.58}
 58%|█████▊    | 2487/4286 [15:28:31<10:37:11, 21.25s/it] 58%|█████▊    | 2488/4286 [15:28:53<10:43:43, 21.48s/it]                                                         {'loss': 0.0407, 'grad_norm': 2.367836440421604, 'learning_rate': 4.195053663089127e-07, 'completion_length': 193.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.4747024327516556, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4568453431129456, 'reward_std': 0.0931827537715435, 'kl': 1.0146484375, 'epoch': 0.58}
 58%|█████▊    | 2488/4286 [15:28:53<10:43:43, 21.48s/it] 58%|█████▊    | 2489/4286 [15:29:13<10:36:59, 21.27s/it]                                                         {'loss': 0.0069, 'grad_norm': 2.1798892258035885, 'learning_rate': 4.19272048530098e-07, 'completion_length': 207.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.7038690745830536, 'rewards/format_reward': 1.0, 'reward': 1.7038692235946655, 'reward_std': 0.08805716224014759, 'kl': 0.17138671875, 'epoch': 0.58}
 58%|█████▊    | 2489/4286 [15:29:14<10:36:59, 21.27s/it] 58%|█████▊    | 2490/4286 [15:29:34<10:31:00, 21.08s/it]                                                         {'loss': 0.0251, 'grad_norm': 2.975591986798252, 'learning_rate': 4.190387307512832e-07, 'completion_length': 212.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.5943452715873718, 'rewards/format_reward': 1.0, 'reward': 1.5943453311920166, 'reward_std': 0.046518079936504364, 'kl': 0.6259765625, 'epoch': 0.58}
 58%|█████▊    | 2490/4286 [15:29:34<10:31:00, 21.08s/it] 58%|█████▊    | 2491/4286 [15:29:56<10:39:44, 21.38s/it]                                                         {'loss': 0.0277, 'grad_norm': 1.3180400107620562, 'learning_rate': 4.188054129724685e-07, 'completion_length': 237.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755954027175903, 'reward_std': 0.10282689332962036, 'kl': 0.689453125, 'epoch': 0.58}
 58%|█████▊    | 2491/4286 [15:29:56<10:39:44, 21.38s/it] 58%|█████▊    | 2492/4286 [15:30:16<10:21:49, 20.80s/it]                                                         {'loss': 0.0096, 'grad_norm': 11.557134660270439, 'learning_rate': 4.185720951936537e-07, 'completion_length': 177.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.773809552192688, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.047619045712053776, 'kl': 0.240234375, 'epoch': 0.58}
 58%|█████▊    | 2492/4286 [15:30:16<10:21:49, 20.80s/it] 58%|█████▊    | 2493/4286 [15:30:37<10:25:37, 20.94s/it]                                                         {'loss': 0.0122, 'grad_norm': 6.744279043928518, 'learning_rate': 4.18338777414839e-07, 'completion_length': 225.58930206298828, 'rewards/only_full_func_accuracy_reward': 0.688244104385376, 'rewards/format_reward': 1.0, 'reward': 1.688244104385376, 'reward_std': 0.09331014752388, 'kl': 0.3056640625, 'epoch': 0.58}
 58%|█████▊    | 2493/4286 [15:30:37<10:25:37, 20.94s/it] 58%|█████▊    | 2494/4286 [15:30:58<10:24:45, 20.92s/it]                                                         {'loss': 0.0506, 'grad_norm': 2.0276408514371584, 'learning_rate': 4.1810545963602425e-07, 'completion_length': 195.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.5870536267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5513393878936768, 'reward_std': 0.14524677768349648, 'kl': 1.263671875, 'epoch': 0.58}
 58%|█████▊    | 2494/4286 [15:30:58<10:24:45, 20.92s/it] 58%|█████▊    | 2495/4286 [15:31:21<10:43:18, 21.55s/it]                                                         {'loss': 0.0246, 'grad_norm': 5.289723977982446, 'learning_rate': 4.1787214185720947e-07, 'completion_length': 214.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.583333432674408, 'reward_std': 0.11585774272680283, 'kl': 0.61669921875, 'epoch': 0.58}
 58%|█████▊    | 2495/4286 [15:31:21<10:43:18, 21.55s/it] 58%|█████▊    | 2496/4286 [15:31:41<10:27:43, 21.04s/it]                                                         {'loss': 0.0065, 'grad_norm': 0.5809734585874857, 'learning_rate': 4.1763882407839475e-07, 'completion_length': 195.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306548953056335, 'reward_std': 0.008928571827709675, 'kl': 0.16162109375, 'epoch': 0.58}
 58%|█████▊    | 2496/4286 [15:31:41<10:27:43, 21.04s/it] 58%|█████▊    | 2497/4286 [15:32:01<10:21:58, 20.86s/it]                                                         {'loss': 0.0619, 'grad_norm': 1.6586549274388533, 'learning_rate': 4.1740550629957997e-07, 'completion_length': 193.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.6886904835700989, 'rewards/format_reward': 1.0, 'reward': 1.6886905431747437, 'reward_std': 0.11402441933751106, 'kl': 1.55078125, 'epoch': 0.58}
 58%|█████▊    | 2497/4286 [15:32:01<10:21:58, 20.86s/it] 58%|█████▊    | 2498/4286 [15:32:21<10:13:44, 20.60s/it]                                                         {'loss': 0.0243, 'grad_norm': 1.0422929119976712, 'learning_rate': 4.1717218852076524e-07, 'completion_length': 200.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.4806548207998276, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.46279776096344, 'reward_std': 0.067431116476655, 'kl': 0.60693359375, 'epoch': 0.58}
 58%|█████▊    | 2498/4286 [15:32:21<10:13:44, 20.60s/it] 58%|█████▊    | 2499/4286 [15:32:41<10:08:21, 20.43s/it]                                                         {'loss': 0.046, 'grad_norm': 3.9632387923002073, 'learning_rate': 4.169388707419505e-07, 'completion_length': 194.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.12021519616246223, 'kl': 1.1484375, 'epoch': 0.58}
 58%|█████▊    | 2499/4286 [15:32:41<10:08:21, 20.43s/it] 58%|█████▊    | 2500/4286 [15:33:00<9:52:35, 19.91s/it]                                                         {'loss': 0.0066, 'grad_norm': 0.7576020570851187, 'learning_rate': 4.1670555296313574e-07, 'completion_length': 173.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.032524414360523224, 'kl': 0.1650390625, 'epoch': 0.58}
 58%|█████▊    | 2500/4286 [15:33:00<9:52:35, 19.91s/it] 58%|█████▊    | 2501/4286 [15:36:47<40:40:52, 82.05s/it]                                                         {'loss': 0.026, 'grad_norm': 1.5351559826786463, 'learning_rate': 4.16472235184321e-07, 'completion_length': 211.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.5550595819950104, 'rewards/format_reward': 1.0, 'reward': 1.5550596117973328, 'reward_std': 0.09516850672662258, 'kl': 0.64990234375, 'epoch': 0.58}
 58%|█████▊    | 2501/4286 [15:36:47<40:40:52, 82.05s/it] 58%|█████▊    | 2502/4286 [15:37:10<31:54:14, 64.38s/it]                                                         {'loss': 0.0206, 'grad_norm': 3.430473160703613, 'learning_rate': 4.1623891740550624e-07, 'completion_length': 220.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5814076364040375, 'rewards/format_reward': 1.0, 'reward': 1.5814076662063599, 'reward_std': 0.09158298373222351, 'kl': 0.5166015625, 'epoch': 0.58}
 58%|█████▊    | 2502/4286 [15:37:10<31:54:14, 64.38s/it] 58%|█████▊    | 2503/4286 [15:37:30<25:15:48, 51.01s/it]                                                         {'loss': 0.0194, 'grad_norm': 3.1160351696323665, 'learning_rate': 4.160055996266915e-07, 'completion_length': 193.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6383929252624512, 'rewards/format_reward': 1.0, 'reward': 1.638392984867096, 'reward_std': 0.047405367717146873, 'kl': 0.48583984375, 'epoch': 0.58}
 58%|█████▊    | 2503/4286 [15:37:30<25:15:48, 51.01s/it] 58%|█████▊    | 2504/4286 [15:37:53<21:05:42, 42.62s/it]                                                         {'loss': 0.0465, 'grad_norm': 5.869992957090431, 'learning_rate': 4.157722818478768e-07, 'completion_length': 206.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6443453133106232, 'rewards/format_reward': 1.0, 'reward': 1.6443453431129456, 'reward_std': 0.10261321440339088, 'kl': 1.162109375, 'epoch': 0.58}
 58%|█████▊    | 2504/4286 [15:37:53<21:05:42, 42.62s/it] 58%|█████▊    | 2505/4286 [15:38:14<17:50:09, 36.05s/it]                                                         {'loss': 0.0577, 'grad_norm': 3.3384989538559364, 'learning_rate': 4.15538964069062e-07, 'completion_length': 228.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.565178632736206, 'rewards/format_reward': 1.0, 'reward': 1.5651786923408508, 'reward_std': 0.09797141700983047, 'kl': 1.44140625, 'epoch': 0.58}
 58%|█████▊    | 2505/4286 [15:38:14<17:50:09, 36.05s/it] 58%|█████▊    | 2506/4286 [15:38:36<15:47:22, 31.93s/it]                                                         {'loss': 0.0176, 'grad_norm': 2.758890355704561, 'learning_rate': 4.153056462902473e-07, 'completion_length': 214.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6147186756134033, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5968615412712097, 'reward_std': 0.1127096563577652, 'kl': 0.439453125, 'epoch': 0.58}
 58%|█████▊    | 2506/4286 [15:38:36<15:47:22, 31.93s/it] 58%|█████▊    | 2507/4286 [15:38:57<14:11:07, 28.71s/it]                                                         {'loss': 0.0413, 'grad_norm': 1.3597473681341417, 'learning_rate': 4.150723285114325e-07, 'completion_length': 202.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7144210338592529, 'rewards/format_reward': 1.0, 'reward': 1.7144210934638977, 'reward_std': 0.05330086871981621, 'kl': 1.033203125, 'epoch': 0.58}
 58%|█████▊    | 2507/4286 [15:38:57<14:11:07, 28.71s/it] 59%|█████▊    | 2508/4286 [15:39:18<13:01:57, 26.39s/it]                                                         {'loss': 0.0105, 'grad_norm': 3.9790897962394114, 'learning_rate': 4.148390107326178e-07, 'completion_length': 220.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6130952537059784, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.09548483416438103, 'kl': 0.2626953125, 'epoch': 0.59}
 59%|█████▊    | 2508/4286 [15:39:18<13:01:57, 26.39s/it] 59%|█████▊    | 2509/4286 [15:39:38<12:05:34, 24.50s/it]                                                         {'loss': 0.0348, 'grad_norm': 5.988100634039363, 'learning_rate': 4.1460569295380306e-07, 'completion_length': 198.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6250000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.607142984867096, 'reward_std': 0.09757720772176981, 'kl': 0.869140625, 'epoch': 0.59}
 59%|█████▊    | 2509/4286 [15:39:38<12:05:34, 24.50s/it] 59%|█████▊    | 2510/4286 [15:39:59<11:29:46, 23.30s/it]                                                         {'loss': 0.0525, 'grad_norm': 2.970631200091158, 'learning_rate': 4.143723751749883e-07, 'completion_length': 209.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.5619473159313202, 'rewards/format_reward': 1.0, 'reward': 1.5619473457336426, 'reward_std': 0.11269823834300041, 'kl': 1.310546875, 'epoch': 0.59}
 59%|█████▊    | 2510/4286 [15:39:59<11:29:46, 23.30s/it][2025-03-02 20:47:37,986] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 59%|█████▊    | 2511/4286 [15:40:22<11:30:33, 23.34s/it]                                                         {'loss': 0.1108, 'grad_norm': 11.095248843435677, 'learning_rate': 4.1413905739617356e-07, 'completion_length': 233.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5072916895151138, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.471577525138855, 'reward_std': 0.18642686679959297, 'kl': 2.765625, 'epoch': 0.59}
 59%|█████▊    | 2511/4286 [15:40:22<11:30:33, 23.34s/it] 59%|█████▊    | 2512/4286 [15:40:43<11:05:30, 22.51s/it]                                                         {'loss': 0.1072, 'grad_norm': 3.3494875028627913, 'learning_rate': 4.1390573961735883e-07, 'completion_length': 212.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.7879961133003235, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7701390385627747, 'reward_std': 0.13829366117715836, 'kl': 2.6796875, 'epoch': 0.59}
 59%|█████▊    | 2512/4286 [15:40:43<11:05:30, 22.51s/it] 59%|█████▊    | 2513/4286 [15:41:01<10:31:40, 21.38s/it]                                                         {'loss': 0.032, 'grad_norm': 6.7729577678847, 'learning_rate': 4.1367242183854405e-07, 'completion_length': 167.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.7666667103767395, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.748809576034546, 'reward_std': 0.14077918231487274, 'kl': 0.80029296875, 'epoch': 0.59}
 59%|█████▊    | 2513/4286 [15:41:01<10:31:40, 21.38s/it] 59%|█████▊    | 2514/4286 [15:41:24<10:41:20, 21.72s/it]                                                         {'loss': 0.1072, 'grad_norm': 7.965837119890324, 'learning_rate': 4.1343910405972933e-07, 'completion_length': 189.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5342262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5163692235946655, 'reward_std': 0.21316614001989365, 'kl': 2.671875, 'epoch': 0.59}
 59%|█████▊    | 2514/4286 [15:41:24<10:41:20, 21.72s/it] 59%|█████▊    | 2515/4286 [15:41:45<10:34:28, 21.50s/it]                                                         {'loss': 0.0408, 'grad_norm': 8.794556895179417, 'learning_rate': 4.1320578628091455e-07, 'completion_length': 165.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.06769214756786823, 'kl': 1.017578125, 'epoch': 0.59}
 59%|█████▊    | 2515/4286 [15:41:45<10:34:28, 21.50s/it] 59%|█████▊    | 2516/4286 [15:42:05<10:25:49, 21.21s/it]                                                         {'loss': 0.0928, 'grad_norm': 7.691640521468476, 'learning_rate': 4.1297246850209983e-07, 'completion_length': 195.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6299851536750793, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5942708849906921, 'reward_std': 0.22253387421369553, 'kl': 2.32421875, 'epoch': 0.59}
 59%|█████▊    | 2516/4286 [15:42:05<10:25:49, 21.21s/it] 59%|█████▊    | 2517/4286 [15:42:25<10:12:04, 20.76s/it]                                                         {'loss': 0.066, 'grad_norm': 3.5562648468513287, 'learning_rate': 4.127391507232851e-07, 'completion_length': 190.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.614583432674408, 'reward_std': 0.15802200138568878, 'kl': 1.650390625, 'epoch': 0.59}
 59%|█████▊    | 2517/4286 [15:42:25<10:12:04, 20.76s/it] 59%|█████▊    | 2518/4286 [15:42:46<10:08:13, 20.64s/it]                                                         {'loss': 0.0803, 'grad_norm': 4.391643223178182, 'learning_rate': 4.125058329444703e-07, 'completion_length': 188.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.6005952954292297, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5648810267448425, 'reward_std': 0.17433896660804749, 'kl': 2.00390625, 'epoch': 0.59}
 59%|█████▊    | 2518/4286 [15:42:46<10:08:13, 20.64s/it] 59%|█████▉    | 2519/4286 [15:43:08<10:23:59, 21.19s/it]                                                         {'loss': 0.1421, 'grad_norm': 13.270617622657841, 'learning_rate': 4.122725151656556e-07, 'completion_length': 206.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6505952775478363, 'rewards/format_reward': 0.910714328289032, 'reward': 1.561309576034546, 'reward_std': 0.1863527074456215, 'kl': 3.546875, 'epoch': 0.59}
 59%|█████▉    | 2519/4286 [15:43:08<10:23:59, 21.19s/it] 59%|█████▉    | 2520/4286 [15:43:30<10:30:14, 21.41s/it]                                                         {'loss': 0.0623, 'grad_norm': 8.883286802057464, 'learning_rate': 4.120391973868408e-07, 'completion_length': 210.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.4958333820104599, 'rewards/format_reward': 1.0, 'reward': 1.4958334565162659, 'reward_std': 0.0318370945751667, 'kl': 1.5546875, 'epoch': 0.59}
 59%|█████▉    | 2520/4286 [15:43:30<10:30:14, 21.41s/it] 59%|█████▉    | 2521/4286 [15:43:50<10:17:02, 20.98s/it]                                                         {'loss': 0.0309, 'grad_norm': 1.015323773158189, 'learning_rate': 4.118058796080261e-07, 'completion_length': 218.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.09150617569684982, 'kl': 0.7724609375, 'epoch': 0.59}
 59%|█████▉    | 2521/4286 [15:43:50<10:17:02, 20.98s/it] 59%|█████▉    | 2522/4286 [15:44:11<10:15:03, 20.92s/it]                                                         {'loss': 0.0322, 'grad_norm': 15.586562763991425, 'learning_rate': 4.1157256182921137e-07, 'completion_length': 193.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6488095223903656, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.11382801085710526, 'kl': 0.8056640625, 'epoch': 0.59}
 59%|█████▉    | 2522/4286 [15:44:11<10:15:03, 20.92s/it] 59%|█████▉    | 2523/4286 [15:44:32<10:21:08, 21.14s/it]                                                         {'loss': 0.0417, 'grad_norm': 1.6115720039676042, 'learning_rate': 4.113392440503966e-07, 'completion_length': 230.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5401785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5223214626312256, 'reward_std': 0.10852411109954119, 'kl': 1.0419921875, 'epoch': 0.59}
 59%|█████▉    | 2523/4286 [15:44:32<10:21:08, 21.14s/it] 59%|█████▉    | 2524/4286 [15:44:52<10:09:36, 20.76s/it]                                                         {'loss': 0.0273, 'grad_norm': 5.929133603846767, 'learning_rate': 4.1110592627158187e-07, 'completion_length': 207.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.5922619998455048, 'rewards/format_reward': 1.0, 'reward': 1.5922620296478271, 'reward_std': 0.08386899344623089, 'kl': 0.685546875, 'epoch': 0.59}
 59%|█████▉    | 2524/4286 [15:44:52<10:09:36, 20.76s/it] 59%|█████▉    | 2525/4286 [15:45:13<10:12:06, 20.86s/it]                                                         {'loss': 0.0339, 'grad_norm': 1.9853076240517509, 'learning_rate': 4.108726084927671e-07, 'completion_length': 198.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6607142686843872, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.0476190485060215, 'kl': 0.84912109375, 'epoch': 0.59}
 59%|█████▉    | 2525/4286 [15:45:13<10:12:06, 20.86s/it] 59%|█████▉    | 2526/4286 [15:45:33<10:04:41, 20.61s/it]                                                         {'loss': 0.0459, 'grad_norm': 1.4516733453930253, 'learning_rate': 4.1063929071395237e-07, 'completion_length': 190.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.8011905252933502, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7833334803581238, 'reward_std': 0.07941640727221966, 'kl': 1.1455078125, 'epoch': 0.59}
 59%|█████▉    | 2526/4286 [15:45:33<10:04:41, 20.61s/it] 59%|█████▉    | 2527/4286 [15:45:57<10:30:34, 21.51s/it]                                                         {'loss': 0.0442, 'grad_norm': 3.006257842877827, 'learning_rate': 4.1040597293513764e-07, 'completion_length': 230.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5312500298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4955357909202576, 'reward_std': 0.15413516759872437, 'kl': 1.1025390625, 'epoch': 0.59}
 59%|█████▉    | 2527/4286 [15:45:57<10:30:34, 21.51s/it] 59%|█████▉    | 2528/4286 [15:46:18<10:23:37, 21.28s/it]                                                         {'loss': 0.0099, 'grad_norm': 3.6102317020906964, 'learning_rate': 4.1017265515632286e-07, 'completion_length': 207.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6175595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6175596117973328, 'reward_std': 0.06090506911277771, 'kl': 0.248046875, 'epoch': 0.59}
 59%|█████▉    | 2528/4286 [15:46:18<10:23:37, 21.28s/it] 59%|█████▉    | 2529/4286 [15:46:38<10:16:10, 21.04s/it]                                                         {'loss': 0.0279, 'grad_norm': 6.324369441339063, 'learning_rate': 4.0993933737750814e-07, 'completion_length': 210.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5238095819950104, 'rewards/format_reward': 1.0, 'reward': 1.5238096714019775, 'reward_std': 0.17050977796316147, 'kl': 0.6962890625, 'epoch': 0.59}
 59%|█████▉    | 2529/4286 [15:46:38<10:16:10, 21.04s/it] 59%|█████▉    | 2530/4286 [15:47:00<10:22:52, 21.28s/it]                                                         {'loss': 0.0497, 'grad_norm': 5.892453604535284, 'learning_rate': 4.0970601959869336e-07, 'completion_length': 215.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.49702388048171997, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4791668057441711, 'reward_std': 0.11963991075754166, 'kl': 1.2421875, 'epoch': 0.59}
 59%|█████▉    | 2530/4286 [15:47:00<10:22:52, 21.28s/it] 59%|█████▉    | 2531/4286 [15:47:20<10:15:07, 21.03s/it]                                                         {'loss': 0.0457, 'grad_norm': 2.3867744234607438, 'learning_rate': 4.0947270181987864e-07, 'completion_length': 222.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.5505952537059784, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5148810148239136, 'reward_std': 0.12331760674715042, 'kl': 1.140625, 'epoch': 0.59}
 59%|█████▉    | 2531/4286 [15:47:20<10:15:07, 21.03s/it] 59%|█████▉    | 2532/4286 [15:47:40<10:05:15, 20.70s/it]                                                         {'loss': 0.0073, 'grad_norm': 6.2700650900620785, 'learning_rate': 4.092393840410639e-07, 'completion_length': 181.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.8065476417541504, 'rewards/format_reward': 1.0, 'reward': 1.8065477013587952, 'reward_std': 0.06345389038324356, 'kl': 0.18359375, 'epoch': 0.59}
 59%|█████▉    | 2532/4286 [15:47:40<10:05:15, 20.70s/it] 59%|█████▉    | 2533/4286 [15:48:00<9:57:17, 20.44s/it]                                                         {'loss': 0.0069, 'grad_norm': 1.129360765107899, 'learning_rate': 4.0900606626224913e-07, 'completion_length': 199.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6175595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6175596714019775, 'reward_std': 0.03985805157572031, 'kl': 0.171875, 'epoch': 0.59}
 59%|█████▉    | 2533/4286 [15:48:00<9:57:17, 20.44s/it] 59%|█████▉    | 2534/4286 [15:48:21<9:59:28, 20.53s/it]                                                        {'loss': 0.0788, 'grad_norm': 4.201304123985718, 'learning_rate': 4.087727484834344e-07, 'completion_length': 222.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.59226194024086, 'rewards/format_reward': 1.0, 'reward': 1.5922619700431824, 'reward_std': 0.1907443404197693, 'kl': 1.96875, 'epoch': 0.59}
 59%|█████▉    | 2534/4286 [15:48:21<9:59:28, 20.53s/it] 59%|█████▉    | 2535/4286 [15:48:43<10:15:02, 21.07s/it]                                                         {'loss': 0.027, 'grad_norm': 3.152552972932441, 'learning_rate': 4.085394307046197e-07, 'completion_length': 190.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098214626312256, 'reward_std': 0.06526251137256622, 'kl': 0.67529296875, 'epoch': 0.59}
 59%|█████▉    | 2535/4286 [15:48:43<10:15:02, 21.07s/it] 59%|█████▉    | 2536/4286 [15:49:03<10:01:40, 20.63s/it]                                                         {'loss': 0.0799, 'grad_norm': 3.5971725268877908, 'learning_rate': 4.083061129258049e-07, 'completion_length': 184.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5252976715564728, 'rewards/format_reward': 1.0, 'reward': 1.5252977013587952, 'reward_std': 0.18186883255839348, 'kl': 2.001953125, 'epoch': 0.59}
 59%|█████▉    | 2536/4286 [15:49:03<10:01:40, 20.63s/it] 59%|█████▉    | 2537/4286 [15:49:24<10:05:58, 20.79s/it]                                                         {'loss': 0.0167, 'grad_norm': 9.136020491819462, 'learning_rate': 4.080727951469902e-07, 'completion_length': 212.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.6546875238418579, 'rewards/format_reward': 1.0, 'reward': 1.6546875834465027, 'reward_std': 0.05223214812576771, 'kl': 0.41796875, 'epoch': 0.59}
 59%|█████▉    | 2537/4286 [15:49:24<10:05:58, 20.79s/it] 59%|█████▉    | 2538/4286 [15:49:47<10:21:36, 21.34s/it]                                                         {'loss': 0.0515, 'grad_norm': 1.8661046574718791, 'learning_rate': 4.078394773681754e-07, 'completion_length': 200.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.59226194024086, 'rewards/format_reward': 1.0, 'reward': 1.5922619700431824, 'reward_std': 0.101190485060215, 'kl': 1.28515625, 'epoch': 0.59}
 59%|█████▉    | 2538/4286 [15:49:47<10:21:36, 21.34s/it] 59%|█████▉    | 2539/4286 [15:50:07<10:14:38, 21.11s/it]                                                         {'loss': 0.0623, 'grad_norm': 1.3948316602279605, 'learning_rate': 4.076061595893607e-07, 'completion_length': 205.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.693452537059784, 'reward_std': 0.08470145612955093, 'kl': 1.55859375, 'epoch': 0.59}
 59%|█████▉    | 2539/4286 [15:50:07<10:14:38, 21.11s/it] 59%|█████▉    | 2540/4286 [15:50:28<10:12:17, 21.04s/it]                                                         {'loss': 0.0074, 'grad_norm': 0.18776200509940244, 'learning_rate': 4.0737284181054595e-07, 'completion_length': 194.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.709821492433548, 'rewards/format_reward': 1.0, 'reward': 1.7098215222358704, 'reward_std': 0.008928571827709675, 'kl': 0.18505859375, 'epoch': 0.59}
 59%|█████▉    | 2540/4286 [15:50:28<10:12:17, 21.04s/it] 59%|█████▉    | 2541/4286 [15:50:49<10:12:37, 21.06s/it]                                                         {'loss': 0.0425, 'grad_norm': 6.013530524031277, 'learning_rate': 4.071395240317312e-07, 'completion_length': 230.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5744048058986664, 'rewards/format_reward': 1.0, 'reward': 1.5744048953056335, 'reward_std': 0.16903655603528023, 'kl': 1.0625, 'epoch': 0.59}
 59%|█████▉    | 2541/4286 [15:50:49<10:12:37, 21.06s/it] 59%|█████▉    | 2542/4286 [15:51:09<10:04:02, 20.78s/it]                                                         {'loss': 0.0159, 'grad_norm': 3.114552778462044, 'learning_rate': 4.0690620625291645e-07, 'completion_length': 188.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.6443453133106232, 'rewards/format_reward': 1.0, 'reward': 1.6443453431129456, 'reward_std': 0.041452983394265175, 'kl': 0.3974609375, 'epoch': 0.59}
 59%|█████▉    | 2542/4286 [15:51:09<10:04:02, 20.78s/it] 59%|█████▉    | 2543/4286 [15:51:32<10:22:43, 21.44s/it]                                                         {'loss': 0.03, 'grad_norm': 12.330437090816694, 'learning_rate': 4.066728884741017e-07, 'completion_length': 210.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6354166865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6175596714019775, 'reward_std': 0.10257173702120781, 'kl': 0.7490234375, 'epoch': 0.59}
 59%|█████▉    | 2543/4286 [15:51:32<10:22:43, 21.44s/it] 59%|█████▉    | 2544/4286 [15:51:52<10:03:34, 20.79s/it]                                                         {'loss': 0.0271, 'grad_norm': 4.014881330295182, 'learning_rate': 4.0643957069528695e-07, 'completion_length': 180.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.09296906739473343, 'kl': 0.67626953125, 'epoch': 0.59}
 59%|█████▉    | 2544/4286 [15:51:52<10:03:34, 20.79s/it] 59%|█████▉    | 2545/4286 [15:52:13<10:08:39, 20.98s/it]                                                         {'loss': 0.0279, 'grad_norm': 2.328775484718378, 'learning_rate': 4.062062529164722e-07, 'completion_length': 224.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6568452715873718, 'rewards/format_reward': 1.0, 'reward': 1.6568452715873718, 'reward_std': 0.07057024165987968, 'kl': 0.69775390625, 'epoch': 0.59}
 59%|█████▉    | 2545/4286 [15:52:13<10:08:39, 20.98s/it] 59%|█████▉    | 2546/4286 [15:52:34<10:06:33, 20.92s/it]                                                         {'loss': 0.0754, 'grad_norm': 21.290039011512732, 'learning_rate': 4.0597293513765745e-07, 'completion_length': 185.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5669643133878708, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5491072535514832, 'reward_std': 0.22622353583574295, 'kl': 1.88671875, 'epoch': 0.59}
 59%|█████▉    | 2546/4286 [15:52:34<10:06:33, 20.92s/it] 59%|█████▉    | 2547/4286 [15:52:55<10:08:42, 21.00s/it]                                                         {'loss': 0.0621, 'grad_norm': 6.545041936637396, 'learning_rate': 4.057396173588427e-07, 'completion_length': 238.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5075757801532745, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.489718735218048, 'reward_std': 0.1625651977956295, 'kl': 1.55859375, 'epoch': 0.59}
 59%|█████▉    | 2547/4286 [15:52:55<10:08:42, 21.00s/it] 59%|█████▉    | 2548/4286 [15:53:16<10:10:53, 21.09s/it]                                                         {'loss': 0.0511, 'grad_norm': 7.4654509778693185, 'learning_rate': 4.0550629958002794e-07, 'completion_length': 225.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.5625, 'rewards/format_reward': 1.0, 'reward': 1.5625001192092896, 'reward_std': 0.06983364932239056, 'kl': 1.27734375, 'epoch': 0.59}
 59%|█████▉    | 2548/4286 [15:53:16<10:10:53, 21.09s/it] 59%|█████▉    | 2549/4286 [15:53:37<10:05:43, 20.92s/it]                                                         {'loss': 0.0448, 'grad_norm': 1.7358664145486058, 'learning_rate': 4.052729818012132e-07, 'completion_length': 217.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6190476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6011906266212463, 'reward_std': 0.11148861795663834, 'kl': 1.123046875, 'epoch': 0.59}
 59%|█████▉    | 2549/4286 [15:53:37<10:05:43, 20.92s/it] 59%|█████▉    | 2550/4286 [15:53:57<9:55:56, 20.60s/it]                                                         {'loss': 0.0777, 'grad_norm': 7.212616465128832, 'learning_rate': 4.050396640223985e-07, 'completion_length': 193.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.635416716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6175596714019775, 'reward_std': 0.14950474351644516, 'kl': 1.94140625, 'epoch': 0.59}
 59%|█████▉    | 2550/4286 [15:53:57<9:55:56, 20.60s/it] 60%|█████▉    | 2551/4286 [15:54:18<9:58:18, 20.69s/it]                                                        {'loss': 0.0669, 'grad_norm': 10.26962657539019, 'learning_rate': 4.048063462435837e-07, 'completion_length': 230.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.5918368101119995, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5739797353744507, 'reward_std': 0.13934582844376564, 'kl': 1.671875, 'epoch': 0.6}
 60%|█████▉    | 2551/4286 [15:54:18<9:58:18, 20.69s/it] 60%|█████▉    | 2552/4286 [15:54:43<10:37:48, 22.07s/it]                                                         {'loss': 0.1154, 'grad_norm': 4.864305910808971, 'learning_rate': 4.04573028464769e-07, 'completion_length': 228.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.6383928656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6026785969734192, 'reward_std': 0.24583741277456284, 'kl': 2.890625, 'epoch': 0.6}
 60%|█████▉    | 2552/4286 [15:54:43<10:37:48, 22.07s/it][2025-03-02 21:02:19,872] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|█████▉    | 2553/4286 [15:55:04<10:29:23, 21.79s/it]                                                         {'loss': 0.0609, 'grad_norm': 11.549307492858267, 'learning_rate': 4.043397106859542e-07, 'completion_length': 210.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6860119700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6681548953056335, 'reward_std': 0.19621488451957703, 'kl': 1.513671875, 'epoch': 0.6}
 60%|█████▉    | 2553/4286 [15:55:04<10:29:23, 21.79s/it] 60%|█████▉    | 2554/4286 [15:55:25<10:19:48, 21.47s/it]                                                         {'loss': 0.096, 'grad_norm': 9.021046148664073, 'learning_rate': 4.041063929071395e-07, 'completion_length': 201.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5818452835083008, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5104168057441711, 'reward_std': 0.23745419830083847, 'kl': 2.3984375, 'epoch': 0.6}
 60%|█████▉    | 2554/4286 [15:55:25<10:19:48, 21.47s/it] 60%|█████▉    | 2555/4286 [15:55:46<10:21:31, 21.54s/it]                                                         {'loss': 0.0675, 'grad_norm': 3.295932897372184, 'learning_rate': 4.0387307512832476e-07, 'completion_length': 217.94644927978516, 'rewards/only_full_func_accuracy_reward': 0.6196428835391998, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5660715699195862, 'reward_std': 0.22741695493459702, 'kl': 1.68359375, 'epoch': 0.6}
 60%|█████▉    | 2555/4286 [15:55:46<10:21:31, 21.54s/it] 60%|█████▉    | 2556/4286 [15:56:09<10:25:55, 21.71s/it]                                                         {'loss': 0.0715, 'grad_norm': 9.608090383265742, 'learning_rate': 4.0363975734951e-07, 'completion_length': 243.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.6327381432056427, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5970239639282227, 'reward_std': 0.1760680228471756, 'kl': 1.77734375, 'epoch': 0.6}
 60%|█████▉    | 2556/4286 [15:56:09<10:25:55, 21.71s/it] 60%|█████▉    | 2557/4286 [15:56:29<10:12:22, 21.25s/it]                                                         {'loss': 0.0694, 'grad_norm': 5.3046599285395395, 'learning_rate': 4.0340643957069526e-07, 'completion_length': 176.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 1.0, 'reward': 1.641369104385376, 'reward_std': 0.048560330644249916, 'kl': 1.736328125, 'epoch': 0.6}
 60%|█████▉    | 2557/4286 [15:56:29<10:12:22, 21.25s/it] 60%|█████▉    | 2558/4286 [15:56:49<10:05:45, 21.03s/it]                                                         {'loss': 0.029, 'grad_norm': 4.821545575761642, 'learning_rate': 4.0317312179188054e-07, 'completion_length': 200.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6755953431129456, 'reward_std': 0.12917682901024818, 'kl': 0.72119140625, 'epoch': 0.6}
 60%|█████▉    | 2558/4286 [15:56:49<10:05:45, 21.03s/it] 60%|█████▉    | 2559/4286 [15:57:10<10:01:27, 20.90s/it]                                                         {'loss': 0.0286, 'grad_norm': 8.013850596451796, 'learning_rate': 4.0293980401306576e-07, 'completion_length': 181.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7452381551265717, 'rewards/format_reward': 1.0, 'reward': 1.745238184928894, 'reward_std': 0.06270713359117508, 'kl': 0.7177734375, 'epoch': 0.6}
 60%|█████▉    | 2559/4286 [15:57:10<10:01:27, 20.90s/it] 60%|█████▉    | 2560/4286 [15:57:32<10:09:44, 21.20s/it]                                                         {'loss': 0.0142, 'grad_norm': 3.8035266077250296, 'learning_rate': 4.0270648623425103e-07, 'completion_length': 217.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.4895833879709244, 'rewards/format_reward': 1.0, 'reward': 1.489583432674408, 'reward_std': 0.03644564375281334, 'kl': 0.35595703125, 'epoch': 0.6}
 60%|█████▉    | 2560/4286 [15:57:32<10:09:44, 21.20s/it][2025-03-02 21:05:09,420] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|█████▉    | 2561/4286 [15:57:54<10:14:54, 21.39s/it]                                                         {'loss': 0.0283, 'grad_norm': 4.166917323047994, 'learning_rate': 4.0247316845543626e-07, 'completion_length': 201.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.6590136587619781, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.641156554222107, 'reward_std': 0.1313733197748661, 'kl': 0.70703125, 'epoch': 0.6}
 60%|█████▉    | 2561/4286 [15:57:54<10:14:54, 21.39s/it] 60%|█████▉    | 2562/4286 [15:58:17<10:34:14, 22.07s/it]                                                         {'loss': 0.083, 'grad_norm': 6.259734666809434, 'learning_rate': 4.0223985067662153e-07, 'completion_length': 225.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.4613095670938492, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4255953431129456, 'reward_std': 0.14285714365541935, 'kl': 2.07421875, 'epoch': 0.6}
 60%|█████▉    | 2562/4286 [15:58:17<10:34:14, 22.07s/it] 60%|█████▉    | 2563/4286 [15:58:39<10:34:37, 22.10s/it]                                                         {'loss': 0.0404, 'grad_norm': 3.654288136076247, 'learning_rate': 4.020065328978068e-07, 'completion_length': 208.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.07729321904480457, 'kl': 1.009765625, 'epoch': 0.6}
 60%|█████▉    | 2563/4286 [15:58:39<10:34:37, 22.10s/it][2025-03-02 21:06:19,481] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|█████▉    | 2564/4286 [15:59:04<10:52:35, 22.74s/it]                                                         {'loss': 0.033, 'grad_norm': 3.3264613521834416, 'learning_rate': 4.0177321511899203e-07, 'completion_length': 234.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7455357909202576, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6919644474983215, 'reward_std': 0.1875000074505806, 'kl': 0.8271484375, 'epoch': 0.6}
 60%|█████▉    | 2564/4286 [15:59:04<10:52:35, 22.74s/it][2025-03-02 21:06:43,089] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|█████▉    | 2565/4286 [15:59:27<10:59:41, 23.00s/it]                                                         {'loss': 0.0609, 'grad_norm': 10.375270062108326, 'learning_rate': 4.015398973401773e-07, 'completion_length': 225.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.5595238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5416667461395264, 'reward_std': 0.09219718724489212, 'kl': 1.5234375, 'epoch': 0.6}
 60%|█████▉    | 2565/4286 [15:59:27<10:59:41, 23.00s/it][2025-03-02 21:07:03,954] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|█████▉    | 2566/4286 [15:59:48<10:40:57, 22.36s/it]                                                         {'loss': 0.0601, 'grad_norm': 7.002912783854806, 'learning_rate': 4.013065795613625e-07, 'completion_length': 229.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6650298535823822, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6293155550956726, 'reward_std': 0.08777855802327394, 'kl': 1.5, 'epoch': 0.6}
 60%|█████▉    | 2566/4286 [15:59:48<10:40:57, 22.36s/it][2025-03-02 21:07:24,509] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 60%|█████▉    | 2567/4286 [16:00:09<10:25:04, 21.82s/it]                                                         {'loss': 0.0279, 'grad_norm': 7.09658291938383, 'learning_rate': 4.010732617825478e-07, 'completion_length': 192.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.672619104385376, 'reward_std': 0.12471206858754158, 'kl': 0.69775390625, 'epoch': 0.6}
 60%|█████▉    | 2567/4286 [16:00:09<10:25:04, 21.82s/it] 60%|█████▉    | 2568/4286 [16:00:30<10:24:07, 21.80s/it]                                                         {'loss': 0.0194, 'grad_norm': 1.8259251888751855, 'learning_rate': 4.008399440037331e-07, 'completion_length': 213.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.6315476298332214, 'rewards/format_reward': 1.0, 'reward': 1.631547749042511, 'reward_std': 0.07100121676921844, 'kl': 0.4873046875, 'epoch': 0.6}
 60%|█████▉    | 2568/4286 [16:00:30<10:24:07, 21.80s/it] 60%|█████▉    | 2569/4286 [16:00:53<10:32:51, 22.12s/it]                                                         {'loss': 0.0466, 'grad_norm': 12.10032939883476, 'learning_rate': 4.006066262249183e-07, 'completion_length': 242.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.5452380925416946, 'rewards/format_reward': 1.0, 'reward': 1.545238196849823, 'reward_std': 0.11951855942606926, 'kl': 1.162109375, 'epoch': 0.6}
 60%|█████▉    | 2569/4286 [16:00:53<10:32:51, 22.12s/it] 60%|█████▉    | 2570/4286 [16:01:13<10:12:33, 21.42s/it]                                                         {'loss': 0.0543, 'grad_norm': 17.51535313166506, 'learning_rate': 4.0037330844610357e-07, 'completion_length': 194.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5937500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5758929252624512, 'reward_std': 0.17327457293868065, 'kl': 1.3583984375, 'epoch': 0.6}
 60%|█████▉    | 2570/4286 [16:01:13<10:12:33, 21.42s/it] 60%|█████▉    | 2571/4286 [16:01:34<10:11:23, 21.39s/it]                                                         {'loss': 0.0375, 'grad_norm': 9.307567324589705, 'learning_rate': 4.001399906672888e-07, 'completion_length': 208.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.574404776096344, 'rewards/format_reward': 1.0, 'reward': 1.574404776096344, 'reward_std': 0.07753365486860275, 'kl': 0.93701171875, 'epoch': 0.6}
 60%|█████▉    | 2571/4286 [16:01:34<10:11:23, 21.39s/it] 60%|██████    | 2572/4286 [16:01:56<10:09:21, 21.33s/it]                                                         {'loss': 0.0359, 'grad_norm': 7.705247114019191, 'learning_rate': 3.9990667288847407e-07, 'completion_length': 203.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.504464328289032, 'rewards/format_reward': 1.0, 'reward': 1.5044644474983215, 'reward_std': 0.12535592541098595, 'kl': 0.9013671875, 'epoch': 0.6}
 60%|██████    | 2572/4286 [16:01:56<10:09:21, 21.33s/it] 60%|██████    | 2573/4286 [16:02:16<10:01:03, 21.05s/it]                                                         {'loss': 0.0388, 'grad_norm': 8.909316090653023, 'learning_rate': 3.9967335510965935e-07, 'completion_length': 215.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5363095849752426, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5184524655342102, 'reward_std': 0.11669468879699707, 'kl': 0.9697265625, 'epoch': 0.6}
 60%|██████    | 2573/4286 [16:02:16<10:01:03, 21.05s/it] 60%|██████    | 2574/4286 [16:02:40<10:26:12, 21.95s/it]                                                         {'loss': 0.0568, 'grad_norm': 6.214500612520857, 'learning_rate': 3.9944003733084457e-07, 'completion_length': 233.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6250000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.07477238774299622, 'kl': 1.41796875, 'epoch': 0.6}
 60%|██████    | 2574/4286 [16:02:40<10:26:12, 21.95s/it] 60%|██████    | 2575/4286 [16:03:02<10:26:58, 21.99s/it]                                                         {'loss': 0.0268, 'grad_norm': 2.3333669424387127, 'learning_rate': 3.9920671955202984e-07, 'completion_length': 227.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.724702388048172, 'rewards/format_reward': 1.0, 'reward': 1.7247024774551392, 'reward_std': 0.04464286006987095, 'kl': 0.6669921875, 'epoch': 0.6}
 60%|██████    | 2575/4286 [16:03:02<10:26:58, 21.99s/it] 60%|██████    | 2576/4286 [16:03:24<10:25:09, 21.94s/it]                                                         {'loss': 0.0404, 'grad_norm': 4.450871772178446, 'learning_rate': 3.9897340177321507e-07, 'completion_length': 222.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5678571909666061, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5500001311302185, 'reward_std': 0.09043434448540211, 'kl': 1.0078125, 'epoch': 0.6}
 60%|██████    | 2576/4286 [16:03:24<10:25:09, 21.94s/it] 60%|██████    | 2577/4286 [16:03:48<10:45:22, 22.66s/it]                                                         {'loss': 0.0616, 'grad_norm': 13.304208662882546, 'learning_rate': 3.9874008399440034e-07, 'completion_length': 193.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5877976566553116, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5699405670166016, 'reward_std': 0.16645298898220062, 'kl': 1.5390625, 'epoch': 0.6}
 60%|██████    | 2577/4286 [16:03:48<10:45:22, 22.66s/it] 60%|██████    | 2578/4286 [16:04:09<10:26:27, 22.01s/it]                                                         {'loss': 0.007, 'grad_norm': 2.3713422469949603, 'learning_rate': 3.985067662155856e-07, 'completion_length': 209.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.761904776096344, 'rewards/format_reward': 1.0, 'reward': 1.7619048357009888, 'reward_std': 0.051976494025439024, 'kl': 0.17578125, 'epoch': 0.6}
 60%|██████    | 2578/4286 [16:04:09<10:26:27, 22.01s/it] 60%|██████    | 2579/4286 [16:04:32<10:34:01, 22.29s/it]                                                         {'loss': 0.0109, 'grad_norm': 27.069688435994586, 'learning_rate': 3.9827344843677084e-07, 'completion_length': 214.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.053571430034935474, 'kl': 0.271484375, 'epoch': 0.6}
 60%|██████    | 2579/4286 [16:04:32<10:34:01, 22.29s/it] 60%|██████    | 2580/4286 [16:04:53<10:28:39, 22.11s/it]                                                         {'loss': 0.0308, 'grad_norm': 6.567032177810918, 'learning_rate': 3.980401306579561e-07, 'completion_length': 217.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.6437500566244125, 'rewards/format_reward': 1.0, 'reward': 1.6437500715255737, 'reward_std': 0.053341024555265903, 'kl': 0.77197265625, 'epoch': 0.6}
 60%|██████    | 2580/4286 [16:04:53<10:28:39, 22.11s/it] 60%|██████    | 2581/4286 [16:05:15<10:21:32, 21.87s/it]                                                         {'loss': 0.0461, 'grad_norm': 2.0111340694161712, 'learning_rate': 3.978068128791414e-07, 'completion_length': 207.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.5849567651748657, 'rewards/format_reward': 1.0, 'reward': 1.5849567651748657, 'reward_std': 0.11987622082233429, 'kl': 1.15625, 'epoch': 0.6}
 60%|██████    | 2581/4286 [16:05:15<10:21:32, 21.87s/it] 60%|██████    | 2582/4286 [16:05:36<10:18:35, 21.78s/it]                                                         {'loss': 0.0456, 'grad_norm': 5.436192887215133, 'learning_rate': 3.975734951003266e-07, 'completion_length': 224.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5041667371988297, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4863096475601196, 'reward_std': 0.09624949470162392, 'kl': 1.13671875, 'epoch': 0.6}
 60%|██████    | 2582/4286 [16:05:36<10:18:35, 21.78s/it] 60%|██████    | 2583/4286 [16:05:59<10:24:53, 22.02s/it]                                                         {'loss': 0.0483, 'grad_norm': 3.6551944242473073, 'learning_rate': 3.973401773215119e-07, 'completion_length': 200.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6642857491970062, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.646428644657135, 'reward_std': 0.2050309181213379, 'kl': 1.2109375, 'epoch': 0.6}
 60%|██████    | 2583/4286 [16:05:59<10:24:53, 22.02s/it] 60%|██████    | 2584/4286 [16:06:19<10:10:46, 21.53s/it]                                                         {'loss': 0.0611, 'grad_norm': 6.7976510026182035, 'learning_rate': 3.971068595426971e-07, 'completion_length': 220.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5907739400863647, 'reward_std': 0.10852411389350891, 'kl': 1.529296875, 'epoch': 0.6}
 60%|██████    | 2584/4286 [16:06:19<10:10:46, 21.53s/it] 60%|██████    | 2585/4286 [16:06:45<10:44:13, 22.72s/it]                                                         {'loss': 0.035, 'grad_norm': 5.8557681534381025, 'learning_rate': 3.968735417638824e-07, 'completion_length': 227.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.5973214656114578, 'rewards/format_reward': 1.0, 'reward': 1.5973215103149414, 'reward_std': 0.09960654750466347, 'kl': 0.876953125, 'epoch': 0.6}
 60%|██████    | 2585/4286 [16:06:45<10:44:13, 22.72s/it] 60%|██████    | 2586/4286 [16:07:06<10:35:22, 22.42s/it]                                                         {'loss': 0.0394, 'grad_norm': 13.426449680974564, 'learning_rate': 3.9664022398506766e-07, 'completion_length': 233.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5282738655805588, 'rewards/format_reward': 1.0, 'reward': 1.52827388048172, 'reward_std': 0.09133677184581757, 'kl': 0.984375, 'epoch': 0.6}
 60%|██████    | 2586/4286 [16:07:06<10:35:22, 22.42s/it] 60%|██████    | 2587/4286 [16:07:28<10:29:36, 22.23s/it]                                                         {'loss': 0.0113, 'grad_norm': 1.5766121619673625, 'learning_rate': 3.964069062062529e-07, 'completion_length': 251.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5485119670629501, 'rewards/format_reward': 1.0, 'reward': 1.5485119819641113, 'reward_std': 0.05286813899874687, 'kl': 0.2822265625, 'epoch': 0.6}
 60%|██████    | 2587/4286 [16:07:28<10:29:36, 22.23s/it] 60%|██████    | 2588/4286 [16:07:49<10:16:41, 21.79s/it]                                                         {'loss': 0.0318, 'grad_norm': 1.5200191702240988, 'learning_rate': 3.9617358842743816e-07, 'completion_length': 174.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.7490080296993256, 'rewards/format_reward': 1.0, 'reward': 1.749008059501648, 'reward_std': 0.08358996734023094, 'kl': 0.79296875, 'epoch': 0.6}
 60%|██████    | 2588/4286 [16:07:49<10:16:41, 21.79s/it] 60%|██████    | 2589/4286 [16:08:12<10:28:35, 22.22s/it]                                                         {'loss': 0.0209, 'grad_norm': 3.992233581229417, 'learning_rate': 3.959402706486234e-07, 'completion_length': 250.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.1011904738843441, 'kl': 0.521484375, 'epoch': 0.6}
 60%|██████    | 2589/4286 [16:08:12<10:28:35, 22.22s/it] 60%|██████    | 2590/4286 [16:08:36<10:37:45, 22.56s/it]                                                         {'loss': 0.0202, 'grad_norm': 3.6556271936349396, 'learning_rate': 3.9570695286980865e-07, 'completion_length': 227.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5104167014360428, 'rewards/format_reward': 1.0, 'reward': 1.5104168057441711, 'reward_std': 0.0625000037252903, 'kl': 0.505859375, 'epoch': 0.6}
 60%|██████    | 2590/4286 [16:08:36<10:37:45, 22.56s/it] 60%|██████    | 2591/4286 [16:08:57<10:31:25, 22.35s/it]                                                         {'loss': 0.0224, 'grad_norm': 4.556958105948353, 'learning_rate': 3.9547363509099393e-07, 'completion_length': 225.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.516369104385376, 'rewards/format_reward': 1.0, 'reward': 1.516369104385376, 'reward_std': 0.07992978021502495, 'kl': 0.560546875, 'epoch': 0.6}
 60%|██████    | 2591/4286 [16:08:57<10:31:25, 22.35s/it] 60%|██████    | 2592/4286 [16:09:22<10:50:01, 23.02s/it]                                                         {'loss': 0.0647, 'grad_norm': 5.7303599324108, 'learning_rate': 3.9524031731217915e-07, 'completion_length': 271.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5848214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5669644474983215, 'reward_std': 0.14809807389974594, 'kl': 1.6171875, 'epoch': 0.6}
 60%|██████    | 2592/4286 [16:09:22<10:50:01, 23.02s/it] 60%|██████    | 2593/4286 [16:09:42<10:25:05, 22.15s/it]                                                         {'loss': 0.0315, 'grad_norm': 4.3012310912441984, 'learning_rate': 3.950069995333644e-07, 'completion_length': 186.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5431548207998276, 'rewards/format_reward': 1.0, 'reward': 1.5431548953056335, 'reward_std': 0.0922619104385376, 'kl': 0.7861328125, 'epoch': 0.6}
 60%|██████    | 2593/4286 [16:09:42<10:25:05, 22.15s/it] 61%|██████    | 2594/4286 [16:10:05<10:26:33, 22.22s/it]                                                         {'loss': 0.0434, 'grad_norm': 17.802668937077065, 'learning_rate': 3.9477368175454965e-07, 'completion_length': 229.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6964285671710968, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.10260479152202606, 'kl': 1.0859375, 'epoch': 0.61}
 61%|██████    | 2594/4286 [16:10:05<10:26:33, 22.22s/it] 61%|██████    | 2595/4286 [16:10:30<10:51:03, 23.10s/it]                                                         {'loss': 0.0363, 'grad_norm': 11.37891882884413, 'learning_rate': 3.945403639757349e-07, 'completion_length': 235.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7110806405544281, 'rewards/format_reward': 1.0, 'reward': 1.71108078956604, 'reward_std': 0.06263979524374008, 'kl': 0.90625, 'epoch': 0.61}
 61%|██████    | 2595/4286 [16:10:30<10:51:03, 23.10s/it] 61%|██████    | 2596/4286 [16:10:53<10:56:17, 23.30s/it]                                                         {'loss': 0.0403, 'grad_norm': 9.871121302866644, 'learning_rate': 3.943070461969202e-07, 'completion_length': 245.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6092261970043182, 'rewards/format_reward': 1.0, 'reward': 1.6092262864112854, 'reward_std': 0.10940021276473999, 'kl': 1.009765625, 'epoch': 0.61}
 61%|██████    | 2596/4286 [16:10:53<10:56:17, 23.30s/it] 61%|██████    | 2597/4286 [16:11:18<11:09:34, 23.79s/it]                                                         {'loss': 0.0525, 'grad_norm': 9.23816044092305, 'learning_rate': 3.940737284181054e-07, 'completion_length': 269.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.4727891534566879, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4549319744110107, 'reward_std': 0.15401731431484222, 'kl': 1.314453125, 'epoch': 0.61}
 61%|██████    | 2597/4286 [16:11:18<11:09:34, 23.79s/it] 61%|██████    | 2598/4286 [16:11:40<10:54:10, 23.25s/it]                                                         {'loss': 0.0871, 'grad_norm': 8.457139338022031, 'learning_rate': 3.938404106392907e-07, 'completion_length': 206.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6056548357009888, 'reward_std': 0.23196285963058472, 'kl': 2.1796875, 'epoch': 0.61}
 61%|██████    | 2598/4286 [16:11:40<10:54:10, 23.25s/it] 61%|██████    | 2599/4286 [16:12:04<10:59:30, 23.46s/it]                                                         {'loss': 0.13, 'grad_norm': 9.58647994270979, 'learning_rate': 3.936070928604759e-07, 'completion_length': 215.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.6187500059604645, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5651786923408508, 'reward_std': 0.23988685011863708, 'kl': 3.25, 'epoch': 0.61}
 61%|██████    | 2599/4286 [16:12:04<10:59:30, 23.46s/it] 61%|██████    | 2600/4286 [16:12:28<11:01:40, 23.55s/it]                                                         {'loss': 0.019, 'grad_norm': 6.233024825119754, 'learning_rate': 3.933737750816612e-07, 'completion_length': 247.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6041666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6041667461395264, 'reward_std': 0.05952380783855915, 'kl': 0.474609375, 'epoch': 0.61}
 61%|██████    | 2600/4286 [16:12:28<11:01:40, 23.55s/it] 61%|██████    | 2601/4286 [16:16:52<44:43:58, 95.57s/it]                                                         {'loss': 0.0518, 'grad_norm': 14.281038682893902, 'learning_rate': 3.9314045730284647e-07, 'completion_length': 238.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.5312500596046448, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4955358505249023, 'reward_std': 0.1967640146613121, 'kl': 1.2900390625, 'epoch': 0.61}
 61%|██████    | 2601/4286 [16:16:52<44:43:58, 95.57s/it] 61%|██████    | 2602/4286 [16:17:13<34:21:15, 73.44s/it]                                                         {'loss': 0.0644, 'grad_norm': 8.06403245483706, 'learning_rate': 3.929071395240317e-07, 'completion_length': 252.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6473214328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6116072535514832, 'reward_std': 0.22001109272241592, 'kl': 1.611328125, 'epoch': 0.61}
 61%|██████    | 2602/4286 [16:17:13<34:21:15, 73.44s/it] 61%|██████    | 2603/4286 [16:17:34<26:58:50, 57.71s/it]                                                         {'loss': 0.0396, 'grad_norm': 8.71987535571611, 'learning_rate': 3.9267382174521697e-07, 'completion_length': 197.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.6458333432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6279762983322144, 'reward_std': 0.14444325678050518, 'kl': 0.98828125, 'epoch': 0.61}
 61%|██████    | 2603/4286 [16:17:34<26:58:50, 57.71s/it] 61%|██████    | 2604/4286 [16:17:58<22:12:46, 47.54s/it]                                                         {'loss': 0.0723, 'grad_norm': 7.193298255304491, 'learning_rate': 3.9244050396640224e-07, 'completion_length': 187.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5456932932138443, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.492121934890747, 'reward_std': 0.13110870495438576, 'kl': 1.810546875, 'epoch': 0.61}
 61%|██████    | 2604/4286 [16:17:58<22:12:46, 47.54s/it] 61%|██████    | 2605/4286 [16:18:20<18:32:07, 39.70s/it]                                                         {'loss': 0.1141, 'grad_norm': 5.776210354532907, 'learning_rate': 3.9220718618758746e-07, 'completion_length': 202.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.4407738447189331, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3872024416923523, 'reward_std': 0.22050917893648148, 'kl': 2.8515625, 'epoch': 0.61}
 61%|██████    | 2605/4286 [16:18:20<18:32:07, 39.70s/it] 61%|██████    | 2606/4286 [16:18:42<16:07:35, 34.56s/it]                                                         {'loss': 0.0301, 'grad_norm': 4.888650517698246, 'learning_rate': 3.9197386840877274e-07, 'completion_length': 247.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.4724702686071396, 'rewards/format_reward': 1.0, 'reward': 1.4724703431129456, 'reward_std': 0.06298866309225559, 'kl': 0.751953125, 'epoch': 0.61}
 61%|██████    | 2606/4286 [16:18:42<16:07:35, 34.56s/it][2025-03-02 21:26:20,895] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████    | 2607/4286 [16:19:05<14:27:56, 31.02s/it]                                                         {'loss': 0.0353, 'grad_norm': 2.2243065702276, 'learning_rate': 3.9174055062995796e-07, 'completion_length': 235.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.41624152660369873, 'rewards/format_reward': 1.0, 'reward': 1.4162416458129883, 'reward_std': 0.12501439079642296, 'kl': 0.8779296875, 'epoch': 0.61}
 61%|██████    | 2607/4286 [16:19:05<14:27:56, 31.02s/it] 61%|██████    | 2608/4286 [16:19:27<13:10:21, 28.26s/it]                                                         {'loss': 0.0632, 'grad_norm': 1.7572010038591817, 'learning_rate': 3.9150723285114324e-07, 'completion_length': 234.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.630952388048172, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.12756164371967316, 'kl': 1.57421875, 'epoch': 0.61}
 61%|██████    | 2608/4286 [16:19:27<13:10:21, 28.26s/it] 61%|██████    | 2609/4286 [16:19:49<12:17:15, 26.38s/it]                                                         {'loss': 0.0849, 'grad_norm': 3.441928884045685, 'learning_rate': 3.912739150723285e-07, 'completion_length': 211.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6113095879554749, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5934524536132812, 'reward_std': 0.18625233322381973, 'kl': 2.1171875, 'epoch': 0.61}
 61%|██████    | 2609/4286 [16:19:49<12:17:15, 26.38s/it] 61%|██████    | 2610/4286 [16:20:11<11:40:38, 25.08s/it]                                                         {'loss': 0.067, 'grad_norm': 1.358407421409627, 'learning_rate': 3.9104059729351373e-07, 'completion_length': 227.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6220238506793976, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5863096714019775, 'reward_std': 0.16515469551086426, 'kl': 1.66796875, 'epoch': 0.61}
 61%|██████    | 2610/4286 [16:20:11<11:40:38, 25.08s/it][2025-03-02 21:27:51,491] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████    | 2611/4286 [16:20:36<11:37:12, 24.97s/it]                                                         {'loss': 0.0159, 'grad_norm': 5.570097574316074, 'learning_rate': 3.90807279514699e-07, 'completion_length': 272.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5401786267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5223215222358704, 'reward_std': 0.1160714365541935, 'kl': 0.39794921875, 'epoch': 0.61}
 61%|██████    | 2611/4286 [16:20:36<11:37:12, 24.97s/it] 61%|██████    | 2612/4286 [16:20:59<11:23:07, 24.48s/it]                                                         {'loss': 0.0106, 'grad_norm': 1.9901609779241416, 'learning_rate': 3.9057396173588423e-07, 'completion_length': 230.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.05952381435781717, 'kl': 0.26416015625, 'epoch': 0.61}
 61%|██████    | 2612/4286 [16:20:59<11:23:07, 24.48s/it] 61%|██████    | 2613/4286 [16:21:25<11:38:44, 25.06s/it]                                                         {'loss': 0.0595, 'grad_norm': 10.955074985991622, 'learning_rate': 3.903406439570695e-07, 'completion_length': 275.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.5824404954910278, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.564583420753479, 'reward_std': 0.20529257506132126, 'kl': 1.484375, 'epoch': 0.61}
 61%|██████    | 2613/4286 [16:21:25<11:38:44, 25.06s/it][2025-03-02 21:29:07,329] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████    | 2614/4286 [16:21:51<11:46:59, 25.37s/it]                                                         {'loss': 0.095, 'grad_norm': 8.903041800924942, 'learning_rate': 3.901073261782548e-07, 'completion_length': 244.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.5925595164299011, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5568453073501587, 'reward_std': 0.19485201686620712, 'kl': 2.37890625, 'epoch': 0.61}
 61%|██████    | 2614/4286 [16:21:51<11:46:59, 25.37s/it][2025-03-02 21:29:34,192] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████    | 2615/4286 [16:22:18<11:59:01, 25.82s/it]                                                         {'loss': 0.0112, 'grad_norm': 2.89864192809852, 'learning_rate': 3.8987400839944e-07, 'completion_length': 283.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 1.0, 'reward': 1.6235119104385376, 'reward_std': 0.08026434481143951, 'kl': 0.27783203125, 'epoch': 0.61}
 61%|██████    | 2615/4286 [16:22:18<11:59:01, 25.82s/it] 61%|██████    | 2616/4286 [16:22:44<11:53:29, 25.63s/it]                                                         {'loss': 0.0329, 'grad_norm': 5.63051703484302, 'learning_rate': 3.896406906206253e-07, 'completion_length': 233.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.7485119998455048, 'rewards/format_reward': 1.0, 'reward': 1.748512089252472, 'reward_std': 0.09183453768491745, 'kl': 0.8212890625, 'epoch': 0.61}
 61%|██████    | 2616/4286 [16:22:44<11:53:29, 25.63s/it][2025-03-02 21:30:24,678] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████    | 2617/4286 [16:23:09<11:50:06, 25.53s/it]                                                         {'loss': 0.0528, 'grad_norm': 2.7147863762428424, 'learning_rate': 3.894073728418105e-07, 'completion_length': 259.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5550595670938492, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.501488208770752, 'reward_std': 0.1915576383471489, 'kl': 1.31689453125, 'epoch': 0.61}
 61%|██████    | 2617/4286 [16:23:09<11:50:06, 25.53s/it] 61%|██████    | 2618/4286 [16:23:31<11:23:52, 24.60s/it]                                                         {'loss': 0.0096, 'grad_norm': 1.4295147149770213, 'learning_rate': 3.891740550629958e-07, 'completion_length': 209.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.7458333969116211, 'rewards/format_reward': 1.0, 'reward': 1.7458334565162659, 'reward_std': 0.07983230799436569, 'kl': 0.240234375, 'epoch': 0.61}
 61%|██████    | 2618/4286 [16:23:31<11:23:52, 24.60s/it] 61%|██████    | 2619/4286 [16:23:52<10:49:48, 23.39s/it]                                                         {'loss': 0.0118, 'grad_norm': 11.032712279273897, 'learning_rate': 3.8894073728418105e-07, 'completion_length': 210.14287567138672, 'rewards/only_full_func_accuracy_reward': 0.7675595879554749, 'rewards/format_reward': 1.0, 'reward': 1.7675595879554749, 'reward_std': 0.04529522359371185, 'kl': 0.29443359375, 'epoch': 0.61}
 61%|██████    | 2619/4286 [16:23:52<10:49:48, 23.39s/it][2025-03-02 21:31:31,491] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████    | 2620/4286 [16:24:16<10:52:59, 23.52s/it]                                                         {'loss': 0.0327, 'grad_norm': 2.1642857668033924, 'learning_rate': 3.8870741950536627e-07, 'completion_length': 248.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7011905312538147, 'rewards/format_reward': 1.0, 'reward': 1.7011905908584595, 'reward_std': 0.08531337231397629, 'kl': 0.818359375, 'epoch': 0.61}
 61%|██████    | 2620/4286 [16:24:16<10:52:59, 23.52s/it] 61%|██████    | 2621/4286 [16:24:40<10:56:38, 23.66s/it]                                                         {'loss': 0.046, 'grad_norm': 4.63130312508756, 'learning_rate': 3.8847410172655155e-07, 'completion_length': 241.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.6622024774551392, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.08303267788141966, 'kl': 1.154296875, 'epoch': 0.61}
 61%|██████    | 2621/4286 [16:24:40<10:56:38, 23.66s/it] 61%|██████    | 2622/4286 [16:25:06<11:18:08, 24.45s/it]                                                         {'loss': 0.0336, 'grad_norm': 2.727161435570232, 'learning_rate': 3.8824078394773677e-07, 'completion_length': 296.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5395834147930145, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5217263102531433, 'reward_std': 0.1410353146493435, 'kl': 0.84375, 'epoch': 0.61}
 61%|██████    | 2622/4286 [16:25:06<11:18:08, 24.45s/it] 61%|██████    | 2623/4286 [16:25:33<11:37:28, 25.16s/it]                                                         {'loss': 0.0373, 'grad_norm': 4.185280799757442, 'learning_rate': 3.8800746616892204e-07, 'completion_length': 350.00001525878906, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.20064513385295868, 'kl': 0.93359375, 'epoch': 0.61}
 61%|██████    | 2623/4286 [16:25:33<11:37:28, 25.16s/it] 61%|██████    | 2624/4286 [16:25:59<11:46:38, 25.51s/it]                                                         {'loss': 0.0393, 'grad_norm': 1.5065188210088762, 'learning_rate': 3.877741483901073e-07, 'completion_length': 245.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5886905193328857, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.535119116306305, 'reward_std': 0.19300784170627594, 'kl': 0.986328125, 'epoch': 0.61}
 61%|██████    | 2624/4286 [16:25:59<11:46:38, 25.51s/it] 61%|██████    | 2625/4286 [16:26:25<11:45:44, 25.49s/it]                                                         {'loss': 0.0286, 'grad_norm': 1.9785126092039371, 'learning_rate': 3.8754083061129254e-07, 'completion_length': 278.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6860120296478271, 'reward_std': 0.08257311163470149, 'kl': 0.716796875, 'epoch': 0.61}
 61%|██████    | 2625/4286 [16:26:25<11:45:44, 25.49s/it] 61%|██████▏   | 2626/4286 [16:26:45<11:03:02, 23.97s/it]                                                         {'loss': 0.0737, 'grad_norm': 2.792394350913351, 'learning_rate': 3.873075128324778e-07, 'completion_length': 199.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.07738096080720425, 'kl': 1.84765625, 'epoch': 0.61}
 61%|██████▏   | 2626/4286 [16:26:45<11:03:02, 23.97s/it] 61%|██████▏   | 2627/4286 [16:27:10<11:13:28, 24.36s/it]                                                         {'loss': 0.0341, 'grad_norm': 1.1334448566423776, 'learning_rate': 3.870741950536631e-07, 'completion_length': 279.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.682043731212616, 'rewards/format_reward': 1.0, 'reward': 1.6820437908172607, 'reward_std': 0.04335014149546623, 'kl': 0.85546875, 'epoch': 0.61}
 61%|██████▏   | 2627/4286 [16:27:10<11:13:28, 24.36s/it][2025-03-02 21:34:51,905] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████▏   | 2628/4286 [16:27:36<11:25:26, 24.80s/it]                                                         {'loss': 0.0105, 'grad_norm': 4.236015479325412, 'learning_rate': 3.868408772748483e-07, 'completion_length': 263.19644927978516, 'rewards/only_full_func_accuracy_reward': 0.68601194024086, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6502977013587952, 'reward_std': 0.10798206739127636, 'kl': 0.2626953125, 'epoch': 0.61}
 61%|██████▏   | 2628/4286 [16:27:36<11:25:26, 24.80s/it] 61%|██████▏   | 2629/4286 [16:28:02<11:33:43, 25.12s/it]                                                         {'loss': 0.0135, 'grad_norm': 0.9980997016695381, 'learning_rate': 3.866075594960336e-07, 'completion_length': 296.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.6455357670783997, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.627678632736206, 'reward_std': 0.09859585203230381, 'kl': 0.33642578125, 'epoch': 0.61}
 61%|██████▏   | 2629/4286 [16:28:02<11:33:43, 25.12s/it] 61%|██████▏   | 2630/4286 [16:28:24<11:12:26, 24.36s/it]                                                         {'loss': 0.0288, 'grad_norm': 0.9799267518277024, 'learning_rate': 3.863742417172188e-07, 'completion_length': 224.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.05541309341788292, 'kl': 0.72021484375, 'epoch': 0.61}
 61%|██████▏   | 2630/4286 [16:28:24<11:12:26, 24.36s/it] 61%|██████▏   | 2631/4286 [16:28:51<11:31:19, 25.06s/it]                                                         {'loss': 0.0089, 'grad_norm': 3.123284211200971, 'learning_rate': 3.861409239384041e-07, 'completion_length': 245.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7440477013587952, 'reward_std': 0.11868223827332258, 'kl': 0.220703125, 'epoch': 0.61}
 61%|██████▏   | 2631/4286 [16:28:51<11:31:19, 25.06s/it][2025-03-02 21:36:34,269] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 61%|██████▏   | 2632/4286 [16:29:18<11:48:41, 25.71s/it]                                                         {'loss': 0.0087, 'grad_norm': 9.552107826816373, 'learning_rate': 3.8590760615958936e-07, 'completion_length': 278.85716247558594, 'rewards/only_full_func_accuracy_reward': 0.760416716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7425596117973328, 'reward_std': 0.12310440465807915, 'kl': 0.21728515625, 'epoch': 0.61}
 61%|██████▏   | 2632/4286 [16:29:18<11:48:41, 25.71s/it] 61%|██████▏   | 2633/4286 [16:29:44<11:51:35, 25.83s/it]                                                         {'loss': 0.0279, 'grad_norm': 3.3570013768056186, 'learning_rate': 3.856742883807746e-07, 'completion_length': 323.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6250000447034836, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.1369047686457634, 'kl': 0.69775390625, 'epoch': 0.61}
 61%|██████▏   | 2633/4286 [16:29:45<11:51:35, 25.83s/it] 61%|██████▏   | 2634/4286 [16:30:09<11:41:44, 25.49s/it]                                                         {'loss': 0.0064, 'grad_norm': 0.4451950566934433, 'learning_rate': 3.8544097060195986e-07, 'completion_length': 286.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6172619163990021, 'rewards/format_reward': 1.0, 'reward': 1.6172620058059692, 'reward_std': 0.02660532109439373, 'kl': 0.15869140625, 'epoch': 0.61}
 61%|██████▏   | 2634/4286 [16:30:09<11:41:44, 25.49s/it] 61%|██████▏   | 2635/4286 [16:30:34<11:35:56, 25.29s/it]                                                         {'loss': 0.022, 'grad_norm': 1.1531292416125658, 'learning_rate': 3.852076528231451e-07, 'completion_length': 295.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5133928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4955358505249023, 'reward_std': 0.0922619104385376, 'kl': 0.55126953125, 'epoch': 0.61}
 61%|██████▏   | 2635/4286 [16:30:34<11:35:56, 25.29s/it] 62%|██████▏   | 2636/4286 [16:30:59<11:33:16, 25.21s/it]                                                         {'loss': 0.0082, 'grad_norm': 0.876691657291841, 'learning_rate': 3.8497433504433036e-07, 'completion_length': 305.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5833333730697632, 'reward_std': 0.11266787722706795, 'kl': 0.20458984375, 'epoch': 0.62}
 62%|██████▏   | 2636/4286 [16:30:59<11:33:16, 25.21s/it][2025-03-02 21:38:39,144] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 62%|██████▏   | 2637/4286 [16:31:23<11:24:41, 24.91s/it]                                                         {'loss': 0.0061, 'grad_norm': 1.3814771218041026, 'learning_rate': 3.8474101726551563e-07, 'completion_length': 267.1964416503906, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.6785715818405151, 'reward_std': 0.011904764920473099, 'kl': 0.1533203125, 'epoch': 0.62}
 62%|██████▏   | 2637/4286 [16:31:23<11:24:41, 24.91s/it][2025-03-02 21:39:04,455] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 62%|██████▏   | 2638/4286 [16:31:49<11:27:33, 25.03s/it]                                                         {'loss': 0.0606, 'grad_norm': 10.871446855193163, 'learning_rate': 3.8450769948670085e-07, 'completion_length': 291.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.586309552192688, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5505953431129456, 'reward_std': 0.22599736601114273, 'kl': 1.51171875, 'epoch': 0.62}
 62%|██████▏   | 2638/4286 [16:31:49<11:27:33, 25.03s/it][2025-03-02 21:39:31,717] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 62%|██████▏   | 2639/4286 [16:32:16<11:45:30, 25.70s/it]                                                         {'loss': 0.0399, 'grad_norm': 1.641750451689967, 'learning_rate': 3.8427438170788613e-07, 'completion_length': 283.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6092262268066406, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5735120177268982, 'reward_std': 0.20983455330133438, 'kl': 0.99609375, 'epoch': 0.62}
 62%|██████▏   | 2639/4286 [16:32:16<11:45:30, 25.70s/it] 62%|██████▏   | 2640/4286 [16:32:40<11:32:39, 25.25s/it]                                                         {'loss': 0.0068, 'grad_norm': 4.831215633993264, 'learning_rate': 3.8404106392907135e-07, 'completion_length': 283.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.6711310148239136, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.0327380932867527, 'kl': 0.17138671875, 'epoch': 0.62}
 62%|██████▏   | 2640/4286 [16:32:40<11:32:39, 25.25s/it] 62%|██████▏   | 2641/4286 [16:33:04<11:23:14, 24.92s/it]                                                         {'loss': 0.0099, 'grad_norm': 1.2515159658860944, 'learning_rate': 3.8380774615025663e-07, 'completion_length': 268.44644927978516, 'rewards/only_full_func_accuracy_reward': 0.4761905074119568, 'rewards/format_reward': 1.0, 'reward': 1.4761906862258911, 'reward_std': 0.02380952797830105, 'kl': 0.24755859375, 'epoch': 0.62}
 62%|██████▏   | 2641/4286 [16:33:04<11:23:14, 24.92s/it] 62%|██████▏   | 2642/4286 [16:33:28<11:10:24, 24.47s/it]                                                         {'loss': 0.0117, 'grad_norm': 1.7684355947748613, 'learning_rate': 3.835744283714419e-07, 'completion_length': 248.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.8440476655960083, 'rewards/format_reward': 1.0, 'reward': 1.844047725200653, 'reward_std': 0.059438424184918404, 'kl': 0.29150390625, 'epoch': 0.62}
 62%|██████▏   | 2642/4286 [16:33:28<11:10:24, 24.47s/it] 62%|██████▏   | 2643/4286 [16:33:53<11:18:59, 24.80s/it]                                                         {'loss': 0.0406, 'grad_norm': 3.4179389106117006, 'learning_rate': 3.833411105926271e-07, 'completion_length': 261.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.5818452835083008, 'rewards/format_reward': 1.0, 'reward': 1.5818453431129456, 'reward_std': 0.11723900958895683, 'kl': 1.0166015625, 'epoch': 0.62}
 62%|██████▏   | 2643/4286 [16:33:53<11:18:59, 24.80s/it] 62%|██████▏   | 2644/4286 [16:34:17<11:13:26, 24.61s/it]                                                         {'loss': 0.0106, 'grad_norm': 1.1600200912886607, 'learning_rate': 3.831077928138124e-07, 'completion_length': 269.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6741072535514832, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.05281119421124458, 'kl': 0.2646484375, 'epoch': 0.62}
 62%|██████▏   | 2644/4286 [16:34:17<11:13:26, 24.61s/it] 62%|██████▏   | 2645/4286 [16:34:43<11:18:37, 24.81s/it]                                                         {'loss': 0.1294, 'grad_norm': 3284.3570008629913, 'learning_rate': 3.828744750349976e-07, 'completion_length': 331.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.39001625776290894, 'rewards/format_reward': 1.0, 'reward': 1.3900163173675537, 'reward_std': 0.05283702630549669, 'kl': 3.234375, 'epoch': 0.62}
 62%|██████▏   | 2645/4286 [16:34:43<11:18:37, 24.81s/it] 62%|██████▏   | 2646/4286 [16:35:07<11:14:17, 24.67s/it]                                                         {'loss': 0.026, 'grad_norm': 23.020006795130517, 'learning_rate': 3.826411572561829e-07, 'completion_length': 250.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6848214566707611, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.66696435213089, 'reward_std': 0.15025083720684052, 'kl': 0.6513671875, 'epoch': 0.62}
 62%|██████▏   | 2646/4286 [16:35:07<11:14:17, 24.67s/it] 62%|██████▏   | 2647/4286 [16:35:34<11:33:34, 25.39s/it]                                                         {'loss': 0.0259, 'grad_norm': 2.277791428885578, 'learning_rate': 3.8240783947736817e-07, 'completion_length': 281.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6235120296478271, 'reward_std': 0.24869729578495026, 'kl': 0.6474609375, 'epoch': 0.62}
 62%|██████▏   | 2647/4286 [16:35:34<11:33:34, 25.39s/it] 62%|██████▏   | 2648/4286 [16:36:00<11:39:11, 25.61s/it]                                                         {'loss': 0.0227, 'grad_norm': 4.955800621829054, 'learning_rate': 3.821745216985534e-07, 'completion_length': 311.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.633928656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6160715818405151, 'reward_std': 0.15324145182967186, 'kl': 0.5673828125, 'epoch': 0.62}
 62%|██████▏   | 2648/4286 [16:36:00<11:39:11, 25.61s/it] 62%|██████▏   | 2649/4286 [16:36:26<11:38:18, 25.59s/it]                                                         {'loss': 0.066, 'grad_norm': 11.283128226083043, 'learning_rate': 3.8194120391973867e-07, 'completion_length': 278.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5208333432674408, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4672619700431824, 'reward_std': 0.2603013291954994, 'kl': 1.6484375, 'epoch': 0.62}
 62%|██████▏   | 2649/4286 [16:36:26<11:38:18, 25.59s/it] 62%|██████▏   | 2650/4286 [16:36:50<11:27:30, 25.21s/it]                                                         {'loss': 0.0174, 'grad_norm': 2.0478089323807653, 'learning_rate': 3.8170788614092394e-07, 'completion_length': 254.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6994048953056335, 'reward_std': 0.0829059649258852, 'kl': 0.4345703125, 'epoch': 0.62}
 62%|██████▏   | 2650/4286 [16:36:50<11:27:30, 25.21s/it][2025-03-02 21:44:32,250] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 62%|██████▏   | 2651/4286 [16:37:16<11:36:15, 25.55s/it]                                                         {'loss': 0.1101, 'grad_norm': 3.207925290642492, 'learning_rate': 3.8147456836210917e-07, 'completion_length': 238.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6324405670166016, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.578869104385376, 'reward_std': 0.2931371107697487, 'kl': 2.75, 'epoch': 0.62}
 62%|██████▏   | 2651/4286 [16:37:16<11:36:15, 25.55s/it] 62%|██████▏   | 2652/4286 [16:37:42<11:38:00, 25.63s/it]                                                         {'loss': 0.1002, 'grad_norm': 9.133709415691609, 'learning_rate': 3.8124125058329444e-07, 'completion_length': 295.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5305059850215912, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4590774774551392, 'reward_std': 0.23020180314779282, 'kl': 2.51171875, 'epoch': 0.62}
 62%|██████▏   | 2652/4286 [16:37:42<11:38:00, 25.63s/it][2025-03-02 21:45:23,594] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 62%|██████▏   | 2653/4286 [16:38:08<11:36:44, 25.60s/it]                                                         {'loss': 0.0293, 'grad_norm': 1.7023455902680078, 'learning_rate': 3.8100793280447966e-07, 'completion_length': 260.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.6038691103458405, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5860120058059692, 'reward_std': 0.21989690512418747, 'kl': 0.732421875, 'epoch': 0.62}
 62%|██████▏   | 2653/4286 [16:38:08<11:36:44, 25.60s/it] 62%|██████▏   | 2654/4286 [16:38:32<11:25:19, 25.20s/it]                                                         {'loss': 0.0132, 'grad_norm': 3.8405020305611246, 'learning_rate': 3.8077461502566494e-07, 'completion_length': 215.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6171343922615051, 'rewards/format_reward': 1.0, 'reward': 1.61713445186615, 'reward_std': 0.052699193358421326, 'kl': 0.3291015625, 'epoch': 0.62}
 62%|██████▏   | 2654/4286 [16:38:32<11:25:19, 25.20s/it] 62%|██████▏   | 2655/4286 [16:38:55<11:06:34, 24.52s/it]                                                         {'loss': 0.0937, 'grad_norm': 539.0915719566719, 'learning_rate': 3.805412972468502e-07, 'completion_length': 265.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.05029458552598953, 'kl': 2.33837890625, 'epoch': 0.62}
 62%|██████▏   | 2655/4286 [16:38:55<11:06:34, 24.52s/it][2025-03-02 21:46:38,521] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 62%|██████▏   | 2656/4286 [16:39:23<11:32:17, 25.48s/it]                                                         {'loss': 0.0262, 'grad_norm': 4.550898746630911, 'learning_rate': 3.8030797946803544e-07, 'completion_length': 317.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.4931548088788986, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4574406147003174, 'reward_std': 0.09727714583277702, 'kl': 0.654296875, 'epoch': 0.62}
 62%|██████▏   | 2656/4286 [16:39:23<11:32:17, 25.48s/it] 62%|██████▏   | 2657/4286 [16:39:47<11:19:11, 25.02s/it]                                                         {'loss': 0.0112, 'grad_norm': 1.1670447568653748, 'learning_rate': 3.800746616892207e-07, 'completion_length': 267.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.7726190984249115, 'rewards/format_reward': 1.0, 'reward': 1.7726190686225891, 'reward_std': 0.03844234719872475, 'kl': 0.2802734375, 'epoch': 0.62}
 62%|██████▏   | 2657/4286 [16:39:47<11:19:11, 25.02s/it] 62%|██████▏   | 2658/4286 [16:40:12<11:25:20, 25.26s/it]                                                         {'loss': 0.0209, 'grad_norm': 18.84434745395372, 'learning_rate': 3.7984134391040593e-07, 'completion_length': 230.76787567138672, 'rewards/only_full_func_accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.473214328289032, 'reward_std': 0.15706335101276636, 'kl': 0.5234375, 'epoch': 0.62}
 62%|██████▏   | 2658/4286 [16:40:12<11:25:20, 25.26s/it] 62%|██████▏   | 2659/4286 [16:40:37<11:17:51, 25.00s/it]                                                         {'loss': 0.0271, 'grad_norm': 2.5852696854091652, 'learning_rate': 3.796080261315912e-07, 'completion_length': 263.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6770833432674408, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.06685744412243366, 'kl': 0.6748046875, 'epoch': 0.62}
 62%|██████▏   | 2659/4286 [16:40:37<11:17:51, 25.00s/it] 62%|██████▏   | 2660/4286 [16:41:01<11:13:17, 24.84s/it]                                                         {'loss': 0.0078, 'grad_norm': 3.1224054666277277, 'learning_rate': 3.793747083527765e-07, 'completion_length': 270.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.5431547909975052, 'rewards/format_reward': 1.0, 'reward': 1.5431548357009888, 'reward_std': 0.09945633634924889, 'kl': 0.1962890625, 'epoch': 0.62}
 62%|██████▏   | 2660/4286 [16:41:01<11:13:17, 24.84s/it] 62%|██████▏   | 2661/4286 [16:41:26<11:15:14, 24.93s/it]                                                         {'loss': 0.0154, 'grad_norm': 2.441051615920666, 'learning_rate': 3.791413905739617e-07, 'completion_length': 271.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.4779762178659439, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4601191878318787, 'reward_std': 0.08874643314629793, 'kl': 0.38671875, 'epoch': 0.62}
 62%|██████▏   | 2661/4286 [16:41:26<11:15:14, 24.93s/it] 62%|██████▏   | 2662/4286 [16:41:50<11:07:11, 24.65s/it]                                                         {'loss': 0.0272, 'grad_norm': 1.4444135664935327, 'learning_rate': 3.78908072795147e-07, 'completion_length': 271.0178756713867, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.68452388048172, 'reward_std': 0.12413783743977547, 'kl': 0.6796875, 'epoch': 0.62}
 62%|██████▏   | 2662/4286 [16:41:50<11:07:11, 24.65s/it] 62%|██████▏   | 2663/4286 [16:42:17<11:21:04, 25.18s/it]                                                         {'loss': 0.0292, 'grad_norm': 0.5002429816474626, 'learning_rate': 3.786747550163322e-07, 'completion_length': 313.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5705357491970062, 'rewards/format_reward': 1.0, 'reward': 1.5705358386039734, 'reward_std': 0.06538487412035465, 'kl': 0.72998046875, 'epoch': 0.62}
 62%|██████▏   | 2663/4286 [16:42:17<11:21:04, 25.18s/it][2025-03-02 21:49:59,847] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 62%|██████▏   | 2664/4286 [16:42:44<11:36:42, 25.77s/it]                                                         {'loss': 0.0147, 'grad_norm': 4.674694310937824, 'learning_rate': 3.784414372375175e-07, 'completion_length': 279.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.639881044626236, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6220239400863647, 'reward_std': 0.09447120130062103, 'kl': 0.365234375, 'epoch': 0.62}
 62%|██████▏   | 2664/4286 [16:42:44<11:36:42, 25.77s/it] 62%|██████▏   | 2665/4286 [16:43:12<11:50:40, 26.31s/it]                                                         {'loss': 0.0325, 'grad_norm': 4.073205500522329, 'learning_rate': 3.7820811945870275e-07, 'completion_length': 258.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5004960745573044, 'rewards/format_reward': 1.0, 'reward': 1.5004961490631104, 'reward_std': 0.12039069458842278, 'kl': 0.81494140625, 'epoch': 0.62}
 62%|██████▏   | 2665/4286 [16:43:12<11:50:40, 26.31s/it] 62%|██████▏   | 2666/4286 [16:43:37<11:45:59, 26.15s/it]                                                         {'loss': 0.0103, 'grad_norm': 1.8179610331220188, 'learning_rate': 3.77974801679888e-07, 'completion_length': 272.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6696428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.07100120931863785, 'kl': 0.2568359375, 'epoch': 0.62}
 62%|██████▏   | 2666/4286 [16:43:37<11:45:59, 26.15s/it] 62%|██████▏   | 2667/4286 [16:44:02<11:32:26, 25.66s/it]                                                         {'loss': 0.0099, 'grad_norm': 2.689869764403202, 'learning_rate': 3.7774148390107325e-07, 'completion_length': 261.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7693453431129456, 'rewards/format_reward': 1.0, 'reward': 1.7693454027175903, 'reward_std': 0.058389291167259216, 'kl': 0.248046875, 'epoch': 0.62}
 62%|██████▏   | 2667/4286 [16:44:02<11:32:26, 25.66s/it] 62%|██████▏   | 2668/4286 [16:44:27<11:24:10, 25.37s/it]                                                         {'loss': 0.0444, 'grad_norm': 3.373654969102417, 'learning_rate': 3.775081661222585e-07, 'completion_length': 301.6250228881836, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.13006461039185524, 'kl': 1.10498046875, 'epoch': 0.62}
 62%|██████▏   | 2668/4286 [16:44:27<11:24:10, 25.37s/it][2025-03-02 21:52:06,249] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 62%|██████▏   | 2669/4286 [16:44:50<11:11:27, 24.92s/it]                                                         {'loss': 0.0142, 'grad_norm': 2.3237738556310172, 'learning_rate': 3.7727484834344375e-07, 'completion_length': 253.75, 'rewards/only_full_func_accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5267857313156128, 'reward_std': 0.102507000323385, 'kl': 0.35546875, 'epoch': 0.62}
 62%|██████▏   | 2669/4286 [16:44:50<11:11:27, 24.92s/it] 62%|██████▏   | 2670/4286 [16:45:14<11:01:58, 24.58s/it]                                                         {'loss': 0.0067, 'grad_norm': 1.7949669954847183, 'learning_rate': 3.77041530564629e-07, 'completion_length': 259.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5952381193637848, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.025651196017861366, 'kl': 0.16845703125, 'epoch': 0.62}
 62%|██████▏   | 2670/4286 [16:45:14<11:01:58, 24.58s/it] 62%|██████▏   | 2671/4286 [16:45:38<10:59:07, 24.49s/it]                                                         {'loss': 0.0673, 'grad_norm': 2.4031588657768865, 'learning_rate': 3.7680821278581425e-07, 'completion_length': 264.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.6235119104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5877978205680847, 'reward_std': 0.18812313303351402, 'kl': 1.6796875, 'epoch': 0.62}
 62%|██████▏   | 2671/4286 [16:45:38<10:59:07, 24.49s/it] 62%|██████▏   | 2672/4286 [16:46:04<11:04:55, 24.72s/it]                                                         {'loss': 0.0983, 'grad_norm': 10.51623928895558, 'learning_rate': 3.765748950069995e-07, 'completion_length': 260.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5788691192865372, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5252978205680847, 'reward_std': 0.21253982931375504, 'kl': 2.4609375, 'epoch': 0.62}
 62%|██████▏   | 2672/4286 [16:46:04<11:04:55, 24.72s/it] 62%|██████▏   | 2673/4286 [16:46:28<11:00:19, 24.56s/it]                                                         {'loss': 0.1056, 'grad_norm': 3.267308142203359, 'learning_rate': 3.763415772281848e-07, 'completion_length': 254.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6622024774551392, 'reward_std': 0.21262606233358383, 'kl': 2.63671875, 'epoch': 0.62}
 62%|██████▏   | 2673/4286 [16:46:28<11:00:19, 24.56s/it] 62%|██████▏   | 2674/4286 [16:46:51<10:51:59, 24.27s/it]                                                         {'loss': 0.0153, 'grad_norm': 3.341255565995585, 'learning_rate': 3.7610825944937e-07, 'completion_length': 274.0178756713867, 'rewards/only_full_func_accuracy_reward': 0.5372024476528168, 'rewards/format_reward': 1.0, 'reward': 1.5372024774551392, 'reward_std': 0.032247669994831085, 'kl': 0.3818359375, 'epoch': 0.62}
 62%|██████▏   | 2674/4286 [16:46:51<10:51:59, 24.27s/it] 62%|██████▏   | 2675/4286 [16:47:16<10:55:25, 24.41s/it]                                                         {'loss': 0.0715, 'grad_norm': 5.214874842428174, 'learning_rate': 3.758749416705553e-07, 'completion_length': 261.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6764881014823914, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6586309671401978, 'reward_std': 0.1769280657172203, 'kl': 1.783203125, 'epoch': 0.62}
 62%|██████▏   | 2675/4286 [16:47:16<10:55:25, 24.41s/it] 62%|██████▏   | 2676/4286 [16:47:42<11:04:00, 24.75s/it]                                                         {'loss': 0.0861, 'grad_norm': 6.003961414419792, 'learning_rate': 3.756416238917405e-07, 'completion_length': 291.14288330078125, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.592262089252472, 'reward_std': 0.1622791513800621, 'kl': 2.1484375, 'epoch': 0.62}
 62%|██████▏   | 2676/4286 [16:47:42<11:04:00, 24.75s/it] 62%|██████▏   | 2677/4286 [16:48:09<11:23:56, 25.50s/it]                                                         {'loss': 0.0824, 'grad_norm': 3.447452910211554, 'learning_rate': 3.754083061129258e-07, 'completion_length': 272.6071472167969, 'rewards/only_full_func_accuracy_reward': 0.6383929550647736, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6026787161827087, 'reward_std': 0.2677866816520691, 'kl': 2.0546875, 'epoch': 0.62}
 62%|██████▏   | 2677/4286 [16:48:09<11:23:56, 25.50s/it] 62%|██████▏   | 2678/4286 [16:48:35<11:24:15, 25.53s/it]                                                         {'loss': 0.0105, 'grad_norm': 2.3935823789814443, 'learning_rate': 3.7517498833411107e-07, 'completion_length': 306.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5982144474983215, 'reward_std': 0.07419107854366302, 'kl': 0.26318359375, 'epoch': 0.62}
 62%|██████▏   | 2678/4286 [16:48:35<11:24:15, 25.53s/it][2025-03-02 21:56:16,051] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 63%|██████▎   | 2679/4286 [16:49:00<11:24:01, 25.54s/it]                                                         {'loss': 0.089, 'grad_norm': 7.12591944493848, 'learning_rate': 3.749416705552963e-07, 'completion_length': 263.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5007440894842148, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4650299549102783, 'reward_std': 0.17170890793204308, 'kl': 2.2265625, 'epoch': 0.63}
 63%|██████▎   | 2679/4286 [16:49:00<11:24:01, 25.54s/it] 63%|██████▎   | 2680/4286 [16:49:25<11:17:40, 25.32s/it]                                                         {'loss': 0.1129, 'grad_norm': 3.3119700699070767, 'learning_rate': 3.7470835277648156e-07, 'completion_length': 256.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5348640084266663, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.499149739742279, 'reward_std': 0.19580747932195663, 'kl': 2.828125, 'epoch': 0.63}
 63%|██████▎   | 2680/4286 [16:49:25<11:17:40, 25.32s/it][2025-03-02 21:57:05,016] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 63%|██████▎   | 2681/4286 [16:49:49<11:07:59, 24.97s/it]                                                         {'loss': 0.0447, 'grad_norm': 3.224588523443276, 'learning_rate': 3.744750349976668e-07, 'completion_length': 273.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7351190745830536, 'rewards/format_reward': 1.0, 'reward': 1.7351192235946655, 'reward_std': 0.07854853197932243, 'kl': 1.11328125, 'epoch': 0.63}
 63%|██████▎   | 2681/4286 [16:49:49<11:07:59, 24.97s/it] 63%|██████▎   | 2682/4286 [16:50:13<11:02:04, 24.77s/it]                                                         {'loss': 0.0674, 'grad_norm': 10.458565015501037, 'learning_rate': 3.7424171721885206e-07, 'completion_length': 269.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.5684524327516556, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5505953431129456, 'reward_std': 0.0892857201397419, 'kl': 1.6875, 'epoch': 0.63}
 63%|██████▎   | 2682/4286 [16:50:13<11:02:04, 24.77s/it][2025-03-02 21:57:52,022] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 63%|██████▎   | 2683/4286 [16:50:36<10:45:15, 24.15s/it]                                                         {'loss': 0.0367, 'grad_norm': 6.693022033076482, 'learning_rate': 3.7400839944003734e-07, 'completion_length': 215.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.10569422878324986, 'kl': 0.916015625, 'epoch': 0.63}
 63%|██████▎   | 2683/4286 [16:50:36<10:45:15, 24.15s/it] 63%|██████▎   | 2684/4286 [16:51:01<10:48:07, 24.27s/it]                                                         {'loss': 0.0377, 'grad_norm': 5.852096768948789, 'learning_rate': 3.7377508166122256e-07, 'completion_length': 252.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.7366072237491608, 'rewards/format_reward': 1.0, 'reward': 1.736607313156128, 'reward_std': 0.09661934711039066, 'kl': 0.9453125, 'epoch': 0.63}
 63%|██████▎   | 2684/4286 [16:51:01<10:48:07, 24.27s/it] 63%|██████▎   | 2685/4286 [16:51:26<10:54:25, 24.53s/it]                                                         {'loss': 0.0698, 'grad_norm': 8.313192329636818, 'learning_rate': 3.7354176388240783e-07, 'completion_length': 296.8571472167969, 'rewards/only_full_func_accuracy_reward': 0.491071492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.473214328289032, 'reward_std': 0.1705967541784048, 'kl': 1.74609375, 'epoch': 0.63}
 63%|██████▎   | 2685/4286 [16:51:26<10:54:25, 24.53s/it][2025-03-02 21:59:07,287] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 63%|██████▎   | 2686/4286 [16:51:51<11:02:33, 24.85s/it]                                                         {'loss': 0.0248, 'grad_norm': 2.8937440566821717, 'learning_rate': 3.7330844610359306e-07, 'completion_length': 252.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6348214745521545, 'rewards/format_reward': 1.0, 'reward': 1.6348214745521545, 'reward_std': 0.06185945123434067, 'kl': 0.6201171875, 'epoch': 0.63}
 63%|██████▎   | 2686/4286 [16:51:51<11:02:33, 24.85s/it] 63%|██████▎   | 2687/4286 [16:52:16<11:02:38, 24.86s/it]                                                         {'loss': 0.03, 'grad_norm': 8.374655037301865, 'learning_rate': 3.7307512832477833e-07, 'completion_length': 271.8035888671875, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592262983322144, 'reward_std': 0.07975583150982857, 'kl': 0.75, 'epoch': 0.63}
 63%|██████▎   | 2687/4286 [16:52:16<11:02:38, 24.86s/it][2025-03-02 21:59:57,371] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 63%|██████▎   | 2688/4286 [16:52:41<11:04:43, 24.96s/it]                                                         {'loss': 0.0424, 'grad_norm': 3.4250196606132435, 'learning_rate': 3.728418105459636e-07, 'completion_length': 255.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6220238208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6041667461395264, 'reward_std': 0.07854852639138699, 'kl': 1.06201171875, 'epoch': 0.63}
 63%|██████▎   | 2688/4286 [16:52:41<11:04:43, 24.96s/it] 63%|██████▎   | 2689/4286 [16:53:07<11:11:21, 25.22s/it]                                                         {'loss': 0.0256, 'grad_norm': 2.446822492985621, 'learning_rate': 3.7260849276714883e-07, 'completion_length': 265.6964416503906, 'rewards/only_full_func_accuracy_reward': 0.7782738208770752, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.0744047611951828, 'kl': 0.642578125, 'epoch': 0.63}
 63%|██████▎   | 2689/4286 [16:53:07<11:11:21, 25.22s/it] 63%|██████▎   | 2690/4286 [16:53:32<11:09:19, 25.16s/it]                                                         {'loss': 0.0462, 'grad_norm': 9.833995343577394, 'learning_rate': 3.723751749883341e-07, 'completion_length': 293.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.1308993138372898, 'kl': 1.16015625, 'epoch': 0.63}
 63%|██████▎   | 2690/4286 [16:53:32<11:09:19, 25.16s/it] 63%|██████▎   | 2691/4286 [16:53:58<11:12:03, 25.28s/it]                                                         {'loss': 0.0323, 'grad_norm': 4.737364975912904, 'learning_rate': 3.721418572095193e-07, 'completion_length': 341.7321472167969, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.1099053667858243, 'kl': 0.80859375, 'epoch': 0.63}
 63%|██████▎   | 2691/4286 [16:53:58<11:12:03, 25.28s/it] 63%|██████▎   | 2692/4286 [16:54:24<11:15:03, 25.41s/it]                                                         {'loss': 0.0392, 'grad_norm': 2.3199975519719325, 'learning_rate': 3.719085394307046e-07, 'completion_length': 259.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7217262089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7038691639900208, 'reward_std': 0.08508739341050386, 'kl': 0.97509765625, 'epoch': 0.63}
 63%|██████▎   | 2692/4286 [16:54:24<11:15:03, 25.41s/it] 63%|██████▎   | 2693/4286 [16:54:48<11:04:31, 25.03s/it]                                                         {'loss': 0.0382, 'grad_norm': 6.621100691470975, 'learning_rate': 3.716752216518899e-07, 'completion_length': 256.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.6213010847568512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.60344398021698, 'reward_std': 0.09391829557716846, 'kl': 0.9521484375, 'epoch': 0.63}
 63%|██████▎   | 2693/4286 [16:54:48<11:04:31, 25.03s/it] 63%|██████▎   | 2694/4286 [16:55:13<11:07:25, 25.15s/it]                                                         {'loss': 0.0076, 'grad_norm': 2.6146720933808973, 'learning_rate': 3.714419038730751e-07, 'completion_length': 267.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6086310148239136, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.06588353775441647, 'kl': 0.1884765625, 'epoch': 0.63}
 63%|██████▎   | 2694/4286 [16:55:13<11:07:25, 25.15s/it] 63%|██████▎   | 2695/4286 [16:55:40<11:22:00, 25.72s/it]                                                         {'loss': 0.0346, 'grad_norm': 8.094400770251791, 'learning_rate': 3.7120858609426037e-07, 'completion_length': 281.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6544643044471741, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.63660728931427, 'reward_std': 0.1586776115000248, 'kl': 0.865234375, 'epoch': 0.63}
 63%|██████▎   | 2695/4286 [16:55:40<11:22:00, 25.72s/it] 63%|██████▎   | 2696/4286 [16:56:05<11:10:35, 25.31s/it]                                                         {'loss': 0.0177, 'grad_norm': 1.1846612877488114, 'learning_rate': 3.709752683154456e-07, 'completion_length': 262.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.03495405800640583, 'kl': 0.44140625, 'epoch': 0.63}
 63%|██████▎   | 2696/4286 [16:56:05<11:10:35, 25.31s/it] 63%|██████▎   | 2697/4286 [16:56:30<11:07:23, 25.20s/it]                                                         {'loss': 0.0077, 'grad_norm': 7.646206898158005, 'learning_rate': 3.7074195053663087e-07, 'completion_length': 237.58928680419922, 'rewards/only_full_func_accuracy_reward': 0.6418651342391968, 'rewards/format_reward': 1.0, 'reward': 1.6418651938438416, 'reward_std': 0.07341270335018635, 'kl': 0.19189453125, 'epoch': 0.63}
 63%|██████▎   | 2697/4286 [16:56:30<11:07:23, 25.20s/it] 63%|██████▎   | 2698/4286 [16:56:54<11:04:37, 25.11s/it]                                                         {'loss': 0.0316, 'grad_norm': 4.345114237333805, 'learning_rate': 3.7050863275781615e-07, 'completion_length': 287.5893020629883, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.6770833730697632, 'reward_std': 0.06845238525420427, 'kl': 0.79296875, 'epoch': 0.63}
 63%|██████▎   | 2698/4286 [16:56:54<11:04:37, 25.11s/it] 63%|██████▎   | 2699/4286 [16:57:19<11:03:24, 25.08s/it]                                                         {'loss': 0.0389, 'grad_norm': 4.267218296000316, 'learning_rate': 3.7027531497900137e-07, 'completion_length': 300.89288330078125, 'rewards/only_full_func_accuracy_reward': 0.508035734295845, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.490178644657135, 'reward_std': 0.10481787472963333, 'kl': 0.97216796875, 'epoch': 0.63}
 63%|██████▎   | 2699/4286 [16:57:19<11:03:24, 25.08s/it] 63%|██████▎   | 2700/4286 [16:57:45<11:04:53, 25.15s/it]                                                         {'loss': 0.0382, 'grad_norm': 2.7492571403584973, 'learning_rate': 3.7004199720018664e-07, 'completion_length': 307.0714416503906, 'rewards/only_full_func_accuracy_reward': 0.572916716337204, 'rewards/format_reward': 1.0, 'reward': 1.5729168057441711, 'reward_std': 0.1263812556862831, 'kl': 0.953125, 'epoch': 0.63}
 63%|██████▎   | 2700/4286 [16:57:45<11:04:53, 25.15s/it][2025-03-02 22:08:30,976] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 63%|██████▎   | 2701/4286 [17:01:15<35:31:53, 80.70s/it]                                                         {'loss': 0.0054, 'grad_norm': 4.193984356130755, 'learning_rate': 3.698086794213719e-07, 'completion_length': 356.98216247558594, 'rewards/only_full_func_accuracy_reward': 0.5610119104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5431548357009888, 'reward_std': 0.06845238525420427, 'kl': 0.1357421875, 'epoch': 0.63}
 63%|██████▎   | 2701/4286 [17:01:15<35:31:53, 80.70s/it] 63%|██████▎   | 2702/4286 [17:01:43<28:32:26, 64.87s/it]                                                         {'loss': 0.0586, 'grad_norm': 9.408289083997762, 'learning_rate': 3.6957536164255714e-07, 'completion_length': 288.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.5565476566553116, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5386906266212463, 'reward_std': 0.13096587732434273, 'kl': 1.462890625, 'epoch': 0.63}
 63%|██████▎   | 2702/4286 [17:01:43<28:32:26, 64.87s/it][2025-03-02 22:09:24,096] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 63%|██████▎   | 2703/4286 [17:02:08<23:17:28, 52.97s/it]                                                         {'loss': 0.0074, 'grad_norm': 5.149380777275965, 'learning_rate': 3.693420438637424e-07, 'completion_length': 250.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7217262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.11027831584215164, 'kl': 0.1845703125, 'epoch': 0.63}
 63%|██████▎   | 2703/4286 [17:02:08<23:17:28, 52.97s/it] 63%|██████▎   | 2704/4286 [17:02:34<19:44:33, 44.93s/it]                                                         {'loss': 0.0237, 'grad_norm': 3.3203365216131107, 'learning_rate': 3.6910872608492764e-07, 'completion_length': 276.6785888671875, 'rewards/only_full_func_accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.04602411761879921, 'kl': 0.591796875, 'epoch': 0.63}
 63%|██████▎   | 2704/4286 [17:02:34<19:44:33, 44.93s/it] 63%|██████▎   | 2705/4286 [17:03:00<17:07:30, 38.99s/it]                                                         {'loss': 0.0384, 'grad_norm': 11.265332848474541, 'learning_rate': 3.688754083061129e-07, 'completion_length': 275.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.555059552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5372024774551392, 'reward_std': 0.19116458296775818, 'kl': 0.9609375, 'epoch': 0.63}
 63%|██████▎   | 2705/4286 [17:03:00<17:07:30, 38.99s/it] 63%|██████▎   | 2706/4286 [17:03:24<15:14:11, 34.72s/it]                                                         {'loss': 0.0239, 'grad_norm': 6.903219592999813, 'learning_rate': 3.686420905272982e-07, 'completion_length': 281.9285888671875, 'rewards/only_full_func_accuracy_reward': 0.52827388048172, 'rewards/format_reward': 1.0, 'reward': 1.52827388048172, 'reward_std': 0.1220238134264946, 'kl': 0.595703125, 'epoch': 0.63}
 63%|██████▎   | 2706/4286 [17:03:24<15:14:11, 34.72s/it][2025-03-02 22:11:06,119] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 63%|██████▎   | 2707/4286 [17:03:50<14:04:35, 32.09s/it]                                                         {'loss': 0.0533, 'grad_norm': 8.034880494786098, 'learning_rate': 3.684087727484834e-07, 'completion_length': 293.9821472167969, 'rewards/only_full_func_accuracy_reward': 0.560119092464447, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.506547749042511, 'reward_std': 0.17801427468657494, 'kl': 1.33203125, 'epoch': 0.63}
 63%|██████▎   | 2707/4286 [17:03:50<14:04:35, 32.09s/it] 63%|██████▎   | 2708/4286 [17:04:15<13:06:40, 29.91s/it]                                                         {'loss': 0.0305, 'grad_norm': 4.159990997867697, 'learning_rate': 3.681754549696687e-07, 'completion_length': 256.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 1.0, 'reward': 1.6443453431129456, 'reward_std': 0.09914423525333405, 'kl': 0.763671875, 'epoch': 0.63}
 63%|██████▎   | 2708/4286 [17:04:15<13:06:40, 29.91s/it] 63%|██████▎   | 2709/4286 [17:04:40<12:30:35, 28.56s/it]                                                         {'loss': 0.0306, 'grad_norm': 1.9362594684693872, 'learning_rate': 3.679421371908539e-07, 'completion_length': 304.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.5818452537059784, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.563988208770752, 'reward_std': 0.08630953170359135, 'kl': 0.7685546875, 'epoch': 0.63}
 63%|██████▎   | 2709/4286 [17:04:40<12:30:35, 28.56s/it] 63%|██████▎   | 2710/4286 [17:05:07<12:17:06, 28.06s/it]                                                         {'loss': 0.0609, 'grad_norm': 12.041809872693106, 'learning_rate': 3.677088194120392e-07, 'completion_length': 286.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5089285969734192, 'reward_std': 0.22772061079740524, 'kl': 1.5234375, 'epoch': 0.63}
 63%|██████▎   | 2710/4286 [17:05:07<12:17:06, 28.06s/it] 63%|██████▎   | 2711/4286 [17:05:33<11:58:11, 27.36s/it]                                                         {'loss': 0.0217, 'grad_norm': 6.381507767245402, 'learning_rate': 3.6747550163322446e-07, 'completion_length': 284.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.7291667461395264, 'rewards/format_reward': 1.0, 'reward': 1.7291667461395264, 'reward_std': 0.09986777603626251, 'kl': 0.54150390625, 'epoch': 0.63}
 63%|██████▎   | 2711/4286 [17:05:33<11:58:11, 27.36s/it] 63%|██████▎   | 2712/4286 [17:05:59<11:49:26, 27.04s/it]                                                         {'loss': 0.0229, 'grad_norm': 1.995212122792243, 'learning_rate': 3.672421838544097e-07, 'completion_length': 345.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.5312500596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5133929252624512, 'reward_std': 0.12406324595212936, 'kl': 0.5751953125, 'epoch': 0.63}
 63%|██████▎   | 2712/4286 [17:05:59<11:49:26, 27.04s/it] 63%|██████▎   | 2713/4286 [17:06:25<11:40:33, 26.72s/it]                                                         {'loss': 0.0368, 'grad_norm': 19.905150835854048, 'learning_rate': 3.6700886607559496e-07, 'completion_length': 265.7857360839844, 'rewards/only_full_func_accuracy_reward': 0.49851198494434357, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4627977013587952, 'reward_std': 0.13240529596805573, 'kl': 0.916015625, 'epoch': 0.63}
 63%|██████▎   | 2713/4286 [17:06:25<11:40:33, 26.72s/it] 63%|██████▎   | 2714/4286 [17:06:52<11:40:58, 26.75s/it]                                                         {'loss': 0.0563, 'grad_norm': 7.553877145445055, 'learning_rate': 3.667755482967802e-07, 'completion_length': 297.3393096923828, 'rewards/only_full_func_accuracy_reward': 0.51488097012043, 'rewards/format_reward': 1.0, 'reward': 1.5148810744285583, 'reward_std': 0.14032969623804092, 'kl': 1.41015625, 'epoch': 0.63}
 63%|██████▎   | 2714/4286 [17:06:52<11:40:58, 26.75s/it] 63%|██████▎   | 2715/4286 [17:07:20<11:49:00, 27.08s/it]                                                         {'loss': 0.0551, 'grad_norm': 14.428274949045772, 'learning_rate': 3.6654223051796545e-07, 'completion_length': 272.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.5568452626466751, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5211310386657715, 'reward_std': 0.2461429089307785, 'kl': 1.37890625, 'epoch': 0.63}
 63%|██████▎   | 2715/4286 [17:07:20<11:49:00, 27.08s/it] 63%|██████▎   | 2716/4286 [17:07:49<12:00:26, 27.53s/it]                                                         {'loss': 0.0646, 'grad_norm': 28.13786118719233, 'learning_rate': 3.6630891273915073e-07, 'completion_length': 342.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5068452656269073, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4354167580604553, 'reward_std': 0.28376366198062897, 'kl': 1.61328125, 'epoch': 0.63}
 63%|██████▎   | 2716/4286 [17:07:49<12:00:26, 27.53s/it] 63%|██████▎   | 2717/4286 [17:08:16<11:59:34, 27.52s/it]                                                         {'loss': 0.054, 'grad_norm': 7.0443620144716315, 'learning_rate': 3.6607559496033595e-07, 'completion_length': 278.2857360839844, 'rewards/only_full_func_accuracy_reward': 0.6227679252624512, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5691965818405151, 'reward_std': 0.22674181312322617, 'kl': 1.349609375, 'epoch': 0.63}
 63%|██████▎   | 2717/4286 [17:08:16<11:59:34, 27.52s/it] 63%|██████▎   | 2718/4286 [17:08:42<11:49:58, 27.17s/it]                                                         {'loss': 0.0269, 'grad_norm': 4.206164703696229, 'learning_rate': 3.658422771815212e-07, 'completion_length': 265.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.7247024476528168, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.688988208770752, 'reward_std': 0.147675022482872, 'kl': 0.6748046875, 'epoch': 0.63}
 63%|██████▎   | 2718/4286 [17:08:42<11:49:58, 27.17s/it] 63%|██████▎   | 2719/4286 [17:09:08<11:40:32, 26.82s/it]                                                         {'loss': 0.0302, 'grad_norm': 12.769281293379205, 'learning_rate': 3.6560895940270645e-07, 'completion_length': 302.5893096923828, 'rewards/only_full_func_accuracy_reward': 0.6097470819950104, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5740328431129456, 'reward_std': 0.14874504134058952, 'kl': 0.7587890625, 'epoch': 0.63}
 63%|██████▎   | 2719/4286 [17:09:08<11:40:32, 26.82s/it] 63%|██████▎   | 2720/4286 [17:09:35<11:39:28, 26.80s/it]                                                         {'loss': 0.0561, 'grad_norm': 4.030441260543305, 'learning_rate': 3.653756416238917e-07, 'completion_length': 276.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5922619700431824, 'reward_std': 0.0892857201397419, 'kl': 1.40576171875, 'epoch': 0.63}
 63%|██████▎   | 2720/4286 [17:09:35<11:39:28, 26.80s/it] 63%|██████▎   | 2721/4286 [17:10:01<11:34:11, 26.61s/it]                                                         {'loss': 0.1028, 'grad_norm': 4.978198883961804, 'learning_rate': 3.65142323845077e-07, 'completion_length': 294.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.7143282890319824, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6786140203475952, 'reward_std': 0.1534740999341011, 'kl': 2.5703125, 'epoch': 0.63}
 63%|██████▎   | 2721/4286 [17:10:01<11:34:11, 26.61s/it] 64%|██████▎   | 2722/4286 [17:10:28<11:31:21, 26.52s/it]                                                         {'loss': 0.0931, 'grad_norm': 5.718999848146938, 'learning_rate': 3.649090060662622e-07, 'completion_length': 284.4821472167969, 'rewards/only_full_func_accuracy_reward': 0.625, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5714287161827087, 'reward_std': 0.22521868720650673, 'kl': 2.33203125, 'epoch': 0.64}
 64%|██████▎   | 2722/4286 [17:10:28<11:31:21, 26.52s/it] 64%|██████▎   | 2723/4286 [17:10:55<11:37:31, 26.78s/it]                                                         {'loss': 0.0725, 'grad_norm': 5.440304887670313, 'learning_rate': 3.646756882874475e-07, 'completion_length': 287.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.6897321939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.654017984867096, 'reward_std': 0.1238575242459774, 'kl': 1.8125, 'epoch': 0.64}
 64%|██████▎   | 2723/4286 [17:10:55<11:37:31, 26.78s/it] 64%|██████▎   | 2724/4286 [17:11:23<11:45:28, 27.10s/it]                                                         {'loss': 0.0993, 'grad_norm': 17.296341624593573, 'learning_rate': 3.6444237050863277e-07, 'completion_length': 307.60716247558594, 'rewards/only_full_func_accuracy_reward': 0.5297619551420212, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4761906266212463, 'reward_std': 0.24512824416160583, 'kl': 2.484375, 'epoch': 0.64}
 64%|██████▎   | 2724/4286 [17:11:23<11:45:28, 27.10s/it][2025-03-02 22:19:08,504] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▎   | 2725/4286 [17:11:53<12:05:17, 27.88s/it]                                                         {'loss': 0.0714, 'grad_norm': 12.519033830318483, 'learning_rate': 3.64209052729818e-07, 'completion_length': 360.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.1551724411547184, 'kl': 1.78515625, 'epoch': 0.64}
 64%|██████▎   | 2725/4286 [17:11:53<12:05:17, 27.88s/it][2025-03-02 22:19:37,203] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▎   | 2726/4286 [17:12:21<12:11:13, 28.12s/it]                                                         {'loss': 0.0666, 'grad_norm': 11.31028403930007, 'learning_rate': 3.6397573495100327e-07, 'completion_length': 423.0357360839844, 'rewards/only_full_func_accuracy_reward': 0.5678571909666061, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.496428668498993, 'reward_std': 0.20103291422128677, 'kl': 1.66015625, 'epoch': 0.64}
 64%|██████▎   | 2726/4286 [17:12:21<12:11:13, 28.12s/it][2025-03-02 22:20:06,702] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▎   | 2727/4286 [17:12:51<12:21:29, 28.54s/it]                                                         {'loss': 0.0497, 'grad_norm': 0.8097403177676707, 'learning_rate': 3.637424171721885e-07, 'completion_length': 362.4464416503906, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6488096714019775, 'reward_std': 0.10714286658912897, 'kl': 1.23876953125, 'epoch': 0.64}
 64%|██████▎   | 2727/4286 [17:12:51<12:21:29, 28.54s/it][2025-03-02 22:20:36,229] [WARNING] [stage3.py:2134:step] 3 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▎   | 2728/4286 [17:13:20<12:28:42, 28.83s/it]                                                         {'loss': 0.0768, 'grad_norm': 5.400213123468968, 'learning_rate': 3.6350909939337377e-07, 'completion_length': 349.3035888671875, 'rewards/only_full_func_accuracy_reward': 0.5169643312692642, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4812501072883606, 'reward_std': 0.21071644872426987, 'kl': 1.921875, 'epoch': 0.64}
 64%|██████▎   | 2728/4286 [17:13:20<12:28:42, 28.83s/it][2025-03-02 22:21:03,753] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▎   | 2729/4286 [17:13:48<12:18:01, 28.44s/it]                                                         {'loss': 0.0477, 'grad_norm': 2.9958128672254487, 'learning_rate': 3.6327578161455904e-07, 'completion_length': 281.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.6220237910747528, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5684524178504944, 'reward_std': 0.2102186605334282, 'kl': 1.1953125, 'epoch': 0.64}
 64%|██████▎   | 2729/4286 [17:13:48<12:18:01, 28.44s/it][2025-03-02 22:21:31,084] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▎   | 2730/4286 [17:14:15<12:08:55, 28.11s/it]                                                         {'loss': 0.0422, 'grad_norm': 10.608739985098069, 'learning_rate': 3.6304246383574426e-07, 'completion_length': 380.5714416503906, 'rewards/only_full_func_accuracy_reward': 0.6547619104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6190477013587952, 'reward_std': 0.15627354383468628, 'kl': 1.056640625, 'epoch': 0.64}
 64%|██████▎   | 2730/4286 [17:14:15<12:08:55, 28.11s/it][2025-03-02 22:22:00,933] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▎   | 2731/4286 [17:14:45<12:22:00, 28.63s/it]                                                         {'loss': 0.0491, 'grad_norm': 3.8289719617194256, 'learning_rate': 3.6280914605692954e-07, 'completion_length': 379.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.54464291036129, 'rewards/format_reward': 0.892857164144516, 'reward': 1.4375001192092896, 'reward_std': 0.1642170064151287, 'kl': 1.23046875, 'epoch': 0.64}
 64%|██████▎   | 2731/4286 [17:14:45<12:22:00, 28.63s/it][2025-03-02 22:22:27,798] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▎   | 2732/4286 [17:15:12<12:07:48, 28.10s/it]                                                         {'loss': 0.09, 'grad_norm': 3.6064129036404973, 'learning_rate': 3.6257582827811476e-07, 'completion_length': 255.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5595238506793976, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5238096714019775, 'reward_std': 0.13979335129261017, 'kl': 2.25, 'epoch': 0.64}
 64%|██████▎   | 2732/4286 [17:15:12<12:07:48, 28.10s/it][2025-03-02 22:22:56,392] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▍   | 2733/4286 [17:15:41<12:11:09, 28.25s/it]                                                         {'loss': 0.0463, 'grad_norm': 2.571088564350233, 'learning_rate': 3.6234251049930004e-07, 'completion_length': 339.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.5955357551574707, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5598215460777283, 'reward_std': 0.16523220390081406, 'kl': 1.16015625, 'epoch': 0.64}
 64%|██████▍   | 2733/4286 [17:15:41<12:11:09, 28.25s/it][2025-03-02 22:23:24,690] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▍   | 2734/4286 [17:16:09<12:11:04, 28.26s/it]                                                         {'loss': 0.0417, 'grad_norm': 1.0062162345541605, 'learning_rate': 3.621091927204853e-07, 'completion_length': 395.51788330078125, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6086310744285583, 'reward_std': 0.12540945783257484, 'kl': 1.04443359375, 'epoch': 0.64}
 64%|██████▍   | 2734/4286 [17:16:09<12:11:04, 28.26s/it] 64%|██████▍   | 2735/4286 [17:16:36<12:03:01, 27.97s/it]                                                         {'loss': 0.0278, 'grad_norm': 1.2068283536979802, 'learning_rate': 3.6187587494167053e-07, 'completion_length': 296.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.6815477013587952, 'rewards/format_reward': 1.0, 'reward': 1.6815477013587952, 'reward_std': 0.04602411016821861, 'kl': 0.69482421875, 'epoch': 0.64}
 64%|██████▍   | 2735/4286 [17:16:36<12:03:01, 27.97s/it][2025-03-02 22:24:20,941] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▍   | 2736/4286 [17:17:05<12:10:16, 28.27s/it]                                                         {'loss': 0.0068, 'grad_norm': 1.6106694392347332, 'learning_rate': 3.616425571628558e-07, 'completion_length': 346.1785888671875, 'rewards/only_full_func_accuracy_reward': 0.6979167461395264, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6800596714019775, 'reward_std': 0.07441825047135353, 'kl': 0.1708984375, 'epoch': 0.64}
 64%|██████▍   | 2736/4286 [17:17:05<12:10:16, 28.27s/it][2025-03-02 22:24:50,681] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▍   | 2737/4286 [17:17:35<12:21:11, 28.71s/it]                                                         {'loss': 0.0436, 'grad_norm': 2.232778329227768, 'learning_rate': 3.6140923938404103e-07, 'completion_length': 337.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6949405670166016, 'reward_std': 0.11947546526789665, 'kl': 1.0869140625, 'epoch': 0.64}
 64%|██████▍   | 2737/4286 [17:17:35<12:21:11, 28.71s/it] 64%|██████▍   | 2738/4286 [17:18:03<12:20:34, 28.70s/it]                                                         {'loss': 0.0337, 'grad_norm': 1.8733856858575497, 'learning_rate': 3.611759216052263e-07, 'completion_length': 356.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.5833333730697632, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5119048953056335, 'reward_std': 0.1353098303079605, 'kl': 0.841796875, 'epoch': 0.64}
 64%|██████▍   | 2738/4286 [17:18:03<12:20:34, 28.70s/it] 64%|██████▍   | 2739/4286 [17:18:31<12:12:23, 28.41s/it]                                                         {'loss': 0.0058, 'grad_norm': 1.4299239712054845, 'learning_rate': 3.609426038264116e-07, 'completion_length': 310.8214416503906, 'rewards/only_full_func_accuracy_reward': 0.4883928745985031, 'rewards/format_reward': 1.0, 'reward': 1.4883930087089539, 'reward_std': 0.02066834270954132, 'kl': 0.14404296875, 'epoch': 0.64}
 64%|██████▍   | 2739/4286 [17:18:31<12:12:23, 28.41s/it] 64%|██████▍   | 2740/4286 [17:18:59<12:05:10, 28.14s/it]                                                         {'loss': 0.0294, 'grad_norm': 0.9918505485583632, 'learning_rate': 3.607092860475968e-07, 'completion_length': 313.6607360839844, 'rewards/only_full_func_accuracy_reward': 0.5854167342185974, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5675595998764038, 'reward_std': 0.07413512282073498, 'kl': 0.736328125, 'epoch': 0.64}
 64%|██████▍   | 2740/4286 [17:18:59<12:05:10, 28.14s/it] 64%|██████▍   | 2741/4286 [17:19:27<12:02:12, 28.05s/it]                                                         {'loss': 0.0248, 'grad_norm': 2.4547273592451613, 'learning_rate': 3.604759682687821e-07, 'completion_length': 274.0893096923828, 'rewards/only_full_func_accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.037555960938334465, 'kl': 0.61669921875, 'epoch': 0.64}
 64%|██████▍   | 2741/4286 [17:19:27<12:02:12, 28.05s/it] 64%|██████▍   | 2742/4286 [17:19:54<11:54:01, 27.75s/it]                                                         {'loss': 0.0364, 'grad_norm': 1.8597669266297248, 'learning_rate': 3.602426504899673e-07, 'completion_length': 369.5535888671875, 'rewards/only_full_func_accuracy_reward': 0.7113096117973328, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.06547619588673115, 'kl': 0.91015625, 'epoch': 0.64}
 64%|██████▍   | 2742/4286 [17:19:54<11:54:01, 27.75s/it] 64%|██████▍   | 2743/4286 [17:20:19<11:33:55, 26.98s/it]                                                         {'loss': 0.0294, 'grad_norm': 7.178271312345784, 'learning_rate': 3.600093327111526e-07, 'completion_length': 276.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5639881491661072, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.019238397479057312, 'kl': 0.73388671875, 'epoch': 0.64}
 64%|██████▍   | 2743/4286 [17:20:19<11:33:55, 26.98s/it] 64%|██████▍   | 2744/4286 [17:20:45<11:24:15, 26.62s/it]                                                         {'loss': 0.0503, 'grad_norm': 37.93072426755742, 'learning_rate': 3.5977601493233785e-07, 'completion_length': 287.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.6187500357627869, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5830357670783997, 'reward_std': 0.18496133759617805, 'kl': 1.26171875, 'epoch': 0.64}
 64%|██████▍   | 2744/4286 [17:20:45<11:24:15, 26.62s/it] 64%|██████▍   | 2745/4286 [17:21:12<11:33:36, 27.01s/it]                                                         {'loss': 0.0055, 'grad_norm': 2.451891784245181, 'learning_rate': 3.5954269715352307e-07, 'completion_length': 310.23216247558594, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.06618334911763668, 'kl': 0.1376953125, 'epoch': 0.64}
 64%|██████▍   | 2745/4286 [17:21:12<11:33:36, 27.01s/it] 64%|██████▍   | 2746/4286 [17:21:41<11:41:25, 27.33s/it]                                                         {'loss': 0.0284, 'grad_norm': 1.592660008782567, 'learning_rate': 3.5930937937470835e-07, 'completion_length': 280.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6488095223903656, 'rewards/format_reward': 1.0, 'reward': 1.6488096714019775, 'reward_std': 0.06938603520393372, 'kl': 0.7109375, 'epoch': 0.64}
 64%|██████▍   | 2746/4286 [17:21:41<11:41:25, 27.33s/it] 64%|██████▍   | 2747/4286 [17:22:06<11:26:33, 26.77s/it]                                                         {'loss': 0.0663, 'grad_norm': 1.5749041252283411, 'learning_rate': 3.590760615958936e-07, 'completion_length': 262.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6250000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.1277625085785985, 'kl': 1.654296875, 'epoch': 0.64}
 64%|██████▍   | 2747/4286 [17:22:06<11:26:33, 26.77s/it] 64%|██████▍   | 2748/4286 [17:22:33<11:24:12, 26.69s/it]                                                         {'loss': 0.0207, 'grad_norm': 60.56115615620161, 'learning_rate': 3.5884274381707884e-07, 'completion_length': 291.73216247558594, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.05502715799957514, 'kl': 0.51708984375, 'epoch': 0.64}
 64%|██████▍   | 2748/4286 [17:22:33<11:24:12, 26.69s/it] 64%|██████▍   | 2749/4286 [17:23:01<11:40:05, 27.33s/it]                                                         {'loss': 0.0269, 'grad_norm': 0.6252115113777056, 'learning_rate': 3.586094260382641e-07, 'completion_length': 336.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7008929252624512, 'rewards/format_reward': 1.0, 'reward': 1.700892984867096, 'reward_std': 0.03869047574698925, 'kl': 0.67138671875, 'epoch': 0.64}
 64%|██████▍   | 2749/4286 [17:23:01<11:40:05, 27.33s/it] 64%|██████▍   | 2750/4286 [17:23:29<11:45:51, 27.57s/it]                                                         {'loss': 0.0475, 'grad_norm': 2.3771714292552075, 'learning_rate': 3.5837610825944934e-07, 'completion_length': 321.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6517857611179352, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.06990811601281166, 'kl': 1.1904296875, 'epoch': 0.64}
 64%|██████▍   | 2750/4286 [17:23:30<11:45:51, 27.57s/it] 64%|██████▍   | 2751/4286 [17:23:59<12:00:34, 28.17s/it]                                                         {'loss': 0.0266, 'grad_norm': 2.0720293310309086, 'learning_rate': 3.581427904806346e-07, 'completion_length': 307.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.6800595819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.662202537059784, 'reward_std': 0.08999287709593773, 'kl': 0.666015625, 'epoch': 0.64}
 64%|██████▍   | 2751/4286 [17:23:59<12:00:34, 28.17s/it][2025-03-02 22:31:38,240] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▍   | 2752/4286 [17:24:22<11:22:52, 26.71s/it]                                                         {'loss': 0.0247, 'grad_norm': 3.071914947961557, 'learning_rate': 3.579094727018199e-07, 'completion_length': 215.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.7514881789684296, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.044642859138548374, 'kl': 0.61572265625, 'epoch': 0.64}
 64%|██████▍   | 2752/4286 [17:24:22<11:22:52, 26.71s/it] 64%|██████▍   | 2753/4286 [17:24:50<11:26:19, 26.86s/it]                                                         {'loss': 0.0324, 'grad_norm': 1.3015324975796014, 'learning_rate': 3.576761549230051e-07, 'completion_length': 230.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261906266212463, 'reward_std': 0.0682387026026845, 'kl': 0.80712890625, 'epoch': 0.64}
 64%|██████▍   | 2753/4286 [17:24:50<11:26:19, 26.86s/it] 64%|██████▍   | 2754/4286 [17:25:18<11:40:31, 27.44s/it]                                                         {'loss': 0.0068, 'grad_norm': 1.1713026348555906, 'learning_rate': 3.574428371441904e-07, 'completion_length': 330.4285888671875, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6785715222358704, 'reward_std': 0.1032646931707859, 'kl': 0.17041015625, 'epoch': 0.64}
 64%|██████▍   | 2754/4286 [17:25:18<11:40:31, 27.44s/it][2025-03-02 22:33:02,643] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▍   | 2755/4286 [17:25:47<11:47:32, 27.73s/it]                                                         {'loss': 0.0329, 'grad_norm': 6.329841587829239, 'learning_rate': 3.572095193653756e-07, 'completion_length': 295.5357360839844, 'rewards/only_full_func_accuracy_reward': 0.579464316368103, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5080358386039734, 'reward_std': 0.15531133860349655, 'kl': 0.8203125, 'epoch': 0.64}
 64%|██████▍   | 2755/4286 [17:25:47<11:47:32, 27.73s/it] 64%|██████▍   | 2756/4286 [17:26:14<11:42:17, 27.54s/it]                                                         {'loss': 0.0243, 'grad_norm': 8.94667729296219, 'learning_rate': 3.569762015865609e-07, 'completion_length': 265.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6324406266212463, 'reward_std': 0.0803571492433548, 'kl': 0.607421875, 'epoch': 0.64}
 64%|██████▍   | 2756/4286 [17:26:14<11:42:17, 27.54s/it] 64%|██████▍   | 2757/4286 [17:26:43<11:51:44, 27.93s/it]                                                         {'loss': 0.0065, 'grad_norm': 1.3397131037227759, 'learning_rate': 3.5674288380774616e-07, 'completion_length': 264.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6175596714019775, 'reward_std': 0.18859335780143738, 'kl': 0.16162109375, 'epoch': 0.64}
 64%|██████▍   | 2757/4286 [17:26:43<11:51:44, 27.93s/it][2025-03-02 22:34:26,376] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▍   | 2758/4286 [17:27:10<11:50:13, 27.89s/it]                                                         {'loss': 0.0152, 'grad_norm': 3.4526698249536323, 'learning_rate': 3.565095660289314e-07, 'completion_length': 333.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.5982143133878708, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.4732143878936768, 'reward_std': 0.16015753149986267, 'kl': 0.3798828125, 'epoch': 0.64}
 64%|██████▍   | 2758/4286 [17:27:10<11:50:13, 27.89s/it][2025-03-02 22:34:55,926] [WARNING] [stage3.py:2134:step] 3 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▍   | 2759/4286 [17:27:40<12:02:27, 28.39s/it]                                                         {'loss': 0.0176, 'grad_norm': 9.190435001156862, 'learning_rate': 3.5627624825011666e-07, 'completion_length': 326.9643096923828, 'rewards/only_full_func_accuracy_reward': 0.4404762238264084, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.2797619700431824, 'reward_std': 0.2132476195693016, 'kl': 0.4375, 'epoch': 0.64}
 64%|██████▍   | 2759/4286 [17:27:40<12:02:27, 28.39s/it] 64%|██████▍   | 2760/4286 [17:28:09<12:04:58, 28.50s/it]                                                         {'loss': 0.0071, 'grad_norm': 1.9003334466366062, 'learning_rate': 3.560429304713019e-07, 'completion_length': 300.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.595238208770752, 'reward_std': 0.23868754506111145, 'kl': 0.17724609375, 'epoch': 0.64}
 64%|██████▍   | 2760/4286 [17:28:09<12:04:58, 28.50s/it] 64%|██████▍   | 2761/4286 [17:28:36<11:55:12, 28.14s/it]                                                         {'loss': 0.0263, 'grad_norm': 2.7932337020091866, 'learning_rate': 3.5580961269248716e-07, 'completion_length': 350.01788330078125, 'rewards/only_full_func_accuracy_reward': 0.5446428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5267858505249023, 'reward_std': 0.14108158648014069, 'kl': 0.6572265625, 'epoch': 0.64}
 64%|██████▍   | 2761/4286 [17:28:36<11:55:12, 28.14s/it][2025-03-02 22:36:18,099] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 64%|██████▍   | 2762/4286 [17:29:02<11:39:15, 27.53s/it]                                                         {'loss': 0.0101, 'grad_norm': 2.3042086482964246, 'learning_rate': 3.5557629491367243e-07, 'completion_length': 220.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142858505249023, 'reward_std': 0.059523810632526875, 'kl': 0.251953125, 'epoch': 0.64}
 64%|██████▍   | 2762/4286 [17:29:02<11:39:15, 27.53s/it] 64%|██████▍   | 2763/4286 [17:29:29<11:35:37, 27.40s/it]                                                         {'loss': 0.0581, 'grad_norm': 71.40586558362966, 'learning_rate': 3.5534297713485765e-07, 'completion_length': 218.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.7291667461395264, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.657738208770752, 'reward_std': 0.2534503936767578, 'kl': 1.453125, 'epoch': 0.64}
 64%|██████▍   | 2763/4286 [17:29:29<11:35:37, 27.40s/it] 64%|██████▍   | 2764/4286 [17:29:55<11:18:12, 26.74s/it]                                                         {'loss': 0.0845, 'grad_norm': 3.959979394973537, 'learning_rate': 3.5510965935604293e-07, 'completion_length': 193.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5327381491661072, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4791668057441711, 'reward_std': 0.12641431856900454, 'kl': 2.11328125, 'epoch': 0.64}
 64%|██████▍   | 2764/4286 [17:29:55<11:18:12, 26.74s/it][2025-03-02 22:37:37,916] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2765/4286 [17:30:22<11:23:46, 26.97s/it]                                                         {'loss': 0.0132, 'grad_norm': 5.081561858331928, 'learning_rate': 3.5487634157722815e-07, 'completion_length': 236.96430206298828, 'rewards/only_full_func_accuracy_reward': 0.679166704416275, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6613095998764038, 'reward_std': 0.08298469334840775, 'kl': 0.3310546875, 'epoch': 0.65}
 65%|██████▍   | 2765/4286 [17:30:22<11:23:46, 26.97s/it] 65%|██████▍   | 2766/4286 [17:30:46<11:04:09, 26.22s/it]                                                         {'loss': 0.0275, 'grad_norm': 1.1762324546521838, 'learning_rate': 3.5464302379841343e-07, 'completion_length': 166.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.06388125941157341, 'kl': 0.6875, 'epoch': 0.65}
 65%|██████▍   | 2766/4286 [17:30:46<11:04:09, 26.22s/it][2025-03-02 22:38:28,530] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2767/4286 [17:31:13<11:03:18, 26.20s/it]                                                         {'loss': 0.0301, 'grad_norm': 1.5938715136226642, 'learning_rate': 3.544097060195987e-07, 'completion_length': 247.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5803571939468384, 'reward_std': 0.11493691802024841, 'kl': 0.75537109375, 'epoch': 0.65}
 65%|██████▍   | 2767/4286 [17:31:13<11:03:18, 26.20s/it][2025-03-02 22:38:55,703] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2768/4286 [17:31:40<11:10:15, 26.49s/it]                                                         {'loss': 0.0194, 'grad_norm': 2.154655693994259, 'learning_rate': 3.541763882407839e-07, 'completion_length': 259.35716247558594, 'rewards/only_full_func_accuracy_reward': 0.6366071403026581, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.600892961025238, 'reward_std': 0.1646805703639984, 'kl': 0.484375, 'epoch': 0.65}
 65%|██████▍   | 2768/4286 [17:31:40<11:10:15, 26.49s/it] 65%|██████▍   | 2769/4286 [17:32:05<11:01:44, 26.17s/it]                                                         {'loss': 0.0107, 'grad_norm': 1.7956232137207229, 'learning_rate': 3.539430704619692e-07, 'completion_length': 219.10716247558594, 'rewards/only_full_func_accuracy_reward': 0.5261904746294022, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5083333849906921, 'reward_std': 0.07397790253162384, 'kl': 0.2685546875, 'epoch': 0.65}
 65%|██████▍   | 2769/4286 [17:32:05<11:01:44, 26.17s/it] 65%|██████▍   | 2770/4286 [17:32:30<10:50:04, 25.73s/it]                                                         {'loss': 0.0074, 'grad_norm': 0.2850765022183356, 'learning_rate': 3.537097526831545e-07, 'completion_length': 220.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.0, 'kl': 0.1845703125, 'epoch': 0.65}
 65%|██████▍   | 2770/4286 [17:32:30<10:50:04, 25.73s/it] 65%|██████▍   | 2771/4286 [17:32:53<10:32:28, 25.05s/it]                                                         {'loss': 0.0128, 'grad_norm': 1.6775914422608513, 'learning_rate': 3.534764349043397e-07, 'completion_length': 159.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.7529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.052017971873283386, 'kl': 0.32080078125, 'epoch': 0.65}
 65%|██████▍   | 2771/4286 [17:32:53<10:32:28, 25.05s/it] 65%|██████▍   | 2772/4286 [17:33:19<10:32:26, 25.06s/it]                                                         {'loss': 0.01, 'grad_norm': 2.5174266820933173, 'learning_rate': 3.5324311712552497e-07, 'completion_length': 198.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.6458333432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6279763579368591, 'reward_std': 0.09354664385318756, 'kl': 0.25, 'epoch': 0.65}
 65%|██████▍   | 2772/4286 [17:33:19<10:32:26, 25.06s/it] 65%|██████▍   | 2773/4286 [17:33:45<10:40:08, 25.39s/it]                                                         {'loss': 0.0176, 'grad_norm': 2.371483968268831, 'learning_rate': 3.530097993467102e-07, 'completion_length': 253.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4895833432674408, 'rewards/format_reward': 1.0, 'reward': 1.4895834922790527, 'reward_std': 0.028368969447910786, 'kl': 0.4404296875, 'epoch': 0.65}
 65%|██████▍   | 2773/4286 [17:33:45<10:40:08, 25.39s/it][2025-03-02 22:41:25,880] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2774/4286 [17:34:10<10:39:31, 25.38s/it]                                                         {'loss': 0.0294, 'grad_norm': 5.447065634471346, 'learning_rate': 3.5277648156789547e-07, 'completion_length': 212.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6470238566398621, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6291667222976685, 'reward_std': 0.1380261890590191, 'kl': 0.7333984375, 'epoch': 0.65}
 65%|██████▍   | 2774/4286 [17:34:10<10:39:31, 25.38s/it] 65%|██████▍   | 2775/4286 [17:34:31<10:06:09, 24.07s/it]                                                         {'loss': 0.0074, 'grad_norm': 5.9810240153058265, 'learning_rate': 3.5254316378908074e-07, 'completion_length': 169.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.7366071343421936, 'rewards/format_reward': 1.0, 'reward': 1.736607313156128, 'reward_std': 0.06215710937976837, 'kl': 0.185546875, 'epoch': 0.65}
 65%|██████▍   | 2775/4286 [17:34:31<10:06:09, 24.07s/it] 65%|██████▍   | 2776/4286 [17:34:57<10:23:44, 24.78s/it]                                                         {'loss': 0.0137, 'grad_norm': 7.34736718391529, 'learning_rate': 3.5230984601026597e-07, 'completion_length': 212.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.4523809999227524, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4345239400863647, 'reward_std': 0.04761904617771506, 'kl': 0.34130859375, 'epoch': 0.65}
 65%|██████▍   | 2776/4286 [17:34:57<10:23:44, 24.78s/it][2025-03-02 22:42:40,132] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2777/4286 [17:35:24<10:38:23, 25.38s/it]                                                         {'loss': 0.0184, 'grad_norm': 4.318813712062957, 'learning_rate': 3.5207652823145124e-07, 'completion_length': 232.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.5848214626312256, 'rewards/format_reward': 1.0, 'reward': 1.5848215222358704, 'reward_std': 0.09183454886078835, 'kl': 0.45947265625, 'epoch': 0.65}
 65%|██████▍   | 2777/4286 [17:35:24<10:38:23, 25.38s/it] 65%|██████▍   | 2778/4286 [17:35:50<10:38:02, 25.39s/it]                                                         {'loss': 0.0093, 'grad_norm': 1.7847709002371124, 'learning_rate': 3.5184321045263646e-07, 'completion_length': 197.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.06356493197381496, 'kl': 0.23193359375, 'epoch': 0.65}
 65%|██████▍   | 2778/4286 [17:35:50<10:38:02, 25.39s/it][2025-03-02 22:43:27,489] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2779/4286 [17:36:12<10:11:50, 24.36s/it]                                                         {'loss': 0.0167, 'grad_norm': 5.665241094538621, 'learning_rate': 3.5160989267382174e-07, 'completion_length': 177.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.636904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6369048357009888, 'reward_std': 0.07250916957855225, 'kl': 0.4189453125, 'epoch': 0.65}
 65%|██████▍   | 2779/4286 [17:36:12<10:11:50, 24.36s/it] 65%|██████▍   | 2780/4286 [17:36:33<9:45:39, 23.33s/it]                                                         {'loss': 0.0084, 'grad_norm': 5.2782290371324745, 'learning_rate': 3.51376574895007e-07, 'completion_length': 167.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.5848214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.566964328289032, 'reward_std': 0.07774734310805798, 'kl': 0.208984375, 'epoch': 0.65}
 65%|██████▍   | 2780/4286 [17:36:33<9:45:39, 23.33s/it] 65%|██████▍   | 2781/4286 [17:36:58<10:02:52, 24.03s/it]                                                         {'loss': 0.0552, 'grad_norm': 37.66655220326712, 'learning_rate': 3.5114325711619224e-07, 'completion_length': 219.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.7113096117973328, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.693452537059784, 'reward_std': 0.148748978972435, 'kl': 1.37890625, 'epoch': 0.65}
 65%|██████▍   | 2781/4286 [17:36:58<10:02:52, 24.03s/it][2025-03-02 22:44:37,472] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2782/4286 [17:37:22<9:57:29, 23.84s/it]                                                         {'loss': 0.0074, 'grad_norm': 0.8386665986493521, 'learning_rate': 3.509099393373775e-07, 'completion_length': 180.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.633928656578064, 'rewards/format_reward': 1.0, 'reward': 1.6339287161827087, 'reward_std': 0.029761902987957, 'kl': 0.18408203125, 'epoch': 0.65}
 65%|██████▍   | 2782/4286 [17:37:22<9:57:29, 23.84s/it] 65%|██████▍   | 2783/4286 [17:37:43<9:36:46, 23.02s/it]                                                        {'loss': 0.0072, 'grad_norm': 1.8769442210977014, 'learning_rate': 3.5067662155856273e-07, 'completion_length': 155.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.7366072237491608, 'rewards/format_reward': 1.0, 'reward': 1.736607313156128, 'reward_std': 0.06274673715233803, 'kl': 0.17919921875, 'epoch': 0.65}
 65%|██████▍   | 2783/4286 [17:37:43<9:36:46, 23.02s/it][2025-03-02 22:45:24,561] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2784/4286 [17:38:09<9:58:24, 23.90s/it]                                                        {'loss': 0.0133, 'grad_norm': 7.7769716848350505, 'learning_rate': 3.50443303779748e-07, 'completion_length': 205.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6187500357627869, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.600892961025238, 'reward_std': 0.07280982658267021, 'kl': 0.33203125, 'epoch': 0.65}
 65%|██████▍   | 2784/4286 [17:38:09<9:58:24, 23.90s/it][2025-03-02 22:45:50,467] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▍   | 2785/4286 [17:38:35<10:13:02, 24.51s/it]                                                         {'loss': 0.0333, 'grad_norm': 2.805008965080555, 'learning_rate': 3.502099860009333e-07, 'completion_length': 178.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6785715222358704, 'reward_std': 0.12670884653925896, 'kl': 0.833984375, 'epoch': 0.65}
 65%|██████▍   | 2785/4286 [17:38:35<10:13:02, 24.51s/it] 65%|██████▌   | 2786/4286 [17:38:57<9:55:47, 23.83s/it]                                                         {'loss': 0.0512, 'grad_norm': 3.944504126298846, 'learning_rate': 3.499766682221185e-07, 'completion_length': 195.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.4272959381341934, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3558674454689026, 'reward_std': 0.2555602863430977, 'kl': 1.279296875, 'epoch': 0.65}
 65%|██████▌   | 2786/4286 [17:38:57<9:55:47, 23.83s/it] 65%|██████▌   | 2787/4286 [17:39:18<9:32:48, 22.93s/it]                                                        {'loss': 0.0222, 'grad_norm': 3.001746164113817, 'learning_rate': 3.497433504433038e-07, 'completion_length': 189.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.08600887097418308, 'kl': 0.5537109375, 'epoch': 0.65}
 65%|██████▌   | 2787/4286 [17:39:18<9:32:48, 22.93s/it] 65%|██████▌   | 2788/4286 [17:39:43<9:47:30, 23.53s/it]                                                        {'loss': 0.0242, 'grad_norm': 7.2008062754841395, 'learning_rate': 3.49510032664489e-07, 'completion_length': 194.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6279763579368591, 'reward_std': 0.11317819729447365, 'kl': 0.6044921875, 'epoch': 0.65}
 65%|██████▌   | 2788/4286 [17:39:43<9:47:30, 23.53s/it][2025-03-02 22:47:23,550] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▌   | 2789/4286 [17:40:08<9:58:34, 23.99s/it]                                                        {'loss': 0.064, 'grad_norm': 8.99968610745933, 'learning_rate': 3.492767148856743e-07, 'completion_length': 189.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6101191639900208, 'reward_std': 0.1428571492433548, 'kl': 1.59765625, 'epoch': 0.65}
 65%|██████▌   | 2789/4286 [17:40:08<9:58:34, 23.99s/it] 65%|██████▌   | 2790/4286 [17:40:35<10:20:04, 24.87s/it]                                                         {'loss': 0.026, 'grad_norm': 5.966669242141235, 'learning_rate': 3.4904339710685955e-07, 'completion_length': 208.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.4627976268529892, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.427083432674408, 'reward_std': 0.13027676939964294, 'kl': 0.650390625, 'epoch': 0.65}
 65%|██████▌   | 2790/4286 [17:40:35<10:20:04, 24.87s/it][2025-03-02 22:48:16,588] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▌   | 2791/4286 [17:41:01<10:29:00, 25.24s/it]                                                         {'loss': 0.118, 'grad_norm': 5.056917416005415, 'learning_rate': 3.488100793280448e-07, 'completion_length': 210.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5744048058986664, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5208334922790527, 'reward_std': 0.2616955265402794, 'kl': 2.94921875, 'epoch': 0.65}
 65%|██████▌   | 2791/4286 [17:41:01<10:29:00, 25.24s/it] 65%|██████▌   | 2792/4286 [17:41:21<9:53:38, 23.84s/it]                                                         {'loss': 0.1124, 'grad_norm': 9.025558943216183, 'learning_rate': 3.4857676154923005e-07, 'completion_length': 161.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.7282738387584686, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.692559540271759, 'reward_std': 0.1699349209666252, 'kl': 2.8125, 'epoch': 0.65}
 65%|██████▌   | 2792/4286 [17:41:21<9:53:38, 23.84s/it] 65%|██████▌   | 2793/4286 [17:41:45<9:53:38, 23.86s/it]                                                        {'loss': 0.0505, 'grad_norm': 1.5691003909088685, 'learning_rate': 3.4834344377041533e-07, 'completion_length': 181.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7232143878936768, 'reward_std': 0.1130952462553978, 'kl': 1.265625, 'epoch': 0.65}
 65%|██████▌   | 2793/4286 [17:41:45<9:53:38, 23.86s/it] 65%|██████▌   | 2794/4286 [17:42:06<9:32:37, 23.03s/it]                                                        {'loss': 0.0303, 'grad_norm': 3.970511168145346, 'learning_rate': 3.4811012599160055e-07, 'completion_length': 155.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6904762983322144, 'reward_std': 0.1428571566939354, 'kl': 0.7607421875, 'epoch': 0.65}
 65%|██████▌   | 2794/4286 [17:42:06<9:32:37, 23.03s/it] 65%|██████▌   | 2795/4286 [17:42:26<9:10:51, 22.17s/it]                                                        {'loss': 0.0696, 'grad_norm': 6.2556021142865825, 'learning_rate': 3.478768082127858e-07, 'completion_length': 135.39286422729492, 'rewards/only_full_func_accuracy_reward': 0.7693453431129456, 'rewards/format_reward': 1.0, 'reward': 1.7693454027175903, 'reward_std': 0.07932847365736961, 'kl': 1.73828125, 'epoch': 0.65}
 65%|██████▌   | 2795/4286 [17:42:26<9:10:51, 22.17s/it][2025-03-02 22:50:07,920] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▌   | 2796/4286 [17:42:52<9:36:12, 23.20s/it]                                                        {'loss': 0.1037, 'grad_norm': 14.11674593071802, 'learning_rate': 3.4764349043397105e-07, 'completion_length': 215.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.4255952686071396, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3898810744285583, 'reward_std': 0.12681032344698906, 'kl': 2.58984375, 'epoch': 0.65}
 65%|██████▌   | 2796/4286 [17:42:52<9:36:12, 23.20s/it] 65%|██████▌   | 2797/4286 [17:43:17<9:47:43, 23.68s/it]                                                        {'loss': 0.0526, 'grad_norm': 6.736669705436984, 'learning_rate': 3.474101726551563e-07, 'completion_length': 188.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.6363095939159393, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6184524297714233, 'reward_std': 0.11855126544833183, 'kl': 1.314453125, 'epoch': 0.65}
 65%|██████▌   | 2797/4286 [17:43:17<9:47:43, 23.68s/it][2025-03-02 22:50:59,348] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▌   | 2798/4286 [17:43:43<10:09:13, 24.57s/it]                                                         {'loss': 0.0625, 'grad_norm': 8.90413317880665, 'learning_rate': 3.471768548763416e-07, 'completion_length': 234.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5468750298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5111608505249023, 'reward_std': 0.22448870539665222, 'kl': 1.560546875, 'epoch': 0.65}
 65%|██████▌   | 2798/4286 [17:43:43<10:09:13, 24.57s/it] 65%|██████▌   | 2799/4286 [17:44:01<9:13:28, 22.33s/it]                                                         {'loss': 0.0192, 'grad_norm': 12.931000768986081, 'learning_rate': 3.469435370975268e-07, 'completion_length': 159.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.639881044626236, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.12088930234313011, 'kl': 0.47998046875, 'epoch': 0.65}
 65%|██████▌   | 2799/4286 [17:44:01<9:13:28, 22.33s/it] 65%|██████▌   | 2800/4286 [17:44:21<8:59:10, 21.77s/it]                                                        {'loss': 0.0462, 'grad_norm': 5.294889855373473, 'learning_rate': 3.467102193187121e-07, 'completion_length': 165.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7276786267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7098215818405151, 'reward_std': 0.1561431623995304, 'kl': 1.15576171875, 'epoch': 0.65}
 65%|██████▌   | 2800/4286 [17:44:21<8:59:10, 21.77s/it] 65%|██████▌   | 2801/4286 [17:47:46<31:40:02, 76.77s/it]                                                         {'loss': 0.0629, 'grad_norm': 7.3117651353971285, 'learning_rate': 3.464769015398973e-07, 'completion_length': 165.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715818405151, 'reward_std': 0.10100726038217545, 'kl': 1.57421875, 'epoch': 0.65}
 65%|██████▌   | 2801/4286 [17:47:46<31:40:02, 76.77s/it] 65%|██████▌   | 2802/4286 [17:48:07<24:42:09, 59.93s/it]                                                         {'loss': 0.0781, 'grad_norm': 12.800151044326604, 'learning_rate': 3.462435837610826e-07, 'completion_length': 151.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5848215818405151, 'reward_std': 0.20344803482294083, 'kl': 1.955078125, 'epoch': 0.65}
 65%|██████▌   | 2802/4286 [17:48:07<24:42:09, 59.93s/it][2025-03-02 22:55:47,504] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▌   | 2803/4286 [17:48:32<20:21:05, 49.40s/it]                                                         {'loss': 0.0591, 'grad_norm': 9.04214524631342, 'learning_rate': 3.4601026598226787e-07, 'completion_length': 175.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.596428632736206, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5785715579986572, 'reward_std': 0.15921103581786156, 'kl': 1.4765625, 'epoch': 0.65}
 65%|██████▌   | 2803/4286 [17:48:32<20:21:05, 49.40s/it] 65%|██████▌   | 2804/4286 [17:48:56<17:13:45, 41.85s/it]                                                         {'loss': 0.0395, 'grad_norm': 11.84963178310883, 'learning_rate': 3.457769482034531e-07, 'completion_length': 185.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5595238506793976, 'rewards/format_reward': 1.0, 'reward': 1.55952388048172, 'reward_std': 0.03939763177186251, 'kl': 0.990234375, 'epoch': 0.65}
 65%|██████▌   | 2804/4286 [17:48:56<17:13:45, 41.85s/it] 65%|██████▌   | 2805/4286 [17:49:20<15:04:08, 36.63s/it]                                                         {'loss': 0.0506, 'grad_norm': 52.71278895706046, 'learning_rate': 3.4554363042463836e-07, 'completion_length': 170.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.6145834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5967262983322144, 'reward_std': 0.1041666641831398, 'kl': 1.26171875, 'epoch': 0.65}
 65%|██████▌   | 2805/4286 [17:49:20<15:04:08, 36.63s/it][2025-03-02 22:57:01,411] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 65%|██████▌   | 2806/4286 [17:49:46<13:39:10, 33.21s/it]                                                         {'loss': 0.0574, 'grad_norm': 5.216745760235185, 'learning_rate': 3.453103126458236e-07, 'completion_length': 188.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.47827382385730743, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4425596594810486, 'reward_std': 0.23886732012033463, 'kl': 1.4375, 'epoch': 0.65}
 65%|██████▌   | 2806/4286 [17:49:46<13:39:10, 33.21s/it] 65%|██████▌   | 2807/4286 [17:50:11<12:41:35, 30.90s/it]                                                         {'loss': 0.0414, 'grad_norm': 11.091028411574156, 'learning_rate': 3.4507699486700886e-07, 'completion_length': 186.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.5818452537059784, 'rewards/format_reward': 1.0, 'reward': 1.5818453431129456, 'reward_std': 0.07431196607649326, 'kl': 1.033203125, 'epoch': 0.65}
 65%|██████▌   | 2807/4286 [17:50:11<12:41:35, 30.90s/it] 66%|██████▌   | 2808/4286 [17:50:35<11:53:24, 28.96s/it]                                                         {'loss': 0.0611, 'grad_norm': 8.405496419611895, 'learning_rate': 3.4484367708819414e-07, 'completion_length': 167.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.5849567353725433, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.567099690437317, 'reward_std': 0.1251050103455782, 'kl': 1.53076171875, 'epoch': 0.66}
 66%|██████▌   | 2808/4286 [17:50:35<11:53:24, 28.96s/it] 66%|██████▌   | 2809/4286 [17:51:00<11:22:24, 27.72s/it]                                                         {'loss': 0.0565, 'grad_norm': 2.938131608310126, 'learning_rate': 3.4461035930937936e-07, 'completion_length': 162.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.648809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6309524774551392, 'reward_std': 0.17553052306175232, 'kl': 1.41015625, 'epoch': 0.66}
 66%|██████▌   | 2809/4286 [17:51:00<11:22:24, 27.72s/it] 66%|██████▌   | 2810/4286 [17:51:25<11:03:07, 26.96s/it]                                                         {'loss': 0.1001, 'grad_norm': 9.28224665263861, 'learning_rate': 3.4437704153056463e-07, 'completion_length': 175.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6056548357009888, 'reward_std': 0.14809077978134155, 'kl': 2.5, 'epoch': 0.66}
 66%|██████▌   | 2810/4286 [17:51:25<11:03:07, 26.96s/it] 66%|██████▌   | 2811/4286 [17:51:49<10:35:03, 25.83s/it]                                                         {'loss': 0.0303, 'grad_norm': 3.0269787656172094, 'learning_rate': 3.4414372375174986e-07, 'completion_length': 162.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6339285969734192, 'reward_std': 0.08012712467461824, 'kl': 0.7578125, 'epoch': 0.66}
 66%|██████▌   | 2811/4286 [17:51:49<10:35:03, 25.83s/it] 66%|██████▌   | 2812/4286 [17:52:10<10:01:00, 24.46s/it]                                                         {'loss': 0.0649, 'grad_norm': 8.189176295088311, 'learning_rate': 3.4391040597293513e-07, 'completion_length': 169.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 1.0, 'reward': 1.6443454027175903, 'reward_std': 0.13579164817929268, 'kl': 1.62109375, 'epoch': 0.66}
 66%|██████▌   | 2812/4286 [17:52:10<10:01:00, 24.46s/it] 66%|██████▌   | 2813/4286 [17:52:35<10:02:45, 24.55s/it]                                                         {'loss': 0.0264, 'grad_norm': 40.781016219975385, 'learning_rate': 3.436770881941204e-07, 'completion_length': 165.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5818452835083008, 'rewards/format_reward': 1.0, 'reward': 1.5818453431129456, 'reward_std': 0.07213572785258293, 'kl': 0.66015625, 'epoch': 0.66}
 66%|██████▌   | 2813/4286 [17:52:35<10:02:45, 24.55s/it] 66%|██████▌   | 2814/4286 [17:52:54<9:24:15, 23.00s/it]                                                         {'loss': 0.0411, 'grad_norm': 7.187317230024141, 'learning_rate': 3.4344377041530563e-07, 'completion_length': 132.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.1003692727535963, 'kl': 1.029296875, 'epoch': 0.66}
 66%|██████▌   | 2814/4286 [17:52:54<9:24:15, 23.00s/it] 66%|██████▌   | 2815/4286 [17:53:18<9:29:59, 23.25s/it]                                                        {'loss': 0.0151, 'grad_norm': 7.134835821447789, 'learning_rate': 3.432104526364909e-07, 'completion_length': 167.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6342262327671051, 'rewards/format_reward': 1.0, 'reward': 1.6342262625694275, 'reward_std': 0.0573536679148674, 'kl': 0.376953125, 'epoch': 0.66}
 66%|██████▌   | 2815/4286 [17:53:18<9:29:59, 23.25s/it] 66%|██████▌   | 2816/4286 [17:53:43<9:41:51, 23.75s/it]                                                        {'loss': 0.0485, 'grad_norm': 9.579961694285274, 'learning_rate': 3.429771348576762e-07, 'completion_length': 179.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6297619938850403, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5761905908584595, 'reward_std': 0.19696014374494553, 'kl': 1.21484375, 'epoch': 0.66}
 66%|██████▌   | 2816/4286 [17:53:43<9:41:51, 23.75s/it] 66%|██████▌   | 2817/4286 [17:54:07<9:40:59, 23.73s/it]                                                        {'loss': 0.0355, 'grad_norm': 10.352532451190026, 'learning_rate': 3.427438170788614e-07, 'completion_length': 163.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6130953431129456, 'reward_std': 0.1190476231276989, 'kl': 0.8876953125, 'epoch': 0.66}
 66%|██████▌   | 2817/4286 [17:54:07<9:40:59, 23.73s/it] 66%|██████▌   | 2818/4286 [17:54:31<9:45:55, 23.95s/it]                                                        {'loss': 0.0364, 'grad_norm': 23.487215026785375, 'learning_rate': 3.425104993000467e-07, 'completion_length': 178.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.5625000447034836, 'rewards/format_reward': 1.0, 'reward': 1.5625001192092896, 'reward_std': 0.06266060471534729, 'kl': 0.908203125, 'epoch': 0.66}
 66%|██████▌   | 2818/4286 [17:54:31<9:45:55, 23.95s/it] 66%|██████▌   | 2819/4286 [17:54:55<9:46:45, 24.00s/it]                                                        {'loss': 0.0361, 'grad_norm': 11.442643932677543, 'learning_rate': 3.422771815212319e-07, 'completion_length': 161.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.4866071939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4687501192092896, 'reward_std': 0.075064099393785, 'kl': 0.90234375, 'epoch': 0.66}
 66%|██████▌   | 2819/4286 [17:54:55<9:46:45, 24.00s/it][2025-03-02 23:02:32,725] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 66%|██████▌   | 2820/4286 [17:55:17<9:29:51, 23.32s/it]                                                        {'loss': 0.0449, 'grad_norm': 81.9808075575668, 'learning_rate': 3.4204386374241717e-07, 'completion_length': 179.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.6300595998764038, 'rewards/format_reward': 1.0, 'reward': 1.6300596594810486, 'reward_std': 0.08014271967113018, 'kl': 1.125, 'epoch': 0.66}
 66%|██████▌   | 2820/4286 [17:55:17<9:29:51, 23.32s/it][2025-03-02 23:02:58,970] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 66%|██████▌   | 2821/4286 [17:55:43<9:50:52, 24.20s/it]                                                        {'loss': 0.0505, 'grad_norm': 10.779829085842442, 'learning_rate': 3.4181054596360245e-07, 'completion_length': 191.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.3919643312692642, 'rewards/format_reward': 1.0, 'reward': 1.3919643759727478, 'reward_std': 0.0967034325003624, 'kl': 1.263671875, 'epoch': 0.66}
 66%|██████▌   | 2821/4286 [17:55:43<9:50:52, 24.20s/it] 66%|██████▌   | 2822/4286 [17:55:59<8:53:20, 21.86s/it]                                                        {'loss': 0.0087, 'grad_norm': 6.952165800113175, 'learning_rate': 3.4157722818478767e-07, 'completion_length': 133.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.8854167461395264, 'rewards/format_reward': 1.0, 'reward': 1.8854168057441711, 'reward_std': 0.04900030046701431, 'kl': 0.2177734375, 'epoch': 0.66}
 66%|██████▌   | 2822/4286 [17:55:59<8:53:20, 21.86s/it] 66%|██████▌   | 2823/4286 [17:56:23<9:03:29, 22.29s/it]                                                        {'loss': 0.0286, 'grad_norm': 2.2313792903528658, 'learning_rate': 3.4134391040597295e-07, 'completion_length': 193.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.6056547611951828, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.022469747811555862, 'kl': 0.7109375, 'epoch': 0.66}
 66%|██████▌   | 2823/4286 [17:56:23<9:03:29, 22.29s/it] 66%|██████▌   | 2824/4286 [17:56:47<9:15:07, 22.78s/it]                                                        {'loss': 0.008, 'grad_norm': 0.09855392824446374, 'learning_rate': 3.4111059262715817e-07, 'completion_length': 154.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.8333334028720856, 'rewards/format_reward': 1.0, 'reward': 1.833333432674408, 'reward_std': 0.0, 'kl': 0.20068359375, 'epoch': 0.66}
 66%|██████▌   | 2824/4286 [17:56:47<9:15:07, 22.78s/it] 66%|██████▌   | 2825/4286 [17:57:11<9:29:18, 23.38s/it]                                                        {'loss': 0.0146, 'grad_norm': 9.349742074685034, 'learning_rate': 3.4087727484834344e-07, 'completion_length': 174.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6354167461395264, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6175596714019775, 'reward_std': 0.06483910605311394, 'kl': 0.36376953125, 'epoch': 0.66}
 66%|██████▌   | 2825/4286 [17:57:11<9:29:18, 23.38s/it] 66%|██████▌   | 2826/4286 [17:57:32<9:09:58, 22.60s/it]                                                        {'loss': 0.0182, 'grad_norm': 5.103930231117074, 'learning_rate': 3.406439570695287e-07, 'completion_length': 155.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217262387275696, 'reward_std': 0.07066833972930908, 'kl': 0.4560546875, 'epoch': 0.66}
 66%|██████▌   | 2826/4286 [17:57:32<9:09:58, 22.60s/it] 66%|██████▌   | 2827/4286 [17:57:49<8:29:57, 20.97s/it]                                                        {'loss': 0.0147, 'grad_norm': 2.256772008635404, 'learning_rate': 3.4041063929071394e-07, 'completion_length': 140.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127978205680847, 'reward_std': 0.0301282936707139, 'kl': 0.369140625, 'epoch': 0.66}
 66%|██████▌   | 2827/4286 [17:57:49<8:29:57, 20.97s/it] 66%|██████▌   | 2828/4286 [17:58:10<8:25:09, 20.79s/it]                                                        {'loss': 0.0247, 'grad_norm': 2.303827666520595, 'learning_rate': 3.401773215118992e-07, 'completion_length': 149.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.630952388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6130953431129456, 'reward_std': 0.11585774272680283, 'kl': 0.6181640625, 'epoch': 0.66}
 66%|██████▌   | 2828/4286 [17:58:10<8:25:09, 20.79s/it] 66%|██████▌   | 2829/4286 [17:58:34<8:52:12, 21.92s/it]                                                        {'loss': 0.0139, 'grad_norm': 1.8001844064314176, 'learning_rate': 3.3994400373308444e-07, 'completion_length': 183.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 1.0, 'reward': 1.633928656578064, 'reward_std': 0.06143151968717575, 'kl': 0.34765625, 'epoch': 0.66}
 66%|██████▌   | 2829/4286 [17:58:34<8:52:12, 21.92s/it] 66%|██████▌   | 2830/4286 [17:58:56<8:50:42, 21.87s/it]                                                        {'loss': 0.0126, 'grad_norm': 4.068320650298024, 'learning_rate': 3.397106859542697e-07, 'completion_length': 184.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6488096714019775, 'reward_std': 0.07695358991622925, 'kl': 0.3154296875, 'epoch': 0.66}
 66%|██████▌   | 2830/4286 [17:58:56<8:50:42, 21.87s/it][2025-03-02 23:06:32,831] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 66%|██████▌   | 2831/4286 [17:59:17<8:42:50, 21.56s/it]                                                        {'loss': 0.0243, 'grad_norm': 10.279781718214593, 'learning_rate': 3.39477368175455e-07, 'completion_length': 149.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.5833333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5654762387275696, 'reward_std': 0.0897791888564825, 'kl': 0.6083984375, 'epoch': 0.66}
 66%|██████▌   | 2831/4286 [17:59:17<8:42:50, 21.56s/it] 66%|██████▌   | 2832/4286 [17:59:39<8:43:16, 21.59s/it]                                                        {'loss': 0.0308, 'grad_norm': 2.852764821309884, 'learning_rate': 3.392440503966402e-07, 'completion_length': 173.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7157739400863647, 'reward_std': 0.1298178769648075, 'kl': 0.76953125, 'epoch': 0.66}
 66%|██████▌   | 2832/4286 [17:59:39<8:43:16, 21.59s/it] 66%|██████▌   | 2833/4286 [17:59:59<8:36:36, 21.33s/it]                                                        {'loss': 0.011, 'grad_norm': 9.004080579039018, 'learning_rate': 3.390107326178255e-07, 'completion_length': 132.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.7184524238109589, 'rewards/format_reward': 1.0, 'reward': 1.718452513217926, 'reward_std': 0.05119048058986664, 'kl': 0.27587890625, 'epoch': 0.66}
 66%|██████▌   | 2833/4286 [17:59:59<8:36:36, 21.33s/it] 66%|██████▌   | 2834/4286 [18:00:20<8:33:11, 21.21s/it]                                                        {'loss': 0.0079, 'grad_norm': 5.389263674495979, 'learning_rate': 3.387774148390107e-07, 'completion_length': 170.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.04602411016821861, 'kl': 0.19775390625, 'epoch': 0.66}
 66%|██████▌   | 2834/4286 [18:00:20<8:33:11, 21.21s/it][2025-03-02 23:08:01,434] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 66%|██████▌   | 2835/4286 [18:00:46<9:02:31, 22.43s/it]                                                        {'loss': 0.008, 'grad_norm': 1.0733776104032142, 'learning_rate': 3.38544097060196e-07, 'completion_length': 213.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.5535715073347092, 'rewards/format_reward': 1.0, 'reward': 1.5535715818405151, 'reward_std': 0.011904764920473099, 'kl': 0.20068359375, 'epoch': 0.66}
 66%|██████▌   | 2835/4286 [18:00:46<9:02:31, 22.43s/it] 66%|██████▌   | 2836/4286 [18:01:11<9:24:56, 23.38s/it]                                                        {'loss': 0.0258, 'grad_norm': 4.288446434410665, 'learning_rate': 3.3831077928138126e-07, 'completion_length': 195.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5982143878936768, 'reward_std': 0.10303214937448502, 'kl': 0.6435546875, 'epoch': 0.66}
 66%|██████▌   | 2836/4286 [18:01:11<9:24:56, 23.38s/it][2025-03-02 23:08:50,952] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 66%|██████▌   | 2837/4286 [18:01:35<9:28:38, 23.55s/it]                                                        {'loss': 0.0617, 'grad_norm': 9.981644254732945, 'learning_rate': 3.380774615025665e-07, 'completion_length': 158.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6235119700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6056548953056335, 'reward_std': 0.09066697582602501, 'kl': 1.541015625, 'epoch': 0.66}
 66%|██████▌   | 2837/4286 [18:01:35<9:28:38, 23.55s/it][2025-03-02 23:09:15,421] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 66%|██████▌   | 2838/4286 [18:02:00<9:34:55, 23.82s/it]                                                        {'loss': 0.0175, 'grad_norm': 14.43141031375295, 'learning_rate': 3.3784414372375176e-07, 'completion_length': 180.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.095238097012043, 'kl': 0.4384765625, 'epoch': 0.66}
 66%|██████▌   | 2838/4286 [18:02:00<9:34:55, 23.82s/it] 66%|██████▌   | 2839/4286 [18:02:21<9:18:47, 23.17s/it]                                                        {'loss': 0.0429, 'grad_norm': 4.2876793733225895, 'learning_rate': 3.3761082594493703e-07, 'completion_length': 163.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.5773809552192688, 'rewards/format_reward': 1.0, 'reward': 1.5773810744285583, 'reward_std': 0.04007173329591751, 'kl': 1.07421875, 'epoch': 0.66}
 66%|██████▌   | 2839/4286 [18:02:21<9:18:47, 23.17s/it] 66%|██████▋   | 2840/4286 [18:02:38<8:32:22, 21.26s/it]                                                        {'loss': 0.0273, 'grad_norm': 182.91311730049958, 'learning_rate': 3.3737750816612225e-07, 'completion_length': 146.375, 'rewards/only_full_func_accuracy_reward': 0.5946428775787354, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5767857432365417, 'reward_std': 0.12033125944435596, 'kl': 0.68212890625, 'epoch': 0.66}
 66%|██████▋   | 2840/4286 [18:02:38<8:32:22, 21.26s/it] 66%|██████▋   | 2841/4286 [18:02:56<8:05:46, 20.17s/it]                                                        {'loss': 0.06, 'grad_norm': 1.981113208368167, 'learning_rate': 3.3714419038730753e-07, 'completion_length': 139.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.05952381528913975, 'kl': 1.50390625, 'epoch': 0.66}
 66%|██████▋   | 2841/4286 [18:02:56<8:05:46, 20.17s/it] 66%|██████▋   | 2842/4286 [18:03:12<7:40:59, 19.15s/it]                                                        {'loss': 0.0332, 'grad_norm': 2.6592133261816664, 'learning_rate': 3.3691087260849275e-07, 'completion_length': 132.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.7434523701667786, 'rewards/format_reward': 1.0, 'reward': 1.743452548980713, 'reward_std': 0.06314694881439209, 'kl': 0.830078125, 'epoch': 0.66}
 66%|██████▋   | 2842/4286 [18:03:12<7:40:59, 19.15s/it] 66%|██████▋   | 2843/4286 [18:03:33<7:51:00, 19.58s/it]                                                        {'loss': 0.042, 'grad_norm': 22.509462682575865, 'learning_rate': 3.36677554829678e-07, 'completion_length': 142.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547619700431824, 'reward_std': 0.07251131162047386, 'kl': 1.0498046875, 'epoch': 0.66}
 66%|██████▋   | 2843/4286 [18:03:33<7:51:00, 19.58s/it] 66%|██████▋   | 2844/4286 [18:03:49<7:27:02, 18.60s/it]                                                        {'loss': 0.0566, 'grad_norm': 4.087839701958252, 'learning_rate': 3.364442370508633e-07, 'completion_length': 135.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.07327024638652802, 'kl': 1.42529296875, 'epoch': 0.66}
 66%|██████▋   | 2844/4286 [18:03:49<7:27:02, 18.60s/it] 66%|██████▋   | 2845/4286 [18:04:14<8:09:13, 20.37s/it]                                                        {'loss': 0.0129, 'grad_norm': 4.628521508588163, 'learning_rate': 3.362109192720485e-07, 'completion_length': 184.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.44166670739650726, 'rewards/format_reward': 1.0, 'reward': 1.4416667819023132, 'reward_std': 0.024043038487434387, 'kl': 0.3232421875, 'epoch': 0.66}
 66%|██████▋   | 2845/4286 [18:04:14<8:09:13, 20.37s/it] 66%|██████▋   | 2846/4286 [18:04:35<8:14:19, 20.60s/it]                                                        {'loss': 0.0801, 'grad_norm': 3.617227936433464, 'learning_rate': 3.359776014932338e-07, 'completion_length': 135.96428680419922, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547619700431824, 'reward_std': 0.10898453742265701, 'kl': 2.0, 'epoch': 0.66}
 66%|██████▋   | 2846/4286 [18:04:35<8:14:19, 20.60s/it] 66%|██████▋   | 2847/4286 [18:04:55<8:13:19, 20.57s/it]                                                        {'loss': 0.0723, 'grad_norm': 68.19464196320543, 'learning_rate': 3.35744283714419e-07, 'completion_length': 146.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.5520833432674408, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5163691639900208, 'reward_std': 0.15118440054357052, 'kl': 1.8125, 'epoch': 0.66}
 66%|██████▋   | 2847/4286 [18:04:55<8:13:19, 20.57s/it] 66%|██████▋   | 2848/4286 [18:05:21<8:47:34, 22.01s/it]                                                        {'loss': 0.0344, 'grad_norm': 13.960100131923708, 'learning_rate': 3.355109659356043e-07, 'completion_length': 163.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5446428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.508928656578064, 'reward_std': 0.1412622146308422, 'kl': 0.859375, 'epoch': 0.66}
 66%|██████▋   | 2848/4286 [18:05:21<8:47:34, 22.01s/it] 66%|██████▋   | 2849/4286 [18:05:38<8:11:03, 20.50s/it]                                                        {'loss': 0.0398, 'grad_norm': 4.749591029065893, 'learning_rate': 3.3527764815678957e-07, 'completion_length': 141.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5452381074428558, 'rewards/format_reward': 1.0, 'reward': 1.545238196849823, 'reward_std': 0.090311199426651, 'kl': 0.994140625, 'epoch': 0.66}
 66%|██████▋   | 2849/4286 [18:05:38<8:11:03, 20.50s/it] 66%|██████▋   | 2850/4286 [18:05:56<7:57:16, 19.94s/it]                                                        {'loss': 0.0547, 'grad_norm': 8.316093688141487, 'learning_rate': 3.350443303779748e-07, 'completion_length': 134.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5386904776096344, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5029763579368591, 'reward_std': 0.1845238208770752, 'kl': 1.3671875, 'epoch': 0.66}
 66%|██████▋   | 2850/4286 [18:05:56<7:57:16, 19.94s/it] 67%|██████▋   | 2851/4286 [18:06:19<8:14:57, 20.70s/it]                                                        {'loss': 0.0891, 'grad_norm': 21.724893668406313, 'learning_rate': 3.3481101259916007e-07, 'completion_length': 171.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.5598214864730835, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5419644713401794, 'reward_std': 0.12624644115567207, 'kl': 2.23046875, 'epoch': 0.67}
 67%|██████▋   | 2851/4286 [18:06:19<8:14:57, 20.70s/it] 67%|██████▋   | 2852/4286 [18:06:35<7:41:31, 19.31s/it]                                                        {'loss': 0.0134, 'grad_norm': 4.32764510492127, 'learning_rate': 3.345776948203453e-07, 'completion_length': 127.69643783569336, 'rewards/only_full_func_accuracy_reward': 0.6770834028720856, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.03869047574698925, 'kl': 0.33544921875, 'epoch': 0.67}
 67%|██████▋   | 2852/4286 [18:06:35<7:41:31, 19.31s/it] 67%|██████▋   | 2853/4286 [18:06:55<7:43:46, 19.42s/it]                                                        {'loss': 0.013, 'grad_norm': 3.5271622895877446, 'learning_rate': 3.3434437704153057e-07, 'completion_length': 131.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.7565476596355438, 'rewards/format_reward': 1.0, 'reward': 1.7565476894378662, 'reward_std': 0.060175007209181786, 'kl': 0.32421875, 'epoch': 0.67}
 67%|██████▋   | 2853/4286 [18:06:55<7:43:46, 19.42s/it] 67%|██████▋   | 2854/4286 [18:07:14<7:40:16, 19.29s/it]                                                        {'loss': 0.0251, 'grad_norm': 1.7260675701427408, 'learning_rate': 3.3411105926271584e-07, 'completion_length': 115.87500381469727, 'rewards/only_full_func_accuracy_reward': 0.7797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7797620296478271, 'reward_std': 0.0, 'kl': 0.626953125, 'epoch': 0.67}
 67%|██████▋   | 2854/4286 [18:07:14<7:40:16, 19.29s/it] 67%|██████▋   | 2855/4286 [18:07:31<7:29:08, 18.83s/it]                                                        {'loss': 0.0271, 'grad_norm': 1.236738571673023, 'learning_rate': 3.3387774148390106e-07, 'completion_length': 129.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.8258929252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8080358505249023, 'reward_std': 0.0744047611951828, 'kl': 0.6748046875, 'epoch': 0.67}
 67%|██████▋   | 2855/4286 [18:07:31<7:29:08, 18.83s/it] 67%|██████▋   | 2856/4286 [18:07:49<7:18:50, 18.41s/it]                                                        {'loss': 0.052, 'grad_norm': 9.355336867709546, 'learning_rate': 3.3364442370508634e-07, 'completion_length': 139.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6342262327671051, 'rewards/format_reward': 1.0, 'reward': 1.6342262625694275, 'reward_std': 0.0655425377190113, 'kl': 1.30078125, 'epoch': 0.67}
 67%|██████▋   | 2856/4286 [18:07:49<7:18:50, 18.41s/it] 67%|██████▋   | 2857/4286 [18:08:14<8:08:16, 20.50s/it]                                                        {'loss': 0.0448, 'grad_norm': 3.4524163257082363, 'learning_rate': 3.3341110592627156e-07, 'completion_length': 167.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5267857909202576, 'reward_std': 0.1130952425301075, 'kl': 1.1171875, 'epoch': 0.67}
 67%|██████▋   | 2857/4286 [18:08:14<8:08:16, 20.50s/it][2025-03-02 23:15:50,902] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 67%|██████▋   | 2858/4286 [18:08:35<8:10:21, 20.60s/it]                                                        {'loss': 0.0119, 'grad_norm': 6.170250078215621, 'learning_rate': 3.3317778814745684e-07, 'completion_length': 146.39286422729492, 'rewards/only_full_func_accuracy_reward': 0.6041667312383652, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5863096714019775, 'reward_std': 0.0833333358168602, 'kl': 0.296875, 'epoch': 0.67}
 67%|██████▋   | 2858/4286 [18:08:35<8:10:21, 20.60s/it] 67%|██████▋   | 2859/4286 [18:08:55<8:03:48, 20.34s/it]                                                        {'loss': 0.0179, 'grad_norm': 2.505620792289482, 'learning_rate': 3.329444703686421e-07, 'completion_length': 148.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5372024327516556, 'rewards/format_reward': 1.0, 'reward': 1.5372024774551392, 'reward_std': 0.02267500851303339, 'kl': 0.44775390625, 'epoch': 0.67}
 67%|██████▋   | 2859/4286 [18:08:55<8:03:48, 20.34s/it] 67%|██████▋   | 2860/4286 [18:09:11<7:36:48, 19.22s/it]                                                        {'loss': 0.0083, 'grad_norm': 2.3108407359872256, 'learning_rate': 3.3271115258982733e-07, 'completion_length': 127.96428680419922, 'rewards/only_full_func_accuracy_reward': 0.7276785969734192, 'rewards/format_reward': 1.0, 'reward': 1.727678656578064, 'reward_std': 0.08496132493019104, 'kl': 0.20703125, 'epoch': 0.67}
 67%|██████▋   | 2860/4286 [18:09:11<7:36:48, 19.22s/it] 67%|██████▋   | 2861/4286 [18:09:33<7:54:51, 19.99s/it]                                                        {'loss': 0.0374, 'grad_norm': 4.520428428573425, 'learning_rate': 3.324778348110126e-07, 'completion_length': 170.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.07742243632674217, 'kl': 0.9296875, 'epoch': 0.67}
 67%|██████▋   | 2861/4286 [18:09:33<7:54:51, 19.99s/it] 67%|██████▋   | 2862/4286 [18:09:57<8:21:24, 21.13s/it]                                                        {'loss': 0.0122, 'grad_norm': 1.942874394598692, 'learning_rate': 3.322445170321979e-07, 'completion_length': 156.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607144474983215, 'reward_std': 0.0476190485060215, 'kl': 0.3056640625, 'epoch': 0.67}
 67%|██████▋   | 2862/4286 [18:09:57<8:21:24, 21.13s/it] 67%|██████▋   | 2863/4286 [18:10:15<8:01:13, 20.29s/it]                                                        {'loss': 0.054, 'grad_norm': 5.253389427087594, 'learning_rate': 3.320111992533831e-07, 'completion_length': 131.5535774230957, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.11157478764653206, 'kl': 1.34765625, 'epoch': 0.67}
 67%|██████▋   | 2863/4286 [18:10:15<8:01:13, 20.29s/it] 67%|██████▋   | 2864/4286 [18:10:36<8:01:40, 20.32s/it]                                                        {'loss': 0.0251, 'grad_norm': 9.779066550161055, 'learning_rate': 3.317778814745684e-07, 'completion_length': 147.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5119047909975052, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.49404776096344, 'reward_std': 0.07142857648432255, 'kl': 0.6240234375, 'epoch': 0.67}
 67%|██████▋   | 2864/4286 [18:10:36<8:01:40, 20.32s/it] 67%|██████▋   | 2865/4286 [18:10:52<7:31:21, 19.06s/it]                                                        {'loss': 0.0337, 'grad_norm': 3.8811879723103933, 'learning_rate': 3.315445636957536e-07, 'completion_length': 127.87500381469727, 'rewards/only_full_func_accuracy_reward': 0.6755952537059784, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.04602411389350891, 'kl': 0.84375, 'epoch': 0.67}
 67%|██████▋   | 2865/4286 [18:10:52<7:31:21, 19.06s/it] 67%|██████▋   | 2866/4286 [18:11:11<7:32:07, 19.10s/it]                                                        {'loss': 0.0117, 'grad_norm': 3.429395778487627, 'learning_rate': 3.313112459169389e-07, 'completion_length': 134.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6889881193637848, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.0386904738843441, 'kl': 0.29296875, 'epoch': 0.67}
 67%|██████▋   | 2866/4286 [18:11:11<7:32:07, 19.10s/it] 67%|██████▋   | 2867/4286 [18:11:31<7:39:01, 19.41s/it]                                                        {'loss': 0.0093, 'grad_norm': 36.766218685666914, 'learning_rate': 3.3107792813812415e-07, 'completion_length': 158.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7089286148548126, 'rewards/format_reward': 1.0, 'reward': 1.708928644657135, 'reward_std': 0.04977596737444401, 'kl': 0.23193359375, 'epoch': 0.67}
 67%|██████▋   | 2867/4286 [18:11:31<7:39:01, 19.41s/it] 67%|██████▋   | 2868/4286 [18:11:51<7:44:28, 19.65s/it]                                                        {'loss': 0.0449, 'grad_norm': 2.3568843369616137, 'learning_rate': 3.308446103593094e-07, 'completion_length': 140.26786422729492, 'rewards/only_full_func_accuracy_reward': 0.7916667461395264, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7738096714019775, 'reward_std': 0.15157204493880272, 'kl': 1.123046875, 'epoch': 0.67}
 67%|██████▋   | 2868/4286 [18:11:51<7:44:28, 19.65s/it] 67%|██████▋   | 2869/4286 [18:12:12<7:47:51, 19.81s/it]                                                        {'loss': 0.0554, 'grad_norm': 3.135376563958485, 'learning_rate': 3.3061129258049465e-07, 'completion_length': 149.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5267857909202576, 'reward_std': 0.21385834366083145, 'kl': 1.384765625, 'epoch': 0.67}
 67%|██████▋   | 2869/4286 [18:12:12<7:47:51, 19.81s/it] 67%|██████▋   | 2870/4286 [18:12:28<7:20:33, 18.67s/it]                                                        {'loss': 0.0269, 'grad_norm': 2.6211985428567894, 'learning_rate': 3.3037797480167987e-07, 'completion_length': 122.19643783569336, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.05977054685354233, 'kl': 0.6767578125, 'epoch': 0.67}
 67%|██████▋   | 2870/4286 [18:12:28<7:20:33, 18.67s/it] 67%|██████▋   | 2871/4286 [18:12:48<7:31:32, 19.15s/it]                                                        {'loss': 0.01, 'grad_norm': 2.4172813035962197, 'learning_rate': 3.3014465702286515e-07, 'completion_length': 145.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5565476417541504, 'rewards/format_reward': 1.0, 'reward': 1.5565477013587952, 'reward_std': 0.02816697023808956, 'kl': 0.2509765625, 'epoch': 0.67}
 67%|██████▋   | 2871/4286 [18:12:48<7:31:32, 19.15s/it] 67%|██████▋   | 2872/4286 [18:13:05<7:20:22, 18.69s/it]                                                        {'loss': 0.0636, 'grad_norm': 3.707967001187922, 'learning_rate': 3.299113392440504e-07, 'completion_length': 133.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7163691222667694, 'rewards/format_reward': 1.0, 'reward': 1.7163691520690918, 'reward_std': 0.11157074570655823, 'kl': 1.59375, 'epoch': 0.67}
 67%|██████▋   | 2872/4286 [18:13:05<7:20:22, 18.69s/it] 67%|██████▋   | 2873/4286 [18:13:25<7:29:55, 19.11s/it]                                                        {'loss': 0.0153, 'grad_norm': 4.862828012409671, 'learning_rate': 3.2967802146523564e-07, 'completion_length': 160.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6270833015441895, 'rewards/format_reward': 1.0, 'reward': 1.627083420753479, 'reward_std': 0.047192178666591644, 'kl': 0.3837890625, 'epoch': 0.67}
 67%|██████▋   | 2873/4286 [18:13:25<7:29:55, 19.11s/it] 67%|██████▋   | 2874/4286 [18:13:44<7:24:58, 18.91s/it]                                                        {'loss': 0.0518, 'grad_norm': 2.6499632800953634, 'learning_rate': 3.294447036864209e-07, 'completion_length': 142.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7842262983322144, 'reward_std': 0.0803571492433548, 'kl': 1.296875, 'epoch': 0.67}
 67%|██████▋   | 2874/4286 [18:13:44<7:24:58, 18.91s/it] 67%|██████▋   | 2875/4286 [18:14:01<7:12:10, 18.38s/it]                                                        {'loss': 0.0812, 'grad_norm': 5.330512513624737, 'learning_rate': 3.2921138590760614e-07, 'completion_length': 141.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7008929252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6830358505249023, 'reward_std': 0.1031530313193798, 'kl': 2.02734375, 'epoch': 0.67}
 67%|██████▋   | 2875/4286 [18:14:01<7:12:10, 18.38s/it] 67%|██████▋   | 2876/4286 [18:14:21<7:24:55, 18.93s/it]                                                        {'loss': 0.0248, 'grad_norm': 3.7732596377113388, 'learning_rate': 3.289780681287914e-07, 'completion_length': 157.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6205357909202576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6026787161827087, 'reward_std': 0.1150501649826765, 'kl': 0.62255859375, 'epoch': 0.67}
 67%|██████▋   | 2876/4286 [18:14:21<7:24:55, 18.93s/it] 67%|██████▋   | 2877/4286 [18:14:41<7:32:29, 19.27s/it]                                                        {'loss': 0.0288, 'grad_norm': 7.235389702768072, 'learning_rate': 3.287447503499767e-07, 'completion_length': 167.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.645535796880722, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.627678632736206, 'reward_std': 0.08243607729673386, 'kl': 0.720703125, 'epoch': 0.67}
 67%|██████▋   | 2877/4286 [18:14:41<7:32:29, 19.27s/it][2025-03-02 23:22:22,375] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 67%|██████▋   | 2878/4286 [18:15:06<8:13:38, 21.04s/it]                                                        {'loss': 0.0693, 'grad_norm': 5.033123082247897, 'learning_rate': 3.285114325711619e-07, 'completion_length': 187.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.4151785969734192, 'rewards/format_reward': 1.0, 'reward': 1.4151787161827087, 'reward_std': 0.09067623876035213, 'kl': 1.734375, 'epoch': 0.67}
 67%|██████▋   | 2878/4286 [18:15:06<8:13:38, 21.04s/it] 67%|██████▋   | 2879/4286 [18:15:24<7:46:34, 19.90s/it]                                                        {'loss': 0.0726, 'grad_norm': 6.983665437594273, 'learning_rate': 3.282781147923472e-07, 'completion_length': 137.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5848215818405151, 'reward_std': 0.0953197069466114, 'kl': 1.81640625, 'epoch': 0.67}
 67%|██████▋   | 2879/4286 [18:15:24<7:46:34, 19.90s/it] 67%|██████▋   | 2880/4286 [18:15:44<7:49:49, 20.05s/it]                                                        {'loss': 0.0118, 'grad_norm': 3.2934096039688545, 'learning_rate': 3.280447970135324e-07, 'completion_length': 138.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.7529762089252472, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.0416666679084301, 'kl': 0.294921875, 'epoch': 0.67}
 67%|██████▋   | 2880/4286 [18:15:44<7:49:49, 20.05s/it] 67%|██████▋   | 2881/4286 [18:16:01<7:29:00, 19.17s/it]                                                        {'loss': 0.0302, 'grad_norm': 34.22895479193605, 'learning_rate': 3.278114792347177e-07, 'completion_length': 135.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.04761905549094081, 'kl': 0.7529296875, 'epoch': 0.67}
 67%|██████▋   | 2881/4286 [18:16:01<7:29:00, 19.17s/it] 67%|██████▋   | 2882/4286 [18:16:22<7:40:13, 19.67s/it]                                                        {'loss': 0.0724, 'grad_norm': 4.455080713945093, 'learning_rate': 3.2757816145590296e-07, 'completion_length': 137.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.7069806158542633, 'rewards/format_reward': 1.0, 'reward': 1.706980586051941, 'reward_std': 0.1336580142378807, 'kl': 1.80859375, 'epoch': 0.67}
 67%|██████▋   | 2882/4286 [18:16:22<7:40:13, 19.67s/it] 67%|██████▋   | 2883/4286 [18:16:40<7:27:16, 19.13s/it]                                                        {'loss': 0.0483, 'grad_norm': 3.2100394637483305, 'learning_rate': 3.273448436770882e-07, 'completion_length': 157.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6473214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6294644474983215, 'reward_std': 0.11011905316263437, 'kl': 1.20703125, 'epoch': 0.67}
 67%|██████▋   | 2883/4286 [18:16:40<7:27:16, 19.13s/it] 67%|██████▋   | 2884/4286 [18:17:00<7:31:10, 19.31s/it]                                                        {'loss': 0.0154, 'grad_norm': 0.927247438000851, 'learning_rate': 3.2711152589827346e-07, 'completion_length': 151.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5133928954601288, 'rewards/format_reward': 1.0, 'reward': 1.513392984867096, 'reward_std': 0.03766920417547226, 'kl': 0.3857421875, 'epoch': 0.67}
 67%|██████▋   | 2884/4286 [18:17:00<7:31:10, 19.31s/it][2025-03-02 23:24:38,651] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 67%|██████▋   | 2885/4286 [18:17:23<7:57:17, 20.44s/it]                                                        {'loss': 0.0586, 'grad_norm': 3.449283072931545, 'learning_rate': 3.2687820811945873e-07, 'completion_length': 174.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5476190745830536, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4940477013587952, 'reward_std': 0.16656116768717766, 'kl': 1.46826171875, 'epoch': 0.67}
 67%|██████▋   | 2885/4286 [18:17:23<7:57:17, 20.44s/it] 67%|██████▋   | 2886/4286 [18:17:42<7:45:55, 19.97s/it]                                                        {'loss': 0.0341, 'grad_norm': 6.321119199820031, 'learning_rate': 3.2664489034064396e-07, 'completion_length': 151.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6889881193637848, 'rewards/format_reward': 1.0, 'reward': 1.6889882683753967, 'reward_std': 0.11295603960752487, 'kl': 0.853515625, 'epoch': 0.67}
 67%|██████▋   | 2886/4286 [18:17:42<7:45:55, 19.97s/it] 67%|██████▋   | 2887/4286 [18:18:04<8:05:22, 20.82s/it]                                                        {'loss': 0.0607, 'grad_norm': 4.447947446068411, 'learning_rate': 3.2641157256182923e-07, 'completion_length': 191.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.6026786416769028, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5669643878936768, 'reward_std': 0.11571384966373444, 'kl': 1.513671875, 'epoch': 0.67}
 67%|██████▋   | 2887/4286 [18:18:04<8:05:22, 20.82s/it] 67%|██████▋   | 2888/4286 [18:18:22<7:44:05, 19.92s/it]                                                        {'loss': 0.0462, 'grad_norm': 0.8935126545786366, 'learning_rate': 3.2617825478301445e-07, 'completion_length': 140.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.821428656578064, 'reward_std': 0.10858714953064919, 'kl': 1.15625, 'epoch': 0.67}
 67%|██████▋   | 2888/4286 [18:18:22<7:44:05, 19.92s/it] 67%|██████▋   | 2889/4286 [18:18:43<7:47:44, 20.09s/it]                                                        {'loss': 0.0306, 'grad_norm': 2.1128569925746064, 'learning_rate': 3.2594493700419973e-07, 'completion_length': 144.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.605654776096344, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.0744047611951828, 'kl': 0.76416015625, 'epoch': 0.67}
 67%|██████▋   | 2889/4286 [18:18:43<7:47:44, 20.09s/it] 67%|██████▋   | 2890/4286 [18:19:04<7:54:35, 20.40s/it]                                                        {'loss': 0.1429, 'grad_norm': 18.837260790526205, 'learning_rate': 3.25711619225385e-07, 'completion_length': 154.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5208333432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5029762983322144, 'reward_std': 0.2322368025779724, 'kl': 3.5625, 'epoch': 0.67}
 67%|██████▋   | 2890/4286 [18:19:04<7:54:35, 20.40s/it] 67%|██████▋   | 2891/4286 [18:19:23<7:47:33, 20.11s/it]                                                        {'loss': 0.0581, 'grad_norm': 9.325298965105066, 'learning_rate': 3.2547830144657023e-07, 'completion_length': 156.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.7172619104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6815477013587952, 'reward_std': 0.17757174000144005, 'kl': 1.453125, 'epoch': 0.67}
 67%|██████▋   | 2891/4286 [18:19:23<7:47:33, 20.11s/it] 67%|██████▋   | 2892/4286 [18:19:46<8:06:24, 20.94s/it]                                                        {'loss': 0.0386, 'grad_norm': 17.1867304985682, 'learning_rate': 3.252449836677555e-07, 'completion_length': 174.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5401786118745804, 'rewards/format_reward': 1.0, 'reward': 1.5401787161827087, 'reward_std': 0.056547620333731174, 'kl': 0.966796875, 'epoch': 0.67}
 67%|██████▋   | 2892/4286 [18:19:46<8:06:24, 20.94s/it] 67%|██████▋   | 2893/4286 [18:20:03<7:40:07, 19.82s/it]                                                        {'loss': 0.0223, 'grad_norm': 1.1003163055887157, 'learning_rate': 3.250116658889407e-07, 'completion_length': 166.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7142857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.07008037343621254, 'kl': 0.5576171875, 'epoch': 0.67}
 67%|██████▋   | 2893/4286 [18:20:03<7:40:07, 19.82s/it][2025-03-02 23:27:44,418] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 68%|██████▊   | 2894/4286 [18:20:29<8:16:59, 21.42s/it]                                                        {'loss': 0.0256, 'grad_norm': 5.4746351885651565, 'learning_rate': 3.24778348110126e-07, 'completion_length': 174.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.610119104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5744048357009888, 'reward_std': 0.11493692174553871, 'kl': 0.64111328125, 'epoch': 0.68}
 68%|██████▊   | 2894/4286 [18:20:29<8:16:59, 21.42s/it] 68%|██████▊   | 2895/4286 [18:20:49<8:13:16, 21.28s/it]                                                        {'loss': 0.0147, 'grad_norm': 12.09856231848506, 'learning_rate': 3.245450303313113e-07, 'completion_length': 163.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7184524536132812, 'rewards/format_reward': 1.0, 'reward': 1.7184524536132812, 'reward_std': 0.07262998074293137, 'kl': 0.3671875, 'epoch': 0.68}
 68%|██████▊   | 2895/4286 [18:20:49<8:13:16, 21.28s/it] 68%|██████▊   | 2896/4286 [18:21:10<8:07:51, 21.06s/it]                                                        {'loss': 0.0986, 'grad_norm': 100.89127966170388, 'learning_rate': 3.243117125524965e-07, 'completion_length': 165.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5988095700740814, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5809524655342102, 'reward_std': 0.06476517766714096, 'kl': 2.4697265625, 'epoch': 0.68}
 68%|██████▊   | 2896/4286 [18:21:10<8:07:51, 21.06s/it] 68%|██████▊   | 2897/4286 [18:21:36<8:39:55, 22.46s/it]                                                        {'loss': 0.0382, 'grad_norm': 4.6071242276272155, 'learning_rate': 3.2407839477368177e-07, 'completion_length': 217.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.566964328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5491072535514832, 'reward_std': 0.0922619067132473, 'kl': 0.953125, 'epoch': 0.68}
 68%|██████▊   | 2897/4286 [18:21:36<8:39:55, 22.46s/it] 68%|██████▊   | 2898/4286 [18:21:57<8:28:31, 21.98s/it]                                                        {'loss': 0.0833, 'grad_norm': 7.409024095806655, 'learning_rate': 3.23845076994867e-07, 'completion_length': 169.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.5610119700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5431549549102783, 'reward_std': 0.15583965182304382, 'kl': 2.08203125, 'epoch': 0.68}
 68%|██████▊   | 2898/4286 [18:21:57<8:28:31, 21.98s/it][2025-03-02 23:29:29,633] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 68%|██████▊   | 2899/4286 [18:22:14<7:54:30, 20.53s/it]                                                        {'loss': 0.093, 'grad_norm': 2.6710664317842143, 'learning_rate': 3.2361175921605227e-07, 'completion_length': 147.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.11082620546221733, 'kl': 2.31640625, 'epoch': 0.68}
 68%|██████▊   | 2899/4286 [18:22:14<7:54:30, 20.53s/it] 68%|██████▊   | 2900/4286 [18:22:33<7:47:28, 20.24s/it]                                                        {'loss': 0.1267, 'grad_norm': 10.431149126755672, 'learning_rate': 3.2337844143723754e-07, 'completion_length': 185.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5854166746139526, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5318453311920166, 'reward_std': 0.20257563889026642, 'kl': 3.16796875, 'epoch': 0.68}
 68%|██████▊   | 2900/4286 [18:22:33<7:47:28, 20.24s/it] 68%|██████▊   | 2901/4286 [18:27:05<36:49:15, 95.71s/it]                                                         {'loss': 0.0123, 'grad_norm': 2.726356323594601, 'learning_rate': 3.2314512365842277e-07, 'completion_length': 165.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.645833432674408, 'reward_std': 0.05357143096625805, 'kl': 0.30859375, 'epoch': 0.68}
 68%|██████▊   | 2901/4286 [18:27:05<36:49:15, 95.71s/it] 68%|██████▊   | 2902/4286 [18:27:22<27:43:35, 72.12s/it]                                                         {'loss': 0.0756, 'grad_norm': 2.490688947924863, 'learning_rate': 3.2291180587960804e-07, 'completion_length': 155.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6936224699020386, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6400511860847473, 'reward_std': 0.17635178565979004, 'kl': 1.890625, 'epoch': 0.68}
 68%|██████▊   | 2902/4286 [18:27:22<27:43:35, 72.12s/it][2025-03-02 23:35:01,812] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 68%|██████▊   | 2903/4286 [18:27:46<22:07:43, 57.60s/it]                                                         {'loss': 0.027, 'grad_norm': 11.160579326509785, 'learning_rate': 3.2267848810079326e-07, 'completion_length': 195.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.4613095670938492, 'rewards/format_reward': 1.0, 'reward': 1.4613096117973328, 'reward_std': 0.011904765153303742, 'kl': 0.6767578125, 'epoch': 0.68}
 68%|██████▊   | 2903/4286 [18:27:46<22:07:43, 57.60s/it] 68%|██████▊   | 2904/4286 [18:28:09<18:10:40, 47.35s/it]                                                         {'loss': 0.0206, 'grad_norm': 4.698491868418192, 'learning_rate': 3.2244517032197854e-07, 'completion_length': 188.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5511905550956726, 'rewards/format_reward': 1.0, 'reward': 1.5511906147003174, 'reward_std': 0.07002165261656046, 'kl': 0.51708984375, 'epoch': 0.68}
 68%|██████▊   | 2904/4286 [18:28:09<18:10:40, 47.35s/it] 68%|██████▊   | 2905/4286 [18:28:31<15:13:52, 39.71s/it]                                                         {'loss': 0.0363, 'grad_norm': 4.2964078413103435, 'learning_rate': 3.222118525431638e-07, 'completion_length': 199.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.6261905133724213, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5726191401481628, 'reward_std': 0.18998393416404724, 'kl': 0.90625, 'epoch': 0.68}
 68%|██████▊   | 2905/4286 [18:28:31<15:13:52, 39.71s/it] 68%|██████▊   | 2906/4286 [18:28:58<13:40:50, 35.69s/it]                                                         {'loss': 0.0299, 'grad_norm': 2.01275150287899, 'learning_rate': 3.2197853476434904e-07, 'completion_length': 190.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.51488097012043, 'rewards/format_reward': 1.0, 'reward': 1.5148810744285583, 'reward_std': 0.050381556153297424, 'kl': 0.7490234375, 'epoch': 0.68}
 68%|██████▊   | 2906/4286 [18:28:58<13:40:50, 35.69s/it] 68%|██████▊   | 2907/4286 [18:29:18<11:52:02, 30.98s/it]                                                         {'loss': 0.0723, 'grad_norm': 5.706803998300713, 'learning_rate': 3.217452169855343e-07, 'completion_length': 174.96428680419922, 'rewards/only_full_func_accuracy_reward': 0.571428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715818405151, 'reward_std': 0.1401340626180172, 'kl': 1.8037109375, 'epoch': 0.68}
 68%|██████▊   | 2907/4286 [18:29:18<11:52:02, 30.98s/it] 68%|██████▊   | 2908/4286 [18:29:41<10:59:45, 28.73s/it]                                                         {'loss': 0.0313, 'grad_norm': 3.501437976742093, 'learning_rate': 3.215118992067196e-07, 'completion_length': 185.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.5934524238109589, 'rewards/format_reward': 1.0, 'reward': 1.593452513217926, 'reward_std': 0.005952383857220411, 'kl': 0.78271484375, 'epoch': 0.68}
 68%|██████▊   | 2908/4286 [18:29:41<10:59:45, 28.73s/it] 68%|██████▊   | 2909/4286 [18:30:06<10:36:09, 27.72s/it]                                                         {'loss': 0.0093, 'grad_norm': 2.52706186996887, 'learning_rate': 3.212785814279048e-07, 'completion_length': 208.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6607143580913544, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.06069137901067734, 'kl': 0.23193359375, 'epoch': 0.68}
 68%|██████▊   | 2909/4286 [18:30:06<10:36:09, 27.72s/it] 68%|██████▊   | 2910/4286 [18:30:27<9:48:35, 25.67s/it]                                                         {'loss': 0.0254, 'grad_norm': 2.480838924076265, 'learning_rate': 3.210452636490901e-07, 'completion_length': 189.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.7041667103767395, 'rewards/format_reward': 1.0, 'reward': 1.7041667699813843, 'reward_std': 0.06785715091973543, 'kl': 0.6328125, 'epoch': 0.68}
 68%|██████▊   | 2910/4286 [18:30:27<9:48:35, 25.67s/it] 68%|██████▊   | 2911/4286 [18:30:52<9:38:37, 25.25s/it]                                                        {'loss': 0.0116, 'grad_norm': 2.9835334196580217, 'learning_rate': 3.208119458702753e-07, 'completion_length': 222.21430206298828, 'rewards/only_full_func_accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 1.0, 'reward': 1.6517857909202576, 'reward_std': 0.04031847417354584, 'kl': 0.2890625, 'epoch': 0.68}
 68%|██████▊   | 2911/4286 [18:30:52<9:38:37, 25.25s/it] 68%|██████▊   | 2912/4286 [18:31:14<9:18:26, 24.39s/it]                                                        {'loss': 0.0124, 'grad_norm': 2.1041772785937125, 'learning_rate': 3.205786280914606e-07, 'completion_length': 173.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.7720925807952881, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7542355060577393, 'reward_std': 0.09131344594061375, 'kl': 0.31005859375, 'epoch': 0.68}
 68%|██████▊   | 2912/4286 [18:31:14<9:18:26, 24.39s/it][2025-03-02 23:38:53,309] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 68%|██████▊   | 2913/4286 [18:31:37<9:12:08, 24.13s/it]                                                        {'loss': 0.0467, 'grad_norm': 4.147458627003793, 'learning_rate': 3.2034531031264586e-07, 'completion_length': 178.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6026785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5848215818405151, 'reward_std': 0.10257173888385296, 'kl': 1.16796875, 'epoch': 0.68}
 68%|██████▊   | 2913/4286 [18:31:37<9:12:08, 24.13s/it] 68%|██████▊   | 2914/4286 [18:32:00<8:57:48, 23.52s/it]                                                        {'loss': 0.0104, 'grad_norm': 11.219169794509291, 'learning_rate': 3.201119925338311e-07, 'completion_length': 192.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5907739400863647, 'reward_std': 0.11404913384467363, 'kl': 0.259765625, 'epoch': 0.68}
 68%|██████▊   | 2914/4286 [18:32:00<8:57:48, 23.52s/it] 68%|██████▊   | 2915/4286 [18:32:23<8:54:43, 23.40s/it]                                                        {'loss': 0.0253, 'grad_norm': 1.511692567782257, 'learning_rate': 3.1987867475501635e-07, 'completion_length': 154.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.0267857164144516, 'kl': 0.6328125, 'epoch': 0.68}
 68%|██████▊   | 2915/4286 [18:32:23<8:54:43, 23.40s/it] 68%|██████▊   | 2916/4286 [18:32:49<9:16:02, 24.35s/it]                                                        {'loss': 0.0391, 'grad_norm': 9.582130572696327, 'learning_rate': 3.196453569762016e-07, 'completion_length': 240.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.569047600030899, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.533333420753479, 'reward_std': 0.23878053575754166, 'kl': 0.978515625, 'epoch': 0.68}
 68%|██████▊   | 2916/4286 [18:32:49<9:16:02, 24.35s/it] 68%|██████▊   | 2917/4286 [18:33:13<9:15:02, 24.33s/it]                                                        {'loss': 0.035, 'grad_norm': 14.237430828948327, 'learning_rate': 3.1941203919738685e-07, 'completion_length': 211.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.6238095760345459, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.605952501296997, 'reward_std': 0.1588207446038723, 'kl': 0.875, 'epoch': 0.68}
 68%|██████▊   | 2917/4286 [18:33:13<9:15:02, 24.33s/it][2025-03-02 23:40:55,841] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 68%|██████▊   | 2918/4286 [18:33:40<9:29:19, 24.97s/it]                                                        {'loss': 0.0792, 'grad_norm': 8.330143170350738, 'learning_rate': 3.1917872141857213e-07, 'completion_length': 184.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6247023940086365, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5889882445335388, 'reward_std': 0.21017464250326157, 'kl': 1.98046875, 'epoch': 0.68}
 68%|██████▊   | 2918/4286 [18:33:40<9:29:19, 24.97s/it] 68%|██████▊   | 2919/4286 [18:33:58<8:44:29, 23.02s/it]                                                        {'loss': 0.0105, 'grad_norm': 4.468686990808265, 'learning_rate': 3.1894540363975735e-07, 'completion_length': 159.5, 'rewards/only_full_func_accuracy_reward': 0.7380953431129456, 'rewards/format_reward': 1.0, 'reward': 1.7380954027175903, 'reward_std': 0.0595238134264946, 'kl': 0.26318359375, 'epoch': 0.68}
 68%|██████▊   | 2919/4286 [18:33:58<8:44:29, 23.02s/it][2025-03-02 23:41:37,451] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 68%|██████▊   | 2920/4286 [18:34:22<8:44:54, 23.06s/it]                                                        {'loss': 0.0431, 'grad_norm': 7.739867917613155, 'learning_rate': 3.187120858609426e-07, 'completion_length': 187.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5367772728204727, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5189201831817627, 'reward_std': 0.13164425641298294, 'kl': 1.078125, 'epoch': 0.68}
 68%|██████▊   | 2920/4286 [18:34:22<8:44:54, 23.06s/it] 68%|██████▊   | 2921/4286 [18:34:44<8:42:56, 22.99s/it]                                                        {'loss': 0.0763, 'grad_norm': 8.010246205178278, 'learning_rate': 3.1847876808212785e-07, 'completion_length': 176.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5669642984867096, 'rewards/format_reward': 1.0, 'reward': 1.5669643878936768, 'reward_std': 0.10389559343457222, 'kl': 1.90625, 'epoch': 0.68}
 68%|██████▊   | 2921/4286 [18:34:44<8:42:56, 22.99s/it] 68%|██████▊   | 2922/4286 [18:35:05<8:27:17, 22.32s/it]                                                        {'loss': 0.0428, 'grad_norm': 3.0803743043138287, 'learning_rate': 3.182454503033131e-07, 'completion_length': 173.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.5014881640672684, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4836310744285583, 'reward_std': 0.09226190485060215, 'kl': 1.0732421875, 'epoch': 0.68}
 68%|██████▊   | 2922/4286 [18:35:05<8:27:17, 22.32s/it][2025-03-02 23:42:43,759] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 68%|██████▊   | 2923/4286 [18:35:28<8:29:47, 22.44s/it]                                                        {'loss': 0.0754, 'grad_norm': 4.799875255021824, 'learning_rate': 3.180121325244984e-07, 'completion_length': 179.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6279762983322144, 'reward_std': 0.22192221879959106, 'kl': 1.8828125, 'epoch': 0.68}
 68%|██████▊   | 2923/4286 [18:35:28<8:29:47, 22.44s/it] 68%|██████▊   | 2924/4286 [18:35:48<8:10:28, 21.61s/it]                                                        {'loss': 0.0527, 'grad_norm': 20.93872003316018, 'learning_rate': 3.177788147456836e-07, 'completion_length': 179.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6532739102840424, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6175596117973328, 'reward_std': 0.18289445340633392, 'kl': 1.31640625, 'epoch': 0.68}
 68%|██████▊   | 2924/4286 [18:35:48<8:10:28, 21.61s/it] 68%|██████▊   | 2925/4286 [18:36:07<7:57:00, 21.03s/it]                                                        {'loss': 0.0362, 'grad_norm': 5.6364780835497, 'learning_rate': 3.175454969668689e-07, 'completion_length': 170.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6306547522544861, 'rewards/format_reward': 1.0, 'reward': 1.6306548118591309, 'reward_std': 0.11645298451185226, 'kl': 0.90625, 'epoch': 0.68}
 68%|██████▊   | 2925/4286 [18:36:07<7:57:00, 21.03s/it] 68%|██████▊   | 2926/4286 [18:36:28<7:55:01, 20.96s/it]                                                        {'loss': 0.0739, 'grad_norm': 3.9277391745761596, 'learning_rate': 3.173121791880541e-07, 'completion_length': 177.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5961309671401978, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5604166984558105, 'reward_std': 0.18182791024446487, 'kl': 1.845703125, 'epoch': 0.68}
 68%|██████▊   | 2926/4286 [18:36:28<7:55:01, 20.96s/it] 68%|██████▊   | 2927/4286 [18:36:48<7:46:59, 20.62s/it]                                                        {'loss': 0.0493, 'grad_norm': 3.840503915515197, 'learning_rate': 3.170788614092394e-07, 'completion_length': 170.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.5967262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5788691639900208, 'reward_std': 0.14161648601293564, 'kl': 1.23388671875, 'epoch': 0.68}
 68%|██████▊   | 2927/4286 [18:36:48<7:46:59, 20.62s/it] 68%|██████▊   | 2928/4286 [18:37:10<8:00:10, 21.22s/it]                                                        {'loss': 0.0469, 'grad_norm': 5.151140609918838, 'learning_rate': 3.1684554363042467e-07, 'completion_length': 180.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.6457589566707611, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6279017925262451, 'reward_std': 0.1111062541604042, 'kl': 1.171875, 'epoch': 0.68}
 68%|██████▊   | 2928/4286 [18:37:10<8:00:10, 21.22s/it] 68%|██████▊   | 2929/4286 [18:37:28<7:35:02, 20.12s/it]                                                        {'loss': 0.0226, 'grad_norm': 3.327652969991734, 'learning_rate': 3.166122258516099e-07, 'completion_length': 170.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6669643223285675, 'rewards/format_reward': 1.0, 'reward': 1.6669644117355347, 'reward_std': 0.08243520930409431, 'kl': 0.5654296875, 'epoch': 0.68}
 68%|██████▊   | 2929/4286 [18:37:28<7:35:02, 20.12s/it] 68%|██████▊   | 2930/4286 [18:37:51<7:52:08, 20.89s/it]                                                        {'loss': 0.0279, 'grad_norm': 13.553503282141238, 'learning_rate': 3.1637890807279516e-07, 'completion_length': 203.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5040922909975052, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.450520932674408, 'reward_std': 0.09257859457284212, 'kl': 0.69775390625, 'epoch': 0.68}
 68%|██████▊   | 2930/4286 [18:37:51<7:52:08, 20.89s/it] 68%|██████▊   | 2931/4286 [18:38:15<8:13:47, 21.87s/it]                                                        {'loss': 0.0659, 'grad_norm': 9.723048688600308, 'learning_rate': 3.1614559029398044e-07, 'completion_length': 182.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.6183036267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5825893878936768, 'reward_std': 0.27551741898059845, 'kl': 1.6484375, 'epoch': 0.68}
 68%|██████▊   | 2931/4286 [18:38:15<8:13:47, 21.87s/it] 68%|██████▊   | 2932/4286 [18:38:33<7:47:59, 20.74s/it]                                                        {'loss': 0.0262, 'grad_norm': 3.9419290817975523, 'learning_rate': 3.1591227251516566e-07, 'completion_length': 170.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619049549102783, 'reward_std': 0.08773225918412209, 'kl': 0.65673828125, 'epoch': 0.68}
 68%|██████▊   | 2932/4286 [18:38:33<7:47:59, 20.74s/it] 68%|██████▊   | 2933/4286 [18:38:58<8:14:37, 21.93s/it]                                                        {'loss': 0.1604, 'grad_norm': 10.21916107566806, 'learning_rate': 3.1567895473635094e-07, 'completion_length': 202.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.4017857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3660714626312256, 'reward_std': 0.14538798108696938, 'kl': 4.015625, 'epoch': 0.68}
 68%|██████▊   | 2933/4286 [18:38:58<8:14:37, 21.93s/it] 68%|██████▊   | 2934/4286 [18:39:16<7:52:12, 20.96s/it]                                                        {'loss': 0.0625, 'grad_norm': 13.830818269583418, 'learning_rate': 3.1544563695753616e-07, 'completion_length': 157.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5352891385555267, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.517432153224945, 'reward_std': 0.10044588893651962, 'kl': 1.5625, 'epoch': 0.68}
 68%|██████▊   | 2934/4286 [18:39:16<7:52:12, 20.96s/it] 68%|██████▊   | 2935/4286 [18:39:36<7:42:53, 20.56s/it]                                                        {'loss': 0.0473, 'grad_norm': 7.274192943650406, 'learning_rate': 3.1521231917872143e-07, 'completion_length': 175.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6377976536750793, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6199406385421753, 'reward_std': 0.16096476092934608, 'kl': 1.1796875, 'epoch': 0.68}
 68%|██████▊   | 2935/4286 [18:39:36<7:42:53, 20.56s/it] 69%|██████▊   | 2936/4286 [18:39:56<7:41:10, 20.50s/it]                                                        {'loss': 0.0701, 'grad_norm': 3.4416252404888144, 'learning_rate': 3.149790013999067e-07, 'completion_length': 178.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5907738655805588, 'rewards/format_reward': 1.0, 'reward': 1.5907739400863647, 'reward_std': 0.08938459306955338, 'kl': 1.75390625, 'epoch': 0.69}
 69%|██████▊   | 2936/4286 [18:39:56<7:41:10, 20.50s/it] 69%|██████▊   | 2937/4286 [18:40:17<7:40:20, 20.48s/it]                                                        {'loss': 0.0595, 'grad_norm': 8.13496786559377, 'learning_rate': 3.1474568362109193e-07, 'completion_length': 173.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5550596117973328, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.537202537059784, 'reward_std': 0.13065173104405403, 'kl': 1.486328125, 'epoch': 0.69}
 69%|██████▊   | 2937/4286 [18:40:17<7:40:20, 20.48s/it] 69%|██████▊   | 2938/4286 [18:40:35<7:26:37, 19.88s/it]                                                        {'loss': 0.0124, 'grad_norm': 11.537166747864964, 'learning_rate': 3.145123658422772e-07, 'completion_length': 162.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.5750000476837158, 'rewards/format_reward': 1.0, 'reward': 1.5750000476837158, 'reward_std': 0.08963518217206001, 'kl': 0.31005859375, 'epoch': 0.69}
 69%|██████▊   | 2938/4286 [18:40:35<7:26:37, 19.88s/it][2025-03-02 23:48:14,635] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 69%|██████▊   | 2939/4286 [18:40:59<7:50:45, 20.97s/it]                                                        {'loss': 0.0453, 'grad_norm': 9.365706173428674, 'learning_rate': 3.1427904806346243e-07, 'completion_length': 202.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.46071429550647736, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4250000715255737, 'reward_std': 0.18374374508857727, 'kl': 1.130859375, 'epoch': 0.69}
 69%|██████▊   | 2939/4286 [18:40:59<7:50:45, 20.97s/it] 69%|██████▊   | 2940/4286 [18:41:18<7:37:21, 20.39s/it]                                                        {'loss': 0.033, 'grad_norm': 44.745478791760945, 'learning_rate': 3.140457302846477e-07, 'completion_length': 168.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.6297619342803955, 'rewards/format_reward': 1.0, 'reward': 1.6297619938850403, 'reward_std': 0.10900448635220528, 'kl': 0.82421875, 'epoch': 0.69}
 69%|██████▊   | 2940/4286 [18:41:18<7:37:21, 20.39s/it] 69%|██████▊   | 2941/4286 [18:41:36<7:22:53, 19.76s/it]                                                        {'loss': 0.0153, 'grad_norm': 1.5451518269143765, 'learning_rate': 3.13812412505833e-07, 'completion_length': 164.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.062484078109264374, 'kl': 0.3818359375, 'epoch': 0.69}
 69%|██████▊   | 2941/4286 [18:41:36<7:22:53, 19.76s/it][2025-03-02 23:49:14,786] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 69%|██████▊   | 2942/4286 [18:41:59<7:43:14, 20.68s/it]                                                        {'loss': 0.0192, 'grad_norm': 7.937639494462676, 'learning_rate': 3.135790947270182e-07, 'completion_length': 187.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6711310148239136, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6532739400863647, 'reward_std': 0.0863095298409462, 'kl': 0.47998046875, 'epoch': 0.69}
 69%|██████▊   | 2942/4286 [18:41:59<7:43:14, 20.68s/it] 69%|██████▊   | 2943/4286 [18:42:21<7:54:25, 21.20s/it]                                                        {'loss': 0.0189, 'grad_norm': 3.622299373937668, 'learning_rate': 3.133457769482035e-07, 'completion_length': 177.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.05929713882505894, 'kl': 0.47265625, 'epoch': 0.69}
 69%|██████▊   | 2943/4286 [18:42:21<7:54:25, 21.20s/it] 69%|██████▊   | 2944/4286 [18:42:45<8:08:38, 21.85s/it]                                                        {'loss': 0.0426, 'grad_norm': 8.725961237830699, 'learning_rate': 3.131124591693887e-07, 'completion_length': 185.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.4975198358297348, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.479662835597992, 'reward_std': 0.09497880190610886, 'kl': 1.0634765625, 'epoch': 0.69}
 69%|██████▊   | 2944/4286 [18:42:45<8:08:38, 21.85s/it] 69%|██████▊   | 2945/4286 [18:43:05<7:57:22, 21.36s/it]                                                        {'loss': 0.0368, 'grad_norm': 3.5527321868934556, 'learning_rate': 3.1287914139057397e-07, 'completion_length': 177.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6101191639900208, 'reward_std': 0.10671550035476685, 'kl': 0.919921875, 'epoch': 0.69}
 69%|██████▊   | 2945/4286 [18:43:05<7:57:22, 21.36s/it] 69%|██████▊   | 2946/4286 [18:43:24<7:44:52, 20.81s/it]                                                        {'loss': 0.0245, 'grad_norm': 4.499716634212593, 'learning_rate': 3.1264582361175925e-07, 'completion_length': 177.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.617559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5997024774551392, 'reward_std': 0.11369924806058407, 'kl': 0.615234375, 'epoch': 0.69}
 69%|██████▊   | 2946/4286 [18:43:24<7:44:52, 20.81s/it] 69%|██████▉   | 2947/4286 [18:43:43<7:28:33, 20.10s/it]                                                        {'loss': 0.0264, 'grad_norm': 5.545513916496268, 'learning_rate': 3.124125058329444e-07, 'completion_length': 163.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.7022534608840942, 'rewards/format_reward': 1.0, 'reward': 1.7022535800933838, 'reward_std': 0.08255954459309578, 'kl': 0.66015625, 'epoch': 0.69}
 69%|██████▉   | 2947/4286 [18:43:43<7:28:33, 20.10s/it][2025-03-02 23:51:21,626] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 69%|██████▉   | 2948/4286 [18:44:06<7:46:48, 20.93s/it]                                                        {'loss': 0.0246, 'grad_norm': 43.87901390175217, 'learning_rate': 3.121791880541297e-07, 'completion_length': 171.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6586309671401978, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6229166984558105, 'reward_std': 0.06891651265323162, 'kl': 0.6162109375, 'epoch': 0.69}
 69%|██████▉   | 2948/4286 [18:44:06<7:46:48, 20.93s/it] 69%|██████▉   | 2949/4286 [18:44:23<7:20:54, 19.79s/it]                                                        {'loss': 0.0078, 'grad_norm': 2.735499743792965, 'learning_rate': 3.119458702753149e-07, 'completion_length': 154.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.7282738089561462, 'rewards/format_reward': 1.0, 'reward': 1.728273868560791, 'reward_std': 0.07337586395442486, 'kl': 0.1962890625, 'epoch': 0.69}
 69%|██████▉   | 2949/4286 [18:44:23<7:20:54, 19.79s/it] 69%|██████▉   | 2950/4286 [18:44:44<7:27:27, 20.10s/it]                                                        {'loss': 0.0674, 'grad_norm': 10.273507514447056, 'learning_rate': 3.117125524965002e-07, 'completion_length': 168.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5818453431129456, 'reward_std': 0.0917550902813673, 'kl': 1.6953125, 'epoch': 0.69}
 69%|██████▉   | 2950/4286 [18:44:44<7:27:27, 20.10s/it] 69%|██████▉   | 2951/4286 [18:45:04<7:25:45, 20.03s/it]                                                        {'loss': 0.0228, 'grad_norm': 23.76868235153523, 'learning_rate': 3.1147923471768547e-07, 'completion_length': 191.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5059524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4880953431129456, 'reward_std': 0.07504807412624359, 'kl': 0.5693359375, 'epoch': 0.69}
 69%|██████▉   | 2951/4286 [18:45:04<7:25:45, 20.03s/it] 69%|██████▉   | 2952/4286 [18:45:24<7:29:43, 20.23s/it]                                                        {'loss': 0.0583, 'grad_norm': 8.181575462191748, 'learning_rate': 3.112459169388707e-07, 'completion_length': 184.83928680419922, 'rewards/only_full_func_accuracy_reward': 0.7142857909202576, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.05028875544667244, 'kl': 1.458984375, 'epoch': 0.69}
 69%|██████▉   | 2952/4286 [18:45:24<7:29:43, 20.23s/it][2025-03-02 23:52:59,765] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 69%|██████▉   | 2953/4286 [18:45:44<7:25:29, 20.05s/it]                                                        {'loss': 0.0172, 'grad_norm': 9.259230194141876, 'learning_rate': 3.1101259916005596e-07, 'completion_length': 164.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.5500000417232513, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5321428775787354, 'reward_std': 0.09026157855987549, 'kl': 0.4296875, 'epoch': 0.69}
 69%|██████▉   | 2953/4286 [18:45:44<7:25:29, 20.05s/it] 69%|██████▉   | 2954/4286 [18:46:00<7:02:05, 19.01s/it]                                                        {'loss': 0.0083, 'grad_norm': 3.787659915763744, 'learning_rate': 3.107792813812412e-07, 'completion_length': 159.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.816964328289032, 'rewards/format_reward': 1.0, 'reward': 1.8169643878936768, 'reward_std': 0.03418238554149866, 'kl': 0.2080078125, 'epoch': 0.69}
 69%|██████▉   | 2954/4286 [18:46:00<7:02:05, 19.01s/it] 69%|██████▉   | 2955/4286 [18:46:27<7:49:52, 21.18s/it]                                                        {'loss': 0.069, 'grad_norm': 52.16814396263955, 'learning_rate': 3.1054596360242646e-07, 'completion_length': 213.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.5098684579133987, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.3670113682746887, 'reward_std': 0.2826315760612488, 'kl': 1.7265625, 'epoch': 0.69}
 69%|██████▉   | 2955/4286 [18:46:27<7:49:52, 21.18s/it] 69%|██████▉   | 2956/4286 [18:46:45<7:30:49, 20.34s/it]                                                        {'loss': 0.0232, 'grad_norm': 5.60080908305767, 'learning_rate': 3.1031264582361174e-07, 'completion_length': 165.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.7565476894378662, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7386906147003174, 'reward_std': 0.08759628236293793, 'kl': 0.5791015625, 'epoch': 0.69}
 69%|██████▉   | 2956/4286 [18:46:45<7:30:49, 20.34s/it] 69%|██████▉   | 2957/4286 [18:47:03<7:17:41, 19.76s/it]                                                        {'loss': 0.0174, 'grad_norm': 6.108395886013299, 'learning_rate': 3.1007932804479696e-07, 'completion_length': 180.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.7187500596046448, 'rewards/format_reward': 1.0, 'reward': 1.7187501192092896, 'reward_std': 0.07305656746029854, 'kl': 0.435546875, 'epoch': 0.69}
 69%|██████▉   | 2957/4286 [18:47:04<7:17:41, 19.76s/it] 69%|██████▉   | 2958/4286 [18:47:23<7:12:48, 19.55s/it]                                                        {'loss': 0.0406, 'grad_norm': 3.518942483526161, 'learning_rate': 3.0984601026598223e-07, 'completion_length': 170.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7302296459674835, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.712372601032257, 'reward_std': 0.14660073071718216, 'kl': 1.0166015625, 'epoch': 0.69}
 69%|██████▉   | 2958/4286 [18:47:23<7:12:48, 19.55s/it] 69%|██████▉   | 2959/4286 [18:47:44<7:24:34, 20.10s/it]                                                        {'loss': 0.0672, 'grad_norm': 5.465851513065315, 'learning_rate': 3.096126924871675e-07, 'completion_length': 184.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.5982143431901932, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803572535514832, 'reward_std': 0.11851610988378525, 'kl': 1.6796875, 'epoch': 0.69}
 69%|██████▉   | 2959/4286 [18:47:44<7:24:34, 20.10s/it] 69%|██████▉   | 2960/4286 [18:48:02<7:11:32, 19.53s/it]                                                        {'loss': 0.066, 'grad_norm': 13.366726689767654, 'learning_rate': 3.0937937470835273e-07, 'completion_length': 180.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.5744048058986664, 'rewards/format_reward': 1.0, 'reward': 1.5744048357009888, 'reward_std': 0.12246337160468102, 'kl': 1.65234375, 'epoch': 0.69}
 69%|██████▉   | 2960/4286 [18:48:02<7:11:32, 19.53s/it] 69%|██████▉   | 2961/4286 [18:48:20<7:03:20, 19.17s/it]                                                        {'loss': 0.0736, 'grad_norm': 22.066477754919013, 'learning_rate': 3.09146056929538e-07, 'completion_length': 162.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.6250000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.607142984867096, 'reward_std': 0.10259868483990431, 'kl': 1.8369140625, 'epoch': 0.69}
 69%|██████▉   | 2961/4286 [18:48:20<7:03:20, 19.17s/it] 69%|██████▉   | 2962/4286 [18:48:42<7:21:49, 20.02s/it]                                                        {'loss': 0.0358, 'grad_norm': 3.833260849062481, 'learning_rate': 3.0891273915072323e-07, 'completion_length': 189.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5825893580913544, 'rewards/format_reward': 1.0, 'reward': 1.5825893878936768, 'reward_std': 0.09323276206851006, 'kl': 0.89453125, 'epoch': 0.69}
 69%|██████▉   | 2962/4286 [18:48:42<7:21:49, 20.02s/it] 69%|██████▉   | 2963/4286 [18:49:02<7:16:16, 19.79s/it]                                                        {'loss': 0.0172, 'grad_norm': 2.261702421761464, 'learning_rate': 3.086794213719085e-07, 'completion_length': 182.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.736607164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7187500596046448, 'reward_std': 0.0803571417927742, 'kl': 0.4296875, 'epoch': 0.69}
 69%|██████▉   | 2963/4286 [18:49:02<7:16:16, 19.79s/it] 69%|██████▉   | 2964/4286 [18:49:20<7:06:39, 19.36s/it]                                                        {'loss': 0.0169, 'grad_norm': 3.0843244226088533, 'learning_rate': 3.084461035930938e-07, 'completion_length': 178.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6696429252624512, 'reward_std': 0.1069922149181366, 'kl': 0.42138671875, 'epoch': 0.69}
 69%|██████▉   | 2964/4286 [18:49:20<7:06:39, 19.36s/it] 69%|██████▉   | 2965/4286 [18:49:41<7:19:14, 19.95s/it]                                                        {'loss': 0.0129, 'grad_norm': 2.442745626314623, 'learning_rate': 3.08212785814279e-07, 'completion_length': 187.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.6550595462322235, 'rewards/format_reward': 1.0, 'reward': 1.655059576034546, 'reward_std': 0.008438955061137676, 'kl': 0.3212890625, 'epoch': 0.69}
 69%|██████▉   | 2965/4286 [18:49:41<7:19:14, 19.95s/it] 69%|██████▉   | 2966/4286 [18:50:00<7:08:38, 19.48s/it]                                                        {'loss': 0.0522, 'grad_norm': 16.630838594184862, 'learning_rate': 3.079794680354643e-07, 'completion_length': 191.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.6958333551883698, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6779763102531433, 'reward_std': 0.16063217446208, 'kl': 1.30859375, 'epoch': 0.69}
 69%|██████▉   | 2966/4286 [18:50:00<7:08:38, 19.48s/it] 69%|██████▉   | 2967/4286 [18:50:19<7:03:09, 19.25s/it]                                                        {'loss': 0.0384, 'grad_norm': 25.651091418653678, 'learning_rate': 3.077461502566495e-07, 'completion_length': 190.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.08978978171944618, 'kl': 0.9619140625, 'epoch': 0.69}
 69%|██████▉   | 2967/4286 [18:50:19<7:03:09, 19.25s/it] 69%|██████▉   | 2968/4286 [18:50:38<7:01:47, 19.20s/it]                                                        {'loss': 0.0418, 'grad_norm': 53.17003275895316, 'learning_rate': 3.0751283247783477e-07, 'completion_length': 183.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 1.0, 'reward': 1.691964328289032, 'reward_std': 0.08930627629160881, 'kl': 1.0458984375, 'epoch': 0.69}
 69%|██████▉   | 2968/4286 [18:50:38<7:01:47, 19.20s/it] 69%|██████▉   | 2969/4286 [18:50:57<7:04:20, 19.33s/it]                                                        {'loss': 0.0133, 'grad_norm': 0.9428194705198207, 'learning_rate': 3.0727951469902005e-07, 'completion_length': 180.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7500000596046448, 'reward_std': 0.06185895949602127, 'kl': 0.33154296875, 'epoch': 0.69}
 69%|██████▉   | 2969/4286 [18:50:57<7:04:20, 19.33s/it] 69%|██████▉   | 2970/4286 [18:51:14<6:49:45, 18.68s/it]                                                        {'loss': 0.0442, 'grad_norm': 3.1577293787159864, 'learning_rate': 3.0704619692020527e-07, 'completion_length': 159.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949406266212463, 'reward_std': 0.09460101276636124, 'kl': 1.103515625, 'epoch': 0.69}
 69%|██████▉   | 2970/4286 [18:51:14<6:49:45, 18.68s/it] 69%|██████▉   | 2971/4286 [18:51:34<6:52:20, 18.81s/it]                                                        {'loss': 0.0424, 'grad_norm': 6.546702497862986, 'learning_rate': 3.0681287914139054e-07, 'completion_length': 186.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6627976596355438, 'rewards/format_reward': 1.0, 'reward': 1.662797749042511, 'reward_std': 0.09566834568977356, 'kl': 1.060546875, 'epoch': 0.69}
 69%|██████▉   | 2971/4286 [18:51:34<6:52:20, 18.81s/it][2025-03-02 23:59:09,621] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 69%|██████▉   | 2972/4286 [18:51:54<7:01:14, 19.24s/it]                                                        {'loss': 0.0291, 'grad_norm': 7.252074837207819, 'learning_rate': 3.0657956136257577e-07, 'completion_length': 164.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6279762387275696, 'reward_std': 0.10990536771714687, 'kl': 0.7275390625, 'epoch': 0.69}
 69%|██████▉   | 2972/4286 [18:51:54<7:01:14, 19.24s/it] 69%|██████▉   | 2973/4286 [18:52:11<6:50:40, 18.77s/it]                                                        {'loss': 0.0317, 'grad_norm': 4.771832121252948, 'learning_rate': 3.0634624358376104e-07, 'completion_length': 153.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7357143461704254, 'rewards/format_reward': 1.0, 'reward': 1.7357143759727478, 'reward_std': 0.05454729590564966, 'kl': 0.79248046875, 'epoch': 0.69}
 69%|██████▉   | 2973/4286 [18:52:11<6:50:40, 18.77s/it] 69%|██████▉   | 2974/4286 [18:52:36<7:26:48, 20.43s/it]                                                        {'loss': 0.0656, 'grad_norm': 18.34208861324344, 'learning_rate': 3.061129258049463e-07, 'completion_length': 194.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6132440865039825, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5596727132797241, 'reward_std': 0.19527338445186615, 'kl': 1.640625, 'epoch': 0.69}
 69%|██████▉   | 2974/4286 [18:52:36<7:26:48, 20.43s/it] 69%|██████▉   | 2975/4286 [18:52:55<7:18:39, 20.08s/it]                                                        {'loss': 0.0304, 'grad_norm': 3.820049282456782, 'learning_rate': 3.0587960802613154e-07, 'completion_length': 181.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.7360119223594666, 'rewards/format_reward': 1.0, 'reward': 1.7360119819641113, 'reward_std': 0.06369047984480858, 'kl': 0.7626953125, 'epoch': 0.69}
 69%|██████▉   | 2975/4286 [18:52:55<7:18:39, 20.08s/it] 69%|██████▉   | 2976/4286 [18:53:16<7:23:26, 20.31s/it]                                                        {'loss': 0.028, 'grad_norm': 7.282976400370073, 'learning_rate': 3.056462902473168e-07, 'completion_length': 177.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.6086310148239136, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.1408017985522747, 'kl': 0.7001953125, 'epoch': 0.69}
 69%|██████▉   | 2976/4286 [18:53:16<7:23:26, 20.31s/it] 69%|██████▉   | 2977/4286 [18:53:35<7:16:40, 20.02s/it]                                                        {'loss': 0.0284, 'grad_norm': 7.873721855349452, 'learning_rate': 3.0541297246850204e-07, 'completion_length': 178.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.5967262089252472, 'rewards/format_reward': 1.0, 'reward': 1.5967262983322144, 'reward_std': 0.06471596751362085, 'kl': 0.70947265625, 'epoch': 0.69}
 69%|██████▉   | 2977/4286 [18:53:35<7:16:40, 20.02s/it] 69%|██████▉   | 2978/4286 [18:53:58<7:31:35, 20.72s/it]                                                        {'loss': 0.0376, 'grad_norm': 6.751680191998175, 'learning_rate': 3.051796546896873e-07, 'completion_length': 205.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6130952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.595238208770752, 'reward_std': 0.1145835816860199, 'kl': 0.9404296875, 'epoch': 0.69}
 69%|██████▉   | 2978/4286 [18:53:58<7:31:35, 20.72s/it] 70%|██████▉   | 2979/4286 [18:54:18<7:31:17, 20.72s/it]                                                        {'loss': 0.0351, 'grad_norm': 15.04132524726594, 'learning_rate': 3.049463369108726e-07, 'completion_length': 194.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6862245202064514, 'rewards/format_reward': 1.0, 'reward': 1.6862245798110962, 'reward_std': 0.1021465603262186, 'kl': 0.87890625, 'epoch': 0.7}
 70%|██████▉   | 2979/4286 [18:54:18<7:31:17, 20.72s/it] 70%|██████▉   | 2980/4286 [18:54:36<7:12:58, 19.89s/it]                                                        {'loss': 0.0086, 'grad_norm': 11.476752056440686, 'learning_rate': 3.047130191320578e-07, 'completion_length': 184.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.611607164144516, 'rewards/format_reward': 1.0, 'reward': 1.6116072535514832, 'reward_std': 0.038690478540956974, 'kl': 0.21435546875, 'epoch': 0.7}
 70%|██████▉   | 2980/4286 [18:54:36<7:12:58, 19.89s/it] 70%|██████▉   | 2981/4286 [18:54:55<7:03:10, 19.46s/it]                                                        {'loss': 0.0205, 'grad_norm': 1.649002391932106, 'learning_rate': 3.044797013532431e-07, 'completion_length': 176.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6130952537059784, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.02380952052772045, 'kl': 0.51171875, 'epoch': 0.7}
 70%|██████▉   | 2981/4286 [18:54:55<7:03:10, 19.46s/it] 70%|██████▉   | 2982/4286 [18:55:12<6:49:14, 18.83s/it]                                                        {'loss': 0.0687, 'grad_norm': 5.297390325810126, 'learning_rate': 3.0424638357442836e-07, 'completion_length': 164.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.6050595343112946, 'rewards/format_reward': 1.0, 'reward': 1.6050595045089722, 'reward_std': 0.06700291484594345, 'kl': 1.71484375, 'epoch': 0.7}
 70%|██████▉   | 2982/4286 [18:55:12<6:49:14, 18.83s/it] 70%|██████▉   | 2983/4286 [18:55:33<7:01:06, 19.39s/it]                                                        {'loss': 0.0433, 'grad_norm': 2.2363993409276666, 'learning_rate': 3.040130657956136e-07, 'completion_length': 190.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5412415266036987, 'rewards/format_reward': 1.0, 'reward': 1.5412415862083435, 'reward_std': 0.06908904016017914, 'kl': 1.083984375, 'epoch': 0.7}
 70%|██████▉   | 2983/4286 [18:55:33<7:01:06, 19.39s/it] 70%|██████▉   | 2984/4286 [18:55:52<7:00:18, 19.37s/it]                                                        {'loss': 0.0253, 'grad_norm': 6.045645691343047, 'learning_rate': 3.0377974801679886e-07, 'completion_length': 196.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7617560029029846, 'rewards/format_reward': 1.0, 'reward': 1.7617561221122742, 'reward_std': 0.09008084796369076, 'kl': 0.6337890625, 'epoch': 0.7}
 70%|██████▉   | 2984/4286 [18:55:52<7:00:18, 19.37s/it] 70%|██████▉   | 2985/4286 [18:56:13<7:12:17, 19.94s/it]                                                        {'loss': 0.0515, 'grad_norm': 8.326891737670433, 'learning_rate': 3.035464302379841e-07, 'completion_length': 180.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.6525298058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.63467276096344, 'reward_std': 0.15479976311326027, 'kl': 1.29296875, 'epoch': 0.7}
 70%|██████▉   | 2985/4286 [18:56:13<7:12:17, 19.94s/it] 70%|██████▉   | 2986/4286 [18:56:32<7:00:49, 19.42s/it]                                                        {'loss': 0.0689, 'grad_norm': 2.027278652429842, 'learning_rate': 3.0331311245916935e-07, 'completion_length': 175.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7062500417232513, 'rewards/format_reward': 1.0, 'reward': 1.7062501311302185, 'reward_std': 0.11460258811712265, 'kl': 1.72021484375, 'epoch': 0.7}
 70%|██████▉   | 2986/4286 [18:56:32<7:00:49, 19.42s/it] 70%|██████▉   | 2987/4286 [18:56:54<7:21:04, 20.37s/it]                                                        {'loss': 0.0158, 'grad_norm': 6.6811456212997085, 'learning_rate': 3.0307979468035463e-07, 'completion_length': 182.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.6431548297405243, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6252976655960083, 'reward_std': 0.06342826131731272, 'kl': 0.3955078125, 'epoch': 0.7}
 70%|██████▉   | 2987/4286 [18:56:54<7:21:04, 20.37s/it] 70%|██████▉   | 2988/4286 [18:57:13<7:14:21, 20.08s/it]                                                        {'loss': 0.0324, 'grad_norm': 4.572516545074906, 'learning_rate': 3.0284647690153985e-07, 'completion_length': 175.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6056548953056335, 'reward_std': 0.07818738743662834, 'kl': 0.81005859375, 'epoch': 0.7}
 70%|██████▉   | 2988/4286 [18:57:13<7:14:21, 20.08s/it] 70%|██████▉   | 2989/4286 [18:57:32<7:05:53, 19.70s/it]                                                        {'loss': 0.0396, 'grad_norm': 6.818833994587191, 'learning_rate': 3.0261315912272513e-07, 'completion_length': 181.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.65327388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6354167461395264, 'reward_std': 0.07934350613504648, 'kl': 0.9892578125, 'epoch': 0.7}
 70%|██████▉   | 2989/4286 [18:57:32<7:05:53, 19.70s/it] 70%|██████▉   | 2990/4286 [18:57:52<7:02:36, 19.56s/it]                                                        {'loss': 0.0358, 'grad_norm': 6.682888437937837, 'learning_rate': 3.0237984134391035e-07, 'completion_length': 191.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5758929252624512, 'rewards/format_reward': 1.0, 'reward': 1.5758929252624512, 'reward_std': 0.09144651144742966, 'kl': 0.892578125, 'epoch': 0.7}
 70%|██████▉   | 2990/4286 [18:57:52<7:02:36, 19.56s/it] 70%|██████▉   | 2991/4286 [18:58:12<7:07:39, 19.81s/it]                                                        {'loss': 0.066, 'grad_norm': 2.188399792148935, 'learning_rate': 3.021465235650956e-07, 'completion_length': 194.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6130952537059784, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.11329300329089165, 'kl': 1.654296875, 'epoch': 0.7}
 70%|██████▉   | 2991/4286 [18:58:12<7:07:39, 19.81s/it] 70%|██████▉   | 2992/4286 [18:58:31<7:02:51, 19.61s/it]                                                        {'loss': 0.1152, 'grad_norm': 5.414338068024859, 'learning_rate': 3.019132057862809e-07, 'completion_length': 166.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.6622024476528168, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6086310744285583, 'reward_std': 0.25212543457746506, 'kl': 2.8828125, 'epoch': 0.7}
 70%|██████▉   | 2992/4286 [18:58:31<7:02:51, 19.61s/it] 70%|██████▉   | 2993/4286 [18:58:50<6:57:54, 19.39s/it]                                                        {'loss': 0.0132, 'grad_norm': 7.946495170689279, 'learning_rate': 3.016798880074661e-07, 'completion_length': 188.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6279762387275696, 'reward_std': 0.1130952425301075, 'kl': 0.330078125, 'epoch': 0.7}
 70%|██████▉   | 2993/4286 [18:58:50<6:57:54, 19.39s/it] 70%|██████▉   | 2994/4286 [18:59:08<6:48:41, 18.98s/it]                                                        {'loss': 0.03, 'grad_norm': 12.666623932101231, 'learning_rate': 3.014465702286514e-07, 'completion_length': 170.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.8250000476837158, 'rewards/format_reward': 1.0, 'reward': 1.8250001072883606, 'reward_std': 0.08983943983912468, 'kl': 0.7509765625, 'epoch': 0.7}
 70%|██████▉   | 2994/4286 [18:59:08<6:48:41, 18.98s/it][2025-03-03 00:06:43,933] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 70%|██████▉   | 2995/4286 [18:59:28<6:55:21, 19.30s/it]                                                        {'loss': 0.0364, 'grad_norm': 2.854825550029518, 'learning_rate': 3.012132524498366e-07, 'completion_length': 174.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6815477013587952, 'reward_std': 0.17189332097768784, 'kl': 0.90869140625, 'epoch': 0.7}
 70%|██████▉   | 2995/4286 [18:59:28<6:55:21, 19.30s/it] 70%|██████▉   | 2996/4286 [18:59:47<6:51:14, 19.13s/it]                                                        {'loss': 0.0849, 'grad_norm': 5.119941886283461, 'learning_rate': 3.009799346710219e-07, 'completion_length': 172.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6279762089252472, 'rewards/format_reward': 1.0, 'reward': 1.6279762983322144, 'reward_std': 0.09740681573748589, 'kl': 2.125, 'epoch': 0.7}
 70%|██████▉   | 2996/4286 [18:59:47<6:51:14, 19.13s/it] 70%|██████▉   | 2997/4286 [19:00:06<6:51:15, 19.14s/it]                                                        {'loss': 0.0109, 'grad_norm': 2.0950754994702265, 'learning_rate': 3.0074661689220717e-07, 'completion_length': 187.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.7619048655033112, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.04761904664337635, 'kl': 0.2724609375, 'epoch': 0.7}
 70%|██████▉   | 2997/4286 [19:00:06<6:51:15, 19.14s/it] 70%|██████▉   | 2998/4286 [19:00:25<6:52:00, 19.19s/it]                                                        {'loss': 0.0767, 'grad_norm': 63.21283687336204, 'learning_rate': 3.005132991133924e-07, 'completion_length': 180.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.605654776096344, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5699405670166016, 'reward_std': 0.12667373567819595, 'kl': 1.91796875, 'epoch': 0.7}
 70%|██████▉   | 2998/4286 [19:00:25<6:52:00, 19.19s/it] 70%|██████▉   | 2999/4286 [19:00:45<6:55:40, 19.38s/it]                                                        {'loss': 0.0457, 'grad_norm': 4.044775588527603, 'learning_rate': 3.0027998133457767e-07, 'completion_length': 176.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.7500708699226379, 'rewards/format_reward': 1.0, 'reward': 1.7500709295272827, 'reward_std': 0.08458881266415119, 'kl': 1.1484375, 'epoch': 0.7}
 70%|██████▉   | 2999/4286 [19:00:45<6:55:40, 19.38s/it] 70%|██████▉   | 3000/4286 [19:01:05<6:58:03, 19.50s/it]                                                        {'loss': 0.065, 'grad_norm': 11.288063874965898, 'learning_rate': 3.000466635557629e-07, 'completion_length': 188.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6458334028720856, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.11493691802024841, 'kl': 1.625, 'epoch': 0.7}
 70%|██████▉   | 3000/4286 [19:01:05<6:58:03, 19.50s/it] 70%|███████   | 3001/4286 [19:05:42<34:34:33, 96.87s/it]                                                         {'loss': 0.1023, 'grad_norm': 5.7043073536520605, 'learning_rate': 2.9981334577694816e-07, 'completion_length': 171.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038691639900208, 'reward_std': 0.06122583523392677, 'kl': 2.5546875, 'epoch': 0.7}
 70%|███████   | 3001/4286 [19:05:42<34:34:33, 96.87s/it] 70%|███████   | 3002/4286 [19:06:02<26:20:27, 73.85s/it]                                                         {'loss': 0.0656, 'grad_norm': 7.534843756818717, 'learning_rate': 2.9958002799813344e-07, 'completion_length': 190.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5773810148239136, 'reward_std': 0.1316695250570774, 'kl': 1.640625, 'epoch': 0.7}
 70%|███████   | 3002/4286 [19:06:02<26:20:27, 73.85s/it] 70%|███████   | 3003/4286 [19:06:21<20:25:14, 57.30s/it]                                                         {'loss': 0.0461, 'grad_norm': 7.0410302350211245, 'learning_rate': 2.9934671021931866e-07, 'completion_length': 172.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.5913690626621246, 'rewards/format_reward': 1.0, 'reward': 1.5913691520690918, 'reward_std': 0.15450509265065193, 'kl': 1.15234375, 'epoch': 0.7}
 70%|███████   | 3003/4286 [19:06:21<20:25:14, 57.30s/it] 70%|███████   | 3004/4286 [19:06:39<16:12:40, 45.52s/it]                                                         {'loss': 0.0537, 'grad_norm': 3.4156363794134035, 'learning_rate': 2.9911339244050394e-07, 'completion_length': 179.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7336309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.12535497918725014, 'kl': 1.34375, 'epoch': 0.7}
 70%|███████   | 3004/4286 [19:06:39<16:12:40, 45.52s/it] 70%|███████   | 3005/4286 [19:07:00<13:31:59, 38.03s/it]                                                         {'loss': 0.0467, 'grad_norm': 2.6459921508862108, 'learning_rate': 2.988800746616892e-07, 'completion_length': 202.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.07267062738537788, 'kl': 1.1640625, 'epoch': 0.7}
 70%|███████   | 3005/4286 [19:07:00<13:31:59, 38.03s/it] 70%|███████   | 3006/4286 [19:07:21<11:42:25, 32.93s/it]                                                         {'loss': 0.0338, 'grad_norm': 12.883024940369923, 'learning_rate': 2.9864675688287443e-07, 'completion_length': 175.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7419643104076385, 'rewards/format_reward': 1.0, 'reward': 1.741964340209961, 'reward_std': 0.028357749804854393, 'kl': 0.84375, 'epoch': 0.7}
 70%|███████   | 3006/4286 [19:07:21<11:42:25, 32.93s/it] 70%|███████   | 3007/4286 [19:07:38<10:03:39, 28.32s/it]                                                         {'loss': 0.0285, 'grad_norm': 2.3523915238495023, 'learning_rate': 2.984134391040597e-07, 'completion_length': 163.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.6175595074892044, 'rewards/format_reward': 1.0, 'reward': 1.6175596714019775, 'reward_std': 0.08404048532247543, 'kl': 0.712890625, 'epoch': 0.7}
 70%|███████   | 3007/4286 [19:07:38<10:03:39, 28.32s/it] 70%|███████   | 3008/4286 [19:07:56<8:56:47, 25.20s/it]                                                         {'loss': 0.0349, 'grad_norm': 6.200405715750194, 'learning_rate': 2.9818012132524493e-07, 'completion_length': 178.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8175595700740814, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7997025847434998, 'reward_std': 0.11065995320677757, 'kl': 0.87255859375, 'epoch': 0.7}
 70%|███████   | 3008/4286 [19:07:56<8:56:47, 25.20s/it] 70%|███████   | 3009/4286 [19:08:15<8:14:07, 23.22s/it]                                                        {'loss': 0.0236, 'grad_norm': 1.6949341031327267, 'learning_rate': 2.979468035464302e-07, 'completion_length': 178.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 1.0, 'reward': 1.6502977013587952, 'reward_std': 0.05133940279483795, 'kl': 0.5927734375, 'epoch': 0.7}
 70%|███████   | 3009/4286 [19:08:15<8:14:07, 23.22s/it] 70%|███████   | 3010/4286 [19:08:33<7:41:11, 21.69s/it]                                                        {'loss': 0.0536, 'grad_norm': 5.494835984450247, 'learning_rate': 2.977134857676155e-07, 'completion_length': 183.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7619047462940216, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7440477013587952, 'reward_std': 0.10831043869256973, 'kl': 1.33984375, 'epoch': 0.7}
 70%|███████   | 3010/4286 [19:08:33<7:41:11, 21.69s/it] 70%|███████   | 3011/4286 [19:08:53<7:31:00, 21.22s/it]                                                        {'loss': 0.0277, 'grad_norm': 13.841743567902677, 'learning_rate': 2.974801679888007e-07, 'completion_length': 203.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.7029762268066406, 'rewards/format_reward': 1.0, 'reward': 1.7029762864112854, 'reward_std': 0.10325397178530693, 'kl': 0.6904296875, 'epoch': 0.7}
 70%|███████   | 3011/4286 [19:08:53<7:31:00, 21.22s/it] 70%|███████   | 3012/4286 [19:09:11<7:12:00, 20.35s/it]                                                        {'loss': 0.0347, 'grad_norm': 2.9175957820084237, 'learning_rate': 2.97246850209986e-07, 'completion_length': 159.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.7336309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7157739400863647, 'reward_std': 0.09447787329554558, 'kl': 0.865234375, 'epoch': 0.7}
 70%|███████   | 3012/4286 [19:09:11<7:12:00, 20.35s/it] 70%|███████   | 3013/4286 [19:09:30<7:00:06, 19.80s/it]                                                        {'loss': 0.0089, 'grad_norm': 12.062700473053974, 'learning_rate': 2.970135324311712e-07, 'completion_length': 170.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.743452399969101, 'rewards/format_reward': 1.0, 'reward': 1.7434524297714233, 'reward_std': 0.06608702428638935, 'kl': 0.22216796875, 'epoch': 0.7}
 70%|███████   | 3013/4286 [19:09:30<7:00:06, 19.80s/it] 70%|███████   | 3014/4286 [19:09:52<7:17:04, 20.62s/it]                                                        {'loss': 0.051, 'grad_norm': 2.717389984876956, 'learning_rate': 2.967802146523565e-07, 'completion_length': 196.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6116071939468384, 'rewards/format_reward': 1.0, 'reward': 1.6116072535514832, 'reward_std': 0.10969169065356255, 'kl': 1.27734375, 'epoch': 0.7}
 70%|███████   | 3014/4286 [19:09:52<7:17:04, 20.62s/it][2025-03-03 00:17:31,001] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 70%|███████   | 3015/4286 [19:10:15<7:30:15, 21.26s/it]                                                        {'loss': 0.0433, 'grad_norm': 64.24205274246694, 'learning_rate': 2.9654689687354175e-07, 'completion_length': 197.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.605124831199646, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5336963534355164, 'reward_std': 0.24472012370824814, 'kl': 1.08203125, 'epoch': 0.7}
 70%|███████   | 3015/4286 [19:10:15<7:30:15, 21.26s/it] 70%|███████   | 3016/4286 [19:10:34<7:16:27, 20.62s/it]                                                        {'loss': 0.016, 'grad_norm': 3.369733940614389, 'learning_rate': 2.96313579094727e-07, 'completion_length': 169.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6809523701667786, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6630953550338745, 'reward_std': 0.11462146788835526, 'kl': 0.400390625, 'epoch': 0.7}
 70%|███████   | 3016/4286 [19:10:34<7:16:27, 20.62s/it] 70%|███████   | 3017/4286 [19:10:55<7:14:50, 20.56s/it]                                                        {'loss': 0.0255, 'grad_norm': 8.246272065201312, 'learning_rate': 2.9608026131591225e-07, 'completion_length': 194.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.415178582072258, 'rewards/format_reward': 1.0, 'reward': 1.415178656578064, 'reward_std': 0.1091451458632946, 'kl': 0.63671875, 'epoch': 0.7}
 70%|███████   | 3017/4286 [19:10:55<7:14:50, 20.56s/it] 70%|███████   | 3018/4286 [19:11:13<7:03:08, 20.02s/it]                                                        {'loss': 0.0102, 'grad_norm': 12.106870054151148, 'learning_rate': 2.9584694353709747e-07, 'completion_length': 183.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.591369092464447, 'rewards/format_reward': 1.0, 'reward': 1.5913691520690918, 'reward_std': 0.055021190084517, 'kl': 0.25390625, 'epoch': 0.7}
 70%|███████   | 3018/4286 [19:11:13<7:03:08, 20.02s/it] 70%|███████   | 3019/4286 [19:11:32<6:50:25, 19.44s/it]                                                        {'loss': 0.0074, 'grad_norm': 3.85525849337572, 'learning_rate': 2.9561362575828275e-07, 'completion_length': 181.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.026572031434625387, 'kl': 0.18359375, 'epoch': 0.7}
 70%|███████   | 3019/4286 [19:11:32<6:50:25, 19.44s/it][2025-03-03 00:19:08,189] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 70%|███████   | 3020/4286 [19:11:52<6:58:42, 19.84s/it]                                                        {'loss': 0.0136, 'grad_norm': 1.2914107831988366, 'learning_rate': 2.95380307979468e-07, 'completion_length': 181.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6372024118900299, 'rewards/format_reward': 1.0, 'reward': 1.6372024416923523, 'reward_std': 0.04937081038951874, 'kl': 0.33935546875, 'epoch': 0.7}
 70%|███████   | 3020/4286 [19:11:52<6:58:42, 19.84s/it] 70%|███████   | 3021/4286 [19:12:10<6:42:17, 19.08s/it]                                                        {'loss': 0.0132, 'grad_norm': 10.04856965443028, 'learning_rate': 2.9514699020065324e-07, 'completion_length': 170.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5714286267757416, 'rewards/format_reward': 1.0, 'reward': 1.5714287161827087, 'reward_std': 0.016835877671837807, 'kl': 0.330078125, 'epoch': 0.7}
 70%|███████   | 3021/4286 [19:12:10<6:42:17, 19.08s/it] 71%|███████   | 3022/4286 [19:12:28<6:34:39, 18.73s/it]                                                        {'loss': 0.0105, 'grad_norm': 1.734448621807149, 'learning_rate': 2.949136724218385e-07, 'completion_length': 180.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.03755595348775387, 'kl': 0.26318359375, 'epoch': 0.71}
 71%|███████   | 3022/4286 [19:12:28<6:34:39, 18.73s/it] 71%|███████   | 3023/4286 [19:12:45<6:28:28, 18.45s/it]                                                        {'loss': 0.0068, 'grad_norm': 0.1734830993506459, 'learning_rate': 2.9468035464302374e-07, 'completion_length': 163.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.80952388048172, 'rewards/format_reward': 1.0, 'reward': 1.8095239400863647, 'reward_std': 0.0, 'kl': 0.16943359375, 'epoch': 0.71}
 71%|███████   | 3023/4286 [19:12:45<6:28:28, 18.45s/it] 71%|███████   | 3024/4286 [19:13:03<6:26:17, 18.37s/it]                                                        {'loss': 0.0111, 'grad_norm': 1.3444117997510054, 'learning_rate': 2.94447036864209e-07, 'completion_length': 190.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.395833358168602, 'rewards/format_reward': 1.0, 'reward': 1.3958334922790527, 'reward_std': 0.07419108971953392, 'kl': 0.2783203125, 'epoch': 0.71}
 71%|███████   | 3024/4286 [19:13:03<6:26:17, 18.37s/it] 71%|███████   | 3025/4286 [19:13:23<6:34:06, 18.75s/it]                                                        {'loss': 0.0113, 'grad_norm': 2.0449520066563704, 'learning_rate': 2.942137190853943e-07, 'completion_length': 193.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.5562500506639481, 'rewards/format_reward': 1.0, 'reward': 1.5562500953674316, 'reward_std': 0.06935148127377033, 'kl': 0.28271484375, 'epoch': 0.71}
 71%|███████   | 3025/4286 [19:13:23<6:34:06, 18.75s/it] 71%|███████   | 3026/4286 [19:13:44<6:47:33, 19.41s/it]                                                        {'loss': 0.0145, 'grad_norm': 92.08913396568134, 'learning_rate': 2.939804013065795e-07, 'completion_length': 193.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.664806604385376, 'rewards/format_reward': 1.0, 'reward': 1.6648066639900208, 'reward_std': 0.05541962757706642, 'kl': 0.36279296875, 'epoch': 0.71}
 71%|███████   | 3026/4286 [19:13:44<6:47:33, 19.41s/it] 71%|███████   | 3027/4286 [19:14:04<6:52:25, 19.65s/it]                                                        {'loss': 0.0166, 'grad_norm': 1.9973360888974288, 'learning_rate': 2.937470835277648e-07, 'completion_length': 185.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5788690745830536, 'rewards/format_reward': 1.0, 'reward': 1.578869104385376, 'reward_std': 0.07992978021502495, 'kl': 0.41455078125, 'epoch': 0.71}
 71%|███████   | 3027/4286 [19:14:04<6:52:25, 19.65s/it] 71%|███████   | 3028/4286 [19:14:23<6:47:33, 19.44s/it]                                                        {'loss': 0.0267, 'grad_norm': 4.1727798395116515, 'learning_rate': 2.9351376574895e-07, 'completion_length': 170.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410714626312256, 'reward_std': 0.09364316612482071, 'kl': 0.66552734375, 'epoch': 0.71}
 71%|███████   | 3028/4286 [19:14:23<6:47:33, 19.44s/it] 71%|███████   | 3029/4286 [19:14:42<6:41:12, 19.15s/it]                                                        {'loss': 0.0091, 'grad_norm': 7.241758716431217, 'learning_rate': 2.932804479701353e-07, 'completion_length': 177.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6907738745212555, 'rewards/format_reward': 1.0, 'reward': 1.6907739639282227, 'reward_std': 0.10150355100631714, 'kl': 0.228515625, 'epoch': 0.71}
 71%|███████   | 3029/4286 [19:14:42<6:41:12, 19.15s/it] 71%|███████   | 3030/4286 [19:14:59<6:31:12, 18.69s/it]                                                        {'loss': 0.0081, 'grad_norm': 5.7704128656922204, 'learning_rate': 2.9304713019132056e-07, 'completion_length': 165.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.8020833730697632, 'rewards/format_reward': 1.0, 'reward': 1.802083432674408, 'reward_std': 0.07003564015030861, 'kl': 0.20361328125, 'epoch': 0.71}
 71%|███████   | 3030/4286 [19:14:59<6:31:12, 18.69s/it] 71%|███████   | 3031/4286 [19:15:18<6:28:18, 18.56s/it]                                                        {'loss': 0.0169, 'grad_norm': 5.919392857119172, 'learning_rate': 2.928138124125058e-07, 'completion_length': 185.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.04013476986438036, 'kl': 0.4228515625, 'epoch': 0.71}
 71%|███████   | 3031/4286 [19:15:18<6:28:18, 18.56s/it] 71%|███████   | 3032/4286 [19:15:39<6:44:28, 19.35s/it]                                                        {'loss': 0.0236, 'grad_norm': 3.1142885213566824, 'learning_rate': 2.9258049463369106e-07, 'completion_length': 187.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.6785715818405151, 'reward_std': 0.08165093511343002, 'kl': 0.58984375, 'epoch': 0.71}
 71%|███████   | 3032/4286 [19:15:39<6:44:28, 19.35s/it] 71%|███████   | 3033/4286 [19:16:03<7:13:33, 20.76s/it]                                                        {'loss': 0.0546, 'grad_norm': 9.304603110834647, 'learning_rate': 2.9234717685487633e-07, 'completion_length': 201.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5855106562376022, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5676535964012146, 'reward_std': 0.12144243717193604, 'kl': 1.36328125, 'epoch': 0.71}
 71%|███████   | 3033/4286 [19:16:03<7:13:33, 20.76s/it] 71%|███████   | 3034/4286 [19:16:21<6:55:47, 19.93s/it]                                                        {'loss': 0.03, 'grad_norm': 11.92156266096359, 'learning_rate': 2.9211385907606156e-07, 'completion_length': 173.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5833333879709244, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5654763579368591, 'reward_std': 0.11904761660844088, 'kl': 0.748046875, 'epoch': 0.71}
 71%|███████   | 3034/4286 [19:16:21<6:55:47, 19.93s/it] 71%|███████   | 3035/4286 [19:16:40<6:47:48, 19.56s/it]                                                        {'loss': 0.0626, 'grad_norm': 3.696371768297159, 'learning_rate': 2.9188054129724683e-07, 'completion_length': 178.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.46760205924510956, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4497449398040771, 'reward_std': 0.12964027374982834, 'kl': 1.56640625, 'epoch': 0.71}
 71%|███████   | 3035/4286 [19:16:40<6:47:48, 19.56s/it] 71%|███████   | 3036/4286 [19:16:57<6:33:15, 18.88s/it]                                                        {'loss': 0.0229, 'grad_norm': 5.763581487337523, 'learning_rate': 2.9164722351843205e-07, 'completion_length': 172.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.8008929491043091, 'rewards/format_reward': 1.0, 'reward': 1.800892949104309, 'reward_std': 0.05297619290649891, 'kl': 0.57275390625, 'epoch': 0.71}
 71%|███████   | 3036/4286 [19:16:57<6:33:15, 18.88s/it] 71%|███████   | 3037/4286 [19:17:16<6:35:41, 19.01s/it]                                                        {'loss': 0.0504, 'grad_norm': 17.25353366355797, 'learning_rate': 2.9141390573961733e-07, 'completion_length': 172.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.7178571820259094, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7000001072883606, 'reward_std': 0.15805364027619362, 'kl': 1.26171875, 'epoch': 0.71}
 71%|███████   | 3037/4286 [19:17:16<6:35:41, 19.01s/it] 71%|███████   | 3038/4286 [19:17:36<6:40:13, 19.24s/it]                                                        {'loss': 0.0254, 'grad_norm': 7.260148431840627, 'learning_rate': 2.911805879608026e-07, 'completion_length': 163.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5726191401481628, 'rewards/format_reward': 1.0, 'reward': 1.5726191997528076, 'reward_std': 0.07558627892285585, 'kl': 0.634765625, 'epoch': 0.71}
 71%|███████   | 3038/4286 [19:17:36<6:40:13, 19.24s/it] 71%|███████   | 3039/4286 [19:17:54<6:35:41, 19.04s/it]                                                        {'loss': 0.0141, 'grad_norm': 17.46601835278647, 'learning_rate': 2.909472701819878e-07, 'completion_length': 176.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7910715043544769, 'rewards/format_reward': 1.0, 'reward': 1.7910714745521545, 'reward_std': 0.11547619476914406, 'kl': 0.35302734375, 'epoch': 0.71}
 71%|███████   | 3039/4286 [19:17:54<6:35:41, 19.04s/it] 71%|███████   | 3040/4286 [19:18:14<6:36:31, 19.09s/it]                                                        {'loss': 0.0348, 'grad_norm': 7.432110062703004, 'learning_rate': 2.907139524031731e-07, 'completion_length': 167.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6175595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6175596117973328, 'reward_std': 0.122023805975914, 'kl': 0.8671875, 'epoch': 0.71}
 71%|███████   | 3040/4286 [19:18:14<6:36:31, 19.09s/it] 71%|███████   | 3041/4286 [19:18:35<6:48:26, 19.68s/it]                                                        {'loss': 0.1013, 'grad_norm': 31.62526814284709, 'learning_rate': 2.904806346243583e-07, 'completion_length': 187.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.473214328289032, 'reward_std': 0.23555521667003632, 'kl': 2.53125, 'epoch': 0.71}
 71%|███████   | 3041/4286 [19:18:35<6:48:26, 19.68s/it] 71%|███████   | 3042/4286 [19:18:56<6:58:05, 20.17s/it]                                                        {'loss': 0.0548, 'grad_norm': 14.886260961280895, 'learning_rate': 2.902473168455436e-07, 'completion_length': 184.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5351190567016602, 'rewards/format_reward': 1.0, 'reward': 1.5351191759109497, 'reward_std': 0.1748717837035656, 'kl': 1.37109375, 'epoch': 0.71}
 71%|███████   | 3042/4286 [19:18:56<6:58:05, 20.17s/it] 71%|███████   | 3043/4286 [19:19:16<6:54:22, 20.00s/it]                                                        {'loss': 0.0434, 'grad_norm': 9.281420475263616, 'learning_rate': 2.900139990667289e-07, 'completion_length': 191.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6244260668754578, 'rewards/format_reward': 1.0, 'reward': 1.6244261264801025, 'reward_std': 0.12106333300471306, 'kl': 1.0859375, 'epoch': 0.71}
 71%|███████   | 3043/4286 [19:19:16<6:54:22, 20.00s/it] 71%|███████   | 3044/4286 [19:19:36<6:56:43, 20.13s/it]                                                        {'loss': 0.0479, 'grad_norm': 3.22950458150865, 'learning_rate': 2.897806812879141e-07, 'completion_length': 184.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5308248698711395, 'rewards/format_reward': 1.0, 'reward': 1.530824899673462, 'reward_std': 0.10071776062250137, 'kl': 1.19873046875, 'epoch': 0.71}
 71%|███████   | 3044/4286 [19:19:36<6:56:43, 20.13s/it] 71%|███████   | 3045/4286 [19:19:59<7:12:19, 20.90s/it]                                                        {'loss': 0.1419, 'grad_norm': 13.835210554270652, 'learning_rate': 2.8954736350909937e-07, 'completion_length': 208.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5967262387275696, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.543154776096344, 'reward_std': 0.33924926817417145, 'kl': 3.546875, 'epoch': 0.71}
 71%|███████   | 3045/4286 [19:19:59<7:12:19, 20.90s/it] 71%|███████   | 3046/4286 [19:20:19<7:09:22, 20.78s/it]                                                        {'loss': 0.1538, 'grad_norm': 13.664891530507852, 'learning_rate': 2.893140457302846e-07, 'completion_length': 188.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5582454204559326, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.522531270980835, 'reward_std': 0.34186573326587677, 'kl': 3.8359375, 'epoch': 0.71}
 71%|███████   | 3046/4286 [19:20:19<7:09:22, 20.78s/it] 71%|███████   | 3047/4286 [19:20:38<6:55:48, 20.14s/it]                                                        {'loss': 0.1726, 'grad_norm': 19.213054334686582, 'learning_rate': 2.8908072795146987e-07, 'completion_length': 164.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.4709821790456772, 'rewards/format_reward': 1.0, 'reward': 1.470982313156128, 'reward_std': 0.22951283305883408, 'kl': 4.3125, 'epoch': 0.71}
 71%|███████   | 3047/4286 [19:20:38<6:55:48, 20.14s/it] 71%|███████   | 3048/4286 [19:20:56<6:42:46, 19.52s/it]                                                        {'loss': 0.0602, 'grad_norm': 6.7401940118992565, 'learning_rate': 2.8884741017265514e-07, 'completion_length': 143.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6324405074119568, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.07784926891326904, 'kl': 1.50390625, 'epoch': 0.71}
 71%|███████   | 3048/4286 [19:20:56<6:42:46, 19.52s/it] 71%|███████   | 3049/4286 [19:21:17<6:51:03, 19.94s/it]                                                        {'loss': 0.0934, 'grad_norm': 47.106359070869594, 'learning_rate': 2.8861409239384037e-07, 'completion_length': 190.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6306548416614532, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5949405431747437, 'reward_std': 0.3341507390141487, 'kl': 2.3359375, 'epoch': 0.71}
 71%|███████   | 3049/4286 [19:21:17<6:51:03, 19.94s/it] 71%|███████   | 3050/4286 [19:21:38<6:55:20, 20.16s/it]                                                        {'loss': 0.102, 'grad_norm': 7.659363646879131, 'learning_rate': 2.8838077461502564e-07, 'completion_length': 180.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6014881432056427, 'rewards/format_reward': 1.0, 'reward': 1.6014882922172546, 'reward_std': 0.11864285916090012, 'kl': 2.546875, 'epoch': 0.71}
 71%|███████   | 3050/4286 [19:21:38<6:55:20, 20.16s/it] 71%|███████   | 3051/4286 [19:21:57<6:49:08, 19.88s/it]                                                        {'loss': 0.0941, 'grad_norm': 34.14855590419106, 'learning_rate': 2.8814745683621086e-07, 'completion_length': 185.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5452806055545807, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5095664262771606, 'reward_std': 0.24346762150526047, 'kl': 2.34765625, 'epoch': 0.71}
 71%|███████   | 3051/4286 [19:21:57<6:49:08, 19.88s/it] 71%|███████   | 3052/4286 [19:22:17<6:47:43, 19.82s/it]                                                        {'loss': 0.0825, 'grad_norm': 17.77725829652796, 'learning_rate': 2.8791413905739614e-07, 'completion_length': 172.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6101191639900208, 'reward_std': 0.08600304275751114, 'kl': 2.0546875, 'epoch': 0.71}
 71%|███████   | 3052/4286 [19:22:17<6:47:43, 19.82s/it] 71%|███████   | 3053/4286 [19:22:37<6:49:28, 19.93s/it]                                                        {'loss': 0.086, 'grad_norm': 11.671344603923824, 'learning_rate': 2.876808212785814e-07, 'completion_length': 186.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5008928775787354, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4830358028411865, 'reward_std': 0.1293214000761509, 'kl': 2.15234375, 'epoch': 0.71}
 71%|███████   | 3053/4286 [19:22:37<6:49:28, 19.93s/it] 71%|███████▏  | 3054/4286 [19:22:56<6:46:48, 19.81s/it]                                                        {'loss': 0.0976, 'grad_norm': 7.469277121181759, 'learning_rate': 2.8744750349976664e-07, 'completion_length': 176.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.4645833373069763, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4467263221740723, 'reward_std': 0.11221642419695854, 'kl': 2.4453125, 'epoch': 0.71}
 71%|███████▏  | 3054/4286 [19:22:56<6:46:48, 19.81s/it] 71%|███████▏  | 3055/4286 [19:23:17<6:53:05, 20.13s/it]                                                        {'loss': 0.0521, 'grad_norm': 3.93844641640188, 'learning_rate': 2.872141857209519e-07, 'completion_length': 183.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6486820578575134, 'rewards/format_reward': 1.0, 'reward': 1.6486821174621582, 'reward_std': 0.09202000871300697, 'kl': 1.30078125, 'epoch': 0.71}
 71%|███████▏  | 3055/4286 [19:23:17<6:53:05, 20.13s/it] 71%|███████▏  | 3056/4286 [19:23:35<6:41:44, 19.60s/it]                                                        {'loss': 0.01, 'grad_norm': 2.2067808252998504, 'learning_rate': 2.869808679421372e-07, 'completion_length': 163.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7294643521308899, 'rewards/format_reward': 1.0, 'reward': 1.72946435213089, 'reward_std': 0.01767674833536148, 'kl': 0.24853515625, 'epoch': 0.71}
 71%|███████▏  | 3056/4286 [19:23:35<6:41:44, 19.60s/it] 71%|███████▏  | 3057/4286 [19:23:54<6:35:20, 19.30s/it]                                                        {'loss': 0.0509, 'grad_norm': 52.31730225692402, 'learning_rate': 2.867475501633224e-07, 'completion_length': 166.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.5282738655805588, 'rewards/format_reward': 1.0, 'reward': 1.5282739400863647, 'reward_std': 0.1271214820444584, 'kl': 1.2705078125, 'epoch': 0.71}
 71%|███████▏  | 3057/4286 [19:23:54<6:35:20, 19.30s/it] 71%|███████▏  | 3058/4286 [19:24:12<6:24:02, 18.76s/it]                                                        {'loss': 0.0088, 'grad_norm': 4.2824588839746625, 'learning_rate': 2.865142323845077e-07, 'completion_length': 174.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 1.0, 'reward': 1.723214328289032, 'reward_std': 0.059523806907236576, 'kl': 0.2197265625, 'epoch': 0.71}
 71%|███████▏  | 3058/4286 [19:24:12<6:24:02, 18.76s/it] 71%|███████▏  | 3059/4286 [19:24:29<6:17:53, 18.48s/it]                                                        {'loss': 0.0187, 'grad_norm': 2.327430867109042, 'learning_rate': 2.862809146056929e-07, 'completion_length': 159.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.03348226100206375, 'kl': 0.4677734375, 'epoch': 0.71}
 71%|███████▏  | 3059/4286 [19:24:29<6:17:53, 18.48s/it] 71%|███████▏  | 3060/4286 [19:24:48<6:17:13, 18.46s/it]                                                        {'loss': 0.013, 'grad_norm': 4.567000211807632, 'learning_rate': 2.860475968268782e-07, 'completion_length': 177.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.06395573727786541, 'kl': 0.32568359375, 'epoch': 0.71}
 71%|███████▏  | 3060/4286 [19:24:48<6:17:13, 18.46s/it] 71%|███████▏  | 3061/4286 [19:25:07<6:20:29, 18.64s/it]                                                        {'loss': 0.0138, 'grad_norm': 3.681003852889635, 'learning_rate': 2.8581427904806346e-07, 'completion_length': 177.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 1.0, 'reward': 1.74404776096344, 'reward_std': 0.02380952751263976, 'kl': 0.34619140625, 'epoch': 0.71}
 71%|███████▏  | 3061/4286 [19:25:07<6:20:29, 18.64s/it] 71%|███████▏  | 3062/4286 [19:25:26<6:25:11, 18.88s/it]                                                        {'loss': 0.0417, 'grad_norm': 3.7031174594824905, 'learning_rate': 2.855809612692487e-07, 'completion_length': 176.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6366071701049805, 'rewards/format_reward': 1.0, 'reward': 1.6366072297096252, 'reward_std': 0.04591536708176136, 'kl': 1.044921875, 'epoch': 0.71}
 71%|███████▏  | 3062/4286 [19:25:26<6:25:11, 18.88s/it] 71%|███████▏  | 3063/4286 [19:25:45<6:22:37, 18.77s/it]                                                        {'loss': 0.0091, 'grad_norm': 3.3801038270186003, 'learning_rate': 2.8534764349043395e-07, 'completion_length': 178.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.46547622978687286, 'rewards/format_reward': 1.0, 'reward': 1.4654762744903564, 'reward_std': 0.041144710034132004, 'kl': 0.2294921875, 'epoch': 0.71}
 71%|███████▏  | 3063/4286 [19:25:45<6:22:37, 18.77s/it] 71%|███████▏  | 3064/4286 [19:26:04<6:23:15, 18.82s/it]                                                        {'loss': 0.0091, 'grad_norm': 1.949528880216129, 'learning_rate': 2.851143257116192e-07, 'completion_length': 176.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.6190476715564728, 'rewards/format_reward': 1.0, 'reward': 1.6190477013587952, 'reward_std': 0.028166969306766987, 'kl': 0.2265625, 'epoch': 0.71}
 71%|███████▏  | 3064/4286 [19:26:04<6:23:15, 18.82s/it] 72%|███████▏  | 3065/4286 [19:26:23<6:27:49, 19.06s/it]                                                        {'loss': 0.0341, 'grad_norm': 2.1574380254994043, 'learning_rate': 2.8488100793280445e-07, 'completion_length': 171.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.8095238208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7916668057441711, 'reward_std': 0.11904762778431177, 'kl': 0.85302734375, 'epoch': 0.72}
 72%|███████▏  | 3065/4286 [19:26:23<6:27:49, 19.06s/it] 72%|███████▏  | 3066/4286 [19:26:44<6:34:51, 19.42s/it]                                                        {'loss': 0.0544, 'grad_norm': 7.692643854687898, 'learning_rate': 2.846476901539897e-07, 'completion_length': 187.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.5173363536596298, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4816222190856934, 'reward_std': 0.1881227269768715, 'kl': 1.36328125, 'epoch': 0.72}
 72%|███████▏  | 3066/4286 [19:26:44<6:34:51, 19.42s/it] 72%|███████▏  | 3067/4286 [19:27:02<6:28:12, 19.11s/it]                                                        {'loss': 0.0328, 'grad_norm': 6.804439174570813, 'learning_rate': 2.8441437237517495e-07, 'completion_length': 175.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7038690447807312, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6860119700431824, 'reward_std': 0.07673493027687073, 'kl': 0.81884765625, 'epoch': 0.72}
 72%|███████▏  | 3067/4286 [19:27:02<6:28:12, 19.11s/it] 72%|███████▏  | 3068/4286 [19:27:23<6:40:49, 19.75s/it]                                                        {'loss': 0.0595, 'grad_norm': 14.419336820213731, 'learning_rate': 2.841810545963602e-07, 'completion_length': 194.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6035714745521545, 'rewards/format_reward': 1.0, 'reward': 1.6035714745521545, 'reward_std': 0.10391178354620934, 'kl': 1.486328125, 'epoch': 0.72}
 72%|███████▏  | 3068/4286 [19:27:23<6:40:49, 19.75s/it] 72%|███████▏  | 3069/4286 [19:27:41<6:29:23, 19.20s/it]                                                        {'loss': 0.03, 'grad_norm': 2.3990231797695976, 'learning_rate': 2.8394773681754545e-07, 'completion_length': 172.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.7008928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7008930444717407, 'reward_std': 0.0625000037252903, 'kl': 0.748046875, 'epoch': 0.72}
 72%|███████▏  | 3069/4286 [19:27:41<6:29:23, 19.20s/it][2025-03-03 00:35:18,195] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 72%|███████▏  | 3070/4286 [19:28:02<6:40:50, 19.78s/it]                                                        {'loss': 0.0632, 'grad_norm': 5.614583608560202, 'learning_rate': 2.837144190387307e-07, 'completion_length': 181.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.7517857849597931, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7339287400245667, 'reward_std': 0.14226406812667847, 'kl': 1.58203125, 'epoch': 0.72}
 72%|███████▏  | 3070/4286 [19:28:02<6:40:50, 19.78s/it] 72%|███████▏  | 3071/4286 [19:28:21<6:35:42, 19.54s/it]                                                        {'loss': 0.0176, 'grad_norm': 2.7368208055725756, 'learning_rate': 2.83481101259916e-07, 'completion_length': 171.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.7034439146518707, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6855868697166443, 'reward_std': 0.08858015388250351, 'kl': 0.4423828125, 'epoch': 0.72}
 72%|███████▏  | 3071/4286 [19:28:21<6:35:42, 19.54s/it] 72%|███████▏  | 3072/4286 [19:28:40<6:31:03, 19.33s/it]                                                        {'loss': 0.0452, 'grad_norm': 5.581548601195189, 'learning_rate': 2.832477834811012e-07, 'completion_length': 164.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5651786029338837, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5473214983940125, 'reward_std': 0.08516562730073929, 'kl': 1.130859375, 'epoch': 0.72}
 72%|███████▏  | 3072/4286 [19:28:40<6:31:03, 19.33s/it] 72%|███████▏  | 3073/4286 [19:29:01<6:42:24, 19.90s/it]                                                        {'loss': 0.0441, 'grad_norm': 4.670581350632523, 'learning_rate': 2.830144657022865e-07, 'completion_length': 167.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.4285714775323868, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4107143878936768, 'reward_std': 0.05633394047617912, 'kl': 1.1015625, 'epoch': 0.72}
 72%|███████▏  | 3073/4286 [19:29:01<6:42:24, 19.90s/it][2025-03-03 00:36:39,250] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 72%|███████▏  | 3074/4286 [19:29:23<6:54:42, 20.53s/it]                                                        {'loss': 0.0335, 'grad_norm': 7.463599460161232, 'learning_rate': 2.827811479234717e-07, 'completion_length': 171.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.5758928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5580358505249023, 'reward_std': 0.14057645946741104, 'kl': 0.837890625, 'epoch': 0.72}
 72%|███████▏  | 3074/4286 [19:29:23<6:54:42, 20.53s/it] 72%|███████▏  | 3075/4286 [19:29:43<6:47:10, 20.17s/it]                                                        {'loss': 0.0158, 'grad_norm': 5.273642496634566, 'learning_rate': 2.82547830144657e-07, 'completion_length': 181.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.01785714365541935, 'kl': 0.39453125, 'epoch': 0.72}
 72%|███████▏  | 3075/4286 [19:29:43<6:47:10, 20.17s/it] 72%|███████▏  | 3076/4286 [19:30:00<6:31:49, 19.43s/it]                                                        {'loss': 0.0186, 'grad_norm': 2.012675339878192, 'learning_rate': 2.8231451236584227e-07, 'completion_length': 152.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6279762983322144, 'reward_std': 0.05357143096625805, 'kl': 0.4658203125, 'epoch': 0.72}
 72%|███████▏  | 3076/4286 [19:30:00<6:31:49, 19.43s/it] 72%|███████▏  | 3077/4286 [19:30:17<6:13:32, 18.54s/it]                                                        {'loss': 0.0903, 'grad_norm': 3.820214500776194, 'learning_rate': 2.820811945870275e-07, 'completion_length': 156.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.7961309850215912, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7425596117973328, 'reward_std': 0.22986217215657234, 'kl': 2.25390625, 'epoch': 0.72}
 72%|███████▏  | 3077/4286 [19:30:17<6:13:32, 18.54s/it] 72%|███████▏  | 3078/4286 [19:30:39<6:33:34, 19.55s/it]                                                        {'loss': 0.0509, 'grad_norm': 2.40539943764561, 'learning_rate': 2.8184787680821276e-07, 'completion_length': 166.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.5821428745985031, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5464286804199219, 'reward_std': 0.11279664188623428, 'kl': 1.27490234375, 'epoch': 0.72}
 72%|███████▏  | 3078/4286 [19:30:39<6:33:34, 19.55s/it] 72%|███████▏  | 3079/4286 [19:30:55<6:14:54, 18.64s/it]                                                        {'loss': 0.0534, 'grad_norm': 25.57469525123979, 'learning_rate': 2.8161455902939804e-07, 'completion_length': 151.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6339287161827087, 'reward_std': 0.07862301170825958, 'kl': 1.3310546875, 'epoch': 0.72}
 72%|███████▏  | 3079/4286 [19:30:55<6:14:54, 18.64s/it] 72%|███████▏  | 3080/4286 [19:31:17<6:30:53, 19.45s/it]                                                        {'loss': 0.105, 'grad_norm': 11.71999007312255, 'learning_rate': 2.8138124125058326e-07, 'completion_length': 183.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5744048058986664, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5386905670166016, 'reward_std': 0.2000962495803833, 'kl': 2.625, 'epoch': 0.72}
 72%|███████▏  | 3080/4286 [19:31:17<6:30:53, 19.45s/it] 72%|███████▏  | 3081/4286 [19:31:34<6:20:57, 18.97s/it]                                                        {'loss': 0.0835, 'grad_norm': 6.374962950392978, 'learning_rate': 2.8114792347176854e-07, 'completion_length': 162.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547619700431824, 'reward_std': 0.19779008626937866, 'kl': 2.0859375, 'epoch': 0.72}
 72%|███████▏  | 3081/4286 [19:31:34<6:20:57, 18.97s/it] 72%|███████▏  | 3082/4286 [19:31:57<6:43:54, 20.13s/it]                                                        {'loss': 0.0294, 'grad_norm': 10.714107262319787, 'learning_rate': 2.8091460569295376e-07, 'completion_length': 203.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.7327381670475006, 'rewards/format_reward': 1.0, 'reward': 1.7327382564544678, 'reward_std': 0.08201512321829796, 'kl': 0.732421875, 'epoch': 0.72}
 72%|███████▏  | 3082/4286 [19:31:57<6:43:54, 20.13s/it] 72%|███████▏  | 3083/4286 [19:32:16<6:36:45, 19.79s/it]                                                        {'loss': 0.0941, 'grad_norm': 5.074341577826875, 'learning_rate': 2.8068128791413903e-07, 'completion_length': 178.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5476190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5297620296478271, 'reward_std': 0.11473165825009346, 'kl': 2.35546875, 'epoch': 0.72}
 72%|███████▏  | 3083/4286 [19:32:16<6:36:45, 19.79s/it] 72%|███████▏  | 3084/4286 [19:32:36<6:34:04, 19.67s/it]                                                        {'loss': 0.0699, 'grad_norm': 4.121235918527424, 'learning_rate': 2.804479701353243e-07, 'completion_length': 168.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6619048118591309, 'rewards/format_reward': 1.0, 'reward': 1.6619049310684204, 'reward_std': 0.13355717062950134, 'kl': 1.75, 'epoch': 0.72}
 72%|███████▏  | 3084/4286 [19:32:36<6:34:04, 19.67s/it] 72%|███████▏  | 3085/4286 [19:32:53<6:16:59, 18.83s/it]                                                        {'loss': 0.0404, 'grad_norm': 3.8418180021739516, 'learning_rate': 2.8021465235650953e-07, 'completion_length': 151.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.7961309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7782739400863647, 'reward_std': 0.11450555361807346, 'kl': 1.0126953125, 'epoch': 0.72}
 72%|███████▏  | 3085/4286 [19:32:53<6:16:59, 18.83s/it] 72%|███████▏  | 3086/4286 [19:33:11<6:14:00, 18.70s/it]                                                        {'loss': 0.0673, 'grad_norm': 3.129091822111209, 'learning_rate': 2.799813345776948e-07, 'completion_length': 177.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.6497024297714233, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6318453550338745, 'reward_std': 0.12121091783046722, 'kl': 1.6796875, 'epoch': 0.72}
 72%|███████▏  | 3086/4286 [19:33:11<6:14:00, 18.70s/it] 72%|███████▏  | 3087/4286 [19:33:29<6:12:19, 18.63s/it]                                                        {'loss': 0.0144, 'grad_norm': 3.8117648284226684, 'learning_rate': 2.7974801679888003e-07, 'completion_length': 169.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098214626312256, 'reward_std': 0.09051632881164551, 'kl': 0.3603515625, 'epoch': 0.72}
 72%|███████▏  | 3087/4286 [19:33:29<6:12:19, 18.63s/it] 72%|███████▏  | 3088/4286 [19:33:48<6:11:49, 18.62s/it]                                                        {'loss': 0.0096, 'grad_norm': 2.909568781400658, 'learning_rate': 2.795146990200653e-07, 'completion_length': 185.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.5952381193637848, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.020619653165340424, 'kl': 0.2392578125, 'epoch': 0.72}
 72%|███████▏  | 3088/4286 [19:33:48<6:11:49, 18.62s/it] 72%|███████▏  | 3089/4286 [19:34:06<6:04:40, 18.28s/it]                                                        {'loss': 0.0273, 'grad_norm': 0.6980998738386935, 'learning_rate': 2.792813812412506e-07, 'completion_length': 171.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.7366071939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7187501788139343, 'reward_std': 0.0625, 'kl': 0.6845703125, 'epoch': 0.72}
 72%|███████▏  | 3089/4286 [19:34:06<6:04:40, 18.28s/it] 72%|███████▏  | 3090/4286 [19:34:24<6:02:40, 18.19s/it]                                                        {'loss': 0.0353, 'grad_norm': 4.727094519597487, 'learning_rate': 2.790480634624358e-07, 'completion_length': 181.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.581845298409462, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.563988208770752, 'reward_std': 0.08311965316534042, 'kl': 0.8828125, 'epoch': 0.72}
 72%|███████▏  | 3090/4286 [19:34:24<6:02:40, 18.19s/it] 72%|███████▏  | 3091/4286 [19:34:42<6:02:14, 18.19s/it]                                                        {'loss': 0.0085, 'grad_norm': 7.617037832267506, 'learning_rate': 2.788147456836211e-07, 'completion_length': 177.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.8202381730079651, 'rewards/format_reward': 1.0, 'reward': 1.820238173007965, 'reward_std': 0.03427334874868393, 'kl': 0.2138671875, 'epoch': 0.72}
 72%|███████▏  | 3091/4286 [19:34:42<6:02:14, 18.19s/it] 72%|███████▏  | 3092/4286 [19:34:59<5:56:57, 17.94s/it]                                                        {'loss': 0.0073, 'grad_norm': 0.9735691156448009, 'learning_rate': 2.785814279048063e-07, 'completion_length': 160.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.57589291036129, 'rewards/format_reward': 1.0, 'reward': 1.575892984867096, 'reward_std': 0.019238397479057312, 'kl': 0.18115234375, 'epoch': 0.72}
 72%|███████▏  | 3092/4286 [19:34:59<5:56:57, 17.94s/it] 72%|███████▏  | 3093/4286 [19:35:18<6:02:34, 18.23s/it]                                                        {'loss': 0.065, 'grad_norm': 1.044227519413758, 'learning_rate': 2.7834811012599157e-07, 'completion_length': 172.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6130952835083008, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5773810744285583, 'reward_std': 0.1785714365541935, 'kl': 1.62890625, 'epoch': 0.72}
 72%|███████▏  | 3093/4286 [19:35:18<6:02:34, 18.23s/it] 72%|███████▏  | 3094/4286 [19:35:37<6:08:22, 18.54s/it]                                                        {'loss': 0.0313, 'grad_norm': 2.9821339760858203, 'learning_rate': 2.7811479234717685e-07, 'completion_length': 190.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.5166667252779007, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4988096356391907, 'reward_std': 0.1007008645683527, 'kl': 0.787109375, 'epoch': 0.72}
 72%|███████▏  | 3094/4286 [19:35:37<6:08:22, 18.54s/it] 72%|███████▏  | 3095/4286 [19:35:56<6:08:23, 18.56s/it]                                                        {'loss': 0.0191, 'grad_norm': 5.550132894334436, 'learning_rate': 2.7788147456836207e-07, 'completion_length': 179.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.6964285969734192, 'reward_std': 0.051976488903164864, 'kl': 0.47802734375, 'epoch': 0.72}
 72%|███████▏  | 3095/4286 [19:35:56<6:08:23, 18.56s/it] 72%|███████▏  | 3096/4286 [19:36:14<6:08:29, 18.58s/it]                                                        {'loss': 0.0544, 'grad_norm': 3.357860244668124, 'learning_rate': 2.7764815678954734e-07, 'completion_length': 171.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6145834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5967262983322144, 'reward_std': 0.1101190522313118, 'kl': 1.35888671875, 'epoch': 0.72}
 72%|███████▏  | 3096/4286 [19:36:14<6:08:29, 18.58s/it] 72%|███████▏  | 3097/4286 [19:36:32<5:59:10, 18.12s/it]                                                        {'loss': 0.0074, 'grad_norm': 1.0103759047397136, 'learning_rate': 2.7741483901073257e-07, 'completion_length': 162.08928680419922, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.04946071840822697, 'kl': 0.18505859375, 'epoch': 0.72}
 72%|███████▏  | 3097/4286 [19:36:32<5:59:10, 18.12s/it] 72%|███████▏  | 3098/4286 [19:36:50<5:59:50, 18.17s/it]                                                        {'loss': 0.0253, 'grad_norm': 1.342227876037114, 'learning_rate': 2.7718152123191784e-07, 'completion_length': 175.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6693452596664429, 'rewards/format_reward': 1.0, 'reward': 1.6693453788757324, 'reward_std': 0.046378858387470245, 'kl': 0.63134765625, 'epoch': 0.72}
 72%|███████▏  | 3098/4286 [19:36:50<5:59:50, 18.17s/it] 72%|███████▏  | 3099/4286 [19:37:07<5:54:08, 17.90s/it]                                                        {'loss': 0.0505, 'grad_norm': 0.9919277617930479, 'learning_rate': 2.769482034531031e-07, 'completion_length': 173.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7261905670166016, 'reward_std': 0.1547619104385376, 'kl': 1.2578125, 'epoch': 0.72}
 72%|███████▏  | 3099/4286 [19:37:07<5:54:08, 17.90s/it] 72%|███████▏  | 3100/4286 [19:37:29<6:17:33, 19.10s/it]                                                        {'loss': 0.0283, 'grad_norm': 3.557084705820236, 'learning_rate': 2.7671488567428834e-07, 'completion_length': 197.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6547619700431824, 'reward_std': 0.08407166227698326, 'kl': 0.70751953125, 'epoch': 0.72}
 72%|███████▏  | 3100/4286 [19:37:29<6:17:33, 19.10s/it] 72%|███████▏  | 3101/4286 [19:41:26<27:46:48, 84.40s/it]                                                         {'loss': 0.0321, 'grad_norm': 2.356623679914438, 'learning_rate': 2.764815678954736e-07, 'completion_length': 180.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.636904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6369048953056335, 'reward_std': 0.022795887663960457, 'kl': 0.80126953125, 'epoch': 0.72}
 72%|███████▏  | 3101/4286 [19:41:26<27:46:48, 84.40s/it] 72%|███████▏  | 3102/4286 [19:41:45<21:22:32, 64.99s/it]                                                         {'loss': 0.0142, 'grad_norm': 1.1367302675565167, 'learning_rate': 2.762482501166589e-07, 'completion_length': 189.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.538690522313118, 'rewards/format_reward': 1.0, 'reward': 1.5386905670166016, 'reward_std': 0.010309826582670212, 'kl': 0.35595703125, 'epoch': 0.72}
 72%|███████▏  | 3102/4286 [19:41:45<21:22:32, 64.99s/it] 72%|███████▏  | 3103/4286 [19:42:04<16:44:23, 50.94s/it]                                                         {'loss': 0.0583, 'grad_norm': 2.8395516645134933, 'learning_rate': 2.760149323378441e-07, 'completion_length': 174.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.512797623872757, 'rewards/format_reward': 1.0, 'reward': 1.5127977132797241, 'reward_std': 0.1167588010430336, 'kl': 1.453125, 'epoch': 0.72}
 72%|███████▏  | 3103/4286 [19:42:04<16:44:23, 50.94s/it] 72%|███████▏  | 3104/4286 [19:42:23<13:35:38, 41.40s/it]                                                         {'loss': 0.007, 'grad_norm': 1.084462241924785, 'learning_rate': 2.757816145590294e-07, 'completion_length': 189.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7392858266830444, 'rewards/format_reward': 1.0, 'reward': 1.7392858266830444, 'reward_std': 0.02142857387661934, 'kl': 0.1748046875, 'epoch': 0.72}
 72%|███████▏  | 3104/4286 [19:42:23<13:35:38, 41.40s/it] 72%|███████▏  | 3105/4286 [19:42:42<11:21:14, 34.61s/it]                                                         {'loss': 0.0074, 'grad_norm': 2.2458823232696186, 'learning_rate': 2.755482967802146e-07, 'completion_length': 189.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.6400510668754578, 'rewards/format_reward': 1.0, 'reward': 1.6400511264801025, 'reward_std': 0.01822388218715787, 'kl': 0.185546875, 'epoch': 0.72}
 72%|███████▏  | 3105/4286 [19:42:42<11:21:14, 34.61s/it] 72%|███████▏  | 3106/4286 [19:43:00<9:43:16, 29.66s/it]                                                         {'loss': 0.0069, 'grad_norm': 0.7360219338245427, 'learning_rate': 2.753149790013999e-07, 'completion_length': 170.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.7857143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.011904759332537651, 'kl': 0.1728515625, 'epoch': 0.72}
 72%|███████▏  | 3106/4286 [19:43:00<9:43:16, 29.66s/it] 72%|███████▏  | 3107/4286 [19:43:19<8:40:32, 26.49s/it]                                                        {'loss': 0.0324, 'grad_norm': 1.7609892765616437, 'learning_rate': 2.7508166122258516e-07, 'completion_length': 189.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.575892984867096, 'reward_std': 0.08219882473349571, 'kl': 0.81005859375, 'epoch': 0.72}
 72%|███████▏  | 3107/4286 [19:43:19<8:40:32, 26.49s/it] 73%|███████▎  | 3108/4286 [19:43:38<7:58:33, 24.37s/it]                                                        {'loss': 0.0069, 'grad_norm': 4.430487483145217, 'learning_rate': 2.748483434437704e-07, 'completion_length': 195.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6282738745212555, 'rewards/format_reward': 1.0, 'reward': 1.6282739639282227, 'reward_std': 0.022809826768934727, 'kl': 0.1728515625, 'epoch': 0.73}
 73%|███████▎  | 3108/4286 [19:43:38<7:58:33, 24.37s/it] 73%|███████▎  | 3109/4286 [19:43:58<7:31:19, 23.01s/it]                                                        {'loss': 0.0118, 'grad_norm': 5.591120647756166, 'learning_rate': 2.7461502566495566e-07, 'completion_length': 197.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.59375, 'rewards/format_reward': 1.0, 'reward': 1.5937501192092896, 'reward_std': 0.03951851278543472, 'kl': 0.29443359375, 'epoch': 0.73}
 73%|███████▎  | 3109/4286 [19:43:58<7:31:19, 23.01s/it] 73%|███████▎  | 3110/4286 [19:44:17<7:06:21, 21.75s/it]                                                        {'loss': 0.0313, 'grad_norm': 2.166290150676721, 'learning_rate': 2.743817078861409e-07, 'completion_length': 164.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.766369104385376, 'rewards/format_reward': 1.0, 'reward': 1.766369104385376, 'reward_std': 0.05495268665254116, 'kl': 0.783203125, 'epoch': 0.73}
 73%|███████▎  | 3110/4286 [19:44:17<7:06:21, 21.75s/it] 73%|███████▎  | 3111/4286 [19:44:35<6:47:58, 20.83s/it]                                                        {'loss': 0.0083, 'grad_norm': 1.443564291718492, 'learning_rate': 2.7414839010732615e-07, 'completion_length': 197.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.026785715483129025, 'kl': 0.2080078125, 'epoch': 0.73}
 73%|███████▎  | 3111/4286 [19:44:35<6:47:58, 20.83s/it] 73%|███████▎  | 3112/4286 [19:44:55<6:40:01, 20.44s/it]                                                        {'loss': 0.0078, 'grad_norm': 1.050433804937154, 'learning_rate': 2.7391507232851143e-07, 'completion_length': 189.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6717262268066406, 'rewards/format_reward': 1.0, 'reward': 1.6717263460159302, 'reward_std': 0.048214289359748363, 'kl': 0.19384765625, 'epoch': 0.73}
 73%|███████▎  | 3112/4286 [19:44:55<6:40:01, 20.44s/it] 73%|███████▎  | 3113/4286 [19:45:17<6:48:47, 20.91s/it]                                                        {'loss': 0.034, 'grad_norm': 3.391007160429153, 'learning_rate': 2.7368175454969665e-07, 'completion_length': 208.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6294642984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6116072535514832, 'reward_std': 0.09447786957025528, 'kl': 0.849609375, 'epoch': 0.73}
 73%|███████▎  | 3113/4286 [19:45:17<6:48:47, 20.91s/it] 73%|███████▎  | 3114/4286 [19:45:37<6:44:42, 20.72s/it]                                                        {'loss': 0.0148, 'grad_norm': 12.544133586835843, 'learning_rate': 2.7344843677088193e-07, 'completion_length': 183.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.7202381491661072, 'reward_std': 0.07738095708191395, 'kl': 0.37109375, 'epoch': 0.73}
 73%|███████▎  | 3114/4286 [19:45:37<6:44:42, 20.72s/it] 73%|███████▎  | 3115/4286 [19:45:58<6:42:34, 20.63s/it]                                                        {'loss': 0.0233, 'grad_norm': 4.396495131391478, 'learning_rate': 2.7321511899206715e-07, 'completion_length': 197.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5163690894842148, 'rewards/format_reward': 1.0, 'reward': 1.516369104385376, 'reward_std': 0.049000305123627186, 'kl': 0.5830078125, 'epoch': 0.73}
 73%|███████▎  | 3115/4286 [19:45:58<6:42:34, 20.63s/it] 73%|███████▎  | 3116/4286 [19:46:18<6:39:31, 20.49s/it]                                                        {'loss': 0.0087, 'grad_norm': 2.4322221723985074, 'learning_rate': 2.729818012132524e-07, 'completion_length': 198.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.5580357760190964, 'rewards/format_reward': 1.0, 'reward': 1.5580358505249023, 'reward_std': 0.06274673342704773, 'kl': 0.21630859375, 'epoch': 0.73}
 73%|███████▎  | 3116/4286 [19:46:18<6:39:31, 20.49s/it] 73%|███████▎  | 3117/4286 [19:46:39<6:43:01, 20.69s/it]                                                        {'loss': 0.0556, 'grad_norm': 3.1192885823571843, 'learning_rate': 2.727484834344377e-07, 'completion_length': 199.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.668154776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6502977013587952, 'reward_std': 0.10475895926356316, 'kl': 1.390625, 'epoch': 0.73}
 73%|███████▎  | 3117/4286 [19:46:39<6:43:01, 20.69s/it] 73%|███████▎  | 3118/4286 [19:47:03<7:00:14, 21.59s/it]                                                        {'loss': 0.0556, 'grad_norm': 1.9743489752429944, 'learning_rate': 2.725151656556229e-07, 'completion_length': 196.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6732143461704254, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6375001072883606, 'reward_std': 0.1772100143134594, 'kl': 1.38671875, 'epoch': 0.73}
 73%|███████▎  | 3118/4286 [19:47:03<7:00:14, 21.59s/it] 73%|███████▎  | 3119/4286 [19:47:25<7:05:24, 21.87s/it]                                                        {'loss': 0.037, 'grad_norm': 6.781523201076222, 'learning_rate': 2.722818478768082e-07, 'completion_length': 213.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6406127214431763, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6048985719680786, 'reward_std': 0.12526641972362995, 'kl': 0.9267578125, 'epoch': 0.73}
 73%|███████▎  | 3119/4286 [19:47:25<7:05:24, 21.87s/it] 73%|███████▎  | 3120/4286 [19:47:47<7:04:58, 21.87s/it]                                                        {'loss': 0.03, 'grad_norm': 4.701170790861183, 'learning_rate': 2.720485300979934e-07, 'completion_length': 211.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.5424745380878448, 'rewards/format_reward': 1.0, 'reward': 1.5424745678901672, 'reward_std': 0.05849968222901225, 'kl': 0.7509765625, 'epoch': 0.73}
 73%|███████▎  | 3120/4286 [19:47:47<7:04:58, 21.87s/it] 73%|███████▎  | 3121/4286 [19:48:11<7:18:42, 22.59s/it]                                                        {'loss': 0.0376, 'grad_norm': 7.352383043057576, 'learning_rate': 2.718152123191787e-07, 'completion_length': 183.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.4300595670938492, 'rewards/format_reward': 1.0, 'reward': 1.4300596714019775, 'reward_std': 0.11153335869312286, 'kl': 0.939453125, 'epoch': 0.73}
 73%|███████▎  | 3121/4286 [19:48:11<7:18:42, 22.59s/it] 73%|███████▎  | 3122/4286 [19:48:33<7:13:07, 22.33s/it]                                                        {'loss': 0.0601, 'grad_norm': 2.2235507820054203, 'learning_rate': 2.7158189454036397e-07, 'completion_length': 198.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.4988095760345459, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4630953073501587, 'reward_std': 0.2003367468714714, 'kl': 1.498046875, 'epoch': 0.73}
 73%|███████▎  | 3122/4286 [19:48:33<7:13:07, 22.33s/it] 73%|███████▎  | 3123/4286 [19:48:53<6:58:01, 21.57s/it]                                                        {'loss': 0.0514, 'grad_norm': 1.225343985709246, 'learning_rate': 2.713485767615492e-07, 'completion_length': 189.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5282738655805588, 'rewards/format_reward': 1.0, 'reward': 1.5282739400863647, 'reward_std': 0.07573210448026657, 'kl': 1.28125, 'epoch': 0.73}
 73%|███████▎  | 3123/4286 [19:48:53<6:58:01, 21.57s/it] 73%|███████▎  | 3124/4286 [19:49:15<7:00:59, 21.74s/it]                                                        {'loss': 0.0114, 'grad_norm': 2.3056516042478807, 'learning_rate': 2.7111525898273447e-07, 'completion_length': 187.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.548894613981247, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.531037449836731, 'reward_std': 0.07482994068413973, 'kl': 0.28515625, 'epoch': 0.73}
 73%|███████▎  | 3124/4286 [19:49:15<7:00:59, 21.74s/it] 73%|███████▎  | 3125/4286 [19:49:37<7:03:35, 21.89s/it]                                                        {'loss': 0.0171, 'grad_norm': 2.183270927094988, 'learning_rate': 2.7088194120391974e-07, 'completion_length': 198.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7752976417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7574405670166016, 'reward_std': 0.0744047649204731, 'kl': 0.42919921875, 'epoch': 0.73}
 73%|███████▎  | 3125/4286 [19:49:37<7:03:35, 21.89s/it] 73%|███████▎  | 3126/4286 [19:50:01<7:15:36, 22.53s/it]                                                        {'loss': 0.0163, 'grad_norm': 1.7422305471385782, 'learning_rate': 2.7064862342510496e-07, 'completion_length': 206.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.68452388048172, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.030682737939059734, 'kl': 0.408203125, 'epoch': 0.73}
 73%|███████▎  | 3126/4286 [19:50:01<7:15:36, 22.53s/it] 73%|███████▎  | 3127/4286 [19:50:23<7:11:32, 22.34s/it]                                                        {'loss': 0.0185, 'grad_norm': 4.969012482612814, 'learning_rate': 2.7041530564629024e-07, 'completion_length': 209.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6526786386966705, 'rewards/format_reward': 1.0, 'reward': 1.6526786088943481, 'reward_std': 0.05609964672476053, 'kl': 0.4619140625, 'epoch': 0.73}
 73%|███████▎  | 3127/4286 [19:50:23<7:11:32, 22.34s/it] 73%|███████▎  | 3128/4286 [19:50:44<7:04:51, 22.01s/it]                                                        {'loss': 0.0356, 'grad_norm': 36.20453331591098, 'learning_rate': 2.7018198786747546e-07, 'completion_length': 217.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.5568182021379471, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5211040377616882, 'reward_std': 0.13022628473117948, 'kl': 0.890625, 'epoch': 0.73}
 73%|███████▎  | 3128/4286 [19:50:44<7:04:51, 22.01s/it] 73%|███████▎  | 3129/4286 [19:51:04<6:52:12, 21.38s/it]                                                        {'loss': 0.0094, 'grad_norm': 0.9341054973742554, 'learning_rate': 2.6994867008866074e-07, 'completion_length': 193.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6151786148548126, 'rewards/format_reward': 1.0, 'reward': 1.615178644657135, 'reward_std': 0.01709691435098648, 'kl': 0.236328125, 'epoch': 0.73}
 73%|███████▎  | 3129/4286 [19:51:04<6:52:12, 21.38s/it] 73%|███████▎  | 3130/4286 [19:51:27<6:58:42, 21.73s/it]                                                        {'loss': 0.0366, 'grad_norm': 133.76135472421615, 'learning_rate': 2.69715352309846e-07, 'completion_length': 218.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7979167103767395, 'rewards/format_reward': 1.0, 'reward': 1.7979167699813843, 'reward_std': 0.0639858078211546, 'kl': 0.9140625, 'epoch': 0.73}
 73%|███████▎  | 3130/4286 [19:51:27<6:58:42, 21.73s/it] 73%|███████▎  | 3131/4286 [19:51:50<7:05:05, 22.08s/it]                                                        {'loss': 0.0183, 'grad_norm': 2.358048178548839, 'learning_rate': 2.6948203453103123e-07, 'completion_length': 217.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6586310267448425, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.640773892402649, 'reward_std': 0.10734344646334648, 'kl': 0.45751953125, 'epoch': 0.73}
 73%|███████▎  | 3131/4286 [19:51:50<7:05:05, 22.08s/it] 73%|███████▎  | 3132/4286 [19:52:09<6:47:09, 21.17s/it]                                                        {'loss': 0.0446, 'grad_norm': 6.391168536928628, 'learning_rate': 2.692487167522165e-07, 'completion_length': 194.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.596726268529892, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5788691639900208, 'reward_std': 0.08898506313562393, 'kl': 1.1123046875, 'epoch': 0.73}
 73%|███████▎  | 3132/4286 [19:52:09<6:47:09, 21.17s/it] 73%|███████▎  | 3133/4286 [19:52:29<6:38:14, 20.72s/it]                                                        {'loss': 0.0107, 'grad_norm': 0.8402099279673372, 'learning_rate': 2.6901539897340173e-07, 'completion_length': 199.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 1.0, 'reward': 1.6235119700431824, 'reward_std': 0.008928571827709675, 'kl': 0.26708984375, 'epoch': 0.73}
 73%|███████▎  | 3133/4286 [19:52:29<6:38:14, 20.72s/it] 73%|███████▎  | 3134/4286 [19:52:48<6:29:48, 20.30s/it]                                                        {'loss': 0.03, 'grad_norm': 6.656467163897978, 'learning_rate': 2.68782081194587e-07, 'completion_length': 175.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.0535714365541935, 'kl': 0.7509765625, 'epoch': 0.73}
 73%|███████▎  | 3134/4286 [19:52:48<6:29:48, 20.30s/it] 73%|███████▎  | 3135/4286 [19:53:07<6:21:22, 19.88s/it]                                                        {'loss': 0.0072, 'grad_norm': 9.089237096385217, 'learning_rate': 2.685487634157723e-07, 'completion_length': 195.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6366071999073029, 'rewards/format_reward': 1.0, 'reward': 1.63660728931427, 'reward_std': 0.008793754503130913, 'kl': 0.18017578125, 'epoch': 0.73}
 73%|███████▎  | 3135/4286 [19:53:07<6:21:22, 19.88s/it] 73%|███████▎  | 3136/4286 [19:53:32<6:50:18, 21.41s/it]                                                        {'loss': 0.0227, 'grad_norm': 57.14684284532239, 'learning_rate': 2.683154456369575e-07, 'completion_length': 224.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.6017857789993286, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.583928644657135, 'reward_std': 0.16774092987179756, 'kl': 0.5673828125, 'epoch': 0.73}
 73%|███████▎  | 3136/4286 [19:53:32<6:50:18, 21.41s/it] 73%|███████▎  | 3137/4286 [19:53:51<6:40:10, 20.90s/it]                                                        {'loss': 0.0462, 'grad_norm': 0.7791987653485993, 'learning_rate': 2.680821278581428e-07, 'completion_length': 204.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.662202537059784, 'reward_std': 0.12697650492191315, 'kl': 1.16015625, 'epoch': 0.73}
 73%|███████▎  | 3137/4286 [19:53:51<6:40:10, 20.90s/it] 73%|███████▎  | 3138/4286 [19:54:11<6:32:18, 20.50s/it]                                                        {'loss': 0.0085, 'grad_norm': 1.5721792429981523, 'learning_rate': 2.67848810079328e-07, 'completion_length': 214.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6877976059913635, 'rewards/format_reward': 1.0, 'reward': 1.687797725200653, 'reward_std': 0.005357143934816122, 'kl': 0.21142578125, 'epoch': 0.73}
 73%|███████▎  | 3138/4286 [19:54:11<6:32:18, 20.50s/it][2025-03-03 01:01:50,207] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 73%|███████▎  | 3139/4286 [19:54:34<6:48:15, 21.36s/it]                                                        {'loss': 0.0333, 'grad_norm': 5.512536597355054, 'learning_rate': 2.676154923005133e-07, 'completion_length': 245.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5142006874084473, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4784865379333496, 'reward_std': 0.11167037487030029, 'kl': 0.8330078125, 'epoch': 0.73}
 73%|███████▎  | 3139/4286 [19:54:34<6:48:15, 21.36s/it][2025-03-03 01:02:12,149] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 73%|███████▎  | 3140/4286 [19:54:56<6:51:15, 21.53s/it]                                                        {'loss': 0.0418, 'grad_norm': 4.664824215544329, 'learning_rate': 2.6738217452169855e-07, 'completion_length': 212.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6770834922790527, 'reward_std': 0.1669583022594452, 'kl': 1.0478515625, 'epoch': 0.73}
 73%|███████▎  | 3140/4286 [19:54:56<6:51:15, 21.53s/it] 73%|███████▎  | 3141/4286 [19:55:19<6:55:11, 21.76s/it]                                                        {'loss': 0.074, 'grad_norm': 3.979995081754287, 'learning_rate': 2.671488567428838e-07, 'completion_length': 218.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.4895833730697632, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4538692235946655, 'reward_std': 0.17357773706316948, 'kl': 1.84375, 'epoch': 0.73}
 73%|███████▎  | 3141/4286 [19:55:19<6:55:11, 21.76s/it] 73%|███████▎  | 3142/4286 [19:55:43<7:09:42, 22.54s/it]                                                        {'loss': 0.1024, 'grad_norm': 2.9481022453760786, 'learning_rate': 2.6691553896406905e-07, 'completion_length': 232.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5223214626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4866071939468384, 'reward_std': 0.21407204121351242, 'kl': 2.546875, 'epoch': 0.73}
 73%|███████▎  | 3142/4286 [19:55:43<7:09:42, 22.54s/it] 73%|███████▎  | 3143/4286 [19:56:09<7:26:53, 23.46s/it]                                                        {'loss': 0.0443, 'grad_norm': 3.5746105102723775, 'learning_rate': 2.6668222118525427e-07, 'completion_length': 223.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7029761970043182, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6494048833847046, 'reward_std': 0.16152720153331757, 'kl': 1.109375, 'epoch': 0.73}
 73%|███████▎  | 3143/4286 [19:56:09<7:26:53, 23.46s/it] 73%|███████▎  | 3144/4286 [19:56:31<7:18:17, 23.03s/it]                                                        {'loss': 0.0297, 'grad_norm': 63.172791199070346, 'learning_rate': 2.6644890340643955e-07, 'completion_length': 206.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.611607164144516, 'rewards/format_reward': 1.0, 'reward': 1.611607313156128, 'reward_std': 0.07272235490381718, 'kl': 0.7431640625, 'epoch': 0.73}
 73%|███████▎  | 3144/4286 [19:56:31<7:18:17, 23.03s/it][2025-03-03 01:04:06,904] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 73%|███████▎  | 3145/4286 [19:56:51<7:03:24, 22.26s/it]                                                        {'loss': 0.0529, 'grad_norm': 6.334302789798192, 'learning_rate': 2.662155856276248e-07, 'completion_length': 208.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7431547939777374, 'rewards/format_reward': 1.0, 'reward': 1.7431548237800598, 'reward_std': 0.12859654799103737, 'kl': 1.328125, 'epoch': 0.73}
 73%|███████▎  | 3145/4286 [19:56:51<7:03:24, 22.26s/it] 73%|███████▎  | 3146/4286 [19:57:10<6:46:58, 21.42s/it]                                                        {'loss': 0.0337, 'grad_norm': 4.1636634293748225, 'learning_rate': 2.6598226784881004e-07, 'completion_length': 207.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.7255952656269073, 'rewards/format_reward': 1.0, 'reward': 1.7255953550338745, 'reward_std': 0.02767553413286805, 'kl': 0.83984375, 'epoch': 0.73}
 73%|███████▎  | 3146/4286 [19:57:10<6:46:58, 21.42s/it] 73%|███████▎  | 3147/4286 [19:57:33<6:50:35, 21.63s/it]                                                        {'loss': 0.0403, 'grad_norm': 2.1557903476828404, 'learning_rate': 2.657489500699953e-07, 'completion_length': 190.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5922619700431824, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5565477013587952, 'reward_std': 0.125, 'kl': 1.00439453125, 'epoch': 0.73}
 73%|███████▎  | 3147/4286 [19:57:33<6:50:35, 21.63s/it] 73%|███████▎  | 3148/4286 [19:57:55<6:54:14, 21.84s/it]                                                        {'loss': 0.033, 'grad_norm': 6.6580492881194, 'learning_rate': 2.655156322911806e-07, 'completion_length': 225.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7138736546039581, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6960166096687317, 'reward_std': 0.17157085239887238, 'kl': 0.828125, 'epoch': 0.73}
 73%|███████▎  | 3148/4286 [19:57:55<6:54:14, 21.84s/it] 73%|███████▎  | 3149/4286 [19:58:18<7:00:05, 22.17s/it]                                                        {'loss': 0.051, 'grad_norm': 6.1167078667915495, 'learning_rate': 2.652823145123658e-07, 'completion_length': 217.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.5994047969579697, 'rewards/format_reward': 1.0, 'reward': 1.5994048714637756, 'reward_std': 0.08766740374267101, 'kl': 1.2744140625, 'epoch': 0.73}
 73%|███████▎  | 3149/4286 [19:58:18<7:00:05, 22.17s/it] 73%|███████▎  | 3150/4286 [19:58:40<6:59:03, 22.13s/it]                                                        {'loss': 0.0623, 'grad_norm': 2.516468363092672, 'learning_rate': 2.650489967335511e-07, 'completion_length': 207.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.7232143878936768, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.669642984867096, 'reward_std': 0.2083333432674408, 'kl': 1.5546875, 'epoch': 0.73}
 73%|███████▎  | 3150/4286 [19:58:40<6:59:03, 22.13s/it] 74%|███████▎  | 3151/4286 [19:59:01<6:54:22, 21.91s/it]                                                        {'loss': 0.0344, 'grad_norm': 1.9590846079650872, 'learning_rate': 2.648156789547363e-07, 'completion_length': 187.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6616071164608002, 'rewards/format_reward': 1.0, 'reward': 1.661607265472412, 'reward_std': 0.1048775352537632, 'kl': 0.85986328125, 'epoch': 0.74}
 74%|███████▎  | 3151/4286 [19:59:01<6:54:22, 21.91s/it] 74%|███████▎  | 3152/4286 [19:59:22<6:47:23, 21.55s/it]                                                        {'loss': 0.0902, 'grad_norm': 4.169316695999267, 'learning_rate': 2.645823611759216e-07, 'completion_length': 212.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.5863095223903656, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.532738208770752, 'reward_std': 0.18611687421798706, 'kl': 2.2578125, 'epoch': 0.74}
 74%|███████▎  | 3152/4286 [19:59:22<6:47:23, 21.55s/it] 74%|███████▎  | 3153/4286 [19:59:42<6:39:41, 21.17s/it]                                                        {'loss': 0.0382, 'grad_norm': 1.9482875308645535, 'learning_rate': 2.6434904339710686e-07, 'completion_length': 190.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6636906266212463, 'reward_std': 0.08827208820730448, 'kl': 0.95751953125, 'epoch': 0.74}
 74%|███████▎  | 3153/4286 [19:59:42<6:39:41, 21.17s/it] 74%|███████▎  | 3154/4286 [20:00:03<6:36:32, 21.02s/it]                                                        {'loss': 0.088, 'grad_norm': 8.34310113429403, 'learning_rate': 2.641157256182921e-07, 'completion_length': 193.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6090368032455444, 'rewards/format_reward': 1.0, 'reward': 1.609036922454834, 'reward_std': 0.10455026477575302, 'kl': 2.1953125, 'epoch': 0.74}
 74%|███████▎  | 3154/4286 [20:00:03<6:36:32, 21.02s/it][2025-03-03 01:07:41,977] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 74%|███████▎  | 3155/4286 [20:00:26<6:48:13, 21.66s/it]                                                        {'loss': 0.1103, 'grad_norm': 3.279865550821806, 'learning_rate': 2.6388240783947736e-07, 'completion_length': 203.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.706250011920929, 'rewards/format_reward': 0.910714328289032, 'reward': 1.6169643998146057, 'reward_std': 0.24521122127771378, 'kl': 2.75390625, 'epoch': 0.74}
 74%|███████▎  | 3155/4286 [20:00:26<6:48:13, 21.66s/it] 74%|███████▎  | 3156/4286 [20:00:50<6:58:01, 22.20s/it]                                                        {'loss': 0.0706, 'grad_norm': 5.535397865447362, 'learning_rate': 2.636490900606626e-07, 'completion_length': 220.71430206298828, 'rewards/only_full_func_accuracy_reward': 0.6526786088943481, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6348215341567993, 'reward_std': 0.13948357105255127, 'kl': 1.76171875, 'epoch': 0.74}
 74%|███████▎  | 3156/4286 [20:00:50<6:58:01, 22.20s/it][2025-03-03 01:08:30,033] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 74%|███████▎  | 3157/4286 [20:01:14<7:11:14, 22.92s/it]                                                        {'loss': 0.1154, 'grad_norm': 6.645933101143994, 'learning_rate': 2.6341577228184786e-07, 'completion_length': 207.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5312501192092896, 'reward_std': 0.24512096494436264, 'kl': 2.890625, 'epoch': 0.74}
 74%|███████▎  | 3157/4286 [20:01:14<7:11:14, 22.92s/it] 74%|███████▎  | 3158/4286 [20:01:37<7:09:40, 22.85s/it]                                                        {'loss': 0.0155, 'grad_norm': 7.158748298608476, 'learning_rate': 2.6318245450303313e-07, 'completion_length': 205.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.5535714775323868, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.05498574301600456, 'kl': 0.38671875, 'epoch': 0.74}
 74%|███████▎  | 3158/4286 [20:01:37<7:09:40, 22.85s/it] 74%|███████▎  | 3159/4286 [20:01:58<6:57:08, 22.21s/it]                                                        {'loss': 0.0685, 'grad_norm': 2.155147771991007, 'learning_rate': 2.6294913672421836e-07, 'completion_length': 196.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.59226194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5744048953056335, 'reward_std': 0.12990923691540956, 'kl': 1.703125, 'epoch': 0.74}
 74%|███████▎  | 3159/4286 [20:01:58<6:57:08, 22.21s/it] 74%|███████▎  | 3160/4286 [20:02:18<6:47:16, 21.70s/it]                                                        {'loss': 0.0283, 'grad_norm': 1.8641626395733861, 'learning_rate': 2.6271581894540363e-07, 'completion_length': 202.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6375000476837158, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6196429133415222, 'reward_std': 0.08928572107106447, 'kl': 0.70751953125, 'epoch': 0.74}
 74%|███████▎  | 3160/4286 [20:02:18<6:47:16, 21.70s/it] 74%|███████▍  | 3161/4286 [20:02:38<6:38:04, 21.23s/it]                                                        {'loss': 0.0102, 'grad_norm': 3.2591216564607817, 'learning_rate': 2.6248250116658885e-07, 'completion_length': 197.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6608496308326721, 'rewards/format_reward': 1.0, 'reward': 1.6608496308326721, 'reward_std': 0.04734848625957966, 'kl': 0.2548828125, 'epoch': 0.74}
 74%|███████▍  | 3161/4286 [20:02:38<6:38:04, 21.23s/it] 74%|███████▍  | 3162/4286 [20:03:00<6:39:35, 21.33s/it]                                                        {'loss': 0.0343, 'grad_norm': 7.742051008609584, 'learning_rate': 2.6224918338777413e-07, 'completion_length': 189.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6470238864421844, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6113095879554749, 'reward_std': 0.14117566868662834, 'kl': 0.85546875, 'epoch': 0.74}
 74%|███████▍  | 3162/4286 [20:03:00<6:39:35, 21.33s/it] 74%|███████▍  | 3163/4286 [20:03:22<6:42:00, 21.48s/it]                                                        {'loss': 0.0545, 'grad_norm': 3.022531398677834, 'learning_rate': 2.620158656089594e-07, 'completion_length': 221.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.55952388048172, 'rewards/format_reward': 1.0, 'reward': 1.5595239400863647, 'reward_std': 0.05197649821639061, 'kl': 1.35546875, 'epoch': 0.74}
 74%|███████▍  | 3163/4286 [20:03:22<6:42:00, 21.48s/it] 74%|███████▍  | 3164/4286 [20:03:41<6:30:27, 20.88s/it]                                                        {'loss': 0.0202, 'grad_norm': 4.48078708271375, 'learning_rate': 2.617825478301446e-07, 'completion_length': 176.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.7244048118591309, 'rewards/format_reward': 1.0, 'reward': 1.7244048118591309, 'reward_std': 0.047224972397089005, 'kl': 0.50341796875, 'epoch': 0.74}
 74%|███████▍  | 3164/4286 [20:03:41<6:30:27, 20.88s/it] 74%|███████▍  | 3165/4286 [20:04:01<6:22:15, 20.46s/it]                                                        {'loss': 0.0255, 'grad_norm': 0.6899760977140196, 'learning_rate': 2.615492300513299e-07, 'completion_length': 197.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7434524297714233, 'rewards/format_reward': 1.0, 'reward': 1.7434524893760681, 'reward_std': 0.0385858528316021, 'kl': 0.63916015625, 'epoch': 0.74}
 74%|███████▍  | 3165/4286 [20:04:01<6:22:15, 20.46s/it] 74%|███████▍  | 3166/4286 [20:04:19<6:12:09, 19.94s/it]                                                        {'loss': 0.0279, 'grad_norm': 3.213907855910422, 'learning_rate': 2.613159122725151e-07, 'completion_length': 183.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6357143223285675, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.617857277393341, 'reward_std': 0.09379715472459793, 'kl': 0.701171875, 'epoch': 0.74}
 74%|███████▍  | 3166/4286 [20:04:19<6:12:09, 19.94s/it] 74%|███████▍  | 3167/4286 [20:04:41<6:19:24, 20.34s/it]                                                        {'loss': 0.0584, 'grad_norm': 3.563821842628993, 'learning_rate': 2.610825944937004e-07, 'completion_length': 201.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.6333333849906921, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.597619116306305, 'reward_std': 0.1549631953239441, 'kl': 1.4609375, 'epoch': 0.74}
 74%|███████▍  | 3167/4286 [20:04:41<6:19:24, 20.34s/it] 74%|███████▍  | 3168/4286 [20:05:01<6:19:08, 20.35s/it]                                                        {'loss': 0.0375, 'grad_norm': 6.35904794596957, 'learning_rate': 2.6084927671488567e-07, 'completion_length': 199.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6270833611488342, 'rewards/format_reward': 1.0, 'reward': 1.627083420753479, 'reward_std': 0.09007730334997177, 'kl': 0.939453125, 'epoch': 0.74}
 74%|███████▍  | 3168/4286 [20:05:01<6:19:08, 20.35s/it] 74%|███████▍  | 3169/4286 [20:05:23<6:30:56, 21.00s/it]                                                        {'loss': 0.024, 'grad_norm': 6.789250962623268, 'learning_rate': 2.606159589360709e-07, 'completion_length': 210.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5758928954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5401785969734192, 'reward_std': 0.12834678776562214, 'kl': 0.59716796875, 'epoch': 0.74}
 74%|███████▍  | 3169/4286 [20:05:23<6:30:56, 21.00s/it] 74%|███████▍  | 3170/4286 [20:05:43<6:23:39, 20.63s/it]                                                        {'loss': 0.0069, 'grad_norm': 2.7171925249032127, 'learning_rate': 2.6038264115725617e-07, 'completion_length': 199.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7979167401790619, 'rewards/format_reward': 1.0, 'reward': 1.797916829586029, 'reward_std': 0.02296610688790679, 'kl': 0.17138671875, 'epoch': 0.74}
 74%|███████▍  | 3170/4286 [20:05:43<6:23:39, 20.63s/it] 74%|███████▍  | 3171/4286 [20:06:03<6:16:22, 20.25s/it]                                                        {'loss': 0.0328, 'grad_norm': 3.7509457594650595, 'learning_rate': 2.6014932337844145e-07, 'completion_length': 190.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.5208334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5029762983322144, 'reward_std': 0.10362444445490837, 'kl': 0.8193359375, 'epoch': 0.74}
 74%|███████▍  | 3171/4286 [20:06:03<6:16:22, 20.25s/it] 74%|███████▍  | 3172/4286 [20:06:24<6:23:36, 20.66s/it]                                                        {'loss': 0.0338, 'grad_norm': 33.1325201295606, 'learning_rate': 2.5991600559962667e-07, 'completion_length': 181.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.6294642984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6116071939468384, 'reward_std': 0.1329149305820465, 'kl': 0.845703125, 'epoch': 0.74}
 74%|███████▍  | 3172/4286 [20:06:24<6:23:36, 20.66s/it] 74%|███████▍  | 3173/4286 [20:06:42<6:07:07, 19.79s/it]                                                        {'loss': 0.0261, 'grad_norm': 1.8797028872256567, 'learning_rate': 2.5968268782081194e-07, 'completion_length': 164.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.6219094097614288, 'rewards/format_reward': 1.0, 'reward': 1.6219094395637512, 'reward_std': 0.054047103971242905, 'kl': 0.65234375, 'epoch': 0.74}
 74%|███████▍  | 3173/4286 [20:06:42<6:07:07, 19.79s/it] 74%|███████▍  | 3174/4286 [20:07:01<6:02:17, 19.55s/it]                                                        {'loss': 0.0315, 'grad_norm': 1.6332898643303568, 'learning_rate': 2.5944937004199717e-07, 'completion_length': 184.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5833333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5654763579368591, 'reward_std': 0.056678008288145065, 'kl': 0.7880859375, 'epoch': 0.74}
 74%|███████▍  | 3174/4286 [20:07:01<6:02:17, 19.55s/it] 74%|███████▍  | 3175/4286 [20:07:24<6:19:41, 20.51s/it]                                                        {'loss': 0.0299, 'grad_norm': 27.8798833792285, 'learning_rate': 2.5921605226318244e-07, 'completion_length': 201.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.6687500774860382, 'rewards/format_reward': 1.0, 'reward': 1.6687501072883606, 'reward_std': 0.055357142351567745, 'kl': 0.7470703125, 'epoch': 0.74}
 74%|███████▍  | 3175/4286 [20:07:24<6:19:41, 20.51s/it] 74%|███████▍  | 3176/4286 [20:07:42<6:06:49, 19.83s/it]                                                        {'loss': 0.0237, 'grad_norm': 11.995979311412501, 'learning_rate': 2.589827344843677e-07, 'completion_length': 185.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7842262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.05495268478989601, 'kl': 0.59375, 'epoch': 0.74}
 74%|███████▍  | 3176/4286 [20:07:42<6:06:49, 19.83s/it] 74%|███████▍  | 3177/4286 [20:08:01<6:01:12, 19.54s/it]                                                        {'loss': 0.0238, 'grad_norm': 1.2872857149720427, 'learning_rate': 2.5874941670555294e-07, 'completion_length': 188.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098214626312256, 'reward_std': 0.05495268199592829, 'kl': 0.5947265625, 'epoch': 0.74}
 74%|███████▍  | 3177/4286 [20:08:01<6:01:12, 19.54s/it] 74%|███████▍  | 3178/4286 [20:08:22<6:09:56, 20.03s/it]                                                        {'loss': 0.0585, 'grad_norm': 7.708648460159838, 'learning_rate': 2.585160989267382e-07, 'completion_length': 199.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5889881253242493, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5711310505867004, 'reward_std': 0.13166452944278717, 'kl': 1.46484375, 'epoch': 0.74}
 74%|███████▍  | 3178/4286 [20:08:22<6:09:56, 20.03s/it] 74%|███████▍  | 3179/4286 [20:08:41<6:03:48, 19.72s/it]                                                        {'loss': 0.0126, 'grad_norm': 8.176775115804011, 'learning_rate': 2.5828278114792344e-07, 'completion_length': 196.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.07327024824917316, 'kl': 0.31396484375, 'epoch': 0.74}
 74%|███████▍  | 3179/4286 [20:08:41<6:03:48, 19.72s/it] 74%|███████▍  | 3180/4286 [20:09:01<6:04:20, 19.77s/it]                                                        {'loss': 0.068, 'grad_norm': 9.888848703861592, 'learning_rate': 2.580494633691087e-07, 'completion_length': 184.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6748512089252472, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6391370296478271, 'reward_std': 0.11850002035498619, 'kl': 1.6953125, 'epoch': 0.74}
 74%|███████▍  | 3180/4286 [20:09:01<6:04:20, 19.77s/it] 74%|███████▍  | 3181/4286 [20:09:20<5:59:32, 19.52s/it]                                                        {'loss': 0.0084, 'grad_norm': 3.5186454885663987, 'learning_rate': 2.57816145590294e-07, 'completion_length': 199.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.08667591214179993, 'kl': 0.20849609375, 'epoch': 0.74}
 74%|███████▍  | 3181/4286 [20:09:20<5:59:32, 19.52s/it] 74%|███████▍  | 3182/4286 [20:09:39<5:55:04, 19.30s/it]                                                        {'loss': 0.0114, 'grad_norm': 1.4121106986708707, 'learning_rate': 2.575828278114792e-07, 'completion_length': 185.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.044642859138548374, 'kl': 0.28466796875, 'epoch': 0.74}
 74%|███████▍  | 3182/4286 [20:09:39<5:55:04, 19.30s/it] 74%|███████▍  | 3183/4286 [20:09:58<5:55:43, 19.35s/it]                                                        {'loss': 0.0112, 'grad_norm': 19.29765011809258, 'learning_rate': 2.573495100326645e-07, 'completion_length': 200.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6761904954910278, 'rewards/format_reward': 1.0, 'reward': 1.6761905550956726, 'reward_std': 0.04308671108447015, 'kl': 0.28125, 'epoch': 0.74}
 74%|███████▍  | 3183/4286 [20:09:58<5:55:43, 19.35s/it] 74%|███████▍  | 3184/4286 [20:10:17<5:52:43, 19.20s/it]                                                        {'loss': 0.0069, 'grad_norm': 7.721360781496995, 'learning_rate': 2.571161922538497e-07, 'completion_length': 190.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.7428571879863739, 'rewards/format_reward': 1.0, 'reward': 1.7428572177886963, 'reward_std': 0.028032148256897926, 'kl': 0.17236328125, 'epoch': 0.74}
 74%|███████▍  | 3184/4286 [20:10:17<5:52:43, 19.20s/it] 74%|███████▍  | 3185/4286 [20:10:36<5:51:41, 19.17s/it]                                                        {'loss': 0.0216, 'grad_norm': 18.989162219744777, 'learning_rate': 2.56882874475035e-07, 'completion_length': 199.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6514880955219269, 'rewards/format_reward': 1.0, 'reward': 1.6514882445335388, 'reward_std': 0.10270530730485916, 'kl': 0.5400390625, 'epoch': 0.74}
 74%|███████▍  | 3185/4286 [20:10:36<5:51:41, 19.17s/it] 74%|███████▍  | 3186/4286 [20:10:56<5:57:15, 19.49s/it]                                                        {'loss': 0.027, 'grad_norm': 4.61250577869753, 'learning_rate': 2.5664955669622026e-07, 'completion_length': 210.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.5398809909820557, 'rewards/format_reward': 1.0, 'reward': 1.5398810505867004, 'reward_std': 0.0971916951239109, 'kl': 0.67041015625, 'epoch': 0.74}
 74%|███████▍  | 3186/4286 [20:10:56<5:57:15, 19.49s/it] 74%|███████▍  | 3187/4286 [20:11:16<5:55:49, 19.43s/it]                                                        {'loss': 0.0072, 'grad_norm': 1.0241288721359674, 'learning_rate': 2.564162389174055e-07, 'completion_length': 201.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.01236517820507288, 'kl': 0.18017578125, 'epoch': 0.74}
 74%|███████▍  | 3187/4286 [20:11:16<5:55:49, 19.43s/it] 74%|███████▍  | 3188/4286 [20:11:38<6:13:01, 20.38s/it]                                                        {'loss': 0.0142, 'grad_norm': 3.7928567738496333, 'learning_rate': 2.5618292113859075e-07, 'completion_length': 205.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6553571820259094, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6375001072883606, 'reward_std': 0.09240813180804253, 'kl': 0.35546875, 'epoch': 0.74}
 74%|███████▍  | 3188/4286 [20:11:38<6:13:01, 20.38s/it] 74%|███████▍  | 3189/4286 [20:11:58<6:07:50, 20.12s/it]                                                        {'loss': 0.0107, 'grad_norm': 6.839776319934126, 'learning_rate': 2.55949603359776e-07, 'completion_length': 218.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6559524238109589, 'rewards/format_reward': 1.0, 'reward': 1.6559524536132812, 'reward_std': 0.03333333251066506, 'kl': 0.26708984375, 'epoch': 0.74}
 74%|███████▍  | 3189/4286 [20:11:58<6:07:50, 20.12s/it] 74%|███████▍  | 3190/4286 [20:12:19<6:12:50, 20.41s/it]                                                        {'loss': 0.0116, 'grad_norm': 2.0691192096828637, 'learning_rate': 2.5571628558096125e-07, 'completion_length': 210.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6610119640827179, 'rewards/format_reward': 1.0, 'reward': 1.6610119342803955, 'reward_std': 0.07083333283662796, 'kl': 0.291015625, 'epoch': 0.74}
 74%|███████▍  | 3190/4286 [20:12:19<6:12:50, 20.41s/it] 74%|███████▍  | 3191/4286 [20:12:41<6:22:27, 20.96s/it]                                                        {'loss': 0.0153, 'grad_norm': 5.521398936647675, 'learning_rate': 2.554829678021465e-07, 'completion_length': 212.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6883929073810577, 'rewards/format_reward': 1.0, 'reward': 1.6883928775787354, 'reward_std': 0.04143310710787773, 'kl': 0.38134765625, 'epoch': 0.74}
 74%|███████▍  | 3191/4286 [20:12:41<6:22:27, 20.96s/it] 74%|███████▍  | 3192/4286 [20:13:04<6:31:49, 21.49s/it]                                                        {'loss': 0.0322, 'grad_norm': 2.395515882293336, 'learning_rate': 2.5524965002333175e-07, 'completion_length': 210.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5822916924953461, 'rewards/format_reward': 1.0, 'reward': 1.5822916626930237, 'reward_std': 0.07498600520193577, 'kl': 0.8037109375, 'epoch': 0.74}
 74%|███████▍  | 3192/4286 [20:13:04<6:31:49, 21.49s/it] 74%|███████▍  | 3193/4286 [20:13:24<6:24:32, 21.11s/it]                                                        {'loss': 0.0141, 'grad_norm': 6.064480083958274, 'learning_rate': 2.55016332244517e-07, 'completion_length': 193.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6491497159004211, 'rewards/format_reward': 1.0, 'reward': 1.6491497159004211, 'reward_std': 0.07713404670357704, 'kl': 0.35302734375, 'epoch': 0.74}
 74%|███████▍  | 3193/4286 [20:13:24<6:24:32, 21.11s/it] 75%|███████▍  | 3194/4286 [20:13:44<6:19:45, 20.87s/it]                                                        {'loss': 0.0345, 'grad_norm': 9.270107428678681, 'learning_rate': 2.547830144657023e-07, 'completion_length': 206.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.514328271150589, 'rewards/format_reward': 1.0, 'reward': 1.5143283009529114, 'reward_std': 0.1467251032590866, 'kl': 0.86328125, 'epoch': 0.75}
 75%|███████▍  | 3194/4286 [20:13:44<6:19:45, 20.87s/it] 75%|███████▍  | 3195/4286 [20:14:04<6:14:21, 20.59s/it]                                                        {'loss': 0.0077, 'grad_norm': 2.307298116093934, 'learning_rate': 2.545496966868875e-07, 'completion_length': 187.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.644345223903656, 'rewards/format_reward': 1.0, 'reward': 1.6443454027175903, 'reward_std': 0.05838928930461407, 'kl': 0.19189453125, 'epoch': 0.75}
 75%|███████▍  | 3195/4286 [20:14:04<6:14:21, 20.59s/it] 75%|███████▍  | 3196/4286 [20:14:23<6:06:55, 20.20s/it]                                                        {'loss': 0.0212, 'grad_norm': 1.0786778214082864, 'learning_rate': 2.543163789080728e-07, 'completion_length': 180.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6973214745521545, 'rewards/format_reward': 1.0, 'reward': 1.6973215341567993, 'reward_std': 0.033928575459867716, 'kl': 0.529296875, 'epoch': 0.75}
 75%|███████▍  | 3196/4286 [20:14:23<6:06:55, 20.20s/it] 75%|███████▍  | 3197/4286 [20:14:42<5:59:20, 19.80s/it]                                                        {'loss': 0.0086, 'grad_norm': 1.51269855238371, 'learning_rate': 2.54083061129258e-07, 'completion_length': 183.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.5491072088479996, 'rewards/format_reward': 1.0, 'reward': 1.5491071939468384, 'reward_std': 0.015801792964339256, 'kl': 0.21484375, 'epoch': 0.75}
 75%|███████▍  | 3197/4286 [20:14:42<5:59:20, 19.80s/it] 75%|███████▍  | 3198/4286 [20:15:02<6:00:30, 19.88s/it]                                                        {'loss': 0.007, 'grad_norm': 2.839586337349847, 'learning_rate': 2.538497433504433e-07, 'completion_length': 206.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6952381432056427, 'rewards/format_reward': 1.0, 'reward': 1.695238173007965, 'reward_std': 0.04626357927918434, 'kl': 0.17626953125, 'epoch': 0.75}
 75%|███████▍  | 3198/4286 [20:15:02<6:00:30, 19.88s/it][2025-03-03 01:22:41,984] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 75%|███████▍  | 3199/4286 [20:15:26<6:21:08, 21.04s/it]                                                        {'loss': 0.012, 'grad_norm': 1.20720158893258, 'learning_rate': 2.5361642557162857e-07, 'completion_length': 229.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6406250298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6227680444717407, 'reward_std': 0.05275922268629074, 'kl': 0.30126953125, 'epoch': 0.75}
 75%|███████▍  | 3199/4286 [20:15:26<6:21:08, 21.04s/it][2025-03-03 01:23:01,039] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 75%|███████▍  | 3200/4286 [20:15:45<6:10:01, 20.44s/it]                                                        {'loss': 0.0065, 'grad_norm': 0.8050719301586662, 'learning_rate': 2.533831077928138e-07, 'completion_length': 209.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.01565450057387352, 'kl': 0.162109375, 'epoch': 0.75}
 75%|███████▍  | 3200/4286 [20:15:45<6:10:01, 20.44s/it] 75%|███████▍  | 3201/4286 [20:20:51<31:55:56, 105.95s/it]                                                          {'loss': 0.0073, 'grad_norm': 4.5024359040401345, 'learning_rate': 2.5314979001399907e-07, 'completion_length': 199.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.7113096117973328, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.043375805020332336, 'kl': 0.181640625, 'epoch': 0.75}
 75%|███████▍  | 3201/4286 [20:20:51<31:55:56, 105.95s/it] 75%|███████▍  | 3202/4286 [20:21:09<23:58:24, 79.62s/it]                                                          {'loss': 0.0193, 'grad_norm': 6.987682683380655, 'learning_rate': 2.529164722351843e-07, 'completion_length': 188.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5944940745830536, 'rewards/format_reward': 1.0, 'reward': 1.5944941639900208, 'reward_std': 0.022321430034935474, 'kl': 0.4814453125, 'epoch': 0.75}
 75%|███████▍  | 3202/4286 [20:21:09<23:58:24, 79.62s/it] 75%|███████▍  | 3203/4286 [20:21:27<18:25:10, 61.23s/it]                                                         {'loss': 0.0068, 'grad_norm': 1.074832061294645, 'learning_rate': 2.5268315445636956e-07, 'completion_length': 192.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7154762744903564, 'rewards/format_reward': 1.0, 'reward': 1.7154763340950012, 'reward_std': 0.03333957400172949, 'kl': 0.16943359375, 'epoch': 0.75}
 75%|███████▍  | 3203/4286 [20:21:27<18:25:10, 61.23s/it] 75%|███████▍  | 3204/4286 [20:21:45<14:31:28, 48.33s/it]                                                         {'loss': 0.0078, 'grad_norm': 2.9155802653280944, 'learning_rate': 2.5244983667755484e-07, 'completion_length': 179.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715818405151, 'reward_std': 0.0535714291036129, 'kl': 0.19482421875, 'epoch': 0.75}
 75%|███████▍  | 3204/4286 [20:21:45<14:31:28, 48.33s/it] 75%|███████▍  | 3205/4286 [20:22:06<12:00:06, 39.97s/it]                                                         {'loss': 0.0067, 'grad_norm': 8.618084904880117, 'learning_rate': 2.5221651889874006e-07, 'completion_length': 217.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7053571343421936, 'rewards/format_reward': 1.0, 'reward': 1.705357313156128, 'reward_std': 0.06664376333355904, 'kl': 0.16845703125, 'epoch': 0.75}
 75%|███████▍  | 3205/4286 [20:22:06<12:00:06, 39.97s/it] 75%|███████▍  | 3206/4286 [20:22:25<10:07:35, 33.75s/it]                                                         {'loss': 0.0066, 'grad_norm': 0.40893203853628085, 'learning_rate': 2.5198320111992534e-07, 'completion_length': 200.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.011904762126505375, 'kl': 0.16455078125, 'epoch': 0.75}
 75%|███████▍  | 3206/4286 [20:22:25<10:07:35, 33.75s/it] 75%|███████▍  | 3207/4286 [20:22:44<8:47:55, 29.36s/it]                                                         {'loss': 0.0217, 'grad_norm': 4.386924092282056, 'learning_rate': 2.5174988334111056e-07, 'completion_length': 192.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7011905014514923, 'rewards/format_reward': 1.0, 'reward': 1.7011905312538147, 'reward_std': 0.14135389029979706, 'kl': 0.5419921875, 'epoch': 0.75}
 75%|███████▍  | 3207/4286 [20:22:44<8:47:55, 29.36s/it] 75%|███████▍  | 3208/4286 [20:23:04<7:54:49, 26.43s/it]                                                        {'loss': 0.0125, 'grad_norm': 3.3567492001966843, 'learning_rate': 2.5151656556229583e-07, 'completion_length': 189.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.6145833730697632, 'rewards/format_reward': 1.0, 'reward': 1.614583432674408, 'reward_std': 0.09728010464459658, 'kl': 0.31298828125, 'epoch': 0.75}
 75%|███████▍  | 3208/4286 [20:23:04<7:54:49, 26.43s/it] 75%|███████▍  | 3209/4286 [20:23:21<7:06:18, 23.75s/it]                                                        {'loss': 0.0275, 'grad_norm': 4.6157697159215445, 'learning_rate': 2.512832477834811e-07, 'completion_length': 159.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.8005953133106232, 'rewards/format_reward': 1.0, 'reward': 1.8005954027175903, 'reward_std': 0.1011904813349247, 'kl': 0.68994140625, 'epoch': 0.75}
 75%|███████▍  | 3209/4286 [20:23:21<7:06:18, 23.75s/it] 75%|███████▍  | 3210/4286 [20:23:41<6:42:23, 22.44s/it]                                                        {'loss': 0.0262, 'grad_norm': 2.826025254601149, 'learning_rate': 2.5104993000466633e-07, 'completion_length': 197.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.07530966773629189, 'kl': 0.6552734375, 'epoch': 0.75}
 75%|███████▍  | 3210/4286 [20:23:41<6:42:23, 22.44s/it] 75%|███████▍  | 3211/4286 [20:24:00<6:23:24, 21.40s/it]                                                        {'loss': 0.1537, 'grad_norm': 2007.869689155412, 'learning_rate': 2.508166122258516e-07, 'completion_length': 189.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6916666626930237, 'rewards/format_reward': 1.0, 'reward': 1.6916667819023132, 'reward_std': 0.05821817368268967, 'kl': 3.859375, 'epoch': 0.75}
 75%|███████▍  | 3211/4286 [20:24:00<6:23:24, 21.40s/it] 75%|███████▍  | 3212/4286 [20:24:19<6:11:57, 20.78s/it]                                                        {'loss': 0.0281, 'grad_norm': 1.100568325026445, 'learning_rate': 2.5058329444703683e-07, 'completion_length': 190.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6547619700431824, 'rewards/format_reward': 1.0, 'reward': 1.6547619700431824, 'reward_std': 0.02749287337064743, 'kl': 0.7041015625, 'epoch': 0.75}
 75%|███████▍  | 3212/4286 [20:24:19<6:11:57, 20.78s/it] 75%|███████▍  | 3213/4286 [20:24:42<6:24:01, 21.47s/it]                                                        {'loss': 0.0701, 'grad_norm': 1.3002004267197502, 'learning_rate': 2.503499766682221e-07, 'completion_length': 197.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.627678632736206, 'rewards/format_reward': 1.0, 'reward': 1.6276786923408508, 'reward_std': 0.09968218952417374, 'kl': 1.7578125, 'epoch': 0.75}
 75%|███████▍  | 3213/4286 [20:24:42<6:24:01, 21.47s/it] 75%|███████▍  | 3214/4286 [20:25:01<6:08:20, 20.62s/it]                                                        {'loss': 0.0295, 'grad_norm': 1.469649734898864, 'learning_rate': 2.501166588894074e-07, 'completion_length': 200.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.854166716337204, 'rewards/format_reward': 1.0, 'reward': 1.8541668057441711, 'reward_std': 0.11072053387761116, 'kl': 0.7353515625, 'epoch': 0.75}
 75%|███████▍  | 3214/4286 [20:25:01<6:08:20, 20.62s/it][2025-03-03 01:32:35,451] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 75%|███████▌  | 3215/4286 [20:25:20<5:58:55, 20.11s/it]                                                        {'loss': 0.0962, 'grad_norm': 5.505965176766646, 'learning_rate': 2.498833411105926e-07, 'completion_length': 181.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.2369118332862854, 'kl': 2.40234375, 'epoch': 0.75}
 75%|███████▌  | 3215/4286 [20:25:20<5:58:55, 20.11s/it] 75%|███████▌  | 3216/4286 [20:25:39<5:55:03, 19.91s/it]                                                        {'loss': 0.0814, 'grad_norm': 3.8936847466221587, 'learning_rate': 2.496500233317779e-07, 'completion_length': 198.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5514881312847137, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.533631145954132, 'reward_std': 0.13096310384571552, 'kl': 2.04052734375, 'epoch': 0.75}
 75%|███████▌  | 3216/4286 [20:25:39<5:55:03, 19.91s/it] 75%|███████▌  | 3217/4286 [20:26:00<6:00:06, 20.21s/it]                                                        {'loss': 0.0116, 'grad_norm': 1.8367748237702715, 'learning_rate': 2.4941670555296315e-07, 'completion_length': 193.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.6979166865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6800596714019775, 'reward_std': 0.044642859138548374, 'kl': 0.2900390625, 'epoch': 0.75}
 75%|███████▌  | 3217/4286 [20:26:00<6:00:06, 20.21s/it] 75%|███████▌  | 3218/4286 [20:26:21<6:03:41, 20.43s/it]                                                        {'loss': 0.0624, 'grad_norm': 6.170984983409384, 'learning_rate': 2.4918338777414837e-07, 'completion_length': 212.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.7002976536750793, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.664583444595337, 'reward_std': 0.18433920666575432, 'kl': 1.5546875, 'epoch': 0.75}
 75%|███████▌  | 3218/4286 [20:26:21<6:03:41, 20.43s/it] 75%|███████▌  | 3219/4286 [20:26:41<5:59:47, 20.23s/it]                                                        {'loss': 0.0273, 'grad_norm': 5.817456867171438, 'learning_rate': 2.4895006999533365e-07, 'completion_length': 206.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6598214507102966, 'rewards/format_reward': 1.0, 'reward': 1.6598215103149414, 'reward_std': 0.0625, 'kl': 0.68408203125, 'epoch': 0.75}
 75%|███████▌  | 3219/4286 [20:26:41<5:59:47, 20.23s/it][2025-03-03 01:34:17,161] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 75%|███████▌  | 3220/4286 [20:27:01<6:01:36, 20.35s/it]                                                        {'loss': 0.0189, 'grad_norm': 61.3017769444752, 'learning_rate': 2.4871675221651887e-07, 'completion_length': 203.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6133928894996643, 'rewards/format_reward': 1.0, 'reward': 1.6133928894996643, 'reward_std': 0.07379360310733318, 'kl': 0.470703125, 'epoch': 0.75}
 75%|███████▌  | 3220/4286 [20:27:01<6:01:36, 20.35s/it] 75%|███████▌  | 3221/4286 [20:27:22<6:01:48, 20.38s/it]                                                        {'loss': 0.0653, 'grad_norm': 7.160622870313956, 'learning_rate': 2.4848343443770414e-07, 'completion_length': 183.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.8157738149166107, 'rewards/format_reward': 1.0, 'reward': 1.8157739043235779, 'reward_std': 0.0801902487874031, 'kl': 1.6328125, 'epoch': 0.75}
 75%|███████▌  | 3221/4286 [20:27:22<6:01:48, 20.38s/it] 75%|███████▌  | 3222/4286 [20:27:41<5:55:42, 20.06s/it]                                                        {'loss': 0.0392, 'grad_norm': 5.712666853384862, 'learning_rate': 2.482501166588894e-07, 'completion_length': 191.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5717262327671051, 'rewards/format_reward': 1.0, 'reward': 1.5717262625694275, 'reward_std': 0.05138125643134117, 'kl': 0.97998046875, 'epoch': 0.75}
 75%|███████▌  | 3222/4286 [20:27:41<5:55:42, 20.06s/it] 75%|███████▌  | 3223/4286 [20:28:03<6:07:40, 20.75s/it]                                                        {'loss': 0.0451, 'grad_norm': 2.8468986770325078, 'learning_rate': 2.4801679888007464e-07, 'completion_length': 196.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.4655754268169403, 'rewards/format_reward': 1.0, 'reward': 1.4655755162239075, 'reward_std': 0.04997079586610198, 'kl': 1.12109375, 'epoch': 0.75}
 75%|███████▌  | 3223/4286 [20:28:03<6:07:40, 20.75s/it] 75%|███████▌  | 3224/4286 [20:28:25<6:13:29, 21.10s/it]                                                        {'loss': 0.0928, 'grad_norm': 1.5037735237447019, 'learning_rate': 2.477834811012599e-07, 'completion_length': 212.58930206298828, 'rewards/only_full_func_accuracy_reward': 0.6918983459472656, 'rewards/format_reward': 1.0, 'reward': 1.6918984055519104, 'reward_std': 0.15030189231038094, 'kl': 2.3203125, 'epoch': 0.75}
 75%|███████▌  | 3224/4286 [20:28:25<6:13:29, 21.10s/it] 75%|███████▌  | 3225/4286 [20:28:45<6:03:01, 20.53s/it]                                                        {'loss': 0.0238, 'grad_norm': 1.088071525638265, 'learning_rate': 2.4755016332244514e-07, 'completion_length': 200.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562501192092896, 'reward_std': 0.044642859138548374, 'kl': 0.59228515625, 'epoch': 0.75}
 75%|███████▌  | 3225/4286 [20:28:45<6:03:01, 20.53s/it] 75%|███████▌  | 3226/4286 [20:29:03<5:54:15, 20.05s/it]                                                        {'loss': 0.0296, 'grad_norm': 8.680322890132583, 'learning_rate': 2.473168455436304e-07, 'completion_length': 183.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.708333432674408, 'reward_std': 0.12099630758166313, 'kl': 0.740234375, 'epoch': 0.75}
 75%|███████▌  | 3226/4286 [20:29:03<5:54:15, 20.05s/it] 75%|███████▌  | 3227/4286 [20:29:24<5:56:11, 20.18s/it]                                                        {'loss': 0.0605, 'grad_norm': 2.151979493371242, 'learning_rate': 2.470835277648157e-07, 'completion_length': 204.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 1.0, 'reward': 1.6473215818405151, 'reward_std': 0.15250568836927414, 'kl': 1.515625, 'epoch': 0.75}
 75%|███████▌  | 3227/4286 [20:29:24<5:56:11, 20.18s/it] 75%|███████▌  | 3228/4286 [20:29:43<5:48:37, 19.77s/it]                                                        {'loss': 0.0278, 'grad_norm': 5.173829235580123, 'learning_rate': 2.468502099860009e-07, 'completion_length': 190.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.71726194024086, 'rewards/format_reward': 1.0, 'reward': 1.7172619700431824, 'reward_std': 0.049202305264770985, 'kl': 0.69189453125, 'epoch': 0.75}
 75%|███████▌  | 3228/4286 [20:29:43<5:48:37, 19.77s/it] 75%|███████▌  | 3229/4286 [20:30:02<5:44:28, 19.55s/it]                                                        {'loss': 0.0281, 'grad_norm': 3.183793373489132, 'learning_rate': 2.466168922071862e-07, 'completion_length': 202.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.0720495954155922, 'kl': 0.7021484375, 'epoch': 0.75}
 75%|███████▌  | 3229/4286 [20:30:02<5:44:28, 19.55s/it] 75%|███████▌  | 3230/4286 [20:30:22<5:50:03, 19.89s/it]                                                        {'loss': 0.0279, 'grad_norm': 2.7177350691821265, 'learning_rate': 2.463835744283714e-07, 'completion_length': 216.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.666666716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6488096714019775, 'reward_std': 0.11334197595715523, 'kl': 0.6982421875, 'epoch': 0.75}
 75%|███████▌  | 3230/4286 [20:30:22<5:50:03, 19.89s/it] 75%|███████▌  | 3231/4286 [20:30:41<5:42:10, 19.46s/it]                                                        {'loss': 0.0072, 'grad_norm': 3.0519146339899823, 'learning_rate': 2.461502566495567e-07, 'completion_length': 186.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.7791666984558105, 'rewards/format_reward': 1.0, 'reward': 1.7791667580604553, 'reward_std': 0.020406460389494896, 'kl': 0.1796875, 'epoch': 0.75}
 75%|███████▌  | 3231/4286 [20:30:41<5:42:10, 19.46s/it] 75%|███████▌  | 3232/4286 [20:30:59<5:35:58, 19.13s/it]                                                        {'loss': 0.0186, 'grad_norm': 4.221185906370853, 'learning_rate': 2.4591693887074196e-07, 'completion_length': 178.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6399235278367996, 'rewards/format_reward': 1.0, 'reward': 1.6399235725402832, 'reward_std': 0.06280326284468174, 'kl': 0.46484375, 'epoch': 0.75}
 75%|███████▌  | 3232/4286 [20:30:59<5:35:58, 19.13s/it] 75%|███████▌  | 3233/4286 [20:31:19<5:40:02, 19.38s/it]                                                        {'loss': 0.0633, 'grad_norm': 2.9238019503924506, 'learning_rate': 2.456836210919272e-07, 'completion_length': 187.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5238095819950104, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4880953431129456, 'reward_std': 0.166666679084301, 'kl': 1.58203125, 'epoch': 0.75}
 75%|███████▌  | 3233/4286 [20:31:19<5:40:02, 19.38s/it] 75%|███████▌  | 3234/4286 [20:31:39<5:42:21, 19.53s/it]                                                        {'loss': 0.0612, 'grad_norm': 3.292181346810491, 'learning_rate': 2.4545030331311246e-07, 'completion_length': 201.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6026786118745804, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5669644474983215, 'reward_std': 0.12710600532591343, 'kl': 1.53515625, 'epoch': 0.75}
 75%|███████▌  | 3234/4286 [20:31:39<5:42:21, 19.53s/it] 75%|███████▌  | 3235/4286 [20:31:59<5:45:27, 19.72s/it]                                                        {'loss': 0.0106, 'grad_norm': 13.762114560319722, 'learning_rate': 2.452169855342977e-07, 'completion_length': 199.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.616666704416275, 'rewards/format_reward': 1.0, 'reward': 1.6166667938232422, 'reward_std': 0.06769214570522308, 'kl': 0.26318359375, 'epoch': 0.75}
 75%|███████▌  | 3235/4286 [20:31:59<5:45:27, 19.72s/it] 76%|███████▌  | 3236/4286 [20:32:18<5:38:27, 19.34s/it]                                                        {'loss': 0.119, 'grad_norm': 8.569597296641955, 'learning_rate': 2.4498366775548295e-07, 'completion_length': 186.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.6833333671092987, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6119049191474915, 'reward_std': 0.28668713569641113, 'kl': 2.98046875, 'epoch': 0.76}
 76%|███████▌  | 3236/4286 [20:32:18<5:38:27, 19.34s/it] 76%|███████▌  | 3237/4286 [20:32:37<5:36:26, 19.24s/it]                                                        {'loss': 0.0419, 'grad_norm': 4.586396994108635, 'learning_rate': 2.4475034997666823e-07, 'completion_length': 192.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.818452388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8005953431129456, 'reward_std': 0.10901819542050362, 'kl': 1.04443359375, 'epoch': 0.76}
 76%|███████▌  | 3237/4286 [20:32:37<5:36:26, 19.24s/it] 76%|███████▌  | 3238/4286 [20:32:56<5:38:09, 19.36s/it]                                                        {'loss': 0.0473, 'grad_norm': 12.365463026973988, 'learning_rate': 2.4451703219785345e-07, 'completion_length': 192.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7258928716182709, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7080358266830444, 'reward_std': 0.14352985471487045, 'kl': 1.1796875, 'epoch': 0.76}
 76%|███████▌  | 3238/4286 [20:32:56<5:38:09, 19.36s/it] 76%|███████▌  | 3239/4286 [20:33:17<5:42:08, 19.61s/it]                                                        {'loss': 0.0515, 'grad_norm': 2.7118352524321545, 'learning_rate': 2.4428371441903873e-07, 'completion_length': 205.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.5669643580913544, 'rewards/format_reward': 1.0, 'reward': 1.5669644474983215, 'reward_std': 0.044642859138548374, 'kl': 1.28515625, 'epoch': 0.76}
 76%|███████▌  | 3239/4286 [20:33:17<5:42:08, 19.61s/it] 76%|███████▌  | 3240/4286 [20:33:36<5:39:12, 19.46s/it]                                                        {'loss': 0.0308, 'grad_norm': 2.1983000187266915, 'learning_rate': 2.44050396640224e-07, 'completion_length': 186.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.6660714447498322, 'rewards/format_reward': 1.0, 'reward': 1.6660714745521545, 'reward_std': 0.0759628601372242, 'kl': 0.7724609375, 'epoch': 0.76}
 76%|███████▌  | 3240/4286 [20:33:36<5:39:12, 19.46s/it] 76%|███████▌  | 3241/4286 [20:33:55<5:39:36, 19.50s/it]                                                        {'loss': 0.0713, 'grad_norm': 10.643081962472946, 'learning_rate': 2.438170788614092e-07, 'completion_length': 195.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6815477013587952, 'reward_std': 0.2083333507180214, 'kl': 1.77734375, 'epoch': 0.76}
 76%|███████▌  | 3241/4286 [20:33:55<5:39:36, 19.50s/it] 76%|███████▌  | 3242/4286 [20:34:15<5:40:00, 19.54s/it]                                                        {'loss': 0.0619, 'grad_norm': 2.0204468937296176, 'learning_rate': 2.435837610825945e-07, 'completion_length': 204.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7130952775478363, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.695238173007965, 'reward_std': 0.18115617334842682, 'kl': 1.546875, 'epoch': 0.76}
 76%|███████▌  | 3242/4286 [20:34:15<5:40:00, 19.54s/it] 76%|███████▌  | 3243/4286 [20:34:34<5:39:24, 19.53s/it]                                                        {'loss': 0.0405, 'grad_norm': 16.257886667385165, 'learning_rate': 2.433504433037797e-07, 'completion_length': 200.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.7559524476528168, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.030682743526995182, 'kl': 1.009765625, 'epoch': 0.76}
 76%|███████▌  | 3243/4286 [20:34:34<5:39:24, 19.53s/it] 76%|███████▌  | 3244/4286 [20:34:56<5:48:48, 20.09s/it]                                                        {'loss': 0.0736, 'grad_norm': 8.771810809949207, 'learning_rate': 2.43117125524965e-07, 'completion_length': 199.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.4836309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.46577388048172, 'reward_std': 0.1041666679084301, 'kl': 1.84375, 'epoch': 0.76}
 76%|███████▌  | 3244/4286 [20:34:56<5:48:48, 20.09s/it] 76%|███████▌  | 3245/4286 [20:35:15<5:44:41, 19.87s/it]                                                        {'loss': 0.0134, 'grad_norm': 1.6911637141247473, 'learning_rate': 2.4288380774615027e-07, 'completion_length': 200.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.6163690686225891, 'rewards/format_reward': 1.0, 'reward': 1.6163691878318787, 'reward_std': 0.02479754202067852, 'kl': 0.33447265625, 'epoch': 0.76}
 76%|███████▌  | 3245/4286 [20:35:15<5:44:41, 19.87s/it] 76%|███████▌  | 3246/4286 [20:35:36<5:49:11, 20.15s/it]                                                        {'loss': 0.0414, 'grad_norm': 3.472500021828113, 'learning_rate': 2.426504899673355e-07, 'completion_length': 203.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.5241071730852127, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.506250023841858, 'reward_std': 0.14104429632425308, 'kl': 1.03466796875, 'epoch': 0.76}
 76%|███████▌  | 3246/4286 [20:35:36<5:49:11, 20.15s/it] 76%|███████▌  | 3247/4286 [20:35:56<5:49:20, 20.17s/it]                                                        {'loss': 0.0366, 'grad_norm': 11.150139132125762, 'learning_rate': 2.4241717218852077e-07, 'completion_length': 202.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5413691103458405, 'rewards/format_reward': 1.0, 'reward': 1.5413691401481628, 'reward_std': 0.07582303136587143, 'kl': 0.916015625, 'epoch': 0.76}
 76%|███████▌  | 3247/4286 [20:35:56<5:49:20, 20.17s/it] 76%|███████▌  | 3248/4286 [20:36:16<5:45:39, 19.98s/it]                                                        {'loss': 0.0272, 'grad_norm': 3.8211180082545715, 'learning_rate': 2.42183854409706e-07, 'completion_length': 189.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.0892857164144516, 'kl': 0.68212890625, 'epoch': 0.76}
 76%|███████▌  | 3248/4286 [20:36:16<5:45:39, 19.98s/it] 76%|███████▌  | 3249/4286 [20:36:36<5:44:45, 19.95s/it]                                                        {'loss': 0.0624, 'grad_norm': 4.363156549750729, 'learning_rate': 2.4195053663089127e-07, 'completion_length': 193.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6607143878936768, 'reward_std': 0.19642858672887087, 'kl': 1.5615234375, 'epoch': 0.76}
 76%|███████▌  | 3249/4286 [20:36:36<5:44:45, 19.95s/it] 76%|███████▌  | 3250/4286 [20:36:58<5:55:27, 20.59s/it]                                                        {'loss': 0.0514, 'grad_norm': 6.883204736566161, 'learning_rate': 2.4171721885207654e-07, 'completion_length': 200.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.5565476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5386905670166016, 'reward_std': 0.11752717196941376, 'kl': 1.2890625, 'epoch': 0.76}
 76%|███████▌  | 3250/4286 [20:36:58<5:55:27, 20.59s/it] 76%|███████▌  | 3251/4286 [20:37:17<5:50:22, 20.31s/it]                                                        {'loss': 0.0442, 'grad_norm': 7.88066764282415, 'learning_rate': 2.4148390107326176e-07, 'completion_length': 198.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.614583432674408, 'reward_std': 0.13325447589159012, 'kl': 1.1083984375, 'epoch': 0.76}
 76%|███████▌  | 3251/4286 [20:37:17<5:50:22, 20.31s/it] 76%|███████▌  | 3252/4286 [20:37:37<5:44:16, 19.98s/it]                                                        {'loss': 0.0454, 'grad_norm': 4.705419594807557, 'learning_rate': 2.4125058329444704e-07, 'completion_length': 195.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.5622024536132812, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5443453192710876, 'reward_std': 0.09658701717853546, 'kl': 1.1328125, 'epoch': 0.76}
 76%|███████▌  | 3252/4286 [20:37:37<5:44:16, 19.98s/it] 76%|███████▌  | 3253/4286 [20:37:57<5:46:12, 20.11s/it]                                                        {'loss': 0.0587, 'grad_norm': 1.3370826585848163, 'learning_rate': 2.4101726551563226e-07, 'completion_length': 213.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6330782473087311, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5973640084266663, 'reward_std': 0.19019722938537598, 'kl': 1.4716796875, 'epoch': 0.76}
 76%|███████▌  | 3253/4286 [20:37:57<5:46:12, 20.11s/it] 76%|███████▌  | 3254/4286 [20:38:17<5:47:43, 20.22s/it]                                                        {'loss': 0.0998, 'grad_norm': 8.436912308809935, 'learning_rate': 2.4078394773681754e-07, 'completion_length': 170.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6407738327980042, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6050596237182617, 'reward_std': 0.17924967408180237, 'kl': 2.5, 'epoch': 0.76}
 76%|███████▌  | 3254/4286 [20:38:17<5:47:43, 20.22s/it] 76%|███████▌  | 3255/4286 [20:38:36<5:40:26, 19.81s/it]                                                        {'loss': 0.0089, 'grad_norm': 3.955843806925577, 'learning_rate': 2.405506299580028e-07, 'completion_length': 196.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6708333492279053, 'rewards/format_reward': 1.0, 'reward': 1.67083340883255, 'reward_std': 0.034162976779043674, 'kl': 0.22265625, 'epoch': 0.76}
 76%|███████▌  | 3255/4286 [20:38:36<5:40:26, 19.81s/it] 76%|███████▌  | 3256/4286 [20:38:54<5:31:47, 19.33s/it]                                                        {'loss': 0.0317, 'grad_norm': 2.420987094299373, 'learning_rate': 2.4031731217918803e-07, 'completion_length': 179.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6449404954910278, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.627083420753479, 'reward_std': 0.050828754901885986, 'kl': 0.7919921875, 'epoch': 0.76}
 76%|███████▌  | 3256/4286 [20:38:54<5:31:47, 19.33s/it][2025-03-03 01:46:31,383] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 76%|███████▌  | 3257/4286 [20:39:16<5:40:10, 19.84s/it]                                                        {'loss': 0.0471, 'grad_norm': 2.0154026302519368, 'learning_rate': 2.400839944003733e-07, 'completion_length': 211.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6294643878936768, 'reward_std': 0.09990235418081284, 'kl': 1.17578125, 'epoch': 0.76}
 76%|███████▌  | 3257/4286 [20:39:16<5:40:10, 19.84s/it] 76%|███████▌  | 3258/4286 [20:39:34<5:33:40, 19.48s/it]                                                        {'loss': 0.0499, 'grad_norm': 3.385374791545983, 'learning_rate': 2.3985067662155853e-07, 'completion_length': 186.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6092262268066406, 'rewards/format_reward': 1.0, 'reward': 1.6092262864112854, 'reward_std': 0.06872855499386787, 'kl': 1.25, 'epoch': 0.76}
 76%|███████▌  | 3258/4286 [20:39:34<5:33:40, 19.48s/it] 76%|███████▌  | 3259/4286 [20:39:52<5:26:04, 19.05s/it]                                                        {'loss': 0.0286, 'grad_norm': 3.251904497848813, 'learning_rate': 2.396173588427438e-07, 'completion_length': 170.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6869048178195953, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6690477132797241, 'reward_std': 0.1200803704559803, 'kl': 0.7119140625, 'epoch': 0.76}
 76%|███████▌  | 3259/4286 [20:39:52<5:26:04, 19.05s/it] 76%|███████▌  | 3260/4286 [20:40:10<5:19:35, 18.69s/it]                                                        {'loss': 0.055, 'grad_norm': 1.733371097352923, 'learning_rate': 2.393840410639291e-07, 'completion_length': 177.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.7812500298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7633929252624512, 'reward_std': 0.16485805436968803, 'kl': 1.376953125, 'epoch': 0.76}
 76%|███████▌  | 3260/4286 [20:40:10<5:19:35, 18.69s/it] 76%|███████▌  | 3261/4286 [20:40:29<5:22:03, 18.85s/it]                                                        {'loss': 0.0266, 'grad_norm': 1.2223114589616189, 'learning_rate': 2.391507232851143e-07, 'completion_length': 191.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6383928656578064, 'rewards/format_reward': 1.0, 'reward': 1.638392984867096, 'reward_std': 0.050444590859115124, 'kl': 0.66796875, 'epoch': 0.76}
 76%|███████▌  | 3261/4286 [20:40:29<5:22:03, 18.85s/it] 76%|███████▌  | 3262/4286 [20:40:48<5:19:14, 18.71s/it]                                                        {'loss': 0.0283, 'grad_norm': 3.1389937232695124, 'learning_rate': 2.389174055062996e-07, 'completion_length': 190.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6279762983322144, 'reward_std': 0.10554792359471321, 'kl': 0.70361328125, 'epoch': 0.76}
 76%|███████▌  | 3262/4286 [20:40:48<5:19:14, 18.71s/it] 76%|███████▌  | 3263/4286 [20:41:06<5:18:30, 18.68s/it]                                                        {'loss': 0.0265, 'grad_norm': 5.14871288774955, 'learning_rate': 2.3868408772748485e-07, 'completion_length': 200.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.5476190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5297620296478271, 'reward_std': 0.09239229559898376, 'kl': 0.6630859375, 'epoch': 0.76}
 76%|███████▌  | 3263/4286 [20:41:06<5:18:30, 18.68s/it] 76%|███████▌  | 3264/4286 [20:41:26<5:21:38, 18.88s/it]                                                        {'loss': 0.007, 'grad_norm': 0.709621380837282, 'learning_rate': 2.384507699486701e-07, 'completion_length': 203.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.8720238506793976, 'rewards/format_reward': 1.0, 'reward': 1.8720239400863647, 'reward_std': 0.01969881495460868, 'kl': 0.17578125, 'epoch': 0.76}
 76%|███████▌  | 3264/4286 [20:41:26<5:21:38, 18.88s/it] 76%|███████▌  | 3265/4286 [20:41:48<5:41:15, 20.05s/it]                                                        {'loss': 0.0292, 'grad_norm': 2.5777415422271663, 'learning_rate': 2.3821745216985533e-07, 'completion_length': 209.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5818453431129456, 'reward_std': 0.14690252393484116, 'kl': 0.7294921875, 'epoch': 0.76}
 76%|███████▌  | 3265/4286 [20:41:48<5:41:15, 20.05s/it] 76%|███████▌  | 3266/4286 [20:42:08<5:38:54, 19.94s/it]                                                        {'loss': 0.0217, 'grad_norm': 2.3743382984786296, 'learning_rate': 2.379841343910406e-07, 'completion_length': 204.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5937500298023224, 'rewards/format_reward': 1.0, 'reward': 1.5937501192092896, 'reward_std': 0.0733834970742464, 'kl': 0.54248046875, 'epoch': 0.76}
 76%|███████▌  | 3266/4286 [20:42:08<5:38:54, 19.94s/it][2025-03-03 01:49:43,936] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 76%|███████▌  | 3267/4286 [20:42:28<5:38:52, 19.95s/it]                                                        {'loss': 0.0678, 'grad_norm': 8.35118022041719, 'learning_rate': 2.3775081661222585e-07, 'completion_length': 180.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.6086310148239136, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5729167461395264, 'reward_std': 0.17539439350366592, 'kl': 1.69921875, 'epoch': 0.76}
 76%|███████▌  | 3267/4286 [20:42:28<5:38:52, 19.95s/it] 76%|███████▌  | 3268/4286 [20:42:47<5:33:13, 19.64s/it]                                                        {'loss': 0.015, 'grad_norm': 10.230943040957673, 'learning_rate': 2.375174988334111e-07, 'completion_length': 172.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.62202388048172, 'rewards/format_reward': 1.0, 'reward': 1.62202388048172, 'reward_std': 0.01785714365541935, 'kl': 0.3740234375, 'epoch': 0.76}
 76%|███████▌  | 3268/4286 [20:42:47<5:33:13, 19.64s/it] 76%|███████▋  | 3269/4286 [20:43:07<5:36:22, 19.84s/it]                                                        {'loss': 0.0384, 'grad_norm': 10.420510238033359, 'learning_rate': 2.3728418105459635e-07, 'completion_length': 190.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6250000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.607142984867096, 'reward_std': 0.07795529067516327, 'kl': 0.962890625, 'epoch': 0.76}
 76%|███████▋  | 3269/4286 [20:43:07<5:36:22, 19.84s/it] 76%|███████▋  | 3270/4286 [20:43:30<5:51:24, 20.75s/it]                                                        {'loss': 0.0391, 'grad_norm': 3.4651141996845847, 'learning_rate': 2.3705086327578162e-07, 'completion_length': 195.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6417942643165588, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6239371299743652, 'reward_std': 0.06578351650387049, 'kl': 0.9833984375, 'epoch': 0.76}
 76%|███████▋  | 3270/4286 [20:43:30<5:51:24, 20.75s/it] 76%|███████▋  | 3271/4286 [20:43:50<5:46:07, 20.46s/it]                                                        {'loss': 0.0753, 'grad_norm': 5.2813437274459325, 'learning_rate': 2.3681754549696687e-07, 'completion_length': 191.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6000000238418579, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.582142949104309, 'reward_std': 0.08883299678564072, 'kl': 1.88720703125, 'epoch': 0.76}
 76%|███████▋  | 3271/4286 [20:43:50<5:46:07, 20.46s/it] 76%|███████▋  | 3272/4286 [20:44:09<5:36:39, 19.92s/it]                                                        {'loss': 0.0336, 'grad_norm': 4.989767564772055, 'learning_rate': 2.3658422771815212e-07, 'completion_length': 186.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.675000011920929, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6571429371833801, 'reward_std': 0.10760699212551117, 'kl': 0.8388671875, 'epoch': 0.76}
 76%|███████▋  | 3272/4286 [20:44:09<5:36:39, 19.92s/it] 76%|███████▋  | 3273/4286 [20:44:28<5:33:23, 19.75s/it]                                                        {'loss': 0.0366, 'grad_norm': 23.02340456812781, 'learning_rate': 2.3635090993933737e-07, 'completion_length': 204.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.7461309731006622, 'rewards/format_reward': 1.0, 'reward': 1.7461311221122742, 'reward_std': 0.08510801196098328, 'kl': 0.91796875, 'epoch': 0.76}
 76%|███████▋  | 3273/4286 [20:44:28<5:33:23, 19.75s/it] 76%|███████▋  | 3274/4286 [20:44:50<5:44:07, 20.40s/it]                                                        {'loss': 0.0442, 'grad_norm': 4.8770561198802085, 'learning_rate': 2.3611759216052262e-07, 'completion_length': 216.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5642857551574707, 'rewards/format_reward': 1.0, 'reward': 1.5642858147621155, 'reward_std': 0.13025234267115593, 'kl': 1.1064453125, 'epoch': 0.76}
 76%|███████▋  | 3274/4286 [20:44:50<5:44:07, 20.40s/it][2025-03-03 01:52:26,129] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 76%|███████▋  | 3275/4286 [20:45:10<5:43:39, 20.39s/it]                                                        {'loss': 0.0842, 'grad_norm': 25.54056927696396, 'learning_rate': 2.358842743817079e-07, 'completion_length': 215.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6590638756752014, 'rewards/format_reward': 1.0, 'reward': 1.659063994884491, 'reward_std': 0.06108368746936321, 'kl': 2.109375, 'epoch': 0.76}
 76%|███████▋  | 3275/4286 [20:45:10<5:43:39, 20.39s/it] 76%|███████▋  | 3276/4286 [20:45:29<5:35:10, 19.91s/it]                                                        {'loss': 0.0519, 'grad_norm': 6.810353104709157, 'learning_rate': 2.3565095660289314e-07, 'completion_length': 189.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142858505249023, 'reward_std': 0.11368753388524055, 'kl': 1.30078125, 'epoch': 0.76}
 76%|███████▋  | 3276/4286 [20:45:29<5:35:10, 19.91s/it] 76%|███████▋  | 3277/4286 [20:45:48<5:31:45, 19.73s/it]                                                        {'loss': 0.0606, 'grad_norm': 19.112339943884812, 'learning_rate': 2.354176388240784e-07, 'completion_length': 210.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6940476596355438, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6583334803581238, 'reward_std': 0.148695919662714, 'kl': 1.5078125, 'epoch': 0.76}
 76%|███████▋  | 3277/4286 [20:45:48<5:31:45, 19.73s/it] 76%|███████▋  | 3278/4286 [20:46:08<5:30:04, 19.65s/it]                                                        {'loss': 0.0326, 'grad_norm': 12.224916989316002, 'learning_rate': 2.3518432104526364e-07, 'completion_length': 185.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.700297623872757, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6824405789375305, 'reward_std': 0.07452842127531767, 'kl': 0.8154296875, 'epoch': 0.76}
 76%|███████▋  | 3278/4286 [20:46:08<5:30:04, 19.65s/it] 77%|███████▋  | 3279/4286 [20:46:29<5:36:58, 20.08s/it]                                                        {'loss': 0.0109, 'grad_norm': 9.609478022690071, 'learning_rate': 2.349510032664489e-07, 'completion_length': 185.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6354166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6354168057441711, 'reward_std': 0.050595239736139774, 'kl': 0.2724609375, 'epoch': 0.77}
 77%|███████▋  | 3279/4286 [20:46:29<5:36:58, 20.08s/it] 77%|███████▋  | 3280/4286 [20:46:47<5:28:48, 19.61s/it]                                                        {'loss': 0.0066, 'grad_norm': 0.8334345341246774, 'learning_rate': 2.3471768548763416e-07, 'completion_length': 192.41072845458984, 'rewards/only_full_func_accuracy_reward': 0.8437500596046448, 'rewards/format_reward': 1.0, 'reward': 1.8437500596046448, 'reward_std': 0.014880955684930086, 'kl': 0.166015625, 'epoch': 0.77}
 77%|███████▋  | 3280/4286 [20:46:47<5:28:48, 19.61s/it] 77%|███████▋  | 3281/4286 [20:47:09<5:39:02, 20.24s/it]                                                        {'loss': 0.0463, 'grad_norm': 4.037689012093267, 'learning_rate': 2.344843677088194e-07, 'completion_length': 194.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6428571343421936, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.12562406063079834, 'kl': 1.158203125, 'epoch': 0.77}
 77%|███████▋  | 3281/4286 [20:47:09<5:39:02, 20.24s/it] 77%|███████▋  | 3282/4286 [20:47:32<5:49:48, 20.90s/it]                                                        {'loss': 0.0652, 'grad_norm': 2.810970973302108, 'learning_rate': 2.3425104993000466e-07, 'completion_length': 186.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5851190984249115, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.531547725200653, 'reward_std': 0.14219717681407928, 'kl': 1.6328125, 'epoch': 0.77}
 77%|███████▋  | 3282/4286 [20:47:32<5:49:48, 20.90s/it] 77%|███████▋  | 3283/4286 [20:47:51<5:40:24, 20.36s/it]                                                        {'loss': 0.0318, 'grad_norm': 1.194588193210068, 'learning_rate': 2.340177321511899e-07, 'completion_length': 188.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7538690567016602, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7360119819641113, 'reward_std': 0.08082127571105957, 'kl': 0.79638671875, 'epoch': 0.77}
 77%|███████▋  | 3283/4286 [20:47:51<5:40:24, 20.36s/it] 77%|███████▋  | 3284/4286 [20:48:10<5:32:49, 19.93s/it]                                                        {'loss': 0.0555, 'grad_norm': 13.589603625347605, 'learning_rate': 2.3378441437237518e-07, 'completion_length': 196.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6297619193792343, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6119048595428467, 'reward_std': 0.13014347106218338, 'kl': 1.38671875, 'epoch': 0.77}
 77%|███████▋  | 3284/4286 [20:48:10<5:32:49, 19.93s/it] 77%|███████▋  | 3285/4286 [20:48:29<5:27:41, 19.64s/it]                                                        {'loss': 0.0687, 'grad_norm': 3.581750084922303, 'learning_rate': 2.3355109659356043e-07, 'completion_length': 204.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.47678573429584503, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.458928644657135, 'reward_std': 0.11785715073347092, 'kl': 1.71484375, 'epoch': 0.77}
 77%|███████▋  | 3285/4286 [20:48:29<5:27:41, 19.64s/it] 77%|███████▋  | 3286/4286 [20:48:47<5:22:47, 19.37s/it]                                                        {'loss': 0.0694, 'grad_norm': 5.517671457045466, 'learning_rate': 2.3331777881474568e-07, 'completion_length': 180.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.7470238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7291668057441711, 'reward_std': 0.11368753388524055, 'kl': 1.734375, 'epoch': 0.77}
 77%|███████▋  | 3286/4286 [20:48:47<5:22:47, 19.37s/it][2025-03-03 01:56:24,588] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 77%|███████▋  | 3287/4286 [20:49:09<5:32:46, 19.99s/it]                                                        {'loss': 0.0321, 'grad_norm': 5.356695200309438, 'learning_rate': 2.3308446103593093e-07, 'completion_length': 204.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5181548148393631, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.500297725200653, 'reward_std': 0.10340345650911331, 'kl': 0.8037109375, 'epoch': 0.77}
 77%|███████▋  | 3287/4286 [20:49:09<5:32:46, 19.99s/it] 77%|███████▋  | 3288/4286 [20:49:32<5:47:58, 20.92s/it]                                                        {'loss': 0.0842, 'grad_norm': 3.2221918325181322, 'learning_rate': 2.3285114325711618e-07, 'completion_length': 186.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.7827381193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7648810744285583, 'reward_std': 0.1163812056183815, 'kl': 2.109375, 'epoch': 0.77}
 77%|███████▋  | 3288/4286 [20:49:32<5:47:58, 20.92s/it] 77%|███████▋  | 3289/4286 [20:49:51<5:38:03, 20.34s/it]                                                        {'loss': 0.074, 'grad_norm': 5.029933600349937, 'learning_rate': 2.3261782547830145e-07, 'completion_length': 177.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.627976194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6101191639900208, 'reward_std': 0.1293574469164014, 'kl': 1.8515625, 'epoch': 0.77}
 77%|███████▋  | 3289/4286 [20:49:51<5:38:03, 20.34s/it] 77%|███████▋  | 3290/4286 [20:50:11<5:34:30, 20.15s/it]                                                        {'loss': 0.0348, 'grad_norm': 7.307395557999298, 'learning_rate': 2.323845076994867e-07, 'completion_length': 194.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6290391683578491, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6111820936203003, 'reward_std': 0.10596409067511559, 'kl': 0.86865234375, 'epoch': 0.77}
 77%|███████▋  | 3290/4286 [20:50:11<5:34:30, 20.15s/it] 77%|███████▋  | 3291/4286 [20:50:30<5:28:47, 19.83s/it]                                                        {'loss': 0.0416, 'grad_norm': 7.629053118901304, 'learning_rate': 2.3215118992067195e-07, 'completion_length': 183.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.06526251137256622, 'kl': 1.04296875, 'epoch': 0.77}
 77%|███████▋  | 3291/4286 [20:50:30<5:28:47, 19.83s/it] 77%|███████▋  | 3292/4286 [20:50:49<5:28:13, 19.81s/it]                                                        {'loss': 0.0201, 'grad_norm': 11.724136796871422, 'learning_rate': 2.319178721418572e-07, 'completion_length': 195.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6639881134033203, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6461310386657715, 'reward_std': 0.11257379874587059, 'kl': 0.5009765625, 'epoch': 0.77}
 77%|███████▋  | 3292/4286 [20:50:49<5:28:13, 19.81s/it] 77%|███████▋  | 3293/4286 [20:51:08<5:23:26, 19.54s/it]                                                        {'loss': 0.02, 'grad_norm': 6.323257568326354, 'learning_rate': 2.3168455436304247e-07, 'completion_length': 188.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6011905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6011906266212463, 'reward_std': 0.06322525069117546, 'kl': 0.4990234375, 'epoch': 0.77}
 77%|███████▋  | 3293/4286 [20:51:08<5:23:26, 19.54s/it] 77%|███████▋  | 3294/4286 [20:51:27<5:20:42, 19.40s/it]                                                        {'loss': 0.0366, 'grad_norm': 7.203110756699037, 'learning_rate': 2.3145123658422772e-07, 'completion_length': 183.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5654762387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5476192235946655, 'reward_std': 0.11780716478824615, 'kl': 0.916015625, 'epoch': 0.77}
 77%|███████▋  | 3294/4286 [20:51:27<5:20:42, 19.40s/it] 77%|███████▋  | 3295/4286 [20:51:45<5:12:55, 18.95s/it]                                                        {'loss': 0.0484, 'grad_norm': 3.153431548436954, 'learning_rate': 2.3121791880541297e-07, 'completion_length': 171.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.7380953133106232, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.011904764920473099, 'kl': 1.20654296875, 'epoch': 0.77}
 77%|███████▋  | 3295/4286 [20:51:45<5:12:55, 18.95s/it] 77%|███████▋  | 3296/4286 [20:52:03<5:08:20, 18.69s/it]                                                        {'loss': 0.0329, 'grad_norm': 4.10257639349103, 'learning_rate': 2.3098460102659822e-07, 'completion_length': 170.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.65327388048172, 'rewards/format_reward': 1.0, 'reward': 1.65327388048172, 'reward_std': 0.04032689053565264, 'kl': 0.81982421875, 'epoch': 0.77}
 77%|███████▋  | 3296/4286 [20:52:03<5:08:20, 18.69s/it] 77%|███████▋  | 3297/4286 [20:52:24<5:16:01, 19.17s/it]                                                        {'loss': 0.1004, 'grad_norm': 44.910473506248834, 'learning_rate': 2.3075128324778347e-07, 'completion_length': 186.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6660714447498322, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.612500011920929, 'reward_std': 0.22136690467596054, 'kl': 2.51171875, 'epoch': 0.77}
 77%|███████▋  | 3297/4286 [20:52:24<5:16:01, 19.17s/it] 77%|███████▋  | 3298/4286 [20:52:42<5:11:18, 18.91s/it]                                                        {'loss': 0.056, 'grad_norm': 6.04161360196679, 'learning_rate': 2.3051796546896874e-07, 'completion_length': 183.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6994048953056335, 'reward_std': 0.07261905260384083, 'kl': 1.3984375, 'epoch': 0.77}
 77%|███████▋  | 3298/4286 [20:52:42<5:11:18, 18.91s/it] 77%|███████▋  | 3299/4286 [20:53:02<5:15:56, 19.21s/it]                                                        {'loss': 0.0469, 'grad_norm': 1.4380120267523324, 'learning_rate': 2.30284647690154e-07, 'completion_length': 200.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.611607164144516, 'rewards/format_reward': 1.0, 'reward': 1.6116071939468384, 'reward_std': 0.08260531723499298, 'kl': 1.1708984375, 'epoch': 0.77}
 77%|███████▋  | 3299/4286 [20:53:02<5:15:56, 19.21s/it] 77%|███████▋  | 3300/4286 [20:53:21<5:17:30, 19.32s/it]                                                        {'loss': 0.0433, 'grad_norm': 5.832220496308031, 'learning_rate': 2.3005132991133924e-07, 'completion_length': 203.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.569940522313118, 'rewards/format_reward': 1.0, 'reward': 1.5699405670166016, 'reward_std': 0.0625000037252903, 'kl': 1.083984375, 'epoch': 0.77}
 77%|███████▋  | 3300/4286 [20:53:21<5:17:30, 19.32s/it] 77%|███████▋  | 3301/4286 [20:56:48<20:40:38, 75.57s/it]                                                         {'loss': 0.0071, 'grad_norm': 6.008992627301985, 'learning_rate': 2.298180121325245e-07, 'completion_length': 185.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.0597705515101552, 'kl': 0.17626953125, 'epoch': 0.77}
 77%|███████▋  | 3301/4286 [20:56:48<20:40:38, 75.57s/it] 77%|███████▋  | 3302/4286 [20:57:11<16:20:11, 59.77s/it]                                                         {'loss': 0.0254, 'grad_norm': 3.5862471369849085, 'learning_rate': 2.2958469435370976e-07, 'completion_length': 185.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6279762387275696, 'reward_std': 0.03243744093924761, 'kl': 0.63671875, 'epoch': 0.77}
 77%|███████▋  | 3302/4286 [20:57:11<16:20:11, 59.77s/it] 77%|███████▋  | 3303/4286 [20:57:32<13:10:15, 48.24s/it]                                                         {'loss': 0.04, 'grad_norm': 5.761831336894612, 'learning_rate': 2.29351376574895e-07, 'completion_length': 171.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.587797686457634, 'rewards/format_reward': 1.0, 'reward': 1.58779776096344, 'reward_std': 0.052039530128240585, 'kl': 1.0, 'epoch': 0.77}
 77%|███████▋  | 3303/4286 [20:57:32<13:10:15, 48.24s/it] 77%|███████▋  | 3304/4286 [20:57:51<10:42:59, 39.29s/it]                                                         {'loss': 0.0105, 'grad_norm': 9.710489600855325, 'learning_rate': 2.2911805879608026e-07, 'completion_length': 187.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.6532738208770752, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.06434167176485062, 'kl': 0.26171875, 'epoch': 0.77}
 77%|███████▋  | 3304/4286 [20:57:51<10:42:59, 39.29s/it] 77%|███████▋  | 3305/4286 [20:58:10<9:05:44, 33.38s/it]                                                         {'loss': 0.0269, 'grad_norm': 7.672814917981272, 'learning_rate': 2.288847410172655e-07, 'completion_length': 192.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.4449405074119568, 'rewards/format_reward': 1.0, 'reward': 1.4449405670166016, 'reward_std': 0.07228299230337143, 'kl': 0.6748046875, 'epoch': 0.77}
 77%|███████▋  | 3305/4286 [20:58:10<9:05:44, 33.38s/it] 77%|███████▋  | 3306/4286 [20:58:29<7:52:25, 28.92s/it]                                                        {'loss': 0.0272, 'grad_norm': 13.062585958917106, 'learning_rate': 2.2865142323845076e-07, 'completion_length': 180.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7797619700431824, 'rewards/format_reward': 1.0, 'reward': 1.779762089252472, 'reward_std': 0.0714285671710968, 'kl': 0.67919921875, 'epoch': 0.77}
 77%|███████▋  | 3306/4286 [20:58:29<7:52:25, 28.92s/it] 77%|███████▋  | 3307/4286 [20:58:49<7:07:56, 26.23s/it]                                                        {'loss': 0.0247, 'grad_norm': 6.119734226776936, 'learning_rate': 2.2841810545963603e-07, 'completion_length': 191.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.7235544323921204, 'rewards/format_reward': 1.0, 'reward': 1.72355455160141, 'reward_std': 0.07019967958331108, 'kl': 0.615234375, 'epoch': 0.77}
 77%|███████▋  | 3307/4286 [20:58:49<7:07:56, 26.23s/it] 77%|███████▋  | 3308/4286 [20:59:08<6:31:40, 24.03s/it]                                                        {'loss': 0.0517, 'grad_norm': 1.7836541861645132, 'learning_rate': 2.2818478768082128e-07, 'completion_length': 176.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.8244048058986664, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8065477013587952, 'reward_std': 0.1293574459850788, 'kl': 1.29296875, 'epoch': 0.77}
 77%|███████▋  | 3308/4286 [20:59:08<6:31:40, 24.03s/it] 77%|███████▋  | 3309/4286 [20:59:26<6:05:06, 22.42s/it]                                                        {'loss': 0.0733, 'grad_norm': 4.715551627766157, 'learning_rate': 2.2795146990200653e-07, 'completion_length': 184.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6050595641136169, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5514881610870361, 'reward_std': 0.20587609708309174, 'kl': 1.83203125, 'epoch': 0.77}
 77%|███████▋  | 3309/4286 [20:59:26<6:05:06, 22.42s/it] 77%|███████▋  | 3310/4286 [20:59:51<6:14:26, 23.02s/it]                                                        {'loss': 0.0546, 'grad_norm': 10.231578882265774, 'learning_rate': 2.2771815212319178e-07, 'completion_length': 193.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.5710034668445587, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5531463623046875, 'reward_std': 0.08592060953378677, 'kl': 1.3671875, 'epoch': 0.77}
 77%|███████▋  | 3310/4286 [20:59:51<6:14:26, 23.02s/it] 77%|███████▋  | 3311/4286 [21:00:11<6:00:42, 22.20s/it]                                                        {'loss': 0.0207, 'grad_norm': 15.549844950205712, 'learning_rate': 2.2748483434437703e-07, 'completion_length': 183.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098214626312256, 'reward_std': 0.08655625954270363, 'kl': 0.517578125, 'epoch': 0.77}
 77%|███████▋  | 3311/4286 [21:00:11<6:00:42, 22.20s/it] 77%|███████▋  | 3312/4286 [21:00:29<5:39:56, 20.94s/it]                                                        {'loss': 0.0253, 'grad_norm': 23.085445090089138, 'learning_rate': 2.272515165655623e-07, 'completion_length': 175.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.5907738208770752, 'rewards/format_reward': 1.0, 'reward': 1.59077388048172, 'reward_std': 0.03951088711619377, 'kl': 0.63330078125, 'epoch': 0.77}
 77%|███████▋  | 3312/4286 [21:00:29<5:39:56, 20.94s/it] 77%|███████▋  | 3313/4286 [21:00:48<5:30:26, 20.38s/it]                                                        {'loss': 0.0143, 'grad_norm': 3.9844183026309303, 'learning_rate': 2.2701819878674755e-07, 'completion_length': 186.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6205357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6205357909202576, 'reward_std': 0.05059524439275265, 'kl': 0.35888671875, 'epoch': 0.77}
 77%|███████▋  | 3313/4286 [21:00:48<5:30:26, 20.38s/it] 77%|███████▋  | 3314/4286 [21:01:07<5:20:33, 19.79s/it]                                                        {'loss': 0.0105, 'grad_norm': 5.14075176270693, 'learning_rate': 2.267848810079328e-07, 'completion_length': 184.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.7613095641136169, 'rewards/format_reward': 1.0, 'reward': 1.7613096237182617, 'reward_std': 0.038020084612071514, 'kl': 0.2626953125, 'epoch': 0.77}
 77%|███████▋  | 3314/4286 [21:01:07<5:20:33, 19.79s/it] 77%|███████▋  | 3315/4286 [21:01:24<5:10:41, 19.20s/it]                                                        {'loss': 0.0159, 'grad_norm': 81.14921254673975, 'learning_rate': 2.2655156322911805e-07, 'completion_length': 174.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.598214328289032, 'rewards/format_reward': 1.0, 'reward': 1.5982143878936768, 'reward_std': 0.01785714365541935, 'kl': 0.3994140625, 'epoch': 0.77}
 77%|███████▋  | 3315/4286 [21:01:24<5:10:41, 19.20s/it] 77%|███████▋  | 3316/4286 [21:01:43<5:09:34, 19.15s/it]                                                        {'loss': 0.0483, 'grad_norm': 47.003281042630334, 'learning_rate': 2.2631824545030333e-07, 'completion_length': 191.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.7244047522544861, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7065476775169373, 'reward_std': 0.18766788393259048, 'kl': 1.2109375, 'epoch': 0.77}
 77%|███████▋  | 3316/4286 [21:01:44<5:09:34, 19.15s/it] 77%|███████▋  | 3317/4286 [21:02:02<5:07:17, 19.03s/it]                                                        {'loss': 0.036, 'grad_norm': 8.035182099499915, 'learning_rate': 2.2608492767148857e-07, 'completion_length': 170.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217262983322144, 'reward_std': 0.07029405236244202, 'kl': 0.90234375, 'epoch': 0.77}
 77%|███████▋  | 3317/4286 [21:02:02<5:07:17, 19.03s/it] 77%|███████▋  | 3318/4286 [21:02:21<5:03:22, 18.80s/it]                                                        {'loss': 0.0361, 'grad_norm': 5.32825050363385, 'learning_rate': 2.2585160989267382e-07, 'completion_length': 197.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.10686444863677025, 'kl': 0.90283203125, 'epoch': 0.77}
 77%|███████▋  | 3318/4286 [21:02:21<5:03:22, 18.80s/it] 77%|███████▋  | 3319/4286 [21:02:39<4:59:55, 18.61s/it]                                                        {'loss': 0.048, 'grad_norm': 7.4498510506014295, 'learning_rate': 2.2561829211385907e-07, 'completion_length': 169.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.7571429014205933, 'rewards/format_reward': 1.0, 'reward': 1.757142961025238, 'reward_std': 0.0943107083439827, 'kl': 1.20361328125, 'epoch': 0.77}
 77%|███████▋  | 3319/4286 [21:02:39<4:59:55, 18.61s/it] 77%|███████▋  | 3320/4286 [21:02:57<4:57:50, 18.50s/it]                                                        {'loss': 0.007, 'grad_norm': 0.4778856729314867, 'learning_rate': 2.2538497433504432e-07, 'completion_length': 180.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.5267857760190964, 'rewards/format_reward': 1.0, 'reward': 1.5267858505249023, 'reward_std': 0.01785714365541935, 'kl': 0.17529296875, 'epoch': 0.77}
 77%|███████▋  | 3320/4286 [21:02:57<4:57:50, 18.50s/it] 77%|███████▋  | 3321/4286 [21:03:18<5:10:43, 19.32s/it]                                                        {'loss': 0.0757, 'grad_norm': 86.92959914480001, 'learning_rate': 2.251516565562296e-07, 'completion_length': 173.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5610119700431824, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5252978205680847, 'reward_std': 0.10885171592235565, 'kl': 1.890625, 'epoch': 0.77}
 77%|███████▋  | 3321/4286 [21:03:18<5:10:43, 19.32s/it] 78%|███████▊  | 3322/4286 [21:03:38<5:11:23, 19.38s/it]                                                        {'loss': 0.0373, 'grad_norm': 14.07585451511841, 'learning_rate': 2.2491833877741484e-07, 'completion_length': 197.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.7660714983940125, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7482143640518188, 'reward_std': 0.12964674830436707, 'kl': 0.9296875, 'epoch': 0.78}
 78%|███████▊  | 3322/4286 [21:03:38<5:11:23, 19.38s/it] 78%|███████▊  | 3323/4286 [21:03:56<5:07:47, 19.18s/it]                                                        {'loss': 0.0159, 'grad_norm': 4.605574400963648, 'learning_rate': 2.246850209986001e-07, 'completion_length': 180.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.5068452954292297, 'rewards/format_reward': 1.0, 'reward': 1.5068453550338745, 'reward_std': 0.019414008129388094, 'kl': 0.3984375, 'epoch': 0.78}
 78%|███████▊  | 3323/4286 [21:03:56<5:07:47, 19.18s/it] 78%|███████▊  | 3324/4286 [21:04:14<5:02:21, 18.86s/it]                                                        {'loss': 0.0137, 'grad_norm': 11.843911908731704, 'learning_rate': 2.2445170321978534e-07, 'completion_length': 182.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.7217262983322144, 'rewards/format_reward': 1.0, 'reward': 1.7217262983322144, 'reward_std': 0.06705520860850811, 'kl': 0.34130859375, 'epoch': 0.78}
 78%|███████▊  | 3324/4286 [21:04:14<5:02:21, 18.86s/it] 78%|███████▊  | 3325/4286 [21:04:33<5:00:05, 18.74s/it]                                                        {'loss': 0.0799, 'grad_norm': 9.467114185795662, 'learning_rate': 2.242183854409706e-07, 'completion_length': 158.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.65327388048172, 'reward_std': 0.10579753294587135, 'kl': 2.0, 'epoch': 0.78}
 78%|███████▊  | 3325/4286 [21:04:33<5:00:05, 18.74s/it] 78%|███████▊  | 3326/4286 [21:04:52<4:59:17, 18.71s/it]                                                        {'loss': 0.0495, 'grad_norm': 13.245314939669187, 'learning_rate': 2.2398506766215587e-07, 'completion_length': 188.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.6001984179019928, 'rewards/format_reward': 1.0, 'reward': 1.6001984477043152, 'reward_std': 0.07121410593390465, 'kl': 1.23291015625, 'epoch': 0.78}
 78%|███████▊  | 3326/4286 [21:04:52<4:59:17, 18.71s/it] 78%|███████▊  | 3327/4286 [21:05:12<5:07:45, 19.26s/it]                                                        {'loss': 0.0526, 'grad_norm': 12.03277967794935, 'learning_rate': 2.2375174988334111e-07, 'completion_length': 177.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6354166865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6175596117973328, 'reward_std': 0.10751047730445862, 'kl': 1.3125, 'epoch': 0.78}
 78%|███████▊  | 3327/4286 [21:05:12<5:07:45, 19.26s/it] 78%|███████▊  | 3328/4286 [21:05:30<4:59:30, 18.76s/it]                                                        {'loss': 0.0284, 'grad_norm': 1.3004275974912205, 'learning_rate': 2.2351843210452636e-07, 'completion_length': 170.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.09959554672241211, 'kl': 0.71044921875, 'epoch': 0.78}
 78%|███████▊  | 3328/4286 [21:05:30<4:59:30, 18.76s/it] 78%|███████▊  | 3329/4286 [21:05:47<4:51:37, 18.28s/it]                                                        {'loss': 0.0086, 'grad_norm': 3.898706281497104, 'learning_rate': 2.232851143257116e-07, 'completion_length': 159.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6369048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6369049549102783, 'reward_std': 0.011904762126505375, 'kl': 0.2138671875, 'epoch': 0.78}
 78%|███████▊  | 3329/4286 [21:05:47<4:51:37, 18.28s/it][2025-03-03 02:13:23,868] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 78%|███████▊  | 3330/4286 [21:06:08<5:04:44, 19.13s/it]                                                        {'loss': 0.0967, 'grad_norm': 8.666100200099262, 'learning_rate': 2.2305179654689689e-07, 'completion_length': 191.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6949405670166016, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.677083432674408, 'reward_std': 0.11875972151756287, 'kl': 2.41796875, 'epoch': 0.78}
 78%|███████▊  | 3330/4286 [21:06:08<5:04:44, 19.13s/it] 78%|███████▊  | 3331/4286 [21:06:27<5:03:07, 19.04s/it]                                                        {'loss': 0.0546, 'grad_norm': 3.234834187121048, 'learning_rate': 2.2281847876808214e-07, 'completion_length': 170.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6157738268375397, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5800595879554749, 'reward_std': 0.13458094373345375, 'kl': 1.3671875, 'epoch': 0.78}
 78%|███████▊  | 3331/4286 [21:06:27<5:03:07, 19.04s/it] 78%|███████▊  | 3332/4286 [21:06:46<5:03:26, 19.08s/it]                                                        {'loss': 0.0221, 'grad_norm': 2.2673233267267374, 'learning_rate': 2.2258516098926738e-07, 'completion_length': 192.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.5684524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5505954027175903, 'reward_std': 0.1011904776096344, 'kl': 0.55224609375, 'epoch': 0.78}
 78%|███████▊  | 3332/4286 [21:06:46<5:03:26, 19.08s/it] 78%|███████▊  | 3333/4286 [21:07:07<5:11:23, 19.61s/it]                                                        {'loss': 0.0367, 'grad_norm': 2.744220478976642, 'learning_rate': 2.2235184321045263e-07, 'completion_length': 178.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5071429014205933, 'rewards/format_reward': 1.0, 'reward': 1.507142961025238, 'reward_std': 0.040203677490353584, 'kl': 0.916015625, 'epoch': 0.78}
 78%|███████▊  | 3333/4286 [21:07:07<5:11:23, 19.61s/it] 78%|███████▊  | 3334/4286 [21:07:25<5:06:27, 19.31s/it]                                                        {'loss': 0.0071, 'grad_norm': 4.178056593936613, 'learning_rate': 2.2211852543163788e-07, 'completion_length': 190.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.022214587777853012, 'kl': 0.17626953125, 'epoch': 0.78}
 78%|███████▊  | 3334/4286 [21:07:25<5:06:27, 19.31s/it][2025-03-03 02:14:59,575] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 78%|███████▊  | 3335/4286 [21:07:44<5:00:55, 18.99s/it]                                                        {'loss': 0.0383, 'grad_norm': 3.6197572538048703, 'learning_rate': 2.2188520765282316e-07, 'completion_length': 164.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.07535864785313606, 'kl': 0.953125, 'epoch': 0.78}
 78%|███████▊  | 3335/4286 [21:07:44<5:00:55, 18.99s/it] 78%|███████▊  | 3336/4286 [21:08:03<5:03:17, 19.16s/it]                                                        {'loss': 0.0125, 'grad_norm': 3.901616081289545, 'learning_rate': 2.216518898740084e-07, 'completion_length': 184.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6592262387275696, 'rewards/format_reward': 1.0, 'reward': 1.6592262983322144, 'reward_std': 0.019238398410379887, 'kl': 0.3125, 'epoch': 0.78}
 78%|███████▊  | 3336/4286 [21:08:03<5:03:17, 19.16s/it] 78%|███████▊  | 3337/4286 [21:08:22<5:01:25, 19.06s/it]                                                        {'loss': 0.0382, 'grad_norm': 2.015982153909689, 'learning_rate': 2.2141857209519365e-07, 'completion_length': 190.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.05952381528913975, 'kl': 0.95703125, 'epoch': 0.78}
 78%|███████▊  | 3337/4286 [21:08:22<5:01:25, 19.06s/it] 78%|███████▊  | 3338/4286 [21:08:44<5:13:16, 19.83s/it]                                                        {'loss': 0.034, 'grad_norm': 36.57765353337896, 'learning_rate': 2.211852543163789e-07, 'completion_length': 207.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.5732143223285675, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.555357277393341, 'reward_std': 0.15200193971395493, 'kl': 0.8515625, 'epoch': 0.78}
 78%|███████▊  | 3338/4286 [21:08:44<5:13:16, 19.83s/it] 78%|███████▊  | 3339/4286 [21:09:02<5:03:42, 19.24s/it]                                                        {'loss': 0.0181, 'grad_norm': 3.124452307893009, 'learning_rate': 2.2095193653756418e-07, 'completion_length': 190.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.5119048058986664, 'rewards/format_reward': 1.0, 'reward': 1.5119048953056335, 'reward_std': 0.0476190522313118, 'kl': 0.4501953125, 'epoch': 0.78}
 78%|███████▊  | 3339/4286 [21:09:02<5:03:42, 19.24s/it] 78%|███████▊  | 3340/4286 [21:09:21<5:05:13, 19.36s/it]                                                        {'loss': 0.051, 'grad_norm': 1.6360284951311528, 'learning_rate': 2.2071861875874943e-07, 'completion_length': 183.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.7190476655960083, 'rewards/format_reward': 1.0, 'reward': 1.719047725200653, 'reward_std': 0.09194338787347078, 'kl': 1.27490234375, 'epoch': 0.78}
 78%|███████▊  | 3340/4286 [21:09:21<5:05:13, 19.36s/it] 78%|███████▊  | 3341/4286 [21:09:39<4:56:10, 18.80s/it]                                                        {'loss': 0.0364, 'grad_norm': 6.995410893589413, 'learning_rate': 2.2048530097993467e-07, 'completion_length': 171.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.5238095968961716, 'rewards/format_reward': 1.0, 'reward': 1.5238096714019775, 'reward_std': 0.0714285746216774, 'kl': 0.91015625, 'epoch': 0.78}
 78%|███████▊  | 3341/4286 [21:09:39<4:56:10, 18.80s/it] 78%|███████▊  | 3342/4286 [21:09:56<4:50:41, 18.48s/it]                                                        {'loss': 0.007, 'grad_norm': 0.3386284683642424, 'learning_rate': 2.2025198320111992e-07, 'completion_length': 188.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.7842262089252472, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.008928571827709675, 'kl': 0.1748046875, 'epoch': 0.78}
 78%|███████▊  | 3342/4286 [21:09:56<4:50:41, 18.48s/it] 78%|███████▊  | 3343/4286 [21:10:15<4:50:02, 18.45s/it]                                                        {'loss': 0.0282, 'grad_norm': 2.0831649705222617, 'learning_rate': 2.2001866542230517e-07, 'completion_length': 161.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5952381640672684, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5773810744285583, 'reward_std': 0.09996834769845009, 'kl': 0.7041015625, 'epoch': 0.78}
 78%|███████▊  | 3343/4286 [21:10:15<4:50:02, 18.45s/it] 78%|███████▊  | 3344/4286 [21:10:36<5:04:30, 19.40s/it]                                                        {'loss': 0.0414, 'grad_norm': 10.094270433609246, 'learning_rate': 2.1978534764349045e-07, 'completion_length': 203.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4509921073913574, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4331350326538086, 'reward_std': 0.1074690893292427, 'kl': 1.03515625, 'epoch': 0.78}
 78%|███████▊  | 3344/4286 [21:10:36<5:04:30, 19.40s/it] 78%|███████▊  | 3345/4286 [21:10:55<4:59:40, 19.11s/it]                                                        {'loss': 0.021, 'grad_norm': 4.246896093419553, 'learning_rate': 2.195520298646757e-07, 'completion_length': 194.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.558035746216774, 'rewards/format_reward': 1.0, 'reward': 1.5580358505249023, 'reward_std': 0.06182590126991272, 'kl': 0.52392578125, 'epoch': 0.78}
 78%|███████▊  | 3345/4286 [21:10:55<4:59:40, 19.11s/it] 78%|███████▊  | 3346/4286 [21:11:16<5:08:55, 19.72s/it]                                                        {'loss': 0.0371, 'grad_norm': 8.014036975792823, 'learning_rate': 2.1931871208586094e-07, 'completion_length': 190.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6124008595943451, 'rewards/format_reward': 1.0, 'reward': 1.6124008893966675, 'reward_std': 0.08075397834181786, 'kl': 0.923828125, 'epoch': 0.78}
 78%|███████▊  | 3346/4286 [21:11:16<5:08:55, 19.72s/it] 78%|███████▊  | 3347/4286 [21:11:34<4:58:45, 19.09s/it]                                                        {'loss': 0.0075, 'grad_norm': 0.9435407818150983, 'learning_rate': 2.190853943070462e-07, 'completion_length': 181.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.68601194024086, 'rewards/format_reward': 1.0, 'reward': 1.6860120296478271, 'reward_std': 0.008928571827709675, 'kl': 0.1865234375, 'epoch': 0.78}
 78%|███████▊  | 3347/4286 [21:11:34<4:58:45, 19.09s/it] 78%|███████▊  | 3348/4286 [21:11:52<4:57:18, 19.02s/it]                                                        {'loss': 0.0278, 'grad_norm': 2.9026667526166046, 'learning_rate': 2.1885207652823144e-07, 'completion_length': 198.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.549107164144516, 'rewards/format_reward': 1.0, 'reward': 1.5491071939468384, 'reward_std': 0.12961260601878166, 'kl': 0.6953125, 'epoch': 0.78}
 78%|███████▊  | 3348/4286 [21:11:52<4:57:18, 19.02s/it][2025-03-03 02:19:31,590] [WARNING] [stage3.py:2134:step] 3 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 78%|███████▊  | 3349/4286 [21:12:16<5:16:45, 20.28s/it]                                                        {'loss': 0.0485, 'grad_norm': 2.4778888486651796, 'learning_rate': 2.186187587494167e-07, 'completion_length': 196.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.049460720270872116, 'kl': 1.21484375, 'epoch': 0.78}
 78%|███████▊  | 3349/4286 [21:12:16<5:16:45, 20.28s/it] 78%|███████▊  | 3350/4286 [21:12:36<5:15:48, 20.24s/it]                                                        {'loss': 0.0464, 'grad_norm': 3.0642730368851927, 'learning_rate': 2.1838544097060194e-07, 'completion_length': 191.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.0892857201397419, 'kl': 1.16015625, 'epoch': 0.78}
 78%|███████▊  | 3350/4286 [21:12:36<5:15:48, 20.24s/it] 78%|███████▊  | 3351/4286 [21:12:58<5:24:43, 20.84s/it]                                                        {'loss': 0.0681, 'grad_norm': 8.645164897997626, 'learning_rate': 2.181521231917872e-07, 'completion_length': 191.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7169643342494965, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6991072297096252, 'reward_std': 0.13179437071084976, 'kl': 1.703125, 'epoch': 0.78}
 78%|███████▊  | 3351/4286 [21:12:58<5:24:43, 20.84s/it] 78%|███████▊  | 3352/4286 [21:13:17<5:13:46, 20.16s/it]                                                        {'loss': 0.0556, 'grad_norm': 3.0600759987423527, 'learning_rate': 2.1791880541297244e-07, 'completion_length': 195.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.458333358168602, 'rewards/format_reward': 1.0, 'reward': 1.4583334922790527, 'reward_std': 0.11450954899191856, 'kl': 1.39453125, 'epoch': 0.78}
 78%|███████▊  | 3352/4286 [21:13:17<5:13:46, 20.16s/it] 78%|███████▊  | 3353/4286 [21:13:35<5:02:52, 19.48s/it]                                                        {'loss': 0.0381, 'grad_norm': 1.892491064503898, 'learning_rate': 2.176854876341577e-07, 'completion_length': 184.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6264881193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6086310148239136, 'reward_std': 0.09661935828626156, 'kl': 0.951171875, 'epoch': 0.78}
 78%|███████▊  | 3353/4286 [21:13:35<5:02:52, 19.48s/it] 78%|███████▊  | 3354/4286 [21:13:52<4:55:04, 19.00s/it]                                                        {'loss': 0.0434, 'grad_norm': 1.741497585135943, 'learning_rate': 2.1745216985534296e-07, 'completion_length': 182.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.5758929252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5580358505249023, 'reward_std': 0.09585829172283411, 'kl': 1.087890625, 'epoch': 0.78}
 78%|███████▊  | 3354/4286 [21:13:52<4:55:04, 19.00s/it][2025-03-03 02:21:31,425] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 78%|███████▊  | 3355/4286 [21:14:16<5:13:58, 20.24s/it]                                                        {'loss': 0.0424, 'grad_norm': 3.0418710359705754, 'learning_rate': 2.172188520765282e-07, 'completion_length': 207.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5520834028720856, 'rewards/format_reward': 1.0, 'reward': 1.5520834922790527, 'reward_std': 0.06434167176485062, 'kl': 1.05859375, 'epoch': 0.78}
 78%|███████▊  | 3355/4286 [21:14:16<5:13:58, 20.24s/it] 78%|███████▊  | 3356/4286 [21:14:33<5:00:28, 19.39s/it]                                                        {'loss': 0.0243, 'grad_norm': 6.596438968026243, 'learning_rate': 2.1698553429771346e-07, 'completion_length': 160.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098214626312256, 'reward_std': 0.04648452810943127, 'kl': 0.6103515625, 'epoch': 0.78}
 78%|███████▊  | 3356/4286 [21:14:33<5:00:28, 19.39s/it] 78%|███████▊  | 3357/4286 [21:14:51<4:53:40, 18.97s/it]                                                        {'loss': 0.0117, 'grad_norm': 7.924593822500854, 'learning_rate': 2.167522165188987e-07, 'completion_length': 174.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6696429252624512, 'rewards/format_reward': 1.0, 'reward': 1.669642984867096, 'reward_std': 0.01785714365541935, 'kl': 0.29150390625, 'epoch': 0.78}
 78%|███████▊  | 3357/4286 [21:14:51<4:53:40, 18.97s/it] 78%|███████▊  | 3358/4286 [21:15:12<5:01:59, 19.53s/it]                                                        {'loss': 0.0534, 'grad_norm': 2.6489247021128475, 'learning_rate': 2.1651889874008398e-07, 'completion_length': 193.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.49226196110248566, 'rewards/format_reward': 1.0, 'reward': 1.4922619462013245, 'reward_std': 0.14921827614307404, 'kl': 1.3359375, 'epoch': 0.78}
 78%|███████▊  | 3358/4286 [21:15:12<5:01:59, 19.53s/it] 78%|███████▊  | 3359/4286 [21:15:30<4:55:02, 19.10s/it]                                                        {'loss': 0.0248, 'grad_norm': 0.9137270141474809, 'learning_rate': 2.1628558096126923e-07, 'completion_length': 194.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.59077388048172, 'rewards/format_reward': 1.0, 'reward': 1.59077388048172, 'reward_std': 0.05495268478989601, 'kl': 0.61865234375, 'epoch': 0.78}
 78%|███████▊  | 3359/4286 [21:15:30<4:55:02, 19.10s/it] 78%|███████▊  | 3360/4286 [21:15:48<4:52:23, 18.95s/it]                                                        {'loss': 0.0278, 'grad_norm': 3.1718982958313555, 'learning_rate': 2.1605226318245448e-07, 'completion_length': 194.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7767857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.758928656578064, 'reward_std': 0.09795446321368217, 'kl': 0.6923828125, 'epoch': 0.78}
 78%|███████▊  | 3360/4286 [21:15:48<4:52:23, 18.95s/it] 78%|███████▊  | 3361/4286 [21:16:10<5:04:25, 19.75s/it]                                                        {'loss': 0.0267, 'grad_norm': 1.069033729221252, 'learning_rate': 2.1581894540363973e-07, 'completion_length': 194.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5491071939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5312501192092896, 'reward_std': 0.07731622457504272, 'kl': 0.66748046875, 'epoch': 0.78}
 78%|███████▊  | 3361/4286 [21:16:10<5:04:25, 19.75s/it] 78%|███████▊  | 3362/4286 [21:16:32<5:14:25, 20.42s/it]                                                        {'loss': 0.0296, 'grad_norm': 1.6961125619981816, 'learning_rate': 2.1558562762482498e-07, 'completion_length': 180.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6517857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6160715222358704, 'reward_std': 0.10990537330508232, 'kl': 0.74169921875, 'epoch': 0.78}
 78%|███████▊  | 3362/4286 [21:16:32<5:14:25, 20.42s/it] 78%|███████▊  | 3363/4286 [21:16:50<5:02:45, 19.68s/it]                                                        {'loss': 0.0431, 'grad_norm': 3.9324269671776597, 'learning_rate': 2.1535230984601025e-07, 'completion_length': 164.92857360839844, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.05952381435781717, 'kl': 1.078125, 'epoch': 0.78}
 78%|███████▊  | 3363/4286 [21:16:50<5:02:45, 19.68s/it] 78%|███████▊  | 3364/4286 [21:17:09<4:57:42, 19.37s/it]                                                        {'loss': 0.0664, 'grad_norm': 3.508207689484305, 'learning_rate': 2.151189920671955e-07, 'completion_length': 182.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6741072535514832, 'reward_std': 0.11303051188588142, 'kl': 1.66015625, 'epoch': 0.78}
 78%|███████▊  | 3364/4286 [21:17:09<4:57:42, 19.37s/it] 79%|███████▊  | 3365/4286 [21:17:27<4:52:25, 19.05s/it]                                                        {'loss': 0.0402, 'grad_norm': 2.183162880820233, 'learning_rate': 2.1488567428838075e-07, 'completion_length': 181.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.5148809999227524, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.49702388048172, 'reward_std': 0.08928571827709675, 'kl': 1.0048828125, 'epoch': 0.79}
 79%|███████▊  | 3365/4286 [21:17:27<4:52:25, 19.05s/it] 79%|███████▊  | 3366/4286 [21:17:47<4:56:47, 19.36s/it]                                                        {'loss': 0.035, 'grad_norm': 3.6942704788810965, 'learning_rate': 2.14652356509566e-07, 'completion_length': 174.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6651785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6473215818405151, 'reward_std': 0.08815120905637741, 'kl': 0.873046875, 'epoch': 0.79}
 79%|███████▊  | 3366/4286 [21:17:47<4:56:47, 19.36s/it] 79%|███████▊  | 3367/4286 [21:18:08<5:03:21, 19.81s/it]                                                        {'loss': 0.0081, 'grad_norm': 1.2802153243197736, 'learning_rate': 2.1441903873075127e-07, 'completion_length': 197.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7529762387275696, 'rewards/format_reward': 1.0, 'reward': 1.7529762983322144, 'reward_std': 0.056333938613533974, 'kl': 0.20166015625, 'epoch': 0.79}
 79%|███████▊  | 3367/4286 [21:18:08<5:03:21, 19.81s/it] 79%|███████▊  | 3368/4286 [21:18:26<4:56:13, 19.36s/it]                                                        {'loss': 0.0354, 'grad_norm': 1.1367490005447856, 'learning_rate': 2.1418572095193652e-07, 'completion_length': 185.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7306548953056335, 'reward_std': 0.0744047649204731, 'kl': 0.88818359375, 'epoch': 0.79}
 79%|███████▊  | 3368/4286 [21:18:26<4:56:13, 19.36s/it] 79%|███████▊  | 3369/4286 [21:18:47<5:03:33, 19.86s/it]                                                        {'loss': 0.0961, 'grad_norm': 6.011574884115354, 'learning_rate': 2.1395240317312177e-07, 'completion_length': 194.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.5119048058986664, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4761906266212463, 'reward_std': 0.20011192560195923, 'kl': 2.3984375, 'epoch': 0.79}
 79%|███████▊  | 3369/4286 [21:18:47<5:03:33, 19.86s/it] 79%|███████▊  | 3370/4286 [21:19:11<5:21:37, 21.07s/it]                                                        {'loss': 0.0235, 'grad_norm': 5.8029945423054805, 'learning_rate': 2.1371908539430702e-07, 'completion_length': 206.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.542261928319931, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5244048833847046, 'reward_std': 0.0519288033246994, 'kl': 0.587890625, 'epoch': 0.79}
 79%|███████▊  | 3370/4286 [21:19:11<5:21:37, 21.07s/it] 79%|███████▊  | 3371/4286 [21:19:28<5:03:51, 19.92s/it]                                                        {'loss': 0.0695, 'grad_norm': 1.1937283098358824, 'learning_rate': 2.1348576761549227e-07, 'completion_length': 166.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7961309552192688, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7604167461395264, 'reward_std': 0.1636904776096344, 'kl': 1.7421875, 'epoch': 0.79}
 79%|███████▊  | 3371/4286 [21:19:28<5:03:51, 19.92s/it] 79%|███████▊  | 3372/4286 [21:19:46<4:55:03, 19.37s/it]                                                        {'loss': 0.0068, 'grad_norm': 0.7402264701407306, 'learning_rate': 2.1325244983667754e-07, 'completion_length': 180.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.8065476715564728, 'rewards/format_reward': 1.0, 'reward': 1.8065477013587952, 'reward_std': 0.01785714365541935, 'kl': 0.1689453125, 'epoch': 0.79}
 79%|███████▊  | 3372/4286 [21:19:46<4:55:03, 19.37s/it] 79%|███████▊  | 3373/4286 [21:20:04<4:47:18, 18.88s/it]                                                        {'loss': 0.0279, 'grad_norm': 2.1572346791641497, 'learning_rate': 2.130191320578628e-07, 'completion_length': 180.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.7395834028720856, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.07280982099473476, 'kl': 0.6962890625, 'epoch': 0.79}
 79%|███████▊  | 3373/4286 [21:20:04<4:47:18, 18.88s/it] 79%|███████▊  | 3374/4286 [21:20:23<4:44:50, 18.74s/it]                                                        {'loss': 0.0319, 'grad_norm': 0.46035539982213336, 'learning_rate': 2.1278581427904804e-07, 'completion_length': 173.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250001192092896, 'reward_std': 0.0714285746216774, 'kl': 0.79833984375, 'epoch': 0.79}
 79%|███████▊  | 3374/4286 [21:20:23<4:44:50, 18.74s/it][2025-03-03 02:27:58,980] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 79%|███████▊  | 3375/4286 [21:20:43<4:52:29, 19.26s/it]                                                        {'loss': 0.0907, 'grad_norm': 5.344675586131671, 'learning_rate': 2.125524965002333e-07, 'completion_length': 182.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6205357313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.602678656578064, 'reward_std': 0.0922619067132473, 'kl': 2.265625, 'epoch': 0.79}
 79%|███████▊  | 3375/4286 [21:20:43<4:52:29, 19.26s/it] 79%|███████▉  | 3376/4286 [21:21:02<4:50:46, 19.17s/it]                                                        {'loss': 0.0272, 'grad_norm': 4.356138890294353, 'learning_rate': 2.1231917872141856e-07, 'completion_length': 192.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6220238506793976, 'rewards/format_reward': 1.0, 'reward': 1.62202388048172, 'reward_std': 0.01785714365541935, 'kl': 0.6787109375, 'epoch': 0.79}
 79%|███████▉  | 3376/4286 [21:21:02<4:50:46, 19.17s/it][2025-03-03 02:28:41,798] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 79%|███████▉  | 3377/4286 [21:21:26<5:11:45, 20.58s/it]                                                        {'loss': 0.0445, 'grad_norm': 2.393699593900641, 'learning_rate': 2.120858609426038e-07, 'completion_length': 213.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6175595819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.599702537059784, 'reward_std': 0.10692918300628662, 'kl': 1.11279296875, 'epoch': 0.79}
 79%|███████▉  | 3377/4286 [21:21:26<5:11:45, 20.58s/it] 79%|███████▉  | 3378/4286 [21:21:44<5:01:59, 19.95s/it]                                                        {'loss': 0.0275, 'grad_norm': 3.04284476983197, 'learning_rate': 2.1185254316378906e-07, 'completion_length': 188.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7187500298023224, 'rewards/format_reward': 1.0, 'reward': 1.7187500596046448, 'reward_std': 0.0734308548271656, 'kl': 0.68701171875, 'epoch': 0.79}
 79%|███████▉  | 3378/4286 [21:21:44<5:01:59, 19.95s/it][2025-03-03 02:29:21,643] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 79%|███████▉  | 3379/4286 [21:22:06<5:07:57, 20.37s/it]                                                        {'loss': 0.0454, 'grad_norm': 1.2515229875584775, 'learning_rate': 2.116192253849743e-07, 'completion_length': 193.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5699405372142792, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5342262983322144, 'reward_std': 0.1517857238650322, 'kl': 1.13671875, 'epoch': 0.79}
 79%|███████▉  | 3379/4286 [21:22:06<5:07:57, 20.37s/it] 79%|███████▉  | 3380/4286 [21:22:26<5:07:28, 20.36s/it]                                                        {'loss': 0.0295, 'grad_norm': 2.9618539939952764, 'learning_rate': 2.1138590760615956e-07, 'completion_length': 193.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.44464291632175446, 'rewards/format_reward': 1.0, 'reward': 1.444642961025238, 'reward_std': 0.05084558296948671, 'kl': 0.73974609375, 'epoch': 0.79}
 79%|███████▉  | 3380/4286 [21:22:26<5:07:28, 20.36s/it] 79%|███████▉  | 3381/4286 [21:22:46<5:07:06, 20.36s/it]                                                        {'loss': 0.0396, 'grad_norm': 6.542262878911277, 'learning_rate': 2.1115258982734483e-07, 'completion_length': 182.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7125425636768341, 'rewards/format_reward': 1.0, 'reward': 1.7125425934791565, 'reward_std': 0.1440448984503746, 'kl': 0.98828125, 'epoch': 0.79}
 79%|███████▉  | 3381/4286 [21:22:46<5:07:06, 20.36s/it] 79%|███████▉  | 3382/4286 [21:23:05<4:57:51, 19.77s/it]                                                        {'loss': 0.044, 'grad_norm': 6.80354291587277, 'learning_rate': 2.1091927204853008e-07, 'completion_length': 174.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.07662072405219078, 'kl': 1.1015625, 'epoch': 0.79}
 79%|███████▉  | 3382/4286 [21:23:05<4:57:51, 19.77s/it] 79%|███████▉  | 3383/4286 [21:23:25<4:59:28, 19.90s/it]                                                        {'loss': 0.0351, 'grad_norm': 2.312035049417529, 'learning_rate': 2.1068595426971533e-07, 'completion_length': 210.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.5639881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5461310744285583, 'reward_std': 0.07950930669903755, 'kl': 0.87646484375, 'epoch': 0.79}
 79%|███████▉  | 3383/4286 [21:23:25<4:59:28, 19.90s/it] 79%|███████▉  | 3384/4286 [21:23:45<5:00:43, 20.00s/it]                                                        {'loss': 0.0295, 'grad_norm': 2.5333494378115127, 'learning_rate': 2.1045263649090058e-07, 'completion_length': 178.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.645833432674408, 'reward_std': 0.09196125715970993, 'kl': 0.73974609375, 'epoch': 0.79}
 79%|███████▉  | 3384/4286 [21:23:45<5:00:43, 20.00s/it] 79%|███████▉  | 3385/4286 [21:24:04<4:55:48, 19.70s/it]                                                        {'loss': 0.0637, 'grad_norm': 25.917377313952816, 'learning_rate': 2.1021931871208583e-07, 'completion_length': 193.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6440476775169373, 'rewards/format_reward': 1.0, 'reward': 1.6440476775169373, 'reward_std': 0.13167474418878555, 'kl': 1.59375, 'epoch': 0.79}
 79%|███████▉  | 3385/4286 [21:24:04<4:55:48, 19.70s/it] 79%|███████▉  | 3386/4286 [21:24:22<4:47:39, 19.18s/it]                                                        {'loss': 0.045, 'grad_norm': 3.960806188293557, 'learning_rate': 2.099860009332711e-07, 'completion_length': 177.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.636904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6369048953056335, 'reward_std': 0.11769941821694374, 'kl': 1.12255859375, 'epoch': 0.79}
 79%|███████▉  | 3386/4286 [21:24:22<4:47:39, 19.18s/it] 79%|███████▉  | 3387/4286 [21:24:41<4:45:32, 19.06s/it]                                                        {'loss': 0.0523, 'grad_norm': 0.8663538665854384, 'learning_rate': 2.0975268315445635e-07, 'completion_length': 193.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.642857313156128, 'reward_std': 0.05633394047617912, 'kl': 1.30517578125, 'epoch': 0.79}
 79%|███████▉  | 3387/4286 [21:24:41<4:45:32, 19.06s/it] 79%|███████▉  | 3388/4286 [21:25:01<4:51:35, 19.48s/it]                                                        {'loss': 0.0568, 'grad_norm': 6.928653303890392, 'learning_rate': 2.095193653756416e-07, 'completion_length': 198.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.7002976536750793, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6824406385421753, 'reward_std': 0.1207944005727768, 'kl': 1.4189453125, 'epoch': 0.79}
 79%|███████▉  | 3388/4286 [21:25:02<4:51:35, 19.48s/it] 79%|███████▉  | 3389/4286 [21:25:21<4:49:30, 19.36s/it]                                                        {'loss': 0.0079, 'grad_norm': 1.707694060038137, 'learning_rate': 2.0928604759682685e-07, 'completion_length': 190.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6250000894069672, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.028166969306766987, 'kl': 0.19775390625, 'epoch': 0.79}
 79%|███████▉  | 3389/4286 [21:25:21<4:49:30, 19.36s/it] 79%|███████▉  | 3390/4286 [21:25:41<4:52:12, 19.57s/it]                                                        {'loss': 0.0305, 'grad_norm': 2.279255310073763, 'learning_rate': 2.0905272981801213e-07, 'completion_length': 192.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.688988208770752, 'reward_std': 0.12863484397530556, 'kl': 0.763671875, 'epoch': 0.79}
 79%|███████▉  | 3390/4286 [21:25:41<4:52:12, 19.57s/it] 79%|███████▉  | 3391/4286 [21:25:59<4:45:58, 19.17s/it]                                                        {'loss': 0.052, 'grad_norm': 5.976321810976935, 'learning_rate': 2.0881941203919737e-07, 'completion_length': 164.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.4970238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4791667461395264, 'reward_std': 0.08114088699221611, 'kl': 1.298828125, 'epoch': 0.79}
 79%|███████▉  | 3391/4286 [21:25:59<4:45:58, 19.17s/it] 79%|███████▉  | 3392/4286 [21:26:21<4:57:05, 19.94s/it]                                                        {'loss': 0.0882, 'grad_norm': 4.433387554450364, 'learning_rate': 2.0858609426038262e-07, 'completion_length': 177.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.09025165438652039, 'kl': 2.20703125, 'epoch': 0.79}
 79%|███████▉  | 3392/4286 [21:26:21<4:57:05, 19.94s/it] 79%|███████▉  | 3393/4286 [21:26:41<4:58:00, 20.02s/it]                                                        {'loss': 0.0667, 'grad_norm': 17.109779922213182, 'learning_rate': 2.0835277648156787e-07, 'completion_length': 196.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.51488097012043, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4791668057441711, 'reward_std': 0.23212020099163055, 'kl': 1.66796875, 'epoch': 0.79}
 79%|███████▉  | 3393/4286 [21:26:41<4:58:00, 20.02s/it] 79%|███████▉  | 3394/4286 [21:27:03<5:06:42, 20.63s/it]                                                        {'loss': 0.0692, 'grad_norm': 10.880216147210923, 'learning_rate': 2.0811945870275312e-07, 'completion_length': 190.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.46853743493556976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4506803750991821, 'reward_std': 0.10600616037845612, 'kl': 1.732421875, 'epoch': 0.79}
 79%|███████▉  | 3394/4286 [21:27:03<5:06:42, 20.63s/it] 79%|███████▉  | 3395/4286 [21:27:22<4:57:43, 20.05s/it]                                                        {'loss': 0.0354, 'grad_norm': 4.261817388302278, 'learning_rate': 2.078861409239384e-07, 'completion_length': 172.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7336309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.07678571809083223, 'kl': 0.8876953125, 'epoch': 0.79}
 79%|███████▉  | 3395/4286 [21:27:22<4:57:43, 20.05s/it] 79%|███████▉  | 3396/4286 [21:27:40<4:51:43, 19.67s/it]                                                        {'loss': 0.0084, 'grad_norm': 5.391654322274604, 'learning_rate': 2.0765282314512364e-07, 'completion_length': 190.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.6705358028411865, 'rewards/format_reward': 1.0, 'reward': 1.6705358624458313, 'reward_std': 0.07309230882674456, 'kl': 0.2109375, 'epoch': 0.79}
 79%|███████▉  | 3396/4286 [21:27:40<4:51:43, 19.67s/it] 79%|███████▉  | 3397/4286 [21:27:58<4:41:09, 18.98s/it]                                                        {'loss': 0.0261, 'grad_norm': 9.432837861023723, 'learning_rate': 2.074195053663089e-07, 'completion_length': 170.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.7127976417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6949406266212463, 'reward_std': 0.10257173050194979, 'kl': 0.65185546875, 'epoch': 0.79}
 79%|███████▉  | 3397/4286 [21:27:58<4:41:09, 18.98s/it] 79%|███████▉  | 3398/4286 [21:28:16<4:38:39, 18.83s/it]                                                        {'loss': 0.0459, 'grad_norm': 1.5638813407545165, 'learning_rate': 2.0718618758749414e-07, 'completion_length': 184.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5625000596046448, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5267857909202576, 'reward_std': 0.1369047686457634, 'kl': 1.15234375, 'epoch': 0.79}
 79%|███████▉  | 3398/4286 [21:28:16<4:38:39, 18.83s/it] 79%|███████▉  | 3399/4286 [21:28:34<4:32:47, 18.45s/it]                                                        {'loss': 0.0081, 'grad_norm': 0.7334174430330984, 'learning_rate': 2.0695286980867942e-07, 'completion_length': 175.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.0357142873108387, 'kl': 0.20166015625, 'epoch': 0.79}
 79%|███████▉  | 3399/4286 [21:28:34<4:32:47, 18.45s/it] 79%|███████▉  | 3400/4286 [21:28:53<4:34:34, 18.59s/it]                                                        {'loss': 0.0706, 'grad_norm': 9.762410433698674, 'learning_rate': 2.0671955202986466e-07, 'completion_length': 186.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.7247024774551392, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6711310744285583, 'reward_std': 0.2817244604229927, 'kl': 1.765625, 'epoch': 0.79}
 79%|███████▉  | 3400/4286 [21:28:53<4:34:34, 18.59s/it] 79%|███████▉  | 3401/4286 [21:32:34<19:32:43, 79.51s/it]                                                         {'loss': 0.0384, 'grad_norm': 2.578768162480371, 'learning_rate': 2.0648623425104991e-07, 'completion_length': 184.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.07236604392528534, 'kl': 0.955078125, 'epoch': 0.79}
 79%|███████▉  | 3401/4286 [21:32:34<19:32:43, 79.51s/it] 79%|███████▉  | 3402/4286 [21:32:53<15:02:32, 61.26s/it]                                                         {'loss': 0.0156, 'grad_norm': 8.196783955926385, 'learning_rate': 2.0625291647223516e-07, 'completion_length': 198.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.8086310029029846, 'rewards/format_reward': 1.0, 'reward': 1.8086310625076294, 'reward_std': 0.04307790193706751, 'kl': 0.388671875, 'epoch': 0.79}
 79%|███████▉  | 3402/4286 [21:32:53<15:02:32, 61.26s/it] 79%|███████▉  | 3403/4286 [21:33:13<11:58:14, 48.81s/it]                                                         {'loss': 0.0434, 'grad_norm': 1.7836880509019037, 'learning_rate': 2.060195986934204e-07, 'completion_length': 196.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.5492772608995438, 'rewards/format_reward': 1.0, 'reward': 1.5492772459983826, 'reward_std': 0.06811224296689034, 'kl': 1.087890625, 'epoch': 0.79}
 79%|███████▉  | 3403/4286 [21:33:13<11:58:14, 48.81s/it] 79%|███████▉  | 3404/4286 [21:33:32<9:46:14, 39.88s/it]                                                         {'loss': 0.0605, 'grad_norm': 37.69080805428405, 'learning_rate': 2.0578628091460569e-07, 'completion_length': 192.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5962798297405243, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5784226655960083, 'reward_std': 0.13030626811087132, 'kl': 1.51416015625, 'epoch': 0.79}
 79%|███████▉  | 3404/4286 [21:33:32<9:46:14, 39.88s/it][2025-03-03 02:41:07,507] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 79%|███████▉  | 3405/4286 [21:33:52<8:17:13, 33.86s/it]                                                        {'loss': 0.0121, 'grad_norm': 1.5394682609336645, 'learning_rate': 2.0555296313579093e-07, 'completion_length': 191.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979167461395264, 'reward_std': 0.04312239959836006, 'kl': 0.30078125, 'epoch': 0.79}
 79%|███████▉  | 3405/4286 [21:33:52<8:17:13, 33.86s/it] 79%|███████▉  | 3406/4286 [21:34:15<7:29:24, 30.64s/it]                                                        {'loss': 0.031, 'grad_norm': 4.099649558258738, 'learning_rate': 2.0531964535697618e-07, 'completion_length': 194.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.647321492433548, 'rewards/format_reward': 1.0, 'reward': 1.6473215818405151, 'reward_std': 0.037716567516326904, 'kl': 0.7783203125, 'epoch': 0.79}
 79%|███████▉  | 3406/4286 [21:34:15<7:29:24, 30.64s/it] 79%|███████▉  | 3407/4286 [21:34:33<6:33:15, 26.84s/it]                                                        {'loss': 0.0089, 'grad_norm': 0.4582100463802814, 'learning_rate': 2.0508632757816143e-07, 'completion_length': 193.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.6488095223903656, 'rewards/format_reward': 1.0, 'reward': 1.6488096714019775, 'reward_std': 0.0, 'kl': 0.22216796875, 'epoch': 0.79}
 79%|███████▉  | 3407/4286 [21:34:33<6:33:15, 26.84s/it] 80%|███████▉  | 3408/4286 [21:34:54<6:07:19, 25.10s/it]                                                        {'loss': 0.0475, 'grad_norm': 11.115136378431057, 'learning_rate': 2.0485300979934668e-07, 'completion_length': 178.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7023810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.05029458459466696, 'kl': 1.189453125, 'epoch': 0.8}
 80%|███████▉  | 3408/4286 [21:34:54<6:07:19, 25.10s/it] 80%|███████▉  | 3409/4286 [21:35:17<5:56:53, 24.42s/it]                                                        {'loss': 0.0389, 'grad_norm': 12.174348102153306, 'learning_rate': 2.0461969202053196e-07, 'completion_length': 211.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5791667103767395, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.561309576034546, 'reward_std': 0.12738095596432686, 'kl': 0.97509765625, 'epoch': 0.8}
 80%|███████▉  | 3409/4286 [21:35:17<5:56:53, 24.42s/it] 80%|███████▉  | 3410/4286 [21:35:35<5:28:01, 22.47s/it]                                                        {'loss': 0.0252, 'grad_norm': 2.8637349869758975, 'learning_rate': 2.043863742417172e-07, 'completion_length': 179.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.5285715162754059, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5107144117355347, 'reward_std': 0.06551574263721704, 'kl': 0.630859375, 'epoch': 0.8}
 80%|███████▉  | 3410/4286 [21:35:35<5:28:01, 22.47s/it] 80%|███████▉  | 3411/4286 [21:35:55<5:18:48, 21.86s/it]                                                        {'loss': 0.0678, 'grad_norm': 10.890035177398484, 'learning_rate': 2.0415305646290245e-07, 'completion_length': 179.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6696429252624512, 'reward_std': 0.23121538013219833, 'kl': 1.69921875, 'epoch': 0.8}
 80%|███████▉  | 3411/4286 [21:35:55<5:18:48, 21.86s/it] 80%|███████▉  | 3412/4286 [21:36:15<5:10:12, 21.30s/it]                                                        {'loss': 0.018, 'grad_norm': 4.164776639384627, 'learning_rate': 2.039197386840877e-07, 'completion_length': 194.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261906266212463, 'reward_std': 0.037515184842050076, 'kl': 0.4482421875, 'epoch': 0.8}
 80%|███████▉  | 3412/4286 [21:36:15<5:10:12, 21.30s/it] 80%|███████▉  | 3413/4286 [21:36:35<5:03:22, 20.85s/it]                                                        {'loss': 0.035, 'grad_norm': 9.74940175378794, 'learning_rate': 2.0368642090527298e-07, 'completion_length': 186.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6111820042133331, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.593324899673462, 'reward_std': 0.07110941782593727, 'kl': 0.875, 'epoch': 0.8}
 80%|███████▉  | 3413/4286 [21:36:35<5:03:22, 20.85s/it] 80%|███████▉  | 3414/4286 [21:36:52<4:48:47, 19.87s/it]                                                        {'loss': 0.03, 'grad_norm': 5.337097858948257, 'learning_rate': 2.0345310312645823e-07, 'completion_length': 167.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.7464286088943481, 'rewards/format_reward': 1.0, 'reward': 1.7464287281036377, 'reward_std': 0.0921705923974514, 'kl': 0.751953125, 'epoch': 0.8}
 80%|███████▉  | 3414/4286 [21:36:52<4:48:47, 19.87s/it] 80%|███████▉  | 3415/4286 [21:37:10<4:36:41, 19.06s/it]                                                        {'loss': 0.0616, 'grad_norm': 2.8895331317735327, 'learning_rate': 2.0321978534764347e-07, 'completion_length': 164.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.709821492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6919643878936768, 'reward_std': 0.11770784854888916, 'kl': 1.5458984375, 'epoch': 0.8}
 80%|███████▉  | 3415/4286 [21:37:10<4:36:41, 19.06s/it] 80%|███████▉  | 3416/4286 [21:37:28<4:33:49, 18.88s/it]                                                        {'loss': 0.0119, 'grad_norm': 0.8118471804280992, 'learning_rate': 2.0298646756882872e-07, 'completion_length': 186.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6250000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6250001788139343, 'reward_std': 0.0357142873108387, 'kl': 0.296875, 'epoch': 0.8}
 80%|███████▉  | 3416/4286 [21:37:28<4:33:49, 18.88s/it] 80%|███████▉  | 3417/4286 [21:37:46<4:31:23, 18.74s/it]                                                        {'loss': 0.026, 'grad_norm': 4.625935344286648, 'learning_rate': 2.0275314979001397e-07, 'completion_length': 193.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.587797611951828, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5699405670166016, 'reward_std': 0.14245203137397766, 'kl': 0.65087890625, 'epoch': 0.8}
 80%|███████▉  | 3417/4286 [21:37:46<4:31:23, 18.74s/it] 80%|███████▉  | 3418/4286 [21:38:05<4:32:41, 18.85s/it]                                                        {'loss': 0.0459, 'grad_norm': 1.33684514111687, 'learning_rate': 2.0251983201119925e-07, 'completion_length': 189.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6949405670166016, 'reward_std': 0.133928582072258, 'kl': 1.146484375, 'epoch': 0.8}
 80%|███████▉  | 3418/4286 [21:38:05<4:32:41, 18.85s/it] 80%|███████▉  | 3419/4286 [21:38:29<4:53:08, 20.29s/it]                                                        {'loss': 0.0208, 'grad_norm': 4.663946537205493, 'learning_rate': 2.022865142323845e-07, 'completion_length': 210.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6651786267757416, 'rewards/format_reward': 1.0, 'reward': 1.665178656578064, 'reward_std': 0.02083333395421505, 'kl': 0.517578125, 'epoch': 0.8}
 80%|███████▉  | 3419/4286 [21:38:29<4:53:08, 20.29s/it] 80%|███████▉  | 3420/4286 [21:38:47<4:43:22, 19.63s/it]                                                        {'loss': 0.007, 'grad_norm': 7.630129336367707, 'learning_rate': 2.0205319645356974e-07, 'completion_length': 196.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5699405074119568, 'rewards/format_reward': 1.0, 'reward': 1.5699405670166016, 'reward_std': 0.04698196332901716, 'kl': 0.17529296875, 'epoch': 0.8}
 80%|███████▉  | 3420/4286 [21:38:47<4:43:22, 19.63s/it] 80%|███████▉  | 3421/4286 [21:39:06<4:38:54, 19.35s/it]                                                        {'loss': 0.007, 'grad_norm': 8.022631689075482, 'learning_rate': 2.01819878674755e-07, 'completion_length': 187.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7023810148239136, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.07435034960508347, 'kl': 0.17431640625, 'epoch': 0.8}
 80%|███████▉  | 3421/4286 [21:39:06<4:38:54, 19.35s/it] 80%|███████▉  | 3422/4286 [21:39:24<4:34:33, 19.07s/it]                                                        {'loss': 0.0657, 'grad_norm': 2.8807479069221213, 'learning_rate': 2.0158656089594027e-07, 'completion_length': 178.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7500001192092896, 'reward_std': 0.15639832615852356, 'kl': 1.6455078125, 'epoch': 0.8}
 80%|███████▉  | 3422/4286 [21:39:24<4:34:33, 19.07s/it] 80%|███████▉  | 3423/4286 [21:39:44<4:34:44, 19.10s/it]                                                        {'loss': 0.0072, 'grad_norm': 11.907994080517774, 'learning_rate': 2.0135324311712552e-07, 'completion_length': 188.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.05351835861802101, 'kl': 0.18115234375, 'epoch': 0.8}
 80%|███████▉  | 3423/4286 [21:39:44<4:34:44, 19.10s/it] 80%|███████▉  | 3424/4286 [21:40:03<4:34:43, 19.12s/it]                                                        {'loss': 0.0509, 'grad_norm': 9.428875775272981, 'learning_rate': 2.0111992533831077e-07, 'completion_length': 178.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6383928656578064, 'rewards/format_reward': 1.0, 'reward': 1.6383929252624512, 'reward_std': 0.05876358598470688, 'kl': 1.271484375, 'epoch': 0.8}
 80%|███████▉  | 3424/4286 [21:40:03<4:34:43, 19.12s/it] 80%|███████▉  | 3425/4286 [21:40:21<4:32:09, 18.97s/it]                                                        {'loss': 0.0074, 'grad_norm': 3.532505963630359, 'learning_rate': 2.0088660755949601e-07, 'completion_length': 196.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.8342262208461761, 'rewards/format_reward': 1.0, 'reward': 1.8342262506484985, 'reward_std': 0.06485805660486221, 'kl': 0.18408203125, 'epoch': 0.8}
 80%|███████▉  | 3425/4286 [21:40:21<4:32:09, 18.97s/it] 80%|███████▉  | 3426/4286 [21:40:43<4:41:58, 19.67s/it]                                                        {'loss': 0.044, 'grad_norm': 3.8763085037501805, 'learning_rate': 2.0065328978068126e-07, 'completion_length': 208.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.457589328289032, 'rewards/format_reward': 1.0, 'reward': 1.4575893878936768, 'reward_std': 0.04583030007779598, 'kl': 1.09765625, 'epoch': 0.8}
 80%|███████▉  | 3426/4286 [21:40:43<4:41:58, 19.67s/it] 80%|███████▉  | 3427/4286 [21:41:01<4:36:43, 19.33s/it]                                                        {'loss': 0.074, 'grad_norm': 13.2737297449274, 'learning_rate': 2.0041997200186654e-07, 'completion_length': 185.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6592262387275696, 'reward_std': 0.1398809589445591, 'kl': 1.8466796875, 'epoch': 0.8}
 80%|███████▉  | 3427/4286 [21:41:01<4:36:43, 19.33s/it] 80%|███████▉  | 3428/4286 [21:41:20<4:35:55, 19.29s/it]                                                        {'loss': 0.0318, 'grad_norm': 3.0127997891770457, 'learning_rate': 2.0018665422305179e-07, 'completion_length': 185.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 1.0, 'reward': 1.7619049549102783, 'reward_std': 0.03905685991048813, 'kl': 0.79833984375, 'epoch': 0.8}
 80%|███████▉  | 3428/4286 [21:41:20<4:35:55, 19.29s/it] 80%|████████  | 3429/4286 [21:41:42<4:43:56, 19.88s/it]                                                        {'loss': 0.0427, 'grad_norm': 1.948441068521201, 'learning_rate': 1.9995333644423704e-07, 'completion_length': 185.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.6568452715873718, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6211310625076294, 'reward_std': 0.14953920245170593, 'kl': 1.06689453125, 'epoch': 0.8}
 80%|████████  | 3429/4286 [21:41:42<4:43:56, 19.88s/it] 80%|████████  | 3430/4286 [21:42:02<4:44:06, 19.91s/it]                                                        {'loss': 0.0106, 'grad_norm': 7.971807547141913, 'learning_rate': 1.9972001866542228e-07, 'completion_length': 177.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.5497024357318878, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5318453907966614, 'reward_std': 0.10468304390087724, 'kl': 0.26513671875, 'epoch': 0.8}
 80%|████████  | 3430/4286 [21:42:02<4:44:06, 19.91s/it] 80%|████████  | 3431/4286 [21:42:21<4:43:08, 19.87s/it]                                                        {'loss': 0.1005, 'grad_norm': 2.7932434715175862, 'learning_rate': 1.9948670088660753e-07, 'completion_length': 203.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6190476417541504, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5833334922790527, 'reward_std': 0.14395049959421158, 'kl': 2.515625, 'epoch': 0.8}
 80%|████████  | 3431/4286 [21:42:21<4:43:08, 19.87s/it] 80%|████████  | 3432/4286 [21:42:39<4:32:50, 19.17s/it]                                                        {'loss': 0.0695, 'grad_norm': 1.487583600538547, 'learning_rate': 1.992533831077928e-07, 'completion_length': 174.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.6145834028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5967262983322144, 'reward_std': 0.1160714328289032, 'kl': 1.7421875, 'epoch': 0.8}
 80%|████████  | 3432/4286 [21:42:39<4:32:50, 19.17s/it] 80%|████████  | 3433/4286 [21:42:58<4:33:20, 19.23s/it]                                                        {'loss': 0.0261, 'grad_norm': 5.2643192458665515, 'learning_rate': 1.9902006532897806e-07, 'completion_length': 185.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.650297611951828, 'rewards/format_reward': 1.0, 'reward': 1.65029776096344, 'reward_std': 0.038089167326688766, 'kl': 0.65283203125, 'epoch': 0.8}
 80%|████████  | 3433/4286 [21:42:58<4:33:20, 19.23s/it] 80%|████████  | 3434/4286 [21:43:16<4:28:26, 18.90s/it]                                                        {'loss': 0.0077, 'grad_norm': 4.682786471880883, 'learning_rate': 1.987867475501633e-07, 'completion_length': 180.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.5967262089252472, 'rewards/format_reward': 1.0, 'reward': 1.5967262387275696, 'reward_std': 0.04027373343706131, 'kl': 0.19189453125, 'epoch': 0.8}
 80%|████████  | 3434/4286 [21:43:16<4:28:26, 18.90s/it] 80%|████████  | 3435/4286 [21:43:37<4:34:07, 19.33s/it]                                                        {'loss': 0.0382, 'grad_norm': 5.02623316333522, 'learning_rate': 1.9855342977134855e-07, 'completion_length': 189.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6160714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5982143878936768, 'reward_std': 0.12615589797496796, 'kl': 0.95703125, 'epoch': 0.8}
 80%|████████  | 3435/4286 [21:43:37<4:34:07, 19.33s/it] 80%|████████  | 3436/4286 [21:43:54<4:26:21, 18.80s/it]                                                        {'loss': 0.0743, 'grad_norm': 2.3786249164205016, 'learning_rate': 1.9832011199253383e-07, 'completion_length': 175.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.723214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7053572535514832, 'reward_std': 0.153166975826025, 'kl': 1.859375, 'epoch': 0.8}
 80%|████████  | 3436/4286 [21:43:54<4:26:21, 18.80s/it] 80%|████████  | 3437/4286 [21:44:13<4:25:43, 18.78s/it]                                                        {'loss': 0.0544, 'grad_norm': 1.0857737616118757, 'learning_rate': 1.9808679421371908e-07, 'completion_length': 188.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5937500894069672, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5580358505249023, 'reward_std': 0.1220238134264946, 'kl': 1.36328125, 'epoch': 0.8}
 80%|████████  | 3437/4286 [21:44:13<4:25:43, 18.78s/it] 80%|████████  | 3438/4286 [21:44:33<4:28:27, 19.00s/it]                                                        {'loss': 0.0366, 'grad_norm': 9.819051078614955, 'learning_rate': 1.9785347643490433e-07, 'completion_length': 192.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.49494050443172455, 'rewards/format_reward': 1.0, 'reward': 1.4949405193328857, 'reward_std': 0.06726190820336342, 'kl': 0.9140625, 'epoch': 0.8}
 80%|████████  | 3438/4286 [21:44:33<4:28:27, 19.00s/it] 80%|████████  | 3439/4286 [21:44:51<4:24:19, 18.72s/it]                                                        {'loss': 0.1195, 'grad_norm': 702.6815944098712, 'learning_rate': 1.9762015865608958e-07, 'completion_length': 174.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.584821492433548, 'rewards/format_reward': 1.0, 'reward': 1.5848215818405151, 'reward_std': 0.03273809980601072, 'kl': 2.982421875, 'epoch': 0.8}
 80%|████████  | 3439/4286 [21:44:51<4:24:19, 18.72s/it][2025-03-03 02:52:26,865] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 80%|████████  | 3440/4286 [21:45:11<4:31:03, 19.22s/it]                                                        {'loss': 0.0786, 'grad_norm': 4.1235939285436825, 'learning_rate': 1.9738684087727482e-07, 'completion_length': 195.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.645833432674408, 'reward_std': 0.123106699436903, 'kl': 1.9658203125, 'epoch': 0.8}
 80%|████████  | 3440/4286 [21:45:11<4:31:03, 19.22s/it][2025-03-03 02:52:46,221] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 80%|████████  | 3441/4286 [21:45:30<4:31:17, 19.26s/it]                                                        {'loss': 0.0331, 'grad_norm': 3.318496619700854, 'learning_rate': 1.971535230984601e-07, 'completion_length': 186.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6086309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5907739400863647, 'reward_std': 0.09661935735493898, 'kl': 0.830078125, 'epoch': 0.8}
 80%|████████  | 3441/4286 [21:45:30<4:31:17, 19.26s/it] 80%|████████  | 3442/4286 [21:45:49<4:26:40, 18.96s/it]                                                        {'loss': 0.0498, 'grad_norm': 22.207709931150728, 'learning_rate': 1.9692020531964535e-07, 'completion_length': 164.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 1.0, 'reward': 1.5937501192092896, 'reward_std': 0.0566922165453434, 'kl': 1.240234375, 'epoch': 0.8}
 80%|████████  | 3442/4286 [21:45:49<4:26:40, 18.96s/it] 80%|████████  | 3443/4286 [21:46:08<4:28:35, 19.12s/it]                                                        {'loss': 0.0511, 'grad_norm': 1.0612290863264071, 'learning_rate': 1.966868875408306e-07, 'completion_length': 180.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6160714626312256, 'reward_std': 0.13690476305782795, 'kl': 1.28173828125, 'epoch': 0.8}
 80%|████████  | 3443/4286 [21:46:08<4:28:35, 19.12s/it] 80%|████████  | 3444/4286 [21:46:27<4:29:26, 19.20s/it]                                                        {'loss': 0.0335, 'grad_norm': 21.53679547917792, 'learning_rate': 1.9645356976201585e-07, 'completion_length': 174.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6738095879554749, 'rewards/format_reward': 1.0, 'reward': 1.6738096475601196, 'reward_std': 0.11368954926729202, 'kl': 0.84228515625, 'epoch': 0.8}
 80%|████████  | 3444/4286 [21:46:27<4:29:26, 19.20s/it] 80%|████████  | 3445/4286 [21:46:52<4:51:21, 20.79s/it]                                                        {'loss': 0.034, 'grad_norm': 1.8740810333536086, 'learning_rate': 1.9622025198320112e-07, 'completion_length': 191.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.7175595760345459, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6997025609016418, 'reward_std': 0.08944508992135525, 'kl': 0.8525390625, 'epoch': 0.8}
 80%|████████  | 3445/4286 [21:46:52<4:51:21, 20.79s/it] 80%|████████  | 3446/4286 [21:47:11<4:44:40, 20.33s/it]                                                        {'loss': 0.0731, 'grad_norm': 14.367639878138522, 'learning_rate': 1.9598693420438637e-07, 'completion_length': 179.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.848214328289032, 'rewards/format_reward': 1.0, 'reward': 1.848214328289032, 'reward_std': 0.05541309714317322, 'kl': 1.83154296875, 'epoch': 0.8}
 80%|████████  | 3446/4286 [21:47:11<4:44:40, 20.33s/it] 80%|████████  | 3447/4286 [21:47:28<4:31:06, 19.39s/it]                                                        {'loss': 0.0072, 'grad_norm': 2.9573523574792384, 'learning_rate': 1.9575361642557162e-07, 'completion_length': 172.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.7782738506793976, 'rewards/format_reward': 1.0, 'reward': 1.7782739400863647, 'reward_std': 0.06618335098028183, 'kl': 0.18017578125, 'epoch': 0.8}
 80%|████████  | 3447/4286 [21:47:28<4:31:06, 19.39s/it] 80%|████████  | 3448/4286 [21:47:46<4:21:52, 18.75s/it]                                                        {'loss': 0.0248, 'grad_norm': 0.9100367173292989, 'learning_rate': 1.9552029864675687e-07, 'completion_length': 160.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607144474983215, 'reward_std': 0.08173839747905731, 'kl': 0.6181640625, 'epoch': 0.8}
 80%|████████  | 3448/4286 [21:47:46<4:21:52, 18.75s/it] 80%|████████  | 3449/4286 [21:48:05<4:25:01, 19.00s/it]                                                        {'loss': 0.0543, 'grad_norm': 4.336655971945384, 'learning_rate': 1.9528698086794211e-07, 'completion_length': 183.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6157738566398621, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5979167222976685, 'reward_std': 0.084793571382761, 'kl': 1.35546875, 'epoch': 0.8}
 80%|████████  | 3449/4286 [21:48:05<4:25:01, 19.00s/it] 80%|████████  | 3450/4286 [21:48:24<4:21:36, 18.78s/it]                                                        {'loss': 0.0073, 'grad_norm': 3.3392696601152125, 'learning_rate': 1.950536630891274e-07, 'completion_length': 176.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.028166969306766987, 'kl': 0.18115234375, 'epoch': 0.8}
 80%|████████  | 3450/4286 [21:48:24<4:21:36, 18.78s/it] 81%|████████  | 3451/4286 [21:48:42<4:18:05, 18.55s/it]                                                        {'loss': 0.0243, 'grad_norm': 1.6870451733732552, 'learning_rate': 1.9482034531031264e-07, 'completion_length': 164.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.5940476953983307, 'rewards/format_reward': 1.0, 'reward': 1.5940476655960083, 'reward_std': 0.05914039257913828, 'kl': 0.60791015625, 'epoch': 0.81}
 81%|████████  | 3451/4286 [21:48:42<4:18:05, 18.55s/it] 81%|████████  | 3452/4286 [21:48:59<4:13:25, 18.23s/it]                                                        {'loss': 0.0281, 'grad_norm': 2.117999556642384, 'learning_rate': 1.945870275314979e-07, 'completion_length': 176.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.783928632736206, 'rewards/format_reward': 1.0, 'reward': 1.783928632736206, 'reward_std': 0.04606983345001936, 'kl': 0.70068359375, 'epoch': 0.81}
 81%|████████  | 3452/4286 [21:48:59<4:13:25, 18.23s/it] 81%|████████  | 3453/4286 [21:49:22<4:32:44, 19.64s/it]                                                        {'loss': 0.0189, 'grad_norm': 3.137540427544899, 'learning_rate': 1.9435370975268314e-07, 'completion_length': 199.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 1.0, 'reward': 1.6473215222358704, 'reward_std': 0.07229908183217049, 'kl': 0.47265625, 'epoch': 0.81}
 81%|████████  | 3453/4286 [21:49:22<4:32:44, 19.64s/it] 81%|████████  | 3454/4286 [21:49:42<4:32:11, 19.63s/it]                                                        {'loss': 0.0461, 'grad_norm': 1.718063579634057, 'learning_rate': 1.9412039197386838e-07, 'completion_length': 178.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6458334028720856, 'rewards/format_reward': 1.0, 'reward': 1.645833432674408, 'reward_std': 0.09548484347760677, 'kl': 1.1484375, 'epoch': 0.81}
 81%|████████  | 3454/4286 [21:49:42<4:32:11, 19.63s/it] 81%|████████  | 3455/4286 [21:50:00<4:27:28, 19.31s/it]                                                        {'loss': 0.0294, 'grad_norm': 1.0607515139934935, 'learning_rate': 1.9388707419505366e-07, 'completion_length': 179.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.7785714864730835, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7607144117355347, 'reward_std': 0.08474471140652895, 'kl': 0.73681640625, 'epoch': 0.81}
 81%|████████  | 3455/4286 [21:50:00<4:27:28, 19.31s/it] 81%|████████  | 3456/4286 [21:50:19<4:23:39, 19.06s/it]                                                        {'loss': 0.023, 'grad_norm': 1.218263935210323, 'learning_rate': 1.936537564162389e-07, 'completion_length': 195.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6979166865348816, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.0659366063773632, 'kl': 0.57275390625, 'epoch': 0.81}
 81%|████████  | 3456/4286 [21:50:19<4:23:39, 19.06s/it] 81%|████████  | 3457/4286 [21:50:37<4:20:46, 18.87s/it]                                                        {'loss': 0.0315, 'grad_norm': 3.4793150064239042, 'learning_rate': 1.9342043863742416e-07, 'completion_length': 177.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6327381432056427, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6148810386657715, 'reward_std': 0.11838765814900398, 'kl': 0.7890625, 'epoch': 0.81}
 81%|████████  | 3457/4286 [21:50:37<4:20:46, 18.87s/it] 81%|████████  | 3458/4286 [21:50:58<4:29:51, 19.55s/it]                                                        {'loss': 0.0371, 'grad_norm': 10.694815446951067, 'learning_rate': 1.931871208586094e-07, 'completion_length': 196.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.52976194024086, 'rewards/format_reward': 1.0, 'reward': 1.5297620296478271, 'reward_std': 0.06472779344767332, 'kl': 0.9287109375, 'epoch': 0.81}
 81%|████████  | 3458/4286 [21:50:58<4:29:51, 19.55s/it] 81%|████████  | 3459/4286 [21:51:18<4:32:07, 19.74s/it]                                                        {'loss': 0.0242, 'grad_norm': 5.457980222166459, 'learning_rate': 1.9295380307979468e-07, 'completion_length': 209.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.7693452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7693453431129456, 'reward_std': 0.05943367816507816, 'kl': 0.60546875, 'epoch': 0.81}
 81%|████████  | 3459/4286 [21:51:18<4:32:07, 19.74s/it] 81%|████████  | 3460/4286 [21:51:37<4:28:15, 19.49s/it]                                                        {'loss': 0.0274, 'grad_norm': 7.578116769994676, 'learning_rate': 1.9272048530097993e-07, 'completion_length': 180.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7348214685916901, 'rewards/format_reward': 1.0, 'reward': 1.7348215579986572, 'reward_std': 0.06980404630303383, 'kl': 0.6875, 'epoch': 0.81}
 81%|████████  | 3460/4286 [21:51:37<4:28:15, 19.49s/it] 81%|████████  | 3461/4286 [21:51:55<4:21:15, 19.00s/it]                                                        {'loss': 0.007, 'grad_norm': 8.68802058289269, 'learning_rate': 1.9248716752216518e-07, 'completion_length': 171.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6830357313156128, 'rewards/format_reward': 1.0, 'reward': 1.6830357909202576, 'reward_std': 0.026785715483129025, 'kl': 0.17529296875, 'epoch': 0.81}
 81%|████████  | 3461/4286 [21:51:55<4:21:15, 19.00s/it] 81%|████████  | 3462/4286 [21:52:17<4:32:52, 19.87s/it]                                                        {'loss': 0.032, 'grad_norm': 3.0801814194227792, 'learning_rate': 1.9225384974335043e-07, 'completion_length': 206.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6592262983322144, 'reward_std': 0.07584905717521906, 'kl': 0.7998046875, 'epoch': 0.81}
 81%|████████  | 3462/4286 [21:52:17<4:32:52, 19.87s/it] 81%|████████  | 3463/4286 [21:52:38<4:35:10, 20.06s/it]                                                        {'loss': 0.033, 'grad_norm': 1.8582984824445576, 'learning_rate': 1.9202053196453568e-07, 'completion_length': 177.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6413691341876984, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6235120296478271, 'reward_std': 0.0922619067132473, 'kl': 0.82763671875, 'epoch': 0.81}
 81%|████████  | 3463/4286 [21:52:38<4:35:10, 20.06s/it][2025-03-03 03:00:14,587] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 81%|████████  | 3464/4286 [21:52:59<4:39:26, 20.40s/it]                                                        {'loss': 0.064, 'grad_norm': 12.870779299417135, 'learning_rate': 1.9178721418572095e-07, 'completion_length': 182.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.68452388048172, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6488096714019775, 'reward_std': 0.13920534402132034, 'kl': 1.6015625, 'epoch': 0.81}
 81%|████████  | 3464/4286 [21:52:59<4:39:26, 20.40s/it] 81%|████████  | 3465/4286 [21:53:17<4:29:16, 19.68s/it]                                                        {'loss': 0.0167, 'grad_norm': 5.147305802522027, 'learning_rate': 1.915538964069062e-07, 'completion_length': 176.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178572535514832, 'reward_std': 0.12453466467559338, 'kl': 0.41796875, 'epoch': 0.81}
 81%|████████  | 3465/4286 [21:53:17<4:29:16, 19.68s/it] 81%|████████  | 3466/4286 [21:53:34<4:20:40, 19.07s/it]                                                        {'loss': 0.0124, 'grad_norm': 7.6817184113411505, 'learning_rate': 1.9132057862809145e-07, 'completion_length': 170.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6294643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6294643878936768, 'reward_std': 0.059310125187039375, 'kl': 0.31005859375, 'epoch': 0.81}
 81%|████████  | 3466/4286 [21:53:34<4:20:40, 19.07s/it] 81%|████████  | 3467/4286 [21:53:52<4:12:59, 18.53s/it]                                                        {'loss': 0.0334, 'grad_norm': 2.930506597347632, 'learning_rate': 1.910872608492767e-07, 'completion_length': 158.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.617559552192688, 'rewards/format_reward': 1.0, 'reward': 1.6175596714019775, 'reward_std': 0.056547620333731174, 'kl': 0.83447265625, 'epoch': 0.81}
 81%|████████  | 3467/4286 [21:53:52<4:12:59, 18.53s/it] 81%|████████  | 3468/4286 [21:54:10<4:10:36, 18.38s/it]                                                        {'loss': 0.0391, 'grad_norm': 1.0582469223039634, 'learning_rate': 1.9085394307046197e-07, 'completion_length': 180.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.8130952715873718, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7773810625076294, 'reward_std': 0.13571429252624512, 'kl': 0.982421875, 'epoch': 0.81}
 81%|████████  | 3468/4286 [21:54:10<4:10:36, 18.38s/it][2025-03-03 03:01:43,536] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 81%|████████  | 3469/4286 [21:54:28<4:08:40, 18.26s/it]                                                        {'loss': 0.0326, 'grad_norm': 6.41585069210176, 'learning_rate': 1.9062062529164722e-07, 'completion_length': 169.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.6190476715564728, 'rewards/format_reward': 1.0, 'reward': 1.61904776096344, 'reward_std': 0.05094085447490215, 'kl': 0.81298828125, 'epoch': 0.81}
 81%|████████  | 3469/4286 [21:54:28<4:08:40, 18.26s/it] 81%|████████  | 3470/4286 [21:54:47<4:11:25, 18.49s/it]                                                        {'loss': 0.0516, 'grad_norm': 42.70442197105591, 'learning_rate': 1.9038730751283247e-07, 'completion_length': 193.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.6555272042751312, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.619813084602356, 'reward_std': 0.17722046375274658, 'kl': 1.28515625, 'epoch': 0.81}
 81%|████████  | 3470/4286 [21:54:47<4:11:25, 18.49s/it] 81%|████████  | 3471/4286 [21:55:06<4:14:57, 18.77s/it]                                                        {'loss': 0.0837, 'grad_norm': 30.696587163721354, 'learning_rate': 1.9015398973401772e-07, 'completion_length': 198.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.7247024774551392, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.688988208770752, 'reward_std': 0.1681198887526989, 'kl': 2.09765625, 'epoch': 0.81}
 81%|████████  | 3471/4286 [21:55:06<4:14:57, 18.77s/it] 81%|████████  | 3472/4286 [21:55:29<4:31:31, 20.01s/it]                                                        {'loss': 0.0853, 'grad_norm': 8.323915054901802, 'learning_rate': 1.8992067195520297e-07, 'completion_length': 184.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.616071492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5982143878936768, 'reward_std': 0.11465832777321339, 'kl': 2.13671875, 'epoch': 0.81}
 81%|████████  | 3472/4286 [21:55:29<4:31:31, 20.01s/it] 81%|████████  | 3473/4286 [21:55:49<4:31:51, 20.06s/it]                                                        {'loss': 0.0339, 'grad_norm': 8.121978058107354, 'learning_rate': 1.8968735417638824e-07, 'completion_length': 187.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.5553571581840515, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.537500023841858, 'reward_std': 0.150860495865345, 'kl': 0.8515625, 'epoch': 0.81}
 81%|████████  | 3473/4286 [21:55:49<4:31:51, 20.06s/it] 81%|████████  | 3474/4286 [21:56:07<4:23:36, 19.48s/it]                                                        {'loss': 0.0387, 'grad_norm': 6.9507748664052, 'learning_rate': 1.894540363975735e-07, 'completion_length': 160.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.5907738208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5729167461395264, 'reward_std': 0.1316809356212616, 'kl': 0.96728515625, 'epoch': 0.81}
 81%|████████  | 3474/4286 [21:56:07<4:23:36, 19.48s/it] 81%|████████  | 3475/4286 [21:56:25<4:15:49, 18.93s/it]                                                        {'loss': 0.0069, 'grad_norm': 0.6624752390609482, 'learning_rate': 1.8922071861875874e-07, 'completion_length': 176.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7961309850215912, 'rewards/format_reward': 1.0, 'reward': 1.7961310744285583, 'reward_std': 0.03495405614376068, 'kl': 0.17236328125, 'epoch': 0.81}
 81%|████████  | 3475/4286 [21:56:25<4:15:49, 18.93s/it] 81%|████████  | 3476/4286 [21:56:44<4:15:22, 18.92s/it]                                                        {'loss': 0.0362, 'grad_norm': 1.566438417061157, 'learning_rate': 1.88987400839944e-07, 'completion_length': 193.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5627976357936859, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5449405312538147, 'reward_std': 0.14560046046972275, 'kl': 0.904296875, 'epoch': 0.81}
 81%|████████  | 3476/4286 [21:56:44<4:15:22, 18.92s/it] 81%|████████  | 3477/4286 [21:57:03<4:14:22, 18.87s/it]                                                        {'loss': 0.0281, 'grad_norm': 3.296955512604131, 'learning_rate': 1.8875408306112924e-07, 'completion_length': 194.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7529762387275696, 'reward_std': 0.07738095847889781, 'kl': 0.70458984375, 'epoch': 0.81}
 81%|████████  | 3477/4286 [21:57:03<4:14:22, 18.87s/it] 81%|████████  | 3478/4286 [21:57:20<4:08:43, 18.47s/it]                                                        {'loss': 0.0189, 'grad_norm': 5.451800173894736, 'learning_rate': 1.885207652823145e-07, 'completion_length': 179.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.05169975198805332, 'kl': 0.47265625, 'epoch': 0.81}
 81%|████████  | 3478/4286 [21:57:20<4:08:43, 18.47s/it] 81%|████████  | 3479/4286 [21:57:39<4:09:09, 18.53s/it]                                                        {'loss': 0.0349, 'grad_norm': 1.9209306547649547, 'learning_rate': 1.8828744750349976e-07, 'completion_length': 193.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7172619700431824, 'reward_std': 0.07738095708191395, 'kl': 0.87109375, 'epoch': 0.81}
 81%|████████  | 3479/4286 [21:57:39<4:09:09, 18.53s/it] 81%|████████  | 3480/4286 [21:57:57<4:06:42, 18.37s/it]                                                        {'loss': 0.0075, 'grad_norm': 1.1745643889858, 'learning_rate': 1.88054129724685e-07, 'completion_length': 189.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.6086310148239136, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.0474053667858243, 'kl': 0.18798828125, 'epoch': 0.81}
 81%|████████  | 3480/4286 [21:57:57<4:06:42, 18.37s/it] 81%|████████  | 3481/4286 [21:58:16<4:08:56, 18.55s/it]                                                        {'loss': 0.0081, 'grad_norm': 2.7555901974682095, 'learning_rate': 1.8782081194587026e-07, 'completion_length': 184.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.7172619998455048, 'rewards/format_reward': 1.0, 'reward': 1.7172620296478271, 'reward_std': 0.040653031319379807, 'kl': 0.20263671875, 'epoch': 0.81}
 81%|████████  | 3481/4286 [21:58:16<4:08:56, 18.55s/it] 81%|████████  | 3482/4286 [21:58:35<4:12:50, 18.87s/it]                                                        {'loss': 0.0564, 'grad_norm': 1.8024613910698781, 'learning_rate': 1.8758749416705553e-07, 'completion_length': 183.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6892857551574707, 'rewards/format_reward': 1.0, 'reward': 1.6892858147621155, 'reward_std': 0.059624457731842995, 'kl': 1.41015625, 'epoch': 0.81}
 81%|████████  | 3482/4286 [21:58:35<4:12:50, 18.87s/it] 81%|████████▏ | 3483/4286 [21:58:54<4:11:54, 18.82s/it]                                                        {'loss': 0.0441, 'grad_norm': 4.7887547409840305, 'learning_rate': 1.8735417638824078e-07, 'completion_length': 170.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.740178644657135, 'rewards/format_reward': 1.0, 'reward': 1.7401787042617798, 'reward_std': 0.07657129876315594, 'kl': 1.10302734375, 'epoch': 0.81}
 81%|████████▏ | 3483/4286 [21:58:54<4:11:54, 18.82s/it] 81%|████████▏ | 3484/4286 [21:59:12<4:08:42, 18.61s/it]                                                        {'loss': 0.037, 'grad_norm': 11.00294098153873, 'learning_rate': 1.8712085860942603e-07, 'completion_length': 189.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.6949405074119568, 'rewards/format_reward': 1.0, 'reward': 1.6949406266212463, 'reward_std': 0.09861730504781008, 'kl': 0.92431640625, 'epoch': 0.81}
 81%|████████▏ | 3484/4286 [21:59:12<4:08:42, 18.61s/it] 81%|████████▏ | 3485/4286 [21:59:31<4:07:36, 18.55s/it]                                                        {'loss': 0.0342, 'grad_norm': 6.772002299742048, 'learning_rate': 1.8688754083061128e-07, 'completion_length': 202.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 1.0, 'reward': 1.7351191639900208, 'reward_std': 0.06781134335324168, 'kl': 0.85205078125, 'epoch': 0.81}
 81%|████████▏ | 3485/4286 [21:59:31<4:07:36, 18.55s/it] 81%|████████▏ | 3486/4286 [21:59:50<4:10:35, 18.79s/it]                                                        {'loss': 0.0072, 'grad_norm': 1.0720386059887264, 'learning_rate': 1.8665422305179653e-07, 'completion_length': 198.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.799107164144516, 'rewards/format_reward': 1.0, 'reward': 1.7991072535514832, 'reward_std': 0.008928571827709675, 'kl': 0.1796875, 'epoch': 0.81}
 81%|████████▏ | 3486/4286 [21:59:50<4:10:35, 18.79s/it] 81%|████████▏ | 3487/4286 [22:00:08<4:07:04, 18.55s/it]                                                        {'loss': 0.0338, 'grad_norm': 14.220906920371593, 'learning_rate': 1.864209052729818e-07, 'completion_length': 184.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6857143044471741, 'rewards/format_reward': 1.0, 'reward': 1.6857143640518188, 'reward_std': 0.06935982033610344, 'kl': 0.84130859375, 'epoch': 0.81}
 81%|████████▏ | 3487/4286 [22:00:08<4:07:04, 18.55s/it] 81%|████████▏ | 3488/4286 [22:00:29<4:18:21, 19.43s/it]                                                        {'loss': 0.0098, 'grad_norm': 1.2706144015729903, 'learning_rate': 1.8618758749416705e-07, 'completion_length': 182.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6711310744285583, 'reward_std': 0.019238398410379887, 'kl': 0.24365234375, 'epoch': 0.81}
 81%|████████▏ | 3488/4286 [22:00:29<4:18:21, 19.43s/it] 81%|████████▏ | 3489/4286 [22:00:48<4:16:12, 19.29s/it]                                                        {'loss': 0.079, 'grad_norm': 14.006542521944015, 'learning_rate': 1.859542697153523e-07, 'completion_length': 179.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.616071492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.580357313156128, 'reward_std': 0.16942918300628662, 'kl': 1.9765625, 'epoch': 0.81}
 81%|████████▏ | 3489/4286 [22:00:48<4:16:12, 19.29s/it] 81%|████████▏ | 3490/4286 [22:01:10<4:23:15, 19.84s/it]                                                        {'loss': 0.0602, 'grad_norm': 2.186275510982187, 'learning_rate': 1.8572095193653755e-07, 'completion_length': 197.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 1.0, 'reward': 1.6473215818405151, 'reward_std': 0.052916526794433594, 'kl': 1.50390625, 'epoch': 0.81}
 81%|████████▏ | 3490/4286 [22:01:10<4:23:15, 19.84s/it] 81%|████████▏ | 3491/4286 [22:01:28<4:19:12, 19.56s/it]                                                        {'loss': 0.0408, 'grad_norm': 2.7975804692000654, 'learning_rate': 1.854876341577228e-07, 'completion_length': 194.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7436225414276123, 'rewards/format_reward': 1.0, 'reward': 1.743622601032257, 'reward_std': 0.07007416151463985, 'kl': 1.01953125, 'epoch': 0.81}
 81%|████████▏ | 3491/4286 [22:01:28<4:19:12, 19.56s/it] 81%|████████▏ | 3492/4286 [22:01:49<4:23:22, 19.90s/it]                                                        {'loss': 0.016, 'grad_norm': 8.289187835689441, 'learning_rate': 1.8525431637890807e-07, 'completion_length': 202.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.0476190485060215, 'kl': 0.4013671875, 'epoch': 0.81}
 81%|████████▏ | 3492/4286 [22:01:49<4:23:22, 19.90s/it] 81%|████████▏ | 3493/4286 [22:02:10<4:25:11, 20.07s/it]                                                        {'loss': 0.0077, 'grad_norm': 2.132492366528292, 'learning_rate': 1.8502099860009332e-07, 'completion_length': 193.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.699404776096344, 'rewards/format_reward': 1.0, 'reward': 1.6994048953056335, 'reward_std': 0.04602411389350891, 'kl': 0.19384765625, 'epoch': 0.81}
 81%|████████▏ | 3493/4286 [22:02:10<4:25:11, 20.07s/it] 82%|████████▏ | 3494/4286 [22:02:32<4:34:59, 20.83s/it]                                                        {'loss': 0.0383, 'grad_norm': 7.921572960327529, 'learning_rate': 1.8478768082127857e-07, 'completion_length': 212.55357360839844, 'rewards/only_full_func_accuracy_reward': 0.5416666716337204, 'rewards/format_reward': 1.0, 'reward': 1.5416668057441711, 'reward_std': 0.07715156301856041, 'kl': 0.95849609375, 'epoch': 0.82}
 82%|████████▏ | 3494/4286 [22:02:32<4:34:59, 20.83s/it] 82%|████████▏ | 3495/4286 [22:02:52<4:30:14, 20.50s/it]                                                        {'loss': 0.0198, 'grad_norm': 52.175175025493644, 'learning_rate': 1.8455436304246382e-07, 'completion_length': 169.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.8139881193637848, 'rewards/format_reward': 1.0, 'reward': 1.8139881491661072, 'reward_std': 0.04053215682506561, 'kl': 0.4931640625, 'epoch': 0.82}
 82%|████████▏ | 3495/4286 [22:02:52<4:30:14, 20.50s/it] 82%|████████▏ | 3496/4286 [22:03:13<4:30:24, 20.54s/it]                                                        {'loss': 0.0173, 'grad_norm': 2.744758529958222, 'learning_rate': 1.843210452636491e-07, 'completion_length': 193.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7026785910129547, 'rewards/format_reward': 1.0, 'reward': 1.702678620815277, 'reward_std': 0.01920507848262787, 'kl': 0.431640625, 'epoch': 0.82}
 82%|████████▏ | 3496/4286 [22:03:13<4:30:24, 20.54s/it][2025-03-03 03:10:50,467] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 82%|████████▏ | 3497/4286 [22:03:35<4:35:57, 20.99s/it]                                                        {'loss': 0.0109, 'grad_norm': 1.158800930643095, 'learning_rate': 1.8408772748483434e-07, 'completion_length': 184.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 1.0, 'reward': 1.6934524774551392, 'reward_std': 0.08640840649604797, 'kl': 0.27294921875, 'epoch': 0.82}
 82%|████████▏ | 3497/4286 [22:03:35<4:35:57, 20.99s/it][2025-03-03 03:11:09,486] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 82%|████████▏ | 3498/4286 [22:03:54<4:27:51, 20.40s/it]                                                        {'loss': 0.0327, 'grad_norm': 2.636710131910688, 'learning_rate': 1.838544097060196e-07, 'completion_length': 173.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.8187500536441803, 'rewards/format_reward': 1.0, 'reward': 1.818750023841858, 'reward_std': 0.05297619476914406, 'kl': 0.81640625, 'epoch': 0.82}
 82%|████████▏ | 3498/4286 [22:03:54<4:27:51, 20.40s/it] 82%|████████▏ | 3499/4286 [22:04:14<4:27:06, 20.36s/it]                                                        {'loss': 0.0384, 'grad_norm': 64.53008295553639, 'learning_rate': 1.8362109192720484e-07, 'completion_length': 199.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6252976655960083, 'rewards/format_reward': 1.0, 'reward': 1.6252976655960083, 'reward_std': 0.05983574502170086, 'kl': 0.95703125, 'epoch': 0.82}
 82%|████████▏ | 3499/4286 [22:04:14<4:27:06, 20.36s/it] 82%|████████▏ | 3500/4286 [22:04:35<4:30:46, 20.67s/it]                                                        {'loss': 0.0234, 'grad_norm': 2.5438777259138114, 'learning_rate': 1.833877741483901e-07, 'completion_length': 199.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6919643878936768, 'reward_std': 0.03273810027167201, 'kl': 0.58447265625, 'epoch': 0.82}
 82%|████████▏ | 3500/4286 [22:04:35<4:30:46, 20.67s/it] 82%|████████▏ | 3501/4286 [22:09:51<23:48:30, 109.19s/it]                                                          {'loss': 0.0476, 'grad_norm': 1.944357188596808, 'learning_rate': 1.8315445636957536e-07, 'completion_length': 202.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7000000476837158, 'rewards/format_reward': 1.0, 'reward': 1.7000001072883606, 'reward_std': 0.046428573317825794, 'kl': 1.1884765625, 'epoch': 0.82}
 82%|████████▏ | 3501/4286 [22:09:51<23:48:30, 109.19s/it] 82%|████████▏ | 3502/4286 [22:10:11<17:58:00, 82.50s/it]                                                          {'loss': 0.0391, 'grad_norm': 5.175193392471207, 'learning_rate': 1.829211385907606e-07, 'completion_length': 189.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.652381032705307, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6345239877700806, 'reward_std': 0.12489388138055801, 'kl': 0.9765625, 'epoch': 0.82}
 82%|████████▏ | 3502/4286 [22:10:11<17:58:00, 82.50s/it] 82%|████████▏ | 3503/4286 [22:10:30<13:48:10, 63.46s/it]                                                         {'loss': 0.0286, 'grad_norm': 1.3376843082282333, 'learning_rate': 1.8268782081194586e-07, 'completion_length': 186.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.767857313156128, 'reward_std': 0.047619049437344074, 'kl': 0.71240234375, 'epoch': 0.82}
 82%|████████▏ | 3503/4286 [22:10:30<13:48:10, 63.46s/it] 82%|████████▏ | 3504/4286 [22:10:56<11:19:52, 52.16s/it]                                                         {'loss': 0.0352, 'grad_norm': 4.833042533053299, 'learning_rate': 1.824545030331311e-07, 'completion_length': 223.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5227220952510834, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.469150722026825, 'reward_std': 0.16864130645990372, 'kl': 0.880859375, 'epoch': 0.82}
 82%|████████▏ | 3504/4286 [22:10:56<11:19:52, 52.16s/it] 82%|████████▏ | 3505/4286 [22:11:15<9:09:03, 42.18s/it]                                                         {'loss': 0.0273, 'grad_norm': 1.9113519784048312, 'learning_rate': 1.8222118525431639e-07, 'completion_length': 197.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.641964316368103, 'rewards/format_reward': 1.0, 'reward': 1.6419644355773926, 'reward_std': 0.058957336470484734, 'kl': 0.681640625, 'epoch': 0.82}
 82%|████████▏ | 3505/4286 [22:11:15<9:09:03, 42.18s/it] 82%|████████▏ | 3506/4286 [22:11:36<7:44:27, 35.73s/it]                                                        {'loss': 0.0381, 'grad_norm': 1.8556330504539018, 'learning_rate': 1.8198786747550163e-07, 'completion_length': 214.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7037698328495026, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6680555939674377, 'reward_std': 0.15697165578603745, 'kl': 0.94921875, 'epoch': 0.82}
 82%|████████▏ | 3506/4286 [22:11:36<7:44:27, 35.73s/it] 82%|████████▏ | 3507/4286 [22:11:56<6:42:05, 30.97s/it]                                                        {'loss': 0.0122, 'grad_norm': 6.769675889030772, 'learning_rate': 1.8175454969668688e-07, 'completion_length': 179.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.5544643402099609, 'rewards/format_reward': 1.0, 'reward': 1.5544643998146057, 'reward_std': 0.030909644439816475, 'kl': 0.30615234375, 'epoch': 0.82}
 82%|████████▏ | 3507/4286 [22:11:56<6:42:05, 30.97s/it] 82%|████████▏ | 3508/4286 [22:12:17<6:03:49, 28.06s/it]                                                        {'loss': 0.0639, 'grad_norm': 7.598628365322858, 'learning_rate': 1.8152123191787213e-07, 'completion_length': 199.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.5229166895151138, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5050596594810486, 'reward_std': 0.19775142520666122, 'kl': 1.603515625, 'epoch': 0.82}
 82%|████████▏ | 3508/4286 [22:12:17<6:03:49, 28.06s/it] 82%|████████▏ | 3509/4286 [22:12:35<5:25:47, 25.16s/it]                                                        {'loss': 0.0555, 'grad_norm': 8.768591150247142, 'learning_rate': 1.8128791413905738e-07, 'completion_length': 196.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.702381044626236, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.06731786578893661, 'kl': 1.38671875, 'epoch': 0.82}
 82%|████████▏ | 3509/4286 [22:12:35<5:25:47, 25.16s/it] 82%|████████▏ | 3510/4286 [22:12:55<5:04:41, 23.56s/it]                                                        {'loss': 0.0625, 'grad_norm': 2.498041209728143, 'learning_rate': 1.8105459636024266e-07, 'completion_length': 192.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.739583432674408, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.703869104385376, 'reward_std': 0.1458333432674408, 'kl': 1.5625, 'epoch': 0.82}
 82%|████████▏ | 3510/4286 [22:12:55<5:04:41, 23.56s/it] 82%|████████▏ | 3511/4286 [22:13:16<4:53:39, 22.73s/it]                                                        {'loss': 0.0902, 'grad_norm': 22.96345598607654, 'learning_rate': 1.808212785814279e-07, 'completion_length': 196.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5252976715564728, 'rewards/format_reward': 1.0, 'reward': 1.5252977013587952, 'reward_std': 0.127976194024086, 'kl': 2.2578125, 'epoch': 0.82}
 82%|████████▏ | 3511/4286 [22:13:16<4:53:39, 22.73s/it] 82%|████████▏ | 3512/4286 [22:13:37<4:45:56, 22.17s/it]                                                        {'loss': 0.087, 'grad_norm': 6.178877444898644, 'learning_rate': 1.8058796080261315e-07, 'completion_length': 196.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5451236665248871, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5094094276428223, 'reward_std': 0.1590820774435997, 'kl': 2.171875, 'epoch': 0.82}
 82%|████████▏ | 3512/4286 [22:13:37<4:45:56, 22.17s/it] 82%|████████▏ | 3513/4286 [22:13:55<4:32:12, 21.13s/it]                                                        {'loss': 0.0082, 'grad_norm': 4.663936226580058, 'learning_rate': 1.803546430237984e-07, 'completion_length': 183.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.8333334028720856, 'rewards/format_reward': 1.0, 'reward': 1.8333334922790527, 'reward_std': 0.01785714365541935, 'kl': 0.20361328125, 'epoch': 0.82}
 82%|████████▏ | 3513/4286 [22:13:55<4:32:12, 21.13s/it] 82%|████████▏ | 3514/4286 [22:14:14<4:21:56, 20.36s/it]                                                        {'loss': 0.0397, 'grad_norm': 17.04975319518622, 'learning_rate': 1.8012132524498365e-07, 'completion_length': 180.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7090774178504944, 'rewards/format_reward': 1.0, 'reward': 1.7090774774551392, 'reward_std': 0.08142849989235401, 'kl': 0.98828125, 'epoch': 0.82}
 82%|████████▏ | 3514/4286 [22:14:14<4:21:56, 20.36s/it] 82%|████████▏ | 3515/4286 [22:14:33<4:15:42, 19.90s/it]                                                        {'loss': 0.0459, 'grad_norm': 8.528041399116603, 'learning_rate': 1.7988800746616892e-07, 'completion_length': 171.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.8138716518878937, 'rewards/format_reward': 1.0, 'reward': 1.8138717412948608, 'reward_std': 0.11764147877693176, 'kl': 1.1484375, 'epoch': 0.82}
 82%|████████▏ | 3515/4286 [22:14:33<4:15:42, 19.90s/it] 82%|████████▏ | 3516/4286 [22:14:52<4:14:24, 19.82s/it]                                                        {'loss': 0.0279, 'grad_norm': 18.817815950791562, 'learning_rate': 1.7965468968735417e-07, 'completion_length': 178.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.7795068621635437, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.76164972782135, 'reward_std': 0.08560183644294739, 'kl': 0.69921875, 'epoch': 0.82}
 82%|████████▏ | 3516/4286 [22:14:52<4:14:24, 19.82s/it] 82%|████████▏ | 3517/4286 [22:15:13<4:18:02, 20.13s/it]                                                        {'loss': 0.0194, 'grad_norm': 7.446451664416588, 'learning_rate': 1.7942137190853942e-07, 'completion_length': 195.50001525878906, 'rewards/only_full_func_accuracy_reward': 0.5089285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5089287161827087, 'reward_std': 0.0472072996199131, 'kl': 0.486328125, 'epoch': 0.82}
 82%|████████▏ | 3517/4286 [22:15:13<4:18:02, 20.13s/it] 82%|████████▏ | 3518/4286 [22:15:32<4:13:30, 19.81s/it]                                                        {'loss': 0.027, 'grad_norm': 5.042578004596477, 'learning_rate': 1.7918805412972467e-07, 'completion_length': 168.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.7113095819950104, 'rewards/format_reward': 1.0, 'reward': 1.7113096117973328, 'reward_std': 0.06748265027999878, 'kl': 0.67578125, 'epoch': 0.82}
 82%|████████▏ | 3518/4286 [22:15:32<4:13:30, 19.81s/it] 82%|████████▏ | 3519/4286 [22:15:51<4:07:14, 19.34s/it]                                                        {'loss': 0.0357, 'grad_norm': 6.773045156172034, 'learning_rate': 1.7895473635090995e-07, 'completion_length': 195.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6190477013587952, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6011906266212463, 'reward_std': 0.07922262512147427, 'kl': 0.8916015625, 'epoch': 0.82}
 82%|████████▏ | 3519/4286 [22:15:51<4:07:14, 19.34s/it] 82%|████████▏ | 3520/4286 [22:16:15<4:26:24, 20.87s/it]                                                        {'loss': 0.0463, 'grad_norm': 6.914370630706977, 'learning_rate': 1.787214185720952e-07, 'completion_length': 198.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6785715818405151, 'reward_std': 0.15476190857589245, 'kl': 1.1552734375, 'epoch': 0.82}
 82%|████████▏ | 3520/4286 [22:16:15<4:26:24, 20.87s/it] 82%|████████▏ | 3521/4286 [22:16:37<4:29:52, 21.17s/it]                                                        {'loss': 0.0151, 'grad_norm': 7.185310134634027, 'learning_rate': 1.7848810079328044e-07, 'completion_length': 200.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6741071939468384, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.0327381007373333, 'kl': 0.37548828125, 'epoch': 0.82}
 82%|████████▏ | 3521/4286 [22:16:37<4:29:52, 21.17s/it] 82%|████████▏ | 3522/4286 [22:16:56<4:23:40, 20.71s/it]                                                        {'loss': 0.0731, 'grad_norm': 29.045891326352567, 'learning_rate': 1.782547830144657e-07, 'completion_length': 199.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.7294643223285675, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.711607277393341, 'reward_std': 0.11762984097003937, 'kl': 1.82373046875, 'epoch': 0.82}
 82%|████████▏ | 3522/4286 [22:16:56<4:23:40, 20.71s/it] 82%|████████▏ | 3523/4286 [22:17:20<4:34:05, 21.55s/it]                                                        {'loss': 0.0755, 'grad_norm': 16.721828201312146, 'learning_rate': 1.7802146523565094e-07, 'completion_length': 221.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6000000536441803, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5642857551574707, 'reward_std': 0.19068127870559692, 'kl': 1.8828125, 'epoch': 0.82}
 82%|████████▏ | 3523/4286 [22:17:20<4:34:05, 21.55s/it] 82%|████████▏ | 3524/4286 [22:17:38<4:21:04, 20.56s/it]                                                        {'loss': 0.0357, 'grad_norm': 8.697953410458394, 'learning_rate': 1.7778814745683622e-07, 'completion_length': 162.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.7455357909202576, 'rewards/format_reward': 1.0, 'reward': 1.7455357909202576, 'reward_std': 0.040532153099775314, 'kl': 0.892578125, 'epoch': 0.82}
 82%|████████▏ | 3524/4286 [22:17:38<4:21:04, 20.56s/it][2025-03-03 03:25:16,379] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 82%|████████▏ | 3525/4286 [22:18:00<4:27:16, 21.07s/it]                                                        {'loss': 0.0317, 'grad_norm': 2.584473308977475, 'learning_rate': 1.7755482967802146e-07, 'completion_length': 198.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6360119581222534, 'rewards/format_reward': 1.0, 'reward': 1.6360120177268982, 'reward_std': 0.03722163196653128, 'kl': 0.7919921875, 'epoch': 0.82}
 82%|████████▏ | 3525/4286 [22:18:01<4:27:16, 21.07s/it][2025-03-03 03:25:41,304] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 82%|████████▏ | 3526/4286 [22:18:25<4:41:33, 22.23s/it]                                                        {'loss': 0.0978, 'grad_norm': 14.358008926114213, 'learning_rate': 1.7732151189920671e-07, 'completion_length': 206.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.5193452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5014881491661072, 'reward_std': 0.13466134667396545, 'kl': 2.4453125, 'epoch': 0.82}
 82%|████████▏ | 3526/4286 [22:18:25<4:41:33, 22.23s/it] 82%|████████▏ | 3527/4286 [22:18:44<4:27:40, 21.16s/it]                                                        {'loss': 0.0489, 'grad_norm': 1.1708999362956471, 'learning_rate': 1.7708819412039196e-07, 'completion_length': 186.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.7574405074119568, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7217262983322144, 'reward_std': 0.13392857648432255, 'kl': 1.22412109375, 'epoch': 0.82}
 82%|████████▏ | 3527/4286 [22:18:44<4:27:40, 21.16s/it] 82%|████████▏ | 3528/4286 [22:19:02<4:16:30, 20.30s/it]                                                        {'loss': 0.055, 'grad_norm': 1.6348430767324116, 'learning_rate': 1.7685487634157724e-07, 'completion_length': 169.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6376488506793976, 'rewards/format_reward': 1.0, 'reward': 1.6376489400863647, 'reward_std': 0.08533598855137825, 'kl': 1.3779296875, 'epoch': 0.82}
 82%|████████▏ | 3528/4286 [22:19:02<4:16:30, 20.30s/it][2025-03-03 03:26:38,722] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 82%|████████▏ | 3529/4286 [22:19:23<4:16:41, 20.35s/it]                                                        {'loss': 0.0293, 'grad_norm': 4.015460138834332, 'learning_rate': 1.7662155856276249e-07, 'completion_length': 198.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6264881193637848, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6086310744285583, 'reward_std': 0.1160439969971776, 'kl': 0.72802734375, 'epoch': 0.82}
 82%|████████▏ | 3529/4286 [22:19:23<4:16:41, 20.35s/it] 82%|████████▏ | 3530/4286 [22:19:41<4:08:52, 19.75s/it]                                                        {'loss': 0.0099, 'grad_norm': 9.676394282254641, 'learning_rate': 1.7638824078394773e-07, 'completion_length': 194.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.08919291291385889, 'kl': 0.24658203125, 'epoch': 0.82}
 82%|████████▏ | 3530/4286 [22:19:41<4:08:52, 19.75s/it] 82%|████████▏ | 3531/4286 [22:20:03<4:17:08, 20.44s/it]                                                        {'loss': 0.016, 'grad_norm': 6.204851617340096, 'learning_rate': 1.7615492300513298e-07, 'completion_length': 192.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.657738208770752, 'reward_std': 0.07922263070940971, 'kl': 0.40185546875, 'epoch': 0.82}
 82%|████████▏ | 3531/4286 [22:20:03<4:17:08, 20.44s/it] 82%|████████▏ | 3532/4286 [22:20:22<4:12:12, 20.07s/it]                                                        {'loss': 0.0827, 'grad_norm': 8.76994974297306, 'learning_rate': 1.7592160522631823e-07, 'completion_length': 174.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6086310744285583, 'reward_std': 0.28132300823926926, 'kl': 2.0625, 'epoch': 0.82}
 82%|████████▏ | 3532/4286 [22:20:22<4:12:12, 20.07s/it] 82%|████████▏ | 3533/4286 [22:20:42<4:10:41, 19.98s/it]                                                        {'loss': 0.0074, 'grad_norm': 4.601122257172964, 'learning_rate': 1.756882874475035e-07, 'completion_length': 178.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.7023809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.02380952797830105, 'kl': 0.18505859375, 'epoch': 0.82}
 82%|████████▏ | 3533/4286 [22:20:42<4:10:41, 19.98s/it] 82%|████████▏ | 3534/4286 [22:21:02<4:09:35, 19.91s/it]                                                        {'loss': 0.0691, 'grad_norm': 6.643232599782385, 'learning_rate': 1.7545496966868876e-07, 'completion_length': 189.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1785714328289032, 'kl': 1.732421875, 'epoch': 0.82}
 82%|████████▏ | 3534/4286 [22:21:02<4:09:35, 19.91s/it] 82%|████████▏ | 3535/4286 [22:21:21<4:06:40, 19.71s/it]                                                        {'loss': 0.0297, 'grad_norm': 5.830017090338831, 'learning_rate': 1.75221651889874e-07, 'completion_length': 192.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.4750000387430191, 'rewards/format_reward': 1.0, 'reward': 1.4750002026557922, 'reward_std': 0.07933454401791096, 'kl': 0.7421875, 'epoch': 0.82}
 82%|████████▏ | 3535/4286 [22:21:21<4:06:40, 19.71s/it] 83%|████████▎ | 3536/4286 [22:21:41<4:04:49, 19.59s/it]                                                        {'loss': 0.0301, 'grad_norm': 8.26404755754555, 'learning_rate': 1.7498833411105925e-07, 'completion_length': 165.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6354168057441711, 'reward_std': 0.0922619104385376, 'kl': 0.75244140625, 'epoch': 0.83}
 83%|████████▎ | 3536/4286 [22:21:41<4:04:49, 19.59s/it] 83%|████████▎ | 3537/4286 [22:21:59<3:58:42, 19.12s/it]                                                        {'loss': 0.0292, 'grad_norm': 5.393548581225742, 'learning_rate': 1.747550163322445e-07, 'completion_length': 177.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.65327388048172, 'reward_std': 0.08262400794774294, 'kl': 0.7275390625, 'epoch': 0.83}
 83%|████████▎ | 3537/4286 [22:21:59<3:58:42, 19.12s/it] 83%|████████▎ | 3538/4286 [22:22:17<3:56:28, 18.97s/it]                                                        {'loss': 0.0574, 'grad_norm': 12.00671974244478, 'learning_rate': 1.7452169855342978e-07, 'completion_length': 193.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.697916716337204, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.09166059270501137, 'kl': 1.4384765625, 'epoch': 0.83}
 83%|████████▎ | 3538/4286 [22:22:17<3:56:28, 18.97s/it] 83%|████████▎ | 3539/4286 [22:22:39<4:05:17, 19.70s/it]                                                        {'loss': 0.0701, 'grad_norm': 15.546422934869964, 'learning_rate': 1.7428838077461503e-07, 'completion_length': 185.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.49047623574733734, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.472619116306305, 'reward_std': 0.09956123307347298, 'kl': 1.748046875, 'epoch': 0.83}
 83%|████████▎ | 3539/4286 [22:22:39<4:05:17, 19.70s/it] 83%|████████▎ | 3540/4286 [22:22:57<3:59:44, 19.28s/it]                                                        {'loss': 0.0073, 'grad_norm': 0.8715176014730882, 'learning_rate': 1.7405506299580027e-07, 'completion_length': 184.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 1.0, 'reward': 1.6473215222358704, 'reward_std': 0.008928571827709675, 'kl': 0.1826171875, 'epoch': 0.83}
 83%|████████▎ | 3540/4286 [22:22:57<3:59:44, 19.28s/it] 83%|████████▎ | 3541/4286 [22:23:16<3:57:15, 19.11s/it]                                                        {'loss': 0.0654, 'grad_norm': 17.39956141673515, 'learning_rate': 1.7382174521698552e-07, 'completion_length': 187.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.5369047969579697, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5011905431747437, 'reward_std': 0.16253793239593506, 'kl': 1.63671875, 'epoch': 0.83}
 83%|████████▎ | 3541/4286 [22:23:16<3:57:15, 19.11s/it] 83%|████████▎ | 3542/4286 [22:23:36<4:03:27, 19.63s/it]                                                        {'loss': 0.059, 'grad_norm': 4.538815144603974, 'learning_rate': 1.735884274381708e-07, 'completion_length': 199.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.6966118216514587, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6608975529670715, 'reward_std': 0.13948853313922882, 'kl': 1.47265625, 'epoch': 0.83}
 83%|████████▎ | 3542/4286 [22:23:36<4:03:27, 19.63s/it] 83%|████████▎ | 3543/4286 [22:23:58<4:09:47, 20.17s/it]                                                        {'loss': 0.0305, 'grad_norm': 0.4143159202158527, 'learning_rate': 1.7335510965935605e-07, 'completion_length': 182.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6904762983322144, 'reward_std': 0.0714285746216774, 'kl': 0.76513671875, 'epoch': 0.83}
 83%|████████▎ | 3543/4286 [22:23:58<4:09:47, 20.17s/it] 83%|████████▎ | 3544/4286 [22:24:20<4:16:57, 20.78s/it]                                                        {'loss': 0.0516, 'grad_norm': 7.149175438923389, 'learning_rate': 1.731217918805413e-07, 'completion_length': 196.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.5556547939777374, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5377976894378662, 'reward_std': 0.15165308862924576, 'kl': 1.29296875, 'epoch': 0.83}
 83%|████████▎ | 3544/4286 [22:24:20<4:16:57, 20.78s/it] 83%|████████▎ | 3545/4286 [22:24:38<4:04:48, 19.82s/it]                                                        {'loss': 0.0253, 'grad_norm': 1.3751274734339312, 'learning_rate': 1.7288847410172654e-07, 'completion_length': 179.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7053571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.05357143096625805, 'kl': 0.6298828125, 'epoch': 0.83}
 83%|████████▎ | 3545/4286 [22:24:38<4:04:48, 19.82s/it] 83%|████████▎ | 3546/4286 [22:24:59<4:09:24, 20.22s/it]                                                        {'loss': 0.0987, 'grad_norm': 19.559468562727897, 'learning_rate': 1.726551563229118e-07, 'completion_length': 189.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.49464288353919983, 'rewards/format_reward': 1.0, 'reward': 1.4946429133415222, 'reward_std': 0.11746351048350334, 'kl': 2.47265625, 'epoch': 0.83}
 83%|████████▎ | 3546/4286 [22:24:59<4:09:24, 20.22s/it] 83%|████████▎ | 3547/4286 [22:25:17<4:01:30, 19.61s/it]                                                        {'loss': 0.0098, 'grad_norm': 0.9982926722357578, 'learning_rate': 1.7242183854409707e-07, 'completion_length': 181.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5729166865348816, 'rewards/format_reward': 1.0, 'reward': 1.5729168057441711, 'reward_std': 0.04464286006987095, 'kl': 0.24658203125, 'epoch': 0.83}
 83%|████████▎ | 3547/4286 [22:25:17<4:01:30, 19.61s/it] 83%|████████▎ | 3548/4286 [22:25:37<4:04:21, 19.87s/it]                                                        {'loss': 0.0251, 'grad_norm': 3.138213277295157, 'learning_rate': 1.7218852076528232e-07, 'completion_length': 183.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.8080357313156128, 'rewards/format_reward': 1.0, 'reward': 1.8080358505249023, 'reward_std': 0.0734308511018753, 'kl': 0.6240234375, 'epoch': 0.83}
 83%|████████▎ | 3548/4286 [22:25:37<4:04:21, 19.87s/it] 83%|████████▎ | 3549/4286 [22:26:00<4:12:48, 20.58s/it]                                                        {'loss': 0.0269, 'grad_norm': 8.531075147296987, 'learning_rate': 1.7195520298646757e-07, 'completion_length': 185.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.6440477073192596, 'rewards/format_reward': 1.0, 'reward': 1.6440476775169373, 'reward_std': 0.04566161707043648, 'kl': 0.673828125, 'epoch': 0.83}
 83%|████████▎ | 3549/4286 [22:26:00<4:12:48, 20.58s/it] 83%|████████▎ | 3550/4286 [22:26:18<4:04:41, 19.95s/it]                                                        {'loss': 0.0299, 'grad_norm': 2.75386236092493, 'learning_rate': 1.7172188520765281e-07, 'completion_length': 184.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.5744047611951828, 'rewards/format_reward': 1.0, 'reward': 1.5744048953056335, 'reward_std': 0.07450363785028458, 'kl': 0.74609375, 'epoch': 0.83}
 83%|████████▎ | 3550/4286 [22:26:18<4:04:41, 19.95s/it] 83%|████████▎ | 3551/4286 [22:26:36<3:57:24, 19.38s/it]                                                        {'loss': 0.0097, 'grad_norm': 2.1782014399757195, 'learning_rate': 1.714885674288381e-07, 'completion_length': 186.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6369048357009888, 'rewards/format_reward': 1.0, 'reward': 1.6369048953056335, 'reward_std': 0.05633394420146942, 'kl': 0.24267578125, 'epoch': 0.83}
 83%|████████▎ | 3551/4286 [22:26:36<3:57:24, 19.38s/it] 83%|████████▎ | 3552/4286 [22:26:56<3:57:14, 19.39s/it]                                                        {'loss': 0.0306, 'grad_norm': 6.182157429740706, 'learning_rate': 1.7125524965002334e-07, 'completion_length': 200.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5964285731315613, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5785714983940125, 'reward_std': 0.10257834941148758, 'kl': 0.7666015625, 'epoch': 0.83}
 83%|████████▎ | 3552/4286 [22:26:56<3:57:14, 19.39s/it] 83%|████████▎ | 3553/4286 [22:27:15<3:56:05, 19.33s/it]                                                        {'loss': 0.0192, 'grad_norm': 4.499928396340354, 'learning_rate': 1.7102193187120859e-07, 'completion_length': 198.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.71726194024086, 'rewards/format_reward': 1.0, 'reward': 1.7172620296478271, 'reward_std': 0.06343365833163261, 'kl': 0.47900390625, 'epoch': 0.83}
 83%|████████▎ | 3553/4286 [22:27:15<3:56:05, 19.33s/it] 83%|████████▎ | 3554/4286 [22:27:34<3:54:54, 19.25s/it]                                                        {'loss': 0.0279, 'grad_norm': 14.068667182181807, 'learning_rate': 1.7078861409239384e-07, 'completion_length': 177.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5500000268220901, 'rewards/format_reward': 1.0, 'reward': 1.5500001311302185, 'reward_std': 0.09941545501351357, 'kl': 0.6982421875, 'epoch': 0.83}
 83%|████████▎ | 3554/4286 [22:27:34<3:54:54, 19.25s/it] 83%|████████▎ | 3555/4286 [22:27:53<3:52:35, 19.09s/it]                                                        {'loss': 0.0473, 'grad_norm': 3.3720067003142913, 'learning_rate': 1.7055529631357908e-07, 'completion_length': 189.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.03335827589035034, 'kl': 1.1806640625, 'epoch': 0.83}
 83%|████████▎ | 3555/4286 [22:27:53<3:52:35, 19.09s/it] 83%|████████▎ | 3556/4286 [22:28:11<3:49:26, 18.86s/it]                                                        {'loss': 0.0167, 'grad_norm': 4.740528783541074, 'learning_rate': 1.7032197853476436e-07, 'completion_length': 179.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.836904764175415, 'rewards/format_reward': 1.0, 'reward': 1.8369048833847046, 'reward_std': 0.024998377077281475, 'kl': 0.41748046875, 'epoch': 0.83}
 83%|████████▎ | 3556/4286 [22:28:11<3:49:26, 18.86s/it] 83%|████████▎ | 3557/4286 [22:28:30<3:50:14, 18.95s/it]                                                        {'loss': 0.037, 'grad_norm': 8.737645777689282, 'learning_rate': 1.700886607559496e-07, 'completion_length': 192.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7291666865348816, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.09112739190459251, 'kl': 0.923828125, 'epoch': 0.83}
 83%|████████▎ | 3557/4286 [22:28:30<3:50:14, 18.95s/it] 83%|████████▎ | 3558/4286 [22:28:50<3:51:58, 19.12s/it]                                                        {'loss': 0.1183, 'grad_norm': 28.16493914179374, 'learning_rate': 1.6985534297713486e-07, 'completion_length': 181.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5133928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4955357909202576, 'reward_std': 0.14694640412926674, 'kl': 2.953125, 'epoch': 0.83}
 83%|████████▎ | 3558/4286 [22:28:50<3:51:58, 19.12s/it] 83%|████████▎ | 3559/4286 [22:29:09<3:53:41, 19.29s/it]                                                        {'loss': 0.0628, 'grad_norm': 11.247075118889631, 'learning_rate': 1.696220251983201e-07, 'completion_length': 187.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6086310744285583, 'reward_std': 0.09619198739528656, 'kl': 1.572265625, 'epoch': 0.83}
 83%|████████▎ | 3559/4286 [22:29:09<3:53:41, 19.29s/it] 83%|████████▎ | 3560/4286 [22:29:29<3:55:50, 19.49s/it]                                                        {'loss': 0.0702, 'grad_norm': 13.21752905052411, 'learning_rate': 1.6938870741950535e-07, 'completion_length': 193.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6198413372039795, 'rewards/format_reward': 1.0, 'reward': 1.6198413372039795, 'reward_std': 0.14310894533991814, 'kl': 1.755859375, 'epoch': 0.83}
 83%|████████▎ | 3560/4286 [22:29:29<3:55:50, 19.49s/it] 83%|████████▎ | 3561/4286 [22:29:51<4:02:02, 20.03s/it]                                                        {'loss': 0.0471, 'grad_norm': 2.4550224872260094, 'learning_rate': 1.6915538964069063e-07, 'completion_length': 165.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5000001192092896, 'reward_std': 0.09905350301414728, 'kl': 1.18017578125, 'epoch': 0.83}
 83%|████████▎ | 3561/4286 [22:29:51<4:02:02, 20.03s/it] 83%|████████▎ | 3562/4286 [22:30:09<3:55:46, 19.54s/it]                                                        {'loss': 0.0726, 'grad_norm': 5.016636146774742, 'learning_rate': 1.6892207186187588e-07, 'completion_length': 175.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.7425596117973328, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.06685744598507881, 'kl': 1.81640625, 'epoch': 0.83}
 83%|████████▎ | 3562/4286 [22:30:09<3:55:46, 19.54s/it] 83%|████████▎ | 3563/4286 [22:30:27<3:49:23, 19.04s/it]                                                        {'loss': 0.0088, 'grad_norm': 2.9622808977670076, 'learning_rate': 1.6868875408306113e-07, 'completion_length': 176.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6663690507411957, 'rewards/format_reward': 1.0, 'reward': 1.6663691401481628, 'reward_std': 0.05670575052499771, 'kl': 0.21875, 'epoch': 0.83}
 83%|████████▎ | 3563/4286 [22:30:27<3:49:23, 19.04s/it] 83%|████████▎ | 3564/4286 [22:30:45<3:47:14, 18.88s/it]                                                        {'loss': 0.0126, 'grad_norm': 2.747750315365488, 'learning_rate': 1.6845543630424638e-07, 'completion_length': 189.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.523809552192688, 'rewards/format_reward': 1.0, 'reward': 1.5238096117973328, 'reward_std': 0.051976495422422886, 'kl': 0.31494140625, 'epoch': 0.83}
 83%|████████▎ | 3564/4286 [22:30:45<3:47:14, 18.88s/it] 83%|████████▎ | 3565/4286 [22:31:06<3:52:49, 19.38s/it]                                                        {'loss': 0.0283, 'grad_norm': 2.4388120143244625, 'learning_rate': 1.6822211852543165e-07, 'completion_length': 172.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6874999701976776, 'rewards/format_reward': 1.0, 'reward': 1.6875001788139343, 'reward_std': 0.024730363860726357, 'kl': 0.70947265625, 'epoch': 0.83}
 83%|████████▎ | 3565/4286 [22:31:06<3:52:49, 19.38s/it] 83%|████████▎ | 3566/4286 [22:31:26<3:55:46, 19.65s/it]                                                        {'loss': 0.0423, 'grad_norm': 8.481041320322278, 'learning_rate': 1.679888007466169e-07, 'completion_length': 184.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.12205686792731285, 'kl': 1.0595703125, 'epoch': 0.83}
 83%|████████▎ | 3566/4286 [22:31:26<3:55:46, 19.65s/it] 83%|████████▎ | 3567/4286 [22:31:45<3:52:49, 19.43s/it]                                                        {'loss': 0.0092, 'grad_norm': 2.289227347027157, 'learning_rate': 1.6775548296780215e-07, 'completion_length': 187.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.7059524655342102, 'rewards/format_reward': 1.0, 'reward': 1.7059524655342102, 'reward_std': 0.04219588916748762, 'kl': 0.22900390625, 'epoch': 0.83}
 83%|████████▎ | 3567/4286 [22:31:45<3:52:49, 19.43s/it] 83%|████████▎ | 3568/4286 [22:32:03<3:47:57, 19.05s/it]                                                        {'loss': 0.03, 'grad_norm': 1.735242925060611, 'learning_rate': 1.675221651889874e-07, 'completion_length': 180.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.09204822406172752, 'kl': 0.75, 'epoch': 0.83}
 83%|████████▎ | 3568/4286 [22:32:03<3:47:57, 19.05s/it] 83%|████████▎ | 3569/4286 [22:32:21<3:43:19, 18.69s/it]                                                        {'loss': 0.0344, 'grad_norm': 5.652183558562483, 'learning_rate': 1.6728884741017264e-07, 'completion_length': 183.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.7187500596046448, 'rewards/format_reward': 1.0, 'reward': 1.7187501788139343, 'reward_std': 0.026785717345774174, 'kl': 0.857421875, 'epoch': 0.83}
 83%|████████▎ | 3569/4286 [22:32:21<3:43:19, 18.69s/it] 83%|████████▎ | 3570/4286 [22:32:40<3:44:40, 18.83s/it]                                                        {'loss': 0.055, 'grad_norm': 24.401331302776445, 'learning_rate': 1.6705552963135792e-07, 'completion_length': 192.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.617559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.599702537059784, 'reward_std': 0.10944495908915997, 'kl': 1.375, 'epoch': 0.83}
 83%|████████▎ | 3570/4286 [22:32:40<3:44:40, 18.83s/it] 83%|████████▎ | 3571/4286 [22:32:58<3:41:25, 18.58s/it]                                                        {'loss': 0.0173, 'grad_norm': 1.4199917590058473, 'learning_rate': 1.6682221185254317e-07, 'completion_length': 178.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.585416704416275, 'rewards/format_reward': 1.0, 'reward': 1.5854167938232422, 'reward_std': 0.05420483462512493, 'kl': 0.4345703125, 'epoch': 0.83}
 83%|████████▎ | 3571/4286 [22:32:58<3:41:25, 18.58s/it] 83%|████████▎ | 3572/4286 [22:33:20<3:53:58, 19.66s/it]                                                        {'loss': 0.1364, 'grad_norm': 4.795151590895786, 'learning_rate': 1.6658889407372842e-07, 'completion_length': 187.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.5699405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.552083432674408, 'reward_std': 0.1369052566587925, 'kl': 3.40625, 'epoch': 0.83}
 83%|████████▎ | 3572/4286 [22:33:20<3:53:58, 19.66s/it] 83%|████████▎ | 3573/4286 [22:33:42<3:59:54, 20.19s/it]                                                        {'loss': 0.0172, 'grad_norm': 2.048423743626577, 'learning_rate': 1.6635557629491367e-07, 'completion_length': 193.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6096939146518707, 'rewards/format_reward': 1.0, 'reward': 1.609694004058838, 'reward_std': 0.018848473206162453, 'kl': 0.42822265625, 'epoch': 0.83}
 83%|████████▎ | 3573/4286 [22:33:42<3:59:54, 20.19s/it] 83%|████████▎ | 3574/4286 [22:34:02<3:59:29, 20.18s/it]                                                        {'loss': 0.0522, 'grad_norm': 3.520633646384653, 'learning_rate': 1.6612225851609894e-07, 'completion_length': 182.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.67083340883255, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6172619462013245, 'reward_std': 0.16538618505001068, 'kl': 1.3046875, 'epoch': 0.83}
 83%|████████▎ | 3574/4286 [22:34:02<3:59:29, 20.18s/it] 83%|████████▎ | 3575/4286 [22:34:23<4:03:09, 20.52s/it]                                                        {'loss': 0.0537, 'grad_norm': 12.268074711654565, 'learning_rate': 1.658889407372842e-07, 'completion_length': 216.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.586309552192688, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5505952835083008, 'reward_std': 0.15343140438199043, 'kl': 1.3408203125, 'epoch': 0.83}
 83%|████████▎ | 3575/4286 [22:34:23<4:03:09, 20.52s/it] 83%|████████▎ | 3576/4286 [22:34:41<3:53:18, 19.72s/it]                                                        {'loss': 0.0074, 'grad_norm': 9.337775149018794, 'learning_rate': 1.6565562295846944e-07, 'completion_length': 186.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.6294643431901932, 'rewards/format_reward': 1.0, 'reward': 1.6294643878936768, 'reward_std': 0.019238398410379887, 'kl': 0.18505859375, 'epoch': 0.83}
 83%|████████▎ | 3576/4286 [22:34:41<3:53:18, 19.72s/it] 83%|████████▎ | 3577/4286 [22:34:59<3:45:13, 19.06s/it]                                                        {'loss': 0.0094, 'grad_norm': 0.890770371251421, 'learning_rate': 1.654223051796547e-07, 'completion_length': 162.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7420635521411896, 'rewards/format_reward': 1.0, 'reward': 1.742063581943512, 'reward_std': 0.02181776612997055, 'kl': 0.234375, 'epoch': 0.83}
 83%|████████▎ | 3577/4286 [22:34:59<3:45:13, 19.06s/it] 83%|████████▎ | 3578/4286 [22:35:17<3:43:00, 18.90s/it]                                                        {'loss': 0.0116, 'grad_norm': 1.544508238677502, 'learning_rate': 1.6518898740083994e-07, 'completion_length': 185.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6568453013896942, 'rewards/format_reward': 1.0, 'reward': 1.6568453311920166, 'reward_std': 0.06840843241661787, 'kl': 0.291015625, 'epoch': 0.83}
 83%|████████▎ | 3578/4286 [22:35:17<3:43:00, 18.90s/it] 84%|████████▎ | 3579/4286 [22:35:38<3:50:42, 19.58s/it]                                                        {'loss': 0.0348, 'grad_norm': 4.200714330210831, 'learning_rate': 1.649556696220252e-07, 'completion_length': 189.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.4724702686071396, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.454613208770752, 'reward_std': 0.10116793215274811, 'kl': 0.8681640625, 'epoch': 0.84}
 84%|████████▎ | 3579/4286 [22:35:38<3:50:42, 19.58s/it] 84%|████████▎ | 3580/4286 [22:35:56<3:42:21, 18.90s/it]                                                        {'loss': 0.021, 'grad_norm': 4.484795716037145, 'learning_rate': 1.6472235184321046e-07, 'completion_length': 151.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6803571879863739, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6625001430511475, 'reward_std': 0.07315545715391636, 'kl': 0.5263671875, 'epoch': 0.84}
 84%|████████▎ | 3580/4286 [22:35:56<3:42:21, 18.90s/it] 84%|████████▎ | 3581/4286 [22:36:16<3:45:52, 19.22s/it]                                                        {'loss': 0.0543, 'grad_norm': 3.4606469960785424, 'learning_rate': 1.644890340643957e-07, 'completion_length': 185.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6437500417232513, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6258929371833801, 'reward_std': 0.13080309331417084, 'kl': 1.35546875, 'epoch': 0.84}
 84%|████████▎ | 3581/4286 [22:36:16<3:45:52, 19.22s/it] 84%|████████▎ | 3582/4286 [22:36:36<3:49:21, 19.55s/it]                                                        {'loss': 0.0238, 'grad_norm': 6.523591875102055, 'learning_rate': 1.6425571628558096e-07, 'completion_length': 173.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.7386905252933502, 'rewards/format_reward': 1.0, 'reward': 1.7386906147003174, 'reward_std': 0.034109159372746944, 'kl': 0.59375, 'epoch': 0.84}
 84%|████████▎ | 3582/4286 [22:36:36<3:49:21, 19.55s/it] 84%|████████▎ | 3583/4286 [22:36:55<3:46:11, 19.31s/it]                                                        {'loss': 0.0174, 'grad_norm': 1.0742561408993303, 'learning_rate': 1.640223985067662e-07, 'completion_length': 186.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.8407738208770752, 'rewards/format_reward': 1.0, 'reward': 1.8407739400863647, 'reward_std': 0.033803027123212814, 'kl': 0.435546875, 'epoch': 0.84}
 84%|████████▎ | 3583/4286 [22:36:55<3:46:11, 19.31s/it] 84%|████████▎ | 3584/4286 [22:37:13<3:41:28, 18.93s/it]                                                        {'loss': 0.0486, 'grad_norm': 8.927654397097339, 'learning_rate': 1.6378908072795148e-07, 'completion_length': 184.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8035715818405151, 'reward_std': 0.17538155987858772, 'kl': 1.21484375, 'epoch': 0.84}
 84%|████████▎ | 3584/4286 [22:37:13<3:41:28, 18.93s/it] 84%|████████▎ | 3585/4286 [22:37:32<3:43:40, 19.14s/it]                                                        {'loss': 0.0484, 'grad_norm': 62.224964853549686, 'learning_rate': 1.6355576294913673e-07, 'completion_length': 192.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.7571429014205933, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7392858266830444, 'reward_std': 0.11512117460370064, 'kl': 1.2080078125, 'epoch': 0.84}
 84%|████████▎ | 3585/4286 [22:37:32<3:43:40, 19.14s/it] 84%|████████▎ | 3586/4286 [22:37:50<3:38:57, 18.77s/it]                                                        {'loss': 0.007, 'grad_norm': 1.0958145754966333, 'learning_rate': 1.6332244517032198e-07, 'completion_length': 185.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.8303571939468384, 'rewards/format_reward': 1.0, 'reward': 1.8303572535514832, 'reward_std': 0.01785714365541935, 'kl': 0.17431640625, 'epoch': 0.84}
 84%|████████▎ | 3586/4286 [22:37:50<3:38:57, 18.77s/it] 84%|████████▎ | 3587/4286 [22:38:12<3:48:52, 19.65s/it]                                                        {'loss': 0.0124, 'grad_norm': 3.924233010392168, 'learning_rate': 1.6308912739150723e-07, 'completion_length': 192.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.6607143580913544, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.0357142873108387, 'kl': 0.3095703125, 'epoch': 0.84}
 84%|████████▎ | 3587/4286 [22:38:12<3:48:52, 19.65s/it] 84%|████████▎ | 3588/4286 [22:38:30<3:43:44, 19.23s/it]                                                        {'loss': 0.0075, 'grad_norm': 4.3992555022125925, 'learning_rate': 1.628558096126925e-07, 'completion_length': 182.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7291666865348816, 'rewards/format_reward': 1.0, 'reward': 1.7291668057441711, 'reward_std': 0.04983501508831978, 'kl': 0.1875, 'epoch': 0.84}
 84%|████████▎ | 3588/4286 [22:38:30<3:43:44, 19.23s/it] 84%|████████▎ | 3589/4286 [22:38:50<3:46:53, 19.53s/it]                                                        {'loss': 0.0075, 'grad_norm': 16.04371807556762, 'learning_rate': 1.6262249183387775e-07, 'completion_length': 186.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 1.0, 'reward': 1.5937501788139343, 'reward_std': 0.061151803471148014, 'kl': 0.18798828125, 'epoch': 0.84}
 84%|████████▎ | 3589/4286 [22:38:50<3:46:53, 19.53s/it] 84%|████████▍ | 3590/4286 [22:39:11<3:48:23, 19.69s/it]                                                        {'loss': 0.0321, 'grad_norm': 3.5190856728808595, 'learning_rate': 1.62389174055063e-07, 'completion_length': 172.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7232142686843872, 'rewards/format_reward': 1.0, 'reward': 1.7232144474983215, 'reward_std': 0.024056263267993927, 'kl': 0.80078125, 'epoch': 0.84}
 84%|████████▍ | 3590/4286 [22:39:11<3:48:23, 19.69s/it] 84%|████████▍ | 3591/4286 [22:39:29<3:43:15, 19.27s/it]                                                        {'loss': 0.0458, 'grad_norm': 11.32898170300681, 'learning_rate': 1.6215585627624825e-07, 'completion_length': 181.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.6651785969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.629464328289032, 'reward_std': 0.12792384065687656, 'kl': 1.14990234375, 'epoch': 0.84}
 84%|████████▍ | 3591/4286 [22:39:29<3:43:15, 19.27s/it] 84%|████████▍ | 3592/4286 [22:39:48<3:41:42, 19.17s/it]                                                        {'loss': 0.043, 'grad_norm': 2.309893323906639, 'learning_rate': 1.619225384974335e-07, 'completion_length': 170.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7383929193019867, 'rewards/format_reward': 1.0, 'reward': 1.7383930087089539, 'reward_std': 0.11845237948000431, 'kl': 1.07421875, 'epoch': 0.84}
 84%|████████▍ | 3592/4286 [22:39:48<3:41:42, 19.17s/it] 84%|████████▍ | 3593/4286 [22:40:06<3:36:53, 18.78s/it]                                                        {'loss': 0.0652, 'grad_norm': 5.098869475119053, 'learning_rate': 1.6168922071861877e-07, 'completion_length': 167.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.7065476477146149, 'rewards/format_reward': 1.0, 'reward': 1.7065476775169373, 'reward_std': 0.1019965298473835, 'kl': 1.630859375, 'epoch': 0.84}
 84%|████████▍ | 3593/4286 [22:40:06<3:36:53, 18.78s/it] 84%|████████▍ | 3594/4286 [22:40:24<3:33:54, 18.55s/it]                                                        {'loss': 0.0218, 'grad_norm': 9.919873301803456, 'learning_rate': 1.6145590293980402e-07, 'completion_length': 165.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6458334028720856, 'rewards/format_reward': 1.0, 'reward': 1.6458334922790527, 'reward_std': 0.034119348507374525, 'kl': 0.5458984375, 'epoch': 0.84}
 84%|████████▍ | 3594/4286 [22:40:24<3:33:54, 18.55s/it] 84%|████████▍ | 3595/4286 [22:40:42<3:34:30, 18.63s/it]                                                        {'loss': 0.0277, 'grad_norm': 8.836801854487232, 'learning_rate': 1.6122258516098927e-07, 'completion_length': 188.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.7752976715564728, 'rewards/format_reward': 1.0, 'reward': 1.77529776096344, 'reward_std': 0.08009707182645798, 'kl': 0.69384765625, 'epoch': 0.84}
 84%|████████▍ | 3595/4286 [22:40:42<3:34:30, 18.63s/it] 84%|████████▍ | 3596/4286 [22:41:01<3:34:19, 18.64s/it]                                                        {'loss': 0.0078, 'grad_norm': 0.9158067764118676, 'learning_rate': 1.6098926738217452e-07, 'completion_length': 184.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.680059552192688, 'rewards/format_reward': 1.0, 'reward': 1.6800596714019775, 'reward_std': 0.008928571827709675, 'kl': 0.19482421875, 'epoch': 0.84}
 84%|████████▍ | 3596/4286 [22:41:01<3:34:19, 18.64s/it] 84%|████████▍ | 3597/4286 [22:41:20<3:34:47, 18.70s/it]                                                        {'loss': 0.0757, 'grad_norm': 2.1123788922584352, 'learning_rate': 1.607559496033598e-07, 'completion_length': 193.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.7008928954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.665178656578064, 'reward_std': 0.1820865347981453, 'kl': 1.88671875, 'epoch': 0.84}
 84%|████████▍ | 3597/4286 [22:41:20<3:34:47, 18.70s/it] 84%|████████▍ | 3598/4286 [22:41:42<3:45:25, 19.66s/it]                                                        {'loss': 0.0474, 'grad_norm': 28.27972038661357, 'learning_rate': 1.6052263182454504e-07, 'completion_length': 190.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.417628213763237, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3819139003753662, 'reward_std': 0.14440375799313188, 'kl': 1.181640625, 'epoch': 0.84}
 84%|████████▍ | 3598/4286 [22:41:42<3:45:25, 19.66s/it] 84%|████████▍ | 3599/4286 [22:42:01<3:42:13, 19.41s/it]                                                        {'loss': 0.0395, 'grad_norm': 5.509896402018508, 'learning_rate': 1.602893140457303e-07, 'completion_length': 188.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.700297623872757, 'rewards/format_reward': 1.0, 'reward': 1.7002976536750793, 'reward_std': 0.060244192369282246, 'kl': 0.984375, 'epoch': 0.84}
 84%|████████▍ | 3599/4286 [22:42:01<3:42:13, 19.41s/it] 84%|████████▍ | 3600/4286 [22:42:20<3:40:57, 19.33s/it]                                                        {'loss': 0.0279, 'grad_norm': 3.2813966809219974, 'learning_rate': 1.6005599626691554e-07, 'completion_length': 195.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6074405610561371, 'rewards/format_reward': 1.0, 'reward': 1.6074405908584595, 'reward_std': 0.0469798156991601, 'kl': 0.69970703125, 'epoch': 0.84}
 84%|████████▍ | 3600/4286 [22:42:20<3:40:57, 19.33s/it] 84%|████████▍ | 3601/4286 [22:45:35<13:44:24, 72.21s/it]                                                         {'loss': 0.0489, 'grad_norm': 438.68994510975233, 'learning_rate': 1.598226784881008e-07, 'completion_length': 200.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.702381044626236, 'rewards/format_reward': 1.0, 'reward': 1.7023810744285583, 'reward_std': 0.02380952797830105, 'kl': 1.2255859375, 'epoch': 0.84}
 84%|████████▍ | 3601/4286 [22:45:35<13:44:24, 72.21s/it] 84%|████████▍ | 3602/4286 [22:45:55<10:43:40, 56.46s/it]                                                         {'loss': 0.0699, 'grad_norm': 2.701800468208635, 'learning_rate': 1.5958936070928606e-07, 'completion_length': 192.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5773809850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5595239400863647, 'reward_std': 0.18477055430412292, 'kl': 1.74609375, 'epoch': 0.84}
 84%|████████▍ | 3602/4286 [22:45:55<10:43:40, 56.46s/it][2025-03-03 03:53:30,170] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 84%|████████▍ | 3603/4286 [22:46:14<8:35:21, 45.27s/it]                                                         {'loss': 0.0182, 'grad_norm': 5.935053671863181, 'learning_rate': 1.593560429304713e-07, 'completion_length': 171.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6880952417850494, 'rewards/format_reward': 1.0, 'reward': 1.6880953907966614, 'reward_std': 0.026190475560724735, 'kl': 0.45654296875, 'epoch': 0.84}
 84%|████████▍ | 3603/4286 [22:46:14<8:35:21, 45.27s/it] 84%|████████▍ | 3604/4286 [22:46:33<7:02:47, 37.20s/it]                                                        {'loss': 0.0269, 'grad_norm': 7.321536234381193, 'learning_rate': 1.5912272515165656e-07, 'completion_length': 172.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.742559552192688, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.06250000465661287, 'kl': 0.67138671875, 'epoch': 0.84}
 84%|████████▍ | 3604/4286 [22:46:33<7:02:47, 37.20s/it] 84%|████████▍ | 3605/4286 [22:46:53<6:06:11, 32.26s/it]                                                        {'loss': 0.0322, 'grad_norm': 1.2261288480964694, 'learning_rate': 1.588894073728418e-07, 'completion_length': 193.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.7202381491661072, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.04166667116805911, 'kl': 0.80517578125, 'epoch': 0.84}
 84%|████████▍ | 3605/4286 [22:46:53<6:06:11, 32.26s/it] 84%|████████▍ | 3606/4286 [22:47:13<5:20:58, 28.32s/it]                                                        {'loss': 0.0531, 'grad_norm': 5.23138095431636, 'learning_rate': 1.5865608959402706e-07, 'completion_length': 190.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.7946429252624512, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7589287757873535, 'reward_std': 0.16418456006795168, 'kl': 1.32763671875, 'epoch': 0.84}
 84%|████████▍ | 3606/4286 [22:47:13<5:20:58, 28.32s/it] 84%|████████▍ | 3607/4286 [22:47:31<4:48:33, 25.50s/it]                                                        {'loss': 0.0336, 'grad_norm': 1.012700281436174, 'learning_rate': 1.5842277181521233e-07, 'completion_length': 193.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5625000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5625000596046448, 'reward_std': 0.08667591959238052, 'kl': 0.8408203125, 'epoch': 0.84}
 84%|████████▍ | 3607/4286 [22:47:31<4:48:33, 25.50s/it] 84%|████████▍ | 3608/4286 [22:47:51<4:26:23, 23.57s/it]                                                        {'loss': 0.0241, 'grad_norm': 1.093953701520814, 'learning_rate': 1.5818945403639758e-07, 'completion_length': 192.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6848214566707611, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6669644713401794, 'reward_std': 0.08751039579510689, 'kl': 0.60107421875, 'epoch': 0.84}
 84%|████████▍ | 3608/4286 [22:47:51<4:26:23, 23.57s/it] 84%|████████▍ | 3609/4286 [22:48:10<4:12:14, 22.35s/it]                                                        {'loss': 0.0636, 'grad_norm': 7.554599890549626, 'learning_rate': 1.5795613625758283e-07, 'completion_length': 196.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.595238134264946, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5773810744285583, 'reward_std': 0.06212059408426285, 'kl': 1.587890625, 'epoch': 0.84}
 84%|████████▍ | 3609/4286 [22:48:10<4:12:14, 22.35s/it] 84%|████████▍ | 3610/4286 [22:48:28<3:56:34, 21.00s/it]                                                        {'loss': 0.0115, 'grad_norm': 0.6900184407615958, 'learning_rate': 1.5772281847876808e-07, 'completion_length': 182.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.866071492433548, 'rewards/format_reward': 1.0, 'reward': 1.8660715222358704, 'reward_std': 0.06664376705884933, 'kl': 0.28857421875, 'epoch': 0.84}
 84%|████████▍ | 3610/4286 [22:48:28<3:56:34, 21.00s/it] 84%|████████▍ | 3611/4286 [22:48:49<3:56:03, 20.98s/it]                                                        {'loss': 0.094, 'grad_norm': 2.5489027886695466, 'learning_rate': 1.5748950069995335e-07, 'completion_length': 189.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5916666984558105, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.573809564113617, 'reward_std': 0.1337794065475464, 'kl': 2.3515625, 'epoch': 0.84}
 84%|████████▍ | 3611/4286 [22:48:49<3:56:03, 20.98s/it] 84%|████████▍ | 3612/4286 [22:49:08<3:48:26, 20.34s/it]                                                        {'loss': 0.0304, 'grad_norm': 8.159380737795082, 'learning_rate': 1.572561829211386e-07, 'completion_length': 193.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5565476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5386905670166016, 'reward_std': 0.06353989243507385, 'kl': 0.75830078125, 'epoch': 0.84}
 84%|████████▍ | 3612/4286 [22:49:08<3:48:26, 20.34s/it] 84%|████████▍ | 3613/4286 [22:49:27<3:46:22, 20.18s/it]                                                        {'loss': 0.0411, 'grad_norm': 4.441346080049437, 'learning_rate': 1.5702286514232385e-07, 'completion_length': 184.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5967262089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.578869104385376, 'reward_std': 0.11036578938364983, 'kl': 1.02880859375, 'epoch': 0.84}
 84%|████████▍ | 3613/4286 [22:49:27<3:46:22, 20.18s/it] 84%|████████▍ | 3614/4286 [22:49:46<3:41:29, 19.78s/it]                                                        {'loss': 0.0137, 'grad_norm': 5.79135405619429, 'learning_rate': 1.567895473635091e-07, 'completion_length': 194.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7428571581840515, 'rewards/format_reward': 1.0, 'reward': 1.742857277393341, 'reward_std': 0.04404762480407953, 'kl': 0.34326171875, 'epoch': 0.84}
 84%|████████▍ | 3614/4286 [22:49:46<3:41:29, 19.78s/it] 84%|████████▍ | 3615/4286 [22:50:07<3:43:02, 19.94s/it]                                                        {'loss': 0.0163, 'grad_norm': 0.6273236593819475, 'learning_rate': 1.5655622958469435e-07, 'completion_length': 182.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.020619653165340424, 'kl': 0.40869140625, 'epoch': 0.84}
 84%|████████▍ | 3615/4286 [22:50:07<3:43:02, 19.94s/it] 84%|████████▍ | 3616/4286 [22:50:27<3:42:40, 19.94s/it]                                                        {'loss': 0.0286, 'grad_norm': 4.350978366228716, 'learning_rate': 1.5632291180587962e-07, 'completion_length': 193.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.5806548297405243, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.562797725200653, 'reward_std': 0.08355020079761744, 'kl': 0.71435546875, 'epoch': 0.84}
 84%|████████▍ | 3616/4286 [22:50:27<3:42:40, 19.94s/it] 84%|████████▍ | 3617/4286 [22:50:45<3:38:05, 19.56s/it]                                                        {'loss': 0.0377, 'grad_norm': 4.345454564047846, 'learning_rate': 1.5608959402706485e-07, 'completion_length': 183.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.523809552192688, 'rewards/format_reward': 1.0, 'reward': 1.5238096117973328, 'reward_std': 0.06823870167136192, 'kl': 0.943359375, 'epoch': 0.84}
 84%|████████▍ | 3617/4286 [22:50:45<3:38:05, 19.56s/it] 84%|████████▍ | 3618/4286 [22:51:06<3:41:20, 19.88s/it]                                                        {'loss': 0.0184, 'grad_norm': 18.198643160721588, 'learning_rate': 1.558562762482501e-07, 'completion_length': 211.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6767857670783997, 'rewards/format_reward': 1.0, 'reward': 1.6767857670783997, 'reward_std': 0.07654710486531258, 'kl': 0.45703125, 'epoch': 0.84}
 84%|████████▍ | 3618/4286 [22:51:06<3:41:20, 19.88s/it] 84%|████████▍ | 3619/4286 [22:51:24<3:35:13, 19.36s/it]                                                        {'loss': 0.0288, 'grad_norm': 3.9592858501503407, 'learning_rate': 1.5562295846943534e-07, 'completion_length': 180.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7842262387275696, 'rewards/format_reward': 1.0, 'reward': 1.7842262983322144, 'reward_std': 0.08199355937540531, 'kl': 0.7177734375, 'epoch': 0.84}
 84%|████████▍ | 3619/4286 [22:51:24<3:35:13, 19.36s/it] 84%|████████▍ | 3620/4286 [22:51:44<3:37:48, 19.62s/it]                                                        {'loss': 0.0158, 'grad_norm': 2.542628015048059, 'learning_rate': 1.553896406906206e-07, 'completion_length': 196.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7110119163990021, 'rewards/format_reward': 1.0, 'reward': 1.7110119462013245, 'reward_std': 0.03876052796840668, 'kl': 0.39453125, 'epoch': 0.84}
 84%|████████▍ | 3620/4286 [22:51:44<3:37:48, 19.62s/it] 84%|████████▍ | 3621/4286 [22:52:06<3:43:26, 20.16s/it]                                                        {'loss': 0.0298, 'grad_norm': 2.6728867999492554, 'learning_rate': 1.5515632291180587e-07, 'completion_length': 184.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.6370400786399841, 'rewards/format_reward': 1.0, 'reward': 1.637040138244629, 'reward_std': 0.05605795420706272, 'kl': 0.7451171875, 'epoch': 0.84}
 84%|████████▍ | 3621/4286 [22:52:06<3:43:26, 20.16s/it] 85%|████████▍ | 3622/4286 [22:52:24<3:36:22, 19.55s/it]                                                        {'loss': 0.0291, 'grad_norm': 7.628070456538402, 'learning_rate': 1.5492300513299112e-07, 'completion_length': 194.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 1.0, 'reward': 1.7574406862258911, 'reward_std': 0.08728558383882046, 'kl': 0.72900390625, 'epoch': 0.85}
 85%|████████▍ | 3622/4286 [22:52:24<3:36:22, 19.55s/it] 85%|████████▍ | 3623/4286 [22:52:42<3:31:59, 19.19s/it]                                                        {'loss': 0.0214, 'grad_norm': 5.163810679111133, 'learning_rate': 1.5468968735417636e-07, 'completion_length': 176.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.717857152223587, 'rewards/format_reward': 1.0, 'reward': 1.7178572416305542, 'reward_std': 0.02408570423722267, 'kl': 0.533203125, 'epoch': 0.85}
 85%|████████▍ | 3623/4286 [22:52:42<3:31:59, 19.19s/it] 85%|████████▍ | 3624/4286 [22:53:00<3:27:39, 18.82s/it]                                                        {'loss': 0.0069, 'grad_norm': 1.0846892804535906, 'learning_rate': 1.5445636957536161e-07, 'completion_length': 164.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7544642984867096, 'rewards/format_reward': 1.0, 'reward': 1.7544643878936768, 'reward_std': 0.0267857164144516, 'kl': 0.17236328125, 'epoch': 0.85}
 85%|████████▍ | 3624/4286 [22:53:00<3:27:39, 18.82s/it] 85%|████████▍ | 3625/4286 [22:53:18<3:25:55, 18.69s/it]                                                        {'loss': 0.0109, 'grad_norm': 1.8346228916249188, 'learning_rate': 1.542230517965469e-07, 'completion_length': 190.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.7166667282581329, 'rewards/format_reward': 1.0, 'reward': 1.7166667580604553, 'reward_std': 0.030952386558055878, 'kl': 0.27294921875, 'epoch': 0.85}
 85%|████████▍ | 3625/4286 [22:53:18<3:25:55, 18.69s/it] 85%|████████▍ | 3626/4286 [22:53:37<3:25:11, 18.65s/it]                                                        {'loss': 0.0415, 'grad_norm': 4.293730273473241, 'learning_rate': 1.5398973401773214e-07, 'completion_length': 182.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7336309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.07213573902845383, 'kl': 1.03515625, 'epoch': 0.85}
 85%|████████▍ | 3626/4286 [22:53:37<3:25:11, 18.65s/it] 85%|████████▍ | 3627/4286 [22:53:57<3:29:03, 19.03s/it]                                                        {'loss': 0.0691, 'grad_norm': 1.4193334421139023, 'learning_rate': 1.5375641623891739e-07, 'completion_length': 172.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0888583529740572, 'kl': 1.7265625, 'epoch': 0.85}
 85%|████████▍ | 3627/4286 [22:53:57<3:29:03, 19.03s/it] 85%|████████▍ | 3628/4286 [22:54:16<3:28:22, 19.00s/it]                                                        {'loss': 0.0077, 'grad_norm': 1.937082230985162, 'learning_rate': 1.5352309846010263e-07, 'completion_length': 193.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6482143402099609, 'rewards/format_reward': 1.0, 'reward': 1.648214340209961, 'reward_std': 0.04340266715735197, 'kl': 0.19384765625, 'epoch': 0.85}
 85%|████████▍ | 3628/4286 [22:54:16<3:28:22, 19.00s/it] 85%|████████▍ | 3629/4286 [22:54:35<3:26:56, 18.90s/it]                                                        {'loss': 0.0416, 'grad_norm': 7.912725148253593, 'learning_rate': 1.5328978068128788e-07, 'completion_length': 170.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.609226256608963, 'rewards/format_reward': 1.0, 'reward': 1.6092262864112854, 'reward_std': 0.05769624840468168, 'kl': 1.041015625, 'epoch': 0.85}
 85%|████████▍ | 3629/4286 [22:54:35<3:26:56, 18.90s/it] 85%|████████▍ | 3630/4286 [22:54:57<3:38:41, 20.00s/it]                                                        {'loss': 0.0461, 'grad_norm': 10.792058514485076, 'learning_rate': 1.5305646290247316e-07, 'completion_length': 200.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.4494047909975052, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4315477013587952, 'reward_std': 0.1250000074505806, 'kl': 1.150390625, 'epoch': 0.85}
 85%|████████▍ | 3630/4286 [22:54:57<3:38:41, 20.00s/it] 85%|████████▍ | 3631/4286 [22:55:18<3:40:33, 20.20s/it]                                                        {'loss': 0.0262, 'grad_norm': 25.72727597096587, 'learning_rate': 1.528231451236584e-07, 'completion_length': 198.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6195115745067596, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6016545295715332, 'reward_std': 0.10306521505117416, 'kl': 0.6552734375, 'epoch': 0.85}
 85%|████████▍ | 3631/4286 [22:55:18<3:40:33, 20.20s/it] 85%|████████▍ | 3632/4286 [22:55:37<3:37:48, 19.98s/it]                                                        {'loss': 0.0241, 'grad_norm': 4.624564737269642, 'learning_rate': 1.5258982734484366e-07, 'completion_length': 178.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.566964328289032, 'rewards/format_reward': 1.0, 'reward': 1.5669644474983215, 'reward_std': 0.044642859138548374, 'kl': 0.6005859375, 'epoch': 0.85}
 85%|████████▍ | 3632/4286 [22:55:37<3:37:48, 19.98s/it] 85%|████████▍ | 3633/4286 [22:55:56<3:34:02, 19.67s/it]                                                        {'loss': 0.0259, 'grad_norm': 2.431320474098631, 'learning_rate': 1.523565095660289e-07, 'completion_length': 183.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.7916666865348816, 'rewards/format_reward': 1.0, 'reward': 1.7916668057441711, 'reward_std': 0.04925545770674944, 'kl': 0.6494140625, 'epoch': 0.85}
 85%|████████▍ | 3633/4286 [22:55:56<3:34:02, 19.67s/it] 85%|████████▍ | 3634/4286 [22:56:14<3:28:17, 19.17s/it]                                                        {'loss': 0.0074, 'grad_norm': 7.521065311240578, 'learning_rate': 1.5212319178721418e-07, 'completion_length': 165.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.7306548357009888, 'rewards/format_reward': 1.0, 'reward': 1.7306548357009888, 'reward_std': 0.0446428544819355, 'kl': 0.1845703125, 'epoch': 0.85}
 85%|████████▍ | 3634/4286 [22:56:14<3:28:17, 19.17s/it] 85%|████████▍ | 3635/4286 [22:56:35<3:34:49, 19.80s/it]                                                        {'loss': 0.0143, 'grad_norm': 1.6922475767931848, 'learning_rate': 1.5188987400839943e-07, 'completion_length': 189.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6196429133415222, 'rewards/format_reward': 1.0, 'reward': 1.6196429133415222, 'reward_std': 0.02261904627084732, 'kl': 0.3564453125, 'epoch': 0.85}
 85%|████████▍ | 3635/4286 [22:56:35<3:34:49, 19.80s/it] 85%|████████▍ | 3636/4286 [22:56:56<3:36:22, 19.97s/it]                                                        {'loss': 0.03, 'grad_norm': 2.400507131158313, 'learning_rate': 1.5165655622958468e-07, 'completion_length': 189.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7800596058368683, 'rewards/format_reward': 1.0, 'reward': 1.7800596356391907, 'reward_std': 0.06823740154504776, 'kl': 0.751953125, 'epoch': 0.85}
 85%|████████▍ | 3636/4286 [22:56:56<3:36:22, 19.97s/it] 85%|████████▍ | 3637/4286 [22:57:17<3:38:42, 20.22s/it]                                                        {'loss': 0.0564, 'grad_norm': 2.015019727636749, 'learning_rate': 1.5142323845076993e-07, 'completion_length': 192.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6038690805435181, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.586012065410614, 'reward_std': 0.09952829033136368, 'kl': 1.41015625, 'epoch': 0.85}
 85%|████████▍ | 3637/4286 [22:57:17<3:38:42, 20.22s/it] 85%|████████▍ | 3638/4286 [22:57:35<3:31:31, 19.59s/it]                                                        {'loss': 0.0731, 'grad_norm': 3.776769798604738, 'learning_rate': 1.5118992067195517e-07, 'completion_length': 181.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.45744049549102783, 'rewards/format_reward': 1.0, 'reward': 1.4574405550956726, 'reward_std': 0.07489877752959728, 'kl': 1.83203125, 'epoch': 0.85}
 85%|████████▍ | 3638/4286 [22:57:35<3:31:31, 19.59s/it] 85%|████████▍ | 3639/4286 [22:57:57<3:39:33, 20.36s/it]                                                        {'loss': 0.0259, 'grad_norm': 7.22045123695456, 'learning_rate': 1.5095660289314045e-07, 'completion_length': 203.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7198129594326019, 'rewards/format_reward': 1.0, 'reward': 1.7198131084442139, 'reward_std': 0.09466557949781418, 'kl': 0.646484375, 'epoch': 0.85}
 85%|████████▍ | 3639/4286 [22:57:57<3:39:33, 20.36s/it] 85%|████████▍ | 3640/4286 [22:58:18<3:40:48, 20.51s/it]                                                        {'loss': 0.0259, 'grad_norm': 1.6247921618684642, 'learning_rate': 1.507232851143257e-07, 'completion_length': 188.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7258929014205933, 'rewards/format_reward': 1.0, 'reward': 1.725892961025238, 'reward_std': 0.034238445572555065, 'kl': 0.6484375, 'epoch': 0.85}
 85%|████████▍ | 3640/4286 [22:58:18<3:40:48, 20.51s/it] 85%|████████▍ | 3641/4286 [22:58:36<3:31:39, 19.69s/it]                                                        {'loss': 0.0082, 'grad_norm': 6.620701464898905, 'learning_rate': 1.5048996733551095e-07, 'completion_length': 174.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.011904759332537651, 'kl': 0.2060546875, 'epoch': 0.85}
 85%|████████▍ | 3641/4286 [22:58:36<3:31:39, 19.69s/it] 85%|████████▍ | 3642/4286 [22:58:56<3:34:06, 19.95s/it]                                                        {'loss': 0.0281, 'grad_norm': 7.894085600904354, 'learning_rate': 1.502566495566962e-07, 'completion_length': 188.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5848215818405151, 'reward_std': 0.044642859138548374, 'kl': 0.70166015625, 'epoch': 0.85}
 85%|████████▍ | 3642/4286 [22:58:56<3:34:06, 19.95s/it] 85%|████████▍ | 3643/4286 [22:59:17<3:37:06, 20.26s/it]                                                        {'loss': 0.0296, 'grad_norm': 34.822927160508186, 'learning_rate': 1.5002333177788144e-07, 'completion_length': 197.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.7172618806362152, 'rewards/format_reward': 1.0, 'reward': 1.7172620296478271, 'reward_std': 0.06473047845065594, 'kl': 0.7412109375, 'epoch': 0.85}
 85%|████████▍ | 3643/4286 [22:59:17<3:37:06, 20.26s/it] 85%|████████▌ | 3644/4286 [22:59:35<3:28:06, 19.45s/it]                                                        {'loss': 0.02, 'grad_norm': 1.5042296881508395, 'learning_rate': 1.4979001399906672e-07, 'completion_length': 169.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.9151785969734192, 'rewards/format_reward': 1.0, 'reward': 1.9151787757873535, 'reward_std': 0.020833336748182774, 'kl': 0.50146484375, 'epoch': 0.85}
 85%|████████▌ | 3644/4286 [22:59:35<3:28:06, 19.45s/it] 85%|████████▌ | 3645/4286 [22:59:57<3:36:53, 20.30s/it]                                                        {'loss': 0.0646, 'grad_norm': 2.685667758803198, 'learning_rate': 1.4955669622025197e-07, 'completion_length': 199.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6369048357009888, 'reward_std': 0.17980830371379852, 'kl': 1.61328125, 'epoch': 0.85}
 85%|████████▌ | 3645/4286 [22:59:57<3:36:53, 20.30s/it] 85%|████████▌ | 3646/4286 [23:00:17<3:34:19, 20.09s/it]                                                        {'loss': 0.0101, 'grad_norm': 6.745878303159059, 'learning_rate': 1.4932337844143722e-07, 'completion_length': 174.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6830357313156128, 'rewards/format_reward': 1.0, 'reward': 1.6830357909202576, 'reward_std': 0.019238397479057312, 'kl': 0.2529296875, 'epoch': 0.85}
 85%|████████▌ | 3646/4286 [23:00:17<3:34:19, 20.09s/it] 85%|████████▌ | 3647/4286 [23:00:35<3:28:30, 19.58s/it]                                                        {'loss': 0.0173, 'grad_norm': 5.8442740716067645, 'learning_rate': 1.4909006066262247e-07, 'completion_length': 177.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.6309524476528168, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.023809521459043026, 'kl': 0.43359375, 'epoch': 0.85}
 85%|████████▌ | 3647/4286 [23:00:35<3:28:30, 19.58s/it] 85%|████████▌ | 3648/4286 [23:00:57<3:37:12, 20.43s/it]                                                        {'loss': 0.0214, 'grad_norm': 2.90365771372125, 'learning_rate': 1.4885674288380774e-07, 'completion_length': 188.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.657738208770752, 'reward_std': 0.10530742816627026, 'kl': 0.53515625, 'epoch': 0.85}
 85%|████████▌ | 3648/4286 [23:00:57<3:37:12, 20.43s/it] 85%|████████▌ | 3649/4286 [23:01:17<3:34:33, 20.21s/it]                                                        {'loss': 0.0357, 'grad_norm': 3.3916058058719667, 'learning_rate': 1.48623425104993e-07, 'completion_length': 187.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 1.0, 'reward': 1.610119104385376, 'reward_std': 0.05501317232847214, 'kl': 0.89208984375, 'epoch': 0.85}
 85%|████████▌ | 3649/4286 [23:01:17<3:34:33, 20.21s/it] 85%|████████▌ | 3650/4286 [23:01:36<3:31:00, 19.91s/it]                                                        {'loss': 0.0274, 'grad_norm': 4.366404869933353, 'learning_rate': 1.4839010732617824e-07, 'completion_length': 190.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.724702388048172, 'rewards/format_reward': 1.0, 'reward': 1.724702537059784, 'reward_std': 0.06391431391239166, 'kl': 0.6875, 'epoch': 0.85}
 85%|████████▌ | 3650/4286 [23:01:36<3:31:00, 19.91s/it] 85%|████████▌ | 3651/4286 [23:01:56<3:30:32, 19.89s/it]                                                        {'loss': 0.0082, 'grad_norm': 9.160747805390788, 'learning_rate': 1.481567895473635e-07, 'completion_length': 199.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.764881044626236, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.035714288242161274, 'kl': 0.20654296875, 'epoch': 0.85}
 85%|████████▌ | 3651/4286 [23:01:56<3:30:32, 19.89s/it] 85%|████████▌ | 3652/4286 [23:02:15<3:27:54, 19.68s/it]                                                        {'loss': 0.0233, 'grad_norm': 9.785796700891868, 'learning_rate': 1.4792347176854874e-07, 'completion_length': 183.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.595833420753479, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5779762864112854, 'reward_std': 0.12023810669779778, 'kl': 0.5810546875, 'epoch': 0.85}
 85%|████████▌ | 3652/4286 [23:02:15<3:27:54, 19.68s/it] 85%|████████▌ | 3653/4286 [23:02:34<3:24:06, 19.35s/it]                                                        {'loss': 0.0155, 'grad_norm': 2.178183088740904, 'learning_rate': 1.47690153989734e-07, 'completion_length': 202.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6056548357009888, 'rewards/format_reward': 1.0, 'reward': 1.6056549549102783, 'reward_std': 0.08219881728291512, 'kl': 0.38623046875, 'epoch': 0.85}
 85%|████████▌ | 3653/4286 [23:02:34<3:24:06, 19.35s/it] 85%|████████▌ | 3654/4286 [23:02:53<3:24:08, 19.38s/it]                                                        {'loss': 0.0217, 'grad_norm': 3.060760916713776, 'learning_rate': 1.4745683621091926e-07, 'completion_length': 197.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.6339286267757416, 'rewards/format_reward': 1.0, 'reward': 1.633928656578064, 'reward_std': 0.05792888253927231, 'kl': 0.5419921875, 'epoch': 0.85}
 85%|████████▌ | 3654/4286 [23:02:53<3:24:08, 19.38s/it] 85%|████████▌ | 3655/4286 [23:03:14<3:27:49, 19.76s/it]                                                        {'loss': 0.009, 'grad_norm': 2.4326685346490975, 'learning_rate': 1.472235184321045e-07, 'completion_length': 194.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.6130952537059784, 'rewards/format_reward': 1.0, 'reward': 1.6130953431129456, 'reward_std': 0.06388125568628311, 'kl': 0.2255859375, 'epoch': 0.85}
 85%|████████▌ | 3655/4286 [23:03:14<3:27:49, 19.76s/it] 85%|████████▌ | 3656/4286 [23:03:34<3:28:09, 19.82s/it]                                                        {'loss': 0.0297, 'grad_norm': 5.659251669599736, 'learning_rate': 1.4699020065328976e-07, 'completion_length': 189.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.7019983232021332, 'rewards/format_reward': 1.0, 'reward': 1.7019984722137451, 'reward_std': 0.06352294981479645, 'kl': 0.74267578125, 'epoch': 0.85}
 85%|████████▌ | 3656/4286 [23:03:34<3:28:09, 19.82s/it] 85%|████████▌ | 3657/4286 [23:03:56<3:36:19, 20.64s/it]                                                        {'loss': 0.0164, 'grad_norm': 2.131427261164405, 'learning_rate': 1.46756882874475e-07, 'completion_length': 197.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.669642984867096, 'reward_std': 0.05038155987858772, 'kl': 0.4111328125, 'epoch': 0.85}
 85%|████████▌ | 3657/4286 [23:03:56<3:36:19, 20.64s/it] 85%|████████▌ | 3658/4286 [23:04:20<3:45:28, 21.54s/it]                                                        {'loss': 0.0415, 'grad_norm': 10.05231206179916, 'learning_rate': 1.4652356509566028e-07, 'completion_length': 212.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.6930697858333588, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6752126812934875, 'reward_std': 0.17333360016345978, 'kl': 1.0390625, 'epoch': 0.85}
 85%|████████▌ | 3658/4286 [23:04:20<3:45:28, 21.54s/it] 85%|████████▌ | 3659/4286 [23:04:44<3:53:39, 22.36s/it]                                                        {'loss': 0.0714, 'grad_norm': 3.2615228996211094, 'learning_rate': 1.4629024731684553e-07, 'completion_length': 208.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6214286088943481, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6035714745521545, 'reward_std': 0.1511385478079319, 'kl': 1.7890625, 'epoch': 0.85}
 85%|████████▌ | 3659/4286 [23:04:44<3:53:39, 22.36s/it] 85%|████████▌ | 3660/4286 [23:05:03<3:40:46, 21.16s/it]                                                        {'loss': 0.0072, 'grad_norm': 5.497489681444469, 'learning_rate': 1.4605692953803078e-07, 'completion_length': 193.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.05531906709074974, 'kl': 0.1806640625, 'epoch': 0.85}
 85%|████████▌ | 3660/4286 [23:05:03<3:40:46, 21.16s/it] 85%|████████▌ | 3661/4286 [23:05:23<3:36:36, 20.79s/it]                                                        {'loss': 0.0122, 'grad_norm': 0.7073891831967907, 'learning_rate': 1.4582361175921603e-07, 'completion_length': 195.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5923469960689545, 'rewards/format_reward': 1.0, 'reward': 1.5923470854759216, 'reward_std': 0.03632080461829901, 'kl': 0.30615234375, 'epoch': 0.85}
 85%|████████▌ | 3661/4286 [23:05:23<3:36:36, 20.79s/it] 85%|████████▌ | 3662/4286 [23:05:45<3:42:11, 21.36s/it]                                                        {'loss': 0.0472, 'grad_norm': 24.71583459744043, 'learning_rate': 1.455902939804013e-07, 'completion_length': 195.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.572916716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.537202537059784, 'reward_std': 0.1677335724234581, 'kl': 1.17578125, 'epoch': 0.85}
 85%|████████▌ | 3662/4286 [23:05:45<3:42:11, 21.36s/it] 85%|████████▌ | 3663/4286 [23:06:04<3:34:11, 20.63s/it]                                                        {'loss': 0.0119, 'grad_norm': 3.93862688490394, 'learning_rate': 1.4535697620158655e-07, 'completion_length': 178.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.02976190857589245, 'kl': 0.2978515625, 'epoch': 0.85}
 85%|████████▌ | 3663/4286 [23:06:04<3:34:11, 20.63s/it] 85%|████████▌ | 3664/4286 [23:06:22<3:25:59, 19.87s/it]                                                        {'loss': 0.0086, 'grad_norm': 2.1971390407715687, 'learning_rate': 1.451236584227718e-07, 'completion_length': 173.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7901786267757416, 'rewards/format_reward': 1.0, 'reward': 1.7901787161827087, 'reward_std': 0.02417590841650963, 'kl': 0.2158203125, 'epoch': 0.85}
 85%|████████▌ | 3664/4286 [23:06:22<3:25:59, 19.87s/it] 86%|████████▌ | 3665/4286 [23:06:42<3:23:23, 19.65s/it]                                                        {'loss': 0.0091, 'grad_norm': 2.5036969055353353, 'learning_rate': 1.4489034064395705e-07, 'completion_length': 203.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904763579368591, 'reward_std': 0.026485057547688484, 'kl': 0.22802734375, 'epoch': 0.86}
 86%|████████▌ | 3665/4286 [23:06:42<3:23:23, 19.65s/it] 86%|████████▌ | 3666/4286 [23:07:01<3:21:02, 19.46s/it]                                                        {'loss': 0.0444, 'grad_norm': 44.170690404111106, 'learning_rate': 1.446570228651423e-07, 'completion_length': 186.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.4627976566553116, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4449405670166016, 'reward_std': 0.06593661103397608, 'kl': 1.11328125, 'epoch': 0.86}
 86%|████████▌ | 3666/4286 [23:07:01<3:21:02, 19.46s/it] 86%|████████▌ | 3667/4286 [23:07:19<3:18:57, 19.29s/it]                                                        {'loss': 0.0123, 'grad_norm': 3.8386526499356077, 'learning_rate': 1.4442370508632757e-07, 'completion_length': 198.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.6379677057266235, 'rewards/format_reward': 1.0, 'reward': 1.6379677653312683, 'reward_std': 0.044354356825351715, 'kl': 0.30859375, 'epoch': 0.86}
 86%|████████▌ | 3667/4286 [23:07:19<3:18:57, 19.29s/it] 86%|████████▌ | 3668/4286 [23:07:38<3:15:04, 18.94s/it]                                                        {'loss': 0.0368, 'grad_norm': 3.893284762667728, 'learning_rate': 1.4419038730751282e-07, 'completion_length': 169.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6666668057441711, 'reward_std': 0.12347954884171486, 'kl': 0.921875, 'epoch': 0.86}
 86%|████████▌ | 3668/4286 [23:07:38<3:15:04, 18.94s/it] 86%|████████▌ | 3669/4286 [23:07:57<3:15:38, 19.03s/it]                                                        {'loss': 0.035, 'grad_norm': 8.365739780967642, 'learning_rate': 1.4395706952869807e-07, 'completion_length': 185.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.5377976447343826, 'rewards/format_reward': 1.0, 'reward': 1.5377976894378662, 'reward_std': 0.04345238232053816, 'kl': 0.873046875, 'epoch': 0.86}
 86%|████████▌ | 3669/4286 [23:07:57<3:15:38, 19.03s/it] 86%|████████▌ | 3670/4286 [23:08:16<3:15:28, 19.04s/it]                                                        {'loss': 0.0389, 'grad_norm': 1.7711951938929038, 'learning_rate': 1.4372375174988332e-07, 'completion_length': 191.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.6979167461395264, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.058765457943081856, 'kl': 0.97216796875, 'epoch': 0.86}
 86%|████████▌ | 3670/4286 [23:08:16<3:15:28, 19.04s/it] 86%|████████▌ | 3671/4286 [23:08:37<3:20:49, 19.59s/it]                                                        {'loss': 0.045, 'grad_norm': 4.090790757892972, 'learning_rate': 1.434904339710686e-07, 'completion_length': 189.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.5312500149011612, 'rewards/format_reward': 1.0, 'reward': 1.5312501192092896, 'reward_std': 0.11659271642565727, 'kl': 1.125, 'epoch': 0.86}
 86%|████████▌ | 3671/4286 [23:08:37<3:20:49, 19.59s/it] 86%|████████▌ | 3672/4286 [23:08:58<3:24:45, 20.01s/it]                                                        {'loss': 0.038, 'grad_norm': 45.53283860031231, 'learning_rate': 1.4325711619225384e-07, 'completion_length': 195.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.6532738506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6354168057441711, 'reward_std': 0.15843574702739716, 'kl': 0.951171875, 'epoch': 0.86}
 86%|████████▌ | 3672/4286 [23:08:58<3:24:45, 20.01s/it] 86%|████████▌ | 3673/4286 [23:09:18<3:24:23, 20.01s/it]                                                        {'loss': 0.0348, 'grad_norm': 36.58964950840717, 'learning_rate': 1.430237984134391e-07, 'completion_length': 197.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6327381432056427, 'rewards/format_reward': 1.0, 'reward': 1.6327382326126099, 'reward_std': 0.09442061930894852, 'kl': 0.873046875, 'epoch': 0.86}
 86%|████████▌ | 3673/4286 [23:09:18<3:24:23, 20.01s/it] 86%|████████▌ | 3674/4286 [23:09:39<3:27:03, 20.30s/it]                                                        {'loss': 0.0171, 'grad_norm': 10.103287237532859, 'learning_rate': 1.4279048063462434e-07, 'completion_length': 197.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.6776786148548126, 'rewards/format_reward': 1.0, 'reward': 1.677678644657135, 'reward_std': 0.0621761716902256, 'kl': 0.427734375, 'epoch': 0.86}
 86%|████████▌ | 3674/4286 [23:09:39<3:27:03, 20.30s/it] 86%|████████▌ | 3675/4286 [23:10:01<3:32:43, 20.89s/it]                                                        {'loss': 0.0566, 'grad_norm': 3.206995856358041, 'learning_rate': 1.425571628558096e-07, 'completion_length': 201.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.583333432674408, 'reward_std': 0.10182805359363556, 'kl': 1.41796875, 'epoch': 0.86}
 86%|████████▌ | 3675/4286 [23:10:01<3:32:43, 20.89s/it] 86%|████████▌ | 3676/4286 [23:10:20<3:26:53, 20.35s/it]                                                        {'loss': 0.0451, 'grad_norm': 4.939137120592649, 'learning_rate': 1.4232384507699486e-07, 'completion_length': 197.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.6008929014205933, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5830358862876892, 'reward_std': 0.1395740918815136, 'kl': 1.130859375, 'epoch': 0.86}
 86%|████████▌ | 3676/4286 [23:10:20<3:26:53, 20.35s/it] 86%|████████▌ | 3677/4286 [23:10:40<3:24:47, 20.18s/it]                                                        {'loss': 0.0413, 'grad_norm': 8.449791471321927, 'learning_rate': 1.420905272981801e-07, 'completion_length': 185.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5776786208152771, 'rewards/format_reward': 1.0, 'reward': 1.5776787400245667, 'reward_std': 0.09694609045982361, 'kl': 1.029296875, 'epoch': 0.86}
 86%|████████▌ | 3677/4286 [23:10:40<3:24:47, 20.18s/it] 86%|████████▌ | 3678/4286 [23:10:59<3:22:02, 19.94s/it]                                                        {'loss': 0.0209, 'grad_norm': 3.2020374204915516, 'learning_rate': 1.4185720951936536e-07, 'completion_length': 189.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6458333730697632, 'rewards/format_reward': 1.0, 'reward': 1.6458334922790527, 'reward_std': 0.034119345247745514, 'kl': 0.52294921875, 'epoch': 0.86}
 86%|████████▌ | 3678/4286 [23:10:59<3:22:02, 19.94s/it] 86%|████████▌ | 3679/4286 [23:11:19<3:20:31, 19.82s/it]                                                        {'loss': 0.0656, 'grad_norm': 4.530148990423179, 'learning_rate': 1.416238917405506e-07, 'completion_length': 181.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.580527275800705, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5448129177093506, 'reward_std': 0.14384952187538147, 'kl': 1.640625, 'epoch': 0.86}
 86%|████████▌ | 3679/4286 [23:11:19<3:20:31, 19.82s/it] 86%|████████▌ | 3680/4286 [23:11:38<3:17:04, 19.51s/it]                                                        {'loss': 0.0206, 'grad_norm': 6.602764812582468, 'learning_rate': 1.4139057396173586e-07, 'completion_length': 176.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6830357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.01923840120434761, 'kl': 0.51416015625, 'epoch': 0.86}
 86%|████████▌ | 3680/4286 [23:11:38<3:17:04, 19.51s/it] 86%|████████▌ | 3681/4286 [23:11:57<3:15:30, 19.39s/it]                                                        {'loss': 0.0157, 'grad_norm': 22.11701073731636, 'learning_rate': 1.4115725618292113e-07, 'completion_length': 182.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.4761905074119568, 'rewards/format_reward': 1.0, 'reward': 1.4761905670166016, 'reward_std': 0.07142857648432255, 'kl': 0.392578125, 'epoch': 0.86}
 86%|████████▌ | 3681/4286 [23:11:57<3:15:30, 19.39s/it] 86%|████████▌ | 3682/4286 [23:12:15<3:13:32, 19.23s/it]                                                        {'loss': 0.0462, 'grad_norm': 1.471012016056756, 'learning_rate': 1.4092393840410638e-07, 'completion_length': 184.21428680419922, 'rewards/only_full_func_accuracy_reward': 0.7104592621326447, 'rewards/format_reward': 1.0, 'reward': 1.710459291934967, 'reward_std': 0.07908163592219353, 'kl': 1.15625, 'epoch': 0.86}
 86%|████████▌ | 3682/4286 [23:12:15<3:13:32, 19.23s/it] 86%|████████▌ | 3683/4286 [23:12:34<3:11:38, 19.07s/it]                                                        {'loss': 0.021, 'grad_norm': 25.986198848054073, 'learning_rate': 1.4069062062529163e-07, 'completion_length': 184.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6336309909820557, 'rewards/format_reward': 1.0, 'reward': 1.6336309909820557, 'reward_std': 0.07080172561109066, 'kl': 0.52685546875, 'epoch': 0.86}
 86%|████████▌ | 3683/4286 [23:12:34<3:11:38, 19.07s/it] 86%|████████▌ | 3684/4286 [23:12:54<3:13:58, 19.33s/it]                                                        {'loss': 0.0587, 'grad_norm': 43.38822763487179, 'learning_rate': 1.4045730284647688e-07, 'completion_length': 187.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.4901786148548126, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4723215103149414, 'reward_std': 0.15896422043442726, 'kl': 1.470703125, 'epoch': 0.86}
 86%|████████▌ | 3684/4286 [23:12:54<3:13:58, 19.33s/it] 86%|████████▌ | 3685/4286 [23:13:19<3:29:18, 20.90s/it]                                                        {'loss': 0.0148, 'grad_norm': 6.671189321828207, 'learning_rate': 1.4022398506766215e-07, 'completion_length': 204.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.6785715818405151, 'reward_std': 0.02816697023808956, 'kl': 0.37109375, 'epoch': 0.86}
 86%|████████▌ | 3685/4286 [23:13:19<3:29:18, 20.90s/it] 86%|████████▌ | 3686/4286 [23:13:37<3:21:52, 20.19s/it]                                                        {'loss': 0.0171, 'grad_norm': 6.253807211962064, 'learning_rate': 1.399906672888474e-07, 'completion_length': 189.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 1.0, 'reward': 1.662202537059784, 'reward_std': 0.05494099296629429, 'kl': 0.42578125, 'epoch': 0.86}
 86%|████████▌ | 3686/4286 [23:13:37<3:21:52, 20.19s/it] 86%|████████▌ | 3687/4286 [23:13:56<3:18:23, 19.87s/it]                                                        {'loss': 0.0446, 'grad_norm': 15.815930356895723, 'learning_rate': 1.3975734951003265e-07, 'completion_length': 180.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.7336309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7157739400863647, 'reward_std': 0.08678307756781578, 'kl': 1.119140625, 'epoch': 0.86}
 86%|████████▌ | 3687/4286 [23:13:56<3:18:23, 19.87s/it] 86%|████████▌ | 3688/4286 [23:14:16<3:18:46, 19.94s/it]                                                        {'loss': 0.0339, 'grad_norm': 9.163524120330237, 'learning_rate': 1.395240317312179e-07, 'completion_length': 201.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.5961309969425201, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.578273892402649, 'reward_std': 0.15324430912733078, 'kl': 0.84765625, 'epoch': 0.86}
 86%|████████▌ | 3688/4286 [23:14:16<3:18:46, 19.94s/it] 86%|████████▌ | 3689/4286 [23:14:35<3:13:19, 19.43s/it]                                                        {'loss': 0.0217, 'grad_norm': 11.860481351947167, 'learning_rate': 1.3929071395240315e-07, 'completion_length': 171.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.612946480512619, 'rewards/format_reward': 1.0, 'reward': 1.6129465103149414, 'reward_std': 0.04174866899847984, 'kl': 0.54150390625, 'epoch': 0.86}
 86%|████████▌ | 3689/4286 [23:14:35<3:13:19, 19.43s/it] 86%|████████▌ | 3690/4286 [23:14:55<3:15:14, 19.66s/it]                                                        {'loss': 0.0346, 'grad_norm': 1.2648762099176714, 'learning_rate': 1.3905739617358842e-07, 'completion_length': 196.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6369047462940216, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6190477013587952, 'reward_std': 0.10714286379516125, 'kl': 0.86328125, 'epoch': 0.86}
 86%|████████▌ | 3690/4286 [23:14:55<3:15:14, 19.66s/it] 86%|████████▌ | 3691/4286 [23:15:14<3:13:20, 19.50s/it]                                                        {'loss': 0.1093, 'grad_norm': 16.20821199349898, 'learning_rate': 1.3882407839477367e-07, 'completion_length': 170.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6054564118385315, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5697421431541443, 'reward_std': 0.12595517188310623, 'kl': 2.734375, 'epoch': 0.86}
 86%|████████▌ | 3691/4286 [23:15:14<3:13:20, 19.50s/it] 86%|████████▌ | 3692/4286 [23:15:32<3:08:57, 19.09s/it]                                                        {'loss': 0.0266, 'grad_norm': 11.18601150093328, 'learning_rate': 1.3859076061595892e-07, 'completion_length': 171.46428680419922, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 1.0, 'reward': 1.7425596714019775, 'reward_std': 0.05495268478989601, 'kl': 0.6630859375, 'epoch': 0.86}
 86%|████████▌ | 3692/4286 [23:15:32<3:08:57, 19.09s/it] 86%|████████▌ | 3693/4286 [23:15:51<3:09:14, 19.15s/it]                                                        {'loss': 0.0232, 'grad_norm': 0.8424031957482351, 'learning_rate': 1.3835744283714417e-07, 'completion_length': 177.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7276786267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7098215222358704, 'reward_std': 0.08035714644938707, 'kl': 0.578125, 'epoch': 0.86}
 86%|████████▌ | 3693/4286 [23:15:51<3:09:14, 19.15s/it] 86%|████████▌ | 3694/4286 [23:16:10<3:06:42, 18.92s/it]                                                        {'loss': 0.0372, 'grad_norm': 13.383571422683122, 'learning_rate': 1.3812412505832944e-07, 'completion_length': 177.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5500000566244125, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5321429371833801, 'reward_std': 0.10884096659719944, 'kl': 0.927734375, 'epoch': 0.86}
 86%|████████▌ | 3694/4286 [23:16:10<3:06:42, 18.92s/it] 86%|████████▌ | 3695/4286 [23:16:28<3:05:36, 18.84s/it]                                                        {'loss': 0.03, 'grad_norm': 5.552959251834418, 'learning_rate': 1.378908072795147e-07, 'completion_length': 174.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.8592262268066406, 'rewards/format_reward': 1.0, 'reward': 1.8592262864112854, 'reward_std': 0.060671549290418625, 'kl': 0.748046875, 'epoch': 0.86}
 86%|████████▌ | 3695/4286 [23:16:28<3:05:36, 18.84s/it] 86%|████████▌ | 3696/4286 [23:16:50<3:14:05, 19.74s/it]                                                        {'loss': 0.0532, 'grad_norm': 3.495329905429037, 'learning_rate': 1.3765748950069994e-07, 'completion_length': 176.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 1.0, 'reward': 1.6339287161827087, 'reward_std': 0.09463678859174252, 'kl': 1.3330078125, 'epoch': 0.86}
 86%|████████▌ | 3696/4286 [23:16:50<3:14:05, 19.74s/it] 86%|████████▋ | 3697/4286 [23:17:12<3:19:52, 20.36s/it]                                                        {'loss': 0.036, 'grad_norm': 9.475525698407473, 'learning_rate': 1.374241717218852e-07, 'completion_length': 198.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6014881432056427, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5836310386657715, 'reward_std': 0.13630952686071396, 'kl': 0.90087890625, 'epoch': 0.86}
 86%|████████▋ | 3697/4286 [23:17:12<3:19:52, 20.36s/it] 86%|████████▋ | 3698/4286 [23:17:32<3:17:51, 20.19s/it]                                                        {'loss': 0.0229, 'grad_norm': 6.754687672482556, 'learning_rate': 1.3719085394307044e-07, 'completion_length': 182.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6919644474983215, 'reward_std': 0.07534223981201649, 'kl': 0.5732421875, 'epoch': 0.86}
 86%|████████▋ | 3698/4286 [23:17:32<3:17:51, 20.19s/it] 86%|████████▋ | 3699/4286 [23:17:52<3:16:14, 20.06s/it]                                                        {'loss': 0.0312, 'grad_norm': 4.832266924736737, 'learning_rate': 1.3695753616425571e-07, 'completion_length': 171.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.6264881789684296, 'rewards/format_reward': 1.0, 'reward': 1.626488208770752, 'reward_std': 0.042115405201911926, 'kl': 0.78125, 'epoch': 0.86}
 86%|████████▋ | 3699/4286 [23:17:52<3:16:14, 20.06s/it] 86%|████████▋ | 3700/4286 [23:18:11<3:13:33, 19.82s/it]                                                        {'loss': 0.0283, 'grad_norm': 5.530858425089393, 'learning_rate': 1.3672421838544096e-07, 'completion_length': 203.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.6175595819950104, 'rewards/format_reward': 1.0, 'reward': 1.6175596714019775, 'reward_std': 0.06526251696050167, 'kl': 0.70703125, 'epoch': 0.86}
 86%|████████▋ | 3700/4286 [23:18:11<3:13:33, 19.82s/it] 86%|████████▋ | 3701/4286 [23:21:44<12:37:20, 77.68s/it]                                                         {'loss': 0.0648, 'grad_norm': 6.733572267076992, 'learning_rate': 1.364909006066262e-07, 'completion_length': 198.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.7304258942604065, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6947116255760193, 'reward_std': 0.19846609234809875, 'kl': 1.6171875, 'epoch': 0.86}
 86%|████████▋ | 3701/4286 [23:21:44<12:37:20, 77.68s/it] 86%|████████▋ | 3702/4286 [23:22:02<9:41:52, 59.78s/it]                                                         {'loss': 0.0386, 'grad_norm': 3.8186868208696017, 'learning_rate': 1.3625758282781146e-07, 'completion_length': 179.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6735119521617889, 'rewards/format_reward': 1.0, 'reward': 1.6735119819641113, 'reward_std': 0.09344663843512535, 'kl': 0.9638671875, 'epoch': 0.86}
 86%|████████▋ | 3702/4286 [23:22:02<9:41:52, 59.78s/it] 86%|████████▋ | 3703/4286 [23:22:19<7:36:29, 46.98s/it]                                                        {'loss': 0.052, 'grad_norm': 5.187946263825581, 'learning_rate': 1.360242650489967e-07, 'completion_length': 155.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6056548953056335, 'reward_std': 0.1255265362560749, 'kl': 1.296875, 'epoch': 0.86}
 86%|████████▋ | 3703/4286 [23:22:19<7:36:29, 46.98s/it] 86%|████████▋ | 3704/4286 [23:22:38<6:14:56, 38.65s/it]                                                        {'loss': 0.0198, 'grad_norm': 10.20538994275178, 'learning_rate': 1.3579094727018198e-07, 'completion_length': 184.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6562500596046448, 'rewards/format_reward': 1.0, 'reward': 1.6562501788139343, 'reward_std': 0.09342947602272034, 'kl': 0.49169921875, 'epoch': 0.86}
 86%|████████▋ | 3704/4286 [23:22:38<6:14:56, 38.65s/it] 86%|████████▋ | 3705/4286 [23:22:56<5:15:53, 32.62s/it]                                                        {'loss': 0.0408, 'grad_norm': 2.5304809230071235, 'learning_rate': 1.3555762949136723e-07, 'completion_length': 188.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.611607164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5937500596046448, 'reward_std': 0.1134050190448761, 'kl': 1.021484375, 'epoch': 0.86}
 86%|████████▋ | 3705/4286 [23:22:56<5:15:53, 32.62s/it] 86%|████████▋ | 3706/4286 [23:23:16<4:38:36, 28.82s/it]                                                        {'loss': 0.0267, 'grad_norm': 7.840171699833573, 'learning_rate': 1.3532431171255248e-07, 'completion_length': 179.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6306548416614532, 'rewards/format_reward': 1.0, 'reward': 1.6306548714637756, 'reward_std': 0.038151202723383904, 'kl': 0.66796875, 'epoch': 0.86}
 86%|████████▋ | 3706/4286 [23:23:16<4:38:36, 28.82s/it] 86%|████████▋ | 3707/4286 [23:23:36<4:10:00, 25.91s/it]                                                        {'loss': 0.0123, 'grad_norm': 3.894537253195001, 'learning_rate': 1.3509099393373773e-07, 'completion_length': 183.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6327381432056427, 'rewards/format_reward': 1.0, 'reward': 1.632738173007965, 'reward_std': 0.04625473078340292, 'kl': 0.3076171875, 'epoch': 0.86}
 86%|████████▋ | 3707/4286 [23:23:36<4:10:00, 25.91s/it] 87%|████████▋ | 3708/4286 [23:23:54<3:48:58, 23.77s/it]                                                        {'loss': 0.0091, 'grad_norm': 21.85811182738261, 'learning_rate': 1.34857676154923e-07, 'completion_length': 188.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6892857551574707, 'rewards/format_reward': 1.0, 'reward': 1.689285933971405, 'reward_std': 0.06053862348198891, 'kl': 0.2275390625, 'epoch': 0.87}
 87%|████████▋ | 3708/4286 [23:23:54<3:48:58, 23.77s/it] 87%|████████▋ | 3709/4286 [23:24:12<3:29:51, 21.82s/it]                                                        {'loss': 0.0158, 'grad_norm': 3.8113036323004943, 'learning_rate': 1.3462435837610825e-07, 'completion_length': 166.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.05777822621166706, 'kl': 0.39453125, 'epoch': 0.87}
 87%|████████▋ | 3709/4286 [23:24:12<3:29:51, 21.82s/it] 87%|████████▋ | 3710/4286 [23:24:30<3:20:32, 20.89s/it]                                                        {'loss': 0.0284, 'grad_norm': 0.5401864056002565, 'learning_rate': 1.343910405972935e-07, 'completion_length': 186.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.05357143096625805, 'kl': 0.71044921875, 'epoch': 0.87}
 87%|████████▋ | 3710/4286 [23:24:30<3:20:32, 20.89s/it] 87%|████████▋ | 3711/4286 [23:24:49<3:12:50, 20.12s/it]                                                        {'loss': 0.022, 'grad_norm': 4.81613575775746, 'learning_rate': 1.3415772281847875e-07, 'completion_length': 191.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.722321480512619, 'rewards/format_reward': 1.0, 'reward': 1.7223215103149414, 'reward_std': 0.05862582102417946, 'kl': 0.5537109375, 'epoch': 0.87}
 87%|████████▋ | 3711/4286 [23:24:49<3:12:50, 20.12s/it] 87%|████████▋ | 3712/4286 [23:25:08<3:10:52, 19.95s/it]                                                        {'loss': 0.031, 'grad_norm': 4.667834419093401, 'learning_rate': 1.33924405039664e-07, 'completion_length': 189.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6252976655960083, 'rewards/format_reward': 1.0, 'reward': 1.6252976655960083, 'reward_std': 0.08093245513737202, 'kl': 0.7763671875, 'epoch': 0.87}
 87%|████████▋ | 3712/4286 [23:25:08<3:10:52, 19.95s/it] 87%|████████▋ | 3713/4286 [23:25:28<3:09:04, 19.80s/it]                                                        {'loss': 0.0782, 'grad_norm': 2.578693095934125, 'learning_rate': 1.3369108726084928e-07, 'completion_length': 189.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.6508929133415222, 'rewards/format_reward': 1.0, 'reward': 1.6508929133415222, 'reward_std': 0.14556369185447693, 'kl': 1.953125, 'epoch': 0.87}
 87%|████████▋ | 3713/4286 [23:25:28<3:09:04, 19.80s/it] 87%|████████▋ | 3714/4286 [23:25:47<3:06:23, 19.55s/it]                                                        {'loss': 0.0243, 'grad_norm': 6.76995656192441, 'learning_rate': 1.3345776948203452e-07, 'completion_length': 189.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.49315477907657623, 'rewards/format_reward': 1.0, 'reward': 1.4931548833847046, 'reward_std': 0.11447710543870926, 'kl': 0.60498046875, 'epoch': 0.87}
 87%|████████▋ | 3714/4286 [23:25:47<3:06:23, 19.55s/it] 87%|████████▋ | 3715/4286 [23:26:09<3:13:01, 20.28s/it]                                                        {'loss': 0.0122, 'grad_norm': 2.7978576184262534, 'learning_rate': 1.3322445170321977e-07, 'completion_length': 202.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.5232143253087997, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.505357265472412, 'reward_std': 0.06434538215398788, 'kl': 0.3056640625, 'epoch': 0.87}
 87%|████████▋ | 3715/4286 [23:26:09<3:13:01, 20.28s/it][2025-03-03 04:33:46,964] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 87%|████████▋ | 3716/4286 [23:26:31<3:18:53, 20.94s/it]                                                        {'loss': 0.0364, 'grad_norm': 14.26900903997592, 'learning_rate': 1.3299113392440502e-07, 'completion_length': 195.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.654762089252472, 'reward_std': 0.08769077807664871, 'kl': 0.9091796875, 'epoch': 0.87}
 87%|████████▋ | 3716/4286 [23:26:31<3:18:53, 20.94s/it] 87%|████████▋ | 3717/4286 [23:26:53<3:21:44, 21.27s/it]                                                        {'loss': 0.055, 'grad_norm': 7.028771806330798, 'learning_rate': 1.327578161455903e-07, 'completion_length': 187.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6041668057441711, 'reward_std': 0.1369047686457634, 'kl': 1.37890625, 'epoch': 0.87}
 87%|████████▋ | 3717/4286 [23:26:53<3:21:44, 21.27s/it] 87%|████████▋ | 3718/4286 [23:27:15<3:24:09, 21.57s/it]                                                        {'loss': 0.0222, 'grad_norm': 1.4911180390340302, 'learning_rate': 1.3252449836677555e-07, 'completion_length': 175.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.6622024178504944, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.0327381007373333, 'kl': 0.55517578125, 'epoch': 0.87}
 87%|████████▋ | 3718/4286 [23:27:15<3:24:09, 21.57s/it] 87%|████████▋ | 3719/4286 [23:27:41<3:34:51, 22.74s/it]                                                        {'loss': 0.1172, 'grad_norm': 11.10094276761149, 'learning_rate': 1.322911805879608e-07, 'completion_length': 192.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.5227183103561401, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5048611164093018, 'reward_std': 0.17031780257821083, 'kl': 2.9296875, 'epoch': 0.87}
 87%|████████▋ | 3719/4286 [23:27:41<3:34:51, 22.74s/it] 87%|████████▋ | 3720/4286 [23:28:03<3:31:54, 22.46s/it]                                                        {'loss': 0.0674, 'grad_norm': 7.654726043743404, 'learning_rate': 1.3205786280914604e-07, 'completion_length': 208.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.5461309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5282739400863647, 'reward_std': 0.11915471032261848, 'kl': 1.68359375, 'epoch': 0.87}
 87%|████████▋ | 3720/4286 [23:28:03<3:31:54, 22.46s/it] 87%|████████▋ | 3721/4286 [23:28:21<3:19:28, 21.18s/it]                                                        {'loss': 0.0247, 'grad_norm': 6.472543578127691, 'learning_rate': 1.318245450303313e-07, 'completion_length': 174.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 1.0, 'reward': 1.6026787161827087, 'reward_std': 0.0933420117944479, 'kl': 0.62109375, 'epoch': 0.87}
 87%|████████▋ | 3721/4286 [23:28:21<3:19:28, 21.18s/it] 87%|████████▋ | 3722/4286 [23:28:40<3:12:10, 20.44s/it]                                                        {'loss': 0.0147, 'grad_norm': 12.7211978908467, 'learning_rate': 1.3159122725151657e-07, 'completion_length': 193.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.648809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6309524774551392, 'reward_std': 0.05952381435781717, 'kl': 0.3671875, 'epoch': 0.87}
 87%|████████▋ | 3722/4286 [23:28:40<3:12:10, 20.44s/it] 87%|████████▋ | 3723/4286 [23:28:59<3:08:36, 20.10s/it]                                                        {'loss': 0.0786, 'grad_norm': 4.539000151164177, 'learning_rate': 1.3135790947270182e-07, 'completion_length': 179.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6889881491661072, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.08330587577074766, 'kl': 1.962890625, 'epoch': 0.87}
 87%|████████▋ | 3723/4286 [23:28:59<3:08:36, 20.10s/it] 87%|████████▋ | 3724/4286 [23:29:21<3:14:17, 20.74s/it]                                                        {'loss': 0.051, 'grad_norm': 13.492130222212177, 'learning_rate': 1.3112459169388706e-07, 'completion_length': 174.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.5131696909666061, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4595982432365417, 'reward_std': 0.1206362396478653, 'kl': 1.2705078125, 'epoch': 0.87}
 87%|████████▋ | 3724/4286 [23:29:21<3:14:17, 20.74s/it] 87%|████████▋ | 3725/4286 [23:29:40<3:09:15, 20.24s/it]                                                        {'loss': 0.0746, 'grad_norm': 2.8802733970675614, 'learning_rate': 1.308912739150723e-07, 'completion_length': 189.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.8232143223285675, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.787500023841858, 'reward_std': 0.20517675951123238, 'kl': 1.859375, 'epoch': 0.87}
 87%|████████▋ | 3725/4286 [23:29:40<3:09:15, 20.24s/it] 87%|████████▋ | 3726/4286 [23:29:58<3:02:43, 19.58s/it]                                                        {'loss': 0.0554, 'grad_norm': 6.26100677208434, 'learning_rate': 1.3065795613625756e-07, 'completion_length': 173.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5208334028720856, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5029762983322144, 'reward_std': 0.1080636978149414, 'kl': 1.38671875, 'epoch': 0.87}
 87%|████████▋ | 3726/4286 [23:29:58<3:02:43, 19.58s/it] 87%|████████▋ | 3727/4286 [23:30:18<3:03:42, 19.72s/it]                                                        {'loss': 0.0337, 'grad_norm': 2.4331329789182266, 'learning_rate': 1.3042463835744284e-07, 'completion_length': 166.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.7157738208770752, 'rewards/format_reward': 1.0, 'reward': 1.7157739400863647, 'reward_std': 0.03172445576637983, 'kl': 0.841796875, 'epoch': 0.87}
 87%|████████▋ | 3727/4286 [23:30:18<3:03:42, 19.72s/it] 87%|████████▋ | 3728/4286 [23:30:40<3:10:00, 20.43s/it]                                                        {'loss': 0.0582, 'grad_norm': 13.253862408468791, 'learning_rate': 1.3019132057862809e-07, 'completion_length': 202.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5178572535514832, 'reward_std': 0.22023407369852066, 'kl': 1.4482421875, 'epoch': 0.87}
 87%|████████▋ | 3728/4286 [23:30:40<3:10:00, 20.43s/it] 87%|████████▋ | 3729/4286 [23:30:59<3:04:59, 19.93s/it]                                                        {'loss': 0.0117, 'grad_norm': 7.291257992363128, 'learning_rate': 1.2995800279981333e-07, 'completion_length': 172.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.6964287161827087, 'reward_std': 0.06793888658285141, 'kl': 0.29248046875, 'epoch': 0.87}
 87%|████████▋ | 3729/4286 [23:30:59<3:04:59, 19.93s/it] 87%|████████▋ | 3730/4286 [23:31:22<3:14:12, 20.96s/it]                                                        {'loss': 0.03, 'grad_norm': 3.0917350550152096, 'learning_rate': 1.2972468502099858e-07, 'completion_length': 185.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.6044643670320511, 'rewards/format_reward': 1.0, 'reward': 1.6044644117355347, 'reward_std': 0.07570716179907322, 'kl': 0.751953125, 'epoch': 0.87}
 87%|████████▋ | 3730/4286 [23:31:23<3:14:12, 20.96s/it] 87%|████████▋ | 3731/4286 [23:31:40<3:05:03, 20.01s/it]                                                        {'loss': 0.0231, 'grad_norm': 7.593229322371554, 'learning_rate': 1.2949136724218386e-07, 'completion_length': 177.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7785714566707611, 'rewards/format_reward': 1.0, 'reward': 1.7785714864730835, 'reward_std': 0.07088642567396164, 'kl': 0.57861328125, 'epoch': 0.87}
 87%|████████▋ | 3731/4286 [23:31:40<3:05:03, 20.01s/it] 87%|████████▋ | 3732/4286 [23:32:03<3:11:22, 20.73s/it]                                                        {'loss': 0.0451, 'grad_norm': 13.270134164015841, 'learning_rate': 1.292580494633691e-07, 'completion_length': 198.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.5773809850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5416667461395264, 'reward_std': 0.10940204933285713, 'kl': 1.12890625, 'epoch': 0.87}
 87%|████████▋ | 3732/4286 [23:32:03<3:11:22, 20.73s/it] 87%|████████▋ | 3733/4286 [23:32:23<3:10:47, 20.70s/it]                                                        {'loss': 0.0415, 'grad_norm': 1.363064201318179, 'learning_rate': 1.2902473168455436e-07, 'completion_length': 184.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.5089285671710968, 'rewards/format_reward': 1.0, 'reward': 1.5089287161827087, 'reward_std': 0.08928571827709675, 'kl': 1.04052734375, 'epoch': 0.87}
 87%|████████▋ | 3733/4286 [23:32:23<3:10:47, 20.70s/it] 87%|████████▋ | 3734/4286 [23:32:42<3:04:37, 20.07s/it]                                                        {'loss': 0.0313, 'grad_norm': 3.893114482720033, 'learning_rate': 1.287914139057396e-07, 'completion_length': 190.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6559524238109589, 'rewards/format_reward': 1.0, 'reward': 1.6559524536132812, 'reward_std': 0.11301729083061218, 'kl': 0.783203125, 'epoch': 0.87}
 87%|████████▋ | 3734/4286 [23:32:42<3:04:37, 20.07s/it] 87%|████████▋ | 3735/4286 [23:33:02<3:04:41, 20.11s/it]                                                        {'loss': 0.023, 'grad_norm': 17.922648700862247, 'learning_rate': 1.2855809612692485e-07, 'completion_length': 192.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.575892984867096, 'reward_std': 0.09066697210073471, 'kl': 0.576171875, 'epoch': 0.87}
 87%|████████▋ | 3735/4286 [23:33:02<3:04:41, 20.11s/it] 87%|████████▋ | 3736/4286 [23:33:21<3:00:27, 19.69s/it]                                                        {'loss': 0.0601, 'grad_norm': 2.643357903458924, 'learning_rate': 1.2832477834811013e-07, 'completion_length': 161.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.5505952686071396, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.532738208770752, 'reward_std': 0.15148507058620453, 'kl': 1.50390625, 'epoch': 0.87}
 87%|████████▋ | 3736/4286 [23:33:21<3:00:27, 19.69s/it] 87%|████████▋ | 3737/4286 [23:33:41<3:00:38, 19.74s/it]                                                        {'loss': 0.0386, 'grad_norm': 8.601878252608723, 'learning_rate': 1.2809146056929538e-07, 'completion_length': 196.30357360839844, 'rewards/only_full_func_accuracy_reward': 0.664881020784378, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6291667819023132, 'reward_std': 0.20645516365766525, 'kl': 0.9677734375, 'epoch': 0.87}
 87%|████████▋ | 3737/4286 [23:33:41<3:00:38, 19.74s/it] 87%|████████▋ | 3738/4286 [23:34:03<3:06:18, 20.40s/it]                                                        {'loss': 0.0605, 'grad_norm': 22.285946258859155, 'learning_rate': 1.2785814279048063e-07, 'completion_length': 192.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5163690596818924, 'rewards/format_reward': 1.0, 'reward': 1.5163691639900208, 'reward_std': 0.10034418571740389, 'kl': 1.517578125, 'epoch': 0.87}
 87%|████████▋ | 3738/4286 [23:34:03<3:06:18, 20.40s/it] 87%|████████▋ | 3739/4286 [23:34:20<2:58:33, 19.59s/it]                                                        {'loss': 0.0625, 'grad_norm': 7.666708368765205, 'learning_rate': 1.2762482501166587e-07, 'completion_length': 181.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.5973215103149414, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.579464316368103, 'reward_std': 0.12636392936110497, 'kl': 1.5625, 'epoch': 0.87}
 87%|████████▋ | 3739/4286 [23:34:20<2:58:33, 19.59s/it] 87%|████████▋ | 3740/4286 [23:34:41<3:01:06, 19.90s/it]                                                        {'loss': 0.0732, 'grad_norm': 14.225097940174786, 'learning_rate': 1.2739150723285115e-07, 'completion_length': 204.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.6333333849906921, 'rewards/format_reward': 1.0, 'reward': 1.633333444595337, 'reward_std': 0.15368958935141563, 'kl': 1.83203125, 'epoch': 0.87}
 87%|████████▋ | 3740/4286 [23:34:41<3:01:06, 19.90s/it] 87%|████████▋ | 3741/4286 [23:35:06<3:14:54, 21.46s/it]                                                        {'loss': 0.0436, 'grad_norm': 1.8715474368523233, 'learning_rate': 1.271581894540364e-07, 'completion_length': 204.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.635416716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5997024774551392, 'reward_std': 0.1101190485060215, 'kl': 1.091796875, 'epoch': 0.87}
 87%|████████▋ | 3741/4286 [23:35:06<3:14:54, 21.46s/it] 87%|████████▋ | 3742/4286 [23:35:27<3:13:45, 21.37s/it]                                                        {'loss': 0.0542, 'grad_norm': 32.636874312445435, 'learning_rate': 1.2692487167522165e-07, 'completion_length': 180.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.7075758278369904, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6897186636924744, 'reward_std': 0.10459177382290363, 'kl': 1.35546875, 'epoch': 0.87}
 87%|████████▋ | 3742/4286 [23:35:27<3:13:45, 21.37s/it] 87%|████████▋ | 3743/4286 [23:35:47<3:09:12, 20.91s/it]                                                        {'loss': 0.0538, 'grad_norm': 28.202405323241408, 'learning_rate': 1.266915538964069e-07, 'completion_length': 208.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.5860119462013245, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5324405431747437, 'reward_std': 0.174908809363842, 'kl': 1.3408203125, 'epoch': 0.87}
 87%|████████▋ | 3743/4286 [23:35:47<3:09:12, 20.91s/it] 87%|████████▋ | 3744/4286 [23:36:06<3:03:13, 20.28s/it]                                                        {'loss': 0.0156, 'grad_norm': 2.1184223423964137, 'learning_rate': 1.2645823611759214e-07, 'completion_length': 178.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7232144474983215, 'reward_std': 0.0384767958894372, 'kl': 0.38916015625, 'epoch': 0.87}
 87%|████████▋ | 3744/4286 [23:36:06<3:03:13, 20.28s/it] 87%|████████▋ | 3745/4286 [23:36:25<2:58:56, 19.85s/it]                                                        {'loss': 0.0381, 'grad_norm': 6.064411535598246, 'learning_rate': 1.2622491833877742e-07, 'completion_length': 168.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.5937500447034836, 'rewards/format_reward': 1.0, 'reward': 1.5937501192092896, 'reward_std': 0.0744047686457634, 'kl': 0.9501953125, 'epoch': 0.87}
 87%|████████▋ | 3745/4286 [23:36:25<2:58:56, 19.85s/it] 87%|████████▋ | 3746/4286 [23:36:45<3:00:17, 20.03s/it]                                                        {'loss': 0.0336, 'grad_norm': 7.6888333198694365, 'learning_rate': 1.2599160055996267e-07, 'completion_length': 184.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.7428571879863739, 'rewards/format_reward': 1.0, 'reward': 1.7428571581840515, 'reward_std': 0.07132146507501602, 'kl': 0.8427734375, 'epoch': 0.87}
 87%|████████▋ | 3746/4286 [23:36:45<3:00:17, 20.03s/it] 87%|████████▋ | 3747/4286 [23:37:08<3:08:48, 21.02s/it]                                                        {'loss': 0.0391, 'grad_norm': 28.16826635325579, 'learning_rate': 1.2575828278114792e-07, 'completion_length': 186.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6190476417541504, 'rewards/format_reward': 1.0, 'reward': 1.6190476417541504, 'reward_std': 0.1271628886461258, 'kl': 0.9775390625, 'epoch': 0.87}
 87%|████████▋ | 3747/4286 [23:37:08<3:08:48, 21.02s/it] 87%|████████▋ | 3748/4286 [23:37:28<3:04:12, 20.54s/it]                                                        {'loss': 0.0444, 'grad_norm': 6.158420123756181, 'learning_rate': 1.2552496500233316e-07, 'completion_length': 192.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.7126831710338593, 'rewards/format_reward': 1.0, 'reward': 1.712683379650116, 'reward_std': 0.0882328562438488, 'kl': 1.111328125, 'epoch': 0.87}
 87%|████████▋ | 3748/4286 [23:37:28<3:04:12, 20.54s/it] 87%|████████▋ | 3749/4286 [23:37:46<2:57:56, 19.88s/it]                                                        {'loss': 0.0515, 'grad_norm': 1.0766881467672136, 'learning_rate': 1.2529164722351841e-07, 'completion_length': 179.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6607143878936768, 'reward_std': 0.1358487457036972, 'kl': 1.28173828125, 'epoch': 0.87}
 87%|████████▋ | 3749/4286 [23:37:46<2:57:56, 19.88s/it] 87%|████████▋ | 3750/4286 [23:38:04<2:52:44, 19.34s/it]                                                        {'loss': 0.0088, 'grad_norm': 4.43502796614499, 'learning_rate': 1.250583294447037e-07, 'completion_length': 199.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410715222358704, 'reward_std': 0.029761905781924725, 'kl': 0.2216796875, 'epoch': 0.87}
 87%|████████▋ | 3750/4286 [23:38:04<2:52:44, 19.34s/it] 88%|████████▊ | 3751/4286 [23:38:23<2:51:19, 19.21s/it]                                                        {'loss': 0.062, 'grad_norm': 5.695317191610093, 'learning_rate': 1.2482501166588894e-07, 'completion_length': 185.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.630952537059784, 'reward_std': 0.09535404108464718, 'kl': 1.55224609375, 'epoch': 0.88}
 88%|████████▊ | 3751/4286 [23:38:23<2:51:19, 19.21s/it] 88%|████████▊ | 3752/4286 [23:38:45<2:56:59, 19.89s/it]                                                        {'loss': 0.0196, 'grad_norm': 6.194361718080416, 'learning_rate': 1.2459169388707419e-07, 'completion_length': 179.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7648809850215912, 'rewards/format_reward': 1.0, 'reward': 1.7648810744285583, 'reward_std': 0.04602411389350891, 'kl': 0.4873046875, 'epoch': 0.88}
 88%|████████▊ | 3752/4286 [23:38:45<2:56:59, 19.89s/it] 88%|████████▊ | 3753/4286 [23:39:07<3:02:44, 20.57s/it]                                                        {'loss': 0.0257, 'grad_norm': 9.776361091318059, 'learning_rate': 1.2435837610825943e-07, 'completion_length': 203.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6755952537059784, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.657738208770752, 'reward_std': 0.13425380364060402, 'kl': 0.640625, 'epoch': 0.88}
 88%|████████▊ | 3753/4286 [23:39:07<3:02:44, 20.57s/it] 88%|████████▊ | 3754/4286 [23:39:30<3:10:16, 21.46s/it]                                                        {'loss': 0.0299, 'grad_norm': 1.9867826783759959, 'learning_rate': 1.241250583294447e-07, 'completion_length': 193.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7380952537059784, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7023810744285583, 'reward_std': 0.1602869164198637, 'kl': 0.744140625, 'epoch': 0.88}
 88%|████████▊ | 3754/4286 [23:39:30<3:10:16, 21.46s/it] 88%|████████▊ | 3755/4286 [23:39:51<3:07:13, 21.16s/it]                                                        {'loss': 0.0169, 'grad_norm': 13.292810363376322, 'learning_rate': 1.2389174055062996e-07, 'completion_length': 186.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949405670166016, 'reward_std': 0.06250000186264515, 'kl': 0.421875, 'epoch': 0.88}
 88%|████████▊ | 3755/4286 [23:39:51<3:07:13, 21.16s/it] 88%|████████▊ | 3756/4286 [23:40:10<3:01:08, 20.51s/it]                                                        {'loss': 0.0393, 'grad_norm': 42.55072075970688, 'learning_rate': 1.236584227718152e-07, 'completion_length': 168.08928680419922, 'rewards/only_full_func_accuracy_reward': 0.6547619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6547619700431824, 'reward_std': 0.06388125754892826, 'kl': 0.9814453125, 'epoch': 0.88}
 88%|████████▊ | 3756/4286 [23:40:10<3:01:08, 20.51s/it] 88%|████████▊ | 3757/4286 [23:40:28<2:54:45, 19.82s/it]                                                        {'loss': 0.0288, 'grad_norm': 4.89117291786235, 'learning_rate': 1.2342510499300046e-07, 'completion_length': 179.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7261905670166016, 'reward_std': 0.11886699870228767, 'kl': 0.7177734375, 'epoch': 0.88}
 88%|████████▊ | 3757/4286 [23:40:28<2:54:45, 19.82s/it] 88%|████████▊ | 3758/4286 [23:40:49<2:56:40, 20.08s/it]                                                        {'loss': 0.0447, 'grad_norm': 9.497940730132932, 'learning_rate': 1.231917872141857e-07, 'completion_length': 194.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.5297619551420212, 'rewards/format_reward': 1.0, 'reward': 1.5297620296478271, 'reward_std': 0.08822483941912651, 'kl': 1.1171875, 'epoch': 0.88}
 88%|████████▊ | 3758/4286 [23:40:49<2:56:40, 20.08s/it] 88%|████████▊ | 3759/4286 [23:41:08<2:54:34, 19.88s/it]                                                        {'loss': 0.0179, 'grad_norm': 3.6757859414205596, 'learning_rate': 1.2295846943537098e-07, 'completion_length': 184.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.6098214983940125, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5919644236564636, 'reward_std': 0.14689559116959572, 'kl': 0.4462890625, 'epoch': 0.88}
 88%|████████▊ | 3759/4286 [23:41:08<2:54:34, 19.88s/it] 88%|████████▊ | 3760/4286 [23:41:27<2:51:33, 19.57s/it]                                                        {'loss': 0.0091, 'grad_norm': 9.99192198490632, 'learning_rate': 1.2272515165655623e-07, 'completion_length': 183.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.8092262744903564, 'rewards/format_reward': 1.0, 'reward': 1.8092263340950012, 'reward_std': 0.03988095559179783, 'kl': 0.22705078125, 'epoch': 0.88}
 88%|████████▊ | 3760/4286 [23:41:27<2:51:33, 19.57s/it] 88%|████████▊ | 3761/4286 [23:41:46<2:50:24, 19.48s/it]                                                        {'loss': 0.0707, 'grad_norm': 4.833608449629828, 'learning_rate': 1.2249183387774148e-07, 'completion_length': 202.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6367312073707581, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6188741326332092, 'reward_std': 0.06911402940750122, 'kl': 1.76953125, 'epoch': 0.88}
 88%|████████▊ | 3761/4286 [23:41:46<2:50:24, 19.48s/it] 88%|████████▊ | 3762/4286 [23:42:06<2:50:41, 19.55s/it]                                                        {'loss': 0.0315, 'grad_norm': 8.350222200536052, 'learning_rate': 1.2225851609892673e-07, 'completion_length': 202.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6223214566707611, 'rewards/format_reward': 1.0, 'reward': 1.6223215460777283, 'reward_std': 0.06803308241069317, 'kl': 0.78759765625, 'epoch': 0.88}
 88%|████████▊ | 3762/4286 [23:42:06<2:50:41, 19.55s/it] 88%|████████▊ | 3763/4286 [23:42:23<2:43:47, 18.79s/it]                                                        {'loss': 0.0251, 'grad_norm': 9.516934288046231, 'learning_rate': 1.22025198320112e-07, 'completion_length': 154.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7886905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7886906266212463, 'reward_std': 0.0535714328289032, 'kl': 0.63037109375, 'epoch': 0.88}
 88%|████████▊ | 3763/4286 [23:42:23<2:43:47, 18.79s/it] 88%|████████▊ | 3764/4286 [23:42:46<2:55:26, 20.17s/it]                                                        {'loss': 0.0961, 'grad_norm': 7.344889215497266, 'learning_rate': 1.2179188054129725e-07, 'completion_length': 205.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5104167014360428, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4747024774551392, 'reward_std': 0.17298542708158493, 'kl': 2.40625, 'epoch': 0.88}
 88%|████████▊ | 3764/4286 [23:42:46<2:55:26, 20.17s/it] 88%|████████▊ | 3765/4286 [23:43:05<2:51:43, 19.78s/it]                                                        {'loss': 0.0398, 'grad_norm': 6.0252931631508915, 'learning_rate': 1.215585627624825e-07, 'completion_length': 176.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.681547686457634, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6636906266212463, 'reward_std': 0.10990536957979202, 'kl': 0.9951171875, 'epoch': 0.88}
 88%|████████▊ | 3765/4286 [23:43:05<2:51:43, 19.78s/it] 88%|████████▊ | 3766/4286 [23:43:23<2:46:34, 19.22s/it]                                                        {'loss': 0.0078, 'grad_norm': 0.34753497941477446, 'learning_rate': 1.2132524498366775e-07, 'completion_length': 170.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0, 'kl': 0.19482421875, 'epoch': 0.88}
 88%|████████▊ | 3766/4286 [23:43:23<2:46:34, 19.22s/it] 88%|████████▊ | 3767/4286 [23:43:42<2:46:20, 19.23s/it]                                                        {'loss': 0.0188, 'grad_norm': 4.141736934098093, 'learning_rate': 1.21091927204853e-07, 'completion_length': 182.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.08014346286654472, 'kl': 0.4716796875, 'epoch': 0.88}
 88%|████████▊ | 3767/4286 [23:43:42<2:46:20, 19.23s/it] 88%|████████▊ | 3768/4286 [23:44:01<2:44:27, 19.05s/it]                                                        {'loss': 0.0354, 'grad_norm': 2.440622840409482, 'learning_rate': 1.2085860942603827e-07, 'completion_length': 184.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.614583432674408, 'reward_std': 0.127976194024086, 'kl': 0.8828125, 'epoch': 0.88}
 88%|████████▊ | 3768/4286 [23:44:01<2:44:27, 19.05s/it] 88%|████████▊ | 3769/4286 [23:44:19<2:42:29, 18.86s/it]                                                        {'loss': 0.0308, 'grad_norm': 9.185586855392055, 'learning_rate': 1.2062529164722352e-07, 'completion_length': 194.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.663194477558136, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6453372836112976, 'reward_std': 0.1283600702881813, 'kl': 0.76953125, 'epoch': 0.88}
 88%|████████▊ | 3769/4286 [23:44:19<2:42:29, 18.86s/it] 88%|████████▊ | 3770/4286 [23:44:39<2:43:12, 18.98s/it]                                                        {'loss': 0.018, 'grad_norm': 2.29138669261474, 'learning_rate': 1.2039197386840877e-07, 'completion_length': 188.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.8306548297405243, 'rewards/format_reward': 1.0, 'reward': 1.8306548595428467, 'reward_std': 0.0656273690983653, 'kl': 0.4501953125, 'epoch': 0.88}
 88%|████████▊ | 3770/4286 [23:44:39<2:43:12, 18.98s/it] 88%|████████▊ | 3771/4286 [23:44:58<2:43:02, 19.00s/it]                                                        {'loss': 0.0223, 'grad_norm': 4.198568315988709, 'learning_rate': 1.2015865608959402e-07, 'completion_length': 187.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.7315476536750793, 'rewards/format_reward': 1.0, 'reward': 1.7315477132797241, 'reward_std': 0.05168629437685013, 'kl': 0.560546875, 'epoch': 0.88}
 88%|████████▊ | 3771/4286 [23:44:58<2:43:02, 19.00s/it] 88%|████████▊ | 3772/4286 [23:45:17<2:42:33, 18.97s/it]                                                        {'loss': 0.0189, 'grad_norm': 32.58179423835497, 'learning_rate': 1.1992533831077927e-07, 'completion_length': 191.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.49375003576278687, 'rewards/format_reward': 1.0, 'reward': 1.4937500953674316, 'reward_std': 0.0733458586037159, 'kl': 0.47314453125, 'epoch': 0.88}
 88%|████████▊ | 3772/4286 [23:45:17<2:42:33, 18.97s/it] 88%|████████▊ | 3773/4286 [23:45:34<2:38:00, 18.48s/it]                                                        {'loss': 0.033, 'grad_norm': 3.8744340771348713, 'learning_rate': 1.1969202053196454e-07, 'completion_length': 164.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7261906862258911, 'reward_std': 0.08538143523037434, 'kl': 0.82421875, 'epoch': 0.88}
 88%|████████▊ | 3773/4286 [23:45:34<2:38:00, 18.48s/it] 88%|████████▊ | 3774/4286 [23:45:53<2:38:45, 18.60s/it]                                                        {'loss': 0.0231, 'grad_norm': 1.5409449359388203, 'learning_rate': 1.194587027531498e-07, 'completion_length': 187.37501525878906, 'rewards/only_full_func_accuracy_reward': 0.6761905252933502, 'rewards/format_reward': 1.0, 'reward': 1.6761906147003174, 'reward_std': 0.031871598213911057, 'kl': 0.57666015625, 'epoch': 0.88}
 88%|████████▊ | 3774/4286 [23:45:53<2:38:45, 18.60s/it] 88%|████████▊ | 3775/4286 [23:46:11<2:37:59, 18.55s/it]                                                        {'loss': 0.0733, 'grad_norm': 26.702726314788627, 'learning_rate': 1.1922538497433504e-07, 'completion_length': 185.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6339285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6160715818405151, 'reward_std': 0.172619067132473, 'kl': 1.83203125, 'epoch': 0.88}
 88%|████████▊ | 3775/4286 [23:46:11<2:37:59, 18.55s/it] 88%|████████▊ | 3776/4286 [23:46:30<2:38:41, 18.67s/it]                                                        {'loss': 0.0399, 'grad_norm': 5.012484582944247, 'learning_rate': 1.189920671955203e-07, 'completion_length': 174.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.5806547999382019, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5627977848052979, 'reward_std': 0.08869048301130533, 'kl': 0.99609375, 'epoch': 0.88}
 88%|████████▊ | 3776/4286 [23:46:30<2:38:41, 18.67s/it] 88%|████████▊ | 3777/4286 [23:46:48<2:36:53, 18.49s/it]                                                        {'loss': 0.0343, 'grad_norm': 11.69687356623695, 'learning_rate': 1.1875874941670555e-07, 'completion_length': 190.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6056548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6056548357009888, 'reward_std': 0.035413628444075584, 'kl': 0.859375, 'epoch': 0.88}
 88%|████████▊ | 3777/4286 [23:46:48<2:36:53, 18.49s/it] 88%|████████▊ | 3778/4286 [23:47:07<2:36:35, 18.49s/it]                                                        {'loss': 0.0742, 'grad_norm': 4.780130527770195, 'learning_rate': 1.1852543163789081e-07, 'completion_length': 173.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6916666924953461, 'rewards/format_reward': 1.0, 'reward': 1.6916667222976685, 'reward_std': 0.09024734422564507, 'kl': 1.8515625, 'epoch': 0.88}
 88%|████████▊ | 3778/4286 [23:47:07<2:36:35, 18.49s/it] 88%|████████▊ | 3779/4286 [23:47:26<2:38:26, 18.75s/it]                                                        {'loss': 0.0296, 'grad_norm': 1.446160796647337, 'learning_rate': 1.1829211385907606e-07, 'completion_length': 195.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5446428656578064, 'rewards/format_reward': 1.0, 'reward': 1.5446429252624512, 'reward_std': 0.059431010857224464, 'kl': 0.7421875, 'epoch': 0.88}
 88%|████████▊ | 3779/4286 [23:47:26<2:38:26, 18.75s/it] 88%|████████▊ | 3780/4286 [23:47:46<2:41:53, 19.20s/it]                                                        {'loss': 0.0459, 'grad_norm': 6.547929185971823, 'learning_rate': 1.1805879608026131e-07, 'completion_length': 185.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.7148809731006622, 'rewards/format_reward': 1.0, 'reward': 1.7148810625076294, 'reward_std': 0.06785714626312256, 'kl': 1.14453125, 'epoch': 0.88}
 88%|████████▊ | 3780/4286 [23:47:46<2:41:53, 19.20s/it] 88%|████████▊ | 3781/4286 [23:48:04<2:37:33, 18.72s/it]                                                        {'loss': 0.046, 'grad_norm': 13.30700697276144, 'learning_rate': 1.1782547830144657e-07, 'completion_length': 156.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.755952537059784, 'reward_std': 0.09204822219908237, 'kl': 1.15234375, 'epoch': 0.88}
 88%|████████▊ | 3781/4286 [23:48:04<2:37:33, 18.72s/it] 88%|████████▊ | 3782/4286 [23:48:24<2:39:11, 18.95s/it]                                                        {'loss': 0.0431, 'grad_norm': 7.379180910026845, 'learning_rate': 1.1759216052263182e-07, 'completion_length': 180.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.12088929582387209, 'kl': 1.078125, 'epoch': 0.88}
 88%|████████▊ | 3782/4286 [23:48:24<2:39:11, 18.95s/it] 88%|████████▊ | 3783/4286 [23:48:43<2:40:03, 19.09s/it]                                                        {'loss': 0.0388, 'grad_norm': 7.88391038213434, 'learning_rate': 1.1735884274381708e-07, 'completion_length': 171.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.6116071939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5937501192092896, 'reward_std': 0.1458333395421505, 'kl': 0.96875, 'epoch': 0.88}
 88%|████████▊ | 3783/4286 [23:48:43<2:40:03, 19.09s/it] 88%|████████▊ | 3784/4286 [23:49:01<2:38:18, 18.92s/it]                                                        {'loss': 0.0393, 'grad_norm': 13.352936509947163, 'learning_rate': 1.1712552496500233e-07, 'completion_length': 178.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5699405372142792, 'rewards/format_reward': 1.0, 'reward': 1.5699405670166016, 'reward_std': 0.09346423484385014, 'kl': 0.98046875, 'epoch': 0.88}
 88%|████████▊ | 3784/4286 [23:49:01<2:38:18, 18.92s/it] 88%|████████▊ | 3785/4286 [23:49:20<2:35:56, 18.68s/it]                                                        {'loss': 0.0288, 'grad_norm': 17.342567237396974, 'learning_rate': 1.1689220718618759e-07, 'completion_length': 179.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.7645833492279053, 'rewards/format_reward': 1.0, 'reward': 1.7645834684371948, 'reward_std': 0.06842948496341705, 'kl': 0.71826171875, 'epoch': 0.88}
 88%|████████▊ | 3785/4286 [23:49:20<2:35:56, 18.68s/it] 88%|████████▊ | 3786/4286 [23:49:40<2:40:31, 19.26s/it]                                                        {'loss': 0.0472, 'grad_norm': 2.7920593522336112, 'learning_rate': 1.1665888940737284e-07, 'completion_length': 179.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6160714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5982143878936768, 'reward_std': 0.13807234540581703, 'kl': 1.1796875, 'epoch': 0.88}
 88%|████████▊ | 3786/4286 [23:49:40<2:40:31, 19.26s/it] 88%|████████▊ | 3787/4286 [23:49:59<2:38:11, 19.02s/it]                                                        {'loss': 0.0665, 'grad_norm': 5.968315426573075, 'learning_rate': 1.1642557162855809e-07, 'completion_length': 163.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.729166716337204, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7113096714019775, 'reward_std': 0.11182790715247393, 'kl': 1.6640625, 'epoch': 0.88}
 88%|████████▊ | 3787/4286 [23:49:59<2:38:11, 19.02s/it] 88%|████████▊ | 3788/4286 [23:50:17<2:37:17, 18.95s/it]                                                        {'loss': 0.026, 'grad_norm': 6.23672912992248, 'learning_rate': 1.1619225384974335e-07, 'completion_length': 191.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.6839286386966705, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6660714745521545, 'reward_std': 0.07619047537446022, 'kl': 0.6494140625, 'epoch': 0.88}
 88%|████████▊ | 3788/4286 [23:50:17<2:37:17, 18.95s/it] 88%|████████▊ | 3789/4286 [23:50:35<2:34:23, 18.64s/it]                                                        {'loss': 0.0111, 'grad_norm': 6.091805995358802, 'learning_rate': 1.159589360709286e-07, 'completion_length': 185.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.6666666865348816, 'rewards/format_reward': 1.0, 'reward': 1.6666668057441711, 'reward_std': 0.061365483328700066, 'kl': 0.27734375, 'epoch': 0.88}
 88%|████████▊ | 3789/4286 [23:50:35<2:34:23, 18.64s/it][2025-03-03 04:58:13,222] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 88%|████████▊ | 3790/4286 [23:50:57<2:42:21, 19.64s/it]                                                        {'loss': 0.0404, 'grad_norm': 3.1827410713927904, 'learning_rate': 1.1572561829211386e-07, 'completion_length': 180.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.6488096117973328, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6309524774551392, 'reward_std': 0.12322444841265678, 'kl': 1.01171875, 'epoch': 0.88}
 88%|████████▊ | 3790/4286 [23:50:57<2:42:21, 19.64s/it] 88%|████████▊ | 3791/4286 [23:51:15<2:37:02, 19.04s/it]                                                        {'loss': 0.0485, 'grad_norm': 5.514143425638623, 'learning_rate': 1.1549230051329911e-07, 'completion_length': 174.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7113096117973328, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6934524774551392, 'reward_std': 0.13371490687131882, 'kl': 1.2109375, 'epoch': 0.88}
 88%|████████▊ | 3791/4286 [23:51:15<2:37:02, 19.04s/it] 88%|████████▊ | 3792/4286 [23:51:35<2:39:02, 19.32s/it]                                                        {'loss': 0.0281, 'grad_norm': 8.286396072351735, 'learning_rate': 1.1525898273448437e-07, 'completion_length': 190.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7342262268066406, 'rewards/format_reward': 1.0, 'reward': 1.7342262864112854, 'reward_std': 0.05782270058989525, 'kl': 0.703125, 'epoch': 0.88}
 88%|████████▊ | 3792/4286 [23:51:35<2:39:02, 19.32s/it] 88%|████████▊ | 3793/4286 [23:51:52<2:34:12, 18.77s/it]                                                        {'loss': 0.0266, 'grad_norm': 7.323454567526076, 'learning_rate': 1.1502566495566962e-07, 'completion_length': 176.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6919643878936768, 'reward_std': 0.08898505941033363, 'kl': 0.66357421875, 'epoch': 0.88}
 88%|████████▊ | 3793/4286 [23:51:52<2:34:12, 18.77s/it] 89%|████████▊ | 3794/4286 [23:52:13<2:38:41, 19.35s/it]                                                        {'loss': 0.037, 'grad_norm': 3.3168507651397188, 'learning_rate': 1.1479234717685488e-07, 'completion_length': 182.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.4702381491661072, 'rewards/format_reward': 1.0, 'reward': 1.470238208770752, 'reward_std': 0.08333333674818277, 'kl': 0.92578125, 'epoch': 0.89}
 89%|████████▊ | 3794/4286 [23:52:13<2:38:41, 19.35s/it] 89%|████████▊ | 3795/4286 [23:52:32<2:37:11, 19.21s/it]                                                        {'loss': 0.0195, 'grad_norm': 8.390358579940806, 'learning_rate': 1.1455902939804013e-07, 'completion_length': 173.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6711309850215912, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6532739400863647, 'reward_std': 0.11143557354807854, 'kl': 0.486328125, 'epoch': 0.89}
 89%|████████▊ | 3795/4286 [23:52:32<2:37:11, 19.21s/it] 89%|████████▊ | 3796/4286 [23:52:52<2:39:41, 19.55s/it]                                                        {'loss': 0.0274, 'grad_norm': 2.062134877045186, 'learning_rate': 1.1432571161922538e-07, 'completion_length': 179.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803572535514832, 'reward_std': 0.09800061210989952, 'kl': 0.68603515625, 'epoch': 0.89}
 89%|████████▊ | 3796/4286 [23:52:52<2:39:41, 19.55s/it] 89%|████████▊ | 3797/4286 [23:53:11<2:36:05, 19.15s/it]                                                        {'loss': 0.0329, 'grad_norm': 6.084726085968439, 'learning_rate': 1.1409239384041064e-07, 'completion_length': 184.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6300595700740814, 'rewards/format_reward': 1.0, 'reward': 1.6300596594810486, 'reward_std': 0.07045114040374756, 'kl': 0.822265625, 'epoch': 0.89}
 89%|████████▊ | 3797/4286 [23:53:11<2:36:05, 19.15s/it] 89%|████████▊ | 3798/4286 [23:53:28<2:31:21, 18.61s/it]                                                        {'loss': 0.0149, 'grad_norm': 1.8626823093202347, 'learning_rate': 1.1385907606159589e-07, 'completion_length': 169.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.7979166805744171, 'rewards/format_reward': 1.0, 'reward': 1.7979167699813843, 'reward_std': 0.015666970517486334, 'kl': 0.373046875, 'epoch': 0.89}
 89%|████████▊ | 3798/4286 [23:53:28<2:31:21, 18.61s/it] 89%|████████▊ | 3799/4286 [23:53:46<2:29:01, 18.36s/it]                                                        {'loss': 0.0147, 'grad_norm': 3.958312429185477, 'learning_rate': 1.1362575828278115e-07, 'completion_length': 173.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7886905670166016, 'rewards/format_reward': 1.0, 'reward': 1.7886906266212463, 'reward_std': 0.06596966087818146, 'kl': 0.3681640625, 'epoch': 0.89}
 89%|████████▊ | 3799/4286 [23:53:46<2:29:01, 18.36s/it] 89%|████████▊ | 3800/4286 [23:54:04<2:28:42, 18.36s/it]                                                        {'loss': 0.021, 'grad_norm': 5.5479559328741415, 'learning_rate': 1.133924405039664e-07, 'completion_length': 190.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.709821492433548, 'rewards/format_reward': 1.0, 'reward': 1.7098215222358704, 'reward_std': 0.044642859138548374, 'kl': 0.52587890625, 'epoch': 0.89}
 89%|████████▊ | 3800/4286 [23:54:04<2:28:42, 18.36s/it] 89%|████████▊ | 3801/4286 [23:58:12<11:46:14, 87.37s/it]                                                         {'loss': 0.0323, 'grad_norm': 2.908701714224145, 'learning_rate': 1.1315912272515166e-07, 'completion_length': 168.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6755953133106232, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.0980442725121975, 'kl': 0.806640625, 'epoch': 0.89}
 89%|████████▊ | 3801/4286 [23:58:12<11:46:14, 87.37s/it] 89%|████████▊ | 3802/4286 [23:58:30<8:55:09, 66.34s/it]                                                         {'loss': 0.0251, 'grad_norm': 10.486801631398842, 'learning_rate': 1.1292580494633691e-07, 'completion_length': 171.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.5818452835083008, 'rewards/format_reward': 1.0, 'reward': 1.5818453431129456, 'reward_std': 0.03273809980601072, 'kl': 0.62939453125, 'epoch': 0.89}
 89%|████████▊ | 3802/4286 [23:58:30<8:55:09, 66.34s/it] 89%|████████▊ | 3803/4286 [23:58:49<7:00:04, 52.18s/it]                                                        {'loss': 0.0394, 'grad_norm': 6.93219314598107, 'learning_rate': 1.1269248716752216e-07, 'completion_length': 186.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6436508297920227, 'rewards/format_reward': 1.0, 'reward': 1.6436509490013123, 'reward_std': 0.06906273029744625, 'kl': 0.984375, 'epoch': 0.89}
 89%|████████▊ | 3803/4286 [23:58:49<7:00:04, 52.18s/it] 89%|████████▉ | 3804/4286 [23:59:07<5:37:17, 41.99s/it]                                                        {'loss': 0.038, 'grad_norm': 23.157826266036526, 'learning_rate': 1.1245916938870742e-07, 'completion_length': 180.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6875001788139343, 'reward_std': 0.07807645574212074, 'kl': 0.951171875, 'epoch': 0.89}
 89%|████████▉ | 3804/4286 [23:59:07<5:37:17, 41.99s/it] 89%|████████▉ | 3805/4286 [23:59:26<4:41:28, 35.11s/it]                                                        {'loss': 0.0232, 'grad_norm': 1.9970676920103279, 'learning_rate': 1.1222585160989267e-07, 'completion_length': 187.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6184524297714233, 'rewards/format_reward': 1.0, 'reward': 1.6184524297714233, 'reward_std': 0.0168779157102108, 'kl': 0.58203125, 'epoch': 0.89}
 89%|████████▉ | 3805/4286 [23:59:26<4:41:28, 35.11s/it] 89%|████████▉ | 3806/4286 [23:59:46<4:04:30, 30.56s/it]                                                        {'loss': 0.0818, 'grad_norm': 26.2056471400377, 'learning_rate': 1.1199253383107793e-07, 'completion_length': 191.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.5863095819950104, 'rewards/format_reward': 1.0, 'reward': 1.5863096117973328, 'reward_std': 0.15435076504945755, 'kl': 2.05078125, 'epoch': 0.89}
 89%|████████▉ | 3806/4286 [23:59:46<4:04:30, 30.56s/it] 89%|████████▉ | 3807/4286 [24:00:10<3:47:31, 28.50s/it]                                                        {'loss': 0.0356, 'grad_norm': 4.579402528464295, 'learning_rate': 1.1175921605226318e-07, 'completion_length': 194.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.4732142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4732143878936768, 'reward_std': 0.04535001702606678, 'kl': 0.890625, 'epoch': 0.89}
 89%|████████▉ | 3807/4286 [24:00:10<3:47:31, 28.50s/it] 89%|████████▉ | 3808/4286 [24:00:28<3:22:05, 25.37s/it]                                                        {'loss': 0.0067, 'grad_norm': 21.82428909104787, 'learning_rate': 1.1152589827344844e-07, 'completion_length': 183.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.7196428775787354, 'rewards/format_reward': 1.0, 'reward': 1.7196429371833801, 'reward_std': 0.038095240481197834, 'kl': 0.1669921875, 'epoch': 0.89}
 89%|████████▉ | 3808/4286 [24:00:28<3:22:05, 25.37s/it] 89%|████████▉ | 3809/4286 [24:00:47<3:06:16, 23.43s/it]                                                        {'loss': 0.0172, 'grad_norm': 7.208693697491757, 'learning_rate': 1.1129258049463369e-07, 'completion_length': 182.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.4851190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4672620296478271, 'reward_std': 0.08173840306699276, 'kl': 0.4306640625, 'epoch': 0.89}
 89%|████████▉ | 3809/4286 [24:00:47<3:06:16, 23.43s/it] 89%|████████▉ | 3810/4286 [24:01:06<2:57:02, 22.32s/it]                                                        {'loss': 0.0111, 'grad_norm': 5.9529963086700155, 'learning_rate': 1.1105926271581894e-07, 'completion_length': 177.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.0357142798602581, 'kl': 0.2783203125, 'epoch': 0.89}
 89%|████████▉ | 3810/4286 [24:01:06<2:57:02, 22.32s/it] 89%|████████▉ | 3811/4286 [24:01:24<2:45:28, 20.90s/it]                                                        {'loss': 0.015, 'grad_norm': 21.80125748949478, 'learning_rate': 1.108259449370042e-07, 'completion_length': 162.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6889880895614624, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.09836822375655174, 'kl': 0.37353515625, 'epoch': 0.89}
 89%|████████▉ | 3811/4286 [24:01:24<2:45:28, 20.90s/it] 89%|████████▉ | 3812/4286 [24:01:42<2:38:23, 20.05s/it]                                                        {'loss': 0.0255, 'grad_norm': 1.5882591245814772, 'learning_rate': 1.1059262715818945e-07, 'completion_length': 179.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.635416716337204, 'rewards/format_reward': 1.0, 'reward': 1.6354167461395264, 'reward_std': 0.05654762778431177, 'kl': 0.63525390625, 'epoch': 0.89}
 89%|████████▉ | 3812/4286 [24:01:42<2:38:23, 20.05s/it] 89%|████████▉ | 3813/4286 [24:02:00<2:32:05, 19.29s/it]                                                        {'loss': 0.0093, 'grad_norm': 7.931508340364449, 'learning_rate': 1.1035930937937471e-07, 'completion_length': 171.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.0446428582072258, 'kl': 0.2333984375, 'epoch': 0.89}
 89%|████████▉ | 3813/4286 [24:02:00<2:32:05, 19.29s/it] 89%|████████▉ | 3814/4286 [24:02:18<2:28:26, 18.87s/it]                                                        {'loss': 0.0621, 'grad_norm': 2.4429051738291157, 'learning_rate': 1.1012599160055996e-07, 'completion_length': 171.58928680419922, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.6770833730697632, 'reward_std': 0.10373930633068085, 'kl': 1.556640625, 'epoch': 0.89}
 89%|████████▉ | 3814/4286 [24:02:18<2:28:26, 18.87s/it] 89%|████████▉ | 3815/4286 [24:02:36<2:26:03, 18.61s/it]                                                        {'loss': 0.017, 'grad_norm': 1.8163181672191602, 'learning_rate': 1.0989267382174522e-07, 'completion_length': 166.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7833333909511566, 'rewards/format_reward': 1.0, 'reward': 1.7833334803581238, 'reward_std': 0.08674049004912376, 'kl': 0.42578125, 'epoch': 0.89}
 89%|████████▉ | 3815/4286 [24:02:36<2:26:03, 18.61s/it] 89%|████████▉ | 3816/4286 [24:02:54<2:26:07, 18.65s/it]                                                        {'loss': 0.0185, 'grad_norm': 3.2188438675339115, 'learning_rate': 1.0965935604293047e-07, 'completion_length': 173.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.6676020622253418, 'rewards/format_reward': 1.0, 'reward': 1.6676020622253418, 'reward_std': 0.051348526030778885, 'kl': 0.462890625, 'epoch': 0.89}
 89%|████████▉ | 3816/4286 [24:02:54<2:26:07, 18.65s/it] 89%|████████▉ | 3817/4286 [24:03:15<2:31:31, 19.38s/it]                                                        {'loss': 0.0708, 'grad_norm': 13.16882282484611, 'learning_rate': 1.0942603826411572e-07, 'completion_length': 182.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.09984228014945984, 'kl': 1.765625, 'epoch': 0.89}
 89%|████████▉ | 3817/4286 [24:03:15<2:31:31, 19.38s/it] 89%|████████▉ | 3818/4286 [24:03:36<2:32:54, 19.60s/it]                                                        {'loss': 0.0744, 'grad_norm': 3.669385076062692, 'learning_rate': 1.0919272048530097e-07, 'completion_length': 168.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.6142857372760773, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.596428632736206, 'reward_std': 0.17734738439321518, 'kl': 1.859375, 'epoch': 0.89}
 89%|████████▉ | 3818/4286 [24:03:36<2:32:54, 19.60s/it] 89%|████████▉ | 3819/4286 [24:03:56<2:33:34, 19.73s/it]                                                        {'loss': 0.062, 'grad_norm': 3.099423682165192, 'learning_rate': 1.0895940270648622e-07, 'completion_length': 189.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.45208336412906647, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4342262148857117, 'reward_std': 0.1262727715075016, 'kl': 1.548828125, 'epoch': 0.89}
 89%|████████▉ | 3819/4286 [24:03:56<2:33:34, 19.73s/it] 89%|████████▉ | 3820/4286 [24:04:16<2:35:05, 19.97s/it]                                                        {'loss': 0.0492, 'grad_norm': 26.92334479355825, 'learning_rate': 1.0872608492767148e-07, 'completion_length': 198.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.64384925365448, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.590277910232544, 'reward_std': 0.1798696517944336, 'kl': 1.23046875, 'epoch': 0.89}
 89%|████████▉ | 3820/4286 [24:04:16<2:35:05, 19.97s/it] 89%|████████▉ | 3821/4286 [24:04:36<2:35:07, 20.02s/it]                                                        {'loss': 0.0472, 'grad_norm': 29.317472813920485, 'learning_rate': 1.0849276714885673e-07, 'completion_length': 185.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.766369104385376, 'rewards/format_reward': 1.0, 'reward': 1.7663691639900208, 'reward_std': 0.06621116865426302, 'kl': 1.181640625, 'epoch': 0.89}
 89%|████████▉ | 3821/4286 [24:04:36<2:35:07, 20.02s/it] 89%|████████▉ | 3822/4286 [24:04:54<2:29:57, 19.39s/it]                                                        {'loss': 0.0464, 'grad_norm': 6.56452710431013, 'learning_rate': 1.0825944937004199e-07, 'completion_length': 156.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.6904762089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6726191639900208, 'reward_std': 0.1179870218038559, 'kl': 1.16015625, 'epoch': 0.89}
 89%|████████▉ | 3822/4286 [24:04:54<2:29:57, 19.39s/it] 89%|████████▉ | 3823/4286 [24:05:13<2:28:16, 19.22s/it]                                                        {'loss': 0.0391, 'grad_norm': 4.080459538880219, 'learning_rate': 1.0802613159122724e-07, 'completion_length': 172.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.7904762625694275, 'rewards/format_reward': 1.0, 'reward': 1.7904763221740723, 'reward_std': 0.10590371489524841, 'kl': 0.97705078125, 'epoch': 0.89}
 89%|████████▉ | 3823/4286 [24:05:13<2:28:16, 19.22s/it] 89%|████████▉ | 3824/4286 [24:05:35<2:33:44, 19.97s/it]                                                        {'loss': 0.048, 'grad_norm': 4.33564116697144, 'learning_rate': 1.0779281381241249e-07, 'completion_length': 184.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5526785850524902, 'rewards/format_reward': 1.0, 'reward': 1.5526787638664246, 'reward_std': 0.0989355593919754, 'kl': 1.19921875, 'epoch': 0.89}
 89%|████████▉ | 3824/4286 [24:05:35<2:33:44, 19.97s/it] 89%|████████▉ | 3825/4286 [24:05:54<2:31:08, 19.67s/it]                                                        {'loss': 0.0308, 'grad_norm': 3.654223840478977, 'learning_rate': 1.0755949603359775e-07, 'completion_length': 174.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7202381789684296, 'rewards/format_reward': 1.0, 'reward': 1.720238208770752, 'reward_std': 0.04761904664337635, 'kl': 0.7705078125, 'epoch': 0.89}
 89%|████████▉ | 3825/4286 [24:05:54<2:31:08, 19.67s/it] 89%|████████▉ | 3826/4286 [24:06:11<2:26:32, 19.11s/it]                                                        {'loss': 0.0174, 'grad_norm': 14.245648719691932, 'learning_rate': 1.07326178254783e-07, 'completion_length': 167.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 1.0, 'reward': 1.5937501192092896, 'reward_std': 0.019238398410379887, 'kl': 0.4345703125, 'epoch': 0.89}
 89%|████████▉ | 3826/4286 [24:06:11<2:26:32, 19.11s/it] 89%|████████▉ | 3827/4286 [24:06:29<2:22:13, 18.59s/it]                                                        {'loss': 0.0391, 'grad_norm': 5.9544300016794685, 'learning_rate': 1.0709286047596826e-07, 'completion_length': 165.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.7068453133106232, 'rewards/format_reward': 1.0, 'reward': 1.7068453431129456, 'reward_std': 0.04958160035312176, 'kl': 0.9765625, 'epoch': 0.89}
 89%|████████▉ | 3827/4286 [24:06:29<2:22:13, 18.59s/it] 89%|████████▉ | 3828/4286 [24:06:47<2:19:54, 18.33s/it]                                                        {'loss': 0.0424, 'grad_norm': 2.383041366952403, 'learning_rate': 1.0685954269715351e-07, 'completion_length': 180.12501525878906, 'rewards/only_full_func_accuracy_reward': 0.662202388048172, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.10105127654969692, 'kl': 1.0615234375, 'epoch': 0.89}
 89%|████████▉ | 3828/4286 [24:06:47<2:19:54, 18.33s/it] 89%|████████▉ | 3829/4286 [24:07:04<2:17:34, 18.06s/it]                                                        {'loss': 0.0068, 'grad_norm': 1.518601355436637, 'learning_rate': 1.0662622491833877e-07, 'completion_length': 165.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.7544642984867096, 'rewards/format_reward': 1.0, 'reward': 1.7544643878936768, 'reward_std': 0.008928571827709675, 'kl': 0.17138671875, 'epoch': 0.89}
 89%|████████▉ | 3829/4286 [24:07:04<2:17:34, 18.06s/it] 89%|████████▉ | 3830/4286 [24:07:25<2:23:05, 18.83s/it]                                                        {'loss': 0.0446, 'grad_norm': 3.688074003117829, 'learning_rate': 1.0639290713952402e-07, 'completion_length': 183.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.671131044626236, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6532739400863647, 'reward_std': 0.11447650194168091, 'kl': 1.11328125, 'epoch': 0.89}
 89%|████████▉ | 3830/4286 [24:07:25<2:23:05, 18.83s/it] 89%|████████▉ | 3831/4286 [24:07:43<2:22:57, 18.85s/it]                                                        {'loss': 0.0734, 'grad_norm': 2.351839891189623, 'learning_rate': 1.0615958936070928e-07, 'completion_length': 181.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.610119104385376, 'rewards/format_reward': 1.0, 'reward': 1.610119104385376, 'reward_std': 0.11441674828529358, 'kl': 1.83203125, 'epoch': 0.89}
 89%|████████▉ | 3831/4286 [24:07:43<2:22:57, 18.85s/it] 89%|████████▉ | 3832/4286 [24:08:01<2:20:41, 18.59s/it]                                                        {'loss': 0.0419, 'grad_norm': 7.287032270499036, 'learning_rate': 1.0592627158189453e-07, 'completion_length': 164.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5639881491661072, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.08528826385736465, 'kl': 1.046875, 'epoch': 0.89}
 89%|████████▉ | 3832/4286 [24:08:01<2:20:41, 18.59s/it] 89%|████████▉ | 3833/4286 [24:08:19<2:18:36, 18.36s/it]                                                        {'loss': 0.0237, 'grad_norm': 7.601596195210004, 'learning_rate': 1.0569295380307978e-07, 'completion_length': 176.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.768750011920929, 'rewards/format_reward': 1.0, 'reward': 1.7687501311302185, 'reward_std': 0.07486889511346817, 'kl': 0.5927734375, 'epoch': 0.89}
 89%|████████▉ | 3833/4286 [24:08:19<2:18:36, 18.36s/it] 89%|████████▉ | 3834/4286 [24:08:38<2:19:32, 18.52s/it]                                                        {'loss': 0.0587, 'grad_norm': 6.557595596187834, 'learning_rate': 1.0545963602426504e-07, 'completion_length': 187.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6580357551574707, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.640178620815277, 'reward_std': 0.19644339010119438, 'kl': 1.46875, 'epoch': 0.89}
 89%|████████▉ | 3834/4286 [24:08:38<2:19:32, 18.52s/it] 89%|████████▉ | 3835/4286 [24:08:57<2:20:05, 18.64s/it]                                                        {'loss': 0.0434, 'grad_norm': 6.9987603339456745, 'learning_rate': 1.0522631824545029e-07, 'completion_length': 187.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.7714286744594574, 'rewards/format_reward': 1.0, 'reward': 1.771428644657135, 'reward_std': 0.12155751511454582, 'kl': 1.08984375, 'epoch': 0.89}
 89%|████████▉ | 3835/4286 [24:08:57<2:20:05, 18.64s/it] 90%|████████▉ | 3836/4286 [24:09:17<2:22:21, 18.98s/it]                                                        {'loss': 0.0373, 'grad_norm': 0.9972072860179474, 'learning_rate': 1.0499300046663555e-07, 'completion_length': 194.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 1.0, 'reward': 1.595238208770752, 'reward_std': 0.05029458552598953, 'kl': 0.93212890625, 'epoch': 0.9}
 90%|████████▉ | 3836/4286 [24:09:17<2:22:21, 18.98s/it] 90%|████████▉ | 3837/4286 [24:09:36<2:21:37, 18.92s/it]                                                        {'loss': 0.0245, 'grad_norm': 9.404693931452844, 'learning_rate': 1.047596826878208e-07, 'completion_length': 172.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.6375000774860382, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.619642972946167, 'reward_std': 0.07023809850215912, 'kl': 0.6103515625, 'epoch': 0.9}
 90%|████████▉ | 3837/4286 [24:09:36<2:21:37, 18.92s/it] 90%|████████▉ | 3838/4286 [24:09:54<2:18:52, 18.60s/it]                                                        {'loss': 0.0446, 'grad_norm': 12.965674544987928, 'learning_rate': 1.0452636490900606e-07, 'completion_length': 179.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6416667103767395, 'rewards/format_reward': 1.0, 'reward': 1.6416667103767395, 'reward_std': 0.09870147192850709, 'kl': 1.115234375, 'epoch': 0.9}
 90%|████████▉ | 3838/4286 [24:09:54<2:18:52, 18.60s/it] 90%|████████▉ | 3839/4286 [24:10:12<2:17:35, 18.47s/it]                                                        {'loss': 0.0482, 'grad_norm': 18.13999371007556, 'learning_rate': 1.0429304713019131e-07, 'completion_length': 162.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5000001192092896, 'reward_std': 0.12193620204925537, 'kl': 1.203125, 'epoch': 0.9}
 90%|████████▉ | 3839/4286 [24:10:12<2:17:35, 18.47s/it] 90%|████████▉ | 3840/4286 [24:10:33<2:23:12, 19.27s/it]                                                        {'loss': 0.061, 'grad_norm': 8.907071118696418, 'learning_rate': 1.0405972935137656e-07, 'completion_length': 182.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6636905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6636906862258911, 'reward_std': 0.1186632588505745, 'kl': 1.51953125, 'epoch': 0.9}
 90%|████████▉ | 3840/4286 [24:10:33<2:23:12, 19.27s/it] 90%|████████▉ | 3841/4286 [24:10:52<2:22:58, 19.28s/it]                                                        {'loss': 0.0277, 'grad_norm': 2.177149024366885, 'learning_rate': 1.0382641157256182e-07, 'completion_length': 189.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.7922619581222534, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7744048833847046, 'reward_std': 0.09703431650996208, 'kl': 0.6953125, 'epoch': 0.9}
 90%|████████▉ | 3841/4286 [24:10:52<2:22:58, 19.28s/it] 90%|████████▉ | 3842/4286 [24:11:15<2:30:25, 20.33s/it]                                                        {'loss': 0.0651, 'grad_norm': 4.976321516464687, 'learning_rate': 1.0359309379374707e-07, 'completion_length': 177.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6017857491970062, 'rewards/format_reward': 1.0, 'reward': 1.6017858386039734, 'reward_std': 0.09183504059910774, 'kl': 1.62890625, 'epoch': 0.9}
 90%|████████▉ | 3842/4286 [24:11:15<2:30:25, 20.33s/it] 90%|████████▉ | 3843/4286 [24:11:32<2:23:37, 19.45s/it]                                                        {'loss': 0.0067, 'grad_norm': 7.3067098520242055, 'learning_rate': 1.0335977601493233e-07, 'completion_length': 167.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.7336310148239136, 'rewards/format_reward': 1.0, 'reward': 1.7336310744285583, 'reward_std': 0.031055690720677376, 'kl': 0.1669921875, 'epoch': 0.9}
 90%|████████▉ | 3843/4286 [24:11:32<2:23:37, 19.45s/it] 90%|████████▉ | 3844/4286 [24:11:51<2:22:14, 19.31s/it]                                                        {'loss': 0.0326, 'grad_norm': 67.01086233881331, 'learning_rate': 1.0312645823611758e-07, 'completion_length': 186.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.5571429133415222, 'rewards/format_reward': 1.0, 'reward': 1.557142972946167, 'reward_std': 0.06984424963593483, 'kl': 0.814453125, 'epoch': 0.9}
 90%|████████▉ | 3844/4286 [24:11:51<2:22:14, 19.31s/it] 90%|████████▉ | 3845/4286 [24:12:10<2:20:41, 19.14s/it]                                                        {'loss': 0.0229, 'grad_norm': 1.5391628732643912, 'learning_rate': 1.0289314045730284e-07, 'completion_length': 178.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.7410714626312256, 'rewards/format_reward': 1.0, 'reward': 1.7410714626312256, 'reward_std': 0.06990811415016651, 'kl': 0.5732421875, 'epoch': 0.9}
 90%|████████▉ | 3845/4286 [24:12:10<2:20:41, 19.14s/it] 90%|████████▉ | 3846/4286 [24:12:31<2:24:33, 19.71s/it]                                                        {'loss': 0.0764, 'grad_norm': 5.896128067397986, 'learning_rate': 1.0265982267848809e-07, 'completion_length': 172.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.714781790971756, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6790675520896912, 'reward_std': 0.09736847877502441, 'kl': 1.90771484375, 'epoch': 0.9}
 90%|████████▉ | 3846/4286 [24:12:31<2:24:33, 19.71s/it] 90%|████████▉ | 3847/4286 [24:12:53<2:29:08, 20.38s/it]                                                        {'loss': 0.0443, 'grad_norm': 2.9253017169748983, 'learning_rate': 1.0242650489967334e-07, 'completion_length': 180.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6306547820568085, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5949405431747437, 'reward_std': 0.08073648810386658, 'kl': 1.107421875, 'epoch': 0.9}
 90%|████████▉ | 3847/4286 [24:12:53<2:29:08, 20.38s/it] 90%|████████▉ | 3848/4286 [24:13:15<2:32:05, 20.84s/it]                                                        {'loss': 0.0678, 'grad_norm': 10.51006686889639, 'learning_rate': 1.021931871208586e-07, 'completion_length': 199.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.557921290397644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5400642156600952, 'reward_std': 0.1659819707274437, 'kl': 1.69140625, 'epoch': 0.9}
 90%|████████▉ | 3848/4286 [24:13:15<2:32:05, 20.84s/it] 90%|████████▉ | 3849/4286 [24:13:33<2:25:14, 19.94s/it]                                                        {'loss': 0.0226, 'grad_norm': 3.661480027683825, 'learning_rate': 1.0195986934204385e-07, 'completion_length': 171.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5517857372760773, 'rewards/format_reward': 1.0, 'reward': 1.5517858266830444, 'reward_std': 0.030567951500415802, 'kl': 0.5654296875, 'epoch': 0.9}
 90%|████████▉ | 3849/4286 [24:13:33<2:25:14, 19.94s/it] 90%|████████▉ | 3850/4286 [24:13:53<2:24:57, 19.95s/it]                                                        {'loss': 0.0637, 'grad_norm': 13.096900316252446, 'learning_rate': 1.0172655156322911e-07, 'completion_length': 193.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 1.0, 'reward': 1.6473215222358704, 'reward_std': 0.061691084410995245, 'kl': 1.58935546875, 'epoch': 0.9}
 90%|████████▉ | 3850/4286 [24:13:53<2:24:57, 19.95s/it] 90%|████████▉ | 3851/4286 [24:14:12<2:22:15, 19.62s/it]                                                        {'loss': 0.0392, 'grad_norm': 6.445138069880481, 'learning_rate': 1.0149323378441436e-07, 'completion_length': 175.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.541666716337204, 'rewards/format_reward': 1.0, 'reward': 1.5416668057441711, 'reward_std': 0.13009268045425415, 'kl': 0.98046875, 'epoch': 0.9}
 90%|████████▉ | 3851/4286 [24:14:12<2:22:15, 19.62s/it] 90%|████████▉ | 3852/4286 [24:14:30<2:18:59, 19.22s/it]                                                        {'loss': 0.0179, 'grad_norm': 3.2340326870357967, 'learning_rate': 1.0125991600559962e-07, 'completion_length': 185.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6726190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.654762089252472, 'reward_std': 0.10395299270749092, 'kl': 0.44970703125, 'epoch': 0.9}
 90%|████████▉ | 3852/4286 [24:14:30<2:18:59, 19.22s/it] 90%|████████▉ | 3853/4286 [24:14:49<2:19:13, 19.29s/it]                                                        {'loss': 0.0329, 'grad_norm': 4.980143622168651, 'learning_rate': 1.0102659822678487e-07, 'completion_length': 192.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.5732143521308899, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5375000834465027, 'reward_std': 0.1231064572930336, 'kl': 0.82421875, 'epoch': 0.9}
 90%|████████▉ | 3853/4286 [24:14:49<2:19:13, 19.29s/it] 90%|████████▉ | 3854/4286 [24:15:07<2:15:30, 18.82s/it]                                                        {'loss': 0.0606, 'grad_norm': 1.9990875599601792, 'learning_rate': 1.0079328044797013e-07, 'completion_length': 170.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.7251984775066376, 'rewards/format_reward': 1.0, 'reward': 1.7251984477043152, 'reward_std': 0.13531459867954254, 'kl': 1.51171875, 'epoch': 0.9}
 90%|████████▉ | 3854/4286 [24:15:07<2:15:30, 18.82s/it] 90%|████████▉ | 3855/4286 [24:15:28<2:19:56, 19.48s/it]                                                        {'loss': 0.0223, 'grad_norm': 3.8829918571112922, 'learning_rate': 1.0055996266915538e-07, 'completion_length': 179.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6723214685916901, 'rewards/format_reward': 1.0, 'reward': 1.6723215579986572, 'reward_std': 0.07797619327902794, 'kl': 0.5556640625, 'epoch': 0.9}
 90%|████████▉ | 3855/4286 [24:15:28<2:19:56, 19.48s/it] 90%|████████▉ | 3856/4286 [24:15:47<2:17:23, 19.17s/it]                                                        {'loss': 0.0347, 'grad_norm': 11.480797651175283, 'learning_rate': 1.0032664489034063e-07, 'completion_length': 181.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7232143580913544, 'rewards/format_reward': 1.0, 'reward': 1.7232144474983215, 'reward_std': 0.06610919162631035, 'kl': 0.86474609375, 'epoch': 0.9}
 90%|████████▉ | 3856/4286 [24:15:47<2:17:23, 19.17s/it] 90%|████████▉ | 3857/4286 [24:16:06<2:16:57, 19.15s/it]                                                        {'loss': 0.0129, 'grad_norm': 5.721032480643956, 'learning_rate': 1.0009332711152589e-07, 'completion_length': 182.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6949405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6949405670166016, 'reward_std': 0.1033390611410141, 'kl': 0.322265625, 'epoch': 0.9}
 90%|████████▉ | 3857/4286 [24:16:06<2:16:57, 19.15s/it] 90%|█████████ | 3858/4286 [24:16:27<2:22:13, 19.94s/it]                                                        {'loss': 0.0488, 'grad_norm': 4.777381266258054, 'learning_rate': 9.986000933271114e-08, 'completion_length': 197.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.6532737910747528, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.12677635997533798, 'kl': 1.220703125, 'epoch': 0.9}
 90%|█████████ | 3858/4286 [24:16:27<2:22:13, 19.94s/it] 90%|█████████ | 3859/4286 [24:16:46<2:19:40, 19.63s/it]                                                        {'loss': 0.0266, 'grad_norm': 4.98574127751249, 'learning_rate': 9.96266915538964e-08, 'completion_length': 166.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.7366071939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7187501192092896, 'reward_std': 0.08132158685475588, 'kl': 0.6689453125, 'epoch': 0.9}
 90%|█████████ | 3859/4286 [24:16:46<2:19:40, 19.63s/it] 90%|█████████ | 3860/4286 [24:17:04<2:14:55, 19.00s/it]                                                        {'loss': 0.0093, 'grad_norm': 10.070048283500487, 'learning_rate': 9.939337377508165e-08, 'completion_length': 180.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.7544643878936768, 'reward_std': 0.05495268478989601, 'kl': 0.23388671875, 'epoch': 0.9}
 90%|█████████ | 3860/4286 [24:17:04<2:14:55, 19.00s/it] 90%|█████████ | 3861/4286 [24:17:27<2:23:57, 20.32s/it]                                                        {'loss': 0.0397, 'grad_norm': 2.092149510354099, 'learning_rate': 9.916005599626691e-08, 'completion_length': 196.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6297619640827179, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6119048595428467, 'reward_std': 0.06904762424528599, 'kl': 0.9921875, 'epoch': 0.9}
 90%|█████████ | 3861/4286 [24:17:27<2:23:57, 20.32s/it] 90%|█████████ | 3862/4286 [24:17:46<2:19:38, 19.76s/it]                                                        {'loss': 0.0115, 'grad_norm': 2.156190597576091, 'learning_rate': 9.892673821745216e-08, 'completion_length': 185.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.7619048058986664, 'rewards/format_reward': 1.0, 'reward': 1.7619048953056335, 'reward_std': 0.04007172957062721, 'kl': 0.2880859375, 'epoch': 0.9}
 90%|█████████ | 3862/4286 [24:17:46<2:19:38, 19.76s/it] 90%|█████████ | 3863/4286 [24:18:04<2:16:29, 19.36s/it]                                                        {'loss': 0.0211, 'grad_norm': 3.001561075350821, 'learning_rate': 9.869342043863741e-08, 'completion_length': 171.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.7354167103767395, 'rewards/format_reward': 1.0, 'reward': 1.7354167699813843, 'reward_std': 0.0644764918833971, 'kl': 0.52783203125, 'epoch': 0.9}
 90%|█████████ | 3863/4286 [24:18:04<2:16:29, 19.36s/it] 90%|█████████ | 3864/4286 [24:18:26<2:21:59, 20.19s/it]                                                        {'loss': 0.0286, 'grad_norm': 5.361987062160776, 'learning_rate': 9.846010265982267e-08, 'completion_length': 190.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.5553572177886963, 'rewards/format_reward': 1.0, 'reward': 1.5553571581840515, 'reward_std': 0.045889293774962425, 'kl': 0.71826171875, 'epoch': 0.9}
 90%|█████████ | 3864/4286 [24:18:26<2:21:59, 20.19s/it] 90%|█████████ | 3865/4286 [24:18:45<2:19:38, 19.90s/it]                                                        {'loss': 0.0404, 'grad_norm': 5.059536096266846, 'learning_rate': 9.822678488100792e-08, 'completion_length': 187.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5925595462322235, 'rewards/format_reward': 1.0, 'reward': 1.5925596356391907, 'reward_std': 0.0730433464050293, 'kl': 1.009765625, 'epoch': 0.9}
 90%|█████████ | 3865/4286 [24:18:45<2:19:38, 19.90s/it] 90%|█████████ | 3866/4286 [24:19:04<2:15:38, 19.38s/it]                                                        {'loss': 0.0317, 'grad_norm': 5.074813722543813, 'learning_rate': 9.799346710219318e-08, 'completion_length': 184.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.5565476417541504, 'rewards/format_reward': 1.0, 'reward': 1.55654776096344, 'reward_std': 0.0595238134264946, 'kl': 0.7919921875, 'epoch': 0.9}
 90%|█████████ | 3866/4286 [24:19:04<2:15:38, 19.38s/it] 90%|█████████ | 3867/4286 [24:19:24<2:17:44, 19.72s/it]                                                        {'loss': 0.048, 'grad_norm': 12.568206052996256, 'learning_rate': 9.776014932337843e-08, 'completion_length': 187.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6398810148239136, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6220239400863647, 'reward_std': 0.09682134911417961, 'kl': 1.203125, 'epoch': 0.9}
 90%|█████████ | 3867/4286 [24:19:24<2:17:44, 19.72s/it] 90%|█████████ | 3868/4286 [24:19:42<2:14:25, 19.30s/it]                                                        {'loss': 0.0114, 'grad_norm': 3.2775052404346283, 'learning_rate': 9.75268315445637e-08, 'completion_length': 175.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.5922619700431824, 'rewards/format_reward': 1.0, 'reward': 1.5922620296478271, 'reward_std': 0.09052776917815208, 'kl': 0.28466796875, 'epoch': 0.9}
 90%|█████████ | 3868/4286 [24:19:42<2:14:25, 19.30s/it] 90%|█████████ | 3869/4286 [24:20:01<2:12:22, 19.05s/it]                                                        {'loss': 0.0298, 'grad_norm': 11.843513591251627, 'learning_rate': 9.729351376574894e-08, 'completion_length': 182.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.09846103005111217, 'kl': 0.74609375, 'epoch': 0.9}
 90%|█████████ | 3869/4286 [24:20:01<2:12:22, 19.05s/it] 90%|█████████ | 3870/4286 [24:20:22<2:16:36, 19.70s/it]                                                        {'loss': 0.0295, 'grad_norm': 2.251366722240097, 'learning_rate': 9.706019598693419e-08, 'completion_length': 177.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.744047611951828, 'rewards/format_reward': 1.0, 'reward': 1.7440477013587952, 'reward_std': 0.0595238134264946, 'kl': 0.7373046875, 'epoch': 0.9}
 90%|█████████ | 3870/4286 [24:20:22<2:16:36, 19.70s/it] 90%|█████████ | 3871/4286 [24:20:43<2:17:58, 19.95s/it]                                                        {'loss': 0.107, 'grad_norm': 10.405781173955136, 'learning_rate': 9.682687820811945e-08, 'completion_length': 184.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5838293880224228, 'rewards/format_reward': 1.0, 'reward': 1.5838294625282288, 'reward_std': 0.11266326904296875, 'kl': 2.6796875, 'epoch': 0.9}
 90%|█████████ | 3871/4286 [24:20:43<2:17:58, 19.95s/it] 90%|█████████ | 3872/4286 [24:21:07<2:25:57, 21.15s/it]                                                        {'loss': 0.0378, 'grad_norm': 22.976405653589154, 'learning_rate': 9.65935604293047e-08, 'completion_length': 193.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6666668057441711, 'reward_std': 0.11126269027590752, 'kl': 0.943359375, 'epoch': 0.9}
 90%|█████████ | 3872/4286 [24:21:07<2:25:57, 21.15s/it] 90%|█████████ | 3873/4286 [24:21:25<2:19:53, 20.32s/it]                                                        {'loss': 0.0296, 'grad_norm': 2.555234172592064, 'learning_rate': 9.636024265048996e-08, 'completion_length': 173.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 1.0, 'reward': 1.6235120296478271, 'reward_std': 0.0744047611951828, 'kl': 0.736328125, 'epoch': 0.9}
 90%|█████████ | 3873/4286 [24:21:25<2:19:53, 20.32s/it] 90%|█████████ | 3874/4286 [24:21:45<2:19:21, 20.29s/it]                                                        {'loss': 0.0207, 'grad_norm': 5.8421700677028205, 'learning_rate': 9.612692487167521e-08, 'completion_length': 177.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6409439444541931, 'rewards/format_reward': 1.0, 'reward': 1.6409439444541931, 'reward_std': 0.06048122979700565, 'kl': 0.517578125, 'epoch': 0.9}
 90%|█████████ | 3874/4286 [24:21:45<2:19:21, 20.29s/it][2025-03-03 05:29:22,629] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 90%|█████████ | 3875/4286 [24:22:07<2:21:28, 20.65s/it]                                                        {'loss': 0.0634, 'grad_norm': 6.927705466284634, 'learning_rate': 9.589360709286048e-08, 'completion_length': 182.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.6532738208770752, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6354168057441711, 'reward_std': 0.11096088215708733, 'kl': 1.578125, 'epoch': 0.9}
 90%|█████████ | 3875/4286 [24:22:07<2:21:28, 20.65s/it] 90%|█████████ | 3876/4286 [24:22:26<2:17:41, 20.15s/it]                                                        {'loss': 0.0094, 'grad_norm': 3.8955630044661524, 'learning_rate': 9.566028931404572e-08, 'completion_length': 172.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.7389881312847137, 'rewards/format_reward': 1.0, 'reward': 1.738988220691681, 'reward_std': 0.019642856321297586, 'kl': 0.23486328125, 'epoch': 0.9}
 90%|█████████ | 3876/4286 [24:22:26<2:17:41, 20.15s/it] 90%|█████████ | 3877/4286 [24:22:46<2:17:36, 20.19s/it]                                                        {'loss': 0.0515, 'grad_norm': 7.970992290407673, 'learning_rate': 9.542697153523099e-08, 'completion_length': 192.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.691964328289032, 'rewards/format_reward': 1.0, 'reward': 1.6919643878936768, 'reward_std': 0.05829526111483574, 'kl': 1.28515625, 'epoch': 0.9}
 90%|█████████ | 3877/4286 [24:22:46<2:17:36, 20.19s/it] 90%|█████████ | 3878/4286 [24:23:04<2:13:17, 19.60s/it]                                                        {'loss': 0.0213, 'grad_norm': 9.829204979862102, 'learning_rate': 9.519365375641623e-08, 'completion_length': 190.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.7113096117973328, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.050381564535200596, 'kl': 0.533203125, 'epoch': 0.9}
 90%|█████████ | 3878/4286 [24:23:04<2:13:17, 19.60s/it] 91%|█████████ | 3879/4286 [24:23:26<2:16:44, 20.16s/it]                                                        {'loss': 0.1085, 'grad_norm': 20.519452806902056, 'learning_rate': 9.496033597760148e-08, 'completion_length': 213.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5755952894687653, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5041667222976685, 'reward_std': 0.2174176275730133, 'kl': 2.70703125, 'epoch': 0.91}
 91%|█████████ | 3879/4286 [24:23:26<2:16:44, 20.16s/it] 91%|█████████ | 3880/4286 [24:23:49<2:22:26, 21.05s/it]                                                        {'loss': 0.1047, 'grad_norm': 10.110089919448708, 'learning_rate': 9.472701819878675e-08, 'completion_length': 210.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.5886905491352081, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5529763102531433, 'reward_std': 0.17730378359556198, 'kl': 2.62109375, 'epoch': 0.91}
 91%|█████████ | 3880/4286 [24:23:49<2:22:26, 21.05s/it] 91%|█████████ | 3881/4286 [24:24:10<2:22:24, 21.10s/it]                                                        {'loss': 0.0294, 'grad_norm': 4.35027159201358, 'learning_rate': 9.4493700419972e-08, 'completion_length': 188.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.648809552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6309524774551392, 'reward_std': 0.12574021331965923, 'kl': 0.734375, 'epoch': 0.91}
 91%|█████████ | 3881/4286 [24:24:10<2:22:24, 21.10s/it] 91%|█████████ | 3882/4286 [24:24:28<2:16:24, 20.26s/it]                                                        {'loss': 0.0339, 'grad_norm': 8.582551014663983, 'learning_rate': 9.426038264115726e-08, 'completion_length': 179.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7714286148548126, 'rewards/format_reward': 1.0, 'reward': 1.7714287042617798, 'reward_std': 0.0929151400923729, 'kl': 0.85009765625, 'epoch': 0.91}
 91%|█████████ | 3882/4286 [24:24:28<2:16:24, 20.26s/it][2025-03-03 05:32:06,588] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 91%|█████████ | 3883/4286 [24:24:51<2:20:20, 20.89s/it]                                                        {'loss': 0.0993, 'grad_norm': 11.439844961560723, 'learning_rate': 9.40270648623425e-08, 'completion_length': 198.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.5967262387275696, 'rewards/format_reward': 1.0, 'reward': 1.5967262387275696, 'reward_std': 0.10486217588186264, 'kl': 2.484375, 'epoch': 0.91}
 91%|█████████ | 3883/4286 [24:24:51<2:20:20, 20.89s/it] 91%|█████████ | 3884/4286 [24:25:13<2:22:28, 21.26s/it]                                                        {'loss': 0.0114, 'grad_norm': 65.9032445464259, 'learning_rate': 9.379374708352777e-08, 'completion_length': 206.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.5312500447034836, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.513392984867096, 'reward_std': 0.09544559568166733, 'kl': 0.28515625, 'epoch': 0.91}
 91%|█████████ | 3884/4286 [24:25:13<2:22:28, 21.26s/it] 91%|█████████ | 3885/4286 [24:25:31<2:15:48, 20.32s/it]                                                        {'loss': 0.0793, 'grad_norm': 112.06669034303286, 'learning_rate': 9.356042930471302e-08, 'completion_length': 186.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6205357313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6026787161827087, 'reward_std': 0.1424260251224041, 'kl': 1.984375, 'epoch': 0.91}
 91%|█████████ | 3885/4286 [24:25:31<2:15:48, 20.32s/it] 91%|█████████ | 3886/4286 [24:25:50<2:12:59, 19.95s/it]                                                        {'loss': 0.0713, 'grad_norm': 4.258602789193722, 'learning_rate': 9.332711152589826e-08, 'completion_length': 193.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.6053571701049805, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5875000953674316, 'reward_std': 0.1627693884074688, 'kl': 1.77734375, 'epoch': 0.91}
 91%|█████████ | 3886/4286 [24:25:50<2:12:59, 19.95s/it] 91%|█████████ | 3887/4286 [24:26:09<2:10:14, 19.58s/it]                                                        {'loss': 0.0364, 'grad_norm': 8.825600509273135, 'learning_rate': 9.309379374708353e-08, 'completion_length': 196.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6145833432674408, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5967263579368591, 'reward_std': 0.15301262587308884, 'kl': 0.91015625, 'epoch': 0.91}
 91%|█████████ | 3887/4286 [24:26:09<2:10:14, 19.58s/it] 91%|█████████ | 3888/4286 [24:26:27<2:06:46, 19.11s/it]                                                        {'loss': 0.0593, 'grad_norm': 9.178186882995513, 'learning_rate': 9.286047596826877e-08, 'completion_length': 177.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.800637811422348, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7827807068824768, 'reward_std': 0.20727132260799408, 'kl': 1.48046875, 'epoch': 0.91}
 91%|█████████ | 3888/4286 [24:26:27<2:06:46, 19.11s/it] 91%|█████████ | 3889/4286 [24:26:47<2:08:02, 19.35s/it]                                                        {'loss': 0.0309, 'grad_norm': 12.231592395235836, 'learning_rate': 9.262715818945404e-08, 'completion_length': 178.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.66952845454216, 'rewards/format_reward': 1.0, 'reward': 1.6695284843444824, 'reward_std': 0.06811443716287613, 'kl': 0.771484375, 'epoch': 0.91}
 91%|█████████ | 3889/4286 [24:26:47<2:08:02, 19.35s/it] 91%|█████████ | 3890/4286 [24:27:09<2:13:17, 20.20s/it]                                                        {'loss': 0.0733, 'grad_norm': 1.5463137242693354, 'learning_rate': 9.239384041063929e-08, 'completion_length': 194.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.558035746216774, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5223215818405151, 'reward_std': 0.1527065671980381, 'kl': 1.83203125, 'epoch': 0.91}
 91%|█████████ | 3890/4286 [24:27:09<2:13:17, 20.20s/it] 91%|█████████ | 3891/4286 [24:27:28<2:10:49, 19.87s/it]                                                        {'loss': 0.0316, 'grad_norm': 2.890947011994019, 'learning_rate': 9.216052263182455e-08, 'completion_length': 188.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.739583432674408, 'rewards/format_reward': 1.0, 'reward': 1.739583432674408, 'reward_std': 0.06616456620395184, 'kl': 0.794921875, 'epoch': 0.91}
 91%|█████████ | 3891/4286 [24:27:28<2:10:49, 19.87s/it] 91%|█████████ | 3892/4286 [24:27:46<2:07:26, 19.41s/it]                                                        {'loss': 0.0095, 'grad_norm': 1.4704134906181592, 'learning_rate': 9.19272048530098e-08, 'completion_length': 192.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.6279762089252472, 'rewards/format_reward': 1.0, 'reward': 1.6279762983322144, 'reward_std': 0.0535714328289032, 'kl': 0.23779296875, 'epoch': 0.91}
 91%|█████████ | 3892/4286 [24:27:46<2:07:26, 19.41s/it] 91%|█████████ | 3893/4286 [24:28:05<2:05:51, 19.22s/it]                                                        {'loss': 0.0203, 'grad_norm': 5.2609775920087145, 'learning_rate': 9.169388707419504e-08, 'completion_length': 191.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.7303571701049805, 'rewards/format_reward': 1.0, 'reward': 1.7303572297096252, 'reward_std': 0.02738095773383975, 'kl': 0.5068359375, 'epoch': 0.91}
 91%|█████████ | 3893/4286 [24:28:05<2:05:51, 19.22s/it] 91%|█████████ | 3894/4286 [24:28:24<2:05:04, 19.14s/it]                                                        {'loss': 0.0071, 'grad_norm': 22.53728847669538, 'learning_rate': 9.14605692953803e-08, 'completion_length': 185.53572845458984, 'rewards/only_full_func_accuracy_reward': 0.7827381491661072, 'rewards/format_reward': 1.0, 'reward': 1.782738208770752, 'reward_std': 0.05541310831904411, 'kl': 0.17724609375, 'epoch': 0.91}
 91%|█████████ | 3894/4286 [24:28:24<2:05:04, 19.14s/it] 91%|█████████ | 3895/4286 [24:28:49<2:16:03, 20.88s/it]                                                        {'loss': 0.0213, 'grad_norm': 3.5400976972783536, 'learning_rate': 9.122725151656556e-08, 'completion_length': 204.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.565476268529892, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5476191639900208, 'reward_std': 0.07511192560195923, 'kl': 0.5341796875, 'epoch': 0.91}
 91%|█████████ | 3895/4286 [24:28:49<2:16:03, 20.88s/it] 91%|█████████ | 3896/4286 [24:29:08<2:11:50, 20.28s/it]                                                        {'loss': 0.0344, 'grad_norm': 8.655946260328188, 'learning_rate': 9.099393373775082e-08, 'completion_length': 172.33928680419922, 'rewards/only_full_func_accuracy_reward': 0.553571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.09687451273202896, 'kl': 0.861328125, 'epoch': 0.91}
 91%|█████████ | 3896/4286 [24:29:08<2:11:50, 20.28s/it] 91%|█████████ | 3897/4286 [24:29:27<2:08:31, 19.82s/it]                                                        {'loss': 0.0354, 'grad_norm': 2.916629792519084, 'learning_rate': 9.076061595893607e-08, 'completion_length': 174.85714721679688, 'rewards/only_full_func_accuracy_reward': 0.6220238208770752, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5863096117973328, 'reward_std': 0.13749704882502556, 'kl': 0.8857421875, 'epoch': 0.91}
 91%|█████████ | 3897/4286 [24:29:27<2:08:31, 19.82s/it] 91%|█████████ | 3898/4286 [24:29:45<2:05:37, 19.43s/it]                                                        {'loss': 0.0213, 'grad_norm': 2.1407500402235744, 'learning_rate': 9.052729818012133e-08, 'completion_length': 179.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6273809969425201, 'rewards/format_reward': 1.0, 'reward': 1.6273810267448425, 'reward_std': 0.0522100105881691, 'kl': 0.5322265625, 'epoch': 0.91}
 91%|█████████ | 3898/4286 [24:29:45<2:05:37, 19.43s/it] 91%|█████████ | 3899/4286 [24:30:05<2:05:23, 19.44s/it]                                                        {'loss': 0.0117, 'grad_norm': 6.359887163023536, 'learning_rate': 9.029398040130658e-08, 'completion_length': 185.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.7589285969734192, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.01785714365541935, 'kl': 0.2919921875, 'epoch': 0.91}
 91%|█████████ | 3899/4286 [24:30:05<2:05:23, 19.44s/it] 91%|█████████ | 3900/4286 [24:30:25<2:07:00, 19.74s/it]                                                        {'loss': 0.0344, 'grad_norm': 2.112956564522627, 'learning_rate': 9.006066262249182e-08, 'completion_length': 184.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.7455357313156128, 'rewards/format_reward': 1.0, 'reward': 1.7455358505249023, 'reward_std': 0.05818403512239456, 'kl': 0.859375, 'epoch': 0.91}
 91%|█████████ | 3900/4286 [24:30:25<2:07:00, 19.74s/it] 91%|█████████ | 3901/4286 [24:33:59<8:21:11, 78.11s/it]                                                        {'loss': 0.0398, 'grad_norm': 2.679535756596411, 'learning_rate': 8.982734484367709e-08, 'completion_length': 186.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.08776525594294071, 'kl': 0.994140625, 'epoch': 0.91}
 91%|█████████ | 3901/4286 [24:33:59<8:21:11, 78.11s/it] 91%|█████████ | 3902/4286 [24:34:18<6:25:29, 60.23s/it]                                                        {'loss': 0.0375, 'grad_norm': 5.619273927634611, 'learning_rate': 8.959402706486234e-08, 'completion_length': 176.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.7886905074119568, 'rewards/format_reward': 1.0, 'reward': 1.7886905670166016, 'reward_std': 0.07513973861932755, 'kl': 0.9384765625, 'epoch': 0.91}
 91%|█████████ | 3902/4286 [24:34:18<6:25:29, 60.23s/it] 91%|█████████ | 3903/4286 [24:34:38<5:07:28, 48.17s/it]                                                        {'loss': 0.127, 'grad_norm': 6.251325109568017, 'learning_rate': 8.93607092860476e-08, 'completion_length': 206.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6208333969116211, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5851191282272339, 'reward_std': 0.2341710776090622, 'kl': 3.1796875, 'epoch': 0.91}
 91%|█████████ | 3903/4286 [24:34:38<5:07:28, 48.17s/it] 91%|█████████ | 3904/4286 [24:34:59<4:14:32, 39.98s/it]                                                        {'loss': 0.0457, 'grad_norm': 4.201518030299883, 'learning_rate': 8.912739150723285e-08, 'completion_length': 188.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6056548953056335, 'reward_std': 0.09736541984602809, 'kl': 1.14697265625, 'epoch': 0.91}
 91%|█████████ | 3904/4286 [24:34:59<4:14:32, 39.98s/it] 91%|█████████ | 3905/4286 [24:35:16<3:30:57, 33.22s/it]                                                        {'loss': 0.0076, 'grad_norm': 0.12592334235654362, 'learning_rate': 8.889407372841811e-08, 'completion_length': 158.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.0, 'kl': 0.1904296875, 'epoch': 0.91}
 91%|█████████ | 3905/4286 [24:35:16<3:30:57, 33.22s/it] 91%|█████████ | 3906/4286 [24:35:35<3:02:05, 28.75s/it]                                                        {'loss': 0.0461, 'grad_norm': 9.819191172263292, 'learning_rate': 8.866075594960336e-08, 'completion_length': 179.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6580357849597931, 'rewards/format_reward': 1.0, 'reward': 1.6580358743667603, 'reward_std': 0.06859046686440706, 'kl': 1.1513671875, 'epoch': 0.91}
 91%|█████████ | 3906/4286 [24:35:35<3:02:05, 28.75s/it] 91%|█████████ | 3907/4286 [24:35:53<2:42:10, 25.67s/it]                                                        {'loss': 0.0093, 'grad_norm': 1.7373526187174884, 'learning_rate': 8.842743817078862e-08, 'completion_length': 191.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6101190745830536, 'rewards/format_reward': 1.0, 'reward': 1.6101191639900208, 'reward_std': 0.01785714365541935, 'kl': 0.2333984375, 'epoch': 0.91}
 91%|█████████ | 3907/4286 [24:35:53<2:42:10, 25.67s/it] 91%|█████████ | 3908/4286 [24:36:13<2:31:10, 24.00s/it]                                                        {'loss': 0.0284, 'grad_norm': 3.6324624765770106, 'learning_rate': 8.819412039197387e-08, 'completion_length': 202.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.7395833730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7217263579368591, 'reward_std': 0.08723037410527468, 'kl': 0.708984375, 'epoch': 0.91}
 91%|█████████ | 3908/4286 [24:36:13<2:31:10, 24.00s/it] 91%|█████████ | 3909/4286 [24:36:35<2:26:24, 23.30s/it]                                                        {'loss': 0.0618, 'grad_norm': 5.393304950079612, 'learning_rate': 8.796080261315912e-08, 'completion_length': 187.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.574404776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.55654776096344, 'reward_std': 0.11167282424867153, 'kl': 1.541015625, 'epoch': 0.91}
 91%|█████████ | 3909/4286 [24:36:35<2:26:24, 23.30s/it] 91%|█████████ | 3910/4286 [24:36:55<2:20:46, 22.46s/it]                                                        {'loss': 0.0285, 'grad_norm': 25.707801950302205, 'learning_rate': 8.772748483434438e-08, 'completion_length': 189.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.5835034251213074, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5477891564369202, 'reward_std': 0.0812925212085247, 'kl': 0.7099609375, 'epoch': 0.91}
 91%|█████████ | 3910/4286 [24:36:55<2:20:46, 22.46s/it] 91%|█████████▏| 3911/4286 [24:37:16<2:17:29, 22.00s/it]                                                        {'loss': 0.0534, 'grad_norm': 824.8692365061303, 'learning_rate': 8.749416705552963e-08, 'completion_length': 186.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7034438848495483, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6855868101119995, 'reward_std': 0.15189924836158752, 'kl': 1.3359375, 'epoch': 0.91}
 91%|█████████▏| 3911/4286 [24:37:16<2:17:29, 22.00s/it] 91%|█████████▏| 3912/4286 [24:37:35<2:11:14, 21.05s/it]                                                        {'loss': 0.0306, 'grad_norm': 1.4324965535256966, 'learning_rate': 8.726084927671489e-08, 'completion_length': 194.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.49851194024086, 'rewards/format_reward': 1.0, 'reward': 1.4985119700431824, 'reward_std': 0.0267857164144516, 'kl': 0.765625, 'epoch': 0.91}
 91%|█████████▏| 3912/4286 [24:37:35<2:11:14, 21.05s/it][2025-03-03 05:45:10,737] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 91%|█████████▏| 3913/4286 [24:37:55<2:08:34, 20.68s/it]                                                        {'loss': 0.0213, 'grad_norm': 2.702591899159047, 'learning_rate': 8.702753149790014e-08, 'completion_length': 170.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.688873678445816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6710165739059448, 'reward_std': 0.0754755362868309, 'kl': 0.53369140625, 'epoch': 0.91}
 91%|█████████▏| 3913/4286 [24:37:55<2:08:34, 20.68s/it] 91%|█████████▏| 3914/4286 [24:38:13<2:04:16, 20.04s/it]                                                        {'loss': 0.0083, 'grad_norm': 0.9083066342185975, 'learning_rate': 8.67942137190854e-08, 'completion_length': 193.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.8351190984249115, 'rewards/format_reward': 1.0, 'reward': 1.8351192474365234, 'reward_std': 0.03046244941651821, 'kl': 0.20751953125, 'epoch': 0.91}
 91%|█████████▏| 3914/4286 [24:38:13<2:04:16, 20.04s/it] 91%|█████████▏| 3915/4286 [24:38:32<2:00:48, 19.54s/it]                                                        {'loss': 0.0301, 'grad_norm': 14.927222165894873, 'learning_rate': 8.656089594027065e-08, 'completion_length': 187.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.05255779065191746, 'kl': 0.755859375, 'epoch': 0.91}
 91%|█████████▏| 3915/4286 [24:38:32<2:00:48, 19.54s/it] 91%|█████████▏| 3916/4286 [24:38:51<1:59:26, 19.37s/it]                                                        {'loss': 0.0065, 'grad_norm': 1.6390370533562424, 'learning_rate': 8.63275781614559e-08, 'completion_length': 188.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.653869092464447, 'rewards/format_reward': 1.0, 'reward': 1.6538691520690918, 'reward_std': 0.06090506911277771, 'kl': 0.16357421875, 'epoch': 0.91}
 91%|█████████▏| 3916/4286 [24:38:51<1:59:26, 19.37s/it] 91%|█████████▏| 3917/4286 [24:39:10<1:58:03, 19.20s/it]                                                        {'loss': 0.0072, 'grad_norm': 24.465413361225316, 'learning_rate': 8.609426038264116e-08, 'completion_length': 195.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6830357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.05059523694217205, 'kl': 0.18017578125, 'epoch': 0.91}
 91%|█████████▏| 3917/4286 [24:39:10<1:58:03, 19.20s/it] 91%|█████████▏| 3918/4286 [24:39:29<1:58:07, 19.26s/it]                                                        {'loss': 0.0157, 'grad_norm': 1.2725818509497484, 'learning_rate': 8.586094260382641e-08, 'completion_length': 180.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.684736430644989, 'rewards/format_reward': 1.0, 'reward': 1.684736430644989, 'reward_std': 0.06536483392119408, 'kl': 0.39013671875, 'epoch': 0.91}
 91%|█████████▏| 3918/4286 [24:39:29<1:58:07, 19.26s/it] 91%|█████████▏| 3919/4286 [24:39:47<1:56:21, 19.02s/it]                                                        {'loss': 0.0084, 'grad_norm': 6.349475300437603, 'learning_rate': 8.562762482501167e-08, 'completion_length': 173.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6383929252624512, 'rewards/format_reward': 1.0, 'reward': 1.638392984867096, 'reward_std': 0.08311965689063072, 'kl': 0.208984375, 'epoch': 0.91}
 91%|█████████▏| 3919/4286 [24:39:47<1:56:21, 19.02s/it] 91%|█████████▏| 3920/4286 [24:40:06<1:54:53, 18.83s/it]                                                        {'loss': 0.0144, 'grad_norm': 2.1232767097587164, 'learning_rate': 8.539430704619692e-08, 'completion_length': 178.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.7172619700431824, 'rewards/format_reward': 1.0, 'reward': 1.7172619700431824, 'reward_std': 0.043508341535925865, 'kl': 0.36083984375, 'epoch': 0.91}
 91%|█████████▏| 3920/4286 [24:40:06<1:54:53, 18.83s/it] 91%|█████████▏| 3921/4286 [24:40:27<1:58:35, 19.49s/it]                                                        {'loss': 0.0454, 'grad_norm': 3.4185147375444305, 'learning_rate': 8.516098926738218e-08, 'completion_length': 189.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.51488097012043, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4791668057441711, 'reward_std': 0.14880953170359135, 'kl': 1.138671875, 'epoch': 0.91}
 91%|█████████▏| 3921/4286 [24:40:27<1:58:35, 19.49s/it] 92%|█████████▏| 3922/4286 [24:40:45<1:55:11, 18.99s/it]                                                        {'loss': 0.0785, 'grad_norm': 1.8217302182526138, 'learning_rate': 8.492767148856743e-08, 'completion_length': 178.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.7157738506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6979168057441711, 'reward_std': 0.1755952425301075, 'kl': 1.9609375, 'epoch': 0.92}
 92%|█████████▏| 3922/4286 [24:40:45<1:55:11, 18.99s/it] 92%|█████████▏| 3923/4286 [24:41:03<1:52:54, 18.66s/it]                                                        {'loss': 0.0267, 'grad_norm': 1.0105177170995305, 'learning_rate': 8.469435370975268e-08, 'completion_length': 160.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7261905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7083334922790527, 'reward_std': 0.09707977250218391, 'kl': 0.666015625, 'epoch': 0.92}
 92%|█████████▏| 3923/4286 [24:41:03<1:52:54, 18.66s/it] 92%|█████████▏| 3924/4286 [24:41:21<1:51:46, 18.53s/it]                                                        {'loss': 0.0308, 'grad_norm': 5.158995836381322, 'learning_rate': 8.446103593093794e-08, 'completion_length': 174.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.712670087814331, 'rewards/format_reward': 1.0, 'reward': 1.712670087814331, 'reward_std': 0.022746493108570576, 'kl': 0.76953125, 'epoch': 0.92}
 92%|█████████▏| 3924/4286 [24:41:21<1:51:46, 18.53s/it] 92%|█████████▏| 3925/4286 [24:41:41<1:54:06, 18.97s/it]                                                        {'loss': 0.0217, 'grad_norm': 4.660383782466829, 'learning_rate': 8.422771815212319e-08, 'completion_length': 199.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5818453431129456, 'reward_std': 0.10005595907568932, 'kl': 0.541015625, 'epoch': 0.92}
 92%|█████████▏| 3925/4286 [24:41:41<1:54:06, 18.97s/it] 92%|█████████▏| 3926/4286 [24:41:58<1:51:24, 18.57s/it]                                                        {'loss': 0.0459, 'grad_norm': 0.9877967328392877, 'learning_rate': 8.399440037330845e-08, 'completion_length': 174.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.68154776096344, 'reward_std': 0.08928571827709675, 'kl': 1.14306640625, 'epoch': 0.92}
 92%|█████████▏| 3926/4286 [24:41:58<1:51:24, 18.57s/it] 92%|█████████▏| 3927/4286 [24:42:19<1:55:03, 19.23s/it]                                                        {'loss': 0.0427, 'grad_norm': 2.994980185613234, 'learning_rate': 8.37610825944937e-08, 'completion_length': 186.25001525878906, 'rewards/only_full_func_accuracy_reward': 0.6428572535514832, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.607142984867096, 'reward_std': 0.1606309935450554, 'kl': 1.0673828125, 'epoch': 0.92}
 92%|█████████▏| 3927/4286 [24:42:19<1:55:03, 19.23s/it] 92%|█████████▏| 3928/4286 [24:42:37<1:51:58, 18.77s/it]                                                        {'loss': 0.0174, 'grad_norm': 1.3652244309672839, 'learning_rate': 8.352776481567896e-08, 'completion_length': 173.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.8053571283817291, 'rewards/format_reward': 1.0, 'reward': 1.8053572177886963, 'reward_std': 0.05427197366952896, 'kl': 0.4345703125, 'epoch': 0.92}
 92%|█████████▏| 3928/4286 [24:42:37<1:51:58, 18.77s/it] 92%|█████████▏| 3929/4286 [24:43:00<1:58:40, 19.95s/it]                                                        {'loss': 0.0258, 'grad_norm': 24.861625635391555, 'learning_rate': 8.329444703686421e-08, 'completion_length': 187.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.6468962728977203, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6290391683578491, 'reward_std': 0.12070680782198906, 'kl': 0.6435546875, 'epoch': 0.92}
 92%|█████████▏| 3929/4286 [24:43:00<1:58:40, 19.95s/it] 92%|█████████▏| 3930/4286 [24:43:18<1:55:48, 19.52s/it]                                                        {'loss': 0.0211, 'grad_norm': 6.730482007769558, 'learning_rate': 8.306112925804947e-08, 'completion_length': 181.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6943452656269073, 'rewards/format_reward': 1.0, 'reward': 1.6943452954292297, 'reward_std': 0.06234565004706383, 'kl': 0.5283203125, 'epoch': 0.92}
 92%|█████████▏| 3930/4286 [24:43:18<1:55:48, 19.52s/it] 92%|█████████▏| 3931/4286 [24:43:36<1:52:14, 18.97s/it]                                                        {'loss': 0.0491, 'grad_norm': 4.582536055635514, 'learning_rate': 8.282781147923472e-08, 'completion_length': 171.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.4761904627084732, 'rewards/format_reward': 1.0, 'reward': 1.4761906862258911, 'reward_std': 0.0650488380342722, 'kl': 1.2265625, 'epoch': 0.92}
 92%|█████████▏| 3931/4286 [24:43:36<1:52:14, 18.97s/it] 92%|█████████▏| 3932/4286 [24:43:57<1:55:31, 19.58s/it]                                                        {'loss': 0.0308, 'grad_norm': 3.669897501534379, 'learning_rate': 8.259449370041997e-08, 'completion_length': 181.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.7002976536750793, 'rewards/format_reward': 1.0, 'reward': 1.7002977132797241, 'reward_std': 0.10659919679164886, 'kl': 0.76953125, 'epoch': 0.92}
 92%|█████████▏| 3932/4286 [24:43:57<1:55:31, 19.58s/it] 92%|█████████▏| 3933/4286 [24:44:15<1:52:52, 19.19s/it]                                                        {'loss': 0.0151, 'grad_norm': 10.674309998547365, 'learning_rate': 8.236117592160523e-08, 'completion_length': 182.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.7071428894996643, 'rewards/format_reward': 1.0, 'reward': 1.707142949104309, 'reward_std': 0.05660358443856239, 'kl': 0.376953125, 'epoch': 0.92}
 92%|█████████▏| 3933/4286 [24:44:15<1:52:52, 19.19s/it] 92%|█████████▏| 3934/4286 [24:44:33<1:51:12, 18.96s/it]                                                        {'loss': 0.0093, 'grad_norm': 8.568639392241474, 'learning_rate': 8.212785814279048e-08, 'completion_length': 191.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.5044642984867096, 'rewards/format_reward': 1.0, 'reward': 1.5044643878936768, 'reward_std': 0.015801788307726383, 'kl': 0.2333984375, 'epoch': 0.92}
 92%|█████████▏| 3934/4286 [24:44:33<1:51:12, 18.96s/it] 92%|█████████▏| 3935/4286 [24:44:52<1:49:54, 18.79s/it]                                                        {'loss': 0.0269, 'grad_norm': 1.5055994738526093, 'learning_rate': 8.189454036397574e-08, 'completion_length': 151.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.7708334028720856, 'rewards/format_reward': 1.0, 'reward': 1.770833432674408, 'reward_std': 0.0654761902987957, 'kl': 0.6728515625, 'epoch': 0.92}
 92%|█████████▏| 3935/4286 [24:44:52<1:49:54, 18.79s/it] 92%|█████████▏| 3936/4286 [24:45:10<1:48:51, 18.66s/it]                                                        {'loss': 0.0468, 'grad_norm': 14.313437832150184, 'learning_rate': 8.166122258516099e-08, 'completion_length': 163.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.4940476417541504, 'rewards/format_reward': 1.0, 'reward': 1.49404776096344, 'reward_std': 0.0654761977493763, 'kl': 1.171875, 'epoch': 0.92}
 92%|█████████▏| 3936/4286 [24:45:10<1:48:51, 18.66s/it] 92%|█████████▏| 3937/4286 [24:45:30<1:49:57, 18.90s/it]                                                        {'loss': 0.0409, 'grad_norm': 1.5735395026727657, 'learning_rate': 8.142790480634625e-08, 'completion_length': 190.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6290391385555267, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6111820936203003, 'reward_std': 0.10774057544767857, 'kl': 1.0234375, 'epoch': 0.92}
 92%|█████████▏| 3937/4286 [24:45:30<1:49:57, 18.90s/it][2025-03-03 05:53:05,617] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 92%|█████████▏| 3938/4286 [24:45:50<1:51:39, 19.25s/it]                                                        {'loss': 0.0185, 'grad_norm': 1.8001212915077622, 'learning_rate': 8.11945870275315e-08, 'completion_length': 165.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.6413690745830536, 'rewards/format_reward': 1.0, 'reward': 1.6413691639900208, 'reward_std': 0.10692918300628662, 'kl': 0.462890625, 'epoch': 0.92}
 92%|█████████▏| 3938/4286 [24:45:50<1:51:39, 19.25s/it] 92%|█████████▏| 3939/4286 [24:46:11<1:55:39, 20.00s/it]                                                        {'loss': 0.0391, 'grad_norm': 1.9688140614464498, 'learning_rate': 8.096126924871675e-08, 'completion_length': 192.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 1.0, 'reward': 1.6235120296478271, 'reward_std': 0.06434167549014091, 'kl': 0.974609375, 'epoch': 0.92}
 92%|█████████▏| 3939/4286 [24:46:11<1:55:39, 20.00s/it] 92%|█████████▏| 3940/4286 [24:46:30<1:52:46, 19.56s/it]                                                        {'loss': 0.0177, 'grad_norm': 2.3000892619196103, 'learning_rate': 8.072795146990201e-08, 'completion_length': 181.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6875000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6875000596046448, 'reward_std': 0.024890122935175896, 'kl': 0.44140625, 'epoch': 0.92}
 92%|█████████▏| 3940/4286 [24:46:30<1:52:46, 19.56s/it] 92%|█████████▏| 3941/4286 [24:46:48<1:50:06, 19.15s/it]                                                        {'loss': 0.0143, 'grad_norm': 3.359443953597997, 'learning_rate': 8.049463369108726e-08, 'completion_length': 191.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.647321492433548, 'rewards/format_reward': 1.0, 'reward': 1.6473215818405151, 'reward_std': 0.023049292620271444, 'kl': 0.35791015625, 'epoch': 0.92}
 92%|█████████▏| 3941/4286 [24:46:48<1:50:06, 19.15s/it] 92%|█████████▏| 3942/4286 [24:47:07<1:48:29, 18.92s/it]                                                        {'loss': 0.0387, 'grad_norm': 3.76868318542232, 'learning_rate': 8.026131591227252e-08, 'completion_length': 182.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.6502976417541504, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.614583432674408, 'reward_std': 0.15498051792383194, 'kl': 0.9638671875, 'epoch': 0.92}
 92%|█████████▏| 3942/4286 [24:47:07<1:48:29, 18.92s/it] 92%|█████████▏| 3943/4286 [24:47:27<1:50:28, 19.33s/it]                                                        {'loss': 0.076, 'grad_norm': 2.638343913515316, 'learning_rate': 8.002799813345777e-08, 'completion_length': 173.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7038691639900208, 'reward_std': 0.17270240560173988, 'kl': 1.9013671875, 'epoch': 0.92}
 92%|█████████▏| 3943/4286 [24:47:27<1:50:28, 19.33s/it] 92%|█████████▏| 3944/4286 [24:47:45<1:48:14, 18.99s/it]                                                        {'loss': 0.0246, 'grad_norm': 144.17065117075657, 'learning_rate': 7.979468035464303e-08, 'completion_length': 185.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.780952513217926, 'rewards/format_reward': 1.0, 'reward': 1.780952513217926, 'reward_std': 0.03917090967297554, 'kl': 0.6162109375, 'epoch': 0.92}
 92%|█████████▏| 3944/4286 [24:47:45<1:48:14, 18.99s/it] 92%|█████████▏| 3945/4286 [24:48:03<1:46:34, 18.75s/it]                                                        {'loss': 0.0583, 'grad_norm': 5.387912472811602, 'learning_rate': 7.956136257582828e-08, 'completion_length': 183.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904763579368591, 'reward_std': 0.06136547960340977, 'kl': 1.45703125, 'epoch': 0.92}
 92%|█████████▏| 3945/4286 [24:48:03<1:46:34, 18.75s/it] 92%|█████████▏| 3946/4286 [24:48:22<1:46:26, 18.78s/it]                                                        {'loss': 0.0604, 'grad_norm': 10.006719382790676, 'learning_rate': 7.932804479701353e-08, 'completion_length': 174.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.5937500596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.575892984867096, 'reward_std': 0.14511757344007492, 'kl': 1.51171875, 'epoch': 0.92}
 92%|█████████▏| 3946/4286 [24:48:22<1:46:26, 18.78s/it] 92%|█████████▏| 3947/4286 [24:48:41<1:46:14, 18.80s/it]                                                        {'loss': 0.0824, 'grad_norm': 5.509051561173859, 'learning_rate': 7.909472701819879e-08, 'completion_length': 180.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7160714566707611, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6982142925262451, 'reward_std': 0.19514330197125673, 'kl': 2.0595703125, 'epoch': 0.92}
 92%|█████████▏| 3947/4286 [24:48:41<1:46:14, 18.80s/it] 92%|█████████▏| 3948/4286 [24:49:01<1:48:49, 19.32s/it]                                                        {'loss': 0.0239, 'grad_norm': 6.908183540600342, 'learning_rate': 7.886140923938404e-08, 'completion_length': 188.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6465774476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6287203431129456, 'reward_std': 0.05275922268629074, 'kl': 0.59814453125, 'epoch': 0.92}
 92%|█████████▏| 3948/4286 [24:49:01<1:48:49, 19.32s/it] 92%|█████████▏| 3949/4286 [24:49:22<1:50:25, 19.66s/it]                                                        {'loss': 0.0233, 'grad_norm': 2.796539328081548, 'learning_rate': 7.86280914605693e-08, 'completion_length': 187.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6770833730697632, 'rewards/format_reward': 1.0, 'reward': 1.677083432674408, 'reward_std': 0.0625, 'kl': 0.583984375, 'epoch': 0.92}
 92%|█████████▏| 3949/4286 [24:49:22<1:50:25, 19.66s/it] 92%|█████████▏| 3950/4286 [24:49:41<1:49:45, 19.60s/it]                                                        {'loss': 0.0315, 'grad_norm': 5.121252238955014, 'learning_rate': 7.839477368175455e-08, 'completion_length': 186.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6324405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6145834922790527, 'reward_std': 0.10926829092204571, 'kl': 0.78515625, 'epoch': 0.92}
 92%|█████████▏| 3950/4286 [24:49:41<1:49:45, 19.60s/it] 92%|█████████▏| 3951/4286 [24:50:02<1:50:50, 19.85s/it]                                                        {'loss': 0.0842, 'grad_norm': 6.918625766284662, 'learning_rate': 7.816145590293981e-08, 'completion_length': 180.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.6520833671092987, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5985119342803955, 'reward_std': 0.1996680200099945, 'kl': 2.109375, 'epoch': 0.92}
 92%|█████████▏| 3951/4286 [24:50:02<1:50:50, 19.85s/it] 92%|█████████▏| 3952/4286 [24:50:20<1:46:52, 19.20s/it]                                                        {'loss': 0.044, 'grad_norm': 3.1278692633702287, 'learning_rate': 7.792813812412505e-08, 'completion_length': 178.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6398809850215912, 'rewards/format_reward': 1.0, 'reward': 1.6398810744285583, 'reward_std': 0.17526131123304367, 'kl': 1.09521484375, 'epoch': 0.92}
 92%|█████████▏| 3952/4286 [24:50:20<1:46:52, 19.20s/it] 92%|█████████▏| 3953/4286 [24:50:39<1:47:49, 19.43s/it]                                                        {'loss': 0.0441, 'grad_norm': 4.501765978818936, 'learning_rate': 7.76948203453103e-08, 'completion_length': 197.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.7142857909202576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6964287161827087, 'reward_std': 0.1607142947614193, 'kl': 1.09765625, 'epoch': 0.92}
 92%|█████████▏| 3953/4286 [24:50:39<1:47:49, 19.43s/it] 92%|█████████▏| 3954/4286 [24:50:59<1:47:02, 19.34s/it]                                                        {'loss': 0.033, 'grad_norm': 8.965312548870498, 'learning_rate': 7.746150256649556e-08, 'completion_length': 167.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6011904776096344, 'rewards/format_reward': 1.0, 'reward': 1.6011906266212463, 'reward_std': 0.09162086248397827, 'kl': 0.822265625, 'epoch': 0.92}
 92%|█████████▏| 3954/4286 [24:50:59<1:47:02, 19.34s/it] 92%|█████████▏| 3955/4286 [24:51:17<1:45:28, 19.12s/it]                                                        {'loss': 0.0374, 'grad_norm': 4.997200094184899, 'learning_rate': 7.722818478768081e-08, 'completion_length': 189.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.674702376127243, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6568453311920166, 'reward_std': 0.11471000500023365, 'kl': 0.9365234375, 'epoch': 0.92}
 92%|█████████▏| 3955/4286 [24:51:17<1:45:28, 19.12s/it] 92%|█████████▏| 3956/4286 [24:51:36<1:45:09, 19.12s/it]                                                        {'loss': 0.0281, 'grad_norm': 7.78368446970251, 'learning_rate': 7.699486700886607e-08, 'completion_length': 177.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6086309850215912, 'rewards/format_reward': 1.0, 'reward': 1.6086310744285583, 'reward_std': 0.043225754983723164, 'kl': 0.703125, 'epoch': 0.92}
 92%|█████████▏| 3956/4286 [24:51:36<1:45:09, 19.12s/it] 92%|█████████▏| 3957/4286 [24:51:58<1:48:24, 19.77s/it]                                                        {'loss': 0.0235, 'grad_norm': 4.127360410850334, 'learning_rate': 7.676154923005132e-08, 'completion_length': 193.87501525878906, 'rewards/only_full_func_accuracy_reward': 0.4995748698711395, 'rewards/format_reward': 1.0, 'reward': 1.499574899673462, 'reward_std': 0.03979505971074104, 'kl': 0.587890625, 'epoch': 0.92}
 92%|█████████▏| 3957/4286 [24:51:58<1:48:24, 19.77s/it] 92%|█████████▏| 3958/4286 [24:52:17<1:47:59, 19.75s/it]                                                        {'loss': 0.0096, 'grad_norm': 1.57147631327067, 'learning_rate': 7.652823145123658e-08, 'completion_length': 177.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5818453431129456, 'reward_std': 0.06685744412243366, 'kl': 0.2392578125, 'epoch': 0.92}
 92%|█████████▏| 3958/4286 [24:52:17<1:47:59, 19.75s/it] 92%|█████████▏| 3959/4286 [24:52:38<1:49:27, 20.08s/it]                                                        {'loss': 0.0803, 'grad_norm': 4.149534920955699, 'learning_rate': 7.629491367242183e-08, 'completion_length': 200.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6407738327980042, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6229168176651, 'reward_std': 0.14400488883256912, 'kl': 2.00390625, 'epoch': 0.92}
 92%|█████████▏| 3959/4286 [24:52:38<1:49:27, 20.08s/it] 92%|█████████▏| 3960/4286 [24:53:01<1:52:50, 20.77s/it]                                                        {'loss': 0.0613, 'grad_norm': 3.977655371413928, 'learning_rate': 7.606159589360709e-08, 'completion_length': 189.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.5684524178504944, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5148810744285583, 'reward_std': 0.15795032680034637, 'kl': 1.53125, 'epoch': 0.92}
 92%|█████████▏| 3960/4286 [24:53:01<1:52:50, 20.77s/it] 92%|█████████▏| 3961/4286 [24:53:19<1:49:02, 20.13s/it]                                                        {'loss': 0.0302, 'grad_norm': 17.133865575745794, 'learning_rate': 7.582827811479234e-08, 'completion_length': 184.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6622024476528168, 'rewards/format_reward': 1.0, 'reward': 1.6622024774551392, 'reward_std': 0.0565476231276989, 'kl': 0.75830078125, 'epoch': 0.92}
 92%|█████████▏| 3961/4286 [24:53:19<1:49:02, 20.13s/it] 92%|█████████▏| 3962/4286 [24:53:41<1:52:00, 20.74s/it]                                                        {'loss': 0.0587, 'grad_norm': 5.319104598130357, 'learning_rate': 7.559496033597759e-08, 'completion_length': 196.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.6324405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6145834922790527, 'reward_std': 0.14540597796440125, 'kl': 1.46484375, 'epoch': 0.92}
 92%|█████████▏| 3962/4286 [24:53:41<1:52:00, 20.74s/it] 92%|█████████▏| 3963/4286 [24:54:03<1:53:29, 21.08s/it]                                                        {'loss': 0.0376, 'grad_norm': 4.223083769503613, 'learning_rate': 7.536164255716285e-08, 'completion_length': 192.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6517857909202576, 'reward_std': 0.10386601462960243, 'kl': 0.935546875, 'epoch': 0.92}
 92%|█████████▏| 3963/4286 [24:54:03<1:53:29, 21.08s/it] 92%|█████████▏| 3964/4286 [24:54:21<1:47:34, 20.05s/it]                                                        {'loss': 0.0335, 'grad_norm': 2.835872006204027, 'learning_rate': 7.51283247783481e-08, 'completion_length': 178.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.8392857909202576, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.12776251137256622, 'kl': 0.83984375, 'epoch': 0.92}
 92%|█████████▏| 3964/4286 [24:54:21<1:47:34, 20.05s/it][2025-03-03 06:01:56,272] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 93%|█████████▎| 3965/4286 [24:54:40<1:46:22, 19.88s/it]                                                        {'loss': 0.0565, 'grad_norm': 8.500047012942742, 'learning_rate': 7.489500699953336e-08, 'completion_length': 185.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.09800061583518982, 'kl': 1.4169921875, 'epoch': 0.93}
 93%|█████████▎| 3965/4286 [24:54:40<1:46:22, 19.88s/it] 93%|█████████▎| 3966/4286 [24:54:58<1:42:34, 19.23s/it]                                                        {'loss': 0.0514, 'grad_norm': 10.474982222459287, 'learning_rate': 7.466168922071861e-08, 'completion_length': 172.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.617559552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5997024774551392, 'reward_std': 0.12202381156384945, 'kl': 1.27880859375, 'epoch': 0.93}
 93%|█████████▎| 3966/4286 [24:54:58<1:42:34, 19.23s/it] 93%|█████████▎| 3967/4286 [24:55:17<1:42:14, 19.23s/it]                                                        {'loss': 0.0182, 'grad_norm': 2.54141457288931, 'learning_rate': 7.442837144190387e-08, 'completion_length': 205.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5074405074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4895833730697632, 'reward_std': 0.044642859138548374, 'kl': 0.45458984375, 'epoch': 0.93}
 93%|█████████▎| 3967/4286 [24:55:17<1:42:14, 19.23s/it] 93%|█████████▎| 3968/4286 [24:55:37<1:41:59, 19.24s/it]                                                        {'loss': 0.0419, 'grad_norm': 2.273583469653889, 'learning_rate': 7.419505366308912e-08, 'completion_length': 191.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.746471107006073, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.728614091873169, 'reward_std': 0.10014117695391178, 'kl': 1.048828125, 'epoch': 0.93}
 93%|█████████▎| 3968/4286 [24:55:37<1:41:59, 19.24s/it] 93%|█████████▎| 3969/4286 [24:55:55<1:40:25, 19.01s/it]                                                        {'loss': 0.0071, 'grad_norm': 0.10784320962093727, 'learning_rate': 7.396173588427437e-08, 'completion_length': 187.80357360839844, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.6785715818405151, 'reward_std': 0.0, 'kl': 0.17822265625, 'epoch': 0.93}
 93%|█████████▎| 3969/4286 [24:55:55<1:40:25, 19.01s/it] 93%|█████████▎| 3970/4286 [24:56:14<1:40:00, 18.99s/it]                                                        {'loss': 0.0075, 'grad_norm': 0.48391535436884103, 'learning_rate': 7.372841810545963e-08, 'completion_length': 191.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.680059552192688, 'rewards/format_reward': 1.0, 'reward': 1.6800596117973328, 'reward_std': 0.008928571827709675, 'kl': 0.1884765625, 'epoch': 0.93}
 93%|█████████▎| 3970/4286 [24:56:14<1:40:00, 18.99s/it] 93%|█████████▎| 3971/4286 [24:56:32<1:38:40, 18.80s/it]                                                        {'loss': 0.008, 'grad_norm': 3.2936145536310844, 'learning_rate': 7.349510032664488e-08, 'completion_length': 191.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 1.0, 'reward': 1.6755953431129456, 'reward_std': 0.01785714365541935, 'kl': 0.20068359375, 'epoch': 0.93}
 93%|█████████▎| 3971/4286 [24:56:32<1:38:40, 18.80s/it] 93%|█████████▎| 3972/4286 [24:56:54<1:42:13, 19.53s/it]                                                        {'loss': 0.0255, 'grad_norm': 2.8411932139594, 'learning_rate': 7.326178254783014e-08, 'completion_length': 194.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.7758928835391998, 'rewards/format_reward': 1.0, 'reward': 1.775892972946167, 'reward_std': 0.025806485675275326, 'kl': 0.63671875, 'epoch': 0.93}
 93%|█████████▎| 3972/4286 [24:56:54<1:42:13, 19.53s/it] 93%|█████████▎| 3973/4286 [24:57:14<1:42:43, 19.69s/it]                                                        {'loss': 0.0089, 'grad_norm': 1.5354856471934832, 'learning_rate': 7.302846476901539e-08, 'completion_length': 179.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6383928954601288, 'rewards/format_reward': 1.0, 'reward': 1.6383929252624512, 'reward_std': 0.032738097012043, 'kl': 0.22216796875, 'epoch': 0.93}
 93%|█████████▎| 3973/4286 [24:57:14<1:42:43, 19.69s/it][2025-03-03 06:04:50,119] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 93%|█████████▎| 3974/4286 [24:57:34<1:43:45, 19.95s/it]                                                        {'loss': 0.0526, 'grad_norm': 12.564936727951796, 'learning_rate': 7.279514699020065e-08, 'completion_length': 194.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.622619092464447, 'rewards/format_reward': 1.0, 'reward': 1.6226191520690918, 'reward_std': 0.10000000149011612, 'kl': 1.322265625, 'epoch': 0.93}
 93%|█████████▎| 3974/4286 [24:57:34<1:43:45, 19.95s/it] 93%|█████████▎| 3975/4286 [24:57:52<1:39:55, 19.28s/it]                                                        {'loss': 0.0071, 'grad_norm': 0.3786063145345534, 'learning_rate': 7.25618292113859e-08, 'completion_length': 159.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.8452381491661072, 'rewards/format_reward': 1.0, 'reward': 1.845238208770752, 'reward_std': 0.011904764920473099, 'kl': 0.17724609375, 'epoch': 0.93}
 93%|█████████▎| 3975/4286 [24:57:52<1:39:55, 19.28s/it] 93%|█████████▎| 3976/4286 [24:58:10<1:37:17, 18.83s/it]                                                        {'loss': 0.0133, 'grad_norm': 1.065301087375595, 'learning_rate': 7.232851143257115e-08, 'completion_length': 185.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.797619104385376, 'rewards/format_reward': 1.0, 'reward': 1.7976191639900208, 'reward_std': 0.04170814249664545, 'kl': 0.33154296875, 'epoch': 0.93}
 93%|█████████▎| 3976/4286 [24:58:10<1:37:17, 18.83s/it] 93%|█████████▎| 3977/4286 [24:58:30<1:38:38, 19.16s/it]                                                        {'loss': 0.0231, 'grad_norm': 1.1793363343774577, 'learning_rate': 7.209519365375641e-08, 'completion_length': 177.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.783730149269104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7658731341362, 'reward_std': 0.07723850011825562, 'kl': 0.57958984375, 'epoch': 0.93}
 93%|█████████▎| 3977/4286 [24:58:30<1:38:38, 19.16s/it] 93%|█████████▎| 3978/4286 [24:58:48<1:36:42, 18.84s/it]                                                        {'loss': 0.0117, 'grad_norm': 8.725240653890742, 'learning_rate': 7.186187587494166e-08, 'completion_length': 177.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.7008928656578064, 'rewards/format_reward': 1.0, 'reward': 1.7008929252624512, 'reward_std': 0.03365893755108118, 'kl': 0.29345703125, 'epoch': 0.93}
 93%|█████████▎| 3978/4286 [24:58:48<1:36:42, 18.84s/it] 93%|█████████▎| 3979/4286 [24:59:09<1:40:13, 19.59s/it]                                                        {'loss': 0.047, 'grad_norm': 1.6581327068599967, 'learning_rate': 7.162855809612692e-08, 'completion_length': 202.78572845458984, 'rewards/only_full_func_accuracy_reward': 0.6406746208667755, 'rewards/format_reward': 1.0, 'reward': 1.6406747102737427, 'reward_std': 0.05734399892389774, 'kl': 1.173828125, 'epoch': 0.93}
 93%|█████████▎| 3979/4286 [24:59:09<1:40:13, 19.59s/it] 93%|█████████▎| 3980/4286 [24:59:29<1:39:55, 19.59s/it]                                                        {'loss': 0.036, 'grad_norm': 1.7234282003088368, 'learning_rate': 7.139524031731217e-08, 'completion_length': 201.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5869047939777374, 'rewards/format_reward': 1.0, 'reward': 1.5869049429893494, 'reward_std': 0.05238096043467522, 'kl': 0.900390625, 'epoch': 0.93}
 93%|█████████▎| 3980/4286 [24:59:29<1:39:55, 19.59s/it] 93%|█████████▎| 3981/4286 [24:59:47<1:37:59, 19.28s/it]                                                        {'loss': 0.0091, 'grad_norm': 6.346470856544942, 'learning_rate': 7.116192253849743e-08, 'completion_length': 177.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.857142984867096, 'reward_std': 0.049460723996162415, 'kl': 0.2265625, 'epoch': 0.93}
 93%|█████████▎| 3981/4286 [24:59:47<1:37:59, 19.28s/it] 93%|█████████▎| 3982/4286 [25:00:06<1:36:59, 19.14s/it]                                                        {'loss': 0.0262, 'grad_norm': 1.381650139635416, 'learning_rate': 7.092860475968268e-08, 'completion_length': 176.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6476190984249115, 'rewards/format_reward': 1.0, 'reward': 1.6476191878318787, 'reward_std': 0.050970324780792, 'kl': 0.6533203125, 'epoch': 0.93}
 93%|█████████▎| 3982/4286 [25:00:06<1:36:59, 19.14s/it] 93%|█████████▎| 3983/4286 [25:00:25<1:36:19, 19.07s/it]                                                        {'loss': 0.0387, 'grad_norm': 3.0224220587092203, 'learning_rate': 7.069528698086793e-08, 'completion_length': 192.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.6592262089252472, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6413692235946655, 'reward_std': 0.1517857275903225, 'kl': 0.970703125, 'epoch': 0.93}
 93%|█████████▎| 3983/4286 [25:00:25<1:36:19, 19.07s/it] 93%|█████████▎| 3984/4286 [25:00:44<1:35:11, 18.91s/it]                                                        {'loss': 0.0131, 'grad_norm': 4.18409694684351, 'learning_rate': 7.046196920205319e-08, 'completion_length': 189.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6994048058986664, 'rewards/format_reward': 1.0, 'reward': 1.6994048357009888, 'reward_std': 0.02221459336578846, 'kl': 0.3271484375, 'epoch': 0.93}
 93%|█████████▎| 3984/4286 [25:00:44<1:35:11, 18.91s/it] 93%|█████████▎| 3985/4286 [25:01:02<1:34:19, 18.80s/it]                                                        {'loss': 0.0323, 'grad_norm': 3.709171484381635, 'learning_rate': 7.022865142323844e-08, 'completion_length': 159.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7321428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142858505249023, 'reward_std': 0.06547619495540857, 'kl': 0.80810546875, 'epoch': 0.93}
 93%|█████████▎| 3985/4286 [25:01:02<1:34:19, 18.80s/it] 93%|█████████▎| 3986/4286 [25:01:20<1:32:45, 18.55s/it]                                                        {'loss': 0.0105, 'grad_norm': 8.00154617264836, 'learning_rate': 6.99953336444237e-08, 'completion_length': 186.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6389881372451782, 'rewards/format_reward': 1.0, 'reward': 1.6389882564544678, 'reward_std': 0.03971596248447895, 'kl': 0.26318359375, 'epoch': 0.93}
 93%|█████████▎| 3986/4286 [25:01:20<1:32:45, 18.55s/it] 93%|█████████▎| 3987/4286 [25:01:43<1:38:43, 19.81s/it]                                                        {'loss': 0.0165, 'grad_norm': 3.1645116655112893, 'learning_rate': 6.976201586560895e-08, 'completion_length': 211.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.5848214328289032, 'rewards/format_reward': 1.0, 'reward': 1.5848215222358704, 'reward_std': 0.10097679868340492, 'kl': 0.4140625, 'epoch': 0.93}
 93%|█████████▎| 3987/4286 [25:01:43<1:38:43, 19.81s/it] 93%|█████████▎| 3988/4286 [25:02:01<1:35:28, 19.22s/it]                                                        {'loss': 0.037, 'grad_norm': 6.034979718418171, 'learning_rate': 6.952869808679421e-08, 'completion_length': 171.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.81101194024086, 'rewards/format_reward': 1.0, 'reward': 1.8110120296478271, 'reward_std': 0.06557507533580065, 'kl': 0.9248046875, 'epoch': 0.93}
 93%|█████████▎| 3988/4286 [25:02:01<1:35:28, 19.22s/it] 93%|█████████▎| 3989/4286 [25:02:18<1:32:50, 18.76s/it]                                                        {'loss': 0.0078, 'grad_norm': 0.7539196905806368, 'learning_rate': 6.929538030797946e-08, 'completion_length': 159.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.693452388048172, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.01785714365541935, 'kl': 0.1943359375, 'epoch': 0.93}
 93%|█████████▎| 3989/4286 [25:02:18<1:32:50, 18.76s/it] 93%|█████████▎| 3990/4286 [25:02:39<1:35:49, 19.42s/it]                                                        {'loss': 0.0585, 'grad_norm': 1.9850462060538754, 'learning_rate': 6.906206252916472e-08, 'completion_length': 194.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.7068452835083008, 'rewards/format_reward': 1.0, 'reward': 1.7068454027175903, 'reward_std': 0.08588216453790665, 'kl': 1.46875, 'epoch': 0.93}
 93%|█████████▎| 3990/4286 [25:02:39<1:35:49, 19.42s/it] 93%|█████████▎| 3991/4286 [25:02:58<1:34:07, 19.14s/it]                                                        {'loss': 0.0277, 'grad_norm': 3.697018863297885, 'learning_rate': 6.882874475034997e-08, 'completion_length': 176.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.6488095819950104, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.04151602275669575, 'kl': 0.6923828125, 'epoch': 0.93}
 93%|█████████▎| 3991/4286 [25:02:58<1:34:07, 19.14s/it] 93%|█████████▎| 3992/4286 [25:03:17<1:33:13, 19.03s/it]                                                        {'loss': 0.0352, 'grad_norm': 1.5993768410853886, 'learning_rate': 6.859542697153522e-08, 'completion_length': 197.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562501192092896, 'reward_std': 0.0431259706383571, 'kl': 0.87939453125, 'epoch': 0.93}
 93%|█████████▎| 3992/4286 [25:03:17<1:33:13, 19.03s/it] 93%|█████████▎| 3993/4286 [25:03:39<1:38:26, 20.16s/it]                                                        {'loss': 0.0664, 'grad_norm': 2.9819275843806374, 'learning_rate': 6.836210919272048e-08, 'completion_length': 190.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6815477311611176, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6279762983322144, 'reward_std': 0.2085652481764555, 'kl': 1.66015625, 'epoch': 0.93}
 93%|█████████▎| 3993/4286 [25:03:39<1:38:26, 20.16s/it] 93%|█████████▎| 3994/4286 [25:03:57<1:35:08, 19.55s/it]                                                        {'loss': 0.0202, 'grad_norm': 3.142199195615031, 'learning_rate': 6.812879141390573e-08, 'completion_length': 170.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6916667222976685, 'rewards/format_reward': 1.0, 'reward': 1.691666841506958, 'reward_std': 0.07439033314585686, 'kl': 0.505859375, 'epoch': 0.93}
 93%|█████████▎| 3994/4286 [25:03:57<1:35:08, 19.55s/it] 93%|█████████▎| 3995/4286 [25:04:15<1:32:27, 19.06s/it]                                                        {'loss': 0.0341, 'grad_norm': 1.1428976804827196, 'learning_rate': 6.789547363509099e-08, 'completion_length': 175.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.566964328289032, 'rewards/format_reward': 1.0, 'reward': 1.5669644474983215, 'reward_std': 0.04464286006987095, 'kl': 0.85498046875, 'epoch': 0.93}
 93%|█████████▎| 3995/4286 [25:04:15<1:32:27, 19.06s/it] 93%|█████████▎| 3996/4286 [25:04:34<1:32:11, 19.08s/it]                                                        {'loss': 0.0273, 'grad_norm': 3.3209042628770624, 'learning_rate': 6.766215585627624e-08, 'completion_length': 193.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6434524357318878, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6255953907966614, 'reward_std': 0.05613857880234718, 'kl': 0.68603515625, 'epoch': 0.93}
 93%|█████████▎| 3996/4286 [25:04:34<1:32:11, 19.08s/it] 93%|█████████▎| 3997/4286 [25:04:53<1:30:43, 18.84s/it]                                                        {'loss': 0.0351, 'grad_norm': 3.823212957240784, 'learning_rate': 6.74288380774615e-08, 'completion_length': 173.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.65327388048172, 'rewards/format_reward': 1.0, 'reward': 1.6532739400863647, 'reward_std': 0.04464286006987095, 'kl': 0.87890625, 'epoch': 0.93}
 93%|█████████▎| 3997/4286 [25:04:53<1:30:43, 18.84s/it] 93%|█████████▎| 3998/4286 [25:05:12<1:30:17, 18.81s/it]                                                        {'loss': 0.0702, 'grad_norm': 2.3785918174773544, 'learning_rate': 6.719552029864675e-08, 'completion_length': 190.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6934524774551392, 'reward_std': 0.17541061714291573, 'kl': 1.75390625, 'epoch': 0.93}
 93%|█████████▎| 3998/4286 [25:05:12<1:30:17, 18.81s/it] 93%|█████████▎| 3999/4286 [25:05:29<1:28:36, 18.52s/it]                                                        {'loss': 0.0784, 'grad_norm': 6.120375534459591, 'learning_rate': 6.6962202519832e-08, 'completion_length': 173.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.6205357909202576, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.602678656578064, 'reward_std': 0.17437347024679184, 'kl': 1.9609375, 'epoch': 0.93}
 93%|█████████▎| 3999/4286 [25:05:29<1:28:36, 18.52s/it] 93%|█████████▎| 4000/4286 [25:05:48<1:28:00, 18.46s/it]                                                        {'loss': 0.007, 'grad_norm': 0.6186068384497159, 'learning_rate': 6.672888474101726e-08, 'completion_length': 179.05357360839844, 'rewards/only_full_func_accuracy_reward': 0.6220238208770752, 'rewards/format_reward': 1.0, 'reward': 1.62202388048172, 'reward_std': 0.010309826582670212, 'kl': 0.17431640625, 'epoch': 0.93}
 93%|█████████▎| 4000/4286 [25:05:48<1:28:00, 18.46s/it] 93%|█████████▎| 4001/4286 [25:09:39<6:30:20, 82.18s/it]                                                        {'loss': 0.0317, 'grad_norm': 1.9495251582863622, 'learning_rate': 6.649556696220251e-08, 'completion_length': 176.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.8690477013587952, 'rewards/format_reward': 1.0, 'reward': 1.86904776096344, 'reward_std': 0.03715857304632664, 'kl': 0.79052734375, 'epoch': 0.93}
 93%|█████████▎| 4001/4286 [25:09:39<6:30:20, 82.18s/it] 93%|█████████▎| 4002/4286 [25:09:59<5:01:25, 63.68s/it]                                                        {'loss': 0.0142, 'grad_norm': 7.245004763258995, 'learning_rate': 6.626224918338777e-08, 'completion_length': 190.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6758928894996643, 'rewards/format_reward': 1.0, 'reward': 1.6758930683135986, 'reward_std': 0.03863080684095621, 'kl': 0.35546875, 'epoch': 0.93}
 93%|█████████▎| 4002/4286 [25:09:59<5:01:25, 63.68s/it] 93%|█████████▎| 4003/4286 [25:10:17<3:55:39, 49.96s/it]                                                        {'loss': 0.0343, 'grad_norm': 7.540624871606068, 'learning_rate': 6.602893140457302e-08, 'completion_length': 181.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6815476715564728, 'rewards/format_reward': 1.0, 'reward': 1.68154776096344, 'reward_std': 0.05952381528913975, 'kl': 0.857421875, 'epoch': 0.93}
 93%|█████████▎| 4003/4286 [25:10:17<3:55:39, 49.96s/it] 93%|█████████▎| 4004/4286 [25:10:35<3:10:25, 40.52s/it]                                                        {'loss': 0.0382, 'grad_norm': 2.6665359461872975, 'learning_rate': 6.579561362575828e-08, 'completion_length': 196.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.7514881491661072, 'rewards/format_reward': 1.0, 'reward': 1.751488208770752, 'reward_std': 0.10281847417354584, 'kl': 0.953125, 'epoch': 0.93}
 93%|█████████▎| 4004/4286 [25:10:35<3:10:25, 40.52s/it] 93%|█████████▎| 4005/4286 [25:10:55<2:40:20, 34.24s/it]                                                        {'loss': 0.0215, 'grad_norm': 29.59323339437053, 'learning_rate': 6.556229584694353e-08, 'completion_length': 202.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6394841969013214, 'rewards/format_reward': 1.0, 'reward': 1.6394842267036438, 'reward_std': 0.11137013882398605, 'kl': 0.537109375, 'epoch': 0.93}
 93%|█████████▎| 4005/4286 [25:10:55<2:40:20, 34.24s/it] 93%|█████████▎| 4006/4286 [25:11:15<2:20:06, 30.02s/it]                                                        {'loss': 0.0784, 'grad_norm': 4.1683814614187, 'learning_rate': 6.532897806812878e-08, 'completion_length': 195.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.7217262387275696, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7038691639900208, 'reward_std': 0.14438674971461296, 'kl': 1.95703125, 'epoch': 0.93}
 93%|█████████▎| 4006/4286 [25:11:15<2:20:06, 30.02s/it] 93%|█████████▎| 4007/4286 [25:11:34<2:03:22, 26.53s/it]                                                        {'loss': 0.0631, 'grad_norm': 3.564014883181247, 'learning_rate': 6.509566028931404e-08, 'completion_length': 184.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.6425595581531525, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6247025728225708, 'reward_std': 0.13980301097035408, 'kl': 1.57421875, 'epoch': 0.93}
 93%|█████████▎| 4007/4286 [25:11:34<2:03:22, 26.53s/it] 94%|█████████▎| 4008/4286 [25:11:53<1:53:37, 24.52s/it]                                                        {'loss': 0.0784, 'grad_norm': 30.386592217239087, 'learning_rate': 6.486234251049929e-08, 'completion_length': 179.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5383928716182709, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.502678632736206, 'reward_std': 0.2056625634431839, 'kl': 1.9609375, 'epoch': 0.94}
 94%|█████████▎| 4008/4286 [25:11:53<1:53:37, 24.52s/it] 94%|█████████▎| 4009/4286 [25:12:14<1:47:03, 23.19s/it]                                                        {'loss': 0.0253, 'grad_norm': 4.244886713696892, 'learning_rate': 6.462902473168455e-08, 'completion_length': 196.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.03571428172290325, 'kl': 0.6328125, 'epoch': 0.94}
 94%|█████████▎| 4009/4286 [25:12:14<1:47:03, 23.19s/it] 94%|█████████▎| 4010/4286 [25:12:32<1:39:58, 21.73s/it]                                                        {'loss': 0.0363, 'grad_norm': 7.172355695224142, 'learning_rate': 6.43957069528698e-08, 'completion_length': 184.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.74851194024086, 'rewards/format_reward': 1.0, 'reward': 1.7485120296478271, 'reward_std': 0.07121489569544792, 'kl': 0.90673828125, 'epoch': 0.94}
 94%|█████████▎| 4010/4286 [25:12:32<1:39:58, 21.73s/it] 94%|█████████▎| 4011/4286 [25:12:49<1:33:13, 20.34s/it]                                                        {'loss': 0.0328, 'grad_norm': 5.127644356638356, 'learning_rate': 6.416238917405506e-08, 'completion_length': 159.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.672619104385376, 'rewards/format_reward': 1.0, 'reward': 1.6726191639900208, 'reward_std': 0.0357142873108387, 'kl': 0.8203125, 'epoch': 0.94}
 94%|█████████▎| 4011/4286 [25:12:49<1:33:13, 20.34s/it] 94%|█████████▎| 4012/4286 [25:13:09<1:32:07, 20.17s/it]                                                        {'loss': 0.0462, 'grad_norm': 2.1551482083137437, 'learning_rate': 6.392907139524031e-08, 'completion_length': 181.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.6636905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6636905670166016, 'reward_std': 0.0892857238650322, 'kl': 1.158203125, 'epoch': 0.94}
 94%|█████████▎| 4012/4286 [25:13:09<1:32:07, 20.17s/it] 94%|█████████▎| 4013/4286 [25:13:28<1:30:24, 19.87s/it]                                                        {'loss': 0.0503, 'grad_norm': 4.742011761880992, 'learning_rate': 6.369575361642557e-08, 'completion_length': 177.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.6041666865348816, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5863096714019775, 'reward_std': 0.06983364000916481, 'kl': 1.26123046875, 'epoch': 0.94}
 94%|█████████▎| 4013/4286 [25:13:28<1:30:24, 19.87s/it] 94%|█████████▎| 4014/4286 [25:13:47<1:28:58, 19.63s/it]                                                        {'loss': 0.0913, 'grad_norm': 2.362650816936511, 'learning_rate': 6.346243583761082e-08, 'completion_length': 177.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.5639880895614624, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.07908296957612038, 'kl': 2.2734375, 'epoch': 0.94}
 94%|█████████▎| 4014/4286 [25:13:47<1:28:58, 19.63s/it] 94%|█████████▎| 4015/4286 [25:14:06<1:27:23, 19.35s/it]                                                        {'loss': 0.0486, 'grad_norm': 2.600578647845461, 'learning_rate': 6.322911805879607e-08, 'completion_length': 188.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7940477132797241, 'rewards/format_reward': 1.0, 'reward': 1.794047772884369, 'reward_std': 0.09030840918421745, 'kl': 1.2177734375, 'epoch': 0.94}
 94%|█████████▎| 4015/4286 [25:14:06<1:27:23, 19.35s/it] 94%|█████████▎| 4016/4286 [25:14:25<1:27:08, 19.36s/it]                                                        {'loss': 0.037, 'grad_norm': 3.423394410422884, 'learning_rate': 6.299580027998133e-08, 'completion_length': 196.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6895833909511566, 'rewards/format_reward': 1.0, 'reward': 1.6895835399627686, 'reward_std': 0.07559525035321712, 'kl': 0.92578125, 'epoch': 0.94}
 94%|█████████▎| 4016/4286 [25:14:25<1:27:08, 19.36s/it] 94%|█████████▎| 4017/4286 [25:14:46<1:29:03, 19.87s/it]                                                        {'loss': 0.0724, 'grad_norm': 18.286680439234345, 'learning_rate': 6.276248250116658e-08, 'completion_length': 188.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.641335517168045, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6056212782859802, 'reward_std': 0.13934937492012978, 'kl': 1.81640625, 'epoch': 0.94}
 94%|█████████▎| 4017/4286 [25:14:46<1:29:03, 19.87s/it] 94%|█████████▎| 4018/4286 [25:15:06<1:28:26, 19.80s/it]                                                        {'loss': 0.0464, 'grad_norm': 3.9579828072828027, 'learning_rate': 6.252916472235184e-08, 'completion_length': 188.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6021825671195984, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5843254327774048, 'reward_std': 0.12064610794186592, 'kl': 1.158203125, 'epoch': 0.94}
 94%|█████████▎| 4018/4286 [25:15:06<1:28:26, 19.80s/it] 94%|█████████▍| 4019/4286 [25:15:24<1:25:55, 19.31s/it]                                                        {'loss': 0.0393, 'grad_norm': 6.217500443074824, 'learning_rate': 6.229584694353709e-08, 'completion_length': 184.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678571939468384, 'reward_std': 0.09959554672241211, 'kl': 0.98046875, 'epoch': 0.94}
 94%|█████████▍| 4019/4286 [25:15:24<1:25:55, 19.31s/it] 94%|█████████▍| 4020/4286 [25:15:43<1:25:19, 19.25s/it]                                                        {'loss': 0.0411, 'grad_norm': 6.833769308430548, 'learning_rate': 6.206252916472235e-08, 'completion_length': 204.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6717262268066406, 'rewards/format_reward': 1.0, 'reward': 1.6717262864112854, 'reward_std': 0.10237788781523705, 'kl': 1.029296875, 'epoch': 0.94}
 94%|█████████▍| 4020/4286 [25:15:43<1:25:19, 19.25s/it] 94%|█████████▍| 4021/4286 [25:16:03<1:26:11, 19.51s/it]                                                        {'loss': 0.0317, 'grad_norm': 2.3131852974117453, 'learning_rate': 6.18292113859076e-08, 'completion_length': 208.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7440476715564728, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7261905670166016, 'reward_std': 0.10694201290607452, 'kl': 0.79541015625, 'epoch': 0.94}
 94%|█████████▍| 4021/4286 [25:16:03<1:26:11, 19.51s/it][2025-03-03 06:23:40,423] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 94%|█████████▍| 4022/4286 [25:16:25<1:28:19, 20.07s/it]                                                        {'loss': 0.0218, 'grad_norm': 2.889659487271142, 'learning_rate': 6.159589360709285e-08, 'completion_length': 197.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.7485119700431824, 'rewards/format_reward': 1.0, 'reward': 1.7485119700431824, 'reward_std': 0.08529588580131531, 'kl': 0.5439453125, 'epoch': 0.94}
 94%|█████████▍| 4022/4286 [25:16:25<1:28:19, 20.07s/it] 94%|█████████▍| 4023/4286 [25:16:44<1:27:33, 19.98s/it]                                                        {'loss': 0.0311, 'grad_norm': 2.163522632924305, 'learning_rate': 6.136257582827811e-08, 'completion_length': 195.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.68601194024086, 'rewards/format_reward': 1.0, 'reward': 1.6860120296478271, 'reward_std': 0.07008879259228706, 'kl': 0.77734375, 'epoch': 0.94}
 94%|█████████▍| 4023/4286 [25:16:44<1:27:33, 19.98s/it] 94%|█████████▍| 4024/4286 [25:17:05<1:28:00, 20.15s/it]                                                        {'loss': 0.0642, 'grad_norm': 5.893140085790803, 'learning_rate': 6.112925804946336e-08, 'completion_length': 191.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.641964316368103, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6241071224212646, 'reward_std': 0.1458333320915699, 'kl': 1.60546875, 'epoch': 0.94}
 94%|█████████▍| 4024/4286 [25:17:05<1:28:00, 20.15s/it] 94%|█████████▍| 4025/4286 [25:17:23<1:25:02, 19.55s/it]                                                        {'loss': 0.0282, 'grad_norm': 7.690103984037387, 'learning_rate': 6.089594027064862e-08, 'completion_length': 170.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6636905372142792, 'rewards/format_reward': 1.0, 'reward': 1.6636906266212463, 'reward_std': 0.02976190857589245, 'kl': 0.7060546875, 'epoch': 0.94}
 94%|█████████▍| 4025/4286 [25:17:23<1:25:02, 19.55s/it] 94%|█████████▍| 4026/4286 [25:17:42<1:24:10, 19.43s/it]                                                        {'loss': 0.0273, 'grad_norm': 5.379075269465751, 'learning_rate': 6.066262249183387e-08, 'completion_length': 174.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.6889881193637848, 'rewards/format_reward': 1.0, 'reward': 1.688988208770752, 'reward_std': 0.07876221463084221, 'kl': 0.681640625, 'epoch': 0.94}
 94%|█████████▍| 4026/4286 [25:17:42<1:24:10, 19.43s/it] 94%|█████████▍| 4027/4286 [25:18:01<1:23:17, 19.30s/it]                                                        {'loss': 0.0372, 'grad_norm': 5.61246888041665, 'learning_rate': 6.042930471301914e-08, 'completion_length': 191.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.6430272459983826, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6251701712608337, 'reward_std': 0.09170809015631676, 'kl': 0.931640625, 'epoch': 0.94}
 94%|█████████▍| 4027/4286 [25:18:01<1:23:17, 19.30s/it] 94%|█████████▍| 4028/4286 [25:18:19<1:21:45, 19.01s/it]                                                        {'loss': 0.0302, 'grad_norm': 7.1493277764774374, 'learning_rate': 6.019598693420438e-08, 'completion_length': 183.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6130952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.595238208770752, 'reward_std': 0.09182289429008961, 'kl': 0.7548828125, 'epoch': 0.94}
 94%|█████████▍| 4028/4286 [25:18:19<1:21:45, 19.01s/it] 94%|█████████▍| 4029/4286 [25:18:39<1:22:11, 19.19s/it]                                                        {'loss': 0.0197, 'grad_norm': 0.9669196247258665, 'learning_rate': 5.996266915538963e-08, 'completion_length': 184.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.578869104385376, 'rewards/format_reward': 1.0, 'reward': 1.5788691639900208, 'reward_std': 0.04464286006987095, 'kl': 0.49169921875, 'epoch': 0.94}
 94%|█████████▍| 4029/4286 [25:18:39<1:22:11, 19.19s/it] 94%|█████████▍| 4030/4286 [25:19:00<1:23:40, 19.61s/it]                                                        {'loss': 0.0269, 'grad_norm': 2.7216947924933677, 'learning_rate': 5.97293513765749e-08, 'completion_length': 193.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8155612647533417, 'rewards/format_reward': 1.0, 'reward': 1.8155613541603088, 'reward_std': 0.1009349413216114, 'kl': 0.67041015625, 'epoch': 0.94}
 94%|█████████▍| 4030/4286 [25:19:00<1:23:40, 19.61s/it] 94%|█████████▍| 4031/4286 [25:19:22<1:27:13, 20.52s/it]                                                        {'loss': 0.0338, 'grad_norm': 6.152010262665815, 'learning_rate': 5.949603359776015e-08, 'completion_length': 205.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5461310148239136, 'reward_std': 0.1714528724551201, 'kl': 0.84375, 'epoch': 0.94}
 94%|█████████▍| 4031/4286 [25:19:22<1:27:13, 20.52s/it] 94%|█████████▍| 4032/4286 [25:19:42<1:26:07, 20.34s/it]                                                        {'loss': 0.0202, 'grad_norm': 8.579943161776075, 'learning_rate': 5.9262715818945405e-08, 'completion_length': 198.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6681548058986664, 'rewards/format_reward': 1.0, 'reward': 1.6681548953056335, 'reward_std': 0.050841973163187504, 'kl': 0.5048828125, 'epoch': 0.94}
 94%|█████████▍| 4032/4286 [25:19:42<1:26:07, 20.34s/it] 94%|█████████▍| 4033/4286 [25:20:02<1:25:06, 20.18s/it]                                                        {'loss': 0.0491, 'grad_norm': 1.3128798275099671, 'learning_rate': 5.9029398040130654e-08, 'completion_length': 185.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.5758928954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5580357909202576, 'reward_std': 0.1050875075161457, 'kl': 1.2236328125, 'epoch': 0.94}
 94%|█████████▍| 4033/4286 [25:20:02<1:25:06, 20.18s/it] 94%|█████████▍| 4034/4286 [25:20:21<1:22:49, 19.72s/it]                                                        {'loss': 0.027, 'grad_norm': 4.834644856382076, 'learning_rate': 5.879608026131591e-08, 'completion_length': 182.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.7422619462013245, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7244048714637756, 'reward_std': 0.08539529610425234, 'kl': 0.6767578125, 'epoch': 0.94}
 94%|█████████▍| 4034/4286 [25:20:21<1:22:49, 19.72s/it][2025-03-03 06:27:58,931] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 94%|█████████▍| 4035/4286 [25:20:43<1:25:47, 20.51s/it]                                                        {'loss': 0.0562, 'grad_norm': 4.681225869766169, 'learning_rate': 5.8562762482501165e-08, 'completion_length': 201.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.4434524029493332, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.407738208770752, 'reward_std': 0.15169811248779297, 'kl': 1.408203125, 'epoch': 0.94}
 94%|█████████▍| 4035/4286 [25:20:43<1:25:47, 20.51s/it] 94%|█████████▍| 4036/4286 [25:21:06<1:27:52, 21.09s/it]                                                        {'loss': 0.0489, 'grad_norm': 2.749489209329532, 'learning_rate': 5.832944470368642e-08, 'completion_length': 190.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7938617169857025, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.776004672050476, 'reward_std': 0.08606545627117157, 'kl': 1.216796875, 'epoch': 0.94}
 94%|█████████▍| 4036/4286 [25:21:06<1:27:52, 21.09s/it] 94%|█████████▍| 4037/4286 [25:21:25<1:24:58, 20.48s/it]                                                        {'loss': 0.012, 'grad_norm': 2.853417836231927, 'learning_rate': 5.8096126924871675e-08, 'completion_length': 193.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.8556548357009888, 'rewards/format_reward': 1.0, 'reward': 1.8556548357009888, 'reward_std': 0.05016787722706795, 'kl': 0.2998046875, 'epoch': 0.94}
 94%|█████████▍| 4037/4286 [25:21:25<1:24:58, 20.48s/it] 94%|█████████▍| 4038/4286 [25:21:45<1:24:17, 20.39s/it]                                                        {'loss': 0.0109, 'grad_norm': 1.994455522777303, 'learning_rate': 5.786280914605693e-08, 'completion_length': 168.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.6002976894378662, 'rewards/format_reward': 1.0, 'reward': 1.600297749042511, 'reward_std': 0.049404763616621494, 'kl': 0.2724609375, 'epoch': 0.94}
 94%|█████████▍| 4038/4286 [25:21:45<1:24:17, 20.39s/it][2025-03-03 06:29:20,598] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 94%|█████████▍| 4039/4286 [25:22:05<1:23:25, 20.27s/it]                                                        {'loss': 0.0643, 'grad_norm': 1.9048107454339034, 'learning_rate': 5.7629491367242186e-08, 'completion_length': 200.91072845458984, 'rewards/only_full_func_accuracy_reward': 0.6354166567325592, 'rewards/format_reward': 1.0, 'reward': 1.6354167461395264, 'reward_std': 0.12366022542119026, 'kl': 1.6103515625, 'epoch': 0.94}
 94%|█████████▍| 4039/4286 [25:22:05<1:23:25, 20.27s/it] 94%|█████████▍| 4040/4286 [25:22:24<1:21:34, 19.90s/it]                                                        {'loss': 0.0487, 'grad_norm': 2.123845298347471, 'learning_rate': 5.739617358842744e-08, 'completion_length': 172.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.06185896322131157, 'kl': 1.21826171875, 'epoch': 0.94}
 94%|█████████▍| 4040/4286 [25:22:24<1:21:34, 19.90s/it] 94%|█████████▍| 4041/4286 [25:22:44<1:21:37, 19.99s/it]                                                        {'loss': 0.0454, 'grad_norm': 3.474197522365975, 'learning_rate': 5.716285580961269e-08, 'completion_length': 190.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.7544644474983215, 'reward_std': 0.0533577473834157, 'kl': 1.1318359375, 'epoch': 0.94}
 94%|█████████▍| 4041/4286 [25:22:44<1:21:37, 19.99s/it] 94%|█████████▍| 4042/4286 [25:23:03<1:19:41, 19.60s/it]                                                        {'loss': 0.0079, 'grad_norm': 0.7630795231449181, 'learning_rate': 5.6929538030797945e-08, 'completion_length': 174.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 1.0, 'reward': 1.6309524774551392, 'reward_std': 0.0, 'kl': 0.1962890625, 'epoch': 0.94}
 94%|█████████▍| 4042/4286 [25:23:03<1:19:41, 19.60s/it] 94%|█████████▍| 4043/4286 [25:23:22<1:19:30, 19.63s/it]                                                        {'loss': 0.0458, 'grad_norm': 3.461756026013886, 'learning_rate': 5.66962202519832e-08, 'completion_length': 205.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.5613095760345459, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5434524416923523, 'reward_std': 0.12074320763349533, 'kl': 1.146484375, 'epoch': 0.94}
 94%|█████████▍| 4043/4286 [25:23:22<1:19:30, 19.63s/it] 94%|█████████▍| 4044/4286 [25:23:43<1:20:04, 19.85s/it]                                                        {'loss': 0.0768, 'grad_norm': 30.46261205772882, 'learning_rate': 5.6462902473168456e-08, 'completion_length': 186.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.5922619700431824, 'rewards/format_reward': 1.0, 'reward': 1.5922620296478271, 'reward_std': 0.07993820495903492, 'kl': 1.916015625, 'epoch': 0.94}
 94%|█████████▍| 4044/4286 [25:23:43<1:20:04, 19.85s/it] 94%|█████████▍| 4045/4286 [25:24:01<1:17:36, 19.32s/it]                                                        {'loss': 0.0071, 'grad_norm': 2.0049164616653927, 'learning_rate': 5.622958469435371e-08, 'completion_length': 190.7321548461914, 'rewards/only_full_func_accuracy_reward': 0.6577381491661072, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.01785714365541935, 'kl': 0.1767578125, 'epoch': 0.94}
 94%|█████████▍| 4045/4286 [25:24:01<1:17:36, 19.32s/it] 94%|█████████▍| 4046/4286 [25:24:19<1:16:11, 19.05s/it]                                                        {'loss': 0.0267, 'grad_norm': 0.8810709125821021, 'learning_rate': 5.5996266915538966e-08, 'completion_length': 196.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.642857313156128, 'reward_std': 0.07142857182770967, 'kl': 0.66845703125, 'epoch': 0.94}
 94%|█████████▍| 4046/4286 [25:24:19<1:16:11, 19.05s/it] 94%|█████████▍| 4047/4286 [25:24:40<1:17:28, 19.45s/it]                                                        {'loss': 0.0278, 'grad_norm': 3.9237684995826387, 'learning_rate': 5.576294913672422e-08, 'completion_length': 180.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.0357142873108387, 'kl': 0.6953125, 'epoch': 0.94}
 94%|█████████▍| 4047/4286 [25:24:40<1:17:28, 19.45s/it] 94%|█████████▍| 4048/4286 [25:24:57<1:15:07, 18.94s/it]                                                        {'loss': 0.029, 'grad_norm': 11.653547793931757, 'learning_rate': 5.552963135790947e-08, 'completion_length': 176.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.648809552192688, 'rewards/format_reward': 1.0, 'reward': 1.6488096117973328, 'reward_std': 0.10314170084893703, 'kl': 0.72265625, 'epoch': 0.94}
 94%|█████████▍| 4048/4286 [25:24:57<1:15:07, 18.94s/it] 94%|█████████▍| 4049/4286 [25:25:19<1:18:08, 19.78s/it]                                                        {'loss': 0.0407, 'grad_norm': 5.518582125806356, 'learning_rate': 5.5296313579094726e-08, 'completion_length': 209.75001525878906, 'rewards/only_full_func_accuracy_reward': 0.5213293731212616, 'rewards/format_reward': 1.0, 'reward': 1.5213294625282288, 'reward_std': 0.08069736137986183, 'kl': 1.017578125, 'epoch': 0.94}
 94%|█████████▍| 4049/4286 [25:25:19<1:18:08, 19.78s/it] 94%|█████████▍| 4050/4286 [25:25:38<1:17:04, 19.59s/it]                                                        {'loss': 0.0174, 'grad_norm': 1.931981493252384, 'learning_rate': 5.506299580027998e-08, 'completion_length': 195.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6735119223594666, 'rewards/format_reward': 1.0, 'reward': 1.6735119819641113, 'reward_std': 0.02321428433060646, 'kl': 0.43359375, 'epoch': 0.94}
 94%|█████████▍| 4050/4286 [25:25:38<1:17:04, 19.59s/it] 95%|█████████▍| 4051/4286 [25:25:59<1:18:38, 20.08s/it]                                                        {'loss': 0.0625, 'grad_norm': 10.699100912620693, 'learning_rate': 5.4829678021465236e-08, 'completion_length': 203.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7455357611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.691964328289032, 'reward_std': 0.23987218737602234, 'kl': 1.5625, 'epoch': 0.95}
 95%|█████████▍| 4051/4286 [25:25:59<1:18:38, 20.08s/it] 95%|█████████▍| 4052/4286 [25:26:18<1:15:58, 19.48s/it]                                                        {'loss': 0.0353, 'grad_norm': 8.073929560577414, 'learning_rate': 5.4596360242650485e-08, 'completion_length': 185.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.678571492433548, 'rewards/format_reward': 1.0, 'reward': 1.6785715818405151, 'reward_std': 0.0997552964836359, 'kl': 0.87890625, 'epoch': 0.95}
 95%|█████████▍| 4052/4286 [25:26:18<1:15:58, 19.48s/it] 95%|█████████▍| 4053/4286 [25:26:36<1:14:51, 19.28s/it]                                                        {'loss': 0.0187, 'grad_norm': 3.006854113895363, 'learning_rate': 5.436304246383574e-08, 'completion_length': 179.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7857143878936768, 'reward_std': 0.12347953952848911, 'kl': 0.46923828125, 'epoch': 0.95}
 95%|█████████▍| 4053/4286 [25:26:36<1:14:51, 19.28s/it] 95%|█████████▍| 4054/4286 [25:26:55<1:14:15, 19.21s/it]                                                        {'loss': 0.0336, 'grad_norm': 5.925402596198244, 'learning_rate': 5.4129724685020995e-08, 'completion_length': 190.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715818405151, 'reward_std': 0.06388126127421856, 'kl': 0.83740234375, 'epoch': 0.95}
 95%|█████████▍| 4054/4286 [25:26:55<1:14:15, 19.21s/it] 95%|█████████▍| 4055/4286 [25:27:16<1:15:42, 19.66s/it]                                                        {'loss': 0.0482, 'grad_norm': 33.55860523830084, 'learning_rate': 5.3896406906206244e-08, 'completion_length': 202.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6226191222667694, 'rewards/format_reward': 1.0, 'reward': 1.6226191520690918, 'reward_std': 0.10253103822469711, 'kl': 1.205078125, 'epoch': 0.95}
 95%|█████████▍| 4055/4286 [25:27:16<1:15:42, 19.66s/it] 95%|█████████▍| 4056/4286 [25:27:35<1:14:36, 19.46s/it]                                                        {'loss': 0.0697, 'grad_norm': 18.381263690873233, 'learning_rate': 5.36630891273915e-08, 'completion_length': 176.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.528869092464447, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.493154764175415, 'reward_std': 0.16520974040031433, 'kl': 1.7421875, 'epoch': 0.95}
 95%|█████████▍| 4056/4286 [25:27:35<1:14:36, 19.46s/it] 95%|█████████▍| 4057/4286 [25:27:53<1:12:32, 19.01s/it]                                                        {'loss': 0.0339, 'grad_norm': 1.9663345956762597, 'learning_rate': 5.3429771348576755e-08, 'completion_length': 170.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.754464328289032, 'rewards/format_reward': 1.0, 'reward': 1.754464328289032, 'reward_std': 0.026785715483129025, 'kl': 0.845703125, 'epoch': 0.95}
 95%|█████████▍| 4057/4286 [25:27:53<1:12:32, 19.01s/it] 95%|█████████▍| 4058/4286 [25:28:12<1:12:05, 18.97s/it]                                                        {'loss': 0.0065, 'grad_norm': 2.940499815031926, 'learning_rate': 5.319645356976201e-08, 'completion_length': 192.76786041259766, 'rewards/only_full_func_accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.028740640729665756, 'kl': 0.1640625, 'epoch': 0.95}
 95%|█████████▍| 4058/4286 [25:28:12<1:12:05, 18.97s/it] 95%|█████████▍| 4059/4286 [25:28:30<1:11:13, 18.83s/it]                                                        {'loss': 0.0127, 'grad_norm': 2.149601278524689, 'learning_rate': 5.2963135790947265e-08, 'completion_length': 200.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.6145833432674408, 'rewards/format_reward': 1.0, 'reward': 1.614583432674408, 'reward_std': 0.008928571827709675, 'kl': 0.31640625, 'epoch': 0.95}
 95%|█████████▍| 4059/4286 [25:28:30<1:11:13, 18.83s/it] 95%|█████████▍| 4060/4286 [25:28:50<1:11:38, 19.02s/it]                                                        {'loss': 0.0738, 'grad_norm': 3.4753892880754584, 'learning_rate': 5.272981801213252e-08, 'completion_length': 184.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5491071492433548, 'rewards/format_reward': 1.0, 'reward': 1.5491071939468384, 'reward_std': 0.059310128912329674, 'kl': 1.83984375, 'epoch': 0.95}
 95%|█████████▍| 4060/4286 [25:28:50<1:11:38, 19.02s/it] 95%|█████████▍| 4061/4286 [25:29:08<1:10:39, 18.84s/it]                                                        {'loss': 0.0444, 'grad_norm': 2.0317631774727114, 'learning_rate': 5.2496500233317776e-08, 'completion_length': 194.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.643750011920929, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6258929371833801, 'reward_std': 0.09107143431901932, 'kl': 1.109375, 'epoch': 0.95}
 95%|█████████▍| 4061/4286 [25:29:08<1:10:39, 18.84s/it] 95%|█████████▍| 4062/4286 [25:29:27<1:10:33, 18.90s/it]                                                        {'loss': 0.0081, 'grad_norm': 7.947427734992462, 'learning_rate': 5.226318245450303e-08, 'completion_length': 188.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.6502976715564728, 'rewards/format_reward': 1.0, 'reward': 1.6502977013587952, 'reward_std': 0.008928571827709675, 'kl': 0.203125, 'epoch': 0.95}
 95%|█████████▍| 4062/4286 [25:29:27<1:10:33, 18.90s/it] 95%|█████████▍| 4063/4286 [25:29:48<1:11:46, 19.31s/it]                                                        {'loss': 0.0143, 'grad_norm': 1.4325277730158306, 'learning_rate': 5.202986467568828e-08, 'completion_length': 195.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.5204081982374191, 'rewards/format_reward': 1.0, 'reward': 1.520408272743225, 'reward_std': 0.018707480281591415, 'kl': 0.357421875, 'epoch': 0.95}
 95%|█████████▍| 4063/4286 [25:29:48<1:11:46, 19.31s/it] 95%|█████████▍| 4064/4286 [25:30:07<1:11:14, 19.25s/it]                                                        {'loss': 0.0166, 'grad_norm': 1.7930752050670444, 'learning_rate': 5.1796546896873535e-08, 'completion_length': 169.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.7098214626312256, 'rewards/format_reward': 1.0, 'reward': 1.7098215222358704, 'reward_std': 0.026785715483129025, 'kl': 0.4130859375, 'epoch': 0.95}
 95%|█████████▍| 4064/4286 [25:30:07<1:11:14, 19.25s/it] 95%|█████████▍| 4065/4286 [25:30:25<1:09:49, 18.95s/it]                                                        {'loss': 0.0071, 'grad_norm': 1.5211533044123953, 'learning_rate': 5.156322911805879e-08, 'completion_length': 192.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7105655670166016, 'rewards/format_reward': 1.0, 'reward': 1.7105655670166016, 'reward_std': 0.0342261902987957, 'kl': 0.17822265625, 'epoch': 0.95}
 95%|█████████▍| 4065/4286 [25:30:25<1:09:49, 18.95s/it] 95%|█████████▍| 4066/4286 [25:30:43<1:08:04, 18.57s/it]                                                        {'loss': 0.0165, 'grad_norm': 1.5253914042999621, 'learning_rate': 5.1329911339244046e-08, 'completion_length': 163.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.7574405372142792, 'rewards/format_reward': 1.0, 'reward': 1.7574405670166016, 'reward_std': 0.025764448568224907, 'kl': 0.412109375, 'epoch': 0.95}
 95%|█████████▍| 4066/4286 [25:30:43<1:08:04, 18.57s/it] 95%|█████████▍| 4067/4286 [25:31:06<1:13:11, 20.05s/it]                                                        {'loss': 0.0383, 'grad_norm': 6.25683740808385, 'learning_rate': 5.10965935604293e-08, 'completion_length': 207.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 1.0, 'reward': 1.6011905670166016, 'reward_std': 0.08173839747905731, 'kl': 0.95703125, 'epoch': 0.95}
 95%|█████████▍| 4067/4286 [25:31:06<1:13:11, 20.05s/it] 95%|█████████▍| 4068/4286 [25:31:27<1:13:38, 20.27s/it]                                                        {'loss': 0.0569, 'grad_norm': 2.313631337487701, 'learning_rate': 5.0863275781614556e-08, 'completion_length': 188.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.636904776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6190477013587952, 'reward_std': 0.19281134009361267, 'kl': 1.421875, 'epoch': 0.95}
 95%|█████████▍| 4068/4286 [25:31:27<1:13:38, 20.27s/it] 95%|█████████▍| 4069/4286 [25:31:48<1:13:45, 20.40s/it]                                                        {'loss': 0.0628, 'grad_norm': 10.311786802507797, 'learning_rate': 5.062995800279981e-08, 'completion_length': 185.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.5505952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.532738208770752, 'reward_std': 0.06730847526341677, 'kl': 1.5693359375, 'epoch': 0.95}
 95%|█████████▍| 4069/4286 [25:31:48<1:13:45, 20.40s/it] 95%|█████████▍| 4070/4286 [25:32:09<1:14:34, 20.72s/it]                                                        {'loss': 0.0307, 'grad_norm': 1.4846111264238224, 'learning_rate': 5.039664022398507e-08, 'completion_length': 184.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6443453133106232, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.626488208770752, 'reward_std': 0.11233501601964235, 'kl': 0.76806640625, 'epoch': 0.95}
 95%|█████████▍| 4070/4286 [25:32:09<1:14:34, 20.72s/it] 95%|█████████▍| 4071/4286 [25:32:29<1:13:11, 20.43s/it]                                                        {'loss': 0.0527, 'grad_norm': 5.679127763364832, 'learning_rate': 5.0163322445170316e-08, 'completion_length': 203.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.7573129534721375, 'rewards/format_reward': 1.0, 'reward': 1.7573130130767822, 'reward_std': 0.09421503450721502, 'kl': 1.31640625, 'epoch': 0.95}
 95%|█████████▍| 4071/4286 [25:32:29<1:13:11, 20.43s/it] 95%|█████████▌| 4072/4286 [25:32:51<1:14:47, 20.97s/it]                                                        {'loss': 0.0312, 'grad_norm': 5.993036649974477, 'learning_rate': 4.993000466635557e-08, 'completion_length': 208.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.5425595939159393, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5247024297714233, 'reward_std': 0.08773510158061981, 'kl': 0.78125, 'epoch': 0.95}
 95%|█████████▌| 4072/4286 [25:32:51<1:14:47, 20.97s/it] 95%|█████████▌| 4073/4286 [25:33:11<1:12:48, 20.51s/it]                                                        {'loss': 0.083, 'grad_norm': 14.30865952512191, 'learning_rate': 4.9696686887540826e-08, 'completion_length': 187.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.690178632736206, 'rewards/format_reward': 1.0, 'reward': 1.6901786923408508, 'reward_std': 0.1368986964225769, 'kl': 2.07421875, 'epoch': 0.95}
 95%|█████████▌| 4073/4286 [25:33:11<1:12:48, 20.51s/it] 95%|█████████▌| 4074/4286 [25:33:29<1:10:11, 19.87s/it]                                                        {'loss': 0.0389, 'grad_norm': 6.885308107041197, 'learning_rate': 4.946336910872608e-08, 'completion_length': 184.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.711309552192688, 'rewards/format_reward': 1.0, 'reward': 1.7113096714019775, 'reward_std': 0.10117879882454872, 'kl': 0.97216796875, 'epoch': 0.95}
 95%|█████████▌| 4074/4286 [25:33:29<1:10:11, 19.87s/it] 95%|█████████▌| 4075/4286 [25:33:51<1:12:36, 20.65s/it]                                                        {'loss': 0.0377, 'grad_norm': 4.687972759404149, 'learning_rate': 4.923005132991134e-08, 'completion_length': 186.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.6949405670166016, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.677083432674408, 'reward_std': 0.13761192932724953, 'kl': 0.94140625, 'epoch': 0.95}
 95%|█████████▌| 4075/4286 [25:33:51<1:12:36, 20.65s/it] 95%|█████████▌| 4076/4286 [25:34:15<1:15:10, 21.48s/it]                                                        {'loss': 0.0488, 'grad_norm': 8.165888820395946, 'learning_rate': 4.899673355109659e-08, 'completion_length': 198.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6666668057441711, 'reward_std': 0.0850711241364479, 'kl': 1.21875, 'epoch': 0.95}
 95%|█████████▌| 4076/4286 [25:34:15<1:15:10, 21.48s/it] 95%|█████████▌| 4077/4286 [25:34:33<1:11:38, 20.57s/it]                                                        {'loss': 0.0171, 'grad_norm': 6.209984175894577, 'learning_rate': 4.876341577228185e-08, 'completion_length': 185.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.8154762387275696, 'rewards/format_reward': 1.0, 'reward': 1.8154762983322144, 'reward_std': 0.06787716411054134, 'kl': 0.4267578125, 'epoch': 0.95}
 95%|█████████▌| 4077/4286 [25:34:33<1:11:38, 20.57s/it] 95%|█████████▌| 4078/4286 [25:34:54<1:11:11, 20.54s/it]                                                        {'loss': 0.0441, 'grad_norm': 18.872231035203843, 'learning_rate': 4.8530097993467096e-08, 'completion_length': 176.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.7217262983322144, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.686012089252472, 'reward_std': 0.15400169044733047, 'kl': 1.1015625, 'epoch': 0.95}
 95%|█████████▌| 4078/4286 [25:34:54<1:11:11, 20.54s/it] 95%|█████████▌| 4079/4286 [25:35:19<1:15:48, 21.97s/it]                                                        {'loss': 0.0254, 'grad_norm': 6.623745414935619, 'learning_rate': 4.829678021465235e-08, 'completion_length': 192.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.5997024476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5818453431129456, 'reward_std': 0.07790559902787209, 'kl': 0.6357421875, 'epoch': 0.95}
 95%|█████████▌| 4079/4286 [25:35:19<1:15:48, 21.97s/it] 95%|█████████▌| 4080/4286 [25:35:38<1:12:02, 20.98s/it]                                                        {'loss': 0.051, 'grad_norm': 21.879490856968253, 'learning_rate': 4.806346243583761e-08, 'completion_length': 191.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.696428656578064, 'reward_std': 0.166666679084301, 'kl': 1.2734375, 'epoch': 0.95}
 95%|█████████▌| 4080/4286 [25:35:38<1:12:02, 20.98s/it] 95%|█████████▌| 4081/4286 [25:35:58<1:11:10, 20.83s/it]                                                        {'loss': 0.0641, 'grad_norm': 10.13039164337513, 'learning_rate': 4.783014465702286e-08, 'completion_length': 198.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5994048118591309, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5636906027793884, 'reward_std': 0.16553625254891813, 'kl': 1.60302734375, 'epoch': 0.95}
 95%|█████████▌| 4081/4286 [25:35:58<1:11:10, 20.83s/it] 95%|█████████▌| 4082/4286 [25:36:17<1:08:59, 20.29s/it]                                                        {'loss': 0.0254, 'grad_norm': 5.189479777910375, 'learning_rate': 4.759682687820812e-08, 'completion_length': 185.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.8455357253551483, 'rewards/format_reward': 1.0, 'reward': 1.845535933971405, 'reward_std': 0.08146392367780209, 'kl': 0.634765625, 'epoch': 0.95}
 95%|█████████▌| 4082/4286 [25:36:17<1:08:59, 20.29s/it] 95%|█████████▌| 4083/4286 [25:36:36<1:07:02, 19.82s/it]                                                        {'loss': 0.0176, 'grad_norm': 4.127382879072545, 'learning_rate': 4.736350909939337e-08, 'completion_length': 191.03572845458984, 'rewards/only_full_func_accuracy_reward': 0.6875000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6875001788139343, 'reward_std': 0.01785714365541935, 'kl': 0.439453125, 'epoch': 0.95}
 95%|█████████▌| 4083/4286 [25:36:36<1:07:02, 19.82s/it] 95%|█████████▌| 4084/4286 [25:36:55<1:05:29, 19.45s/it]                                                        {'loss': 0.0543, 'grad_norm': 10.236576065946984, 'learning_rate': 4.713019132057863e-08, 'completion_length': 171.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.7425595819950104, 'rewards/format_reward': 1.0, 'reward': 1.7425596117973328, 'reward_std': 0.07589309778995812, 'kl': 1.36376953125, 'epoch': 0.95}
 95%|█████████▌| 4084/4286 [25:36:55<1:05:29, 19.45s/it] 95%|█████████▌| 4085/4286 [25:37:14<1:05:03, 19.42s/it]                                                        {'loss': 0.0192, 'grad_norm': 5.975817764432339, 'learning_rate': 4.689687354176388e-08, 'completion_length': 170.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.736607164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7187500596046448, 'reward_std': 0.08852548897266388, 'kl': 0.47900390625, 'epoch': 0.95}
 95%|█████████▌| 4085/4286 [25:37:14<1:05:03, 19.42s/it] 95%|█████████▌| 4086/4286 [25:37:34<1:05:08, 19.54s/it]                                                        {'loss': 0.0189, 'grad_norm': 5.8967365042815905, 'learning_rate': 4.666355576294913e-08, 'completion_length': 175.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.62351194024086, 'rewards/format_reward': 1.0, 'reward': 1.6235119700431824, 'reward_std': 0.03869047574698925, 'kl': 0.4736328125, 'epoch': 0.95}
 95%|█████████▌| 4086/4286 [25:37:34<1:05:08, 19.54s/it] 95%|█████████▌| 4087/4286 [25:37:55<1:07:00, 20.20s/it]                                                        {'loss': 0.011, 'grad_norm': 1.4514258271498366, 'learning_rate': 4.643023798413439e-08, 'completion_length': 193.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.5142857432365417, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.496428668498993, 'reward_std': 0.07916059158742428, 'kl': 0.2744140625, 'epoch': 0.95}
 95%|█████████▌| 4087/4286 [25:37:55<1:07:00, 20.20s/it] 95%|█████████▌| 4088/4286 [25:38:14<1:04:55, 19.67s/it]                                                        {'loss': 0.0612, 'grad_norm': 5.359737732509799, 'learning_rate': 4.619692020531964e-08, 'completion_length': 180.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6989796161651611, 'rewards/format_reward': 1.0, 'reward': 1.698979675769806, 'reward_std': 0.07545755244791508, 'kl': 1.5234375, 'epoch': 0.95}
 95%|█████████▌| 4088/4286 [25:38:14<1:04:55, 19.67s/it] 95%|█████████▌| 4089/4286 [25:38:34<1:05:13, 19.86s/it]                                                        {'loss': 0.0569, 'grad_norm': 4.878389114412146, 'learning_rate': 4.59636024265049e-08, 'completion_length': 186.26786041259766, 'rewards/only_full_func_accuracy_reward': 0.6702381670475006, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6523810625076294, 'reward_std': 0.05673840269446373, 'kl': 1.42578125, 'epoch': 0.95}
 95%|█████████▌| 4089/4286 [25:38:34<1:05:13, 19.86s/it] 95%|█████████▌| 4090/4286 [25:38:58<1:08:55, 21.10s/it]                                                        {'loss': 0.0528, 'grad_norm': 5.66517628829323, 'learning_rate': 4.573028464769015e-08, 'completion_length': 197.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.6137330532073975, 'rewards/format_reward': 1.0, 'reward': 1.6137331128120422, 'reward_std': 0.07395920902490616, 'kl': 1.318359375, 'epoch': 0.95}
 95%|█████████▌| 4090/4286 [25:38:58<1:08:55, 21.10s/it] 95%|█████████▌| 4091/4286 [25:39:19<1:08:24, 21.05s/it]                                                        {'loss': 0.0359, 'grad_norm': 5.501407765230437, 'learning_rate': 4.549696686887541e-08, 'completion_length': 182.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.7336309850215912, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6979168057441711, 'reward_std': 0.1160714328289032, 'kl': 0.89599609375, 'epoch': 0.95}
 95%|█████████▌| 4091/4286 [25:39:19<1:08:24, 21.05s/it] 95%|█████████▌| 4092/4286 [25:39:39<1:07:18, 20.82s/it]                                                        {'loss': 0.0329, 'grad_norm': 7.6830791937001734, 'learning_rate': 4.5263649090060664e-08, 'completion_length': 187.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7288690507411957, 'rewards/format_reward': 1.0, 'reward': 1.7288691401481628, 'reward_std': 0.03783673234283924, 'kl': 0.82421875, 'epoch': 0.95}
 95%|█████████▌| 4092/4286 [25:39:39<1:07:18, 20.82s/it][2025-03-03 06:47:15,393] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 95%|█████████▌| 4093/4286 [25:40:00<1:06:17, 20.61s/it]                                                        {'loss': 0.0352, 'grad_norm': 2.5948122727440217, 'learning_rate': 4.503033131124591e-08, 'completion_length': 168.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7589286267757416, 'rewards/format_reward': 1.0, 'reward': 1.7589287161827087, 'reward_std': 0.05357143096625805, 'kl': 0.880859375, 'epoch': 0.95}
 95%|█████████▌| 4093/4286 [25:40:00<1:06:17, 20.61s/it][2025-03-03 06:47:36,524] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▌| 4094/4286 [25:40:21<1:06:27, 20.77s/it]                                                        {'loss': 0.0423, 'grad_norm': 7.675156905110298, 'learning_rate': 4.479701353243117e-08, 'completion_length': 185.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127977013587952, 'reward_std': 0.0744047611951828, 'kl': 1.0576171875, 'epoch': 0.96}
 96%|█████████▌| 4094/4286 [25:40:21<1:06:27, 20.77s/it] 96%|█████████▌| 4095/4286 [25:40:39<1:03:44, 20.02s/it]                                                        {'loss': 0.0105, 'grad_norm': 2.64274888865748, 'learning_rate': 4.456369575361642e-08, 'completion_length': 184.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6991071999073029, 'rewards/format_reward': 1.0, 'reward': 1.6991072297096252, 'reward_std': 0.016366009949706495, 'kl': 0.26318359375, 'epoch': 0.96}
 96%|█████████▌| 4095/4286 [25:40:39<1:03:44, 20.02s/it] 96%|█████████▌| 4096/4286 [25:40:57<1:01:44, 19.50s/it]                                                        {'loss': 0.0266, 'grad_norm': 1.3385764380851561, 'learning_rate': 4.433037797480168e-08, 'completion_length': 189.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.7738096117973328, 'rewards/format_reward': 1.0, 'reward': 1.7738096714019775, 'reward_std': 0.0892857201397419, 'kl': 0.6669921875, 'epoch': 0.96}
 96%|█████████▌| 4096/4286 [25:40:57<1:01:44, 19.50s/it] 96%|█████████▌| 4097/4286 [25:41:16<1:01:12, 19.43s/it]                                                        {'loss': 0.0329, 'grad_norm': 2.7532308809492854, 'learning_rate': 4.4097060195986934e-08, 'completion_length': 191.5178680419922, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.7127976417541504, 'reward_std': 0.07220851257443428, 'kl': 0.82421875, 'epoch': 0.96}
 96%|█████████▌| 4097/4286 [25:41:16<1:01:12, 19.43s/it] 96%|█████████▌| 4098/4286 [25:41:39<1:03:57, 20.41s/it]                                                        {'loss': 0.0264, 'grad_norm': 8.16880810151745, 'learning_rate': 4.386374241717219e-08, 'completion_length': 181.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.598214328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5803572535514832, 'reward_std': 0.07738095847889781, 'kl': 0.6591796875, 'epoch': 0.96}
 96%|█████████▌| 4098/4286 [25:41:39<1:03:57, 20.41s/it] 96%|█████████▌| 4099/4286 [25:42:02<1:05:49, 21.12s/it]                                                        {'loss': 0.0414, 'grad_norm': 3.9610981310783204, 'learning_rate': 4.3630424638357444e-08, 'completion_length': 209.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.6205357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6205357909202576, 'reward_std': 0.06593661196529865, 'kl': 1.037109375, 'epoch': 0.96}
 96%|█████████▌| 4099/4286 [25:42:02<1:05:49, 21.12s/it] 96%|█████████▌| 4100/4286 [25:42:25<1:07:39, 21.83s/it]                                                        {'loss': 0.1039, 'grad_norm': 2.4879367022588785, 'learning_rate': 4.33971068595427e-08, 'completion_length': 193.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.607142984867096, 'reward_std': 0.21646223962306976, 'kl': 2.59765625, 'epoch': 0.96}
 96%|█████████▌| 4100/4286 [25:42:25<1:07:39, 21.83s/it] 96%|█████████▌| 4101/4286 [25:46:33<4:36:18, 89.61s/it]                                                        {'loss': 0.0677, 'grad_norm': 4.586057964081529, 'learning_rate': 4.316378908072795e-08, 'completion_length': 186.4107208251953, 'rewards/only_full_func_accuracy_reward': 0.49970243871212006, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4818453192710876, 'reward_std': 0.1397606935352087, 'kl': 1.69140625, 'epoch': 0.96}
 96%|█████████▌| 4101/4286 [25:46:33<4:36:18, 89.61s/it][2025-03-03 06:54:11,749] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▌| 4102/4286 [25:46:56<3:33:13, 69.53s/it]                                                        {'loss': 0.0505, 'grad_norm': 4.980757377858485, 'learning_rate': 4.2930471301913204e-08, 'completion_length': 211.62501525878906, 'rewards/only_full_func_accuracy_reward': 0.4970238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4791668057441711, 'reward_std': 0.08162223733961582, 'kl': 1.265625, 'epoch': 0.96}
 96%|█████████▌| 4102/4286 [25:46:56<3:33:13, 69.53s/it] 96%|█████████▌| 4103/4286 [25:47:16<2:46:44, 54.67s/it]                                                        {'loss': 0.0756, 'grad_norm': 3.268033498109178, 'learning_rate': 4.269715352309846e-08, 'completion_length': 203.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6101190894842148, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5922620296478271, 'reward_std': 0.11516373325139284, 'kl': 1.8828125, 'epoch': 0.96}
 96%|█████████▌| 4103/4286 [25:47:16<2:46:44, 54.67s/it] 96%|█████████▌| 4104/4286 [25:47:34<2:12:28, 43.68s/it]                                                        {'loss': 0.0082, 'grad_norm': 1.0950185784869035, 'learning_rate': 4.2463835744283714e-08, 'completion_length': 186.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6726190149784088, 'rewards/format_reward': 1.0, 'reward': 1.6726192235946655, 'reward_std': 0.060691386461257935, 'kl': 0.2060546875, 'epoch': 0.96}
 96%|█████████▌| 4104/4286 [25:47:34<2:12:28, 43.68s/it] 96%|█████████▌| 4105/4286 [25:47:52<1:48:41, 36.03s/it]                                                        {'loss': 0.0266, 'grad_norm': 4.1635685232113975, 'learning_rate': 4.223051796546897e-08, 'completion_length': 174.2321548461914, 'rewards/only_full_func_accuracy_reward': 0.7875000536441803, 'rewards/format_reward': 1.0, 'reward': 1.7875000834465027, 'reward_std': 0.02023809589445591, 'kl': 0.66552734375, 'epoch': 0.96}
 96%|█████████▌| 4105/4286 [25:47:52<1:48:41, 36.03s/it] 96%|█████████▌| 4106/4286 [25:48:13<1:34:37, 31.54s/it]                                                        {'loss': 0.0285, 'grad_norm': 3.2554867868082358, 'learning_rate': 4.1997200186654225e-08, 'completion_length': 193.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.5258928835391998, 'rewards/format_reward': 1.0, 'reward': 1.5258929133415222, 'reward_std': 0.03024324495345354, 'kl': 0.712890625, 'epoch': 0.96}
 96%|█████████▌| 4106/4286 [25:48:13<1:34:37, 31.54s/it][2025-03-03 06:55:49,427] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▌| 4107/4286 [25:48:34<1:24:07, 28.20s/it]                                                        {'loss': 0.0646, 'grad_norm': 6.259649837699495, 'learning_rate': 4.176388240783948e-08, 'completion_length': 176.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.6410714685916901, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5875001549720764, 'reward_std': 0.16214874386787415, 'kl': 1.61328125, 'epoch': 0.96}
 96%|█████████▌| 4107/4286 [25:48:34<1:24:07, 28.20s/it] 96%|█████████▌| 4108/4286 [25:48:53<1:15:47, 25.55s/it]                                                        {'loss': 0.0396, 'grad_norm': 9.477584978700067, 'learning_rate': 4.1530564629024735e-08, 'completion_length': 190.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.6622024476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6443453431129456, 'reward_std': 0.0702988775447011, 'kl': 0.98583984375, 'epoch': 0.96}
 96%|█████████▌| 4108/4286 [25:48:53<1:15:47, 25.55s/it] 96%|█████████▌| 4109/4286 [25:49:12<1:09:22, 23.52s/it]                                                        {'loss': 0.0222, 'grad_norm': 4.325970091925884, 'learning_rate': 4.1297246850209984e-08, 'completion_length': 197.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.7321429252624512, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142858505249023, 'reward_std': 0.10714286006987095, 'kl': 0.5537109375, 'epoch': 0.96}
 96%|█████████▌| 4109/4286 [25:49:12<1:09:22, 23.52s/it] 96%|█████████▌| 4110/4286 [25:49:30<1:04:10, 21.88s/it]                                                        {'loss': 0.0495, 'grad_norm': 0.706664627888258, 'learning_rate': 4.106392907139524e-08, 'completion_length': 186.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7351190745830536, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.717262089252472, 'reward_std': 0.1130952425301075, 'kl': 1.234375, 'epoch': 0.96}
 96%|█████████▌| 4110/4286 [25:49:30<1:04:10, 21.88s/it] 96%|█████████▌| 4111/4286 [25:49:50<1:02:18, 21.36s/it]                                                        {'loss': 0.0732, 'grad_norm': 4.70178439743555, 'learning_rate': 4.0830611292580495e-08, 'completion_length': 197.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6187500953674316, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5830358266830444, 'reward_std': 0.09538307040929794, 'kl': 1.82421875, 'epoch': 0.96}
 96%|█████████▌| 4111/4286 [25:49:50<1:02:18, 21.36s/it] 96%|█████████▌| 4112/4286 [25:50:09<59:38, 20.57s/it]                                                        {'loss': 0.0392, 'grad_norm': 2.1165879591075503, 'learning_rate': 4.059729351376575e-08, 'completion_length': 193.75000762939453, 'rewards/only_full_func_accuracy_reward': 0.6345238387584686, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6166667342185974, 'reward_std': 0.07300060614943504, 'kl': 0.97607421875, 'epoch': 0.96}
 96%|█████████▌| 4112/4286 [25:50:09<59:38, 20.57s/it] 96%|█████████▌| 4113/4286 [25:50:28<58:08, 20.17s/it]                                                      {'loss': 0.0093, 'grad_norm': 0.6376587071941847, 'learning_rate': 4.0363975734951005e-08, 'completion_length': 172.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.0, 'kl': 0.23291015625, 'epoch': 0.96}
 96%|█████████▌| 4113/4286 [25:50:28<58:08, 20.17s/it] 96%|█████████▌| 4114/4286 [25:50:47<57:11, 19.95s/it]                                                      {'loss': 0.0277, 'grad_norm': 1.337483464160827, 'learning_rate': 4.013065795613626e-08, 'completion_length': 153.60714721679688, 'rewards/only_full_func_accuracy_reward': 0.7440476417541504, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7261905670166016, 'reward_std': 0.1071428619325161, 'kl': 0.69384765625, 'epoch': 0.96}
 96%|█████████▌| 4114/4286 [25:50:47<57:11, 19.95s/it][2025-03-03 06:58:25,555] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▌| 4115/4286 [25:51:10<58:56, 20.68s/it]                                                      {'loss': 0.0123, 'grad_norm': 3.078346882651222, 'learning_rate': 3.9897340177321516e-08, 'completion_length': 218.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.583333432674408, 'reward_std': 0.0833333358168602, 'kl': 0.3076171875, 'epoch': 0.96}
 96%|█████████▌| 4115/4286 [25:51:10<58:56, 20.68s/it][2025-03-03 06:58:47,997] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▌| 4116/4286 [25:51:32<1:00:05, 21.21s/it]                                                        {'loss': 0.0509, 'grad_norm': 1.9382891855724917, 'learning_rate': 3.9664022398506764e-08, 'completion_length': 196.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.6651785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6473214626312256, 'reward_std': 0.0982142873108387, 'kl': 1.2734375, 'epoch': 0.96}
 96%|█████████▌| 4116/4286 [25:51:32<1:00:05, 21.21s/it] 96%|█████████▌| 4117/4286 [25:51:51<57:21, 20.37s/it]                                                        {'loss': 0.0284, 'grad_norm': 7.086812005828206, 'learning_rate': 3.943070461969202e-08, 'completion_length': 188.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.6815476417541504, 'rewards/format_reward': 1.0, 'reward': 1.68154776096344, 'reward_std': 0.07990731298923492, 'kl': 0.708984375, 'epoch': 0.96}
 96%|█████████▌| 4117/4286 [25:51:51<57:21, 20.37s/it][2025-03-03 06:59:25,879] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▌| 4118/4286 [25:52:10<56:16, 20.10s/it]                                                      {'loss': 0.0087, 'grad_norm': 5.6355630331418025, 'learning_rate': 3.9197386840877275e-08, 'completion_length': 174.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6845239102840424, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.0, 'kl': 0.216796875, 'epoch': 0.96}
 96%|█████████▌| 4118/4286 [25:52:10<56:16, 20.10s/it] 96%|█████████▌| 4119/4286 [25:52:33<58:09, 20.90s/it]                                                      {'loss': 0.0941, 'grad_norm': 3.008019732208484, 'learning_rate': 3.8964069062062524e-08, 'completion_length': 166.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.7364927232265472, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7007784843444824, 'reward_std': 0.19368132948875427, 'kl': 2.3515625, 'epoch': 0.96}
 96%|█████████▌| 4119/4286 [25:52:33<58:09, 20.90s/it] 96%|█████████▌| 4120/4286 [25:52:53<57:18, 20.72s/it]                                                      {'loss': 0.0703, 'grad_norm': 1.9445270581185732, 'learning_rate': 3.873075128324778e-08, 'completion_length': 176.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.6659903228282928, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6481332182884216, 'reward_std': 0.11572084948420525, 'kl': 1.7578125, 'epoch': 0.96}
 96%|█████████▌| 4120/4286 [25:52:53<57:18, 20.72s/it] 96%|█████████▌| 4121/4286 [25:53:11<54:33, 19.84s/it]                                                      {'loss': 0.0558, 'grad_norm': 4.7375132210979, 'learning_rate': 3.8497433504433034e-08, 'completion_length': 170.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.68601194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6681548953056335, 'reward_std': 0.1227565836161375, 'kl': 1.3984375, 'epoch': 0.96}
 96%|█████████▌| 4121/4286 [25:53:11<54:33, 19.84s/it] 96%|█████████▌| 4122/4286 [25:53:33<55:48, 20.42s/it]                                                      {'loss': 0.0369, 'grad_norm': 12.248068099665261, 'learning_rate': 3.826411572561829e-08, 'completion_length': 205.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.4895833730697632, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4538691639900208, 'reward_std': 0.12042887508869171, 'kl': 0.921875, 'epoch': 0.96}
 96%|█████████▌| 4122/4286 [25:53:33<55:48, 20.42s/it] 96%|█████████▌| 4123/4286 [25:53:52<54:32, 20.08s/it]                                                      {'loss': 0.0155, 'grad_norm': 1.7197292606098415, 'learning_rate': 3.8030797946803545e-08, 'completion_length': 191.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6934524476528168, 'rewards/format_reward': 1.0, 'reward': 1.6934524774551392, 'reward_std': 0.07419108413159847, 'kl': 0.38671875, 'epoch': 0.96}
 96%|█████████▌| 4123/4286 [25:53:52<54:32, 20.08s/it] 96%|█████████▌| 4124/4286 [25:54:17<57:54, 21.45s/it]                                                      {'loss': 0.064, 'grad_norm': 5.309745128987059, 'learning_rate': 3.7797480167988794e-08, 'completion_length': 197.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.4836309999227524, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4300596714019775, 'reward_std': 0.17919664084911346, 'kl': 1.6015625, 'epoch': 0.96}
 96%|█████████▌| 4124/4286 [25:54:17<57:54, 21.45s/it] 96%|█████████▌| 4125/4286 [25:54:35<55:04, 20.52s/it]                                                      {'loss': 0.0493, 'grad_norm': 1.2512670249898774, 'learning_rate': 3.756416238917405e-08, 'completion_length': 193.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.6175595670938492, 'rewards/format_reward': 1.0, 'reward': 1.6175596714019775, 'reward_std': 0.098214291036129, 'kl': 1.23046875, 'epoch': 0.96}
 96%|█████████▌| 4125/4286 [25:54:35<55:04, 20.52s/it] 96%|█████████▋| 4126/4286 [25:54:57<56:21, 21.13s/it]                                                      {'loss': 0.031, 'grad_norm': 5.557281040261721, 'learning_rate': 3.7330844610359304e-08, 'completion_length': 188.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.59226194024086, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5565477013587952, 'reward_std': 0.1250000074505806, 'kl': 0.7763671875, 'epoch': 0.96}
 96%|█████████▋| 4126/4286 [25:54:57<56:21, 21.13s/it] 96%|█████████▋| 4127/4286 [25:55:21<57:53, 21.85s/it]                                                      {'loss': 0.0445, 'grad_norm': 12.088714550154284, 'learning_rate': 3.709752683154456e-08, 'completion_length': 207.33930206298828, 'rewards/only_full_func_accuracy_reward': 0.635416716337204, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5997024774551392, 'reward_std': 0.14135089982300997, 'kl': 1.1162109375, 'epoch': 0.96}
 96%|█████████▋| 4127/4286 [25:55:21<57:53, 21.85s/it] 96%|█████████▋| 4128/4286 [25:55:40<55:18, 21.00s/it]                                                      {'loss': 0.0857, 'grad_norm': 8.9995947919173, 'learning_rate': 3.6864209052729815e-08, 'completion_length': 189.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.5208333730697632, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5029762983322144, 'reward_std': 0.1531669795513153, 'kl': 2.1484375, 'epoch': 0.96}
 96%|█████████▋| 4128/4286 [25:55:40<55:18, 21.00s/it] 96%|█████████▋| 4129/4286 [25:56:00<53:48, 20.56s/it]                                                      {'loss': 0.0496, 'grad_norm': 6.601188633913347, 'learning_rate': 3.663089127391507e-08, 'completion_length': 207.7678680419922, 'rewards/only_full_func_accuracy_reward': 0.7306548058986664, 'rewards/format_reward': 1.0, 'reward': 1.7306549549102783, 'reward_std': 0.1135203018784523, 'kl': 1.234375, 'epoch': 0.96}
 96%|█████████▋| 4129/4286 [25:56:00<53:48, 20.56s/it] 96%|█████████▋| 4130/4286 [25:56:20<53:09, 20.45s/it]                                                      {'loss': 0.0307, 'grad_norm': 6.800081880598534, 'learning_rate': 3.6397573495100325e-08, 'completion_length': 217.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.7633928954601288, 'rewards/format_reward': 1.0, 'reward': 1.7633930444717407, 'reward_std': 0.10518360137939453, 'kl': 0.76953125, 'epoch': 0.96}
 96%|█████████▋| 4130/4286 [25:56:20<53:09, 20.45s/it] 96%|█████████▋| 4131/4286 [25:56:38<50:46, 19.66s/it]                                                      {'loss': 0.0252, 'grad_norm': 0.9302709730674312, 'learning_rate': 3.6164255716285574e-08, 'completion_length': 159.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.690476268529892, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.0714285746216774, 'kl': 0.62939453125, 'epoch': 0.96}
 96%|█████████▋| 4131/4286 [25:56:38<50:46, 19.66s/it] 96%|█████████▋| 4132/4286 [25:56:56<49:12, 19.17s/it]                                                      {'loss': 0.0307, 'grad_norm': 2.122629164996866, 'learning_rate': 3.593093793747083e-08, 'completion_length': 189.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6934524178504944, 'rewards/format_reward': 1.0, 'reward': 1.693452537059784, 'reward_std': 0.04597861785441637, 'kl': 0.76806640625, 'epoch': 0.96}
 96%|█████████▋| 4132/4286 [25:56:56<49:12, 19.17s/it] 96%|█████████▋| 4133/4286 [25:57:14<47:57, 18.80s/it]                                                      {'loss': 0.0284, 'grad_norm': 0.6175985178989082, 'learning_rate': 3.5697620158656085e-08, 'completion_length': 195.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.7547619342803955, 'rewards/format_reward': 1.0, 'reward': 1.7547619938850403, 'reward_std': 0.04404763085767627, 'kl': 0.7099609375, 'epoch': 0.96}
 96%|█████████▋| 4133/4286 [25:57:14<47:57, 18.80s/it][2025-03-03 07:04:48,824] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 96%|█████████▋| 4134/4286 [25:57:33<48:06, 18.99s/it]                                                      {'loss': 0.0178, 'grad_norm': 91.40025794478572, 'learning_rate': 3.546430237984134e-08, 'completion_length': 192.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.6595238447189331, 'rewards/format_reward': 1.0, 'reward': 1.6595239043235779, 'reward_std': 0.05714285932481289, 'kl': 0.4443359375, 'epoch': 0.96}
 96%|█████████▋| 4134/4286 [25:57:33<48:06, 18.99s/it] 96%|█████████▋| 4135/4286 [25:57:53<48:50, 19.41s/it]                                                      {'loss': 0.0551, 'grad_norm': 2.174675192594814, 'learning_rate': 3.5230984601026595e-08, 'completion_length': 190.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.5109127461910248, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4930557012557983, 'reward_std': 0.10811333917081356, 'kl': 1.37890625, 'epoch': 0.96}
 96%|█████████▋| 4135/4286 [25:57:53<48:50, 19.41s/it][2025-03-03 07:05:32,821] [WARNING] [stage3.py:2134:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 97%|█████████▋| 4136/4286 [25:58:17<51:40, 20.67s/it]                                                      {'loss': 0.0789, 'grad_norm': 1.986399310170416, 'learning_rate': 3.499766682221185e-08, 'completion_length': 200.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5449405312538147, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4913691282272339, 'reward_std': 0.20161600410938263, 'kl': 1.97265625, 'epoch': 0.97}
 97%|█████████▋| 4136/4286 [25:58:17<51:40, 20.67s/it] 97%|█████████▋| 4137/4286 [25:58:38<51:24, 20.70s/it]                                                      {'loss': 0.063, 'grad_norm': 59.232285839696594, 'learning_rate': 3.4764349043397106e-08, 'completion_length': 203.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.6476190984249115, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.629762053489685, 'reward_std': 0.0803829412907362, 'kl': 1.57421875, 'epoch': 0.97}
 97%|█████████▋| 4137/4286 [25:58:38<51:24, 20.70s/it] 97%|█████████▋| 4138/4286 [25:58:56<49:09, 19.93s/it]                                                      {'loss': 0.019, 'grad_norm': 5.820342881710555, 'learning_rate': 3.453103126458236e-08, 'completion_length': 184.6607208251953, 'rewards/only_full_func_accuracy_reward': 0.65476194024086, 'rewards/format_reward': 1.0, 'reward': 1.6547620296478271, 'reward_std': 0.0595238134264946, 'kl': 0.474609375, 'epoch': 0.97}
 97%|█████████▋| 4138/4286 [25:58:56<49:09, 19.93s/it] 97%|█████████▋| 4139/4286 [25:59:15<48:22, 19.74s/it]                                                      {'loss': 0.0195, 'grad_norm': 5.940848846482215, 'learning_rate': 3.429771348576761e-08, 'completion_length': 206.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.789285808801651, 'rewards/format_reward': 1.0, 'reward': 1.7892858386039734, 'reward_std': 0.08176744729280472, 'kl': 0.4873046875, 'epoch': 0.97}
 97%|█████████▋| 4139/4286 [25:59:15<48:22, 19.74s/it] 97%|█████████▋| 4140/4286 [25:59:34<47:26, 19.50s/it]                                                      {'loss': 0.0139, 'grad_norm': 0.7305617982940678, 'learning_rate': 3.4064395706952865e-08, 'completion_length': 193.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.580357164144516, 'rewards/format_reward': 1.0, 'reward': 1.5803572535514832, 'reward_std': 0.01785714365541935, 'kl': 0.34521484375, 'epoch': 0.97}
 97%|█████████▋| 4140/4286 [25:59:34<47:26, 19.50s/it] 97%|█████████▋| 4141/4286 [25:59:58<50:18, 20.81s/it]                                                      {'loss': 0.1325, 'grad_norm': 2.857897143065331, 'learning_rate': 3.383107792813812e-08, 'completion_length': 196.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5952381491661072, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5416667461395264, 'reward_std': 0.24681014567613602, 'kl': 3.3125, 'epoch': 0.97}
 97%|█████████▋| 4141/4286 [25:59:58<50:18, 20.81s/it] 97%|█████████▋| 4142/4286 [26:00:19<50:08, 20.89s/it]                                                      {'loss': 0.0474, 'grad_norm': 7.4867641822251905, 'learning_rate': 3.3597760149323376e-08, 'completion_length': 182.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.647321492433548, 'rewards/format_reward': 1.0, 'reward': 1.6473215818405151, 'reward_std': 0.04740536957979202, 'kl': 1.18359375, 'epoch': 0.97}
 97%|█████████▋| 4142/4286 [26:00:19<50:08, 20.89s/it] 97%|█████████▋| 4143/4286 [26:00:38<48:37, 20.40s/it]                                                      {'loss': 0.0212, 'grad_norm': 10.18788812473157, 'learning_rate': 3.336444237050863e-08, 'completion_length': 182.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.5784439146518707, 'rewards/format_reward': 1.0, 'reward': 1.5784439444541931, 'reward_std': 0.06954945996403694, 'kl': 0.5302734375, 'epoch': 0.97}
 97%|█████████▋| 4143/4286 [26:00:38<48:37, 20.40s/it] 97%|█████████▋| 4144/4286 [26:01:00<49:28, 20.90s/it]                                                      {'loss': 0.0498, 'grad_norm': 14.15252665299774, 'learning_rate': 3.3131124591693886e-08, 'completion_length': 204.44644165039062, 'rewards/only_full_func_accuracy_reward': 0.7211309969425201, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6854168176651, 'reward_std': 0.17984014376997948, 'kl': 1.24609375, 'epoch': 0.97}
 97%|█████████▋| 4144/4286 [26:01:00<49:28, 20.90s/it] 97%|█████████▋| 4145/4286 [26:01:20<48:10, 20.50s/it]                                                      {'loss': 0.0814, 'grad_norm': 1.5410754017448947, 'learning_rate': 3.289780681287914e-08, 'completion_length': 149.98214721679688, 'rewards/only_full_func_accuracy_reward': 0.7619048357009888, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7440477013587952, 'reward_std': 0.1364774089306593, 'kl': 2.03515625, 'epoch': 0.97}
 97%|█████████▋| 4145/4286 [26:01:20<48:10, 20.50s/it] 97%|█████████▋| 4146/4286 [26:01:41<47:55, 20.54s/it]                                                      {'loss': 0.0434, 'grad_norm': 2.5850095557057844, 'learning_rate': 3.266448903406439e-08, 'completion_length': 190.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.6250000596046448, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.07259614393115044, 'kl': 1.091796875, 'epoch': 0.97}
 97%|█████████▋| 4146/4286 [26:01:41<47:55, 20.54s/it] 97%|█████████▋| 4147/4286 [26:02:03<48:48, 21.07s/it]                                                      {'loss': 0.0617, 'grad_norm': 5.907560701284954, 'learning_rate': 3.2431171255249646e-08, 'completion_length': 209.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.5559824258089066, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.520268201828003, 'reward_std': 0.133208766579628, 'kl': 1.5390625, 'epoch': 0.97}
 97%|█████████▋| 4147/4286 [26:02:03<48:48, 21.07s/it] 97%|█████████▋| 4148/4286 [26:02:21<46:19, 20.14s/it]                                                      {'loss': 0.0347, 'grad_norm': 11.023576174928234, 'learning_rate': 3.21978534764349e-08, 'completion_length': 182.89286041259766, 'rewards/only_full_func_accuracy_reward': 0.661309540271759, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6434524059295654, 'reward_std': 0.11466726381331682, 'kl': 0.8701171875, 'epoch': 0.97}
 97%|█████████▋| 4148/4286 [26:02:21<46:19, 20.14s/it] 97%|█████████▋| 4149/4286 [26:02:41<45:56, 20.12s/it]                                                      {'loss': 0.1099, 'grad_norm': 7.210300241536765, 'learning_rate': 3.1964535697620156e-08, 'completion_length': 185.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6279762387275696, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5922620296478271, 'reward_std': 0.21440255641937256, 'kl': 2.75, 'epoch': 0.97}
 97%|█████████▋| 4149/4286 [26:02:41<45:56, 20.12s/it] 97%|█████████▋| 4150/4286 [26:02:59<44:30, 19.63s/it]                                                      {'loss': 0.0683, 'grad_norm': 6.7606817197927676, 'learning_rate': 3.173121791880541e-08, 'completion_length': 187.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.5505952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.532738208770752, 'reward_std': 0.09631870314478874, 'kl': 1.70703125, 'epoch': 0.97}
 97%|█████████▋| 4150/4286 [26:02:59<44:30, 19.63s/it] 97%|█████████▋| 4151/4286 [26:03:18<43:22, 19.28s/it]                                                      {'loss': 0.0505, 'grad_norm': 1119.6179044888559, 'learning_rate': 3.149790013999067e-08, 'completion_length': 188.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6931548416614532, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6574406027793884, 'reward_std': 0.1253054067492485, 'kl': 1.263671875, 'epoch': 0.97}
 97%|█████████▋| 4151/4286 [26:03:18<43:22, 19.28s/it] 97%|█████████▋| 4152/4286 [26:03:37<42:55, 19.22s/it]                                                      {'loss': 0.0551, 'grad_norm': 5.630598274202513, 'learning_rate': 3.126458236117592e-08, 'completion_length': 172.67857360839844, 'rewards/only_full_func_accuracy_reward': 0.5639881491661072, 'rewards/format_reward': 1.0, 'reward': 1.563988208770752, 'reward_std': 0.040532153099775314, 'kl': 1.3759765625, 'epoch': 0.97}
 97%|█████████▋| 4152/4286 [26:03:37<42:55, 19.22s/it] 97%|█████████▋| 4153/4286 [26:03:59<44:10, 19.93s/it]                                                      {'loss': 0.4673, 'grad_norm': 34339.10313404106, 'learning_rate': 3.103126458236118e-08, 'completion_length': 196.0357208251953, 'rewards/only_full_func_accuracy_reward': 0.505357176065445, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4875000715255737, 'reward_std': 0.17216748744249344, 'kl': 11.7265625, 'epoch': 0.97}
 97%|█████████▋| 4153/4286 [26:03:59<44:10, 19.93s/it] 97%|█████████▋| 4154/4286 [26:04:19<43:59, 19.99s/it]                                                      {'loss': 0.0234, 'grad_norm': 4.783860068935083, 'learning_rate': 3.0797946803546426e-08, 'completion_length': 202.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6836309731006622, 'rewards/format_reward': 1.0, 'reward': 1.6836310625076294, 'reward_std': 0.043294661678373814, 'kl': 0.5849609375, 'epoch': 0.97}
 97%|█████████▋| 4154/4286 [26:04:19<43:59, 19.99s/it] 97%|█████████▋| 4155/4286 [26:04:37<42:22, 19.41s/it]                                                      {'loss': 0.0268, 'grad_norm': 1.7605706256296598, 'learning_rate': 3.056462902473168e-08, 'completion_length': 176.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.05952380783855915, 'kl': 0.66748046875, 'epoch': 0.97}
 97%|█████████▋| 4155/4286 [26:04:37<42:22, 19.41s/it] 97%|█████████▋| 4156/4286 [26:04:55<41:29, 19.15s/it]                                                      {'loss': 0.053, 'grad_norm': 0.9989834175086908, 'learning_rate': 3.033131124591694e-08, 'completion_length': 173.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.755952388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7380953431129456, 'reward_std': 0.12347954511642456, 'kl': 1.32421875, 'epoch': 0.97}
 97%|█████████▋| 4156/4286 [26:04:55<41:29, 19.15s/it] 97%|█████████▋| 4157/4286 [26:05:14<41:11, 19.16s/it]                                                      {'loss': 0.0418, 'grad_norm': 6.533593299982559, 'learning_rate': 3.009799346710219e-08, 'completion_length': 186.8571548461914, 'rewards/only_full_func_accuracy_reward': 0.6309524476528168, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6130953431129456, 'reward_std': 0.09959554672241211, 'kl': 1.046875, 'epoch': 0.97}
 97%|█████████▋| 4157/4286 [26:05:14<41:11, 19.16s/it] 97%|█████████▋| 4158/4286 [26:05:35<41:52, 19.63s/it]                                                      {'loss': 0.0122, 'grad_norm': 9.244722928332184, 'learning_rate': 2.986467568828745e-08, 'completion_length': 194.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.6458333432674408, 'rewards/format_reward': 1.0, 'reward': 1.6458334922790527, 'reward_std': 0.0416666679084301, 'kl': 0.30517578125, 'epoch': 0.97}
 97%|█████████▋| 4158/4286 [26:05:35<41:52, 19.63s/it] 97%|█████████▋| 4159/4286 [26:05:54<40:57, 19.35s/it]                                                      {'loss': 0.0086, 'grad_norm': 1.6682061808624171, 'learning_rate': 2.9631357909472703e-08, 'completion_length': 189.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.8005952835083008, 'rewards/format_reward': 1.0, 'reward': 1.8005953431129456, 'reward_std': 0.01785714365541935, 'kl': 0.21484375, 'epoch': 0.97}
 97%|█████████▋| 4159/4286 [26:05:54<40:57, 19.35s/it] 97%|█████████▋| 4160/4286 [26:06:12<40:10, 19.13s/it]                                                      {'loss': 0.0327, 'grad_norm': 0.9422942706035015, 'learning_rate': 2.9398040130657955e-08, 'completion_length': 190.0178680419922, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904763579368591, 'reward_std': 0.04054497182369232, 'kl': 0.81640625, 'epoch': 0.97}
 97%|█████████▋| 4160/4286 [26:06:12<40:10, 19.13s/it] 97%|█████████▋| 4161/4286 [26:06:34<41:23, 19.87s/it]                                                      {'loss': 0.0597, 'grad_norm': 9.723491091528022, 'learning_rate': 2.916472235184321e-08, 'completion_length': 191.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6026787161827087, 'reward_std': 0.16864144057035446, 'kl': 1.48828125, 'epoch': 0.97}
 97%|█████████▋| 4161/4286 [26:06:34<41:23, 19.87s/it] 97%|█████████▋| 4162/4286 [26:06:54<41:04, 19.88s/it]                                                      {'loss': 0.0224, 'grad_norm': 4.307269654527083, 'learning_rate': 2.8931404573028465e-08, 'completion_length': 196.16072845458984, 'rewards/only_full_func_accuracy_reward': 0.495535746216774, 'rewards/format_reward': 1.0, 'reward': 1.4955358505249023, 'reward_std': 0.06306749954819679, 'kl': 0.560546875, 'epoch': 0.97}
 97%|█████████▋| 4162/4286 [26:06:54<41:04, 19.88s/it] 97%|█████████▋| 4163/4286 [26:07:12<39:49, 19.42s/it]                                                      {'loss': 0.0256, 'grad_norm': 1.1764719032015272, 'learning_rate': 2.869808679421372e-08, 'completion_length': 180.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.8363095819950104, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.818452537059784, 'reward_std': 0.08928571827709675, 'kl': 0.6396484375, 'epoch': 0.97}
 97%|█████████▋| 4163/4286 [26:07:12<39:49, 19.42s/it] 97%|█████████▋| 4164/4286 [26:07:35<41:17, 20.31s/it]                                                      {'loss': 0.0421, 'grad_norm': 3.2842679732376023, 'learning_rate': 2.8464769015398973e-08, 'completion_length': 206.50000762939453, 'rewards/only_full_func_accuracy_reward': 0.5409226715564728, 'rewards/format_reward': 1.0, 'reward': 1.5409227013587952, 'reward_std': 0.06101190857589245, 'kl': 1.056640625, 'epoch': 0.97}
 97%|█████████▋| 4164/4286 [26:07:35<41:17, 20.31s/it] 97%|█████████▋| 4165/4286 [26:07:53<39:56, 19.80s/it]                                                      {'loss': 0.0121, 'grad_norm': 3.466564234373761, 'learning_rate': 2.8231451236584228e-08, 'completion_length': 204.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.8259353935718536, 'rewards/format_reward': 1.0, 'reward': 1.825935423374176, 'reward_std': 0.03784884884953499, 'kl': 0.30224609375, 'epoch': 0.97}
 97%|█████████▋| 4165/4286 [26:07:53<39:56, 19.80s/it][2025-03-03 07:15:29,206] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 97%|█████████▋| 4166/4286 [26:08:13<39:43, 19.86s/it]                                                      {'loss': 0.0177, 'grad_norm': 2.507985956421544, 'learning_rate': 2.7998133457769483e-08, 'completion_length': 161.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.7946428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7946429252624512, 'reward_std': 0.055015724152326584, 'kl': 0.443359375, 'epoch': 0.97}
 97%|█████████▋| 4166/4286 [26:08:13<39:43, 19.86s/it] 97%|█████████▋| 4167/4286 [26:08:34<40:06, 20.22s/it]                                                      {'loss': 0.0864, 'grad_norm': 7.813966691657515, 'learning_rate': 2.7764815678954735e-08, 'completion_length': 193.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6205357313156128, 'rewards/format_reward': 1.0, 'reward': 1.6205358505249023, 'reward_std': 0.08336639031767845, 'kl': 2.1640625, 'epoch': 0.97}
 97%|█████████▋| 4167/4286 [26:08:34<40:06, 20.22s/it] 97%|█████████▋| 4168/4286 [26:08:54<39:13, 19.95s/it]                                                      {'loss': 0.0393, 'grad_norm': 8.021838746069143, 'learning_rate': 2.753149790013999e-08, 'completion_length': 195.66072845458984, 'rewards/only_full_func_accuracy_reward': 0.4970238655805588, 'rewards/format_reward': 1.0, 'reward': 1.49702388048172, 'reward_std': 0.07862300798296928, 'kl': 0.982421875, 'epoch': 0.97}
 97%|█████████▋| 4168/4286 [26:08:54<39:13, 19.95s/it] 97%|█████████▋| 4169/4286 [26:09:12<37:48, 19.38s/it]                                                      {'loss': 0.0474, 'grad_norm': 6.065497510182703, 'learning_rate': 2.7298180121325242e-08, 'completion_length': 194.2678680419922, 'rewards/only_full_func_accuracy_reward': 0.6547619700431824, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6369049549102783, 'reward_std': 0.1242439541965723, 'kl': 1.18359375, 'epoch': 0.97}
 97%|█████████▋| 4169/4286 [26:09:12<37:48, 19.38s/it] 97%|█████████▋| 4170/4286 [26:09:31<37:11, 19.24s/it]                                                      {'loss': 0.0228, 'grad_norm': 6.375676729301566, 'learning_rate': 2.7064862342510498e-08, 'completion_length': 189.10714721679688, 'rewards/only_full_func_accuracy_reward': 0.5223214626312256, 'rewards/format_reward': 1.0, 'reward': 1.5223214626312256, 'reward_std': 0.04193221032619476, 'kl': 0.5703125, 'epoch': 0.97}
 97%|█████████▋| 4170/4286 [26:09:31<37:11, 19.24s/it] 97%|█████████▋| 4171/4286 [26:09:49<36:28, 19.03s/it]                                                      {'loss': 0.0428, 'grad_norm': 4.353751109145021, 'learning_rate': 2.683154456369575e-08, 'completion_length': 191.55358123779297, 'rewards/only_full_func_accuracy_reward': 0.604166716337204, 'rewards/format_reward': 1.0, 'reward': 1.6041668057441711, 'reward_std': 0.05405071750283241, 'kl': 1.06884765625, 'epoch': 0.97}
 97%|█████████▋| 4171/4286 [26:09:49<36:28, 19.03s/it] 97%|█████████▋| 4172/4286 [26:10:09<36:19, 19.12s/it]                                                      {'loss': 0.0408, 'grad_norm': 1.1427468243475827, 'learning_rate': 2.6598226784881005e-08, 'completion_length': 200.67858123779297, 'rewards/only_full_func_accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892858505249023, 'reward_std': 0.10138823837041855, 'kl': 1.0205078125, 'epoch': 0.97}
 97%|█████████▋| 4172/4286 [26:10:09<36:19, 19.12s/it] 97%|█████████▋| 4173/4286 [26:10:30<37:22, 19.85s/it]                                                      {'loss': 0.0251, 'grad_norm': 3.9424671287193362, 'learning_rate': 2.636490900606626e-08, 'completion_length': 187.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.610119104385376, 'rewards/format_reward': 1.0, 'reward': 1.610119104385376, 'reward_std': 0.040071723982691765, 'kl': 0.626953125, 'epoch': 0.97}
 97%|█████████▋| 4173/4286 [26:10:30<37:22, 19.85s/it] 97%|█████████▋| 4174/4286 [26:10:51<37:34, 20.13s/it]                                                      {'loss': 0.0728, 'grad_norm': 9.312935905807706, 'learning_rate': 2.6131591227251516e-08, 'completion_length': 206.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5565476715564728, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.520833432674408, 'reward_std': 0.1327940635383129, 'kl': 1.8173828125, 'epoch': 0.97}
 97%|█████████▋| 4174/4286 [26:10:51<37:34, 20.13s/it] 97%|█████████▋| 4175/4286 [26:11:09<36:15, 19.60s/it]                                                      {'loss': 0.0121, 'grad_norm': 10.642181566663211, 'learning_rate': 2.5898273448436768e-08, 'completion_length': 186.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.705357164144516, 'rewards/format_reward': 1.0, 'reward': 1.7053572535514832, 'reward_std': 0.05357143376022577, 'kl': 0.30224609375, 'epoch': 0.97}
 97%|█████████▋| 4175/4286 [26:11:09<36:15, 19.60s/it] 97%|█████████▋| 4176/4286 [26:11:27<35:04, 19.14s/it]                                                      {'loss': 0.0336, 'grad_norm': 2.261104283109872, 'learning_rate': 2.5664955669622023e-08, 'completion_length': 189.12500762939453, 'rewards/only_full_func_accuracy_reward': 0.6958333551883698, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6779763102531433, 'reward_std': 0.09642857499420643, 'kl': 0.83935546875, 'epoch': 0.97}
 97%|█████████▋| 4176/4286 [26:11:27<35:04, 19.14s/it] 97%|█████████▋| 4177/4286 [26:11:47<35:11, 19.37s/it]                                                      {'loss': 0.0436, 'grad_norm': 11.66488916864372, 'learning_rate': 2.5431637890807278e-08, 'completion_length': 194.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.6412698924541473, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6234127879142761, 'reward_std': 0.10281243920326233, 'kl': 1.09228515625, 'epoch': 0.97}
 97%|█████████▋| 4177/4286 [26:11:47<35:11, 19.37s/it] 97%|█████████▋| 4178/4286 [26:12:06<34:26, 19.13s/it]                                                      {'loss': 0.0265, 'grad_norm': 1.8383858277454537, 'learning_rate': 2.5198320111992534e-08, 'completion_length': 183.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.7212302088737488, 'rewards/format_reward': 1.0, 'reward': 1.7212302684783936, 'reward_std': 0.0337301641702652, 'kl': 0.6630859375, 'epoch': 0.97}
 97%|█████████▋| 4178/4286 [26:12:06<34:26, 19.13s/it] 98%|█████████▊| 4179/4286 [26:12:24<33:40, 18.89s/it]                                                      {'loss': 0.0074, 'grad_norm': 19.678749441543058, 'learning_rate': 2.4965002333177786e-08, 'completion_length': 181.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.7413690984249115, 'rewards/format_reward': 1.0, 'reward': 1.7413691878318787, 'reward_std': 0.033261971548199654, 'kl': 0.185546875, 'epoch': 0.98}
 98%|█████████▊| 4179/4286 [26:12:24<33:40, 18.89s/it] 98%|█████████▊| 4180/4286 [26:12:46<34:48, 19.70s/it]                                                      {'loss': 0.0321, 'grad_norm': 2.499740629723656, 'learning_rate': 2.473168455436304e-08, 'completion_length': 194.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.6755952835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6577382683753967, 'reward_std': 0.07234940677881241, 'kl': 0.802734375, 'epoch': 0.98}
 98%|█████████▊| 4180/4286 [26:12:46<34:48, 19.70s/it] 98%|█████████▊| 4181/4286 [26:13:03<33:16, 19.01s/it]                                                      {'loss': 0.0447, 'grad_norm': 11.992582236464175, 'learning_rate': 2.4498366775548296e-08, 'completion_length': 182.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.630952388048172, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6130953431129456, 'reward_std': 0.06010328233242035, 'kl': 1.11767578125, 'epoch': 0.98}
 98%|█████████▊| 4181/4286 [26:13:03<33:16, 19.01s/it] 98%|█████████▊| 4182/4286 [26:13:22<32:38, 18.83s/it]                                                      {'loss': 0.0274, 'grad_norm': 0.8916216489127582, 'learning_rate': 2.4265048996733548e-08, 'completion_length': 196.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.8199405372142792, 'rewards/format_reward': 1.0, 'reward': 1.8199405670166016, 'reward_std': 0.04740536957979202, 'kl': 0.68701171875, 'epoch': 0.98}
 98%|█████████▊| 4182/4286 [26:13:22<32:38, 18.83s/it] 98%|█████████▊| 4183/4286 [26:13:40<32:06, 18.71s/it]                                                      {'loss': 0.0068, 'grad_norm': 3.5116258719062796, 'learning_rate': 2.4031731217918803e-08, 'completion_length': 187.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6562500298023224, 'rewards/format_reward': 1.0, 'reward': 1.6562501192092896, 'reward_std': 0.025190778076648712, 'kl': 0.17138671875, 'epoch': 0.98}
 98%|█████████▊| 4183/4286 [26:13:40<32:06, 18.71s/it] 98%|█████████▊| 4184/4286 [26:14:04<34:20, 20.20s/it]                                                      {'loss': 0.034, 'grad_norm': 2.3818834221535523, 'learning_rate': 2.379841343910406e-08, 'completion_length': 191.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.011904762126505375, 'kl': 0.849609375, 'epoch': 0.98}
 98%|█████████▊| 4184/4286 [26:14:04<34:20, 20.20s/it] 98%|█████████▊| 4185/4286 [26:14:23<33:30, 19.91s/it]                                                      {'loss': 0.0381, 'grad_norm': 110.06573768991439, 'learning_rate': 2.3565095660289314e-08, 'completion_length': 189.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.7380952835083008, 'rewards/format_reward': 1.0, 'reward': 1.7380953431129456, 'reward_std': 0.10125327110290527, 'kl': 0.94921875, 'epoch': 0.98}
 98%|█████████▊| 4185/4286 [26:14:23<33:30, 19.91s/it] 98%|█████████▊| 4186/4286 [26:14:47<35:25, 21.26s/it]                                                      {'loss': 0.0614, 'grad_norm': 7.512250199908811, 'learning_rate': 2.3331777881474566e-08, 'completion_length': 189.82144165039062, 'rewards/only_full_func_accuracy_reward': 0.38690483570098877, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3690476417541504, 'reward_std': 0.11585775390267372, 'kl': 1.53515625, 'epoch': 0.98}
 98%|█████████▊| 4186/4286 [26:14:47<35:25, 21.26s/it] 98%|█████████▊| 4187/4286 [26:15:08<34:42, 21.03s/it]                                                      {'loss': 0.0388, 'grad_norm': 6.857564429319519, 'learning_rate': 2.309846010265982e-08, 'completion_length': 191.80358123779297, 'rewards/only_full_func_accuracy_reward': 0.7895833849906921, 'rewards/format_reward': 1.0, 'reward': 1.789583444595337, 'reward_std': 0.03101765736937523, 'kl': 0.9697265625, 'epoch': 0.98}
 98%|█████████▊| 4187/4286 [26:15:08<34:42, 21.03s/it] 98%|█████████▊| 4188/4286 [26:15:28<34:01, 20.83s/it]                                                      {'loss': 0.0765, 'grad_norm': 4.624629945049112, 'learning_rate': 2.2865142323845077e-08, 'completion_length': 202.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.5907738506793976, 'rewards/format_reward': 1.0, 'reward': 1.59077388048172, 'reward_std': 0.15798483043909073, 'kl': 1.90234375, 'epoch': 0.98}
 98%|█████████▊| 4188/4286 [26:15:28<34:01, 20.83s/it] 98%|█████████▊| 4189/4286 [26:15:47<32:36, 20.17s/it]                                                      {'loss': 0.0387, 'grad_norm': 4.34239327314181, 'learning_rate': 2.2631824545030332e-08, 'completion_length': 197.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6026786267757416, 'rewards/format_reward': 1.0, 'reward': 1.6026787161827087, 'reward_std': 0.07624644227325916, 'kl': 0.96484375, 'epoch': 0.98}
 98%|█████████▊| 4189/4286 [26:15:47<32:36, 20.17s/it] 98%|█████████▊| 4190/4286 [26:16:05<31:14, 19.52s/it]                                                      {'loss': 0.0068, 'grad_norm': 0.9312600763952241, 'learning_rate': 2.2398506766215584e-08, 'completion_length': 178.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.816964328289032, 'rewards/format_reward': 1.0, 'reward': 1.8169643878936768, 'reward_std': 0.01709691435098648, 'kl': 0.17041015625, 'epoch': 0.98}
 98%|█████████▊| 4190/4286 [26:16:05<31:14, 19.52s/it] 98%|█████████▊| 4191/4286 [26:16:24<30:36, 19.33s/it]                                                      {'loss': 0.034, 'grad_norm': 2.4438102306985856, 'learning_rate': 2.216518898740084e-08, 'completion_length': 182.64286041259766, 'rewards/only_full_func_accuracy_reward': 0.4732143133878708, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4375001192092896, 'reward_std': 0.1246987134218216, 'kl': 0.85009765625, 'epoch': 0.98}
 98%|█████████▊| 4191/4286 [26:16:24<30:36, 19.33s/it] 98%|█████████▊| 4192/4286 [26:16:45<31:15, 19.95s/it]                                                      {'loss': 0.058, 'grad_norm': 2.109546963832073, 'learning_rate': 2.1931871208586094e-08, 'completion_length': 201.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.6654762327671051, 'rewards/format_reward': 1.0, 'reward': 1.6654762625694275, 'reward_std': 0.09614031482487917, 'kl': 1.44921875, 'epoch': 0.98}
 98%|█████████▊| 4192/4286 [26:16:45<31:15, 19.95s/it] 98%|█████████▊| 4193/4286 [26:17:04<30:23, 19.60s/it]                                                      {'loss': 0.0127, 'grad_norm': 5.9512473794301854, 'learning_rate': 2.169855342977135e-08, 'completion_length': 192.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6473214626312256, 'rewards/format_reward': 1.0, 'reward': 1.6473215818405151, 'reward_std': 0.07800615765154362, 'kl': 0.31787109375, 'epoch': 0.98}
 98%|█████████▊| 4193/4286 [26:17:04<30:23, 19.60s/it] 98%|█████████▊| 4194/4286 [26:17:24<30:09, 19.67s/it]                                                      {'loss': 0.009, 'grad_norm': 1.260779600508107, 'learning_rate': 2.1465235650956602e-08, 'completion_length': 170.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.674107164144516, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.008928571827709675, 'kl': 0.22509765625, 'epoch': 0.98}
 98%|█████████▊| 4194/4286 [26:17:24<30:09, 19.67s/it] 98%|█████████▊| 4195/4286 [26:17:44<30:01, 19.80s/it]                                                      {'loss': 0.0217, 'grad_norm': 5.850990778001105, 'learning_rate': 2.1231917872141857e-08, 'completion_length': 189.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6773809790611267, 'rewards/format_reward': 1.0, 'reward': 1.6773810386657715, 'reward_std': 0.06030578725039959, 'kl': 0.54541015625, 'epoch': 0.98}
 98%|█████████▊| 4195/4286 [26:17:44<30:01, 19.80s/it] 98%|█████████▊| 4196/4286 [26:18:06<30:42, 20.47s/it]                                                      {'loss': 0.0704, 'grad_norm': 2.976834968911767, 'learning_rate': 2.0998600093327112e-08, 'completion_length': 185.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5639881789684296, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5282739400863647, 'reward_std': 0.17844071984291077, 'kl': 1.7578125, 'epoch': 0.98}
 98%|█████████▊| 4196/4286 [26:18:06<30:42, 20.47s/it] 98%|█████████▊| 4197/4286 [26:18:27<30:41, 20.69s/it]                                                      {'loss': 0.0622, 'grad_norm': 3.1495918678336823, 'learning_rate': 2.0765282314512368e-08, 'completion_length': 191.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.48055557906627655, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4626984596252441, 'reward_std': 0.1209157407283783, 'kl': 1.552734375, 'epoch': 0.98}
 98%|█████████▊| 4197/4286 [26:18:27<30:41, 20.69s/it][2025-03-03 07:26:05,663] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 98%|█████████▊| 4198/4286 [26:18:50<31:16, 21.33s/it]                                                      {'loss': 0.0502, 'grad_norm': 3.348546385500149, 'learning_rate': 2.053196453569762e-08, 'completion_length': 202.23214721679688, 'rewards/only_full_func_accuracy_reward': 0.6145833730697632, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5788691639900208, 'reward_std': 0.15178572572767735, 'kl': 1.251953125, 'epoch': 0.98}
 98%|█████████▊| 4198/4286 [26:18:50<31:16, 21.33s/it][2025-03-03 07:26:25,077] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 98%|█████████▊| 4199/4286 [26:19:09<30:05, 20.75s/it]                                                      {'loss': 0.0291, 'grad_norm': 19.548027941458493, 'learning_rate': 2.0298646756882875e-08, 'completion_length': 179.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6324404925107956, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.08311965316534042, 'kl': 0.7265625, 'epoch': 0.98}
 98%|█████████▊| 4199/4286 [26:19:09<30:05, 20.75s/it] 98%|█████████▊| 4200/4286 [26:19:30<29:47, 20.79s/it]                                                      {'loss': 0.0388, 'grad_norm': 4.201634187150531, 'learning_rate': 2.006532897806813e-08, 'completion_length': 199.4821548461914, 'rewards/only_full_func_accuracy_reward': 0.6781463027000427, 'rewards/format_reward': 1.0, 'reward': 1.6781463027000427, 'reward_std': 0.12229341268539429, 'kl': 0.97119140625, 'epoch': 0.98}
 98%|█████████▊| 4200/4286 [26:19:30<29:47, 20.79s/it] 98%|█████████▊| 4201/4286 [26:22:59<1:49:21, 77.19s/it]                                                        {'loss': 0.0604, 'grad_norm': 2.5769605937427764, 'learning_rate': 1.9832011199253382e-08, 'completion_length': 190.69644165039062, 'rewards/only_full_func_accuracy_reward': 0.370535746216774, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.352678656578064, 'reward_std': 0.10128936171531677, 'kl': 1.51171875, 'epoch': 0.98}
 98%|█████████▊| 4201/4286 [26:22:59<1:49:21, 77.19s/it] 98%|█████████▊| 4202/4286 [26:23:19<1:23:58, 59.98s/it]                                                        {'loss': 0.0116, 'grad_norm': 2.0373835360363874, 'learning_rate': 1.9598693420438638e-08, 'completion_length': 209.46429443359375, 'rewards/only_full_func_accuracy_reward': 0.5744048058986664, 'rewards/format_reward': 1.0, 'reward': 1.5744048953056335, 'reward_std': 0.035961021669209, 'kl': 0.29052734375, 'epoch': 0.98}
 98%|█████████▊| 4202/4286 [26:23:19<1:23:58, 59.98s/it] 98%|█████████▊| 4203/4286 [26:23:38<1:06:01, 47.73s/it]                                                        {'loss': 0.0279, 'grad_norm': 0.8963574684061931, 'learning_rate': 1.936537564162389e-08, 'completion_length': 181.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.738095223903656, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.720238208770752, 'reward_std': 0.0714285746216774, 'kl': 0.6943359375, 'epoch': 0.98}
 98%|█████████▊| 4203/4286 [26:23:38<1:06:01, 47.73s/it] 98%|█████████▊| 4204/4286 [26:23:58<53:55, 39.46s/it]                                                        {'loss': 0.062, 'grad_norm': 3.5317331943440515, 'learning_rate': 1.9132057862809145e-08, 'completion_length': 195.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.6651786267757416, 'rewards/format_reward': 1.0, 'reward': 1.665178656578064, 'reward_std': 0.05575069412589073, 'kl': 1.546875, 'epoch': 0.98}
 98%|█████████▊| 4204/4286 [26:23:58<53:55, 39.46s/it] 98%|█████████▊| 4205/4286 [26:24:17<44:57, 33.30s/it]                                                      {'loss': 0.0069, 'grad_norm': 2.967720124175602, 'learning_rate': 1.8898740083994397e-08, 'completion_length': 203.1964340209961, 'rewards/only_full_func_accuracy_reward': 0.703869104385376, 'rewards/format_reward': 1.0, 'reward': 1.7038692235946655, 'reward_std': 0.019238397479057312, 'kl': 0.171875, 'epoch': 0.98}
 98%|█████████▊| 4205/4286 [26:24:17<44:57, 33.30s/it] 98%|█████████▊| 4206/4286 [26:24:36<38:41, 29.02s/it]                                                      {'loss': 0.0139, 'grad_norm': 18.71940789813737, 'learning_rate': 1.8665422305179652e-08, 'completion_length': 190.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.5622449368238449, 'rewards/format_reward': 1.0, 'reward': 1.5622450113296509, 'reward_std': 0.01598639413714409, 'kl': 0.3466796875, 'epoch': 0.98}
 98%|█████████▊| 4206/4286 [26:24:36<38:41, 29.02s/it] 98%|█████████▊| 4207/4286 [26:24:55<34:11, 25.97s/it]                                                      {'loss': 0.0072, 'grad_norm': 3.6730584836862032, 'learning_rate': 1.8432104526364907e-08, 'completion_length': 183.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.8422619700431824, 'rewards/format_reward': 1.0, 'reward': 1.8422619700431824, 'reward_std': 0.0297619067132473, 'kl': 0.1787109375, 'epoch': 0.98}
 98%|█████████▊| 4207/4286 [26:24:55<34:11, 25.97s/it] 98%|█████████▊| 4208/4286 [26:25:13<30:47, 23.69s/it]                                                      {'loss': 0.0068, 'grad_norm': 3.2180791240340647, 'learning_rate': 1.8198786747550163e-08, 'completion_length': 181.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.030682744458317757, 'kl': 0.1689453125, 'epoch': 0.98}
 98%|█████████▊| 4208/4286 [26:25:13<30:47, 23.69s/it] 98%|█████████▊| 4209/4286 [26:25:33<28:48, 22.45s/it]                                                      {'loss': 0.0111, 'grad_norm': 3.335509123567878, 'learning_rate': 1.7965468968735415e-08, 'completion_length': 192.83929443359375, 'rewards/only_full_func_accuracy_reward': 0.5699405074119568, 'rewards/format_reward': 1.0, 'reward': 1.5699405670166016, 'reward_std': 0.07716727629303932, 'kl': 0.27783203125, 'epoch': 0.98}
 98%|█████████▊| 4209/4286 [26:25:33<28:48, 22.45s/it] 98%|█████████▊| 4210/4286 [26:25:52<27:23, 21.63s/it]                                                      {'loss': 0.0303, 'grad_norm': 3.109593015302696, 'learning_rate': 1.773215118992067e-08, 'completion_length': 173.5714340209961, 'rewards/only_full_func_accuracy_reward': 0.6089285910129547, 'rewards/format_reward': 1.0, 'reward': 1.608928620815277, 'reward_std': 0.05867985263466835, 'kl': 0.7607421875, 'epoch': 0.98}
 98%|█████████▊| 4210/4286 [26:25:52<27:23, 21.63s/it] 98%|█████████▊| 4211/4286 [26:26:11<25:43, 20.58s/it]                                                      {'loss': 0.0359, 'grad_norm': 2.047649368304466, 'learning_rate': 1.7498833411105925e-08, 'completion_length': 184.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678572535514832, 'reward_std': 0.15660357102751732, 'kl': 0.89697265625, 'epoch': 0.98}
 98%|█████████▊| 4211/4286 [26:26:11<25:43, 20.58s/it] 98%|█████████▊| 4212/4286 [26:26:31<25:25, 20.61s/it]                                                      {'loss': 0.0589, 'grad_norm': 3.1998226358841624, 'learning_rate': 1.726551563229118e-08, 'completion_length': 201.17858123779297, 'rewards/only_full_func_accuracy_reward': 0.6154762208461761, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.597619116306305, 'reward_std': 0.11974501982331276, 'kl': 1.470703125, 'epoch': 0.98}
 98%|█████████▊| 4212/4286 [26:26:31<25:25, 20.61s/it] 98%|█████████▊| 4213/4286 [26:26:53<25:23, 20.86s/it]                                                      {'loss': 0.0197, 'grad_norm': 2.7353644697465866, 'learning_rate': 1.7032197853476433e-08, 'completion_length': 177.08929443359375, 'rewards/only_full_func_accuracy_reward': 0.7261905372142792, 'rewards/format_reward': 1.0, 'reward': 1.7261906266212463, 'reward_std': 0.011904764920473099, 'kl': 0.49267578125, 'epoch': 0.98}
 98%|█████████▊| 4213/4286 [26:26:53<25:23, 20.86s/it] 98%|█████████▊| 4214/4286 [26:27:15<25:28, 21.22s/it]                                                      {'loss': 0.0201, 'grad_norm': 7.179820449912677, 'learning_rate': 1.6798880074661688e-08, 'completion_length': 178.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.5863095819950104, 'rewards/format_reward': 1.0, 'reward': 1.5863096117973328, 'reward_std': 0.034693021327257156, 'kl': 0.501953125, 'epoch': 0.98}
 98%|█████████▊| 4214/4286 [26:27:15<25:28, 21.22s/it] 98%|█████████▊| 4215/4286 [26:27:33<24:06, 20.37s/it]                                                      {'loss': 0.0711, 'grad_norm': 1.989987050436229, 'learning_rate': 1.6565562295846943e-08, 'completion_length': 195.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.5803572237491608, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5625001192092896, 'reward_std': 0.150876946747303, 'kl': 1.78515625, 'epoch': 0.98}
 98%|█████████▊| 4215/4286 [26:27:33<24:06, 20.37s/it] 98%|█████████▊| 4216/4286 [26:27:55<24:15, 20.79s/it]                                                      {'loss': 0.0562, 'grad_norm': 8.503086962308217, 'learning_rate': 1.6332244517032195e-08, 'completion_length': 199.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.5654762387275696, 'rewards/format_reward': 1.0, 'reward': 1.5654762983322144, 'reward_std': 0.13423834182322025, 'kl': 1.404296875, 'epoch': 0.98}
 98%|█████████▊| 4216/4286 [26:27:55<24:15, 20.79s/it] 98%|█████████▊| 4217/4286 [26:28:14<23:09, 20.13s/it]                                                      {'loss': 0.0111, 'grad_norm': 1.6994623203990478, 'learning_rate': 1.609892673821745e-08, 'completion_length': 194.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.6741071343421936, 'rewards/format_reward': 1.0, 'reward': 1.6741072535514832, 'reward_std': 0.07465150393545628, 'kl': 0.27734375, 'epoch': 0.98}
 98%|█████████▊| 4217/4286 [26:28:14<23:09, 20.13s/it] 98%|█████████▊| 4218/4286 [26:28:31<21:55, 19.34s/it]                                                      {'loss': 0.0089, 'grad_norm': 6.043085221849337, 'learning_rate': 1.5865608959402706e-08, 'completion_length': 163.42857360839844, 'rewards/only_full_func_accuracy_reward': 0.7083333730697632, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.04191340319812298, 'kl': 0.22216796875, 'epoch': 0.98}
 98%|█████████▊| 4218/4286 [26:28:31<21:55, 19.34s/it] 98%|█████████▊| 4219/4286 [26:28:50<21:21, 19.13s/it]                                                      {'loss': 0.0092, 'grad_norm': 7.119861347406628, 'learning_rate': 1.563229118058796e-08, 'completion_length': 177.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.6577381193637848, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.032437440007925034, 'kl': 0.228515625, 'epoch': 0.98}
 98%|█████████▊| 4219/4286 [26:28:50<21:21, 19.13s/it] 98%|█████████▊| 4220/4286 [26:29:10<21:35, 19.63s/it]                                                      {'loss': 0.0261, 'grad_norm': 7.8285930693052705, 'learning_rate': 1.5398973401773213e-08, 'completion_length': 191.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.5741071701049805, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.538392961025238, 'reward_std': 0.06541862338781357, 'kl': 0.6494140625, 'epoch': 0.98}
 98%|█████████▊| 4220/4286 [26:29:10<21:35, 19.63s/it] 98%|█████████▊| 4221/4286 [26:29:29<20:54, 19.30s/it]                                                      {'loss': 0.0187, 'grad_norm': 96.69935480136826, 'learning_rate': 1.516565562295847e-08, 'completion_length': 178.2857208251953, 'rewards/only_full_func_accuracy_reward': 0.6660714745521545, 'rewards/format_reward': 1.0, 'reward': 1.6660715341567993, 'reward_std': 0.050651200115680695, 'kl': 0.4677734375, 'epoch': 0.98}
 98%|█████████▊| 4221/4286 [26:29:29<20:54, 19.30s/it] 99%|█████████▊| 4222/4286 [26:29:50<21:09, 19.84s/it]                                                      {'loss': 0.049, 'grad_norm': 2.9668823778225164, 'learning_rate': 1.4932337844143724e-08, 'completion_length': 179.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.6276786029338837, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6098215579986572, 'reward_std': 0.0898132249712944, 'kl': 1.2265625, 'epoch': 0.99}
 99%|█████████▊| 4222/4286 [26:29:50<21:09, 19.84s/it] 99%|█████████▊| 4223/4286 [26:30:08<20:15, 19.30s/it]                                                      {'loss': 0.029, 'grad_norm': 6.420571316927044, 'learning_rate': 1.4699020065328977e-08, 'completion_length': 180.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.7083334028720856, 'rewards/format_reward': 1.0, 'reward': 1.708333432674408, 'reward_std': 0.0357142873108387, 'kl': 0.7275390625, 'epoch': 0.99}
 99%|█████████▊| 4223/4286 [26:30:08<20:15, 19.30s/it] 99%|█████████▊| 4224/4286 [26:30:26<19:34, 18.95s/it]                                                      {'loss': 0.0075, 'grad_norm': 3.5410476153622072, 'learning_rate': 1.4465702286514233e-08, 'completion_length': 186.0714340209961, 'rewards/only_full_func_accuracy_reward': 0.7470238208770752, 'rewards/format_reward': 1.0, 'reward': 1.7470239400863647, 'reward_std': 0.005952383857220411, 'kl': 0.18603515625, 'epoch': 0.99}
 99%|█████████▊| 4224/4286 [26:30:26<19:34, 18.95s/it] 99%|█████████▊| 4225/4286 [26:30:46<19:29, 19.17s/it]                                                      {'loss': 0.0105, 'grad_norm': 1.9125279563824698, 'learning_rate': 1.4232384507699486e-08, 'completion_length': 181.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.5922619104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5744048953056335, 'reward_std': 0.04602411389350891, 'kl': 0.26171875, 'epoch': 0.99}
 99%|█████████▊| 4225/4286 [26:30:46<19:29, 19.17s/it] 99%|█████████▊| 4226/4286 [26:31:06<19:30, 19.51s/it]                                                      {'loss': 0.0587, 'grad_norm': 13.57370545203503, 'learning_rate': 1.3999066728884742e-08, 'completion_length': 187.48214721679688, 'rewards/only_full_func_accuracy_reward': 0.613690510392189, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.595833420753479, 'reward_std': 0.15569988545030355, 'kl': 1.466796875, 'epoch': 0.99}
 99%|█████████▊| 4226/4286 [26:31:06<19:30, 19.51s/it] 99%|█████████▊| 4227/4286 [26:31:25<18:54, 19.23s/it]                                                      {'loss': 0.0743, 'grad_norm': 5.019887507812805, 'learning_rate': 1.3765748950069995e-08, 'completion_length': 196.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6461310088634491, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.628273844718933, 'reward_std': 0.1332686049863696, 'kl': 1.85986328125, 'epoch': 0.99}
 99%|█████████▊| 4227/4286 [26:31:25<18:54, 19.23s/it] 99%|█████████▊| 4228/4286 [26:31:44<18:27, 19.10s/it]                                                      {'loss': 0.0222, 'grad_norm': 14.28305564860265, 'learning_rate': 1.3532431171255249e-08, 'completion_length': 192.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6324405074119568, 'rewards/format_reward': 1.0, 'reward': 1.6324405670166016, 'reward_std': 0.08035714365541935, 'kl': 0.5576171875, 'epoch': 0.99}
 99%|█████████▊| 4228/4286 [26:31:44<18:27, 19.10s/it] 99%|█████████▊| 4229/4286 [26:32:03<18:17, 19.25s/it]                                                      {'loss': 0.0145, 'grad_norm': 4.784815908621302, 'learning_rate': 1.3299113392440503e-08, 'completion_length': 183.9464340209961, 'rewards/only_full_func_accuracy_reward': 0.5782738327980042, 'rewards/format_reward': 1.0, 'reward': 1.578273892402649, 'reward_std': 0.09285644814372063, 'kl': 0.36328125, 'epoch': 0.99}
 99%|█████████▊| 4229/4286 [26:32:03<18:17, 19.25s/it] 99%|█████████▊| 4230/4286 [26:32:21<17:25, 18.67s/it]                                                      {'loss': 0.0073, 'grad_norm': 0.7750422244123029, 'learning_rate': 1.3065795613625758e-08, 'completion_length': 180.8214340209961, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.025651192292571068, 'kl': 0.18115234375, 'epoch': 0.99}
 99%|█████████▊| 4230/4286 [26:32:21<17:25, 18.67s/it] 99%|█████████▊| 4231/4286 [26:32:39<17:07, 18.68s/it]                                                      {'loss': 0.0454, 'grad_norm': 35.862879684978054, 'learning_rate': 1.2832477834811011e-08, 'completion_length': 167.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.595238208770752, 'reward_std': 0.1190476268529892, 'kl': 1.13671875, 'epoch': 0.99}
 99%|█████████▊| 4231/4286 [26:32:39<17:07, 18.68s/it] 99%|█████████▊| 4232/4286 [26:33:01<17:38, 19.59s/it]                                                      {'loss': 0.0778, 'grad_norm': 6.082171875480919, 'learning_rate': 1.2599160055996267e-08, 'completion_length': 207.28572845458984, 'rewards/only_full_func_accuracy_reward': 0.6797619462013245, 'rewards/format_reward': 1.0, 'reward': 1.6797620058059692, 'reward_std': 0.0681816479191184, 'kl': 1.94140625, 'epoch': 0.99}
 99%|█████████▊| 4232/4286 [26:33:01<17:38, 19.59s/it] 99%|█████████▉| 4233/4286 [26:33:19<16:55, 19.16s/it]                                                      {'loss': 0.0314, 'grad_norm': 3.4626507812983247, 'learning_rate': 1.236584227718152e-08, 'completion_length': 172.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.713988184928894, 'rewards/format_reward': 1.0, 'reward': 1.7139882445335388, 'reward_std': 0.04640500992536545, 'kl': 0.787109375, 'epoch': 0.99}
 99%|█████████▉| 4233/4286 [26:33:19<16:55, 19.16s/it] 99%|█████████▉| 4234/4286 [26:33:41<17:17, 19.95s/it]                                                      {'loss': 0.0294, 'grad_norm': 14.292323221010546, 'learning_rate': 1.2132524498366774e-08, 'completion_length': 205.96429443359375, 'rewards/only_full_func_accuracy_reward': 0.605654776096344, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.58779776096344, 'reward_std': 0.1357702501118183, 'kl': 0.7333984375, 'epoch': 0.99}
 99%|█████████▉| 4234/4286 [26:33:41<17:17, 19.95s/it] 99%|█████████▉| 4235/4286 [26:34:02<17:11, 20.23s/it]                                                      {'loss': 0.0306, 'grad_norm': 7.286965999157949, 'learning_rate': 1.189920671955203e-08, 'completion_length': 192.07144165039062, 'rewards/only_full_func_accuracy_reward': 0.5550595670938492, 'rewards/format_reward': 1.0, 'reward': 1.5550596714019775, 'reward_std': 0.04685881920158863, 'kl': 0.76806640625, 'epoch': 0.99}
 99%|█████████▉| 4235/4286 [26:34:02<17:11, 20.23s/it] 99%|█████████▉| 4236/4286 [26:34:22<16:43, 20.08s/it]                                                      {'loss': 0.0307, 'grad_norm': 1.5765134464620307, 'learning_rate': 1.1665888940737283e-08, 'completion_length': 194.32144165039062, 'rewards/only_full_func_accuracy_reward': 0.5997024178504944, 'rewards/format_reward': 1.0, 'reward': 1.5997024774551392, 'reward_std': 0.05838929861783981, 'kl': 0.767578125, 'epoch': 0.99}
 99%|█████████▉| 4236/4286 [26:34:22<16:43, 20.08s/it] 99%|█████████▉| 4237/4286 [26:34:41<16:19, 19.99s/it]                                                      {'loss': 0.0151, 'grad_norm': 5.320928179851487, 'learning_rate': 1.1432571161922538e-08, 'completion_length': 197.94644165039062, 'rewards/only_full_func_accuracy_reward': 0.6264881491661072, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6086310744285583, 'reward_std': 0.081052640452981, 'kl': 0.37548828125, 'epoch': 0.99}
 99%|█████████▉| 4237/4286 [26:34:41<16:19, 19.99s/it] 99%|█████████▉| 4238/4286 [26:35:03<16:30, 20.64s/it]                                                      {'loss': 0.0347, 'grad_norm': 7.2870823640038225, 'learning_rate': 1.1199253383107792e-08, 'completion_length': 202.3571548461914, 'rewards/only_full_func_accuracy_reward': 0.772321492433548, 'rewards/format_reward': 1.0, 'reward': 1.7723215222358704, 'reward_std': 0.056547620333731174, 'kl': 0.869140625, 'epoch': 0.99}
 99%|█████████▉| 4238/4286 [26:35:03<16:30, 20.64s/it] 99%|█████████▉| 4239/4286 [26:35:24<16:02, 20.48s/it]                                                      {'loss': 0.0344, 'grad_norm': 2.459380558188502, 'learning_rate': 1.0965935604293047e-08, 'completion_length': 188.21429443359375, 'rewards/only_full_func_accuracy_reward': 0.5829081833362579, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5471939444541931, 'reward_std': 0.13700179383158684, 'kl': 0.857421875, 'epoch': 0.99}
 99%|█████████▉| 4239/4286 [26:35:24<16:02, 20.48s/it] 99%|█████████▉| 4240/4286 [26:35:42<15:15, 19.90s/it]                                                      {'loss': 0.0187, 'grad_norm': 5.022697586763219, 'learning_rate': 1.0732617825478301e-08, 'completion_length': 201.1071548461914, 'rewards/only_full_func_accuracy_reward': 0.6056548357009888, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5877977013587952, 'reward_std': 0.07095707580447197, 'kl': 0.4658203125, 'epoch': 0.99}
 99%|█████████▉| 4240/4286 [26:35:42<15:15, 19.90s/it] 99%|█████████▉| 4241/4286 [26:36:03<15:11, 20.26s/it]                                                      {'loss': 0.0176, 'grad_norm': 3.227015532599903, 'learning_rate': 1.0499300046663556e-08, 'completion_length': 190.1428680419922, 'rewards/only_full_func_accuracy_reward': 0.6443452835083008, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.626488208770752, 'reward_std': 0.06250000558793545, 'kl': 0.439453125, 'epoch': 0.99}
 99%|█████████▉| 4241/4286 [26:36:03<15:11, 20.26s/it] 99%|█████████▉| 4242/4286 [26:36:21<14:18, 19.52s/it]                                                      {'loss': 0.0087, 'grad_norm': 1.8244779478704338, 'learning_rate': 1.026598226784881e-08, 'completion_length': 181.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6883929371833801, 'rewards/format_reward': 1.0, 'reward': 1.6883929371833801, 'reward_std': 0.03209754452109337, 'kl': 0.216796875, 'epoch': 0.99}
 99%|█████████▉| 4242/4286 [26:36:21<14:18, 19.52s/it] 99%|█████████▉| 4243/4286 [26:36:40<13:55, 19.44s/it]                                                      {'loss': 0.0471, 'grad_norm': 3.1225355766625214, 'learning_rate': 1.0032664489034065e-08, 'completion_length': 201.7857208251953, 'rewards/only_full_func_accuracy_reward': 0.6994048357009888, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.68154776096344, 'reward_std': 0.1076585166156292, 'kl': 1.1796875, 'epoch': 0.99}
 99%|█████████▉| 4243/4286 [26:36:40<13:55, 19.44s/it] 99%|█████████▉| 4244/4286 [26:37:05<14:42, 21.01s/it]                                                      {'loss': 0.0501, 'grad_norm': 2.7381356664840073, 'learning_rate': 9.799346710219319e-09, 'completion_length': 206.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6666668057441711, 'reward_std': 0.130827397108078, 'kl': 1.25, 'epoch': 0.99}
 99%|█████████▉| 4244/4286 [26:37:05<14:42, 21.01s/it][2025-03-03 07:44:41,719] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
 99%|█████████▉| 4245/4286 [26:37:26<14:20, 20.98s/it]                                                      {'loss': 0.0587, 'grad_norm': 2.3037683719451985, 'learning_rate': 9.566028931404572e-09, 'completion_length': 204.6071548461914, 'rewards/only_full_func_accuracy_reward': 0.5401785969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5223215818405151, 'reward_std': 0.1179309505969286, 'kl': 1.47265625, 'epoch': 0.99}
 99%|█████████▉| 4245/4286 [26:37:26<14:20, 20.98s/it] 99%|█████████▉| 4246/4286 [26:37:44<13:28, 20.22s/it]                                                      {'loss': 0.0073, 'grad_norm': 1.114332842451177, 'learning_rate': 9.332711152589826e-09, 'completion_length': 183.57144165039062, 'rewards/only_full_func_accuracy_reward': 0.6949405074119568, 'rewards/format_reward': 1.0, 'reward': 1.6949405670166016, 'reward_std': 0.008928571827709675, 'kl': 0.181640625, 'epoch': 0.99}
 99%|█████████▉| 4246/4286 [26:37:44<13:28, 20.22s/it] 99%|█████████▉| 4247/4286 [26:38:03<12:52, 19.80s/it]                                                      {'loss': 0.0108, 'grad_norm': 7.278188197785304, 'learning_rate': 9.099393373775081e-09, 'completion_length': 193.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.7514881789684296, 'rewards/format_reward': 1.0, 'reward': 1.7514882683753967, 'reward_std': 0.029548224061727524, 'kl': 0.27001953125, 'epoch': 0.99}
 99%|█████████▉| 4247/4286 [26:38:03<12:52, 19.80s/it] 99%|█████████▉| 4248/4286 [26:38:22<12:18, 19.43s/it]                                                      {'loss': 0.0292, 'grad_norm': 4.532910649743139, 'learning_rate': 8.866075594960335e-09, 'completion_length': 188.71429443359375, 'rewards/only_full_func_accuracy_reward': 0.735119104385376, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7172620296478271, 'reward_std': 0.12684168107807636, 'kl': 0.7294921875, 'epoch': 0.99}
 99%|█████████▉| 4248/4286 [26:38:22<12:18, 19.43s/it] 99%|█████████▉| 4249/4286 [26:38:40<11:49, 19.18s/it]                                                      {'loss': 0.0069, 'grad_norm': 0.5907163129404978, 'learning_rate': 8.63275781614559e-09, 'completion_length': 194.17857360839844, 'rewards/only_full_func_accuracy_reward': 0.6919643580913544, 'rewards/format_reward': 1.0, 'reward': 1.6919643878936768, 'reward_std': 0.008928571827709675, 'kl': 0.17333984375, 'epoch': 0.99}
 99%|█████████▉| 4249/4286 [26:38:40<11:49, 19.18s/it] 99%|█████████▉| 4250/4286 [26:39:03<12:05, 20.16s/it]                                                      {'loss': 0.0263, 'grad_norm': 11.068567704011434, 'learning_rate': 8.399440037330844e-09, 'completion_length': 209.33929443359375, 'rewards/only_full_func_accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715818405151, 'reward_std': 0.07142858020961285, 'kl': 0.6572265625, 'epoch': 0.99}
 99%|█████████▉| 4250/4286 [26:39:03<12:05, 20.16s/it] 99%|█████████▉| 4251/4286 [26:39:22<11:35, 19.86s/it]                                                      {'loss': 0.0199, 'grad_norm': 2.9513331630157396, 'learning_rate': 8.166122258516098e-09, 'completion_length': 180.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.5482143312692642, 'rewards/format_reward': 1.0, 'reward': 1.5482143759727478, 'reward_std': 0.041900184005498886, 'kl': 0.4990234375, 'epoch': 0.99}
 99%|█████████▉| 4251/4286 [26:39:22<11:35, 19.86s/it] 99%|█████████▉| 4252/4286 [26:39:40<10:54, 19.24s/it]                                                      {'loss': 0.0483, 'grad_norm': 1.5495080445896667, 'learning_rate': 7.932804479701353e-09, 'completion_length': 186.8928680419922, 'rewards/only_full_func_accuracy_reward': 0.6979166865348816, 'rewards/format_reward': 1.0, 'reward': 1.6979168057441711, 'reward_std': 0.0565476268529892, 'kl': 1.20263671875, 'epoch': 0.99}
 99%|█████████▉| 4252/4286 [26:39:40<10:54, 19.24s/it] 99%|█████████▉| 4253/4286 [26:39:59<10:39, 19.36s/it]                                                      {'loss': 0.0159, 'grad_norm': 0.9761347562304338, 'learning_rate': 7.699486700886607e-09, 'completion_length': 187.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.5747023969888687, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5568453669548035, 'reward_std': 0.06201038882136345, 'kl': 0.396484375, 'epoch': 0.99}
 99%|█████████▉| 4253/4286 [26:39:59<10:39, 19.36s/it] 99%|█████████▉| 4254/4286 [26:40:20<10:30, 19.69s/it]                                                      {'loss': 0.0085, 'grad_norm': 4.152235517468497, 'learning_rate': 7.466168922071862e-09, 'completion_length': 198.4464340209961, 'rewards/only_full_func_accuracy_reward': 0.6904762387275696, 'rewards/format_reward': 1.0, 'reward': 1.6904763579368591, 'reward_std': 0.0476190522313118, 'kl': 0.2138671875, 'epoch': 0.99}
 99%|█████████▉| 4254/4286 [26:40:20<10:30, 19.69s/it] 99%|█████████▉| 4255/4286 [26:40:44<10:48, 20.91s/it]                                                      {'loss': 0.0821, 'grad_norm': 25.184906771377655, 'learning_rate': 7.232851143257116e-09, 'completion_length': 221.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.6058249175548553, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5701106786727905, 'reward_std': 0.12535633146762848, 'kl': 2.046875, 'epoch': 0.99}
 99%|█████████▉| 4255/4286 [26:40:44<10:48, 20.91s/it] 99%|█████████▉| 4256/4286 [26:41:04<10:22, 20.74s/it]                                                      {'loss': 0.0693, 'grad_norm': 5.337715038710265, 'learning_rate': 6.999533364442371e-09, 'completion_length': 174.73214721679688, 'rewards/only_full_func_accuracy_reward': 0.6309524178504944, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6130953431129456, 'reward_std': 0.12021520361304283, 'kl': 1.73046875, 'epoch': 0.99}
 99%|█████████▉| 4256/4286 [26:41:04<10:22, 20.74s/it] 99%|█████████▉| 4257/4286 [26:41:22<09:40, 20.03s/it]                                                      {'loss': 0.0085, 'grad_norm': 1.2071048874143298, 'learning_rate': 6.7662155856276244e-09, 'completion_length': 183.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.5773810148239136, 'rewards/format_reward': 1.0, 'reward': 1.5773810744285583, 'reward_std': 0.01785714365541935, 'kl': 0.21240234375, 'epoch': 0.99}
 99%|█████████▉| 4257/4286 [26:41:22<09:40, 20.03s/it] 99%|█████████▉| 4258/4286 [26:41:41<09:11, 19.69s/it]                                                      {'loss': 0.0274, 'grad_norm': 4.002294284192922, 'learning_rate': 6.532897806812879e-09, 'completion_length': 192.19644165039062, 'rewards/only_full_func_accuracy_reward': 0.6324405372142792, 'rewards/format_reward': 1.0, 'reward': 1.6324406266212463, 'reward_std': 0.04900030745193362, 'kl': 0.68310546875, 'epoch': 0.99}
 99%|█████████▉| 4258/4286 [26:41:41<09:11, 19.69s/it] 99%|█████████▉| 4259/4286 [26:41:59<08:33, 19.04s/it]                                                      {'loss': 0.0253, 'grad_norm': 2.1772142398776753, 'learning_rate': 6.299580027998133e-09, 'completion_length': 170.39286041259766, 'rewards/only_full_func_accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.06890111975371838, 'kl': 0.63330078125, 'epoch': 0.99}
 99%|█████████▉| 4259/4286 [26:41:59<08:33, 19.04s/it] 99%|█████████▉| 4260/4286 [26:42:19<08:28, 19.55s/it]                                                      {'loss': 0.027, 'grad_norm': 2.772791567723558, 'learning_rate': 6.066262249183387e-09, 'completion_length': 187.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6654762327671051, 'rewards/format_reward': 1.0, 'reward': 1.6654762625694275, 'reward_std': 0.034704845398664474, 'kl': 0.673828125, 'epoch': 0.99}
 99%|█████████▉| 4260/4286 [26:42:19<08:28, 19.55s/it] 99%|█████████▉| 4261/4286 [26:42:39<08:12, 19.69s/it]                                                      {'loss': 0.0438, 'grad_norm': 1.8108097646703791, 'learning_rate': 5.8329444703686415e-09, 'completion_length': 205.35714721679688, 'rewards/only_full_func_accuracy_reward': 0.6491071879863739, 'rewards/format_reward': 1.0, 'reward': 1.649107277393341, 'reward_std': 0.07083334028720856, 'kl': 1.09326171875, 'epoch': 0.99}
 99%|█████████▉| 4261/4286 [26:42:39<08:12, 19.69s/it] 99%|█████████▉| 4262/4286 [26:43:00<07:58, 19.94s/it]                                                      {'loss': 0.0551, 'grad_norm': 1.0902725452580126, 'learning_rate': 5.599626691553896e-09, 'completion_length': 193.3214340209961, 'rewards/only_full_func_accuracy_reward': 0.586309552192688, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.568452537059784, 'reward_std': 0.12295747548341751, 'kl': 1.37451171875, 'epoch': 0.99}
 99%|█████████▉| 4262/4286 [26:43:00<07:58, 19.94s/it] 99%|█████████▉| 4263/4286 [26:43:19<07:32, 19.66s/it]                                                      {'loss': 0.0131, 'grad_norm': 1.6839784161224425, 'learning_rate': 5.3663089127391504e-09, 'completion_length': 174.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.6696428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6696429252624512, 'reward_std': 0.010309826582670212, 'kl': 0.32861328125, 'epoch': 0.99}
 99%|█████████▉| 4263/4286 [26:43:19<07:32, 19.66s/it] 99%|█████████▉| 4264/4286 [26:43:39<07:12, 19.64s/it]                                                      {'loss': 0.0282, 'grad_norm': 12.120468993030272, 'learning_rate': 5.132991133924405e-09, 'completion_length': 187.58929443359375, 'rewards/only_full_func_accuracy_reward': 0.6315476596355438, 'rewards/format_reward': 1.0, 'reward': 1.6315476894378662, 'reward_std': 0.05698513612151146, 'kl': 0.70703125, 'epoch': 0.99}
 99%|█████████▉| 4264/4286 [26:43:39<07:12, 19.64s/it]100%|█████████▉| 4265/4286 [26:43:57<06:47, 19.42s/it]                                                      {'loss': 0.0134, 'grad_norm': 6.600789375832465, 'learning_rate': 4.899673355109659e-09, 'completion_length': 204.05358123779297, 'rewards/only_full_func_accuracy_reward': 0.6580357551574707, 'rewards/format_reward': 1.0, 'reward': 1.6580357551574707, 'reward_std': 0.09205486625432968, 'kl': 0.3349609375, 'epoch': 1.0}
100%|█████████▉| 4265/4286 [26:43:57<06:47, 19.42s/it]100%|█████████▉| 4266/4286 [26:44:18<06:37, 19.90s/it]                                                      {'loss': 0.0492, 'grad_norm': 6.312152033696758, 'learning_rate': 4.666355576294913e-09, 'completion_length': 191.6428680419922, 'rewards/only_full_func_accuracy_reward': 0.6398810148239136, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6041668057441711, 'reward_std': 0.17554771155118942, 'kl': 1.2333984375, 'epoch': 1.0}
100%|█████████▉| 4266/4286 [26:44:18<06:37, 19.90s/it]100%|█████████▉| 4267/4286 [26:44:36<06:06, 19.26s/it]                                                      {'loss': 0.0071, 'grad_norm': 1.5062106259032793, 'learning_rate': 4.4330377974801675e-09, 'completion_length': 173.30358123779297, 'rewards/only_full_func_accuracy_reward': 0.7559524178504944, 'rewards/format_reward': 1.0, 'reward': 1.7559524774551392, 'reward_std': 0.01626220904290676, 'kl': 0.1767578125, 'epoch': 1.0}
100%|█████████▉| 4267/4286 [26:44:36<06:06, 19.26s/it]100%|█████████▉| 4268/4286 [26:44:55<05:41, 18.98s/it]                                                      {'loss': 0.0824, 'grad_norm': 523.169287156582, 'learning_rate': 4.199720018665422e-09, 'completion_length': 190.1607208251953, 'rewards/only_full_func_accuracy_reward': 0.6011905074119568, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.583333432674408, 'reward_std': 0.09493973851203918, 'kl': 2.056640625, 'epoch': 1.0}
100%|█████████▉| 4268/4286 [26:44:55<05:41, 18.98s/it]100%|█████████▉| 4269/4286 [26:45:14<05:25, 19.18s/it]                                                      {'loss': 0.0182, 'grad_norm': 9.436388842968203, 'learning_rate': 3.9664022398506764e-09, 'completion_length': 194.87500762939453, 'rewards/only_full_func_accuracy_reward': 0.6330357193946838, 'rewards/format_reward': 1.0, 'reward': 1.6330357789993286, 'reward_std': 0.04902936052531004, 'kl': 0.4560546875, 'epoch': 1.0}
100%|█████████▉| 4269/4286 [26:45:14<05:25, 19.18s/it]100%|█████████▉| 4270/4286 [26:45:39<05:33, 20.85s/it]                                                      {'loss': 0.0709, 'grad_norm': 2.51376391791416, 'learning_rate': 3.733084461035931e-09, 'completion_length': 203.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.5768849700689316, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.559027910232544, 'reward_std': 0.16109099984169006, 'kl': 1.767578125, 'epoch': 1.0}
100%|█████████▉| 4270/4286 [26:45:39<05:33, 20.85s/it][2025-03-03 07:53:16,281] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
100%|█████████▉| 4271/4286 [26:46:00<05:15, 21.03s/it]                                                      {'loss': 0.024, 'grad_norm': 6.492668571473638, 'learning_rate': 3.4997666822211854e-09, 'completion_length': 177.01786041259766, 'rewards/only_full_func_accuracy_reward': 0.7901786267757416, 'rewards/format_reward': 1.0, 'reward': 1.790178656578064, 'reward_std': 0.031143157742917538, 'kl': 0.60205078125, 'epoch': 1.0}
100%|█████████▉| 4271/4286 [26:46:00<05:15, 21.03s/it]100%|█████████▉| 4272/4286 [26:46:19<04:43, 20.22s/it]                                                      {'loss': 0.0278, 'grad_norm': 5.80148629126257, 'learning_rate': 3.2664489034064395e-09, 'completion_length': 191.51786041259766, 'rewards/only_full_func_accuracy_reward': 0.6830357611179352, 'rewards/format_reward': 1.0, 'reward': 1.6830358505249023, 'reward_std': 0.06250000465661287, 'kl': 0.69287109375, 'epoch': 1.0}
100%|█████████▉| 4272/4286 [26:46:19<04:43, 20.22s/it]100%|█████████▉| 4273/4286 [26:46:41<04:32, 20.95s/it]                                                      {'loss': 0.015, 'grad_norm': 4.8590808671148755, 'learning_rate': 3.0331311245916935e-09, 'completion_length': 191.5357208251953, 'rewards/only_full_func_accuracy_reward': 0.6065476834774017, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5886905193328857, 'reward_std': 0.09242979087866843, 'kl': 0.37353515625, 'epoch': 1.0}
100%|█████████▉| 4273/4286 [26:46:41<04:32, 20.95s/it][2025-03-03 07:54:14,915] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
100%|█████████▉| 4274/4286 [26:46:59<03:59, 19.96s/it]                                                      {'loss': 0.0078, 'grad_norm': 0.2851360795118726, 'learning_rate': 2.799813345776948e-09, 'completion_length': 168.62500762939453, 'rewards/only_full_func_accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.0, 'kl': 0.1953125, 'epoch': 1.0}
100%|█████████▉| 4274/4286 [26:46:59<03:59, 19.96s/it]100%|█████████▉| 4275/4286 [26:47:18<03:35, 19.57s/it]                                                      {'loss': 0.0228, 'grad_norm': 9.865305059926708, 'learning_rate': 2.5664955669622025e-09, 'completion_length': 192.6964340209961, 'rewards/only_full_func_accuracy_reward': 0.6577380895614624, 'rewards/format_reward': 1.0, 'reward': 1.657738208770752, 'reward_std': 0.046098590828478336, 'kl': 0.5712890625, 'epoch': 1.0}
100%|█████████▉| 4275/4286 [26:47:18<03:35, 19.57s/it]100%|█████████▉| 4276/4286 [26:47:36<03:11, 19.14s/it]                                                      {'loss': 0.046, 'grad_norm': 2.988270789301084, 'learning_rate': 2.3331777881474565e-09, 'completion_length': 181.42858123779297, 'rewards/only_full_func_accuracy_reward': 0.71726194024086, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6994048953056335, 'reward_std': 0.11876922100782394, 'kl': 1.150390625, 'epoch': 1.0}
100%|█████████▉| 4276/4286 [26:47:36<03:11, 19.14s/it]100%|█████████▉| 4277/4286 [26:47:57<02:57, 19.69s/it]                                                      {'loss': 0.0091, 'grad_norm': 2.183072086246251, 'learning_rate': 2.099860009332711e-09, 'completion_length': 215.14286041259766, 'rewards/only_full_func_accuracy_reward': 0.690476268529892, 'rewards/format_reward': 1.0, 'reward': 1.6904762983322144, 'reward_std': 0.08290597051382065, 'kl': 0.22900390625, 'epoch': 1.0}
100%|█████████▉| 4277/4286 [26:47:57<02:57, 19.69s/it]100%|█████████▉| 4278/4286 [26:48:19<02:44, 20.55s/it]                                                      {'loss': 0.0149, 'grad_norm': 6.55538011103947, 'learning_rate': 1.8665422305179655e-09, 'completion_length': 198.9107208251953, 'rewards/only_full_func_accuracy_reward': 0.5238095223903656, 'rewards/format_reward': 1.0, 'reward': 1.5238096117973328, 'reward_std': 0.0, 'kl': 0.3720703125, 'epoch': 1.0}
100%|█████████▉| 4278/4286 [26:48:19<02:44, 20.55s/it]100%|█████████▉| 4279/4286 [26:48:38<02:20, 20.12s/it]                                                      {'loss': 0.0141, 'grad_norm': 2.724090341274087, 'learning_rate': 1.6332244517032197e-09, 'completion_length': 193.37500762939453, 'rewards/only_full_func_accuracy_reward': 0.5842262208461761, 'rewards/format_reward': 1.0, 'reward': 1.5842262506484985, 'reward_std': 0.0720009058713913, 'kl': 0.35205078125, 'epoch': 1.0}
100%|█████████▉| 4279/4286 [26:48:38<02:20, 20.12s/it]100%|█████████▉| 4280/4286 [26:48:57<01:58, 19.76s/it]                                                      {'loss': 0.0074, 'grad_norm': 6.602794651650291, 'learning_rate': 1.399906672888474e-09, 'completion_length': 192.92858123779297, 'rewards/only_full_func_accuracy_reward': 0.62202388048172, 'rewards/format_reward': 1.0, 'reward': 1.6220239400863647, 'reward_std': 0.035207461565732956, 'kl': 0.18603515625, 'epoch': 1.0}
100%|█████████▉| 4280/4286 [26:48:57<01:58, 19.76s/it]100%|█████████▉| 4281/4286 [26:49:16<01:36, 19.36s/it]                                                      {'loss': 0.025, 'grad_norm': 2.3404621413154887, 'learning_rate': 1.1665888940737283e-09, 'completion_length': 176.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6845238506793976, 'rewards/format_reward': 1.0, 'reward': 1.6845239400863647, 'reward_std': 0.01785714365541935, 'kl': 0.625, 'epoch': 1.0}
100%|█████████▉| 4281/4286 [26:49:16<01:36, 19.36s/it]100%|█████████▉| 4282/4286 [26:49:36<01:18, 19.71s/it]                                                      {'loss': 0.0118, 'grad_norm': 13.19763197431407, 'learning_rate': 9.332711152589827e-10, 'completion_length': 216.3928680419922, 'rewards/only_full_func_accuracy_reward': 0.7127976715564728, 'rewards/format_reward': 1.0, 'reward': 1.71279776096344, 'reward_std': 0.07557233422994614, 'kl': 0.294921875, 'epoch': 1.0}
100%|█████████▉| 4282/4286 [26:49:36<01:18, 19.71s/it]100%|█████████▉| 4283/4286 [26:49:56<00:59, 19.83s/it]                                                      {'loss': 0.0158, 'grad_norm': 13.15822066834395, 'learning_rate': 6.99953336444237e-10, 'completion_length': 184.00000762939453, 'rewards/only_full_func_accuracy_reward': 0.7205357551574707, 'rewards/format_reward': 1.0, 'reward': 1.7205357551574707, 'reward_std': 0.07351037487387657, 'kl': 0.39453125, 'epoch': 1.0}
100%|█████████▉| 4283/4286 [26:49:56<00:59, 19.83s/it][2025-03-03 07:57:38,137] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
100%|█████████▉| 4284/4286 [26:50:22<00:43, 21.62s/it]                                                      {'loss': 0.058, 'grad_norm': 2.3228248985842153, 'learning_rate': 4.666355576294914e-10, 'completion_length': 201.25000762939453, 'rewards/only_full_func_accuracy_reward': 0.641369104385376, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5699406266212463, 'reward_std': 0.2164071798324585, 'kl': 1.44921875, 'epoch': 1.0}
100%|█████████▉| 4284/4286 [26:50:22<00:43, 21.62s/it][2025-03-03 07:58:00,699] [WARNING] [stage3.py:2134:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
100%|█████████▉| 4285/4286 [26:50:45<00:21, 21.90s/it]                                                      {'loss': 0.0464, 'grad_norm': 5.314056467720467, 'learning_rate': 2.333177788147457e-10, 'completion_length': 179.9821548461914, 'rewards/only_full_func_accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.11798760294914246, 'kl': 1.1572265625, 'epoch': 1.0}
100%|█████████▉| 4285/4286 [26:50:45<00:21, 21.90s/it]100%|██████████| 4286/4286 [26:51:02<00:00, 20.60s/it]                                                      {'loss': 0.007, 'grad_norm': 0.3876862576984759, 'learning_rate': 0.0, 'completion_length': 204.6666717529297, 'rewards/only_full_func_accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.0, 'kl': 0.17529296875, 'epoch': 1.0}
100%|██████████| 4286/4286 [26:51:02<00:00, 20.60s/it]                                                      {'train_runtime': 96872.45, 'train_samples_per_second': 0.619, 'train_steps_per_second': 0.044, 'train_loss': 0.03789731134604421, 'epoch': 1.0}
100%|██████████| 4286/4286 [26:54:29<00:00, 20.60s/it]100%|██████████| 4286/4286 [26:54:29<00:00, 22.60s/it]
[1;34mwandb[0m: 
[1;34mwandb[0m: 🚀 View run [33mONLY-FULL-SHUFFLE-R1-ZERO-VLLM-Correct-Qwen2-VL-7B-GRPO-TRANCE-60k-2025-03-02-05-04-15[0m at: [34mhttps://wandb.ai/tanhuajie264-peking-university/vison-open-r1/runs/qwup6qq1[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/run-20250302_050711-qwup6qq1/logs[0m