Object shard /models/Nemotron-4-340B-Reward/model_weights/model.rm_head._extra_state/shard_0_1.pt not found
Instruct model seems to work as expected, but I am experiencing issues with the Reward model.
Tested containers: nemo_24_01_framework and nemo_24_03_01_framework
Resource config: #SBATCH --nodes=2 #SBATCH --ntasks-per-node=8 #SBATCH --gpus-per-node=8
Attempted resolution: Repeated downloading of the Nemotron-4-340B-Reward model
Related error: https://github.com/NVIDIA/NeMo/issues/8785
On the second node I experience the following error:
Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/16
Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/16
Error executing job with overrides: ['rm_model_file=/models/Nemotron-4-340B-Reward', 'trainer.num_nodes=2', 'trainer.devices=8', '++model.tensor_model_parallel_size=8', '++model.pipeline_model_parallel_size=2', 'inference.micro_batch_size=2'
, 'inference.port=1424']
Error executing job with overrides: ['rm_model_file=/models/Nemotron-4-340B-Reward', 'trainer.num_nodes=2', 'trainer.devices=8', '++model.tensor_model_parallel_size=8', '++model.pipeline_model_parallel_size=2', 'inference.micro_batch_size=2'
, 'inference.port=1424']
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/serialization.py", line 210, in load_sharded_object
loaded_obj = torch.load(load_path)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 997, in load
with _open_file_like(f, 'rb') as opened_file:
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 444, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 425, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/models/Nemotron-4-340B-Reward/model_weights/model.rm_head._extra_state/shard_0_1.pt'
Apologies for the late reply - just saw this (in the future, if urgent - please email to follow up at [email protected]).
Please use docker pull nvcr.io/nvidia/nemo:24.01.framework
container as specified in the readme and try it - the downloading of the file should be correct.