tori29umai commited on
Commit
0a6cb98
·
verified ·
1 Parent(s): 8b92d1b

Upload error.txt

Browse files
Files changed (1) hide show
  1. error.txt +151 -0
error.txt ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [rank0]:[E609 00:24:47.186853648 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600010 milliseconds before timing out.
2
+ [rank0]:[E609 00:24:47.189780854 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1
3
+ [rank0]:[E609 00:24:47.190080357 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
4
+ [rank1]:[E609 00:24:48.277501819 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
5
+ [rank1]:[E609 00:24:48.279945091 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1
6
+ [rank1]:[E609 00:24:48.280072432 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
7
+ [rank0]: Traceback (most recent call last):
8
+ [rank0]: File "/musubi-tuner/fpack_train_network.py", line 617, in <module>
9
+ [rank0]: trainer.train(args)
10
+ [rank0]: File "/musubi-tuner/hv_train_network.py", line 1648, in train
11
+ [rank0]: network, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(network, optimizer, train_dataloader, lr_scheduler)
12
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13
+ [rank0]: File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1446, in prepare
14
+ [rank0]: result = tuple(
15
+ [rank0]: ^^^^^^
16
+ [rank0]: File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1447, in <genexpr>
17
+ [rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
18
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
19
+ [rank0]: File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1289, in _prepare_one
20
+ [rank0]: return self.prepare_model(obj, device_placement=device_placement)
21
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22
+ [rank0]: File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1595, in prepare_model
23
+ [rank0]: model = torch.nn.parallel.DistributedDataParallel(
24
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
25
+ [rank0]: File "/musubi-tuner/venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
26
+ [rank0]: _verify_param_shape_across_processes(self.process_group, parameters)
27
+ [rank0]: File "/musubi-tuner/venv/lib/python3.11/site-packages/torch/distributed/utils.py", line 294, in _verify_param_shape_across_processes
28
+ [rank0]: return dist._verify_params_across_processes(process_group, tensors, logger)
29
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
30
+ [rank0]: RuntimeError: DDP expects same model across all ranks, but Rank 0 has 880 params, while rank 1 has inconsistent 0 params.
31
+ [rank1]: Traceback (most recent call last):
32
+ [rank1]: File "/musubi-tuner/fpack_train_network.py", line 617, in <module>
33
+ [rank1]: trainer.train(args)
34
+ [rank1]: File "/musubi-tuner/hv_train_network.py", line 1648, in train
35
+ [rank1]: network, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(network, optimizer, train_dataloader, lr_scheduler)
36
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
37
+ [rank1]: File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1446, in prepare
38
+ [rank1]: result = tuple(
39
+ [rank1]: ^^^^^^
40
+ [rank1]: File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1447, in <genexpr>
41
+ [rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
42
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
43
+ [rank1]: File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1289, in _prepare_one
44
+ [rank1]: return self.prepare_model(obj, device_placement=device_placement)
45
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46
+ [rank1]: File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1595, in prepare_model
47
+ [rank1]: model = torch.nn.parallel.DistributedDataParallel(
48
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
49
+ [rank1]: File "/musubi-tuner/venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
50
+ [rank1]: _verify_param_shape_across_processes(self.process_group, parameters)
51
+ [rank1]: File "/musubi-tuner/venv/lib/python3.11/site-packages/torch/distributed/utils.py", line 294, in _verify_param_shape_across_processes
52
+ [rank1]: return dist._verify_params_across_processes(process_group, tensors, logger)
53
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
54
+ [rank1]: RuntimeError: DDP expects same model across all ranks, but Rank 1 has 880 params, while rank 0 has inconsistent 0 params.
55
+ [rank0]:[E609 00:24:48.714209617 ProcessGroupNCCL.cpp:681] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
56
+ [rank0]:[E609 00:24:48.714323378 ProcessGroupNCCL.cpp:695] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
57
+ [rank0]:[E609 00:24:48.722522951 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600010 milliseconds before timing out.
58
+ Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
59
+ frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x72f281acc788 in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
60
+ frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x72f2307d39ad in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
61
+ frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x9e8 (0x72f2307d4fa8 in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
62
+ frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72f2307d5c4d in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
63
+ frame #4: <unknown function> + 0xd8198 (0x72f284b27198 in /opt/conda/bin/../lib/libstdc++.so.6)
64
+ frame #5: <unknown function> + 0x94ac3 (0x72f2856c6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
65
+ frame #6: clone + 0x44 (0x72f285757a04 in /lib/x86_64-linux-gnu/libc.so.6)
66
+
67
+ terminate called after throwing an instance of 'c10::DistBackendError'
68
+ what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600010 milliseconds before timing out.
69
+ Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
70
+ frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x72f281acc788 in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
71
+ frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x72f2307d39ad in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
72
+ frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x9e8 (0x72f2307d4fa8 in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
73
+ frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72f2307d5c4d in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
74
+ frame #4: <unknown function> + 0xd8198 (0x72f284b27198 in /opt/conda/bin/../lib/libstdc++.so.6)
75
+ frame #5: <unknown function> + 0x94ac3 (0x72f2856c6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
76
+ frame #6: clone + 0x44 (0x72f285757a04 in /lib/x86_64-linux-gnu/libc.so.6)
77
+
78
+ Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
79
+ frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x72f281acc788 in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
80
+ frame #1: <unknown function> + 0x10d2c3e (0x72f2307a6c3e in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
81
+ frame #2: <unknown function> + 0xd6d5ed (0x72f2304415ed in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
82
+ frame #3: <unknown function> + 0xd8198 (0x72f284b27198 in /opt/conda/bin/../lib/libstdc++.so.6)
83
+ frame #4: <unknown function> + 0x94ac3 (0x72f2856c6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
84
+ frame #5: clone + 0x44 (0x72f285757a04 in /lib/x86_64-linux-gnu/libc.so.6)
85
+
86
+ [rank1]:[E609 00:24:48.048166525 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
87
+ [rank1]:[E609 00:24:48.048284856 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
88
+ [rank1]:[E609 00:24:48.056881613 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
89
+ Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
90
+ frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7da22f57d788 in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
91
+ frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7da1de1d39ad in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
92
+ frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x9e8 (0x7da1de1d4fa8 in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
93
+ frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7da1de1d5c4d in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
94
+ frame #4: <unknown function> + 0xd8198 (0x7da2325e2198 in /opt/conda/bin/../lib/libstdc++.so.6)
95
+ frame #5: <unknown function> + 0x94ac3 (0x7da233181ac3 in /lib/x86_64-linux-gnu/libc.so.6)
96
+ frame #6: clone + 0x44 (0x7da233212a04 in /lib/x86_64-linux-gnu/libc.so.6)
97
+
98
+ terminate called after throwing an instance of 'c10::DistBackendError'
99
+ what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
100
+ Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
101
+ frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7da22f57d788 in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
102
+ frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7da1de1d39ad in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
103
+ frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x9e8 (0x7da1de1d4fa8 in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
104
+ frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7da1de1d5c4d in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
105
+ frame #4: <unknown function> + 0xd8198 (0x7da2325e2198 in /opt/conda/bin/../lib/libstdc++.so.6)
106
+ frame #5: <unknown function> + 0x94ac3 (0x7da233181ac3 in /lib/x86_64-linux-gnu/libc.so.6)
107
+ frame #6: clone + 0x44 (0x7da233212a04 in /lib/x86_64-linux-gnu/libc.so.6)
108
+
109
+ Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
110
+ frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7da22f57d788 in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
111
+ frame #1: <unknown function> + 0x10d2c3e (0x7da1de1a6c3e in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
112
+ frame #2: <unknown function> + 0xd6d5ed (0x7da1dde415ed in /musubi-tuner/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
113
+ frame #3: <unknown function> + 0xd8198 (0x7da2325e2198 in /opt/conda/bin/../lib/libstdc++.so.6)
114
+ frame #4: <unknown function> + 0x94ac3 (0x7da233181ac3 in /lib/x86_64-linux-gnu/libc.so.6)
115
+ frame #5: clone + 0x44 (0x7da233212a04 in /lib/x86_64-linux-gnu/libc.so.6)
116
+
117
+ W0609 00:24:49.814000 3604 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3673 closing signal SIGTERM
118
+ E0609 00:24:50.740000 3604 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 3672) of binary: /musubi-tuner/venv/bin/python
119
+ Traceback (most recent call last):
120
+ File "/musubi-tuner/venv/bin/accelerate", line 8, in <module>
121
+ sys.exit(main())
122
+ ^^^^^^
123
+ File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
124
+ args.func(args)
125
+ File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1204, in launch_command
126
+ multi_gpu_launcher(args)
127
+ File "/musubi-tuner/venv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 825, in multi_gpu_launcher
128
+ distrib_run.run(args)
129
+ File "/musubi-tuner/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
130
+ elastic_launch(
131
+ File "/musubi-tuner/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
132
+ return launch_agent(self._config, self._entrypoint, list(args))
133
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
134
+ File "/musubi-tuner/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
135
+ raise ChildFailedError(
136
+ torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
137
+ =====================================================
138
+ fpack_train_network.py FAILED
139
+ -----------------------------------------------------
140
+ Failures:
141
+ <NO_OTHER_FAILURES>
142
+ -----------------------------------------------------
143
+ Root Cause (first observed failure):
144
+ [0]:
145
+ time : 2025-06-09_00:24:49
146
+ host : 1a962a38c261
147
+ rank : 0 (local_rank: 0)
148
+ exitcode : -6 (pid: 3672)
149
+ error_file: <N/A>
150
+ traceback : Signal 6 (SIGABRT) received by PID 3672
151
+ =====================================================