Multi-GPU running error with example script of VideoX-Fun wan2.1_fun predict_v2v_control.py with this model

#6
by ericliu127 - opened

Modified script is:
index 33b1c8c..fa9c1d6 100755
--- a/examples/wan2.1_fun/predict_v2v_control.py
+++ b/examples/wan2.1_fun/predict_v2v_control.py
@@ -41,7 +41,7 @@ GPU_memory_mode = "sequential_cpu_offload"

Please ensure that the product of ulysses_degree and ring_degree equals the number of GPUs used.

For example, if you are using 8 GPUs, you can set ulysses_degree = 2 and ring_degree = 4.

If you are using 1 GPU, you can set ulysses_degree = 1 and ring_degree = 1.

-ulysses_degree = 1
+ulysses_degree = 2
ring_degree = 1

Support TeaCache.

@@ -74,19 +74,19 @@ vae_path = None
lora_path = None

Other params

-sample_size = [832, 480]
+sample_size = [480, 832]
video_length = 49
-fps = 16
+fps = 30

Use torch.float16 if GPU does not support torch.bfloat16

ome graphics cards, such as v100, 2080ti, do not support torch.bfloat16

weight_dtype = torch.bfloat16
control_video = "asset/pose.mp4"
-ref_image = None
+ref_image = "asset/ref_picture.png"

使用更长的neg prompt如"模糊,突变,变形,失真,画面暗,文本字幕,画面固定,连环画,漫画,线稿,没有主体。",可以增加稳定性

在neg prompt中添加"安静,固定"等词语可以增加动态性。

-prompt = "在这个阳光明媚的户外花园里,美女身穿一袭及膝的白色无袖连衣裙,裙摆在她轻盈的舞姿中轻柔地摆动,宛如一只翩翩起舞的蝴蝶。阳光透过树叶间洒下斑驳的光影,映衬出她柔和的脸庞和清澈的眼眸,显得
格外优雅。仿佛每一个动作都在诉说着青春与活力,她在草地上旋转,裙摆随之飞扬,仿佛整个花园都因她的舞动而欢愉。周围五彩缤纷的花朵在微风中摇曳,玫瑰、菊花、百合,各自释放出阵阵香气,营造出一种轻松而愉快的氛围。"
+prompt = "A beautiful and sexy woman."
negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁
容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"

Using longer neg prompt such as "Blurring, mutation, deformation, distortion, dark and solid, comics, text subtitles, line art." can increase stability

#=============================================================================================================================
Error Log:
W0425 16:19:44.455000 139412325037184 torch/distributed/run.py:779]
W0425 16:19:44.455000 139412325037184 torch/distributed/run.py:779] *****************************************
W0425 16:19:44.455000 139412325037184 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0425 16:19:44.455000 139412325037184 torch/distributed/run.py:779] *****************************************
/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
/mnt/sat/ai/VideoX-Fun/videox_fun/dist/wan_xfuser.py:32: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@amp .autocast(enabled=False)
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py:212: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@amp .autocast(enabled=False)
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py:223: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@amp .autocast(enabled=False)
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py:282: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@amp .autocast(enabled=False)
/mnt/sat/ai/VideoX-Fun/videox_fun/dist/wan_xfuser.py:32: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@amp .autocast(enabled=False)
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py:212: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@amp .autocast(enabled=False)
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py:223: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@amp .autocast(enabled=False)
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py:282: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@amp .autocast(enabled=False)
[W425 16:19:48.196924434 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
parallel inference enabled: ulysses_degree=2 ring_degree=1 rank=1 world_size=2
DEBUG 04-25 16:19:48 [parallel_state.py:207] world_size=2 rank=1 local_rank=-1 distributed_init_method=env:// backend=nccl
[W425 16:19:48.200560414 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
parallel inference enabled: ulysses_degree=2 ring_degree=1 rank=0 world_size=2
DEBUG 04-25 16:19:48 [parallel_state.py:207] world_size=2 rank=0 local_rank=-1 distributed_init_method=env:// backend=nccl
rank=1 device=cuda:1
rank=0 device=cuda:0
loaded 3D transformer's pretrained weights from models/Diffusion_Transformer/Wan2.1-Fun-1.3B-Control/./ ...
loaded 3D transformer's pretrained weights from models/Diffusion_Transformer/Wan2.1-Fun-1.3B-Control/./ ...
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_vae.py:697: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(pretrained_model_path, map_location="cpu")
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_vae.py:697: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(pretrained_model_path, map_location="cpu")

missing keys: 0;

unexpected keys: 0;

[] []

missing keys: 0;

unexpected keys: 0;

[] []
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_text_encoder.py:334: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(pretrained_model_path, map_location="cpu")
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_text_encoder.py:334: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(pretrained_model_path, map_location="cpu")
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_image_encoder.py:544: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(pretrained_model_path, map_location="cpu")
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_image_encoder.py:544: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(pretrained_model_path, map_location="cpu")

missing keys: 0;

unexpected keys: 0;

[] []

missing keys: 0;

unexpected keys: 0;

[] []
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:181: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
self.video_processor = VideoProcessor(vae_scale_factor=self.vae.spacial_compression_ratio)
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:182: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae.spacial_compression_ratio)
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:184: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
vae_scale_factor=self.vae.spacial_compression_ratio, do_normalize=False, do_binarize=True, do_convert_grayscale=True
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:181: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
self.video_processor = VideoProcessor(vae_scale_factor=self.vae.spacial_compression_ratio)
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:182: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae.spacial_compression_ratio)
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:184: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
vae_scale_factor=self.vae.spacial_compression_ratio, do_normalize=False, do_binarize=True, do_convert_grayscale=True
Enable TeaCache with threshold 0.1 and skip the first 5 steps.
Enable TeaCache with threshold 0.1 and skip the first 5 steps.
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:324: FutureWarning: Accessing config attribute temporal_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'temporal_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.temporal_compression_ratio'.
(num_frames - 1) // self.vae.temporal_compression_ratio + 1,
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:325: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
height // self.vae.spacial_compression_ratio,
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:326: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
width // self.vae.spacial_compression_ratio,
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:324: FutureWarning: Accessing config attribute temporal_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'temporal_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.temporal_compression_ratio'.
(num_frames - 1) // self.vae.temporal_compression_ratio + 1,
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:325: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
height // self.vae.spacial_compression_ratio,
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:326: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
width // self.vae.spacial_compression_ratio,
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_image_encoder.py:526: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with torch.cuda.amp.autocast(dtype=self.dtype):
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_image_encoder.py:526: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with torch.cuda.amp.autocast(dtype=self.dtype):
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:659: FutureWarning: Accessing config attribute latent_channels directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'latent_channels' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.latent_channels'.
target_shape = (self.vae.latent_channels, (num_frames - 1) // self.vae.temporal_compression_ratio + 1, width // self.vae.spacial_compression_ratio, height // self.vae.spacial_compression_ratio)
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:659: FutureWarning: Accessing config attribute temporal_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'temporal_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.temporal_compression_ratio'.
target_shape = (self.vae.latent_channels, (num_frames - 1) // self.vae.temporal_compression_ratio + 1, width // self.vae.spacial_compression_ratio, height // self.vae.spacial_compression_ratio)
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:659: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
target_shape = (self.vae.latent_channels, (num_frames - 1) // self.vae.temporal_compression_ratio + 1, width // self.vae.spacial_compression_ratio, height // self.vae.spacial_compression_ratio)
0%| | 0/50 [00:00<?, ?it/s]/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:676: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with torch.cuda.amp.autocast(dtype=weight_dtype):
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py:868: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with amp.autocast(dtype=torch.float32):
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:659: FutureWarning: Accessing config attribute latent_channels directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'latent_channels' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.latent_channels'.
target_shape = (self.vae.latent_channels, (num_frames - 1) // self.vae.temporal_compression_ratio + 1, width // self.vae.spacial_compression_ratio, height // self.vae.spacial_compression_ratio)
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:659: FutureWarning: Accessing config attribute temporal_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'temporal_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.temporal_compression_ratio'.
target_shape = (self.vae.latent_channels, (num_frames - 1) // self.vae.temporal_compression_ratio + 1, width // self.vae.spacial_compression_ratio, height // self.vae.spacial_compression_ratio)
/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:659: FutureWarning: Accessing config attribute spacial_compression_ratio directly via 'AutoencoderKLWan' object attribute is deprecated. Please access 'spacial_compression_ratio' over 'AutoencoderKLWan's config object instead, e.g. 'unet.config.spacial_compression_ratio'.
target_shape = (self.vae.latent_channels, (num_frames - 1) // self.vae.temporal_compression_ratio + 1, width // self.vae.spacial_compression_ratio, height // self.vae.spacial_compression_ratio)
0%| | 0/50 [00:00<?, ?it/s]/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py:676: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with torch.cuda.amp.autocast(dtype=weight_dtype):
/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py:868: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with amp.autocast(dtype=torch.float32):
0%| | 0/50 [00:00<?, ?it/s]
[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/sat/ai/VideoX-Fun/examples/wan2.1_fun/predict_v2v_control.py", line 216, in
[rank1]: sample = pipeline(
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/mnt/sat/ai/VideoX-Fun/videox_fun/pipeline/pipeline_wan_fun_control.py", line 677, in call
[rank1]: noise_pred = self.transformer(
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
[rank1]: output = module._old_forward(*args, **kwargs)
[rank1]: File "/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py", line 959, in forward
[rank1]: x = block(x, **kwargs)
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py", line 578, in forward
[rank1]: x = cross_attn_ffn(x, context, context_lens, e)
[rank1]: File "/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py", line 568, in cross_attn_ffn
[rank1]: x = x + self.cross_attn(self.norm3(x), context, context_lens, dtype)
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/mnt/sat/ai/VideoX-Fun/videox_fun/models/wan_transformer3d.py", line 461, in forward
[rank1]: q = self.norm_q(self.q(x.to(dtype))).view(b, -1, n, d)
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
[rank1]: args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/accelerate/hooks.py", line 355, in pre_forward
[rank1]: set_module_tensor_to_device(
[rank1]: File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 329, in set_module_tensor_to_device
[rank1]: new_value = value.to(device)
[rank1]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank1]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank1]:[E425 16:20:24.762983353 ProcessGroupNCCL.cpp:1515] [PG 17 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x756afe2cbf86 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x756afe27ad10 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x756afe3a6f08 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x756aafd8aa76 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x756aafd8fc90 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x756aafd9694a in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x756aafd98d8c in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x756afeedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x756b02894ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x756b02926850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 17 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x756afe2cbf86 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x756afe27ad10 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x756afe3a6f08 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x756aafd8aa76 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x756aafd8fc90 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x756aafd9694a in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x756aafd98d8c in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x756afeedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x756b02894ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x756b02926850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x756afe2cbf86 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe1a5e4 (0x756aafa1a5e4 in /mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdc253 (0x756afeedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x756b02894ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x756b02926850 in /lib/x86_64-linux-gnu/libc.so.6)

W0425 16:20:26.085000 139412325037184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 32248 closing signal SIGTERM
E0425 16:20:26.752000 139412325037184 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 32249) of binary: /mnt/sat/ai/ComfyUI/comfyui-venv/bin/python
Traceback (most recent call last):
File "/mnt/sat/ai/ComfyUI/comfyui-venv/bin/torchrun", line 8, in
sys.exit(main())
File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/sat/ai/ComfyUI/comfyui-venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

examples/wan2.1_fun/predict_v2v_control.py FAILED

Failures:

Root Cause (first observed failure):
[0]:
time : 2025-04-25_16:20:26
host : xx299
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 32249)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 32249

It looks like main error is: [rank1]: RuntimeError: CUDA error: an illegal memory access was encountered ???

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment