Generate on V100 questions
Great model!
I managed to generate vidos, but unable to match the shared resolutions and length that you poster based on your RTX4090 test.
I'm running it on a V100 I've expected slower generation time but I'm pretty much always running out or memory with higher >600 resolution and more than 24 frames.
worked with V100 32GB low_gpu_memory_mode = True
- 600, 24 frames
- 384, 48 frames
I've tried to replicate the
Could you share the exact config this tested with a 4090?
Can be this related (I don't know how.. ) to the older Volta architecture?
(we loading the model in bf16 so I can't really imagine how could it be, just asking)
Test:
V10032GB - 24 frame, res 768, low_gpu_memory_mode = False, offload steps = 0
predict_i2v_80g.py
Pipeline loaded ...
0%| | 0/25 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/kecso/Documents/workspace/Ruyi-Models/predict_i2v_80g.py", line 230, in
sample = pipeline(
^^^^^^^^^
...
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.08 GiB. GPU 0 has a total capacity of 31.74 GiB of which 15.20 GiB is free. Including non-PyTorch memory, this process has 16.53 GiB memory in use. Of the allocated memory 15.93 GiB is allocated by PyTorch, and 244.25 MiB is reserved by PyTorch but unallocated.
Am I missing something here?
I think this issue is similar to the one mentioned in this GitHub issue.
The current model requires the bfloat16 data type, which is only supported by NVIDIA's Ampere architecture (RTX 30 series) and later. Therefore, GPUs based on architectures prior to Ampere may encounter undesirable issues.
I think the main cause is related to the backend used in scaled_dot_product_attention
. Could you help us identify which backend is being utilized? The following code snippets can be used to test the backend:
from torch.nn.attention import sdpa_kernel, SDPBackend
backends = []
backends.append(SDPBackend.CUDNN_ATTENTION)
# backends.append(SDPBackend.EFFICIENT_ATTENTION)
# backends.append(SDPBackend.MATH)
# backends_flash.append(SDPBackend.FLASH_ATTENTION)
...
with sdpa_kernel(backends):
out = F.scaled_dot_product_attention(
query, key, value,
attn_mask=None,
dropout_p=0.0,
is_causal=False
)
Thanks for getting back to me.
The confusion is that it’s not failing on bf16 but internally runs f32 according to this post:
https://discuss.pytorch.org/t/bfloat16-on-nvidia-v100-gpu/201629
This has confused me.
Sure I’ll run the test code and get back to you soon.
Thankfully I have another machine with RTX 3090s
I’ll try on that
Thanks again for the model!
I've run this:
import torch
import torch.nn.functional as F
from torch.nn.attention import sdpa_kernel, SDPBackend
# Define input tensors
query = torch.randn(2, 4, 8, device='cuda') # (batch_size, seq_len, embed_dim)
key = torch.randn(2, 4, 8, device='cuda') # (batch_size, seq_len, embed_dim)
value = torch.randn(2, 4, 8, device='cuda') # (batch_size, seq_len, embed_dim)
# Select backends for testing
backends = [SDPBackend.CUDNN_ATTENTION] # You can add other backends here
# Use the chosen backend with the scaled_dot_product_attention function
with sdpa_kernel(backends):
out = F.scaled_dot_product_attention(
query, key, value,
attn_mask=None, # Optional mask
dropout_p=0.0, # No dropout for testing
is_causal=False # Non-causal attention
)
print("Scaled Dot-Product Attention Output:")
print(out)
STDOUT:
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:773.)
out = F.scaled_dot_product_attention(
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:558.)
out = F.scaled_dot_product_attention(
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:775.)
out = F.scaled_dot_product_attention(
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: Flash attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:546.)
out = F.scaled_dot_product_attention(
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: CuDNN attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:777.)
out = F.scaled_dot_product_attention(
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: All fused kernels requires query, key and value to be 4 dimensional, but got Query dim: 3, Key dim: 3, Value dim: 3 instead. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:317.)
out = F.scaled_dot_product_attention(
Traceback (most recent call last):
File "/home/kecso/Documents/workspace/Ruyi-Models/test1.py", line 15, in <module>
out = F.scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: No available kernel. Aborting execution.
Hope this helps
- Thank you for your testing. Based on the outputs, I believe that
scaled_dot_product_attention
could be the main cause. Since the other backends are not available, PyTorch might consider using SDPBackend.MATH, which does require a significant amount of GPU memory. I have listed my results below:
Profiling with shape (32, 8, 4096, 128)...
Results:
----------------------------------------------------------------------
Implementation Avg Time (ms) Std Dev (ms) Peak Memory (GB)
----------------------------------------------------------------------
sdpa_cudnn 12.503 0.340 1.258
sdpa_memory 19.870 0.269 1.008
sdpa_math 154.977 181.029 54.759
sdpa_flash 11.492 0.241 1.258
As shown in the table above, sdpa_math indeed uses a substantial amount of GPU memory.
- Regarding bfloat16 and float32, I think that having the same number of exponent bits (8 bits) may be the reason why PyTorch internally uses float32. This should be able to execute without any loss of precision, but it will require more GPU memory.