IamCreateAI/Ruyi-Mini-7B · Generate on V100 questions

Dec 24, 2024

Great model!

I managed to generate vidos, but unable to match the shared resolutions and length that you poster based on your RTX4090 test.

I'm running it on a V100 I've expected slower generation time but I'm pretty much always running out or memory with higher >600 resolution and more than 24 frames.
worked with V100 32GB low_gpu_memory_mode = True

600, 24 frames
384, 48 frames

I've tried to replicate the
Could you share the exact config this tested with a 4090?
Can be this related (I don't know how.. ) to the older Volta architecture?
(we loading the model in bf16 so I can't really imagine how could it be, just asking)

Test:
V10032GB - 24 frame, res 768, low_gpu_memory_mode = False, offload steps = 0
predict_i2v_80g.py

Pipeline loaded ...
0%| | 0/25 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/kecso/Documents/workspace/Ruyi-Models/predict_i2v_80g.py", line 230, in
sample = pipeline(
^^^^^^^^^
...
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.08 GiB. GPU 0 has a total capacity of 31.74 GiB of which 15.20 GiB is free. Including non-PyTorch memory, this process has 16.53 GiB memory in use. Of the allocated memory 15.93 GiB is allocated by PyTorch, and 244.25 MiB is reserved by PyTorch but unallocated.

Am I missing something here?

cellzero

IamCreateAI org Dec 24, 2024

I think this issue is similar to the one mentioned in this GitHub issue.

The current model requires the bfloat16 data type, which is only supported by NVIDIA's Ampere architecture (RTX 30 series) and later. Therefore, GPUs based on architectures prior to Ampere may encounter undesirable issues.

I think the main cause is related to the backend used in scaled_dot_product_attention. Could you help us identify which backend is being utilized? The following code snippets can be used to test the backend:

from torch.nn.attention import sdpa_kernel, SDPBackend

backends = []
backends.append(SDPBackend.CUDNN_ATTENTION)
# backends.append(SDPBackend.EFFICIENT_ATTENTION)
# backends.append(SDPBackend.MATH)
# backends_flash.append(SDPBackend.FLASH_ATTENTION)

...

with sdpa_kernel(backends):
    out = F.scaled_dot_product_attention(
        query, key, value,
        attn_mask=None,
        dropout_p=0.0,
        is_causal=False
    )

csabakecskemeti

Dec 24, 2024

Thanks for getting back to me.
The confusion is that it’s not failing on bf16 but internally runs f32 according to this post:
https://discuss.pytorch.org/t/bfloat16-on-nvidia-v100-gpu/201629
This has confused me.

Sure I’ll run the test code and get back to you soon.

Thankfully I have another machine with RTX 3090s
I’ll try on that

Thanks again for the model!

csabakecskemeti

Dec 24, 2024

I've run this:

import torch
import torch.nn.functional as F
from torch.nn.attention import sdpa_kernel, SDPBackend

# Define input tensors
query = torch.randn(2, 4, 8, device='cuda')  # (batch_size, seq_len, embed_dim)
key = torch.randn(2, 4, 8, device='cuda')    # (batch_size, seq_len, embed_dim)
value = torch.randn(2, 4, 8, device='cuda')  # (batch_size, seq_len, embed_dim)

# Select backends for testing
backends = [SDPBackend.CUDNN_ATTENTION]  # You can add other backends here

# Use the chosen backend with the scaled_dot_product_attention function
with sdpa_kernel(backends):
    out = F.scaled_dot_product_attention(
        query, key, value,
        attn_mask=None,    # Optional mask
        dropout_p=0.0,     # No dropout for testing
        is_causal=False    # Non-causal attention
    )

print("Scaled Dot-Product Attention Output:")
print(out)

STDOUT:

/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:773.)
  out = F.scaled_dot_product_attention(
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:558.)
  out = F.scaled_dot_product_attention(
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:775.)
  out = F.scaled_dot_product_attention(
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: Flash attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:546.)
  out = F.scaled_dot_product_attention(
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: CuDNN attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:777.)
  out = F.scaled_dot_product_attention(
/home/kecso/Documents/workspace/Ruyi-Models/test1.py:15: UserWarning: All fused kernels requires query, key and value to be 4 dimensional, but got Query dim: 3, Key dim: 3, Value dim: 3 instead. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:317.)
  out = F.scaled_dot_product_attention(
Traceback (most recent call last):
  File "/home/kecso/Documents/workspace/Ruyi-Models/test1.py", line 15, in <module>
    out = F.scaled_dot_product_attention(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: No available kernel. Aborting execution.

Hope this helps

cellzero

IamCreateAI org Dec 25, 2024

Thank you for your testing. Based on the outputs, I believe that scaled_dot_product_attention could be the main cause. Since the other backends are not available, PyTorch might consider using SDPBackend.MATH, which does require a significant amount of GPU memory. I have listed my results below:

Profiling with shape (32, 8, 4096, 128)...

Results:
----------------------------------------------------------------------
Implementation    Avg Time (ms)    Std Dev (ms)     Peak Memory (GB)
----------------------------------------------------------------------
sdpa_cudnn               12.503           0.340                1.258
sdpa_memory              19.870           0.269                1.008
sdpa_math               154.977         181.029               54.759
sdpa_flash               11.492           0.241                1.258

As shown in the table above, sdpa_math indeed uses a substantial amount of GPU memory.

Regarding bfloat16 and float32, I think that having the same number of exponent bits (8 bits) may be the reason why PyTorch internally uses float32. This should be able to execute without any loss of precision, but it will require more GPU memory.

csabakecskemeti

Dec 26, 2024

Thanks for looking into this.
I've re-tested the model on my other machine
perf results:
RTX 4080 (16GB) res: 512, 120 frames [12:39<00:00, 30.37s/it]
RTX 3090 (24GB) res: 768, 72 frames [21:33<00:00, 51.75s/it]