Qwen/Qwen2-VL-7B-Instruct · Can it be run on a 3090 with 24gb VRAM?

Sep 5, 2024

I'm able to run it to caption images of resolution 1024x1024 and similar, but it seems to run out of memory for images around 1920x1080.

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.00it/s]
Processing Images:   0%|                                                                     | 0/2 [00:00<?, ?it/s]C:\AI\!Training\qwenvl2.7b\venv\Lib\site-packages\transformers\models\qwen2_vl\modeling_qwen2_vl.py:350: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
  attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
Error processing C:\AI\!Training\qwenvl2.7b\input\2024-09-02 - 15.58.42_00001_.png: CUDA out of memory. Tried to allocate 3.28 GiB. GPU 0 has a total capacity of 24.00 GiB of which 2.46 GiB is free. Of the allocated memory 19.82 GiB is allocated by PyTorch, and 227.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Processing Images:  50%|██████████████████████████████▌                              | 1/2 [00:01<00:01,  1.18s/it]Error processing C:\AI\!Training\qwenvl2.7b\input\2024-09-02 - 15.59.04_00001_.png: CUDA out of memory. Tried to allocate 3.28 GiB. GPU 0 has a total capacity of 24.00 GiB of which 2.44 GiB is free. Of the allocated memory 19.82 GiB is allocated by PyTorch, and 256.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Processing Images: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.08it/s]

I could always resize the images, but it seems a shame when it's so close.

Any suggestions?

I'm using this inference code:

# Caption generation function (unchanged)
def qwen_caption(image, prompt):
    if not isinstance(image, Image.Image):
        image = Image.fromarray(np.uint8(image))

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": image,
                },
                {"type": "text", "text": prompt},
            ],
        }
    ]

    text = qwen_processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)

    inputs = qwen_processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(device)

    generated_ids = qwen_model.generate(**inputs, max_new_tokens=256)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = qwen_processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text[0]

CED6688

Sep 5, 2024

I have no issue with this on a 3090 at full-weights. If your images are very large, make sure you look at the example of initializing the processor with image size limitations. You can also specify the size of images in the messages using the other examples on their github.

mnemic

Sep 5, 2024

Interesting that it is working for you, thanks for the advice.