Several Issues Regarding Environment Setup

#5
by ZAHNGYUXUAN - opened

Since the main branches of transformers and vLLM may be unstable, if you encounter issues when running code from the main branch, please install the specified commits as follows:

setuptools>=80.9.0
setuptools_scm>=8.3.1
git+https://github.com/huggingface/transformers.git@91221da2f1f68df9eb97c980a7206b14c4d3a9b0
git+https://github.com/vllm-project/vllm.git@220aee902a291209f2975d4cd02dadcc6749ffe6
torchvision>=0.22.0
gradio>=5.35.0
pre-commit>=4.2.0
PyMuPDF>=1.26.1
av>=14.4.0
accelerate>=1.6.0
spaces>=0.37.1

Please note: these are source code installations. Do not use pip releases or mirror sites to install them.

vLLM involves compiling C++ source code, which may take some time. A precompiled installation option is available for faster setup.

We are actively communicating with maintainers of the related repositories.
Sorry for the inconvenience, and thank you for your patience.

ZAHNGYUXUAN pinned discussion

由于 transformers 和 vLLM 的主分支可能不稳定,如果遇到主分支运行代码出现问题,请按照指定commit 安装:

setuptools>=80.9.0
setuptools_scm>=8.3.1
git+https://github.com/huggingface/transformers.git@91221da2f1f68df9eb97c980a7206b14c4d3a9b0
git+https://github.com/vllm-project/vllm.git@220aee902a291209f2975d4cd02dadcc6749ffe6
torchvision>=0.22.0
gradio>=5.35.0
pre-commit>=4.2.0
PyMuPDF>=1.26.1
av>=14.4.0
accelerate>=1.6.0
spaces>=0.37.1

请注意,这里是源代码安装,你不应该使用任何pip 发行版或者镜像站安装。
其中 vLLM 涉及使用 C++ 源代码编译,速度较慢,可以使用免编译安装。

我们正在与相关仓库人员沟通,抱歉给你们带来困扰。

Do you think something like this would work in the meantime? Really glad someone finally incorporated the excellent AIM (besides ovis2):

SAMPLE SNIPPET
import transformers.image_transforms as image_transforms
from typing import Union, List
import torch

# Store original function
_original_group_images_by_shape = image_transforms.group_images_by_shape

def patched_group_images_by_shape(images: List[torch.Tensor], disable_grouping: bool = False):
    """Patched version with disable_grouping parameter."""
    if disable_grouping:
        # Process individually - return single group
        if not images:
            return {}, {}
        
        # Create single group with all images
        first_shape = images[0].shape[1:]
        return {first_shape: torch.stack(images, dim=0)}, {i: (first_shape, i) for i in range(len(images))}
    
    # Use original function for grouped processing
    return _original_group_images_by_shape(images)

# Apply monkey patch
image_transforms.group_images_by_shape = patched_group_images_by_shape

# Also patch BaseImageProcessorFast if needed
if hasattr(transformers, 'BaseImageProcessorFast'):
    original_preprocess = transformers.BaseImageProcessorFast._preprocess
    
    def patched_preprocess(self, images, **kwargs):
        # Add disable_grouping parameter if missing
        if 'disable_grouping' not in kwargs:
            kwargs['disable_grouping'] = not (hasattr(images[0], 'device') and images[0].device.type == 'cuda')
        return original_preprocess(self, images, **kwargs)
    
    transformers.BaseImageProcessorFast._preprocess = patched_preprocess
Z.ai & THUKEG org

Now can try vLLM >=0.9.2 and transformers == 4.53.1

Now can try vLLM >=0.9.2 and transformers == 4.53.1

I tried transformers (not vLLM) 4.53.1 and couldn't get it to work, but if I still use the specific commit you mentioned above it does work...any official word from huggingface on a fix? Maybe I'm not doing it right?

Since the main branches of transformers and vLLM may be unstable, if you encounter issues when running code from the main branch, please install the specified commits as follows:

setuptools>=80.9.0
setuptools_scm>=8.3.1
git+https://github.com/huggingface/transformers.git@91221da2f1f68df9eb97c980a7206b14c4d3a9b0
git+https://github.com/vllm-project/vllm.git@220aee902a291209f2975d4cd02dadcc6749ffe6
torchvision>=0.22.0
gradio>=5.35.0
pre-commit>=4.2.0
PyMuPDF>=1.26.1
av>=14.4.0
accelerate>=1.6.0
spaces>=0.37.1

Please note: these are source code installations. Do not use pip releases or mirror sites to install them.

vLLM involves compiling C++ source code, which may take some time. A precompiled installation option is available for faster setup.

We are actively communicating with maintainers of the related repositories.
Sorry for the inconvenience, and thank you for your patience.

"vLLM involves compiling C++ source code, which may take some time. A precompiled installation option is available for faster setup."
Where is the precompiled installation option?

I can confirm that it's working now. To recap...Transformers 4.53.1 was NOT working but

4.53.2 IS NOW WORKING.

Sample script for those interested in this excellent model...NOTE, it requires pip-installing the CUDA-related libraries and using my custom set_cuda_paths function...but if you don't like that approach it'll at least show the basic logic of how to run the model. NOTE, assumes you have an Nvidia GPU and I've only tested on Windows.

WORKING SCRIPT HERE
# uses apple/aimv2-huge-patch14-336 whereas Ovis2 uses the large 448

USE_QUANTIZATION = True  # Set to True to use BitsAndBytesConfig, False for no quantization

import sys, os, time, threading, queue
from pathlib import Path
import pynvml, torch
from PIL import Image
from transformers import Glm4vForConditionalGeneration, AutoProcessor

if USE_QUANTIZATION:
    from transformers import BitsAndBytesConfig

def set_cuda_paths():
    venv_base = Path(sys.executable).parent.parent
    nvidia_base_path = venv_base / 'Lib' / 'site-packages' / 'nvidia'
    paths_to_add = [
        str(nvidia_base_path / 'cuda_runtime' / 'bin'),
        str(nvidia_base_path / 'cuda_runtime' / 'bin' / 'lib' / 'x64'),
        str(nvidia_base_path / 'cuda_runtime' / 'include'),
        str(nvidia_base_path / 'cublas' / 'bin'),
        str(nvidia_base_path / 'cudnn' / 'bin'),
        str(nvidia_base_path / 'cuda_nvrtc' / 'bin'),
        str(nvidia_base_path / 'cuda_nvcc' / 'bin'),
    ]
    current_value = os.environ.get('PATH', '')
    os.environ['PATH'] = os.pathsep.join(paths_to_add + [current_value] if current_value else paths_to_add)
    os.environ['CUDA_PATH'] = str(nvidia_base_path / 'cuda_runtime')

def monitor_vram(vram_queue, handle, stop_flag, interval=0.1):
    max_usage = 0
    while not stop_flag.is_set():
        usage = pynvml.nvmlDeviceGetMemoryInfo(handle).used / 1024 ** 2
        max_usage = max(max_usage, usage)
        vram_queue.put(max_usage)
        time.sleep(interval)

set_cuda_paths()

quantization_config = None
if USE_QUANTIZATION:
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4",
    )

pynvml.nvmlInit()
gpu_handle = pynvml.nvmlDeviceGetHandleByIndex(0)
baseline_usage = pynvml.nvmlDeviceGetMemoryInfo(gpu_handle).used / 1024 ** 2

model_path = r"[PATH TO THE FOLDER CONTAINING THE MODEL AND RELATED FILES]"
image_path = r"[PATH TO A SPECIFIC IMAGE TO PROCESS"

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)

model_kwargs = {
    "torch_dtype": dtype,
    "device_map": "auto",
    "attn_implementation": "sdpa",
}

if USE_QUANTIZATION:
    model_kwargs["quantization_config"] = quantization_config

model = Glm4vForConditionalGeneration.from_pretrained(
    model_path,
    **model_kwargs
).eval()

image = Image.open(image_path)
user_prompt = "Describe in English as much detail as possible what this image depicts?"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image
            },
            {
                "type": "text", 
                "text": user_prompt
            },
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(device)

vram_queue = queue.Queue()
stop_flag = threading.Event()
vram_thread = threading.Thread(target=monitor_vram, args=(vram_queue, gpu_handle, stop_flag))
vram_thread.start()

start = time.time()
with torch.inference_mode():
    out_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)

if USE_QUANTIZATION:
    torch.cuda.synchronize()

end = time.time()

stop_flag.set()
vram_thread.join()

peak_vram = 0
while not vram_queue.empty():
    peak_vram = max(peak_vram, vram_queue.get())

model_vram = peak_vram - baseline_usage
elapsed = end - start

generated_ids_trimmed = [
    out_ids[0][len(inputs.input_ids[0]):]
]
response = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0].strip()

print(f"\n{response}\n")
print(f"Max VRAM Usage: {model_vram:.2f} MB")
print(f"Characters per second: {len(response) / elapsed:.2f}")
print(f"Compute Time: {elapsed:.2f} seconds")
print(f"Quantization: {'Enabled (4-bit)' if USE_QUANTIZATION else 'Disabled'}")

Sign up or log in to comment