Does there any tutorial on how to use the multimodal?

#8
by lucasjin - opened

minG looks awesome, the benchmark board it's not the only metric reveals the model performance, but if one model extremly good, it should handles these benchmarks.

The model inference should refer to THUDM/glm-4-9b-chat-1m and THUDM/glm-4v-9b.

It should be like this:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)

query = 'Describe the image.'
image = Image.open("your image").convert('RGB')
inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
                                       add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                       return_dict=True)  # chat mode

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "CausalLM/miniG",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "temperature": 0.3, "top_p":0.8 }
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))
JosephusCheung changed discussion status to closed
This comment has been hidden

hi what's the vision encoder usd here and what's the input resolution

Sign up or log in to comment