CausalLM/miniG · Does there any tutorial on how to use the multimodal?

Sep 1, 2024

minG looks awesome, the benchmark board it's not the only metric reveals the model performance, but if one model extremly good, it should handles these benchmarks.

JosephusCheung

CausalLM org Sep 1, 2024

The model inference should refer to THUDM/glm-4-9b-chat-1m and THUDM/glm-4v-9b.

It should be like this:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)

query = 'Describe the image.'
image = Image.open("your image").convert('RGB')
inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
                                       add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                       return_dict=True)  # chat mode

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "CausalLM/miniG",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "temperature": 0.3, "top_p":0.8 }
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

JosephusCheung changed discussion status to closed Sep 1, 2024

lucasjin

Sep 3, 2024

This comment has been hidden

lucasjin

Sep 3, 2024

hi what's the vision encoder usd here and what's the input resolution