LLaVA-JP Model Card
Model detail
Model type:
LLaVA-JP is a vision-language model that can converse about input images.
This model was trained by fine-tuning lightblue/karasu-1.1B using LLaVA method and google/siglip-so400m-patch14-384 is used as Image Encoder.
Training:
This model was initially trained with the Vision Projector using LLaVA-Pretrain-JA.
In the second phase, it was fine-tuned with LLaVA-v1.5-Instruct-620K-JA.
resources for more information: https://github.com/tosiyuki/LLaVA-JP/tree/main
Comparing VLMs:
Model | JA-VG-VQA-500 (ROUGE-L) |
JA-VLM-Bench-In-the-Wild (ROUGE-L) |
Heron-Bench(Detail) | Heron-Bench(Conv) | Heron-Bench(Complex) | Heron-Bench(Average) |
---|---|---|---|---|---|---|
Japanese Stable VLM | - | 40.50 | 25.15 | 51.23 | 37.84 | 38.07 |
EvoVLM-JP-v1-7B | 19.70 | 51.25 | 50.31 | 44.42 | 40.47 | 45.07 |
Heron BLIP Japanese StableLM Base 7B llava-620k | 14.51 | 33.26 | 49.09 | 41.51 | 45.72 | 45.44 |
Heron GIT Japanese StableLM Base 7B | 15.18 | 37.82 | 42.77 | 54.20 | 43.53 | 46.83 |
llava-jp-1.3b-v1.0-620k | 12.69 | 44.58 | 51.21 | 41.05 | 45.95 | 44.84 |
llava-jp-1.3b-v1.1 | 13.33 | 44.40 | 50.00 | 51.83 | 48.98 | 50.39 |
llava-jp-karasu-1.1b-v1.0-620k | 13.23 | 44.59 | 42.16 | 43.79 | 40.35 | 42.16 |
How to use the model
1. Download dependencies
git clone https://github.com/tosiyuki/LLaVA-JP.git -b develop
2. Inference
import requests
import torch
import transformers
from PIL import Image
from transformers.generation.streamers import TextStreamer
from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.llava_llama import LlavaLlamaForCausalLM
from llava.train.arguments_dataclass import ModelArguments, DataArguments, TrainingArguments
from llava.train.dataset import tokenizer_image_token
if __name__ == "__main__":
parser = transformers.HfArgumentParser(
(ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
model_path = 'toshi456/llava-jp-karasu-1.1b-v1.0-620k'
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if device=="cuda" else torch.float32
model = LlavaLlamaForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
use_safetensors=True,
torch_dtype=torch_dtype,
device_map=device,
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_path,
model_max_length=1532,
padding_side="right",
use_fast=False,
)
model.eval()
conv_mode = "karasu"
conv = conv_templates[conv_mode].copy()
# image pre-process
image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
image_size = model.get_model().vision_tower.image_processor.size["height"]
if model.get_model().vision_tower.scales is not None:
image_size = model.get_model().vision_tower.image_processor.size["height"] * len(model.get_model().vision_tower.scales)
if device == "cuda":
image_tensor = model.get_model().vision_tower.image_processor(
image,
return_tensors='pt',
size={"height": image_size, "width": image_size}
)['pixel_values'].half().cuda().to(torch_dtype)
else:
image_tensor = model.get_model().vision_tower.image_processor(
image,
return_tensors='pt',
size={"height": image_size, "width": image_size}
)['pixel_values'].to(torch_dtype)
# create prompt
# ユーザー: <image>\n{prompt}
prompt = "猫の隣には何がありますか?"
inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(
prompt,
tokenizer,
IMAGE_TOKEN_INDEX,
return_tensors='pt'
).unsqueeze(0)
if device == "cuda":
input_ids = input_ids.to(device)
input_ids = input_ids[:, :-1]
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
streamer = TextStreamer(tokenizer, skip_prompt=True, timeout=20.0)
# predict
with torch.inference_mode():
model.generate(
inputs=input_ids,
images=image_tensor,
do_sample=True,
temperature=0.1,
top_p=1.0,
max_new_tokens=512,
streamer=streamer,
use_cache=True,
)
"""猫の隣にはノートパソコンがあります。"""
Training dataset
Stage1 Pretrain
Stage2 Fine-tuning
Acknowledgement
License
cc-by-nc-4.0
- Downloads last month
- 119
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.