Image-to-Text
Transformers
Safetensors
Japanese
llava-jp
text-generation
vision
image-captioning
VQA
Inference Endpoints
Edit model card

ConvLLaVA-JP Model Card

Model detail

Model type:

ConvLLaVA-JP is a vision-language model that can converse about input images.
This model is an LVLM model trained using laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft as the image encoder and llm-jp/llm-jp-1.3b-v1.0 as the text decoder. Input of 768 x 768 high resolution.

Training:

This model was initially trained with Vision Projector and Stage 5 using LLaVA-Pretrain-JA.
In the second phase, it was trained Image Encoder, Vision Projector, Stage 5 and LLM using LLaVA-Pretrain-JA.
In the third phase, it was fine-tuned with Vision Projector and LLM using LLaVA-v1.5-Instruct-620K-JA.

resources for more information: https://github.com/tosiyuki/LLaVA-JP/tree/main

Comparing VLMs

Model JA-VG-VQA-500
(ROUGE-L)
JA-VLM-Bench-In-the-Wild
(ROUGE-L)
Heron-Bench(Detail) Heron-Bench(Conv) Heron-Bench(Complex) Heron-Bench(Average)
Japanese Stable VLM - 40.50 25.15 51.23 37.84 38.07
EvoVLM-JP-v1-7B 19.70 51.25 50.31 44.42 40.47 45.07
Heron BLIP Japanese StableLM Base 7B llava-620k 14.51 33.26 49.09 41.51 45.72 45.44
Heron GIT Japanese StableLM Base 7B 15.18 37.82 42.77 54.20 43.53 46.83
llava-jp-1.3b-v1.0-620k 12.69 44.58 51.21 41.05 45.95 44.84
llava-jp-1.3b-v1.1 13.33 44.40 50.00 51.83 48.98 50.39
ConvLLaVA-JP-1.3b-768 12.05 42.80 44.24 40.00 48.16 44.96
ConvLLaVA-JP-1.3b-1280 11.88 43.64 38.95 44.79 41.24 42.31

How to use the model

1. Download dependencies

git clone https://github.com/tosiyuki/LLaVA-JP.git

2. Inference

import requests
import torch
import transformers
from PIL import Image

from transformers.generation.streamers import TextStreamer
from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.llava_gpt2 import LlavaGpt2ForCausalLM
from llava.train.dataset import tokenizer_image_token


if __name__ == "__main__":
    model_path = 'toshi456/ConvLLaVA-JP-1.3b-768'
    device = "cuda" if torch.cuda.is_available() else "cpu"
    torch_dtype = torch.bfloat16 if device=="cuda" else torch.float32

    model = LlavaGpt2ForCausalLM.from_pretrained(
        model_path, 
        low_cpu_mem_usage=True,
        use_safetensors=True,
        torch_dtype=torch_dtype,
        device_map=device,
    )
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_path,
        model_max_length=1532,
        padding_side="right",
        use_fast=False,
    )
    model.eval()

    conv_mode = "v1"
    conv = conv_templates[conv_mode].copy()

    # image pre-process
    image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg"
    image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
    
    if device == "cuda":
        image_tensor = model.get_model().vision_tower.image_processor(image).unsqueeze(0).half().cuda().to(torch_dtype)
    else:
        image_tensor = model.get_model().vision_tower.image_processor(image).unsqueeze(0).to(torch_dtype)

    # create prompt
    # ユーザー: <image>\n{prompt}
    prompt = "猫の隣には何がありますか?"
    inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()

    input_ids = tokenizer_image_token(
        prompt, 
        tokenizer, 
        IMAGE_TOKEN_INDEX, 
        return_tensors='pt'
    ).unsqueeze(0)
    if device == "cuda":
        input_ids = input_ids.to(device)

    input_ids = input_ids[:, :-1] # </sep>がinputの最後に入るので削除する
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    streamer = TextStreamer(tokenizer, skip_prompt=True, timeout=20.0)

    # predict
    with torch.inference_mode():
        output_id = model.generate(
            inputs=input_ids,
            images=image_tensor,
            do_sample=False,
            temperature=1.0,
            top_p=1.0,
            max_new_tokens=256,
            streamer=streamer,
            use_cache=True,
        )
    """猫の隣にはノートパソコンがあります。"""

Training dataset

Stage1 and Stage2 Pretrain

Stage3 Fine-tuning

Acknowledgement

License

cc-by-nc-4.0

Downloads last month
30
Safetensors
Model size
2.1B params
Tensor type
F32
·
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train toshi456/ConvLLaVA-JP-1.3b-768