|
--- |
|
language: |
|
- ja |
|
- en |
|
base_model: |
|
- sbintuitions/sarashina2-7b |
|
license: mit |
|
tags: |
|
- multimodal |
|
- vision-language |
|
- llama |
|
- qwen2_vl |
|
pipeline_tag: image-to-text |
|
library_name: transformers |
|
--- |
|
|
|
# Sarashina2-Vision-8B |
|
**Sarashina2-Vision-8B** is a Japanese Large Vision Language Model trained by [SB Intuitions](https://www.sbintuitions.co.jp/). |
|
|
|
This model is based on [Sarashina2-7B](https://huggingface.co/sbintuitions/sarashina2-7b) and Image Encoder of [Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B). |
|
|
|
It achieved the highest level of scores in 4 benchmarks (as of 2025/03/07) compared to other Japanese VLMs. |
|
|
|
## How to use |
|
### 1. Install dependencies |
|
|
|
```sh |
|
pip install -U transformers==4.47.0 torch torchvision pillow protobuf sentencepiece accelerate |
|
``` |
|
|
|
### 2. Inference |
|
The following script loads the model and allows inference. |
|
```python |
|
import requests |
|
from PIL import Image |
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
|
# Define model path |
|
model_path = "sbintuitions/sarashina2-vision-8b" |
|
|
|
# Load model and processor |
|
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_path, |
|
device_map="cuda", |
|
torch_dtype="auto", |
|
trust_remote_code=True, |
|
) |
|
|
|
message = [{"role": "user", "content": "この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか?"}] |
|
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True) |
|
"""text_prompt: <s><|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. |
|
|
|
### Human: この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか? |
|
### Assistant:""" |
|
|
|
sample_image_url = "https://huggingface.co/sbintuitions/sarashina2-vision-8b/resolve/main/sample.jpg" |
|
image = Image.open(requests.get(sample_image_url, stream=True).raw).convert("RGB") |
|
inputs = processor( |
|
text=[text_prompt], |
|
images=[image], |
|
padding=True, |
|
return_tensors="pt", |
|
) |
|
inputs = inputs.to("cuda") |
|
stopping_criteria = processor.get_stopping_criteria(["\n###"]) |
|
|
|
# Inference: Generation of the output |
|
output_ids = model.generate( |
|
**inputs, |
|
max_new_tokens=128, |
|
temperature=0.0, |
|
do_sample=False, |
|
stopping_criteria=stopping_criteria, |
|
) |
|
generated_ids = [ |
|
output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids) |
|
] |
|
output_text = processor.batch_decode( |
|
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True |
|
) |
|
print(output_text[0]) |
|
"""この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京のランドマークであり、この写真では、高層ビル群の向こう側に写っています。""" |
|
``` |
|
|
|
### Example |
|
<img src="https://huggingface.co/sbintuitions/sarashina2-vision-8b/resolve/main/sample.jpg" width="350"> |
|
|
|
|Prompt|Output| |
|
|-|-| |
|
|この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか?|この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京のランドマークであり、この写真では、高層ビル群の向こう側に写っています。| |
|
|真ん中に映っている赤と白の物は何ですか?|真ん中に映っている赤と白のものはクレーンです。| |
|
|
|
## Training |
|
**Sarashina2-Vision** is created through the following three-stage learning process: |
|
|
|
1. We tune the parameters in the projector by caption datasets. |
|
2. We tune the parameters in the Vision Encoder and projector by caption datasets. |
|
3. We tune the parameters in the projector and LLM by Visual Instruction datasets. |
|
|
|
## Evaluation Results |
|
|Model|Model Size|JMMMU<sup>*1</sup>|Heron-Bench<sup>*2</sup>|JDocQA| |
|
|-|-|-|-|-| |
|
|[heron-chat-git-ja-stablelm-base-7b-v1](https://huggingface.co/turing-motors/heron-chat-git-ja-stablelm-base-7b-v1)|7B|0.294|0.461|0.069| |
|
|[llava-calm2-siglip](https://huggingface.co/cyberagent/llava-calm2-siglip)|7B|0.07|0.521|0.084| |
|
|[Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2)|8B|0.389|0.509|0.103| |
|
|[Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B)|14B|0.302|0.433|0.06| |
|
|[llm-jp-3-vila-14b](https://huggingface.co/llm-jp/llm-jp-3-vila-14b)|14B|0.23|**0.665**|0.176| |
|
|[EZO-InternVL2-26B](https://huggingface.co/AXCXEPT/EZO-InternVL2-26B)|26B|0.389|0.609|0.196| |
|
|[Sarashina2-Vision-8B](https://huggingface.co/sbintuitions/sarashina2-vision-8b)|8B|0.393|0.648|0.229| |
|
|[Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)|14B|**0.433**|0.644|**0.245**| |
|
|
|
1. Evaluated only single image samples (1,286 samples). If answer extraction failed, we treated it as incorrect (score 0) instead of making a random choice to eliminate stochasticity. |
|
2. GPT-4o (gpt-4o-2024-08-06) was used for LLM-as-a-Judge. |
|
|
|
|
|
## Ethical Considerations and Limitations |
|
Sarashina2-Vision might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs. Before using Sarashina2-Vision, we would like developers to tune models based on human preferences and safety considerations. |
|
|
|
## LICENSE |
|
[MIT License](./LICENSE) |
|
|