Update README.md

05f6e7d verified 5 months ago

5.44 kB

	---
	language:
	- ja
	- en
	base_model:
	- sbintuitions/sarashina2-7b
	license: mit
	tags:
	- multimodal
	- vision-language
	- llama
	- qwen2_vl
	pipeline_tag: image-to-text
	library_name: transformers
	---

	# Sarashina2-Vision-8B
	Sarashina2-Vision-8B is a Japanese Large Vision Language Model trained by [SB Intuitions](https://www.sbintuitions.co.jp/).

	This model is based on [Sarashina2-7B](https://huggingface.co/sbintuitions/sarashina2-7b) and Image Encoder of [Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B).

	It achieved the highest level of scores in 4 benchmarks (as of 2025/03/07) compared to other Japanese VLMs.

	## How to use
	### 1. Install dependencies

	```sh
	pip install -U transformers==4.47.0 torch torchvision pillow protobuf sentencepiece accelerate
	```

	### 2. Inference
	The following script loads the model and allows inference.
	```python
	import requests
	from PIL import Image
	from transformers import AutoModelForCausalLM, AutoProcessor

	# Define model path
	model_path = "sbintuitions/sarashina2-vision-8b"

	# Load model and processor
	processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	device_map="cuda",
	torch_dtype="auto",
	trust_remote_code=True,
	)

	message = [{"role": "user", "content": "この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？"}]
	text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
	"""text_prompt: <s><\|prefix\|><\|file\|><\|suffix\|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

	### Human: この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？
	### Assistant:"""

	sample_image_url = "https://huggingface.co/sbintuitions/sarashina2-vision-8b/resolve/main/sample.jpg"
	image = Image.open(requests.get(sample_image_url, stream=True).raw).convert("RGB")
	inputs = processor(
	text=[text_prompt],
	images=[image],
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")
	stopping_criteria = processor.get_stopping_criteria(["\n###"])

	# Inference: Generation of the output
	output_ids = model.generate(
	**inputs,
	max_new_tokens=128,
	temperature=0.0,
	do_sample=False,
	stopping_criteria=stopping_criteria,
	)
	generated_ids = [
	output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
	]
	output_text = processor.batch_decode(
	generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
	)
	print(output_text[0])
	"""この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京のランドマークであり、この写真では、高層ビル群の向こう側に写っています。"""
	```

	### Example
	<img src="https://huggingface.co/sbintuitions/sarashina2-vision-8b/resolve/main/sample.jpg" width="350">

	\|Prompt\|Output\|
	\|-\|-\|
	\|この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？\|この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京のランドマークであり、この写真では、高層ビル群の向こう側に写っています。\|
	\|真ん中に映っている赤と白の物は何ですか？\|真ん中に映っている赤と白のものはクレーンです。\|

	## Training
	Sarashina2-Vision is created through the following three-stage learning process:

	1. We tune the parameters in the projector by caption datasets.
	2. We tune the parameters in the Vision Encoder and projector by caption datasets.
	3. We tune the parameters in the projector and LLM by Visual Instruction datasets.

	## Evaluation Results
	\|Model\|Model Size\|JMMMU<sup>1</sup>\|Heron-Bench<sup>2</sup>\|JDocQA\|
	\|-\|-\|-\|-\|-\|
	\|[heron-chat-git-ja-stablelm-base-7b-v1](https://huggingface.co/turing-motors/heron-chat-git-ja-stablelm-base-7b-v1)\|7B\|0.294\|0.461\|0.069\|
	\|[llava-calm2-siglip](https://huggingface.co/cyberagent/llava-calm2-siglip)\|7B\|0.07\|0.521\|0.084\|
	\|[Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2)\|8B\|0.389\|0.509\|0.103\|
	\|[Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B)\|14B\|0.302\|0.433\|0.06\|
	\|[llm-jp-3-vila-14b](https://huggingface.co/llm-jp/llm-jp-3-vila-14b)\|14B\|0.23\|0.665\|0.176\|
	\|[EZO-InternVL2-26B](https://huggingface.co/AXCXEPT/EZO-InternVL2-26B)\|26B\|0.389\|0.609\|0.196\|
	\|[Sarashina2-Vision-8B](https://huggingface.co/sbintuitions/sarashina2-vision-8b)\|8B\|0.393\|0.648\|0.229\|
	\|[Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)\|14B\|0.433\|0.644\|0.245\|

	1. Evaluated only single image samples (1,286 samples). If answer extraction failed, we treated it as incorrect (score 0) instead of making a random choice to eliminate stochasticity.
	2. GPT-4o (gpt-4o-2024-08-06) was used for LLM-as-a-Judge.


	## Ethical Considerations and Limitations
	Sarashina2-Vision might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs. Before using Sarashina2-Vision, we would like developers to tune models based on human preferences and safety considerations.

	## LICENSE
	[MIT License](./LICENSE)