🎉 News

[2024.03.10] base recipe out!
[2024.03.10] Finetune scripts out!
[2024.02.25] Update evaluation scripts and docs!
[2024.02.25] Data descriptions out. Release TinyLLaVA-1.5B and TinyLLaVA-2.0B!
[2024.02.24] Example code on inference and model loading added!
[2024.02.23] Evaluation code and scripts released!
[2024.02.21] Creating the TinyLLaVABench repository on GitHub!
[2024.02.21] Our paper: TinyLLaVA: A Framework of Small-scale Large Multimodal Models is out!
[2024.01.11] Our fist model TinyLLaVA-1.4B is out!

⌛ TODO

Add support for Ollama and llama.cpp.
Developers' guide / How to build demo locally.
Training and custom finetuning docs.
Model Zoo descriptions.
Examples and inference.
Release code for training.
Add descriptions for evaluation.
Add descriptions for data preparation.
Release TinyLLaVA-1.5B and TinyLLaVA-2.0B.
Release TinyLLaVA-3.1B.
Release the evaluation code and weights today(2024.2.23).

🔥 High performance, but with fewer parameters

Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL.

Install
Model Zoo
Demo
Quick Start
Run Inference
Evaluation
Data
Train
Custom Finetune

🔧 Requirements and Installation

We recommend the requirements as follows.

Clone this repository and navigate to LLaVA folder

git clone https://github.com/DLCV-BUAA/TinyLLaVABench.git
cd TinyLLaVABench

Install Package

conda create -n tinyllava python=3.10 -y
conda activate tinyllava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Upgrade to the latest code base

git pull
pip install -e .

# if you see some import errors when you upgrade, please try running the command below (without #)
# pip install flash-attn --no-build-isolation --no-cache-dir

🐳 Model Zoo

Legacy Model

tiny-llava-hf

Pretrained Models

Model Details

Name	LLM	Checkpoint	LLaVA-Bench-Wild	MME	MMBench	MM-Vet	SQA-image	VQA-v2	GQA	TextVQA
TinyLLaVA-3.1B	Phi-2	TinyLLaVA-3.1B	75.8	1464.9	66.9	32.0	69.1	79.9	62.0	59.1
TinyLLaVA-2.0B	StableLM-2-1.6B	TinyLLaVA-2.0B	66.4	1433.8	63.3	32.6	64.7	78.9	61.9	56.4
TinyLLaVA-1.5B	TinyLlama	TinyLLaVA-1.5B	60.8	1276.5	55.2	25.8	60.3	76.9	60.3	51.7

Demo

Gradio Web Demo

Launch a local web demo by running:

python tinyllava/serve/app.py --model-path bczhou/TinyLLaVA-3.1B --model-name TinyLLaVA-3.1B

CLI Inference

We also support running inference with CLI. To use our model, run:

python -m tinyllava.serve.cli \
    --model-path bczhou/TinyLLaVA-3.1B \
    --image-file "./tinyllava/serve/examples/extreme_ironing.jpg"

🔧 Quick Start

Load model

from tinyllava.model.builder import load_pretrained_model
from tinyllava.mm_utils import get_model_name_from_path
from tinyllava.eval.run_tiny_llava import eval_model

model_path = "bczhou/TinyLLaVA-3.1B"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

🔧 Run Inference

Here's an example of running inference with TinyLLaVA-3.1B

Run Inference

from tinyllava.model.builder import load_pretrained_model
from tinyllava.mm_utils import get_model_name_from_path
from tinyllava.eval.run_tiny_llava import eval_model

model_path = "bczhou/TinyLLaVA-3.1B"
prompt = "What are the things I should be cautious about when I visit here?"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": "phi",
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

eval_model(args)

Important

We use different conv_mode for different models. Replace the conv_mode in args according to this table: | model | conv_mode | |---------------- |----------- | | TinyLLaVA-3.1B | phi | | TinyLLaVA-2.0B | phi | | TinyLLaVA-1.5B | v1 |

Evaluation

To ensure the reproducibility, we evaluate the models with greedy decoding.

See Evaluation.md

Data Preparation

In our paper, we used two different datasets: the LLaVA dataset and the ShareGPT4V dataset, and compared their differences. In this section, we provide information on data preparation.

Pretraining Images

LLaVA: The pretraining images of LLaVA is from the 558K subset of the LAION-CC-SBU dataset.
ShareGPT4V: The pretraining images of ShareGPT4V is a mixture of 558K LAION-CC-SBU subset, SAM dataset, and COCO dataset.

Pretraining Annotations

LLaVA: The pretraining annotations of LLaVA are here.
ShareGPT4V: The pretraining annotations of ShareGPT4V are here.

SFT Images & Annotations

The majority of the two SFT datasets are the same, with the exception that the 23K detailed description data in LLaVA-1.5-SFT being replaced with detailed captions randomly sampled from the 100K ShareGPT4V data.

Download data

Download relevant images

LAION-CC-SBU-558K: images.zip
COCO: This dataset is from the COCO2017 challenge. Download: train2017
WebData: This dataset is curated by the ShareGPT4V project. Download: images. Only for academic usage.
SAM: This dataset is collected by Meta. Download: images. We only use 000000~000050.tar for now. If you just want to use ShareGPT4V for SFT, you can quickly download 9K images from here.
GQA: GQA project page. Download: images
OCR-VQA: OCR-VQA project page. Download: download script. We save all files as .jpg
TextVQA: TextVQA project page. Download: trainvalimages
VisualGenome: VisualGenome project page. Download: part1, part2

Download relevant annotations

LLaVA's pretraining annotations: blip_laion_cc_sbu_558k.json
LLaVA's SFT annotations: llava_v1_5_mix665k.json
ShareGPT4V's pretraining annotations: share-captioner_coco_lcs_sam_1246k_1107.json
ShareGPT4V's SFT annotations: sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json

Organize Data

Organize the image files and annotation files as follows in path/to/your/data:

data
├── llava
│   ├── llava_pretrain
│   │   ├── images
│   │   ├── blip_laion_cc_sbu_558k.json
├── coco
│   ├── train2017
├── sam
│   ├── images
├── gqa
│   ├── images
├── ocr_vqa
│   ├── images
├── textvqa
│   ├── train_images
├── vg
│   ├── VG_100K
│   ├── VG_100K_2
├── share_textvqa
│   ├── images
├── web-celebrity
│   ├── images
├── web-landmark
│   ├── images
├── wikiart
│   ├── images
├── text_files
│   ├── llava_v1_5_mix665k.json
│   ├── share-captioner_coco_lcs_sam_1246k_1107.json
│   ├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json

Train

This section we describe the base recipe.

Hyperparameters

Both hyperparameters used in pretraining and finetuning are provided below.

Pretraining

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
TinyLLaVA-3.1B	256	1e-3	1	3072	0

Finetuning

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
TinyLLaVA-3.1B	128	2e-5	1	3072	0

Pretrain

Replace paths to your paths

Training script with DeepSpeed ZeRO-2: pretrain.sh.

Finetune

Replace paths to your paths

Training script with DeepSpeed ZeRO-3: finetune.sh.

Custom-Finetune

Check out our custom finetune using LoRA here.

- Prompt Template

The model supports multi-image and multi-prompt generation. When using the model, make sure to follow the correct prompt template (USER: <image>xxx\nASSISTANT:), where <image> token is a place-holding special token for image embeddings.

Model Inference from `pipeline` and `transformers`

- Using `pipeline`:

Below we used "bczhou/tiny-llava-v1-hf" checkpoint.

from transformers import pipeline
from PIL import Image
import requests
model_id = "bczhou/tiny-llava-v1-hf"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:"
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs[0])
>>> {"generated_text': 'USER:  \nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT: The label 15 represents lava, which is a type of volcanic rock."}

- Using pure `transformers`:

Below is an example script to run generation in float16 precision on a GPU device:

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "bczhou/tiny-llava-v1-hf"
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

✏ Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.

@misc{zhou2024tinyllava,
      title={TinyLLaVA: A Framework of Small-scale Large Multimodal Models}, 
      author={Baichuan Zhou and Ying Hu and Xi Weng and Junlong Jia and Jie Luo and Xien Liu and Ji Wu and Lei Huang},
      year={2024},
      eprint={2402.14289},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

❤️ Community efforts

Our codebase is built upon the LLaVA project. Great work!
Our project uses data from the ShareGPT4V project. Great work!

Downloads last month: 520

Safetensors

Model size

1B params

Tensor type

F32

Datasets used to train bczhou/tiny-llava-v1-hf

Spaces using bczhou/tiny-llava-v1-hf 5

Collection including bczhou/tiny-llava-v1-hf

TinyLLaVA

Collection

TinyLLaVA: A Framework of Small-scale Large Multimodal Models • 7 items • Updated Mar 19, 2024 • 6

Paper for bczhou/tiny-llava-v1-hf

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Paper • 2402.14289 • Published Feb 22, 2024 • 20