BakLLaVA Model Card

BakLlava is a model that is derived from the original Llava architecture, that uses Mistral-7b as a text backbone.

Below is the model card of BakLlava model 7b, which is copied from the original BakLlava model card that you can find here.

BakLLaVA 1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture. In this first version, we showcase that a Mistral 7B base outperforms Llama 2 13B on several benchmarks. You can run BakLLaVA-1 on our repo. We are currently updating it to make it easier for you to finetune and inference. (https://github.com/SkunkworksAI/BakLLaVA).

Note: BakLLaVA-1 is fully open-source but was trained on certain data that includes LLaVA's corpus which is not commercially permissive. We will fix this in the upcoming release.

BakLLaVA 2 is cooking with a significantly larger (commercially viable) dataset and a novel architecture that expands beyond the current LLaVA method. BakLLaVA-2 will do away with the restrictions of BakLLaVA-1.

How to use the model

First, make sure to have transformers >= 4.35.3. The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (USER: xxx\nASSISTANT:) and add the token <image> to the location where you want to query images:

Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance:

Or check out our Spaces demo!

Using `pipeline`:

from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llava-hf/bakLlava-v1-hf")
messages = [
    {
      "role": "user",
      "content": [
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"},
          {"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
        ],
    },
]

out = pipe(text=messages, max_new_tokens=20)
print(out)
>>> [{'input_text': [{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg'}, {'type': 'text', 'text': 'What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud'}]}], 'generated_text': 'Lava'}]

Using pure `transformers`:

Below is an example script to run generation in float16 precision on a GPU device:

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "llava-hf/bakLlava-v1-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {

      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

From transformers>=v4.48, you can also pass image url or local path to the conversation history, and let the chat template handle the rest. Chat template will load the image for you and return inputs in torch.Tensor which you can pass directly to model.generate()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
            {"type": "text", "text": "What is shown in this image?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
output = model.generate(**inputs, max_new_tokens=50)

Model optimization

4-bit quantization through `bitsandbytes` library

First make sure to install bitsandbytes, pip install bitsandbytes and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

Use Flash-Attention 2 to further speed-up generation

First make sure to install flash-attn. Refer to the original repository of Flash Attention regarding that package installation. Simply change the snippet above with:

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

Evaluations

Training dataset

558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
158K GPT-generated multimodal instruction-following data.
450K academic-task-oriented VQA data mixture.
40K ShareGPT data.
Additional private data (permissive)

llava-hf
/

bakLlava-v1-hf

BakLLaVA Model Card

How to use the model

Using `pipeline`:

Using pure `transformers`:

Model optimization

4-bit quantization through `bitsandbytes` library

Use Flash-Attention 2 to further speed-up generation

Evaluations

Training dataset

License

Model tree for llava-hf/bakLlava-v1-hf

Dataset used to train llava-hf/bakLlava-v1-hf

Spaces using llava-hf/bakLlava-v1-hf 7

Collection including llava-hf/bakLlava-v1-hf

LLaVa-1.5

BakLLaVA Model Card

How to use the model

Using pipeline:

Using pure transformers:

Model optimization

4-bit quantization through bitsandbytes library

Use Flash-Attention 2 to further speed-up generation

Evaluations

Training dataset

License

Model tree for llava-hf/bakLlava-v1-hf

Dataset used to train llava-hf/bakLlava-v1-hf

Spaces using llava-hf/bakLlava-v1-hf 7

Collection including llava-hf/bakLlava-v1-hf

Using `pipeline`:

Using pure `transformers`:

4-bit quantization through `bitsandbytes` library