RaviNaik
/

Llava-Phi2

Image-Text-to-Text

text-generation

Inference Endpoints

Model card Files Files and versions Community

Model Card for Model ID

This is a multimodal implementation of Phi2 model inspired by LlaVA-Phi.

Model Details

LLM Backbone: Phi2
Vision Tower: clip-vit-large-patch14-336
Pretraining Dataset: LAION-CC-SBU dataset with BLIP captions(200k samples)
Finetuning Dataset: Instruct 150k dataset based on COCO
Finetuned Model: RaviNaik/Llava-Phi2

Model Sources

Original Repository: Llava-Phi
Paper [optional]: LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
Demo [optional]: Demo Link

How to Get Started with the Model

Use the code below to get started with the model.

Clone this repository and navigate to llava-phi folder

git clone https://github.com/zhuyiche/llava-phi.git
cd llava-phi

Install Package

conda create -n llava_phi python=3.10 -y
conda activate llava_phi
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Run the Model

python llava_phi/eval/run_llava_phi.py --model-path="RaviNaik/Llava-Phi2" \
    --image-file="https://huggingface.co/RaviNaik/Llava-Phi2/resolve/main/people.jpg?download=true" \
    --query="How many people are there in the image?"

Acknowledgement

This implementation is based on wonderful work done by:
LlaVA-Phi
Llava
Phi2

Downloads last month: 62

Safetensors

Model size

3.09B params

Tensor type

F32

·

Inference Providers NEW

Image-Text-to-Text

This model is not currently available via any of the supported Inference Providers.

Datasets used to train RaviNaik/Llava-Phi2

Spaces using RaviNaik/Llava-Phi2 4