---
license: mit
datasets:
- liuhaotian/LLaVA-Instruct-150K
- liuhaotian/LLaVA-Pretrain
language:
- en
pipeline_tag: visual-question-answering
---

# Model Card for Model ID

This is a multimodal implementation of [Phi2](https://huggingface.co/microsoft/phi-2) model inspired by [LlaVA-Phi](https://github.com/zhuyiche/llava-phi).

## Model Details
1. LLM Backbone: [Phi2](https://huggingface.co/microsoft/phi-2)
2. Vision Tower: [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
4. Pretraining Dataset: [LAION-CC-SBU dataset with BLIP captions(200k samples)](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
5. Finetuning Dataset: [Instruct 150k dataset based on COCO](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)
6. Finetuned Model: [marianna13/llava-phi-2-3b](https://huggingface.co/marianna13/llava-phi-2-3b)


### Model Sources

<!-- Provide the basic links for the model. -->

- **Original Repository:** [Llava-Phi](https://github.com/zhuyiche/llava-phi)
- **Paper [optional]:** [LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model](https://arxiv.org/pdf/2401.02330)
- **Demo [optional]:** [Demo Link](https://huggingface.co/spaces/RaviNaik/MultiModal-Phi2)