--- license: mit datasets: - liuhaotian/LLaVA-Instruct-150K - liuhaotian/LLaVA-Pretrain language: - en pipeline_tag: visual-question-answering --- # Model Card for Model ID This is a multimodal implementation of [Phi2](https://huggingface.co/microsoft/phi-2) model inspired by [LlaVA-Phi](https://github.com/zhuyiche/llava-phi). ## Model Details 1. LLM Backbone: [Phi2](https://huggingface.co/microsoft/phi-2) 2. Vision Tower: [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) 4. Pretraining Dataset: [LAION-CC-SBU dataset with BLIP captions(200k samples)](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) 5. Finetuning Dataset: [Instruct 150k dataset based on COCO](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) 6. Finetuned Model: [marianna13/llava-phi-2-3b](https://huggingface.co/marianna13/llava-phi-2-3b) ### Model Sources - **Original Repository:** [Llava-Phi](https://github.com/zhuyiche/llava-phi) - **Paper [optional]:** [LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model](https://arxiv.org/pdf/2401.02330) - **Demo [optional]:** [Demo Link](https://huggingface.co/spaces/RaviNaik/MultiModal-Phi2)