Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- liuhaotian/LLaVA-Instruct-150K
|
5 |
+
- liuhaotian/LLaVA-Pretrain
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
pipeline_tag: visual-question-answering
|
9 |
+
---
|
10 |
+
|
11 |
+
# Model Card for Model ID
|
12 |
+
|
13 |
+
This is a multimodal implementation of [Phi2](https://huggingface.co/microsoft/phi-2) model inspired by [LlaVA-Phi](https://github.com/zhuyiche/llava-phi).
|
14 |
+
|
15 |
+
## Model Details
|
16 |
+
1. LLM Backbone: [Phi2](https://huggingface.co/microsoft/phi-2)
|
17 |
+
2. Vision Tower: [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
|
18 |
+
4. Pretraining Dataset: [LAION-CC-SBU dataset with BLIP captions(200k samples)](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
|
19 |
+
5. Finetuning Dataset: [Instruct 150k dataset based on COCO](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)
|
20 |
+
6. Finetuned Model: [RaviNaik/Llava-Phi2](https://huggingface.co/RaviNaik/Llava-Phi2)
|
21 |
+
|
22 |
+
|
23 |
+
### Model Sources [optional]
|
24 |
+
|
25 |
+
<!-- Provide the basic links for the model. -->
|
26 |
+
|
27 |
+
- **Original Repository:** [Llava-Phi](https://github.com/zhuyiche/llava-phi)
|
28 |
+
- **Paper [optional]:** [LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model](https://arxiv.org/pdf/2401.02330)
|
29 |
+
- **Demo [optional]:** [Demo Link](https://huggingface.co/spaces/RaviNaik/MultiModal-Phi2)
|
30 |
+
|
31 |
+
|
32 |
+
## How to Get Started with the Model
|
33 |
+
|
34 |
+
Use the code below to get started with the model.
|
35 |
+
1. Clone this repository and navigate to llava-phi folder
|
36 |
+
```bash
|
37 |
+
git clone https://github.com/zhuyiche/llava-phi.git
|
38 |
+
cd llava-phi
|
39 |
+
```
|
40 |
+
2. Install Package
|
41 |
+
```bash
|
42 |
+
conda create -n llava_phi python=3.10 -y
|
43 |
+
conda activate llava_phi
|
44 |
+
pip install --upgrade pip # enable PEP 660 support
|
45 |
+
pip install -e .
|
46 |
+
```
|
47 |
+
3. Run the Model
|
48 |
+
```bash
|
49 |
+
python llava_phi/eval/run_llava_phi.py --model-path="RaviNaik/Llava-Phi2" \
|
50 |
+
--image-file="https://huggingface.co/RaviNaik/Llava-Phi2/resolve/main/people.jpg?download=true" \
|
51 |
+
--query="How many people are there in the image?"
|
52 |
+
```
|
53 |
+
|
54 |
+
### Acknowledgement
|
55 |
+
This implementation is based on wonderful work done by: \
|
56 |
+
[LlaVA-Phi](https://github.com/zhuyiche/llava-phi) \
|
57 |
+
[Llava](https://github.com/haotian-liu/LLaVA) \
|
58 |
+
[Phi2](https://huggingface.co/microsoft/phi-2)
|