File size: 2,681 Bytes
a6fe732
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
license: apache-2.0
---

# Omni-VideoAssistant
This is a Video Question Answering Large Language model.
[code base](https://github.com/wanghao-cst/Omni-VideoAssistant).

## πŸ“ Updates
* **[2023.12.09]**  πŸ€—[Hugging Face](https://huggingface.co/harvey2333/omni_video_assistant_6_1) **A Better Model V6.1** are available now! Welcome to **watch** this repository for the latest updates.
* **[2023.12.06]**  Gradio & CLI **Inference Demo** are available now.
* **[2023.12.01]**  πŸ€—[Hugging Face](https://huggingface.co/harvey2333/omni_video_assistant_5_3) **Preview Model** are available now!

<details open><summary>πŸ’‘ I also have other video-language projects that may interest you ✨. </summary><p>
<!--  may -->

> [**OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation**](https://arxiv.org/abs/2308.04126) <br>
> Dongyang Yu, Shihao Wang, Yuan Fang, Wangpeng An <br>
[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/shajiayu1/MVCE/) [![arXiv](https://img.shields.io/badge/Arxiv-2310.01852-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2308.04126) <br></p></details>


## πŸ”¨ Preparation
```bash
git clone https://github.com/wanghao-cst/Omni-VideoAssistant
cd Omni-VideoAssistant
```
```shell
conda create -n omni python=3.10 -y
conda activate omni
pip install --upgrade pip
pip install -e .
```

## 🌟 Start here
### Download Omni Preview Model
Download for CLI inference only, gradio web UI will download it automatically.
[Omni Preview Model 6.1](https://huggingface.co/harvey2333/omni_video_assistant_6_1)

### Inference in Gradio Web UI

```Shell
CUDA_VISIBLE_DEVICES=0 python -m  llava.serve.gradio_demo
```
<p align="left">
<img src="assets/gradio_demo.png" width=100%>
</p>

### Inference in CLI
```
CUDA_VISIBLE_DEVICES=0 python -m llava.eval.run_omni \
    --model-path "path to omni checkpoints" \
    --image-file "llava/serve/examples/extreme_ironing.jpg" \
    --query "What is unusual about this image?"
CUDA_VISIBLE_DEVICES=0 python -m llava.eval.run_omni \
    --model-path "path to omni checkpoints" \
    --video-file "llava/serve/examples/0A8CF.mp4" \
    --query "Describe the activity in the video"
```

## πŸ”₯ Results Comparision (based on model 5.3, evaluation on 6.1 is doing)
### Image understanding
<p align="left">
<img src="assets/val_img.png" width=100%>
</p>

### Video understanding
<p align="left">
<img src="assets/val_vid.png" width=100%>
</p>


## 😊 Acknowledgment

This work is based on [MVCE for unlimited training data generation.](https://github.com/shajiayu1/MVCE/), [LLaVA for pretrained model](https://github.com/haotian-liu/LLaVA/)