Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- ardamamur/EgoExOR
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
metrics:
|
8 |
+
- f1
|
9 |
+
base_model:
|
10 |
+
- liuhaotian/llava-v1.5-7b
|
11 |
+
---
|
12 |
+
# EgoExOR Scene Graph Foundation Model
|
13 |
+
|
14 |
+
|
15 |
+
This repository hosts the foundation model for **surgical scene graph generation** trained on the [EgoExOR](https://huggingface.co/datasets/ardamamur/EgoExOR) dataset – a multimodal, multi-perspective dataset collected in a simulated operating room (OR) environment.
|
16 |
+
|
17 |
+
> **EgoExOR** stands for **Egocentric and Exocentric Operating Room**, integrating data from wearable AR glasses (egocentric) and static cameras (exocentric), enabling holistic modeling of complex surgical interactions.
|
18 |
+
|
19 |
+
## 🧠 Model Overview
|
20 |
+
|
21 |
+
The EgoExOR model is a dual-branch architecture that separately processes:
|
22 |
+
|
23 |
+
- **Egocentric inputs**: Egocentric RGB video, hand tracking, gaze vectors, and audio
|
24 |
+
- **Exocentric inputs**: Multiview exsocentric RGB-D video, point cloud data, and ultrasound imagery
|
25 |
+
|
26 |
+
Each branch employs transformer-based fusion before embedding tokens are passed to a large language model (Vicuna-7B via LLaVA) to **autoregressively generate scene graph triplets**:
|
27 |
+
**(subject, predicate, object)** – e.g., `(assistant, injecting, patient)`
|
28 |
+
|
29 |
+
## 📊 Benchmark Results
|
30 |
+
|
31 |
+
This model outperforms prior single-stream baselines like [ORacle](https://arxiv.org/pdf/2404.07031) and [MM2SG](https://arxiv.org/pdf/2503.02579) by effectively leveraging perspective-specific signals.
|
32 |
+
|
33 |
+
| Model | UI F1 | MISS F1 | Overall F1 |
|
34 |
+
|------------------|-------|---------|------------|
|
35 |
+
| ORacle (Baseline) | 0.72 | 0.64 | 0.67 |
|
36 |
+
| MM2SG (Baseline) | 0.79 | 0.66 | 0.72 |
|
37 |
+
| **EgoExOR (Ours)**| **0.84** | **0.69** | **0.76** |
|
38 |
+
|
39 |
+
For detailed benchmark results and dataset information, see the [paper](https://arxiv.org/abs/TODO) and [GitHub repo](https://github.com/ardamamur/EgoExOR).
|
40 |
+
|
41 |
+
## 🗃️ Dataset
|
42 |
+
|
43 |
+
EgoExOR provides:
|
44 |
+
- 84,553 frames (94 mins)
|
45 |
+
- 2 surgical procedures (Ultrasound Injection & MISS)
|
46 |
+
- 36 entities, 22 predicates
|
47 |
+
- Over 573,000 triplets
|
48 |
+
- Multimodal signals: RGB, depth, gaze, audio, ultrasound, point cloud, hand tracking
|
49 |
+
|
50 |
+
You can find the dataset processing tools [GitHub repo](https://github.com/ardamamur/EgoExOR).
|
51 |
+
|