English
File size: 5,277 Bytes
8ae171d
 
 
 
 
 
 
 
 
 
 
 
7f54af5
 
 
 
 
 
 
 
 
 
 
 
 
 
8ae171d
 
 
7f54af5
 
 
 
 
 
 
 
 
 
 
 
 
8ae171d
 
 
7f54af5
 
 
 
 
 
 
 
 
8ae171d
7f54af5
 
 
 
 
 
 
8ae171d
 
 
7f54af5
 
8ae171d
7f54af5
8ae171d
cf79a2e
 
 
8ae171d
7f54af5
 
 
 
8ae171d
 
 
 
 
 
 
 
 
 
 
 
7f54af5
 
 
 
b172a44
7f54af5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
license: apache-2.0
datasets:
- ardamamur/EgoExOR
language:
- en
metrics:
- f1
base_model:
- liuhaotian/llava-v1.5-7b
---
# EgoExOR Scene Graph Foundation Model
<table>
  <tr>
    <td style="padding: 0;">
      <a href="https://huggingface.co/datasets/ardamamur/EgoExOR">
        <img src="https://img.shields.io/badge/Data-4d5eff?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor" alt="Data">
      </a>
    </td>
    <td style="padding: 0;">
      <a href="https://github.com/ardamamur/EgoExOR">
        <img src="https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white" alt="Code">
      </a>
    </td>
  </tr>
</table>

This repository hosts the foundation model for **surgical scene graph generation** trained on the [EgoExOR](https://huggingface.co/datasets/ardamamur/EgoExOR) dataset – a multimodal, multi-perspective dataset collected in a simulated operating room (OR) environment.

> Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating
> advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context,
>  but don't explore the comprehensive combination of both. We introduce EgoExOR,
> the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives.
> Spanning 94 minutes (84,553 frames at 15 FPS) of two simulated spine procedures,
> Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery,
>  EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses,
> exocentric RGB and depth from RGB-D cameras, and ultrasound imagery.
> Its detailed scene graph annotations, covering 36 entities and 22 relations (~573,000 triplets), enable robust modeling of clinical interactions,
> supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of
> two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR’s multimodal and multi-perspective signals.
> Our new dataset and benchmark set a new foundation for OR perception, offering a rich,
> multimodal resource for next-generation clinical perception. Our code available at [EgoExOR GitHub](https://github.com/ardamamur/EgoExOR) and dataset [EgoExOR Hugging Face Dataset](https://huggingface.co/datasets/ardamamur/EgoExOR)

## 🧠 Model Overview

<p align="center">
  <img src="https://github.com/ardamamur/EgoExOR/blob/main/figures/model_overview.png?raw=true" alt="EgoExOR Overview" width="80%"/>
</p>
<p align="center">
  <em>Figure: Overview of the proposed EgoExOR model for surgical scene graph generation. The model
employs a dual-branch architecture to separately process egocentric and exocentric modalities. Fused
embeddings are passed to a large language model (LLM) to autoregressively generate scene graph
triplets representing entities and their interactions.</em>
</p>

EgoExOR Model. To fully exploit EgoExOR’s rich multi-perspective data, we introduce a new baseline model featuring a dual-branch architecture. 
The egocentric branch processes first
person RGB, hand pose, and gaze data, while the exocentric branch handles third-person RGB-D, 
ultrasound recordings, audio, and point clouds. Each branch uses a 2-layer transformer to fuse its inputs into N feature embeddings. 
These are concatenated and fed into the LLM for triplet prediction.
By explicitly separating and fusing perspective-specific features, 
our model better captures actions and staff interactions, outperforming single-stream baselines in modeling complex OR dynamics.

## 📊 Benchmark Results

This model outperforms prior single-stream baselines like [ORacle](https://arxiv.org/pdf/2404.07031) and [MM2SG](https://arxiv.org/pdf/2503.02579) 
by effectively leveraging perspective-specific signals.

> | Model            | UI F1 | MISS F1 | Overall F1 |
|------------------|-------|---------|------------|
| ORacle (Baseline) | 0.70  | 0.71    | 0.69       |
| MM2SG (Baseline)  | 0.77  | 0.68    | 0.72       |
| **EgoExOR (Ours)**| **0.86**  | **0.70**    | **0.79**       |

Overall the results, shown in Table above, the dual-branch EgoExOR model achieves
the highest macro F1. Several predicates in EgoExOR rely on understanding transient tool-hand
trajectories, and fine-grained action cues. This emphasizes the importance of explicitly modeling
multiple viewpoints and leveraging all available modalities to improve OR scene understanding.

## 🗃️ Dataset

EgoExOR provides:
- 84,553 frames (94 mins)
- 2 surgical procedures (Ultrasound Injection & MISS)
- 36 entities, 22 predicates
- Over 573,000 triplets
- Multimodal signals: RGB, depth, gaze, audio, ultrasound, point cloud, hand tracking

You can find the dataset processing tools [GitHub repo](https://github.com/ardamamur/EgoExOR).

## 🔗 Links

- 🖥️ Code: [EgoExOR GitHub](https://github.com/ardamamur/EgoExOR)
- 🤗 Dataset: [EgoExOR Hugging Face Dataset](https://huggingface.co/datasets/ardamamur/EgoExOR)
- 🤗 Model Card & Weights: [EgoExOR Hugging Face Model](https://huggingface.co/ardamamur/EgoExOR)