--- license: apache-2.0 datasets: - ardamamur/EgoExOR language: - en metrics: - f1 base_model: - liuhaotian/llava-v1.5-7b --- # EgoExOR Scene Graph Foundation Model
|
|
Figure: Overview of the proposed EgoExOR model for surgical scene graph generation. The model employs a dual-branch architecture to separately process egocentric and exocentric modalities. Fused embeddings are passed to a large language model (LLM) to autoregressively generate scene graph triplets representing entities and their interactions.
EgoExOR Model. To fully exploit EgoExOR’s rich multi-perspective data, we introduce a new baseline model featuring a dual-branch architecture. The egocentric branch processes first person RGB, hand pose, and gaze data, while the exocentric branch handles third-person RGB-D, ultrasound recordings, audio, and point clouds. Each branch uses a 2-layer transformer to fuse its inputs into N feature embeddings. These are concatenated and fed into the LLM for triplet prediction. By explicitly separating and fusing perspective-specific features, our model better captures actions and staff interactions, outperforming single-stream baselines in modeling complex OR dynamics. ## 📊 Benchmark Results This model outperforms prior single-stream baselines like [ORacle](https://arxiv.org/pdf/2404.07031) and [MM2SG](https://arxiv.org/pdf/2503.02579) by effectively leveraging perspective-specific signals. > | Model | UI F1 | MISS F1 | Overall F1 | |------------------|-------|---------|------------| | ORacle (Baseline) | 0.70 | 0.71 | 0.69 | | MM2SG (Baseline) | 0.77 | 0.68 | 0.72 | | **EgoExOR (Ours)**| **0.86** | **0.70** | **0.79** | Overall the results, shown in Table above, the dual-branch EgoExOR model achieves the highest macro F1. Several predicates in EgoExOR rely on understanding transient tool-hand trajectories, and fine-grained action cues. This emphasizes the importance of explicitly modeling multiple viewpoints and leveraging all available modalities to improve OR scene understanding. ## 🗃️ Dataset EgoExOR provides: - 84,553 frames (94 mins) - 2 surgical procedures (Ultrasound Injection & MISS) - 36 entities, 22 predicates - Over 573,000 triplets - Multimodal signals: RGB, depth, gaze, audio, ultrasound, point cloud, hand tracking You can find the dataset processing tools [GitHub repo](https://github.com/ardamamur/EgoExOR). ## 🔗 Links - 🖥️ Code: [EgoExOR GitHub](https://github.com/ardamamur/EgoExOR) - 🤗 Dataset: [EgoExOR Hugging Face Dataset](https://huggingface.co/datasets/ardamamur/EgoExOR) - 🤗 Model Card & Weights: [EgoExOR Hugging Face Model](https://huggingface.co/ardamamur/EgoExOR)