SurGrID: Controllable Surgical Simulation via Scene Graph to Image Diffusion (IPCAI 2025)

💡Key Features

We show that SGs can encode surgical scenes in a human-readable format.
We propose a novel pre-training step that encodes global and local information from (image, mask, SG) triplets. The learned embeddings are employed to condition graph to image diffusion for high-quality and precisely controllable surgical simulation.
We evaluate our generative approach on scenes from cataract surgeries using quantitative fidelity and diversity measurements, followed by an extensive user study involving clinical experts

🛠 Setup

git clone https://github.com/MECLabTUDA/SurGrID.git
cd SurGrID
conda create -n surgrid python=3.8.5 pip=20.3.3
conda activate surgrid

pip install torch==2.0.1 torchvision==0.15.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

🏁 Model Checkpoints and Dataset

Download the checkpoints of all the necessary models from the provided sources and place them in [results](./results). We also provide the processed CADIS dataset, containing images, segmentation masks and their scene graphs. Update the paths of the dataset in [configs](./configs).

Checkpoints: VQGANs, GraphEncoders, Diffusion Model
Processed Dataset: CADIS

💥 Sampling SurGrID

python script/sampler_diffusion.py --conf configs/eval/eval_combined_emb.yaml

⏳ Training SurGrID

Step 1: Train Separate VQGAN for Image and Segmentation

python surgrid/taming/main.py --base configs/vqgan/vqgan_image_cadis.yaml -t --gpus 0,
python surgrid/taming/main.py --base configs/vqgan/vqgan_segmentation_cadis.yaml -t --gpus 0,

Step 2: Train Both Graph Encoder

python script/trainer_graph.py --mode masked --conf configs/graph/graph_cadis.yaml
python script/trainer_graph.py --mode segclip --conf configs/graph/graph_cadis.yaml

Step 3: Train Diffusion Model

python script/trainer_diffusion.py --conf configs/trainer/combined_emb.yaml

🔄 Training SurGrID on a New Dataset

The files below needs to be adapted:

🥼 Clinical Expert Assesment

python script/demo_surgrid.py --conf configs/trainer/combined_emb.yaml

Our demo GUI allows for loading ground-truth graphs along with the ground-truth image. The graph’s nodes can be moved, deleted, or have their class changed. We instruct our participants to load four different ground-truth graphs and sequentially perform the following actions on each. They are requested to score the samples’ realism and coherence with the graph input using a Likert scale of 1 to 7:

First, participants are instructed to generate a batch of four samples from the groundtruth SG without modifications.
Second, the participants are requested to spatially move nodes in the canvas and again judge the synthesised samples.
Third, participants change the class of one of the instrument nodes and judge the generated images.
Lastly, participants are instructed to remove one of the instruments or miscellaneous classes and judge the synthesised image a final time.

Clinician	Synthesisation from GT		Spatial Modification		Tool Modification		Tool Removal
Clinician	Realism	Coherence	Realism	Coherence	Realism	Coherence	Realism	Coherence
P1	6.5±0.5	6.5±1.0	6.3±0.9	6.3±0.9	5.3±1.2	4.5±1.9	6.3±0.9	5.5±2.3
P2	5.3±0.9	5.3±0.5	4.5±0.5	4.3±2.0	5.3±0.9	5.8±0.9	5.5±1.2	5.5±1.9
P3	6.3±0.9	6.3±0.9	6.5±1.0	5.5±0.5	6.0±0.8	6.8±0.5	6.3±0.5	6.5±0.5

📜 Citations

If you are using SurGrID for your paper, please cite the following paper:

@article{frisch2025surgrid,
  title={SurGrID: Controllable Surgical Simulation via Scene Graph to Image Diffusion},
  author={Frisch, Yannik and Sivakumar, Ssharvien Kumar and K{\"o}ksal, {\c{C}}a{\u{g}}han and B{\"o}hm, Elsa and Wagner, Felix and Gericke, Adrian and Ghazaei, Ghazal and Mukhopadhyay, Anirban},
  journal={arXiv preprint arXiv:2502.07945},
  year={2025}
}

⭐ Acknowledgement

Thanks for the following projects and theoretical works that we have either used or inspired from: