# Synthetic Visual Genome This repository contains the **ROBIN-3B model** based on [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B), introduced in the paper [Synthetic Visual Genome](https://arxiv.org/abs/2506.07643). It is designed for **scene graph understanding** and **dense visual relationships**. ## 🤖 Checkpoints - **Robin-3b Stage 2 [this repo]**: [🤗 hf-model](https://huggingface.co/jamepark3922/robin-qwen2.5-3b) - Robin-3b Stage 1: TBD - Robin-3b Stage 0: TBD ## 🚀 Quick Start: Scene Graph Generation with SAM Generate **scene graph** for each image, using **segment-anything masks** and optional **GroundingDINO object regions** 1. First install [Segment Anything](https://github.com/facebookresearch/segment-anything) ``` pip install git+https://github.com/facebookresearch/segment-anything.git ``` 2. Download all the checkpoints: - [ViT-H SAM model](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth) - [Robin-3b](https://huggingface.co/jamepark3922/robin-qwen2.5-3b) - Run `git clone https://huggingface.co/jamepark3922/robin-qwen2.5-3b` - [CLIP-convnext](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/blob/main/open_clip_pytorch_model.bin) The default path of all the checkpoints: ``` ├── demo ├── checkpoints │ ├── robin-qwen2.5-3b-sg-stage2 │ └── sam_vit_h_4b8939.pth └── open_clip_pytorch_model.bin ``` Note: You might need to change the "mm_vision_tower" in `config.json` of robin-3b model to the Absolute path of `open_clip_pytorch_model.bin`. #### Scene Graph Generation for Single Image 🖼️ Refer to the [SyntheticVG repo](https://github.com/jamepark3922/SyntheticVG) for the full code to generate scene graph for a single image using SAM and Robin-3b. ```python import requests from segment_anything import sam_model_registry from svg.pipeline.region_proposal.region_generator import SamGroundingDinoRegionGenerator from svg.pipeline.grounding.grounding_dino import GroundingDinoSAM from svg.pipeline.captioning.gpt4o import GPT4Captioner from svg.pipeline.robin import RobinPipeline from svg.draw_utils import visualize_masks image = Image.open(requests.get('http://farm4.staticflickr.com/3377/3573516590_a1f6cf2cbd_z.jpg', stream=True).raw) device = 'cuda' if torch.cuda.is_available() else 'cpu' sam_ckpt = 'sam_vit_h_4b8939.pth' sam_model = sam_model_registry["vit_h"](checkpoint=sam_ckpt).to(device) # Optional: grounding_dino + gpt4o captioner for additional region grounding print('Loading GroundingDino model...') grounding_model = GroundingDinoSAM( "IDEA-Research/grounding-dino-base", sam_model, device ) captioner = GPT4Captioner() region_generator = SamGroundingDinoRegionGenerator( sam_model=sam_model, grounding_model=grounding_model, # None if not using captioner=captioner ) regions: list[dict] = region_generator.generate_regions(image, region_mode='merged') # Generate scene graph from regions model = RobinPipeline(robin_path, device=device) sg, _ = model.generate_scene_graph(im, regions) objects: list[str] = sg['objects'] relations: list[tuple[int, int, str]] = sg['relations'] # Visualize the scene graph image_rgb = np.array(image) image_with_masks: np.ndarray = visualize_masks( image_rgb, regions, draw_bbox=True, draw_mask = True, draw_polygon=False, white_padding=50 ) cv2.imwrite('scene_graph.jpg', image_with_masks) with open('scene_graph.json', 'w') as f: json.dump(scene_graph, f, indent=4) ``` You can also run `predict.py` to generate scene graph for a single image. ``` python predict.py --image_path path/to/image.jpg ``` ## BibTeX 🖊️ If you find this work useful, please consider citing: ``` @misc{park2025syntheticvisualgenome, title={Synthetic Visual Genome}, author={Jae Sung Park and Zixian Ma and Linjie Li and Chenhao Zheng and Cheng-Yu Hsieh and Ximing Lu and Khyathi Chandu and Quan Kong and Norimasa Kobori and Ali Farhadi and Yejin Choi and Ranjay Krishna}, year={2025}, eprint={2506.07643}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.07643}, } ```