JoaoMigSilva
/

floorplan-intent-retriever

Feature Extraction

image-retrieval

contrastive-learning

floorplan-retrieval

computer-vision

natural-language-processing

Model card Files Files and versions

floorplan-intent-retriever / README.md

JoaoMigSilva's picture

Create README.md

e5e0cc8 verified 2 months ago

|

history blame contribute delete

3.12 kB

	---
	license: mit
	language: en
	library_name: pytorch
	tags:
	- multimodal
	- image-retrieval
	- contrastive-learning
	- floorplan-retrieval
	- architecture
	- computer-vision
	- natural-language-processing
	pipeline_tag: feature-extraction
	model-index:
	- name: CLIP-MLP-Floorplan-Retriever
	results:
	- task:
	type: feature-extraction
	name: Feature Extraction
	dataset:
	type: jmsilva/Synthetic_Floorplan_Intent_Dataset
	name: Synthetic Floorplan Intent Dataset
	metrics:
	- type: Precision@3
	value: 0.393
	name: Precision@3
	- type: UPR
	value: 0.607
	name: Unique Preference Rate
	- name: BERT-ResNet-CA-Floorplan-Retriever
	results:
	- task:
	type: feature-extraction
	name: Feature Extraction
	dataset:
	type: jmsilva/Synthetic_Floorplan_Intent_Dataset
	name: Synthetic Floorplan Intent Dataset
	metrics:
	- type: Precision@3
	value: 0.226
	name: Precision@3
	- type: UPR
	value: 0.179
	name: Unique Preference Rate
	---

	# Floorplan Retrieval with Design Intent Models

	This repository contains two models trained for the research paper: "Unlocking Floorplan Retrieval with Design Intent via Contrastive Multimodal Learning".

	These models are designed to retrieve architectural floorplans from a database based on a source image and a natural language instruction describing a desired change. This enables a more intuitive and goal-driven search for architects and designers.

	## Model Details

	Two architectures were trained for this task using a triplet contrastive learning framework. The goal is to learn a shared embedding space where a query (source image + text instruction) is closer to a positive target image (that satisfies the instruction) than to a negative image.

	### 1. `CLIP-MLP-Floorplan-Retriever` (Recommended)
	This model uses the pre-trained multimodal embeddings from CLIP (ViT-B/32). The image and text embeddings are concatenated and passed through a simple MLP for fusion. This model demonstrated superior performance in both quantitative metrics and user studies.

	- Image Encoder: CLIP Vision Transformer (ViT-B/32)
	- Text Encoder: CLIP Text Transformer
	- Fusion: Concatenation + Multi-Layer Perceptron (MLP)
	- Training Loss: `TripletMarginWithDistanceLoss` with Cosine Similarity (margin=0.2)

	### 2. `BERT-ResNet-CA-Floorplan-Retriever`
	This model uses separate pre-trained encoders for image and text. A cross-attention module is used to fuse the features, allowing the image representation to attend to linguistic cues from the instruction.

	- Image Encoder: ResNet50
	- Text Encoder: BERT (base-uncased)
	- Fusion: Cross-Attention Module
	- Training Loss: `TripletMarginLoss` with L2 Euclidean Distance (margin=1.0)


	## How to Use

	You can use these models to get a fused embedding for a (floorplan, instruction) pair. You can then compare this embedding (e.g., using cosine similarity) against a pre-computed database of floorplan embeddings to find the best match.

	First, install the necessary libraries:
	```bash
	pip install torch transformers Pillow