JoaoMigSilva
/

floorplan-intent-retriever

+---
+license: mit
+language: en
+library_name: pytorch
+tags:
+- multimodal
+- image-retrieval
+- contrastive-learning
+- floorplan-retrieval
+- architecture
+- computer-vision
+- natural-language-processing
+pipeline_tag: feature-extraction
+model-index:
+- name: CLIP-MLP-Floorplan-Retriever
+  results:
+  - task:
+      type: feature-extraction
+      name: Feature Extraction
+    dataset:
+      type: jmsilva/Synthetic_Floorplan_Intent_Dataset
+      name: Synthetic Floorplan Intent Dataset
+    metrics:
+    - type: Precision@3
+      value: 0.393
+      name: Precision@3
+    - type: UPR
+      value: 0.607
+      name: Unique Preference Rate
+- name: BERT-ResNet-CA-Floorplan-Retriever
+  results:
+  - task:
+      type: feature-extraction
+      name: Feature Extraction
+    dataset:
+      type: jmsilva/Synthetic_Floorplan_Intent_Dataset
+      name: Synthetic Floorplan Intent Dataset
+    metrics:
+    - type: Precision@3
+      value: 0.226
+      name: Precision@3
+    - type: UPR
+      value: 0.179
+      name: Unique Preference Rate
+---
+# Floorplan Retrieval with Design Intent Models
+This repository contains two models trained for the research paper: **"Unlocking Floorplan Retrieval with Design Intent via Contrastive Multimodal Learning"**.
+These models are designed to retrieve architectural floorplans from a database based on a source image and a natural language instruction describing a desired change. This enables a more intuitive and goal-driven search for architects and designers.
+## Model Details
+Two architectures were trained for this task using a triplet contrastive learning framework. The goal is to learn a shared embedding space where a query (source image + text instruction) is closer to a positive target image (that satisfies the instruction) than to a negative image.
+### 1. `CLIP-MLP-Floorplan-Retriever` (Recommended)
+This model uses the pre-trained multimodal embeddings from CLIP (ViT-B/32). The image and text embeddings are concatenated and passed through a simple MLP for fusion. This model demonstrated superior performance in both quantitative metrics and user studies.
+- **Image Encoder**: CLIP Vision Transformer (ViT-B/32)
+- **Text Encoder**: CLIP Text Transformer
+- **Fusion**: Concatenation + Multi-Layer Perceptron (MLP)
+- **Training Loss**: `TripletMarginWithDistanceLoss` with Cosine Similarity (margin=0.2)
+### 2. `BERT-ResNet-CA-Floorplan-Retriever`
+This model uses separate pre-trained encoders for image and text. A cross-attention module is used to fuse the features, allowing the image representation to attend to linguistic cues from the instruction.
+- **Image Encoder**: ResNet50
+- **Text Encoder**: BERT (base-uncased)
+- **Fusion**: Cross-Attention Module
+- **Training Loss**: `TripletMarginLoss` with L2 Euclidean Distance (margin=1.0)
+## How to Use
+You can use these models to get a fused embedding for a (floorplan, instruction) pair. You can then compare this embedding (e.g., using cosine similarity) against a pre-computed database of floorplan embeddings to find the best match.
+First, install the necessary libraries:
+```bash
+pip install torch transformers Pillow