File size: 3,123 Bytes
e5e0cc8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
---
license: mit
language: en
library_name: pytorch
tags:
- multimodal
- image-retrieval
- contrastive-learning
- floorplan-retrieval
- architecture
- computer-vision
- natural-language-processing
pipeline_tag: feature-extraction
model-index:
- name: CLIP-MLP-Floorplan-Retriever
results:
- task:
type: feature-extraction
name: Feature Extraction
dataset:
type: jmsilva/Synthetic_Floorplan_Intent_Dataset
name: Synthetic Floorplan Intent Dataset
metrics:
- type: Precision@3
value: 0.393
name: Precision@3
- type: UPR
value: 0.607
name: Unique Preference Rate
- name: BERT-ResNet-CA-Floorplan-Retriever
results:
- task:
type: feature-extraction
name: Feature Extraction
dataset:
type: jmsilva/Synthetic_Floorplan_Intent_Dataset
name: Synthetic Floorplan Intent Dataset
metrics:
- type: Precision@3
value: 0.226
name: Precision@3
- type: UPR
value: 0.179
name: Unique Preference Rate
---
# Floorplan Retrieval with Design Intent Models
This repository contains two models trained for the research paper: **"Unlocking Floorplan Retrieval with Design Intent via Contrastive Multimodal Learning"**.
These models are designed to retrieve architectural floorplans from a database based on a source image and a natural language instruction describing a desired change. This enables a more intuitive and goal-driven search for architects and designers.
## Model Details
Two architectures were trained for this task using a triplet contrastive learning framework. The goal is to learn a shared embedding space where a query (source image + text instruction) is closer to a positive target image (that satisfies the instruction) than to a negative image.
### 1. `CLIP-MLP-Floorplan-Retriever` (Recommended)
This model uses the pre-trained multimodal embeddings from CLIP (ViT-B/32). The image and text embeddings are concatenated and passed through a simple MLP for fusion. This model demonstrated superior performance in both quantitative metrics and user studies.
- **Image Encoder**: CLIP Vision Transformer (ViT-B/32)
- **Text Encoder**: CLIP Text Transformer
- **Fusion**: Concatenation + Multi-Layer Perceptron (MLP)
- **Training Loss**: `TripletMarginWithDistanceLoss` with Cosine Similarity (margin=0.2)
### 2. `BERT-ResNet-CA-Floorplan-Retriever`
This model uses separate pre-trained encoders for image and text. A cross-attention module is used to fuse the features, allowing the image representation to attend to linguistic cues from the instruction.
- **Image Encoder**: ResNet50
- **Text Encoder**: BERT (base-uncased)
- **Fusion**: Cross-Attention Module
- **Training Loss**: `TripletMarginLoss` with L2 Euclidean Distance (margin=1.0)
## How to Use
You can use these models to get a fused embedding for a (floorplan, instruction) pair. You can then compare this embedding (e.g., using cosine similarity) against a pre-computed database of floorplan embeddings to find the best match.
First, install the necessary libraries:
```bash
pip install torch transformers Pillow |