|
--- |
|
license: mit |
|
language: en |
|
library_name: pytorch |
|
tags: |
|
- multimodal |
|
- image-retrieval |
|
- contrastive-learning |
|
- floorplan-retrieval |
|
- architecture |
|
- computer-vision |
|
- natural-language-processing |
|
pipeline_tag: feature-extraction |
|
model-index: |
|
- name: CLIP-MLP-Floorplan-Retriever |
|
results: |
|
- task: |
|
type: feature-extraction |
|
name: Feature Extraction |
|
dataset: |
|
type: jmsilva/Synthetic_Floorplan_Intent_Dataset |
|
name: Synthetic Floorplan Intent Dataset |
|
metrics: |
|
- type: Precision@3 |
|
value: 0.393 |
|
name: Precision@3 |
|
- type: UPR |
|
value: 0.607 |
|
name: Unique Preference Rate |
|
- name: BERT-ResNet-CA-Floorplan-Retriever |
|
results: |
|
- task: |
|
type: feature-extraction |
|
name: Feature Extraction |
|
dataset: |
|
type: jmsilva/Synthetic_Floorplan_Intent_Dataset |
|
name: Synthetic Floorplan Intent Dataset |
|
metrics: |
|
- type: Precision@3 |
|
value: 0.226 |
|
name: Precision@3 |
|
- type: UPR |
|
value: 0.179 |
|
name: Unique Preference Rate |
|
--- |
|
|
|
# Floorplan Retrieval with Design Intent Models |
|
|
|
This repository contains two models trained for the research paper: **"Unlocking Floorplan Retrieval with Design Intent via Contrastive Multimodal Learning"**. |
|
|
|
These models are designed to retrieve architectural floorplans from a database based on a source image and a natural language instruction describing a desired change. This enables a more intuitive and goal-driven search for architects and designers. |
|
|
|
## Model Details |
|
|
|
Two architectures were trained for this task using a triplet contrastive learning framework. The goal is to learn a shared embedding space where a query (source image + text instruction) is closer to a positive target image (that satisfies the instruction) than to a negative image. |
|
|
|
### 1. `CLIP-MLP-Floorplan-Retriever` (Recommended) |
|
This model uses the pre-trained multimodal embeddings from CLIP (ViT-B/32). The image and text embeddings are concatenated and passed through a simple MLP for fusion. This model demonstrated superior performance in both quantitative metrics and user studies. |
|
|
|
- **Image Encoder**: CLIP Vision Transformer (ViT-B/32) |
|
- **Text Encoder**: CLIP Text Transformer |
|
- **Fusion**: Concatenation + Multi-Layer Perceptron (MLP) |
|
- **Training Loss**: `TripletMarginWithDistanceLoss` with Cosine Similarity (margin=0.2) |
|
|
|
### 2. `BERT-ResNet-CA-Floorplan-Retriever` |
|
This model uses separate pre-trained encoders for image and text. A cross-attention module is used to fuse the features, allowing the image representation to attend to linguistic cues from the instruction. |
|
|
|
- **Image Encoder**: ResNet50 |
|
- **Text Encoder**: BERT (base-uncased) |
|
- **Fusion**: Cross-Attention Module |
|
- **Training Loss**: `TripletMarginLoss` with L2 Euclidean Distance (margin=1.0) |
|
|
|
|
|
## How to Use |
|
|
|
You can use these models to get a fused embedding for a (floorplan, instruction) pair. You can then compare this embedding (e.g., using cosine similarity) against a pre-computed database of floorplan embeddings to find the best match. |
|
|
|
First, install the necessary libraries: |
|
```bash |
|
pip install torch transformers Pillow |