--- license: mit language: en library_name: pytorch tags: - multimodal - image-retrieval - contrastive-learning - floorplan-retrieval - architecture - computer-vision - natural-language-processing pipeline_tag: feature-extraction model-index: - name: CLIP-MLP-Floorplan-Retriever results: - task: type: feature-extraction name: Feature Extraction dataset: type: jmsilva/Synthetic_Floorplan_Intent_Dataset name: Synthetic Floorplan Intent Dataset metrics: - type: Precision@3 value: 0.393 name: Precision@3 - type: UPR value: 0.607 name: Unique Preference Rate - name: BERT-ResNet-CA-Floorplan-Retriever results: - task: type: feature-extraction name: Feature Extraction dataset: type: jmsilva/Synthetic_Floorplan_Intent_Dataset name: Synthetic Floorplan Intent Dataset metrics: - type: Precision@3 value: 0.226 name: Precision@3 - type: UPR value: 0.179 name: Unique Preference Rate --- # Floorplan Retrieval with Design Intent Models This repository contains two models trained for the research paper: **"Unlocking Floorplan Retrieval with Design Intent via Contrastive Multimodal Learning"**. These models are designed to retrieve architectural floorplans from a database based on a source image and a natural language instruction describing a desired change. This enables a more intuitive and goal-driven search for architects and designers. ## Model Details Two architectures were trained for this task using a triplet contrastive learning framework. The goal is to learn a shared embedding space where a query (source image + text instruction) is closer to a positive target image (that satisfies the instruction) than to a negative image. ### 1. `CLIP-MLP-Floorplan-Retriever` (Recommended) This model uses the pre-trained multimodal embeddings from CLIP (ViT-B/32). The image and text embeddings are concatenated and passed through a simple MLP for fusion. This model demonstrated superior performance in both quantitative metrics and user studies. - **Image Encoder**: CLIP Vision Transformer (ViT-B/32) - **Text Encoder**: CLIP Text Transformer - **Fusion**: Concatenation + Multi-Layer Perceptron (MLP) - **Training Loss**: `TripletMarginWithDistanceLoss` with Cosine Similarity (margin=0.2) ### 2. `BERT-ResNet-CA-Floorplan-Retriever` This model uses separate pre-trained encoders for image and text. A cross-attention module is used to fuse the features, allowing the image representation to attend to linguistic cues from the instruction. - **Image Encoder**: ResNet50 - **Text Encoder**: BERT (base-uncased) - **Fusion**: Cross-Attention Module - **Training Loss**: `TripletMarginLoss` with L2 Euclidean Distance (margin=1.0) ## How to Use You can use these models to get a fused embedding for a (floorplan, instruction) pair. You can then compare this embedding (e.g., using cosine similarity) against a pre-computed database of floorplan embeddings to find the best match. First, install the necessary libraries: ```bash pip install torch transformers Pillow