JoaoMigSilva's picture
Create README.md
e5e0cc8 verified
metadata
license: mit
language: en
library_name: pytorch
tags:
  - multimodal
  - image-retrieval
  - contrastive-learning
  - floorplan-retrieval
  - architecture
  - computer-vision
  - natural-language-processing
pipeline_tag: feature-extraction
model-index:
  - name: CLIP-MLP-Floorplan-Retriever
    results:
      - task:
          type: feature-extraction
          name: Feature Extraction
        dataset:
          type: jmsilva/Synthetic_Floorplan_Intent_Dataset
          name: Synthetic Floorplan Intent Dataset
        metrics:
          - type: Precision@3
            value: 0.393
            name: Precision@3
          - type: UPR
            value: 0.607
            name: Unique Preference Rate
  - name: BERT-ResNet-CA-Floorplan-Retriever
    results:
      - task:
          type: feature-extraction
          name: Feature Extraction
        dataset:
          type: jmsilva/Synthetic_Floorplan_Intent_Dataset
          name: Synthetic Floorplan Intent Dataset
        metrics:
          - type: Precision@3
            value: 0.226
            name: Precision@3
          - type: UPR
            value: 0.179
            name: Unique Preference Rate

Floorplan Retrieval with Design Intent Models

This repository contains two models trained for the research paper: "Unlocking Floorplan Retrieval with Design Intent via Contrastive Multimodal Learning".

These models are designed to retrieve architectural floorplans from a database based on a source image and a natural language instruction describing a desired change. This enables a more intuitive and goal-driven search for architects and designers.

Model Details

Two architectures were trained for this task using a triplet contrastive learning framework. The goal is to learn a shared embedding space where a query (source image + text instruction) is closer to a positive target image (that satisfies the instruction) than to a negative image.

1. CLIP-MLP-Floorplan-Retriever (Recommended)

This model uses the pre-trained multimodal embeddings from CLIP (ViT-B/32). The image and text embeddings are concatenated and passed through a simple MLP for fusion. This model demonstrated superior performance in both quantitative metrics and user studies.

  • Image Encoder: CLIP Vision Transformer (ViT-B/32)
  • Text Encoder: CLIP Text Transformer
  • Fusion: Concatenation + Multi-Layer Perceptron (MLP)
  • Training Loss: TripletMarginWithDistanceLoss with Cosine Similarity (margin=0.2)

2. BERT-ResNet-CA-Floorplan-Retriever

This model uses separate pre-trained encoders for image and text. A cross-attention module is used to fuse the features, allowing the image representation to attend to linguistic cues from the instruction.

  • Image Encoder: ResNet50
  • Text Encoder: BERT (base-uncased)
  • Fusion: Cross-Attention Module
  • Training Loss: TripletMarginLoss with L2 Euclidean Distance (margin=1.0)

How to Use

You can use these models to get a fused embedding for a (floorplan, instruction) pair. You can then compare this embedding (e.g., using cosine similarity) against a pre-computed database of floorplan embeddings to find the best match.

First, install the necessary libraries:

pip install torch transformers Pillow