JoaoMigSilva commited on
Commit
e5e0cc8
·
verified ·
1 Parent(s): 51fdb52

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language: en
4
+ library_name: pytorch
5
+ tags:
6
+ - multimodal
7
+ - image-retrieval
8
+ - contrastive-learning
9
+ - floorplan-retrieval
10
+ - architecture
11
+ - computer-vision
12
+ - natural-language-processing
13
+ pipeline_tag: feature-extraction
14
+ model-index:
15
+ - name: CLIP-MLP-Floorplan-Retriever
16
+ results:
17
+ - task:
18
+ type: feature-extraction
19
+ name: Feature Extraction
20
+ dataset:
21
+ type: jmsilva/Synthetic_Floorplan_Intent_Dataset
22
+ name: Synthetic Floorplan Intent Dataset
23
+ metrics:
24
+ - type: Precision@3
25
+ value: 0.393
26
+ name: Precision@3
27
+ - type: UPR
28
+ value: 0.607
29
+ name: Unique Preference Rate
30
+ - name: BERT-ResNet-CA-Floorplan-Retriever
31
+ results:
32
+ - task:
33
+ type: feature-extraction
34
+ name: Feature Extraction
35
+ dataset:
36
+ type: jmsilva/Synthetic_Floorplan_Intent_Dataset
37
+ name: Synthetic Floorplan Intent Dataset
38
+ metrics:
39
+ - type: Precision@3
40
+ value: 0.226
41
+ name: Precision@3
42
+ - type: UPR
43
+ value: 0.179
44
+ name: Unique Preference Rate
45
+ ---
46
+
47
+ # Floorplan Retrieval with Design Intent Models
48
+
49
+ This repository contains two models trained for the research paper: **"Unlocking Floorplan Retrieval with Design Intent via Contrastive Multimodal Learning"**.
50
+
51
+ These models are designed to retrieve architectural floorplans from a database based on a source image and a natural language instruction describing a desired change. This enables a more intuitive and goal-driven search for architects and designers.
52
+
53
+ ## Model Details
54
+
55
+ Two architectures were trained for this task using a triplet contrastive learning framework. The goal is to learn a shared embedding space where a query (source image + text instruction) is closer to a positive target image (that satisfies the instruction) than to a negative image.
56
+
57
+ ### 1. `CLIP-MLP-Floorplan-Retriever` (Recommended)
58
+ This model uses the pre-trained multimodal embeddings from CLIP (ViT-B/32). The image and text embeddings are concatenated and passed through a simple MLP for fusion. This model demonstrated superior performance in both quantitative metrics and user studies.
59
+
60
+ - **Image Encoder**: CLIP Vision Transformer (ViT-B/32)
61
+ - **Text Encoder**: CLIP Text Transformer
62
+ - **Fusion**: Concatenation + Multi-Layer Perceptron (MLP)
63
+ - **Training Loss**: `TripletMarginWithDistanceLoss` with Cosine Similarity (margin=0.2)
64
+
65
+ ### 2. `BERT-ResNet-CA-Floorplan-Retriever`
66
+ This model uses separate pre-trained encoders for image and text. A cross-attention module is used to fuse the features, allowing the image representation to attend to linguistic cues from the instruction.
67
+
68
+ - **Image Encoder**: ResNet50
69
+ - **Text Encoder**: BERT (base-uncased)
70
+ - **Fusion**: Cross-Attention Module
71
+ - **Training Loss**: `TripletMarginLoss` with L2 Euclidean Distance (margin=1.0)
72
+
73
+
74
+ ## How to Use
75
+
76
+ You can use these models to get a fused embedding for a (floorplan, instruction) pair. You can then compare this embedding (e.g., using cosine similarity) against a pre-computed database of floorplan embeddings to find the best match.
77
+
78
+ First, install the necessary libraries:
79
+ ```bash
80
+ pip install torch transformers Pillow