Update README.md
Browse files
README.md
CHANGED
@@ -24,16 +24,17 @@ license: apache-2.0
|
|
24 |
<b> This page is dedicated to the GAEA model </b>
|
25 |
|
26 |
<p align="center">
|
27 |
-
<img src="Assets/teaser.jpg" alt="teaser"
|
28 |
</p>
|
29 |
|
|
|
|
|
30 |
<h2 align="left"> Model Description</h2>
|
31 |
|
32 |
<h3 align="left">Architecture</h3>
|
33 |
|
34 |
-
<p align="
|
35 |
-
|
36 |
-
</p>
|
37 |
|
38 |
<!-- <h2 align="left"> How To Use</h2> -->
|
39 |
|
@@ -41,28 +42,34 @@ license: apache-2.0
|
|
41 |
|
42 |
<h3 align="left">Comparison with SoTA LMMs on GAEA-Bench (Conversational) </h3>
|
43 |
|
44 |
-
<p align="
|
45 |
-
<img src="Assets/GAEA-Benc-Eval.png" alt="GAEA-Benc-Eval"
|
46 |
</p>
|
|
|
47 |
|
48 |
-
<p align="
|
49 |
-
<img src="Assets/question_types_stats.jpg" alt="question-types-stats"
|
50 |
</p>
|
|
|
|
|
51 |
|
52 |
<h3 align="left">Qualitative Results (Conversational) </h3>
|
53 |
|
54 |
-
<p align="
|
55 |
-
<img src="Assets/queston_types_qual.jpg" alt="queston-types-qual"
|
56 |
</p>
|
|
|
57 |
|
58 |
<h3 align="left">Comparison with Specialized Models on Standard Geolocalization Datasets</h3>
|
59 |
|
60 |
-
<p align="
|
61 |
-
<img src="Assets/Geolocalization_results.png" alt="Geolocalization_results"
|
62 |
</p>
|
|
|
63 |
|
64 |
<h3 align="left">Comparison with best SoTA LMMs on City/Country Prediction </h3>
|
65 |
|
66 |
-
<p align="
|
67 |
-
<img src="Assets/City_Country_results.jpg" alt="City-Country-results"
|
68 |
-
</p>
|
|
|
|
24 |
<b> This page is dedicated to the GAEA model </b>
|
25 |
|
26 |
<p align="center">
|
27 |
+
<img src="Assets/teaser.jpg" alt="teaser" width="600px"/></a>
|
28 |
</p>
|
29 |
|
30 |
+
<p align="justify"> We compare the performance of various LMMs on the geographically-grounded visual-question-answering task, included in our new GAEA-Bench benchmark. Most LMMs can describe the Wat Pho statue, but only GAEA, our Geolocation Aware Assistant, retrieves the correct nearby cafe, Cafe Amazon <i>(left)</i>. Qualitative SVQA comparison showing GAEA’s ability to provide accurate, location-specific answers where other LMMs fail <i>(right)</i>.</p>
|
31 |
+
|
32 |
<h2 align="left"> Model Description</h2>
|
33 |
|
34 |
<h3 align="left">Architecture</h3>
|
35 |
|
36 |
+
<p align="left"><img src="Assets/arch.png" alt="arch" width="400px"/></p>
|
37 |
+
<p align="justify"> <b>Overview of the GAEA model architecture and workflow.</b> An input image is first processed by a Vision Transformer (ViT) encoder, whose output is projected through a visual projector to obtain visual embeddings. Simultaneously, the input text prompt is converted into text embeddings. The combined visual and textual embeddings are then fed into the Qwen2.5 LLM space, which generates a response based on the multimodal input. We follow the single-stage training approach, unfreezing MLP, and performing LoRA fine-tuning in the same stage. </p>
|
|
|
38 |
|
39 |
<!-- <h2 align="left"> How To Use</h2> -->
|
40 |
|
|
|
42 |
|
43 |
<h3 align="left">Comparison with SoTA LMMs on GAEA-Bench (Conversational) </h3>
|
44 |
|
45 |
+
<p align="left">
|
46 |
+
<img src="Assets/GAEA-Benc-Eval.png" alt="GAEA-Benc-Eval" width="500px"/></a>
|
47 |
</p>
|
48 |
+
<p align="justify"> We benchmark 11 open-source and proprietary LMMs on GAEA-Bench. Notably, GAEA outperforms all open-source models and fares higher than the proprietary models on decision-making questions (MCQs and TFs). We provide the relative performance change for each model compared to GAEA. We use GPT-4o as a judge for evaluation, and it has been documented that LLMs as judges prefer their long-form output; hence, the scores for these models are likely overestimated. </p>
|
49 |
|
50 |
+
<p align="left">
|
51 |
+
<img src="Assets/question_types_stats.jpg" alt="question-types-stats" width="500px"/></a>
|
52 |
</p>
|
53 |
+
<p align="justify">We showcase the performance of various LMMs on four diverse question types. GAEA outperforms on average across all question forms.</p>
|
54 |
+
|
55 |
|
56 |
<h3 align="left">Qualitative Results (Conversational) </h3>
|
57 |
|
58 |
+
<p align="left">
|
59 |
+
<img src="Assets/queston_types_qual.jpg" alt="queston-types-qual" width="500px"/></a>
|
60 |
</p>
|
61 |
+
<p align="justify"> Qualitative MCQs comparison showing GAEA’s ability to provide accurate answers where other LMMs fail. </p>
|
62 |
|
63 |
<h3 align="left">Comparison with Specialized Models on Standard Geolocalization Datasets</h3>
|
64 |
|
65 |
+
<p align="left">
|
66 |
+
<img src="Assets/Geolocalization_results.png" alt="Geolocalization_results" width="400px"/></a>
|
67 |
</p>
|
68 |
+
<p align="justify"> We benchmark the performance of various specialized models on standard geolocation datasets. GAEA demonstrates competitive results, outperforming GaGA on multiple distance thresholds in both IM2GPS and IM2GPS3k. </p>
|
69 |
|
70 |
<h3 align="left">Comparison with best SoTA LMMs on City/Country Prediction </h3>
|
71 |
|
72 |
+
<p align="left">
|
73 |
+
<img src="Assets/City_Country_results.jpg" alt="City-Country-results" width="400px"/></a>
|
74 |
+
</p>
|
75 |
+
<p align="justify"> Classification accuracy for both city and country labels, where GAEA surpasses several recent LMMs in performance. </p>
|