File size: 7,326 Bytes
9509736 4cb38b4 2589ccd 9509736 e7b9ee6 9509736 a6d47b2 9509736 e15e82f 9509736 2589ccd 9509736 e97a94c 9d734eb e97a94c 9d734eb d5b6aeb 9d734eb e97a94c 9d734eb e97a94c 9d734eb e97a94c 9d734eb e97a94c 9d734eb e97a94c cf3b8a1 e97a94c 9d734eb e97a94c 9d734eb e97a94c 9d734eb e97a94c 2589ccd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
license: cc
datasets:
- ucf-crcv/GAEA-Train
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---
<h1 align="left"> GAEA: A Geolocation Aware Conversational Assistant [WACV 2026🔥]</h1>
<h3 align="left"> Summary</h3>
<p align="justify"> Image geolocalization, in which an AI model traditionally predicts the precise GPS coordinates of an image, is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge beyond the GPS coordinates; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with the tremendous progress of large multimodal models (LMMs) — proprietary and open-source — researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, such as geolocalization, LMMs struggle. In this work, we propose solving this problem by introducing a conversational model, GAEA, that provides information regarding the location of an image as the user requires. No large-scale dataset enabling the training of such a model exists. Thus, we propose GAEA-1.4M, a comprehensive dataset comprising over 800k images and approximately 1.4M question-answer pairs, constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, GAEA-Bench, comprising 3.5k image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision, by 18.2% and the best proprietary model, GPT-4o, by 7.2%. Our dataset, model, and codes are publicly available. </p>
## `GAEA` is the first open-source conversational model for conversational capabilities equipped with global-scale geolocalization.
[](https://arxiv.org/abs/2503.16423)
[](https://huggingface.co/collections/ucf-crcv/gaea-67d514a61d48eb1708b13a08)
[](https://ucf-crcv.github.io/GAEA/)
**Main contributions:**
1) **`GAEA-Train: A Diverse Training Dataset:`** We propose GAEA-Train, a new dataset designed for training conversational image geolocalization models, incorporating diverse visual and contextual data.
2) **`GAEA-Bench: Evaluating Conversational Geolocalization:`** To assess conversational capabilities in geolocalization, we introduce GAEA-Bench, a benchmark featuring various question-answer formats.
3) **`GAEA: An Interactive Geolocalization Chatbot:`** We present GAEA, a conversational chatbot that extends beyond geolocalization to provide rich contextual insights about locations from images.
4) **`Benchmarking Against State-of-the-Art LMMs:`** We quantitatively compare our model’s performance against 8 open-source and 3 proprietary LMMs, including GPT-4o and Gemini-2.0-Flash.
<b> This page is dedicated to the GAEA model </b>
<p align="center">
<img src="Assets/teaser.jpg" alt="teaser" width="800px"/></a>
</p>
<p align="justify"> We compare the performance of various LMMs on the geographically-grounded visual-question-answering task, included in our new GAEA-Bench benchmark. Most LMMs can describe the Wat Pho statue, but only GAEA, our Geolocation Aware Assistant, retrieves the correct nearby cafe, Cafe Amazon <i>(left)</i>. Qualitative SVQA comparison showing GAEA’s ability to provide accurate, location-specific answers where other LMMs fail <i>(right)</i>.</p>
<h2 align="left"> Model Description</h2>
<h3 align="left">Architecture</h3>
<p align="left"><img src="Assets/arch.png" alt="arch" width="400px"/></p>
<p align="justify"> <b>Overview of the GAEA model architecture and workflow.</b> An input image is first processed by a Vision Transformer (ViT) encoder, whose output is projected through a visual projector to obtain visual embeddings. Simultaneously, the input text prompt is converted into text embeddings. The combined visual and textual embeddings are then fed into the Qwen2.5 LLM space, which generates a response based on the multimodal input. We follow the single-stage training approach, unfreezing MLP, and performing LoRA fine-tuning in the same stage. </p>
<!-- <h2 align="left"> How To Use</h2> -->
<h2 align="left">Evaluation Results</h2>
<h3 align="left">Comparison with SoTA LMMs on GAEA-Bench (Conversational) </h3>
<p align="left">
<img src="Assets/GAEA-Benc-Eval.png" alt="GAEA-Benc-Eval" width="500px"/></a>
</p>
<p align="justify"> We benchmark 11 open-source and proprietary LMMs on GAEA-Bench. Notably, GAEA outperforms all open-source models and fares higher than the proprietary models on decision-making questions (MCQs and TFs). We provide the relative performance change for each model compared to GAEA. We use GPT-4o as a judge for evaluation, and it has been documented that LLMs as judges prefer their long-form output; hence, the scores for these models are likely overestimated. </p>
<p align="left">
<img src="Assets/question_types_stats.jpg" alt="question-types-stats" width="500px"/></a>
</p>
<p align="justify">We showcase the performance of various LMMs on four diverse question types. GAEA outperforms on average across all question forms.</p>
<h3 align="left">Qualitative Results (Conversational) </h3>
<p align="left">
<img src="Assets/queston_types_qual.jpg" alt="queston-types-qual" width="500px"/></a>
</p>
<p align="justify"> Qualitative MCQs comparison showing GAEA’s ability to provide accurate answers where other LMMs fail. </p>
<h3 align="left">Comparison with Specialized Models on Standard Geolocalization Datasets</h3>
<p align="left">
<img src="Assets/Geolocalization_results.png" alt="Geolocalization_results" width="400px"/></a>
</p>
<p align="justify"> We benchmark the performance of various specialized models on standard geolocation datasets. GAEA demonstrates competitive results, outperforming GaGA on multiple distance thresholds in both IM2GPS and IM2GPS3k. </p>
<h3 align="left">Comparison with best SoTA LMMs on City/Country Prediction </h3>
<p align="left">
<img src="Assets/City_Country_results.jpg" alt="City-Country-results" width="400px"/></a>
</p>
<p align="justify"> Classification accuracy for both city and country labels, where GAEA surpasses several recent LMMs in performance. </p>
---
# Citation
**BibTeX:**
```bibtex
@misc{campos2025gaeageolocationawareconversational,
title={GAEA: A Geolocation Aware Conversational Assistant},
author={Ron Campos and Ashmal Vayani and Parth Parag Kulkarni and Rohit Gupta and Aritra Dutta and Mubarak Shah},
year={2025},
eprint={2503.16423},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.16423},
}
```
---
## Licensing Information
We release our work under [CC BY-NC 4.0 License](https://creativecommons.org/licenses/by-nc/4.0/). The CC BY-NC 4.0 license allows others to share, remix, and adapt the work, as long as it's for non-commercial purposes and proper attribution is given to the original creator. |