File size: 7,326 Bytes

9509736
4cb38b4
2589ccd
 
 
 
9509736
 
e7b9ee6
9509736
 
 
a6d47b2
9509736
 
 
e15e82f
 
 
 
 
9509736
 
 
 
 
 
 
 
 
2589ccd
9509736
 
e97a94c
 
9d734eb
 
 
 
e97a94c
 
9d734eb
d5b6aeb
9d734eb
 
 
 
 
e97a94c
 
9d734eb
e97a94c
9d734eb
e97a94c
 
9d734eb
e97a94c
 
9d734eb
 
 
e97a94c
 
cf3b8a1
e97a94c
9d734eb
 
 
e97a94c
 
9d734eb
e97a94c
9d734eb
 
 
e97a94c
 
 
2589ccd

---
license: cc
datasets:
- ucf-crcv/GAEA-Train
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---

<h1 align="left"> GAEA: A Geolocation Aware Conversational Assistant [WACV 2026🔥]</h1>

<h3 align="left"> Summary</h3>

<p align="justify"> Image geolocalization, in which an AI model traditionally predicts the precise GPS coordinates of an image, is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge beyond the GPS coordinates; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with the tremendous progress of large multimodal models (LMMs) — proprietary and open-source — researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, such as geolocalization, LMMs struggle. In this work, we propose solving this problem by introducing a conversational model, GAEA, that provides information regarding the location of an image as the user requires. No large-scale dataset enabling the training of such a model exists. Thus, we propose GAEA-1.4M, a comprehensive dataset comprising over 800k images and approximately 1.4M question-answer pairs, constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, GAEA-Bench, comprising 3.5k image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision, by 18.2% and the best proprietary model, GPT-4o, by 7.2%. Our dataset, model, and codes are publicly available. </p>

## `GAEA` is the first open-source conversational model for conversational capabilities equipped with global-scale geolocalization.

[![paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2503.16423)
[![Dataset](https://img.shields.io/badge/Dataset-Access-<COLOR>)](https://huggingface.co/collections/ucf-crcv/gaea-67d514a61d48eb1708b13a08)
[![Website](https://img.shields.io/badge/Project-Website-87CEEB)](https://ucf-crcv.github.io/GAEA/)


**Main contributions:**
1) **`GAEA-Train: A Diverse Training Dataset:`** We propose GAEA-Train, a new dataset designed for training conversational image geolocalization models, incorporating diverse visual and contextual data.
2) **`GAEA-Bench: Evaluating Conversational Geolocalization:`** To assess conversational capabilities in geolocalization, we introduce GAEA-Bench, a benchmark featuring various question-answer formats.
3) **`GAEA: An Interactive Geolocalization Chatbot:`** We present GAEA, a conversational chatbot that extends beyond geolocalization to provide rich contextual insights about locations from images.
4) **`Benchmarking Against State-of-the-Art LMMs:`** We quantitatively compare our model’s performance against 8 open-source and 3 proprietary LMMs, including GPT-4o and Gemini-2.0-Flash.

<b> This page is dedicated to the GAEA model </b>

<p align="center">
   <img src="Assets/teaser.jpg" alt="teaser" width="800px"/></a>
</p>

<p align="justify"> We compare the performance of various LMMs on the geographically-grounded visual-question-answering task, included in our new GAEA-Bench benchmark. Most LMMs can describe the Wat Pho statue, but only GAEA, our Geolocation Aware Assistant, retrieves the correct nearby cafe, Cafe Amazon <i>(left)</i>. Qualitative SVQA comparison showing GAEA’s ability to provide accurate, location-specific answers where other LMMs fail <i>(right)</i>.</p>

<h2 align="left"> Model Description</h2>

<h3 align="left">Architecture</h3>

<p align="left"><img src="Assets/arch.png" alt="arch" width="400px"/></p>
<p align="justify"> <b>Overview of the GAEA model architecture and workflow.</b> An input image is first processed by a Vision Transformer (ViT) encoder, whose output is projected through a visual projector to obtain visual embeddings. Simultaneously, the input text prompt is converted into text embeddings. The combined visual and textual embeddings are then fed into the Qwen2.5 LLM space, which generates a response based on the multimodal input. We follow the single-stage training approach, unfreezing MLP, and performing LoRA fine-tuning in the same stage. </p>

<!-- <h2 align="left"> How To Use</h2> -->

<h2 align="left">Evaluation Results</h2>

<h3 align="left">Comparison with SoTA LMMs on GAEA-Bench (Conversational) </h3>

<p align="left">
   <img src="Assets/GAEA-Benc-Eval.png" alt="GAEA-Benc-Eval" width="500px"/></a>
</p>
<p align="justify"> We benchmark 11 open-source and proprietary LMMs on GAEA-Bench. Notably, GAEA outperforms all open-source models and fares higher than the proprietary models on decision-making questions (MCQs and TFs). We provide the relative performance change for each model compared to GAEA. We use GPT-4o as a judge for evaluation, and it has been documented that LLMs as judges prefer their long-form output; hence, the scores for these models are likely overestimated. </p>

<p align="left">
   <img src="Assets/question_types_stats.jpg" alt="question-types-stats" width="500px"/></a>
</p>
<p align="justify">We showcase the performance of various LMMs on four diverse question types. GAEA outperforms on average across all question forms.</p>


<h3 align="left">Qualitative Results (Conversational) </h3>

<p align="left">
   <img src="Assets/queston_types_qual.jpg" alt="queston-types-qual" width="500px"/></a>
</p>
<p align="justify"> Qualitative MCQs comparison showing GAEA’s ability to provide accurate answers where other LMMs fail. </p>

<h3 align="left">Comparison with Specialized Models on Standard Geolocalization Datasets</h3>

<p align="left">
   <img src="Assets/Geolocalization_results.png" alt="Geolocalization_results" width="400px"/></a>
</p>
<p align="justify"> We benchmark the performance of various specialized models on standard geolocation datasets. GAEA demonstrates competitive results, outperforming GaGA on multiple distance thresholds in both IM2GPS and IM2GPS3k. </p>

<h3 align="left">Comparison with best SoTA LMMs on City/Country Prediction </h3>

<p align="left">
   <img src="Assets/City_Country_results.jpg" alt="City-Country-results" width="400px"/></a>
</p>
<p align="justify"> Classification accuracy for both city and country labels, where GAEA surpasses several recent LMMs in performance. </p>

---
# Citation
**BibTeX:**

```bibtex
@misc{campos2025gaeageolocationawareconversational,
      title={GAEA: A Geolocation Aware Conversational Assistant}, 
      author={Ron Campos and Ashmal Vayani and Parth Parag Kulkarni and Rohit Gupta and Aritra Dutta and Mubarak Shah},
      year={2025},
      eprint={2503.16423},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.16423}, 
}
```

---
## Licensing Information
We release our work under [CC BY-NC 4.0 License](https://creativecommons.org/licenses/by-nc/4.0/). The CC BY-NC 4.0 license allows others to share, remix, and adapt the work, as long as it's for non-commercial purposes and proper attribution is given to the original creator.