yali30
/

findingdory-qwen2.5-VL-3B-finetuned

+---
+datasets:
+- yali30/findingdory
+language:
+- en
+base_model:
+- Qwen/Qwen2.5-VL-3B-Instruct
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- habitat
+- embodied-ai
+- memory
+---
+<a href="https://arxiv.org/abs/2506.15635" target="_blank">
+    <img alt="arXiv" src="https://img.shields.io/badge/arXiv-FindingDory-red?logo=arxiv" height="20" />
+</a>
+<a href="https://findingdory-benchmark.github.io/" target="_blank">
+    <img alt="Website" src="https://img.shields.io/badge/🌎_Website-FindingDory-blue.svg" height="20" />
+</a>
+<a href="https://github.com/findingdory-benchmark/findingdory-trl" target="_blank" style="display: inline-block; margin-right: 10px;">
+    <img alt="GitHub Code" src="https://img.shields.io/badge/Code-FindingDory--TRL-white?&logo=github&logoColor=white" />
+</a>
+<center><h1>FindingDory: A Benchmark to Evaluate Memory in Embodied Agents</h1>
+  <a href="https://www.karmeshyadav.com/">Karmesh Yadav*</a>,
+  <a href="https://yusufali98.github.io/">Yusuf Ali*</a>,
+  <a href="https://gunshigupta.netlify.app/">Gunshi Gupta</a>,
+  <a href="https://www.cs.ox.ac.uk/people/yarin.gal/website/">Yarin Gal</a>,
+  <a href="https://faculty.cc.gatech.edu/~zk15/">Zsolt Kira</a>
+</center>
+Current vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce **FindingDory**, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks.
+In this repo, we release a **Qwen2.5-VL-3B-Instruct** checkpoint trained on the training split of **FindingDory**. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a **frame index** (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with _immediately after_ the mug”).
+At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task.
+🏋️ Training details
+| Property | Value |
+| -------- | ----- |
+| Epochs   | 5 |
+| Effective batch | 32 |
+| LR schedule | Cosine (LR=5e-6, Warmup ratio=0.1)  |
+| Image resol. | TODO |
+| Compute  | “8 × A40 48 GB for ~18 hours” |
+| Input frames | 96 Images |
+| Optimiser | AdamW(β₁ = 0.9, β₂ = 0.95) |
+| Best checkpoint | TODO |
+📊 Evaluation
+We compare the performance of our finetuned `FindingDory-Qwen2.5-VL-3B-SFT` checkpoint against other models below:
+| Model	| High-level Success Rate | Notes |
+| ----- | ----------------------- | ----- |
+| FindingDory-Qwen2.5-VL-3B-SFT | 52.4% | ours |
+| Base Qwen2.5-VL-7B-Instruct | 15.1% | zero-shot |
+| Gemma3-12B-it | 13.2% | zero-shot |
+| GPT-4o | 27.3% | zero-shot |
+| Gemini-2.0-Flash | 25.4% | zero-shot |
+Checkout Fig 2 in the paper for more details.
+📄 Citation
+```
+@article{yadav2025findingdory,
+  title     = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents},
+  author    = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt},
+  journal   = {arXiv preprint arXiv:2506.15635},
+  year      = {2025}
+}
+```