FindingDory: A Benchmark to Evaluate Memory in Embodied Agents
Karmesh Yadav*, Yusuf Ali*, Gunshi Gupta, Yarin Gal, Zsolt KiraCurrent vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce FindingDory, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks.
In this repo, we release a Qwen2.5-VL-3B-Instruct checkpoint trained on the training split of FindingDory. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a frame index (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with immediately after the mug”).
At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task.
🏋️ Training details
Property | Value |
---|---|
Epochs | 5 (Total training steps 12840) |
Effective batch | 32 |
LR schedule | Cosine (LR=5e-6, Warmup ratio=0.1) |
Max Pixels. | 360 x 420 |
Compute | “8 × A40 48 GB for ~84 hours” |
Input frames | 96 Images (~10k tokens) |
Optimiser | AdamW(β₁ = 0.9, β₂ = 0.95) |
Best checkpoint | 8800 Steps |
📊 Evaluation
We compare the performance of our finetuned FindingDory-Qwen2.5-VL-3B-SFT
checkpoint against other models below:
Model | High-level Success Rate | Notes |
---|---|---|
FindingDory-Qwen2.5-VL-3B-SFT | 52.4% | ours |
Base Qwen2.5-VL-7B-Instruct | 15.1% | zero-shot |
Gemma3-12B-it | 13.2% | zero-shot |
GPT-4o | 27.3% | zero-shot |
Gemini-2.0-Flash | 25.4% | zero-shot |
Checkout Fig 2 in the paper for more details.
📄 Citation
@article{yadav2025findingdory,
title = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents},
author = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt},
journal = {arXiv preprint arXiv:2506.15635},
year = {2025}
}
- Downloads last month
- 6
Model tree for yali30/findingdory-qwen2.5-VL-3B-finetuned
Base model
Qwen/Qwen2.5-VL-3B-Instruct