FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Karmesh Yadav*, Yusuf Ali*, Gunshi Gupta, Yarin Gal, Zsolt Kira

Current vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce FindingDory, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks.

In this repo, we release a Qwen2.5-VL-3B-Instruct checkpoint trained on the training split of FindingDory. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a frame index (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with immediately after the mug”).
At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task.

🏋️ Training details

Property	Value
Epochs	5 (Total training steps 12840)
Effective batch	32
LR schedule	Cosine (LR=5e-6, Warmup ratio=0.1)
Max Pixels.	360 x 420
Compute	“8 × A40 48 GB for ~84 hours”
Input frames	96 Images (~10k tokens)
Optimiser	AdamW(β₁ = 0.9, β₂ = 0.95)
Best checkpoint	8800 Steps

📊 Evaluation We compare the performance of our finetuned FindingDory-Qwen2.5-VL-3B-SFT checkpoint against other models below:

Model	High-level Success Rate	Notes
FindingDory-Qwen2.5-VL-3B-SFT	52.4%	ours
Base Qwen2.5-VL-7B-Instruct	15.1%	zero-shot
Gemma3-12B-it	13.2%	zero-shot
GPT-4o	27.3%	zero-shot
Gemini-2.0-Flash	25.4%	zero-shot

Checkout Fig 2 in the paper for more details.

📄 Citation

@article{yadav2025findingdory,
  title     = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents},
  author    = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt},
  journal   = {arXiv preprint arXiv:2506.15635},
  year      = {2025}
}

yali30
/

findingdory-qwen2.5-VL-3B-finetuned

FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Model tree for yali30/findingdory-qwen2.5-VL-3B-finetuned

Dataset used to train yali30/findingdory-qwen2.5-VL-3B-finetuned