arXiv Website GitHub Code Huggingface

FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Karmesh Yadav*, Yusuf Ali*, Gunshi Gupta, Yarin Gal, Zsolt Kira

Current vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce FindingDory, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks.

In this repo, we release a Qwen2.5-VL-3B-Instruct checkpoint trained on the training split of FindingDory. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a frame index (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with immediately after the mug”).
At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task.

🏋️ Training details

Property Value
Epochs 5 (Total training steps 12840)
Effective batch 32
LR schedule Cosine (LR=5e-6, Warmup ratio=0.1)
Max Pixels. 360 x 420
Compute “8 × A40 48 GB for ~84 hours”
Input frames 96 Images (~10k tokens)
Optimiser AdamW(β₁ = 0.9, β₂ = 0.95)
Best checkpoint 8800 Steps

📊 Evaluation We compare the performance of our finetuned FindingDory-Qwen2.5-VL-3B-SFT checkpoint against other models below:

Model High-level Success Rate Notes
FindingDory-Qwen2.5-VL-3B-SFT 52.4% ours
Base Qwen2.5-VL-7B-Instruct 15.1% zero-shot
Gemma3-12B-it 13.2% zero-shot
GPT-4o 27.3% zero-shot
Gemini-2.0-Flash 25.4% zero-shot

Checkout Fig 2 in the paper for more details.

📄 Citation

@article{yadav2025findingdory,
  title     = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents},
  author    = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt},
  journal   = {arXiv preprint arXiv:2506.15635},
  year      = {2025}
}
Downloads last month
6
Safetensors
Model size
3.75B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yali30/findingdory-qwen2.5-VL-3B-finetuned

Finetuned
(400)
this model

Dataset used to train yali30/findingdory-qwen2.5-VL-3B-finetuned