Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - yali30/findingdory
4
+ language:
5
+ - en
6
+ base_model:
7
+ - Qwen/Qwen2.5-VL-3B-Instruct
8
+ pipeline_tag: image-text-to-text
9
+ library_name: transformers
10
+ tags:
11
+ - habitat
12
+ - embodied-ai
13
+ - memory
14
+ ---
15
+ <a href="https://arxiv.org/abs/2506.15635" target="_blank">
16
+ <img alt="arXiv" src="https://img.shields.io/badge/arXiv-FindingDory-red?logo=arxiv" height="20" />
17
+ </a>
18
+ <a href="https://findingdory-benchmark.github.io/" target="_blank">
19
+ <img alt="Website" src="https://img.shields.io/badge/🌎_Website-FindingDory-blue.svg" height="20" />
20
+ </a>
21
+ <a href="https://github.com/findingdory-benchmark/findingdory-trl" target="_blank" style="display: inline-block; margin-right: 10px;">
22
+ <img alt="GitHub Code" src="https://img.shields.io/badge/Code-FindingDory--TRL-white?&logo=github&logoColor=white" />
23
+ </a>
24
+
25
+ <center><h1>FindingDory: A Benchmark to Evaluate Memory in Embodied Agents</h1>
26
+ <a href="https://www.karmeshyadav.com/">Karmesh Yadav*</a>,
27
+ <a href="https://yusufali98.github.io/">Yusuf Ali*</a>,
28
+ <a href="https://gunshigupta.netlify.app/">Gunshi Gupta</a>,
29
+ <a href="https://www.cs.ox.ac.uk/people/yarin.gal/website/">Yarin Gal</a>,
30
+ <a href="https://faculty.cc.gatech.edu/~zk15/">Zsolt Kira</a>
31
+ </center>
32
+
33
+ Current vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce **FindingDory**, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks.
34
+
35
+ In this repo, we release a **Qwen2.5-VL-3B-Instruct** checkpoint trained on the training split of **FindingDory**. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a **frame index** (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with _immediately after_ the mug”).
36
+ At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task.
37
+
38
+ 🏋️ Training details
39
+ | Property | Value |
40
+ | -------- | ----- |
41
+ | Epochs | 5 |
42
+ | Effective batch | 32 |
43
+ | LR schedule | Cosine (LR=5e-6, Warmup ratio=0.1) |
44
+ | Image resol. | TODO |
45
+ | Compute | “8 × A40 48 GB for ~18 hours” |
46
+ | Input frames | 96 Images |
47
+ | Optimiser | AdamW(β₁ = 0.9, β₂ = 0.95) |
48
+ | Best checkpoint | TODO |
49
+
50
+
51
+ 📊 Evaluation
52
+ We compare the performance of our finetuned `FindingDory-Qwen2.5-VL-3B-SFT` checkpoint against other models below:
53
+ | Model | High-level Success Rate | Notes |
54
+ | ----- | ----------------------- | ----- |
55
+ | FindingDory-Qwen2.5-VL-3B-SFT | 52.4% | ours |
56
+ | Base Qwen2.5-VL-7B-Instruct | 15.1% | zero-shot |
57
+ | Gemma3-12B-it | 13.2% | zero-shot |
58
+ | GPT-4o | 27.3% | zero-shot |
59
+ | Gemini-2.0-Flash | 25.4% | zero-shot |
60
+ Checkout Fig 2 in the paper for more details.
61
+
62
+ 📄 Citation
63
+ ```
64
+ @article{yadav2025findingdory,
65
+ title = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents},
66
+ author = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt},
67
+ journal = {arXiv preprint arXiv:2506.15635},
68
+ year = {2025}
69
+ }
70
+ ```