arxiv:2507.05257

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Published on Jul 7

· Submitted by

Authors:

Abstract

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

View arXiv page View PDF GitHub 6 Add to collection

Community

ai-hyz

Paper submitter about 12 hours ago

•

edited about 11 hours ago

⚙️ MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

MemoryAgentBench is a unified benchmark framework for comprehensively evaluating the memory capabilities of LLM agents: through four core competencies (Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Conflict Resolution) and incremental multi-turn interaction design, it reveals existing limitations and shortcomings of current memory agents and compares performance differences across various memory agents.

Four Core Competencies for Evaluation

What capabilities does AI need to truly "remember"? We argue that merely storing and retrieving information is far from sufficient. The memory system needs to possess four key competencies:
1. Accurate Retrieval (AR)
This is the most fundamental capability—precisely locating required information from massive dialogue histories. For instance, when you ask about a detail mentioned 3 hours ago after hours of conversation with an AI, can it quickly and accurately find it? This requires not only single-hop retrieval but also multi-hop reasoning capabilities.
2. Test-Time Learning (TTL)
Truly intelligent systems should be able to continuously learn new skills during interactions. For example, if you teach an AI a new classification method through a few examples, can it flexibly apply this in subsequent conversations? This "learning-while-using" capability is crucial for building adaptive AI.
3. Long-Range Understanding (LRU)
Unlike fragmented information retrieval, long-range understanding requires AI to form global cognition. Just like after reading a novel, you not only remember specific plot points but also understand the overall narrative and character relationships. AI needs to abstract high-level understanding from long conversations.
4. Conflict Resolution (CR)
Information in the real world is dynamic. When users say "I changed jobs" or "this theory has been disproven," AI must identify and update outdated information rather than simply accumulating old and new knowledge.

Careful Dataset Design

From "feeding data" to "simulating real interactions," MemoryAgentBench demonstrates ingenuity in dataset design: The research team both adapted existing datasets and created two new ones. All data is split into chunks to simulate real multi-turn interaction scenarios—just like your daily conversations with an AI assistant, where information accumulates gradually rather than being injected all at once.
1. Newly Constructed Datasets:
EventQA: Requires AI to understand temporal event chains in novels and predict "what happens next".
FactConsolidation: Specifically designed to test conflict resolution capabilities, including single-hop and multi-hop difficulty levels.

Notably, the team adopted a "inject once, query multiple times" design philosophy—one long text corresponds to multiple questions, significantly improving evaluation efficiency.

2. Unified Evaluation Protocol:
Memory Construction Phase → Incremental chunk input → Build/Update memory
Query Execution Phase → Pose questions → Answer based on memory → Evaluate accuracy

Key Findings 🔍

1. RAG is Not a Silver Bullet 🎯
RAG shows clear advantages in accurate retrieval tasks—even simple BM25 methods significantly outperform the GPT-4o-mini baseline (100% vs 22.8% on NIAH-MQ task). However, they have a fatal weakness: poor performance on tasks requiring global understanding, as RAG can only retrieve local information fragments.
2. Long Context ≠ Universal Solution 🔑
Although GPT-4.1-mini supports million-level tokens, it doesn't achieve top performance across various tasks. For instance, it only achieves 45.8% accuracy on ∞Bench-QA, and computational overhead increases linearly with context length.
3. Commercial Systems Fall Short of Expectations 😔
Three primary factors lead to poor performance of commercial memory systems. First, severe information loss—Mem0 compresses information by extracting "facts," resulting in substantial context loss. Second, limited retrieval mechanisms—while MemGPT supports multiple retrieval iterations, it lacks temporal and structural metadata. Third, absence of global perspective—these methods cannot reconstruct complete documents, performing particularly poorly on long-range understanding tasks.
4. Conflict Resolution Remains Challenging ⚠️
For single-hop conflict resolution, memory agents built with GPT-4o achieve only 60% accuracy. In multi-hop conflict resolution scenarios, all methods achieve single-digit accuracy rates (at most 7%), highlighting this as a critical bottleneck for current memory systems.
5. Ablation Studies Reveal Optimization Directions 🔬
Balancing Chunk Size: Smaller chunks (512 tokens) benefit accurate retrieval tasks (RULER-QA accuracy reaches 90%), while larger chunks (4096 tokens) better preserve semantic coherence for continuous text understanding. Dynamic chunk size adjustment based on task type is recommended.
Marginal Effects of Top-K: Increasing K from 2 to 10 yields significant performance gains for accurate retrieval tasks (BM25 improves from 49.5% to 61%), but shows limited impact on learning tasks, indicating that simply increasing retrieval volume is not a panacea.
Computational Latency Gaps: The computational overhead difference between simple and complex systems is staggering—Mem0's memory construction time is 20,000x that of BM25. When using 512 tokens for memory input, Cognee requires 3.3 hours to process a single long-context sample. From a practical deployment perspective, commercial systems must find a balance between performance and efficiency.

Conclusion 📌

MemoryAgentBench demonstrates significant progress in systematically evaluating LLM memory mechanisms—through comprehensive assessment of four core competencies, it reveals for the first time the limitations of current state-of-the-art methods in dynamic memory updates and long-range consistency, providing a standardized evaluation framework for building AI agents with genuine memory capabilities. In future, we will collect more realistic real-world conversation data to further enrich the benchmark's diversity and authenticity, and explore comprehensive memory architectures that can balance accurate retrieval, test-time learning, long-range understanding, and conflict resolution.

📄 Paper: https://arxiv.org/pdf/2507.05257
💻 Code: https://github.com/HUST-AI-HYZ/MemoryAgentBench
📚 Datasets: https://huggingface.co/datasets/ai-hyz/MemoryAgentBench

ai-hyz

Paper submitter about 11 hours ago

•

edited about 6 hours ago

⚙️ MemoryAgentBench：通过增量式多轮交互评测LLM代理的记忆能力

MemoryAgentBench 是一个全面评测 LLM 代理记忆能力的基准框架：通过四大核心维度（准确检索、测试时学习、长程理解、冲突解决）和增量式多轮交互设计，揭示了当前记忆代理存在的问题和短板，并比较了多种记忆代理之间的性能差异。

评测的四大核心维度

让AI真正"记住"需要什么能力？我们认为，仅仅能存储和检索信息远远不够。记忆系统需要具备四种关键能力：
1. 准确检索（Accurate Retrieval, AR）
这是最基础的能力——从海量对话历史中精准定位所需信息。比如，在与AI进行了数小时对话后，你询问3小时前提到的某个细节，它能否快速准确地找到？这不仅需要单点检索，还需要多跳推理能力。
2. 测试时学习（Test-Time Learning, TTL）
真正智能的系统应该能在交互中不断学习新技能。比如，你通过几个例子教会AI一种新的分类方法，它能否在后续对话中灵活运用？这种"边用边学"的能力对构建自适应AI至关重要。
3. 长程理解（Long-Range Understanding, LRU）
不同于碎片化的信息检索，长程理解要求AI形成全局认知。就像读完一本小说后，你不仅记得具体情节，更能理解整体脉络和人物关系。AI需要从长对话中抽象出高层次的理解。
4. 冲突解决（Conflict Resolution, CR）
现实世界的信息是动态的。当用户说"我换工作了"或"这个理论已被推翻"时，AI必须识别并更新过时信息，而非简单堆叠新旧知识。

数据集的精心设计

从"喂数据"到"模拟真实交互"，MemoryAgentBench在数据集设计上别具匠心：研究团队既改造了现有数据集，又创建了两个全新数据集。所有数据被切分成块（chunks），模拟真实的多轮交互场景——就像你与AI助手的日常对话，信息是逐步累积的，而非一次性灌输。
1. 全新构建的数据集：
EventQA：要求AI理解小说中的时序事件链，预测"接下来会发生什么"。
FactConsolidation：专门测试冲突解决能力，包含单跳和多跳两个难度级别。
特别值得一提的是，团队采用了**"一次注入、多次查询"**的设计理念——一个长文本对应多个问题，大幅提升了评测效率。
2. 统一评测协议：
记忆构建阶段 → 逐块输入 → 构建/更新记忆
查询执行阶段 → 提出问题 → 基于记忆回答 → 评估准确性

关键发现 🔍

1. RAG不是万能解药 🎯
RAG在准确检索任务中优势明显——即使是简单的BM25方法也能显著超越GPT-4o-mini基线（在NIAH-MQ任务上100% vs 22.8%）。但它有致命短板：在需要全局理解的任务中表现糟糕，因为RAG只能检索局部信息片段。
2. 长上下文 ≠ 万能钥匙 🔑
尽管GPT-4.1-mini支持百万级token，但在各类任务中并非都表现最佳。例如在∞Bench-QA上仅获45.8%准确率，且计算开销随上下文线性增长。
3. 商业系统表现不尽人意 😔
三大原因导致商业记忆系统普遍表现不佳。首先，信息丢失严重——Mem0通过提取"事实"压缩信息，导致大量上下文丢失。其次，检索机制有限——虽然MemGPT支持多轮检索，但缺乏时序和结构化元数据。第三，缺乏全局视角——这些方法无法重建完整文档，在长程理解任务上表现尤其糟糕。
4. 冲突解决仍具挑战性 ⚠️
单跳冲突解决中，使用GPT-4o构建的记忆代理仅达60%准确率。在多跳冲突解决场景中，所有方法的准确率都只有个位数（最多7%），这凸显了当前记忆系统的关键瓶颈。
5. 消融实验揭示优化方向 🔬
平衡Chunk大小： 较小的块（512 tokens）有利于准确检索任务（RULER-QA准确率可达90%），而较大的块（4096 tokens）更好地保持了连续文本理解的语义连贯性。建议根据任务类型动态调整块大小。
Top-K的边际效应： 将K从2增加到10，准确检索任务性能显著提升（BM25从49.5%提升至61%），但对学习类任务影响有限，说明单纯增加检索量并非灵丹妙药。
计算延迟差距惊人： 简单方法与复杂系统的计算开销差异巨大——Mem0的内存构建时间是BM25的2万倍。当使用512 tokens作为记忆输入时，Cognee需要3.3小时处理一个长上下文样本。从实际部署角度看，商业系统必须在性能和效率间找到平衡。

结语 📌

MemoryAgentBench 展示了我们在系统评估LLM记忆机制方面的重要进展——通过四大核心能力的综合测评，首次揭示了当前最先进方法在动态记忆更新和长程一致性上的局限，为构建真正具备记忆能力的AI代理提供了标准化评测框架。未来，我们将收集更贴近真实世界的对话数据，进一步丰富基准的多样性和真实性，并探索能够平衡准确检索、测试时学习、长程理解和冲突解决的综合性记忆架构。

📄 论文：https://arxiv.org/pdf/2507.05257
💻 代码：https://github.com/HUST-AI-HYZ/MemoryAgentBench
📚 数据集：https://huggingface.co/datasets/ai-hyz/MemoryAgentBench