MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
Abstract
Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.
Community
We propose Memory-enhanced model Serving,
or MemServe, to handle inter-request and intra-request optimizations within a unified system. To tackle the challenges of managing the KV cache across distributed instances, MemServe introduces an elastic memory pool, or MemPool, which is a substrate for managing all cluster memory, including
CPU DRAM and GPU HBM. MemPool offers a rich set of APIs for managing distributed memory and KV cache. Utilizing these APIs, MemServe implements context caching over standard prefill-decode-colocated (PD-colocated) instances and disaggregated inference. Moreover, MemServe enhances disaggregated inference with context caching, reaping both benefits. Finally, to maximize KV cache reuse, MemServe employs a global scheduler that incorporates a locality-aware policy using novel global prompt trees for best-effort routing.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper