Taming the Titans: A Survey of Efficient LLM Inference Serving
Abstract
Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Seesaw: High-throughput LLM Inference via Model Re-sharding (2025)
- DynaServe: Unified and Elastic Tandem-Style Execution for Dynamic Disaggregated LLM Serving (2025)
- PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices (2025)
- HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference (2025)
- Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference (2025)
- Collaborative Speculative Inference for Efficient LLM Inference Serving (2025)
- FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper