sLLM: Accelerating LLM Inference using Semantic Load Balancing with Shared Memory Data Structures

Jieyu Lin, Sai Qian Zhang, Alberto Leon-Garcia

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As Large Language Models (LLMs) are increasingly deployed to support a broad spectrum of applications, enhancing inference efficiency and minimizing costs have become critical areas of focus. To address these challenges, researchers have explored optimizing the Key-Value (KV) cache within LLMs. However, existing approaches have not considered the potential benefits of sharing KV caches across multiple requests in a cluster environment. Addressing this gap, we introduce sLLM, a novel system that integrates an efficient shared-memory-based Semantic Load Balancer with a KV cache sharing mechanism. This design significantly reduces the need for recomputation during LLM inference, which enhances inference performance. Our evaluation of the sLLM system showcases its effectiveness: the Semantic Load Balancer achieves up to a 7× reduction in latency when dispatching requests, while the system as a whole can decrease the Time-To-First-Token (TTFT) for LLM inferences by 30 - 58%.

Original languageEnglish (US)
Title of host publicationProceedings of the 25th International Symposium on Quality Electronic Design, ISQED 2024
PublisherIEEE Computer Society
ISBN (Electronic)9798350309270
DOIs
StatePublished - 2024
Event25th International Symposium on Quality Electronic Design, ISQED 2024 - Hybrid, San Francisco, United States
Duration: Apr 3 2024Apr 5 2024

Publication series

NameProceedings - International Symposium on Quality Electronic Design, ISQED
ISSN (Print)1948-3287
ISSN (Electronic)1948-3295

Conference

Conference25th International Symposium on Quality Electronic Design, ISQED 2024
Country/TerritoryUnited States
CityHybrid, San Francisco
Period4/3/244/5/24

Keywords

  • component
  • formatting
  • insert
  • style
  • styling

ASJC Scopus subject areas

  • Hardware and Architecture
  • Electrical and Electronic Engineering
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'sLLM: Accelerating LLM Inference using Semantic Load Balancing with Shared Memory Data Structures'. Together they form a unique fingerprint.

Cite this