Low Latency RNN Inference with Cellular Batching

Pin Gao, Yongwei Wu, Lingfan Yu, Jinyang Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN “cell” (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.

Original languageEnglish (US)
Title of host publicationProceedings of the 13th EuroSys Conference, EuroSys 2018
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450355841
DOIs
StatePublished - Apr 23 2018
Event13th EuroSys Conference, EuroSys 2018 - Porto, Portugal
Duration: Apr 23 2018Apr 26 2018

Publication series

NameProceedings of the 13th EuroSys Conference, EuroSys 2018
Volume2018-January

Other

Other13th EuroSys Conference, EuroSys 2018
CountryPortugal
CityPorto
Period4/23/184/26/18

Keywords

  • Batching
  • Dataflow Graph
  • Inference
  • Recurrent Neural Network

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'Low Latency RNN Inference with Cellular Batching'. Together they form a unique fingerprint.

  • Cite this

    Gao, P., Wu, Y., Yu, L., & Li, J. (2018). Low Latency RNN Inference with Cellular Batching. In Proceedings of the 13th EuroSys Conference, EuroSys 2018 (Proceedings of the 13th EuroSys Conference, EuroSys 2018; Vol. 2018-January). Association for Computing Machinery, Inc. https://doi.org/10.1145/3190508.3190541