TY - GEN
T1 - Low Latency RNN Inference with Cellular Batching
AU - Gao, Pin
AU - Wu, Yongwei
AU - Yu, Lingfan
AU - Li, Jinyang
N1 - Funding Information:
This paper was greatly improved by the advice from our extraordinary fellow colleagues Minjie Wang, Chien-Chin Huang, Cheng Tan. And we thank the anonymous reviewers and our shepherd Gustavo Alonso for their valuable comments and helpful suggestions. We thank Shizhen Xu for providing the knowledge of GPU. The NVIDIA DGX-1 used for this research was donated by the NVIDIA Corporation. This work is supported by NVIDIA AI Lab (NVAIL) and GPU Center of Excellence, National Key Research & Development Program of China (2016YFB1000504), Natural Science Foundation of China (61433008, 61373145, 61572280, 61133004, 61502019, U1435216), National Basic Research (973) Program of China (2014CB340402). Pin Gao’s work is also supported by the China Scholarship Council.
PY - 2018/4/23
Y1 - 2018/4/23
N2 - Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN “cell” (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.
AB - Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN “cell” (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.
KW - Batching
KW - Dataflow Graph
KW - Inference
KW - Recurrent Neural Network
UR - http://www.scopus.com/inward/record.url?scp=85052014907&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052014907&partnerID=8YFLogxK
U2 - 10.1145/3190508.3190541
DO - 10.1145/3190508.3190541
M3 - Conference contribution
AN - SCOPUS:85052014907
T3 - Proceedings of the 13th EuroSys Conference, EuroSys 2018
BT - Proceedings of the 13th EuroSys Conference, EuroSys 2018
PB - Association for Computing Machinery, Inc
T2 - 13th EuroSys Conference, EuroSys 2018
Y2 - 23 April 2018 through 26 April 2018
ER -