TY - GEN
T1 - Supporting very large models using automatic dataflow graph partitioning
AU - Wang, Minjie
AU - Huang, Chien chin
AU - Li, Jinyang
N1 - Funding Information:
This work is supported in part by the National Science Foundation under award CNS-1816717, NVIDIA AI Lab (NVAIL) at NYU, and AWS cloud credits for research. Our shepherd, Chris De Sa, and other anonymous reviewers have given helpful feedback that improved this work. We also thank Jeff Hammond for pointing us to related work in the HPC community, esp. work on tensor contraction engines.
Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/3/25
Y1 - 2019/3/25
N2 - This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language inspired by Halide. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.
AB - This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language inspired by Halide. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.
UR - http://www.scopus.com/inward/record.url?scp=85063904164&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063904164&partnerID=8YFLogxK
U2 - 10.1145/3302424.3303953
DO - 10.1145/3302424.3303953
M3 - Conference contribution
AN - SCOPUS:85063904164
T3 - Proceedings of the 14th EuroSys Conference 2019
BT - Proceedings of the 14th EuroSys Conference 2019
PB - Association for Computing Machinery, Inc
T2 - 14th European Conference on Computer Systems, EuroSys 2019
Y2 - 25 March 2019 through 28 March 2019
ER -