TY - GEN
T1 - Supporting very large models using automatic dataflow graph partitioning
AU - Wang, Minjie
AU - Huang, Chien chin
AU - Li, Jinyang
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/3/25
Y1 - 2019/3/25
N2 - This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language inspired by Halide. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.
AB - This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language inspired by Halide. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.
UR - http://www.scopus.com/inward/record.url?scp=85063904164&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063904164&partnerID=8YFLogxK
U2 - 10.1145/3302424.3303953
DO - 10.1145/3302424.3303953
M3 - Conference contribution
AN - SCOPUS:85063904164
T3 - Proceedings of the 14th EuroSys Conference 2019
BT - Proceedings of the 14th EuroSys Conference 2019
PB - Association for Computing Machinery, Inc
T2 - 14th European Conference on Computer Systems, EuroSys 2019
Y2 - 25 March 2019 through 28 March 2019
ER -