TY - GEN
T1 - A Framework for distributed deep neural network training with heterogeneous computing platforms
AU - Gu, Bontak
AU - Kong, Joonho
AU - Munir, Arslan
AU - Kim, Young Geun
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - Deep neural network (DNN) training is generally performed by cloud computing platforms. However, cloud-based training has several problems such as network bottleneck, server management cost, and privacy. To overcome these problems, one of the most promising solutions is distributed DNN model training which trains the model with not only high-performance servers but also low-end power-efficient mobile edge or user devices. However, due to the lack of a framework which can provide an optimal cluster configuration (i.e., determining which computing devices participate in DNN training tasks), it is difficult to perform efficient DNN model training considering DNN service providers' preferences such as training time or energy efficiency. In this paper, we introduce a novel framework for distributed DNN training that determines the best training cluster configuration with available heterogeneous computing resources. Our proposed framework utilizes pre-Training with a small number of training steps and estimates training time, power, energy, and energy-delay product (EDP) for each possible training cluster configuration. Based on the estimated metrics, our framework performs DNN training for the remaining steps with the chosen best cluster configurations depending on DNN service providers' preferences. Our framework is implemented in TensorFlow and evaluated with three heterogeneous computing platforms and five widely used DNN models. According to our experimental results, in 76.67% of the cases, our framework chooses the best cluster configuration depending on DNN service providers' preferences with only a small training time overhead.
AB - Deep neural network (DNN) training is generally performed by cloud computing platforms. However, cloud-based training has several problems such as network bottleneck, server management cost, and privacy. To overcome these problems, one of the most promising solutions is distributed DNN model training which trains the model with not only high-performance servers but also low-end power-efficient mobile edge or user devices. However, due to the lack of a framework which can provide an optimal cluster configuration (i.e., determining which computing devices participate in DNN training tasks), it is difficult to perform efficient DNN model training considering DNN service providers' preferences such as training time or energy efficiency. In this paper, we introduce a novel framework for distributed DNN training that determines the best training cluster configuration with available heterogeneous computing resources. Our proposed framework utilizes pre-Training with a small number of training steps and estimates training time, power, energy, and energy-delay product (EDP) for each possible training cluster configuration. Based on the estimated metrics, our framework performs DNN training for the remaining steps with the chosen best cluster configurations depending on DNN service providers' preferences. Our framework is implemented in TensorFlow and evaluated with three heterogeneous computing platforms and five widely used DNN models. According to our experimental results, in 76.67% of the cases, our framework chooses the best cluster configuration depending on DNN service providers' preferences with only a small training time overhead.
KW - Deep neural network
KW - Distributed processing
KW - Edge Computing
KW - Energy efficiency
KW - Training time
UR - http://www.scopus.com/inward/record.url?scp=85078899832&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078899832&partnerID=8YFLogxK
U2 - 10.1109/ICPADS47876.2019.00068
DO - 10.1109/ICPADS47876.2019.00068
M3 - Conference contribution
AN - SCOPUS:85078899832
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 430
EP - 437
BT - Proceedings - 2019 IEEE 25th International Conference on Parallel and Distributed Systems, ICPADS 2019
PB - IEEE Computer Society
T2 - 25th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2019
Y2 - 4 December 2019 through 6 December 2019
ER -