A Framework for distributed deep neural network training with heterogeneous computing platforms

Bontak Gu, Joonho Kong, Arslan Munir, Young Geun Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Deep neural network (DNN) training is generally performed by cloud computing platforms. However, cloud-based training has several problems such as network bottleneck, server management cost, and privacy. To overcome these problems, one of the most promising solutions is distributed DNN model training which trains the model with not only high-performance servers but also low-end power-efficient mobile edge or user devices. However, due to the lack of a framework which can provide an optimal cluster configuration (i.e., determining which computing devices participate in DNN training tasks), it is difficult to perform efficient DNN model training considering DNN service providers' preferences such as training time or energy efficiency. In this paper, we introduce a novel framework for distributed DNN training that determines the best training cluster configuration with available heterogeneous computing resources. Our proposed framework utilizes pre-Training with a small number of training steps and estimates training time, power, energy, and energy-delay product (EDP) for each possible training cluster configuration. Based on the estimated metrics, our framework performs DNN training for the remaining steps with the chosen best cluster configurations depending on DNN service providers' preferences. Our framework is implemented in TensorFlow and evaluated with three heterogeneous computing platforms and five widely used DNN models. According to our experimental results, in 76.67% of the cases, our framework chooses the best cluster configuration depending on DNN service providers' preferences with only a small training time overhead.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE 25th International Conference on Parallel and Distributed Systems, ICPADS 2019
PublisherIEEE Computer Society
Pages430-437
Number of pages8
ISBN (Electronic)9781728125831
DOIs
StatePublished - Dec 2019
Event25th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2019 - Tianjin, China
Duration: Dec 4 2019Dec 6 2019

Publication series

NameProceedings of the International Conference on Parallel and Distributed Systems - ICPADS
Volume2019-December
ISSN (Print)1521-9097

Conference

Conference25th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2019
Country/TerritoryChina
CityTianjin
Period12/4/1912/6/19

Keywords

  • Deep neural network
  • Distributed processing
  • Edge Computing
  • Energy efficiency
  • Training time

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'A Framework for distributed deep neural network training with heterogeneous computing platforms'. Together they form a unique fingerprint.

Cite this