TY - JOUR
T1 - SuperSlash
T2 - A Unified Design Space Exploration and Model Compression Methodology for Design of Deep Learning Accelerators with Reduced Off-Chip Memory Access Volume
AU - Ahmad, Hazoor
AU - Arif, Tabasher
AU - Hanif, Muhammad Abdullah
AU - Hafiz, Rehan
AU - Shafique, Muhammad
N1 - Funding Information:
Manuscript received April 17, 2020; revised June 12, 2020; accepted July 6, 2020. Date of publication October 2, 2020; date of current version October 27, 2020. This work was supported by the Higher Education Commission of Pakistan, through NRPU Project “AxVision—Application-Specific Data-Aware, Approximate-Computing for Energy Efficient Image and Vision Processing Applications,” under Grant 10150. This article was presented in the International Conference on Hardware/Software Codesign and System Synthesis 2020 and appears as part of the ESWEEK-TCAD special issue. (Corresponding author: Rehan Hafiz.) Hazoor Ahmad, Tabasher Arif, and Rehan Hafiz are with the Faculty of Engineering, Information Technology University, Lahore 54700, Pakistan (e-mail: [email protected]; [email protected]; [email protected]).
Publisher Copyright:
© 1982-2012 IEEE.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/11
Y1 - 2020/11
N2 - Deploying deep learning (DL) models on resource-constrained embedded devices is a challenging task. The limited on-chip memory on such devices results in increased off-chip memory access volume, thus limiting the size of DL models that can be efficiently realized in such systems. Design space exploration (DSE) under memory constraint, or to achieve minimal off-chip memory access volume, has recently received much attention. Unfortunately, DSE alone cannot reduce the amount of off-chip memory accesses beyond a certain point due to the fixed model size. Model compression via pruning can be employed to reduce the size of the model and the associated off-chip memory accesses. However, in this article, we demonstrate that pruned models with even the same accuracy and model size may require a different number of off-chip memory accesses depending upon the pruning strategy adopted. Thus, mainstream pruning techniques may not be closely tied to the design goals, and thereby hard to be integrated with existing DSE techniques. To overcome this problem, we propose SuperSlash, a unified solution for DSE and model compression. SuperSlash estimates off-chip memory access volume overhead of each layer of a DL model by exploring multiple design candidates. In particular, it evaluates multiple data reuse strategies for each layer, along with the possibility of layer fusion. Layer fusion aims at reducing the off-chip memory access volume by avoiding the intermediate off-chip storage of a layer's output and directly using it for processing of the subsequent layer. SuperSlash then guides the pruning process via a ranking function, which ranks each layer according to its explored off-chip memory access cost. We demonstrate that SuperSlash not only offers an extensive design space coverage but also provides lower off-chip memory access volume (up to 57.71%, 25.83%, 47.73%, and 29.02% reduction for VGG16, ResNet56, ResNet110, and MobileNetV1, respectively) as compared to the state-of-art.
AB - Deploying deep learning (DL) models on resource-constrained embedded devices is a challenging task. The limited on-chip memory on such devices results in increased off-chip memory access volume, thus limiting the size of DL models that can be efficiently realized in such systems. Design space exploration (DSE) under memory constraint, or to achieve minimal off-chip memory access volume, has recently received much attention. Unfortunately, DSE alone cannot reduce the amount of off-chip memory accesses beyond a certain point due to the fixed model size. Model compression via pruning can be employed to reduce the size of the model and the associated off-chip memory accesses. However, in this article, we demonstrate that pruned models with even the same accuracy and model size may require a different number of off-chip memory accesses depending upon the pruning strategy adopted. Thus, mainstream pruning techniques may not be closely tied to the design goals, and thereby hard to be integrated with existing DSE techniques. To overcome this problem, we propose SuperSlash, a unified solution for DSE and model compression. SuperSlash estimates off-chip memory access volume overhead of each layer of a DL model by exploring multiple design candidates. In particular, it evaluates multiple data reuse strategies for each layer, along with the possibility of layer fusion. Layer fusion aims at reducing the off-chip memory access volume by avoiding the intermediate off-chip storage of a layer's output and directly using it for processing of the subsequent layer. SuperSlash then guides the pruning process via a ranking function, which ranks each layer according to its explored off-chip memory access cost. We demonstrate that SuperSlash not only offers an extensive design space coverage but also provides lower off-chip memory access volume (up to 57.71%, 25.83%, 47.73%, and 29.02% reduction for VGG16, ResNet56, ResNet110, and MobileNetV1, respectively) as compared to the state-of-art.
KW - Accelerators
KW - deep neural network (DNN)
KW - design space exploration (DSE)
KW - model compression
KW - off-chip memory access volume
KW - optimization
KW - pruning
UR - http://www.scopus.com/inward/record.url?scp=85096034061&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096034061&partnerID=8YFLogxK
U2 - 10.1109/TCAD.2020.3012865
DO - 10.1109/TCAD.2020.3012865
M3 - Article
AN - SCOPUS:85096034061
SN - 0278-0070
VL - 39
SP - 4191
EP - 4204
JO - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
JF - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
IS - 11
M1 - 9211496
ER -