TY - GEN
T1 - On-chip deep neural network storage with multi-level eNVM
AU - Donato, Marco
AU - Reagen, Brandon
AU - Pentecost, Lillian
AU - Gupta, Udit
AU - Brooks, David
AU - Wei, Gu Yeon
N1 - Funding Information:
This work was supported in part by the Center for Applications Driving Architectures (ADA), one of six centers of JUMP, a Semiconductor Research Corporation program co-sponsored by DARPA. The work was also partially supported by the U.S. Government, under the DARPA CRAFT and DARPA PERFECT programs. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official
Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/6/24
Y1 - 2018/6/24
N2 - One of the biggest performance bottlenecks of today's neural network (NN) accelerators is off-chip memory accesses [11]. In this paper, we propose a method to use multi-level, embedded nonvolatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators [6].
AB - One of the biggest performance bottlenecks of today's neural network (NN) accelerators is off-chip memory accesses [11]. In this paper, we propose a method to use multi-level, embedded nonvolatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators [6].
UR - http://www.scopus.com/inward/record.url?scp=85053658463&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85053658463&partnerID=8YFLogxK
U2 - 10.1145/3195970.3196083
DO - 10.1145/3195970.3196083
M3 - Conference contribution
AN - SCOPUS:85053658463
SN - 9781450357005
T3 - Proceedings - Design Automation Conference
BT - Proceedings of the 55th Annual Design Automation Conference, DAC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 55th Annual Design Automation Conference, DAC 2018
Y2 - 24 June 2018 through 29 June 2018
ER -