TY - GEN
T1 - Sparse-TPU
T2 - 34th ACM International Conference on Supercomputing, ICS 2020
AU - He, Xin
AU - Pal, Subhankar
AU - Amarnath, Aporva
AU - Feng, Siying
AU - Park, Dong Hyeon
AU - Rovinski, Austin
AU - Ye, Haojie
AU - Chen, Yuhan
AU - Dreslinski, Ronald
AU - Mudge, Trevor
N1 - Funding Information:
The material is based on research sponsored by Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement number FA8650-18-2-7864. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) or the U.S. Government.
Publisher Copyright:
© 2020 ACM.
PY - 2020/6/29
Y1 - 2020/6/29
N2 - While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08X higher performance while consuming 4.39X and 19.79X lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.
AB - While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08X higher performance while consuming 4.39X and 19.79X lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.
KW - application-specific hardware
KW - hardware accelerators
KW - hardware-software codesign
KW - sparse matrix condensing
KW - sparse matrix processing
KW - systolic array
UR - http://www.scopus.com/inward/record.url?scp=85088536385&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85088536385&partnerID=8YFLogxK
U2 - 10.1145/3392717.3392751
DO - 10.1145/3392717.3392751
M3 - Conference contribution
AN - SCOPUS:85088536385
T3 - Proceedings of the International Conference on Supercomputing
BT - Proceedings of the 34th ACM International Conference on Supercomputing, ICS 2020
PB - Association for Computing Machinery
Y2 - 29 June 2020 through 2 July 2020
ER -