TY - GEN
T1 - Sparse-TPU
T2 - 34th ACM International Conference on Supercomputing, ICS 2020
AU - He, Xin
AU - Pal, Subhankar
AU - Amarnath, Aporva
AU - Feng, Siying
AU - Park, Dong Hyeon
AU - Rovinski, Austin
AU - Ye, Haojie
AU - Chen, Yuhan
AU - Dreslinski, Ronald
AU - Mudge, Trevor
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/6/29
Y1 - 2020/6/29
N2 - While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08X higher performance while consuming 4.39X and 19.79X lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.
AB - While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08X higher performance while consuming 4.39X and 19.79X lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.
KW - application-specific hardware
KW - hardware accelerators
KW - hardware-software codesign
KW - sparse matrix condensing
KW - sparse matrix processing
KW - systolic array
UR - http://www.scopus.com/inward/record.url?scp=85088536385&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85088536385&partnerID=8YFLogxK
U2 - 10.1145/3392717.3392751
DO - 10.1145/3392717.3392751
M3 - Conference contribution
AN - SCOPUS:85088536385
T3 - Proceedings of the International Conference on Supercomputing
BT - Proceedings of the 34th ACM International Conference on Supercomputing, ICS 2020
PB - Association for Computing Machinery
Y2 - 29 June 2020 through 2 July 2020
ER -