TY - JOUR
T1 - Row-Wise Product-Based Sparse Matrix Multiplication Hardware Accelerator With Optimal Load Balancing
AU - Lee, Jong Hun
AU - Park, Beomjin
AU - Kong, Joonho
AU - Munir, Arslan
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2022
Y1 - 2022
N2 - Matrix multiplication is a main computation kernel of emerging workloads, such as deep neural networks and graph analytics. These workloads often exhibit high sparsity in data, which means a large portion of the elements in the data are zero-valued elements. Though systolic arrays have shown a significant performance and energy efficiency improvement over central processing units (CPUs) or graphic processing units (GPUs) when executing matrix multiplications, data sparsity is largely overlooked in the conventional systolic arrays. In this paper, we propose a row-wise product-based sparse matrix multiplication (SpMM) hardware accelerator for compressed sparse row (CSR)-formatted input matrices. Our hardware accelerator leverages row-wise product, which has advantages over inner-product or outer-product when executing the sparse matrix multiplications. As compared to the conventional systolic arrays, which cannot skip the ineffectual operations, our hardware accelerator only performs effectual operations with non-zero elements, improving the performance when executing SpMM. In addition, we also propose an optimal load balancing scheme when using multiple processing elements (PEs). Our load balancing scheme utilizes an operation count-based matrix tiling for parallel execution of the PEs and resource contention avoidance. According to our evaluation, our 32PE-SpMM accelerator shows 13.6× - 47.9× speedup over tensor processing unit (TPU)-like systolic arrays, on average. Furthermore, our operation count-based load balancing scheme shows better performance over the fixed tiling and non-zero element count-based tiling by up to 8.48% and 6.28%, respectively, with only up to 0.06% matrix tiling pre-processing latency overhead.
AB - Matrix multiplication is a main computation kernel of emerging workloads, such as deep neural networks and graph analytics. These workloads often exhibit high sparsity in data, which means a large portion of the elements in the data are zero-valued elements. Though systolic arrays have shown a significant performance and energy efficiency improvement over central processing units (CPUs) or graphic processing units (GPUs) when executing matrix multiplications, data sparsity is largely overlooked in the conventional systolic arrays. In this paper, we propose a row-wise product-based sparse matrix multiplication (SpMM) hardware accelerator for compressed sparse row (CSR)-formatted input matrices. Our hardware accelerator leverages row-wise product, which has advantages over inner-product or outer-product when executing the sparse matrix multiplications. As compared to the conventional systolic arrays, which cannot skip the ineffectual operations, our hardware accelerator only performs effectual operations with non-zero elements, improving the performance when executing SpMM. In addition, we also propose an optimal load balancing scheme when using multiple processing elements (PEs). Our load balancing scheme utilizes an operation count-based matrix tiling for parallel execution of the PEs and resource contention avoidance. According to our evaluation, our 32PE-SpMM accelerator shows 13.6× - 47.9× speedup over tensor processing unit (TPU)-like systolic arrays, on average. Furthermore, our operation count-based load balancing scheme shows better performance over the fixed tiling and non-zero element count-based tiling by up to 8.48% and 6.28%, respectively, with only up to 0.06% matrix tiling pre-processing latency overhead.
KW - load balancing
KW - matrix tiling
KW - row-wise product
KW - Sparse matrix multiplication
KW - speedup
UR - http://www.scopus.com/inward/record.url?scp=85132690878&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85132690878&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2022.3184116
DO - 10.1109/ACCESS.2022.3184116
M3 - Article
AN - SCOPUS:85132690878
SN - 2169-3536
VL - 10
SP - 64547
EP - 64559
JO - IEEE Access
JF - IEEE Access
ER -