TY - GEN
T1 - FAST
T2 - 28th Annual IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022
AU - Zhang, Sai Qian
AU - Mcdanel, Bradley
AU - Kung, H. T.
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training by providing a wide dynamic range via a shared exponent across a group of values. In this paper, we propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP. FAST supports matrix multiplication with variable precision BFP input operands, enabling incremental increases in DNN precision throughout training. By increasing the BFP precision across both training iterations and DNN layers, FAST can greatly shorten the training time while reducing overall hardware resource usage. Our FAST Multipler-Accumulator (fMAC) supports dot product computations under multiple BFP precisions. We validate our FAST system on multiple DNNs with different datasets, demonstrating a 2-6× speedup in training on a single-chip platform over prior work based on mixed-precision or block floating point number systems while achieving similar performance in validation accuracy.
AB - Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training by providing a wide dynamic range via a shared exponent across a group of values. In this paper, we propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP. FAST supports matrix multiplication with variable precision BFP input operands, enabling incremental increases in DNN precision throughout training. By increasing the BFP precision across both training iterations and DNN layers, FAST can greatly shorten the training time while reducing overall hardware resource usage. Our FAST Multipler-Accumulator (fMAC) supports dot product computations under multiple BFP precisions. We validate our FAST system on multiple DNNs with different datasets, demonstrating a 2-6× speedup in training on a single-chip platform over prior work based on mixed-precision or block floating point number systems while achieving similar performance in validation accuracy.
KW - n/a
UR - http://www.scopus.com/inward/record.url?scp=85130709575&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85130709575&partnerID=8YFLogxK
U2 - 10.1109/HPCA53966.2022.00067
DO - 10.1109/HPCA53966.2022.00067
M3 - Conference contribution
AN - SCOPUS:85130709575
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 846
EP - 860
BT - Proceedings - 2022 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022
PB - IEEE Computer Society
Y2 - 2 April 2022 through 6 April 2022
ER -