TY - GEN
T1 - Term quantization
T2 - 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020
AU - Kung, H. T.
AU - McDanel, Bradley
AU - Zhang, Sai Qian
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - We present a novel technique, called Term Quantization (TQ), for furthering quantization at run time for improved computational efficiency of deep neural networks (DNNs) already quantized with conventional quantization methods. TQ operates on power-of-two terms in expressions of values. In computing a dot-product computation, TQ dynamically selects a fixed number of largest terms to use from values of the two vectors. By exploiting weight and data distributions typically present in DNNs, TQ has a minimal impact on DNN model performance (e.g., accuracy or perplexity). We use TQ to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We evaluate TQ on an MLP for MNIST, multiple CNNs for ImageNet and an LSTM for Wikitext-2. We demonstrate significant reductions in inference computation costs (between ;3-10×) compared to conventional uniform quantization for the same level of model performance.
AB - We present a novel technique, called Term Quantization (TQ), for furthering quantization at run time for improved computational efficiency of deep neural networks (DNNs) already quantized with conventional quantization methods. TQ operates on power-of-two terms in expressions of values. In computing a dot-product computation, TQ dynamically selects a fixed number of largest terms to use from values of the two vectors. By exploiting weight and data distributions typically present in DNNs, TQ has a minimal impact on DNN model performance (e.g., accuracy or perplexity). We use TQ to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We evaluate TQ on an MLP for MNIST, multiple CNNs for ImageNet and an LSTM for Wikitext-2. We demonstrate significant reductions in inference computation costs (between ;3-10×) compared to conventional uniform quantization for the same level of model performance.
KW - accelerator
KW - Deep neural network (DNN)
KW - quantization
UR - http://www.scopus.com/inward/record.url?scp=85101309204&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85101309204&partnerID=8YFLogxK
U2 - 10.1109/SC41405.2020.00100
DO - 10.1109/SC41405.2020.00100
M3 - Conference contribution
AN - SCOPUS:85101309204
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2020
PB - IEEE Computer Society
Y2 - 9 November 2020 through 19 November 2020
ER -