TY - GEN
T1 - Training for multi-resolution inference using reusable quantization terms
AU - Zhang, Sai Qian
AU - McDanel, Bradley
AU - Kung, H. T.
AU - Dong, Xin
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/4/19
Y1 - 2021/4/19
N2 - Low-resolution uniform quantization (e.g., 4-bit bitwidth) for both Deep Neural Network (DNN) weights and data has emerged as an important technique for efficient inference. Departing from conventional quantization, we describe a novel training approach to support inference at multiple resolutions by reusing a single set of quantization terms (the same set of nonzero bits in values). The proposed approach streamlines the training and supports dynamic selection of resolution levels during inference. We evaluate the method on a diverse range of applications including multiple CNNs on ImageNet, an LSTM on Wikitext-2, and YOLO-v5 on COCO. We show that models resulting from our multi-resolution training can support up to 10 resolutions with only a moderate performance reduction (e.g., = 1%) compared to training them individually. Lastly, using an FPGA, we compare our multi-resolution multiplier-accumulator (mMAC) against other conventional MAC designs and evaluate the inference performance. We show that the mMAC design broadens the choices in trading off cost, efficiency, and latency across a range of computational budgets.
AB - Low-resolution uniform quantization (e.g., 4-bit bitwidth) for both Deep Neural Network (DNN) weights and data has emerged as an important technique for efficient inference. Departing from conventional quantization, we describe a novel training approach to support inference at multiple resolutions by reusing a single set of quantization terms (the same set of nonzero bits in values). The proposed approach streamlines the training and supports dynamic selection of resolution levels during inference. We evaluate the method on a diverse range of applications including multiple CNNs on ImageNet, an LSTM on Wikitext-2, and YOLO-v5 on COCO. We show that models resulting from our multi-resolution training can support up to 10 resolutions with only a moderate performance reduction (e.g., = 1%) compared to training them individually. Lastly, using an FPGA, we compare our multi-resolution multiplier-accumulator (mMAC) against other conventional MAC designs and evaluate the inference performance. We show that the mMAC design broadens the choices in trading off cost, efficiency, and latency across a range of computational budgets.
KW - co-design
KW - deep neural networks
KW - joint-optimization training
KW - Multi-resolution inference
KW - quantization
KW - systolic arrays
UR - http://www.scopus.com/inward/record.url?scp=85104783641&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104783641&partnerID=8YFLogxK
U2 - 10.1145/3445814.3446741
DO - 10.1145/3445814.3446741
M3 - Conference contribution
AN - SCOPUS:85104783641
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 845
EP - 860
BT - Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2021
PB - Association for Computing Machinery
T2 - 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2021
Y2 - 19 April 2021 through 23 April 2021
ER -