TY - GEN
T1 - The architectural implications of facebook's DNN-based personalized recommendation
AU - Gupta, Udit
AU - Wu, Carole Jean
AU - Wang, Xiaodong
AU - Naumov, Maxim
AU - Reagen, Brandon
AU - Brooks, David
AU - Cottel, Bradford
AU - Hazelwood, Kim
AU - Hempstead, Mark
AU - Jia, Bill
AU - Lee, Hsien Hsin S.
AU - Malevich, Andrey
AU - Mudigere, Dheevatsa
AU - Smelyanskiy, Mikhail
AU - Xiong, Liang
AU - Zhang, Xuan
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/2
Y1 - 2020/2
N2 - The widespread application of deep learning has changed the landscape of computation in data centers. In particular, personalized recommendation for content ranking is now largely accomplished using deep neural networks. However, despite their importance and the amount of compute cycles they consume, relatively little research attention has been devoted to recommendation systems. To facilitate research and advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inference jobs can drastically improve latency-bounded throughput, and diversity across recommendation models leads to different optimization strategies.
AB - The widespread application of deep learning has changed the landscape of computation in data centers. In particular, personalized recommendation for content ranking is now largely accomplished using deep neural networks. However, despite their importance and the amount of compute cycles they consume, relatively little research attention has been devoted to recommendation systems. To facilitate research and advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inference jobs can drastically improve latency-bounded throughput, and diversity across recommendation models leads to different optimization strategies.
UR - http://www.scopus.com/inward/record.url?scp=85084172326&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084172326&partnerID=8YFLogxK
U2 - 10.1109/HPCA47549.2020.00047
DO - 10.1109/HPCA47549.2020.00047
M3 - Conference contribution
AN - SCOPUS:85084172326
T3 - Proceedings - 2020 IEEE International Symposium on High Performance Computer Architecture, HPCA 2020
SP - 488
EP - 501
BT - Proceedings - 2020 IEEE International Symposium on High Performance Computer Architecture, HPCA 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 26th IEEE International Symposium on High Performance Computer Architecture, HPCA 2020
Y2 - 22 February 2020 through 26 February 2020
ER -