TY - GEN
T1 - RecNMP
T2 - 47th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2020
AU - Ke, Liu
AU - Gupta, Udit
AU - Cho, Benjamin Youngjae
AU - Brooks, David
AU - Chandra, Vikas
AU - Diril, Utku
AU - Firoozshahian, Amin
AU - Hazelwood, Kim
AU - Jia, Bill
AU - Lee, Hsien Hsin S.
AU - Li, Meng
AU - Maher, Bert
AU - Mudigere, Dheevatsa
AU - Naumov, Maxim
AU - Schatz, Martin
AU - Smelyanskiy, Mikhail
AU - Wang, Xiaodong
AU - Reagen, Brandon
AU - Wu, Carole Jean
AU - Hempstead, Mark
AU - Zhang, Xuan
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, providing up to $9.8 × memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers $4.2 × throughput improvement and 45.8% memory energy savings.
AB - Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, providing up to $9.8 × memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers $4.2 × throughput improvement and 45.8% memory energy savings.
UR - http://www.scopus.com/inward/record.url?scp=85091973291&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85091973291&partnerID=8YFLogxK
U2 - 10.1109/ISCA45697.2020.00070
DO - 10.1109/ISCA45697.2020.00070
M3 - Conference contribution
AN - SCOPUS:85091973291
T3 - Proceedings - International Symposium on Computer Architecture
SP - 790
EP - 803
BT - Proceedings - 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture, ISCA 2020
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 30 May 2020 through 3 June 2020
ER -