TY - GEN
T1 - Efficient utilization of GPGPU cache hierarchy
AU - Khairy, Mahmoud
AU - Zahran, Mohamed
AU - Wassal, Amr G.
N1 - Publisher Copyright:
Copyright 2015 ACM.
PY - 2015/2/7
Y1 - 2015/2/7
N2 - Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance in-stead. In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. The first technique aims to dynamically detect and bypass memory accesses that show streaming behavior. In the second technique, we propose dynamic warp throttling via cores sampling (DWT-CS) to alleviate cache thrashing by throttling the number of active warps per core. DWT-CS monitors the MPKI at L1, when it exceeds a specific threshold, all GPU cores are sampled with different number of active warps to find the optimal number of warps that mitigates thrashing and achieves the highest performance. Our pro-posed third technique addresses the problem of GPU cache associativity since many GPGPU applications suffer from severe associativity stalls and conflict misses. Prior work proposed cache bypassing on associativity stalls. In this work, instead of bypassing, we employ a better cache indexing function, Pseudo Random Interleaving Cache (PRIC), that is based on polynomial modulus mapping, in order to fairly and evenly distribute memory accesses over cache sets. The proposed techniques improve the average performance of streaming and contention applications by 1.2X and 2.3X respectively. Compared to prior work, it achieves 1.7X and 1.5X performance improvement over Cache-Conscious Wave-front Scheduler and Memory Request Prioritization Buffer respectively.
AB - Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance in-stead. In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. The first technique aims to dynamically detect and bypass memory accesses that show streaming behavior. In the second technique, we propose dynamic warp throttling via cores sampling (DWT-CS) to alleviate cache thrashing by throttling the number of active warps per core. DWT-CS monitors the MPKI at L1, when it exceeds a specific threshold, all GPU cores are sampled with different number of active warps to find the optimal number of warps that mitigates thrashing and achieves the highest performance. Our pro-posed third technique addresses the problem of GPU cache associativity since many GPGPU applications suffer from severe associativity stalls and conflict misses. Prior work proposed cache bypassing on associativity stalls. In this work, instead of bypassing, we employ a better cache indexing function, Pseudo Random Interleaving Cache (PRIC), that is based on polynomial modulus mapping, in order to fairly and evenly distribute memory accesses over cache sets. The proposed techniques improve the average performance of streaming and contention applications by 1.2X and 2.3X respectively. Compared to prior work, it achieves 1.7X and 1.5X performance improvement over Cache-Conscious Wave-front Scheduler and Memory Request Prioritization Buffer respectively.
KW - Cache bypassing
KW - Cache management
KW - Conflict-avoiding
KW - GPGPU
KW - Warp throttling
UR - http://www.scopus.com/inward/record.url?scp=84938828873&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84938828873&partnerID=8YFLogxK
U2 - 10.1145/2716282.2716291
DO - 10.1145/2716282.2716291
M3 - Conference contribution
AN - SCOPUS:84938828873
T3 - ACM International Conference Proceeding Series
SP - 36
EP - 47
BT - ACM International Conference Proceeding Series
A2 - Gong, Xiang
PB - Association for Computing Machinery
T2 - 8th Annual Workshop on General Purpose Processing using Graphics Processing Unit, GPGPU 2015
Y2 - 7 February 2015
ER -