TY - GEN
T1 - Towards Scalable Lifetime Reliability Management for Dark Silicon Manycore Systems
AU - Rathore, Vijeta
AU - Chaturvedi, Vivek
AU - Singh, Amit K.
AU - Srikanthan, Thambipillai
AU - Shafique, Muhammad
N1 - Funding Information:
The coauthor, Dr. Shafique's contributions in this work, is supported in parts by the German Research Foundation (DFG) as part of the GetSURE project in the scope of SPP-1500 priority program "Dependable Embedded Systems".
Funding Information:
IV. ACKNOWLEDGEMENT The coauthor, Dr. Shafique’s contributions in this work, is supported in parts by the German Research Foundation (DFG) as part of the GetSURE project in the scope of SPP-1500 priority program “Dependable Embedded Systems”.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/7
Y1 - 2019/7
N2 - Aggressive technology scaling enabled very high integration density. Unfortunately, it also led to issues such as process variation, increased power density and consequently rising chip temperature resulting in accelerated device aging and poor lifetime reliability of different components in a manycore system. Moreover, thermal and power limitations let only a fraction of the chip function at full speed; the rest is the dark silicon. Most of the lifetime reliability enhancement solutions for the multi-/manycore systems in the literature are heuristic-based, while some use standard compute-intensive methods to solve the optimization problem making them not scale well with the manycore size. The heuristic-based solutions are formulated to search through the design space of a fine granularity making it huge, limiting their scalability. Also, these approaches do not account for the impact of different applications' execution behavior on the aging of the underlying cores, and their performance requirement distribution across the cores to their advantage. In this paper, we present our resource management strategies towards building scalable lifetime reliability enhancement solutions for dark silicon manycore systems. The first technique, Hierarchical Mapping approach (HiMap), maps a periodic workload employing a block-based hierarchical method that leverages dark cores for thermal mitigation. The second approach, LifeGuard, uses reinforcement learning to learn the applications' aging behavior, and is aware of the performance requirement pattern onto the core frequencies. It maps randomly arriving requests and is scalable to the number of applications and the size of a manycore.
AB - Aggressive technology scaling enabled very high integration density. Unfortunately, it also led to issues such as process variation, increased power density and consequently rising chip temperature resulting in accelerated device aging and poor lifetime reliability of different components in a manycore system. Moreover, thermal and power limitations let only a fraction of the chip function at full speed; the rest is the dark silicon. Most of the lifetime reliability enhancement solutions for the multi-/manycore systems in the literature are heuristic-based, while some use standard compute-intensive methods to solve the optimization problem making them not scale well with the manycore size. The heuristic-based solutions are formulated to search through the design space of a fine granularity making it huge, limiting their scalability. Also, these approaches do not account for the impact of different applications' execution behavior on the aging of the underlying cores, and their performance requirement distribution across the cores to their advantage. In this paper, we present our resource management strategies towards building scalable lifetime reliability enhancement solutions for dark silicon manycore systems. The first technique, Hierarchical Mapping approach (HiMap), maps a periodic workload employing a block-based hierarchical method that leverages dark cores for thermal mitigation. The second approach, LifeGuard, uses reinforcement learning to learn the applications' aging behavior, and is aware of the performance requirement pattern onto the core frequencies. It maps randomly arriving requests and is scalable to the number of applications and the size of a manycore.
KW - aging
KW - hierarchical
KW - lifetime reliability
KW - manycore systems
KW - mapping
KW - reinforcement learning.
UR - http://www.scopus.com/inward/record.url?scp=85073742260&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073742260&partnerID=8YFLogxK
U2 - 10.1109/IOLTS.2019.8854454
DO - 10.1109/IOLTS.2019.8854454
M3 - Conference contribution
AN - SCOPUS:85073742260
T3 - 2019 IEEE 25th International Symposium on On-Line Testing and Robust System Design, IOLTS 2019
SP - 204
EP - 207
BT - 2019 IEEE 25th International Symposium on On-Line Testing and Robust System Design, IOLTS 2019
A2 - Gizopoulos, Dimitris
A2 - Alexandrescu, Dan
A2 - Papavramidou, Panagiota
A2 - Maniatakos, Michail
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th IEEE International Symposium on On-Line Testing and Robust System Design, IOLTS 2019
Y2 - 1 July 2019 through 3 July 2019
ER -