TY - JOUR
T1 - A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs
AU - Vitkovskiy, Arseniy
AU - Soteriou, Vassos
AU - Nicopoulos, Chrysostomos
N1 - Funding Information:
Manuscript received September 9, 2011; revised January 21, 2012; accepted February 3, 2012. Date of current version July 18, 2012. This work was supported by a startup research grant provided by the Cyprus University of Technology and the Cyprus Research Promotion Foundation’s Grant TΠE/ΠΛHPO/0609(BIE)/09, cofunded by the Republic of Cyprus and the European Regional Development Fund. This paper was recommended by Associate Editor D. Atienza.
PY - 2012
Y1 - 2012
N2 - The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.
AB - The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.
KW - Fault-tolerance
KW - Networks-on-chip (NoCs)
KW - On-chip interconnection networks
KW - Router microarchitecture
KW - Routing algorithm
UR - http://www.scopus.com/inward/record.url?scp=84864116812&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84864116812&partnerID=8YFLogxK
U2 - 10.1109/TCAD.2012.2188801
DO - 10.1109/TCAD.2012.2188801
M3 - Article
AN - SCOPUS:84864116812
SN - 0278-0070
VL - 31
SP - 1235
EP - 1248
JO - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
JF - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
IS - 8
M1 - 6238398
ER -