A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs

Arseniy Vitkovskiy, Vassos Soteriou, Chrysostomos Nicopoulos

Research output: Contribution to journalArticlepeer-review

Abstract

The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of informationalbeit at a gracefully degraded modein order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.

Original languageEnglish (US)
Article number6238398
Pages (from-to)1235-1248
Number of pages14
JournalIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Volume31
Issue number8
DOIs
StatePublished - 2012

Keywords

  • Fault-tolerance
  • Networks-on-chip (NoCs)
  • On-chip interconnection networks
  • Router microarchitecture
  • Routing algorithm

ASJC Scopus subject areas

  • Software
  • Computer Graphics and Computer-Aided Design
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'A dynamically adjusting gracefully degrading link-level fault-tolerant mechanism for NoCs'. Together they form a unique fingerprint.

Cite this