Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors

Bharath Srinivas Prabakaran, Mihika Dave, Florian Kriebel, Semeen Rehman, Muhammad Shafique

Research output: Contribution to journalArticlepeer-review

Abstract

State-of-the-art reliability techniques and mechanisms deploy full-scale redundancy, like double or triple modular redundancy (DMR, TMR), on different layers of the computing stack to detect and/or correct such transient faults. However, the techniques relying on full-scale redundancy incur significant area, performance, and/or power overheads, which might not always be feasible/practical due to system constraints such as deadlines and available power budget for the full chip (or a processor core). In this work, we propose a novel design methodology to generate and explore the architectural-space of heterogeneous reliability modes for out-of-order superscalar multi-core processors. These heterogeneous modes enable varying reliability and power/area trade-offs, from which an optimal configuration can be chosen at run time to meet the reliability requirements of a given system while reducing the corresponding power overheads (or solving the inverse problem, i.e., maximizing the reliability under a given power constraint). Our experimental results show that a pareto-optimal heterogeneous reliability mode reduces the core vulnerability by 87%, on average, across multiple application workloads, with area and power overheads of 10% and 43%, respectively. To further enhance the design space of heterogeneous reliability modes, we investigate the effectiveness of combining different processor state compression techniques like Distributed Multi-threaded Checkpointing (DMTCP), Hash-based Incremental Checkpointing (HBICT) and GNU zip, such that the correct processor state can be recovered once a fault is detected. We reduced the checkpoint sizes by a factor of ∼ 6× using a unique combination of different state compression techniques.

Original languageEnglish (US)
Article number8859234
Pages (from-to)145324-145339
Number of pages16
JournalIEEE Access
Volume7
DOIs
StatePublished - 2019

Keywords

  • architecture
  • AVF
  • checkpointing
  • design space exploration
  • fault-tolerance
  • hardening
  • heterogeneity
  • microprocessors
  • multi-cores
  • out-of-order
  • Reliability
  • resilience
  • superscalar

ASJC Scopus subject areas

  • Computer Science(all)
  • Materials Science(all)
  • Engineering(all)

Fingerprint Dive into the research topics of 'Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors'. Together they form a unique fingerprint.

Cite this