Abstract
State-of-the-art reliability techniques and mechanisms deploy full-scale redundancy, like double or triple modular redundancy (DMR, TMR), on different layers of the computing stack to detect and/or correct such transient faults. However, the techniques relying on full-scale redundancy incur significant area, performance, and/or power overheads, which might not always be feasible/practical due to system constraints such as deadlines and available power budget for the full chip (or a processor core). In this work, we propose a novel design methodology to generate and explore the architectural-space of heterogeneous reliability modes for out-of-order superscalar multi-core processors. These heterogeneous modes enable varying reliability and power/area trade-offs, from which an optimal configuration can be chosen at run time to meet the reliability requirements of a given system while reducing the corresponding power overheads (or solving the inverse problem, i.e., maximizing the reliability under a given power constraint). Our experimental results show that a pareto-optimal heterogeneous reliability mode reduces the core vulnerability by 87%, on average, across multiple application workloads, with area and power overheads of 10% and 43%, respectively. To further enhance the design space of heterogeneous reliability modes, we investigate the effectiveness of combining different processor state compression techniques like Distributed Multi-threaded Checkpointing (DMTCP), Hash-based Incremental Checkpointing (HBICT) and GNU zip, such that the correct processor state can be recovered once a fault is detected. We reduced the checkpoint sizes by a factor of ∼ 6× using a unique combination of different state compression techniques.
Original language | English (US) |
---|---|
Article number | 8859234 |
Pages (from-to) | 145324-145339 |
Number of pages | 16 |
Journal | IEEE Access |
Volume | 7 |
DOIs | |
State | Published - 2019 |
Keywords
- architecture
- AVF
- checkpointing
- design space exploration
- fault-tolerance
- hardening
- heterogeneity
- microprocessors
- multi-cores
- out-of-order
- Reliability
- resilience
- superscalar
ASJC Scopus subject areas
- Computer Science(all)
- Materials Science(all)
- Engineering(all)