Two-State Checkpointing for Energy-Efficient Fault Tolerance in Hard Real-Time Systems

Mohammad Salehi, Mohammad Khavari Tavana, Semeen Rehman, Muhammad Shafique, Alireza Ejlali, Jörg Henkel

Research output: Contribution to journalArticlepeer-review

Abstract

Checkpointing with rollback recovery is a well-established technique to tolerate transient faults. However, it incurs significant time and energy overheads, which go wasted in fault-free execution states and may not even be feasible in hard real-time systems. This paper presents a low-overhead two-state checkpointing (TsCp) scheme for fault-tolerant hard real-time systems. It differentiates between the fault-free and faulty execution states and leverages two types of checkpoint intervals for these two different states. The first type is nonuniform intervals that are used while no fault has occurred. These intervals are determined based on postponing checkpoint insertions in fault-free states, with the aim of decreasing the number of checkpoint insertions. The second type is uniform intervals that are used from the time when the first fault occurs. They are determined so as to minimize execution time for faulty states, leaving more time available for energy management in fault-free states. Experimental evaluation on an embedded processor (LEON3) and an emerging nonvolatile memory technology (ReRAM) illustrates that TsCp significantly reduces the number of checkpoints (62% on average) compared with previous works, while preserving fault tolerance. This results in 14% and 13% reduced execution time and energy consumption, respectively. Furthermore, we combine TsCp with dynamic voltage scaling (DVS) and achieve up to 26% (21% on average) energy saving compared with the state-of-the-art techniques.

Original languageEnglish (US)
Article number7387782
Pages (from-to)2426-2437
Number of pages12
JournalIEEE Transactions on Very Large Scale Integration (VLSI) Systems
Volume24
Issue number7
DOIs
StatePublished - Jul 2016

Keywords

  • Checkpointing
  • embedded systems
  • energy management
  • fault tolerance
  • real-time
  • reliability

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Two-State Checkpointing for Energy-Efficient Fault Tolerance in Hard Real-Time Systems'. Together they form a unique fingerprint.

Cite this