Abstract
Checkpointing with rollback recovery is a well-established technique to tolerate transient faults. However, it incurs significant time and energy overheads, which go wasted in fault-free execution states and may not even be feasible in hard real-time systems. This paper presents a low-overhead two-state checkpointing (TsCp) scheme for fault-tolerant hard real-time systems. It differentiates between the fault-free and faulty execution states and leverages two types of checkpoint intervals for these two different states. The first type is nonuniform intervals that are used while no fault has occurred. These intervals are determined based on postponing checkpoint insertions in fault-free states, with the aim of decreasing the number of checkpoint insertions. The second type is uniform intervals that are used from the time when the first fault occurs. They are determined so as to minimize execution time for faulty states, leaving more time available for energy management in fault-free states. Experimental evaluation on an embedded processor (LEON3) and an emerging nonvolatile memory technology (ReRAM) illustrates that TsCp significantly reduces the number of checkpoints (62% on average) compared with previous works, while preserving fault tolerance. This results in 14% and 13% reduced execution time and energy consumption, respectively. Furthermore, we combine TsCp with dynamic voltage scaling (DVS) and achieve up to 26% (21% on average) energy saving compared with the state-of-the-art techniques.
Original language | English (US) |
---|---|
Article number | 7387782 |
Pages (from-to) | 2426-2437 |
Number of pages | 12 |
Journal | IEEE Transactions on Very Large Scale Integration (VLSI) Systems |
Volume | 24 |
Issue number | 7 |
DOIs | |
State | Published - Jul 2016 |
Keywords
- Checkpointing
- embedded systems
- energy management
- fault tolerance
- real-time
- reliability
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Electrical and Electronic Engineering