TY - JOUR
T1 - Fine-Grained Checkpoint Recovery for Application-Specific Instruction-Set Processors
AU - Li, Tuo
AU - Shafique, Muhammad
AU - Ambrose, Jude Angelo
AU - Henkel, Jorg
AU - Parameswaran, Sri
N1 - Publisher Copyright:
© 1968-2012 IEEE.
Copyright:
Copyright 2018 Elsevier B.V., All rights reserved.
PY - 2017/4/1
Y1 - 2017/4/1
N2 - Checkpoint recovery (CR) is a classic fault-tolerance technique, which enables computing systems to execute correctly even when affected by transient faults. Although a number of software and hardware based approaches for CR does exist, these approaches usually are either too large, too slow, or require extensive modifications to the software and the caching/memory schemes. In this paper, we propose a novel CR approach, which is based on re-engineering the instruction set of a target processor. We take the base instruction set and augment the native micro-operations, i.e., an architectural description language (ADL), with additional microoperations to perform checkpointing at the granularity of basic blocks. The recovery mechanism is realized by three custom instructions, which can undo the corruptions caused by transient faults during instruction execution, including the values of general-purpose registers, data memory, and special-purpose registers (PC, status registers, etc.), which were incorrectly modified. Our checkpoint storage is sized according to the application program executed. The experimental results show that our approach degrades the system performance by just 0.76 percent when there is no fault, and introduces an area overhead of 44 percent on average and 79 percent in the worst case. During the fault injection test with the benchmark applications, the recovery took just 62 clock cycles (worst case).
AB - Checkpoint recovery (CR) is a classic fault-tolerance technique, which enables computing systems to execute correctly even when affected by transient faults. Although a number of software and hardware based approaches for CR does exist, these approaches usually are either too large, too slow, or require extensive modifications to the software and the caching/memory schemes. In this paper, we propose a novel CR approach, which is based on re-engineering the instruction set of a target processor. We take the base instruction set and augment the native micro-operations, i.e., an architectural description language (ADL), with additional microoperations to perform checkpointing at the granularity of basic blocks. The recovery mechanism is realized by three custom instructions, which can undo the corruptions caused by transient faults during instruction execution, including the values of general-purpose registers, data memory, and special-purpose registers (PC, status registers, etc.), which were incorrectly modified. Our checkpoint storage is sized according to the application program executed. The experimental results show that our approach degrades the system performance by just 0.76 percent when there is no fault, and introduces an area overhead of 44 percent on average and 79 percent in the worst case. During the fault injection test with the benchmark applications, the recovery took just 62 clock cycles (worst case).
KW - ASIP
KW - checkpoint recovery
KW - reliability
UR - http://www.scopus.com/inward/record.url?scp=85027417295&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85027417295&partnerID=8YFLogxK
U2 - 10.1109/TC.2016.2606378
DO - 10.1109/TC.2016.2606378
M3 - Article
AN - SCOPUS:85027417295
SN - 0018-9340
VL - 66
SP - 647
EP - 660
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 4
ER -