TY - GEN
T1 - Reduce
T2 - 2023 Design, Automation and Test in Europe Conference and Exhibition, DATE 2023
AU - Hanif, Muhammad Abdullah
AU - Shafique, Muhammad
N1 - Publisher Copyright:
© 2023 EDAA.
PY - 2023
Y1 - 2023
N2 - Fault-aware retraining has emerged as a prominent technique for mitigating permanent faults in Deep Neural Network (DNN) hardware accelerators. However, retraining leads to huge overheads, specifically when used for fine-tuning large DNNs designed for solving complex problems. Moreover, as each fabricated chip can have a distinct fault pattern, fault-aware retraining is required to be performed for each chip individually considering its unique fault map, which further aggravates the problem. To reduce the overall retraining cost, in this work, we introduce the concept of resilience-driven retraining amount selection. To realize this concept, we propose a novel framework, Reduce, that, at first, computes the resilience of the given DNN to faults at different fault rates and with different amounts of retraining. Then, based on the resilience, it computes the amount of retraining required for each chip considering its unique fault map. We demonstrate the effectiveness of our methodology for a systolic array-based DNN accelerator experiencing permanent faults in the computational array.
AB - Fault-aware retraining has emerged as a prominent technique for mitigating permanent faults in Deep Neural Network (DNN) hardware accelerators. However, retraining leads to huge overheads, specifically when used for fine-tuning large DNNs designed for solving complex problems. Moreover, as each fabricated chip can have a distinct fault pattern, fault-aware retraining is required to be performed for each chip individually considering its unique fault map, which further aggravates the problem. To reduce the overall retraining cost, in this work, we introduce the concept of resilience-driven retraining amount selection. To realize this concept, we propose a novel framework, Reduce, that, at first, computes the resilience of the given DNN to faults at different fault rates and with different amounts of retraining. Then, based on the resilience, it computes the amount of retraining required for each chip considering its unique fault map. We demonstrate the effectiveness of our methodology for a systolic array-based DNN accelerator experiencing permanent faults in the computational array.
UR - http://www.scopus.com/inward/record.url?scp=85162675017&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85162675017&partnerID=8YFLogxK
U2 - 10.23919/DATE56975.2023.10137268
DO - 10.23919/DATE56975.2023.10137268
M3 - Conference contribution
AN - SCOPUS:85162675017
T3 - Proceedings -Design, Automation and Test in Europe, DATE
BT - 2023 Design, Automation and Test in Europe Conference and Exhibition, DATE 2023 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 April 2023 through 19 April 2023
ER -