Reduce: A Framework for Reducing the Overheads of Fault-Aware Retraining

Muhammad Abdullah Hanif, Muhammad Shafique

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Fault-aware retraining has emerged as a prominent technique for mitigating permanent faults in Deep Neural Network (DNN) hardware accelerators. However, retraining leads to huge overheads, specifically when used for fine-tuning large DNNs designed for solving complex problems. Moreover, as each fabricated chip can have a distinct fault pattern, fault-aware retraining is required to be performed for each chip individually considering its unique fault map, which further aggravates the problem. To reduce the overall retraining cost, in this work, we introduce the concept of resilience-driven retraining amount selection. To realize this concept, we propose a novel framework, Reduce, that, at first, computes the resilience of the given DNN to faults at different fault rates and with different amounts of retraining. Then, based on the resilience, it computes the amount of retraining required for each chip considering its unique fault map. We demonstrate the effectiveness of our methodology for a systolic array-based DNN accelerator experiencing permanent faults in the computational array.

Original languageEnglish (US)
Title of host publication2023 Design, Automation and Test in Europe Conference and Exhibition, DATE 2023 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9783981926378
DOIs
StatePublished - 2023
Event2023 Design, Automation and Test in Europe Conference and Exhibition, DATE 2023 - Antwerp, Belgium
Duration: Apr 17 2023Apr 19 2023

Publication series

NameProceedings -Design, Automation and Test in Europe, DATE
Volume2023-April
ISSN (Print)1530-1591

Conference

Conference2023 Design, Automation and Test in Europe Conference and Exhibition, DATE 2023
Country/TerritoryBelgium
CityAntwerp
Period4/17/234/19/23

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Reduce: A Framework for Reducing the Overheads of Fault-Aware Retraining'. Together they form a unique fingerprint.

Cite this