TRE-Map: Towards reducing the overheads of fault-aware retraining of deep neural networks by merging fault maps

Le Ha Hoang, Muhammad Abdullah Hanif, Muhammad Shafique

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recently, fault-aware retraining has emerged as a promising approach to improve the error resilience of Deep Neural Networks (DNNs) against manufacturing-induced defects in DNN accelerators. However, state-of-the-art fault-aware training techniques incur a gigantic retraining overhead due to their per-chip retraining nature for the chip’s unique fault map, which may render it practically infeasible if retraining is done on large datasets. To address this major limitation and improve the practicability of the fault-aware retraining methodology, this work proposes a novel concept of merging fault maps to effectively retrain a DNN for a group of faulty chips in a single fault-aware retraining round. The merging of fault maps enables to avoid per chip retraining and thereby reduces the retraining overhead significantly. However, the merging of fault maps brings in new challenges such as training divergence (accuracy collapse) if a high number of accumulated faults are injected into the network in the first epoch. To address these challenges, we propose a methodology for effective merging of fault maps and then retraining of DNNs. Experimental results show that our methodology offers at least 1.4x retraining speedup on average while improving the error resilience of the network (depending on the DNN models and the number of merged fault maps). For example, for the Resnet-32 model using fault map generated from 5 fault maps at the fault rate 6e-3, our methodology offers 2x retraining speedup and 0.6% classification accuracy drop against per-chip retraining.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 24th Euromicro Conference on Digital System Design, DSD 2021
EditorsFrancesco Leporati, Salvatore Vitabile, Amund Skavhaug
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages434-441
Number of pages8
ISBN (Electronic)9781665427036
DOIs
StatePublished - 2021
Event24th Euromicro Conference on Digital System Design, DSD 2021 - Virtual, Online, Italy
Duration: Sep 1 2021Sep 3 2021

Publication series

NameProceedings - 2021 24th Euromicro Conference on Digital System Design, DSD 2021

Conference

Conference24th Euromicro Conference on Digital System Design, DSD 2021
Country/TerritoryItaly
CityVirtual, Online
Period9/1/219/3/21

Keywords

  • DNN accelerator
  • Deep neural networks
  • Fault maps
  • Manufacturing defects
  • Reliability
  • Resilience
  • SRAM

ASJC Scopus subject areas

  • Hardware and Architecture
  • Control and Systems Engineering

Fingerprint

Dive into the research topics of 'TRE-Map: Towards reducing the overheads of fault-aware retraining of deep neural networks by merging fault maps'. Together they form a unique fingerprint.

Cite this