TY - GEN
T1 - Improving GPU robustness by making use of faulty parts
AU - Durytskyy, Artem
AU - Zahran, Mohamed
AU - Karri, Ramesh
PY - 2011
Y1 - 2011
N2 - With hundreds of processing units in current state-of-the-art graphics processing units (GPUs), the probability that one or more processing units fail due to permanent faults, during fabrication or post deployment, increases drastically. In our experiments we found that the loss of a single streaming multiprocessor (SM) in an 8-SM GPU resulted in as much as 16%performance loss. The default method for dealing with faulty SMs is to turn them off. Although faulty SMs cannot be trusted to completely execute a single kernel (program assigned to an SM) correctly, we show that we can still make use of these SMs to improve system throughput by generating and supplying high-level hints to other functional SMs. By making the faulty SMs supply hints to functional SMs, we have been able to achieve an average speed-up of about 16 % over the baseline case (wherein the faulty SMs are turned off). The proposed technique requires minimal hardware overhead and is highly scalable.
AB - With hundreds of processing units in current state-of-the-art graphics processing units (GPUs), the probability that one or more processing units fail due to permanent faults, during fabrication or post deployment, increases drastically. In our experiments we found that the loss of a single streaming multiprocessor (SM) in an 8-SM GPU resulted in as much as 16%performance loss. The default method for dealing with faulty SMs is to turn them off. Although faulty SMs cannot be trusted to completely execute a single kernel (program assigned to an SM) correctly, we show that we can still make use of these SMs to improve system throughput by generating and supplying high-level hints to other functional SMs. By making the faulty SMs supply hints to functional SMs, we have been able to achieve an average speed-up of about 16 % over the baseline case (wherein the faulty SMs are turned off). The proposed technique requires minimal hardware overhead and is highly scalable.
UR - http://www.scopus.com/inward/record.url?scp=83455196011&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=83455196011&partnerID=8YFLogxK
U2 - 10.1109/ICCD.2011.6081422
DO - 10.1109/ICCD.2011.6081422
M3 - Conference contribution
AN - SCOPUS:83455196011
SN - 9781457719523
T3 - Proceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors
SP - 346
EP - 351
BT - 2011 IEEE 29th International Conference on Computer Design, ICCD 2011
T2 - 29th IEEE International Conference on Computer Design 2011, ICCD 2011
Y2 - 9 November 2011 through 12 November 2011
ER -