TY - GEN
T1 - Improving availability in distributed systems with failure informers
AU - Leners, Joshua B.
AU - Gupta, Trinabh
AU - Aguilera, Marcos K.
AU - Walfish, Michael
N1 - Funding Information:
This paper was improved by the helpful comments of Lorenzo Alvisi, Sebastian Angel, Mahesh Balakrishnan, Russ Cox, Alan Dunn, James Grimmelmann, Rodrigo Rodrigues, Srinath Setty, Scott Shenker, and Edmund L. Wong. We thank the anonymous reviewers, and our shepherd Katerina Argyraki, for their suggestions. This research was supported in part by AFOSR grant FA9550-10-1-0073 and NSF grants 1055057 and 1040083.
Publisher Copyright:
© Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013. All rights reserved.
PY - 2013
Y1 - 2013
N2 - This paper addresses a core question in distributed systems: how should applications be notified of failures? When a distributed system acts on failure reports, the system's correctness and availability depend on the granularity and semantics of those reports. The system's availability also depends on coverage (failures are reported), accuracy (reports are justified), and timeliness (reports come quickly). This paper describes Pigeon, a failure reporting service designed to enable high availability in the applications that use it. Pigeon exposes a new abstraction, called a failure informer, which allows applications to take informed, application-specific recovery actions, and which encapsulates uncertainty, allowing applications to proceed safely in the presence of doubt. Pigeon also significantly improves over the previous state of the art in the three-way trade-off among coverage, accuracy, and timeliness.
AB - This paper addresses a core question in distributed systems: how should applications be notified of failures? When a distributed system acts on failure reports, the system's correctness and availability depend on the granularity and semantics of those reports. The system's availability also depends on coverage (failures are reported), accuracy (reports are justified), and timeliness (reports come quickly). This paper describes Pigeon, a failure reporting service designed to enable high availability in the applications that use it. Pigeon exposes a new abstraction, called a failure informer, which allows applications to take informed, application-specific recovery actions, and which encapsulates uncertainty, allowing applications to proceed safely in the presence of doubt. Pigeon also significantly improves over the previous state of the art in the three-way trade-off among coverage, accuracy, and timeliness.
UR - http://www.scopus.com/inward/record.url?scp=85076715108&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85076715108&partnerID=8YFLogxK
M3 - Conference contribution
T3 - Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013
SP - 427
EP - 441
BT - Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013
PB - USENIX Association
T2 - 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013
Y2 - 2 April 2013 through 5 April 2013
ER -