TY - JOUR
T1 - Rollback-Recovery for Middleboxes
AU - Sherry, Justine
AU - Gao, Peter Xiang
AU - Basu, Soumya
AU - Panda, Aurojit
AU - Krishnamurthy, Arvind
AU - Maciocco, Christian
AU - Manesh, Maziar
AU - Martins, João
AU - Ratnasamy, Sylvia
AU - Rizzo, Luigi
AU - Shenker, Scott
N1 - Funding Information:
We thank the anonymous reviewers of the SIGCOMM program committee and our shepherd Jeff Chase for their thoughtful feedback on this paper. We thank the middlebox vendors we spoke with for helpful discussions about FTMB, reliability practices, and state of the art network appliances. Jiawei Chen and Eddie Dong at Intel kindly shared the Colo source code and helped us deploy it in our lab at Berkeley. Kay Ousterhout provided feedback on many iterations of this paper. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1106400. This work was in part made possible by generous financial support and technical feedback from Intel Research.
PY - 2015/8/17
Y1 - 2015/8/17
N2 - Network middleboxes must offer high availability, with automatic failover when a device fails. Achieving high availability is challenging because failover must correctly restore lost state (e.g., activity logs, port mappings) but must do so quickly (e.g., in less than typical transport timeout values to minimize disruption to applications) and with little overhead to failure-free operation (e.g., additional per-packet latencies of 10-100s of us). No existing middlebox design provides failover that is correct, fast to recover, and imposes little increased latency on failure-free operations. We present a new design for fault-tolerance in middleboxes that achieves these three goals. Our system, FTMB (for Fault-Tolerant MiddleBox), adopts the classical approach of "rollback recovery" in which a system uses information logged during normal operation to correctly reconstruct state after a failure. However, traditional rollback recovery cannot maintain high throughput given the frequent output rate of middleboxes. Hence, we design a novel solution to record middlebox state which relies on two mechanisms: (1) 'ordered logging', which provides lightweight logging of the information needed after recovery, and (2) a 'parallel release' algorithm which, when coupled with ordered logging, ensures that recovery is always correct. We implement ordered logging and parallel release in Click and show that for our test applications our design adds only 30$\mu$s of latency to median per packet latencies. Our system introduces moderate throughput overheads (5-30%) and can reconstruct lost state in 40-275ms for practical systems.
AB - Network middleboxes must offer high availability, with automatic failover when a device fails. Achieving high availability is challenging because failover must correctly restore lost state (e.g., activity logs, port mappings) but must do so quickly (e.g., in less than typical transport timeout values to minimize disruption to applications) and with little overhead to failure-free operation (e.g., additional per-packet latencies of 10-100s of us). No existing middlebox design provides failover that is correct, fast to recover, and imposes little increased latency on failure-free operations. We present a new design for fault-tolerance in middleboxes that achieves these three goals. Our system, FTMB (for Fault-Tolerant MiddleBox), adopts the classical approach of "rollback recovery" in which a system uses information logged during normal operation to correctly reconstruct state after a failure. However, traditional rollback recovery cannot maintain high throughput given the frequent output rate of middleboxes. Hence, we design a novel solution to record middlebox state which relies on two mechanisms: (1) 'ordered logging', which provides lightweight logging of the information needed after recovery, and (2) a 'parallel release' algorithm which, when coupled with ordered logging, ensures that recovery is always correct. We implement ordered logging and parallel release in Click and show that for our test applications our design adds only 30$\mu$s of latency to median per packet latencies. Our system introduces moderate throughput overheads (5-30%) and can reconstruct lost state in 40-275ms for practical systems.
KW - Middlebox reliability
KW - Parallel fault-tolerance
UR - http://www.scopus.com/inward/record.url?scp=85086591802&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85086591802&partnerID=8YFLogxK
U2 - 10.1145/2829988.2787501
DO - 10.1145/2829988.2787501
M3 - Article
AN - SCOPUS:85086591802
VL - 45
SP - 227
EP - 240
JO - Computer Communication Review
JF - Computer Communication Review
SN - 0146-4833
IS - 4
ER -